Good progress in optimisation, as the frame rate of my little demo is now about 44-45 fps
The most of the performance increase was gained when I realised that the function static inline void setup_data_16(uint16_t data) called a lot inside the lcdRefreshMode2() (and used in other modes too) was actually compiled as a normal function (!). I verified it by looking at the assembler listing. By copying the commands in setup_data_16() directly to the lcdRefreshMode2() I got over 10 fps increase in performance.
I could not yet figure out why the compiler decides to use a function call instead of in-lining the code (I have optimization enabled). I have to check if this is the case in other projects too than mine, but I suspect it is.
Thanks for the confirmation. This problem affects many functions in the library. I will try to find which compiler flag causes it. If I enable-Winline it gives a huge number of warnings: âinlining failed in call to âvoid setup_data_16(uint16_t)â: call is unlikely and code size would grow [-Winline]â, which is weird.
One possibility is to try to use __attribute__((always_inline)) in functions but I rather would let the compiler to decide.
Hah, âcall is unlikelyâ⌠Weâre sure the call is going to be made a few million times per second. Maybe GCC would inline it if we used PGO? In the absence of that, using the attribute seems a reasonable/simple fix.
Alternatively:
setup_data_16 is generally called inside a loop. SET_MASK_P2; can be moved out, before the loop, and CLR_MASK_P2; after the loop.
I havenât looked at the disassembly, but I observed that using the mbed macros for accessing memory registers (LPC_GPIO_PORT->MPIN[2] = ...) was a tiny bit slower than direct memory manipulation (*reinterpret_cast<uint32_t *>(0xA0002188) = ...). I assumed that the dereference is not optimized away and produces a few more ops, while the latter results in a simple STR.
Reducing setup_data_16 to just a STR should convince GCC into inlining, as thatâd be smaller than the code necessary to call it⌠but at that point, might as well just put the MPIN write in a macro, just like the mask code.
Dammit! I had â-O3â (=turn on all optimizations) and after that â-Osâ (=optimize for size) ! The latter is effective which pretty much explains the inline problem.
I will attempt a pull on github later, but here is the mode13 function in HWLCD.cpp.
Every time I attempt to use the github desktop app, it ends up deleting a whole bunch of my projects. I anyone else wants to push the above change, please be my guest.
note: it also needs
game.setFrameRate(100);
added to the mode13 example to get the benefit. Excellent work @Hanski and @FManga
The first and second for loops can be merged together. When you write to your scanline array, also write to MPIN.
That should perform a little better, even if you donât unroll.
Edit- Something like this (not tested):
void Pokitto::lcdRefreshMode13(uint8_t * scrbuf, uint16_t* paletteptr, uint8_t offset){
uint16_t x,y;
uint16_t scanline[2][110]; // read two nibbles = pixels at a time
uint8_t *d;
write_command(0x20); write_data(0);
write_command(0x21); write_data(0);
write_command(0x22);
CLR_CS_SET_CD_RD_WR;
SET_MASK_P2;
for(x=0;x<110;x+=2)
{
d = scrbuf+x;// point to beginning of line in data
uint8_t s=0;
for(y=0;y<88;y++)
{
uint8_t t = *d;
uint8_t t1 = *(d+1);
LPC_GPIO_PORT->MPIN[2] = scanline[0][s] = paletteptr[(t+offset)&255]<<3;
CLR_WR;SET_WR;CLR_WR;SET_WR;
scanline[1][s++] = paletteptr[(t1+offset)&255]<<3;
d+=110; // jump to read byte directly below in screenbuffer
}
for (s=0;s<88;) {
LPC_GPIO_PORT->MPIN[2] = (scanline[0][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
LPC_GPIO_PORT->MPIN[2] = (scanline[0][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
LPC_GPIO_PORT->MPIN[2] = (scanline[0][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
LPC_GPIO_PORT->MPIN[2] = (scanline[0][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
LPC_GPIO_PORT->MPIN[2] = (scanline[0][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
LPC_GPIO_PORT->MPIN[2] = (scanline[0][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
LPC_GPIO_PORT->MPIN[2] = (scanline[0][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
LPC_GPIO_PORT->MPIN[2] = (scanline[0][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
}
for (s=0;s<88;) {
LPC_GPIO_PORT->MPIN[2] = (scanline[1][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
LPC_GPIO_PORT->MPIN[2] = (scanline[1][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
LPC_GPIO_PORT->MPIN[2] = (scanline[1][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
LPC_GPIO_PORT->MPIN[2] = (scanline[1][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
LPC_GPIO_PORT->MPIN[2] = (scanline[1][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
LPC_GPIO_PORT->MPIN[2] = (scanline[1][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
LPC_GPIO_PORT->MPIN[2] = (scanline[1][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
LPC_GPIO_PORT->MPIN[2] = (scanline[1][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
}
for (s=0;s<88;) {
LPC_GPIO_PORT->MPIN[2] = (scanline[1][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
LPC_GPIO_PORT->MPIN[2] = (scanline[1][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
LPC_GPIO_PORT->MPIN[2] = (scanline[1][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
LPC_GPIO_PORT->MPIN[2] = (scanline[1][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
LPC_GPIO_PORT->MPIN[2] = (scanline[1][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
LPC_GPIO_PORT->MPIN[2] = (scanline[1][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
LPC_GPIO_PORT->MPIN[2] = (scanline[1][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
LPC_GPIO_PORT->MPIN[2] = (scanline[1][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
}
}
CLR_MASK_P2;
}
I think something is off⌠Iâm not sure if itâs the framerate limiter code or the FPS display, but Iâve noticed the FPS is often lower (about half) than the limit even if itâs supposed to be able to reach the limit (raising the limit also increases the FPS). So setting it to 100, it maxes at 50. Then raising it to 200, it maxes at 100.