Improving FPS


#1

Good progress in optimisation, as the frame rate of my little demo is now about 44-45 fps :slight_smile:

The most of the performance increase was gained when I realised that the function static inline void setup_data_16(uint16_t data) called a lot inside the lcdRefreshMode2() (and used in other modes too) was actually compiled as a normal function (!). I verified it by looking at the assembler listing. By copying the commands in setup_data_16() directly to the lcdRefreshMode2() I got over 10 fps increase in performance.

I could not yet figure out why the compiler decides to use a function call instead of in-lining the code (I have optimization enabled). I have to check if this is the case in other projects too than mine, but I suspect it is.


Micro Platformer - Simple Platforming Engine in 100 Lines of Code
#2

Well done Hanski!


#3

I can confirm. GCC simply ignores the inline hint, for some reason. I’ve been using a macro instead.


#4

Thanks for the confirmation. This problem affects many functions in the library. I will try to find which compiler flag causes it. If I enable-Winline it gives a huge number of warnings: “inlining failed in call to ‘void setup_data_16(uint16_t)’: call is unlikely and code size would grow [-Winline]”, which is weird.

One possibility is to try to use __attribute__((always_inline)) in functions but I rather would let the compiler to decide.


#5

Hah, “call is unlikely”… We’re sure the call is going to be made a few million times per second. Maybe GCC would inline it if we used PGO? In the absence of that, using the attribute seems a reasonable/simple fix.

Alternatively:

setup_data_16 is generally called inside a loop. SET_MASK_P2; can be moved out, before the loop, and CLR_MASK_P2; after the loop.

I haven’t looked at the disassembly, but I observed that using the mbed macros for accessing memory registers (LPC_GPIO_PORT->MPIN[2] = ...) was a tiny bit slower than direct memory manipulation (*reinterpret_cast<uint32_t *>(0xA0002188) = ...). I assumed that the dereference is not optimized away and produces a few more ops, while the latter results in a simple STR.

Reducing setup_data_16 to just a STR should convince GCC into inlining, as that’d be smaller than the code necessary to call it… but at that point, might as well just put the MPIN write in a macro, just like the mask code.


#6

Dammit! I had “-O3” (=turn on all optimizations) and after that “-Os” (=optimize for size) ! The latter is effective which pretty much explains the inline problem.

With just -O3 inlines work ok,


#7

It’d be good to write the code in a way that -Os isn’t so much slower. -O3 often makes code that’s so much larger it isn’t an option.


#8

Wow, that was a good tip! I am at 50 fps now.

Edit: That is the C64 (pal) frame rate too :slight_smile:


#9

Going FTL !!!


#10


#11

Using the above changes, I managed to get mode13 up to 50fps also!

void Pokitto::lcdRefreshMode13(uint8_t * scrbuf, uint16_t* paletteptr, uint8_t offset){
uint16_t x,y;
uint16_t scanline[2][110]; // read two nibbles = pixels at a time
uint8_t *d;

write_command(0x20); write_data(0);
write_command(0x21); write_data(0);
write_command(0x22);
CLR_CS_SET_CD_RD_WR;

SET_MASK_P2;

for(x=0;x<110;x+=2)
  {
    d = scrbuf+x;// point to beginning of line in data
    uint8_t s=0;
    for(y=0;y<88;y++)
    {
        uint8_t t = *d;
        uint8_t t1 = *(d+1);
        scanline[0][s] = paletteptr[(t+offset)&255]<<3;
        scanline[1][s++] = paletteptr[(t1+offset)&255]<<3;
        d+=110; // jump to read byte directly below in screenbuffer
    }
    s=0;

    for (s=0;s<88;) {
        LPC_GPIO_PORT->MPIN[2] = (scanline[0][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
        LPC_GPIO_PORT->MPIN[2] = (scanline[0][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
        LPC_GPIO_PORT->MPIN[2] = (scanline[0][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
        LPC_GPIO_PORT->MPIN[2] = (scanline[0][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
        LPC_GPIO_PORT->MPIN[2] = (scanline[0][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
        LPC_GPIO_PORT->MPIN[2] = (scanline[0][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
        LPC_GPIO_PORT->MPIN[2] = (scanline[0][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
        LPC_GPIO_PORT->MPIN[2] = (scanline[0][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
    }
    for (s=0;s<88;) {
        LPC_GPIO_PORT->MPIN[2] = (scanline[0][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
        LPC_GPIO_PORT->MPIN[2] = (scanline[0][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
        LPC_GPIO_PORT->MPIN[2] = (scanline[0][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
        LPC_GPIO_PORT->MPIN[2] = (scanline[0][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
        LPC_GPIO_PORT->MPIN[2] = (scanline[0][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
        LPC_GPIO_PORT->MPIN[2] = (scanline[0][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
        LPC_GPIO_PORT->MPIN[2] = (scanline[0][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
        LPC_GPIO_PORT->MPIN[2] = (scanline[0][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
    }
    for (s=0;s<88;) {
        LPC_GPIO_PORT->MPIN[2] = (scanline[1][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
        LPC_GPIO_PORT->MPIN[2] = (scanline[1][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
        LPC_GPIO_PORT->MPIN[2] = (scanline[1][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
        LPC_GPIO_PORT->MPIN[2] = (scanline[1][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
        LPC_GPIO_PORT->MPIN[2] = (scanline[1][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
        LPC_GPIO_PORT->MPIN[2] = (scanline[1][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
        LPC_GPIO_PORT->MPIN[2] = (scanline[1][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
        LPC_GPIO_PORT->MPIN[2] = (scanline[1][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
    }
    for (s=0;s<88;) {
        LPC_GPIO_PORT->MPIN[2] = (scanline[1][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
        LPC_GPIO_PORT->MPIN[2] = (scanline[1][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
        LPC_GPIO_PORT->MPIN[2] = (scanline[1][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
        LPC_GPIO_PORT->MPIN[2] = (scanline[1][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
        LPC_GPIO_PORT->MPIN[2] = (scanline[1][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
        LPC_GPIO_PORT->MPIN[2] = (scanline[1][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
        LPC_GPIO_PORT->MPIN[2] = (scanline[1][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
        LPC_GPIO_PORT->MPIN[2] = (scanline[1][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
    }
  }
    CLR_MASK_P2;
}

I will attempt a pull on github later, but here is the mode13 function in HWLCD.cpp.
Every time I attempt to use the github desktop app, it ends up deleting a whole bunch of my projects. I anyone else wants to push the above change, please be my guest.

note: it also needs

    game.setFrameRate(100); 

added to the mode13 example to get the benefit. Excellent work @Hanski and @FManga


#12

The first and second for loops can be merged together. When you write to your scanline array, also write to MPIN.
That should perform a little better, even if you don’t unroll.

Edit- Something like this (not tested):

void Pokitto::lcdRefreshMode13(uint8_t * scrbuf, uint16_t* paletteptr, uint8_t offset){
uint16_t x,y;
uint16_t scanline[2][110]; // read two nibbles = pixels at a time
uint8_t *d;
write_command(0x20); write_data(0);
write_command(0x21); write_data(0);
write_command(0x22);
CLR_CS_SET_CD_RD_WR;
SET_MASK_P2;
for(x=0;x<110;x+=2)
  {
    d = scrbuf+x;// point to beginning of line in data
    uint8_t s=0;
    for(y=0;y<88;y++)
    {
        uint8_t t = *d;
        uint8_t t1 = *(d+1);
        LPC_GPIO_PORT->MPIN[2] = scanline[0][s] = paletteptr[(t+offset)&255]<<3;
        CLR_WR;SET_WR;CLR_WR;SET_WR;
        scanline[1][s++] = paletteptr[(t1+offset)&255]<<3;
        d+=110; // jump to read byte directly below in screenbuffer
    }
    for (s=0;s<88;) {
    LPC_GPIO_PORT->MPIN[2] = (scanline[0][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
    LPC_GPIO_PORT->MPIN[2] = (scanline[0][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
    LPC_GPIO_PORT->MPIN[2] = (scanline[0][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
    LPC_GPIO_PORT->MPIN[2] = (scanline[0][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
    LPC_GPIO_PORT->MPIN[2] = (scanline[0][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
    LPC_GPIO_PORT->MPIN[2] = (scanline[0][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
    LPC_GPIO_PORT->MPIN[2] = (scanline[0][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
    LPC_GPIO_PORT->MPIN[2] = (scanline[0][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
}
for (s=0;s<88;) {
    LPC_GPIO_PORT->MPIN[2] = (scanline[1][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
    LPC_GPIO_PORT->MPIN[2] = (scanline[1][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
    LPC_GPIO_PORT->MPIN[2] = (scanline[1][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
    LPC_GPIO_PORT->MPIN[2] = (scanline[1][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
    LPC_GPIO_PORT->MPIN[2] = (scanline[1][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
    LPC_GPIO_PORT->MPIN[2] = (scanline[1][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
    LPC_GPIO_PORT->MPIN[2] = (scanline[1][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
    LPC_GPIO_PORT->MPIN[2] = (scanline[1][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
}
for (s=0;s<88;) {
    LPC_GPIO_PORT->MPIN[2] = (scanline[1][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
    LPC_GPIO_PORT->MPIN[2] = (scanline[1][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
    LPC_GPIO_PORT->MPIN[2] = (scanline[1][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
    LPC_GPIO_PORT->MPIN[2] = (scanline[1][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
    LPC_GPIO_PORT->MPIN[2] = (scanline[1][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
    LPC_GPIO_PORT->MPIN[2] = (scanline[1][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
    LPC_GPIO_PORT->MPIN[2] = (scanline[1][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
    LPC_GPIO_PORT->MPIN[2] = (scanline[1][s++]);CLR_WR;SET_WR;CLR_WR;SET_WR;
}
  }
    CLR_MASK_P2;
}

#13

It works, but it doesn’t look like it improves the fps any. Is it possible than 50fps is the limit?

[edit] 89fps by setting game.setFrameRate(200);
[edit2] I seem to have lost the red section of my palette, I wonder when and how :frowning:


#14

I think something is off… I’m not sure if it’s the framerate limiter code or the FPS display, but I’ve noticed the FPS is often lower (about half) than the limit even if it’s supposed to be able to reach the limit (raising the limit also increases the FPS). So setting it to 100, it maxes at 50. Then raising it to 200, it maxes at 100.


#15

<<3 is going to overflow an int16_t, you’ll lose the upper 3 bits.


#16

cool, fixed that, but it dropped to 84fps. Still way better than the 24(ish) I had originally :slight_smile:


#17

I think I can get you some of that speed back, I’ll have to do some testing. Any opposition to using SRAM1?


#18

I just tried unrolling the display loops, got 100fps!

Honestly I wouldn’t know SRAM1 from a hole in the head, what’s it currently used for?


#19

Currently, it isn’t even enabled. Its 2Kb of RAM that isn’t being used at all.


#20

I have no problem using something that currently isn’t used.