so i had been playing with the direct tile draw as seen in Ultimate screen mode?

now that has some troubles since its 16bits per pixel which is awfully large to keep in memory
one of the questions i have is what could be the best way to compress 16bit color into something lower like 8bit
without significant loss in performance and if that will be faster or slower to an 256 color pallet

the second thing i had an idea on what if we compressed images even more on the flash and expand it into ram and then draw it, would there be a speed increase compared to the color pallets

Iâ€™ve been toying with something along these lines, and hereâ€™s what Iâ€™ve got so far. Itâ€™s a little Aquarium demo with a resolution of 220x176 @ 256 colors, tiles+sprites. Unless Iâ€™ve made a mistake somewhere, I believe it runs at around 40 fps. I have only tested it with the latest version of armgcc, your kilometerage may vary.

The first optimization is to do the opposite of compression: inflate each 16-bit color in a 256-color palette into a 32-bit one. I put the inflated palette in SRAM1, which would otherwise go to waste anyway.
First, enable SRAM1:

``````*reinterpret_cast<uint32_t *>(0x40048080) |= 3 << 26;
``````

Yeah, thatâ€™s not how MBED code is supposed to do itâ€¦ Iâ€™m just going by the processor datasheets, so pardon the bizarroness. Anyway, now for the inflation:

``````  uint32_t *palette = reinterpret_cast<uint32_t *>(0x20000000);
void load8BitPalette( const unsigned short *pal ){
uint32_t *out = palette;
for( int i=0; i<256; ++i ){
*out++ = uint32_t(*pal++)<<3;
}
}
``````

Get each color, shift it by 3â€¦ and youâ€™re probably thinking: â€śWHY?!â€ť
Thereâ€™s 38720 pixels to push out the GPIO, it needs to be done quickly.
Instead of calling write_data, which does a lot under the hood, for each pixel I do this:

Set up things before the loop that writes to the LCD.

``````  write_command(0x20); write_data(0);
write_command(0x21); write_data(0);
write_command(0x22); // write data to DRAM
CLR_CS_SET_CD_RD_WR;

``````

The audio interrupt might need to be adjusted to preserve the P2 maskâ€¦ I havenâ€™t looked into this, yet.

Now simply lookup the right color in the palette and write it directly to the GPIO. No per-pixel function call, no comparison with the previous color, no shifting, no NOPs. Just read the color (1 cycle), write it out (1 cycle), toggle the WR (2 cycles).

``````	  *reinterpret_cast<uint32_t *>(0xA0002188) = palette[ color ];
*reinterpret_cast<uint32_t *>(0xA0002284) = 1 << LCD_WR_PIN;
*reinterpret_cast<uint32_t *>(0xA0002204) = 1 << LCD_WR_PIN;
``````

Tiles and sprites are optimized differently.

Tiles: These are of a known, fixed size, so I unrolled the loop that copies them out. Basically, something like this:

``````	  *reinterpret_cast<uint32_t *>(0xA0002188) = palette[ *tile++ ];
*reinterpret_cast<uint32_t *>(0xA0002284) = 1 << LCD_WR_PIN;
*reinterpret_cast<uint32_t *>(0xA0002204) = 1 << LCD_WR_PIN;
*reinterpret_cast<uint32_t *>(0xA0002188) = palette[ *tile++ ];
*reinterpret_cast<uint32_t *>(0xA0002284) = 1 << LCD_WR_PIN;
*reinterpret_cast<uint32_t *>(0xA0002204) = 1 << LCD_WR_PIN;
*reinterpret_cast<uint32_t *>(0xA0002188) = palette[ *tile++ ];
*reinterpret_cast<uint32_t *>(0xA0002284) = 1 << LCD_WR_PIN;
*reinterpret_cast<uint32_t *>(0xA0002204) = 1 << LCD_WR_PIN;
... etc ...
``````

In practice, itâ€™s a bit more complicated due to the mixture of tiles and sprites, but thatâ€™s the idea, and it helps some.

As for sprites, each column has an offset and a length, effectively RLE compression for the transparent area around a sprite. For the transparent areas enclosed within the sprite, regular comparison of the color index must be used. I did not unroll the sprite output loop, it might be a good idea.

5 Likes

Voodoo magic here!!
Awesome!

1 Like

@FManga Hahaha! Awesome trickery!

``````
*reinterpret_cast<uint32_t *>(0x40048080) |= 3 << 26;
``````

1 Like

Haha! I have no idea what the mbed way of doing this is.
I looked around and couldnâ€™t find it, so thatâ€™s what my codeâ€™s stuck with, for now. Itâ€™s mostly just a proof-of-concept anyway.

That line also enables USBRAM as Iâ€™m thinking of storing procedurally-generated tilemaps on it.

Your trick (with the palette) will speed up rendering considerably & can be used in all screen modes

I hope you can help to implement this as a part of the library.

Can you do a tile scroll of the background ?

I think soâ€¦ though all the code Iâ€™ve written for the Pokitto so far uses only direct drawing, so I havenâ€™t looked at the various screen modesâ€™ code yet.

I havenâ€™t implemented this yet, but I had it in mind and it should not be too hard to do.

If you can look at the one I did in ultimate screen mode idk if that could work for you

I would use this for a platformer

I tested the palette trick in mode15 and it didnâ€™t help significantly. In this case the bottleneck is not the color lookup, itâ€™s drawBitmapData.

uhm i meant this code

``````// tile test

#include "Pokitto.h"
Pokitto::Core game;

const uint16_t tile[]={22, 22, /*width, height*/
0xf800,0xf800,0xf800,0xf800,0xf800,0xf800,0xf800,0xf800,0xf800,0xf800,0xf800,0xf800,0xf800,0xf800,0xf800,0xf800,0xf800,0xc800,0xc800,0xc800,0xc800,0x9800,
0xf800,0xff0c,0xff0c,0xff0c,0xfff3,0xfff3,0xfff3,0xfff3,0xfff3,0xff0c,0xff0c,0xff0c,0xff0c,0xfff3,0xfff3,0xff0c,0xff0c,0xfff3,0xff0c,0xf629,0xf629,0x9800,
0xf800,0xff0c,0xff0c,0xff0c,0xff0c,0xfff3,0xfff3,0xfff3,0xfff3,0xfff3,0xfff3,0xfff3,0xfff3,0xfff3,0xff0c,0xff0c,0xfff3,0xff0c,0xff0c,0xf629,0xf629,0x9800,
0xf800,0xfff3,0xff0c,0xf629,0xf629,0xf629,0xc508,0xf629,0xc508,0xf629,0xf629,0xc508,0xf629,0xc508,0xc508,0xf629,0xc508,0xc508,0xf629,0x6aa3,0x6aa3,0x9800,
0xf800,0xfff3,0xfff3,0xc508,0xc508,0xf629,0xf629,0xc508,0xf629,0xf629,0xc508,0xc508,0xf629,0xf629,0xf629,0xc508,0xc508,0xf629,0xc508,0x6240,0x6aa3,0x9800,
0xf800,0xfff3,0xfff3,0xc508,0xf629,0xc508,0xc508,0xf629,0xff0c,0xc508,0xf629,0xc508,0xf629,0xc508,0xc508,0xc508,0xc508,0xc508,0xc508,0x6240,0x6aa3,0x9800,
0xf800,0xfff3,0xfff3,0xf629,0xc508,0xf629,0xff0c,0xc508,0xf629,0xc508,0xf629,0xc508,0xc508,0xc508,0xc508,0xc508,0xc508,0xac46,0xc508,0x6aa3,0x6aa3,0x9800,
0xf800,0xfff3,0xfff3,0xc508,0xf629,0xff0c,0xf629,0xc508,0xc508,0xf629,0xc508,0xc508,0xc508,0xf629,0xf629,0xac46,0x8384,0xc508,0xac46,0x6aa3,0x6240,0x9800,
0xf800,0xfff3,0xfff3,0xc508,0xc508,0xf629,0xc508,0xff0c,0xc508,0xac46,0xc508,0xac46,0xc508,0xc508,0xac46,0xac46,0xc508,0xac46,0xac46,0x6240,0x6aa3,0x9800,
0xf800,0xff0c,0xfff3,0xc508,0xf629,0xc508,0xc508,0xf629,0xc508,0xc508,0xc508,0xac46,0xac46,0xc508,0xc508,0xc508,0xc508,0x8384,0xc508,0x6240,0x6240,0x9800,
0xf800,0xff0c,0xfff3,0xf629,0xf629,0xf629,0xc508,0xc508,0xc508,0xc508,0xac46,0xc508,0xc508,0xc508,0xc508,0xc508,0xc508,0xac46,0xc508,0x6aa3,0x6240,0x9800,
0xf800,0xfff3,0xff0c,0xc508,0xc508,0xf629,0xf629,0xc508,0xc508,0xc508,0xc508,0xc508,0xc508,0xc508,0xc508,0xc508,0xac46,0x8384,0xc508,0x6240,0x6240,0x9800,
0xf800,0xff0c,0xff0c,0xac46,0xc508,0xc508,0xf629,0xc508,0xac46,0xac46,0xc508,0xc508,0xac46,0xac46,0xc508,0xc508,0x8384,0xac46,0xac46,0x6240,0x6aa3,0x9800,
0xf800,0xfff3,0xfff3,0xc508,0xc508,0xf629,0xc508,0xc508,0xc508,0xac46,0xc508,0xac46,0xac46,0xc508,0xac46,0xac46,0x8384,0xac46,0xc508,0x6aa3,0x6aa3,0x9800,
0xf800,0xff0c,0xff0c,0xac46,0xf629,0xf629,0xc508,0xc508,0xac46,0xac46,0xac46,0xac46,0x8384,0xac46,0x8384,0xac46,0x8384,0xc508,0xac46,0x6aa3,0x6240,0x9800,
0xf800,0xfff3,0xff0c,0xc508,0xc508,0xc508,0xac46,0xc508,0xc508,0xc508,0xc508,0xac46,0xc508,0xac46,0xac46,0xc508,0xc508,0xc508,0xc508,0x6aa3,0x6240,0x9800,
0xf800,0xff0c,0xff0c,0xc508,0xf629,0xf629,0xf629,0xc508,0xac46,0xc508,0xac46,0x8384,0xc508,0xc508,0x8384,0xc508,0xac46,0xc508,0xac46,0x6240,0x6aa3,0x9800,
0xc800,0xff0c,0xff0c,0xf629,0xf629,0xc508,0xc508,0xc508,0xc508,0xc508,0x8384,0xc508,0xac46,0xac46,0xac46,0x8384,0xac46,0xc508,0xac46,0x6aa3,0x6aa3,0x9800,
0xc800,0xff0c,0xff0c,0xf629,0xc508,0xf629,0xc508,0xac46,0xc508,0xac46,0xc508,0xc508,0xc508,0x8384,0xc508,0xac46,0xac46,0xac46,0xac46,0x6aa3,0x6240,0x9800,
0xc800,0xf629,0xf629,0x6aa3,0x6aa3,0x6aa3,0x6240,0x6aa3,0x6240,0x6240,0x6240,0x6240,0x6240,0x6240,0x6aa3,0x6240,0x6240,0x6aa3,0x6240,0x6aa3,0x6240,0x9800,
0xc800,0xf629,0xf629,0x6240,0x6aa3,0x6240,0x6aa3,0x6240,0x6240,0x6aa3,0x6aa3,0x6240,0x6aa3,0x6aa3,0x6240,0x6240,0x6aa3,0x6aa3,0x6aa3,0x6240,0x6240,0x9800,
0x9800,0x3000,0x3000,0x3000,0x3000,0x3000,0x3000,0x3000,0x3000,0x3000,0x3000,0x3000,0x3000,0x3000,0x3000,0x3000,0x3000,0x3000,0x3000,0x3000,0x3000,0x3000,
};
void drawTile(int x, int y, const uint16_t* spr ){
if(x >220-spr[1])  game.display.directTile(x,y,220,y+spr[0],(uint16_t*)(spr+2));
else if(x <0)               game.display.directTile(0,y,spr[1]+x,y+spr[0],(uint16_t*)(spr+2+(spr[0]*-x)));
else                        game.display.directTile(x,y,x+spr[1],y+spr[0],(uint16_t*)(spr+2));
}

void drawMap(int8_t x,int8_t y/*, uint8_t* mapdata*/){
for(int i=0; i<11; i++){
for(int j=0; j<8; j++){
drawTile(x+(i*22),y+(j*22),tile);
}
}
}

int main(){
game.begin();
game.setFrameRate(100);
int count;

while(game.isRunning()){
if(game.update(true)){

drawMap(count-22,0);

count++;
count = count%22;

}
}
}
``````

@FManga & @adekto : I just figured out how to make it work and get the speed boost in mode15. Letâ€™s see if Felipe gets the same idea. Hint: 16 x 16 = ?

1 Like

256â€¦ which means I can load the same 16-color palette 16 times into the inflated 256-color palette and not have to mask away the upper nibble to get the correct color for the lower nibble?

1 Like

2 tables of 256 each for low and high lookup

edit: actually no, 1 table of uint16_t [256][2]

first write value [hi|lo][0] then [hi|lo][1] (just increment pointer)

edit: holy cow. this might speed up the rendering alot

2 Likes

Ah, I only tried with the lower nibble because I was thinking of optimizations that would also apply to the 1- and 2-bit modes.

I was using the Mode15 demo for testing, and copying the large sprites onto the framebuffer seems to be much slower than from the framebuffer to the LCD. In that case, the more important optimization would be to provide a â€ścopyBitmapDataâ€ť function that doesnâ€™t support transparency. It also wouldnâ€™t hurt to invert the ifs and fors (have `if`s on the outside with case-specific `for`s inside.)

For games that use small sprites and no background, I think this would help. Iâ€™ll give it another go.

2-bit mode? uint16_t[256][4]

I suspect the second dimension in the array isnâ€™t going to come for free. Thumb2 doesnâ€™t have an LDR from a pointer with a shifted register+immediate offset. Iâ€™ll do some testing.

Edit: Are there any games out there using mode15? The Mode15.bin isnâ€™t really representative of what we should be optimizing for.

@HomineLudensâ€™ Pok15

And the unfinished demo, @adektoâ€™s Pokitbeasts:

1 Like