Weird idea about compression

so i had been playing with the direct tile draw as seen in Ultimate screen mode?

now that has some troubles since its 16bits per pixel which is awfully large to keep in memory
one of the questions i have is what could be the best way to compress 16bit color into something lower like 8bit
without significant loss in performance and if that will be faster or slower to an 256 color pallet

the second thing i had an idea on what if we compressed images even more on the flash and expand it into ram and then draw it, would there be a speed increase compared to the color pallets

I’ve been toying with something along these lines, and here’s what I’ve got so far. It’s a little Aquarium demo with a resolution of 220x176 @ 256 colors, tiles+sprites. Unless I’ve made a mistake somewhere, I believe it runs at around 40 fps. I have only tested it with the latest version of armgcc, your kilometerage may vary.

See in Emulator | Download Bin | View Source

The first optimization is to do the opposite of compression: inflate each 16-bit color in a 256-color palette into a 32-bit one. I put the inflated palette in SRAM1, which would otherwise go to waste anyway.
First, enable SRAM1:

*reinterpret_cast<uint32_t *>(0x40048080) |= 3 << 26;

Yeah, that’s not how MBED code is supposed to do it… I’m just going by the processor datasheets, so pardon the bizarroness. Anyway, now for the inflation:

  uint32_t *palette = reinterpret_cast<uint32_t *>(0x20000000);
  void load8BitPalette( const unsigned short *pal ){
    uint32_t *out = palette;
    for( int i=0; i<256; ++i ){
      *out++ = uint32_t(*pal++)<<3;
    }
  }

Get each color, shift it by 3… and you’re probably thinking: “WHY?!”
There’s 38720 pixels to push out the GPIO, it needs to be done quickly.
Instead of calling write_data, which does a lot under the hood, for each pixel I do this:

Set up things before the loop that writes to the LCD.

  write_command(0x20); write_data(0);
  write_command(0x21); write_data(0);
  write_command(0x22); // write data to DRAM
  CLR_CS_SET_CD_RD_WR;

  SET_MASK_P2;

The audio interrupt might need to be adjusted to preserve the P2 mask… I haven’t looked into this, yet.

Now simply lookup the right color in the palette and write it directly to the GPIO. No per-pixel function call, no comparison with the previous color, no shifting, no NOPs. Just read the color (1 cycle), write it out (1 cycle), toggle the WR (2 cycles).

	  *reinterpret_cast<uint32_t *>(0xA0002188) = palette[ color ];
	  *reinterpret_cast<uint32_t *>(0xA0002284) = 1 << LCD_WR_PIN;
	  *reinterpret_cast<uint32_t *>(0xA0002204) = 1 << LCD_WR_PIN;

Tiles and sprites are optimized differently.

Tiles: These are of a known, fixed size, so I unrolled the loop that copies them out. Basically, something like this:

	  *reinterpret_cast<uint32_t *>(0xA0002188) = palette[ *tile++ ];
	  *reinterpret_cast<uint32_t *>(0xA0002284) = 1 << LCD_WR_PIN;
	  *reinterpret_cast<uint32_t *>(0xA0002204) = 1 << LCD_WR_PIN;
	  *reinterpret_cast<uint32_t *>(0xA0002188) = palette[ *tile++ ];
	  *reinterpret_cast<uint32_t *>(0xA0002284) = 1 << LCD_WR_PIN;
	  *reinterpret_cast<uint32_t *>(0xA0002204) = 1 << LCD_WR_PIN;
	  *reinterpret_cast<uint32_t *>(0xA0002188) = palette[ *tile++ ];
	  *reinterpret_cast<uint32_t *>(0xA0002284) = 1 << LCD_WR_PIN;
	  *reinterpret_cast<uint32_t *>(0xA0002204) = 1 << LCD_WR_PIN;
... etc ...

In practice, it’s a bit more complicated due to the mixture of tiles and sprites, but that’s the idea, and it helps some.

As for sprites, each column has an offset and a length, effectively RLE compression for the transparent area around a sprite. For the transparent areas enclosed within the sprite, regular comparison of the color index must be used. I did not unroll the sprite output loop, it might be a good idea.

5 Likes

:dizzy_face: Voodoo magic here!!
Awesome!

1 Like

@FManga Hahaha! Awesome trickery!


*reinterpret_cast<uint32_t *>(0x40048080) |= 3 << 26;

1 Like

Haha! I have no idea what the mbed way of doing this is.
I looked around and couldn’t find it, so that’s what my code’s stuck with, for now. It’s mostly just a proof-of-concept anyway.

That line also enables USBRAM as I’m thinking of storing procedurally-generated tilemaps on it.

Your trick (with the palette) will speed up rendering considerably & can be used in all screen modes

I hope you can help to implement this as a part of the library.

Can you do a tile scroll of the background ?

I think so… though all the code I’ve written for the Pokitto so far uses only direct drawing, so I haven’t looked at the various screen modes’ code yet.

I haven’t implemented this yet, but I had it in mind and it should not be too hard to do.

If you can look at the one I did in ultimate screen mode idk if that could work for you

I would use this for a platformer

I tested the palette trick in mode15 and it didn’t help significantly. In this case the bottleneck is not the color lookup, it’s drawBitmapData. :frowning:

is mode 15 that bad?

uhm i meant this code

// tile test


#include "Pokitto.h"
Pokitto::Core game;

const uint16_t tile[]={22, 22, /*width, height*/
0xf800,0xf800,0xf800,0xf800,0xf800,0xf800,0xf800,0xf800,0xf800,0xf800,0xf800,0xf800,0xf800,0xf800,0xf800,0xf800,0xf800,0xc800,0xc800,0xc800,0xc800,0x9800,
0xf800,0xff0c,0xff0c,0xff0c,0xfff3,0xfff3,0xfff3,0xfff3,0xfff3,0xff0c,0xff0c,0xff0c,0xff0c,0xfff3,0xfff3,0xff0c,0xff0c,0xfff3,0xff0c,0xf629,0xf629,0x9800,
0xf800,0xff0c,0xff0c,0xff0c,0xff0c,0xfff3,0xfff3,0xfff3,0xfff3,0xfff3,0xfff3,0xfff3,0xfff3,0xfff3,0xff0c,0xff0c,0xfff3,0xff0c,0xff0c,0xf629,0xf629,0x9800,
0xf800,0xfff3,0xff0c,0xf629,0xf629,0xf629,0xc508,0xf629,0xc508,0xf629,0xf629,0xc508,0xf629,0xc508,0xc508,0xf629,0xc508,0xc508,0xf629,0x6aa3,0x6aa3,0x9800,
0xf800,0xfff3,0xfff3,0xc508,0xc508,0xf629,0xf629,0xc508,0xf629,0xf629,0xc508,0xc508,0xf629,0xf629,0xf629,0xc508,0xc508,0xf629,0xc508,0x6240,0x6aa3,0x9800,
0xf800,0xfff3,0xfff3,0xc508,0xf629,0xc508,0xc508,0xf629,0xff0c,0xc508,0xf629,0xc508,0xf629,0xc508,0xc508,0xc508,0xc508,0xc508,0xc508,0x6240,0x6aa3,0x9800,
0xf800,0xfff3,0xfff3,0xf629,0xc508,0xf629,0xff0c,0xc508,0xf629,0xc508,0xf629,0xc508,0xc508,0xc508,0xc508,0xc508,0xc508,0xac46,0xc508,0x6aa3,0x6aa3,0x9800,
0xf800,0xfff3,0xfff3,0xc508,0xf629,0xff0c,0xf629,0xc508,0xc508,0xf629,0xc508,0xc508,0xc508,0xf629,0xf629,0xac46,0x8384,0xc508,0xac46,0x6aa3,0x6240,0x9800,
0xf800,0xfff3,0xfff3,0xc508,0xc508,0xf629,0xc508,0xff0c,0xc508,0xac46,0xc508,0xac46,0xc508,0xc508,0xac46,0xac46,0xc508,0xac46,0xac46,0x6240,0x6aa3,0x9800,
0xf800,0xff0c,0xfff3,0xc508,0xf629,0xc508,0xc508,0xf629,0xc508,0xc508,0xc508,0xac46,0xac46,0xc508,0xc508,0xc508,0xc508,0x8384,0xc508,0x6240,0x6240,0x9800,
0xf800,0xff0c,0xfff3,0xf629,0xf629,0xf629,0xc508,0xc508,0xc508,0xc508,0xac46,0xc508,0xc508,0xc508,0xc508,0xc508,0xc508,0xac46,0xc508,0x6aa3,0x6240,0x9800,
0xf800,0xfff3,0xff0c,0xc508,0xc508,0xf629,0xf629,0xc508,0xc508,0xc508,0xc508,0xc508,0xc508,0xc508,0xc508,0xc508,0xac46,0x8384,0xc508,0x6240,0x6240,0x9800,
0xf800,0xff0c,0xff0c,0xac46,0xc508,0xc508,0xf629,0xc508,0xac46,0xac46,0xc508,0xc508,0xac46,0xac46,0xc508,0xc508,0x8384,0xac46,0xac46,0x6240,0x6aa3,0x9800,
0xf800,0xfff3,0xfff3,0xc508,0xc508,0xf629,0xc508,0xc508,0xc508,0xac46,0xc508,0xac46,0xac46,0xc508,0xac46,0xac46,0x8384,0xac46,0xc508,0x6aa3,0x6aa3,0x9800,
0xf800,0xff0c,0xff0c,0xac46,0xf629,0xf629,0xc508,0xc508,0xac46,0xac46,0xac46,0xac46,0x8384,0xac46,0x8384,0xac46,0x8384,0xc508,0xac46,0x6aa3,0x6240,0x9800,
0xf800,0xfff3,0xff0c,0xc508,0xc508,0xc508,0xac46,0xc508,0xc508,0xc508,0xc508,0xac46,0xc508,0xac46,0xac46,0xc508,0xc508,0xc508,0xc508,0x6aa3,0x6240,0x9800,
0xf800,0xff0c,0xff0c,0xc508,0xf629,0xf629,0xf629,0xc508,0xac46,0xc508,0xac46,0x8384,0xc508,0xc508,0x8384,0xc508,0xac46,0xc508,0xac46,0x6240,0x6aa3,0x9800,
0xc800,0xff0c,0xff0c,0xf629,0xf629,0xc508,0xc508,0xc508,0xc508,0xc508,0x8384,0xc508,0xac46,0xac46,0xac46,0x8384,0xac46,0xc508,0xac46,0x6aa3,0x6aa3,0x9800,
0xc800,0xff0c,0xff0c,0xf629,0xc508,0xf629,0xc508,0xac46,0xc508,0xac46,0xc508,0xc508,0xc508,0x8384,0xc508,0xac46,0xac46,0xac46,0xac46,0x6aa3,0x6240,0x9800,
0xc800,0xf629,0xf629,0x6aa3,0x6aa3,0x6aa3,0x6240,0x6aa3,0x6240,0x6240,0x6240,0x6240,0x6240,0x6240,0x6aa3,0x6240,0x6240,0x6aa3,0x6240,0x6aa3,0x6240,0x9800,
0xc800,0xf629,0xf629,0x6240,0x6aa3,0x6240,0x6aa3,0x6240,0x6240,0x6aa3,0x6aa3,0x6240,0x6aa3,0x6aa3,0x6240,0x6240,0x6aa3,0x6aa3,0x6aa3,0x6240,0x6240,0x9800,
0x9800,0x3000,0x3000,0x3000,0x3000,0x3000,0x3000,0x3000,0x3000,0x3000,0x3000,0x3000,0x3000,0x3000,0x3000,0x3000,0x3000,0x3000,0x3000,0x3000,0x3000,0x3000,
};
void drawTile(int x, int y, const uint16_t* spr ){
    if(x >220-spr[1])  game.display.directTile(x,y,220,y+spr[0],(uint16_t*)(spr+2));
    else if(x <0)               game.display.directTile(0,y,spr[1]+x,y+spr[0],(uint16_t*)(spr+2+(spr[0]*-x)));
    else                        game.display.directTile(x,y,x+spr[1],y+spr[0],(uint16_t*)(spr+2));
}

void drawMap(int8_t x,int8_t y/*, uint8_t* mapdata*/){
    for(int i=0; i<11; i++){
        for(int j=0; j<8; j++){
             drawTile(x+(i*22),y+(j*22),tile);
        }
    }
}
    
int main(){
    game.begin();
    game.setFrameRate(100);
    int count;
   
    while(game.isRunning()){
       if(game.update(true)){
           
           drawMap(count-22,0);
          
           count++;
           count = count%22;
         
       }
    }
}

@FManga & @adekto : I just figured out how to make it work and get the speed boost in mode15. Let’s see if Felipe gets the same idea. Hint: 16 x 16 = ?

1 Like

256… which means I can load the same 16-color palette 16 times into the inflated 256-color palette and not have to mask away the upper nibble to get the correct color for the lower nibble?

1 Like

2 tables of 256 each for low and high lookup

edit: actually no, 1 table of uint16_t [256][2]

first write value [hi|lo][0] then [hi|lo][1] (just increment pointer)

edit: holy cow. this might speed up the rendering alot

2 Likes

Ah, I only tried with the lower nibble because I was thinking of optimizations that would also apply to the 1- and 2-bit modes.

I was using the Mode15 demo for testing, and copying the large sprites onto the framebuffer seems to be much slower than from the framebuffer to the LCD. In that case, the more important optimization would be to provide a “copyBitmapData” function that doesn’t support transparency. It also wouldn’t hurt to invert the ifs and fors (have ifs on the outside with case-specific fors inside.)

For games that use small sprites and no background, I think this would help. I’ll give it another go.

2-bit mode? uint16_t[256][4]

I suspect the second dimension in the array isn’t going to come for free. Thumb2 doesn’t have an LDR from a pointer with a shifted register+immediate offset. I’ll do some testing.

Edit: Are there any games out there using mode15? The Mode15.bin isn’t really representative of what we should be optimizing for.

@HomineLudens’ Pok15

And the unfinished demo, @adekto’s Pokitbeasts:

1 Like

@adekto: I added the tile scrolling, see the links in my first post again.
It might be good to split this thread in two.

I get 10fps in Pok15 initially.
With a uint32_t[2][256] palette and the necessary adaptations, I get 15fps… which is still pretty slow, but 50% faster.
Pok15_paltrick.bin (163.2 KB)

2 Likes