I saw you were using zboy, so I went with it too. I'm also using flash for ROM, but I think it may be possible to put it in RAM for mbc0 games (that way we'd be able to load games from the SD). I've been testing with Tetris, Dr Mario and Kwirk. They're still a bit slow, but playable. I'm only focusing on mbc0 games, for now.
I uploaded the code in its current/rough state here: https://github.com/felipemanga/PokittoZBoy
At first I also thought timing could be an issue, so I disabled the throttling and didn't get any improvement. Since I couldn't find a hardware debugger in Brazil, I modified ProjectABE for profiling. I can now make a "hotspot ranking" like this:
#6ca1 ___ 100%
#6865 ___ 99.48%
#5f8d ___ 99.4%
#b4d6 ___ 73.38%
#b4d8 ___ 73.38%
#b504 ___ 71.88%
#383c ___ 36.82%
#3844 ___ 36.82%
#3848 ___ 36.82%
#3854 ___ 36.82%
ModeGBC was the first bottleneck and replacing it with the version I posted previously helped a bit. Then I had to rewrite setPixel, getPixel, and most of DrawBackground. No point in drawing to framebuffer, copying that to PokittoLib's framebuffer, then copying that to the LCD.
I removed the huge switch from the CPU interpreter and used an array of function pointers, instead. To lower the per-op overhead, I update the CPU more often than the other systems (16:1 was the most I could get away with).
The MMU was next: I use SRAM1 as a "RAM palette", to index addresses into blocks of memory. Reading is now simple enough that the compiler inlines it.
The profiler also pointed out things like:
CurLY = VideoClkCounterVBlank / 456
This results in a call to udivsi3, so I replaced it with a fixed point multiply:
CurLY = VideoClkCounterVBlank * (0x1000000 / 456) >> 24
Now individual opcodes are the bottleneck (#6ca1 is the implementation of
LD A,($FF00+n)). This is a good thing, because that means the rest of the emulator is not in the way, and a bad thing because it's hard to make it any faster.