[Tool]Gameboy Emulator

open-source
release
tool

#201

That’s a great problem! One I hope to contribute to because the world needs more Pokitto!


#202

All pleasure is mine!

In fact, this is something I already experimented with, like, 10 years ago. The difference of performance was almost none, this is why I did not apply this method back then (the crude if/else method was easier to maintain during prototyping). Now I was wondering that perhaps my tests years ago were flawed, or perhaps I misremembered, or simply missed something. So I re-implemented MMU matching again, using function pointers instead of branches. And again - my benchmarks show that there is no difference in speed, none at all. Apparently function call overhead is more or less equivalent to a branch execution (esp. if we factor in modern branch prediction). For completeness I also tested this on a pure i386 build since I suspected gcc might optimize the if/else with some magic found on current processor, but even with a i386 build results did not differ.

So my question is - did it really make a significant difference in your tests? If so, then I’d assume that branching on a Cortex unit is hugely more expensive than on an actual CPU, unless you also observed differences on your PC? (in which case I’d be really interested in figuring out what I have missed)


#203

As far as I’m aware, not many embedded CPUs have branch prediction.


#204

Ah, I see. I remember this optimization helped, though alone it wasn’t a drastic difference, and I didn’t test on the PC. How are you benchmarking your code? I have no idea what GCC does on x86, but counting cycles to emulate a 4mhz processor on a >1ghz one probably won’t help much. :stuck_out_tongue:

On Cortex m0+ there is no branch prediction and function calls are relatively cheap. There is also no cache, so using tables everywhere doesn’t cause thrashing.

One difference was to allow the MemoryRead/Write functions to actually get inlined. All the inline keyword does is tell the compiler to not complain if it finds the same function more than once, while making functions smaller helps convince the compiler to actually inline. Instead of a function call and a chain of branches, it becomes a lookup and a call.
In simple cases, like writing to a regular block of memory, the result doesn’t even need a prolog/epilog:

void RAMWrite( uint32_t WriteAddr, uint8_t DataHolder, uint8_t *buffer ){
    buffer[ WriteAddr ] = DataHolder;
}

Becomes:

RAMWrite:
strb	r1, [r2, r0]
bx	lr

#205

FPS counting (with unlimited frametrate).

Why? A cycle is a cycle. On my i5 laptop I get about 1300 FPS running Marioland. 2500 FPS if disabling video output. Whatever change I do to the code - if it increases the FPS count, it’s good.

About half of the time zBoy spends outputting pixels to screen. That’s what gprof says, and it’s actually confirmed by FPS measures with/without video output, which is why I disable video output when trying to measure the impact of changes on the CPU/MMU modules.

Anyway, my branchless MMU version is running at exactly the same speed on my i5… And when running on my PII 300 MHz, it’s even slower than the branch-rich original (110 FPS with branching, 100 with function pointers calls replacing if/else chains). This is the reason I was surprised you got different results - but I understand you did not have that much speed-up either, and the little you got is due to the specificity of Cortex CPUs (no cache to trash, no branch prediction to miss).

Ultimately, after the new tests I did during last 48h, I have doubts that fiddling with the MMU can bring any significant improvements, on x86 at least. Nonetheless, I will continue looking for ways to make things faster.

True, inlining (within reasonable limits to avoid cache trashing) can bring very good performance results. But it troubles me that you mention it, while at the same time you do function calls through pointers (through the writeHandlers[] array), since such calls cannot be inlined, and moreover are likely to end up performing long jumps (much slower than short jumps, at least on x86). I am sure that this method is somehow beneficial on the Corex arch, since you applied it again in CPU emulation through the OP[] function ptr array. Unfortunately it seems to be an optimization that works only for this kind of MCUs.

I also noticed that you perform more CPU instructions within a single emulation cycle (8 instructions, to be precise). While this certainly makes things many times faster overall because the rest of the emulation modules need to be called 8x times less often, I worry about the emulation accuracy… Didn’t it lead to regressions? I’d expect at least some games (esp. those with very tight timing loops) to get broken, it could also make some games being notified of a H-blank too late, and generate possible glitches. I’d be curious to know how often such troubles really occur with your approach (if at all).


#206

If a >1ghz processor has trouble emulating a 4mhz one, it’s likely that a micro-optimization involving cycle counting wouldn’t be significant and there are high-level optimizations to be made instead. Diminishing returns.
It seems the machines you’re benchmarking on don’t need optimization anyway, they already run much faster than they need to, unless you’re trying to prolong laptop battery life.

They have the same cost here. Are they still that much slower on x86? I haven’t benchmarked x86 code in a long time.

8 instructions seems to be the upper limit, more than that will start breaking things, IIRC. Yes, it’s trading accuracy for performance, but, given the restrictions, the priorities are a bit different. I’d rather have a good amount of playable games than have a whole lot of really slow ones. Since I can’t test them all, I have some games I test with, then I find out if I’ve gone too far by releasing it and letting others use it.

There is a bug regarding some games not accepting input. I guess it could be related to this, I’ll have to check.


#207

Battery life is a possible concern, yes, but more generally - anything that wins some extra FPSes on my i5 is usually just as much beneficial on slower machines. I test zBoy periodically on my PII PC, the performance percentage gain is usually similar on both machine. Now of course both are x86 architectures. Results on different things (like the Pokitto) may vary wildly. For instance, on both my x86 machines, 50% of CPU time is spent drawing pixels, even though the video backends are completely different (SDL2 vs VGA mode 13h). But apparently on the Pokitto this activity is negligible, since I noticed you commented out my “detect screen areas that need to be refreshed” code in the Pokitto build (hence I assume it was costing more CPU cycles than it was saving, so pixels are probably almost free there).

Of course they are, I don’t question this choice at all. I’m only sad that I can’t reuse your clever bits upstream, since they are either Pokitto-specific or implying possible regressions :slight_smile:

Possibly. But it could also be due to a zBoy bug… I fixed a major joyad emulation bug in r211 the other day. This fixed controls at least for Galaga & Galaxians (it’s the one game I noticed the problem in the first place, the day after I released v0.70). It’s a miracle that almost every title worked at all until now.


#208

This was mostly a RAM optimization. Instead of storing a full framebuffer, I switched to a scanline buffer. I wrote the code that copies the scanline to the LCD in assembly so it wouldn’t become a bottleneck.

Ah, I’ll have to check that fix then. My dad sent me a bug report about those two games. :stuck_out_tongue:


#209

Did you count the GameBoy Color ROMs that also run on the DMG? Okay, there might not be a lot of such ROMs within Pokitto’s 128K limit :slight_smile: Roadsters '98 do come to mind, though (and there are probably a few more)


#210

Awesome. Played Super Mario Land on the Pokitto today.
The latest Firmware of ZBoy (0.7) got sound. Is it possible to implement this version?
With sound everything is better :wink:


#211

It should be possible, and I intend to do so, but I’ve got some other things I need to finish first.
:coffee::coffee::coffee: :sleepy: