FPS counting (with unlimited frametrate).
Why? A cycle is a cycle. On my i5 laptop I get about 1300 FPS running Marioland. 2500 FPS if disabling video output. Whatever change I do to the code - if it increases the FPS count, it’s good.
About half of the time zBoy spends outputting pixels to screen. That’s what gprof says, and it’s actually confirmed by FPS measures with/without video output, which is why I disable video output when trying to measure the impact of changes on the CPU/MMU modules.
Anyway, my branchless MMU version is running at exactly the same speed on my i5… And when running on my PII 300 MHz, it’s even slower than the branch-rich original (110 FPS with branching, 100 with function pointers calls replacing if/else chains). This is the reason I was surprised you got different results - but I understand you did not have that much speed-up either, and the little you got is due to the specificity of Cortex CPUs (no cache to trash, no branch prediction to miss).
Ultimately, after the new tests I did during last 48h, I have doubts that fiddling with the MMU can bring any significant improvements, on x86 at least. Nonetheless, I will continue looking for ways to make things faster.
True, inlining (within reasonable limits to avoid cache trashing) can bring very good performance results. But it troubles me that you mention it, while at the same time you do function calls through pointers (through the writeHandlers[] array), since such calls cannot be inlined, and moreover are likely to end up performing long jumps (much slower than short jumps, at least on x86). I am sure that this method is somehow beneficial on the Corex arch, since you applied it again in CPU emulation through the OP[] function ptr array. Unfortunately it seems to be an optimization that works only for this kind of MCUs.
I also noticed that you perform more CPU instructions within a single emulation cycle (8 instructions, to be precise). While this certainly makes things many times faster overall because the rest of the emulation modules need to be called 8x times less often, I worry about the emulation accuracy… Didn’t it lead to regressions? I’d expect at least some games (esp. those with very tight timing loops) to get broken, it could also make some games being notified of a H-blank too late, and generate possible glitches. I’d be curious to know how often such troubles really occur with your approach (if at all).