I was wondering why I got 5 fps drop in performance when I started to use textures of width 12 pixels (because of mipmapping) instead of width 8 pixels. LPC11U68 should have 1 cycle multiplication, so it should be no problem. Except that it is. I have following calculation in the inner loop of the “Mode7” bitmap transformation function.
So GCC is using a combination of shifts and add to produce multiplication of 12, instead of using the mul instruction. This is something @FManga has mentioned before.
So the question is: how can we force GCC to use the mul instruction?
If constants are replaced with variables, it will force compiler to use MULS:
[Note that multiplying two variables together will cause the generation of a MULS instruction, as in such cases the compiler has no way of knowing what combination of add/shifts to use.]
Good point. I tested it and it indeed looked to use the muls instruction. Of course it can affect optimization also otherwise as the value is not const any more.
It still would like to find a generic solution, so that I could squeeze out some cycles without tricking the compiler case by case.
That’s a shame, because now we cannot (fully) benefit from the 1 cycle multiplication (!). The options are to try to avoid NPOT multiplication in the inner loops, or when that is inevitable, code inline assembler.
For this particular case I can avoid the problem by putting mipmaps under the original texture (total size:8x12 pixels) instead of the right side (12x8 pixels), because I need only the bitmap width in the inner loop.
Or could we make an own MULS macro or inline function for integer multiplication (implemented in inline assembler) that can be used instead of ‘*’? It would be interesting to see how the optimizer handles it
Edit: I can also check if there is newer EmBitz version available with updated GCC.
multiplying a constant with a variable is (relatively) slow, unless the constant is POT (power-of-two).
The option for the latter is to use a constant as variable, but it is not as fast as the POT option, even if the memory access is really fast (1 cycle iirc). The optimizer can do better job with constants as the instruction pipeline can be filled more efficiently.