GCC optimizer problems

I was wondering why I got 5 fps drop in performance when I started to use textures of width 12 pixels (because of mipmapping) instead of width 8 pixels. LPC11U68 should have 1 cycle multiplication, so it should be no problem. Except that it is. I have following calculation in the inner loop of the “Mode7” bitmap transformation function.

Bitmap width: 8 pixels

( finalV & 0x7 ) * 8

The assembly is as follows;

	ands    r3, r4
	lsls	r3, r3, #3

Bitmap width: 12 pixels

( finalV & 0x7 ) * 12

	ands	r3, r4
	lsls	r6, r3, #1
	mov	fp, r6
	add	r3, r3, fp
	lsls	r3, r3, #2

So GCC is using a combination of shifts and add to produce multiplication of 12, instead of using the mul instruction. This is something @FManga has mentioned before.

So the question is: how can we force GCC to use the mul instruction?

I think its -mtune=cortex-m0plus but I haven’t had a chance to test it yet.

I tested it but unfortunately it did not work. The same assembly code.

Some M0+ CPU has the “small” 32 -cycle multiplier.

What is the assembly output if you add the -Os option to GCC?

Does your IDE have a “Cortex M0+ Fast Multiplier” architecture option?

Are you using -Os?

I think @dmarschal read the same page I ended up at :P

https://community.nxp.com/thread/388760

@Hanski how are you building? mbed?
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka13304.html

Oh, try -mtune=cortex-m0plus.small-multiply.

https://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html#index-mtune-3

isn’t that the opposite (force GCC to not use MUL)?

I don’t know for definite but I’m assuming a single MUL is ‘smaller’ than lots of shifting and adding.

If it is the opposite then the people who pick the flag names for GCC aren’t very good at their job. :P

I think smaller refers to the circuit that does the math, not the code.

I am using EmBitz.
-Os has always produced slower code for me than -O3

Hanski, lets see a longer code snippet of the multiplication you are doing, particularly the datatypes of the integers involved

If constants are replaced with variables, it will force compiler to use MULS:

[Note that multiplying two variables together will cause the generation of a MULS instruction, as in such cases the compiler has no way of knowing what combination of add/shifts to use.]

https://community.nxp.com/thread/423115

Edit: in other words, use a variable instead of 8/12 constants

1 Like

Good point. I tested it and it indeed looked to use the muls instruction. Of course it can affect optimization also otherwise as the value is not const any more.

It still would like to find a generic solution, so that I could squeeze out some cycles without tricking the compiler case by case.

AFAIK this is a general problem with GCC and cortex-M0+. Some reports say that it depends on the version of GCC used

That’s a shame, because now we cannot (fully) benefit from the 1 cycle multiplication (!). The options are to try to avoid NPOT multiplication in the inner loops, or when that is inevitable, code inline assembler.

For this particular case I can avoid the problem by putting mipmaps under the original texture (total size:8x12 pixels) instead of the right side (12x8 pixels), because I need only the bitmap width in the inner loop.

Or could we make an own MULS macro or inline function for integer multiplication (implemented in inline assembler) that can be used instead of ‘*’? It would be interesting to see how the optimizer handles it :wink:

Edit: I can also check if there is newer EmBitz version available with updated GCC.

AFAIK you can specify which GCC it uses and from which directory. Just go to settings->compiler or something, its in the menu

2 Likes

I already have the latest version of EmBittz 1.11. That is using tailored GCC 5.4 (embedded version).

The same thing happens with GCC 7.2.1.

1 Like

As a conclusion:

  • multiplying of two int variables should be fast
  • multiplying a constant with a variable is (relatively) slow, unless the constant is POT (power-of-two).

The option for the latter is to use a constant as variable, but it is not as fast as the POT option, even if the memory access is really fast (1 cycle iirc). The optimizer can do better job with constants as the instruction pipeline can be filled more efficiently.

1 Like