Optimizing 4bpp bitmap drawing

dir3kt · January 17, 2019, 9:26am

Howdy!

In my quest to make a sidescrolling platformer in mode 15 I have been trying to optimize sprite rendering. It quickly became apparent that drawing tiles is the bottleneck, so this is the first area I tried to improve.

The test consist of a simple program with uncapped FPS that draws a 14x11 tilemap of 16x16 tiles on screen each frame. Basically it draws a full screen image, but drawing a tilemap is more realistic in my particular use case. There are two versions of the test: One that draws the tilemap on an even position, another that draws the tilemap on an odd position. This is because routines are quite different depending on the pixel placement.

The idea behind this post is to discuss those techniques and see what can eventually make its way to PokittoLib. But first the numbers…

Using PokittoLib

This uses the drawBitmapData from PokittoLib.

Even: 18 FPS
Odd: 21 FPS

Not much difference between odd and even pixel placement. Surprisingly odd pixel placement is faster which looking at the code shouldn’t. I guess this is due to some obscure O3 optimization.

Optimization 1: Disabling transparency

The first, obvious thing, is to disable transparency as most of the time this is not needed for tiles. So I just disabled the transparency check on the Pokitto routines.

Even: 35 FPS
Odd: 25 FPS

The odd frame rate could still be optimzed. My current code simply don’t check for transparency but still performs read on the screen buffer, which could be prevented.

Optimization 2: Even pixel placement

Even pixel placement is much faster and easier to work with (when writing blitting routines). So I decided to rewrite a routine that can only display sprites on an even pixel position, also it only support sprites with even width so that there is no ‘x clip’.

Even: 42 FPS

This first version is simply a loop that reads byte from the bitmap and put them in the screen buffer. I had a private discussion with @Pharap about the Pokitto CPU and it occurred this is a 32bit CPU so why push 8bit bytes instead of 32bit words? Here is the result with another version that use memcpy to copy the bitmap content to the screen buffer.

Even with memcpy: 47 FPS

Using even pixel placement

I have been thinking how to use even pixel placement. There could be some ticks such as doubling assets with a normal version and a ‘shifted by 4 bit version’. Or using a screen buffer that is bigger by 1 byte, and perform the shifting when sending data to the LCD. But this all seems overly complicated and somehow still limited.

So I used a much simpler trick which consists of making use of even pixel position as much as possible. This is rather simple: The players moves by 2 pixel each frame, the camera follows the player so it also move by 2 pixels. This insure the background tilemap is always drawn on even position.

Putting it all together

To prove my points and show it works I put together a small demo platformer were you can move, jump and shoot.

Without capping the FPS the game runs between 45-47 FPS. I decided to cap it to 30 FPS so that there is still room for entities and sound, and that the games run at a constant speed. Also 30 FPS is good for porting the game to PC which has a 60Hz vsync.

In this sort demo there are 3 layers:

parralax background. Only the clouds are drawn, blue is the background color. This layer doesn’t use even pixel placement, it uses optimization 1.
background. Follows the player so use optimization 2.
foreground. Same as background put drawn over the player.

dg_0_1.bin (57.6 KB)

Enjoy!

This game uses the DeadGunner sprites from Surt of OpenGameArt. By the way I recommend checking other works by Surt, lots of NES style stuffs that could be great Pokitto games!

Note that this is version 0.1 of DeadGunner. I’m planning to continue work on it, hopefully turning it into a full game. But let’s not discuss this here, if there is interest I will create a separated thread.

Hanski · January 17, 2019, 10:10am

Thanks for sharing your experiments! This is very interesting.

Some thoughts below.

You can save multiple calls to drawBitmapData() per frame by making an own function “drawTileMap()”, which gets the tilemap and textures as a parameter and loops over the whole screen. I have made something like that in the “Scroll2Layers” example (in lo-res mode) which comes with Pokittolib. I am not sure if this gets you around the performance problems in “odd pixel placement” also.

I have measured how much the audio (1 channel) affects to performace. It is about 10% - 30 %, depending on the sample rate. The link is below:

That’s impressive in hi-res mode(!)

DeadGunner sprites look cool:-) I will try your demo later.

dir3kt · January 18, 2019, 11:05am

Interesting. This could save some offset calculation, and also the function call. Will have to make some measurements. This could maybe help for odd pixel placement too. One little thing with real case (not the test case) is that some tile are not displayed so I would need to add some way to skip tiles.

Thanks for the link to the audio perfs!

Pharap · January 18, 2019, 1:40pm

std::memcpy isn’t always the fastest way to copy memory,
it depends how the implementation is written and how the compiler optimises it.

I think it would be worth writing a custom copy routine just to see if it can run faster than std::memcpy.

Example

void copy(char * destination, const char * source, std::size_t count)
{
	using word = unsigned int;

	const std::size_t remainderCount = (count % sizeof(word));
	const std::size_t alignedCount = (count / sizeof(word));

	if (alignedCount > 0)
	{
		word * destinationWord = reinterpret_cast<word *>(destination);
		const word * sourceWord = reinterpret_cast<const word *>(source);

		for (std::size_t index = 0; index < alignedCount; ++index)
		{
			destinationWord[index] = sourceWord[index];
		}
	}

	if (remainderCount > 0)
	{
		const std::size_t startIndex = (alignedCount * sizeof(word));

		char * destinationChar = &destination[startIndex];
		const char * sourceChar = &source[startIndex];

		for (std::size_t index = 0; index < alignedCount; ++index)
		{
			destinationChar[index] = sourceChar[index];
		}
	}
}

template<std::size_t destinationSize, std::size_t sourceSize>
void copy(char(&destination)[destinationSize], const char(&source)[sourceSize], std::size_t count)
{
	std::assert(count <= destinationSize);
	std::assert(count <= sourceSize);
	copy(destination, source, count);
}

template<std::size_t destinationSize, std::size_t sourceSize>
void copy(char(&destination)[destinationSize], const char(&source)[sourceSize])
{
	static_assert(destinationSize <= sourceSize, "");
	copy(destination, source, destinationSize);
}

Also maybe try std::copy to see what the performance of that is like in comparison.

Hanski · January 21, 2019, 4:55am

Looks nice and works very smoothly. Good work!

dir3kt · January 21, 2019, 10:29am

Thanks for the routine it’s a good idea. Applying it to drawBitmap, remainderCount, alignedCount and startIndex could be calculated only once outside of the ‘y’ loop.

Any reasons for using using array indexer [index] instead of incrementing the pointers and using dereference?

jonne · January 21, 2019, 11:20am

AFAIK the latter should be faster than the previous but it depends on the compilers ability to optimize

FManga · January 21, 2019, 3:01pm

@dir3kt: Some of these optimizations were done in Abbaye des Morts, if you want to compare implementations.

Is destination guaranteed to be word-aligned?

Both methods require 3 registers, so it should be the same… Hard to guess what GCC is going to do sometimes, though.

dir3kt · January 22, 2019, 10:05am

In my case nope. Should it be?

Pharap · January 22, 2019, 11:48am

I usually use indexing and ‘address of’ when dealing with pointers because it’s more self documenting and easier to reason about (checking whether an index is valid is easier than checking whether a pointer is valid).

As far as I can tell indexing should use at least 4:
index, source, destination and alignedCount

Pointer incrementing should use 3:
source, destination and either sourceEnd or destinationEnd

Whether pointer incrementing is faster than indexing depends on the instruction set.

In the case of thumb, there’s an LDR instruction that is basically Rd = *(Rn + Rm);,
and a corresponding STD instruction that does *(Rn + Rm) = Rd;.
So actually indexing should be either just as fast or possibly even faster than pointer incrementing.

For the word copying part the final thing should look something like this:

LDR temp, source, index
STD destination, source, index
ADD index, index, 1
CMP index, count
B <GTE>, -4

(This is a guess, I don’t know exactly which version of thumb the Pokitto’s processor uses, and I’m having a hard time sifting through all the instructions and different info.)

Either way, the compiler should be smart enough to pick the best way.
Probably best to try both approaches to see if there’s a measurable difference.

I don’t think the function can guarantee that.

It’s certainly possible to force arrays of chars to be word-aligned as of C++11 (using alignas(int)).
(But then that’s only going to work for the compilers that support C++11, which rules out mbed online.)

Unfortunately, if custom alignment is given to an array,
it’s impossible to detect because alignment can only be detected from a type, not a variable/value.

It would be possible to define a custom bitmap type that must be word aligned,
but it would make initialising the bitmap data a bit more complicated.

It seems that desktop programs on Windows all static variables are word-aligned, so I’m wondering if it will be the same for Pokitto.
I’ll test later.

FManga · January 22, 2019, 3:46pm

I checked the disassemblies in compiler explorer (ARM gcc 8.2, -mcpu=cortex-m0plus -O3) for:

void square(int num, char *x, char *y) {
    for( int i=0; i<num; ++i )
        x[i] = y[i];
    return;
}

vs

void square(int num, char *x, char *y) {
    for( int i=0; i<num; ++i )
        *x++ = *y++;
    return;
}

and in both cases you’ll find this:

.L3:
        ldrb    r4, [r2, r3]
        strb    r4, [r1, r3]
        adds    r3, r3, #1
        cmp     r0, r3
        bne     .L3

So 5 registers in total, identical implementations either way.
That is interesting in itself, but there’s a second takeaway: in either case there’s a lot more code. Including a check to see if the addresses are word-aligned and a specifically optimized version of the loop that copies 4 bytes at a time.
This is necessary because the STR/LDR opcode can only write/read to 4-byte aligned addresses (hence my question regarding alignment earlier). If it isn’t aligned, the compiler might resort to loading storing byte-by-byte.

The handwritten optimization probably has no effect though it doesn’t hurt to test and measure. It might actually be slower.

Pharap · January 26, 2019, 3:55am

Sorry for the delay, I’ve only just spotted this comment.

I suspected the compiler would generate the same for both.

Any improvements if you exchange int for std::size_t and/or mark y as const char *?

And is the return; there to prevent a certain kind of optimisation or is it just randomly there?

And not too far off my earlier guess:

LDR temp, source, index
STD destination, source, index
ADD index, index, 1
CMP index, count
B <GTE>, -4

vs

.L3:
        ldrb    r4, [r2, r3]
        strb    r4, [r1, r3]
        adds    r3, r3, #1
        cmp     r0, r3
        bne     .L3

r0 = count
r1 = destination
r2 = source
r3 = index
r4 = temp

Seeing the actual thing I can see now that I wrote it slightly wrong.
The second line should have been STD temp, destination, index, which is what I’d intended to write.
(Or rather it should have been STD temp, [destination, index], though I’m not sure what the square bracket syntax means.)
At least I was pretty close, which is reassuring because I’m always slightly doubtful of my assembly knowledge (it’s not something I use often).

This is probably true, but it depends on how easy it is to check for word alignment and whether or not there’s any requirement for statically allocated variables to be word aligned.

Like I say, when testing on a desktop, all global (statically allocated) variables appear to be word aligned.
If that’s the case on the Pokitto then in this case the check might not be necessary.

Even if it is necessary, it should be as simple as:

constexpr uintptr wordMask = ((1 << sizeof(const void *)) - 1);
constexpr bool isWordAligned(const void * pointer)
{
    return ((reinterpret_cast<uintptr>(pointer) & wordMask) == 0);
}

Precisely, always test and never assume. (Although that’s not always practical.)

FManga · January 27, 2019, 11:41am

No change with int/size_t or const.

It’s just randomly there.
Compiler Explorer starts up with this:

// Type your code here, or load an example.
int square(int num) {
    return num * num;
}

I just removed the num * num instead of removing the entire line. That also explains the name square instead of something like copy. I tested without the return and it makes no difference.

I don’t think there is a requirement based on how variables are allocated, just on how they’re accessed and therefore the type. I had to add __attribute__ ((aligned)) to the framebuffer in the PokittoLib since I wanted to read 32-bits at a time from it, for example.
Also, even if the beginning of a buffer is aligned, the copy function might not receive an offset that is aligned.

And when it’s not practical, one should err to the side of clear and concise code.

Pharap · January 29, 2019, 10:25am

Unsurprising on ARM I suppose.
But on the bright side it also shows that correctness costs nothing.

Ah, fair enough.

Unsurprising.
The return will be there either way, it’s just a matter of implicit vs explicit.

Unfortunate but true.

Precisely.
(Partly why I mentioned std::szie_t and const.)