Optimal/standard full screen tiled-mode proposal

andrea_i · July 5, 2019, 9:22pm

Same here, loving the restrictions and simplicity of the underlying architecture, it’s a refreshing way to look at game development! So, the game like f-zero is yours? loved the reminiscent aesthetics : )

Pharap · July 5, 2019, 10:33pm

I’d say don’t do it unless you can actually prove that the generated machine code is better.

9 times out of 10 the compiler knows what it’s doing when it comes to low level optimisations and you should only contradict it when you can prove that your code is doing a better job than what the compiler generates.

For example, prefer to use % 2 than prematurely optimising to & 1 because the compiler already knows about that optimisation and it’s not really worth the negative impact to readability.

It depends on what the tangible benefit is and how much less readable it is.

For example, in some cases moving an if out of a loop and having two loops can be faster than having a loop with an if inside it.
Depending on the number of iterations, that optimisation could be worth the penalty to maintenance and readability.

It’s kind of a case-by-case thing though.
There are so many possibilities that it’s hard to have a general rule.

It depends on what the usage is like.
What’s the maximum, minimum and average size that the array will be?

The problem isn’t just the slowness of allocation, there’s also the potential for heap fragmentation and the fact that simply using new and delete means that new functions are added to the code.

(As far as I’m aware neither the Pokitto library or the Mbed code use new and delete. If they do then my last point is no longer valid.)

There’s also the problem of what actually happens if the allocation fails.
I’m not sure what the Pokitto’s default behaviour would be because I’ve never got round to testing it.
There is a way to test if the allocation failed, but that just raises the issue of what to do next.
(Unfortunately there’s usually nothing you can do by that point.)

If the function is actually called then yes.
If the compiler decides to inline it then no.

That’s the downside to modern computers.

People are so spoilt for resources that many programmers end up writing code that gobbles up far more memory/processing power than it really needs.
(E.g. doing O(n) searches over a list instead of maintaining a binary search tree for O(log(n)) searches.)

It’s fine when there’s just one program running, but once you get several ‘greedy’ programs running side by side they soon eat up all the memory/processing power.

Is this library intended to be made part of the Pokitto library or as an extension library?

I think it could benefit heavily from using C++11, but the Pokitto library still needs to target mbed online which only runs C++03.
(I’m hoping the introduction of Femto will mean that can be changed so we can finally bring C++11 to the Pokitto lib.)

It’s late here so I’ll have to wait until tomorrow to give it a proper look over.
For now I’ll mention a few things that are probably due to a lack of C++-specific knowledge in the hopes that it will help you get to know the language better.

When you do new uint32_t[num]{0}, you don’t need the 0, you can just do new uint32_t[num]{}.
The 0 only gets copied to the first element, and the remaining elements are then zero-initialsed, so effectively the 0 is redundant.
Using just an empty aggregate initialiser ({}) means that all the elements are zero-initialised as intended.

If you’d wanted to skip the zero initialisation and leave the elements as indeterminate values then you could have used new uint32_t[num].

renderInfo->palette[16] = NULL; has two problems.
Firstly it’s trying to assign NULL to a uint16_t.
I’m actually kind of surprised that isn’t erroring because NULL should be either void * on C++03 or std::nullptr_t on C++11+ and neither of those should be implicitly convertible to uint16_t.

Secondly, and more worryingly, renderInfo->palette[16] is actually outside the bounds of the array.
renderInfo->palette[0] is the first element and renderInfo->palette[15] is the last element.

You don’t actually need the unnamed namespace.
No other file can ‘see’ the things defined in a .cpp file unless they’ve been declared elsewhere, so the anonymous namespace is effectively redundant.

bufferSize is actually a constant value so you could just make it a static const (or static constexpr in C++11) member variable to save memory.

Instead of using -= and += you could use ++ (e.g. ++pixelX;).

I’ll explain the differences later but you should be able to change this:

uint8_t pixels = renderInfo->bufferImage[tilePixelIt + 2];
uint8_t pixel1 = pixels >> 4;
uint8_t pixel2 = pixels & (0x0F);

uint8_t bufferPixels = buffer[bufferPixelIt];
uint8_t bufferPixel1 = bufferPixels >> 4;
uint8_t bufferPixel2 = bufferPixels & (0x0F);

if(pixel1 == transparentPaletteId){
pixel1 = bufferPixel1;
}
if(pixel2 == transparentPaletteId){
pixel2 = bufferPixel2;
}

bufferPixels = (bufferPixels&0x0F) | (pixel2<<4);
bufferPixels = (bufferPixels&0xF0) | (pixel1);

buffer[bufferPixelIt] = bufferPixels;

To this:

uint8_t imagePixels = renderInfo->bufferImage[tilePixelIt + 2];
uint8_t bufferPixels = buffer[bufferPixelIt];

uint8_t pixel1 = ((imagePixels >> 4) & (0x0F));
uint8_t pixel2 = ((imagePixels >> 0) & (0x0F));

if(pixel1 == transparentPaletteId)
	pixel1 = ((bufferPixels >> 4) & (0x0F));

if(pixel2 == transparentPaletteId)
	pixel2 = ((bufferPixels >> 0) & (0x0F));

	
buffer[bufferPixelIt] = ((pixel1 << 0) | (pixel2 << 4));

And this:

void pdk::clearBuffer(uint8_t paletteColorId){
	for(uint16_t i = 0; i < renderInfo->bufferSize; ++i){
		uint8_t pixel = 0;
		pixel = (pixel&0xF0) | (paletteColorId);
		pixel = (pixel&0x0F) | (paletteColorId<<4);
		
		buffer[i] = pixel;
	}

	renderInfo->bufferUpdated = true;
}

To this:

void pdk::clearBuffer(uint8_t paletteColorId)
{
	uint8_t paletteColourByte = (((paletteColorId & 0x0F) << 4) | ((paletteColorId & 0x0F) << 0))
	
	for(size_t index = 0; index < renderInfo->bufferSize; ++index)
		buffer[i] = paletteColourByte;

	renderInfo->bufferUpdated = true;
}

Or maybe even just:

// Goes at the top of the code
#include <algorithm>

void pdk::clearBuffer(uint8_t paletteColorId)
{
	uint8_t paletteColourByte = (((paletteColorId & 0x0F) << 4) | ((paletteColorId & 0x0F) << 0))
	
	// Should produce the exact same code, but best to check to be sure
	std::fill_n(buffer, renderInfo->bufferSize, paletteColourByte);

	renderInfo->bufferUpdated = true;
}

And I have something that could make loadPalette better by refusing or adapting to arrays of different sizes, but it would need quite a bit of rewriting (and would be a lot easier if C++11 were available).

I’ll look again tomorrow.
The biggest improvement you could make is probably to substitute all the magic numbers with named constants.

Hanski · July 6, 2019, 6:25am

It depends a lot which level you are in code: per frame, per line or per pixel. In the latter one, every instruction counts and it is worthwhile to do “micro optimization”, even if it makes the code less readable. So you should put a lot of comments too. Here are some tips for pixel level optimization:

If you want the maximum speed, make sure you use -O3 optimisation option for the compiler. I do not know what is the default in FemtoIDE.
If you have a condition in a loop, check if you could move the condition outside the loop, and make separate loops for different conditions
For simple loops you could unroll it, or make it e.g. 8-times unrolled per loop.
Avoid function calls unless you are sure the optimiser can make it an inline function
Avoid non-POT divisions. If you are incrementing instead, beware of cumulative error. Using “fixed point” integers can add precision and reduce cumulative error.
Multiplication is a one-cycle operation in Pokitto, so that can be used!

andrea_i · July 6, 2019, 7:31am

Thanks for all the advice guys!
I’m removing the dynamic array allocation, it’s probably for the best.
I thought because of the limited architecture I could have used more strict optimizations, but it still looks like the best rule of thumb is to first implement, then optimize if needed.

this is a big oops : )

This lib encloses pokitto to take advantage of some low level functions, but it doesn’t expose the original pokitto lib, it’s a replacement.

The original pokitto lib seems to be all about flexibility in terms of render modes, but every mode seems to have downsides and yet the interface of the apis seem to talk about every single mode.
It’s big and does a lot of things in many different ways.

This lib I’m doing takes an opposite approach: only one rendering mode, single header with just a handful of functions to do everything, and flexibility is shifted on how data is allocated in ram.
Rather than pre-configured render modes, It promotes a unified tile-based workflow, you can still draw individual pixels on a buffer, you can have a full screen buffer or a tiny one in one portion of the screen, or no buffer at all and only render with tiles, and you can change any of that on each frame.
It uses one palette of 16 colors that can be swapped on each frame.

Not trying to say this is the way to go at all! Just my take on the matter : )

Hanski · July 6, 2019, 7:57am

While it is certainly easier to get started to make an own fork of Pokittolib, I hope that you at some point integrate changes over the current PokittoLib. It is tedious to maintain two libraries when there comes new features, bug fixes or optimisations.

The current lib is not perfect, but huge amount of work has been done to make it better.

Not to say that “PokittoLib2” cannot be done, but if it will be, that would need commitment from all contributors so that we do not divide our already small resources into two groups.

FManga · July 6, 2019, 11:02am

Depends on the division.

If it’s dividing by a power-of-two (POT), GCC emits a shift instead of actually doing a division. (1 cycle)

If it’s not a POT, but you’re dividing by a constant, use fixed-point multiplication by a reciprocal and a shift. (2 cycles)

If your algorithm can naturally be expressed using increments (much like Bresenham’s line drawing algorithm), it’s worth doing so if it happens in a “hot spot”. 1 division per scanline? Probably not a big deal. 1 division per pixel? It’s going to crawl.

If it’s not a POT nor a constant, the Pokitto’s ROM features a division routine that is specifically tuned for the processor. You can find documentation on how to use it in UM10732.pdf, page 467.
It has a worse best-case, but a better worst-case than the division routine in libc, so for some values of x and y it’s going to be faster than simply doing “x / y” and for others it’s going to be slower. You’ll have to profile with your specific code and see.

If it’s code that will probably need to be rewritten in assembly later on, favour readability and correctness for now.

Because of the clockspeed, it’s common to try to follow guidelines that were valid for old Pentiums… but they’re very different architectures. Here we have no cache, no branch prediction, no speculative execution, etc.
That means that having lots of little functions isn’t likely to be a performance problem, especially because GCC is likely to inline them where it makes sense. If it doesn’t inline, small functions often have no prologue/epilogue, so the only cost is that of jumping in and back out… which is really cheap here too.

Accessing an array allocated with new is going to be the same speed as one allocated elsewhere.
For a small array, it’s probably fine. For large ones you risk fragmenting the heap. You also risk getting into some hard to debug problems when you’re running out of heap and the stack starts trampling on it.
Generally, you have a few options, depending on what you want the API to look like.
If the library simply accepts a pointer from user code, the problem gets delegated to the user and you’re done. The user can then allocate a temporary array on the stack and not have to worry about the heap.
You can also statically allocate the array with the worst-case in mind… which wastes RAM, but is safer because you can easily add up the amount of memory your game needs to run.

new and delete are simply forwarded to malloc and free, so it’s not much. Off the top of my head: on initialization, PokittoLib calls srand which calls malloc. The bulk of the heap-related code can’t be escaped.

Pokitto’s malloc calls _sbrk when it needs more memory. When it runs out of memory, _sbrk calls exit with an error code and you can’t recover. For my GC to do its thing in java, I had to implement my own sbrk that returns an error instead, then malloc returns null.

So do I, Femto defaults to "-std=c++14".

If you make a debug build (press F5) it uses -Og -g3 -ggdb, which has no optimisations. When you make a release build (press build, Ctrl+B or Ctrl+G) it uses -O3. This is all in the project.json file, so it can be customized as you wish. You have to clean the build for it to take effect on external files like the PokittoLib.

andrea_i · July 6, 2019, 11:50am

I’m not usually one for reinventing the wheel, but the pokitto high level interface is bloated by the multitude of mutually exclusive render modes, I don’t think you’d want yet another one in there (with all the related methods exclusive to this mode that would come with it).
This new lib is very simple at the moment, and it can be fine tuned more than mode15, I think in the future if it gets consolidated it might be a good alternative that still sits on top of the core display routines of HwLcd.

The algorithms I’ve optimized are those that draw on the buffer and those that draw on screen, I think what I should do is use a compiler condition and have both optimized/unreadable and unoptimized/readable code in the same function, so if someone wants to take a crack at further refining it he can start from the readable code.

Pharap · July 6, 2019, 10:04pm

I’ll probably have to wait until tomorrow to reexamine the code.
Did all my suggestions make sense?

If it turns out that static allocation becomes a problem then you can always change it back later, but it’s worth at least trying to avoid it.

It is and it isn’t.

It’s flexible in that there are multiple options,
but the fact they’re all controlled by macros hampers the situation somewhat.

You can’t really escape that.
That’s just the nature of having limited resources.

Programming is all about tradeoffs.
Every new feature consumes resources,
and every algorithm has different characteristics.
Most of the time you’re trading speed for memory (or vice versa), or sometimes trading flexibility for speed or memory (or vice versa).

In fairness that’s not a new thing, that’s possible on all of the existing screen modes that use a palette.

That’s not really an issue.
In C++ you don’t pay for what you don’t use.
If a function isn’t used, it’s not included in the compiled code.
(Which I don’t think is true for languages like Java and C#.)

I’m half and half on this.

I think graphics modes should be available as separate libraries,
but I think core functionality should not be duplicated.

Hopefully some day we will be at that point.

I’ve done a few mockups/prototypes before but I hardly ever find the chance to work on them these days.

This is the kind of thing we need documented on the forum somewhere.

This is the ideal approach and what I would expect from most real-world libraries,
but it makes things harder for beginners (and for people who are too lazy to do their own memory management and/or don’t realise what a blessing it is).

In general I’m very much in favour of giving the end user fine control over the library (e.g. letting the user choose when the screen is drawn to and when the buttons are updated et cetera), because that gives experienced users greater capability,
but as a result it makes things harder for beginners (and lazy people).

That’s what I was expecting.

Technically they don’t have to be implemented that way,
(the ‘heap’ and the ‘free store’ are considered different things by the standard),
but I’ve yet to see an implementation that does otherwise.

srand calling malloc doesn’t really make sense.
That sounds like poor design to me.

Most Arduino programs manage to avoid it.

I’m not saying it’s never a good idea to use dynamic allocation,
but I think libraries should either avoid it or advertise the fact they use dynamic allocation so that the people using them can make informed decisions.

Ideally it should be possible to write a Pokitto game without using any dynamic allocation.

Presumably that means the Pokitto resets, which is sort of what I’d expect (the Arduboy does that too).

Presumably that’s used to throw an OutOfMemoryError.
Or are exceptions not implemented?

The ‘way to test if the allocation failed’ is what I was talking about.
There’s an overload of new that returns nullptr instead of throwing an exception.

Even better - more constexpr flexibility, and actual binary literals (instead of silly compiler extensions).

This is precisely why I’m hoping we’ll eventually be able to move to a situation where every render mode is a separate library.

Though as I said before, the size only affects the size of the Pokitto library, not the size of the user’s code.
Unused functions aren’t included in the compiled executable.

I’d be concerned about speed.

I suspect reimplementing some of the existing display modes in terms of this mode might result in a notable slowdown.
Making each mode monolithic might be a bit of a maintenance issue, but it allows for better fine-tuning.

FManga · July 6, 2019, 10:30pm

I brought it up once before.

It’s used to trigger the garbage collector. If it still can’t allocate memory after a clean up, it halts. It doesn’t throw as that would require allocating the exception object, which might not be possible at that point.
Exceptions are implemented, but disabled by default. Enabling exceptions results in much larger bins.

Pharap · July 6, 2019, 11:50pm

By ‘somewhere’ I mean in something like an ‘optimisation tips’ thread.
I.e. not buried half way down an unrelated thread.

I think most implementations get around that be pre-allocating the exception and modifying it to have the right info before throwing.

A shame, but unsurprising.

andrea_i · July 7, 2019, 6:51am

Thanks for all the precious advice Pharap and FMAnga!

Please don’t waste time reviewing that code yet, I’ll upload it on git once it’s in a decent state and then it will be interesting to see if it has any advantage over what pokitto has so far.

I also agree having separate apis for each render mode would greatly help.

There are quite a few things that every mode will do the same way:

routines to write bitmaps to a buffer (one routine for each kind of buffer we might have)
routines to display a buffer on the lcd (in this lib’s case, I can’t reuse mode15 display routine because it’s not necessarily full screen)
I’m already reusing the buttons routines

As for performance, now that I’ve rewritten every routine with a slower/readable code option, I can confirm my unreadable/semi-optimized code that avoids divisions and nested loops as much as possible does gain 3 fps when I start to have 10 or more large sprites on screen.

Hanski · July 7, 2019, 8:34am

What is the fps value? Increase of 3 out of 30 would be 10% which is rather good.

andrea_i · July 7, 2019, 9:08am

Basically, any fps between 10 and 30 will be bumped +3. But if I draw enough bitmaps to make it drop under 10, then the gain is gone, for watever reason.

Also interesting, but probably a well known thing here, the main bottleneck right now is sending pixels to the lcd calling setup_data_16(color);CLR_WR;SET_WR;
That call will make my game jump from 32fps to 94

EDIT to explain myself better:
if I comment out the call to setup_data_16() in my drawbuffer function, then I jump from 32fps to 94.

FManga · July 7, 2019, 9:14am

I imagine commenting that line turns much of your code into a big no-op that GCC optimizes away.

andrea_i · July 7, 2019, 9:18am

I’ve now replaced that call with an incremental to a global variable used within the same loop, so the compiler can’t take that off now. still 93fps

andrea_i · July 7, 2019, 9:20am

@FManga the two commented lines are the ones that can be swapped to cause the huge drop

void drawBuffer(){
	for(uint8_t bufferTileX = 0; bufferTileX < renderInfo.bufferWidth; ++bufferTileX){
		write_command_16(0x21); write_data_16(renderInfo.bufferStartX + bufferTileX);// move to x lcd coord
		write_command_16(0x20); write_data_16(renderInfo.bufferStartY);// move to y lcd coord
		
		write_command(0x22); // ready to write pixels column
		CLR_CS_SET_CD_RD_WR;

		uint16_t bufferIt = bufferTileX / 2;

		for(uint8_t bufferY = 0; bufferY < renderInfo.bufferHeight; ++bufferY){
			uint8_t paletteId = renderInfo.buffer[bufferIt];
			
			if((bufferTileX & 1) == 0){ // even pixel column
				paletteId = paletteId >> 4;
			}
			else{// odd pixel column
				paletteId = paletteId & (0x0F);
			}

			uint16_t color = renderInfo.palette[paletteId];
			
			//renderInfo.bufferStartX+=1; 93fps
			// setup_data_16(color);CLR_WR;SET_WR;// 32fps

			 bufferIt += renderInfo.bufferHalfWidth;
		}
	}

	renderInfo.bufferUpdated = false;
}

FManga · July 7, 2019, 9:24am

color, being an unused variable, gets optimized away. The renderInfo.palette[paletteId] has no-where to go, so it goes away too. Then paletteId becomes unused and it goes away, along with the if-else.
So all your for loop does is increment a counter.

andrea_i · July 7, 2019, 9:33am

you’re right, 49fps if I increment adding color to have the compiler pulling that in as well.
so 32fps to 49.

FManga · July 7, 2019, 9:39am

Makes sense. One optimization that is worth doing is drawing rows, instead of columns. That way you only read each buffer byte once.
As for writing to the LCD being slow, switching to hand-written assembly will give it a boost.

andrea_i · July 7, 2019, 9:43am

I’m writing columns because the lcd routine automatically goes down to the next vertical pixel (which is an horizontal pixel in the pokitto lcd really).
So that spares me from having to reset the lcd pixel position, which is another bunch of instructions.
Maybe I should use a buffer that merges vertical pixels together instead of horizontal? dunno, confusing as hell : )