Improving FPS


#102

Hrm, possibly because it acts as a sort of memory barrier preventing the rearranging of instructions?

Does the pause have to be just one cycle?
I’m wondering if calling an empty function marked with __attribute((noinline)) would work or if a function call would be too many cycles.

Or perhaps doing:

volatile int nop __attribute__((unused)) = 0;
++nop; // Optional

Would force GCC to create and use the variable?
It might cause similar issues though if GCC ends up having to dump a register it’s using.

Or perhaps:

// global
static int nopVar = 0;
void __attribute__((always_inline)) nop(void)
{
  ++nopVar;
}

I think the fact it’s a global variable should mean that the compiler can’t optimise the operation away,
but it could probably still reorder it.


#103

I haven’t looked into the rules regarding GCC and inline asm on ARM yet, so I have no idea. Maybe it considers a set of registers to get clobbered and has to reload them?
It doesn’t have to be one cycle, but it adds up.
We’re getting to a point where the high-level language is getting in the way and writing these routines in assembly would be best, both in terms of performance and legibility.


#104

Alignment is another possibility. I remember reading thatif the function/loop is not aligned you get padding.


#105

The fundamental rules of inline asm on GCC are the same for all chips, what changes is the registers, the available instructions and the register aliases, e.g. on AVR r means any register and d means an upper register.

It shouldn’t clobber a register unless you explicitly state it to be an output operand or to be clobbered.

Considering this is rendering code, I’m inclined to agree.
As long as the external interface is C++ friendly,
the implementation can be pure assembly for speed sake,
providing it’s well-commented.

Functions do have a memory alignment, but if it pads I would assume it’s just padding with more nops,
which shouldn’t be a major issue.


#106

But noticeable in tight execution speed


#107

ARM instructions are 32 bits wide, but I can’t find any information about function alignment.

I found a thing discussing the alignment of structs which said that the largest field being int causes 4-byte alignment and the largest field being long long causes 8-byte alignment, but nothing beyond that.

I’m erring on the side of memory barriers/formal no-tampering rules though, see this SO question that demonstrates how something as simple as asm("# im in ur loop"); can have an impact because of the side effects of the presense of an asm block (note that the asm block here is implicitly asm volatile).


#108

Oh yeah, did I already mention inline asm can not be used with mbed online ide :wink: ?


#109

The Pokitto uses Thumb, not the full 32-bit ARM IS, so it’s just 16 bits wide. I vaguely remember the TRM mentioning 4-byte instruction alignment for certain ops.

What do you mean? You’re using inline asm in the PokittoLib:

#define CLR_WR { LPC_GPIO_PORT->CLR[LCD_WR_PORT] = 1 << LCD_WR_PIN;__asm("nop");}//__asm("nop");}//

#110

Right, I didn’t know that. I’ve added that to the technical spec thread since that’s probably important.

If @jonne tried asm volatile and it didn’t work, maybe it’s just __asm that works on mbed online?

Apparently:

When writing code that can be compiled with -ansi and the various -std options, use __asm__ instead of asm (see Alternate Keywords).

Though __asm__ isn’t the same as __asm so maybe __asm is an mbed thing?


#111

How did you fix it?


#112

It seems fileopen returns false if it works, not true. So I was only attempting to render is the file didn’t open.


#113

fileOpen doesn’t return bool, it returns a uint8_t, and signals success with 0 and failure with 1.

This is one of the many reasons I hate implicit conversion between integer types and bool, if conversion was explicit, you’d have got a compiler error telling you that were trying to put an integer expression in an if instead of a boolean expression.

At least that seems to be the case.
On the simulator it all seems to work fine:

But the hardware code is more than a little confusing.
It’s full of comments and empty functions:

https://github.com/pokitto/PokittoLib/blob/master/Pokitto/POKITTO_LIBS/FileIO/FileIO_HW.cpp#L21


#114

more reason for that big rewrite effort

so how is that video working? is it like interlaced or compressed in some way? or just full bitmaps?


#115

Uncompressed bitmap data using a single 8bit palette. I’m quite amazed that the SD read and LCD display are fast enough. (Explanation link.)


#116

I hadn’t noticed before, but:

What’s wrong with doing this?

bool stillGoing = true;
while(stillGoing){

Definitely.

At one point I’d written a nice object-oriented wrapper for some of the file functions, but I lost it somewhere (and may have accidentally deleted it).
I was aiming to design it so you didn’t have to remember call close because the object’s destructor would call it when the object fell out of scope.

If it’s only supposed to play one way, I’d hope it was at least using delta compression.


#117

Nothing I guess, just how I wrote it.

On a related note, is there any example of a pokitto program playing a sound sample from data in an array?


#118

.wav playing demo from the lib examples:


#119

If ROM files are read only, would it not be possible to leave them on the SD and stream the data from there? or would seeking slow tings down too much?


#120

It would be really slow, and getting it to full speed is being challenging enough. Besides, doing it this way I can use defines to optimize for the specific mapper in the cartridge. Also, the pokitto loader already acts as a menu for picking a game, putting another one in the emulator would be redundant.


#121

@spinal: I think I found a way to get rid of the noise without wasting cycles. Try this:

#define TGL_WR(OP)							\
  *reinterpret_cast< volatile uint32_t *>(0xA0002284) = 1 << LCD_WR_PIN; \
  OP;									\
  *reinterpret_cast< volatile uint32_t *>(0xA0002204) = 1 << LCD_WR_PIN;

 void Pokitto::lcdRefreshMode13(uint8_t * scrbuf, uint16_t* paletteptr, uint8_t offset){
   uint16_t x,y;
   uint32_t scanline[110]; // read two nibbles = pixels at a time
   uint8_t *d;
   uint32_t *s;

   write_command_16(0x03); write_data_16(0x1038);
   write_command(0x20); write_data(0);
   write_command(0x21); write_data(0);
   write_command(0x22);
   CLR_CS_SET_CD_RD_WR;
   SET_MASK_P2;

   volatile uint32_t *LCD = reinterpret_cast< volatile uint32_t * >(0xA0002188);

   d = scrbuf;// point to beginning of line in data
   for(y=0;y<88;y++){

     s = scanline;

     for(x=0;x<110;x+=10){
       *LCD = *s = paletteptr[(*d + offset)&255]<<3; TGL_WR(s++);TGL_WR(d++);	
       *LCD = *s = paletteptr[(*d + offset)&255]<<3; TGL_WR(s++);TGL_WR(d++);	
       *LCD = *s = paletteptr[(*d + offset)&255]<<3; TGL_WR(s++);TGL_WR(d++);	
       *LCD = *s = paletteptr[(*d + offset)&255]<<3; TGL_WR(s++);TGL_WR(d++);	
       *LCD = *s = paletteptr[(*d + offset)&255]<<3; TGL_WR(s++);TGL_WR(d++);	
       *LCD = *s = paletteptr[(*d + offset)&255]<<3; TGL_WR(s++);TGL_WR(d++);	
       *LCD = *s = paletteptr[(*d + offset)&255]<<3; TGL_WR(s++);TGL_WR(d++);	
       *LCD = *s = paletteptr[(*d + offset)&255]<<3; TGL_WR(s++);TGL_WR(d++);	
       *LCD = *s = paletteptr[(*d + offset)&255]<<3; TGL_WR(s++);TGL_WR(d++);	
       *LCD = *s = paletteptr[(*d + offset)&255]<<3; TGL_WR(s++);TGL_WR(d++);	
     }

     s = scanline;
     uint32_t c = *s;
     for(x=0;x<110;x+=10){
       *LCD = c; TGL_WR(s++);TGL_WR(c=*s);
       *LCD = c; TGL_WR(s++);TGL_WR(c=*s);
       *LCD = c; TGL_WR(s++);TGL_WR(c=*s);
       *LCD = c; TGL_WR(s++);TGL_WR(c=*s);
       *LCD = c; TGL_WR(s++);TGL_WR(c=*s);
       *LCD = c; TGL_WR(s++);TGL_WR(c=*s);
       *LCD = c; TGL_WR(s++);TGL_WR(c=*s);
       *LCD = c; TGL_WR(s++);TGL_WR(c=*s);
       *LCD = c; TGL_WR(s++);TGL_WR(c=*s);
       *LCD = c; TGL_WR(s++);TGL_WR(c=*s);
     }
     
   }
   
 }