My assembly programming experiences I would like to share

Hello Pokitto guys and gals, Dennis here.

I would like to share my experiences with you.
I use Segger Embedded Studio for writing assembly programs with a custom 11u68 statrup code.
For debugging I use a J-link probe.


I have written a very simple fixed-point Mandelbrot generator
and tested it with different clock settings from 24Mhz to 72Mhz.
Test conditions:
40 Mandelbrot sets were generated while zooming in and measured run time.
220x176, 4BPP mode was used with EGA palette settings.
d-mandel.bin is included to this post. It does not contain the loader.
d-mandel.bin (2.2 KB)

Results were: (SYSPLLCTRL values for PLL settings)
24Mhz: 18.09s (MSEL=1, PSEL=2)
36Mhz: 12.07s (MSEL=2, PSEL=2)
48Mhz: 9.12s // default Pokitto speed (MSEL=3, PSEL=1)


The fastest screen buffer to lcd copy is 6 clocks per native pixel.
That is 24 clocks per pixel in 110x88-4BPP mode.
Copying only runs at ~200FPS.

I use the GPIO register NOT (GPIO_BASE | 0x2300) to toggle pin states.
(This saves a few registers and instructions in a loop. The difference between SET2 0x2208 and CLR2 0x2288 is #0x80 bytes. This needs two instructions to calculate, OR an extra register OR an STM - STR instruction pair.)

// r0=8 pieces of 4BPP pixels
// r6=*palette[]   // stores combined values for 2 pixels
// r4=*gpio_base|MPIN2
// r1=NOT2-MPIN2
// r3=buffer on stack to store pixel data
 lsrs r2,r0,#24
 lsls r0,#8
 lsls r2,#2
 ldr r3,[r6,r2] // load Combined Palette element for 2 pixels
 str r3,[r4]    // MPIN  1st pixel
 str r5,[r4,r1] // SET WR
 str r5,[r4,r1] // CLR WR
 stm r7!,{r3}
 movs r2,#16   //         ***
 str r5,[r4,r1] // SET WR
 str r5,[r4,r1] // CLR WR
 rors r3,r2
 str r3,[r4]    // MPIN  2nd pixel  *** 
 str r5,[r4,r1] // SET WR
 str r5,[r4,r1] // CLR WR
 nop                   // ***
 str r5,[r4,r1] // SET WR
 str r5,[r4,r1] // CLR WR

This is the fastest method I now. You need to insert
an instruction between the SET-CLR WR pairs or
else it will write black pixels.
I marked these “hot spots” in the code with three stars ***
No extra instrucion needed between changing WR or RD states.


STM really speeds up things. A single STR takes 2(3) clocks to execute
while an STM is 1+1n clocks.

STR r0,[r2]     // 2 clocks
STR r1,[r2,#4]  // 2 clocks

STR r0,[r2]     // 2 clocks
STR r1,[r0]     // 2 clocks + 1 clock penalty for using r0

STM r2!, {r0-r3} // 5 clocks total for loading 4 registers.

Data alignment on word boundary is important!
The loaded registers can be used after 3 instructions counted from the STM instruction.

Drawback is that STM/LDM can not store/load high registers (r8-)


Portrait mode available with the following settings:
LCD Command=3, Data=0x28 (AM=1, ID=0b10)
LCD Command=0x20, Data=0xaf
LCD Command=0x21, Data=0
This mode is good for 8BPP and 16BPP modes only.


Reading data / command from the LCD controller is not stable.
Needs manual adjusting and special timings.

6 Likes

Hi Dennis!

Its going to be very useful to have one more assembly expert in the house. Very neat trick on the GPIO->NOT

Cool, can’t wait to see some nice updates to the display routines!

Either way, you need two registers to control the screen (in your snippet, R4 and R1). You can reach SET from MPIN with an immediate offset, so I only use a register to reach CLR. When starved for registers, this gives me the option of recycling a temporary register at the cost of just one cycle: movs %[tmp], 252.
To use NOT you might as well keep an absolute reference instead of an offset in R1.

At one point I considered using SP as a pointer to MPIN2, keeping a backup in one of the high registers. That would allow complete control over the screen using just one register, freeing 2 low registers in the process. On the other hand, it would require disabling interrupts for the duration of the copy. :thinking:


Neither can STR/LDR, so that’s not really a drawback of STM/LDM. The only instructions that have access to r8-r12 are mov, add, cmp, if I remember correctly.


Interesting, I hadn’t tried reading from the controller yet. I was hoping this would allow some way to detect vsync.

VSYNC works only in RGB mode.
Reading out even the ID code from the LCD controller is not stable. )-:

VSYNC works in RGB mode but not BGR? o_O