Problem with LZ4 compression in Pokitto and fade out routines in emulator

manuelsagra · February 18, 2020, 8:44pm

I’ve created a small project to show two problems I’ve faced recently. One of them is the reason I’m trying to do some hardware debugging, and is related with LZ4 (de)compression on the real device. The project runs without problems in the emulator, but in the device the project hangs after the first frame is shown. Maybe is it trying to write to a wrong memory space?

The other issue is not important but I found it interesting. Remove the flag Frame function and draw something on the screen. The fadeout shows green tones on the emulator, but in the real device is working properly. Am I doing something wrong?

Flag.bin.2

Any clues?

Thanks!

Flag.zip (263.7 KB)

FManga · February 18, 2020, 9:54pm

It crashes when it tries to uncompress flag01, tries to read from an address that doesn’t exist.
Maybe flagFramesLength[1] is incorrect?
Edit: no, it’s not the length…
Edit: Interesting, it only crashes on odd frames (1, 3, 5).

manuelsagra · February 18, 2020, 10:03pm

Maybe it’s something related to odd memory addresses on the ARM, and that explains why x86-64 doesn’t fail… just a thought.

Edit: I tried to align all the arrays to even lenghts (adding a 0 at the end on odd length arrays, but not altering the actual length number) but it’s still failing.

Pharap · February 18, 2020, 10:24pm

Does the same problem occur if you substitute the ARM assembly for C or C++ code?
(I believe there is some available on the LZ4 repo?)

If not, it might be worth comparing the code generated for the compiled C or C++ with the hand written assembly to see if that could give any hints.

manuelsagra · February 18, 2020, 10:28pm

The problem is that the C implementations I’ve seen are quite complex (such as the reference one) or use too much memory (https://create.stephan-brumme.com/smallz4/). I’ve to look for a better alternative.

Pharap · February 18, 2020, 10:42pm

I’m not suggesting you replace the assembly with some C or C++ code permenantly,
I’m suggesting comparing the generated assembly code to the hand-written assembly code to see what’s different.
Maybe the handwritten assembly has a mistake that won’t be replicated by the compiler.

That’s assuming that there actually is some source code that behaves similarly to the assembly code.
If there isn’t then my suggestion is useless. (You could try to write one I guess?)

FManga · February 18, 2020, 10:54pm

I also suspected alignment could be an issue, but __attribute__ ((aligned)) didn’t help. Looking at the code, almost all of the memory accesses are done with LDRB so alignment won’t be an issue.

Pharap · February 19, 2020, 7:13pm

Last night before I left I started having a go at trying to write the C++ equivalent of the assembly code, but I only got so far.

I’m not even sure how much of this is right to be honest…

// r0 = source, r1 = destination
void unlz4(const std::uint8_t * source, std::uint8_t * destination)
{
	// Get length of data
	// LDRH r2, [r0]
	// Advance source pointer, R0 += 2
	// ADDS r0, r0, #2

	// The actual code is more like const std::uint16_t length = *reinterpret_cast<const std::uint16_t *>(source);
	// but that would be in the realm of undefined behaviour
	const std::uint16_t length = ((source[0] << 8) | (source[1] << 0));
	unlz4_length(&source[2], destination, length);
}

// r0 = source, r1 = destination, r2 = length
void unlz4_length(const std::uint8_t * source, std::uint8_t * destination, std::uint32_t length)
{
	// PUSH {r4-r6,lr}
	// ADDS r5, r2, r0
	std::uint32_t r5 = &source[length];

	getToken:
	// R6 = *R0
	// LDRB r6, [r0]
	std::uint8_t r6 = *source;
	// ++R0
	// ADDS r0, r0, #1
	source += 1;
	
	// R4 = R6 >> 4
	// LSRS r4, r6, #4
	std::uint32_t r4 = (r6 >> 4);
	
	// BEQ getOffset
	if(r4 == 0)
		goto getOffset;
	
	// BL getLength
	// call getLength;
	
	// MOVS R2, R0
	// ???
	
	// BL copyData	
	// call copyData
	
	// MOVS R0, R2
	// ???
	
	getOffset:
	// R3 = R0[0]
	// LDRB r3, [r0, #0]
	// R2 = R1 - R3
	// SUBS r2, r1, r3
	// R3 = R0[1]
	// LDRB r3, [r0, #1]
	// R3 = (R3 << 8)
	// LSLS r3, r3, #8
	// R2 -= R3
	// SUBS r2, r2, r3
	// R0 += 2
	// ADDS r0, r0, #2
	// R4 = R6 & 0x3
	// LSLS r4, r6, #28
	// LSRS r4, r4, #28
	// call getLength
	// BL getLength
	// R4 += 4
	// ADDS r4, r4, #4
	// BL copyData
	// CMP R0, R5
	// BLT getToken
	// POP {r4-r6,pc}
	
	//getLength:
	//CMP r4, #0x0F
	//BNE gotLength
	if(r4 == 15)
	{
		//getLengthLoop:
		//LDRB r3, [r0]
		//ADDS r0, r0, #1
		//ADDS r4, r4, r3
		//CMP r3, #0xFF
		//BEQ getLengthLoop
		
		do
		{
			const std::uint8_t r3 = *r0;
			r0 += 1;
			r4 += r3;
		}
		while(r3 == 0xFF);
		//gotLength:
	}	
	// return;
	//BX LR
}

// R1, R2, R4
void copyData(const std::uint8_t * source, std::uint8_t * destination, std::int32_t length)
{
	// R4 = -R4
	// RSBS r4, r4, #0
	// R2 -= R4
	// SUBS r2, r2, r4
	// R1 -= R4
	// SUBS r1, r1, r4
	
	length = (0 - length);
	destination = (destination - length);
	source = (source - length);
	
	// copyDataLoop:
	do
	{
		// R3 = R2[R4]
		//LDRB r3, [r2, r4]
		// R1[R4] = R3
		//STRB r3, [r1, r4]
		// R4 += 1;
		//ADDS r4, r4, #1
		const std::uint8_t value = *(destination + length);
		*(source + length) = value;
		length = (length + 1);
	}
	while(length != 0);
	// BNE copyDataLoop

	// return;
	// BX lr
}

So far the thing I’m most surprised about is copyData.
It all seems technically valid, but it’s a really backwards way to go about things.
I get that they’re probably trying to use fewer instructions, but even so.

I probably won’t finish implementing the rest, so make of it what you will.
(Bear in mind that even if this compiles it probably won’t do anything worthwhile yet.)

Not really relevant, but for some reason I’ve always found that lowercase assembly looks odd.
I tend to think it looks better when it’s uppercase.
I don’t know why that should be.

manuelsagra · February 19, 2020, 8:30pm

Thanks a lot! I’ll try to finish this implementation and see if it works on real hardware. Wish me luck

FManga · February 19, 2020, 9:49pm

This is what I’ve been debugging with, in case anyone else finds it easier to read:
Edit: see next post

FManga · February 19, 2020, 10:48pm

Got it to work:

.syntax             unified
.cpu                cortex-m0
.thumb
        
.func               unlz4
.global             unlz4, unlz4_len
.type               unlz4, %function
.type               unlz4_len, %function
        
CData0          .req r0
DestData1       .req r1
CopySrc         .req r2
Tmp3            .req r3
CopyLen4        .req r4
EndCData        .req r5
Token6          .req r6
        
.thumb_func
unlz4:
	ldrb r2, [CData0]             /* get length of compressed data */
	ldrb Tmp3, [CData0, #1]
	lsls r2, #8
	adds r2, Tmp3
	adds CData0, #2            /* advance source pointer */
        
.thumb_func
unlz4_len:
        push {r4-r6, lr}          /* save r4,  EndCData,  r6 and return-address */
        
        adds EndCData, r2, CData0            /* point EndCData to end of compressed data */
        
getToken:
        ldrb Token6, [CData0]             /* get token */
        adds CData0, #1            /* advance source pointer */
        lsrs CopyLen4, Token6, #4            /* get literal length,  keep token in r6 */
        beq  getOffset           /* jump forward if there are no literals */
        bl   getLength           /* get length of literals */
        movs CopySrc, CData0               /* point r2 to literals */
        bl   copyData            /* copy literals (r2=src,  DestData1=dst,  r4=len) */
        movs CData0, CopySrc               /* update source pointer */

getOffset:
        cmp  CData0, EndCData
        bge  done
        ldrb Tmp3, [CData0, #0]          /* get match offset's low byte */
        subs CopySrc, DestData1, Tmp3            /* subtract from destination; this will become the match position */
        ldrb Tmp3, [CData0, #1]          /* get match offset's high byte */
        lsls Tmp3, #8            /* shift to high byte */
        subs CopySrc, Tmp3            /* subtract from match position */
        adds CData0, #2            /* advance source pointer */
        lsls CopyLen4, Token6, #28           /* get rid of token's high 28 bits */
        lsrs CopyLen4, #28           /* move the 4 low bits back where they were */
        bl   getLength           /* get length of match data */
        adds CopyLen4, #4            /* minimum match length is 4 bytes */
        bl   copyData            /* copy match data (r2=src,  DestData1=dst,  r4=len) */
        cmp  CData0, EndCData               /* check if we've reached the end of the compressed data */
        blt  getToken            /* if not,  go get the next token */
done:   
        pop  {r4-r6, pc}          /* restore r4,  EndCData and r6,  then return */
        
.thumb_func
getLength:
        cmp  CopyLen4, #0x0f            /* if length is 15,  then more length info follows */
        bne  gotLength           /* jump forward if we have the complete length */
getLengthLoop:
        ldrb Tmp3, [CData0]             /* read another byte */
        adds CData0, #1            /* advance source pointer */
        adds CopyLen4, Tmp3            /* add byte to length */
        cmp  Tmp3, #0xff            /* check if end reached */
        beq  getLengthLoop       /* if not,  go round loop */
gotLength:
        bx   lr                  /* return */
        
.thumb_func
copyData:
        rsbs CopyLen4, #0            /* index = -length */
        subs CopySrc, CopyLen4            /* point to end of source */
        subs DestData1, CopyLen4            /* point to end of destination */
        
copyDataLoop:
        ldrb Tmp3, [CopySrc, CopyLen4]          /* read byte from source_end[-index] */
        strb Tmp3, [DestData1, CopyLen4]          /* store byte in destination_end[-index] */
        adds CopyLen4, #1            /* increment index */
        bne  copyDataLoop        /* keep going until index wraps to 0 */
        bx   lr                  /* return */
        
.size               unlz4, .-unlz4
        
.endfunc

As for the green tones you mentioned, I see them in hardware as well.

jonne · February 19, 2020, 10:51pm

Yeeees? And the problem was?

FManga · February 19, 2020, 11:07pm

tl;dr: it just needed some more instructions:

        cmp  CData0, EndCData
        bge  done

Slightly more detailed explanation, but still simplifying a lot:
Lz4 compresses data in Blocks. Blocks can add new data to the output (the “literal” part) and they can repeat data that was already added (the “match” part). The code expects all blocks to have these two parts, but the last blocks in @manuelsagra’s images don’t contain the match part. Before processing the match, I made it check if it had already reached the end (CData0 == EndCData) and bail if necessary.

For the full explanation, the original article does a pretty good job, other than assuming that blocks will always have match parts.

Hanski · February 20, 2020, 7:45am

Good work! Did you find any explanation why the emulator did not crash?

FManga · February 20, 2020, 11:32am

Yes. Because of the broken block it tries to read from the void that is between Flash and SRAM (addresses greater than 0x40000 and below 0x10000000). Since not all memory addresses have been implemented yet, the emulator just returns zero while hardware triggers a hard fault.

manuelsagra · February 20, 2020, 11:42am

Thanks a lot @FManga!!!

Meanwhile, I’m trying to write a simple C implementation of the algorithm, in order to try to understand it better, but somehow this doesn’t work yet:

void unlz4c(const uint8_t* source, uint8_t* destination, int length) {
	int sourceOffset = 0;
	int destinationOffset = 0;
	uint8_t moreLength = 0;
	int copyIndex = 0;
	
	while (sourceOffset < length) {
	    uint8_t token = source[sourceOffset++];
	    
	    uint8_t literalNibble = (token & 0xF0) >> 4;
	    uint8_t matchNibble = token & 0x0F;
	    
	    if (literalNibble) {
	        int literalLength = literalNibble;
	        if (literalNibble == 0xF) {
	            do {
	                moreLength = source[sourceOffset++];
	                literalLength += moreLength;
	            } while (moreLength == 0xFF);
	        }
	        while (literalLength > 0) {
	            *(destination + destinationOffset++) = *(source + sourceOffset++);
	            literalLength--;
	        }
	    }
	    
	    if (matchNibble) {
	        int matchOffset = source[sourceOffset++] | (source[sourceOffset++] << 8);
	        int matchLength = matchNibble + 4;
	        if (matchNibble == 0xF) {
	            do {
	                moreLength = source[sourceOffset++];
	                matchLength += moreLength;
	            } while (moreLength == 0xFF);
	        }
	        while (matchLength > 0) {
	            while (copyIndex < matchOffset) {
	                *(destination + destinationOffset) = *(destination + destinationOffset - matchOffset);
	                destinationOffset++;
	                copyIndex++;
	            }
	            copyIndex = 0;
	            matchLength--;
	        }
	    }
	}
}

BTW, the blocks are created in Java using commons-compress from Apache, and maybe that code doesn’t create the last match block:

private static byte[] getLZ4(byte[] in) {
	ByteArrayOutputStream dest = new ByteArrayOutputStream(); 
	try {
		BlockLZ4CompressorOutputStream lzOut = new BlockLZ4CompressorOutputStream(dest);
		lzOut.write(in);
		lzOut.close();
	} catch (IOException e) {
		System.err.println("Error compressing with LZ4");
	}
	return dest.toByteArray();
}

And about the fade out, I guess that RGB being 565 tends to fade to green in the last steps, but I didn’t spot it in the Pokitto screen, only in my computer monitor.

Thanks again!

FManga · February 20, 2020, 12:38pm

Try changing that to this:
if (matchNibble && sourceOffset < length) {

manuelsagra · February 20, 2020, 12:56pm

The check is wrong: matchNibble can be 0, but the other check is right. In fact, the fifth block is:

0x10, 0xBC, 0x24, 0x00

And that translates to:

0xBC, 0xCC, 0xCC, 0xCC, 0xCC

Because there’s a 0xCC byte with an offset -0x0024.

Edit: This is the proper code, I misunderstood how the the match copy worked, it’s quite simple in fact

void unlz4c(const uint8_t* source, uint8_t* destination, int length) {
	int sourceOffset = 0;
	int destinationOffset = 0;
	uint8_t moreLength = 0;
	
	while (sourceOffset < length) {
	    uint8_t token = source[sourceOffset++];
	    
	    uint8_t literalNibble = (token & 0xF0) >> 4;
	    uint8_t matchNibble = token & 0x0F;
	    
	    if (literalNibble) {
	        int literalLength = literalNibble;
	        if (literalNibble == 0xF) {
	            do {
	                moreLength = source[sourceOffset++];
	                literalLength += moreLength;
	            } while (moreLength == 0xFF);
	        }
	        while (literalLength > 0) {
	            *(destination + destinationOffset++) = *(source + sourceOffset++);
	            literalLength--;
	        }
	    }
	    
	    if (sourceOffset < length) {
            int matchOffset = source[sourceOffset++] | (source[sourceOffset++] << 8);
            int matchLength = matchNibble + 4;
            if (matchNibble == 0xF) {
                do {
                    moreLength = source[sourceOffset++];
                    matchLength += moreLength;
                } while (moreLength == 0xFF);
            }
            while (matchLength > 0) {
                *(destination + destinationOffset) = *(destination + destinationOffset - matchOffset);
                destinationOffset++;
                matchLength--;
            }
	    }
	}
}

Edit2: Now the question is: If the algorithm is so simple, WHY I COULDN’T FIND A SHORT FUNCTION LIKE THIS ANYWHERE?