Unfortunately many of the points in ‘Performance’ are lacking examples at the moment, but I’ve read examples in other documents that show most of these rules are true.
My favourite parts of 'Performance’ares probably points 4 and 5:
“Don’t assume that complicated code is necessarily faster than simple code”
“Don’t assume that low-level code is necessarily faster than high-level code”
I like them because I see so many people make those assumptions without taking the time to understand why those assumptions aren’t necessarily correct.
(Also #10 “Rely on the static type system” and #11 “Move computation from run time to compile time” are good bits of advice.)
True, I’ve seen people create complicated code that was supposed to be optimized that was eventually slower because of unpredictable memory access which killed the cache. (Recently I’ve been informed there is no cache on Pokitto MCU though!)
I’m interested in Per.17: Declare the most used member of a time-critical struct first now, which is sadly lacking details.
In my raycasting engine I recently started moving a lot of computation to compile time, it helps a lot!
There is indeed no cache on the Pokitto’s MCU, not the Arduboy’s MCU.
In fact I think anything small enough to be classed as an ‘MCU’ rather than a ‘CPU’ would probably be without a cache.
There’s more to it than just the cache though.
For example, C++-style std::sort is considered to be generally faster than C-style qsort (reference here and here), thanks to the way templates allow for code to be optimised on a per-type basis,
and if a function object (or lambda) is used in place of a function pointer then the sort function can be inlined and a level of indirection can be elided - something not possible with qsort.
Fortunately I can answer this one.
Members of any class (or struct) are accessed by taking the this pointer (the pointer that points to the first address of the class in memory, possibly word aligned) and offsetting it by a given amount.
Hence more frequently accessed members should be given a smaller offset, and the most frequently accessed member should be at the top, because that requires no offset, just an indirection.
I’ve actually seen this in action when helping @Botisaurus optimise his Arduwars.
By moving the large arrays to later in the structures and the more frequently accessed members towards the start, the progmem usage dropped.
I think it’s possible for a struct to be split across a series of registers if it’s small enough, but it might not be.
There might be language rules that forbid that.