Hah, "call is unlikely"... We're sure the call is going to be made a few million times per second. Maybe GCC would inline it if we used PGO? In the absence of that, using the attribute seems a reasonable/simple fix.
setup_data_16 is generally called inside a loop.
SET_MASK_P2; can be moved out, before the loop, and
CLR_MASK_P2; after the loop.
I haven't looked at the disassembly, but I observed that using the mbed macros for accessing memory registers
(LPC_GPIO_PORT->MPIN = ...) was a tiny bit slower than direct memory manipulation (
*reinterpret_cast<uint32_t *>(0xA0002188) = ...). I assumed that the dereference is not optimized away and produces a few more ops, while the latter results in a simple STR.
Reducing setup_data_16 to just a STR should convince GCC into inlining, as that'd be smaller than the code necessary to call it... but at that point, might as well just put the MPIN write in a macro, just like the mask code.