I tested it on my desktop, so i works for that at least. I also verified that it fails if you leave off either nocreat or notrunc.
The fsync is just a timing change (also not available on OSX), as the unmount will do it for you. Adding it to the dd just makes it more likely to work if something breaks between the dd finishing and the unmount finishing.
Adding sync pads the last block out with nulls. Since the file wasn’t truncated, not doing that means those bytes have to come from the old file, so the file system will read the file system (not dd) block in order to have them to write back, and the idea is to save that read.
Changing the block size - well, that one’s hard to predict, because the block size on the media (determined by the file system layout presented by the chip, buffer strategy, file system syncs, and flash sector size.
Checking the manual, it appears that the USB ISP code doesn’t use the “erase page” facility, so writing is always going to be full sectors (4K up to ~100K, then 32K for things beyond that). The file system presented by the ROM bootlader has a 1K cluster size. So yeah, 265 bytes was a bad choice. The question is whether you want 1K or 4K writes. If each dd buffer is physically written to flash individually, then writing 4K will mean reading the buffer in 4 times, copying the 1K buffer into the right place in it, erasing the flash and then writing out the rest of the sector. Of course, using a 4K buffer in dd may not help, because that could well turn into writes for each individual 1K cluster.
The VFAT file system, the UMS device driver it talks to, and the MSC implementation on the NXP chip all get a chance to buffer things and change the size of what’s physically written to flash. First question should be whether the MSC implementation does any buffering, as if it’s sufficiently smart, then what the others do won’t really matter much. Second question is whether the VFAT file system and UMS driver will pass a 4K write through to the device, as if it won’t, then 1K and 4K will behave pretty much the same way, just that 1K makes 4x the system calls so is a little bit slower (at least without conv=sync),