
So short answer:
Align memory. It's faster that way on Intel machines.
Good post! TY that helps a lot :)
Just to iron out some thoughts, how useful is 8 byte alignment with modern CPUs, in this case a P4. In context, in linux kernel code, the movsl alignment mask is defined as 8 bytes (arch/i386/cpu/intel.c).
Right, after reading the source I think I've figured that out. There is an optimised intel version of the copy code which unrolls a short part of the copy (first 64bytes I think) before using rep movsl for the rest. If it doesn't meet the alignment/minimum lengths it uses rep movsl without the unrolled copy first. I'm not sure why it needs to be aligned to 8 bytes, from my understanding a 4 byte alignment should be sufficient. It could be that on a 64bit machine it's copying 8 bytes, not 4.
Without actually performing some low level benchmarks, is it likely that larger or smaller alignment values might generate better overhead conditions for the CPU? Special cases perhaps such as audio.
Generally larger alignment is faster. Also as you start getting closer to the hardware you start getting stricter requirements for alignment. Early DMA controllers for instance had to be "page" aligned (16 bytes IIRC). I have no idea if this is still true today. The downside of large alignment is that it uses more memory, and if you end up having to touch that alignment padding then you end up wasting resources. The trick is trading off this "Wasted" space, vs the speedup you get from the alignment. What exactly are you trying to discover here? Are you trying to figure out how to deglitch some audio?