Re: [wlug] movsl alignment

15 May 2006

      Glenn Enright wrote:
...
On Monday 15 May 2006 9:58 am, Perry Lorier wrote:
...
Generally larger alignment is faster.  Also as you start getting closer
to the hardware you start getting stricter requirements for alignment.
Early DMA controllers for instance had to be "page" aligned (16 bytes
IIRC).  I have no idea if this is still true today.
The downside of large alignment is that it uses more memory, and if you
end up having to touch that alignment padding then you end up wasting
resources.  The trick is trading off this "Wasted" space, vs the speedup
you get from the alignment.
Right, pretty much matches what I was thinking. I will look at dma stuff next. 
I should probably start doing some profiling to see what the objective 
results are, rather than subjective ;-p. might look into ck series kernels to 
see differences.
Unfortunately most of my assembly language coding was done in the dark 
ages in the 16 bit dos days.  (Segmentation, 640k and TSR's oh my!), and 
I've really not bothered about what's happening under the hood since then.

Goodness knows how APIC/IOMMU/ACPI work[1] :)
...
My understanding is that P4 chips have quite bad latency for some common io 
instructions, which is where AMD makes ground. For example more here...
http://answers.google.com/answers/threadview?id=321522
Intel chips have always been slow at various things, they've always just 
done better with clock cycles.  A memorable example of this was the LOOP 
opcode on pentium chips.  It was very slow on Intel machines, so 
programs used it as a timing delay.  On AMD chips it was extremely fast 
(1 cycle IIRC), so those timing loops effectively became noops causing 
lots of programs to fail in amusing and/or spectacular fashion.
...
I traced the movsl mask back to L1_CACHE_BYTES, which makes sense.
...
What exactly are you trying to discover here?  Are you trying to figure
out how to deglitch some audio?
Not really, just using that as an example (although ac97 drivers do still have 
a bad io related bug).
only one?  Miracle!
...
I'm hacking round in i386 arch trying to learn a bit
more about how io is handled and increasing my knowledge of system 
programming at the same time. Been attempting to absorb Intel docs on this. 
Kinda hobby type thing.
Ah, I remember the good ol' days of doing this myself :)  Although 
documentation wasn't quite as free flowing as it is these days.
...
So far recent testing versions built created about 5% 
decrease in core code size (subtle bugs aside) using gcc 3.4.6, just by 
manually optimizing kernel code for a p4 2.6 (stepping 9) which I'm running. 
Building with 'march=pentium4' has worked nicely so far.
Nice.  I guess an important lesson here that when compiling a kernel, 
compile it for your CPU, it'll be spiffier!
...
Also the MB (Abit IS7) appears to have really good IO subsystems,
which fascinates me :).  Learning what kernel devs do when things
break has been fun. I realise that  newer 64bit offerings do many
things differently, but this is what I have to play with for now.
Yeah, I've not looked that closely at the 64bit stuff other than the 
more, bigger, registers.

---
[1] I Realise that implicit in this statement I assume they do 
*actually* work.