
Can anyone help with the relation between assembler command 'movsl' and alignment. All I've been able to find so far is much general stuff in kernel patches about bulk memory moves. -- This fortune is inoperative. Please try another.

Glenn Enright wrote:
Can anyone help with the relation between assembler command 'movsl' and alignment. All I've been able to find so far is much general stuff in kernel patches about bulk memory moves.
movsl is an assembler command for "move string (long)". It copies 4 (or 8? I forget) bytes from DS:ESI to ES:EDI, and increments ESI and EDI by the number of bytes copied. It's usually used with the "REP" prefix which repeats decrementing ECX each iteration until CX is zero. So you can say "REP MOVSL" to copy lots of memory in one opcode. Ahh, the wonderful bounties of a CISC world. Now, Intel machines don't really care too much about alignment, you can fetch a byte at any address without issue. However the various busses inside your PC usually don't work on a byte by byte manner, but in multiples of a byte, 4 to 16 bytes are common. And when you read memory from an address the bus will read the surrounding bytes and return them too, which will get thrown away if they are unused. So, say I fetch a byte at address 0x02 over a 32bit (4 byte) bus. The bus will return the memory at address 0x00, 0x01, 0x02 and 0x03 at the same time, which all but the memory at 0x02 will all get thrown away. Now, if I fetch 4 bytes over a 32bit bus starting at address 0x02 I'll get two bus transactions, one with "0x00, 0x01, 0x02 and 0x03" and one with "0x04, 0x05, 0x06 and 0x07". Then it'll throw away 0x00, 0x01, 0x06 and 0x07. This is obviously a waste, so if you are trying to read more than one byte quantities, you generally try and align it to the nearest multiple of the size you are trying to read, otherwise you waste half your bus bandwidth. So short answer: Align memory. It's faster that way on Intel machines.

On Sunday 14 May 2006 3:01 pm, Perry Lorier wrote:
Glenn Enright wrote:
Can anyone help with the relation between assembler command 'movsl' and alignment. All I've been able to find so far is much general stuff in kernel patches about bulk memory moves.
movsl is an assembler command for "move string (long)". It copies 4 (or 8? I forget) bytes from DS:ESI to ES:EDI, and increments ESI and EDI by the number of bytes copied.
It's usually used with the "REP" prefix which repeats decrementing ECX each iteration until CX is zero. So you can say "REP MOVSL" to copy lots of memory in one opcode. Ahh, the wonderful bounties of a CISC world.
Now, Intel machines don't really care too much about alignment, you can fetch a byte at any address without issue. However the various busses inside your PC usually don't work on a byte by byte manner, but in multiples of a byte, 4 to 16 bytes are common. And when you read memory from an address the bus will read the surrounding bytes and return them too, which will get thrown away if they are unused.
So, say I fetch a byte at address 0x02 over a 32bit (4 byte) bus. The bus will return the memory at address 0x00, 0x01, 0x02 and 0x03 at the same time, which all but the memory at 0x02 will all get thrown away.
Now, if I fetch 4 bytes over a 32bit bus starting at address 0x02 I'll get two bus transactions, one with "0x00, 0x01, 0x02 and 0x03" and one with "0x04, 0x05, 0x06 and 0x07". Then it'll throw away 0x00, 0x01, 0x06 and 0x07. This is obviously a waste, so if you are trying to read more than one byte quantities, you generally try and align it to the nearest multiple of the size you are trying to read, otherwise you waste half your bus bandwidth.
So short answer:
Align memory. It's faster that way on Intel machines.
Good post! TY that helps a lot :) Just to iron out some thoughts, how useful is 8 byte alignment with modern CPUs, in this case a P4. In context, in linux kernel code, the movsl alignment mask is defined as 8 bytes (arch/i386/cpu/intel.c). Without actually performing some low level benchmarks, is it likely that larger or smaller alignment values might generate better overhead conditions for the CPU? Special cases perhaps such as audio.

So short answer:
Align memory. It's faster that way on Intel machines.
Good post! TY that helps a lot :)
Just to iron out some thoughts, how useful is 8 byte alignment with modern CPUs, in this case a P4. In context, in linux kernel code, the movsl alignment mask is defined as 8 bytes (arch/i386/cpu/intel.c).
Right, after reading the source I think I've figured that out. There is an optimised intel version of the copy code which unrolls a short part of the copy (first 64bytes I think) before using rep movsl for the rest. If it doesn't meet the alignment/minimum lengths it uses rep movsl without the unrolled copy first. I'm not sure why it needs to be aligned to 8 bytes, from my understanding a 4 byte alignment should be sufficient. It could be that on a 64bit machine it's copying 8 bytes, not 4.
Without actually performing some low level benchmarks, is it likely that larger or smaller alignment values might generate better overhead conditions for the CPU? Special cases perhaps such as audio.
Generally larger alignment is faster. Also as you start getting closer to the hardware you start getting stricter requirements for alignment. Early DMA controllers for instance had to be "page" aligned (16 bytes IIRC). I have no idea if this is still true today. The downside of large alignment is that it uses more memory, and if you end up having to touch that alignment padding then you end up wasting resources. The trick is trading off this "Wasted" space, vs the speedup you get from the alignment. What exactly are you trying to discover here? Are you trying to figure out how to deglitch some audio?

On Monday 15 May 2006 9:58 am, Perry Lorier wrote:
Generally larger alignment is faster. Also as you start getting closer to the hardware you start getting stricter requirements for alignment. Early DMA controllers for instance had to be "page" aligned (16 bytes IIRC). I have no idea if this is still true today.
The downside of large alignment is that it uses more memory, and if you end up having to touch that alignment padding then you end up wasting resources. The trick is trading off this "Wasted" space, vs the speedup you get from the alignment.
Right, pretty much matches what I was thinking. I will look at dma stuff next. I should probably start doing some profiling to see what the objective results are, rather than subjective ;-p. might look into ck series kernels to see differences. My understanding is that P4 chips have quite bad latency for some common io instructions, which is where AMD makes ground. For example more here... http://answers.google.com/answers/threadview?id=321522 I traced the movsl mask back to L1_CACHE_BYTES, which makes sense.
What exactly are you trying to discover here? Are you trying to figure out how to deglitch some audio?
Not really, just using that as an example (although ac97 drivers do still have a bad io related bug). I'm hacking round in i386 arch trying to learn a bit more about how io is handled and increasing my knowledge of system programming at the same time. Been attempting to absorb Intel docs on this. Kinda hobby type thing. So far recent testing versions built created about 5% decrease in core code size (subtle bugs aside) using gcc 3.4.6, just by manually optimizing kernel code for a p4 2.6 (stepping 9) which I'm running. Building with 'march=pentium4' has worked nicely so far. Also the MB (Abit IS7) appears to have really good IO subsystems, which fascinates me :). Learning what kernel devs do when things break has been fun. I realise that newer 64bit offerings do many things differently, but this is what I have to play with for now. -- Whether you can hear it or not, The Universe is laughing behind your back. -- National Lampoon, "Deteriorata"

Glenn Enright wrote:
On Monday 15 May 2006 9:58 am, Perry Lorier wrote:
Generally larger alignment is faster. Also as you start getting closer to the hardware you start getting stricter requirements for alignment. Early DMA controllers for instance had to be "page" aligned (16 bytes IIRC). I have no idea if this is still true today.
The downside of large alignment is that it uses more memory, and if you end up having to touch that alignment padding then you end up wasting resources. The trick is trading off this "Wasted" space, vs the speedup you get from the alignment.
Right, pretty much matches what I was thinking. I will look at dma stuff next. I should probably start doing some profiling to see what the objective results are, rather than subjective ;-p. might look into ck series kernels to see differences.
Unfortunately most of my assembly language coding was done in the dark ages in the 16 bit dos days. (Segmentation, 640k and TSR's oh my!), and I've really not bothered about what's happening under the hood since then. Goodness knows how APIC/IOMMU/ACPI work[1] :)
My understanding is that P4 chips have quite bad latency for some common io instructions, which is where AMD makes ground. For example more here... http://answers.google.com/answers/threadview?id=321522
Intel chips have always been slow at various things, they've always just done better with clock cycles. A memorable example of this was the LOOP opcode on pentium chips. It was very slow on Intel machines, so programs used it as a timing delay. On AMD chips it was extremely fast (1 cycle IIRC), so those timing loops effectively became noops causing lots of programs to fail in amusing and/or spectacular fashion.
I traced the movsl mask back to L1_CACHE_BYTES, which makes sense.
What exactly are you trying to discover here? Are you trying to figure out how to deglitch some audio?
Not really, just using that as an example (although ac97 drivers do still have a bad io related bug).
only one? Miracle!
I'm hacking round in i386 arch trying to learn a bit more about how io is handled and increasing my knowledge of system programming at the same time. Been attempting to absorb Intel docs on this. Kinda hobby type thing.
Ah, I remember the good ol' days of doing this myself :) Although documentation wasn't quite as free flowing as it is these days.
So far recent testing versions built created about 5% decrease in core code size (subtle bugs aside) using gcc 3.4.6, just by manually optimizing kernel code for a p4 2.6 (stepping 9) which I'm running. Building with 'march=pentium4' has worked nicely so far.
Nice. I guess an important lesson here that when compiling a kernel, compile it for your CPU, it'll be spiffier!
Also the MB (Abit IS7) appears to have really good IO subsystems, which fascinates me :). Learning what kernel devs do when things break has been fun. I realise that newer 64bit offerings do many things differently, but this is what I have to play with for now.
Yeah, I've not looked that closely at the 64bit stuff other than the more, bigger, registers. --- [1] I Realise that implicit in this statement I assume they do *actually* work.

On Monday 15 May 2006 4:21 pm, Perry Lorier wrote:
Unfortunately most of my assembly language coding was done in the dark ages in the 16 bit dos days. (Segmentation, 640k and TSR's oh my!), and I've really not bothered about what's happening under the hood since then.
Goodness knows how APIC/IOMMU/ACPI work[1] :)
[1] I Realise that implicit in this statement I assume they do *actually* work.
yes indeed. Thank you so much for your insight tho, you have saved me a lot of time researching I dont know what.
Intel chips have always been slow at various things, they've always just done better with clock cycles. A memorable example of this was the LOOP opcode on pentium chips. It was very slow on Intel machines, so programs used it as a timing delay. On AMD chips it was extremely fast (1 cycle IIRC), so those timing loops effectively became noops causing lots of programs to fail in amusing and/or spectacular fashion.
lol theres good reason for not using things for purposes other than intended sometimes :) Would've loved to see the expresions on the faces of the guys at intel as they came across stuff like that. Still every choice has its tradeoffs. -- Your mouse has moved. Windows NT must be restarted for the change to take effect. Reboot now? [ OK ] -- From a Slashdot.org post

I missed the beginning of this thread so my comments may be off topic, but as a student games developer the subject of Assembler comes up occasionally. My advice is to avoid it like the plague but still be aware of your architecture's quirks and traits. Eg, a PlayStation 2 cache miss costs you 50 cpu cycles. Lesson, don't scatter your data all over RAM. :) Word alignment is important in languages like C too. Eg, struct stuff { char x; int y; char z; }; will probably use more memory on a 32bit chip than: struct smallerstuff { char x; char z; int y; }; This being due to compiles word aligning data by default. The top struct turns into: struct stuff { char x; // three bytes of padding int y; char z; // three bytes of padding }; The smallerstuff struct however should only have 2 bytes of padding after the 2nd char. You can force most compilers to pack the bits of a struct into the smallest possible size but that causes big slowdowns. Regards
Intel chips have always been slow at various things, they've always just done better with clock cycles. A memorable example of this was the LOOP opcode on pentium chips. It was very slow on Intel machines, so programs used it as a timing delay. On AMD chips it was extremely fast (1 cycle IIRC), so those timing loops effectively became noops causing lots of programs to fail in amusing and/or spectacular fashion.
lol theres good reason for not using things for purposes other than intended sometimes :) Would've loved to see the expresions on the faces of the guys at intel as they came across stuff like that. Still every choice has its tradeoffs.

On Wednesday 17 May 2006 1:45 am, Oliver Jones wrote:
I missed the beginning of this thread so my comments may be off topic
Hey the more info the better :)
but as a student games developer the subject of Assembler comes up occasionally. My advice is to avoid it like the plague but still be aware of your architecture's quirks and traits. Eg, a PlayStation 2 cache miss costs you 50 cpu cycles. Lesson, don't scatter your data all over RAM. :)
Holy cow!
Word alignment is important in languages like C too. Eg,
Yeah whoever says the way the code reads doesn't make a difference to the final program has probably never done system level stuff ;). Compilers are good, but they're not psychic. In the past I've been keen to make the code more readable, but examples like that probably crop up too often to ignore. That also explains partly why kernel code is somewhat spares sometimes. Thanks for your insight. -- It seems like once people grow up, they have no idea what's cool. -- Calvin

* Glenn Enright <elinar(a)ihug.co.nz> [2006-05-17 00:55]:
On Wednesday 17 May 2006 1:45 am, Oliver Jones wrote:
Eg, a PlayStation 2 cache miss costs you 50 cpu cycles. Lesson, don't scatter your data all over RAM. :)
Holy cow!
Yeah, memory has not kept up with CPUs in terms of clock speed. Before the 386, there were no caches in PC-compatibles. With the 386, on-chipset caches were first introduced by some manufacturers, and they made a noticable difference. The 486 was the first CPU with an on-chip cache; so the first chipsets didn’t have a cache of their own. When L2 cache were introduced, their contribution to performance was controversial. With the Pentium, however, there was no question that the gap between L1 cache speed and main memory had gotten so large than an L2 cache was an undeniable boost. I don’t know how much fact this was based on, but there has always been a claim that the fastest Pentium 1 ran barely any faster than a 286 if you disabled all caches. Considering how much caches have grown in speed and size though, it is quite believable that most of the computation speed of modern chips is mainly due to cache performance. Some points of reference: the 386 caches were on the order of 32KB; the 486’s L1 cache was just 4KB but clocked at full CPU speed; the first L2 caches were on the order of 128KB and mainly got their advantage from bandwidth, not clock speed, if memory serves; nowadays L1 caches are on the order of 128KB and L2 caches are megabytes in size, with L1 cache running at full clockspeed. So yeah, locality is incredibly important. Regards, -- Aristotle Pagaltzis // <http://plasmasturm.org/>

* Daniel Lawson <daniel(a)meta.net.nz> [2006-05-17 04:50]:
The Playstation 3 will use RAMBUS ram, clocked at... core speed of 3.2GHz. Oooh yeah.
No wonder it costs so much.
Memory used to be extremely expensive up until roughly the time when 32MB became the baseline. Which is when it started noticably falling behind CPUs in terms of speed. Seems like a reversion to those days? Regards, -- Aristotle Pagaltzis // <http://plasmasturm.org/>

The PS3 is the shiz. But yes, expensive. http://www.gametrailers.com/player.php?id=10401&type=mov&pl=game Regards A. Pagaltzis wrote:
* Daniel Lawson <daniel(a)meta.net.nz> [2006-05-17 04:50]:
The Playstation 3 will use RAMBUS ram, clocked at... core speed of 3.2GHz. Oooh yeah.
No wonder it costs so much.
Memory used to be extremely expensive up until roughly the time when 32MB became the baseline. Which is when it started noticably falling behind CPUs in terms of speed. Seems like a reversion to those days?
Regards,

In the PlayStation 2's case however the massive cost is probably more to do with a shitty MMU rather than Ram/Bus speed. The PS2 has RamBus designed RAM and huge bandwidth (1024bit bus and shit). The reason it needs this is because it has fuck all texture RAM (2MB) so you're moving huge chunks of memory around all the time.
Yeah, memory has not kept up with CPUs in terms of clock speed. Before the 386, there were no caches in PC-compatibles. With the 386, on-chipset caches were first introduced by some manufacturers, and they made a noticable difference. The 486 was the first CPU with an on-chip cache; so the first chipsets didn’t have a cache of their own. When L2 cache were introduced, their contribution to performance was controversial. With the Pentium, however, there was no question that the gap between L1 cache speed and main memory had gotten so large than an L2 cache was an undeniable boost.
Regards

* A. Pagaltzis <pagaltzis(a)gmx.de> [2006-05-17 04:45]:
The 486 was the first CPU with an on-chip cache;
Err, the first Intel CPU. I doubt the 486 pioneered this ground. Regards, -- Aristotle Pagaltzis // <http://plasmasturm.org/>

However you have to consider this trait of crappy programmers... *CPU cycles are a precious commodity and your programming style and language reflects that belief.* Always remember you're writing code for other people to read, not the compiler. Even if that "other person" is just you in 3 months time. Be aware of your target architecture and its quirks, but don't be a slave to it. Regards Glenn Enright wrote:
Word alignment is important in languages like C too. Eg,
Yeah whoever says the way the code reads doesn't make a difference to the final program has probably never done system level stuff ;). Compilers are good, but they're not psychic. In the past I've been keen to make the code more readable, but examples like that probably crop up too often to ignore. That also explains partly why kernel code is somewhat spares sometimes. Thanks for your insight.
participants (5)
-
A. Pagaltzis
-
Daniel Lawson
-
Glenn Enright
-
Oliver Jones
-
Perry Lorier