I missed the beginning of this thread so my comments may be off topic, but as a student games developer the subject of Assembler comes up occasionally.  My advice is to avoid it like the plague but still be aware of your architecture's quirks and traits.  Eg, a PlayStation 2 cache miss costs you 50 cpu cycles.  Lesson, don't scatter your data all over RAM. :) 

Word alignment is important in languages like C too.  Eg,

struct stuff {
    char x;
    int y;
    char z;
};

will probably use more memory on a 32bit chip than:

struct smallerstuff {
    char x;
    char z;
    int y;
};

This being due to compiles word aligning data by default.  The top struct turns into:

struct stuff {
    char x;
    // three bytes of padding
    int y;
    char z;
    // three bytes of padding
};

The smallerstuff struct however should only have 2 bytes of padding after the 2nd char.  You can force most compilers to pack the bits of a struct into the smallest possible size but that causes big slowdowns.

Regards
Intel chips have always been slow at various things, they've always just
done better with clock cycles.  A memorable example of this was the LOOP
opcode on pentium chips.  It was very slow on Intel machines, so
programs used it as a timing delay.  On AMD chips it was extremely fast
(1 cycle IIRC), so those timing loops effectively became noops causing
lots of programs to fail in amusing and/or spectacular fashion.
    

lol theres good reason for not using things for purposes other than intended 
sometimes :) Would've loved to see the expresions on the faces of the guys at 
intel as they came across stuff like that. Still every choice has its 
tradeoffs.