
On Tue, Nov 16, 2021 at 11:48:09AM +1300, Lawrence D'Oliveiro wrote:
Somebody did a nice chart a while back <https://gist.github.com/hellerbarde/2843375> of the time it takes to perform many common data transfer operations on contemporary hardware. And note the multi-part figure, where the basic unit in each block is a bit more than an order of magnitude greater than the one in the previous block.
It's that 100ns for a main memory reference (i.e. a random access to memory that is not in cache) compared to the L1 (or even L2) cache reference of 0.5ns (or 7ns) that is really significant to one of the most important algorithms to modern computing, namely the discrete Fourier transform (and its variants such as the FFT and the DCT). Program it naively and you will be hitting that 100ns access time for just about every memory access it makes on a reasonable sized transform (particularly one that is bigger than L1 cache; remember for single precision complex we are talking 8-bytes per point) because your memory access pattern aliases to the same cache line repeatedly! Program it carefully reorganising the order of calculations and it will run very much faster. Cheers, Michael.