Latency Numbers Every Programmer Should Know

Somebody did a nice chart a while back <https://gist.github.com/hellerbarde/2843375> of the time it takes to perform many common data transfer operations on contemporary hardware. And note the multi-part figure, where the basic unit in each block is a bit more than an order of magnitude greater than the one in the previous block. And if the figure isn’t dramatic enough, a commenter has said “let’s multiply all the numbers by a billion” (just to bring them into more human-comprehensible scales), and then liken them to common things that people do, from making a cup of coffee, all the way up to completing a bachelor’s degree.

On Tue, Nov 16, 2021 at 11:48:09AM +1300, Lawrence D'Oliveiro wrote:
Somebody did a nice chart a while back <https://gist.github.com/hellerbarde/2843375> of the time it takes to perform many common data transfer operations on contemporary hardware. And note the multi-part figure, where the basic unit in each block is a bit more than an order of magnitude greater than the one in the previous block.
It's that 100ns for a main memory reference (i.e. a random access to memory that is not in cache) compared to the L1 (or even L2) cache reference of 0.5ns (or 7ns) that is really significant to one of the most important algorithms to modern computing, namely the discrete Fourier transform (and its variants such as the FFT and the DCT). Program it naively and you will be hitting that 100ns access time for just about every memory access it makes on a reasonable sized transform (particularly one that is bigger than L1 cache; remember for single precision complex we are talking 8-bytes per point) because your memory access pattern aliases to the same cache line repeatedly! Program it carefully reorganising the order of calculations and it will run very much faster. Cheers, Michael.

On Tue, 16 Nov 2021 12:28:01 +1300, Michael Cree wrote:
It's that 100ns for a main memory reference (i.e. a random access to memory that is not in cache) compared to the L1 (or even L2) cache reference of 0.5ns (or 7ns) that is really significant to one of the most important algorithms to modern computing, namely the discrete Fourier transform (and its variants such as the FFT and the DCT).
I suspect (but haven’t checked) that it also renders old techniques like binary search of an ordered array worthless, or at least of very limited use -- probably faster (and easier) to do a sequential search of an unordered array in many cases. Also garbage-collected languages like Java are likely to be impacted. The way I like to put it is “memory is cheap, but accessing memory is expensive”. The first part has been true since about the 1980s, but the ever-greater emphasis on caches for performance on current architectures, particularly in this century, has brought the second part into sharper relief.

On Tue, Nov 16, 2021 at 12:50:40PM +1300, Lawrence D'Oliveiro wrote:
On Tue, 16 Nov 2021 12:28:01 +1300, Michael Cree wrote:
It's that 100ns for a main memory reference (i.e. a random access to memory that is not in cache) compared to the L1 (or even L2) cache reference of 0.5ns (or 7ns) that is really significant to one of the most important algorithms to modern computing, namely the discrete Fourier transform (and its variants such as the FFT and the DCT).
I suspect (but haven’t checked) that it also renders old techniques like binary search of an ordered array worthless, or at least of very limited use -- probably faster (and easier) to do a sequential search of an unordered array in many cases.
Excepting that binary search turns a sequential O(n) search into a O(log n) search, and the memory access timings don't change that fact, so for a sufficiently large array a binary search will always outperform a linear search even if access time is much worse for individual accesses in the binary search.
The way I like to put it is “memory is cheap, but accessing memory is expensive”.
It is better to consider modern "random access memory" (or RAM) is grossly misnamed. It should be called "streaming memory". It is extremely efficient to stream sequential memory locations into the CPU but it is very inefficient (100s of CPU clock cycles) to do a single random access to memory. Cheers Michael.

On Tue, 16 Nov 2021 14:09:07 +1300, Michael Cree wrote:
Excepting that binary search turns a sequential O(n) search into a O(log n) search, and the memory access timings don't change that fact, so for a sufficiently large array a binary search will always outperform a linear search even if access time is much worse for individual accesses in the binary search.
I’m thinking that successive elements in the sequential search are more likely to be in the same cache line, instead of bouncing around between different cache lines in the binary-search situation. With a factor of a couple of magnitudes in access time, that could make quite a difference.
It is better to consider modern "random access memory" (or RAM) is grossly misnamed. It should be called "streaming memory". It is extremely efficient to stream sequential memory locations into the CPU but it is very inefficient (100s of CPU clock cycles) to do a single random access to memory.
And sequential search seems optimized for precisely such streaming behaviour.

On Tue, Nov 16, 2021 at 02:46:35PM +1300, Lawrence D'Oliveiro wrote:
On Tue, 16 Nov 2021 14:09:07 +1300, Michael Cree wrote:
Excepting that binary search turns a sequential O(n) search into a O(log n) search, and the memory access timings don't change that fact, so for a sufficiently large array a binary search will always outperform a linear search even if access time is much worse for individual accesses in the binary search.
I’m thinking that successive elements in the sequential search are more likely to be in the same cache line, instead of bouncing around between different cache lines in the binary-search situation. With a factor of a couple of magnitudes in access time, that could make quite a difference.
Yes, but not in a large array. Say you have an unsorted array of 1000000 elements. Sequential sort takes on average 500000 comparisons. Assuming a memory access for each one at 0.5ns gives 250ms total. Now let's say you have a sorted array of 1000000 elements. The binary search needs log(1000000) accesses to find the result (on average). Let us assume a base 2 logarithm as it is a binary search, then that is 20 memory accesses at 100ns per random memory access, i.e., a total of 2000ns (or 2ms). Binary search will always win out once the array is large enough. (Of course the array needs to be sorted, but that only has to be done once, and can be easily maintained even as new elements are added because a binary search will find the insert point efficiently if the array is large.) The reason why the FFT is especially bad for memory access is that the naive algorithm steps through the array at 2^n sized steps (for various whole number n values). If the size of a word is w (for single precision w is 8-bytes) then the step size in bytes is w2^n. When w2^n is the same size as the cache all hell breaks loose because every access is to the same cache line. Thus potentially every memory access is a cache miss forcing an access to main memory that incurs the longest memory access times. This is why the conjugate pair algorithm is faster than the standard radix-2 algorithm even though it involves the same number of arithmetical computations. It reorganises those computations such that a subsequent computation is more likely to have its operands still in cache. Cheers, Michael.
participants (2)
-
Lawrence D'Oliveiro
-
Michael Cree