[wlug] Re: Latency Numbers Every Programmer Should Know

17 Nov 2021

      On Tue, Nov 16, 2021 at 02:46:35PM +1300, Lawrence D'Oliveiro wrote:
...
On Tue, 16 Nov 2021 14:09:07 +1300, Michael Cree wrote:
...
Excepting that binary search turns a sequential O(n) search into a
O(log n) search, and the memory access timings don't change that
fact, so for a sufficiently large array a binary search will always
outperform a linear search even if access time is much worse for
individual accesses in the binary search.
I’m thinking that successive elements in the sequential search are more
likely to be in the same cache line, instead of bouncing around between
different cache lines in the binary-search situation. With a factor of
a couple of magnitudes in access time, that could make quite a
difference.
Yes, but not in a large array.  Say you have an unsorted array of
1000000 elements. Sequential sort takes on average 500000 comparisons.
Assuming a memory access for each one at 0.5ns gives 250ms total.

Now let's say you have a sorted array of 1000000 elements.  The
binary search needs log(1000000) accesses to find the result (on
average). Let us assume a base 2 logarithm as it is a binary search,
then that is 20 memory accesses at 100ns per random memory access,
i.e., a total of 2000ns (or 2ms).

Binary search will always win out once the array is large enough.

(Of course the array needs to be sorted, but that only has to be
done once, and can be easily maintained even as new elements are
added because a binary search will find the insert point
efficiently if the array is large.)

The reason why the FFT is especially bad for memory access is that
the naive algorithm steps through the array at 2^n sized steps (for
various whole number n values).  If the size of a word is w (for
single precision w is 8-bytes) then the step size in bytes is w2^n.
When w2^n is the same size as the cache all hell breaks loose
because every access is to the same cache line.  Thus potentially
every memory access is a cache miss forcing an access to main memory
that incurs the longest memory access times.

This is why the conjugate pair algorithm is faster than the standard
radix-2 algorithm even though it involves the same number of
arithmetical computations. It reorganises those computations such that
a subsequent computation is more likely to have its operands still in
cache.

Cheers,
Michael.