
On Tue, Nov 16, 2021 at 02:46:35PM +1300, Lawrence D'Oliveiro wrote:
On Tue, 16 Nov 2021 14:09:07 +1300, Michael Cree wrote:
Excepting that binary search turns a sequential O(n) search into a O(log n) search, and the memory access timings don't change that fact, so for a sufficiently large array a binary search will always outperform a linear search even if access time is much worse for individual accesses in the binary search.
I’m thinking that successive elements in the sequential search are more likely to be in the same cache line, instead of bouncing around between different cache lines in the binary-search situation. With a factor of a couple of magnitudes in access time, that could make quite a difference.
Yes, but not in a large array. Say you have an unsorted array of 1000000 elements. Sequential sort takes on average 500000 comparisons. Assuming a memory access for each one at 0.5ns gives 250ms total. Now let's say you have a sorted array of 1000000 elements. The binary search needs log(1000000) accesses to find the result (on average). Let us assume a base 2 logarithm as it is a binary search, then that is 20 memory accesses at 100ns per random memory access, i.e., a total of 2000ns (or 2ms). Binary search will always win out once the array is large enough. (Of course the array needs to be sorted, but that only has to be done once, and can be easily maintained even as new elements are added because a binary search will find the insert point efficiently if the array is large.) The reason why the FFT is especially bad for memory access is that the naive algorithm steps through the array at 2^n sized steps (for various whole number n values). If the size of a word is w (for single precision w is 8-bytes) then the step size in bytes is w2^n. When w2^n is the same size as the cache all hell breaks loose because every access is to the same cache line. Thus potentially every memory access is a cache miss forcing an access to main memory that incurs the longest memory access times. This is why the conjugate pair algorithm is faster than the standard radix-2 algorithm even though it involves the same number of arithmetical computations. It reorganises those computations such that a subsequent computation is more likely to have its operands still in cache. Cheers, Michael.