How B-trees can win is by taking advantage of the node children being sorted in ...

bunderbunder · on June 24, 2024

Though, just spitballing here - I'd guess that, in practice, a linear search still tends to handily outperform a binary search (and, by extension, a well-balanced binary tree) whenever the keys are stored in a compact array. Due to better interaction with the cache and speculative execution.

shpongled · on June 24, 2024

Definitely depends on the branching factor! There are other encodings like Eytzinger that can provide further improves over "standard" binary search too [1].

[1] https://en.algorithmica.org/hpc/data-structures/binary-searc...

kazinator · on June 24, 2024

Suppose the key comparisons are much more expensive than the array or pointer navigation. Like the article does: it says that it is about theory, and it is looking at number of comparisons as the performance measure.

The fact that we can search a B-tree node that has, say, 256 children in 8 steps means that the node effectively contains the equivalent of a small balanced binary tree.

torginus · on June 24, 2024

I remember reading a blogpost about this, they compared binary search on integer keys with a highly-optimized unrolled SIMD linear search.

The takeaway was that the break-even point was suprising small - just tens of elements, which is crazy considering a CPU can do like 2x 8-wide AVX2 integer comparisons in a single cycle. Modern CPUs are just that good at executing branchy code.

But with B-trees we can have the best of both worlds - trees with a relatively high branch factor, like 16, that can still be tested in one go using SIMD.

Another advantage is cache locality - which is one of the reasons B-trees are favored when implementing filesystems - much of the tree doesn't need to be fetched into memory while the parts that do are contiguous in memory.

kevin_thibedeau · on June 24, 2024

It does. The break even point varies by platform but linear will win for sufficiently short arrays.

sam0x17 · on June 24, 2024

yeah locality of reference is way better as you're traversing the array than as you follow links in the tree

sqeaky · on June 24, 2024

I think you might be missing the real performance characteristics of the hardware. The amount of comparisons might be the same, but comparisons are only a proxy for the expensive parts of lookups, the latency to tell RAM to provide data.

If the whole tree is in CPU cache then likely the difference is noise, but if the tree is large enough to have some data in RAM then the cost to copy that into cache will be larger based not on the size, but on the amount of queries from the CPU to the RAM (unless the size it way out of proportion). RAM has latency that is generally an order of magnitude slower than CPU Cache and generally an order of magnitude slower than fast contemporary storage.

There are many places you can find "Latency numbers every programmer needs to know" https://fullstackengineer.pro/blog/latency-numbers-programme...

But this trend has held for about 30 years and seem fundamental to the economics of building hardware. CPUs will be fastest, and cache slower, and RAM slower still , and fast storage still slower etc.

From your example, if it takes 100ns to get the node containing a~f then the whole of the B-Tree search for f might take 150ns (1 lookup from RAM + a few CPU intructions), but with the strictly binary tree it will need 6 lookups to RAM, and we can see that we can ignore the CPU time because it is so small. This could take 600ns (or 1300ns if each branch is a node with only pointers and the data needs to be dereferenced too). A smart cache prefetcher might guess that these were pointers and precache a and b once n was referenced, and might even cache things pointed to by loaded pointers but even in that highly optimistic scenario it could still take 300ns to load stuff.

Consider filesystems on storage and how they have had this problem but worse for a long time. Tree structures with many pointers are faster than binary trees in proportion to latency between the thing processing the tree and the place the tree it stored.

EDIT - And don't get me started on clever use of SIMD instruction. Imagine if node in a tree had binary layout matching an AVX register, then comparisons looking for the searched item could check a whole node in a single instruction! I am curious if there are any implementations that do this?

EDIT 2 - It does exist and I am late to the show, people have been doing this for like 15 years: https://stackoverflow.com/questions/20616605/using-simd-avx-...

kazinator · on June 24, 2024

Like the article, I'm only concerned with comparison counts, trying to explain why B-trees look like balanced BSTs, and why that gets tighter and tighter with increasing numbers of children per node.

It's simply because the nodes themselves contain the equivalent of a balanced binary tree, in the form of a binary-searched array.

In the ultimate case, we set the number of children to be large enough to contain all the data, so then there is only one B-tree node, which is binary searched to find the element. Then we have the same number of comparisons as well balanced binary search tree.

sqeaky · on June 27, 2024

Where M is amount of data per node and N is the amount of all data, each node that you descend skips a proportion M-1/M of all the remaining data as long as N is large enough to require multiple levels of nesting.

So for practical sizes of M, as in a few cache lines, and sufficiently large datasets of size N a binary tree will eliminate 1/2 of data each node checked and a B tree with M set to 8 would skip 7/8ths of the data each node. When not descending nodes they are both in Log2(N) time, but when jumping nodes this hypothetical B-Tree is using Log8(N) time.

So yeah if the N and M are close then they are similar, but those aren't interesting cases. Datasets that small are fast to iterate even if unsorted, and large unwieldy values of M are silly and impractical. Once you have thousands or millions of items in N then the savings of eliminating downtree data from checked at all becomes real. Or put another way binary trees are always averaging Log2(N) amount of checks but B trees have some mix of Log2(N)+LogM(N) amount of checks which should be lower in most cases and so close in edge cases as to not matter.

But again, I state that counting comparisons is silly. On any modern hardware (past 20 years) load time dominates the comparison time and B-Trees will just have fewer loads, memory complexity is the "interesting" part of this problem unless you are a C64 or some zany future company with hyper fast RAM.

9659 · on June 24, 2024

Just because it was first discovered 15 years ago, does not mean it wasn't still a good idea.

Keep on thinking. You never know what else you will come up with.

sqeaky · on June 25, 2024

They measured a speedup. SIMD is a super common way to speed stuff up and good compilers do it automatically.

A lot of string comparisons use SIMD now. In C++ if you use std:string on GCCit has done it for years. I bet Rust and Python (in the compiled runtime) ate doing it too.

No sense leaving easy performance gains on the table.

jvanderbot · on June 24, 2024

I don't think so?

I think the b-tree is using the start/end sequence knowledge at each to do fewer node evaluations. That is, a b-tree is "shorter" so of course it'll take fewer node evaluations to reach the leaf.

"b-tree is equivalent to balanced binary tree", and "unbalanced tree", don't make a ton of sense when ostensibly the article compares b-trees to balanced BST.