/ .NET

Why B-trees?

Before I attempt to fix all of the existing compilation errors, it's worth spending a moment answering the question -- why do we even need a custom B-tree implementation?

First, you need to understand what a B-tree is, and Wikipedia's page is as good as any. In particular, "the B-tree is optimized for systems that read and write large blocks of data. B-trees are a good example of a data structure for external memory."

Looking at what Bobo's implementation of B-tree does, it looks like a classic .NET hashtable or dictionary -- it lets you store and retrieve a value via a key. The data type of the key and the value are extremely specific. This leads to some questions:

  1. Why didn't you use System.Collections.Hashtable?
  2. Why didn't you use System.Collections.Generic.Dictionary<TKey, TValue>?
  3. Why didn't you at least make it a generic class?

First and foremost, this part of Bobo is some of the oldest code -- and was written in the timeframe of .NET Framework 1.0. And, believe it or not, there was a time when none of these capabilities exists. Generics were a feature of 2.0. The Hashtable did not even exist until 1.1.

But, equally important, the B-tree implementation is used to maintain the searchable index of tokens (read: words) within the search engine. It needs to be stored in persistent storage -- not simply in memory. Updates to the index need to be fast. Hashtable, Dictionary and the like are all in-memory data structures. So while it may be worth considering switching from this custom code to an existing .NET Core class, the need for persistence cannot be ignored.

This is not an uncommon need -- an internet search for '.net persistent dictionary' gets more than a few hits.

The concern is that these implementations did not optimize for disk read/write speed. Bobo's B-tree uses a high degree of branching -- 29 children per node -- so that reads and writes are approximately the size of an entire disk sector. The reason for this is that is takes essentially the same amount of time to read 1 byte from a disk sector as reading all of the bytes in a disk sector.

That being said, there is room for improvement in Bobo's B-tree implementation. It can be made generic without sacrificing performance. It seems to be oddly coded for specifically writing to the file system rather than more generically to streams. It has protected methods in a sealed class.

Addressing some of these concerns may have a positive side-effect -- the new BTree implementation may be easier to test, especially using just a mocking framework. The original implementation would likely require a combination of mocks, fakes and stubs.