/ Bobo

I'm really not sure what I was thinking

On paper, it really did look like a great idea: as an educational exercise, take an enterprise-grade, large (> 250k lines of code) .NET Framework 1.0/1.1 server application and migrate it to .NET Core 1.0.

As soon as I hit all of the many warning and errors in the migration wizard when I opened the code in Visual Studio 2015 Update 3 (code which had been original authored in Visual Studio .NET -- before Microsoft even bothered to give that version of Visual Studio a version number or year in its title), I knew I was in trouble.

Well, maybe not 'trouble', but clearly I had a lot of work ahead of me. 22 projects (20 C#, 1 installer project and 1 C++ project). Thousands of files. Hundreds of thousands of lines of code.

Where to begin?

Well, as discussed in the overview.md, at the very heart of the search engine is the index, and the index has two major pieces -- a B-Tree of tokens and a list of positions for each token.

So let's start with the B-Tree, which we immediately observe to implement the IKeyToDataPointerStore interface:

internal sealed class BTree : IKeyToDataPointerStore

This should make sense -- the B-Tree is an index of all of the token ("words"), with pointers to the positional lists for each of those tokens.

The interface is very simple, but it reveals one interesting "oh, this was code written in 2002" subtlety -- a #define is begin used to switch between creating a 32-bit vs. a 64-bit index. It suggests that it was meaningful to make this a compile-time decision; likely performance suffered and space requirements grew to handle a 64-bit index over a 32-bit index. Plus there were likely plenty of times when a 32-bit index could handle enough data to be useful. (The primary difference between a 32-bit index and a 64-bit index is how much data they can store.)

For the migration effort, we can probably just focus on the 64-bit use case.

using System;

namespace Daxat.Ess.Indexers.FullTextIndexers
{
	/// <summary>
	/// Summary description for IKeyToDataPointerStore.
	/// </summary>
	internal interface IKeyToDataPointerStore
	{
		// TODO: Hmmm..I think we need a IKey interface that is a superinterface
		// to IBTreeKey
#if BTREE_DATA_POINTERS_32BITS
		void Insert(IComparable key, int val);
#else
		void Insert(IComparable key, long val);
#endif
		// All refs to BTree's need to be replaced with more generic interfaces
		IKeyDataPair Search(IComparable key);
	}
}

A BTree interface doesn't get much simplier than that -- there are only two operations: Insert a key/value pair and Search which takes a key and returns a key/value pair. Yeah, it's odd that the return type of a search is IKeyDataPair but the insertion method takes two distinct arguments. There may be a good reason for that -- potentially there was a performance savings. For example, creating a new IKeyDataPair simply to insert a key/value pair may have been determined to be syntactic sugar at the expense of performance. Remember, a lot of the convenience and speed of the .NET Framework did not exist in the 1.0 days.

The second thing we notice about the BTree class is that its constructor takes some easy-to-deduce arguments -- a path to where to store the data, some utility objects (which look ripe to be replaced with lambdas) -- and a bucket.

What's a bucket?

Although we talk about a Bobo index having two main pieces -- the BTree and the position list -- for performance and other reasons the index is actually stored across many files -- potentially hundreds of files. In overall index is divided into a number of buckets, and each bucket as a B-Tree and a position list. So, a index with 1 bucket has 2 files, whereas an index with 100 buckets has 200 files.

Looking at the constructor, we can see that the B-Tree is physically stored in a bunch of files with names such as index0.idx, index1.idx, and so on -- with the numerical part of the file name representing the bucket.

		public BTree(string path, ICompareIBTreeKeyToInternalNumericKey comparer, IMapKeyToObject mapper, int bucket)
		{
			_mapper = mapper;
			// Store index on disk; do NOT use a write cache.
			string btreePath = Path.Combine(path, "index"+bucket+".idx");
			storage = new BTreeStorage(btreePath, new FileStream(btreePath, FileMode.OpenOrCreate, FileAccess.ReadWrite, FileShare.None), false, _mapper);
			// Store index on disk; DO use a write cache.
			//storage = new BTreeStorage(new FileStream("index"+bucket+".idx", FileMode.OpenOrCreate, FileAccess.ReadWrite, FileShare.None), true, _mapper);
			// Store index in memory
			//storage = new BTreeStorage();

			this.comparer = comparer;
			if (storage.BaseStream.Length == 0)
				Create();
			else root = storage.Read(0);
		}

The comments in the constructor also reveal some other potential ways the BTree can be stored -- (a) with write-caching turned on or (b) in memory. Both of those sound really dangerous for a server-based process, so I think we can safely assume those options exist merely for increasing performance during development.

Most of the remainder of the BTree code looks like standard BTree code -- insertion , split child operations, and search operations to name but a few. There are also a bunch of utility functions to dump out the data, such as returning XML versions of the index contents (likely for debugging purposes). But if you look closely at some of the operations on the individual nodes of the BTree, you'll see things like:

public void Insert(IComparable comparableKey, long val)
		{
		IBTreeKey key = comparableKey as IBTreeKey;
		_readerWriterLock.AcquireWriterLock(-1);

The ReaderWriterLock is pretty important to Bobo -- it's a lock which permits multiple, simultaneous readers but, when a write operation is needed, it locks the resource so that the update can be done safely. This is fundamental to Bobo operating in a thread-safe yet high-performing manner, as it is common for the number of read operations to greatly outnumber the write operations. (i.e. you're likely searching the index far more often than you are updating it.)

Still, this was deemed the best solution to this problem in 2002 -- definitely worth some consideration for modern alternatives.

Next we'll dive a bit deeper into the BTree implementation and find some scary stuff.