Monthly Archives: May 2011

A blunt tool

B-Trees are regularly used by relational databases to enforce primary key constraints.

Let’s look at what we know about a primary key…

  • It is unique
  • The semantic ordering of a primary key is irrelevant (we do not care that one primary key is more or less than any another)

But what properties does a B-Tree have in these respects?

  • The structure does not implicitly enforce uniqueness (you have to find a key first and then make a separate judgement).
  • The structure rigorously enforces semantic ordering upon the keys.

At first glance, the B-Tree does not sound like the best tool for the job and on  closer inspection things don’t get any better…

For each key inspected in random order, we have to visit a block at every level of the B-Tree and for each of those blocks that we visit we then have to perform either a binary search across the block (at best) or iterate sequentially through it (at worst). So even for a modest population  of 1 billion keys, using a B-Tree with 3 levels and 1000 entries per block will require somewhere between 30 and 3000 memory transfers just to find one key – and that’s on a good day i.e. when it happens to be completely cached within several gigabytes of memory.

Of course, the B-Tree is also available to service queries too. But the semantic ordering enforced by the B-Tree index is irrelevant to a query navigating a primary/foreign key relationship and the key order doesn’t assist with either nested loop or hash join processing. While the key order can be exploited by a merge join, the merge join is rarely a strategy of first choice and that limited advantage evaporates if additional query predicates need to be applied to the same table that holds the primary key index.

In essence, the B-Tree contains redundant information which is maintained at a high cost but yields minimal benefit. That’s not a good cost/benefit ratio.

Up and out

Scaling out a database across a network of database servers requires an architecture which equitably distributes independent tasks across the network; where equitable distribution is required to ensure balanced use of the resources available and independence is required to minimise communication between those tasks. This latter point is important because communication can create serialization between tasks and can saturate the network bandwidth as the number of tasks is increased. Both of these aspects hurt scalability.

Serialization is undesirable, because it forces tasks to wait on one another and a waiting task is an unproductive one – so serialization reduces the effectiveness of scale out. Network bandwidth saturation creates a ceiling for the number of tasks that can be executed simultaneously because adding more tasks just starves the remaining tasks of bandwidth – so bandwidth saturation limits the extent of scale out. Hence, both serialization and saturation make the architecture less scalable.

Of course, there has to be some communication at the start and end of each task if a task is to perform any useful general-purpose work; but each task needs to operate independently between those communication points to achieve scalability. The Map-Reduce framework exemplifies this approach wherein the map phase distributes the tasks which operate independently until completion and then the reduce phase consolidates the results.

Interestingly, a similar approach is also required within a single database server. It is rare to find a server with a single CPU core these days (outside of the mobile computing environment) and to scale up performance the database must equitably distribute independent tasks across the CPU cores. Again communication between the cores must be kept to a minimum to avoid serialization between tasks, memory bandwidth saturation and cache coherency traffic. Let’s take a closer look at that last point.

In a multi-core environment, each core will have a private cache as well as a shared cache to overcome the latencies experienced with memory access. The private caches will be small but will execute at near CPU speed while the shared cache will be larger and slower but still much faster than normal memory access. Indeed, memory access is likely to be non-uniform in a multi-core environment and different cores will likely experience different latencies when accessing the same memory location. These caches attempt to overcome memory latency and provide a cache hierarchy whereby content works its way from memory to shared cache to private cache for a memory read access and vice versa for a memory write access. Therefore, an update to a memory address starts in the private cache of a core and is not immediately visible to the other cores (because it is a private cache) and to avoid consistency problems with memory access across multiple cores, a cache coherency protocol has to be observed. A typical protocol will invalidate the content of a memory address in a private cache when the corresponding address is updated in the private cache of another CPU core and will force a flush of the memory address from the updated private cache into the shared cache and possibly memory also. Any core attempting to access their invalidated private cache location will then be forced to fetch the content again from the shared cache and/or memory. The cache coherency protocol involves traffic between cores and/or snooping between cores (to see which cores have accessed which memory locations) and forces access to slower shared cache or memory when a memory address is invalidated by another core. In other words, cache coherency is an expensive protocol which hits core performance and the only way to minimise the cache coherency overhead is to avoid updating the content of memory locations shared between cores wherever possible. Therefore, not only can communication between tasks create serialization and saturate memory bandwidth it will also scupper core performance through the burden of cache coherency.

Note that sharing a memory location across cores for read only access is perfectly fine. In this case, each private cache will get its own copy of the data and provided no core updates it, the cache coherency protocol need never be invoked for it. Problems only arise when a core updates the content of a shared memory address.

Inevitably, there will be some updating of memory shared between cores and there is a plethora of research regarding blocking, obstruction-free, lock-free and wait-free algorithms for dealing with shared data across contenting threads. However, outside of the research community there is some misunderstanding about the effects and benefits of these algorithms, so let’s take a closer look.

A blocking algorithm is the classic approach that places a guard or latch around a critical section to prevent more than one thread accessing the critical section at the same time. It typically uses a low level mutex or semaphore for this controlled access and these structures deliberately force serialized access to critical sections and hence serialization between tasks. Ultimately, a semaphore is a shared memory location with atomic test and increment/decrement operations. The use of semaphores becomes particularly unfortunate if it is used to count the number of concurrent readers of a critical section to prevent concurrent read/write access; in this case every read of the critical section may invoke cache coherency traffic even though the shared critical section is only being read and not updated at all.

Moreover, a thread holding a block on a critical section may be suspended and cause active threads using the same critical section to wait until the controlling thread becomes active again. Even worse, a block may be accidentally left in place if a controlling thread abnormally terminates. Hence, with a blocking protocol, the progress of any single thread and the system overall is heavily dependent upon the activities of all of the threads sharing the same critical section.

In light of this, the following non-blocking algorithms can be used as an alternative – they are designed to provide a guarantee about the progress of one or more threads competing for shared data.

  • An obstruction-free algorithm guarantees that a hibernating thread cannot block an active thread from accessing the same data;
  • A lock-free algorithm guarantees that at least one active thread will make (genuine) progress amongst multiple active threads accessing the same data;
  • A wait-free algorithm guarantees that all active threads will make (genuine) progress when accessing the same data.

Such algorithms are generally available for common structures such as stacks and queues etc. But they still require the use of atomic increment/decrement, test and exchange instructions operating on shared memory locations. Hence they still incur the burden of cache coherency and still remain limited by memory access bandwidth. They only alleviate the serialization experienced between threads. Nothing more.

Indeed, the benefit of the reduced serialization can incur additional cost in memory and processing cycles.  As the serialization guarantee gets more stringent, the cost of that guarantee also increases. Some phrase involving the words lunch and free comes to mind.

Clearly, the wait-free algorithm is the most stringent of all and only makes sense when threads accessing the same data are performing logically independent operations (albeit against the same physical structure). For example, if I push an item into a queue at the same time you push another independent item into the same queue, these are logically independent operations and both can progress independently. In effect, two versions of the queue are created (requiring more memory) and subsequently merged (requiring more processing) and memory reclaimed (requiring garbage collection) to achieve progress for both operations simultaneously. (Note that there is an analogy here with eventual consistency in a scaled-out architecture.)

A lock-free algorithm will cause non-progressing threads to spin around with repeated unsuccessful attempts at accessing the shared data and likely incurring cache coherency overhead on the progressing thread.

An obstruction-free algorithm may allow optimistic access to shared data and force unsuccessful threads to clean-up their misdemeanours after the fact.

Hence the use of the more stringent non-blocking algorithms does not deliver the best overall throughput. In particular, the wait-free guarantee is very expensive and is usually confined to real-time environments where any given thread cannot afford to wait too long.

Therefore a database architecture wishing to scale up effectively will use lock-free or obstruction-free algorithms to reduce serialization to a reasonable degree while maintaining a high overall throughput.

But more importantly, these non-blocking protocols are not a magic bullet and regardless of the protocols used, it is far more important to minimise the use of communication as much as possible. Less communication between cores means better scale-up and a database designed to scale-up well, will need to be scaled out a lot less than one that isn’t.