Tag Archives: Data Structures

Black and white

The constant background debate about relational versus NoSQL has become entrenched in dogma and misses many fundamental aspects of data management.  Fortunately, the notion that the difference between SQL and NoSQL has anything to do with SQL has already been thoroughly dispelled – but there still remains an active debate about the virtues or otherwise of schemas and ACID transactions.

Let’s deal with schemas first. A schema is just an abstraction of the data and is not a method of storage. The schema is simply there to provide a conceptual model for data access and exists as a logical layer above the storage structure in much the same way that a file system provides a logical abstraction of the physical storage system.

While the relational model was conceived to provide an intuitive framework for understanding real-world entities and a self-documenting structure for supporting queries, the database industry made the mistake of closely coupling that model with the storage organisation. This has resulted in the unnecessary polarisation between row stores and column stores and the horrible complexity and inertia typically experienced with traditional relational databases. Whereas the underlying database storage structure should be schema-less and the relational model should simply be an optional schema layered upon it.

By using a schema-less storage structure that efficiently resolves arbitrary access paths between stored elements it does not matter whether the logical access model is relational or not. Indeed, there is no reason why the relational schema cannot co-exist with other models in the same database at the same time.

ACID or not? Actually, there’s a false premise to that question. ACID is not a single property, it includes atomicity, durability and consistency and these are all orthogonal features which can be varied independently. Should a sequence of updates be applied atomically as a single change? Does durability of a change need to be guaranteed? How consistent should these changes be across the database infrastructure? These are not the same question and that last one actually requires more than a simple binary answer because consistency is really a spectrum of possibilities ranging from none, to eventual, to immediate.

In fact the requirements for atomicity, durability and consistency can vary by individual transaction. It is easy to imagine an application where a class of transactions that require durability coexist with a class of transactions that do not require it at all. Similarly for atomicity and consistency.  So it is not even adequate to configure these guarantees globally across the whole database – let alone hardwire them into the database infrastructure itself, which is the typical state of affairs.

While the relational model may choose to enforce global acidity, durability and immediate consistency for transaction – the database engine need not.

Even durability is not a simple yes or no.

Does committing a change to a disk make that change durable? Not if the disk fails. Is mirrored storage sufficient? How many mirrors are deemed safe? Is a single-site sufficient? How distant do these replicas need to be? One man’s durable is another man’s ephemeral.

Forget the arguments about choosing black or white – we need grey-scale.

Cover up

A traditional relational database falls into one of two camps – it is either a row store or a column store. These are both intuitive arrangements for storing a table structure, where either each row or each column is stored contiguously. The idea being that if you want to fetch a row, you will find it in one place or if you want to scan a column you will find all of the column values in one place.

Of course, the problem with both storage models is that they are overly simplistic and neither helps you find relevant rows in a selective OLTP query or eliminate irrelevant rows in a collective analytical query and the only way to make these storage models effective is to layer indexes on top or to throw more hardware at them.

Indexes are very familiar in a row store – but they also appear in column stores as row projections where multiple columns are stitched together to avoid the high cost of row re-construction; and more familiar index structures can also be found in some column stores too, for filtering columns and for joining between tables.

In its most general sense, any index can be considered to be a store of key and address pairs where both the key and the address may be stored either explicitly or implicitly for an index entry. So for example, in a key comparative structure (such as a B-Tree, LSM tree, fractal tree etc) both the key and the address are explicitly recorded in the index entry; whereas in a bit map, both the key and the address are implied from their position within the index. Naturally, a variety of possible index structures covers all four possible combinations  as shown in the table below (which is not an exhaustive list of structures by any means).

The advantage of an index with an implicit address is that the row address order will be preserved and scans across multiple indexes against the same table can be efficiently resolved to yield a combined result. Thus with bit map indexes, multiple indexes can be logically conjoined or disjoined for the same table without any need to sort/merge explicit addresses. With these types of indexes there is rarely any need for a composite key index because separate indexes can be easily combined into a single composite result.

An index with an explicit address will also typically retain the address order for a single and specific key value (since there is typically no advantage to changing the address order). Thus, with an inverted list we can easily merge the results from different lists; and similarly for a B-Tree, where we choose one particular key from each index we can merge the results directly because the address order is preserved.

However, when we attempt to scan an index with an explicit key and address for multiple keys, the addresses may no longer be in row address order and we have to either sort the addresses or create an ordered address map (such as a bit map) for each index to be able to merge addresses across indexes on the same table. Naturally, as a query covers more of the key domain, the more addresses must be mapped or sorted prior to merge and the less trivial this task becomes.

However, the important point here is that as soon as a query covers more than one key in an explicit key and address index, the overhead on index merging changes significantly.

Why is this important? Well, it is widely assumed, that an index built with composite keys (that is, multiple keys concatenated together) covers and removes the need for indexes that might be built on a subset of those keys. A classic example of this is building a compound B-Tree index on the department number and employee salary column in an employees table. This index might be used to find employees in a specific department in a certain salary range. Since we have already included the department number, this index can also be used to find all employees in a specific department regardless of their salary. Therefore, there is no need to build a separate department number index because the composite index will suffice. Right? Not quite. Whereas an index built solely on department number can be easily merged with an indexed built solely on employee gender to find all male employees in a specific department; the composite index requires the employees in the queried department to be sorted or address mapped before they can be merged with the gender index.

Worse still, while index scans that retain address order can be easily decomposed into an arbitrary number of address partitions across the address space with all partitions scanned and merged in parallel, this is no longer immediately possible if one or more the indexes has to be sorted or mapped first.

Therefore, the idea that a row store can be made analytical by overlaying covering indexes is only effective where those indexes precisely match the query requirements – but to cover all those requirements typically requires an awful lot of indexes.

A blunt tool

B-Trees are regularly used by relational databases to enforce primary key constraints.

Let’s look at what we know about a primary key…

  • It is unique
  • The semantic ordering of a primary key is irrelevant (we do not care that one primary key is more or less than any another)

But what properties does a B-Tree have in these respects?

  • The structure does not implicitly enforce uniqueness (you have to find a key first and then make a separate judgement).
  • The structure rigorously enforces semantic ordering upon the keys.

At first glance, the B-Tree does not sound like the best tool for the job and on  closer inspection things don’t get any better…

For each key inspected in random order, we have to visit a block at every level of the B-Tree and for each of those blocks that we visit we then have to perform either a binary search across the block (at best) or iterate sequentially through it (at worst). So even for a modest population  of 1 billion keys, using a B-Tree with 3 levels and 1000 entries per block will require somewhere between 30 and 3000 memory transfers just to find one key – and that’s on a good day i.e. when it happens to be completely cached within several gigabytes of memory.

Of course, the B-Tree is also available to service queries too. But the semantic ordering enforced by the B-Tree index is irrelevant to a query navigating a primary/foreign key relationship and the key order doesn’t assist with either nested loop or hash join processing. While the key order can be exploited by a merge join, the merge join is rarely a strategy of first choice and that limited advantage evaporates if additional query predicates need to be applied to the same table that holds the primary key index.

In essence, the B-Tree contains redundant information which is maintained at a high cost but yields minimal benefit. That’s not a good cost/benefit ratio.


There are two very different strategies for dealing with updates and deletes in a relational database. Update-in-place semantics will directly overwrite data in-situ with a new version of that data; while append semantics will simply append the new version to a chronological log and leave the old data untouched. The update-in-place approach was extensively used in early database architectures but more recent database systems tend to opt for append semantics. Let’s look at the reasons why that may be.

When we append updates rather than overwrite data we are able to keep the old and consistent version of the data while the update is in progress and this affords a robust recovery mechanism should the update fail. We are able to easily watermark the updates and we can choose to move that watermark only when we know that the updates have been completed and successfully persisted in storage. Whereas with update-in-place semantics we would have to write and retain the old version in a separate log before the update took place to be able to guarantee recovery from a failed update (assuming that a full recovery from a database backup copy is a little too rudimentary for most people).

The use of an append semantics allows multiple updates to be appended to the same area of storage rather than scattered at random locations across the database. This affords a huge performance benefit to updates because large numbers of small updates become coalesced into a smaller number of larger writes and this hugely reduces the load on fsync() operations – which are essential for durability in write back cache arrangements and for avoiding anomalies arising from out-of-order write behaviour in modern disk systems.

Even on flash storage this type of large append write pattern has huge benefits over small random writes because much less of an erase block is wasted with each write. With a small random update I/O, a flash device has to copy the entirety  of an updated block to a new empty location and remap the block’s location. Once a flash device fills, unused blocks must be explicitly erased before they can be re-used and on a busy device this garbage collection activity can impact performance even where it happens asynchronously ahead of time.

Append semantics show significant performance gains in other areas too, such as in transaction isolation. With update-in-place semantics either an explicit MVCC (multi-version consistency control) scheme with separate before-update images must be managed; or worse still, read activity must lock-out update activity on the same data to ensure consistent reads. While the use of read locks is a death knell to performance with concurrent users and is tantamount to eating small children in a public place; the overhead of managing before and after images also places a huge burden on throughput. However, none of this band aid architecture is needed with append semantics because versioning is implicit and watermarks demarcate the boundaries between transactions.

There’s also a slight deception to the term update-in-place. What happens when the version we are attempting to overwrite is smaller than the updated version? This usually leads to schemes such as row chaining where database rows are split to accommodate increasing row length. These chained row fragments may not even reside in the same storage block as the original row and this ultimately leads to further performance horror at query time.

Moreover, contemporary data management requirements commonly require a history of transactions to be retained for management, analysis and legal reasons. So why go out of your way to destroy that very history by overwriting it?

That all sounds great for append semantics, but are there any good reasons for choosing to update in place? The usual response here is that append semantics will require more storage space than update-in-place semantics because we are retaining a complete history of the updates. This is indeed true, but is it important?

On any very large database system dealing with machine generated or web data, the huge majority of data is derived from immutable inserts that never get updated and any data that can be updated (such as customer or account data) is insignificant by comparison. Besides, customer and account data is the very type of data we want to maintain a history for because these are the significant real-world entities on which the business depends.

At this point, someone will helpfully suggest that they might use a table which contains a single row which gets frequently updated for the sole purpose of maintaining a sequence number.  (Suddenly, a silence falls and a distant church bell can be heard to slowly toll…) Enough said.

Therefore, either the volume of updates is unlikely to significantly impact storage space on a large system where space is critical; or the system is too small for it to matter at today’s storage prices. Indeed, storage bandwidth is a lot costlier than storage capacity and update-in-place offers no benefit there at all.

Neither is query performance improved by update-in-place semantics since with append semantics, all versions of a row can be equally accessible and resolvable through whatever structures are use to locate an original row. Indeed query performance is likely to suffer more with update-in-place semantics because of row chaining.

So why do some database vendors continue to use update-in-place semantics? Good question.

A precedent embalms a principle. Benjamin Disraeli.

Times have changed

It occurred to me some time ago that while both hardware architecture and business requirements have changed beyond recognition since the 1970s, the same cannot be said for relational databases. It is quite surprising that the relational database industry remains so entrenched in the algorithms and data structures most suited to a hardware architecture that has long since disappeared, resulting in a fundamental disharmony between these relational database engines and the hardware on which they now execute. 

Consider the environment in which software was designed when the relational database era began

  • Single CPU core
  • Small memory capacity
  • No memory wall
  • Poor transfer rates
  • Small user population
  • Purely transactional
  • Low transaction rates
  • Few analytical queries
  • Monthly reports
  • Minimal history retention
  • Limited size (a 1GB database was large)
  • Static schema
  • Static queries
Every single aspect of the list above has dramatically changed. Yet the algorithms and data structures of that era are still ubiquitous today. 

This is not about SSD vs HDD; nor is this about in-memory vs in-storage. This is about how the algorithms and storage structures still at the core of today’s operational and analytical databases, that were optimal when first conceived, have now become inappropriate for modern hardware architecture and contemporary user requirements – regardless of were the data happens to be stored or distributed.