Tag Archives: Row Store

Black and white

The constant background debate about relational versus NoSQL has become entrenched in dogma and misses many fundamental aspects of data management.  Fortunately, the notion that the difference between SQL and NoSQL has anything to do with SQL has already been thoroughly dispelled – but there still remains an active debate about the virtues or otherwise of schemas and ACID transactions.

Let’s deal with schemas first. A schema is just an abstraction of the data and is not a method of storage. The schema is simply there to provide a conceptual model for data access and exists as a logical layer above the storage structure in much the same way that a file system provides a logical abstraction of the physical storage system.

While the relational model was conceived to provide an intuitive framework for understanding real-world entities and a self-documenting structure for supporting queries, the database industry made the mistake of closely coupling that model with the storage organisation. This has resulted in the unnecessary polarisation between row stores and column stores and the horrible complexity and inertia typically experienced with traditional relational databases. Whereas the underlying database storage structure should be schema-less and the relational model should simply be an optional schema layered upon it.

By using a schema-less storage structure that efficiently resolves arbitrary access paths between stored elements it does not matter whether the logical access model is relational or not. Indeed, there is no reason why the relational schema cannot co-exist with other models in the same database at the same time.

ACID or not? Actually, there’s a false premise to that question. ACID is not a single property, it includes atomicity, durability and consistency and these are all orthogonal features which can be varied independently. Should a sequence of updates be applied atomically as a single change? Does durability of a change need to be guaranteed? How consistent should these changes be across the database infrastructure? These are not the same question and that last one actually requires more than a simple binary answer because consistency is really a spectrum of possibilities ranging from none, to eventual, to immediate.

In fact the requirements for atomicity, durability and consistency can vary by individual transaction. It is easy to imagine an application where a class of transactions that require durability coexist with a class of transactions that do not require it at all. Similarly for atomicity and consistency.  So it is not even adequate to configure these guarantees globally across the whole database – let alone hardwire them into the database infrastructure itself, which is the typical state of affairs.

While the relational model may choose to enforce global acidity, durability and immediate consistency for transaction – the database engine need not.

Even durability is not a simple yes or no.

Does committing a change to a disk make that change durable? Not if the disk fails. Is mirrored storage sufficient? How many mirrors are deemed safe? Is a single-site sufficient? How distant do these replicas need to be? One man’s durable is another man’s ephemeral.

Forget the arguments about choosing black or white – we need grey-scale.

Cover up

A traditional relational database falls into one of two camps – it is either a row store or a column store. These are both intuitive arrangements for storing a table structure, where either each row or each column is stored contiguously. The idea being that if you want to fetch a row, you will find it in one place or if you want to scan a column you will find all of the column values in one place.

Of course, the problem with both storage models is that they are overly simplistic and neither helps you find relevant rows in a selective OLTP query or eliminate irrelevant rows in a collective analytical query and the only way to make these storage models effective is to layer indexes on top or to throw more hardware at them.

Indexes are very familiar in a row store – but they also appear in column stores as row projections where multiple columns are stitched together to avoid the high cost of row re-construction; and more familiar index structures can also be found in some column stores too, for filtering columns and for joining between tables.

In its most general sense, any index can be considered to be a store of key and address pairs where both the key and the address may be stored either explicitly or implicitly for an index entry. So for example, in a key comparative structure (such as a B-Tree, LSM tree, fractal tree etc) both the key and the address are explicitly recorded in the index entry; whereas in a bit map, both the key and the address are implied from their position within the index. Naturally, a variety of possible index structures covers all four possible combinations  as shown in the table below (which is not an exhaustive list of structures by any means).

The advantage of an index with an implicit address is that the row address order will be preserved and scans across multiple indexes against the same table can be efficiently resolved to yield a combined result. Thus with bit map indexes, multiple indexes can be logically conjoined or disjoined for the same table without any need to sort/merge explicit addresses. With these types of indexes there is rarely any need for a composite key index because separate indexes can be easily combined into a single composite result.

An index with an explicit address will also typically retain the address order for a single and specific key value (since there is typically no advantage to changing the address order). Thus, with an inverted list we can easily merge the results from different lists; and similarly for a B-Tree, where we choose one particular key from each index we can merge the results directly because the address order is preserved.

However, when we attempt to scan an index with an explicit key and address for multiple keys, the addresses may no longer be in row address order and we have to either sort the addresses or create an ordered address map (such as a bit map) for each index to be able to merge addresses across indexes on the same table. Naturally, as a query covers more of the key domain, the more addresses must be mapped or sorted prior to merge and the less trivial this task becomes.

However, the important point here is that as soon as a query covers more than one key in an explicit key and address index, the overhead on index merging changes significantly.

Why is this important? Well, it is widely assumed, that an index built with composite keys (that is, multiple keys concatenated together) covers and removes the need for indexes that might be built on a subset of those keys. A classic example of this is building a compound B-Tree index on the department number and employee salary column in an employees table. This index might be used to find employees in a specific department in a certain salary range. Since we have already included the department number, this index can also be used to find all employees in a specific department regardless of their salary. Therefore, there is no need to build a separate department number index because the composite index will suffice. Right? Not quite. Whereas an index built solely on department number can be easily merged with an indexed built solely on employee gender to find all male employees in a specific department; the composite index requires the employees in the queried department to be sorted or address mapped before they can be merged with the gender index.

Worse still, while index scans that retain address order can be easily decomposed into an arbitrary number of address partitions across the address space with all partitions scanned and merged in parallel, this is no longer immediately possible if one or more the indexes has to be sorted or mapped first.

Therefore, the idea that a row store can be made analytical by overlaying covering indexes is only effective where those indexes precisely match the query requirements – but to cover all those requirements typically requires an awful lot of indexes.

Location, location, location

Locality of reference is one of the most important factors in mitigating the latency and finite size of a working area such as memory and storage. From the perspective of the CPU there are many types of locality of reference, some of which are quite esoteric and which can arise from CPU optimisation techniques such as branch prediction and speculative execution; but by far the two most significant types of locality of reference are spatial locality and temporal locality.

Spatial locality is the proximity of address locations for data referenced by successive CPU instructions. With this type of locality, access latency can be amortized by retrieving multiple data items in a single fetch cycle and using good spatial locality to ensure that all the data fetched is useful. If the data is not useful, because of poor spatial locality, then more fetch cycles will be required to process the same data.

Spatial locality is only beneficial where there is latency. If there is no latency at all, and every address location is equally randomly accessible from the CPU, then spatial locality is utterly irrelevant. If this is the case, then it does not matter if the successive data is near or far away from the current data as all address locations are equally accessible. This was how memory was back then before the 1990s – spatial locality was of no concern at all within memory.

Temporal locality is the proximity in time for the re-use of data – this is what caching is all about. With this type of locality, access latency can be avoided by re-using the data fetched previously. With poor temporal locality, data that could have been re-used is lost from a working area and must be fetched again. Data is typically lost because it has been overwritten or evicted by other data; therefore temporal locality is only beneficial where there is a finite size to the working area. If the working area is of infinite size then data is unlikely to be lost and can always be re-used.

Effectively, storage has an infinite size because it persists all of the data (by definition) and therefore temporal locality is irrelevant to storage. Meanwhile, memory has always had a finite and restrictive size and continues to do so in most situations. With today’s larger memory sizes it is sometimes possible to retain all of the data within memory; but this is still the exception rather than the rule and temporal locality remains important within memory. Of course, it is still crucially important for the CPU cache where the size of the working area is still extremely limited.

Good spatial locality mostly arises from data structure; whereas good temporal locality mostly arises from process. There is some connection between the two because algorithms are related to the data structures on which they operate; but it is reasonable to assert that structure is more important in determining spatial locality than process is, for example. This also means that spatial locality is inherited from its source (because the structure is retained) whereas temporal locality is not propagated (because process is localised to a working area). Indeed we would expect different temporal locality between the CPU cache and memory since good temporal locality in CPU cache is specifically used to avoid access to memory. Therefore we see that providing good spatial locality in storage will propagate good storage locality in memory and throughout; whereas good temporal locality must be provided locally by the algorithms operating on CPU cache and memory.

The latency we now see in memory makes spatial locality more important than it was before the 1990s. These days, memory is no longer randomly accessible, and good spatial locality within memory structures is a critical aspect of performance.

We are familiar with row stores and column stores offering spatial locality for very specific query patterns. Row storage offers good spatial locality within a specific row (although it doesn’t help in finding that row); while column storage offers good spatial locality within a column (although it doesn’t help in eliminating irrelevant rows). Clearly, neither are likely to offer good temporal locality and both require the use of indexes built on top to overcome their spatial limitations.

But what about the spatial and temporal localities of those indexes such as the venerable B-Tree?

You might think that B-Trees offer good spatial locality, but it’s not that simple. While a leaf block, which contains many similar keys clustered together, offers good spatial locality across the index entries that are relevant to the query; the root block can only contain one entry that is precisely relevant. Thus we see poor spatial locality at the root of the tree but improving as we progress towards its leaves; it is not a uniform virtue across the entire structure. Indeed, the need to visit multiple independently located blocks for every tree navigation rather dilutes the spatial locality of structure as a whole.

Meanwhile, the temporal locality of the tree works inversely to its spatial locality; while the root block will be re-used with every query, it is very unlikely that any given leaf block will be re-used by successive queries.

Hence the spatial and temporal localities of reference have opposite gradients such that middle layers of a B-Tree cannot offer good locality of reference in either dimension and consistently yields poor locality in both dimensions at the same time. While the the root block held in memory works well and the leaf blocks held on storage work well, the rest of the tree behaves poorly regardless of whether it sits in memory or in storage.