Category Archives: Relational Database


Event sourcing is a design pattern for retaining an exhaustive history of events such that the state of the system now (or at some point in the past) can be derived by a replay of those events. The goal of event sourcing is to derive the current system state from the combination of past events with current application logic such that changes to logic are applied retrospectively and become implicitly reflected in the current system state.

Conceptually, event sourcing only requires a single chronological log to record every event about every entity and a primary key generator for creating primary keys for new entities. Furthermore, in extremis, application state need never be persisted since it can always be recreated – you can think of this as lazy evaluation of system state. You can find a thorough discussion of event sourcing here.

There may appear to be analogies here to audit logs and write-ahead logging in databases; but closer inspection shows that an audit log only provides a history of events for a known system state; and a write-ahead log provides a limited history of recent changes which is retained only until the whole system state can be reliably persisted.

In principle, a relational database could apply an event sourcing design pattern such that every definition command, every insert, every update and every delete statement is recorded as an autonomous event so that the state of a row is recreated by replaying all of the events related to it. Of course, it is time consuming and inefficient to reconstruct everything from first principles for every query and therefore databases retain current state to avoid the reconstruction costs. Indeed, reconstruction of system state typically only occurs at database start-up when system state is recovered from any pending write-ahead or recovery logs. Hopefully, a database doesn’t expect to change its internal processing logic that frequently, so any potential advantages are enormously outweighed by the disadvantages of a wholly lazy evaluation.

While more volatile applications may choose an event sourcing pattern to effect a robust delivery environment for rapidly evolving requirements, performance considerations will often dictate a hybrid approach whereby current system state is recreated by applying recent events (rather than all events) to recent system state (rather than an empty state).

We also have to be careful to understand what an event is. Simply recording a change in state (such as a new attribute value) is not really event sourcing; whereas recording the cause (such as a command) that gave rise to that change is. While the latter allows system state to be fully revised according to changes in application logic; the former only allows us to recreate state at any given point in time – yet  this is still very useful within a database for the purposes of auditing and analytics.

Much like Memento (the film), which offers a delightfully ambiguous interpretation of reality, event sourcing allows a new reality to be created simply through an alternative interpretation of events – but, thankfully, databases are expected to hold a more definitive view of their world.

Black and white

The constant background debate about relational versus NoSQL has become entrenched in dogma and misses many fundamental aspects of data management.  Fortunately, the notion that the difference between SQL and NoSQL has anything to do with SQL has already been thoroughly dispelled – but there still remains an active debate about the virtues or otherwise of schemas and ACID transactions.

Let’s deal with schemas first. A schema is just an abstraction of the data and is not a method of storage. The schema is simply there to provide a conceptual model for data access and exists as a logical layer above the storage structure in much the same way that a file system provides a logical abstraction of the physical storage system.

While the relational model was conceived to provide an intuitive framework for understanding real-world entities and a self-documenting structure for supporting queries, the database industry made the mistake of closely coupling that model with the storage organisation. This has resulted in the unnecessary polarisation between row stores and column stores and the horrible complexity and inertia typically experienced with traditional relational databases. Whereas the underlying database storage structure should be schema-less and the relational model should simply be an optional schema layered upon it.

By using a schema-less storage structure that efficiently resolves arbitrary access paths between stored elements it does not matter whether the logical access model is relational or not. Indeed, there is no reason why the relational schema cannot co-exist with other models in the same database at the same time.

ACID or not? Actually, there’s a false premise to that question. ACID is not a single property, it includes atomicity, durability and consistency and these are all orthogonal features which can be varied independently. Should a sequence of updates be applied atomically as a single change? Does durability of a change need to be guaranteed? How consistent should these changes be across the database infrastructure? These are not the same question and that last one actually requires more than a simple binary answer because consistency is really a spectrum of possibilities ranging from none, to eventual, to immediate.

In fact the requirements for atomicity, durability and consistency can vary by individual transaction. It is easy to imagine an application where a class of transactions that require durability coexist with a class of transactions that do not require it at all. Similarly for atomicity and consistency.  So it is not even adequate to configure these guarantees globally across the whole database – let alone hardwire them into the database infrastructure itself, which is the typical state of affairs.

While the relational model may choose to enforce global acidity, durability and immediate consistency for transaction – the database engine need not.

Even durability is not a simple yes or no.

Does committing a change to a disk make that change durable? Not if the disk fails. Is mirrored storage sufficient? How many mirrors are deemed safe? Is a single-site sufficient? How distant do these replicas need to be? One man’s durable is another man’s ephemeral.

Forget the arguments about choosing black or white – we need grey-scale.

Cover up

A traditional relational database falls into one of two camps – it is either a row store or a column store. These are both intuitive arrangements for storing a table structure, where either each row or each column is stored contiguously. The idea being that if you want to fetch a row, you will find it in one place or if you want to scan a column you will find all of the column values in one place.

Of course, the problem with both storage models is that they are overly simplistic and neither helps you find relevant rows in a selective OLTP query or eliminate irrelevant rows in a collective analytical query and the only way to make these storage models effective is to layer indexes on top or to throw more hardware at them.

Indexes are very familiar in a row store – but they also appear in column stores as row projections where multiple columns are stitched together to avoid the high cost of row re-construction; and more familiar index structures can also be found in some column stores too, for filtering columns and for joining between tables.

In its most general sense, any index can be considered to be a store of key and address pairs where both the key and the address may be stored either explicitly or implicitly for an index entry. So for example, in a key comparative structure (such as a B-Tree, LSM tree, fractal tree etc) both the key and the address are explicitly recorded in the index entry; whereas in a bit map, both the key and the address are implied from their position within the index. Naturally, a variety of possible index structures covers all four possible combinations  as shown in the table below (which is not an exhaustive list of structures by any means).

The advantage of an index with an implicit address is that the row address order will be preserved and scans across multiple indexes against the same table can be efficiently resolved to yield a combined result. Thus with bit map indexes, multiple indexes can be logically conjoined or disjoined for the same table without any need to sort/merge explicit addresses. With these types of indexes there is rarely any need for a composite key index because separate indexes can be easily combined into a single composite result.

An index with an explicit address will also typically retain the address order for a single and specific key value (since there is typically no advantage to changing the address order). Thus, with an inverted list we can easily merge the results from different lists; and similarly for a B-Tree, where we choose one particular key from each index we can merge the results directly because the address order is preserved.

However, when we attempt to scan an index with an explicit key and address for multiple keys, the addresses may no longer be in row address order and we have to either sort the addresses or create an ordered address map (such as a bit map) for each index to be able to merge addresses across indexes on the same table. Naturally, as a query covers more of the key domain, the more addresses must be mapped or sorted prior to merge and the less trivial this task becomes.

However, the important point here is that as soon as a query covers more than one key in an explicit key and address index, the overhead on index merging changes significantly.

Why is this important? Well, it is widely assumed, that an index built with composite keys (that is, multiple keys concatenated together) covers and removes the need for indexes that might be built on a subset of those keys. A classic example of this is building a compound B-Tree index on the department number and employee salary column in an employees table. This index might be used to find employees in a specific department in a certain salary range. Since we have already included the department number, this index can also be used to find all employees in a specific department regardless of their salary. Therefore, there is no need to build a separate department number index because the composite index will suffice. Right? Not quite. Whereas an index built solely on department number can be easily merged with an indexed built solely on employee gender to find all male employees in a specific department; the composite index requires the employees in the queried department to be sorted or address mapped before they can be merged with the gender index.

Worse still, while index scans that retain address order can be easily decomposed into an arbitrary number of address partitions across the address space with all partitions scanned and merged in parallel, this is no longer immediately possible if one or more the indexes has to be sorted or mapped first.

Therefore, the idea that a row store can be made analytical by overlaying covering indexes is only effective where those indexes precisely match the query requirements – but to cover all those requirements typically requires an awful lot of indexes.

A blunt tool

B-Trees are regularly used by relational databases to enforce primary key constraints.

Let’s look at what we know about a primary key…

  • It is unique
  • The semantic ordering of a primary key is irrelevant (we do not care that one primary key is more or less than any another)

But what properties does a B-Tree have in these respects?

  • The structure does not implicitly enforce uniqueness (you have to find a key first and then make a separate judgement).
  • The structure rigorously enforces semantic ordering upon the keys.

At first glance, the B-Tree does not sound like the best tool for the job and on  closer inspection things don’t get any better…

For each key inspected in random order, we have to visit a block at every level of the B-Tree and for each of those blocks that we visit we then have to perform either a binary search across the block (at best) or iterate sequentially through it (at worst). So even for a modest population  of 1 billion keys, using a B-Tree with 3 levels and 1000 entries per block will require somewhere between 30 and 3000 memory transfers just to find one key – and that’s on a good day i.e. when it happens to be completely cached within several gigabytes of memory.

Of course, the B-Tree is also available to service queries too. But the semantic ordering enforced by the B-Tree index is irrelevant to a query navigating a primary/foreign key relationship and the key order doesn’t assist with either nested loop or hash join processing. While the key order can be exploited by a merge join, the merge join is rarely a strategy of first choice and that limited advantage evaporates if additional query predicates need to be applied to the same table that holds the primary key index.

In essence, the B-Tree contains redundant information which is maintained at a high cost but yields minimal benefit. That’s not a good cost/benefit ratio.

Permit or forgive?

This post is about row update locking within a relational database instance.

But before we delve into the nuances of locking, it is important to lay out the context of this discussion.

We will be only talking about locking for the purposes of resolving competing updates to the same data. The practice of a writer locking out reads for the same data (or vice versa) is unnecessary; similarly locking a whole table for one or more row updates is beyond rudimentary. Both practices have a devastating effect on performance amongst concurrent users and frankly, if you are running a shared database that invokes either of these dubious practices then it’s time to move on, you can do a lot better. OK, now that’s out of the way, let’s get into the detail of row update locking.

Locking can be pessimistic or optimistic – in the former we prevent any nastiness happening from competing updates and in the latter we clean up any nastiness that did happen because we didn’t stop it. In that sense, locking can be thought of as collision prevention (pessimistic locking with no nastiness) or collision detection (optimistic locking with nastiness clean-up).

There has been some debate over which type of locking/collision management affords the best performance and because it depends on so many factors, databases typically offer a choice. But how do you choose?

The first factor to be considered is the amount of real contention that goes on. If the probability that any two users are likely to update the same row at the same time then prevention is more efficient than clean up; whereas if real contention is exceptional then it may be more efficient to deal with the consequences of a collision rather than prevent it.

It’s a bit like operating traffic lights at a road junction. If the traffic is sufficiently light you can get away with turning the lights off and clearing up after occasional accidents – but if the traffic gets heavy you’d better turn them back on to avoid carnage.

If you really don’t know what the likelihood of a collision is, then it’s probably better to opt for prevention for all the reasons about to be outlined.

Collision detection (optimistic locking) requires the application using the database to deal with the consequences of a collision. If updates from two users do collide, then one of them has to be denied their update and informed that the data has changed so that they can choose to re-submit an action, change an action or cancel it entirely. Realistically, the database cannot decide on the user’s behalf and the user or application will have to resolve it.

Collision detection is easy for a database that performs an update-in-place (see previous post). In this case, writers place a new version number against each row updated so that a process can take note of a row version at read time and see if it has changed at update time (which implies a collision happened since the read). But collision detection gets a whole lot harder for a database that appends updates. Here, the original version of the data never changes and the update is placed elsewhere. Hence a writer has to go look and see if there have been any colliding appends since the original version was read and if there is a lot of update activity this can become a costly overhead. Plus what happens if the collision occurs between the collision check and the append? We now have to marshal access to the append area to prevent this and it starts to sound a lot more like collision prevention.

Ideally, any form of collision prevention (pessimistic locking) would be sufficiently fast and lightweight such that it presents a marginal cost in cases where genuine contention does not occur. But pessimistic locking is often considered a major overhead for a number of reasons.

It may be that lock requests become serialised because they are managed by a centralised lock manager and processes that have no natural contention are then forced to wait on each other.

It may be that locks are such a limited resource that processes have to wait for their availability or enlarge the granularity of each lock in order to use less of them. Again, this creates artificial contention which will hit performance.

However, these performance issues arise from the locking implementations used and not from the principle of pessimistic locking itself.

Then there are deadlocks. A deadlock occurs where two or more processes hold mutual locks between each other such that no process can proceed, unless one or more processes relinquish one or more locks allowing the other processes to continue.

If there’s only one knife and one fork on a dining table and I grab the fork at the same time you grab the knife, then neither of us can eat until one of us gives up an item of cutlery. We could be obstinate and both starve; or we could be altruistic and both starve; or we could defer to the choice of a third party in which case only one of us will starve. In the same way, deadlock detection has to choose which process will be forced to relinquish its locks and will use a graph of processes and locks to determine which process has the least to lose and the most to offer. This can be complicated and time consuming if the graph is large with lots of locks to consider.

However, if locks are sufficiently granular then deadlocks can only occur where there is genuine conflict and then it is quite reasonable to back out one or more of the conflicting parties in that case.

But if each party requests all of its locks through a single atomic request which is either granted or denied in whole then deadlocks cannot occur between those parties. In our dining example, we can each make a single request – for both a knife and a fork; one of us will succeed and the other will fail – but there is no deadlock. In the context of updating rows in a table, the rows to be updated are identified first and then locked atomically – rather than requesting locks incrementally. Deadlocks could still arise where multiple tables are being updated because each set of locks for each table will be requested separately. However, the deadlock graph is much simpler because it need only consider a handful of table dependencies rather than thousands of individual lock dependencies and this makes deadlock detection a much faster proposition.

Plus requesting locks in a bulk manner will minimise the number of requests made – which in turn reduces request serialization and amortizes the cost of lock requests.

Then there are a couple of classic red herrings often raised about pessimistic locking…

A user who can issue ad-hoc SQL, locks a set of rows for update and then goes off to lunch before finishing his transaction – leaving anybody else needing to update the same rows waiting until he returns later in the afternoon.

If this is just a test system, then who cares? If this is a production system then who allows users access to the SQL prompt anyway? If you do that kind of thing on a production system then it is probably already toast for a whole bunch of reasons unrelated to locking.

Another objection is that long-lasting web transactions may disconnect and leave table rows locked.

The clue to the misdirection here is the phrase long-lasting web transaction. Nobody (of sane mind) designs a web transaction to be long-lasting or allows it to span across temporary database connections. Web transactions should be encased in procedures such that they either happen or don’t – they can’t partially happen. Browsing available stock, selecting a stock item and purchasing a stock item are all separate and independent transactions. You never lock rows in a database for the period between adding an item to a cart and paying for it.

This all rather begs the question of why would you use collision detection (optimistic locking) with append semantics if you also have a fast and efficient method of preventing collisions? You wouldn’t.

It may be easier to ask for forgiveness rather than permission – but only if you’re not doing it all the time.


There are two very different strategies for dealing with updates and deletes in a relational database. Update-in-place semantics will directly overwrite data in-situ with a new version of that data; while append semantics will simply append the new version to a chronological log and leave the old data untouched. The update-in-place approach was extensively used in early database architectures but more recent database systems tend to opt for append semantics. Let’s look at the reasons why that may be.

When we append updates rather than overwrite data we are able to keep the old and consistent version of the data while the update is in progress and this affords a robust recovery mechanism should the update fail. We are able to easily watermark the updates and we can choose to move that watermark only when we know that the updates have been completed and successfully persisted in storage. Whereas with update-in-place semantics we would have to write and retain the old version in a separate log before the update took place to be able to guarantee recovery from a failed update (assuming that a full recovery from a database backup copy is a little too rudimentary for most people).

The use of an append semantics allows multiple updates to be appended to the same area of storage rather than scattered at random locations across the database. This affords a huge performance benefit to updates because large numbers of small updates become coalesced into a smaller number of larger writes and this hugely reduces the load on fsync() operations – which are essential for durability in write back cache arrangements and for avoiding anomalies arising from out-of-order write behaviour in modern disk systems.

Even on flash storage this type of large append write pattern has huge benefits over small random writes because much less of an erase block is wasted with each write. With a small random update I/O, a flash device has to copy the entirety  of an updated block to a new empty location and remap the block’s location. Once a flash device fills, unused blocks must be explicitly erased before they can be re-used and on a busy device this garbage collection activity can impact performance even where it happens asynchronously ahead of time.

Append semantics show significant performance gains in other areas too, such as in transaction isolation. With update-in-place semantics either an explicit MVCC (multi-version consistency control) scheme with separate before-update images must be managed; or worse still, read activity must lock-out update activity on the same data to ensure consistent reads. While the use of read locks is a death knell to performance with concurrent users and is tantamount to eating small children in a public place; the overhead of managing before and after images also places a huge burden on throughput. However, none of this band aid architecture is needed with append semantics because versioning is implicit and watermarks demarcate the boundaries between transactions.

There’s also a slight deception to the term update-in-place. What happens when the version we are attempting to overwrite is smaller than the updated version? This usually leads to schemes such as row chaining where database rows are split to accommodate increasing row length. These chained row fragments may not even reside in the same storage block as the original row and this ultimately leads to further performance horror at query time.

Moreover, contemporary data management requirements commonly require a history of transactions to be retained for management, analysis and legal reasons. So why go out of your way to destroy that very history by overwriting it?

That all sounds great for append semantics, but are there any good reasons for choosing to update in place? The usual response here is that append semantics will require more storage space than update-in-place semantics because we are retaining a complete history of the updates. This is indeed true, but is it important?

On any very large database system dealing with machine generated or web data, the huge majority of data is derived from immutable inserts that never get updated and any data that can be updated (such as customer or account data) is insignificant by comparison. Besides, customer and account data is the very type of data we want to maintain a history for because these are the significant real-world entities on which the business depends.

At this point, someone will helpfully suggest that they might use a table which contains a single row which gets frequently updated for the sole purpose of maintaining a sequence number.  (Suddenly, a silence falls and a distant church bell can be heard to slowly toll…) Enough said.

Therefore, either the volume of updates is unlikely to significantly impact storage space on a large system where space is critical; or the system is too small for it to matter at today’s storage prices. Indeed, storage bandwidth is a lot costlier than storage capacity and update-in-place offers no benefit there at all.

Neither is query performance improved by update-in-place semantics since with append semantics, all versions of a row can be equally accessible and resolvable through whatever structures are use to locate an original row. Indeed query performance is likely to suffer more with update-in-place semantics because of row chaining.

So why do some database vendors continue to use update-in-place semantics? Good question.

A precedent embalms a principle. Benjamin Disraeli.

Too little, too late

If we were set the task of getting from the first floor to the penthouse suite at the top of a hundred storey apartment block as quickly as possible and we were given the choice of a fully-functioning express elevator that goes straight there or the stairwell – we would probably choose the former. We might like to think that we had optimized this task by choosing the elevator – but wasn’t the real optimization performed by the architect who made that elevator available to us?

In a relational database, the optimizer kicks in at query time – but by then it’s already way too late to really optimize the query. When executing a query, either the optimal access paths exist for that query or they don’t. Horse gone, door firmly bolted.

In other words, optimization has to begin a lot earlier than at query time – by making the optimal access structures available to the queries that need them.

Today’s relational databases rely on humans to do that optimization ground work – by designing indexes, by partitioning and by de-normalizing data schemas. Optimization is a joint effort requiring human design input and automated query planning. Without sufficient human input, the query optimizer is left to choose the least painful of the available options; but just because something happens to be the least painful doesn’t mean it isn’t painful.