Monthly Archives: February 2011

More of the same

Normalization is bad for queries.


But consider the classic example of the employee and department tables where employee records join to the department table through the department number, as illustrated below. Here we see employee Anita Tickles works in the Development department.

But does this normalization differ so much from dictionary coded compression? In effect, the “Development” symbol has been compressed down to the more compact dictionary index of “30” and we know that compression is typically a benefit to query performance. So why isn’t normalization good for queries too?

Now consider the alternative to a join, where we de-normalize the tables (effectively decompressing the data) into a single table like so

Now we have larger rows for each employee and the size of the de-normalized table is actually larger than the sum of the normalized tables. That is not going to be a good outcome if we have to scan the employee table. Furthermore, if we have an index on the department name then that index now has more rows in it and we all know how indexes just love to suck up yet more data.

Well, de-normalization doesn’t look that promising, but let’s look at some numbers from the TPC-H benchmark data just to see the real impact of this. Consider the Customers, Orders and LineItems tables

With a very modest scale factor of one, these tables have the following approximate row counts and row sizes.

Hence the relative scan costs between the normalized tables and their de-normalized counterpart; and the relative index sizes for a customer key index in the normalized table versus the de-normalized table would be as follows

These are very significant increases for the de-normalization strategy in both cases and will hit performance accordingly. De-normalization actually looks like a very poor strategy; yet experience with today’s relational databases shows that it delivers better query performance than joining normalized (compressed) data.

So why do joins perform so badly? So badly in fact, that we are prepared to bloat the data to the extent seen above simply to avoid them. The answer lies in the join strategies used and in subsequent posts we will take look at what is wrong with these join strategies and why they perform so dismally.

Location, location, location

Locality of reference is one of the most important factors in mitigating the latency and finite size of a working area such as memory and storage. From the perspective of the CPU there are many types of locality of reference, some of which are quite esoteric and which can arise from CPU optimisation techniques such as branch prediction and speculative execution; but by far the two most significant types of locality of reference are spatial locality and temporal locality.

Spatial locality is the proximity of address locations for data referenced by successive CPU instructions. With this type of locality, access latency can be amortized by retrieving multiple data items in a single fetch cycle and using good spatial locality to ensure that all the data fetched is useful. If the data is not useful, because of poor spatial locality, then more fetch cycles will be required to process the same data.

Spatial locality is only beneficial where there is latency. If there is no latency at all, and every address location is equally randomly accessible from the CPU, then spatial locality is utterly irrelevant. If this is the case, then it does not matter if the successive data is near or far away from the current data as all address locations are equally accessible. This was how memory was back then before the 1990s – spatial locality was of no concern at all within memory.

Temporal locality is the proximity in time for the re-use of data – this is what caching is all about. With this type of locality, access latency can be avoided by re-using the data fetched previously. With poor temporal locality, data that could have been re-used is lost from a working area and must be fetched again. Data is typically lost because it has been overwritten or evicted by other data; therefore temporal locality is only beneficial where there is a finite size to the working area. If the working area is of infinite size then data is unlikely to be lost and can always be re-used.

Effectively, storage has an infinite size because it persists all of the data (by definition) and therefore temporal locality is irrelevant to storage. Meanwhile, memory has always had a finite and restrictive size and continues to do so in most situations. With today’s larger memory sizes it is sometimes possible to retain all of the data within memory; but this is still the exception rather than the rule and temporal locality remains important within memory. Of course, it is still crucially important for the CPU cache where the size of the working area is still extremely limited.

Good spatial locality mostly arises from data structure; whereas good temporal locality mostly arises from process. There is some connection between the two because algorithms are related to the data structures on which they operate; but it is reasonable to assert that structure is more important in determining spatial locality than process is, for example. This also means that spatial locality is inherited from its source (because the structure is retained) whereas temporal locality is not propagated (because process is localised to a working area). Indeed we would expect different temporal locality between the CPU cache and memory since good temporal locality in CPU cache is specifically used to avoid access to memory. Therefore we see that providing good spatial locality in storage will propagate good storage locality in memory and throughout; whereas good temporal locality must be provided locally by the algorithms operating on CPU cache and memory.

The latency we now see in memory makes spatial locality more important than it was before the 1990s. These days, memory is no longer randomly accessible, and good spatial locality within memory structures is a critical aspect of performance.

We are familiar with row stores and column stores offering spatial locality for very specific query patterns. Row storage offers good spatial locality within a specific row (although it doesn’t help in finding that row); while column storage offers good spatial locality within a column (although it doesn’t help in eliminating irrelevant rows). Clearly, neither are likely to offer good temporal locality and both require the use of indexes built on top to overcome their spatial limitations.

But what about the spatial and temporal localities of those indexes such as the venerable B-Tree?

You might think that B-Trees offer good spatial locality, but it’s not that simple. While a leaf block, which contains many similar keys clustered together, offers good spatial locality across the index entries that are relevant to the query; the root block can only contain one entry that is precisely relevant. Thus we see poor spatial locality at the root of the tree but improving as we progress towards its leaves; it is not a uniform virtue across the entire structure. Indeed, the need to visit multiple independently located blocks for every tree navigation rather dilutes the spatial locality of structure as a whole.

Meanwhile, the temporal locality of the tree works inversely to its spatial locality; while the root block will be re-used with every query, it is very unlikely that any given leaf block will be re-used by successive queries.

Hence the spatial and temporal localities of reference have opposite gradients such that middle layers of a B-Tree cannot offer good locality of reference in either dimension and consistently yields poor locality in both dimensions at the same time. While the the root block held in memory works well and the leaf blocks held on storage work well, the rest of the tree behaves poorly regardless of whether it sits in memory or in storage.


The phrase “agile relational database” is commonly viewed as an oxymoron.

This is not surprising when you consider the decisions and actions required to create or alter a table or column; or simply to introduce a new query. Whereas the logical schema definition is straightforward in itself, the implications that can arise from a change in the logical model are another matter.

Consider the simple act of creating a new table with its associated columns. Easy enough from the standpoint of the data model – just define the column names, their data types and any constraints required. From the requirements definition standpoint, we are already done.

But then we are forced to consider the physical implementation of this table if it is going to contain anything other than a trivial amount of data and allow queries to deliver results in a reasonably timely manner. We now have to consider one or more of the following

  • How many rows will be stored in the table?
  • Are row deletes likely?
  • What queries are expected against the table?
  • Does the table need to be partitioned to support the number of rows?
  • What partitioning scheme would need to be used?
  • Should heap storage or clustered storage be used?
  • What storage parameters are best for the table?
  • What columns need to be indexed?
  • In what order should columns appear in each index key?
  • What type of index is required in each case?
  • What storage parameters are best for each index?
  • Do the indexes need to be partitioned and how?
  • Are materialised aggregations required?
  • Do one or more tables need to be de-normalized to support join performance?

Worse still, each of the above questions can arise each time we add a new column to a table or even just by using a new query.

Moreover, the inertia created by these decisions increases as the data volumes grow and it becomes ever more difficult to change them.

Yet none of the questions above have anything to do with our logical data model; they are all physical design considerations inflicted upon the data model by the underlying database engine.

There is nothing quite like agility – and this is nothing like it.

Cache rich

Let’s complete the analogy of the hapless components retailer to see the effect of caching B-Trees in memory by housing copies of the card indexes in a building near to the workshop and using space within the workshop as a small CPU cache…

Now the retailer keeps a copy of all of his card indexes in a wooden shed which is situated right next to his workshop to avoid all of the lengthy round trips to the warehouse.

He hires another assistant to perform all of the fetching, copying and carrying operations between the shed and the workshop; but again, while the assistant is adept at numbers he seems completely clueless with the alphabet.

The retailer is able to summon the assistant via a buzzer operated from the workshop. The assistant is very vigilant and the shed is close enough such that, when the retailer presses the buzzer, it only takes the assistant a minute to appear at the workshop to receive his instructions, and there’s only 2 minutes for each round trip from the workshop to the shed, to go to a specific card index, copy some cards and then return back to the workshop again.

The retailer has optimized the way he flicks through a card index by using a binary search method (starting at the middle card and then going to the middle card in the bottom half or the middle card of the top half depending on what is on the first card etc). With this method, the retailer only needs to look at 10 cards on average from the 1000 cards in the index and it takes him about a minute to do this.

The assistant cannot copy and carry all of the 1000 cards in an index in a single trip, but for efficiency delivers them in batches of 100 cards at a time – rather than just 1 card at a time. With a round trip time of 2 minutes this works out at almost 1 card per second delivery rate (which is a very good rate when compared with the work the retailer can achieve in a single minute).

In addition to the copy of the top level card index the retailer keeps in his workshop, the retailer also makes some space available to keep some of the more frequently used cards so that he doesn’t always have to ask his assistant to fetch them if he already happens to have a card to hand.

On the face of it, this arrangement looks like it has legs.

The obvious way for the retailer to request a card index is to ask the assistant for the complete card index; wait until it arrives in full and then run his binary search – this takes about 21 minutes overall for one card index (1 minute to summon the assistant and 2 minutes per batch).

Alternatively, the retailer could ask for specific bundles of cards based on his search outcomes – for example, he could start by asking for a bundle of cards that cover the middle range then choose another bundle based on what he finds. Unfortunately though, his assistant has a habit of wandering off back to the shed to fetch the next 100 cards while the retailer is busy looking at the latest batch of cards; which means the assistant has be summoned by the buzzer yet again, taking another minute for him to re-appear. With an average of 10 searches per card index, this would take about 30 minutes (3 minutes per batch) – so the retailer is actually better off asking for the whole card index right at the start, meaning that the best time he can achieve is 21 minutes per card index.

Remember that the retailer always keeps a copy of the top card index in his workshop and so he only needs to consult two card indexes kept in the nearby shed. Therefore, it takes 45 minutes to identify a crate by consulting 3 card indexes and fetching 2 of those card indexes from the shed.

Well, this is much better than going all the way to the warehouse for each card index (which took several months), but there are still a number of aspects that continue to frustrate the retailer

  • He only needs to look at 10 cards in each card index, but he ends up receiving 1000 cards each time. This takes up space in his workshop and worse still, he often finds himself throwing away other potentially useful cards just to make room for cards he later discovers are not useful at all. As a result, he finds that the majority of cards he keeps in his workshop are rarely useful in practice and simply clutter his workshop, wasting valuable space.
  • He is still spending a lot of his time waiting for his assistant to deliver more cards. Of the 45 minutes taken, only 3 minutes are spent by the retailer doing any useful work. For the remaining 42 minutes, he is waiting for 2000 cards to be delivered (99% of which he does not really need).
  • He is annoyed by the serial nature of the task. He cannot ask for the next card index until he has searched the prior one. He wishes he knew in advance what card indexes (or cards) were required so he could ask for them up front and avoid the need to repeatedly summon his assistant.
  • The size of the shed seems excessive for the benefit derived. He uses only a tiny 0.000002% of the cards stored there for each attempt to identify a crate.
  • If he can’t build a shed big enough to keep a copy of all of those card indexes, then extra lengthy trips to the warehouse will be incurred and to make matters worse, he will not know which orders will be delayed until he starts to work on them.
  • The area in which the wooden shed is located is often the subject of riotous behaviour and it  is always in danger of being burnt down. When this happens he has no choice, but to rebuild it and send his assistant all the way back to the warehouse when he needs a card index that can’t be found in the new shed. On such occasions, the service to his customers suddenly and markedly deteriorates.

It is clear that, even with copies of the card indexes stored close by,  the method the retailer is using to identify a crate does not serve him well.

Logistical nightmare

Imagine a man who runs a small retail company selling electronic components.

He works out of a tiny workshop in which he packages his orders and has just enough room to hold a handful of parcels and keep a small card index.

He also owns a vast warehouse where he stores his inventory for fulfilling his retail orders. This warehouse is organised into aisles, racks and crates. Each crate holds 1000 items; each rack holds a 1000 crates; and each aisle houses 1000 racks. To keep things organised, every crate, rack and aisle is uniquely numbered and every stored component is arranged in strict alphabetical order across all of the crates with each aisle, rack and crate labelled with the alphabetical range of its contents.

Each aisle holds a card index which lists the alphabetical range of each rack and each rack holds a card index which lists the alphabetical range of each crate. There is also a card index for identifying the correct aisle and this card index is kept at the workshop. To find the correct crate simply requires identifying the correct aisle, then the correct rack and then the correct crate using the respective card indexes.

This man is immobile and unable to drive a car and so works with an assistant who drives between the workshop and the warehouse in a pick-up truck to collect the items as required. Unfortunately while the assistant can read numbers well enough, he is completely illiterate and needs to be told exactly which numbered aisle, rack or crate he has to go to. Worse still, the warehouse is so far away that it takes a whole day for the assistant to drive there and back.

To parcel up an order, the man has to follow these steps for each item…

  • Consult the card index in his workshop to work out which aisle the item belongs in
  • Send his assistant to the warehouse to fetch the card index from that numbered aisle
  • When the assistant returns, consult the card index for the aisle to identify the correct rack
  • Send his assistant back to the warehouse to fetch the card index from that numbered rack (and return the card index for the aisle)
  • When the assistant returns, consult the card index for the rack to identify the correct crate
  • Send his assistant back to the warehouse to fetch the numbered crate (and return the card index for the rack)

As you might expect, the man is actually reasonably fast at reading the card index, sorting through a crate and packaging the parcels but with a whole day consumed by each round trip by his assistant, he actually spends the majority of his time just waiting for his assistant to return with a card index.

Moreover, the pick-up truck could potentially transport multiple crates at once but spends its time mostly empty – carting around just one card index and the occasional crate.

This is a logistical nightmare and you wouldn’t run a business like this – yet this is exactly the logistics of using a B-tree index.

You might think the times above are bonkers, but if you scale the timings precisely to that of a CPU and hard disk, they become insane…

  • 1 minute to flick through the card index
  • Over 2 months for the assistant to return from the warehouse

Even scaling the timings for flash storage, it doesn’t look that viable either

  • 1 minute to flick through the card index
  • 10 hours for the assistant to drive to and return from the warehouse.

Of course the man can be doing other things while he is waiting for his assistant to return, but on this scale it would take him 15 minutes to switch to another parcel (to do another minute of useful work).

Life gets even worse for this poor man when a new consignment of components arrives at his workshop and the components have to be dispatched to the distant warehouse and placed into their correct crates one by one…

This is certainly not a sensible way to run a business; nor is it likely to be the best way to retrieve data.