Tag Archives: Optimization

Too little, too late

If we were set the task of getting from the first floor to the penthouse suite at the top of a hundred storey apartment block as quickly as possible and we were given the choice of a fully-functioning express elevator that goes straight there or the stairwell – we would probably choose the former. We might like to think that we had optimized this task by choosing the elevator – but wasn’t the real optimization performed by the architect who made that elevator available to us?

In a relational database, the optimizer kicks in at query time – but by then it’s already way too late to really optimize the query. When executing a query, either the optimal access paths exist for that query or they don’t. Horse gone, door firmly bolted.

In other words, optimization has to begin a lot earlier than at query time – by making the optimal access structures available to the queries that need them.

Today’s relational databases rely on humans to do that optimization ground work – by designing indexes, by partitioning and by de-normalizing data schemas. Optimization is a joint effort requiring human design input and automated query planning. Without sufficient human input, the query optimizer is left to choose the least painful of the available options; but just because something happens to be the least painful doesn’t mean it isn’t painful.

Bad pixels

We saw in the previous post that statistics are about as useful as a chocolate teapot when we daisy-chain joins together.

But statistics are still useful and effective for deciding on whether an index scan or table scan is preferable, right? Err…., not exactly.

Firstly, we only need statistics if there is a skew in the data distribution. If the data is uniformly distributed then all we need to know is the range of data values and the number of rows. While detailed statistics could indeed tell us that the distribution is uniform, they are not really necessary.

Secondly, statistics can only be of limited detail for any data set with a large population of distinct values. There may be millions of distinct values, but it’s not feasible to store statistics about every single one. This is where histograms come in – they can profile the data to provide an indication of its value distribution. But histograms can conceal as much as they reveal.

Let’s look at an example of how histograms are created and used. Suppose we have an English essay which contains 2,000 words and which contains the following distribution of letters amongst those words.

Note that I have rounded the percentages for ease of example (my maths isn’t quite as bad as you may be thinking).

And we can represent this as a frequency bar chart as shown below where each distinct letter is represented on the X axis and its frequency of occurrence is represented as a percentage on the Y axis.

The bar chart above is a direct copy of the letters table and gives us an accurate view of the data distribution because it contains the full detail for every possible letter. Like the table, it tells us that if we are looking for the letter K we will find it in 1% of the words. So far, so good.

Now let’s convert the bar chart into a height balanced histogram where we change the size of the letter categories to provide a more uniform height across those categories. This technique is universally used by relational databases to cope with distributions that have a large category domain (X axis) where it is not possible to maintain statistics for every discrete category.

For our example, we could accumulate letters into variable sized bands until we reach a cumulative frequency of 10% in each band, like so…

Note that the letter E already exceeds 10% and therefore remains the only letter within its category.

With this histogram, we have to account for the width of a category as well as its height. If we now look for the letter K, we find it in the category IJK which contains 3 letters. We have lost statistical information about any individual letter in this category and so we have to estimate its frequency by dividing the total frequency (8%) by the category size (3). This yields a frequency of 2.6% – compared with the actual frequency of 1% and instead of estimating the true 20 words we would now estimate 52. It may not sound like much, but this is an error of 260%.

These histograms remind me of those badly pixelated pictures where large featureless pixels obscure the specific details we are most interesting in. The reality is that these histograms go badly pixelated precisely for the type of data distribution where they are most needed – where there is a large value domain (leading to cumulative categories) and where there is significant data skew (leading to false estimates in cumulative categories). Worse still, joins have a habit of amplifying those estimation errors, where those errors can grow exponentially through a sequence of joins.

Statistics are like bikinis…

…what they reveal is suggestive, but what they conceal is vital.
(Aaron Levenstein)

Let’s look at how statistics are used in join optimisations. Consider 3 tables (blue, green and purple).

 

Let’s call these tables B, G and P (from their colours).

We want to join B with G with P by the numbers and letters that have corresponding values – but we can do this in table pairs in two different ways…

Firstly, we could join B with G – then join that intermediate result with P, like so

Or secondly, we could join G with P – then join that intermediate result with B, like so

Thankfully, we ended up with the same result in both cases. But notice the remarkable difference in the intermediate result between the two options. Clearly the second option is the optimal choice – but we would need to know that before we start the join process if we wish to optimize it.

We can do this with statistics. If we keep a count of every value in each column we can see how many rows are going to emerge from a join result. Therefore, with statistics for each table as shown below, we could evaluate that joining B with G yields 7 rows while joining G with P yields only 1 row.

That’s enough information to tell us which join order requires the least work.

But what if we then want to join our final result with yet another table? Look at those statistics again for B and G below

How many rows for each letter will be produced by the join? For example, the letter A could appear in zero, one or two rows.  We have no idea before we perform the join because the distribution of data between columns is decided by the rows in the table.

As soon as we join two tables together, we have lost all of the statistics about the data that’s been joined together – we’re flying blind from then on.

Of course, we could create some statistics as we go – but that that would require us to commit to joining two tables and then it’s too late to decide if it was a bad choice – unless we’re prepared to throw it away and start again. But in query optimization, it’s always too late to learn from one’s mistakes…

Hope is not a strategy

Imagine a carpenter who needs to cut a piece of wood in half but does not know what tools are available for the job until he looks inside his tool bag to find out. With luck, he will find a wood saw – but he might only find a chisel – or worse yet, just a screwdriver…

In much the same way, the query optimizer in a relational database has a lousy tool bag and is left to make the best of a bad job. The available tools are unsatisfactory for all of the following reasons

  • Any indexes that do exist are unlikely to be the best possible for the specific query
  • It is quite possible that one or more of the required indexes are missing altogether
  • The data distribution statistics used to decide on the best query plan are at best imprecise
  • These table statistics may not be up to date and may be wholly wrong
  • The statistics about a table immediately become junk following a join with another table
  • The number of possible combinations of table joins is often so immense that not all of them can be evaluated

In general, the choice of an execution plan is based on heuristics and assumptions which make the process less like a deterministic optimisation and more like a lottery – where hope is all there is.