We saw in the previous post that statistics are about as useful as a chocolate teapot when we daisy-chain joins together.
But statistics are still useful and effective for deciding on whether an index scan or table scan is preferable, right? Err…., not exactly.
Firstly, we only need statistics if there is a skew in the data distribution. If the data is uniformly distributed then all we need to know is the range of data values and the number of rows. While detailed statistics could indeed tell us that the distribution is uniform, they are not really necessary.
Secondly, statistics can only be of limited detail for any data set with a large population of distinct values. There may be millions of distinct values, but it’s not feasible to store statistics about every single one. This is where histograms come in – they can profile the data to provide an indication of its value distribution. But histograms can conceal as much as they reveal.
Let’s look at an example of how histograms are created and used. Suppose we have an English essay which contains 2,000 words and which contains the following distribution of letters amongst those words.
Note that I have rounded the percentages for ease of example (my maths isn’t quite as bad as you may be thinking).
And we can represent this as a frequency bar chart as shown below where each distinct letter is represented on the X axis and its frequency of occurrence is represented as a percentage on the Y axis.
The bar chart above is a direct copy of the letters table and gives us an accurate view of the data distribution because it contains the full detail for every possible letter. Like the table, it tells us that if we are looking for the letter K we will find it in 1% of the words. So far, so good.
Now let’s convert the bar chart into a height balanced histogram where we change the size of the letter categories to provide a more uniform height across those categories. This technique is universally used by relational databases to cope with distributions that have a large category domain (X axis) where it is not possible to maintain statistics for every discrete category.
For our example, we could accumulate letters into variable sized bands until we reach a cumulative frequency of 10% in each band, like so…
Note that the letter E already exceeds 10% and therefore remains the only letter within its category.
With this histogram, we have to account for the width of a category as well as its height. If we now look for the letter K, we find it in the category IJK which contains 3 letters. We have lost statistical information about any individual letter in this category and so we have to estimate its frequency by dividing the total frequency (8%) by the category size (3). This yields a frequency of 2.6% – compared with the actual frequency of 1% and instead of estimating the true 20 words we would now estimate 52. It may not sound like much, but this is an error of 260%.
These histograms remind me of those badly pixelated pictures where large featureless pixels obscure the specific details we are most interesting in. The reality is that these histograms go badly pixelated precisely for the type of data distribution where they are most needed – where there is a large value domain (leading to cumulative categories) and where there is significant data skew (leading to false estimates in cumulative categories). Worse still, joins have a habit of amplifying those estimation errors, where those errors can grow exponentially through a sequence of joins.