Tag Archives: Transfer Rate

A picture of performance

Let’s draw a picture…

Consider three metrics that govern the performance of a CPU.

  • Instruction execution rate
  • Maximum data transfer rate (between CPU and memory)
  • Maximum data transfer request rate (between CPU and memory)

These represent the performance capacity of this hardware and are indicative of its potential throughput. Note that the data transfer request rate is the inverse of the more conventional transfer latency figure and we are using this metric because we want measures that are directly proportional to performance for diagrammatic purposes.

We can consider these metrics as largely orthogonal dimensions in a 3 dimensional space and can construct a cuboid  to represent the performance capacity, like so

Clearly the CPU may operate at any point within the space bounded by the polyhedron but we cannot exceed its boundaries. In the diagram above, the point labelled H represents the point of optimal throughput for the hardware in question.

It may be the case that the metrics are not completely independent and the limits of capacity are represented by a more complex surface, but for clarity of argument let us just consider a simple cuboid.

That’s hardware. But what about software?

Clearly, an algorithm utilises the available capacity of the hardware to some extent, but an algorithm has no fixed boundary – the faster the hardware, the faster an algorithm can operate. But interestingly, an algorithm does have a shape – this makes sense when you consider that for each iteration of an algorithm there will be a demand for given number of CPU cycles, transfer requests and transfer volume between the CPU and memory.

Hence we can see that software also forms a polyhedron of a specific shape (within the hardware polyhedron) which can expand in size until it meets a boundary created by the hardware. Clearly, the software can expand to a greater extent if its shape is more closely matched to that of the enclosing hardware polyhedron.

For example, consider an algorithm that requires a lot of memory transfer activity but little in the way of instruction execution. This is illustrated below. The software polyhedron expands until it hits one of the memory transfer boundaries, but retains its shape. The best throughput achieved by this software on this hardware is denoted by the point labelled S.

Now, let’s take a step back to the start of the relational era (1970’s). Again, we can create a cuboid for the hardware characteristics of the time – but let us also normalize the dimensions such that we achieve a perfect cube of unitary dimensions. It would be reasonable to assume that the software of that time would fit this cube reasonably well; otherwise the implication is that the algorithms of that time were grossly inefficient and I don’t think that is the case. In the 1970’s we therefore have a picture like so…

So far, so good. No real surprises there. But lets expand the dimensions of the hardware polyhedron to reflect the performance improvements we have already seen in hardware and detailed again here.

We arrive at a polyhedron of a very different shape that looks something like the diagram below where the hardware polyhedron now resembles a flat plane rather than its original cube. When we expand the software cube up to the boundaries of the hardware cuboid we see that a lot of the hardware  potential remains unused.

 

The hardware performance polyhedron has changed shape but the algorithms still being used on that hardware have not and this illustrates the extent of mismatch we see in 20th century data structures with 21st century hardware and the resulting waste of performance potential.

Moving the goal posts

The Relational Model, proposed by Codd in 1969, heralded the start of the relational era and shortly after both row store and column store database systems began to emerge throughout the 1970s along with associated index structures such as the B-tree index. 

In those days, the CPU and memory were effectively joined at the hip with storage latency viewed as the central hurdle to performance. Memory was small but fast and regarded as truly randomly accessible with negligible latency for any CPU access and the likes of B-tree indexes et al were designed to overcome the then dominant storage wall (high storage latency). The goal in the 1970s was to achieve good spatial locality within storage (to amortize the cost of a storage transfer) and good temporal locality within memory (to avoid storage transfers).

The chart below typifies the hardware performance characteristics prevalent both then and now. 


It is evident that CPU speed and memory transfer rates have improved dramatically while memory latency has improved much less so. It is a very similar story for storage performance.

From the current perspective of the CPU, memory is now large but slow. Because memory latency is so high with respect to CPU speed, memory is no longer randomly accessible and cannot be treated as an efficient cache; so now there is a memory wall too (high memory latency) and we see CPUs designed with their own cache. The goal today is to achieve good spatial locality in both storage and memory and to achieve good temporal locality within CPU cache. Therefore the goal posts have moved to the extent that the structures of the 1970s are unable to score goals any more.

Meanwhile, much has been written about enhancing B-tree like structures (such as CO and LSM variants) to improve write performance. But even for read access, the spatial locality of these types of structures is insufficient to effectively exploit the higher transfer rates now possible from storage and memory or to overcome the memory wall that now overshadows CPU performance.