Category Archives: Hardware

Gone in a flash

Flash storage has become mainstream. But is it here to stay? Probably not.

There is a whole raft of non-volatile memory technologies snapping at the heels of flash memory. These new technologies use different effects such as magnetic, resistive and material phase effects to store data rather than the charge effects used by flash memory. These new technologies include Phase Change Memory, Spin Torque Transfer Memory, Resistive Memory and Racetrack Memory and some of these are already in production.

The important thing about these new technologies is that they are all much faster than flash, they all offer greater density and have much lower power requirements than flash. In many cases, these new forms of memory promise orders of magnitude greater speed and density. In fact, we are likely to see non-volatile memory that is both faster and more dense than current DRAM.

In the longer term, these technologies have the potential to change the architecture of computing whereby storage is memory and is integrated with the CPU. It may be that memory becomes sufficiently dense and close to the CPU that caching becomes unnecessary. That makes CPU cores a whole lot simpler and removes the need for pre-fetching, branch predictions, pipelining or super-scaling. Simpler cores can mean faster cores and many more cores. We may be looking at an architecture of interconnected memory modules with integrated cores (and no storage).

In fact, recent research has proposed the idea that resistive memory can perform logic operations in addition to functioning as memory. That could create a very new and challenging architecture indeed.

Ultimately, we will probably need to forget about making software cache efficient (because there won’t be one) but we will need to ensure that software is embarrassingly parallel instead.

In the meantime, flash will have its decade and will become temporarily ubiquitous in mobile devices. What about enterprise storage? Can it replace hard disk in a decade? There is so much hard disk storage out there and new disk remains such a cheap option that flash can probably only become dominant as a storage cache to accelerate mechanical systems – before it is overtaken by one of these rapidly emerging technologies.

Flash storage. RIP.

Up and out

Scaling out a database across a network of database servers requires an architecture which equitably distributes independent tasks across the network; where equitable distribution is required to ensure balanced use of the resources available and independence is required to minimise communication between those tasks. This latter point is important because communication can create serialization between tasks and can saturate the network bandwidth as the number of tasks is increased. Both of these aspects hurt scalability.

Serialization is undesirable, because it forces tasks to wait on one another and a waiting task is an unproductive one – so serialization reduces the effectiveness of scale out. Network bandwidth saturation creates a ceiling for the number of tasks that can be executed simultaneously because adding more tasks just starves the remaining tasks of bandwidth – so bandwidth saturation limits the extent of scale out. Hence, both serialization and saturation make the architecture less scalable.

Of course, there has to be some communication at the start and end of each task if a task is to perform any useful general-purpose work; but each task needs to operate independently between those communication points to achieve scalability. The Map-Reduce framework exemplifies this approach wherein the map phase distributes the tasks which operate independently until completion and then the reduce phase consolidates the results.

Interestingly, a similar approach is also required within a single database server. It is rare to find a server with a single CPU core these days (outside of the mobile computing environment) and to scale up performance the database must equitably distribute independent tasks across the CPU cores. Again communication between the cores must be kept to a minimum to avoid serialization between tasks, memory bandwidth saturation and cache coherency traffic. Let’s take a closer look at that last point.

In a multi-core environment, each core will have a private cache as well as a shared cache to overcome the latencies experienced with memory access. The private caches will be small but will execute at near CPU speed while the shared cache will be larger and slower but still much faster than normal memory access. Indeed, memory access is likely to be non-uniform in a multi-core environment and different cores will likely experience different latencies when accessing the same memory location. These caches attempt to overcome memory latency and provide a cache hierarchy whereby content works its way from memory to shared cache to private cache for a memory read access and vice versa for a memory write access. Therefore, an update to a memory address starts in the private cache of a core and is not immediately visible to the other cores (because it is a private cache) and to avoid consistency problems with memory access across multiple cores, a cache coherency protocol has to be observed. A typical protocol will invalidate the content of a memory address in a private cache when the corresponding address is updated in the private cache of another CPU core and will force a flush of the memory address from the updated private cache into the shared cache and possibly memory also. Any core attempting to access their invalidated private cache location will then be forced to fetch the content again from the shared cache and/or memory. The cache coherency protocol involves traffic between cores and/or snooping between cores (to see which cores have accessed which memory locations) and forces access to slower shared cache or memory when a memory address is invalidated by another core. In other words, cache coherency is an expensive protocol which hits core performance and the only way to minimise the cache coherency overhead is to avoid updating the content of memory locations shared between cores wherever possible. Therefore, not only can communication between tasks create serialization and saturate memory bandwidth it will also scupper core performance through the burden of cache coherency.

Note that sharing a memory location across cores for read only access is perfectly fine. In this case, each private cache will get its own copy of the data and provided no core updates it, the cache coherency protocol need never be invoked for it. Problems only arise when a core updates the content of a shared memory address.

Inevitably, there will be some updating of memory shared between cores and there is a plethora of research regarding blocking, obstruction-free, lock-free and wait-free algorithms for dealing with shared data across contenting threads. However, outside of the research community there is some misunderstanding about the effects and benefits of these algorithms, so let’s take a closer look.

A blocking algorithm is the classic approach that places a guard or latch around a critical section to prevent more than one thread accessing the critical section at the same time. It typically uses a low level mutex or semaphore for this controlled access and these structures deliberately force serialized access to critical sections and hence serialization between tasks. Ultimately, a semaphore is a shared memory location with atomic test and increment/decrement operations. The use of semaphores becomes particularly unfortunate if it is used to count the number of concurrent readers of a critical section to prevent concurrent read/write access; in this case every read of the critical section may invoke cache coherency traffic even though the shared critical section is only being read and not updated at all.

Moreover, a thread holding a block on a critical section may be suspended and cause active threads using the same critical section to wait until the controlling thread becomes active again. Even worse, a block may be accidentally left in place if a controlling thread abnormally terminates. Hence, with a blocking protocol, the progress of any single thread and the system overall is heavily dependent upon the activities of all of the threads sharing the same critical section.

In light of this, the following non-blocking algorithms can be used as an alternative – they are designed to provide a guarantee about the progress of one or more threads competing for shared data.

  • An obstruction-free algorithm guarantees that a hibernating thread cannot block an active thread from accessing the same data;
  • A lock-free algorithm guarantees that at least one active thread will make (genuine) progress amongst multiple active threads accessing the same data;
  • A wait-free algorithm guarantees that all active threads will make (genuine) progress when accessing the same data.

Such algorithms are generally available for common structures such as stacks and queues etc. But they still require the use of atomic increment/decrement, test and exchange instructions operating on shared memory locations. Hence they still incur the burden of cache coherency and still remain limited by memory access bandwidth. They only alleviate the serialization experienced between threads. Nothing more.

Indeed, the benefit of the reduced serialization can incur additional cost in memory and processing cycles.  As the serialization guarantee gets more stringent, the cost of that guarantee also increases. Some phrase involving the words lunch and free comes to mind.

Clearly, the wait-free algorithm is the most stringent of all and only makes sense when threads accessing the same data are performing logically independent operations (albeit against the same physical structure). For example, if I push an item into a queue at the same time you push another independent item into the same queue, these are logically independent operations and both can progress independently. In effect, two versions of the queue are created (requiring more memory) and subsequently merged (requiring more processing) and memory reclaimed (requiring garbage collection) to achieve progress for both operations simultaneously. (Note that there is an analogy here with eventual consistency in a scaled-out architecture.)

A lock-free algorithm will cause non-progressing threads to spin around with repeated unsuccessful attempts at accessing the shared data and likely incurring cache coherency overhead on the progressing thread.

An obstruction-free algorithm may allow optimistic access to shared data and force unsuccessful threads to clean-up their misdemeanours after the fact.

Hence the use of the more stringent non-blocking algorithms does not deliver the best overall throughput. In particular, the wait-free guarantee is very expensive and is usually confined to real-time environments where any given thread cannot afford to wait too long.

Therefore a database architecture wishing to scale up effectively will use lock-free or obstruction-free algorithms to reduce serialization to a reasonable degree while maintaining a high overall throughput.

But more importantly, these non-blocking protocols are not a magic bullet and regardless of the protocols used, it is far more important to minimise the use of communication as much as possible. Less communication between cores means better scale-up and a database designed to scale-up well, will need to be scaled out a lot less than one that isn’t.

Tradition

There are two very different strategies for dealing with updates and deletes in a relational database. Update-in-place semantics will directly overwrite data in-situ with a new version of that data; while append semantics will simply append the new version to a chronological log and leave the old data untouched. The update-in-place approach was extensively used in early database architectures but more recent database systems tend to opt for append semantics. Let’s look at the reasons why that may be.

When we append updates rather than overwrite data we are able to keep the old and consistent version of the data while the update is in progress and this affords a robust recovery mechanism should the update fail. We are able to easily watermark the updates and we can choose to move that watermark only when we know that the updates have been completed and successfully persisted in storage. Whereas with update-in-place semantics we would have to write and retain the old version in a separate log before the update took place to be able to guarantee recovery from a failed update (assuming that a full recovery from a database backup copy is a little too rudimentary for most people).

The use of an append semantics allows multiple updates to be appended to the same area of storage rather than scattered at random locations across the database. This affords a huge performance benefit to updates because large numbers of small updates become coalesced into a smaller number of larger writes and this hugely reduces the load on fsync() operations – which are essential for durability in write back cache arrangements and for avoiding anomalies arising from out-of-order write behaviour in modern disk systems.

Even on flash storage this type of large append write pattern has huge benefits over small random writes because much less of an erase block is wasted with each write. With a small random update I/O, a flash device has to copy the entirety  of an updated block to a new empty location and remap the block’s location. Once a flash device fills, unused blocks must be explicitly erased before they can be re-used and on a busy device this garbage collection activity can impact performance even where it happens asynchronously ahead of time.

Append semantics show significant performance gains in other areas too, such as in transaction isolation. With update-in-place semantics either an explicit MVCC (multi-version consistency control) scheme with separate before-update images must be managed; or worse still, read activity must lock-out update activity on the same data to ensure consistent reads. While the use of read locks is a death knell to performance with concurrent users and is tantamount to eating small children in a public place; the overhead of managing before and after images also places a huge burden on throughput. However, none of this band aid architecture is needed with append semantics because versioning is implicit and watermarks demarcate the boundaries between transactions.

There’s also a slight deception to the term update-in-place. What happens when the version we are attempting to overwrite is smaller than the updated version? This usually leads to schemes such as row chaining where database rows are split to accommodate increasing row length. These chained row fragments may not even reside in the same storage block as the original row and this ultimately leads to further performance horror at query time.

Moreover, contemporary data management requirements commonly require a history of transactions to be retained for management, analysis and legal reasons. So why go out of your way to destroy that very history by overwriting it?

That all sounds great for append semantics, but are there any good reasons for choosing to update in place? The usual response here is that append semantics will require more storage space than update-in-place semantics because we are retaining a complete history of the updates. This is indeed true, but is it important?

On any very large database system dealing with machine generated or web data, the huge majority of data is derived from immutable inserts that never get updated and any data that can be updated (such as customer or account data) is insignificant by comparison. Besides, customer and account data is the very type of data we want to maintain a history for because these are the significant real-world entities on which the business depends.

At this point, someone will helpfully suggest that they might use a table which contains a single row which gets frequently updated for the sole purpose of maintaining a sequence number.  (Suddenly, a silence falls and a distant church bell can be heard to slowly toll…) Enough said.

Therefore, either the volume of updates is unlikely to significantly impact storage space on a large system where space is critical; or the system is too small for it to matter at today’s storage prices. Indeed, storage bandwidth is a lot costlier than storage capacity and update-in-place offers no benefit there at all.

Neither is query performance improved by update-in-place semantics since with append semantics, all versions of a row can be equally accessible and resolvable through whatever structures are use to locate an original row. Indeed query performance is likely to suffer more with update-in-place semantics because of row chaining.

So why do some database vendors continue to use update-in-place semantics? Good question.

A precedent embalms a principle. Benjamin Disraeli.

Location, location, location

Locality of reference is one of the most important factors in mitigating the latency and finite size of a working area such as memory and storage. From the perspective of the CPU there are many types of locality of reference, some of which are quite esoteric and which can arise from CPU optimisation techniques such as branch prediction and speculative execution; but by far the two most significant types of locality of reference are spatial locality and temporal locality.

Spatial locality is the proximity of address locations for data referenced by successive CPU instructions. With this type of locality, access latency can be amortized by retrieving multiple data items in a single fetch cycle and using good spatial locality to ensure that all the data fetched is useful. If the data is not useful, because of poor spatial locality, then more fetch cycles will be required to process the same data.

Spatial locality is only beneficial where there is latency. If there is no latency at all, and every address location is equally randomly accessible from the CPU, then spatial locality is utterly irrelevant. If this is the case, then it does not matter if the successive data is near or far away from the current data as all address locations are equally accessible. This was how memory was back then before the 1990s – spatial locality was of no concern at all within memory.

Temporal locality is the proximity in time for the re-use of data – this is what caching is all about. With this type of locality, access latency can be avoided by re-using the data fetched previously. With poor temporal locality, data that could have been re-used is lost from a working area and must be fetched again. Data is typically lost because it has been overwritten or evicted by other data; therefore temporal locality is only beneficial where there is a finite size to the working area. If the working area is of infinite size then data is unlikely to be lost and can always be re-used.

Effectively, storage has an infinite size because it persists all of the data (by definition) and therefore temporal locality is irrelevant to storage. Meanwhile, memory has always had a finite and restrictive size and continues to do so in most situations. With today’s larger memory sizes it is sometimes possible to retain all of the data within memory; but this is still the exception rather than the rule and temporal locality remains important within memory. Of course, it is still crucially important for the CPU cache where the size of the working area is still extremely limited.

Good spatial locality mostly arises from data structure; whereas good temporal locality mostly arises from process. There is some connection between the two because algorithms are related to the data structures on which they operate; but it is reasonable to assert that structure is more important in determining spatial locality than process is, for example. This also means that spatial locality is inherited from its source (because the structure is retained) whereas temporal locality is not propagated (because process is localised to a working area). Indeed we would expect different temporal locality between the CPU cache and memory since good temporal locality in CPU cache is specifically used to avoid access to memory. Therefore we see that providing good spatial locality in storage will propagate good storage locality in memory and throughout; whereas good temporal locality must be provided locally by the algorithms operating on CPU cache and memory.

The latency we now see in memory makes spatial locality more important than it was before the 1990s. These days, memory is no longer randomly accessible, and good spatial locality within memory structures is a critical aspect of performance.

We are familiar with row stores and column stores offering spatial locality for very specific query patterns. Row storage offers good spatial locality within a specific row (although it doesn’t help in finding that row); while column storage offers good spatial locality within a column (although it doesn’t help in eliminating irrelevant rows). Clearly, neither are likely to offer good temporal locality and both require the use of indexes built on top to overcome their spatial limitations.

But what about the spatial and temporal localities of those indexes such as the venerable B-Tree?

You might think that B-Trees offer good spatial locality, but it’s not that simple. While a leaf block, which contains many similar keys clustered together, offers good spatial locality across the index entries that are relevant to the query; the root block can only contain one entry that is precisely relevant. Thus we see poor spatial locality at the root of the tree but improving as we progress towards its leaves; it is not a uniform virtue across the entire structure. Indeed, the need to visit multiple independently located blocks for every tree navigation rather dilutes the spatial locality of structure as a whole.

Meanwhile, the temporal locality of the tree works inversely to its spatial locality; while the root block will be re-used with every query, it is very unlikely that any given leaf block will be re-used by successive queries.

Hence the spatial and temporal localities of reference have opposite gradients such that middle layers of a B-Tree cannot offer good locality of reference in either dimension and consistently yields poor locality in both dimensions at the same time. While the the root block held in memory works well and the leaf blocks held on storage work well, the rest of the tree behaves poorly regardless of whether it sits in memory or in storage.

Cache rich

Let’s complete the analogy of the hapless components retailer to see the effect of caching B-Trees in memory by housing copies of the card indexes in a building near to the workshop and using space within the workshop as a small CPU cache…

Now the retailer keeps a copy of all of his card indexes in a wooden shed which is situated right next to his workshop to avoid all of the lengthy round trips to the warehouse.

He hires another assistant to perform all of the fetching, copying and carrying operations between the shed and the workshop; but again, while the assistant is adept at numbers he seems completely clueless with the alphabet.

The retailer is able to summon the assistant via a buzzer operated from the workshop. The assistant is very vigilant and the shed is close enough such that, when the retailer presses the buzzer, it only takes the assistant a minute to appear at the workshop to receive his instructions, and there’s only 2 minutes for each round trip from the workshop to the shed, to go to a specific card index, copy some cards and then return back to the workshop again.

The retailer has optimized the way he flicks through a card index by using a binary search method (starting at the middle card and then going to the middle card in the bottom half or the middle card of the top half depending on what is on the first card etc). With this method, the retailer only needs to look at 10 cards on average from the 1000 cards in the index and it takes him about a minute to do this.

The assistant cannot copy and carry all of the 1000 cards in an index in a single trip, but for efficiency delivers them in batches of 100 cards at a time – rather than just 1 card at a time. With a round trip time of 2 minutes this works out at almost 1 card per second delivery rate (which is a very good rate when compared with the work the retailer can achieve in a single minute).

In addition to the copy of the top level card index the retailer keeps in his workshop, the retailer also makes some space available to keep some of the more frequently used cards so that he doesn’t always have to ask his assistant to fetch them if he already happens to have a card to hand.

On the face of it, this arrangement looks like it has legs.

The obvious way for the retailer to request a card index is to ask the assistant for the complete card index; wait until it arrives in full and then run his binary search – this takes about 21 minutes overall for one card index (1 minute to summon the assistant and 2 minutes per batch).

Alternatively, the retailer could ask for specific bundles of cards based on his search outcomes – for example, he could start by asking for a bundle of cards that cover the middle range then choose another bundle based on what he finds. Unfortunately though, his assistant has a habit of wandering off back to the shed to fetch the next 100 cards while the retailer is busy looking at the latest batch of cards; which means the assistant has be summoned by the buzzer yet again, taking another minute for him to re-appear. With an average of 10 searches per card index, this would take about 30 minutes (3 minutes per batch) – so the retailer is actually better off asking for the whole card index right at the start, meaning that the best time he can achieve is 21 minutes per card index.

Remember that the retailer always keeps a copy of the top card index in his workshop and so he only needs to consult two card indexes kept in the nearby shed. Therefore, it takes 45 minutes to identify a crate by consulting 3 card indexes and fetching 2 of those card indexes from the shed.

Well, this is much better than going all the way to the warehouse for each card index (which took several months), but there are still a number of aspects that continue to frustrate the retailer

  • He only needs to look at 10 cards in each card index, but he ends up receiving 1000 cards each time. This takes up space in his workshop and worse still, he often finds himself throwing away other potentially useful cards just to make room for cards he later discovers are not useful at all. As a result, he finds that the majority of cards he keeps in his workshop are rarely useful in practice and simply clutter his workshop, wasting valuable space.
  • He is still spending a lot of his time waiting for his assistant to deliver more cards. Of the 45 minutes taken, only 3 minutes are spent by the retailer doing any useful work. For the remaining 42 minutes, he is waiting for 2000 cards to be delivered (99% of which he does not really need).
  • He is annoyed by the serial nature of the task. He cannot ask for the next card index until he has searched the prior one. He wishes he knew in advance what card indexes (or cards) were required so he could ask for them up front and avoid the need to repeatedly summon his assistant.
  • The size of the shed seems excessive for the benefit derived. He uses only a tiny 0.000002% of the cards stored there for each attempt to identify a crate.
  • If he can’t build a shed big enough to keep a copy of all of those card indexes, then extra lengthy trips to the warehouse will be incurred and to make matters worse, he will not know which orders will be delayed until he starts to work on them.
  • The area in which the wooden shed is located is often the subject of riotous behaviour and it  is always in danger of being burnt down. When this happens he has no choice, but to rebuild it and send his assistant all the way back to the warehouse when he needs a card index that can’t be found in the new shed. On such occasions, the service to his customers suddenly and markedly deteriorates.

It is clear that, even with copies of the card indexes stored close by,  the method the retailer is using to identify a crate does not serve him well.

Logistical nightmare

Imagine a man who runs a small retail company selling electronic components.

He works out of a tiny workshop in which he packages his orders and has just enough room to hold a handful of parcels and keep a small card index.

He also owns a vast warehouse where he stores his inventory for fulfilling his retail orders. This warehouse is organised into aisles, racks and crates. Each crate holds 1000 items; each rack holds a 1000 crates; and each aisle houses 1000 racks. To keep things organised, every crate, rack and aisle is uniquely numbered and every stored component is arranged in strict alphabetical order across all of the crates with each aisle, rack and crate labelled with the alphabetical range of its contents.

Each aisle holds a card index which lists the alphabetical range of each rack and each rack holds a card index which lists the alphabetical range of each crate. There is also a card index for identifying the correct aisle and this card index is kept at the workshop. To find the correct crate simply requires identifying the correct aisle, then the correct rack and then the correct crate using the respective card indexes.

This man is immobile and unable to drive a car and so works with an assistant who drives between the workshop and the warehouse in a pick-up truck to collect the items as required. Unfortunately while the assistant can read numbers well enough, he is completely illiterate and needs to be told exactly which numbered aisle, rack or crate he has to go to. Worse still, the warehouse is so far away that it takes a whole day for the assistant to drive there and back.

To parcel up an order, the man has to follow these steps for each item…

  • Consult the card index in his workshop to work out which aisle the item belongs in
  • Send his assistant to the warehouse to fetch the card index from that numbered aisle
  • When the assistant returns, consult the card index for the aisle to identify the correct rack
  • Send his assistant back to the warehouse to fetch the card index from that numbered rack (and return the card index for the aisle)
  • When the assistant returns, consult the card index for the rack to identify the correct crate
  • Send his assistant back to the warehouse to fetch the numbered crate (and return the card index for the rack)

As you might expect, the man is actually reasonably fast at reading the card index, sorting through a crate and packaging the parcels but with a whole day consumed by each round trip by his assistant, he actually spends the majority of his time just waiting for his assistant to return with a card index.

Moreover, the pick-up truck could potentially transport multiple crates at once but spends its time mostly empty – carting around just one card index and the occasional crate.

This is a logistical nightmare and you wouldn’t run a business like this – yet this is exactly the logistics of using a B-tree index.

You might think the times above are bonkers, but if you scale the timings precisely to that of a CPU and hard disk, they become insane…

  • 1 minute to flick through the card index
  • Over 2 months for the assistant to return from the warehouse

Even scaling the timings for flash storage, it doesn’t look that viable either

  • 1 minute to flick through the card index
  • 10 hours for the assistant to drive to and return from the warehouse.

Of course the man can be doing other things while he is waiting for his assistant to return, but on this scale it would take him 15 minutes to switch to another parcel (to do another minute of useful work).

Life gets even worse for this poor man when a new consignment of components arrives at his workshop and the components have to be dispatched to the distant warehouse and placed into their correct crates one by one…

This is certainly not a sensible way to run a business; nor is it likely to be the best way to retrieve data.

A picture of performance

Let’s draw a picture…

Consider three metrics that govern the performance of a CPU.

  • Instruction execution rate
  • Maximum data transfer rate (between CPU and memory)
  • Maximum data transfer request rate (between CPU and memory)

These represent the performance capacity of this hardware and are indicative of its potential throughput. Note that the data transfer request rate is the inverse of the more conventional transfer latency figure and we are using this metric because we want measures that are directly proportional to performance for diagrammatic purposes.

We can consider these metrics as largely orthogonal dimensions in a 3 dimensional space and can construct a cuboid  to represent the performance capacity, like so

Clearly the CPU may operate at any point within the space bounded by the polyhedron but we cannot exceed its boundaries. In the diagram above, the point labelled H represents the point of optimal throughput for the hardware in question.

It may be the case that the metrics are not completely independent and the limits of capacity are represented by a more complex surface, but for clarity of argument let us just consider a simple cuboid.

That’s hardware. But what about software?

Clearly, an algorithm utilises the available capacity of the hardware to some extent, but an algorithm has no fixed boundary – the faster the hardware, the faster an algorithm can operate. But interestingly, an algorithm does have a shape – this makes sense when you consider that for each iteration of an algorithm there will be a demand for given number of CPU cycles, transfer requests and transfer volume between the CPU and memory.

Hence we can see that software also forms a polyhedron of a specific shape (within the hardware polyhedron) which can expand in size until it meets a boundary created by the hardware. Clearly, the software can expand to a greater extent if its shape is more closely matched to that of the enclosing hardware polyhedron.

For example, consider an algorithm that requires a lot of memory transfer activity but little in the way of instruction execution. This is illustrated below. The software polyhedron expands until it hits one of the memory transfer boundaries, but retains its shape. The best throughput achieved by this software on this hardware is denoted by the point labelled S.

Now, let’s take a step back to the start of the relational era (1970’s). Again, we can create a cuboid for the hardware characteristics of the time – but let us also normalize the dimensions such that we achieve a perfect cube of unitary dimensions. It would be reasonable to assume that the software of that time would fit this cube reasonably well; otherwise the implication is that the algorithms of that time were grossly inefficient and I don’t think that is the case. In the 1970’s we therefore have a picture like so…

So far, so good. No real surprises there. But lets expand the dimensions of the hardware polyhedron to reflect the performance improvements we have already seen in hardware and detailed again here.

We arrive at a polyhedron of a very different shape that looks something like the diagram below where the hardware polyhedron now resembles a flat plane rather than its original cube. When we expand the software cube up to the boundaries of the hardware cuboid we see that a lot of the hardware  potential remains unused.

 

The hardware performance polyhedron has changed shape but the algorithms still being used on that hardware have not and this illustrates the extent of mismatch we see in 20th century data structures with 21st century hardware and the resulting waste of performance potential.

At the speed of light

It is quite enlightening to look at the maximum distance that can be travelled by light (or information) within a single CPU clock tick. This has reduced from around 0.1Km in the 1970s (at 3MHz) down to 100mm now (at 3 GHz). 

In other words, for today’s CPU to request information and have it returned within a single clock tick, the information must be in the near vicinity of the CPU. Therefore, any kind of storage is going to create a storage wall no matter how fast that storage technology is (flash, racetrack, PCM etc). Even storage as fast as the CPU cache itself will still present a storage wall if it has to sit beyond the motherboard – simply because information will not be able to travel faster than the speed of light (quantum computing and entanglement aside).

Ultimately, if we do replace current hard disk and flash storage with fast non-volatile RAM we still will not have removed the access latency problems we currently see, unless all of the data can be located near the CPU itself or we abandon the von Neumann architecture that underpins the computing industry.