Tag Archives: Memory

Gone in a flash

Flash storage has become mainstream. But is it here to stay? Probably not.

There is a whole raft of non-volatile memory technologies snapping at the heels of flash memory. These new technologies use different effects such as magnetic, resistive and material phase effects to store data rather than the charge effects used by flash memory. These new technologies include Phase Change Memory, Spin Torque Transfer Memory, Resistive Memory and Racetrack Memory and some of these are already in production.

The important thing about these new technologies is that they are all much faster than flash, they all offer greater density and have much lower power requirements than flash. In many cases, these new forms of memory promise orders of magnitude greater speed and density. In fact, we are likely to see non-volatile memory that is both faster and more dense than current DRAM.

In the longer term, these technologies have the potential to change the architecture of computing whereby storage is memory and is integrated with the CPU. It may be that memory becomes sufficiently dense and close to the CPU that caching becomes unnecessary. That makes CPU cores a whole lot simpler and removes the need for pre-fetching, branch predictions, pipelining or super-scaling. Simpler cores can mean faster cores and many more cores. We may be looking at an architecture of interconnected memory modules with integrated cores (and no storage).

In fact, recent research has proposed the idea that resistive memory can perform logic operations in addition to functioning as memory. That could create a very new and challenging architecture indeed.

Ultimately, we will probably need to forget about making software cache efficient (because there won’t be one) but we will need to ensure that software is embarrassingly parallel instead.

In the meantime, flash will have its decade and will become temporarily ubiquitous in mobile devices. What about enterprise storage? Can it replace hard disk in a decade? There is so much hard disk storage out there and new disk remains such a cheap option that flash can probably only become dominant as a storage cache to accelerate mechanical systems – before it is overtaken by one of these rapidly emerging technologies.

Flash storage. RIP.

Location, location, location

Locality of reference is one of the most important factors in mitigating the latency and finite size of a working area such as memory and storage. From the perspective of the CPU there are many types of locality of reference, some of which are quite esoteric and which can arise from CPU optimisation techniques such as branch prediction and speculative execution; but by far the two most significant types of locality of reference are spatial locality and temporal locality.

Spatial locality is the proximity of address locations for data referenced by successive CPU instructions. With this type of locality, access latency can be amortized by retrieving multiple data items in a single fetch cycle and using good spatial locality to ensure that all the data fetched is useful. If the data is not useful, because of poor spatial locality, then more fetch cycles will be required to process the same data.

Spatial locality is only beneficial where there is latency. If there is no latency at all, and every address location is equally randomly accessible from the CPU, then spatial locality is utterly irrelevant. If this is the case, then it does not matter if the successive data is near or far away from the current data as all address locations are equally accessible. This was how memory was back then before the 1990s – spatial locality was of no concern at all within memory.

Temporal locality is the proximity in time for the re-use of data – this is what caching is all about. With this type of locality, access latency can be avoided by re-using the data fetched previously. With poor temporal locality, data that could have been re-used is lost from a working area and must be fetched again. Data is typically lost because it has been overwritten or evicted by other data; therefore temporal locality is only beneficial where there is a finite size to the working area. If the working area is of infinite size then data is unlikely to be lost and can always be re-used.

Effectively, storage has an infinite size because it persists all of the data (by definition) and therefore temporal locality is irrelevant to storage. Meanwhile, memory has always had a finite and restrictive size and continues to do so in most situations. With today’s larger memory sizes it is sometimes possible to retain all of the data within memory; but this is still the exception rather than the rule and temporal locality remains important within memory. Of course, it is still crucially important for the CPU cache where the size of the working area is still extremely limited.

Good spatial locality mostly arises from data structure; whereas good temporal locality mostly arises from process. There is some connection between the two because algorithms are related to the data structures on which they operate; but it is reasonable to assert that structure is more important in determining spatial locality than process is, for example. This also means that spatial locality is inherited from its source (because the structure is retained) whereas temporal locality is not propagated (because process is localised to a working area). Indeed we would expect different temporal locality between the CPU cache and memory since good temporal locality in CPU cache is specifically used to avoid access to memory. Therefore we see that providing good spatial locality in storage will propagate good storage locality in memory and throughout; whereas good temporal locality must be provided locally by the algorithms operating on CPU cache and memory.

The latency we now see in memory makes spatial locality more important than it was before the 1990s. These days, memory is no longer randomly accessible, and good spatial locality within memory structures is a critical aspect of performance.

We are familiar with row stores and column stores offering spatial locality for very specific query patterns. Row storage offers good spatial locality within a specific row (although it doesn’t help in finding that row); while column storage offers good spatial locality within a column (although it doesn’t help in eliminating irrelevant rows). Clearly, neither are likely to offer good temporal locality and both require the use of indexes built on top to overcome their spatial limitations.

But what about the spatial and temporal localities of those indexes such as the venerable B-Tree?

You might think that B-Trees offer good spatial locality, but it’s not that simple. While a leaf block, which contains many similar keys clustered together, offers good spatial locality across the index entries that are relevant to the query; the root block can only contain one entry that is precisely relevant. Thus we see poor spatial locality at the root of the tree but improving as we progress towards its leaves; it is not a uniform virtue across the entire structure. Indeed, the need to visit multiple independently located blocks for every tree navigation rather dilutes the spatial locality of structure as a whole.

Meanwhile, the temporal locality of the tree works inversely to its spatial locality; while the root block will be re-used with every query, it is very unlikely that any given leaf block will be re-used by successive queries.

Hence the spatial and temporal localities of reference have opposite gradients such that middle layers of a B-Tree cannot offer good locality of reference in either dimension and consistently yields poor locality in both dimensions at the same time. While the the root block held in memory works well and the leaf blocks held on storage work well, the rest of the tree behaves poorly regardless of whether it sits in memory or in storage.

Cache rich

Let’s complete the analogy of the hapless components retailer to see the effect of caching B-Trees in memory by housing copies of the card indexes in a building near to the workshop and using space within the workshop as a small CPU cache…

Now the retailer keeps a copy of all of his card indexes in a wooden shed which is situated right next to his workshop to avoid all of the lengthy round trips to the warehouse.

He hires another assistant to perform all of the fetching, copying and carrying operations between the shed and the workshop; but again, while the assistant is adept at numbers he seems completely clueless with the alphabet.

The retailer is able to summon the assistant via a buzzer operated from the workshop. The assistant is very vigilant and the shed is close enough such that, when the retailer presses the buzzer, it only takes the assistant a minute to appear at the workshop to receive his instructions, and there’s only 2 minutes for each round trip from the workshop to the shed, to go to a specific card index, copy some cards and then return back to the workshop again.

The retailer has optimized the way he flicks through a card index by using a binary search method (starting at the middle card and then going to the middle card in the bottom half or the middle card of the top half depending on what is on the first card etc). With this method, the retailer only needs to look at 10 cards on average from the 1000 cards in the index and it takes him about a minute to do this.

The assistant cannot copy and carry all of the 1000 cards in an index in a single trip, but for efficiency delivers them in batches of 100 cards at a time – rather than just 1 card at a time. With a round trip time of 2 minutes this works out at almost 1 card per second delivery rate (which is a very good rate when compared with the work the retailer can achieve in a single minute).

In addition to the copy of the top level card index the retailer keeps in his workshop, the retailer also makes some space available to keep some of the more frequently used cards so that he doesn’t always have to ask his assistant to fetch them if he already happens to have a card to hand.

On the face of it, this arrangement looks like it has legs.

The obvious way for the retailer to request a card index is to ask the assistant for the complete card index; wait until it arrives in full and then run his binary search – this takes about 21 minutes overall for one card index (1 minute to summon the assistant and 2 minutes per batch).

Alternatively, the retailer could ask for specific bundles of cards based on his search outcomes – for example, he could start by asking for a bundle of cards that cover the middle range then choose another bundle based on what he finds. Unfortunately though, his assistant has a habit of wandering off back to the shed to fetch the next 100 cards while the retailer is busy looking at the latest batch of cards; which means the assistant has be summoned by the buzzer yet again, taking another minute for him to re-appear. With an average of 10 searches per card index, this would take about 30 minutes (3 minutes per batch) – so the retailer is actually better off asking for the whole card index right at the start, meaning that the best time he can achieve is 21 minutes per card index.

Remember that the retailer always keeps a copy of the top card index in his workshop and so he only needs to consult two card indexes kept in the nearby shed. Therefore, it takes 45 minutes to identify a crate by consulting 3 card indexes and fetching 2 of those card indexes from the shed.

Well, this is much better than going all the way to the warehouse for each card index (which took several months), but there are still a number of aspects that continue to frustrate the retailer

  • He only needs to look at 10 cards in each card index, but he ends up receiving 1000 cards each time. This takes up space in his workshop and worse still, he often finds himself throwing away other potentially useful cards just to make room for cards he later discovers are not useful at all. As a result, he finds that the majority of cards he keeps in his workshop are rarely useful in practice and simply clutter his workshop, wasting valuable space.
  • He is still spending a lot of his time waiting for his assistant to deliver more cards. Of the 45 minutes taken, only 3 minutes are spent by the retailer doing any useful work. For the remaining 42 minutes, he is waiting for 2000 cards to be delivered (99% of which he does not really need).
  • He is annoyed by the serial nature of the task. He cannot ask for the next card index until he has searched the prior one. He wishes he knew in advance what card indexes (or cards) were required so he could ask for them up front and avoid the need to repeatedly summon his assistant.
  • The size of the shed seems excessive for the benefit derived. He uses only a tiny 0.000002% of the cards stored there for each attempt to identify a crate.
  • If he can’t build a shed big enough to keep a copy of all of those card indexes, then extra lengthy trips to the warehouse will be incurred and to make matters worse, he will not know which orders will be delayed until he starts to work on them.
  • The area in which the wooden shed is located is often the subject of riotous behaviour and it  is always in danger of being burnt down. When this happens he has no choice, but to rebuild it and send his assistant all the way back to the warehouse when he needs a card index that can’t be found in the new shed. On such occasions, the service to his customers suddenly and markedly deteriorates.

It is clear that, even with copies of the card indexes stored close by,  the method the retailer is using to identify a crate does not serve him well.

At the speed of light

It is quite enlightening to look at the maximum distance that can be travelled by light (or information) within a single CPU clock tick. This has reduced from around 0.1Km in the 1970s (at 3MHz) down to 100mm now (at 3 GHz). 

In other words, for today’s CPU to request information and have it returned within a single clock tick, the information must be in the near vicinity of the CPU. Therefore, any kind of storage is going to create a storage wall no matter how fast that storage technology is (flash, racetrack, PCM etc). Even storage as fast as the CPU cache itself will still present a storage wall if it has to sit beyond the motherboard – simply because information will not be able to travel faster than the speed of light (quantum computing and entanglement aside).

Ultimately, if we do replace current hard disk and flash storage with fast non-volatile RAM we still will not have removed the access latency problems we currently see, unless all of the data can be located near the CPU itself or we abandon the von Neumann architecture that underpins the computing industry.

Moving the goal posts

The Relational Model, proposed by Codd in 1969, heralded the start of the relational era and shortly after both row store and column store database systems began to emerge throughout the 1970s along with associated index structures such as the B-tree index. 

In those days, the CPU and memory were effectively joined at the hip with storage latency viewed as the central hurdle to performance. Memory was small but fast and regarded as truly randomly accessible with negligible latency for any CPU access and the likes of B-tree indexes et al were designed to overcome the then dominant storage wall (high storage latency). The goal in the 1970s was to achieve good spatial locality within storage (to amortize the cost of a storage transfer) and good temporal locality within memory (to avoid storage transfers).

The chart below typifies the hardware performance characteristics prevalent both then and now. 

It is evident that CPU speed and memory transfer rates have improved dramatically while memory latency has improved much less so. It is a very similar story for storage performance.

From the current perspective of the CPU, memory is now large but slow. Because memory latency is so high with respect to CPU speed, memory is no longer randomly accessible and cannot be treated as an efficient cache; so now there is a memory wall too (high memory latency) and we see CPUs designed with their own cache. The goal today is to achieve good spatial locality in both storage and memory and to achieve good temporal locality within CPU cache. Therefore the goal posts have moved to the extent that the structures of the 1970s are unable to score goals any more.

Meanwhile, much has been written about enhancing B-tree like structures (such as CO and LSM variants) to improve write performance. But even for read access, the spatial locality of these types of structures is insufficient to effectively exploit the higher transfer rates now possible from storage and memory or to overcome the memory wall that now overshadows CPU performance.