“There are only two hard problems in Computer Science: cache invalidation, and naming things.”
— Phil Karlton
Caches are all around – when we do a small derived field combining first name and last name in a new attribute complete name it is a cache of sorts. But we may typically think of cache as something our web browser does to avoid sending things over a network. Cache’s are also typically aggregated data derived from other data – data we snapshot at one point in time and then keep in fact and dimension tables for multidimensional cubes, json or xml documents, reduced aggregated sums or the like.
The main driver for caching is delivery speed by being prepared. Just like when the TV-Chefs say “…to save some time I have already peeled the potatoes”. I stress the concept of being prepared. It is a much more important way of gaining speed than being fast. A TV-Chef might be the world record holder in potato peeling – but peeling the potatoes ahead of time is an enormously much more efficient way to serve them up fast when needed.
Software development is not potato peeling – but the concept of preparation is the same – can we be prepared to deliver a requested result we will be faster when it actually matters.
Being fast when it matters is what performance is all about. Obviously.
Being fast when it matters can always be solved by being prepared and being prepared always translates to caching.
If I am correct – why is not everyone caching everything all the time? There are several reasons why developers in situations chose to avoid caching:
- Not knowing what to prepare for – or inability to forecast future requests
- Potentially wasted resources – considering storage and time to prepare for things that happen seldom or with low gain in being prepared
- No easy way to efficiently know when a cached result has gone stale by change in underlying data
- The risk and implications of serving up old stale data outweighs the benefits of being fast
For reason 1 & 2 I argue that the thing worthy of caching always is computational expensive to derive and that we must be reasonable certain that this cost is motivated by someone asking for the result while the cache is still valid.
Point 3 is what we mean by Cache Invalidation – how we know when to not trust the current cache anymore and create a new one.
Point 4 is the risk we take when not having a good cache invalidation strategy.
When point 3 or 4 is cited as reason for not caching we understand why cache invalidation is an important problem in computer science.
The easiest and probably the most common way to invalidate a cache – i.e. stating that it is stale and must be refreshed – is time. A Developer may make an assumption that a reader of the cached data will not suffer too much of having a potential 1 hour old value. Given that assumption the cache may be invalidated every hour accepting the risk defined in point 4.
If the thing being cached is something we easily can see when it was updated – like a change time on the original data – then a check if the cache time is earlier than the change time will be a perfect dynamic signal to invalidate the cache and we do not have the problem described in point 3.
In the two cases above we either assume high risk of serving up old stale data, or have a cache that apparently is not very computational expensive to find out if it is invalidated. In both of these situations we do not have a big issue with cache invalidation.
The true hard problem with cache invalidation arise when it is computational expensive to check if the underlying data has changed after the cache was created – and we cannot or will not accept the risk of serving up old stale data. When this is the case the problem is something that keeps software developers up at night and make them loose sleep.
If software architects had a 100% fool proof cache invalidation scheme that was cheap to implement they would use caching a lot more than they currently do. If they used caching a lot more, super performance would be easier to deliver. It would be easier to divide larger systems into multiple smaller systems that rely on cached common reference data – updated when needed. And they would sleep better at night. Computers would work on refreshing the cashes that actually changed and not on redoing things just because time has passed – saving millions and millions of clock cycles that can be used for better things. Users will get served fresh data faster with less effort. There are no losers to this equation.
All of the reasons above sum up to the conclusion that cache invalidation is a problem that is worth addressing.
Since the problem is worth addressing you would think that IBM, Microsoft, Google, Apple, Facebook, Oracle and the rest of the world is investing heavily in this – right? The surprising answer is no-ish. Yes of course they do caching – but no, they do not offer a 100% fool proof cache invalidation scheme that is cheap to implement and I will tell you why:
- They all lack a generic consistent way to discover data change
- They all lack a declarative description of the transformation from data to cache
- They all lack a way to detect needed subscriptions to data from such a declarative transformation
- They all lack a way of discovering data not part of the cache that change in such a way that it now should be part of the cache
Since there is no declarative way to handle information and transformation of information in the portfolios of the big tech companies they are unable to provide solutions that need detailed meta data in order to function. What I mean with “no declarative way to handle information” is that all the large tech companies are stuck in the cul de sac of imperative coding – instead of modeling of information. For more in this read the book Doing effective business by taking control of information
Imperative coding might feel good, might be fun, might come natural to many – but it leads to havoc in trying to do static analysis on what the system actually can do. Havoc that eventually will crash some airplanes, accelerate some cars into walls and blow up some nuclear reactors. But it also make cache invalidation really hard.
If we refrain from archaic imperative coding for describing the system Gist and instead describe the information in UML and the transformation in declarative viewmodels with OCL we actually have no issue at all with cache invalidation. Who would have guessed?
The trick is to on a member level make a note of everything that changes – from what and to what – and when it comes to associations make a note of both ends of the association that is changed – then compare those changes with the exact member instances used by the cache when it was compiled. Voila! This is actually an almost exact persisted replica of the derived member implementations available for many years in MDriven in memory handling of modeled objects – but now for persisted databases with potentially millions of rows.
It is easy to use – it follows the pattern of server side declarative viewmodels and it is consistent. This opens many new doors. Stay tuned for the details on practical details.
Pingback: Doing stuff later - MDriven Blog