I misread Linearizability. A mistake that led me to the SOSP 2024 Best Paper insight
I am writing this pretty late. In this one, I recall how I came about on a very unique insight after obsessing over the definition of linearizability.
A few months ago, I was trying to understand why Google Spanner uses the term external consistency instead of simply saying linearizability.
At first, I thought I understood linearizability.
The definition sounded simple:
“A strong consistency model guaranteeing that every operation appears to take effect instantaneously at a specific point in time between its invocation and completion, such that a linear ordering can be established that respects real-time ordering of operations.”
My interpretation was wrong. I thought linearizability meant that operations literally happen according to wall-clock time. If operation A finished before operation B started, then the system must execute A before B.
But the important words were not “instantaneously” or even “real-time”.
The important words were: “appears to take effect” and “respects real-time ordering.”
There is a subtle difference between following a rule and respecting a rule. That difference completely changed my mental model.
When I first encountered Spanner's external consistency, I thought:
“Wait, isn't this just another fancy name for linearizability?”
But the more I tried to understand it, the more confused I became.
Something felt inconsistent. Spanner is one of the most carefully designed distributed systems ever built. The authors obviously weren't randomly inventing terminology. So I went back to the definition of linearizability. I kept obsessivng over for it for hours.
And then I noticed something. Linearizability does not say:
“The system must execute every operation exactly when it happens in the real world.”
It says:
“The history observed by clients should be explainable by some ordering of operations that respects real-time constraints.”
The difference is ginormous.
Imagine a system receives these two writes:
Client A: write(x = 1) → completes
Client B: write(x = 2) → completes
A traditional design might immediately decide:
“Okay, A happened before B. Let's assign them positions in that order.”
But what if the system did something strange?
What if it accepted both writes, stored them somewhere, and waited 40 seconds before deciding their final order? Would it still be linearizable?
Surprisingly, yes.
As long as the final history observed by clients can be explained as a valid timeline that respects what actually happened, the system is still linearizable. The system does not need to know the final ordering immediately.
It only needs to produce a valid ordering when the history becomes visible. That realization was the interesting part.
(The vertical lines that you see on Client 1 and Client 2 represent the linearization point aka the point at which the operation was executed between it's invocation and completion)
(Notice how we buffer the writes and defer the execution until 40s have passed. But the real-time ordering is still respected.)At the time, I thought:
"Surely someone has already built something around this idea."
I had only been studying distributed systems for a few months. Researchers with decades of experience had probably explored this entire space.
A few weeks later, I was browsing papers from SOSP 2024. One paper caught my attention (for no real reason, just that the name sounded cool): LazyLog.
I downloaded it and glanced at the abstract.
My stomach dropped.
This was the exact idea. Not the implementation details. Not the complete system design.
But the core intuition:
Ordering does not always need to happen on the critical write path.
Traditional distributed logs usually have a bottleneck:
A write arrives → the system contacts a global sequencer → the write gets an exact position → then it can be stored.
The sequencer becomes the thing every operation waits for.
LazyLog challenges that assumption. Instead of immediately ordering every write, it allows writes to enter storage without a final position. The system delays ordering.
Later, when clients read the log, LazyLog reconstructs a valid linearizable order using the timing relationships between operations. The result is that the expensive global ordering step is removed from the critical path.
The system is still linearizable. It just gets there lazily.
Of course, the actual LazyLog paper contains many more engineering details and optimizations. My realization was only a small conceptual observation.
But what fascinated me was the path that led there.
I did not find it by designing a new algorithm.
I found it by refusing to move past a sentence in a definition that I did not fully understand.
A lot of distributed systems feels like this. The difficult part is often not memorizing algorithms.
It is building the right mental model.
And sometimes one word in a definition can completely change the way you see a system.
Maybe one day, I will have a paper in SOSP. I genuinely wish for that day.
Fingers crossed 🤞