
Earlier this week someone asked me to describe the extent of my experience with distributed systems. I replied, only somewhat hyperbolically, that “at this point in my evolution as an engineer it feels like distributed systems are the only kind of system”. I did not intend facetiousness, rather quite the contrary.
I may have been subconsciously channeling a bit of Yair Amir, my former Hopkins distributed systems professor, with the nature of my remark. I remember, some seventeen years ago, on the first day of class his calling on each student to say why they were interested in distributed systems. On my turn I offered something like “it’s where some of the most interesting security problems are”. I hazily recall his rejoinder, at substantial risk of gross misquotation, as being similar to the one I gave this week: “that’s where the only security problems are”.
Work on messy problems in this realm long enough and you begin to see everything through the lens of distributed systems. In cryptography one may speak of “security in transit” and “security at rest”, but what is “security at rest”, really, if not the case of data transiting through time? Writing some SQL twenty years ago already involved, whether you knew it or not, the coordination of application space, database engine, operating system, device drivers, and everywhere caching, while today the underpinning “devices” may not be just components plugged into a shared motherboard but rather spread across buildings or even regions.
I recall a meeting from my government days where several managers were in a room diagnosing the problem of a complex set of globally distributed systems regularly breaking in ways that involved substantial downtime and labor to restore. “Why aren’t people building self-healing systems?”, one participant without programming experience lamented, to which my reply was “because distributed systems programing is hard, most people don’t know how to do it, and many people don’t even realize that they are doing it”. And therein lies the crux: we’re all distributed systems programmers but only some of us are embracing it.
Perhaps a major distinction resides in whether the attendant failure scenarios seem to merit one’s attention. This in turn comes down to probability and risk tolerance which stem from system complexity and mission criticality. The right questions are not binary such as “is this a distributed system?” and “is this an important system?” but rather “how distributed?” and “how important?” which leads to “what are the meaningful failure modes?”. Much in the way that “strategy” and “tactics” begin to blur in the eyes of experts, so, too, should we be thinking of system complexity and risk in terms of gradients and tolerances.
If you want to build sensible distributed systems then you must start by accepting a few simple truths and subsequently decide how much to care about them:
- Messages can arrive in any order, any number of times, and arbitrarily corrupted
- Networks can exhibit wildly fluctuating throughput, latency, and reliability
- Storage media can fail without warning and without recourse
- Processes can experience faults at any point when consuming a message
- Stateful systems can exhibit emergent phenomena under prolonged stress
That’s it! That’s all you have to remember. Keep these realities ever in mind and you are on your way to building robust systems. Mind you, that does not imply that the solutions will be easy to craft and trivial to understand. Your toolbox will include techniques as varied as idempotency, transactions, journaling, message replay, agreed ordering, checksums/signatures, persistence, replication, pre-emptive caching, leader election, and many more. Nonetheless the earlier list provides a simple framework for asking the right questions about resilience and exploring potential solutions. Ignore it at your peril.
And finally, when you step away from the code editor, or perhaps because “programming” is not your thing, consider the many marvelous distributed systems in your life. Zoom all the way in: you are yourself one such amazing electrical, chemical, and mechanical system. Zoom all the way out: the planetary ecosystem, the global financial system, the omnipresent Internet, each with trillions of loosely coupled moving parts. Zoom partway back in: cities, companies, and families, with their geometrically complex collections of relationships. All of these systems struggle with challenges around capacity, latency, reliability, and emergent phenomena. A little bit of distributed systems thinking could go a long way in every aspect of our existence.