Blood On The Data Center Floor
I put my head in my hands, elbows propped on desk, the realization sinking in what kind of day it would prove — one like far too many around that long-ago time, alas. With some regularity a key processing system operated by a sister office was going into fits of data loss, leaving our operational picture resembling Swiss cheese. How could this happen? How could people let this happen?
In some sense the answer was simple even though the problem was complex. A flow-based programming system born of one domain and built to run on metal had first been ported to a fundamentally different domain and subsequently resettled to a container farm during a fit of irrational and ill-informed “Cloud” and “NoCode” exuberance. Where in another place and time the former experienced low rates of data loss in a context where losing a little data was no big deal we now had a situation where high rates of data loss were occurring in a context where even a little loss was severely damaging.
The slightly more complex answer involved a firehose of data transiting a complex workflow that for any object might take an unpredictably long time to execute. Worse still, the processing of an object entailed invoking an external metadata management service common to all workflows. This external service, notably, was subject to getting overwhelmed and dropping requests, a scenario that API clients managed by implementing retry protocols on their side, often fomenting a flash crash as competing systems raced to the bottom. Thus objects would pile up deep within these workflows for indeterminate amounts of time as the memory usage of an underlying container crept upward until…
*BANG*. All your data are belong to /dev/null.
The historical stability of the underpinning infrastructure led to a software product with systemic fragility that could not withstand a sudden change in circumstances.
The Goad To Reflection
I recently finished reading Taleb’s Anti-Fragile in which he explores the properties of systems that thrive on volatility, randomness, disorder, and stressors. You should read it, too, as well as his other works. Throughout my read I found neural pathways lighting up to situations in my own reality. At the end Taleb exhorts readers to reflect on applications in their own lives. This represents my attempt to do so.
Fractally Distributed Systems
I credit my Distributed Systems professor Yair Amir with stoking a fascination with and passion for understanding how complex systems break and what means we have for dealing with them. Some twenty years later I have memories of profound aphorisms such as “the network is not a disk” and hands-on projects that drove the points home. From his class I gleaned many timeless and universal techniques for fostering resilience in systems. Certain higher order lessons, however, would have to wait for the violence of directing the development of complex systems over the span of years and countless calamities. I would require a many-layered journey to properly become a generator and purveyor of such knowledge myself.
Taleb, meanwhile, begins by explaining to us that not just is our intuition about what “anti-fragile” ought mean wrong but human language has altogether lacked a single word to capture the concept. To be anti-fragile does not mean simply to be “resilient” or “robust” in the face of adversity but rather to gain from it. It implies an ability to evolve which hinges on many things.
To build a robust software system in a complex domain, first you must set out to build an anti-fragile wrapper system, one capable of absorbing and operationalizing myriad hard lessons accumulated over many years wrangling chaotic problems. Notably, that anti-fragile system must be viewed not just through the lens of technology but also culture, with key areas spanning the following…
We must have a culture that encourages people to own their mistakes, to do so publicly without fear of retribution, and to seek the counsel of others on how to do better. We must likewise encourage people to be humble and explicit about the risks they are assuming, demonstrating a realism about the struggle they may be taking on with a project. Without this people will live in fear and treat their troubles as liabilities instead of assets while behaving in a manner simultaneously secretive and reckless.
We require the knowledge to understand what triggered a bad outcome. This implies the generation, processing, storage, and analysis of high quality telemetry. If we don’t know what is true then we cannot tackle the problem that exists. In systems engineering, premature optimization is the root of much evil, and without a rich radar picture of where problems lie we will forever find ourselves focused on the wrong problem while laboring under bogus theories.
We must have the technology and culture to move quickly yet responsibly. This implies not just a mature DevOps toolchain in the realms of CI/CD and infrastructure-as-code but also flexibility in Data Engineering architecture and furthermore a dynamic approach to managing risk that views dangers in strata and makes recommendations accordingly. The more agile a team’s approach the more the more the team can exhibit anti-fragility by quickly codifying lessons learned in its software and organizational DNA.
Critical to our ability to innovate is access to data and systems that allow us to tinker. Empowering people to fashion a bricolage that solves a problem they have, doing so quickly and unilaterally, accelerates the generation of knowledge and capture of value. This entails open access to APIs and data lakes as well as sandboxing mechanisms that enable exploration while avoiding compliance headaches.
We often find ourselves locked into rigid hierarchies where bureaucrats assign “lanes” in which subordinates possess the mandate to operate and deviation from which yields punishment. This fosters a toxic ecosystem of turf wars, learned helplessness, and incestuous amplification. Even though a specific group or product may have formal responsibility for a particular area we should encourage tinkerers everywhere to self-actualize.
And yet we squander all of this accessibility and diversity if we fail to make it safe for people to be transparent about how they are approaching problems. If officious bureaucrats shout “not your lane!” and seek to quash skunkworks initiatives then they inevitably drive them underground, either destroying them altogether or at least failing to fully capitalize on their innovations by triggering compartmentalization, simultaneously damaging morale and fueling attrition.
But if we are to embrace such fluidity then we must also remain clear-eyed about maturity. The hand edited Python script that haphazardly scrapes some APIs while running in someone’s private sandbox lives on a whole other plane of existence from something that is explicitly versioned, cleanly packaged, automatically deployed, robustly monitored, demonstrably scalable, and so on. Far too often stakeholders conflate the potential of a PoC with the sustainability of a productized system. Avoiding such pitfalls requires mutual respect and open dialogue between multiple parties.
As tools and tradecraft that underpin solutions to cross-cutting concerns mature owing to diverse and transparent activities transacting on accessible data and systems we should cautiously and gently act as gardeners who tame and channel such organic growth into lasting solutions while accepting the need for continual change.
Toward Resilient Software Systems
There is a certain beauty inherent in the violence of running an application within containers underpinned by a shared cloud and subject to unruly third party integrations and downright hostile actors. Execution contexts disappear without warning on a regular basis. Latency for all manner of things drifts as a matter of course. Load shocks reliably emerge from nowhere at the most inconvenient times. Deployment of components happens in an utterly asynchronous and parallel fashion. And, always and forever, humans gonna human. Not just is it that “anything that can go wrong will go wrong” but furthermore “all the damn time”. Best to start figuring out solutions as soon as possible and organically instead of trying to become robust one future day all at once and via top-down command-and-control edicts.
A central thesis in Taleb’s work states that local fragility enables systemic anti-fragility which in turn fosters a greater resilience over time. What better ecosystem for this could we imagine than the aforementioned application architecture coupled with a healthy engineering culture? The pervasive and continual unreliability of our componentry serves as an anti-fragility pressure cooker, forcing us to continually work at resilience in a way that proactively builds resistance to mundane mishaps and Black Swans alike. The lessons to be learned span all of the following and more…
Your compute fabric is going to sporadically disappear out from under you. You had better not claim responsibility for a piece of data before it has made it to its rightful destination. What business have you consuming a message before you have successfully processed it? And wouldn’t it be nice to practice with losing containers before you lose a whole rack of equipment?
You are going to ship buggy code. It is going to mis-process data. If you don’t make data replay a first class citizen then you’re gonna have a bad time. And if you don’t ready your processors to consume replayed data you’re going to have an even worse time.
No reliable ordering? No exactly-once delivery? No problem! Develop a sequencing protocol, slap timestamps on everything, queue up fragments, and make your processing idempotent. Hard drive failures? No worries! Just stop installing packages on servers, instead building fully self-contained Docker images, and configure your most critical data to replicate across Availablity Zones and maybe even Regions.
People are going to try to game your system. Consider making your APIs asynchronous, queue based, partitioned, and metered. Hijackers gonna hijack. Build your container images minimally to reduce attack surface and starve them of tools, minimize tasks’ connectivity and permissions to reduce blast radius, make processes and tokens short-lived to reduce the window of compromise, and wall off subsystems with network and account boundaries.
Hating Heroism (in some of its forms)
The greatest fragility of all perhaps resides in the people who build and operate a system. Make them take vacations and see what breaks. Unplug direct administrator access to production systems to see who freaks out and what bursts into flames. Ensure that releases kick off with a single click and have every last configuration parameter tracked in a version control system. When a hero rushes in to save the day, consider asking one simple question — Is this savior not just the solution to but also the cause of my problems? Making people take time off doesn’t just stave off burn-out. It also surfaces critical fragilities that could easily escalate from nuisance to catastrophe.
The Larger World
We are tempted all the damn time to build fragile systems in the name of efficiency, safety, and comfort, but it is a fool’s errand that leaves us complacently sitting on a keg of dynamite in the form of tail risk.
Want social media to regulate content? Sounds nice, but… What entrenched interests will end up deciding what is true and/or palatable? How much power do you want to put in the hands of tech oligarchs? What happens when The Other Guy is in office tomorrow? How do you feel about Big Tech becoming an organ of the state? What unintended consequences will we reap from letting algorithms control our discourse?
Worried about gun violence? Tempted to strip all private citizens of guns? How do you respond when some lunatic, bigot, or terrorist runs amok and mere seconds make the difference between a short-circuited attack and a massacre? What happens when a populist authoritarian thug hijacks a democratic government and begins rounding up an unarmed populace? What timely resistance will you offer if an expansionist neighboring country begins rolling armored vehicles across your border? How will you distinguish yourself as a citizen versus a subject? Some subset of the population will always have arms. Why not everyone within reason?
Want a hyper-efficient economy predicated on globalism, specialization, outsourcing, consolidation, and just-in-time manufacture? Cheap and plentiful products are nice, but… What happens when a country weaponizes its position as an energy provider? How do you fare when a single enormous boat blocks a critical canal? Or when a planetary-scale SaaS provider suffers a class break that compromises all of its customers? What happens when a pandemic spooks people into hoarding staples and countries into nationalizing strategic resources? How do you adhere to your principals when it turns out that the mining and manufacturing in foreign lands that underpin the products to which you have become addicted involve environmental destruction, forced labor, child soldiers, and perhaps even genocide? And such platforms as smart phones are monopolized by just two Big Tech companies? Or public clouds by three? Or the production of your food is run by an ever shrinking number of increasingly massive agribusinesses and food conglomerates who buy politicians, poison people, and commit atrocities against animals while shielded by Ag-Gag laws?
Not sure about climate change? Yes, it is a big messy subject with a lot we don’t yet understand regarding both the structure of causal chains and the efficacy of potential solutions. But don’t get hung up on that. Instead start by simply thinking in terms of Risk Of Ruin and Via Negativa. We have exactly one planet, this planet represents the most complex system we have ever known, we are growing rapidly while tinkering recklessly, and getting things wrong could quite literally foment the extinction of our species. Perhaps some humility is in order? Perhaps the burden of proof should lie on those claiming no harm while profiting from rapidly making large scale and poorly understood changes to a system evolved over billions of years? Perhaps predicating our whole economic reality on perpetual growth to manage extreme indebtedness is a recipe for collapse?
Let us forsake the false idols of global efficiency, local safety, and temporary comfort, replacing them with the hard work and humility required for the anti-fragility we need to survive and thrive.
What better way to exhibit personal anti-fragility than to keep cogitating on confusing topics while maintaining the courage to press the “Publish” button and the open-mindedness to hear feedback? I often get to the end of a writing project and find myself tempted to stuff the piece in a private drawer because I feel that the treatment is either too inchoate or too incoherent. Mostly I overcome that cowardice and just push it out to the world in all its imperfect glory, humbly accepting that today’s ruminations represent works in progress that may take decades to reach fruition. How different a world might we inhabit if everyone could feel safe in doing so? How more meaningful a dialogue might we have if the bulk of our discourse were not held hostage to machine learning algorithms implementing Sort-By-Controversial while fostering 15 second and 280 character attention spans?