Toward Resilience

Various folks in the realms of engineering, government, and business have long been navigating a class of challenge whose relevance the general public has recently come to feel viscerally.

How do you build and operate systems that can handle temporally varying loads imposed by heterogeneous collections of missions executed by uncoordinated entities?

With the onslaught of COVID-19 we suddenly observe desperate overload and squandered capacity existing side-by-side.

State-administered web-applications for unemployment benefits are crashing under the strain.  Meanwhile tech built in many places for a burgeoning economy suddenly seems over-provisioned.

Medical staff the world over find themselves grossly overextended and quickly headed for conditions usually reserved for third world catastrophes.  Simultaneously folks in the hospitality industry are experiencing forced idleness and indigence.

We can still enjoy a variety of consumer goods that will arrive at our doorstep in a matter of hours or days.  And we can while away the time with those toys in the safety of our homes as people die for want of critical medical supplies over the coming months.

We arrived in this situation, and are struggling to manage it, because we have not yet grokked the implications of modern technologies and global economies.  And yet those same things hold the keys to short-term solutions and long-term resilience.

Now a brief interlude to define some adjectives we might apply to systems…

  1. Scalable: sustains system performance and unit economics as load grows
  2. Elastic: expands and contracts the underlying infrastructure in sync with load
  3. Adaptable: readily reconfigurable to support a wide variety of activities

You probably need not be a systems engineer to understand the heartbreak presently resulting from myriad organizations and technologies failing to exhibit the above properties.

I suspect that the competition between three objectives lies at the center of our problems:

  1. Affordability
  2. Availability
  3. Sovereignty 

Under ordinary circumstances we want to be paying reasonable prices for goods and services, under extraordinary circumstances we would like the essentials to remain attainable, and under all circumstances we dislike being pushed around or held hostage.  Regrettably we denizens of Earth have over the past several decades been prioritizing Affordability and very specific kinds of Sovereignty in a way that has left us ill-prepared for recent events and thus thrust into an Availability crisis.  There likely exists an assortment of reasonable trade-offs to be made in these realms but many of the relevant tools and techniques are new, learning to leverage them will take time, and the transition period will prove bumpy.

In my worlds of Data Engineering and Security Engineering I have seen these struggles play out time and again and in no ecosystem did the results ever feel entirely satisfying.  I’ve lived in government data centers wrangling mission-critical and ultra-sensitive workloads where Availability and Sovereignty were king and the consequence was a money fire.  I’ve built systems in organizations where Affordability trumped all and the price was ceding Sovereignty and accepting huge risks around Availability.  And I’ve experienced places trying to take a measured approach to balancing all three and damn is it tough to get the relevant stakeholders in sync on a roadmap.

As we navigate these trade-offs we must carefully wield various double-edged swords:

  1. Debt: immediate flexibility and leverage at the expense of capacity tomorrow
  2. Specialization: economy of scale today at the expense of flexibility tomorrow
  3. Centralization: efficient but jeopardizes scalability, availability, and adaptability
  4. Reactivity: avoids speculation while risking non-availability and exploding costs
  5. Secrecy: short-term localized safety at the expense of long-term global efficacy
  6. Proprietorship: spurs immediate investment while risking long-term stagnation

Fortunately the innovations of recent decades have armed us with tools and techniques such that many things become much less of an either/or decision.  We must, however, engage on these novel approaches with patience, persistence, and humility.

In the digital realm we have made great strides in commoditizing compute resources and building large distributed systems, but the techniques are not widely understood, the general purpose tooling is immature, a handful of vendors dominate the space while differentiating via incompatible protocols, and staggering amounts of legacy tech weigh us down (calling all “cobalt” programmers! weirdly enough the world desperately needs you, though not for the first time in recent memory).

In the physical realm we have created a magical world where travelers can transit the globe with unprecedented ease, consumers can source goods from their couches via pocket-sized super-computers and expect them to arrive nearly instantaneously, and it has all become shockingly cheap and mostly reliable, but this has been possible by way of just-in-time manufacturing, extremely complex supply chains, hyper-specialized professionals and facilities, an obsession with growth, and razor thin margins on capacity and profitability, all of which entails a great deal of fragility, risk, and long-term costs.

And now we find ourselves in a crisis that is shaking us to our very core.

What to do?

We are already seeing some awesome examples of ingenuity and collaboration, big and small, in the face of disaster.  Countries are smoothing supply and demand disparities through mutual aid agreements, government agencies and hoteliers are collaborating to convert idled properties into makeshift hospitals, distilleries are reallocating manufacturing capacity from liquor to hand sanitizer, and individuals are crafting masks at home in their spare time.  We should look to these activities not just as noble and ingenious short-term solutions but also as a guide on how to re-architect our future society to be more resilient in the face of unpredictable circumstances by emphasizing Scalability, Elasticity, and Adaptability.

We appear to have learned some valuable things from the 2008 financial crisis that policy makers are rapidly deploying to blunt the economic impact of COVID-19.  This is heartening and with any luck will prevent a full blown Depression.  I am, however, sad to realize that we failed to generalize those learnings.  If you Google on something like “financial system stress test” you’ll get reams of hits discussing the extensive activity and research in this area but try “supply chain stress test” and you’ll be relatively disappointed.  Apparently we have to get our asses handed to us in a very specific way before we develop a very specific solution.  It took, for example, the 1973-1974 oil embargo to goad the USA into creating the Strategic Petroleum Reserve, and that trend seems to continue.

We need to be asking better questions about what can kill us, what we can do about it, and how we can know whether our approach is proving effective.  Hopefully we not only come out of this crisis having added Medicine to the list that includes Energy and Finance, but also looking to proactively establish practices for blunting similar impending crises in the realms of Manufacturing, Communications, Transportation, Environment, and Agriculture.  And hopefully the approaches we take preserve our humanity.

Improvements in sensors, telemetry, and predictive analytics show great promise for helping us to anticipate surges in demand and thus proactively provision capacity, but we must also take great care not to put too much stock in their prescience and meanwhile remember that the risk for tech-enabled human rights abuse is enormous.  We need more of this data but also more robust legal and policy frameworks to ensure responsible and ethical behavior.

Lastly, to make effective use of such prescience we must build toward systems of greater simplicity, adaptability, and interoperability, leverage those properties to create more robust and localized shared resource pools, maintain a responsible amount of slack both within and across those resource pools, further mitigate risk by formalizing mutual aid pacts, and engage in regular stress testing to validate our key assumptions in a variety of worst case scenarios.

We have crashed quite unexpectedly into a new era almost overnight.  Let us find a way to make it one characterized by resilience and sustainability.

1 thought on “Toward Resilience

Leave a Reply