This is a disconcerting time. On the one hand I have more opportunity than ever for my writing and a sizable backlog of topics I’ve long been accumulating. On the other hand it can feel kind of tone-deaf to write on any of those topics as if COVID19 weren’t turning the world upside down. And so this is my latest attempt to bridge what now feels like two worlds and ponder what transcends them.
Early in my Software Engineering career I began taking on System Administration knowledge as an enabler to bootstrap products quickly. Over time these disparate domains would knit together into DevOps.
Shortly thereafter Security Engineering began infusing my skill set. This proved an enabler in the middle phase of projects where gaining traction meant experiencing scrutiny. Baking security into a product from the outset is orders of magnitude easier than retrofitting it.
More recently we have seen the term DevSecOps emerge. At first glance the casual observer might cynically conclude that we are just gluing buzzwords together. And certainly many folks are. But as a grizzled practitioner of these historically disparate domains I feel we are onto something.
There has long existed an adversarial role between the Software Engineer and their counterparts. The System Administrator laments “thrown over the wall” delivery of applications, the Security Engineer gets backed into acting more like cop or bureaucrat than collaborator, and the Quality Assurance Engineer often shows up just in time to be the ultimate buzz kill. If we zoom out, however, and apply a shared set of principles across all of those domains, then magic can begin to happen.
Do a Google search on “prevention detection response” and you will find the first two pages of results exclusively discussing cyber-security. You must persist to page three of the results (who does that?!) to reach mention of other domains where, lo and behold, we begin seeing hits on Public Health and Pandemics from CDC, ASEAN, and GHS Index (more on that later). I gave up on page six without having seen any mention of DevOps. But why?
First let’s assert a few basic realities:
- No safety or security mechanism is bulletproof
- Increasing expenditure shows diminishing return
- Timely adaptation to evolving circumstances is crucial to survival
If you cannot sustain a reasonable delivery velocity for meaningful product features then customer engagement will suffer. At the same time if either you or hostile actors unduly disrupt the product experience then, again, customer engagement will suffer. Thus we experience the central tightrope walk of Product Engineering. To that end we must manage risk in a context-aware fashion, adopt a multi-faceted approach, and remember that resilience is attainable in a way that perfection is not.
Consider the foundations of a good DevOps process:
- Automated Testing
- Observability and Monitoring
- Version Control
- Infrastructure|Configuration As Code
- Continuous Integration / Continuous Delivery|Deployment
Now ponder how all of those elements provide some manner of Prevention, Detection, or Response to underpin responsible system ownership from a Safety standpoint while also providing myriad hooks where we can attach technology that provides ongoing Security value.
We will never ship defect free code but we need not be paralyzed by this risk if:
- Automated testing makes shipping defects reasonably unlikely
- Monitoring tips us off to problems quickly
- Observability enables rapid debugging
- Version control precisely describes the software in production
- Infrastructure and configurations are included in what we consider our “software”
- IaC/CaC + CI/CD enables realistic test environments and rapid shipment of software
- Blue-Green Deploys reduce downtime and enable phased roll-outs and roll-backs
How much more aggressively can we responsibly ship code with those layered defenses in place? How much readier for Disaster Recovery will we be when we have so thoroughly described the system and automated its roll-out?
Consider now all the junction points in such an ecosystem where Security Engineers can integrate as high-value contributors in a fashion that is collaborative, automated, and continual:
- If we describe in an exhaustive and programmatic fashion how we arrive at build artifacts and grant compute nodes access to storage and messaging resources then Application Security Engineers can continually scan in-house code and third party dependencies for vulnerabilities and reason about minimal permissioning.
- If we do similar for how we construct our network fabrics and attach them to compute nodes we similarly empower Network Security Engineers to reason efficiently about the attack surface, blast radius, and exfiltration opportunities extant in their area of responsibility.
- If we take Observability seriously by standardizing how we generate, transmit, and store telemetry then we help SOC Analysts looking for threats as much as the Support Engineer monitoring operational health.
- If we take build-and-deploy automation seriously then what better way to ensure we can rapidly patch vulnerable systems, minimize the lifetime of execution environments that serve as adversarial toe holds, and quickly recover from compromises?
Slick tools and techniques, however, do not relieve us of the burden of continual application of human judgment as we iteratively field systems. Every time we release a new version we ought be reasoning about risk in terms of Probability and Impact, and folded into Impact furthermore is Recoverability. If we’ve taken our DevOps’ing seriously then all circumstances will generally be more recoverable than otherwise, but certain classes of errors are inherently less easily recoverable, and so we must take care to front-load an appropriate amount of mitigative effort in our process. My favorite mental map for this problem comes from a conference presentation of forgotten venue and authorship (I regret I cannot give due credit) and that I recall layering the domain thus:
- Photons: you botched the code that does final rendering for a feature that is of low impact and/or is tolerant of moderate latency and you need only ship a new version of that code to return to full capability
- Electrons: you corrupted data and need not only ship new application code but also create and run data repair code that may take a while to execute during which period there may be material delays incurred by time critical business functions
- Atoms: your system gave (or failed to give) instructions such that there were substantial bad outcomes in the physical world, perhaps damaging, degradrading, losing, or misrouting physical items that will be costly to recover or replace
- Meat: your system gave (or failed to give) instructions such that people were traumatized, maimed, or killed, consequences for which no amount of engineering work can compensate, generally limited to areas such as Transportation, Medicine, Industrial Control Systems, Command-and-Control Systems, Fire Control Systems, Law Enforcement, and Intelligence
You must always know which class of risk you are taking as you engage your Change Control Management process. Likely this calculus will prove the most difficult for systems somewhere in the middle that might variously see all classes of the problem at different times while the creators of Candy Crush and the the Phalanx System alike get a pass on having to do much thinking here.
Prevention, Detection, Response… Context awareness… Now let’s bring it back to something currently on everyone’s minds to further illustrate the timelessness and universality of such a multi-faceted approach.
We are in the throes of a global pandemic. This puts us squarely in the Meat quadrant of our context map so we ought be exceptionally cautious about the attendant risks. How are we doing on Prevention, Detection, and Response?
- We don’t yet have a vaccine, our global response to develop one is patchwork, our distribution networks for them are spotty, many people are suspicious of them, and we will feel pressured to use them sooner than ordinary clinical evaluation periods would allow.
- We have advanced VTC technology, and the world’s office workers have broadly moved to make good use of it to minimize contact, but those workers depend on goods and services provided by people who have had much less opportunity to maintain social distancing while at work whether it be Amazon warehouses or Tyson slaughterhouses.
- Within shared public or commercial spaces we have the opportunity to maintain social distancing but the guidance is non-standard across locales and a sufficiently large minority of people are behaving badly enough to undermine the efforts of the majority.
- Both the availability of and guidance on Personal Protective Equipment has been a giant mess though there appears to have been marked improvement of late.
- We don’t yet have an antibody test that is both pervasively available and reliably accurate.
- There is no digital solution for contact tracing that could be adequately reliable without both substantial advances in technology and acceptance of an Orwellian nightmare.
- We have not taken seriously until now the need for an adequately provisioned contact tracing human workforce and it will take time to recruit and train such an army.
- Medical personnel at hospitals are exhausted from the physical and emotional strain of their caseload and meanwhile it is not at all clear that we have comprehensively flattened the curve at a national level in the US.
- The availability of and guidance on ventilators has been as big a mess as the PPE situation.
- Personnel in many commercial venues are reticent to engage with non-compliant customers, there have been instances of extreme violence when they have engaged, and there are not clear protocols for getting law enforcement professionals involved.
- We don’t currently have a great way to enforce quarantines in the US and this is in some ways an even harder problem than contact tracing.
What should we be doing to get through this?
- Be Unwavering
- Be Patient
- Be Humble
- Be Empathetic
- Be Generous
- Be Apolitical
- Respect Experts
- Standardize Protocols
- Enforce Protocols
- Pool Resources
These are all incredibly important but I chose to put a few of them at the top of the list because something really timely popped for me when a couple of weeks ago as I was reading the book Good To Great at the suggestion of colleague Andrew Morris, in particular The Stockdale Paradox. In this segment author Jim Collins recounts his conversation with Admiral Jim Stockdale who was was the highest-ranking military officer in the so-called Hanoi Hilton prison camp. The relevant passage therein follows.
A little later in the conversation, after I’d absorbed that and said nothing for about five minutes because I was just stunned, I asked him who didn’t make it out of those systemic circumstances as well as he had.
He said, “Oh, it’s easy. I can tell you who didn’t make it out. It was the optimists.”
And I said, “I’m really confused, Admiral Stockdale.”
He said, “The optimists. Yes. They were the ones who always said, ‘We’re going to be out by Christmas.’ Christmas would come and it would go. And there would be another Christmas. And they died of a broken heart.” Then he grabbed me by the shoulders and he said, “This is what I learned from those years in the prison camp, where all those constraints just were oppressive. You must never ever ever confuse, on the one hand, the need for absolute, unwavering faith that you can prevail despite those constraints with, on the other hand, the need for the discipline to begin by confronting the brutal facts, whatever they are. We’re not getting out of here by Christmas.”
Heed those words and gird your loins. We’re not getting out of here by Christmas.