Timeless And Universal

This is a disconcerting time. On the one hand I have more opportunity than ever for my writing and a sizable backlog of topics I’ve long been accumulating. On the other hand it can feel kind of tone-deaf to write on any of those topics as if COVID19 weren’t turning the world upside down. And so this is my latest attempt to bridge what now feels like two worlds and ponder what transcends them.

Early in my Software Engineering career I began taking on System Administration knowledge as an enabler to bootstrap products quickly. Over time these disparate domains would knit together into DevOps.

Shortly thereafter Security Engineering began infusing my skill set. This proved an enabler in the middle phase of projects where gaining traction meant experiencing scrutiny. Baking security into a product from the outset is orders of magnitude easier than retrofitting it.

More recently we have seen the term DevSecOps emerge. At first glance the casual observer might cynically conclude that we are just gluing buzzwords together. And certainly many folks are. But as a grizzled practitioner of these historically disparate domains I feel we are onto something.

There has long existed an adversarial role between the Software Engineer and their counterparts. The System Administrator laments “thrown over the wall” delivery of applications, the Security Engineer gets backed into acting more like cop or bureaucrat than collaborator, and the Quality Assurance Engineer often shows up just in time to be the ultimate buzz kill. If we zoom out, however, and apply a shared set of principles across all of those domains, then magic can begin to happen.

Do a Google search on “prevention detection response” and you will find the first two pages of results exclusively discussing cyber-security. You must persist to page three of the results (who does that?!) to reach mention of other domains where, lo and behold, we begin seeing hits on Public Health and Pandemics from CDC, ASEAN, and GHS Index (more on that later). I gave up on page six without having seen any mention of DevOps. But why?

First let’s assert a few basic realities:

  1. No safety or security mechanism is bulletproof
  2. Increasing expenditure shows diminishing return
  3. Timely adaptation to evolving circumstances is crucial to survival

If you cannot sustain a reasonable delivery velocity for meaningful product features then customer engagement will suffer. At the same time if either you or hostile actors unduly disrupt the product experience then, again, customer engagement will suffer. Thus we experience the central tightrope walk of Product Engineering. To that end we must manage risk in a context-aware fashion, adopt a multi-faceted approach, and remember that resilience is attainable in a way that perfection is not.

Consider the foundations of a good DevOps process:

  1. Automated Testing
  2. Observability and Monitoring
  3. Version Control
  4. Infrastructure|Configuration As Code
  5. Continuous Integration / Continuous Delivery|Deployment

Now ponder how all of those elements provide some manner of Prevention, Detection, or Response to underpin responsible system ownership from a Safety standpoint while also providing myriad hooks where we can attach technology that provides ongoing Security value.

We will never ship defect free code but we need not be paralyzed by this risk if:

  1. Automated testing makes shipping defects reasonably unlikely
  2. Monitoring tips us off to problems quickly
  3. Observability enables rapid debugging
  4. Version control precisely describes the software in production
  5. Infrastructure and configurations are included in what we consider our “software”
  6. IaC/CaC + CI/CD enables realistic test environments and rapid shipment of software
  7. Blue-Green Deploys reduce downtime and enable phased roll-outs and roll-backs

How much more aggressively can we responsibly ship code with those layered defenses in place? How much readier for Disaster Recovery will we be when we have so thoroughly described the system and automated its roll-out?

Consider now all the junction points in such an ecosystem where Security Engineers can integrate as high-value contributors in a fashion that is collaborative, automated, and continual:

  1. If we describe in an exhaustive and programmatic fashion how we arrive at build artifacts and grant compute nodes access to storage and messaging resources then Application Security Engineers can continually scan in-house code and third party dependencies for vulnerabilities and reason about minimal permissioning.
  2. If we do similar for how we construct our network fabrics and attach them to compute nodes we similarly empower Network Security Engineers to reason efficiently about the attack surface, blast radius, and exfiltration opportunities extant in their area of responsibility.
  3. If we take Observability seriously by standardizing how we generate, transmit, and store telemetry then we help SOC Analysts looking for threats as much as the Support Engineer monitoring operational health.
  4. If we take build-and-deploy automation seriously then what better way to ensure we can rapidly patch vulnerable systems, minimize the lifetime of execution environments that serve as adversarial toe holds, and quickly recover from compromises?

Slick tools and techniques, however, do not relieve us of the burden of continual application of human judgment as we iteratively field systems. Every time we release a new version we ought be reasoning about risk in terms of Probability and Impact, and folded into Impact furthermore is Recoverability. If we’ve taken our DevOps’ing seriously then all circumstances will generally be more recoverable than otherwise, but certain classes of errors are inherently less easily recoverable, and so we must take care to front-load an appropriate amount of mitigative effort in our process. My favorite mental map for this problem comes from a conference presentation of forgotten venue and authorship (I regret I cannot give due credit) and that I recall layering the domain thus:

  1. Photons: you botched the code that does final rendering for a feature that is of low impact and/or is tolerant of moderate latency and you need only ship a new version of that code to return to full capability
  2. Electrons: you corrupted data and need not only ship new application code but also create and run data repair code that may take a while to execute during which period there may be material delays incurred by time critical business functions
  3. Atoms: your system gave (or failed to give) instructions such that there were substantial bad outcomes in the physical world, perhaps damaging, degradrading, losing, or misrouting physical items that will be costly to recover or replace
  4. Meat: your system gave (or failed to give) instructions such that people were traumatized, maimed, or killed, consequences for which no amount of engineering work can compensate, generally limited to areas such as Transportation, Medicine, Industrial Control Systems, Command-and-Control Systems, Fire Control Systems, Law Enforcement, and Intelligence

You must always know which class of risk you are taking as you engage your Change Control Management process. Likely this calculus will prove the most difficult for systems somewhere in the middle that might variously see all classes of the problem at different times while the creators of Candy Crush and the the Phalanx System alike get a pass on having to do much thinking here.

Prevention, Detection, Response… Context awareness… Now let’s bring it back to something currently on everyone’s minds to further illustrate the timelessness and universality of such a multi-faceted approach.

We are in the throes of a global pandemic. This puts us squarely in the Meat quadrant of our context map so we ought be exceptionally cautious about the attendant risks. How are we doing on Prevention, Detection, and Response?

  1. Prevention:
    • We don’t yet have a vaccine, our global response to develop one is patchwork, our distribution networks for them are spotty, many people are suspicious of them, and we will feel pressured to use them sooner than ordinary clinical evaluation periods would allow.
    • We have advanced VTC technology, and the world’s office workers have broadly moved to make good use of it to minimize contact, but those workers depend on goods and services provided by people who have had much less opportunity to maintain social distancing while at work whether it be Amazon warehouses or Tyson slaughterhouses.
    • Within shared public or commercial spaces we have the opportunity to maintain social distancing but the guidance is non-standard across locales and a sufficiently large minority of people are behaving badly enough to undermine the efforts of the majority.
    • Both the availability of and guidance on Personal Protective Equipment has been a giant mess though there appears to have been marked improvement of late.
  2. Detection:
    • We don’t yet have an antibody test that is both pervasively available and reliably accurate.
    • There is no digital solution for contact tracing that could be adequately reliable without both substantial advances in technology and acceptance of an Orwellian nightmare.
    • We have not taken seriously until now the need for an adequately provisioned contact tracing human workforce and it will take time to recruit and train such an army.
  3. Response:
    • Medical personnel at hospitals are exhausted from the physical and emotional strain of their caseload and meanwhile it is not at all clear that we have comprehensively flattened the curve at a national level in the US.
    • The availability of and guidance on ventilators has been as big a mess as the PPE situation.
    • Personnel in many commercial venues are reticent to engage with non-compliant customers, there have been instances of extreme violence when they have engaged, and there are not clear protocols for getting law enforcement professionals involved.
    • We don’t currently have a great way to enforce quarantines in the US and this is in some ways an even harder problem than contact tracing.

What should we be doing to get through this?

  1. Be Unwavering
  2. Be Patient
  3. Be Humble
  4. Be Empathetic
  5. Be Generous
  6. Be Apolitical
  7. Respect Experts
  8. Standardize Protocols
  9. Enforce Protocols
  10. Pool Resources

These are all incredibly important but I chose to put a few of them at the top of the list because something really timely popped for me when a couple of weeks ago as I was reading the book Good To Great at the suggestion of colleague Andrew Morris, in particular The Stockdale Paradox. In this segment author Jim Collins recounts his conversation with Admiral Jim Stockdale who was was the highest-ranking military officer in the so-called Hanoi Hilton prison camp. The relevant passage therein follows.

A little later in the conversation, after I’d absorbed that and said nothing for about five minutes because I was just stunned, I asked him who didn’t make it out of those systemic circumstances as well as he had.

He said, “Oh, it’s easy. I can tell you who didn’t make it out. It was the optimists.”

And I said, “I’m really confused, Admiral Stockdale.”  

He said, “The optimists. Yes. They were the ones who always said, ‘We’re going to be out by Christmas.’ Christmas would come and it would go. And there would be another Christmas. And they died of a broken heart.” Then he grabbed me by the shoulders and he said, “This is what I learned from those years in the prison camp, where all those constraints just were oppressive. You must never ever ever confuse, on the one hand, the need for absolute, unwavering faith that you can prevail despite those constraints with, on the other hand, the need for the discipline to begin by confronting the brutal facts, whatever they are. We’re not getting out of here by Christmas.”

Heed those words and gird your loins. We’re not getting out of here by Christmas.

Contact Conundrum

An ounce of prevention is worth a pound of cure. Rarely have we felt this so acutely as during the COVID-19 crisis. Thus we stand on the precipice of solidifying a panopticon.

Weighty Decisions

Brace yourselves. We are learning the implications of our hyper-connected, technologically-advanced, and planet-spanning super-society at an exponentially accelerating rate. And yet such understanding arrives with troublesome latency.

Consider healthcare. Roughly eighty (!) years later we are still navigating the insurance-related consequences of WWII-era wage-fixing. The long tail of such macro-level decisions argues for making them with great care and re-examining them rigorously and regularly.

Now consider the established epidemiological practice of contact tracing and in particular the initative by tech behemoths Apple and Google to implement it worldwide that hit the news in recent days.

On initial examination the approach seems well intentioned and thoughtfully reasoned:

  1. The protocol encodes your identity as a series of rotating, random, anonymous keys
  2. The system will keep these keys local to your phone
  3. Key generation and exchange will be opt-in for end-users
  4. Key exchange occurs between nearby Bluetooth LE devices
  5. The protocol captures only key-to-key contact, not location data
  6. An infectee can report their status by uploading their relevant anonymous keys
  7. Performing (6) will require a confirmation code from a healthcare provider
  8. Users of the system will download the keys of infectees to ascertain their exposure

There is a lot to like about this. The rotation of anonymous keys makes unmasking identities via pattern-of-life analysis difficult, the healthcare provider’s certification thwarts spammers, the local storage of telemetry and linking of contacts prevents third party access to non-pertinent information, and geospatial-free fact-of contact data limits utility to the intended purpose.

What could be so bad about stepping onto this seemingly high-friction slope? To answer that let’s zoom out a bit.

Prelude

It is seldom that liberty of any kind is lost all at once. ~ David Hume

One of the great tragedies of the past decade took the form of our collective myopia in the wake of the Snowden leaks. Reasonable people can disagree on matters surrounding intelligence operations, civil liberties, and whistle blowers. Where we lost control of the narrative was in thinking too small.

In 2007 news agencies began reporting on divorce court proceedings buttressed by the E-ZPass toll collection system whose deployments began in 1993. This offered a glimpse of the world to come in which we would casually and pervasively trade anonymity for convenience and cash-back, but few people could have guessed the ultimate scope and scale. Apple had publicly released the iPhone just that year, AWS launched the previous year, only three years earlier Facebook launched and Google IPO’d, and meanwhile credit card providers were sitting on a rich vein of ore whose value they would take a few more years to fully realize.

Then in 2013, just as these private tech goliaths really started coming into their own, Snowden swung the spotlight onto government intelligence agencies. Three years later, in case there was any chance of re-focusing this conversation, Trump stole the stage. And only a few more years later, just as GDPR was beginning to enter the planetary consciousness by spamming us with cookie acknowledgement banners, COVID-19 comes roaring onto the scene and seems poised to further cement our dependence on Big Tech for nearly everything from ordering essential goods to maintaining relationships afar to tracing infected individuals.

We have been giving up an assortment of liberties piecemeal over a long time and now all the pieces are about to click together.

Back To The Present

On some level the Apple/Google partnership for contact tracing seems like a noble idea and a reasonable thing to try. In reality it will likely prove ineffective and governments will experience temptation to leverage more invasive means. Consideration of these means will raise public awareness of their existence, a potential headache for Big Tech, and actual usage at scale would carry huge risks of abuse, a reality that ought be blindingly obvious. The purveyors of the world’s biggest sensor networks have good reason at least to try to short-circuit proceedings.

Let’s take a moment to consider goals we may pursue in our current crisis:

  1. Alert potential infectees of their risk status
  2. Locate potential infectees, test them, and quarantine them
  3. Enforce isolation of known infectees

The Apple/Google partnership, like a handful of similar recent proposals, supports only the first use case. Furthermore, an opt-in approach to activating the technique will make a key exchange event unlikely, the messiness inherent to inferring a meaningful contact will result in a high error rate, and latent disease communicants will allow transmission without human proximity. These three problems in concert seem likely to render the data set ineffectual, providing us only a poor solution to part of the problem.

Let’s assume for a moment, though, that there is an adequate deployment rate, the tech is pretty good, human proximity dominates as a transmission vector, and consequently a reasonable signal exists. How long until government and health officials are clamoring to extend the tech to the other use cases? And how much more comfortable would we be with this next increment of invasiveness having already gotten part way there? We cannot readily implement the other two use cases without much bigger privacy implications. And thus over time we may be tempted to employ an all-source intelligence approach as South Korea has, an automated reporting system as China has, or an electronic fence as Taiwan has.

Stumbling Into A Dark Future

We could totally do it. It wouldn’t even be that hard. Smart phones from Apple and Google have high-resolution GPS sensors to enable app functionality, a growing density of cell towers means an improving accuracy of your location even without GPS, cell phone carriers like Verizon and AT&T won’t let your phone communicate via their towers without identifiers sufficient to bill you, Facebook has assembled a 2.5 billion human strong training data set for facial recognition, CCTV cameras are cheap and popping up everywhere, credit card usage is pervasive, and compute on massive scales is available to anyone with a credit card to plug into the cloud.

Three properties are worth noting about the history of the component technology. Firstly, it has emerged gradually, preventing us from properly appreciating the liberties we have been surrendering. Secondly, in many cases companies have created the tech for relatively benign purposes while providing consumers delightful benefits often at little or no monetary cost, making us willing participants. Thirdly, the development of the technology and storage of the resultant data has occurred in a highly fragmentary fashion, yielding a system with built-in integration challenges and inherent checks and balances.

But now we find ourselves in circumstances where it is incredibly tempting to centralize vast amounts of raw data in the hands of governments. And later, when COVID-19 is a painful but distant memory, the inheritors of such stupefyingly powerful technology may well weaponize it in ways we will lament. Make no mistake — we are poised to execute on some of the most impactful decisions in human history with repercussions that could last centuries.

Pondering Alternatives

Fielding high-tech solutions on a short-fuse will likely prove either ineffectual because they don’t go far enough or cause disastrous long-term collateral damage because they do go far enough. In the short-term we should employ a variety of low-tech means, in the medium-term we should field new technologies that can have more of the safety benefits while minimizing privacy risks, and in the long-term we should be taking a more holistic look at our nascent suite of technologies and the relationship we want to have with it.

Short Term

On the prevention front we need to encourage and enforce risk-reducing behavior by standardizing practices exemplified by the grocery stores that are laying out tape markers for distancing purposes. For detection we ought not just be improving the accessibility of testing but also demonstrating some adaptability and elasticity by leveraging the myriad recently unemployed folks to perform the old-fashioned meatware-based contact chaining services that health departments have been perfecting for decades. And, obviously, on the response front we need to provide better access to personal protective equipment, ramp up hospital capacity to meet demand, make testing procedures readily accessible, and flood the world with hand sanitization stations.

Medium Term

I would wager that there is a bunch of tech on the horizon that could tip the equation in our favor. How about ultra-wideband technology which compared to WiFi and Bluetooth demonstrates relatively low power consumption, has superior range resolution, already inhabits new Apple devices, and is soon to find its way into Android devices? Maybe in addition to higher confidence “contact” events I could also white list my socially non-distant brethren and then have my phone buzz if I carelessly wander too close to others?

Long Term

We have talked a lot about the societies we would like to inhabit. Meanwhile armies of scientists and engineers kept on inventing things for a combination of love and money, businesses kept on exploiting those technologies in ways that are entirely rational from their local standpoint, and now we are courting a form of environmental disaster as we find ourselves surrounded by a dangerous amount of incredibly complex technology, not entirely unlike what we found at the beginning of the Industrial Age when a renaissance in Chemistry resulted in a catastrophe for our water, soil, and air, or when a renaissance in Physics delivered us into the Nuclear Age. The Information Age is at least as dangerous but in ways less screamingly obvious. Let us tread carefully and thoughtfully.

Liberty once lost is lost forever. ~ John Adams, 7 July 1775

Inobtrusive Automation

For some time now I’ve been in the habit of turning on a beefy Honeywell air filtration system that lives in my bedroom prior to leaving my house in the morning for the office…

… or at least that is what I was doing until my bedroom became my office for the foreseeable future. Peak allergy season, meanwhile, is coinciding with peak COVID-19 and thus I’ve found myself wanting an approach where I could have clean air without the droning of the air filter being my constant companion where I am spending most of my day. Home Assistant to the rescue along with the help of the Aeotec’s Multi Sensor 6 and Smart Switch 6. The ultimate solution ended up being mashup of two earlier automation adventures plus some new techniques.

At first I took the approach of using a Multi Sensor to detect when I was in the bedroom, a Smart Switch to control whether the air filter was on, and some Home Assistant Automations that would engage the Smart Switch whenever the Multi Sensor was not reporting motion and the time of day was outside a conservative approximation of when I might be sleeping. That was OK but kind of disappointing because either I would be wasting time when the filter could be running or risking it kicking on while I was sleeping. I needed another signal. As it turns out my Chilipad Ooler was the perfect candidate since its active state correlates strongly with my sleeping.

And so ultimately the solution involved using two Smart Switches, one as a control mechanism and one as a signal generator, the former gating the filter and the latter inferring my presence from the Chilipad’s power consumption.

automation:
  - alias: Disable Bedroom Air Filter On Trigger
    trigger:
      platform:  state
      entity_id: sensor.aeon_labs_zw100_multisensor_6_burglar
      to:        '8'
    action:
      service: switch.turn_off
      entity_id: switch.aeon_labs_zw096_smart_switch_6_switch
  - alias: Disable Bedroom Air Filter On Timer
    trigger:
      platform: time_pattern
      minutes:  "/1"
    condition:
      condition: and
      conditions:
        - condition: state
          entity_id: sensor.aeon_labs_zw100_multisensor_6_burglar
          state:     '8'
        - condition: state
          entity_id: switch.aeon_labs_zw096_smart_switch_6_switch
          state:     'on'
    action:
      service: switch.turn_off
      entity_id: switch.aeon_labs_zw096_smart_switch_6_switch
  - alias: Enable Bedroom Air Filter
    trigger:
      platform: time_pattern
      minutes:  "/1"
    action:
      service: switch.turn_on
      entity_id: switch.aeon_labs_zw096_smart_switch_6_switch
    condition:
      condition: and
      conditions:
        - condition: state
          entity_id: sensor.aeon_labs_zw100_multisensor_6_burglar
          state:     '0'
          for:
            minutes: 5
        - condition: state
          entity_id: switch.aeon_labs_zw096_smart_switch_6_switch
          state:     'off'
        - condition: numeric_state
          entity_id: sensor.aeon_labs_zw096_smart_switch_6_power_2
          below:     1

The above configuration has been running for the past few days and seems to reliably do what I might hope. Within a few minutes of my departing the room the air filter kicks on and I am protected from it ever kicking on while I am trying to sleep. Meanwhile a few fiddly configurations of the Aeotec devices were required to get them to behave as I wanted…

Something along the lines of the first three of the following entries were applied to both Smart Switches via zwave.set_config_parameter to respectively get them to send all of the relevant kinds of data in their reports, to get the reports to ship every minute, and to tell them not to leave their LEDs on all the time. The fourth item was applied to just a single switch where it happened to be plugged into a surge protector strip and I wanted to avoid the potential mayhem of competing protection devices.

{ "node_id" : 4, "parameter" : 101, "value" : 15 }
{ "node_id" : 4, "parameter" : 111, "value" : 60 }
{ "node_id" : 4, "parameter" : 81, "value" : "When the state of the Switch changes, the LED will follow the status (on/off) of its load, but the LED will turn off after 5 seconds." }
{ "node_id" : 5, "parameter" : 3, "value" : "Deactivate Overload Protection" }

I still get a little bit giddy when the system does exactly what I might hope it would.

Toward Resilience

Various folks in the realms of engineering, government, and business have long been navigating a class of challenge whose relevance the general public has recently come to feel viscerally.

How do you build and operate systems that can handle temporally varying loads imposed by heterogeneous collections of missions executed by uncoordinated entities?

With the onslaught of COVID-19 we suddenly observe desperate overload and squandered capacity existing side-by-side.

State-administered web-applications for unemployment benefits are crashing under the strain.  Meanwhile tech built in many places for a burgeoning economy suddenly seems over-provisioned.

Medical staff the world over find themselves grossly overextended and quickly headed for conditions usually reserved for third world catastrophes.  Simultaneously folks in the hospitality industry are experiencing forced idleness and indigence.

We can still enjoy a variety of consumer goods that will arrive at our doorstep in a matter of hours or days.  And we can while away the time with those toys in the safety of our homes as people die for want of critical medical supplies over the coming months.

We arrived in this situation, and are struggling to manage it, because we have not yet grokked the implications of modern technologies and global economies.  And yet those same things hold the keys to short-term solutions and long-term resilience.

Now a brief interlude to define some adjectives we might apply to systems…

  1. Scalable: sustains system performance and unit economics as load grows
  2. Elastic: expands and contracts the underlying infrastructure in sync with load
  3. Adaptable: readily reconfigurable to support a wide variety of activities

You probably need not be a systems engineer to understand the heartbreak presently resulting from myriad organizations and technologies failing to exhibit the above properties.

I suspect that the competition between three objectives lies at the center of our problems:

  1. Affordability
  2. Availability
  3. Sovereignty 

Under ordinary circumstances we want to be paying reasonable prices for goods and services, under extraordinary circumstances we would like the essentials to remain attainable, and under all circumstances we dislike being pushed around or held hostage.  Regrettably we denizens of Earth have over the past several decades been prioritizing Affordability and very specific kinds of Sovereignty in a way that has left us ill-prepared for recent events and thus thrust into an Availability crisis.  There likely exists an assortment of reasonable trade-offs to be made in these realms but many of the relevant tools and techniques are new, learning to leverage them will take time, and the transition period will prove bumpy.

In my worlds of Data Engineering and Security Engineering I have seen these struggles play out time and again and in no ecosystem did the results ever feel entirely satisfying.  I’ve lived in government data centers wrangling mission-critical and ultra-sensitive workloads where Availability and Sovereignty were king and the consequence was a money fire.  I’ve built systems in organizations where Affordability trumped all and the price was ceding Sovereignty and accepting huge risks around Availability.  And I’ve experienced places trying to take a measured approach to balancing all three and damn is it tough to get the relevant stakeholders in sync on a roadmap.

As we navigate these trade-offs we must carefully wield various double-edged swords:

  1. Debt: immediate flexibility and leverage at the expense of capacity tomorrow
  2. Specialization: economy of scale today at the expense of flexibility tomorrow
  3. Centralization: efficient but jeopardizes scalability, availability, and adaptability
  4. Reactivity: avoids speculation while risking non-availability and exploding costs
  5. Secrecy: short-term localized safety at the expense of long-term global efficacy
  6. Proprietorship: spurs immediate investment while risking long-term stagnation

Fortunately the innovations of recent decades have armed us with tools and techniques such that many things become much less of an either/or decision.  We must, however, engage on these novel approaches with patience, persistence, and humility.

In the digital realm we have made great strides in commoditizing compute resources and building large distributed systems, but the techniques are not widely understood, the general purpose tooling is immature, a handful of vendors dominate the space while differentiating via incompatible protocols, and staggering amounts of legacy tech weigh us down (calling all “cobalt” programmers! weirdly enough the world desperately needs you, though not for the first time in recent memory).

In the physical realm we have created a magical world where travelers can transit the globe with unprecedented ease, consumers can source goods from their couches via pocket-sized super-computers and expect them to arrive nearly instantaneously, and it has all become shockingly cheap and mostly reliable, but this has been possible by way of just-in-time manufacturing, extremely complex supply chains, hyper-specialized professionals and facilities, an obsession with growth, and razor thin margins on capacity and profitability, all of which entails a great deal of fragility, risk, and long-term costs.

And now we find ourselves in a crisis that is shaking us to our very core.

What to do?

We are already seeing some awesome examples of ingenuity and collaboration, big and small, in the face of disaster.  Countries are smoothing supply and demand disparities through mutual aid agreements, government agencies and hoteliers are collaborating to convert idled properties into makeshift hospitals, distilleries are reallocating manufacturing capacity from liquor to hand sanitizer, and individuals are crafting masks at home in their spare time.  We should look to these activities not just as noble and ingenious short-term solutions but also as a guide on how to re-architect our future society to be more resilient in the face of unpredictable circumstances by emphasizing Scalability, Elasticity, and Adaptability.

We appear to have learned some valuable things from the 2008 financial crisis that policy makers are rapidly deploying to blunt the economic impact of COVID-19.  This is heartening and with any luck will prevent a full blown Depression.  I am, however, sad to realize that we failed to generalize those learnings.  If you Google on something like “financial system stress test” you’ll get reams of hits discussing the extensive activity and research in this area but try “supply chain stress test” and you’ll be relatively disappointed.  Apparently we have to get our asses handed to us in a very specific way before we develop a very specific solution.  It took, for example, the 1973-1974 oil embargo to goad the USA into creating the Strategic Petroleum Reserve, and that trend seems to continue.

We need to be asking better questions about what can kill us, what we can do about it, and how we can know whether our approach is proving effective.  Hopefully we not only come out of this crisis having added Medicine to the list that includes Energy and Finance, but also looking to proactively establish practices for blunting similar impending crises in the realms of Manufacturing, Communications, Transportation, Environment, and Agriculture.  And hopefully the approaches we take preserve our humanity.

Improvements in sensors, telemetry, and predictive analytics show great promise for helping us to anticipate surges in demand and thus proactively provision capacity, but we must also take great care not to put too much stock in their prescience and meanwhile remember that the risk for tech-enabled human rights abuse is enormous.  We need more of this data but also more robust legal and policy frameworks to ensure responsible and ethical behavior.

Lastly, to make effective use of such prescience we must build toward systems of greater simplicity, adaptability, and interoperability, leverage those properties to create more robust and localized shared resource pools, maintain a responsible amount of slack both within and across those resource pools, further mitigate risk by formalizing mutual aid pacts, and engage in regular stress testing to validate our key assumptions in a variety of worst case scenarios.

We have crashed quite unexpectedly into a new era almost overnight.  Let us find a way to make it one characterized by resilience and sustainability.

Irresponsible And Grasping

I have to hand it to the Data Scientists: after a quarter century the Internet is finally serving me ads for products and services that I end up buying.

But that is only tangential to this post which is actually about protecting your credentials and encouraging responsible behavior.

Phishing remains a potent vector for initiating exploitation workflows and the first line of defense is Internet user education and hygiene.  Worrisomely we see powerful actors encouraging sloppy security practices in this realm for the sake of workflow expediency and intelligence gathering.

Consider a model that has become common within the FaceBook iPhone app that starts with an inline ad that includes a “Shop Now” button.

Clicking that button launches not the default web browser but rather an In-App Browser.

As I prepare to check out I am steered toward PayPal as an option.

And then I am asked to put my PayPal login credentials into this sub-window.

Let’s stop and reflect on that…  I am inside the FaceBook app, nested in which is an in-app browser that says I have a secure connection to a third party merchant, and I’m supposed to feel good about entering the password to a third party financial account.  We are in a world where Phishing Countermeasures 101 is “Don’t Click Links In Emails” and we have a consumer facing Big Tech company with a billion users of widely varying sophistication training people to enter sensitive credentials into unverifiable places.

Shame.

My workaround: Leave the FaceBook app and go directly to the merchant’s web site.  And if I am going to use PayPal then I will log into it via a discrete tab in my for-realz browser so that I know the connection is secure and my PayPal credentials are only going to PayPal.

This had been bugging me for a while but now let’s get to the even more egregious example that goaded me into writing this morning when I attempted to link a bank account to Expensify.

Alright, let me plug in my bank routing and account numbers…

e1.png

Er…  Log into my bank account?

e2.png

Plaid…  Um, ok…

e3_2.png

Whoah.  Fuck.  You.  NO!  I have no idea what is happening to my bank password in this workflow.  And even if I did I don’t want anybody but me to have it!

e4_2.png

Wow.  As I back out from their credential harvester app I discover that the more responsible workflow is available as an Easter egg!

e5.png

Yay!  I can get reimbursed without doing something that feels excessively dirty.

e6.png

But, again, what are we training Internet users to do by subjecting them to workflows like this?

Or, to ask another question: Why are we desensitizing Internet users to engaging in highly risky security practices?

Two answers: to retain attention and to harvest intelligence

In the case of the FaceBook app it all comes down to keeping you in-app and clicking.  Having you divert to an external browser risks disruption to your social media flow state which affects FaceBook’s bottom line.

In the case of Expensify and Plaid it appears even more nefarious.  A perusal of Plaid’s privacy policy makes it seem like they have bigger designs on your account credentials than just  performing systems integration.

It starts off sounding warm and fuzzy…

p1.png

But breezing past the marketing and getting down to brass tacks it reads a lot more scarily…

p2.png

Gross!  All I wanted to do was configure Expensify to push payments to my bank and Plaid is trying to get in the middle and hoover up EVERYTHING.

On a personal level this feels manipulative and violating.  On a societal level this feels like an externalization of costs, a form of cyber pollution in which certain commercial entities are acting in narrow self-interest in a fashion that harms the overall security of the Internet and its users.

Do not fall prey to this manner of thing.  Practice defensive surfing to keep your credentials safe.

Surviving Telework

The benefits of SCIF-life tend to exclude telework.  And you’ll count yourself lucky if your phone does not fry in the parking lot come summertime.  Nonetheless in the underlying restrictions one finds a wealth of under-appreciated inducements to stay focused and productive.  As we roll into this new socially distanced phase I am thankful to have had a few years of post-government life to practice operating in an increasingly fluid and virtualized modality.  I can imagine the shock others may be experiencing as they go full-remote for the first time with no preparation.

To avoid both personal and professional misery I suggest pondering three areas:

  1. Habit / Environment
  2. Ergonomics
  3. Communication

Habit / Environment

With the breakdown of natural boundaries between personal and professional reality, sustaining habits and protecting focus become crucial.  You must cultivate environmental cues to get into the zone and erect barriers both physical and virtual to protect it.  Failure to do so will leave you sluggish, distracted, frustrated, and ineffectual.

If at all possible do NOT attempt to work in a setup that your brain associates with leisure.  I recall, shortly after leaving the government, my first unexpected telework day at Bridgewater being borderline wasted.  I did not intend it to be, but I had flopped unshowered onto a beanbag chair in front of my wide screen TV, thereby adopting a context and posture that my brain associated with computer gaming on a Saturday morning.  FAIL.

Instead start by grounding yourself in a boot sequence that runs independent of where your work day will be.  For me that is coffee, breakfast, hygiene, and clothing, followed by getting to a standing desk with a proper keyboard, mouse, and monitor.  Your approach can be different but you need something to provide the cue to get into the zone.  You may also wish to take a walk that simulates your morning commute (though maybe not right now).

And then you must stay in the zone. Visually and acoustically isolate your home office setup from other household goings on.  If possible get yourself in a room with a door that you can shut.  Leverage headphones and a noise generator to enhance that isolating effect.  Negotiate and maintain clear boundaries and protocols with your cohabitants (pets and humans alike).

Ergonomics

Do not get hurt.  Suffering repetitive strain injuries in a proper office environment is all too easy.  In working at home unmindful of such things one courts disaster.

Get yourself an adjustable desk and spend as much time standing as you can manage.  Ensure that your monitor is at a height that your neck and spine can maintain a neutral posture.  If you must sit, the most important thing is having enough padding to prevent pressure points, not the “support” that fosters a weak core and ultimately a range of musculoskeletal injuries.  And when you are standing, do so on an anti-fatigue mat, avoid extended static positions, and continually stretch as a matter of course.  Consider getting a fidget bar.  Maybe keep yourself honest with an Upright GO.

Avoid working directly on a laptop for any serious amount of time.  Certainly you can use a laptop, but do not interact with it directly.  This device, for all its wondrousness, is an ergonomic disaster.  The reason comes from your need to independently adjust your input and output devices’ relative locations.  This in turn stems from the simultaneous requirements of viewing your monitor at a height such that your neck and spine can remain neutral while keyboarding with your shoulders back, your upper arm segments perpendicular to the floor, your lower arm segments parallel to the floor, and your wrists aligned with your lower arms in a neutral posture.

While there may be many options that suit your purposes, I personally fell in love with the Comfort Keyboard ~15 years ago when my government agency’s ergonomics folks introduced me to it at a time when I was scared that carpal tunnel syndrome was going to derail my nascent career as a software engineer.  This keyboard in particular exemplifies the idea of having multiple independently adjustable axes.  I’ve never used another keyboard since discovering this one and the terrifying pain that drove me to find it has never returned.

Also beware the multi-monitor trap and arrange your desktop windows mindfully.  You do not want to be spending any significant amount of time with your neck twisted or tilted.  Whatever tasks are consuming your focus for substantial periods ought be front-and-center.  Position windows to avoid having applications continually drag your focus off-center.  And if you have multiple monitors, ensure that one sits directly in front of you and others are leveraged only for peripheral awareness and/or very brief tasks.  For most folks a single massive monitor proves a superior option to multiple smaller ones.

Lastly, take breaks deliberately and switch up modalities even if just briefly to prevent rigor mortis.  Stand, sit, lie down, walk, and stretch.  Don’t eat lunch at your desk.  Take a break for some some leisure items and chores (but time box them and only start them once you’ve fully flushed your latest work context).

Communication

Communicating in a quality way can prove difficult under the best of circumstances.  And having an entire company doing remote work, especially when transitioning to that modality en masse and without warning, is far from the best of circumstances.   Worse still, the high-bandwidth and serendipity-driven nature of in-office life often serve as crutches that form and harden bad practices.

In everything you communicate, remain aware of how your engagement style and choice of medium either fosters or undermines effective knowledge transfer.  When transmitting, aim to be precise, concise, transparent, and transactable.  When receiving, assume noble intent and ask for clarification.  Think carefully about the urgency of your requests and strive for asynchronous messaging and batch-mode operation to preserve the focus and efficiency of others.  But also recognize when a problem requires a high-bandwidth and low-latency interaction and elevate your communication to the appropriate medium.

Consider the full range of your communication tools and how each one of them presents an opportunity to empower or hinder your co-workers in a way amplified by remoteness.

Stand Ups — Your organization likely consists of very different people when it comes to their skill profiles, social patterns, communication style, self-sufficiency, and comfort raising problems.  Your daily stand-up meeting represents the one opportunity to cut through all of those challenges and keep people happy and delivering.  It can also prove a quagmire without good protocols and enforcement.  Keep it simple — What did you do yesterday, what are you doing today, what promised deliverables are at risk, and how do you need help?  Everybody should be able to cover this in about a minute.  Maintain a “parking lot” for follow-up items so that you get around the room quickly instead of rabbit-holing.  And then be damned sure that all of the requisite follow-on conversations happen to get people the help they need and re-examine at risk deliverables.

Instant Messaging — Prefer posts to public channels over DMs to foster awareness and crowd source solutions serendipitously.  Leverage threading to ensure that channels remain skim-friendly and searchable while individual messages have the necessary context.  Refrain from flaring individuals or channels unless you have an urgent need to preserve focus.

Email — Be inclusive to ensure adequate transparency, clear about how you view different recipients, and mindful of wasting bandwidth.  Use the “To” line to indicate the people from whom you need something, the “CC” line to provide transparency while indicating optionality, the “BCC” line to let someone know you’re on the task while eliminating future chatter, and the “Subject” line to make inbox skimming efficient.  Leverage distribution lists to assist in adequate dissemination of information.

Issue Logs — Don’t wait until sprint retro day to dredge up things from your tired brain.  Document problems as you run into them.  Simply record the concrete bad outcome in an open-minded way.  Save diagnosis and design for later when the pain signal warrants, the frustration of the moment has bled off, and you can leverage others’ insights to get to root causes and iterate on systems to make them more resilient.

Ticketing Systems — First and foremost, just write it down, for all values of “it”.  Always.  Facing the headwind of tech debt, document it in the backlog for triage.  Having tripped over a bug, write a really clear description and include system output, screen shots, or videos to make it easy to reproduce.  Planning a sprint, write really clear user stories such that a ticket can stand on its own for a developer to know they are done and a tester to believe or refute that claim.  And, most crucially, continually update your evolving understanding of the goals or problems in the ticket instead of relying on out-of-band renegotiations.  Few things are more contentious than what “done” means when you don’t have a written agreement.  Treat your ticketing system as the system of record where your latest synthesized understanding lives.

Version Control Systems — Your VCS is not just where you push code to get reviews, maintain deltas, and trigger system deploys.  Done well this is your opportunity to have an assortment of out-of-band conversations with a variety of individuals wrangling different situations.  Technical leaders looking for patterns and problem areas, team leads trying to keep things on track, individual contributors looking for inspiration, and bug hunters seeking root causes will all thank you for taking your commit messages seriously.  Write a pithy summary line that captures the essence of the “What?” and fits inside summary line length constraints of your tooling.  Embellish in prose form within the message to summarize the “How?” and “Why?” if it won’t be evident from reading the diff.  Squash your branch to a single commit before merging and clean out your in-progress junk comments like “whoops”, “damn it”, “adding debug statement”.

Automation — Automate processes not just to improve efficiency and repeatability but also to tell a story of how things are supposed to work.  Structure code repositories the way a librarian would to foster discoverability.  Lay out the source code in projects the way an author of a book would to afford comprehensibility at multiple levels of zoom.  Craft logging and exception handling to be equally useful to humans and computers alike.  Think carefully about how you decompose logic into units and put serious effort into giving them meaningful names that communicate their purpose.  Or, if that’s too many things to remember, just remind yourself to always code as if the person who ends up maintaining your code is a violent psychopath who knows where you live.

Parting Advice

Wash your hands, stop touching your face, find a way to exercise, resolve to stay calm, support your medical professionals, and if at all possible #staythefuckhome.

 

Three Options

Over a lifetime of wrangling projects and relationships one repeatedly encounters a decision point with three options around managing sub-optimal outcomes.  These options exist in all contexts and carry similar costs, benefits, and risks.  Maintaining awareness of them and choosing deliberately between them makes all the difference.  These options are:

  1. Renovate
  2. Tolerate
  3. Separate

I imagine that many of my most expensive mistakes in life stem from losing sight of this or failing to prioritize making such a hard decision.

In my time at Bridgewater Associates I had many opportunities to reflect on this meta-challenge, perhaps provoked in substantial measure by staring at the Dot Collector (jump to 9:00) and thinking on the oft thin line between “Problems – Not Tolerating Them”, “Determination”, and “Practical Thinking” in many contexts, a realm where “Seeing Multiple Possibilities” and “Dealing With Ambiguity” hold central importance.  A high tolerance for pain is a powerful but dangerous personality trait.  One must take great care to separate “can” from “should” here.

For the sake of brevity I will focus on the application of this thinking to Software Engineering in a highly entrepreneurial context, a domain with which I have wrestled for most of my career at many different kinds of employers.

The key elements where one will have investments and opportunities include:

  1. People
  2. Product
  3. Process
  4. Technology
  5. Market

Wisdom involves constantly making an explicit decision between renovation, toleration, or separation for matters in each of these realms.  Meanwhile the criteria by which one ought reason about ongoing approaches include:

  1. Present Value
  2. Future Value
  3. Support Cost
  4. Opportunity Cost
  5. Risk Profile

As a Maker who takes pride in one’s work, therein resides a strong desire for continual and unending improvement, but the optimal capture of value generally comes from an implementation that falls far short of perfection.  As a Dreamer who can imagine the applicability of one’s work to many problems, therein resides a tendency to keep fighting for one’s envisioned utopia, but there is no validation of your ideas quite like users, funding, and revenue.  As an Entrepreneur one needs to be both of these things, but within reason, tempered by humility, practicality, and data-driven analysis that underpin ruthless prioritization, fanatical focus, and judicious risk management.

Dijkstra perhaps nails a central problem in insisting that ‘we should not regard [lines of code] as “lines produced” but as “lines spent”‘.  With every line of code we produce we create a maintenance burden, increase the cognitive load to add other new features, forgo countless other opportunities, decrease the system’s reliability, and add complexities and risk around security.  Sometimes Good Enough is truly Good Enough.  Use that third-party tool that gives you 90% of what you need and move on.  Maybe tolerate that annoying but rare bug with a frustrating but bearable manual remediation approach.  Or, which is a much harder pill to swallow, but sometimes the right choice: Burn it down.

Letting go of things is hard.  Firing people, abandoning products, scrapping processes, ditching technologies, and leaving markets HURTS.  But being willing to do so is key in being able to innovate.

If you want to win, then you have to be agile.  If you want to be agile, then you have to be able to pivot quickly _and_ progress rapidly.  This entails a combination of:

  1. Continually making the right foundational investments
  2. Maturing processes in sync with value realization
  3. Aggressively pruning fruitless approaches to free resources

Tech has gotten incredibly complicated, most problems have multiple acceptable solutions, and having your engineers slog through duplicative, non-differentiating, infrastructure-level sludge as you progress through a series of application-layer pivots is horrifically expensive, so some foundational investments in DevOps and Data Engineering are critical.  Bias toward continual renovation here, but don’t prematurely optimize and don’t be afraid to give up on tech that isn’t working.

Change control and automated testing are wonderful things for maintaining agility but treat them as a double-edged sword.  They are wonderful for being able to move fast _without_ breaking things when you’ve gotten proven value creation whose disruption would be seriously damaging to client relations.  They can prove a pointless encumbrance when you don’t even know what you ought be building and nobody cares about it yet.  Unless you find yourself subject to undue risk or drag then be prepared to tolerate shortcomings in these realms.

When wrangling application-layer development: fail fast, fail often, and do so in a highly informed way.  In the short-term, leverage usage telemetry and analytics to understand how people are employing your system and where their engagement breaks down.  In the medium-term, gauge interest by a willingness to become a paying customer. In the long-term, customer retention and market capitalization tell the story.

Give people what they want enough to pay to have.  Time-box your bets on things where you can’t seem to get that signal.  Focus and prioritize your efforts as if those were your most important tasks because they are.  Always remember all of your options.  Steer clear of the Sunk Costs Fallacy.  Be willing to set fire to what is not working and use its light to guide your way.