Simplicity Begets Complexity

Aeons ago, in a pre-cloud era of my professional life, I found myself bootstrapping a software system as a (mostly) solo developer.  It would ultimately prove successful beyond my imagination, bringing along an assortment of people and systems (the key ingredients to real and lasting success), but it had humble beginnings.  It commenced with the scaffolding of Ruby On Rails, which apart from ActiveRecord quickly fell away, but its PostgreSQL core persisted.

A colleague recently remarked that “PostgreSQL is the second best choice for everything“.  That resonated with me.  As you bring a new system into existence, a needlessly complex tech stack thwarts progress.  Finding a way to leverage a single piece of coherent tech as the bedrock yields enormous benefits at the outset.  Doing so also entails substantial peril as you find yourself outgrowing yet addicted to that foundation.

git pull; rake db:migrate; touch tmp/restart.txt

During the earliest days, and for a long time, that was pretty much the deploy process for the aforementioned system.  I was able to move fast.

Of course even in that one line an insidious race condition lurks.  The disk in your environment gets new code, the memory of processor nodes for a transient period will be running different code, and some of that running code may be out-of-sync with your database’s layout which may cause data loss or corruption.

But…  Move fast and break things!  That’s a fine philosophy when working on a prototype.  It may even be fine for relatively mature products with low criticality and/or under certain definitions of “break”.  And certainly failing to have anyone care enough about your software to notice that it broke often proves the bigger risk.

Eventually, though, reality catches up and you begin to harden your system iteratively.  For me, in this particular adventure, that meant continually rearchitecting for looser coupling.

Durable message queues came to sit between my PostgreSQL and other systems.  Message processing became more decoupled, with distinct “ingest” and “digest” phases ensuring that we never lost data by first just writing down a raw document into a blob field and subsequently unpacking it into a richer structure.  Substantive changes to data formats rolled out in two or more phases to provide overlapping support that prevented data loss while allowing zero(-ish) downtime deploys.  An assortment of queues, caches, reporting faculties, and object stores accreted within the storage engine.

And so PostgreSQL found itself variously serving as data lake, data warehouse, data mart, and data bus.  This simplified certain classes of problems a lot.

Imagine that you are implementing backup-and-restore procedures.  Wouldn’t it be great if your entire state store was one engine that you could snapshot easily?

Imagine that you are a very small team supporting all of your own ops.  Wouldn’t it be great if you only had a single tech product at the core of your system to instrument and grok?

PostgreSQL was that thing.

It was wonderful.

And it was awful.

Consolidating on that one piece of tech proved pivotal in getting the system off the ground.  Eventually, however, the data processing volume, variety, and velocity ballooned, and having everything running on PostgreSQL took years off of my life (and apologies to those on whom I inflicted a similar fate).

Load shocks to internal queues made Swiss cheese of on-disk data layouts.  Reporting processes would bulldoze caches.  Novel data would confound the query planner and cause index scans through the entirety of massive tables.  The housekeeping “vacuum” processes would compete with real-time mission data processing while also running the risk of failing to complete before hitting a transaction wrap-around failure (and once cause what was possibly the most stressful day of my career).

“It’s always vacuum”, I wrote on a sticky-note that I affixed to the monitor of the guy who often took the brunt of front-line support.  “Always” was only slightly hyperbolic, true often enough to be a useful heuristic.

So simple, yet ultimately so complex.  We ended up spending a lot of time doing stabilization work.  It was stressful.  But at least we knew it was engineering muscle going into a proven and useful product.

Fast-foward to the present.

I have of late been building systems in a cloud-first/cloud-native fashion within AWS that anticipates and preempts an assortment of the aforementioned challenges.  The allure of Serverless, high availability, extreme durability, and elastic scalability is strong.  These things come, however, at a cost, often nefarious.  The raw body of tech knowledge you need to understand grows linearly with the number of pieces of component tech you have fielded.  The systems integration challenges, meanwhile, grow geometrically complex, and holding someone accountable to fix a problem at component boundaries proves maddening.

When a CloudFormation stack tries to delete a Lambda that is attached to a VPC via a shared ENI and unpredictably hangs for forty minutes because of what is likely a reference counting bug and unclear lines of responsibility, who you gonna call?  And when the fabric of your universe unravels without warning or transparency because you have embraced Serverless in all its glory and your cloud provider rolls out a change that is supposed to just be an implementation detail but that bleeds up to the surface, what you gonna do?

This other way of building apps, leveraging a suite of purpose focused tools and perhaps riding atop the PaaS-level offerings of a cloud provider, can provide an enormous amount of lift, raise the scalability ceiling, and relieve you from stressing over certain classes of unpredictability.  It does, however, come at the risk of front-loading a lot of technical complexity when you are struggling to prove your application’s viability.

In some cases the best approach may be to employ a very simple tech stack while structuring the logical boundaries of your code base in a way that anticipates an eventual lift-and-shift to a more heterogenous architecture.  In other cases wisdom may lie in reducing guarantees about correctness or timeliness, at least for a while.  Know the application you are building, remain cognizant of where you are in the application’s life cycle, make risk-based choices from that context, and beware the gravity well that you are creating.

If today you can get by on a simple snapshot-and-restore scheme and tomorrow move to a journal-and-replay approach, perhaps you will have had the best of both worlds.

And, remember: It’s always vacuum.

Leave a Reply