All posts by awgibbs

Simplicity Begets Complexity

Aeons ago, in a pre-cloud era of my professional life, I found myself bootstrapping a software system as a (mostly) solo developer.  It would ultimately prove successful beyond my imagination, bringing along an assortment of people and systems (the key ingredients to real and lasting success), but it had humble beginnings.  It commenced with the scaffolding of Ruby On Rails, which apart from ActiveRecord quickly fell away, but its PostgreSQL core persisted.

A colleague recently remarked that “PostgreSQL is the second best choice for everything“.  That resonated with me.  As you bring a new system into existence, a needlessly complex tech stack thwarts progress.  Finding a way to leverage a single piece of coherent tech as the bedrock yields enormous benefits at the outset.  Doing so also entails substantial peril as you find yourself outgrowing yet addicted to that foundation.

git pull; rake db:migrate; touch tmp/restart.txt

During the earliest days, and for a long time, that was pretty much the deploy process for the aforementioned system.  I was able to move fast.

Of course even in that one line an insidious race condition lurks.  The disk in your environment gets new code, the memory of processor nodes for a transient period will be running different code, and some of that running code may be out-of-sync with your database’s layout which may cause data loss or corruption.

But…  Move fast and break things!  That’s a fine philosophy when working on a prototype.  It may even be fine for relatively mature products with low criticality and/or under certain definitions of “break”.  And certainly failing to have anyone care enough about your software to notice that it broke often proves the bigger risk.

Eventually, though, reality catches up and you begin to harden your system iteratively.  For me, in this particular adventure, that meant continually rearchitecting for looser coupling.

Durable message queues came to sit between my PostgreSQL and other systems.  Message processing became more decoupled, with distinct “ingest” and “digest” phases ensuring that we never lost data by first just writing down a raw document into a blob field and subsequently unpacking it into a richer structure.  Substantive changes to data formats rolled out in two or more phases to provide overlapping support that prevented data loss while allowing zero(-ish) downtime deploys.  An assortment of queues, caches, reporting faculties, and object stores accreted within the storage engine.

And so PostgreSQL found itself variously serving as data lake, data warehouse, data mart, and data bus.  This simplified certain classes of problems a lot.

Imagine that you are implementing backup-and-restore procedures.  Wouldn’t it be great if your entire state store was one engine that you could snapshot easily?

Imagine that you are a very small team supporting all of your own ops.  Wouldn’t it be great if you only had a single tech product at the core of your system to instrument and grok?

PostgreSQL was that thing.

It was wonderful.

And it was awful.

Consolidating on that one piece of tech proved pivotal in getting the system off the ground.  Eventually, however, the data processing volume, variety, and velocity ballooned, and having everything running on PostgreSQL took years off of my life (and apologies to those on whom I inflicted a similar fate).

Load shocks to internal queues made Swiss cheese of on-disk data layouts.  Reporting processes would bulldoze caches.  Novel data would confound the query planner and cause index scans through the entirety of massive tables.  The housekeeping “vacuum” processes would compete with real-time mission data processing while also running the risk of failing to complete before hitting a transaction wrap-around failure (and once cause what was possibly the most stressful day of my career).

“It’s always vacuum”, I wrote on a sticky-note that I affixed to the monitor of the guy who often took the brunt of front-line support.  “Always” was only slightly hyperbolic, true often enough to be a useful heuristic.

So simple, yet ultimately so complex.  We ended up spending a lot of time doing stabilization work.  It was stressful.  But at least we knew it was engineering muscle going into a proven and useful product.

Fast-foward to the present.

I have of late been building systems in a cloud-first/cloud-native fashion within AWS that anticipates and preempts an assortment of the aforementioned challenges.  The allure of Serverless, high availability, extreme durability, and elastic scalability is strong.  These things come, however, at a cost, often nefarious.  The raw body of tech knowledge you need to understand grows linearly with the number of pieces of component tech you have fielded.  The systems integration challenges, meanwhile, grow geometrically complex, and holding someone accountable to fix a problem at component boundaries proves maddening.

When a CloudFormation stack tries to delete a Lambda that is attached to a VPC via a shared ENI and unpredictably hangs for forty minutes because of what is likely a reference counting bug and unclear lines of responsibility, who you gonna call?  And when the fabric of your universe unravels without warning or transparency because you have embraced Serverless in all its glory and your cloud provider rolls out a change that is supposed to just be an implementation detail but that bleeds up to the surface, what you gonna do?

This other way of building apps, leveraging a suite of purpose focused tools and perhaps riding atop the PaaS-level offerings of a cloud provider, can provide an enormous amount of lift, raise the scalability ceiling, and relieve you from stressing over certain classes of unpredictability.  It does, however, come at the risk of front-loading a lot of technical complexity when you are struggling to prove your application’s viability.

In some cases the best approach may be to employ a very simple tech stack while structuring the logical boundaries of your code base in a way that anticipates an eventual lift-and-shift to a more heterogenous architecture.  In other cases wisdom may lie in reducing guarantees about correctness or timeliness, at least for a while.  Know the application you are building, remain cognizant of where you are in the application’s life cycle, make risk-based choices from that context, and beware the gravity well that you are creating.

If today you can get by on a simple snapshot-and-restore scheme and tomorrow move to a journal-and-replay approach, perhaps you will have had the best of both worlds.

And, remember: It’s always vacuum.


Last Tuesday when I learned of Trump’s crude caricature of Christine Blasey Ford’s testimony I felt my stomach turn.  The baseness and puerility, underpinned by a new depth in realpolitik, filled me with shame and disgust.  I want to say “I don’t understand how we got here”, but I think I do.

Now let’s zoom out and survey the larger landscape.

John Oliver is a genius.  I have for a long time enjoyed his show.  Lately, however, I’ve been struggling to partake of it.  It’s not that I disagree thematically with his positions.  Rather, I have increasingly found his presentation of topics disagreeable.  And that brings us to yesterday.

I am inclined to believe that on Saturday we confirmed a rapist to serve as a judge on the highest court in our land.  And yet the next day when I watched the previous week’s episode of Oliver’s “Last Week Tonight” I found the experience similarly excruciating to Trump’s performance.  I go to Oliver for, among other things, some left-leaning gallows humor.  But I do not recall laughing a single damn time during the whole episode.  I felt like I was watching an imminent train wreck while not knowing what to do about it.  I experienced Oliver’s treatment of the matter as crude, divisive, and generally unhelpful, preaching to a choir that has given up on reaching across the aisle.

The national discourse has sunk to such a level that fears of civil war are not unfounded. Instead of a calm examination of facts and testimony meshed with the perspective of subject matter experts we ran a media circus intended to inflame opinion and further polarize an already divided electorate.  Republicans and Democrats alike had very clear agendas and were each weaponizing the Kavanaugh hearings by and large along party lines in a self-serving fashion.

Judges, at least in theory, are supposed to act as dispassionate and logical arbiters of the laws that our legislative branch puts on the books.  If that were true, however, it would seem unlikely that we would find ourselves so embroiled in conversations around the ideology of judges, alternately fretful of judges promoted by a competing ideology as “legislating from the bench” or excited to pack the court with judges of our ilk to lock in our preferred version of reality.

Can we all stop being such unmitigated assholes?  On the current trajectory things ends poorly for everyone.

Check Yourself

Last summer I found myself rebooting my flight training at KDXR through Arrow Aviation with Duke Morasco as my instructor.

Things were going pretty well.  I was ~15 flight hours into the process and Duke thought I was about ready to solo.  I felt confident and capable and in control.  “PP-ASEL, here I come!”, or so I thought.

I found myself out for a lesson with Duke on Thursday 10 August 2017 and…  it was an outlier of a lesson.  I wasn’t sure what was up, but it was our worst lesson together.  I had the sense that Duke was agitated and abrupt, out of character from all of our earlier flights, but I reserved substantial probability mass for it having been my fault, the result of some rust having accumulated from a couple of weeks out of town.

I hoped that it was a fluke and scheduled another lesson with Duke on Saturday 12 August.  That lesson would never take place.

On Friday 11 August I received a call from Arrow Aviation.  Duke had been killed in an accident while up in N1727V with another student.


I found myself in shock, confused, and light on information.  For a long time I had little to go on, just an assortment of news articles and a preliminary NTSB report.  Was it during take-off or landing?  Was there a mechanical failure or operator error?

Somewhat insensitively Arrow asked if I wanted to schedule with a new instructor on Sunday.  I told them I need some time to reflect.  Insanely, Arrow had just lost another plane on 30 July during a failed take-off, and I did not feel like tempting fate.

For over a year I found myself wondering what had happened.  At last the NTSB has issued a final report.

Most notably…

According to GPS data, the airplane landed on and then took off from a grass airstrip, climbed about 150 ft, then collided with terrain about 1,000 ft past the end of the runway.


… and furthermore…

An examination of the wreckage did not reveal any evidence of a preaccident mechanical malfunction or anomaly. An examination of the flight controls revealed that the wing flaps were in the fully extended (40o) position at impact. The airplane’s operating checklist stated that normal and obstacle clearance takeoffs are performed with wing flaps up, and flap settings greater than 10o are not recommended at any time for takeoff. Upon landing on the grass runway, the flaps should have been retracted as part of the after-landing checklist, then confirmed up as part of the before takeoff and takeoff checklists. It is likely that the flap setting at the time of takeoff resulted in an aerodynamic stall and loss of control during the initial climb.


Well, shit.

The student pilot was apparently pretty green.  And it seems like nobody realized that the aircraft was in an excessively high-drag wing configuration prior to take-off.  This, in concert with the natural resistance of a grass-field airstrip, and in conjunction with some nasty trees beyond the threshold, presumably led to a late rotation and inadequate rate of climb that culminated in a panic, stall, and crash.


So preventable.

Take your time.  Run your checklists.  Don’t get complacent.

And be wary of relying on “experts”.  They get over-confident or overwhelmed and make mistakes just like everyone else.

This is doubtless good advice in many contexts, professional and recreational.  If what you’re doing is complicated and dangerous, take the time in a calm and quiet moment to codify how you want to operate in every circumstance.  Your future stressed-out self will thank you.

And it’s not just about the operation’s procedures.  It’s about assessing you, the operator. Every aircraft comes with a comprehensive checklist for every stage of flight.  And yet pilots are further counseled to run the IMSAFE checklist against themselves before getting behind the controls.  The risks of illness, medication, stress, alcohol, fatigue, and emotion are all too real.  And some of those items are extremely difficult to gauge.  It’s pretty straightforward to avoid getting into a cockpit while sick, medicated, or drunk.  But how stressed, fatigued, or emotional is too much?

I wonder how to navigate these circumstances when the impacts are less dramatic and more ambiguous than crashing a plane.  How many times have I driven a car when exhausted and distracted?  How many times should I have waited to share an opinion or make a decision until I had attained a better mind-state?

Choices and consequences.


Thoughtless Development

Back when I was a boy, we ran servers on bare metal and we liked it.

And then there were containers.

And then there was AWS Lambda: “Run code without thinking about servers.”

Dwindling are the folks who might even know what “lsof”, “ps”, “top”, “nc”, “traceroute”, “df”, and “ldd” are, much less when to use them.

Actually, Lambda is pretty great, and I use it a lot, but damn does it make it easy to grow your attack surface and forget that you’ve done so.  And, at the end of the day, there are servers, and that reality has implications for availability and latency in whatever system you are building.

Meanwhile, infra-as-code faculties have proliferated, and many folks are using them, but the siren’s song of infra-as-clicks is quite strong, and the potential to create a non-repeatable mess in the cloud provider of your choice is great.

Be Strong.

But let’s get more concrete…

Today I was in the pantry at the office and on the TV I saw some talking head with a green-screen behind him on which three logos were painted in a repeating pattern: New England Patriots, Dunkin Donuts, and…  Zudy:”No Code Apps”.


Football, donuts, and faux enterprise software development.  LOL WUT.

Zudy’s marketing hype is intense: “No Code Enterprise Apps; Join The No-Code Evolution; Build game changing apps in days”.

Oh, FFS.  It was bad enough that we had to endure the No SQL shenanigans for about a decade before Make SQL Great Again got legs.  Now we’re going to pretend that we can develop apps without even thinking?

Spoiler alert: creating apps is easy; developing them over time once data has begun accumulating and people have begun broadly using them is hard.

We are witnessing a proliferation of shiny technologies that make it easy to bring new capabilities into existence, with the promise of old baggage being jettisoned, but we are not seeing commensurate faculties to manage and evolve these capabilities as we attempt to navigate a full system lifecycle.

I’m sorry, but the majority of the code written for a mature software system centers on logging, testing, data modeling, exception handling, security hardening, performance tuning, configuration management, release management, and inter-version compatibility.  This is the inescapable bread-and-butter engineering work of taking the kernel of an idea to a robust system that can handle day-to-day usage by an army of users in a way that is not completely maddening.

This is not new.  But the frequency with which products like this crop up is increasing.  We see examples of it in such offerings as SplunkPhantom, and NiFi.  And yet the well of uncomfortable truths tells us that “you’ll never find a programming language that frees you from the burden of clarifying your ideas”.

But, fear not…  If you get yourself wrapped around the axel, Zudy has an “AppFactory” and is more than happy to “Let Zudy’s experts build your apps for you.”  Congratulations.   You just built yourself a thicket of tech debt and hired some third rate contract programmers who will hold you hostage in perpetuity.

There are two kinds of enterprises: the kind who create and manage software deliberately and wittingly, and the kind who do so accidentally and unwittingly.  Which will you be?

Rage Against The Machine

At last week’s Strata Conference the buzzword exhibiting the highest frequency count appeared to be “Explainable” as prepended to “Artificial Intelligence”.  We have collectively transcended “can we make it work?” and landed squarely in “why did it make that decision?” territory.

In highly regulated industries the government applies a strong back pressure on non-explainable algorithmic decisions.  This serves as a check against runaway and impenetrable automation of decision making.  Yet clearly not all AI-driven industries that can exert an enormous impact on our lives find themselves subject to such controlling forces.  And from one country to another the degree of regulation for a given industry can vary greatly.

The UAE’s Daman gave an interesting talk on how they applied Natural Language Processing techniques to non-textual data in the healthcare claims adjudication space.  The strategy appeared to enjoy substantial and measurable success.  What creeped me out, though, was their seeming heavy reliance on customer complaints to act as the corrective force on falsely flagging claims as invalid.  The presenter offered the opinion that if a customer did not fight a claim rejection then the claim was probably invalid or unimportant anyway.

This feels like data scientists engaging in cost externalization to customers who exist in a fairly disadvantaged position and who must now fight back against a maddeningly opaque decision engine.  This appeared especially so in the case of Daman who apparently controls 80% of the health care market in the UAE (cited by one of the presenters as a reason why this particular data set was super cool to work on).

What force would stop such a company from taking the next logical step in profit optimization?  Auto-tune the rejection of valid claims to the sweet spot where statistically customers don’t fight it because getting their due does not justify the cost.

There has been much talk of how we must not allow the “Kill Decision” to fall into the hands of robots in warfare.  How easy it would be to make the same mistake in less sensational contexts.

Social Engineering

As an opportunistic hobby I will occasionally engineer my way into “illicit” access to my own stuff as reminder of how vulnerable I am to shenanigans.

Tonight I returned to my hotel room and found my key card unwilling to open my door. It was not that authorization had failed, but rather authentication, as neither the red nor green light came on. I reduced the theory space to a fried card by swiping it on someone else’s door which also gave no recognition of it. I suppose it could also have been that _all_ readers were dead, but that seemed unlikely as there was not a line of irate guests at the front desk as I passed it moments earlier. And I suppose it could have been awkward if that room’s occupants had showed up just as I swiped at their door, but #whatevs.

“I think my card is fried, room XXX”, I said, and handed it to the desk attendant. “Name?”, he asked. “Andrew”, I replied, giving as little information as possible, and not offering my ID, which he did not request. “Oh, yeah, totally dead. You put it next to a phone or something?”, he remarked. “Maybe. Not sure”, I replied noncommittally. He programmed up a new card and handed it over, no more questions asked.

I looked at the physical card afterward. There is no identifier imprinted on it.

So I am pretty sure all I need to go into any arbitrary room in this hotel is knowledge of someone’s name and a room card over which I have dragged a magnet.

The things a less scrupulous person could do with so little… Maybe snoop the guest ahead of you for their name and have access to a card that was “lost” from a previous visit? In you go. Maybe not worth the trouble to steal someone’s wallet. But maybe to leave a little Novichok behind?

Assigning Credit

I was listening to a The Motley Fool podcast this morning in which their product promotion segment referred to the Eero home WiFi system and provided a promo code of “fool” to get free shipping off of the Eero web site.  I have of late been having a crappy experience with my NetGear router and so thought I would give it a try.  I went to the web-site on my iPhone and…  it didn’t load.  I went on with my morning and a little bit later tried pulling it up on my laptop.

While there I decided to search for it on Amazon to see what its reviews were.  Lo and behold it was playing very positively.  And its list price was $50 cheaper.  And I could have free shipping with a guaranteed Tuesday 2 January delivery.

But it felt crappy to learn about the product from TMF and not have them get credit.  And I’ve been hearing increasingly unhappy stories about vendors getting bullied by Amazon over pricing.  So I went all the way to the check-out step on both Amazon and on Eero to  see what the delta in cost and experience would be.

To Amazon’s lower list price would be added $22 in taxes.  Either that was already in the Eero website price or it was not being taxed.  For shipping, Eero’s site would provide free one-day FedEx shipping (thanks to the promo code), but also said that “orders ship within 1-2 business days”, leaving with poor clarity on when the item would arrive.

So, with a $28 higher price, a worse shipping experience, an unclear product return workflow, and order history fragmentation, I couldn’t bring myself to buy directly from Eero.  If the total difference in experience had been just a small delta in price, say within $20, I would have probably on principle purchased directly through Eero, but the holistic Amazon experience was too superior to pass on, and their site reliability and review system pulled me into their gravity well on this purchase before Eero could close the deal directly.

Oops.  It’s actually even worse.  The 5% cash back my Amazon Prime Rewards Visa will give me almost entirely closes the effective pricing gap.

Ironic Fail: The TMF podcast had as one of its topics the runaway dominance of Amazon in the e-commerce space