
I remember a simpler time some twenty-five years ago when a three-tiered architecture of web server, application server, and database server defined my reality at a part-time job worked during university. If memory serves this consisted of a lightweight Apache front-end, a heavier middle-tier Apache running mod_perl, and a backend running MySQL. Maybe there was another tier in the form of a database connection pooler but probably more likely is that I am hallucinating that we even had separate Apache tiers. There were, meanwhile, actual servers on which all this ran to which I might SSH from my college dorm room to read log files and restart processes on the local system. Actually maybe there was just a single server with all this on it. These were, as earlier noted, simpler times.
In various ways this setup managed to exhibit both better and worse security than one might hope. The network architecture, for one, was flat and promiscuous. Secrets probably lived in plaintext in a local configuration file, vulnerable to pick-up by a back-up process. Code ran, as previously mentioned, on long-lived servers, seeing as virtual machines and containers hadn’t yet gone mainstream. I bet we weren’t great at patching. Oh and everybody had admin. But, hey, at least our log files weren’t being shipped to a third party of unknown trustworthiness through a network conduit that could also be used to exfiltrate data to anyone. I think we at least had local packet filtering hardened enough that its set of rules went beyond “allow from anywhere” and “allow to anywhere”. Maybe.
My eleven years in the government, from 2005 to 2016, spanned 3.5 years on one project and 7.5 years on another, both bootstrapped by yours truly because that was more fun than waiting around for someone else to tell me what to do. The first such project began life hung off someone else’s web server in a way that might have drawn the ire of compliance officers had only they known what was up, eventually, though, landing on its own cluster of physical servers after I wowed the director and he promised a truckload of cash. There are, however, as Oscar Wilde apparently said, only two tragedies in life, namely not getting what you want and getting what you want, a reality impressed upon me by the long slog from allocated dollars to racked equipment with routable IPs. My second project, meanwhile, had me navigating a similarly scrappy start, but at least in a world where people had discovered not just transistors but also VMs, allowing me to allocate more of my time to wrangling software and peopleware versus hardware.
In neither case did my government-era projects embody a wildly different application architecture than the aforementioned university one. The technical challenges were greater, the mission stakes were higher, the personal accountability relatively enormous, and there were unprecedentedly thorny enterprise integration struggles, but beyond adding a message bus to coordinate “workers” and various API interactions to retrieve data, the communication patterns were pretty familiar (well, except for that one time I hand-built a cluster scheduler to underpin a near-real-time workflow orchestrator, but that’s another story). What was quite divergent, however, centered on the operations living on a classified network that had insanely limited connectivity to the outside world, exhibiting about as well-defined and thoroughly hardened a perimeter as you could find anywhere in the world. The Tibco JMS messaging hub and the SAN with which my system interacted, for instance, quite explicitly lived within these walls, beyond which “The Internet” seemed a distant and foreign land outside fortress walls. If data exfiltration were to happen then it would have been far more likely to result from a thumb-drive toting treasonous turd than some kind of multi-hop network shenanigans.
I departed the government for Bridgewater Associates in 2016 both with a heavy heart for what I was leaving behind and with a good deal of stoke for the opportunity to play in the public cloud while wrangling different yet related challenges. Jumping into AWS with both feet I found myself both liberated and horrified. Servers, networks, storage, and messaging fabric sprung into existence with the snap of my fingers as if I were some kind of sorcerer. Much of it, however, was still quite immature compared to the present day, leading to both integration struggles and security risks. Where exactly did my code and data live? What was reachable and by whom? How could I reason about systems that could grow so rapidly and organically? Simple misconfigurations coupled with adversarial action could result in staggeringly bad security breaches of a kind from which I had previously been sheltered by default. The perimeter had gotten weird.
Some history may be in order… AWS released their queueing service, SQS, in 2004, their VM service, EC2, in 2006, and their software-defined networking offering, VPC (Virtual Private Cloud), in 2009. VPC Endpoints, meanwhile, mechanisms that allowed you to bring connectivity to such services “inside” your VPC, such that public or NAT’ed IPs were no longer a requirement to reach AWS services, and such that data flows no longer went “over the Internet”, did not begin to appear until 2015, and except in the cases of S3 and DynamoDB generally did not include support for policy attachment until substantially later. In the case of VPC Endpoints that long did not support policies, which was the raft of such endpoints that debuted with PrivateLink of the “Interface” style circa 2018, AWS proudly touted this as a huge security improvement because such flows “no longer had to go over the Internet”, which was a load of bollocks.
Actually, having traffic bound for AWS services not go over the Internet was a big improvement, but mostly only in the sense that you no longer had to foot the bill for AWS’s networking limitations. Your traffic was secure in either case from snooping because TLS provided both encryption of the traffic (thwarting man-on-the-side) and authentication of the counterparty (thwarting man-in-the-middle). The win you could take mostly came as the result of lower bandwidth charges and, I suppose, making data infiltration and exfiltration slightly more difficult by requiring that an adversary exploit specific AWS services as a transfer fabric instead of using the general-purpose Internet through, say, a NAT Gateway. Against a serious adversary, however, this was closer to adding a speed bump to a road than a proper vehicle trap.
So how, one might wonder, did a company of such world class technical sophistication as Amazon, release components as insecure as this and take so long to fix them? The answer lies in the danger of lifting technology out of its original context and placing it in a new one with different assumptions. These components weren’t insecure, at least not in the same way, when it was just Amazon making use of them for one simple reason — they lived inside Amazon’s perimeter. Much like my government era applications could take some solace in knowing that they existed on a classified network, so, too, could Amazon’s internal applications presumably glean some defense-in-depth from the whole shebang living inside Amazon’s data centers.
When Amazon repurposed these components by making them available to the general public in the form of cloud services, however, they went from running in a single-tenant context to a multi-tenant context, a situation greatly and unavoidably more dangerous for such tenants until IAM policies could catch up. While I lack direct knowledge of relevant internal history, I have to imagine that (at least for a time) Amazon protected themselves against such risks by somehow segmenting services like SQS into discrete internal and external domains, relying on such segmentation to protect themselves while leaving a proliferation of third party users correspondingly vulnerable to attacks by each other.
To give credit where it is due, I will note that AWS has stalwartly continued to grind on all manner of such problems, and in 2025 we have a dramatically improved set of IAM features for both identity-attached and resource-attached policies. Nearly all AWS resource types now support VPC Endpoints that allow the attachment of fine-grained policies that dramatically improve our ability to thwart data infiltration and exfiltration attacks by letting us enumerate which resources are valid candidates for specific actions with, in many cases, even object level security.
The addition of various global condition context keys has also been game changing for reducing the friction in using such faculties to craft logical perimeters. No longer must we laboriously plug concrete account numbers or IAM entities into policies, a tedium that doubtless slowed the value capture from VPC Endpoints and resource-based policies. In 2018 we got the “aws:PrincipalOrgId” key and in the following year “aws:PrincipalOrgPaths” which when attached to a resource allow us to say things like “only these tenants can access this resource” in a flexible way. Coincident with the arrival of VPC Endpoints we got the “aws:SourceVpce” key which further allows us to granularly specify which network enclaves are authorized to access a resource. And finally in 2023 we got “aws:ResourceOrgId” and “aws:ResourceOrgPaths” which when attached to a VPC Endpoint allow us to say things like “only resources I own can be accessed via this network conduit”. Each of these is valuable in its own right but together they provide powerful interlocking fields of fire to a defender.
As difficult as these mechanisms may still prove to employ effectively, especially given their newness, and furthermore for want of sensible defaults, they nonetheless collectively represent a huge leap forward in regaining some of the defense-in-depth benefits in the public cloud that we used to inherit from running in physically distinct enclaves. At least, that is, if all of your components exist within a single security domain of an entity like AWS and thus share a coherent IAM approach… Which of course would likely be silly of you to do because so many of the kinds of components you need in a modern software stack have at best a sad “also ran” offering within AWS… And so the struggle continues…
Consider, for instance, the quite reasonable scenario where within your AWS workloads you would like to employ the third party product DataDog to ingest, index, analyze, and report on logs, metrics, and traces. How are we to do this in a sensible manner with respect to managing data exfiltration risks? The options, alas, are all fairly crummy, each in their own way. Let us, for the purposes of illustration, consider the specific case of getting logs from an ECS Task running in an AWS VPC to the DataDog mothership.
You could of course cheerily say “screw it — security and unit economics sound like problems for future me” and attach that task to a subnet that can reach DataDog fairly directly via a NAT Gateway but that risks burning a ton of cash on metered bandwidth while letting someone who has popped your subnet exfil to wherever they please on the Internet. More recently DataDog added support for a PrivateLink VPC Endpoint that you can attach to your VPC which will allow you to greatly reduce your bandwidth charges but — WOMP WOMP WOMP WOMP — get precisely zero relief on exfil risks because oh looky looky AWS provides no faculty whatsoever for the creator of a third party VPC Endpoint to support policies. You could have your ECS Task live in a locked down subnet and write logs to CloudWatch and then have another entity in a different more promiscuous subnet forward them to DataDog but, yuck, now you’re paying for log ingestion twice (both CloudWatch and DataDog), and oh by the way DataDog has seemingly retired support for their official Lambda forwarder so you need to roll your own. Maybe your ambitions include both avoiding needlessly setting money on fire and not unduly accepting exfil risk so now you’re on the path of attaching not one but TWO sidecar containers to your ECS Task’s main container, one to forward logs and another traces, the former of which is going to be a FluentD underpinned abomination possibly also scripted by Lua plugins that probably won’t keep up with data volume and crash without forwarding all the logs and WHAT THE HELL WHY ARE ALL THE OPTIONS FOR SOMETHING SO SIMPLE SO TERRIBLE?!?!
And so the saga continues… I remember, many years ago, sitting in a grad school classroom through a lecture where the professor was opining on the sequential nature of our computing technology ecosystem’s advancement. First, he noted, we had to cultivate hardware platforms that didn’t entirely suck so that we could even begin to do same for programming languages, and then operating systems, and then database systems and on and on to the present day. If my memory serves his position was that each of these foundational layers took about thirty years (with a little bit of overlap) to figure out and that seems about right. So here we are, perhaps about twenty years into this public cloud thing, and it’s sorta kinda starting not to be a total party dumpster fire, and what is old is ever new again, and I see a glimmer of hope that something resembling perimeter defense is becoming practicable again, just in time for the trash fire that is the modern AI era… I don’t know about you but I think I might like a little bit of perimeter defense as a bulwark against the rising tide of slop.
Discover more from All The Things
Subscribe to get the latest posts sent to your email.
Pingback: Why Security And Scalability Are Two Sides Of The Same Coin, or… How To Win Friends And Starve Invading Armies | All The Things
Pingback: From Chunky To Smooth, or… Long is the way, and hard, that out of Hell leads up to light. – All The Things