Information Leakage

Seemingly small data engineering choices can yield massive business implications.

I found myself listening this morning to the latest episode of the “Invest Like The Best” podcast where the day’s guest was Mario Cibelli of Marathon Partners Equity Management. Among assorted topics he spoke of his firm’s long position on Netflix many years ago and the research they put into both rationalizing and bolstering it. The story reminded me of a simple technique I learned long ago that done right can prove as valuable from a Security Engineering standpoint as a Data Engineering one. Consequently let’s go backward before we go forward.

Early in my career a colleague introduced me to the concept of a surrogate key when modeling relationships in a database. This involves linking rows not by a semantically meaningful key but rather an abstract key that serves only to identify a row. Notably the technique prevents an assortment of potential migration headaches as your understanding of the data’s relationships, uniqueness, and volatility evolves. Commonly database setups will employ an auto-increment integer primary key to this end, a simple approach that solves many problems.

Databases often extend their tendrils out to the larger world, however, and here related matters get dicey. When exposing the entities of a database via an API these surrogate keys make their way into public view and in doing so, despite being individually abstract, in aggregate may leak information. The range and progression of values that they exhibit can provide a keen observer insight into business operations one might like to keep private.

In fact Mario Cibelli and company exploited exactly this when analyzing Netflix’s potential. As part of their research they took the various services of the day for a test drive. Among Netflix’s contemporary competitors was Blockbuster and on one fateful day a Marathon analyst had an “aha!” moment at the realization that Blockbuster’s DVD return envelopes exposed serial numbers that revealed valuable insights about customer set size and churn. Marathon not only used this information to rationalize their long position with Netflix but also provided it as competitive intelligence to them. Boom! Pwnd by way of careless usage of surrogate keys. Where are you now, Blockbuster?

How to avoid leaking such valuable intel to adversaries? I commend to your attention the humble UUID. Not only does a UUID have the Data Engineering benefit of providing a semantically meaningless identifier that renders it impervious to the vagaries of data evolution but furthermore by way of being random it provides the Security Engineering benefit of not leaking aggregate information. As gravy, consider further the additional Data Engineering benefit of improved horizontal scalability, coming by virtue of obviating the need for a central authority to generate keys and in turn by receiving keys that distribute across shards as nicely as any could.

Consequently, whenever possible, expose only a single UUID to external observers and leverage that in concert with your private database to reconstruct the larger picture. Doing so will not only harden your posture against adversaries but also lend greater durability to the links you provide to clients.

If UUIDs don’t hold a prominent place in your Data Engineering toolbelt then I strongly encourage you to make the upgrade. Your fate may hinge on it more than you know.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s