Intro

Imagine how different the following tasks are:

Generate a monthly report on sales data
Test a theory against a body of astrophysics data
Provide an analyst a UI to navigate related data and quickly build ad hoc reports
Perform analysis on web traffic to modify an advertising campaign on the fly
Send alerts to members of a social network about their associates’ activity
Analyze social networks to ascertain relationships and propagation paths
Indicate in a web shopping cart the likelihood of when an item will ship
Reserve a seat on a flight
Reserve a seat at a concert
Transfer money or property between two or more accounts
Update a ride sharer on the status of their vehicle
Manage real-time military command-and-control operations

Figuring out how to wrangle the wildly disparate requirements of such systems is just another day in the life of a Data Engineer. Such an engineer needs principles.

While the specific technologies of the trade come and go, certain approaches transcend time, and many of the biggest challenges are non-technical. This guide attempts to be a compendium of the timeless and yet be something you can read in a single sitting with the idea that it can serve as a jumping-off point for deeper learning, exploration, and regular re-visiting.

I hope that this will be a living document. I encourage your feedback.

Principles

I. Stay Grounded

Solve tractable and painful problems iteratively to build momentum.

Tackle challenges with measurable outcomes.
Dumpster dive to discover exploitable data.
Establish the gaps in data to create the required information.
Focus on data quality before system performance.
Lean heavily on existing capabilities while bootstrapping.
Mold an infrastructure iteratively as you learn the requirements.

II. Preserve Knowledge

Ensure durability, traceability, and immutability to maintain integrity and confidence.

Start by durably persisting data unmodified with associated meta-data.
Do not accept responsibility for data until it has been durably persisted.
Attach a UUID to every object and use that as the sole way to reference it.
Attach meta-data that documents everything pertinent about an object’s origins.
Treat all data as immutable, supporting its modification through versioning.
Retain data as long as is legal, safe, practical, and ethical.
Establish a clear Source System of Record for every kind of data in your enterprise.
Maintain an immutable record of your system’s dispositions and actions over time.

III. Limit Complexity

Maintain loose coupling between systems to sustain delivery velocity and reliability.

Maintain a central registry for data schemas, event streams, and API endpoints.
Normalize data to explicit schemas before publishing for general consumption.
Evolve the schemas of interfaces with backward- and/or forward compatibility.
Separate logical phases of processing with a message bus that supports queueing.
Publish events to a message bus that facillitates reliable dissemination and replay.

IV. Innovate Sustainably

Make data accessible but track its usage to encourage convergence.

Accumulate data in warehouses that facilitate efficient interactive exploration.
Encourage exploratory activities but continually systematize them.
Provide data to exploratory activities via mechanisms that track their existence.
Run exploratory initiatives in sandboxes to constrain retention and redistribution.
Encourage exploratory activities to publish their code for requirements analysis.

V. Expect Chaos

Design resiliency into systems at every level to preserve integrity and availability.

Process events idempotently and independently of arrival ordering and delays.
Process events in a transactional or eventually consistent fashion as appropriate.
Process events such that a re-processing corrects an earlier mis-processing.
Choose wisely between consistency and availability when network partitions occur.
Process events in a fashion that anticipates the loss of a worker mid-job.
Establish trigger- and delay-based patterns to reattempt event processing.

VI. Succeed Gracefully

Separate discrete workloads and establish appropriate patterns to scale effectively.

Avoid collocating data that changes quickly with data that changes slowly.
Leverage immutable self-contained documents when practical to stay simple.
Favor stateless processing where possible to facilitate elastic scalability.
Micro-batch timely stateful processing where eventual consistency is acceptable.
Use sticky routing or optimistic locking for real-time, consistent, stateful processing.
Leverage ACID-compliant technologies where transient inconsistency is intolerable.
Isolate different workloads to prevent cache bulldozing and storage fragmentation.
Isolate workloads of different SLAs/SLOs to prevent load-shock spill-over.
Leverage evented query patterns for high-volume/high-latency external API calls.
Provide queue-based APIs to maintain control of parallelism, time-outs, and QoS.

VII. Stay Frosty

Weave telemetry generation into all sub-systems and monitor behavior proactively.

Model tasks granularly, emit standardized telemetry, and centralize its storage.
Record current time, task state, triggering conditions, and queue and service time.
Contextualize with code version, container lifetime, and container placement.
Index stored telemetry to support breadcrumb trail exploration for forensics work.
Leverage standardized telemetry to auto-generate metrics and alerts.
Baseline behavior and track trends to control costs and manage capacity.
Maintain awareness of outlier behavior to stay within your SLAs/SLOs.
Pump synthetic data through all of your flows to prevent a false sense of security.

VIII. Assuage Regulators

Integrate and index metadata early to serve diverse compliance requirements.

Attach standard security and compliance meta-data to every object.
Authenticate recipients of data and ensure they possess adequate credentials.
Include and index age-off meta-data to comply with data retention requirements.
Establish protocols to recall inappropriately collected or incorrectly processed data.
Maintain logs of all queries/actions and the ability to reconstruct answers and state.
Ponder whether your software may eventually run in a different legal jurisdiction.
Know that following the law is not enough to keep customers and regulators happy.
Disentangle certification and accreditation to foster rigor _and_ efficiency.

IX. Defend Proactively

Bake security into your system early on to avoid embarrassment and expensive refactors.

Keep components tightly focused to reduce attack surface.
Keep components minimally permissioned to limit their blast radius.
Separate ingestion, normalization, analysis, and action into discrete components.
Discretely permission controls configuration, app deployment, and app operation.
Build data stores that support fine-grained access based on security labeling.
Today’s LAN is tomorrow’s SDN. Encrypt all traffic to avoid getting burned.
Combine physical security, encryption, and limited admin access for data at rest.
Establish an authentication framework that will prove both secure and scalable.
Avoid static credentials wherever possible and be mindful of where tokens end up.
Favor short-lived, minimally equipped, dynamically credentialed processes.
Enforce authorization at every point in the chain of custody.
Catalog and control 3rd party software in your stack. Monitor it for vulnerabilities.
You will get compromised. Aim to impede surveillance, spreading, and infil/exfil.

X. Nurture Trust

Set clear expectations and monitor associated metrics closely to maintain goodwill.

Empathize with your users on what it would mean for your system to fail.
Clearly communicate SLAs to consumers to manage expectations.
Maintain SLOs with enough headroom to make failing to meet SLAs unlikely.
Keep your eye on outlier performance and understand its impact.
Support QoS faculties to protect and prioritize critical traffic.
Proactively communicate integrity and performance issues to engender confidence.
Maintain the tooling to quickly and precisely tell customers what went wrong.

XI. Foster Repeatability

Move fast _safely_ with systematized, phased, modular deployment mechanisms.

Ensure that system clones can be constructed from scratch fully automatically.
Strive never to touch production systems directly.
Treat configurations and migrations with the same rigor as application code.
Avoid massive, single-release, manually-intensive migrations like the plague.
Pin the versions of your dependencies and test them like any other code.
Pre-stage new storage engines and run comparisons in production before going hot.
Make rollback a practical exercise for every release you perform.
Use feature flags to prevent release pipeline traffic jams and enhance safety.

XII. Be Unkillable

Eschew single points of failure and test recovery procedures to prevent catastrophe.

Deploy your system in a way that leverages multiple availability zones.
Ensure that your source data is stored in a highly durable and available fashion.
Make it easy to reconstruct a state store by re-processing source data.
Regularly test the loss of system elements and observe system behavior.
Regularly rebuild infra and data stores from scratch to validate the process.
Employ multi-party control for the most sensitive operations.

Authors Note

Data Principles represents an attempt by Andrew W. Gibbs to codify the lessons gleaned through years of wrangling messy data-intensive problems.

This compendium of knowledge would likely not have emerged, and certainly not have been as approachable, without the encouragement and counsel of his perennial colleague Aaron Zollman.