Intro
Imagine how different the following tasks are:
- Generate a monthly report on sales data
- Test a theory against a body of astrophysics data
- Provide an analyst a UI to navigate related data and quickly build ad hoc reports
- Perform analysis on web traffic to modify an advertising campaign on the fly
- Send alerts to members of a social network about their associates’ activity
- Analyze social networks to ascertain relationships and propagation paths
- Indicate in a web shopping cart the likelihood of when an item will ship
- Reserve a seat on a flight
- Reserve a seat at a concert
- Transfer money or property between two or more accounts
- Update a ride sharer on the status of their vehicle
- Manage real-time military command-and-control operations
Figuring out how to wrangle the wildly disparate requirements of such systems is just another day in the life of a Data Engineer. Such an engineer needs principles.
While the specific technologies of the trade come and go, certain approaches transcend time, and many of the biggest challenges are non-technical. This guide attempts to be a compendium of the timeless and yet be something you can read in a single sitting with the idea that it can serve as a jumping-off point for deeper learning, exploration, and regular re-visiting.
I hope that this will be a living document. I encourage your feedback.
Principles
I. Stay Grounded
Solve tractable and painful problems iteratively to build momentum.
- Tackle challenges with measurable outcomes.
- Dumpster dive to discover exploitable data.
- Establish the gaps in data to create the required information.
- Focus on data quality before system performance.
- Lean heavily on existing capabilities while bootstrapping.
- Mold an infrastructure iteratively as you learn the requirements.
II. Preserve Knowledge
Ensure durability, traceability, and immutability to maintain integrity and confidence.
- Start by durably persisting data unmodified with associated meta-data.
- Do not accept responsibility for data until it has been durably persisted.
- Attach a UUID to every object and use that as the sole way to reference it.
- Attach meta-data that documents everything pertinent about an object’s origins.
- Treat all data as immutable, supporting its modification through versioning.
- Retain data as long as is legal, safe, practical, and ethical.
- Establish a clear Source System of Record for every kind of data in your enterprise.
- Maintain an immutable record of your system’s dispositions and actions over time.
III. Limit Complexity
Maintain loose coupling between systems to sustain delivery velocity and reliability.
- Maintain a central registry for data schemas, event streams, and API endpoints.
- Normalize data to explicit schemas before publishing for general consumption.
- Evolve the schemas of interfaces with backward- and/or forward compatibility.
- Separate logical phases of processing with a message bus that supports queueing.
- Publish events to a message bus that facillitates reliable dissemination and replay.
IV. Innovate Sustainably
Make data accessible but track its usage to encourage convergence.
- Accumulate data in warehouses that facilitate efficient interactive exploration.
- Encourage exploratory activities but continually systematize them.
- Provide data to exploratory activities via mechanisms that track their existence.
- Run exploratory initiatives in sandboxes to constrain retention and redistribution.
- Encourage exploratory activities to publish their code for requirements analysis.
V. Expect Chaos
Design resiliency into systems at every level to preserve integrity and availability.
- Process events idempotently and independently of arrival ordering and delays.
- Process events in a transactional or eventually consistent fashion as appropriate.
- Process events such that a re-processing corrects an earlier mis-processing.
- Choose wisely between consistency and availability when network partitions occur.
- Process events in a fashion that anticipates the loss of a worker mid-job.
- Establish trigger- and delay-based patterns to reattempt event processing.
VI. Succeed Gracefully
Separate discrete workloads and establish appropriate patterns to scale effectively.
- Avoid collocating data that changes quickly with data that changes slowly.
- Leverage immutable self-contained documents when practical to stay simple.
- Favor stateless processing where possible to facilitate elastic scalability.
- Micro-batch timely stateful processing where eventual consistency is acceptable.
- Use sticky routing or optimistic locking for real-time, consistent, stateful processing.
- Leverage ACID-compliant technologies where transient inconsistency is intolerable.
- Isolate different workloads to prevent cache bulldozing and storage fragmentation.
- Isolate workloads of different SLAs/SLOs to prevent load-shock spill-over.
- Leverage evented query patterns for high-volume/high-latency external API calls.
- Provide queue-based APIs to maintain control of parallelism, time-outs, and QoS.
VII. Stay Frosty
Weave telemetry generation into all sub-systems and monitor behavior proactively.
- Model tasks granularly, emit standardized telemetry, and centralize its storage.
- Record current time, task state, triggering conditions, and queue and service time.
- Contextualize with code version, container lifetime, and container placement.
- Index stored telemetry to support breadcrumb trail exploration for forensics work.
- Leverage standardized telemetry to auto-generate metrics and alerts.
- Baseline behavior and track trends to control costs and manage capacity.
- Maintain awareness of outlier behavior to stay within your SLAs/SLOs.
- Pump synthetic data through all of your flows to prevent a false sense of security.
VIII. Assuage Regulators
Integrate and index metadata early to serve diverse compliance requirements.
- Attach standard security and compliance meta-data to every object.
- Authenticate recipients of data and ensure they possess adequate credentials.
- Include and index age-off meta-data to comply with data retention requirements.
- Establish protocols to recall inappropriately collected or incorrectly processed data.
- Maintain logs of all queries/actions and the ability to reconstruct answers and state.
- Ponder whether your software may eventually run in a different legal jurisdiction.
- Know that following the law is not enough to keep customers and regulators happy.
- Disentangle certification and accreditation to foster rigor _and_ efficiency.
IX. Defend Proactively
Bake security into your system early on to avoid embarrassment and expensive refactors.
- Keep components tightly focused to reduce attack surface.
- Keep components minimally permissioned to limit their blast radius.
- Separate ingestion, normalization, analysis, and action into discrete components.
- Discretely permission controls configuration, app deployment, and app operation.
- Build data stores that support fine-grained access based on security labeling.
- Today’s LAN is tomorrow’s SDN. Encrypt all traffic to avoid getting burned.
- Combine physical security, encryption, and limited admin access for data at rest.
- Establish an authentication framework that will prove both secure and scalable.
- Avoid static credentials wherever possible and be mindful of where tokens end up.
- Favor short-lived, minimally equipped, dynamically credentialed processes.
- Enforce authorization at every point in the chain of custody.
- Catalog and control 3rd party software in your stack. Monitor it for vulnerabilities.
- You will get compromised. Aim to impede surveillance, spreading, and infil/exfil.
X. Nurture Trust
Set clear expectations and monitor associated metrics closely to maintain goodwill.
- Empathize with your users on what it would mean for your system to fail.
- Clearly communicate SLAs to consumers to manage expectations.
- Maintain SLOs with enough headroom to make failing to meet SLAs unlikely.
- Keep your eye on outlier performance and understand its impact.
- Support QoS faculties to protect and prioritize critical traffic.
- Proactively communicate integrity and performance issues to engender confidence.
- Maintain the tooling to quickly and precisely tell customers what went wrong.
XI. Foster Repeatability
Move fast _safely_ with systematized, phased, modular deployment mechanisms.
- Ensure that system clones can be constructed from scratch fully automatically.
- Strive never to touch production systems directly.
- Treat configurations and migrations with the same rigor as application code.
- Avoid massive, single-release, manually-intensive migrations like the plague.
- Pin the versions of your dependencies and test them like any other code.
- Pre-stage new storage engines and run comparisons in production before going hot.
- Make rollback a practical exercise for every release you perform.
- Use feature flags to prevent release pipeline traffic jams and enhance safety.
XII. Be Unkillable
Eschew single points of failure and test recovery procedures to prevent catastrophe.
- Deploy your system in a way that leverages multiple availability zones.
- Ensure that your source data is stored in a highly durable and available fashion.
- Make it easy to reconstruct a state store by re-processing source data.
- Regularly test the loss of system elements and observe system behavior.
- Regularly rebuild infra and data stores from scratch to validate the process.
- Employ multi-party control for the most sensitive operations.
Authors Note
Data Principles represents an attempt by Andrew W. Gibbs to codify the lessons gleaned through years of wrangling messy data-intensive problems.
This compendium of knowledge would likely not have emerged, and certainly not have been as approachable, without the encouragement and counsel of his perennial colleague Aaron Zollman.