A gentle introduction to data contracts
Production services are no longer just generating data as a byproduct. Instead, data is the product and should be treated as such. This is the argument that data contracts make.
Production services are no longer just generating data as a byproduct. Instead, data is the product and should be treated as such. This is the argument that data contracts make.
They have been a hot topic recently, with Chad Sanderson of Convoy and Andrew Jones of GoCardless both writing lengthy blog posts cheerleading their usage. But are they actually worth building? In this blog post, we explore what they are, how they can be implemented, and their pros and cons.
What is a data contract?
Data contracts are API-like agreements between data producers and data consumers. Their goal is to export high quality data that is resilient to change.
In the data contract paradigm, instead of dumping data generated by production services into data warehouses, service owners decide which data to expose to consumers. Then they expose it in an agreed-upon, structured fashion, similar to an API endpoint.
As a result, responsibility for data quality shifts from the data scientist/analyst to the software engineer.
Example of a data contract
Imagine a rideshare application. Production microservices write into the `"rides", "payments", "customers", and "trip request" tables in the database. Over time, these schemas evolve as the business runs promos and expands into different markets.
With no action taken, these production tables eventually end up in a data warehouse. Subsequently, any machine learning engineer or data engineer consuming the analogous tables in the data warehouse has to rewrite data transformations upon schema changes.
With data contracts, data analysts and scientists don’t consume near-raw tables in data warehouses. Instead, they consume from an API that has already munged the data and produced a human readable event, like a “trip request." The trip request metadata will be attached (pricing, yes/no surge pricing, promo, payment details, reviews).
Pros of data contracts
1. Consumers of data don’t have to worry about recreating the business logic that generated it
The current ELT model, where data is dumped into data warehouses and then transformed in massive joins across different tables, replicates the business logic of the production services that generated the data in the first place.
Data contracts, on the other hand, expose semantic events that are not tied to the transactional database. They should remain compatible as the transaction database evolves. Downstream users no longer need to maintain matching logic and data models.
2. Since it’s a strongly defined schema, you can document it, version it, and have CI/CD on it
Schemas aren’t just items on Google Docs. They’re usually defined in JSON or Protos or some other type of templating language that can be checked in on Github, code reviewed, and gate-kept with CI/CD. This brings a level of transparency and standardization that was previously impossible to maintain.
3. Root-case analysis is easier when there is a data quality issue
With data quality efforts that focus on monitoring the data warehouse, even if it tells you that there’s a problem in your data, you don’t necessarily know why. While you can certainly monitor the lineage of tables to get a sense of the problem's location (Bigeye provides this as a feature), data contracts mean that data quality issues should never have the opportunity to travel downstream. I
Cons of data contracts
1. Difficultly in getting buy-in from software engineers
Since the burden of data quality/data transformation now falls onto software engineers instead of data engineers, implementing data contracts requires a process change. This change can be a tricky sell. Even if software engineers are willing, they may be unfamiliar with data modeling.
2. Difficulty in enforcing the data contract
In theory, data contract enforcement is a matter of good CI/CD. If it doesn’t pass , it doesn’t merge. In practice, tables within organizations are not always created through proper CI/CD. Instead, many tables originate during prototyping/exploration, and somehow over time, end up referenced by downstream services.
3. Data consumer needs may change
In theory, data contracts should be designed in a backwards-compatible way. In practice, they probably still need occasional modifications. For instance, using the rideshare example from above, the data contract can handle changes in the metadata of trip requests; new pricing algorithms, for example, or name displays. But what if the machine learning team suddenly needs information about food orders? That’s a new/different entity that would need a separate data contract established.
Implementing data contracts
While Sanderson and Jones agreed on the broad strokes of what data contracts mean and why people should use them, they outlined slightly different implementations at their employers.
At Convoy, Chad Sanderson follows these steps to implement data contracts:
- Come up with enterprise data model
- Teams that own production services define entities and events using Protobufs
- Events that occur to these entities are published to Kafka (pub-sub service)
- Teams consume data directly from Kafka
At GoCardless, Andrew Jones follows these steps for data contract implementation:
- The producing team uses JSON to define the schemas for the data they want to make available
- They categorize the data and choose their service needs
- Once the JSON file is merged into Github, dedicated BigQuery and PubSub resources are automatically deployed and populated with the requested data via a Kubernetes cluster
- The consuming team gets their desired data from their dedicated BigQuery
As you can see, both GoCardless and Convoy make use of the same basic ingredients in creating:
- Definition of entities and events
- Contract defined with some templating language
- A pub-sub system to handle events
What’s the difference between data contracts and data SLAs?
Here at Bigeye, we’ve talked a lot about data SLAs, and you might be wondering what the difference is between data SLAs and data contracts.
As a reminder, SLAs are agreements between the producers and consumers of a service that set performance expectations for that service. Data SLA’s are agreements between the producers and consumers of data that set certain metadata expectations for that data, e.g. freshness and accuracy.
Data contracts complement data SLAs. While data SLAs guarantee meta-properties about the data, data contracts guarantee what the data actually is.
Monitoring
Schema change detection
Lineage monitoring