The complete guide to understanding data SLAs
An SLA is a "service level agreement." If you have questions around the SLA meaning, purpose, or components, this post will walk you through it.
In the world of software engineering, companies like Slack, Stripe, and Zoom ensure 24/7 service availability by measuring performance and publishing SLAs that define the expected behavior of their software. These companies maintain high levels of reliability despite making rapid changes to their services. For example, Stripe reported 99.99% uptime over 90 days for their API, even while deploying code changes multiple times per day.
Data Service Level Agreements (data SLAs) are analogous to SLAs for software. They guarantee a certain quality and availability for data assets. This comprehensive guide will provide insights into what data SLAs are, their importance, implementation, examples, and when it's time to establish them in your organization.
What are data SLAs?
Data SLAs are agreements between data providers and data consumers that outline the expected level of data quality and data observability.
Data SLAs come from the broader concept of Service Level Agreements (SLAs), which are formal commitments between service providers and their stakeholders, or between different departments within an organization. SLAs define the expected level of service, along with the consequences if these expectations are not met.
Who uses data SLAs?
Data SLAs should be joint projects between the data platform teams, which are responsible for providing data to different departments within a company, and the teams that are consuming the data, such as the product, finance, and marketing teams.
Why are data SLAs important?
Data SLAs help bridge the gap between data engineers and consumers by providing clear expectations and accountability for data quality. They also help mediate the needs of different consumers of the data.
For example, the product team might want to move fast and make changes to their data whenever necessary, while other teams like marketing and finance might expect the data to remain stable and reliable. The SLA is the pre-defined source of truth for those opposing viewpoints.
Implementing data SLAs
Implementing data SLAs is a multi-step process that starts with identifying the applications that require SLAs, such as executive-facing dashboards or core machine-learning models, or core tables. The next step is assembling the constituent components of a data SLA: the SLIs and SLOs. The components are as follows:
Service Level Indicators (SLIs)
SLIs are a quantifiable and agreed-upon measurement of the data. For example, a team might decide to measure the duplicate rate of user records in a users table and set a limit on the acceptable percentage of duplicated records. This measurement can then be monitored and used to evaluate the health of the data. In addition to duplicate rate, there are various other aspects of data that could be measured using SLIs, such as:
- data freshness
- nulls and blanks
- out-of-range values
- formatting issues
By establishing a set of SLIs, data platform teams can avoid time-consuming back-and-forth conversations and focus on clearly quantified measurements of data quality.
Service Level Objectives (SLOs)
SLOs, are targets set for the performance of the various attributes measured by SLIs. These targets help define what is considered normal or acceptable for a given data aspect. For instance, a team may decide that a 0.25% duplicate user ID rate is tolerable, but anything above 1-2% would negatively impact other processes or teams, such as finance or machine learning models.
Service Level Agreements (SLAs)
In the final step, SLIs and SLOs are packaged up into SLAs. SLAs are agreements not only that the SLI will stay within the SLO, but also define what happens when that SLO is not met.
For example, maybe the data team is tracking the duplicate rate of user UUIDs and aiming for a 99.5% tolerance. However, in the SLA, they make a commitment to a slightly lower level of reliability: 90%. Over the trailing 30 days, they aim to meet the duplicate rate SLI 90% of the time, allowing up to 7.2 hours of downtime.
In the SLA, it’s also agreed that if this threshold is exceeded, the data team will halt all changes to the ELT jobs that feed the users table and stop changes to all upstream services. This commitment ensures that the data infrastructure remains stable and that upstream changes do not disrupt the users table.
Finally, SLAs can include escalation procedures if disagreements arise. For instance, in case of a disagreement over downtime, the issue could be escalated to the VP of Infrastructure for resolution. The SLA serves as a binding commitment to ensure that all stakeholders work together to maintain data quality and reliability.
Examples of Common Data SLAs
- Freshness: Guaranteeing that data is no more than a certain number of hours or days old.
- Completeness: Ensuring a specific percentage of data is present and accurate.
- Accuracy: Defining acceptable error rates for data values.
- Availability: Ensuring a certain level of uptime for data storage and retrieval systems.
For instance, let's consider a duplicate rate SLI with a 99.5% reliability target. We would measure this SLI every 30 minutes and track the results over a 30-day window. During this period, the data team is allowed a total of 3.6 hours when the duplicate rate exceeds the set threshold. If the duplicate rate surpasses the limit for more than 3.6 hours, the data team has not met its commitment to the company in terms of dataset reliability.
SLOs can be set at different levels of reliability, depending on the specific requirements and priorities of a dataset. Examples include:
- 99.9% reliability (three nines): 43 minutes of downtime in a 30-day window.
- 99.5% reliability: 3.6 hours of downtime (as shown in our example).
- 99% reliability (two nines): 7.2 hours of downtime.
- 95% reliability: 1.5 days of downtime.
- 90% reliability: 3 days of downtime.
Setting stricter SLOs signifies a stronger commitment to stakeholders regarding dataset reliability. For example, a 99.9% reliability target means that the SLI could be violated for roughly one minute per day, which is generally acceptable for applications like analytics dashboards.
Common signs It's the right time for data SLAs
It's always prudent to create an SLA between two teams. Like with so many contracts, creating an SLA is a "better safe than sorry" action that doesn't have a downside. However, there are a few common signs that it's time for your organization to build SLAs into the workflow right away. Those signs are:
- Frequent data quality issues impacting data consumers and their ability to trust the data
- Disagreements between different consumers on the definitions of data quality metrics
- A growing data engineering team that requires clear priorities and guidelines for managing data quality
Data SLAs are essential for maintaining high-quality data and ensuring that data consumers can trust the data they are working with. Implementing data SLAs can lead to better communication, prioritization, and accountability for data quality within an organization. By understanding and establishing data SLAs, businesses can optimize their data-driven decision-making processes and maximize the value of their data assets.
Monitoring
Schema change detection
Lineage monitoring