Thought leadership

August 30, 2023

20 data reliability use cases from real-life teams

min read

Liz Elfman

Get Data Insights Delivered

Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.

Stay Informed

Sign up for the Data Leaders Digest and get the latest trends, insights, and strategies in data management delivered straight to your inbox.

Get the Best of Data Leadership

Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Get the Best of Data Leadership

Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Stay Informed

Sign up for the Data Leaders Digest and get the latest trends, insights, and strategies in data management delivered straight to your inbox.

Get Data Insights Delivered

Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.

Modern, data-driven companies need reliable data. But on the path to building trustworthy analytics, ML models, and data products, your data team is bound to hit some roadblocks. In theory, data reliability is straightforward. But when it comes to the messy business of actually implementing it, what do real-life teams do?

Organizations like Lyft, Walmart, and LinkedIn have applied data reliability techniques to solve their data challenges. In this post, we highlight 20 of those real-world examples.

1. Monitoring data freshness/staleness

LinkedIn built a system called Data Health Monitor (DHM) that automatically monitors the freshness and staleness of datasets. With this system, they can detect issues like pipelines unintentionally using an older dataset version.

2. Monitoring data volume changes

LinkedIn’s DHM also monitors for sudden drops or increases in data volume, which can indicate partial data or insufficient resources, respectively. Being aware of volume changes helps LinkedIn maintain pipeline and data quality.

3. Monitoring the quality of offline data

Lyft built Verity, a check-based system to monitor the quality of offline data. It allows users to define checks that run queries to validate expectations - for example, checking for null values in a column. Verity checks can be configured to run automatically on a schedule or as part of data pipelines. The check results are stored to enable debugging when failures occur.

4. Bringing objectivity to quality

Walmart built DQAF (Data Quality Assessment Framework), which is the company’s product for Continuous Data Quality. The DQAF enables stakeholders to define objective thresholds for quality metrics based on what "good quality" means to them. This makes quality less subjective. For example, a business user can set a threshold that a critical column must be 95-100% complete.

5. Clarifying ownership

Walmart’s framework also assigns ownership of quality scores for different data domains to the relevant teams. So for example, the "Orders" team owns order data quality scores. This functionality delineates responsibilities across siloed teams.

6. Tracking improvements

By storing quality scores over time, Walmart can quantify improvements as data stewards fix issues. If the completeness score for a column goes from 80% to 95% after fixes, it demonstrates the business impact of quality efforts.

7. Depicting interconnectedness

Even though teams own data in silos, quality scores in Walmart’s framework show interdependencies between data sets. For example, Orders team data quality is connected to the Customer team data quality.

8. Enabling custom algorithms

At Walmart, teams can define custom data quality algorithms tailored to their specific data needs. For example, the Orders team could create a validity check unique to a column in the Orders table.

9. Answering questions about data quality

Through Walmart's quality score tracking, analysts can also answer questions like "Why did data quality dip last month?" These analyses provide data-driven narratives around quality that can be surfaced to executives.

10. Making data discoverable

LinkedIn built "Super Tables", which are centralized, well-documented datasets that have been pre-computed and normalized. They aim to be the “go-to datasets” for certain domains, e.g.:

JOBS Super Table: Consolidates data from 57+ different job-related data sources into a single table with 158 columns. Provides precomputed information commonly needed for job analytics and insights.
Ad Events Super Table: Consolidates data from 7 different ad-related tables, including ad impressions, clicks, video views, etc. Joins in campaign and advertiser dimensions. Provides 150+ columns for ad analytics and reporting.

The goal of both Super Tables is to simplify data discovery, reduce redundant joins and storage, and precompute commonly used data for downstream analytics.

11. Guaranteeing table availability

Linkedin’s Super Tables also have well-defined service level agreements (SLAs) that specify availability, supportability, and change management commitments.

For availability, the goal is to achieve 99%+ uptime. For a daily Super Table flow, this translates to about one SLA miss per quarter. To improve availability, Super Tables can be materialized in multiple clusters with active-active configurations. This provides redundancy in case of failures.

Upstream data sources must also commit to SLAs that enable the Super Table to meet its own SLA. The SLAs of upstream sources are tracked and monitored.

12. Managing schema changes in upstream sources

By default, schema changes (additions, deletions etc.) in upstream source data do not automatically affect the Super Table schema.

If a new column is added in a source, it does not appear in the Super Table. If a source column is deleted, its value is nullified in the Super Table.

The Super Table governance body is notified of source schema changes that could potentially impact the table. All planned schema changes to the Super Table itself are documented and communicated to downstream consumers, and there is a monthly release cadence for accepting schema change requests to the Super Table.

13. Reducing alert fatigue

Uber used tiering to classify and prioritize its various data assets, such as tables, pipelines, machine learning models, and dashboards. By assigning different tiers to these assets, Uber is able to manage its resources more efficiently, ensuring that only the most important data gets alerted on:

Tier 0: These are the most critical data assets that are foundational for the business to operate. Any disruption in these assets could have severe consequences. Kafka as a service, for example, falls under this category.
Tier 1: Extremely important datasets that could be essential for decision-making, analytics, or operational aspects. These could be things like user data, transaction data, etc.
Tier 2: Important but not critical datasets. These could be important for some departments or features but aren't as universally crucial.
Tiers 3, 4: Less critical data that may still be useful for specific analyses or features.
Tier 5: These are individually owned datasets, often generated in staging or test environments. They have no guarantees of quality or availability and are the least prioritized.

By identifying just 2,500 Tier 1 and Tier 2 tables out of over 130,000 tables, Uber focused its efforts on a manageable but critically important subset of its data, allowing for better quality, reliability, and resource allocation.

14. Reducing manual data issue debugging

Stripe built a centralized observability platform and internal UI that allowed users to select different runs of a data job and compare metrics like runtime, data volume processed, and logs across the run.

Based on current runtime progression and historical runtimes, the UI would also predict estimated completion time for running jobs, which would help address stakeholder questions.

Finally, users could configure standardized fallback behaviors for different failure cases, and data tests, through the UI.

15. On-call training

Playbooks and runbooks are documents that outline the steps for responding to specific types of issues/incidents. In the context of running a data organization, they ensure that everyone involved has a shared understanding of the plan of action. More specifically, they provide a checklist of action items so that nothing is forgotten. This checklist can also be used to train new staff on data issue response.

16. Data producer-consumer alignment

Convoy pioneered data contracts. These are API-based agreements between software engineers who own services and business-focused data consumers, with the goal of generating well-modeled, high-quality, trusted data. They allow a service to define the entities and application-level events they own, along with their schema and semantics.

Data contracts ensure that production-grade data pipelines are treated as part of the product, with clear SLAs and ownership. They also orient everyone in the same direction so that problem-solving work is effective.

17. Prevent degradation in machine learning model performance

At Lyft, input features to models are validated in real-time against valid value ranges. This catches issues like incorrect units or data types passing to models.

They also monitor distributions of model score outputs with time series alerts, and analyze historical logs of features and predictions to catch unusual statistical deviations that could imply model degradation. If upstream feature changes or data drift is detected, they automatically retrain models to prevent performance from declining.

18. Making it easier for business users to answer data questions

Pinterest built Querybook, an open-source data collaboration platform for sharing SQL queries, datasets, and insights. Querybook also has a ChatGPT-like interface to automatically generate and execute SQL queries from plain text questions. For example, users can ask natural language questions like "How many daily active users in the past month?" and it will generate the appropriate SQL query.

19. Making data incidents less stressful

Following the principles of data reliability will hopefully mean you face fewer data incidents, but it also means that even when data incidents occur, they’re less stressful.

You can apply standard incident response frameworks to data incidents too. For example, the response process (Incident detection, response, root cause analysis, and resolution, and blameless post-mortem) and the response team (incident leader, SME, liaison, scribe). Therein lies your tried and true plan of attack.

20. Encouraging data-driven business decisions

Ultimately, you’re not collecting and analyzing all this data at a company for fun: it should be in service of making business or product decisions. Data reliability principles ensure that analyses and reports are accurate, that metrics and trends can be tracked over time, and that key financial information is always up-to-date and correct for compliance reasons.

Final thoughts

Modern data stacks enable tremendous analytical capabilities but also introduce reliability challenges from complexity and scale. Companies like Lyft, LinkedIn, Uber, Walmart, and Pinterest, apply data reliability principles to build trust and confidence in their data products and make better business choices.

share this episode

Resource

Monthly cost ($)

Number of resources

Time (months)

Total cost ($)

Software/Data engineer

$15,000

$540,000

Data analyst

$12,000

$144,000

Business analyst

$10,000

$30,000

Data/product manager

$20,000

$240,000

Total cost

$954,000

Role

Goals

Common needs

Data engineers

Overall data flow. Data is fresh and operating at full volume. Jobs are always running, so data outages don't impact downstream systems.

Freshness + volume
Monitoring
Schema change detection
Lineage monitoring

Data scientists

Specific datasets in great detail. Looking for outliers, duplication, and other—sometimes subtle—issues that could affect their analysis or machine learning models.

Freshness monitoringCompleteness monitoringDuplicate detectionOutlier detectionDistribution shift detectionDimensional slicing and dicing

Analytics engineers

Rapidly testing the changes they’re making within the data model. Move fast and not break things—without spending hours writing tons of pipeline tests.

Lineage monitoringETL blue/green testing

Business intelligence analysts

The business impact of data. Understand where they should spend their time digging in, and when they have a red herring caused by a data pipeline problem.

Integration with analytics toolsAnomaly detectionCustom business metricsDimensional slicing and dicing

Other stakeholders

Data reliability. Customers and stakeholders don’t want data issues to bog them down, delay deadlines, or provide inaccurate information.

Integration with analytics toolsReporting and insights

Get the Best of Data Leadership

Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Stay Informed

Sign up for the Data Leaders Digest and get the latest trends, insights, and strategies in data management delivered straight to your inbox.

Get Data Insights Delivered

Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.

20 data reliability use cases from real-life teams

Get the Best of Data Leadership

Stay Informed

Get Data Insights Delivered

1. Monitoring data freshness/staleness

2. Monitoring data volume changes

3. Monitoring the quality of offline data

4. Bringing objectivity to quality

5. Clarifying ownership

6. Tracking improvements

7. Depicting interconnectedness

8. Enabling custom algorithms

9. Answering questions about data quality

10. Making data discoverable

11. Guaranteeing table availability

12. Managing schema changes in upstream sources

13. Reducing alert fatigue

14. Reducing manual data issue debugging

15. On-call training

16. Data producer-consumer alignment

17. Prevent degradation in machine learning model performance

18. Making it easier for business users to answer data questions

19. Making data incidents less stressful

20. Encouraging data-driven business decisions

Final thoughts

Get the Best of Data Leadership

Stay Informed

Get Data Insights Delivered

SLAs: Not Just for Software Engineers Anymore

Get AI Ready with Governance & Data Observability

AI for Data Observability: Designing for Privacy, Access, and Risk

Join the Bigeye Newsletter

20 data reliability use cases from real-life teams

Get the Best of Data Leadership

Stay Informed

Get Data Insights Delivered

1. Monitoring data freshness/staleness

2. Monitoring data volume changes

3. Monitoring the quality of offline data

4. Bringing objectivity to quality

5. Clarifying ownership

6. Tracking improvements

7. Depicting interconnectedness

8. Enabling custom algorithms

9. Answering questions about data quality

10. Making data discoverable

11. Guaranteeing table availability

12. Managing schema changes in upstream sources

13. Reducing alert fatigue

14. Reducing manual data issue debugging

15. On-call training

16. Data producer-consumer alignment

17. Prevent degradation in machine learning model performance

18. Making it easier for business users to answer data questions

19. Making data incidents less stressful

20. Encouraging data-driven business decisions

Final thoughts

Get the Best of Data Leadership

Stay Informed

Get Data Insights Delivered

Related posts

SLAs: Not Just for Software Engineers Anymore

Get AI Ready with Governance & Data Observability

AI for Data Observability: Designing for Privacy, Access, and Risk

Join the Bigeye Newsletter