Engineering
-
August 7, 2023

Why data lineage is mission-critical for businesses today

What is data lineage, what makes it difficult to do well, and how should organizations leverage it? In this post, we'll walk through it.

Egor Gryaznov

In this 3-part series, we’ll discuss the ins and outs of data lineage, why it’s difficult to get right, and why it’s mission-critical for businesses looking to supercharge their data quality monitoring.

Data lineage - the knowledge about how data moves and transforms across various systems and processes - is an essential tool in any organization's data management toolbox. But contrary to what you might think, lineage is not a standalone product. The real value of data lineage is in the contextual information it’s able to provide to other workflows. Whether it’s understanding where data originates to ask the right person about its definition, or knowing who will be impacted when a field gets dropped, lineage allows teams to use what would normally be tribal knowledge in a much easier way.

Where did this data come from?

This is a question that has inevitably been asked by every data user in every organization. Before doing anything meaningful with data, it’s critical to understand what data you’re looking at, how it arrived to where you’re consuming it, and who to ask for help. Traditionally, this work has been done via tribal knowledge - you would (often literally) tap your coworker on the shoulder and they’d explain it to you. The next time someone needed to find out about the data, they might have come to you.

Data lineage products traditionally look to automate this process. As data storage, transformation, and consumption patterns have normalized, it became easier to extract data lineage information from the whole data ecosystem. More and more teams became eager to adopt data lineage products in order to answer questions in their organization more efficiently. However, adoption of these products was not as revolutionary as it was expected to be…

What do I do with all this lineage information?

At the end of the day, being able to collect and navigate all this data lineage led to a wealth of information, but an even more fractured workflow. Data users ask questions not just because they want the answer - there’s always a question behind the question. It might be that they are building a new dashboard and need to quickly wrap their head around an unfamiliar dataset - lineage would help them understand who else uses that dataset and how. Maybe they are digging into something that looks strange in their report - lineage helps them find the original source system for their data and who to ask about any possible changes.

At the end of the data, lineage helps give context to business workflows. It’s not an end-all solution to data problems within the business, but at the same time a lot of these problems would not be solvable without it.

A real world example of data lineage-powered workflow

Let's take a data engineer who woke up to a DM from their manager saying something is wrong with the CEO’s dashboard. The first thing the data engineer needs to do is find out what datasets are feeding the dashboard. Instead of asking an analyst or reading hundreds of lines of SQL, the data engineer can go to their data catalog, search for the dashboard, and look at the lineage graph to understand everything feeding into it. Once they understand that, they might go to their data observability tool to understand what issues are going on with those tables and what other data products are impacted. Or maybe that information is already in the catalog!

In both these workflows, lineage is providing important information, but the real goal is to solve a problem - the broken dashboard - not to just learn about the information.

Bigeye and data lineage

Here at Bigeye, we think a lot about making it quick and easy for data teams to understand the health of their data. Lineage plays a massive part in that - knowing what data products are affected by changes, or understanding how far upstream a data pipeline broke. Some of those data pipelines are dozens of steps long and span a handful of different systems. Recently, we acquired Data Advantage Group, in large part for its extensive data lineage capabilities, allowing Bigeye to map data lineage across transactional databases, ETL platforms, data lakes, data warehouses, and business intelligence tools.

In the next post, we’ll talk about why automating lineage collection across the whole data landscape is so difficult, how some products and teams do it, and give a sneak peak at some of the exciting integration work we have going on.

Can’t wait for part two? Book a demo today.

share this episode
Resource
Monthly cost ($)
Number of resources
Time (months)
Total cost ($)
Software/Data engineer
$15,000
3
12
$540,000
Data analyst
$12,000
2
6
$144,000
Business analyst
$10,000
1
3
$30,000
Data/product manager
$20,000
2
6
$240,000
Total cost
$954,000
Role
Goals
Common needs
Data engineers
Overall data flow. Data is fresh and operating at full volume. Jobs are always running, so data outages don't impact downstream systems.
Freshness + volume
Monitoring
Schema change detection
Lineage monitoring
Data scientists
Specific datasets in great detail. Looking for outliers, duplication, and other—sometimes subtle—issues that could affect their analysis or machine learning models.
Freshness monitoringCompleteness monitoringDuplicate detectionOutlier detectionDistribution shift detectionDimensional slicing and dicing
Analytics engineers
Rapidly testing the changes they’re making within the data model. Move fast and not break things—without spending hours writing tons of pipeline tests.
Lineage monitoringETL blue/green testing
Business intelligence analysts
The business impact of data. Understand where they should spend their time digging in, and when they have a red herring caused by a data pipeline problem.
Integration with analytics toolsAnomaly detectionCustom business metricsDimensional slicing and dicing
Other stakeholders
Data reliability. Customers and stakeholders don’t want data issues to bog them down, delay deadlines, or provide inaccurate information.
Integration with analytics toolsReporting and insights

Join the Bigeye Newsletter

1x per month. Get the latest in data observability right in your inbox.