Thought leadership

April 7, 2023

Data pipeline monitoring vs. data quality monitoring: What's the difference?

min read

Liz Elfman

Get Data Insights Delivered

Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.

Stay Informed

Sign up for the Data Leaders Digest and get the latest trends, insights, and strategies in data management delivered straight to your inbox.

Get the Best of Data Leadership

Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Get the Best of Data Leadership

Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Stay Informed

Sign up for the Data Leaders Digest and get the latest trends, insights, and strategies in data management delivered straight to your inbox.

Get Data Insights Delivered

Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.

While data monitoring often conflates data pipeline health with data health itself, in fact, they should be considered two separate disciplines. Those are: data pipeline monitoring and data quality monitoring. In this post, we'll delve into the key differences between the two and why it's essential to have both in place.

Data pipeline monitoring: Ensuring smooth data flow

Data pipeline monitoring (DPM) focuses on the jobs and tables that move the data, e.g. Snowflake and Airflow. The main aspects of DPM are freshness (when each table was last updated), volume (how many rows are being moved), and job run durations. DPM is typically the responsibility of data engineering or data platform teams.

In monitoring the data pipeline, you ensure that your ETL (Extract, Transform, Load) processes are running smoothly and that the data is flowing seamlessly between different stages of the pipeline. You avoid bottlenecks and ensure that your data is up-to-date and ready for analysis.

Data quality monitoring: Assessing the contents of the data

Data quality monitoring (DQM), on the other hand, focuses on the contents of the data. DQM includes aspects such as freshness (how old the values are), completeness (rate of nulls, blanks, etc.), duplication, and format compliance. DQM is often the responsibility of data science and analytics teams, who need to ensure that the data they use is accurate and reliable.

By implementing DQM, you can identify issues such as null values, duplicates, and outliers that may affect the accuracy of your data-driven insights. With proper DQM in place, your ML models and analytics work off of high-quality data, ultimately leading to better decision-making.

The importance of both data pipeline and data quality monitoring

While DPM and DQM can be done with two separate systems, to truly understand the behavior of your pipeline, you should correlate information from both sources. For instance, if you notice that a table has been refreshed later than usual with a larger number of rows, and you also find a significant number of duplicated IDs, this could indicate an issue with an ETL job. In this case, combining data pipeline monitoring (freshness and volume) with data quality monitoring (duplicates) can help you identify and resolve the problem.

You want to prioritize data pipeline monitoring before data quality monitoring. If the data isn't flowing smoothly through the pipeline, there's no point in worrying about data quality. Once the data engineering team has ensured the smooth operation of the data pipeline, they can hand over the responsibility of data quality monitoring to data science and analytics teams. This division of labor allows each team to focus on their area of expertise, and ensures that both aspects of data management are adequately addressed.

The role of analytics engineers in pipeline and quality monitoring

With the rise of tools like dbt, the role of analytics engineer has evolved into a mix of data analyst and data engineer. Analytics engineers understand how the data is consumed in dashboards and statistical models, and write SQL to perform data transformations. They can serve as a valuable bridge in the correlation work mentioned above.

In practice: The intersection of data pipeline and data quality monitoring

In reality, the division between data pipeline monitoring and data quality monitoring is not always clear-cut. However, having a strong understanding of the two concepts and their respective responsibilities can help organizations make informed decisions about which aspects of their data management processes need attention.

share this episode

Resource

Monthly cost ($)

Number of resources

Time (months)

Total cost ($)

Software/Data engineer

$15,000

$540,000

Data analyst

$12,000

$144,000

Business analyst

$10,000

$30,000

Data/product manager

$20,000

$240,000

Total cost

$954,000

Role

Goals

Common needs

Data engineers

Overall data flow. Data is fresh and operating at full volume. Jobs are always running, so data outages don't impact downstream systems.

Freshness + volume
Monitoring
Schema change detection
Lineage monitoring

Data scientists

Specific datasets in great detail. Looking for outliers, duplication, and other—sometimes subtle—issues that could affect their analysis or machine learning models.

Freshness monitoringCompleteness monitoringDuplicate detectionOutlier detectionDistribution shift detectionDimensional slicing and dicing

Analytics engineers

Rapidly testing the changes they’re making within the data model. Move fast and not break things—without spending hours writing tons of pipeline tests.

Lineage monitoringETL blue/green testing

Business intelligence analysts

The business impact of data. Understand where they should spend their time digging in, and when they have a red herring caused by a data pipeline problem.

Integration with analytics toolsAnomaly detectionCustom business metricsDimensional slicing and dicing

Other stakeholders

Data reliability. Customers and stakeholders don’t want data issues to bog them down, delay deadlines, or provide inaccurate information.

Integration with analytics toolsReporting and insights

Get the Best of Data Leadership

Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Stay Informed

Sign up for the Data Leaders Digest and get the latest trends, insights, and strategies in data management delivered straight to your inbox.

Get Data Insights Delivered

Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.

Data pipeline monitoring vs. data quality monitoring: What's the difference?

Get the Best of Data Leadership

Stay Informed

Get Data Insights Delivered

Data pipeline monitoring: Ensuring smooth data flow

Data quality monitoring: Assessing the contents of the data

The importance of both data pipeline and data quality monitoring

The role of analytics engineers in pipeline and quality monitoring

In practice: The intersection of data pipeline and data quality monitoring

Get the Best of Data Leadership

Stay Informed

Get Data Insights Delivered

SLAs: Not Just for Software Engineers Anymore

Get AI Ready with Governance & Data Observability

AI for Data Observability: Designing for Privacy, Access, and Risk

Join the Bigeye Newsletter

Data pipeline monitoring vs. data quality monitoring: What's the difference?

Get the Best of Data Leadership

Stay Informed

Get Data Insights Delivered

Data pipeline monitoring: Ensuring smooth data flow

Data quality monitoring: Assessing the contents of the data

The importance of both data pipeline and data quality monitoring

The role of analytics engineers in pipeline and quality monitoring

In practice: The intersection of data pipeline and data quality monitoring

Get the Best of Data Leadership

Stay Informed

Get Data Insights Delivered

Related posts

SLAs: Not Just for Software Engineers Anymore

Get AI Ready with Governance & Data Observability

AI for Data Observability: Designing for Privacy, Access, and Risk

Join the Bigeye Newsletter