Data pipeline monitoring vs. data quality monitoring: What's the difference?
Why does the difference between data pipeline monitoring and data quality monitoring matter? In this post, we'll define both, and explain what the difference means.
While data monitoring often conflates data pipeline health with data health itself, in fact, they should be considered two separate disciplines. Those are: data pipeline monitoring and data quality monitoring. In this post, we'll delve into the key differences between the two and why it's essential to have both in place.
Data pipeline monitoring: Ensuring smooth data flow
Data pipeline monitoring (DPM) focuses on the jobs and tables that move the data, e.g. Snowflake and Airflow. The main aspects of DPM are freshness (when each table was last updated), volume (how many rows are being moved), and job run durations. DPM is typically the responsibility of data engineering or data platform teams.
In monitoring the data pipeline, you ensure that your ETL (Extract, Transform, Load) processes are running smoothly and that the data is flowing seamlessly between different stages of the pipeline. You avoid bottlenecks and ensure that your data is up-to-date and ready for analysis.
Data quality monitoring: Assessing the contents of the data
Data quality monitoring (DQM), on the other hand, focuses on the contents of the data. DQM includes aspects such as freshness (how old the values are), completeness (rate of nulls, blanks, etc.), duplication, and format compliance. DQM is often the responsibility of data science and analytics teams, who need to ensure that the data they use is accurate and reliable.
By implementing DQM, you can identify issues such as null values, duplicates, and outliers that may affect the accuracy of your data-driven insights. With proper DQM in place, your ML models and analytics work off of high-quality data, ultimately leading to better decision-making.
The importance of both data pipeline and data quality monitoring
While DPM and DQM can be done with two separate systems, to truly understand the behavior of your pipeline, you should correlate information from both sources. For instance, if you notice that a table has been refreshed later than usual with a larger number of rows, and you also find a significant number of duplicated IDs, this could indicate an issue with an ETL job. In this case, combining data pipeline monitoring (freshness and volume) with data quality monitoring (duplicates) can help you identify and resolve the problem.
You want to prioritize data pipeline monitoring before data quality monitoring. If the data isn't flowing smoothly through the pipeline, there's no point in worrying about data quality. Once the data engineering team has ensured the smooth operation of the data pipeline, they can hand over the responsibility of data quality monitoring to data science and analytics teams. This division of labor allows each team to focus on their area of expertise, and ensures that both aspects of data management are adequately addressed.
The role of analytics engineers in pipeline and quality monitoring
With the rise of tools like dbt, the role of analytics engineer has evolved into a mix of data analyst and data engineer. Analytics engineers understand how the data is consumed in dashboards and statistical models, and write SQL to perform data transformations. They can serve as a valuable bridge in the correlation work mentioned above.
In practice: The intersection of data pipeline and data quality monitoring
In reality, the division between data pipeline monitoring and data quality monitoring is not always clear-cut. However, having a strong understanding of the two concepts and their respective responsibilities can help organizations make informed decisions about which aspects of their data management processes need attention.
Monitoring
Schema change detection
Lineage monitoring