Efficiency with issue monitoring states
In this post, we'll take a general look at the lifecycle of an issue, from triage to close, using Bigeye.
As a data engineer or data reliability engineer, you need to quickly decide on how to handle data quality issues that Bigeye detects. The decision needs to be efficient and allow you to act with confidence knowing that the data quality is improving with each issue handled. There needs to be a mechanism to enable focus on the urgent problems, to avoid duplicate effort from teammates, and to ease reporting progress in resolving the problems to your stakeholders. Changing the status of issues is the primary way you can convey this in Bigeye.
Initially Bigeye emulated how observability tools handled anomalies and issues in metrics – it has a triage state for new problems or issues that need to be re-evaluated, an acknowledged state for issues that are being investigated, and a closed state for problems that are resolved. Here’s the general lifecycle of an issue.
In practice, we’ve found by looking at a metric’s chart, you will often know whether the issue is due to the data or due to the metric’s thresholds. The problem is that pipeline fixes may take hours to days to repair and push updates to the data through. This is a lot of time between when a conclusion can be made and when proof that the problem has been fixed is evident. Data engineers need a way to separate the issue that they have made conclusions on from newly arriving issues or issues that require further investigation.
This is where Bigeye’s Issue monitoring states come in. With it, data engineers can now specify if it is a data problem or a threshold problem and set expectations on how anomalous metrics should behave moving forward. This eliminates toil because Bigeye takes on the responsibility of monitoring the metric and then sending out a notification when expectations are met. Here’s the updated lifecycle of an issue.
Each state now has a single purpose which makes it easier to prioritize different issues based on if it needs human intervention or not. The table below summarizes each state’s purpose.
The states make triage of issue lists more efficient. When you visit an issue list, you can prioritize looking into the issues in triage state first. If there are none, you could look into the ack’d issues – they are being investigated and haven’t reached a conclusion. If your issue list only has items in monitoring, you are all set because these issues are just waiting for a healthy run. Using the states can help track progress and help you show your stakeholders that progress is being made.
The state can be set from Slack, issue list pages, and on issue details pages. The timeline on the issue details page shows the different state transitions and comments to summarize the lifecycle of the issue and to provide context when revisiting an issue days later.
So, what does this mean in practice? Here are three scenarios where monitoring states can help you monitor your most critical datasets:
- When a fix is coming
- When the data is changing
- When downstream tables are impacted
Scenario 1: When a fix is coming
You’ve been alerted about an anomaly and have concluded that it is a bug due to a change in a data pipeline or upstream data delivery delay. You want Bigeye to maintain the current thresholds but not alert again if the same values recur unless it has returned to health first.You know why the problem exists and may have already deployed the fix. In either case, you expect the metric to return to within the existing thresholds after the pipeline updates and execution completes.
In this situation you would transition the issue to the new monitoring state and leave a comment with the reasoning. If this was an autothreshold, you’d also give Bigeye feedback to maintain thresholds. This issue will auto-close and notify when the metric returns to health without any further intervention.
Changing the state to monitoring reduces the urgency of the issue and reduces the need to revisit the issue in subsequent triage passes when waiting for the data to return to normal.
Scenario 2: The data is changing
You’ve been alerted about an anomaly but know that this is a legitimate change. Maybe there has been a new data set added or maybe thresholds need to be more permissive to account for the new normal.
If you are using manual thresholds, you can modify your metric’s threshold configuration and put the issue into a monitoring state and leave a comment with the reasoning while waiting for it to return to health.
If you are using autothresholds, you want Bigeye to adapt the current auto-thresholds to accept the new values moving forward without triggering a notification. You would transition the issue to use the new monitoring state giving Bigeye feedback to adapt threshold and leave a comment with the reasoning. The autothresholds will update and Bigeye will auto-close the issue when the metric returns to health.
Like in the previous scenario, changing the state to monitoring reduces the urgency of the issue and reduces the need to revisit the issue in subsequent triage passes when waiting for the thresholds to adapt.
Scenario 3: Cascading monitoring states via Lineage
Monitoring states also can dramatically improve the efficiency of issue management when combined with Bigeye’s lineage capabilities. With lineage involved, you can set a monitoring state and give threshold feedback on an issue and on all the issues downstream from it with a single action. So if an issue is detected at an upstream table, you could change the status on that table and downstream table’s metrics with the same monitoring state and feedback.
Because Bigeye is monitoring the issue and notifies when the issue auto-closes upon returning to health, your pipeline status should improve as new metric runs cascade. This eliminates follow-up interactions required to resolve related individual downstream issues.
If the fix leaves unresolved problems in the pipeline, those issues should remain open. This is a good thing – progress has been made because the open issue list has been whittled down. A new investigation can be done to address the remaining issues.
Conclusion
Bigeye’s monitoring state is a simple mechanism that lets you set expected behavior in response to an issue and lets the system take care of the rest. The workflow for managing data issues becomes clearer because issues that require investigation are separated from those where resolution decisions have been made but need time to return to health. In the multiple issues scenarios, using the states can reduce the number of repeated interactions with individual issues. As pipelines get deeper, monitoring states can help scale by applying monitoring states in bulk cascading down lineage relations.
Monitoring
Schema change detection
Lineage monitoring