How Bigeye detected issues that our dbt CI jobs missed
This is a story about broken data, and how Bigeye pinpointed the issue, minimizing the impact to internal analytics. Crisis, averted. Read on to learn more.
At Bigeye we use a variety of tools in our data stack to analyze our customers' usage patterns. Dbt transforms our data into meaningful insights, and we use Bigeye to observe our data reliability and quality.
As a relatively new user of dbt, I unknowingly broke our internal dbt runs and created stale data. This is the story of how Bigeye detected that problem, and helped us minimize the impact to our internal analytics.
The dbt seed file change
We use dbt seeds to map our customer metric usage data to overarching metric themes. For example, we categorize our freshness and volume metrics as pipeline reliability themes, while we categorize formatting metrics like null percentages or percent UUID as data quality themes. If you are unfamiliar with dbt seeds, then they are great for static data that changes infrequently, like mapping of metric categories to overarching business themes.
In a seemingly simple update, I added a new column to our metric_category.csv seed file in dbt for a different way to categorize metric types that the customer success team preferred. It was at a slightly lower grain, which we wanted to use to help improve our customer usage tracking. The update was very simple, I just went into the seed file and added a new column with a new mapping to each of the 80+ column-level metric types that Bigeye offers.
Discovering the data error
After adding in this new column, I pushed the commit and created a new pull request to merge this change into production. As a best practice, we run CI checks on every pull request, which dbt Cloud enables via Slim CI jobs. This CI check is intended to give users confidence that their new code is building as expected and not introducing errors when merged into the main branch. So when the CI job passed, I felt good and merged my small change into production. That night, however, all of our dbt runs failed due to the changes made in that PR.
Fortunately, since we use Bigeye to monitor all of our internal analytics tables for freshness and volume I was quickly notified the next morning of our stale datasets. The image below shows the exact metric we configured in Bigeye to report the stale table. In this example, we created a Bigeye metric that checks for the hours since the latest value in the _DBT_RUN_AT column every 60 minutes using our autothresholds. Since our autothresholds have detected patterns in our update frequency, when the observed value was greater than the typical hours since the latest value - we were alerted via Slack.
The broken data and the Bigeye fix
After being notified by Bigeye that morning, I quickly acknowledged the issue and went into investigation mode to determine root cause and total impact. It didn’t take long to realize that our dbt runs were failing because the dbt seed was trying to truncate the existing table and re-insert the data into a table where that new column did not exist. The solution was to update our dbt job temporarily to use dbtseed--full-refresh. The --full-refresh flag forces dbt to drop cascade the existing table before rebuilding it, therefore fixing the problem.
After the updated dbt job ran successfully, I refreshed the metric in Bigeye and it returned back to a healthy state and the issue on the table was auto-healed. I also could have waited for the scheduled 60 minutes to run up and have the metric run and healing process occur without any action, but I was eager to close the issue and move on with my day.
Final thoughts
Sometimes the most seemingly minor of changes can create significant issues in our data. Even if we use best practices like CI jobs to ensure these issues are reduced and don’t make their way into our production environments - they still seem to find their way. Having a data observability solution like Bigeye can help find and detect issues when all else fails and provide additional confidence in the reliability of our data.
Monitoring
Schema change detection
Lineage monitoring