Using Bigeye Collections to organize and scale your monitoring
There are four different ways to use Bigeye Collections to organize and scale your monitoring. Let's explore each one.
You’ve onboarded your tables, deployed Autometrics, and you're monitoring your data. Over time, your inner Marie Kondo may feel the need to get even more organized. How can Bigeye help you achieve this organized state?
Bigeye Collections give you the ability to bundle metrics and route notifications. They present a shared view that a team can own. Different users may have different things they want to track, so owners and consumers can mark them as a favorite to quickly highlight their status.
Each collection has an owner or team of owners that are responsible for keeping it healthy communicating to its consumers. The owners can determine how to react when notifications of problems are flagged.
How do people organize their collections in practice? Here are a few strategies:
1. By team and project: One ETL silo vs. another ETL silo
2. By source system: Stitch, Fivetran, and dbt operations
3. By metric type: Data ops / data reliability engineer vs. data analyst
4. By urgency: Route notifications to Pager Duty vs. specific Slack channels vs. email
By team and project
Most commonly, users employ Bigeye Collections to divide monitoring into teams. Each team is responsible for data that corresponds to a subset of the data in the data warehouse or data lake. Notifications get routed specifically to them. Users can go one level deeper with a collection per project or data set. An extra organizational level can keep short-lived datasets and experiments separate from production long-lived datasets.
These data sets are generally owned by a line of business or different functions within a company. For example, Marketing may have their metrics about campaigns, Engineering about app reliability, Sales about the sales funnel, and Product about usage analytics. Each group cares primarily about their datasets, because they contribute to their own business decisions or actions. When anomalies in these datasets are detected, they want to be informed of it, and understand so they can act on it.
By federating responsibility for data quality, long-lived datasets are easier to share across teams. Bigeye’s checks provide part of a data contract, enabling data engineers and analysts to trust data as they produce new insights by combining different datasets. For example ,a Product team may need data from Engineering and from Sales and Marketing to segment feature usage by a customer vertical, customer pipeline status, or marketing campaign.
By source system
Some users give one team responsibility for the operations of a data ingest, while others are responsible for transformations and reporting. The folks responsible for ingest enable different lines of business with domain expertise to get the data. However, while they don’t know the data, they own the processes and the operations of tools like Fivetran, Stitch, and Airbyte to bring the data in.
In these situations, the Bigeye user is a data ops person who would create collections with the operational metrics (freshness, row count) for each group of ingested tables. These metrics typically cover a wide set of tables and schemas. Bigeye notifications for these collections may be sent to the same channels the ingest tools send their notifications.
Examples:
- Engineering wants to analyze data from an application’s database. They create a collection for an application’s Postgres database, and the data warehouse wherein a Stitch job replicates. In this case, a Bigeye Deltas job would further validate that data.
- Sales and Marketing wants funnel data to project revenue growth. They create a collection for the data warehouse replica of Hubspot or Salesforce data that Fivetran has ingested.
- Product wants analytics data from Heap. They create a collection with operational metrics against a data set shared via a Snowflake share.
By metric type
Some customers split collections to focus on different facets of their monitoring – “pipeline reliability” collections, “data quality” collections, and “business insight” collections – which correspond to different stakeholders. Problems with operational collections need to be dealt with before the schema constraint problem. And schema constraint problems need to be managed before they produce clean business metrics. Splitting these up into separate collections aids the prioritization of issues to deal with, and aids the routing of notifications to the teams that are responsible for each tier.
A “pipeline reliability” collection is responsible for making sure that data pipelines are connected and that data is flowing. A collection that solely tracks freshness and row count metrics is a concise aggregate where a data ops person can see a quick health summary. The pipeline reliability collection often ensures that data from an app makes it into the data warehouse. It can also ensure that data pipelines (dbt, Airflow) are continuing to flow, or that backfills have returned data to steady state. It is common for these collections to be slightly larger than average, with 20-100 metrics in them corresponding to a few dozen tables spread across several schemas.
A “data quality” collection is responsible for enforcing that pipeline data is clean. This is tracked with data constraint metrics like nulls or primary key uniqueness metrics. These metrics are most relevant to data engineers who built the pipelines, and the people consuming the data. These collections typically have fewer than 30 metrics on tables in a schema.
Finally, a “business insight” collection is responsible for identifying when the values of the data stray from the norm. A detected anomaly may be an indication of a semantic change or bug in the pipeline or a new trend that can impact data-driven decision making. For example, one could track usage by customer segment for different features or different engagement patterns. You’d use a grouped metric for dimension tracking or maybe relative metrics to ensure monotonic increasing / decreasing values. These typically have <30 metrics on a few tables in a schema.
By urgency
Some Bigeye users divide their metric collections by alert urgency. For example, some users have an “urgent” SLA that sends email to Pager Duty to flag the on-call person. They also have a “normal” SLA that sends messages to Slack for regular business hours processing.
The collections labeled "urgent" tend to have a few critical metrics. These metrics are on production tables. Conversely, Bigeye collections labeled “non-critical” tend to have hundreds of metrics, and are treated as informational as opposed to actionable.
Combining conventions
The strategies above do not need to be used in isolation. They can be combined to create collections at manageable scales. In fact, ~70% of our users collections have a naming convention, and combine at least two of the techniques into their collection names.
Example:
- [team] [project] [ops|schema|biz] [priority]
- [source] [ops|schema] [priority]
- [team] [priority]
Generally, most Bigeye Collections have fewer than 30 metrics in them. This grouping allows data platform owners and leaders to manage data pipelines at a coarser grain.
Summary
Keeping your Bigeye Collections organized and human-readable allows the metrics to stay healthy and reliable. There's no one right way to manage your Bigeye Collections, but hopefully the above strategies have given you inspiration for the various routes you might take, depending on your team, goals, and time frames. Ultimately, if your collections facilitate an easier handoff of responsibility between teams, your inner Marie Kondo can be happy.
Monitoring
Schema change detection
Lineage monitoring