The Five C's: A Blueprint for Building High-Scale Data Observability Platforms
Inspired by another CEO's discussion of "Four C's" in relation to their own observability company, I realized the importance of distilling our core architectural insights into our own set of guiding principles, the "Five C's.”
At Bigeye, our core engineering team brought a wealth of experience directly from enterprise environments- they’ve managed thousands of data consumers, repositories of 100k+ datasets and operated over 1000 microservices.
Their backgrounds deeply informed our decisions when it came to data architecture.
Inspired by another CEO's discussion of "Four C's" in relation to their own observability company, I realized the importance of distilling our core architectural insights into our own set of guiding principles, the "Five C's”, that have acted as the foundation for building Bigeye.
By understanding how these elements intertwine, organizations can build observability platforms that not only identify issues but also adapt to the unique landscapes of diverse data environments.
Connectivity: Getting end-to-end coverage
The first critical “C” is Connectivity.
Observability platforms aim to help data teams identify when, where, and why problems are occurring in their environment. Ideally this happens as early in the pipeline as possible before the issue has propagated downstream. This creates a strong incentive to provide visibility as far upstream in the pipeline as possible.
A platform that can only monitor 80% of the total data pipeline space will never perform as well as one that can cover 90%. Getting to 100% may be intractable in some cases, but incremental gains in this area can help drive down mean time to resolution, and improve the accuracy of the impact assessment.
The set of pipelines in an enterprise are likely going to span a mix of technologies from the 1990s, 2000s, 2010s, and possibly a handful of truly modern technologies sprinkled in as well. To achieve good coverage in this kind of environment the platform needs either to come with an existing library of connectors, or make it simple for the customer’s engineering team to get metrics and metadata out of the target systems and into the observability platform.
An observability tool that’s centered too strongly around one specific element in the pipeline (e.g. the orchestration / transformation layer, or the data warehouse) will struggle to deliver the coverage needed to produce the benefits the business is ultimately looking for.
Connectivity serves as the backbone of observability, offering insights into the when, where, and why of occurring issues in the data environment. Covering as much of the pipeline as possible is essential, as it aids in early identification, impact assessment, and critical communication during triage.
Configurability: Adjusting for Unique Requirements
Our second “C” is Configurability.
Configurability within data observability refers to the platform's capability to be adjusted, fine-tuned, or tailored according to specific needs and use cases. It enables users to customize settings, rules, or configurations based on the unique requirements of their data infrastructure, ensuring flexibility and adaptability.
By allowing users to adapt settings to different use cases, configurability fosters a more precise and effective observability platform. For instance, consider a multinational corporation with multiple data sources spanning different technologies. Configurability allows the observability platform to cater to various databases, data formats, and processing methods, accommodating the intricacies of each system.
Each organization will want to address the use cases they bought the tool for, but they have to navigate the unique set of requirements (read: obstacles) that exist in their organization. If the tool can’t adapt to those requirements, it can’t deliver on the use cases. This stalls the path to ROI for the organization and limits the value they can get from their purchase.
There are a lot of data observability players out there these days. The barrier for entry is low, at least for creating a working product that can serve purely modern-data-stack environments (operated by small and highly technical teams that mostly care about a single warehouse and single analytics tool.) In these cases, configurability isn’t as critical.
Where things start to get tough is in heterogeneous environments with multiple decades worth of technologies stitched together and managed by multiple teams, all with different business priorities. They need columnar lineage from DB2 and SAP, into Snowflake AND Databricks (yes), into five different analytics tools owned by five different lines of business.
By creating a platform that is flexible enough to accommodate a wide variety of customers, we can meet the needs of each unique use case.
Composability: Scaling gracefully
Our third critical design element is Composability.
At Bigeye, we prefer a way of doing things that's less about building different solutions for each need and more about creating a single, versatile framework that keeps things organized and moving forward.
The naive approach to supporting additional use cases asked for by customers is to build new workflows each time. This quickly addresses the problem by creating a path for the customer to follow in the product to solve for the use case. But it also creates product sprawl problems over time as more and more use case-specific workflows get attached on all sides of the core product.
We believe in a streamlined style of working because it helps us build a system that's both adaptable and together. By creating a system that's made up of different parts but still works as a whole, we're ready for whatever changes might come our way in the world of data observability.
Controllability: Watching the watchmen
The fourth factor in Bigeye’s design is Controllability.
Data observability opens a ton of questions because of its connection to the business’s data.
There’s the obvious question of who can see which data, at what granularity, and how to control that such that folks can still perform in their role while minimizing risk to the business (and really the customers whose data is being processed).
But there’s also the question of who can operate the observability platform itself. When fully deployed these things have the potential to generate a lot of noise through alerting, and a lot of infrastructure load on the target systems being monitored. The business wants to give freedom to data engineers, scientists, analysts, stewards, etc. to collaborate on the quality and reliability of the pipelines so they can get the most from their data. But a lack of controls could easily lead to situations where monitoring is misconfigured and causes problems.
Controllability should give the organization a set of tools to construct that balance. This could mean engineering has configuration-driven monitoring defined in code and kept under version control. It could also mean that UI-driven changes can be easily rolled back and are fully auditable. Monitoring configuration may need to first get created in a draft or staging state, and then get promoted to production once it’s confirmed to be the right setup.
Consumability: Optimizing for the Operator
Last but not least, Consumability has been top of mind while building and scaling Bigeye.
Early in my career, Edward Tufte had a big impact on my thinking about the human-data relationship. The concept of limited bandwidth between the human and the data, and how to best use that bandwidth, is an important consideration in this workflow. A world class data observability platform should be designed to optimize the use of the man-machine interface.
I believe eventually we’ll get to see autonomous resolution workflows happening in the real world. This would be where the observability platform can run through a runbook on it’s own and attempt to fix the root cause of the issue before involving a human operator.
Until that happens, consumability is going to be a critical factor. Once the issue is detected and the human gets paged, the man-machine interface starts to impact the MTTR. A less consumable interface reduces bandwidth between the signals the observability system can present about the data pipeline, and the brain of the operator who has to respond to the situation and develop a resolution. Slowing down the operator lengthens each resolution time and increases the cost of each outage.
This includes simpler questions like: "What data am I looking at?" "What caused the alert to fire?" "Where in the pipeline are there dependencies with behavior that might have caused this?" It could also involve more advanced techniques like correlating signals across columns or tables. For example, if we know that every table in a schema is loaded by a Talend job and all of them are getting stale at the same time, that the root cause might be in Talend, and we can present that to the operator directly instead of making them sift through graphs.
The Five C's in Action
The elements of connectivity, configurability, controllability, and consumability work hand in hand, acting as the building blocks for a resilient observability platform.
Connectivity establishes the scope, allowing for end-to-end visibility and identification of problems across the data pipeline. Configurability adapts to unique requirements, enabling the system to cater to diverse use cases. Controllability ensures proper governance and monitoring, managing access and reducing risks. Consumability optimizes the interface, allowing operators to efficiently interpret data signals and respond swiftly to issues.
Future Trends and Challenges
Designing and maintaining observability platforms is not without challenges. The complexity of interconnected systems, varied data sources, and differing user needs pose significant challenges. Balancing ease of use with sophisticated functionalities, ensuring seamless integration of new technologies, and maintaining data security are crucial tasks in the design and ongoing maintenance of observability systems.
Looking forward, the future of observability platforms holds promising advancements, including the integration of AI and machine learning, predictive analytics, and automated issue resolution. However, challenges persist, demanding a delicate balance between user-friendliness, cutting-edge functionalities, seamless technology integration, and stringent data security.
The implementation and integration of the Five C's - Connectivity, Configurability, Composability, Controllability, and Consumability - have been the bedrock of our journey in constructing Bigeye. These fundamental pillars have not only shaped our observability platform but have become a guiding light for anyone seeking to scale their own system.
In this era of transformative trends and evolving challenges, the Five C's pave the way for creating observability systems that not only monitor data but also proactively respond to the dynamic needs of the ever-changing technological landscape.
Monitoring
Schema change detection
Lineage monitoring