Data in Practice: Building out a data quality infrastructure with Brex
In the "Data in Practice" series, we talk to real data engineers who have put data reliability engineering concepts into practice, learning from their challenges and successes in the real world.
In the "Data in Practice" series, we talk to real data engineers who have put data reliability engineering concepts into practice, learning from their challenges and successes in the real world.
In this interview, we look at one example from Brex, the fintech unicorn. We spoke with Andre Menck, a former member of Brex’s data platform team, to learn how they approached building reliable data infrastructure through four years of hypergrowth.
Andre's role at Brex
Andre was the senior engineer on the data platform team at Brex for four years. As one of the earliest data engineers at the company, he had a hand in most of the projects on the team, either in an advising capacity or doing hands-on work.
The data platform team’s function was to create self-serve platforms for data scientists, analytics users, and business users.
Evolution of Brex's data infrastructure
Andre was heavily involved in building out Brex’s data infrastructure, including migrating data storage to Snowflake, deploying Airflow, and supporting the three core data science use cases that developed over the next few years:
- Fraud detection - things like transaction fraud, ACH fraud, account takeover.
- Underwriting - measuring customer creditworthiness and credit risk.
- Product features that involved data science - for example ranking customer value for customer experience (CX) to prioritize their calls, and showing customers insights about their own spending habits
Where they started
When Andre joined Brex, there was no data platform team. At the time, Brex’s data “warehouse” consisted of a MySQL instance. The ETL process consisted of data engineers using AWS DMS (Data Migration Service) to replicate data from the production database into the MySQL instance.
In terms of analytics, Brex employees with read-access on the database ran manual SQL queries. And, there were no data science use cases.
Brex's data stack today
Brex’s data stack at the beginning of 2023 has evolved by leaps and bounds from where they were a few years ago. It now consists of:
- Orchestration: Airflow
- Data warehouse: Snowflake (with some custom permissioning infrastructure on top)
- Transformation: DBT
- Asynchronous events: Kafka
- BI: Looker
- Notebooking for data scientists: Databricks
Making machine learning self service
As the number of data science use cases increased, the team decided to build a platform to support more use cases without having to create engineering bottlenecks. They hoped to move toward a more self-service model, only requiring engineering help for extreme or fringe cases.
The platform (essentially a Python library), allowed data scientists to prototype machine learning models, then save that model in the platform (where it would run in production). The model would then be available for consumption as an API endpoint on the platform, with the platform handling all the data fetching.
Solving data quality
As Brex works with payments and financial data, it was important for the company to ensure that the analytics they used for fraud detection were up-to-date and error-free. Data quality is critical to their business model, where data quality issues can quickly spiral into regulatory, financial, and legal catastrophes in the worst cases. Andre helped firefight a number of data quality issues while at Brex:
Problem 1: Airflow DAGs
Brex had around 3000 data transformations running on Airflow. This huge number was a direct consequence of hypergrowth: “Our analytics team was probably 30, 40 people when I left, so you had all these new people, PM’s and analysts, they go to our Snowflake, they see tables, and they’re like, oh, I don’t know where to get this data. It looks like no one has written this query before, so I’ll just do it myself…it’s sort of the flip side of having very independently operating teams.”
While each Airflow transformation is internally consistent (they are directed acyclic graphs), Airflow didn’t understand that some of the DAGS also depended on each other – the output of one would often be the expected input of another. After seven or eight layers of transformations, they often ended up with 10 day old data.
The solution to this problem was to merge everything into one DAG. This was still an ongoing project when Andre left the company.
Problem 2: Connections with banks
The second problem was in the underwriting space. Brex used Plaid to get customers' financial information from their banks to determine their real-time creditworthiness, but maintaining an ongoing connection to a customer's bank account was challenging.
To handle this issue, Brex came up with a couple of different solutions.
The first was to build an underwriting policy that could deal with stale data – to make sure that if a customer's data was stale, Brex didn’t lower their credit limits right away. “We made that more and more complex to basically be able to guess what is the risk that we attribute to this customer. Let’s say the data we have on the customer is 90 days old, but if they had $100 million in their bank account, they’re probably not insolvent now.
The second solution was to monitor the bank connections in a more systematic way, so that they would pin down exactly which bank was having problems, and talk directly to Plaid without impacting customers.
Problem 3: Product Teams
The third problem was product teams changing the events they emitted without informing the data platform team, so that Brex no longer had the data for certain features for certain machine learning models.
While one solution to this problem would have been to write extremely thorough integration tests, this would have been expensive and difficult to maintain. Instead, the data platform team wrote some manual tests, and supplemented with a more “process driven” solution – close coordination between the data science team that consumed the data and teams that submitted data.
For example, they had a library that was used to produce events to Kafka for data science models. And whenever there was a PR that made some changes to the usage of the library, anywhere, a data scientist could get alerted through a linter, or they would be autotagged on the PR.
Andre’s advice for building data infrastructure
When asked what advice he would give to himself if he were to build data infrastructure again for a new startup, Andre had the following two principles:
Automate as much as possible
Andre advises building intentionally, and automating as much as possible, even when you're under the gun: “You always build fraud-detection products in a rush, because you’re trying to stop the crime. As such, there’s a tendency toward relying on manual processes to test them. But that ends up being way more of a pain in the long term.”
Grow thoughtfully
Says Andre: “If you just build and build and build, you get that disorganized mess of data assets, and it’s impossible to recover from it as a company. If you have 2,000 employees and that’s your data picture, that will be with you forever."
Monitoring
Schema change detection
Lineage monitoring