Mark Grover, Stemma
Kyle, from The Observatory, and Mark Grover, founder and CEO of Stemma, discuss data catalogs, the history of data, and how open source and data catalogs go hand in hand.
Read on for a lightly edited version of the transcript.
Kyle: Hello, everyone, and welcome to the first episode of The Observatory. Today we're going to be chatting with Mark Grover. He's the founder and CEO of Stemma. And he's also the co-creator of the Amundsen data catalog. Mark, welcome to the first episode of The Observatory.
Mark: My pleasure, Kyle. Super glad to be here.
Kyle: Mark, why don't you tell us what data catalogs are, and how did you get interested in data catalogs in the first place?
Mark: A data catalog is a system that allows you to be able to search, understand, discover, and trust the data that's present in the organization. I got into this problem out of necessity, instead of desire. I was working at Lyft prior to starting Stemma, and the problem I saw was that Lyft had all their data consolidated in one central data lake in the cloud. But there were two things happening in terms of growth at Lyft. One was the amount of data at Lyft was growing exponentially every year. And second, the number of people who were being hired at Lyft was also growing every year.
The combination of these two things meant that there was a ton of data. And then these people who were being hired in roles like operations management, general management, product management, data scientists, data analysts, data engineers, of course, were all wanting to use data. And the combination of those two things meant that there were a lot of questions on this internal Slack channel called #analyticsquestions: “Do we have this data? Who else has used this data? Who knows the most about this data? How do I use this data? When was the last updated? Can I trust it?” These were the questions that kept getting asked over and over again.
And I had looked at some solutions that claimed to solve this problem in the past. And I evaluated those, and they were very much focused on curation as a means of solving that problem. So, let’s say “Kyle” is the data steward for the ETA area or the marketplace area? And we'll get Kyle to document this whole Wiki page style thing for the ETA area. And that says this is a certified source of truth, and here are the people you should ask. But in a fast-growing company like Lyft that was really untenable. And the moment you would even document something like this, it would be out of date the next day.
And that pushed me to create an automated data catalog that's called Amundsen that is able to leverage metadata from your query logs, from your BI tool rules, from your team’s HR system, from your Slack, and then be able to put them all together to power a view, similar to how Google did for the web, to be able to search for data within the organization. That thing was super successful at Lyft with 750 users every week, 80% of data engineers and data scientists use it every week. Then we open-sourced it, and there are companies like Instacart, Brex, Asana, Square, Workday, ING all using it. And Stemma is an automated data catalog inspired by that work in Amundsen and open source, providing the same capabilities of understanding, trusting your data to the larger enterprise. And that's how I got involved.
Kyle: I've seen different data catalogs that have sprung up at other companies, as well as internal projects. Some of them never made it to open source the same way Amundsen did. Was open source a key point of interest, or something that you had an intention for at the beginning?
Mark: It wasn't something that was a clear intention for me. In fact, when I evaluated these off-the-shelf products and then found them satisfactory, I went to a few companies in the Bay Area to see if they had the problem and how they were solving that problem. And the three conversations I recall very vividly were Netflix, Facebook, and Airbnb.
And it was pretty clear from talking to those companies, that the data teams all had this problem. And they were all trying to build these mini HR data catalogs internally that were automated. And I asked each of them, would you open source the thing that you're working on, so we could leverage that and not have to build one? And the problem was that each of these products were built very custom to the organization. A lot of reasons, different for each company, but had one of those projects been open sourced, maybe Amundsen wouldn't exist. But none of those projects were generic enough to be usable at any company. And none of those organizations were incentivized enough to open that up and spend effort and energy in making that.
So the thing that we started with Amundsen was an idea that perhaps had been toward other companies, but we felt like the artifacts, the nomenclature, the entities, were all universal in existence. And so all we had to do was build some constructs in the product that would allow you to plug in your own data warehouse. Like Lyft was very heavy on Hive and Presto on Redshift, but what if you use BigQuery or Snowflake, which Lyft didn't. Can you plug in your own integration? And I think that was the part that really benefited from open source, in that we built some constructs and then others were able to ingest and integrate their data warehouses and their BI tools pretty easily.
Kyle: Data's been around for many decades, but I feel like maybe it's not entirely unique to the data field. But there is definitely this wide variety of backgrounds that folks come into data and data engineering from. So it's interesting to hear your background there. Do you feel like it has changed much these days? Are the folks that you see coming into data? Are the paths now different from what they might have been in the past? Or would you have any advice for somebody who is sort of starting their career out in data or data engineering? In some way?
Mark: I don't think the degrees in data were as common, and they are maybe a little more common now. But mostly only in the data science world, not even data engineering. And so folks that I've seen enter data engineering are software developers, usually, that have a knack for attention to detail. Like there's a taste that's involved. Like maybe if I were to zoom back out, like if I were to just take data engineering and piece that apart, I think data engineering involves two things.
One is a tasteful design of the data model, and your data model should be just broad enough, but not too broad, not too narrow. Your data model should be just deep enough. And you had to be very judicious about what the grain of the dataset is and why you choose that. So that's like a very tasteful exercise. I do not think that can be automated, that will remain a data engineer's job going forward, no matter what. The second part of the data engineer's job is optimization. I'm writing a Snowflake pipeline, and it's taking more than 24 hours to complete when it runs every week, every day. That stuff is nuanced. It requires a lot of understanding of Snowflake internals.
And this has gotten better—high, it was 100x worse. That part I think is going to get easier and easier, thanks to technologies that exist. And they'll make it easier, easier to make fewer mistakes. So someone could say like, “Hey, this part is actually going to get more and more automated and more data engineering time would be spent in this modeling work.” And so I still think the modeling work is tasteful, and I think analysts often carry a good taste for this. People who have used data in the past carry a good taste for that. And some people intrinsically have very good taste for that. Others that come from a more traditional software engineering background, would just be off the bat pretty good with optimization. But this data modeling part would be a place where they would have to build a little more taste. And in my opinion, there's no better way to do that than to get some experience under the belt.
Kyle: Yeah, totally. I think it's interesting that you describe it as taste because it really is kind of subjective, right? And the way the decisions that you make when you're designing the data model are going to have a huge impact on how easy it is to use in different types of applications, by different people, or how easy to understand it is.
It's interesting to hear it described as taste because I think that's actually a pretty good word for it. Because you could make these different subjective design decisions, and they're going to have different strengths. And to make those decisions, you're going to have to kind of understand, not just the fundamentals of the data you're working with, but also, you need to understand or predict how the users of the data and the consumers are going to need to interact with it. So this is actually a really great segue into my third question.
So I'm always curious to hear how people see the space evolving. Everybody's got a different perspective, and they approach it from a different angle. So I want to ask, how do you see, within the larger data ecosystem, data discovery and data cataloging in this area that you have all this experience in, and how do you see that evolving over the next several years? And what roles do you see Amundsen and Stemma playing in that?
Mark: I see evolution in three key ways. The first one is that data cataloging has historically been used from a risk management perspective.
And so the idea is, say you work for a large financial organization, you have a certain amount of data sets that you need to report to a regulator for compliance purposes. And what you do is, you stand up one of these data catalog systems that are maybe 10 years old, and then you end up using them for essentially doing process management around data. So someone can do a shopping-cart style, check out the data, and then this approver, governor, would approve or reject that request. And you're essentially tightly controlling a very small subset of data within the organization for compliance purposes. So one thing I believe will evolve very strongly is that these catalogs will not just be used for regulatory and compliance purposes, but also they will be used for democratization, enablement, data-mesh style purposes. So the idea is that an analyst who uses Tableau every day would also use a data catalog every day, much like you and I use Google every day to search. This is a small problem in Google. I remember maybe one or two different websites that I type the link to the URL directly, but most of the time, my entry point is Google. And I think that's one place I see that change happening.
The second place where I see change happening is that, historically, data catalogs have been more of a curation exercise. So like I was saying earlier, you appoint data stewards who put in this information, but technology has come a long way in the last 10 years. You can take a bunch of these existing systems, Snowflake, Redshift, BigQuery, these are the three most common warehouses. You can take BI tools, Tableau, Looker and build processes that scrape, infer, augment metadata that comes from these systems. So I know Kyle and Mark work on the same team. Kyle uses this table all the time, Mark, maybe you should too. Or I know a new executive is looking at this dashboard, and hey, data engineer, if you're changing this, be aware of this having an impact on the executive dashboard. And so all that to say that curation will still play a role, but not the majority role. Automation will come in and then curation will be used to review and augment the metadata that comes from automation. So that's key number two.
Third is that the personas who have used the data catalog in the past have been limited to a data governance persona, again, tied to point number one that I was making. And I find that these personas now have to not just be the analyst, which I think is pretty well understood, but also the data engineer who's using this to triage when a data quality check fails. To triage, who do I notify if I'm going to change or deprecate this table?
So to summarize these three keys, I feel like catalogs will not be used just for governance and regulatory needs, but also for enabling data democratization. I believe that catalogs will not be curated, but more automated. I believe that the personas who use the catalogs will be more diverse: definitely data engineers, data scientists, as well as business users.
Kyle: Oh, thanks, Mark. Those are sort of the big, juicy questions I wanted to talk to you about today. I do have three rapid fire ones for you. Just to round us out. Number one is… you've lived in a couple places, I think, what is the favorite place that you've lived so far?
Mark: Boulder, Colorado. I'm happy to share more details.
Kyle: Number two. Are you still actively coding?
Mark: Not quite. I changed the website for Stemma two weeks ago.
Kyle: Okay, well, that counts. You're on the edge. And number three, if you could change any one data buzzword (and we have plenty of them these days), what would you change?
Mark: Oh, man, that's a that's a hard one. I would change data governance. I think a lot of people put tons of stuff in here. And that word can probably be broken down into more meaningful words so that we can have a better conversation as a community than that particular word.
Kyle: Or, just a little more specificity. Cool. All right. Well, Mark, thanks so much for chatting with me today. And thanks for being on the first episode of The Observatory.
If you want to learn more about either Stemma or Amundsen, we have links down in the description below. So you can click those to check it out. And I will see you all next time.
Get started on your data reliability engineering journey with Bigeye by requesting a demo here.