The Data Stack Show - 182: Building a Dynamic Data Infrastructure at Enterprise Scale Featuring Kevin Liu of Stripe

Starting point is 00:00:00 Before we start the show this week, we've got a quick message from a big show supporter, Data Council founder, Pete Soderling. Hi, Data Stack Show listeners. I'm Pete Soderling, and I'd like to personally invite you to Data Council Austin this March 26 to 28, where I'll play host to hundreds of attendees, 100 plus top speakers, and dozens of hot startups in the cutting edge of data science, engineering, and AI. If you're sick and tired of salesy data conferences like I was, you'll understand exactly why I started Data Council and how it's become known for being the best vendor-neutral,

Starting point is 00:00:34 no BS, technical data conference around. The community that attends Data Council are some of the smartest founders, data engineers, and scientists, CTOs, heads of data, lead engineers, investors, and community organizers. We're all working together to build the future of data and AI. And as a listener to the Data Stack Show, you can join us at the event at a special price. Get 20% discount off tickets by using promo code DATASTACK20. That's DATASTACK20. But don't just take my word that it's the best data event out there. Our attendees refer to Data Council as Spring Break for Data Geeks. So come on down to Austin and join us for an amazing time with the data community.

Starting point is 00:01:14 I can't wait to see you there. Welcome to the Data Stack Show. Each week, we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. We are here on the Data Stack Show with Kevin Liu. Kevin, thank you so much for giving us a little bit of your time today.

Starting point is 00:01:47 Yeah, thanks for having me. All right. Well, you've done a couple of really interesting things in data, but just give us your brief background. How did you start and what are you doing today? Sure. I'm currently a software engineer at Stripe. I've been working there for around three years. I've been working there for around three years. I've been working with data infrastructure there. So a lot of open source technologies such as Trino, Iceberg, my team

Starting point is 00:02:13 powers our internal BI analytics. And recently I've taken on another challenge with, on the data product side, the product is called Stripe Data Pipeline. We essentially enable merchants to have their Stripe data back into their warehouse, into their data ecosystem in an efficient way. This is great. So actually, I know you, Kevin, for a while now, we've been talking like since like the times when I was, the time when I was at Starburst and about Strino specifically.

Starting point is 00:02:51 And I'm very excited today because I had the opportunity and the pleasure to work with Stripe quite a few times. And it's one of these companies that have been around for long enough to go through like many changes, but always like trying to stay at, let's say the forefront of what is happening out there. For example, like very early adopter of Spark, right? I'm pretty sure you probably still have pipelines in Scala in there because of that.

Starting point is 00:03:23 And you keep innovating. We are open, like using like new technologies and many things have happened in these like past 10 years, let's say, so having you from there and you being like enough, long enough there, like to see at least past like three, four years, like the evolution, I think will give us like a great opportunity to talk about where data infrastructure stands today, what let's say some interesting problems are. And also based on also like your latest like move into like turning data into products,

Starting point is 00:04:00 talk about that, because I think it's like a very important next evolution step when it comes like to infrastructure around data. So that's what I'm really excited about today. What about you? What are a few things that you'd love to talk about? Yeah, I think in general, I've been really happy working at Stripe just because the company know, the company for its sides, for the kind of engineering culture there, it really helped me learn and get to understand a lot of what is going on, especially in the data world, kind of at the, like, you know, what is the most newest and

Starting point is 00:04:42 shiniest thing that we can work with, right? So I took a database class in college, didn't think much of it, came to Stripe, started working with OLAP systems, Trino, Iceberg, and it was very new to me. But then eventually I started to realize that it was new to the industry as well. That's been really exciting to me in order to say, okay, well, how do I take this new concept, how do I run it efficiently at Stripe, and then how do I help the community because it is an open source project? How do I kind of take ideas that we have that we come up with

Starting point is 00:05:27 and share it with the community as well? And then on the data product side, I think Stripe is positioned very well to do data sharing. Not a lot of companies can do that because not a lot of companies have the value from the data that they have and have that kind of be shared to their customers in a way where the customers are

Starting point is 00:05:55 asking for it on a daily basis right so you know i'm still learning i think i just want to share some ideas with you guys and yeah happy to talk more about things. Yeah, let's do it. What do you think, Eric? Are we ready? I was born ready, Costas. I was born ready. I know that.

Starting point is 00:06:16 Let's do it. Let's do it. Kevin, so excited to have you on the show. And we have some really exciting subjects to talk about. You gave a brief introduction, but I'm interested to know, sort of going back to the beginning, what sort of sparked your interest in the data side of things? So you have a background as a software engineer, but what drew you into the data aspect of software engineering? Yeah, I think I always like to just dabble around different domains on the internet.

Starting point is 00:06:54 And I think data has been one of those things that just stood out to me, I think, in terms of my work, right, where, you know, I work on making data infrastructure as Stripe and kind of making it so that, you know, thousands of Stripes have access to data that, you know, that's just been really interesting to me to like, see how, see how that kind of evolved from your traditional like data warehouse and I think the open source aspect of it also drove me to kind of participate more to join communities to kind of learn from each other and kind of to share what I've learned I think that's been really kind of motivating for me to kind of work in this field. And obviously, you know, in the recent months, years,

Starting point is 00:07:51 there's been a lot of new developments. You know, I was watching like the history of databases and they're like, you know, calling this like, you know, data lake househouse as a new wave. And in a way, I do believe that it is a different paradigm from before. And I see it firsthand and it enables a lot of interesting features and value to derive from there. Fun fact, we used to run on Redshift until we couldn't run on Redshift anymore. We migrated to Trino and to Iceberg

Starting point is 00:08:31 with open source technologies and we see firsthand how much value it provides to the company and how, you know, folks who use it on a daily basis think it's like magical, right? That we're able to you know analyze petabytes of data super super fast yeah right and we i stripe especially we have a way for you know

Starting point is 00:08:56 people to interact with data very easily like we have a internal tool that you can just go to and you can just write some SQL. And so that approach of democratizing data at the company has been very well accepted at Stripe. Yeah. I have a question. So your title is software engineer, but you work with a ton of data stuff. Just out of curiosity, do you kind of consider yourself like more of a software person or a data person? I know that title can be a little bit abstract because it can mean so many things, right? And in some ways, building a data platform is what you've been doing. But

Starting point is 00:09:41 yeah, just interested in your perspective on that. Yeah, sometimes I think to myself too, when I first learned the term data engineer, I'm like, am I that? Am I a data engineer? I don't know. I'm not sure. I mean, my day-to-day goes from SQL to front-end to BI to distributed system to like,

Starting point is 00:10:03 every part of the data infrastructure, we kind of have some kind of lever that we can pull. In a way, yeah, a lot of what I do is consider data engineering, but I think especially on the data infrastructure side, there's a lot of software that, know exposes a good interface but sometimes you really need to dig into the internals of it and this is where big open source and having the community is is great because a lot of the times we're able to talk with other folks in other companies or also infrastructure and share what we learn with each other and

Starting point is 00:10:45 share with the community. I, you know, went to a Trino Fest event like 2021, 2022 and learned a lot. And I came back to my team and like, hey, you know, Lyft runs their data infra, their Trino clusters very efficiently. What can we learn from them? So a lot of those things I really enjoy. And I guess that's what software engineers do. I'm not too sure.

Starting point is 00:11:13 I don't know. Like, we don't have data engineer roles at Stripe. So I'm not really sure what that means either. So, you know, I think I do a little bit of both. Yeah, yeah. I mean, you know, I think that's actually, you know, I think I do a little bit of both. Yeah, yeah. I mean, I, you know, I think that's actually, you know, part of the reason I asked the question is that, you know, as we think about, like you said, there's all sorts of interesting new developments, right, in data technology and operating platforms. And so it is really interesting to think about

Starting point is 00:11:42 the confluence of multiple different skill sets that are really useful when running, you know, running large data systems. Okay, I have a ton of questions about Stripe, but I want to jump back just a little bit. And you worked on some speech recognition stuff at Amazon previously. And I just have to ask about that because especially after you talking about, you know, sort of being a data person and a software person, did those two things come together in that work as well? Because, you know, you're sort of dealing with massive amounts of data and then, you know, trying to build a system that can essentially operationalize it. Yeah, I think in a way, yes. I think, I forgot who I was talking to,

Starting point is 00:12:28 but I was talking to someone with a lot of years of experience in the industry, a software engineer, and they basically told me that, you know, software engineer and writing software is essentially just moving data around. So I think my role in this data engineer, big data world is being a software engineer and specializing in that. I think I work in speech recognition system in Alexa

Starting point is 00:13:02 and we're kind of supporting the data science team there. So a lot of the job is, you know, how do we provide the right abstraction for data scientists, for ML engineers to run their speech recognition model? How do we have the right environment for them to do their work in a way that produces value, right? Yeah. And same thing as Stripe. I think a lot of our work enables folks from other parts of the company to do their job and to get whatever they need, whatever data they need, whatever insight they need in a

Starting point is 00:13:44 fast and efficient way. Yeah, absolutely. Well, let's dig into the world of Stripe. So can you give us a little bit more detail on what you've done at Stripe? What are the big projects that you've worked on built yeah we did a bunch of stuff at stripe in the years that i've been in i was talking to a co-worker before and we're kind of reminiscing about like the the projects that we took on and it just felt like a decade ago like starting so when i when I first started as Stripe, the whole company was in this big project to support India. And it was really interesting to me because India has this concept of data locality, where it's not a concept. It's a law that Indian merchant data should not leave the continent. Right, yeah, it stays,

Starting point is 00:14:50 yeah, it stays in the borders. Yes, I'm familiar with this, yeah. Yeah, which breaks the concept of like software engineering and like abstraction layer and everything, right? Because now your data is physically in some space

Starting point is 00:15:04 instead of like, you know, data data as blobs in S3. So that's the first project that I kind of worked on. And that actually required a kind of a foundational shift at Stripe to say, you know, apply this concept all the way down the stack and making sure that we're supporting it every week. So that was really interesting for me to see in that Stripe scale to support this strange concept

Starting point is 00:15:35 that's outside of what software engineering has taught me. And then a lot of what my team supported was our internal kind of data analytics BI product. So we have a very popular kind of internal tool called Hubble,

Starting point is 00:15:58 which essentially is just a text box of SQL and a button that you can press for running the SQL and you know, you get some results back, right? Very simple interface, very well received. I think daily active user count was in the thousands, apparently. I went to the Seattle office and walk around and, know folks all have it up yeah and we work a lot on the the front end the kind of the back end which is powered by Trino the various components so we had hive tables we had iceberg tables so you know my role was really a little bit of everything in that.

Starting point is 00:16:45 And, you know, recently, you know, last year, like the year of efficiency, what we worked on and focused on was tracking our spend and seeing like, what exactly are we, you know, paying money, paying EC2, paying this? So we did a lot of work around metadata and especially attributing what is going on in our infrastructure. So, for example, whenever someone pressed run, we want to be able to say, OK, this query was run. And hopefully we have a for this reason. press run, we want to be able to say, okay, this query was run. And hopefully we have a, for this reason, right? And to compound the issue, we also expose an API endpoint. So a lot of integration is done in this like SQL format, right? There can be cron jobs, there can be event handlers to say,

Starting point is 00:17:53 when this happens, we want to do something, find some data in our data infra and then perform something else. So a lot of the like, let me get data, let me find data, let me work with data, work off of this endpoint. And that is where, you know, it's very easy to have runaway costs Let me find data. Let me work with data. Work off of this endpoint. And that is where it's very easy to have runaway costs because once you expose the internal endpoint and once everyone at Stripe wants to integrate with it

Starting point is 00:18:18 because it's very easy to just send to the TV, for us on the infra side, very quickly, we need to figure out, you know, like what is actually happening and what are we spending money on? Because over the years, we just assume that there's, it's natural growth, right? Like every couple of months we say, okay, well, Stripe is growing by this much, your business is growing by this much. So their compute need naturally grows with it. So let's just turn up our cluster, right?

Starting point is 00:18:49 Let's add a new cluster. Let's add new machines. But when efficiency is important, and when we, I mean, we know that over the years we valued growth over efficiency, but when it's time for efficiency, we really had to like hunker down and figure out what exactly

Starting point is 00:19:08 we're spending on. I want to ask you about the, so you have metadata on a query being run. How did you tie that back or how did you go discover the why? Because a lot of times I would think that's sort of the big, that's sort of the big question. And just what comes to my mind is that

Starting point is 00:19:31 a lot of times analytics projects can be ad hoc, right? Where you need to run a bunch of queries on a bunch of data to answer a question, but then when you answer it, you sort of have the insight you need and then you sort of move on, right? It's not like that's a persistent report or whatever. So how did you figure out that why or whether something was ad hoc or ongoing?

Starting point is 00:19:54 Yeah, I think the first thing we wanted to figure out is a big picture of like what is happening. So we have, We know there are certain kind of data operations going on. We know there's ad hoc analytics. We know there's BI reporting. We know there's operational

Starting point is 00:20:15 like tell me when something is involved. We know that there are a lot of these use cases and there's ever-growing amount of use cases. From the infrastructure side, we treat these all as kind of the same, even though they aren't. Like ad hoc analytics require a different latency spec than service, right? Like if it's a cron job, it just wants to run in the next 30 minutes,

Starting point is 00:20:46 whenever, versus, like, if it's ad hoc, someone's waiting. But for us, on the data infrastructure side, like, we wanted to see exactly what is going on throughout, kind of, the realms, right? So the first step was actually just to collect that data, right?

Starting point is 00:21:06 Do we know how many people's running ad hoc queries do we know how much of our compute is spent on dashboarding on service queries on this and that so this is where the metadata comes in and depending on how you structure the metadata you can really slice and dice your way into the different kind of usages. So for us, the first thing we did was like, you know, we know specific services have specific queries. Yep. Like on this website, we have internally those, most people go there for ad hoc stuff, right?

Starting point is 00:21:42 This cron service that we have is, you know, a lot of these services also build out their services. So this cron service actually has different teams under them. So how do we ask the cron service, like, give us more information so we can license that too? So you

Starting point is 00:22:01 really kind of get into a realm where, you know, in the cron service, every time you send a query to us, give us as much information as you can about it. And this is easy because like, we own all the info, right? Like, the code base is all stripes, we can go to that team say, hey, you know, I want to add extra metadata. Every time you send us a query. They're like, okay, cool. Like it doesn't, right? It's not that big of a deal. But for us, it is, right?

Starting point is 00:22:32 For us, we see that this query is from this cron service, which is from this team, which is from this task that runs every so often. Or you really get into the kind of analysis part of it. And with just, you know, three fields in your metadata. Yeah. I have to know, what's one of the most surprising things that you and your team discovered when you started slicing and dicing the metadata? Yeah.

Starting point is 00:23:00 So, you know, we always know there's like some inefficiencies in our system. And, you know, at a hyper growth company, it happens. And, you know, sometimes the best thing you can do is to, you know, focus on the most impactful things. And sometimes it's not cleaning up stuff. So I think once we started gathering data, the most egregious thing we found was that there is a cron service that runs every hour. And what it does is it just runs select max updated at of a table. Pretty simple. I just want to know when the last time this table was updated, right?

Starting point is 00:23:44 And this table was updated. But then when you dig into the details, this table, maybe when it was first started, maybe when this query was first started two years ago, this table was a couple of megabytes, couple of gigabytes. This table is now like a petabyte of data. This table is not structured correctly, partitioned correctly correctly so that your max of updated ad is now doing a table scan a full table scan of like petabytes of data right and now you're doing this in a distributed trino environment where you know you can have like 10, 100 machines running, it takes around two, three CPU days

Starting point is 00:24:28 to run one of these queries. And then you see that this query is run every hour on a cron job. So you multiply all those factors and we're spending so much compute on this one simple query. And then you go back and you say, okay, well, who owns this?

Starting point is 00:24:46 What is this for? Can we tell them, you know, Trino Iceberg has this concept of like metadata table where you can look at the metadata instead of doing a full table scan. It's like, okay, well, this is how we're going to optimize it. We find the team is no longer around and they don't need this. Right? the team is no longer around and they don't need this. Right. So this whole process where we're doing this much compute for zero value. Yeah.

Starting point is 00:25:15 And there's a lot of that we found that was very surprising. And, you know, for us, it's great. It's all savings, right? We can take a lot of these and say, okay, well, you know, every so often we'll just write a report, do some analysis and stop this from happening. But it was just really surprising from our side to find something like that. Yeah, I guess, you know, I can see both sides, right? On the one hand, it is surprising to see that where you're like, okay, maybe, you know, this one's the award for most expensive query in the history of the company. But at the same time, I mean, Stripe is, you know,

Starting point is 00:25:51 a huge company growing fast. It was probably a significant need and things change, right? And it's, you know, everyone knows it's really hard to, I mean, I would guess also with something like that, you know, if you don't have the context, it's scary to go back in and touch stuff like that because it may be running some really important piece of the business. But yeah, man, I can't imagine the cost of that. On the infra side, a lot of the problems that we have is the disconnect from what these things are used for really help push us to go specifically go to the domain and ask, hey, I see this is happening in our system. Like what what is happening? Like, can we help optimize it? Because, you know, you like the domain experts might not know how exactly to write this

Starting point is 00:26:49 query to get the same result, but in a better way. Sure. But on the infrared side, we know how to give you that. Right. We can say now write it in a, with the metadata table.

Starting point is 00:27:02 And now you're reading like a few megabytes of metadata instead of a full megabyte scan. But that disconnect is where this help facilitates as well, is to say, and obviously it would be great if we can automate all of this and no one has to think about it, but a lot of the time you have to push all the way up to the domain and kind of figure out from there together. Yeah, yeah. I mean, my opinion and interested to see if you agree with this is that, you know, it's not necessarily the responsibility of that end user to understand how to optimize that, right? They're trying to pull data so that they can do their job. Yeah, super interesting.

Starting point is 00:27:43 Well, let's change gears just a little bit here. One of the latest projects that you've been working on is actually productizing Stripe data, which sounds absolutely fascinating. I know Costas has a million questions about that, but can you just describe that concept? What was the sort of need and what's the project like? Yeah. So this is how I've been internalizing this, right? Stripe is a API-first payments company, or at least when it first started,

Starting point is 00:28:17 that was the flag that we have, right? We have a set of APIs where you can interact with and you can work off of the global payments rail. Super cool idea. This evolves into, I have a set of reporting APIs. As a merchant, I do a bunch of stuff with Stripe. Stripe helps me facilitate a lot of payments. Now I want information back to say, you know, how many payments have gone through? How much money have I gone through with Stripe?

Starting point is 00:28:57 And either I can keep a system or record on my side, right? Every time I send Stripe some every time i send right some information i also keep some information or you know stripe build out this suite of product to say no i i am the source of truth i'm the record keeper here's your information and let me repackage it in a way that adds value for you the merchant and this evolved evolved from API into something called Stripe Sigma, which is like on the Stripe website, a way to interact

Starting point is 00:29:32 with your own Stripe data as a merchant. So you can go on Stripe data, Stripe Sigma, you can write some SQL queries, press run and have some results back. And the data can be like, you know, how much have you processed? How much have you utilized Stripe

Starting point is 00:29:49 for? Right. But for a lot of enterprise cases, they don't want to work off of Stripe.com. Right. They don't want to a SaaS product. They have their own data engineering team. They have their own data engineering team. They have their own data infrastructure

Starting point is 00:30:06 ecosystem. And they want that data in their system so they can integrate it with maybe their system record of truth. And they want to add different features, different values to that data. So that's kind of where the problem statement is, is to say as a merchant and especially an enterprise merchant, I want Stripe's data in my ecosystem. Like, how can you give me that data? And there's a lot of, you know, off-the-shelf software. You know, Fivetran is kind of the market leader in this

Starting point is 00:30:43 where I think they just scrape Stripe's API, write it down, and push it out. They facilitate it. But on our side, we have all the data. We just need to push it out. We want to make it easier and seamless to integrate with different ecosystems. So that's what we're working with. And I think there's a lot of interesting development in this area from different cloud vendors,

Starting point is 00:31:13 different data vendors in this space. And I'm pretty excited to be working on this. Gosh, well, I have a thousand questions, but Costas, I'm going to hand the mic over to you because I've been monopolizing. Thank you, Eric. Kevin, before we going to hand the mic over to you because I've been monopolizing. Thank you, Eric. Kevin, before we go back to the data product case that you just talked about,

Starting point is 00:31:33 I want to go back to the tool that you mentioned that became really popular inside Stripe. And you mentioned that it was just like a textbook where you could write a SQL query and run this query, right? And my question is, in a world with so many BI tools out there, so many hours spent on figuring out what's the most efficient way for someone to interact with data through a graphical user interface, why this tool became so popular?

Starting point is 00:32:08 And what was the need that it was fulfilling and couldn't be served by all these BI tools out there? Yeah, that's a good question. I think why this tool was made in the first place was kind of beyond my time. But one thing I do know is that I really enjoy using this. And so does a lot of people in the company. And I think I have been trying to figure out why it's so popular, why it's so successful. I think it's just, one, it's very simple.

Starting point is 00:32:42 The interface is very simple. It accomplishes what you want so like you know you write some sql you get some data back there's simple filtering there's you know if you press graph you can turn it into a line graph a pie chart, whatever you want, right? But like a lot of it, a lot of the like most used features are these like features with like reasonable defaults, right? So it's very powerful for me to just like write a query, you know, select date of whatever and Mac like aggregate whatever

Starting point is 00:33:22 and get a result, press like turn this into a line graph and boom like that's all you get right and like if you want to tweak it more you can like go into a write write more visuals and whatnot but like i think for majority of folks doing analytics that's like enough i know for me like it's very useful and i think trino being the back end of it really powers this like kind of magical wow it's so fast kind of thing and it being federated as well we're able to connect a lot of other different data sources so what we were talking about with the attribution of different queries, we threw that into a database and connected back. And now your data ecosystem is all connected. So I can query

Starting point is 00:34:14 on this interface how many queries were run in the last hour from the ad hoc stuff. Yeah. Right. It's just from the service stuff. So like, it's just very kind of central to our data ecosystem. And, you know, I was looking at Superset, right? And I was trying to figure out like, okay, well, like, can we move, migrate to something open source? And I think the difference between Superset and what we use, at least, you know, when I prototype with it on my own time, is this like very simple default. Yeah. There's like two or three features that everyone uses and everyone loves.

Starting point is 00:34:57 And Superset, like, it's a little bit more difficult to set up things, but that like jump in difficulty really is a big factor when you're working with tooling. Yeah, that makes a little sense. And it's like super interesting. And then you also mentioned about like exposing endpoints, like to work with data, right? So you have, you're not just offering, let's say, like a way for people to go and visualize the data, but you also want builders to go and build on top of the data, integrate with like the data infrastructure, right? Right. So how do you do that?

Starting point is 00:35:36 I'm assuming also that, okay, let's say the typical use case around BI and the OLAP concept is that you don't have too many concurrent queries. It's much more, things tend to take much longer to complete. It's a very different, let's say, set of trade-offs that are assumed there, right? Compared to, I don't know, having, let's say, someone from the front-end engineering team decide, oh, now I have this data. Let's create this service that it's going to be hitting every second or sub-second or whatever, right? So how do you balance that, right? Because we're talking about like opening opportunities

Starting point is 00:36:30 to, you know, like every possible use case out there. And some of them might not be, let's say, compatible with the basic data infrastructure. Yeah, I think that's exactly right. I think it's like the API is both a blessing and a curse, I would say. Like it makes it very easily integratable with all of the environments that we have, all of the different languages, because HTTP is pretty universal. But on the flip side, a lot of our compute costs can

Starting point is 00:37:00 be reduced by if you are in the Java environment and you're working with Iceberg to just go and use the native Iceberg library, right? Instead of round tripping through compute that goes through Iceberg and then back again, you can really just, you know, go and read from the source. So that's been something that we've been struggling with. And that's something that it's just the optimization at the end. But the pro case for opening up this as an API is that integration is much easier. Getting things done is much easier.

Starting point is 00:37:41 Getting data is much easier, no matter where you're working on whatever repo, whatever language, whatever environment. Totally with you on like a lot of the time, it's not the best way to do it, but, you know, for now, kind of being able to build out these use cases without being blocked by how do I get this data

Starting point is 00:38:07 has been very useful for Stripe to build out different features, different products. Yeah, 100%. I think it's a testament to the culture of the company. You promote creativity and control

Starting point is 00:38:23 over the resources. That's the trade-off that you're doing there and makes total sense. And I think it's a trade-off that always exists with engineering. When you start optimizing, then usability usually goes down unless you narrow down the use cases a lot.

Starting point is 00:38:40 So it's like this balance between, okay, how much accessible I'll make my systems with how much, let's say, I'm going to make them like robust and all these things. And it's always like a dance that's like very delicate there. And it's very interesting like to see how this is like performed in a company like Stripe, right? I think we over-index on on we're not over index i think we value being able

Starting point is 00:39:08 to unblock and facilitate product development and future development and have you know folks not be blocked on accessing data yeah that's kind of something that i've been really fond of working at Stripe. That's amazing, actually, especially at the scale of a company like Stripe. Because these queries, at that scale, cost a lot of money. When you're at that scale where, let's say, 1% performance gains translates into probably millions of dollars, things are much more complicated. performance gains translates into probably millions of dollars, right? Things are much more complicated.

Starting point is 00:39:53 So it needs to be part of the culture of the company to promote that. And that's amazing, I think. All right, let's go back to the pipelining stuff. Because that's also very interesting. So as you said, there are vendors out there for quite a while now, right? Facilitating the exporting, extracting of data and loading of data to other systems, like Fivetran. Why Stripe wants to get into that business in a way, right? What's the value that someone like Stripe,

Starting point is 00:40:32 which, okay, the core competence of the company is not moving data around, right? It's like processing payments. Why is it becoming so important today that Stripe actually, you know, like, dedicates resources to go and find, like, a robust solution for that? Yeah. I can give you a, you know, what I think is the answer, right? So a lot of, you know, Stripe is pretty innovative in that, like, a lot of the features that gets developed, the roadmap, a lot of it is

Starting point is 00:41:06 driven by the customers themselves. So you probably go on Twitter, see a bunch of people, product leads, co-founders ask, hey, how do you want to see Stripe improved? What part of it do you want to see improved? We have Friday firesides where other company founders come in, talk about how they use Stripe. And the question is, what don't you like about it? Where can we improve? And I think with that mentality, a lot of on the data side has been a natural progression of what the customers want. So Stripe Sigma, so it's essentially a SaaS on Stripe.com where you can write SQL to interact with your own data.

Starting point is 00:41:57 So that was the first iteration. And it's very similar to what we have internally you know, just a website, a SQL dialogue, and a run button, and it returns you the data, right? So that came out of, like, you know, customers wanting to interact with their data, right? And, like, for SMBs, people without their own data infrastructure, that's pretty good, right? people without their own data infrastructure, that's pretty good. You go and do a bunch of SQL analysis just through Stripe. And then for enterprises, they don't want to use that. Maybe their data size or their regulation, just privacy,

Starting point is 00:42:45 some reason they don't want to use that product, but they still want to interact with this data. So there's been a need to provide this data to our customers. And the need is pretty validated, right? Like you have other companies who, you know, these merchants go to to say hey i want my stripe data and you can give it to me i don't care how just give it to me though

Starting point is 00:43:11 and i'll pay you for it so then the natural progression is like well why go the extra step and a lot of the time like you know the way that these companies get data is also pretty costly. They call the APIs, write them down, send it to other companies. So the natural progression is like, okay, how do we do this in a way where our customers benefit and we can also turn this into a product? So that's kind of been the line of thinking. And I think the way that it was started at first was a customer ask. Like a pretty big customer asked for this. They're like, hey, I don't want to work off of your website.

Starting point is 00:43:59 I have my own data engineering team. I have my own data engineering ecosystem. Just give me the data. Let me do what I want with it. And then, you know, more and more companies come in to ask for this. Yeah. Right. The way we see it is there's a segment of like, you know, SMB can use Sigma and Enterprise can use Stripe data pipeline. Yeah. Makes sense. So what's the difference between someone using, let's say, like a third-party vendor that is going to continuously hit the API or Stripe to export data and reload the data

Starting point is 00:44:32 on the S3 bucket of the customer with what Stripe does with their pipelines, right? And let's talk briefly about, let's say, the product experience, if you can talk about that. But also, most importantly, about the technology. What's the difference there? In one case, we have HTTP, right? As you said before, it's pretty inefficient, but it's pretty universal at the same time.

Starting point is 00:44:58 But maybe there's a better way to do that. So what are the technical choices that you as an engineer make to enable a different product experience at the end, right? Right. Yeah. And this is where I really believe the next generation for this product. If you go to Stripe Data Pipelines right now, we have GA in Redshift and Snowflake. As a merchant, you can sign up for this product and you can get your Stripe data in your Redshift cluster, in your Snowflake cluster.

Starting point is 00:45:39 And we do this in a way where we get our data from our source of truth. The reliability factor, the data consistency, data correctness factor, we take that on and we guarantee that. In a way where anything that happened upstream, we can just say, here, we calculated the source of truth. Let me push the data out to you. That's very difficult when you have a man in the middle with a third-party vendor. I'm sure there's a way to solve it. But at the end of the day, going from the source is a lot cleaner. It's a lot easier for both Stripe and the merchant.

Starting point is 00:46:30 But API calls are expensive. If you're scraping a website, the API calls get super expensive. When you're scraping Stripe, there's a cost to Stripe as well. And internally, migrating all those API calls onto this product is just a win. I think in terms of technology, something that I'm really interested in is just the idea of data sharing, right? Like, you know, API call is one of them. SFTP is like one of them. A lot of these things are very old, not old, but like, you know,

Starting point is 00:47:18 they're like proven methods from like, you know, the 80s and 90s. And with a lot of the developments in the data space, data infraspace, especially with a lot of cloud vendors, with a lot of data vendors, innovating on a bunch of different data sharing technologies, I think Stripe is in a good position to piggyback off that

Starting point is 00:47:44 so then we can offer our merchants integration with all of these ecosystems. So something that has been going on in the industry is the rise of Apache Iceberg. Something I just saw recently, I think last year with Salesforce and I think Snowflake, there's a blog post that said they're integrating Salesforce I think Snowflake, there's a blog post that said they're integrating Salesforce data with Snowflake. One click or zero click, zero ETL, whatever.

Starting point is 00:48:13 You can get your Salesforce data in Snowflake super fast, super easily. Right? We see the same thing for Stripe. Right. We want to give you your data on Snowflake, Databricks, AWS, Azure, like anywhere that your data is set up, we want to be able to give you that data. And I think with the rise of the lake house kind of architecture where compute is separated from storage, that really helps our case. Because right now we publish to specific warehouses, right?

Starting point is 00:49:00 It has to be Redshift. It has to be Snowflake. But with this lake house architecture, we want to publish the storage and you bring your compute and the integration should happen seamlessly. We can use Iceberg, we can use different technologies to facilitate this, but the core concept of we'll give you the storage, you bring your compute. I think it's very exciting to me for the next iteration of this product. So just to understand about the use case here with Iceberg, the way that you see it is that the data leaves, let's say, on Stripe,

Starting point is 00:49:47 but the user has, let's say, the capability to choose where to expose this data through iceberg, right? So external query and just like go and query that. Or you see more of like, okay, this is your data. We're going to export it on your own S3 bucket because that's your storage and you want to have it there. And we are going to do that by using Iceberg.

Starting point is 00:50:14 So it's easy then to go and expose it to different query engines and all that stuff. Which one of the two approaches is usually more favorable for the users out there? Yeah. I think there's multiple levels of abstraction. At the core, we're exposing some data where the merchant wants to be able to interact with that data, right? We can throw it into a SFTP server as a CSV, right? Or we can throw it onto Azure or AWS in S3 as like parquet files, right? And then it's about bringing where the merchant is and their ecosystem into our own ecosystem.

Starting point is 00:51:06 So Iceberg is one of the abstractions. We can throw our files on S3, create some kind of catalog to represent Iceberg is so popular is that, or so interesting for us, is that all of these vendors, all of these compute systems are now integrated with Iceberg. So this is the step kind of removed from us, an extra step that we don't have to do, where if we just deliver something in Iceberg, you can read it in Snowflake, you can read it in Databricks, you can read it with Athena, with REST Shift. It's about from us taking the data and making these levels of abstractions so that our merchants can integrate it

Starting point is 00:51:55 in a better way. If our merchants want Delta tables, we have the underlying files. We just need to generate some metadata and boom, you have Delta tables, right? We have the underlying files. We just need to generate some metadata and boom, you have Delta tables. Yeah. So for us, it's about thinking through where we want to meet our users and where they are, where their ecosystem is,

Starting point is 00:52:17 and kind of meeting that demand on our side and enabling them to get the data. Yeah, makes a lot of sense. And one question here, because, okay, I think the value of decentralizing the data like in this way is obvious, right? Both like from like an engineering like perspective in terms of like efficiency there,

Starting point is 00:52:37 but also from like a business perspective of like not having like to, okay, like use 100 different tools and all these different vendors and paying for all that at the end without having the best possible experience at the end and maximizing your value. My question, though, is like,

Starting point is 00:52:55 okay, in this highly decentralized environment with all these different options, how people can keep track of what is available to them, right? How they can find the data that they need, how they can know that this is the right data. Like, yes, of course, you can create some metadata and create iceberg tables and have like a catalog that a system can go and access. And it can't be like a Hive metastore, right?

Starting point is 00:53:29 But then if you go to something like Snowflake, then probably you need a different catalog to be populated there for that to happen, right? So we get to this meta problem in a way of how do we keep consistent also and available all these metadata that are needed in order for people to go and figure out what they can use and how to work with it, right? So first of all, do you think this is a problem or might be just in my mind, right? I don't know. And if it is, what are the possible solutions out there? Yeah. No, I think it's definitely a problem.

Starting point is 00:54:08 Well, not a problem. It's just the way that it's set up, right? Iceberg and any table format, it's essentially your data with some metadata. Yeah. You have to keep your metadata somewhere. And for Iceberg, it's like a catalog, right? The catalog just does the translation of like, here's my table and here's everything I know about this table, where it is.

Starting point is 00:54:35 You have Hive Metastore, you have Glue, you have REST Catalog. I think this concept of catalog is super interesting. When you're talking about these table formats, it's essentially the abstraction where a lot of these vendors are taking to not lock you into their ecosystem, but it's one of those things that's difficult to work with when you're across many ecosystems, right? So you can have a Snowflake, but you can have an Iceberg table in Snowflake, but if it's managed by Snowflake, it's their own catalog, right?

Starting point is 00:55:16 And if maybe you're like an enterprise and you have multiple different ecosystems, you want to use Snowflake and Databricks and something else and Athena, right? Where your catalog is determines which systems you can use. So if you have an Iceberg table that's in Snowflake only, the Snowflake catalog, it's really difficult for you to use that in Databricks. If you have a Unity catalog, which also works with Iceberg, it's hard to export that and put it into Snowflake. Now you need integrations between these catalogs. And this is where like icebergs kind of

Starting point is 00:56:06 innovation with like the rest catalog is i think it's very interesting is that they're just saying there's a rest protocol it represents a catalog it can you can plug and play whatever back end you you have right and it's a level of abstraction that kind of do away with the details and the vendors and everything. I think what it means for us is

Starting point is 00:56:35 we're still trying to flush out how this works. If we want to integrate with table formats, where are we going to store our catalog? Yeah. We need to store multiple copies, right? Like, do we need one in Glue for AWS?

Starting point is 00:56:53 Do we need one in Unity for Databricks? Like, now you have, like, this, like, kind of lock-in on the catalog level. Yeah. How do we get out of that i think those are like interesting questions i a lot of like the integration is happening too you know like glue is able to be read in other places but with these vendors a lot of it is we make it easy for you to read in other catalogs but we make it hard for you to read out anything that we have yeah um so you know it's an interesting kind of time period that we're

Starting point is 00:57:35 that makes total sense okay i think we should have like another episode just like talking about catalogs to be honest uh but we are uh close to the end here and I would like to let Eric ask any other questions you might have. So Eric, all yours again. Yeah, Kevin, I think it's been so interesting to hear you talk about a lot of the practical ways that you're solving problems day to day with your infrastructure but you are a very curious guy

Starting point is 00:58:04 and so I'm dying to know when you look out at the data landscape in general with your infrastructure. But you are a very curious guy. And so I'm dying to know, when you look out at the data landscape in general, what are the most interesting new projects that are exciting to you? Maybe even in the open source, because I know that's exciting, when you sort of remove yourself from the limitations of the infrastructure

Starting point is 00:58:20 you work in every day. Yeah, I think Iceberg has definitely been on my list. I've been kind of participating on the Python Iceberg library, just contributing there. I think a lot of the disaggregation of different database components and like OLAP components, right? Like I think of our current infrastructure as databases kind of just turned inside out and different services essentially. Yeah, yeah, yeah. You know, it's compute, S3 and Iceberg is like storage. And now people are building indexes, are building all these features on the side.

Starting point is 00:59:05 So I think a lot of what interests me is Apache Arrow. So then you can integrate these systems together. Sure. Like Data Fusion, where you can have components of your traditional databases and work with it in a way where you can have your planning, have your compute layer, have your storage layer

Starting point is 00:59:31 in different libraries, and then you can mix and match. So a lot of these foundational core pieces of the database is now being ripped out and bring into like these open source projects. So, you know, I'm very interested in seeing the development of those. And there's like a lot of like active development in those fields. And we'll see, you know, like maybe we'll in a year or two, we'll go back to like what a traditional database looks like, but just in the cloud with like all of the bells and whistles. Yeah.

Starting point is 01:00:07 Well, Kevin, this has been such a great conversation. Thanks again for joining us for the show today. Yeah, thanks for having me. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com.

Starting point is 01:00:33 The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

Your Ad Here

The Data Stack Show - 182: Building a Dynamic Data Infrastructure at Enterprise Scale Featuring Kevin Liu of Stripe

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.