The Data Stack Show - 182: Building a Dynamic Data Infrastructure at Enterprise Scale Featuring Kevin Liu of Stripe
Episode Date: March 20, 2024Highlights from this week’s conversation include:Kevin’s background and work at Stripe (0:31)Evolution of Data Infrastructure at Stripe (2:18)Kevin's Interest in Data (5:29)Software Engineer or Da...ta Engineer? (8:27)Speech Recognition Work at Amazon (11:06)Efficiency and Cost Management (15:50)Metadata and Query Analysis (18:38)Surprising Discoveries in Metadata Analysis (21:43)Optimizing Cost and Value (23:55)Product Sizing Stripe Data (26:39)Popular Tool for Data Interaction (30:08)Enabling Data Infrastructure Integration (35:22)Value of Data Pipelining for Stripe (39:32)Next Generation Product and Technology (43:54)Maximizing value in a decentralized environment (51:34)Future of open source projects in data infrastructure (57:59)Final thoughts and takeaways (59:02)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Before we start the show this week, we've got a quick message from a big show supporter,
Data Council founder, Pete Soderling.
Hi, Data Stack Show listeners.
I'm Pete Soderling, and I'd like to personally invite you to Data Council Austin this March
26 to 28, where I'll play host to hundreds of attendees, 100 plus top speakers, and dozens
of hot startups in the cutting edge of data science, engineering, and AI.
If you're sick and tired of salesy data conferences like I was, you'll understand
exactly why I started Data Council and how it's become known for being the best vendor-neutral,
no BS, technical data conference around. The community that attends Data Council are some
of the smartest founders, data engineers, and scientists, CTOs, heads of data, lead engineers, investors,
and community organizers. We're all working together to build the future of data and AI.
And as a listener to the Data Stack Show, you can join us at the event at a special price.
Get 20% discount off tickets by using promo code DATASTACK20. That's DATASTACK20. But don't just
take my word that it's the best data event out there.
Our attendees refer to Data Council as Spring Break for Data Geeks.
So come on down to Austin and join us for an amazing time with the data community.
I can't wait to see you there.
Welcome to the Data Stack Show.
Each week, we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com.
We are here on the Data Stack Show with Kevin Liu.
Kevin, thank you so much for giving us a little bit of your time today.
Yeah, thanks for having me.
All right. Well, you've done a couple of really interesting things in data,
but just give us your brief background.
How did you start and what are you doing today?
Sure. I'm currently a software engineer at Stripe.
I've been working there for around three years.
I've been working there for around three years. I've been working with data
infrastructure there. So a lot of open source technologies such as Trino, Iceberg, my team
powers our internal BI analytics. And recently I've taken on another challenge with, on the data product side, the product is
called Stripe Data Pipeline.
We essentially enable merchants to have their Stripe data back into their
warehouse, into their data ecosystem in an efficient way.
This is great.
So actually, I know you, Kevin, for a while now, we've been talking like
since like the times when I was, the time when I was at Starburst
and about Strino specifically.
And I'm very excited today because I had the opportunity
and the pleasure to work with Stripe quite a few times.
And it's one of these companies that have been around for long enough
to go through
like many changes, but always like trying to stay at, let's say the forefront of what
is happening out there.
For example, like very early adopter of Spark, right?
I'm pretty sure you probably still have pipelines in Scala in there because of that.
And you keep innovating.
We are open, like using like new technologies and many things have
happened in these like past 10 years, let's say, so having you from there
and you being like enough, long enough there, like to see at least past like
three, four years, like the evolution, I think will give
us like a great opportunity to talk about where data infrastructure stands today, what
let's say some interesting problems are.
And also based on also like your latest like move into like turning data into products,
talk about that, because I think it's like a very important next evolution step when
it comes like to infrastructure around data.
So that's what I'm really excited about today.
What about you?
What are a few things that you'd love to talk about?
Yeah, I think in general, I've been really happy working at Stripe just because the company know, the company for its sides, for the kind of
engineering culture there, it really helped me learn and get to understand a lot of what is going
on, especially in the data world, kind of at the, like, you know, what is the most newest and
shiniest thing that we can work with, right?
So I took a database class in college, didn't think much of it, came to Stripe, started
working with OLAP systems, Trino, Iceberg, and it was very new to me.
But then eventually I started to realize that it was new to the industry as well.
That's been really exciting to me in order to say, okay, well, how do I take this new concept,
how do I run it efficiently at Stripe, and then how do I help the community because it
is an open source project? How do I kind of take ideas that we have
that we come up with
and share it with the community as well?
And then on the data product side,
I think Stripe is positioned very well
to do data sharing.
Not a lot of companies can do that
because not a lot of companies have
the value from the data
that they have and have that kind of be shared to their customers in a way where the customers are
asking for it on a daily basis right so you know i'm still learning i think i just want to share
some ideas with you guys and yeah happy to talk more about things.
Yeah, let's do it.
What do you think, Eric?
Are we ready?
I was born ready, Costas.
I was born ready.
I know that.
Let's do it.
Let's do it.
Kevin, so excited to have you on the show.
And we have some really exciting subjects to talk about.
You gave a brief introduction, but I'm interested to know, sort of going back to the beginning,
what sort of sparked your interest in the data side of things? So you have a background as a
software engineer, but what drew you into the data aspect of software engineering?
Yeah, I think I always like to just dabble around different domains on the internet.
And I think data has been one of those things that just stood out to me, I think, in terms of my work, right, where, you know, I work on making data infrastructure as Stripe and kind
of making it so that, you know, thousands of Stripes have access to data that, you know,
that's just been really interesting to me to like, see how, see how that kind of evolved from your traditional like data warehouse
and I think the open source aspect of it also drove me to kind of participate more to join
communities to kind of learn from each other and kind of to share what I've learned I think that's
been really kind of motivating for me
to kind of work in this field.
And obviously, you know, in the recent months, years,
there's been a lot of new developments.
You know, I was watching like the history of databases
and they're like, you know, calling this like,
you know, data lake househouse as a new wave.
And in a way, I do believe that it is a different paradigm from before.
And I see it firsthand and it enables a lot of interesting features and value to derive from there.
Fun fact, we used to run on Redshift until we couldn't run on Redshift anymore.
We migrated to Trino and to Iceberg
with open source technologies
and we see firsthand
how much value it provides to the company
and how, you know,
folks who use it on a daily basis
think it's like magical, right?
That we're able to you know analyze petabytes
of data super super fast yeah right and we i stripe especially we have a way for you know
people to interact with data very easily like we have a internal tool that you can just go to and you can just write some SQL.
And so that approach of democratizing data at the company has been very well accepted at Stripe.
Yeah.
I have a question.
So your title is software engineer, but you work with a ton
of data stuff. Just out of curiosity, do you kind of consider yourself like more of a software
person or a data person? I know that title can be a little bit abstract because it can mean so
many things, right? And in some ways, building a data platform is what you've been doing. But
yeah, just interested in your perspective on that. Yeah, sometimes I think to myself too,
when I first learned the term data engineer,
I'm like, am I that?
Am I a data engineer?
I don't know.
I'm not sure.
I mean, my day-to-day goes from SQL to front-end to BI
to distributed system to like,
every part of the data infrastructure,
we kind of have some kind of lever that we can pull.
In a way, yeah, a lot of what I do is consider data engineering,
but I think especially on the data infrastructure side,
there's a lot of software that, know exposes a good interface but sometimes
you really need to dig into the internals of it and this is where big open source and having the
community is is great because a lot of the times we're able to talk with other folks in other
companies or also infrastructure and share what we learn with each other and
share with the community.
I, you know, went to a Trino Fest event like 2021, 2022 and learned a lot.
And I came back to my team and like, hey, you know, Lyft runs their data infra, their
Trino clusters very efficiently.
What can we learn from them?
So a lot of those things I really enjoy.
And I guess that's what software engineers do.
I'm not too sure.
I don't know.
Like, we don't have data engineer roles at Stripe.
So I'm not really sure what that means either.
So, you know, I think I do a little bit of both.
Yeah, yeah. I mean, you know, I think that's actually, you know, I think I do a little bit of both. Yeah, yeah. I mean, I, you know,
I think that's actually, you know, part of the reason I asked the question is that,
you know, as we think about, like you said, there's all sorts of interesting new developments,
right, in data technology and operating platforms. And so it is really interesting to think about
the confluence of multiple different skill sets that are really useful when running, you know, running large data systems.
Okay, I have a ton of questions about Stripe, but I want to jump back just a little bit.
And you worked on some speech recognition stuff at Amazon previously. And I just have to ask about that because especially after you talking about,
you know, sort of being a data person and a software person, did those two things come
together in that work as well? Because, you know, you're sort of dealing with massive amounts of
data and then, you know, trying to build a system that can essentially operationalize it.
Yeah, I think in a way, yes.
I think, I forgot who I was talking to,
but I was talking to someone
with a lot of years of experience in the industry,
a software engineer,
and they basically told me that,
you know, software engineer and writing software
is essentially just moving data around.
So I think my role in this data engineer, big data world is being a software engineer and specializing in that.
I think I work in speech recognition system in Alexa
and we're kind of supporting the data science team there.
So a lot of the job is, you know, how do we provide the right abstraction for data scientists,
for ML engineers to run their speech recognition model?
How do we have the right environment for them to do their work in a way that produces value, right?
Yeah.
And same thing as Stripe.
I think a lot of our work enables folks from other parts of the company to do their job
and to get whatever they need, whatever data they need, whatever insight they need in a
fast and efficient way. Yeah, absolutely. Well, let's dig into the world of
Stripe. So can you give us a little bit more detail on what you've done at Stripe?
What are the big projects that you've worked on built yeah we did a bunch of stuff at stripe in the years that i've
been in i was talking to a co-worker before and we're kind of reminiscing about like the the
projects that we took on and it just felt like a decade ago like starting so when i when I first started as Stripe, the whole company was in this big project to support India.
And it was really interesting to me because India has this concept of data locality, where it's not a concept.
It's a law that Indian merchant data should not leave the continent.
Right, yeah, it stays,
yeah, it stays in the borders.
Yes, I'm familiar with this, yeah.
Yeah, which breaks
the concept of like software engineering
and like abstraction layer
and everything, right?
Because now your data
is physically in some space
instead of like, you know, data data as blobs in S3.
So that's the first project that I kind of worked on.
And that actually required a kind of a foundational shift at Stripe to say, you know, apply this concept all the way down the stack and making sure that we're supporting it every
week.
So that was really interesting for me to see
in that Stripe scale
to support this
strange concept
that's outside of
what software
engineering has taught me.
And then
a lot of what my team supported
was our internal kind of data analytics BI product.
So we have a very popular kind of internal tool
called Hubble,
which essentially is just a text box of SQL
and a button that you can press for running the SQL and you
know, you get some results back, right?
Very simple interface, very well received.
I think daily active user count was in the thousands, apparently.
I went to the Seattle office and walk around and, know folks all have it up yeah and we work a lot on the
the front end the kind of the back end which is powered by Trino the various components so we had
hive tables we had iceberg tables so you know my role was really a little bit of everything in that.
And, you know, recently, you know, last year, like the year of efficiency, what we worked on and focused on was tracking our spend and seeing like, what exactly are we, you know, paying money, paying EC2, paying this?
So we did a lot of work around metadata and especially attributing what is going on in our infrastructure.
So, for example, whenever someone pressed run, we want to be able to say, OK, this query was run.
And hopefully we have a for this reason. press run, we want to be able to say, okay, this query was run.
And hopefully we have a, for this reason, right?
And to compound the issue, we also expose an API endpoint.
So a lot of integration is done in this like SQL format, right? There can be cron jobs,
there can be event handlers to say,
when this happens, we want to do something,
find some data in our data infra and then perform something else.
So a lot of the like, let me get data,
let me find data, let me work with data,
work off of this endpoint.
And that is where, you know, it's very easy to have runaway costs Let me find data. Let me work with data. Work off of this endpoint.
And that is where it's very easy to have runaway costs because once you expose the internal endpoint
and once everyone at Stripe wants to integrate with it
because it's very easy to just send to the TV,
for us on the infra side, very quickly, we need to figure out, you know, like what is
actually happening and what are we spending money on?
Because over the years, we just assume that there's, it's natural growth, right?
Like every couple of months we say, okay, well, Stripe is growing by this much, your
business is growing by this much.
So their compute need naturally grows with it.
So let's just turn up our cluster, right?
Let's add a new cluster.
Let's add new machines.
But when efficiency is important,
and when we, I mean, we know that over the years
we valued growth over efficiency,
but when it's time for efficiency,
we really had to like hunker down
and figure out what exactly
we're spending on.
I want to ask you about the,
so you have metadata on a query being run.
How did you tie that back
or how did you go discover the why?
Because a lot of times I would think
that's sort of the big, that's sort of the big question.
And just what comes to my mind is that
a lot of times analytics projects can be ad hoc, right?
Where you need to run a bunch of queries
on a bunch of data to answer a question,
but then when you answer it,
you sort of have the insight you need
and then you sort of move on, right?
It's not like that's a persistent report or whatever.
So how did you figure out that why or whether something was ad hoc or ongoing?
Yeah, I think the first thing we wanted to figure out is a big picture of like what is
happening.
So we have, We know there are certain
kind of data operations
going on. We know there's ad hoc analytics.
We know there's BI
reporting. We know there's
operational
like tell me when
something is involved.
We know that there are a lot of these
use cases and there's
ever-growing amount of use cases.
From the infrastructure side, we treat these all as kind of the same, even though they aren't.
Like ad hoc analytics require a different latency spec than service, right?
Like if it's a cron job, it just wants to run in the next 30 minutes,
whenever, versus, like,
if it's ad hoc, someone's waiting.
But for us, on the data infrastructure side,
like, we wanted to see
exactly what is going on
throughout, kind of, the realms, right?
So the first step was actually
just to collect that data, right?
Do we know how many people's running ad hoc queries do we know how much of our compute is spent on dashboarding
on service queries on this and that so this is where the metadata comes in and depending on how
you structure the metadata you can really slice and dice your way into the different kind of usages.
So for us, the first thing we did was like,
you know, we know specific services have specific queries.
Yep.
Like on this website, we have internally those,
most people go there for ad hoc stuff, right?
This cron service that we have is, you know,
a lot of these services also build out their services. So this cron
service actually has
different teams under them.
So how do we ask the cron
service, like, give us more information
so we can license that
too? So you
really kind of get into a realm
where, you know, in the cron service, every time you send
a query to us, give us as much information as you can about it. And this is easy because like,
we own all the info, right? Like, the code base is all stripes, we can go to that team say, hey,
you know, I want to add extra metadata. Every time you send us a query. They're like, okay, cool.
Like it doesn't, right?
It's not that big of a deal.
But for us, it is, right?
For us, we see that this query is from this cron service,
which is from this team,
which is from this task that runs every so often.
Or you really get into the kind of analysis part of it.
And with just, you know, three fields in your metadata.
Yeah.
I have to know, what's one of the most surprising things that you and your team discovered when you started slicing and dicing the metadata?
Yeah.
So, you know, we always know there's like some inefficiencies in our system.
And, you know, at a hyper growth company, it happens.
And, you know, sometimes the best thing you can do is to, you know, focus on the most impactful things.
And sometimes it's not cleaning up stuff. So I think once we started gathering data, the most egregious thing we found was that
there is a cron service that runs every hour.
And what it does is it just runs select max updated at of a table.
Pretty simple.
I just want to know when the last time this table was updated, right?
And this table was updated.
But then when you dig into the details, this table, maybe when it was first started, maybe
when this query was first started two years ago, this table was a couple of megabytes,
couple of gigabytes.
This table is now like a petabyte of data.
This table is not structured correctly, partitioned correctly correctly so that your max of updated ad is now
doing a table scan a full table scan of like petabytes of data right and now you're doing this
in a distributed trino environment where you know you can have like 10, 100 machines running, it takes around two, three CPU days
to run one of these queries.
And then you see that this query is run every hour
on a cron job.
So you multiply all those factors
and we're spending so much compute
on this one simple query.
And then you go back and you say,
okay, well, who owns this?
What is this for?
Can we tell them, you know, Trino Iceberg has this concept of like metadata table where
you can look at the metadata instead of doing a full table scan.
It's like, okay, well, this is how we're going to optimize it.
We find the team is no longer around and they don't need this.
Right? the team is no longer around and they don't need this. Right.
So this whole process where we're doing this much compute for zero value.
Yeah.
And there's a lot of that we found that was very surprising.
And, you know, for us, it's great.
It's all savings, right?
We can take a lot of these and say, okay, well, you know, every so often we'll just write a report, do some analysis and stop this from happening.
But it was just really surprising from our side to find something like that.
Yeah, I guess, you know, I can see both sides, right?
On the one hand, it is surprising to see that where you're like, okay, maybe, you know, this one's the award for most expensive
query in the history of the company. But at the same time, I mean, Stripe is, you know,
a huge company growing fast. It was probably a significant need and things change, right? And
it's, you know, everyone knows it's really hard to, I mean, I would guess also with something
like that, you know, if you don't have the context, it's scary to go back in and touch stuff like that
because it may be running some really important piece of the business.
But yeah, man, I can't imagine the cost of that.
On the infra side, a lot of the problems that we have
is the disconnect from what these things are used for really help push us to go specifically go to the domain and ask, hey, I see this is happening in our system.
Like what what is happening? Like, can we help optimize it? Because, you know, you like the domain experts might not know how exactly to write this
query to get the same result,
but in a better way.
Sure.
But on the infrared side,
we know how to give you that.
Right.
We can say now write it in a,
with the metadata table.
And now you're reading like a few megabytes of metadata instead
of a full megabyte scan.
But that disconnect is where this help facilitates as well, is to say, and obviously it would
be great if we can automate all of this and no one has to think about it, but a lot of
the time you have to push all the way up to the domain and kind of figure out from there together.
Yeah, yeah. I mean, my opinion and interested to see if you agree with this is that,
you know, it's not necessarily the responsibility of that end user to understand how to optimize
that, right? They're trying to pull data so that they can do their job. Yeah, super interesting.
Well, let's change gears just a little bit here.
One of the latest projects that you've been working on is actually productizing Stripe data,
which sounds absolutely fascinating. I know Costas has a million questions about that, but
can you just describe that concept? What was the sort of need and what's the project like?
Yeah.
So this is how I've been internalizing this, right?
Stripe is a API-first payments company,
or at least when it first started,
that was the flag that we have, right?
We have a set of APIs where you can interact with and you can work off of the global payments rail.
Super cool idea.
This evolves into, I have a set of reporting APIs.
As a merchant, I do a bunch of stuff with Stripe.
Stripe helps me facilitate a lot of payments.
Now I want information back to say, you know, how many payments have gone through?
How much money have I gone through with Stripe?
And either I can keep a system or record on my side, right?
Every time I send Stripe some every time i send right some information i also
keep some information or you know stripe build out this suite of product to say no i i am the
source of truth i'm the record keeper here's your information and let me repackage it in a way that
adds value for you the merchant and this evolved evolved from API into something
called Stripe Sigma, which is
like on
the Stripe website, a way to interact
with your own Stripe data as a
merchant. So you can go on Stripe data,
Stripe Sigma, you can
write some SQL queries, press run
and have some results back.
And the data can be like, you know,
how much have you processed? How much have you
utilized Stripe
for? Right. But
for a lot of enterprise cases,
they don't want to work
off of Stripe.com.
Right. They don't want to
a SaaS product. They have their
own data engineering team.
They have their own data engineering team. They have their own data infrastructure
ecosystem. And they want that data in their system so they can integrate it with maybe their system
record of truth. And they want to add different features, different values to that data.
So that's kind of where the problem statement is,
is to say as a merchant and especially an enterprise merchant,
I want Stripe's data in my ecosystem.
Like, how can you give me that data?
And there's a lot of, you know, off-the-shelf software.
You know, Fivetran is kind of the market leader in this
where I think they just
scrape Stripe's API, write it down, and push it out.
They facilitate it. But on our side,
we have all the data. We just need to push it out.
We want to make it easier and seamless to integrate
with different ecosystems. So that's what we're working with.
And I think there's a lot of interesting development
in this area from different cloud vendors,
different data vendors in this space.
And I'm pretty excited to be working on this.
Gosh, well, I have a thousand questions,
but Costas, I'm going to hand the mic over to you
because I've been monopolizing.
Thank you, Eric. Kevin, before we going to hand the mic over to you because I've been monopolizing. Thank you, Eric.
Kevin, before we go back to the data product case
that you just talked about,
I want to go back to the tool that you mentioned
that became really popular inside Stripe.
And you mentioned that it was just like a textbook
where you could write a SQL query and run this query, right?
And my question is, in a world with so many BI tools out there,
so many hours spent on figuring out what's the most efficient way
for someone to interact with data through a graphical user interface,
why this tool became so popular?
And what was the need that it was fulfilling and couldn't be served by all these BI tools
out there?
Yeah, that's a good question.
I think why this tool was made in the first place was kind of beyond my time.
But one thing I do know is that I really enjoy using this.
And so does a lot of people in the company.
And I think I have been trying to figure out why it's so popular, why it's so successful.
I think it's just, one, it's very simple.
The interface is very simple.
It accomplishes what you want
so like you know you write some sql you get some data back there's simple filtering there's
you know if you press graph you can turn it into a line graph a pie chart, whatever you want, right? But like a lot of it, a lot of the like most used features
are these like features with like reasonable defaults, right?
So it's very powerful for me to just like write a query,
you know, select date of whatever
and Mac like aggregate whatever
and get a result,
press like turn this into a line graph and boom like that's all you get right and like if you want to tweak it more you can like go into a
write write more visuals and whatnot but like i think for majority of folks doing analytics
that's like enough i know for me like it's very useful and i think trino
being the back end of it really powers this like kind of magical wow it's so fast kind of thing
and it being federated as well we're able to connect a lot of other different data sources
so what we were talking about with the attribution of different queries, we threw that
into a database and connected back. And now your data ecosystem is all connected. So I can query
on this interface how many queries were run in the last hour from the ad hoc stuff.
Yeah. Right. It's just from the service stuff. So like, it's just very kind of central to our data ecosystem.
And, you know, I was looking at Superset, right?
And I was trying to figure out like, okay, well, like, can we move, migrate to something
open source?
And I think the difference between Superset and what we use, at least, you know, when I prototype with it on my own time, is this like very simple default.
Yeah.
There's like two or three features that everyone uses and everyone loves.
And Superset, like, it's a little bit more difficult to set up things, but that like jump in difficulty really is a big factor
when you're working with tooling. Yeah, that makes a little sense. And it's like super interesting.
And then you also mentioned about like exposing endpoints, like to work with data, right? So you
have, you're not just offering, let's say, like a way for people to go and visualize
the data, but you also want builders to go and build on top of the data, integrate with
like the data infrastructure, right?
Right.
So how do you do that?
I'm assuming also that, okay, let's say the typical use case around BI and the OLAP concept is that you don't have too many concurrent queries.
It's much more, things tend to take much longer to complete.
It's a very different, let's say, set of trade-offs that are assumed there, right? Compared to, I don't know, having, let's say, someone from the front-end
engineering team decide, oh, now I have this data.
Let's create this service that it's going to be hitting every second or
sub-second or whatever, right?
So how do you balance that, right?
Because we're talking about like opening opportunities
to, you know, like every possible use case out there.
And some of them might not be, let's say,
compatible with the basic data infrastructure.
Yeah, I think that's exactly right.
I think it's like the API is both a blessing and a curse,
I would say.
Like it makes it very easily integratable with all of the environments that we have, all of the different languages,
because HTTP is pretty universal. But on the flip side, a lot of our compute costs can
be reduced by if you are in the Java environment and you're working with Iceberg
to just go and use the native Iceberg library, right?
Instead of round tripping through compute that goes through Iceberg and then back again,
you can really just, you know, go and read from the source.
So that's been something that we've been struggling with.
And that's something that it's just the optimization at the end.
But the pro case for opening up this as an API is that integration is much easier.
Getting things done is much easier.
Getting data is much easier, no matter where you're working on whatever repo,
whatever language, whatever environment.
Totally with you on like a lot of the time,
it's not the best way to do it,
but, you know, for now,
kind of being able to build out these use cases
without being blocked by
how do I get this data
has been very useful
for Stripe to build out
different features, different products.
Yeah, 100%.
I think it's
a testament to the culture of the company.
You promote
creativity and control
over the resources.
That's the trade-off that you're doing there
and makes total sense.
And I think it's a trade-off that always exists
with engineering.
When you start optimizing,
then usability usually goes down
unless you narrow down the use cases a lot.
So it's like this balance between,
okay, how much accessible I'll make my systems
with how much, let's say, I'm going to make them like robust
and all these things.
And it's always like a dance that's like very delicate there.
And it's very interesting like to see how this is like performed
in a company like Stripe, right?
I think we over-index on on we're not over index i think we value being able
to unblock and facilitate product development and future development and have you know folks
not be blocked on accessing data yeah that's kind of something that i've been really fond of working at Stripe.
That's amazing, actually, especially at the scale of a company like Stripe.
Because these queries, at that scale, cost a lot of money.
When you're at that scale where, let's say, 1% performance gains
translates into probably millions of dollars,
things are much more complicated. performance gains translates into probably millions of dollars, right?
Things are much more complicated.
So it needs to be part of the culture of the company to promote that.
And that's amazing, I think.
All right, let's go back to the pipelining stuff.
Because that's also very interesting.
So as you said, there are vendors out there for quite a while now, right? Facilitating the exporting, extracting of data and loading of data to other systems,
like Fivetran.
Why Stripe wants to get into that business in a way, right?
What's the value that someone like Stripe,
which, okay, the core competence of the company is not moving data around, right?
It's like processing payments.
Why is it becoming so important today
that Stripe actually, you know,
like, dedicates resources to go and find, like, a robust solution for that?
Yeah.
I can give you a, you know, what I think is the answer, right? So a lot of, you know, Stripe is pretty innovative in that, like,
a lot of the features that gets developed, the roadmap, a lot of it is
driven by the customers themselves. So you probably go on Twitter, see a bunch of people,
product leads, co-founders ask, hey, how do you want to see Stripe improved? What part of it
do you want to see improved? We have Friday firesides where other company founders come in, talk about how they use Stripe.
And the question is, what don't you like about it?
Where can we improve?
And I think with that mentality, a lot of on the data side has been a natural progression of what the customers want.
So Stripe Sigma, so it's essentially a SaaS on Stripe.com where you can write SQL to
interact with your own data.
So that was the first iteration.
And it's very similar to what we have internally you know, just a website, a SQL dialogue,
and a run button, and it returns you the data, right?
So that came out of, like, you know, customers wanting to interact with their data, right?
And, like, for SMBs, people without their own data infrastructure, that's pretty good,
right? people without their own data infrastructure, that's pretty good. You go and do a bunch of
SQL analysis just through Stripe. And then for enterprises, they don't want to use that.
Maybe their data size or their regulation, just privacy,
some reason they don't want to use that product,
but they still want to interact with this data.
So there's been a need to provide this data
to our customers.
And the need is pretty validated, right?
Like you have other companies who,
you know, these merchants go to to
say hey i want my stripe data and you can give it to me i don't care how just give it to me though
and i'll pay you for it so then the natural progression is like well why go the extra step
and a lot of the time like you know the way that these companies get data is also pretty costly.
They call the APIs, write them down, send it to other companies.
So the natural progression is like, okay, how do we do this in a way where our customers benefit and we can also turn this into a product?
So that's kind of been the line of thinking.
And I think the way that it was started at first was a customer ask.
Like a pretty big customer asked for this.
They're like, hey, I don't want to work off of your website.
I have my own data engineering team.
I have my own data engineering ecosystem.
Just give me the data.
Let me do what I want with it.
And then, you know, more and more companies come in to ask for this.
Yeah. Right. The way we see it is there's a segment of like, you know, SMB can use Sigma and Enterprise can use Stripe data pipeline. Yeah. Makes sense. So what's the difference
between someone using, let's say, like a third-party vendor that is going to continuously hit the API
or Stripe to export data and reload the data
on the S3 bucket of the customer
with what Stripe does with their pipelines, right?
And let's talk briefly about, let's say, the product experience,
if you can talk about that.
But also, most importantly, about the technology.
What's the difference there?
In one case, we have HTTP, right?
As you said before, it's pretty inefficient, but it's pretty universal at the same time.
But maybe there's a better way to do that. So what are the technical choices that you as an engineer make to enable
a different product experience at the end, right?
Right. Yeah. And this is where I
really believe the
next generation for this product. If you go to
Stripe Data Pipelines right now, we have GA in Redshift and Snowflake.
As a merchant, you can sign up for this product and you can get your Stripe data in your Redshift
cluster, in your Snowflake cluster.
And we do this in a way where we get our data from our source of truth.
The reliability factor, the data consistency, data correctness factor, we take that on and
we guarantee that.
In a way where anything that happened upstream, we can just say, here, we calculated the source of truth.
Let me push the data out to you.
That's very difficult when you have a man in the middle with a third-party vendor.
I'm sure there's a way to solve it. But at the end of the day, going from the source is a lot cleaner.
It's a lot easier for both Stripe and the merchant.
But API calls are expensive.
If you're scraping a website, the API calls get super expensive.
When you're scraping Stripe, there's a cost to Stripe as well. And internally,
migrating all those API calls onto this product is just a win.
I think in terms of technology, something that I'm really interested in is just the idea of data sharing, right? Like, you know, API call is one of them.
SFTP is like one of them.
A lot of these things are very old,
not old, but like, you know,
they're like proven methods from like,
you know, the 80s and 90s.
And with a lot of the developments in the data space,
data infraspace,
especially with a lot of cloud vendors,
with a lot of data vendors,
innovating on a bunch of different data sharing technologies,
I think Stripe is in a good position to piggyback off that
so then we can offer our merchants integration with all of these ecosystems.
So something that has been going on in the industry is the rise of Apache Iceberg.
Something I just saw recently, I think last year with Salesforce and I think Snowflake,
there's a blog post that said they're integrating Salesforce I think Snowflake, there's a blog post that said
they're integrating Salesforce
data with Snowflake.
One click or zero click,
zero ETL, whatever.
You can get your
Salesforce data in Snowflake
super fast, super easily.
Right?
We see
the same thing for Stripe. Right. We want to give you your data on Snowflake, Databricks, AWS, Azure, like anywhere that your data is set up, we want to be able to give you that data. And I think with the rise of the lake house kind of architecture where compute is separated
from storage, that really helps our case.
Because right now we publish to specific warehouses, right?
It has to be Redshift.
It has to be Snowflake.
But with this lake house architecture, we want to publish the storage and you bring
your compute and the integration should happen seamlessly.
We can use Iceberg, we can use different technologies to facilitate this, but the core concept of
we'll give you the storage, you bring your compute. I think it's very
exciting to me for the next iteration of this product.
So just to understand about the use case here with Iceberg, the way that you see it is that the data leaves, let's say, on Stripe,
but the user has, let's say,
the capability to choose where to expose this data
through iceberg, right?
So external query and just like go and query that.
Or you see more of like, okay, this is your data.
We're going to export it on your own S3 bucket
because that's your storage and you want to have it there.
And we are going to do that by using Iceberg.
So it's easy then to go and expose it
to different query engines and all that stuff.
Which one of the two approaches is usually more
favorable for the users out there? Yeah. I think there's multiple levels of abstraction.
At the core, we're exposing some data where the merchant wants to be able to interact with that data, right?
We can throw it into a SFTP server as a CSV, right?
Or we can throw it onto Azure or AWS in S3 as like parquet files, right?
And then it's about bringing where the merchant is and their ecosystem into our own ecosystem.
So Iceberg is one of the abstractions.
We can throw our files on S3, create some kind of catalog to represent Iceberg is so popular is that, or so interesting for us, is that all of these
vendors, all of these compute systems are now integrated with Iceberg.
So this is the step kind of removed from us, an extra step that we don't have to do, where
if we just deliver something in Iceberg, you can read it in Snowflake, you can read it
in Databricks, you can read it with Athena, with REST Shift.
It's about from us taking the data and making these levels of
abstractions so that our merchants can integrate it
in a better way. If our merchants want
Delta tables, we have the underlying files. We just need to generate
some metadata and boom, you have Delta tables, right? We have the underlying files. We just need to generate some metadata and boom, you have Delta tables.
Yeah.
So for us, it's about thinking through
where we want to meet our users
and where they are,
where their ecosystem is,
and kind of meeting that demand on our side
and enabling them to get the data.
Yeah, makes a lot of sense.
And one question here, because, okay,
I think the value of decentralizing the data
like in this way is obvious, right?
Both like from like an engineering like perspective
in terms of like efficiency there,
but also from like a business perspective
of like not having like to, okay,
like use 100 different tools
and all these different vendors
and paying for all that at the end
without having the best possible experience at the end
and maximizing your value.
My question, though, is like,
okay, in this highly decentralized environment
with all these different options,
how people can keep track of what is available
to them, right?
How they can find the data that they need, how they can know that this is the right data.
Like, yes, of course, you can create some metadata and create iceberg tables and have
like a catalog that a system can go and access.
And it can't be like a Hive metastore, right?
But then if you go to something like Snowflake, then probably you need a different catalog
to be populated there for that to happen, right?
So we get to this meta problem in a way of how do we keep consistent also and available all these
metadata that are needed in order for people to go and figure out what they can use and how to
work with it, right? So first of all, do you think this is a problem or might be just in my mind,
right? I don't know. And if it is, what are the possible solutions out there?
Yeah.
No, I think it's definitely a problem.
Well, not a problem.
It's just the way that it's set up, right?
Iceberg and any table format, it's essentially your data with some metadata.
Yeah.
You have to keep your metadata somewhere.
And for Iceberg, it's like a catalog, right? The catalog just does the translation of like,
here's my table and here's everything I know about this table,
where it is.
You have Hive Metastore, you have Glue,
you have REST Catalog.
I think this concept of catalog is super interesting.
When you're talking about these table formats, it's essentially the
abstraction where a lot of these vendors are taking to not lock you into their ecosystem,
but it's one of those things that's difficult to work with when you're across many ecosystems, right?
So you can have a Snowflake, but you can have an Iceberg table in Snowflake, but if it's
managed by Snowflake, it's their own catalog, right?
And if maybe you're like an enterprise and you have multiple different ecosystems, you
want to use Snowflake and Databricks and something else and
Athena, right? Where your catalog is determines which systems you can use. So if you have an
Iceberg table that's in Snowflake only, the Snowflake catalog, it's really difficult for you to use that in Databricks.
If you have a Unity catalog, which also works with Iceberg,
it's hard to export that and put it into Snowflake.
Now you need integrations between these catalogs.
And this is where like icebergs kind of
innovation with like the rest catalog is i think it's very interesting is that
they're just saying there's a rest protocol it represents a catalog it can you can plug and play
whatever back end you you have right and it's a level of abstraction that kind of
do away with
the details and
the vendors and everything.
I think what it means
for us is
we're still trying to flush
out how this works.
If we want to integrate with
table formats,
where are we going to store our catalog?
Yeah.
We need to store multiple copies, right?
Like, do we need one in Glue for AWS?
Do we need one in Unity for Databricks?
Like, now you have, like, this, like, kind of lock-in on the catalog level.
Yeah.
How do we get out of that i think those are like
interesting questions i a lot of like the integration is happening too you know like
glue is able to be read in other places but with these vendors a lot of it is we make it easy for
you to read in other catalogs but we make it hard for you to read out
anything that we have yeah um so you know it's an interesting kind of time period that we're
that makes total sense okay i think we should have like another episode just like talking about
catalogs to be honest uh but we are uh close to the end here and I would like to let Eric ask any other questions
you might have. So Eric, all yours again.
Yeah, Kevin, I think it's been so interesting
to hear you talk about
a lot of the practical ways that you're solving problems
day to day with your infrastructure
but you are a very curious guy
and so I'm dying to know when you look out at the data landscape in general with your infrastructure. But you are a very curious guy.
And so I'm dying to know,
when you look out at the data landscape in general,
what are the most interesting new projects that are exciting to you?
Maybe even in the open source,
because I know that's exciting,
when you sort of remove yourself
from the limitations of the infrastructure
you work in every day.
Yeah, I think Iceberg has definitely been on my list.
I've been kind of participating on the Python Iceberg library, just contributing there.
I think a lot of the disaggregation of different database components and like OLAP components, right? Like I think of our current infrastructure as databases kind of just turned inside out
and different services essentially.
Yeah, yeah, yeah.
You know, it's compute, S3 and Iceberg is like storage.
And now people are building indexes, are building all these features on the side.
So I think a lot of what interests me is Apache Arrow.
So then you can integrate these systems together.
Sure.
Like Data Fusion, where you can have components
of your traditional databases and work with it
in a way
where you can have your planning,
have your compute layer, have your storage layer
in different libraries, and then you can mix and match.
So a lot of these foundational core pieces of the database
is now being ripped out and bring into like these open
source projects. So, you know, I'm very interested in seeing the development of those. And there's
like a lot of like active development in those fields. And we'll see, you know, like maybe we'll
in a year or two, we'll go back to like what a traditional database looks like,
but just in the cloud with like all of the bells and whistles.
Yeah.
Well, Kevin, this has been such a great conversation.
Thanks again for joining us for the show today.
Yeah, thanks for having me.
We hope you enjoyed this episode of the Data Stack Show.
Be sure to subscribe on your favorite podcast app
to get notified about new episodes every week.
We'd also love your feedback.
You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com.
The show is brought to you by Rudderstack, the CDP for developers.
Learn how to build a CDP on your data warehouse at rudderstack.com.