The Data Stack Show - 59: Making ETL Optional with Justin Borgman of Starburst Data
Episode Date: October 27, 2021Highlights from this week’s conversation include:Starburst Data is Justin’s second startup (2:42)Starburst focuses on doing data warehousing analytics without the need for the data warehouse (4:14...)Multi-cloud solutions among merger and acquisition use cases (8:32)Ways the stack is increasing in complexity (12:25)Comparing essential components of a data stack from 2010 to now (15:01)The future of ETL (27:36)The best maturity stage for an organization to implement Starburst (31:27)Starburst connectors (36:55)Monetizing enterprise solutions while promoting open source ones (41:52)The history of Presto and Trino (45:37)Benefits of a decentralized data mesh (49:53)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are
run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com.
Welcome back to the show.
We have Justin Borgman from Starburst Data, and I'm really excited to talk with him because
he, I think, may help us make some sense of data mesh, but at the very least, we'll learn a ton about federated queries and building analytics across different components of the stack.
So my main question, and we'll talk about Presto and Trino and get into the details there.
But I think my main question, Costas, is the view of the stack increasing in complexity.
So we had a guest recently talk about how
the promise of the cloud was that it'll unify all this data and everything. And in fact,
it's creating more complexity and more data silos. I thought that was very compelling.
And I think Justin is living that every day with Starburst, trying to make it easier to drive
analytics with an increasingly fragmented stack.
So I want to ask him about the complexity of the stack and how that's changing.
How about you?
Yeah, I want to learn more about Presto in general.
Presto has been around for quite a while and he has gone through many different transformations. So that's definitely part of the conversation that we are going to have. And I want to learn more about
Justin's view of how this data stack is maturing and where he thinks that we are really going to
with the technology. Mainly because the interesting part with Presto is that it has a very, very
different approach when it comes to querying. It has a very decentralized approach,
which is something completely different, actually opposite to the best practice of trying to source
all the data and store it in one centralized location and do the queries there. So yeah,
I think we will have a lot to chat about with him. Well, let's dive in and get to know Justin and Starburst. Let's do it.
Justin, welcome to the show. It's really great to have you with us today.
Thanks, Eric. Super excited to be here. Well, let's start where we always start.
Would love to hear your background. You've done some really cool stuff,
but kind of what led you to Starburst? Yeah. So let's see, this is my second startup.
My first startup was back in 2010. It was called Hadapt and it was a early SQL engine for Hadoop,
just as Hadoop was starting to pick up momentum. And really at the time, people were thinking about
Hadoop as kind of cheap storage or a way of doing batch processing on mass amounts of data.
And our idea was to turn it into a data
warehouse. In fact, I think the business plan we wrote was to become the next Teradata with really
doing data warehousing within Hadoop. Now, as luck would have it, we actually ended up being
acquired by Teradata four years later. And I became a vice president general manager at Teradata,
responsible for emerging
technologies and really trying to think about the future of data warehousing analytics and what
that might look like. And it was in that context that I actually met the creators of an open source
project called Presto. They were at Facebook at the time, Martine, Dan, and David. And we started
collaborating and working on making Presto better and better and better. And today that effort is now known as Trino. So the name
changed along the way, but that's really how Starburst was ultimately born as really the
founders and creators of that open source project, leaving our respective companies.
I left Teradata, they left Facebook, and Starburst was formed.
Very cool.
And can you just give us a quick rundown of what is Starburst and what does it do, just so our listeners have a sense of the product?
Yeah.
So much the way my first company was really SQL and Hadoop, this is SQL and anything.
And I think that was what got me so excited about it.
It's about doing data warehousing analytics without
the need for the data warehouse. And from a technical perspective, it's basically a database
without storage. And it thinks of all other storage as though it's its own. So you can query
the data where it lives. You might have data in Mongo. You might have data streaming in Kafka.
You might have data that you want to access via Elastic and TechSearch. You might have data in traditional legacy systems like Oracle or Teradata, you might have Snowflake, you might
have data lakes, that's one of the areas where we really excel is accessing data and data lakes.
And in all of those cases, you have kind of this single point of access to query the data where it
lives, without the need to move it around and do those typical kind of ETL pipelines. So it's
really about giving you faster time to insight, that's the way we think about it around and do those typical kind of ETL pipelines. So it's really about
giving you faster time to insight. That's the way we think about it. And removing a lot of
that friction traditionally associated with classic data warehousing. Super interesting.
So let's talk about, I love your perspective because you have a great perspective because
you've both built systems that drive analytics from a database standpoint
and then are now leading a company
that solves problems
across different pieces of infrastructure.
We had a guest recently who made a really good point.
It sounds very obvious,
but the data stack is increasing in complexity, right?
I mean, you have all these tools
that are making functions within the stack easier to
do that before required a significant amount of engineering effort.
And it's like, okay, great.
Like we're getting beyond some of the low-level plumbing problems and which is awesome.
But especially as you reach scale, the stack is increasing in complexity, right?
So you have data warehouses, data lakes, Kafka. There are a
number of different sort of core pieces of infrastructure that you're running at scale,
which actually makes traditional linear data into warehouse, into BI dashboard way harder.
So can you just talk us through what you're seeing on the front lines? Like how are stacks increasing in complexity?
And then I'd just love to hear like your perspective on Starburst as the answer to managing that
without necessarily having to get into the plumbing.
Yeah, absolutely.
Well, first of all, I 100% agree with your previous guest about the stack gaining complexity.
And I think of a old quote from really a legend in the database space,
a guy named Mike Stonebreaker, who's a professor at MIT, and he was the creator of Ingress and
Postgres and Vertica and a variety of different database systems over the years, won the Turing
Award. And he had written a paper that basically said there is no one size fits all database system,
meaning that you're always going
to have different databases for different types of jobs, different types of use cases.
And I think that's true. Some applications you want to build on Mongo, some might be Oracle,
some might be something else. And I think that for better or worse leads to greater complexity
because now you have even more data sources. And we find particularly in large enterprises, this is compounded by the fact that you have
different departments, different groups within an organization doing their own thing.
You may acquire businesses.
And every time you have M&A and you acquire a business, you just acquired their data stack
as well, right?
Sure, yeah.
Right?
And that's actually one of the fastest ways we find that our customers end up being
multi-cloud is because they bought somebody who runs on Azure or GCP, and now they're multi-cloud.
So 100% agree on complexity. And that's a big part of what we hope to solve by essentially
allowing you to go direct to source and be able to run those analytics by connecting directly to
where the data is. I think that's the power of the platform. Essentially, I like to describe it as
really giving a data architect or a data engineer infinite optionality. If they still want to
consolidate data into a data lake or data warehouse, that's cool. I would argue data lakes
are probably the better bet over the long run for consolidating
data. And we could talk about that just from a TCO perspective. We will. We'll definitely talk
about it. Yeah, absolutely. But the point is at least you have the freedom of choice. And so
that's really what we're trying to do is kind of create this single point of access across
all of those different data sources to add an abstraction. And abstractions are always
really for the purposes of creating simplicity where there is complexity. And I think we allow
you to do that within the data architecture realm. Let me ask you, so you're a two-time entrepreneur.
So I'm going to ask you a business question that relates directly to this problem. So a lot of times, let's take the example
that you gave of a business acquiring another company and inheriting their stack, right?
Yep. Integrations and all of that are a whole subject unto themselves. But I would argue that
in a lot of those cases, like the synergy, wow, synergy is such a bad buzzword, but let's say that the, the results
you can produce from understanding the power of the relationship between the two businesses
tends to have an outsized impact. Okay. And we'll just call that synergy for the purpose.
Yeah. No, I mean, that's like the truest definition. I agree with you. I know it,
it has negative connotations only because it's usually, I think, overinflated, right? Like people talk about synergy and then maybe they don't find the synergy, but you're absolutely right. Yeah. And I think in this day and Like, do you see that, especially among Starburst
customers where ultimately a lot of these things come to a head in analytics that then influence
business processes that influence product? You know, there's a variety of implications here,
right? But analytics is, and understanding those components is usually the tip of the
spear in terms of like driving the decisions that filter out and shape the business.
Do you see that a lot where when you can combine data from different sources in a way that would
be, I mean, some of these things, like you're talking multi-cloud, if you put a set of data
engineers on this, you're talking months of work to get a basic understanding of how the data
relates. And then you have a ton of BI work and analyst work to get the insights on top of that. And so do you see that a
lot among your customers? Yeah, a hundred percent. In fact, it's a great use case actually for us
because when we see that an M&A transaction is taking place, we know that there's instantly
going to be an opportunity for the reasons that you mentioned. You're inherently talking about
two different sets of data and you're talking about an integration effort, which from speaking to at least one customer that is quite acquisitive,
often takes like two years to fully integrate those two entities to get the value that the
investment banker had written up in the original proposal, right? So it takes a long time. And
the beauty of this mindset or this approach of kind of a single point of access or what some are now calling a data mesh, which I'm sure we'll talk about as well, is that you're getting instant connectivity.
So you don't have the delays of all the challenges associated with getting the data out of one system, navigating how to transform it and load it and get it prepared
into another system. All of that can be done in weeks rather than months or years. And I think
that speaks to that time to insight ability that we can provide. Yeah. Okay. One other question
for me, and I'm just genuinely curious about this. So stack is increasing in complexity and
you're seeing this on the front lines because you're providing an antidote to that. How is it increasing in
complexity? Are there specific trends that you see around particular technologies that maybe add to
the complication of what you would normally solve from a low-level plumbing standpoint?
Yeah, well, one thing I'll mention, and this ties a little bit back to my Stonebraker quote,
but there's a lot of different systems out there now.
And it's not just different types of databases.
It's other forms of data as well.
It's CRM systems.
It's web analytics.
It's a whole host of different data sources that you want to combine to understand your business better. Customer 360 is a very classic use case that we work on with our
customers. And very often that involves pulling together a variety of data sources. I think part
of this also, candidly, is I think fueled by a tremendous amount of venture capital that's
poured into the data space over the last decade. There's a data landscape that
First Smart Capital produces every year. I'm not sure if you've seen it. Matt Turek is the VC who
maintains this. And I like to go back just for fun sometimes and look at like the 2012 version
of this data landscape. And it's already look complicated. There's like 30 different data
sources. And then you look at the 2021 version, you're like, that's an eye chart. Like you have to zoom in. Like it's hard to even find my own
company in that space. So I think that's part of it as well. You've got a lot of different
niche players. Maybe at some point there'll be some consolidation that simplifies, but
we don't see that at least any time soon. And that means ever greater complexity. The one other thing I'll mention that I think is compounding this problem is a demand from
the user side, which could be an analyst or data scientist for more self-service access
to the data that the organization has.
And so you've got greater complexity on one end and a wider variety of potential users
on the other end.
And I think that that's a painful place to be in the middle.
Yeah, for sure.
We had a, on a recent show, we did a fun exercise where someone asked us, how would you build
this in 2012?
Which is a really interesting mental exercise, right?
Relative to all the options we have now.
So that's great.
This is super fascinating.
Costas, please dive in.
I have quite a few questions,
but Justin, I'd like to start with a pretty simple one
that has to do with the conversation
that we had around the data stack.
And I'd like to ask you from your experience
and your experience also through the lenses of Starburst,
what are the essential
components today of a data stack that a company needs? And if you can, I'd like to compare it on
how a data stack looked back in the Hadoop era when you started your previous company and what
are the differences there? Okay, great. All right. Well, I'll start there. Maybe I'll start with the past
and then go to today. So 2010 was an interesting transition point or the beginning of a transition.
I would say the concept of a data lake was in its infancy back then. Of course, back then,
data lake was synonymous with Hadoop. That was the only data lake. Now it's increasingly cloud object
storage like S3 or Azure data lake storage or Google cloud storage. But back then it was Hadoop.
And I think what people at the time were just starting to think about or transition is like,
can I do some data warehousing in Hadoop? Can I do some ETL in Hadoop? At least the T part of ETL,
of course. Can I do some transformations in Hadoop, and essentially offload very expensive compute from my Teradata system or my Oracle system and
use this cheaper batch oriented, infinitely scalable open source platform instead. And so
it was very interesting from that perspective. I think a lot fewer data sources in that world,
Teradata was striving to be the single source of truth with, I will say, mixed results, meaning that they were probably the closest thing to the single source of truth.
But you still had different data marts and other databases, SQL Server here and there and Oracle here and there.
And so still a bit of a heterogeneous environment, but not nearly at the degree that it is today.
The players back then, I would say Tableau was the new kid on the block and killing it.
But absolutely the new kid back then, displacing maybe some of the older BI tools like Business Objects or Cognos or MicroStrategy at the time.
And ETL back then was synonymous with Informatica. I think that's another big change,
right? So if we fast forward to today, I think we are in a much cloudier world. I mean that in
a sense of like more data is in the cloud, which maybe makes it cloudier in multiple levels,
especially for those customers who are hybrid. I think those are unique challenges too.
But Data Lake now is synonymous with cloud object storage. I think Snowflake is trying to be the Teradata of the future, very much embracing this same concept of a single source of truth. And then you have Fivetran or Matillion or other players sort of like being Informatica 2.0. So on a surface level, you could say maybe, and then at the BI level,
Tableau is still very strong. Maybe Looker is a more recent addition. There's also Preset,
the company behind Superset, which is interesting too. But on a surface level, you might say these
are similar. I think though, we're at a point where data lakes have matured, or at least data lake as a data warehousing
alternative has matured a lot as a concept. I think back in 2010, when I was doing that first
business, it was an appealing idea, but not a lot of people were doing it in practice,
largely because it takes a long time to build an analytic database. I learned this the hard way,
building a cost-based optimizer, building an execution engine takes a long time. And in 2010, they were all very early. So you couldn't get the
same performance out of SQL and Hadoop as you could in Teradata, for example. If we fast forward
to today, that gap is much, much narrower to the point that it's almost insignificant. And whether
that's Starburst querying data in a data
lake for other players in the space, like Databricks has a SQL engine now for querying
the data lake as well, you see this idea of like a lake house becoming more popular where I'm going
to store a lot of my data in a data lake and maybe skip out on the Snowflake model. So I guess I would summarize by saying, I think the data warehousing model, irrespective
of the individual players, is being challenged now today in a way that it wasn't previously
in history.
Yeah, yeah.
Makes total sense.
That was a very, very interesting comparison between the two points in time. You mentioned data lakes, and it's been
like a couple of months, at least now that we see quite a few data related companies getting
substantial funding, right? And also quite a few open source projects. We have Iceberg that came
out of Netflix, Hoodie, which came from Uber.
And of course we have Delta Lake, right?
So what's your opinion there?
Like, what do you see?
Because the way that I see it and how I feel about it
is that we have like
some kind of decomposition
of a database system, right?
Because if you think about
something like Postgres,
you have an extremely complex system that is like a black box at the end that you query using SQL.
A very simple, let's say, language.
And we have reached the point right now where we are talking about transaction logs, about query engines on top of the file system.
It kind of feels like we have decomposed the database system into small components
and the data engineering teams are trying to take all these and recreate, let's say,
a large scale database system.
Where are we today?
Like how mature are these technologies?
Like if we take, for example, Hudi or like Delta Lake compared to something like Snowflake.
Yeah. So first of all,
I agree with your general sentiments. I mentioned in the opener that we're like a database without
storage. So you could say we're like the top half of a database, the query engine, the execution
engine, SQL parser, the query optimizer. And Iceberg is like the bottom half, if you will,
of a database. It's the storage piece or Hudi or Delta. And I think what we're
seeing right now, which is kind of an exciting period in history, is back to that point about
data warehousing analytics in a data lake, the one missing piece throughout the last 10 years
has been the ability to do updates and deletes of your data. And that's the gap that I think we're closing with those data formats,
which now allows for what Teradata calls
active data warehousing,
like being able to do updates, do deletes,
modify your data,
and still perform high-performance analytics
and Power BI tools all within one system.
And that's, I think, like you're right on the cusp
of eliminating that delta,
if you will, no pun intended, between data warehouses and data lakes as we speak.
And I think that decomposition is good for customers in the sense that it gives them
a lot of optionality. So for example, if you're going to standardize on delta,
you can use Databricks to train a
machine learning model, create a recommendation engine. If you're a retailer, if you buy this
pair of shoes, you might like this pair of pants. That's a great use case for Databricks.
And then you might use Starburst to generate your reports, use Tableau to access that data
and figure out how much did we sell last month or how much do we think we're going to sell next month? And they can both work off of the same file formats. And that's pretty
cool. So I think that gives, again, customers just a lot of flexibility to interchange engines. And
also they have flexibility around which formats do they choose. Iceberg, Hudi, Delta, all very
interesting and promising options. And I guess I'll just mention one last point.
I think the big distinction between this way of thinking and Snowflake is when you load your data
into Snowflake, you've now locked it into a proprietary format. And that's an important
piece with respect to vendor lock-in and having control and ownership over your own data. And
that's one of the things that I observed even in my time at Teradata. Nobody ever said Teradata was a bad database. It's a great database,
but they really hated the fact that it was inflexible and it was very expensive, right?
So. And Justin, one question and Costas, I apologize to jump in here, but I'd love to
just benchmark when we talk about performance, a lot of times and speed to insight is a term that
you've mentioned a couple of times. I'd love to just benchmark on that because one way I like to
frame this question is the definition of real time has changed over time. Right. And so real time
at one point may have meant a couple of times a day, right? And so it was getting faster and faster and faster. I just love to know, like, what's your perspective on that changing,
especially relative to query performance? And I know that can change based on business model, but
when you talk about recommendations in an e-commerce standpoint, the bleeding edge of
that is generally like has very heavy requirements as far as performance in real time, but that also is relative.
So I'd just love to know, what are you seeing with your customers as far as requirements
on performance and delivery from that standpoint?
Yeah, so there are two dimensions that we think about with our customers.
One is the query response time.
And that's what I think people have classically referred to as performance when it comes to analytic database systems. Like I run a query on a certain amount
of data, how fast does it return? And there are industry benchmarks that have been used for a
long time, TPC-H, TPC-DS. These are sort of like standardized benchmarks that you can run your
queries through. And of course, we would always say the best benchmarking is actually on your own data though, even better than industry benchmarks.
But that's one dimension of performance. The other dimension, which I think is often overlooked,
and this is what we really refer to when we think about time to insight, we think of that as a bit
more holistic of a measure factoring in how long did it take from the moment the data was created to
my ability to analyze it. And if you think about it in that context, just to compare and contrast,
let's say Snowflake versus Starburst. Snowflake, maybe a query runs in two seconds, and maybe it
takes Starburst 2.6 seconds. And you might say, oh, well, Snowflake ran that query faster. Yeah,
okay, a little bit faster. but it might've taken three weeks to
get the data into Snowflake in the first place. And so really that query was three weeks. Right.
And that's what I mean by time to insight is I think people learn over time that the, there's
a prerequisite step before that traditional data warehouse is able to actually run that first query.
And that's an important tax that you don't necessarily need to pay.
Yeah, super interesting.
Yeah, that's, I think, a subject
that we want to explore more in the show
just because when you talk about latency,
time to insight,
like those are very subjective
depending on where you're on the pipeline.
So super interesting.
Yeah, and that's also something else very interesting, Justin.
So let's talk a little bit about ETL.
Okay.
And I want to hear from you,
what do you think is the future of ETL?
ETL has been around like since we had the first database systems,
exactly because as you said at the beginning,
we cannot have one system that does everything.
Different kind of like workloads requires different architectures and different systems.
And probably today is also a bit even more complex, the environment.
If you consider that you have to download data through REST APIs because something is behind your Salesforce instance, for example, or NetSuite or whatever, right?
What do you see happening to ETL?
Because from what I understand, when you are incorporating like Starburst in your architecture,
for example, the need for ETLing the data from, I don't know, like a production database,
for example, to your data warehouse is reduced, right?
And at the same time, like I've seen, I was looking like today, for example,
there was an announcement from Snowflake
that Iterable, which is like a company
like in marketing, if I'm not mistaken, Eric, right?
It's a marketing product.
Yes, indeed.
Yeah.
Yeah, like customer journey, like orchestration.
Yeah, yeah, yeah.
So now you can get access to your iterable data on Snowflake
directly on Snowflake without
doing the ETL through the data
sharing capabilities that Snowflake
has, right? Interesting. I didn't...
That's interesting.
Yeah, yeah. They just announced
the product today.
Again, where is the ETL there,
right? Until yesterday, if
I was using Interable,
I would have to have a pipeline there to pull the data.
It will take days, blah, blah, blah, and put it into Snowflake.
So how do you feel about ETL?
What's the future of ETL based on your experience?
Yeah, so I was going to say, and I did not read the news
because you're more up to speed than I am,
but my guess is that Iterable is probably running
Snowflake themselves, just because the way that Snowflake is building its data sharing marketplace
is really like a proprietary network. It's basically other companies using Snowflake can
share data with other companies using Snowflake. So that would make sense to me in that context.
And I think that's like Snowflake's view of world domination.
It's like, if everybody's using Snowflake, then great.
Yeah, it's a happy world.
You can share among Snowflake databases.
So I get it from a business perspective.
And obviously, they've been a very successful business.
And Frank Slootman is a very successful CEO.
However, I don't think it reflects
necessarily the reality of the data landscapes that customers have. I think it's probably naive
to think that everything will get ingested and sucked into Snowblake databases so that it can be
shared and used. So our approach basically just says all data sources are essentially equal and
we can work with any of them.
But to answer your question about the future of ETL, so I think it's the E and the L that
we're most focused on making optional, I guess you could say.
There may still be times where you want to do the T for sure.
And I think the way we see the future of this industry moving forward is we still think there's going to be great reasons to pull data together into one physical place.
Maybe it's to power a particular dashboard or for certain applications, it would make a lot of sense to pull data together.
But we think that increasingly that will be the data lake because of the economics
involved, right? Like at the end of the day, the data lake is always going to be your lowest TCO
play. The storage is going to be the cheapest, whether it's S3, Azure data lake, whatever.
And you get to work with these open data formats that we already touched on earlier. So you're not
locked in. And so we think that's going to be like your best bet for when you need to consolidate
data. And then for other cases, you can just query the data source directly. And again, that kind of goes back to
that optionality. So I guess to summarize, I would say, I don't think ETL goes away,
but I think it becomes more optional. Interesting. Just to jump in there,
Justin, that is a really insightful, and I'm going to put my marketing hat on here because I've been burned
many times by a marketing tool saying we have this direct integration. And in reality, it's actually
just a sort of behind the scenes, like ETL job. And so it makes total sense that like, if it really
is delivering on the promise, it probably is that they have their data in Snowflake. And from an actual data
movement standpoint, that makes a ton of sense. That was just very clarifying for me because it's
like, yeah, I've heard that so many times before and it's not true. They're actually just running
some job in the background and it's not real time. And of course, ETL has major problems when it
comes to schemas and all that sort of stuff. But if both systems are in Snowflake,
like that would actually work pretty well.
But then to your point,
you're in the Snowflake ecosystem, right?
And that's the boundaries of the boundaries.
So I just appreciated that as a marketer,
understanding the technical limitations
of problems I faced before trying to move data around.
All right.
That was super, super interesting.
I'm very interested in ETL, as we can all understand.
So Justin, let's chat a little bit more about Starburst as a product, right?
And my first question is, at what stage of maturity of the data stack, as we talked about,
Starburst makes sense to become part of this data stack?
Yeah, well, it depends on where you're starting from.
We kind of think about customers on a journey, journey to somewhere, but they're all starting at a different point in time.
For some of our customers, it's simple.
The most simplistic way to get started with us is you have data in S3 and you want to query it.
And you're currently thinking about, well, do I load it into a data warehouse like Snowflake or do I just leave it in open data formats?
Do I use something like Athena on AWS, which, by the way, is actually Presto Trino under the covers?
And that's what powers Athena.
How do I want to build my modern data warehouse type of stack?
And that's a great application.
That's where the kind of leading internet companies end up using our technology.
They have the luxury of designing their stack from the ground up.
And very often it is a data lake in S3 or some other cloud object storage and just querying
it directly with Starburst.
And in that sense, you're essentially building
an alternative data warehousing style platform. Again, you might use Iceberg, you might use Delta,
you might use Hudi if you want that ability to do updates and deletes as well. So that's a very
simple place where people often start, particularly if they have the luxury of starting with a clean
slate. Another place that customers start is they say, okay,
I have a data lake, but I also have a bunch of other databases. And maybe I've got Mongo,
maybe I've got Oracle, and I really need to join a table that I have in Oracle with some tables
that I have in S3 or Hadoop. And that's another great place to start is really combining data sets that currently live in different silos.
And we can very easily provide fast SQL access to both systems.
Another way that people think about us is as an abstraction layer that hides the complexity of data migration.
So a lot of people going through digital transformation where they want to move data off of Teradata or Hadoop and they want to move it to the cloud. But that can be a pretty disruptive endeavor if you're trying
to really like just turn a system off and move it to some totally different system. So another
approach is you connect Starburst to those systems, have your users end up sending queries to Starburst
and that gives you a bit of breathing room and the luxury of time to kind of move tables out of one
system and move them into another system more gradually without the end user having to know where the data lives.
And that's sort of like hiding where the data lives. That's thinking of us as a semantic layer,
essentially, above where all the data is. So those are kind of three different areas where we
typically start working with customers. Yeah, makes total sense.
And let's talk a little bit about the experience,
the product experience.
And when I say the product experience,
I have like two personas, let's say, in mind.
One is like the data engineer,
like the person who is maintaining the data infrastructure and probably has to interact with Starburst
as a piece of infrastructure.
And then the users who are
querying the data, right? So they are different, obviously. So what's the experience that these
two personas have when they are interacting with Starburst? Yeah, so for the data engineer,
they first of all have two choices. We have really two product offerings today. We have
Starburst Enterprise, which you manage yourself. So if you want to control the entire infrastructure, maybe you
want to deploy on-prem, maybe you want to deploy in the cloud, but you have a particular setup
that you want to maintain. Maybe you need Kerberos integration or LDAP integration, or you want to
run on Kubernetes on-prem. You have a lot of flexibility with Starburst Enterprise, but you have to manage it yourself. So that's for somebody who's up to that challenge,
or maybe who has the requirements to run in their own environment.
The other option is something called Starburst Galaxy. And Galaxy is a cloud-hosted offering.
We manage all that complexity. And essentially, you have a control plane that allows you to
connect to your different data sources and configure the system. You can auto scale up and down. So you're
using your EC2 resources efficiently. You can even auto suspend the cluster where it'll just shut off
automatically if it's not being queried. And because we're like a database without storage,
restoring it takes a few seconds and we're connected already to the data sources you
have. So there's a lot of nice kind of ease of use features in particular around Galaxy to make the
data engineer's life as seamless as possible. For the end user, the experience for both platforms
should be roughly the same in the sense that this whole thing should be pretty transparent,
meaning that they are just using their favorite tool,
whether it's a query tool and they like to write their own SQL, or they're using a popular BI tool.
And that connects to either our JDBC, ODBC, or REST API. And now they're accessing data and
they can be joining table A in one data source with table B in another data source and not have
to deal with any of that complexity.
Back to Eric's earlier question about the growing complexity of the data stack.
So we really try to hide that from the end user.
Are there some requirements from the side of the data sources in order to work properly with Starburst in terms of data modeling, for example?
Are there limitations there?
How do you take something from Mongo, right? Which
is like a document-based database and something from Postgres, for example, and you query them
at the same time. Like how do you do that? Yeah. So the short answer is we have this notion of
connectors, but the word connector almost sells it short because the connectors are actually pretty
sophisticated. There's quite a bit of logic involved in each one. And each connector is different based on
the source system that you're working with. So in a nutshell, the connector is connecting to
the catalog of the underlying system and knows how to essentially pass that SQL query or execute
that SQL query or translate that SQL query to the underlying
system. It also has the ability to do push down in some cases to minimize the data moving over
the network. Some connectors are parallel. So if you're connecting to an MPP database system,
like let's say Oracle or again, Teradata or Snowflake, that creates a parallel connection.
So you get even faster read.
So each connector is a bit different, but that's essentially where the logic lies that tells the
system how to actually pass through and execute that query. Interesting. So that was going to be
one of my questions is maybe a way to frame this would be like ergonomics. So like in terms of the ergonomics, like it is writing SQL and then having the
connectors. And so again, that abstraction layer where you're not having to go low level, is that,
is that the idea? Yeah. Yeah. So those connectors, I mean, many of them were created by us. Some of
them were created by others in the community. And, and again, they, they vary in terms of the,
the level of performance or sophistication.
The most popular ones tend to be the fastest, most feature rich, just because we have the most
people using them. But yeah, that's exactly right. In fact, you can build your own connector. Maybe
you have a particular, I was just speaking with a customer who had their own time series database
that they had homegrown and they wanted to create a connector to that time series database
and we're asking like how do i build a connector and that's it's open source and we can point you
to the documentation on how to create a connector to your data source as well so justin from what
i understand starburst is mainly for asking questions right it's like a querying mechanism
do you also have use cases where like
people are using it to write data back? Like for example, I'm creating some features to train a
model, right? Or something like that. So I need this information that I have created out of the
initial data set to write it back into S3. So then I can get, as you mentioned, as an example,
data breaks and train my model. Is this something that you see as a use case? And it's also like
something that it can happen with a product right now? Yeah, it can. Now it depends on the data
source and the connector again. But yes, many of those connectors do support the ability to write
data back. In fact, we've discovered some actually pretty interesting
use cases that we wouldn't have even thought of where companies are doing what you described and
also even doing kind of ETL style workloads, despite our conversation earlier where they're
taking data out of one system, maybe it's a traditional data warehouse, and writing it to
Google Cloud Storage to then be ingested by BigQuery.
And they're using Starburst as that federation layer.
So it's pretty flexible that way.
Yeah.
No, that's super, super interesting.
So if I understand correctly from all the conversations that we have so far, like a very solid stack would be, I have my data lake, right?
With something like Hudi or like Iceberg.
That depends on me.
From the Starburst side of view, like doesn't matter what kind of, let's say, format I'm
using.
Then on top of that, I can have Starburst to query the data, right?
And on top of that, I have a BI tool like Looker, for example, or Tableau, right?
Yep.
And I can use either like the on-prem version of the product,
which is the, or I use cloud.
Yep.
So how important, and this is a question that it's not just technical
or product-oriented, it's also a question to the CEO of the company.
How important is the cloud model for data-related products?
It's something that we have seen happening with many companies,
like Databricks, for example, is a case like this, Confluent, right?
And it's also a very common evolution that we see with open-source projects.
We start with a project, and we up like also offering like a cloud solution.
How important is this?
And also, like, do you see any alternatives to that if someone wants to monetize a data-related product, especially if it starts from an open source project?
Man, heavy, heavy questions, Kostas.
Softball.
Softball, Justin.
Well, I have you here, so I have to ask my questions.
Absolutely. I mean, Justin's solving the problems. I'm super interested to hear.
Yeah, absolutely. Look, I think cloud-hosted solutions are the new frontier for building
businesses around open source. And I think there are a couple reasons for that. I think, first of
all, it gets you out of the sometimes challenging situation of deciding what to contribute to the
open source versus hold back for your enterprise edition, which can sometimes be, you know,
challenging conversations because you want to grow the open source project because that's your
adoption vehicle. But you also want to be able to convert that. So you end up with this tension between growing the pie and increasing your share of the pie, right? And I
think the cloud offering takes a lot of that away because you're actually adding a new dimension of
value for the customer, which is you're removing complexity and you're making it easy. And people
are very willing to pay for that, I think.
I think that's the way they're used to consuming products now at this point.
So, yeah, big deal for us.
I mean, I think Confluent and Mongo are great role models for us in particular,
largely because both of them actually went through the same journey that we're going through, where they had a self-managed enterprise edition and then built a cloud offering
and really
serve both markets and have these markets kind of work together. For Mongo, it was the Atlas product,
which was their cloud product. Confluent has built a cloud offering as well. And what we've seen in
both cases, in fact, Mongo had a nice jump in stock price a few weeks ago, is it represents now
more than half of their revenue and is the fastest growing part of their business.
And similarly for Confluent, maybe less of a share,
but the fastest growing element of their business as well.
And so we're very bullish on the future
and the prospects of a cloud product here.
Yeah, yeah, it's very interesting.
One last question from me.
And I know that you have like a lot of experience
also like in the enterprise space
where we have uh primarily like the model of the on-prem like installations until recently
yeah do you see because many people like predict that the cloud is going to be to dominate
completely right like all these large enterprises out there they are going to migrate completely to
the cloud do you see this as a net result at the end or you feel like things are going to migrate completely to the cloud. Do you see this as a net result at the end,
or do you feel like things are going to be a little bit more hybrid at the end?
What's your opinion on that? I really do think they're going to be hybrid,
either for a very long time or forever, at least long enough that it feels like it will be forever.
Because I think we serve a lot of financial services customers. We serve a lot of healthcare
customers. These regulated industries lot of healthcare customers.
These regulated industries are going to be just more cautious about putting their data somewhere else.
And also not for nothing, I think there are actually sometimes TCO arguments to be made
for actually running some infrastructure on-prem, despite the complexity of having to run your
own data center.
So I think we're going to live in a hybrid world, at least among large enterprise, Fortune 500 customers for quite a long time.
And we think that's also good for our business in the sense that we can provide connectivity
even across from one cloud to another cloud or from the cloud to on-prem.
Yeah. Super interesting. We're getting close to time here. One thing I'd love to do is actually just take a step back and talk about Presto and Trino because you were there towards the beginning
and you have some insight and would just love to know how have those projects developed
individually and what are the differences? And I would just love for our audience, I mean,
I think Presto is pretty familiar to a lot of our audience, like in general.
But the difference between Presto and Trino and just the way that those communities have
developed, like you have some specific insight and would love to hear about that.
Yeah.
Okay, sure.
So Presto, just as a refresher, was created at Facebook in 2012 and open sourced in 2014, created by Martine, Dan, and David and a guy named Eric as well.
And all of those guys work at Starburst today.
But in 2012, 2013, they worked at Facebook.
And I actually first met them in roughly 2014, so maybe a year after Presto had been open sourced and we started
collaborating together again while I was at Teradata. And that collaboration grew over years
and my team at Teradata, which had been acquired from Hidap, was contributing and they became
leading contributors. And so you have this really vibrant core of, call it 10 or 12 engineers who were writing the overwhelming lion's share of the
project. That continued. Starburst was formed in 2017. And actually, initially, the creators of
Presto were still at Facebook. And it was not until maybe a year or so after we had started with
Starburst that they decided to join us. And in the process of joining
us, actually before they joined us, they had left Facebook over kind of a disagreement of
how the project would be governed, how it would be run. Martine, Dan, and David were very adamant
that it be a meritocratic sort of governance model. And Facebook had Facebook's priorities, which makes
sense, right? Like they wanted to take the direction in a direction that benefited their
needs. And by the way, Facebook was running basically all of their analytics on the project.
So it had become very core and very strategic to them. But these were slightly divergent goals
where Martine, Dana and David wanted this open community, a vibrant diversity of users and
contributors, where you would earn maintainer or committer status based on the merits of your contributions. And Facebook
was like, we got to ship this feature. We need to do this thing for our business needs. And so
because of that, they ended up parting ways. And so Martine, Dana, David left Facebook
and continued developing, but developed on a different code repo called Presto SQL.
So there was PrestoDB and Presto SQL.
And for a few years, nobody knew that there were two Prestos.
People weren't really paying attention.
But there were actually these two divergent code repositories.
Now, they ended up joining Starburst.
We already had about half the contributors, leading contributors to the project.
So the Presto SQL side ended up moving much, much faster
as a development organization. And long story short, about a year ago, there were some disputes
over the trademark itself, the trademark of Presto. And it had turned out that Facebook
ended up donating the trademark, the name, which they technically own because even though Martine, Dan, and David
created it, they created it while employees of the book. And so I guess my lesson for any open
source creators out there is if you are working for a company and you create an open source
project, that name is technically owned by the company you work for. So just keep that in mind.
But ultimately they donated that to the Linux Foundation and the Linux Foundation said, hey,
we can't have two Prestos. So you're going to have to rename Presto SQL. And that's how Trino was born. So Trino is
really that lineage of Presto. It is what, what the creators and leading contributors. And since
then, a number of the, of the leading contributors from Facebook have joined us as well. So working
on Trino now, instead of the original Presto. Trino is what Netflix and Airbnb and LinkedIn and a lot of the big internet companies are
running with.
And that's the future.
But that's the backstory of the names and how we got where we are.
Yeah, love it.
No, that's a great backstory.
I love it.
It's really fun to peel back layers on the evolution of open source technologies.
Well, we're close to time here.
Two more questions for you.
One is, what's the future look like for Starburst?
I mean, we've talked about problems you're solving now, but as you look at the stack,
I mean, your bet is that we have hybrid on-prem cloud.
Stack is increasing in complexity.
So I would love to know how Starburst is thinking
about the future. And then second, how can people explore Starburst if they're interested in it
today? Cool. So in terms of the future, I will say we're very bullish on this concept of a data mesh.
So I don't know if your audience has heard of a data mesh at this point, but it's basically this kind of paradigm shift that essentially recognizes that data is inherently decentralized.
Not only as like a practical matter for a lot of the reasons we mentioned, but also that there's actually benefits to decentralization if you think about it in the right way.
And the analogy that I like to use with people is if you think about Wikipedia, where anybody can sort of like create an article, it's generally the expert who knows the most about that particular subject who's writing the Wikipedia article.
So you get the person writing about a particular subject area who knows it very, very well, and they have ownership for that. That's kind of like part of what this notion of decentralization
means from a domain authority perspective, meaning that like the people who know the domain
end up making the decisions about how to interact with that data, what fields are available.
So rather than centralization, putting everything in the hands of a data warehouse team in a
monolithic way, you sort of let the owners of
the data itself essentially curate the data and publish it, serve it up to the organization as
a data product. And that's another big pillar of data mesh is thinking about data as a product,
which is an interesting concept, I think, as well. So it's an area we're very excited about.
Okay. So in terms of data mesh, this is a really interesting topic because it's a new term.
There are different sort of interpretations of how to define it.
And hearing you talk about Starburst actually is a little bit of a light bulb for me in
terms of data mesh, because in the conversations that we've had, the challenge with defining
data mesh is a tension between decentralization of data,
but also the need to actually centralize that, right? In a way that makes a ton of sense for
the business as a whole. And so I would love your thoughts on that tension, right? Because
decentralization generally applies to technology where you have different technologies being
employed by different teams.
That means different formats of data, all that sort of stuff. But you still have this need to
centralize it. And so I would love for you to speak to that, that tension as it relates to
data mesh and then specifically like, is Starburst the stepping stone to like making sense of that?
Yeah. So, I mean, to me, it all centers around
this concept of a data product and having the data owners, the ones who understand the domain
of that data, be the ones responsible for creating and curating that data product. Now,
that data product, I want to stress, doesn't have to be a specific database or even a specific table or a specific data set. It could
be any combination of those things. So the data product might have a table that lives in S3,
and it might have a table that lives in SQL Server, but the product together, which is
the customers who spend the most and watch ESPN, if you're a cable provider,
for example, maybe those live in two different data sets. One's a billing data set. One is a
shows watched data set that you have in two different systems. But the data product that
you're offering is top spend sports enthusiasts, right? Product now can span across those data
sources, but it's still offered up to
the organization to consume that way. And Starburst essentially becomes the abstraction layer that
allows you to serve up those products without having to necessarily reveal where those data
sets live. Like the end consumer of that product doesn't need to know it came from a data warehouse
over here and a data lake over there.
Quickly, listeners who are interested in checking out Starburst, what should they do?
Yeah, you can check us out at starburst.io.
And you're welcome to either download the product and get started or register to use Galaxy, which is currently in beta and will be GA in November. So depending on when
this podcast comes out, it may be GA already, but those are your options. Awesome. Well, Justin,
this has been really informative and just a great conversation. We'd love to have you back to talk
about team structures around data mesh as we shed more light on that subject on the show. Yeah, I think it's a great topic. It's probably one of the most important elements of actually
implementing a data mesh. It is all about people, process, and technology, and the people being
the trickiest part. So would love to. Awesome. Well, we'll catch up again soon. And thanks again
for taking the time. Cool. Thank you, guys. Thank you, Justin.
As always, a great conversation.
I think my big takeaway is actually on the data mesh side of things.
I think that analytics,
federated analytics,
as Justin talked about them,
I think is the most tactical explanation of the value of data mesh that I've heard yet in a way that makes sense from a technological standpoint.
Because I think as we've talked with other guests on the show, one of the challenges of data mesh is fragmented technology.
Everything's decentralized.
Centralization across all of that
is very difficult. And having an infrastructure technology agnostic solution to that makes data
mesh make a lot of sense. I think my follow-up question, which we didn't have time to get to is,
okay, analytics is one thing, like taking action on that data is another thing.
But that was really helpful. So I really just appreciated his perspective on that.
Yeah, absolutely.
And I think we have many reasons
to want to have him on another episode.
There are many things to talk about.
One hour wasn't enough.
Yeah, for me, I think the most interesting takeaway
was the conversation around ETL
and how ETL is changing
in this more decentralized and federated world that we are moving to.
And it was interesting to hear from him that the E and the L are not going away, but they are not as important as they used to be.
But the transformation is there and we will keep needing to transform the data.
So, yeah, it was very interesting.
It was also interesting to hear
about the history,
the story behind Trino
and the trademarks.
Oh, I loved it.
Yeah, coming out of Facebook
and open source drama,
which is always interesting.
And yeah, I'm really looking forward
to have to record another episode
with Justin.
It was great. For sure. Well, thanks for joining us again and we'll catch you on the next show.
We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite
podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, Eric Dodds, at eric at datastackshow.com. That's E-R-I-C
at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers.
Learn how to build a CDP on your data warehouse at rudderstack.com. you