The Data Stack Show - 59: Making ETL Optional with Justin Borgman of Starburst Data

Episode Date: October 27, 2021

Highlights from this week’s conversation include:Starburst Data is Justin’s second startup (2:42)Starburst focuses on doing data warehousing analytics without the need for the data warehouse (4:14...)Multi-cloud solutions among merger and acquisition use cases (8:32)Ways the stack is increasing in complexity (12:25)Comparing essential components of a data stack from 2010 to now (15:01)The future of ETL (27:36)The best maturity stage for an organization to implement Starburst (31:27)Starburst connectors (36:55)Monetizing enterprise solutions while promoting open source ones (41:52)The history of Presto and Trino (45:37)Benefits of a decentralized data mesh (49:53)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Welcome back to the show. We have Justin Borgman from Starburst Data, and I'm really excited to talk with him because
Starting point is 00:00:35 he, I think, may help us make some sense of data mesh, but at the very least, we'll learn a ton about federated queries and building analytics across different components of the stack. So my main question, and we'll talk about Presto and Trino and get into the details there. But I think my main question, Costas, is the view of the stack increasing in complexity. So we had a guest recently talk about how the promise of the cloud was that it'll unify all this data and everything. And in fact, it's creating more complexity and more data silos. I thought that was very compelling. And I think Justin is living that every day with Starburst, trying to make it easier to drive analytics with an increasingly fragmented stack.
Starting point is 00:01:28 So I want to ask him about the complexity of the stack and how that's changing. How about you? Yeah, I want to learn more about Presto in general. Presto has been around for quite a while and he has gone through many different transformations. So that's definitely part of the conversation that we are going to have. And I want to learn more about Justin's view of how this data stack is maturing and where he thinks that we are really going to with the technology. Mainly because the interesting part with Presto is that it has a very, very different approach when it comes to querying. It has a very decentralized approach, which is something completely different, actually opposite to the best practice of trying to source
Starting point is 00:02:13 all the data and store it in one centralized location and do the queries there. So yeah, I think we will have a lot to chat about with him. Well, let's dive in and get to know Justin and Starburst. Let's do it. Justin, welcome to the show. It's really great to have you with us today. Thanks, Eric. Super excited to be here. Well, let's start where we always start. Would love to hear your background. You've done some really cool stuff, but kind of what led you to Starburst? Yeah. So let's see, this is my second startup. My first startup was back in 2010. It was called Hadapt and it was a early SQL engine for Hadoop, just as Hadoop was starting to pick up momentum. And really at the time, people were thinking about
Starting point is 00:02:57 Hadoop as kind of cheap storage or a way of doing batch processing on mass amounts of data. And our idea was to turn it into a data warehouse. In fact, I think the business plan we wrote was to become the next Teradata with really doing data warehousing within Hadoop. Now, as luck would have it, we actually ended up being acquired by Teradata four years later. And I became a vice president general manager at Teradata, responsible for emerging technologies and really trying to think about the future of data warehousing analytics and what that might look like. And it was in that context that I actually met the creators of an open source
Starting point is 00:03:38 project called Presto. They were at Facebook at the time, Martine, Dan, and David. And we started collaborating and working on making Presto better and better and better. And today that effort is now known as Trino. So the name changed along the way, but that's really how Starburst was ultimately born as really the founders and creators of that open source project, leaving our respective companies. I left Teradata, they left Facebook, and Starburst was formed. Very cool. And can you just give us a quick rundown of what is Starburst and what does it do, just so our listeners have a sense of the product? Yeah.
Starting point is 00:04:15 So much the way my first company was really SQL and Hadoop, this is SQL and anything. And I think that was what got me so excited about it. It's about doing data warehousing analytics without the need for the data warehouse. And from a technical perspective, it's basically a database without storage. And it thinks of all other storage as though it's its own. So you can query the data where it lives. You might have data in Mongo. You might have data streaming in Kafka. You might have data that you want to access via Elastic and TechSearch. You might have data in traditional legacy systems like Oracle or Teradata, you might have Snowflake, you might have data lakes, that's one of the areas where we really excel is accessing data and data lakes.
Starting point is 00:04:54 And in all of those cases, you have kind of this single point of access to query the data where it lives, without the need to move it around and do those typical kind of ETL pipelines. So it's really about giving you faster time to insight, that's the way we think about it around and do those typical kind of ETL pipelines. So it's really about giving you faster time to insight. That's the way we think about it. And removing a lot of that friction traditionally associated with classic data warehousing. Super interesting. So let's talk about, I love your perspective because you have a great perspective because you've both built systems that drive analytics from a database standpoint and then are now leading a company
Starting point is 00:05:29 that solves problems across different pieces of infrastructure. We had a guest recently who made a really good point. It sounds very obvious, but the data stack is increasing in complexity, right? I mean, you have all these tools that are making functions within the stack easier to do that before required a significant amount of engineering effort.
Starting point is 00:05:52 And it's like, okay, great. Like we're getting beyond some of the low-level plumbing problems and which is awesome. But especially as you reach scale, the stack is increasing in complexity, right? So you have data warehouses, data lakes, Kafka. There are a number of different sort of core pieces of infrastructure that you're running at scale, which actually makes traditional linear data into warehouse, into BI dashboard way harder. So can you just talk us through what you're seeing on the front lines? Like how are stacks increasing in complexity? And then I'd just love to hear like your perspective on Starburst as the answer to managing that
Starting point is 00:06:31 without necessarily having to get into the plumbing. Yeah, absolutely. Well, first of all, I 100% agree with your previous guest about the stack gaining complexity. And I think of a old quote from really a legend in the database space, a guy named Mike Stonebreaker, who's a professor at MIT, and he was the creator of Ingress and Postgres and Vertica and a variety of different database systems over the years, won the Turing Award. And he had written a paper that basically said there is no one size fits all database system, meaning that you're always going
Starting point is 00:07:05 to have different databases for different types of jobs, different types of use cases. And I think that's true. Some applications you want to build on Mongo, some might be Oracle, some might be something else. And I think that for better or worse leads to greater complexity because now you have even more data sources. And we find particularly in large enterprises, this is compounded by the fact that you have different departments, different groups within an organization doing their own thing. You may acquire businesses. And every time you have M&A and you acquire a business, you just acquired their data stack as well, right?
Starting point is 00:07:40 Sure, yeah. Right? And that's actually one of the fastest ways we find that our customers end up being multi-cloud is because they bought somebody who runs on Azure or GCP, and now they're multi-cloud. So 100% agree on complexity. And that's a big part of what we hope to solve by essentially allowing you to go direct to source and be able to run those analytics by connecting directly to where the data is. I think that's the power of the platform. Essentially, I like to describe it as really giving a data architect or a data engineer infinite optionality. If they still want to
Starting point is 00:08:16 consolidate data into a data lake or data warehouse, that's cool. I would argue data lakes are probably the better bet over the long run for consolidating data. And we could talk about that just from a TCO perspective. We will. We'll definitely talk about it. Yeah, absolutely. But the point is at least you have the freedom of choice. And so that's really what we're trying to do is kind of create this single point of access across all of those different data sources to add an abstraction. And abstractions are always really for the purposes of creating simplicity where there is complexity. And I think we allow you to do that within the data architecture realm. Let me ask you, so you're a two-time entrepreneur.
Starting point is 00:08:59 So I'm going to ask you a business question that relates directly to this problem. So a lot of times, let's take the example that you gave of a business acquiring another company and inheriting their stack, right? Yep. Integrations and all of that are a whole subject unto themselves. But I would argue that in a lot of those cases, like the synergy, wow, synergy is such a bad buzzword, but let's say that the, the results you can produce from understanding the power of the relationship between the two businesses tends to have an outsized impact. Okay. And we'll just call that synergy for the purpose. Yeah. No, I mean, that's like the truest definition. I agree with you. I know it, it has negative connotations only because it's usually, I think, overinflated, right? Like people talk about synergy and then maybe they don't find the synergy, but you're absolutely right. Yeah. And I think in this day and Like, do you see that, especially among Starburst
Starting point is 00:10:05 customers where ultimately a lot of these things come to a head in analytics that then influence business processes that influence product? You know, there's a variety of implications here, right? But analytics is, and understanding those components is usually the tip of the spear in terms of like driving the decisions that filter out and shape the business. Do you see that a lot where when you can combine data from different sources in a way that would be, I mean, some of these things, like you're talking multi-cloud, if you put a set of data engineers on this, you're talking months of work to get a basic understanding of how the data relates. And then you have a ton of BI work and analyst work to get the insights on top of that. And so do you see that a
Starting point is 00:10:46 lot among your customers? Yeah, a hundred percent. In fact, it's a great use case actually for us because when we see that an M&A transaction is taking place, we know that there's instantly going to be an opportunity for the reasons that you mentioned. You're inherently talking about two different sets of data and you're talking about an integration effort, which from speaking to at least one customer that is quite acquisitive, often takes like two years to fully integrate those two entities to get the value that the investment banker had written up in the original proposal, right? So it takes a long time. And the beauty of this mindset or this approach of kind of a single point of access or what some are now calling a data mesh, which I'm sure we'll talk about as well, is that you're getting instant connectivity. So you don't have the delays of all the challenges associated with getting the data out of one system, navigating how to transform it and load it and get it prepared
Starting point is 00:11:45 into another system. All of that can be done in weeks rather than months or years. And I think that speaks to that time to insight ability that we can provide. Yeah. Okay. One other question for me, and I'm just genuinely curious about this. So stack is increasing in complexity and you're seeing this on the front lines because you're providing an antidote to that. How is it increasing in complexity? Are there specific trends that you see around particular technologies that maybe add to the complication of what you would normally solve from a low-level plumbing standpoint? Yeah, well, one thing I'll mention, and this ties a little bit back to my Stonebraker quote, but there's a lot of different systems out there now.
Starting point is 00:12:35 And it's not just different types of databases. It's other forms of data as well. It's CRM systems. It's web analytics. It's a whole host of different data sources that you want to combine to understand your business better. Customer 360 is a very classic use case that we work on with our customers. And very often that involves pulling together a variety of data sources. I think part of this also, candidly, is I think fueled by a tremendous amount of venture capital that's poured into the data space over the last decade. There's a data landscape that
Starting point is 00:13:05 First Smart Capital produces every year. I'm not sure if you've seen it. Matt Turek is the VC who maintains this. And I like to go back just for fun sometimes and look at like the 2012 version of this data landscape. And it's already look complicated. There's like 30 different data sources. And then you look at the 2021 version, you're like, that's an eye chart. Like you have to zoom in. Like it's hard to even find my own company in that space. So I think that's part of it as well. You've got a lot of different niche players. Maybe at some point there'll be some consolidation that simplifies, but we don't see that at least any time soon. And that means ever greater complexity. The one other thing I'll mention that I think is compounding this problem is a demand from the user side, which could be an analyst or data scientist for more self-service access
Starting point is 00:13:56 to the data that the organization has. And so you've got greater complexity on one end and a wider variety of potential users on the other end. And I think that that's a painful place to be in the middle. Yeah, for sure. We had a, on a recent show, we did a fun exercise where someone asked us, how would you build this in 2012? Which is a really interesting mental exercise, right?
Starting point is 00:14:20 Relative to all the options we have now. So that's great. This is super fascinating. Costas, please dive in. I have quite a few questions, but Justin, I'd like to start with a pretty simple one that has to do with the conversation that we had around the data stack.
Starting point is 00:14:38 And I'd like to ask you from your experience and your experience also through the lenses of Starburst, what are the essential components today of a data stack that a company needs? And if you can, I'd like to compare it on how a data stack looked back in the Hadoop era when you started your previous company and what are the differences there? Okay, great. All right. Well, I'll start there. Maybe I'll start with the past and then go to today. So 2010 was an interesting transition point or the beginning of a transition. I would say the concept of a data lake was in its infancy back then. Of course, back then,
Starting point is 00:15:19 data lake was synonymous with Hadoop. That was the only data lake. Now it's increasingly cloud object storage like S3 or Azure data lake storage or Google cloud storage. But back then it was Hadoop. And I think what people at the time were just starting to think about or transition is like, can I do some data warehousing in Hadoop? Can I do some ETL in Hadoop? At least the T part of ETL, of course. Can I do some transformations in Hadoop, and essentially offload very expensive compute from my Teradata system or my Oracle system and use this cheaper batch oriented, infinitely scalable open source platform instead. And so it was very interesting from that perspective. I think a lot fewer data sources in that world, Teradata was striving to be the single source of truth with, I will say, mixed results, meaning that they were probably the closest thing to the single source of truth.
Starting point is 00:16:13 But you still had different data marts and other databases, SQL Server here and there and Oracle here and there. And so still a bit of a heterogeneous environment, but not nearly at the degree that it is today. The players back then, I would say Tableau was the new kid on the block and killing it. But absolutely the new kid back then, displacing maybe some of the older BI tools like Business Objects or Cognos or MicroStrategy at the time. And ETL back then was synonymous with Informatica. I think that's another big change, right? So if we fast forward to today, I think we are in a much cloudier world. I mean that in a sense of like more data is in the cloud, which maybe makes it cloudier in multiple levels, especially for those customers who are hybrid. I think those are unique challenges too.
Starting point is 00:17:03 But Data Lake now is synonymous with cloud object storage. I think Snowflake is trying to be the Teradata of the future, very much embracing this same concept of a single source of truth. And then you have Fivetran or Matillion or other players sort of like being Informatica 2.0. So on a surface level, you could say maybe, and then at the BI level, Tableau is still very strong. Maybe Looker is a more recent addition. There's also Preset, the company behind Superset, which is interesting too. But on a surface level, you might say these are similar. I think though, we're at a point where data lakes have matured, or at least data lake as a data warehousing alternative has matured a lot as a concept. I think back in 2010, when I was doing that first business, it was an appealing idea, but not a lot of people were doing it in practice, largely because it takes a long time to build an analytic database. I learned this the hard way, building a cost-based optimizer, building an execution engine takes a long time. And in 2010, they were all very early. So you couldn't get the
Starting point is 00:18:10 same performance out of SQL and Hadoop as you could in Teradata, for example. If we fast forward to today, that gap is much, much narrower to the point that it's almost insignificant. And whether that's Starburst querying data in a data lake for other players in the space, like Databricks has a SQL engine now for querying the data lake as well, you see this idea of like a lake house becoming more popular where I'm going to store a lot of my data in a data lake and maybe skip out on the Snowflake model. So I guess I would summarize by saying, I think the data warehousing model, irrespective of the individual players, is being challenged now today in a way that it wasn't previously in history.
Starting point is 00:18:56 Yeah, yeah. Makes total sense. That was a very, very interesting comparison between the two points in time. You mentioned data lakes, and it's been like a couple of months, at least now that we see quite a few data related companies getting substantial funding, right? And also quite a few open source projects. We have Iceberg that came out of Netflix, Hoodie, which came from Uber. And of course we have Delta Lake, right? So what's your opinion there?
Starting point is 00:19:32 Like, what do you see? Because the way that I see it and how I feel about it is that we have like some kind of decomposition of a database system, right? Because if you think about something like Postgres, you have an extremely complex system that is like a black box at the end that you query using SQL.
Starting point is 00:19:50 A very simple, let's say, language. And we have reached the point right now where we are talking about transaction logs, about query engines on top of the file system. It kind of feels like we have decomposed the database system into small components and the data engineering teams are trying to take all these and recreate, let's say, a large scale database system. Where are we today? Like how mature are these technologies? Like if we take, for example, Hudi or like Delta Lake compared to something like Snowflake.
Starting point is 00:20:24 Yeah. So first of all, I agree with your general sentiments. I mentioned in the opener that we're like a database without storage. So you could say we're like the top half of a database, the query engine, the execution engine, SQL parser, the query optimizer. And Iceberg is like the bottom half, if you will, of a database. It's the storage piece or Hudi or Delta. And I think what we're seeing right now, which is kind of an exciting period in history, is back to that point about data warehousing analytics in a data lake, the one missing piece throughout the last 10 years has been the ability to do updates and deletes of your data. And that's the gap that I think we're closing with those data formats,
Starting point is 00:21:06 which now allows for what Teradata calls active data warehousing, like being able to do updates, do deletes, modify your data, and still perform high-performance analytics and Power BI tools all within one system. And that's, I think, like you're right on the cusp of eliminating that delta,
Starting point is 00:21:26 if you will, no pun intended, between data warehouses and data lakes as we speak. And I think that decomposition is good for customers in the sense that it gives them a lot of optionality. So for example, if you're going to standardize on delta, you can use Databricks to train a machine learning model, create a recommendation engine. If you're a retailer, if you buy this pair of shoes, you might like this pair of pants. That's a great use case for Databricks. And then you might use Starburst to generate your reports, use Tableau to access that data and figure out how much did we sell last month or how much do we think we're going to sell next month? And they can both work off of the same file formats. And that's pretty
Starting point is 00:22:08 cool. So I think that gives, again, customers just a lot of flexibility to interchange engines. And also they have flexibility around which formats do they choose. Iceberg, Hudi, Delta, all very interesting and promising options. And I guess I'll just mention one last point. I think the big distinction between this way of thinking and Snowflake is when you load your data into Snowflake, you've now locked it into a proprietary format. And that's an important piece with respect to vendor lock-in and having control and ownership over your own data. And that's one of the things that I observed even in my time at Teradata. Nobody ever said Teradata was a bad database. It's a great database, but they really hated the fact that it was inflexible and it was very expensive, right?
Starting point is 00:22:54 So. And Justin, one question and Costas, I apologize to jump in here, but I'd love to just benchmark when we talk about performance, a lot of times and speed to insight is a term that you've mentioned a couple of times. I'd love to just benchmark on that because one way I like to frame this question is the definition of real time has changed over time. Right. And so real time at one point may have meant a couple of times a day, right? And so it was getting faster and faster and faster. I just love to know, like, what's your perspective on that changing, especially relative to query performance? And I know that can change based on business model, but when you talk about recommendations in an e-commerce standpoint, the bleeding edge of that is generally like has very heavy requirements as far as performance in real time, but that also is relative.
Starting point is 00:23:47 So I'd just love to know, what are you seeing with your customers as far as requirements on performance and delivery from that standpoint? Yeah, so there are two dimensions that we think about with our customers. One is the query response time. And that's what I think people have classically referred to as performance when it comes to analytic database systems. Like I run a query on a certain amount of data, how fast does it return? And there are industry benchmarks that have been used for a long time, TPC-H, TPC-DS. These are sort of like standardized benchmarks that you can run your queries through. And of course, we would always say the best benchmarking is actually on your own data though, even better than industry benchmarks.
Starting point is 00:24:28 But that's one dimension of performance. The other dimension, which I think is often overlooked, and this is what we really refer to when we think about time to insight, we think of that as a bit more holistic of a measure factoring in how long did it take from the moment the data was created to my ability to analyze it. And if you think about it in that context, just to compare and contrast, let's say Snowflake versus Starburst. Snowflake, maybe a query runs in two seconds, and maybe it takes Starburst 2.6 seconds. And you might say, oh, well, Snowflake ran that query faster. Yeah, okay, a little bit faster. but it might've taken three weeks to get the data into Snowflake in the first place. And so really that query was three weeks. Right.
Starting point is 00:25:10 And that's what I mean by time to insight is I think people learn over time that the, there's a prerequisite step before that traditional data warehouse is able to actually run that first query. And that's an important tax that you don't necessarily need to pay. Yeah, super interesting. Yeah, that's, I think, a subject that we want to explore more in the show just because when you talk about latency, time to insight,
Starting point is 00:25:37 like those are very subjective depending on where you're on the pipeline. So super interesting. Yeah, and that's also something else very interesting, Justin. So let's talk a little bit about ETL. Okay. And I want to hear from you, what do you think is the future of ETL?
Starting point is 00:25:55 ETL has been around like since we had the first database systems, exactly because as you said at the beginning, we cannot have one system that does everything. Different kind of like workloads requires different architectures and different systems. And probably today is also a bit even more complex, the environment. If you consider that you have to download data through REST APIs because something is behind your Salesforce instance, for example, or NetSuite or whatever, right? What do you see happening to ETL? Because from what I understand, when you are incorporating like Starburst in your architecture,
Starting point is 00:26:30 for example, the need for ETLing the data from, I don't know, like a production database, for example, to your data warehouse is reduced, right? And at the same time, like I've seen, I was looking like today, for example, there was an announcement from Snowflake that Iterable, which is like a company like in marketing, if I'm not mistaken, Eric, right? It's a marketing product. Yes, indeed.
Starting point is 00:26:57 Yeah. Yeah, like customer journey, like orchestration. Yeah, yeah, yeah. So now you can get access to your iterable data on Snowflake directly on Snowflake without doing the ETL through the data sharing capabilities that Snowflake has, right? Interesting. I didn't...
Starting point is 00:27:14 That's interesting. Yeah, yeah. They just announced the product today. Again, where is the ETL there, right? Until yesterday, if I was using Interable, I would have to have a pipeline there to pull the data. It will take days, blah, blah, blah, and put it into Snowflake.
Starting point is 00:27:30 So how do you feel about ETL? What's the future of ETL based on your experience? Yeah, so I was going to say, and I did not read the news because you're more up to speed than I am, but my guess is that Iterable is probably running Snowflake themselves, just because the way that Snowflake is building its data sharing marketplace is really like a proprietary network. It's basically other companies using Snowflake can share data with other companies using Snowflake. So that would make sense to me in that context.
Starting point is 00:28:06 And I think that's like Snowflake's view of world domination. It's like, if everybody's using Snowflake, then great. Yeah, it's a happy world. You can share among Snowflake databases. So I get it from a business perspective. And obviously, they've been a very successful business. And Frank Slootman is a very successful CEO. However, I don't think it reflects
Starting point is 00:28:26 necessarily the reality of the data landscapes that customers have. I think it's probably naive to think that everything will get ingested and sucked into Snowblake databases so that it can be shared and used. So our approach basically just says all data sources are essentially equal and we can work with any of them. But to answer your question about the future of ETL, so I think it's the E and the L that we're most focused on making optional, I guess you could say. There may still be times where you want to do the T for sure. And I think the way we see the future of this industry moving forward is we still think there's going to be great reasons to pull data together into one physical place.
Starting point is 00:29:12 Maybe it's to power a particular dashboard or for certain applications, it would make a lot of sense to pull data together. But we think that increasingly that will be the data lake because of the economics involved, right? Like at the end of the day, the data lake is always going to be your lowest TCO play. The storage is going to be the cheapest, whether it's S3, Azure data lake, whatever. And you get to work with these open data formats that we already touched on earlier. So you're not locked in. And so we think that's going to be like your best bet for when you need to consolidate data. And then for other cases, you can just query the data source directly. And again, that kind of goes back to that optionality. So I guess to summarize, I would say, I don't think ETL goes away,
Starting point is 00:29:54 but I think it becomes more optional. Interesting. Just to jump in there, Justin, that is a really insightful, and I'm going to put my marketing hat on here because I've been burned many times by a marketing tool saying we have this direct integration. And in reality, it's actually just a sort of behind the scenes, like ETL job. And so it makes total sense that like, if it really is delivering on the promise, it probably is that they have their data in Snowflake. And from an actual data movement standpoint, that makes a ton of sense. That was just very clarifying for me because it's like, yeah, I've heard that so many times before and it's not true. They're actually just running some job in the background and it's not real time. And of course, ETL has major problems when it
Starting point is 00:30:40 comes to schemas and all that sort of stuff. But if both systems are in Snowflake, like that would actually work pretty well. But then to your point, you're in the Snowflake ecosystem, right? And that's the boundaries of the boundaries. So I just appreciated that as a marketer, understanding the technical limitations of problems I faced before trying to move data around.
Starting point is 00:31:04 All right. That was super, super interesting. I'm very interested in ETL, as we can all understand. So Justin, let's chat a little bit more about Starburst as a product, right? And my first question is, at what stage of maturity of the data stack, as we talked about, Starburst makes sense to become part of this data stack? Yeah, well, it depends on where you're starting from. We kind of think about customers on a journey, journey to somewhere, but they're all starting at a different point in time.
Starting point is 00:31:38 For some of our customers, it's simple. The most simplistic way to get started with us is you have data in S3 and you want to query it. And you're currently thinking about, well, do I load it into a data warehouse like Snowflake or do I just leave it in open data formats? Do I use something like Athena on AWS, which, by the way, is actually Presto Trino under the covers? And that's what powers Athena. How do I want to build my modern data warehouse type of stack? And that's a great application. That's where the kind of leading internet companies end up using our technology.
Starting point is 00:32:11 They have the luxury of designing their stack from the ground up. And very often it is a data lake in S3 or some other cloud object storage and just querying it directly with Starburst. And in that sense, you're essentially building an alternative data warehousing style platform. Again, you might use Iceberg, you might use Delta, you might use Hudi if you want that ability to do updates and deletes as well. So that's a very simple place where people often start, particularly if they have the luxury of starting with a clean slate. Another place that customers start is they say, okay,
Starting point is 00:32:45 I have a data lake, but I also have a bunch of other databases. And maybe I've got Mongo, maybe I've got Oracle, and I really need to join a table that I have in Oracle with some tables that I have in S3 or Hadoop. And that's another great place to start is really combining data sets that currently live in different silos. And we can very easily provide fast SQL access to both systems. Another way that people think about us is as an abstraction layer that hides the complexity of data migration. So a lot of people going through digital transformation where they want to move data off of Teradata or Hadoop and they want to move it to the cloud. But that can be a pretty disruptive endeavor if you're trying to really like just turn a system off and move it to some totally different system. So another approach is you connect Starburst to those systems, have your users end up sending queries to Starburst
Starting point is 00:33:38 and that gives you a bit of breathing room and the luxury of time to kind of move tables out of one system and move them into another system more gradually without the end user having to know where the data lives. And that's sort of like hiding where the data lives. That's thinking of us as a semantic layer, essentially, above where all the data is. So those are kind of three different areas where we typically start working with customers. Yeah, makes total sense. And let's talk a little bit about the experience, the product experience. And when I say the product experience,
Starting point is 00:34:13 I have like two personas, let's say, in mind. One is like the data engineer, like the person who is maintaining the data infrastructure and probably has to interact with Starburst as a piece of infrastructure. And then the users who are querying the data, right? So they are different, obviously. So what's the experience that these two personas have when they are interacting with Starburst? Yeah, so for the data engineer, they first of all have two choices. We have really two product offerings today. We have
Starting point is 00:34:43 Starburst Enterprise, which you manage yourself. So if you want to control the entire infrastructure, maybe you want to deploy on-prem, maybe you want to deploy in the cloud, but you have a particular setup that you want to maintain. Maybe you need Kerberos integration or LDAP integration, or you want to run on Kubernetes on-prem. You have a lot of flexibility with Starburst Enterprise, but you have to manage it yourself. So that's for somebody who's up to that challenge, or maybe who has the requirements to run in their own environment. The other option is something called Starburst Galaxy. And Galaxy is a cloud-hosted offering. We manage all that complexity. And essentially, you have a control plane that allows you to connect to your different data sources and configure the system. You can auto scale up and down. So you're
Starting point is 00:35:30 using your EC2 resources efficiently. You can even auto suspend the cluster where it'll just shut off automatically if it's not being queried. And because we're like a database without storage, restoring it takes a few seconds and we're connected already to the data sources you have. So there's a lot of nice kind of ease of use features in particular around Galaxy to make the data engineer's life as seamless as possible. For the end user, the experience for both platforms should be roughly the same in the sense that this whole thing should be pretty transparent, meaning that they are just using their favorite tool, whether it's a query tool and they like to write their own SQL, or they're using a popular BI tool.
Starting point is 00:36:11 And that connects to either our JDBC, ODBC, or REST API. And now they're accessing data and they can be joining table A in one data source with table B in another data source and not have to deal with any of that complexity. Back to Eric's earlier question about the growing complexity of the data stack. So we really try to hide that from the end user. Are there some requirements from the side of the data sources in order to work properly with Starburst in terms of data modeling, for example? Are there limitations there? How do you take something from Mongo, right? Which
Starting point is 00:36:46 is like a document-based database and something from Postgres, for example, and you query them at the same time. Like how do you do that? Yeah. So the short answer is we have this notion of connectors, but the word connector almost sells it short because the connectors are actually pretty sophisticated. There's quite a bit of logic involved in each one. And each connector is different based on the source system that you're working with. So in a nutshell, the connector is connecting to the catalog of the underlying system and knows how to essentially pass that SQL query or execute that SQL query or translate that SQL query to the underlying system. It also has the ability to do push down in some cases to minimize the data moving over
Starting point is 00:37:31 the network. Some connectors are parallel. So if you're connecting to an MPP database system, like let's say Oracle or again, Teradata or Snowflake, that creates a parallel connection. So you get even faster read. So each connector is a bit different, but that's essentially where the logic lies that tells the system how to actually pass through and execute that query. Interesting. So that was going to be one of my questions is maybe a way to frame this would be like ergonomics. So like in terms of the ergonomics, like it is writing SQL and then having the connectors. And so again, that abstraction layer where you're not having to go low level, is that, is that the idea? Yeah. Yeah. So those connectors, I mean, many of them were created by us. Some of
Starting point is 00:38:19 them were created by others in the community. And, and again, they, they vary in terms of the, the level of performance or sophistication. The most popular ones tend to be the fastest, most feature rich, just because we have the most people using them. But yeah, that's exactly right. In fact, you can build your own connector. Maybe you have a particular, I was just speaking with a customer who had their own time series database that they had homegrown and they wanted to create a connector to that time series database and we're asking like how do i build a connector and that's it's open source and we can point you to the documentation on how to create a connector to your data source as well so justin from what
Starting point is 00:38:56 i understand starburst is mainly for asking questions right it's like a querying mechanism do you also have use cases where like people are using it to write data back? Like for example, I'm creating some features to train a model, right? Or something like that. So I need this information that I have created out of the initial data set to write it back into S3. So then I can get, as you mentioned, as an example, data breaks and train my model. Is this something that you see as a use case? And it's also like something that it can happen with a product right now? Yeah, it can. Now it depends on the data source and the connector again. But yes, many of those connectors do support the ability to write
Starting point is 00:39:41 data back. In fact, we've discovered some actually pretty interesting use cases that we wouldn't have even thought of where companies are doing what you described and also even doing kind of ETL style workloads, despite our conversation earlier where they're taking data out of one system, maybe it's a traditional data warehouse, and writing it to Google Cloud Storage to then be ingested by BigQuery. And they're using Starburst as that federation layer. So it's pretty flexible that way. Yeah.
Starting point is 00:40:12 No, that's super, super interesting. So if I understand correctly from all the conversations that we have so far, like a very solid stack would be, I have my data lake, right? With something like Hudi or like Iceberg. That depends on me. From the Starburst side of view, like doesn't matter what kind of, let's say, format I'm using. Then on top of that, I can have Starburst to query the data, right? And on top of that, I have a BI tool like Looker, for example, or Tableau, right?
Starting point is 00:40:43 Yep. And I can use either like the on-prem version of the product, which is the, or I use cloud. Yep. So how important, and this is a question that it's not just technical or product-oriented, it's also a question to the CEO of the company. How important is the cloud model for data-related products? It's something that we have seen happening with many companies,
Starting point is 00:41:10 like Databricks, for example, is a case like this, Confluent, right? And it's also a very common evolution that we see with open-source projects. We start with a project, and we up like also offering like a cloud solution. How important is this? And also, like, do you see any alternatives to that if someone wants to monetize a data-related product, especially if it starts from an open source project? Man, heavy, heavy questions, Kostas. Softball. Softball, Justin.
Starting point is 00:41:43 Well, I have you here, so I have to ask my questions. Absolutely. I mean, Justin's solving the problems. I'm super interested to hear. Yeah, absolutely. Look, I think cloud-hosted solutions are the new frontier for building businesses around open source. And I think there are a couple reasons for that. I think, first of all, it gets you out of the sometimes challenging situation of deciding what to contribute to the open source versus hold back for your enterprise edition, which can sometimes be, you know, challenging conversations because you want to grow the open source project because that's your adoption vehicle. But you also want to be able to convert that. So you end up with this tension between growing the pie and increasing your share of the pie, right? And I
Starting point is 00:42:29 think the cloud offering takes a lot of that away because you're actually adding a new dimension of value for the customer, which is you're removing complexity and you're making it easy. And people are very willing to pay for that, I think. I think that's the way they're used to consuming products now at this point. So, yeah, big deal for us. I mean, I think Confluent and Mongo are great role models for us in particular, largely because both of them actually went through the same journey that we're going through, where they had a self-managed enterprise edition and then built a cloud offering and really
Starting point is 00:43:06 serve both markets and have these markets kind of work together. For Mongo, it was the Atlas product, which was their cloud product. Confluent has built a cloud offering as well. And what we've seen in both cases, in fact, Mongo had a nice jump in stock price a few weeks ago, is it represents now more than half of their revenue and is the fastest growing part of their business. And similarly for Confluent, maybe less of a share, but the fastest growing element of their business as well. And so we're very bullish on the future and the prospects of a cloud product here.
Starting point is 00:43:38 Yeah, yeah, it's very interesting. One last question from me. And I know that you have like a lot of experience also like in the enterprise space where we have uh primarily like the model of the on-prem like installations until recently yeah do you see because many people like predict that the cloud is going to be to dominate completely right like all these large enterprises out there they are going to migrate completely to the cloud do you see this as a net result at the end or you feel like things are going to migrate completely to the cloud. Do you see this as a net result at the end,
Starting point is 00:44:05 or do you feel like things are going to be a little bit more hybrid at the end? What's your opinion on that? I really do think they're going to be hybrid, either for a very long time or forever, at least long enough that it feels like it will be forever. Because I think we serve a lot of financial services customers. We serve a lot of healthcare customers. These regulated industries lot of healthcare customers. These regulated industries are going to be just more cautious about putting their data somewhere else. And also not for nothing, I think there are actually sometimes TCO arguments to be made for actually running some infrastructure on-prem, despite the complexity of having to run your
Starting point is 00:44:41 own data center. So I think we're going to live in a hybrid world, at least among large enterprise, Fortune 500 customers for quite a long time. And we think that's also good for our business in the sense that we can provide connectivity even across from one cloud to another cloud or from the cloud to on-prem. Yeah. Super interesting. We're getting close to time here. One thing I'd love to do is actually just take a step back and talk about Presto and Trino because you were there towards the beginning and you have some insight and would just love to know how have those projects developed individually and what are the differences? And I would just love for our audience, I mean, I think Presto is pretty familiar to a lot of our audience, like in general.
Starting point is 00:45:28 But the difference between Presto and Trino and just the way that those communities have developed, like you have some specific insight and would love to hear about that. Yeah. Okay, sure. So Presto, just as a refresher, was created at Facebook in 2012 and open sourced in 2014, created by Martine, Dan, and David and a guy named Eric as well. And all of those guys work at Starburst today. But in 2012, 2013, they worked at Facebook. And I actually first met them in roughly 2014, so maybe a year after Presto had been open sourced and we started
Starting point is 00:46:06 collaborating together again while I was at Teradata. And that collaboration grew over years and my team at Teradata, which had been acquired from Hidap, was contributing and they became leading contributors. And so you have this really vibrant core of, call it 10 or 12 engineers who were writing the overwhelming lion's share of the project. That continued. Starburst was formed in 2017. And actually, initially, the creators of Presto were still at Facebook. And it was not until maybe a year or so after we had started with Starburst that they decided to join us. And in the process of joining us, actually before they joined us, they had left Facebook over kind of a disagreement of how the project would be governed, how it would be run. Martine, Dan, and David were very adamant
Starting point is 00:46:57 that it be a meritocratic sort of governance model. And Facebook had Facebook's priorities, which makes sense, right? Like they wanted to take the direction in a direction that benefited their needs. And by the way, Facebook was running basically all of their analytics on the project. So it had become very core and very strategic to them. But these were slightly divergent goals where Martine, Dana and David wanted this open community, a vibrant diversity of users and contributors, where you would earn maintainer or committer status based on the merits of your contributions. And Facebook was like, we got to ship this feature. We need to do this thing for our business needs. And so because of that, they ended up parting ways. And so Martine, Dana, David left Facebook
Starting point is 00:47:39 and continued developing, but developed on a different code repo called Presto SQL. So there was PrestoDB and Presto SQL. And for a few years, nobody knew that there were two Prestos. People weren't really paying attention. But there were actually these two divergent code repositories. Now, they ended up joining Starburst. We already had about half the contributors, leading contributors to the project. So the Presto SQL side ended up moving much, much faster
Starting point is 00:48:06 as a development organization. And long story short, about a year ago, there were some disputes over the trademark itself, the trademark of Presto. And it had turned out that Facebook ended up donating the trademark, the name, which they technically own because even though Martine, Dan, and David created it, they created it while employees of the book. And so I guess my lesson for any open source creators out there is if you are working for a company and you create an open source project, that name is technically owned by the company you work for. So just keep that in mind. But ultimately they donated that to the Linux Foundation and the Linux Foundation said, hey, we can't have two Prestos. So you're going to have to rename Presto SQL. And that's how Trino was born. So Trino is
Starting point is 00:48:50 really that lineage of Presto. It is what, what the creators and leading contributors. And since then, a number of the, of the leading contributors from Facebook have joined us as well. So working on Trino now, instead of the original Presto. Trino is what Netflix and Airbnb and LinkedIn and a lot of the big internet companies are running with. And that's the future. But that's the backstory of the names and how we got where we are. Yeah, love it. No, that's a great backstory.
Starting point is 00:49:19 I love it. It's really fun to peel back layers on the evolution of open source technologies. Well, we're close to time here. Two more questions for you. One is, what's the future look like for Starburst? I mean, we've talked about problems you're solving now, but as you look at the stack, I mean, your bet is that we have hybrid on-prem cloud. Stack is increasing in complexity.
Starting point is 00:49:43 So I would love to know how Starburst is thinking about the future. And then second, how can people explore Starburst if they're interested in it today? Cool. So in terms of the future, I will say we're very bullish on this concept of a data mesh. So I don't know if your audience has heard of a data mesh at this point, but it's basically this kind of paradigm shift that essentially recognizes that data is inherently decentralized. Not only as like a practical matter for a lot of the reasons we mentioned, but also that there's actually benefits to decentralization if you think about it in the right way. And the analogy that I like to use with people is if you think about Wikipedia, where anybody can sort of like create an article, it's generally the expert who knows the most about that particular subject who's writing the Wikipedia article. So you get the person writing about a particular subject area who knows it very, very well, and they have ownership for that. That's kind of like part of what this notion of decentralization means from a domain authority perspective, meaning that like the people who know the domain
Starting point is 00:50:49 end up making the decisions about how to interact with that data, what fields are available. So rather than centralization, putting everything in the hands of a data warehouse team in a monolithic way, you sort of let the owners of the data itself essentially curate the data and publish it, serve it up to the organization as a data product. And that's another big pillar of data mesh is thinking about data as a product, which is an interesting concept, I think, as well. So it's an area we're very excited about. Okay. So in terms of data mesh, this is a really interesting topic because it's a new term. There are different sort of interpretations of how to define it.
Starting point is 00:51:32 And hearing you talk about Starburst actually is a little bit of a light bulb for me in terms of data mesh, because in the conversations that we've had, the challenge with defining data mesh is a tension between decentralization of data, but also the need to actually centralize that, right? In a way that makes a ton of sense for the business as a whole. And so I would love your thoughts on that tension, right? Because decentralization generally applies to technology where you have different technologies being employed by different teams. That means different formats of data, all that sort of stuff. But you still have this need to
Starting point is 00:52:11 centralize it. And so I would love for you to speak to that, that tension as it relates to data mesh and then specifically like, is Starburst the stepping stone to like making sense of that? Yeah. So, I mean, to me, it all centers around this concept of a data product and having the data owners, the ones who understand the domain of that data, be the ones responsible for creating and curating that data product. Now, that data product, I want to stress, doesn't have to be a specific database or even a specific table or a specific data set. It could be any combination of those things. So the data product might have a table that lives in S3, and it might have a table that lives in SQL Server, but the product together, which is
Starting point is 00:52:58 the customers who spend the most and watch ESPN, if you're a cable provider, for example, maybe those live in two different data sets. One's a billing data set. One is a shows watched data set that you have in two different systems. But the data product that you're offering is top spend sports enthusiasts, right? Product now can span across those data sources, but it's still offered up to the organization to consume that way. And Starburst essentially becomes the abstraction layer that allows you to serve up those products without having to necessarily reveal where those data sets live. Like the end consumer of that product doesn't need to know it came from a data warehouse
Starting point is 00:53:44 over here and a data lake over there. Quickly, listeners who are interested in checking out Starburst, what should they do? Yeah, you can check us out at starburst.io. And you're welcome to either download the product and get started or register to use Galaxy, which is currently in beta and will be GA in November. So depending on when this podcast comes out, it may be GA already, but those are your options. Awesome. Well, Justin, this has been really informative and just a great conversation. We'd love to have you back to talk about team structures around data mesh as we shed more light on that subject on the show. Yeah, I think it's a great topic. It's probably one of the most important elements of actually implementing a data mesh. It is all about people, process, and technology, and the people being
Starting point is 00:54:35 the trickiest part. So would love to. Awesome. Well, we'll catch up again soon. And thanks again for taking the time. Cool. Thank you, guys. Thank you, Justin. As always, a great conversation. I think my big takeaway is actually on the data mesh side of things. I think that analytics, federated analytics, as Justin talked about them, I think is the most tactical explanation of the value of data mesh that I've heard yet in a way that makes sense from a technological standpoint.
Starting point is 00:55:11 Because I think as we've talked with other guests on the show, one of the challenges of data mesh is fragmented technology. Everything's decentralized. Centralization across all of that is very difficult. And having an infrastructure technology agnostic solution to that makes data mesh make a lot of sense. I think my follow-up question, which we didn't have time to get to is, okay, analytics is one thing, like taking action on that data is another thing. But that was really helpful. So I really just appreciated his perspective on that. Yeah, absolutely.
Starting point is 00:55:47 And I think we have many reasons to want to have him on another episode. There are many things to talk about. One hour wasn't enough. Yeah, for me, I think the most interesting takeaway was the conversation around ETL and how ETL is changing in this more decentralized and federated world that we are moving to.
Starting point is 00:56:10 And it was interesting to hear from him that the E and the L are not going away, but they are not as important as they used to be. But the transformation is there and we will keep needing to transform the data. So, yeah, it was very interesting. It was also interesting to hear about the history, the story behind Trino and the trademarks. Oh, I loved it.
Starting point is 00:56:35 Yeah, coming out of Facebook and open source drama, which is always interesting. And yeah, I'm really looking forward to have to record another episode with Justin. It was great. For sure. Well, thanks for joining us again and we'll catch you on the next show. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite
Starting point is 00:56:55 podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, Eric Dodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com. you

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.