The Data Stack Show - 181: OLAP Engines and the Next Generation of Business Intelligence with Mike Driscoll of Rill Data

Episode Date: March 13, 2024

Highlights from this week’s conversation include:Michael’s background and journey in data (0:33)The origin story of Druid (2:39)Experiences and growth in Data (8:08)Druid's evolution (21:46)Druid'...s architectural decisions (26:32)The user experience (30:06)The developer experience (35:14)The evolution of BI tools (40:55)Data architecture and integration (47:53)AI's impact on BI (52:26)What would Mike be doing if he didn’t work in data? (56:27)Final thoughts and takeaways (57:02)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Before we start the show this week, we've got a quick message from a big show supporter, Data Council founder, Pete Soderling. Hi, Data Stack Show listeners. I'm Pete Soderling, and I'd like to personally invite you to Data Council Austin this March 26 to 28, where I'll play host to hundreds of attendees, 100 plus top speakers, and dozens of hot startups in the cutting edge of data science, engineering, and AI. If you're sick and tired of salesy data conferences like I was, you'll understand exactly why I started Data Council and how it's become known for being the best vendor-neutral,
Starting point is 00:00:34 no BS, technical data conference around. The community that attends Data Council are some of the smartest founders, data engineers, and scientists, CTOs, heads of data, lead engineers, investors, and community organizers. We're all working together to build the future of data and AI. And as a listener to the Data Stack Show, you can join us at the event at a special price. Get 20% discount off tickets by using promo code DATASTACK20. That's DATASTACK20. But don't just take my word that it's the best data event out there. Our attendees refer to Data Council as Spring Break for Data Geeks. So come on down to Austin and join us for an amazing time with the data community.
Starting point is 00:01:14 I can't wait to see you there. Welcome to the Data Stack Show. Each week, we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. We are here with Michael Driscoll from Real Data. Michael, thank you so much for joining us on the show today.
Starting point is 00:01:47 Great to be here, Eric. All right, well, give us your brief background. How did you originally get into data and what are you doing at Real today? Yeah, thanks. My background is actually probably not that dissimilar from a few of your guests you've had over the years. I actually started my career as a software developer working for the Human Genome Project a couple decades back. And naturally, there's a lot of data in the Human Genome Project. And that was really the beginning of a multi-decade love affair, working with data at scale, heterogeneous data.
Starting point is 00:02:30 And since then, I've started a few companies. My first startup was an e-tailer called CustomInc.com. We sell t-shirts on the internet. I later started a consultancy called Dataspora. We did a lot of consultant work for banks and folks in the big data era. I then went on to start a company called Metamarkets, which was acquired by Snapchat or Snap, the makers of Snapchat that did analytics for advertising. And now I've got Rill Data. We're a few years into that journey
Starting point is 00:03:05 and focused on an operational business intelligence product with RIL. All right, that's quite a journey, Michael. And I know that part of this journey also includes some very interesting technologies like Druid. And from the conversation we had earlier, I've learned a few things that I wasn't aware like Druid. And from the conversation we had there, I learned a few things that I wasn't aware about Druid and the relationship it had with BI and what were the initial ideas behind it. And I'm super excited to get into that and learn more about how you started building
Starting point is 00:03:44 Druid while you did that, and how you ended up today, actually, with real data that has Druids on the backend, but it's more than a query engine, right? So I'm super excited to get into the details. What about you? What are you excited to talk about today? Yeah, well, I think there's a few big macro trends that we're seeing in the data world today.
Starting point is 00:04:11 I would say I would be delighted to talk about some of the emerging data engines that are out there for powering fast analytics at scale, really at any scale. So Druid and ClickHouse, also DuckDB, we all know is an exciting new engine. But I think the other trend that for me is particularly exciting is the trend towards serverless frameworks. I think if we've, those of us, and I know all of you pay close attention to the space, I think that there's a lot of new frameworks out there for really taking not just data technologies to the cloud, but making them serverless in the cloud.
Starting point is 00:05:06 And so I look at, yeah, almost any area of the data stack, I think is being remade to be truly serverless at scale in the cloud. And that's a pretty exciting area that's going to take several years to play out. Yeah, 100%. We'll have a lot to talk about that too.
Starting point is 00:05:29 So Eric, what do you think? Should we dive in? Let's do it. All right. Well, I want to talk about Druid. I don't think that we've covered this extensively on the show, but maybe you can help us understand Druid by telling us the origin story and sort of where it came from in your time at Metamarkets. Sure. Yeah. The story of Druid really is similar to the story of a lot of technology innovation, necessity is the mother of that innovation.
Starting point is 00:06:10 Metamarkets was started in the early 2010s as an advertising analytics business. I was the CTO and co-founder there. And we were building basically an exploratory BI tool for some of the largest digital advertising platforms that were emerging back then. And as you can imagine, the data that we were looking at comprised of billions and billions of advertising events. I've often said that, you know, in general, advertising is the crucible of a lot of technology innovation. It's one of the first industries that was kind of fully digitally transformed, right? The digital media already was made of bits. And so unlike e-commerce or other verticals, digital media and advertising really adopted a lot of data infrastructure technologies and invented a lot of data infrastructure technologies much earlier than other verticals.
Starting point is 00:07:27 So here we were dealing with billions and billions of records. We had an early customer based here in California that was called OpenX actually was their name. They're still around today. One of the first programmatic advertising businesses doing real-time buying and selling of ads. So I had a lot of different databases. I first started with Greenplum, which was a distributed Postgres engine I'd worked with. Tried to build dashboards that were interactive.
Starting point is 00:07:59 On top of that, we struggled at high concurrency. We eventually moved to a technique that people still use quite a bit, which is we put everything into HBase, a key value store, and we pre-computed all of the different projections of the OLAP cube and stored those keys and values and age base. But that quickly becomes untenable as you kind of expand the dimensionality of your data. It gets kind of massive. And so then an engineer that we had hired out of LinkedIn, very talented at the time, young guy named Eric Cheddar showed up and said, hey, I have an idea for a distributed in memory OLAP engine. And we were young and possibly naive. And we thought, all right, let's give it a shot. So Druid was, I think we started work on it in maybe late, maybe late 2000 2011.
Starting point is 00:09:22 I think Eric, maybe actually early 2011. Eric wrote a spec for it. I have a, it's on my blog actually, where he wrote out the architecture and a few hundred lines of, of code, 550 word requirements stock. And then, and then about, you know, eight weeks later, the first version of Druid was in production. That was April 2011. We open sourced it at an O'Reilly Strata conference in October 2012. And since then, obviously, it's been widely adopted by lots and lots of companies, probably most notably like Netflix, Lyft, eBay, Salesforce, Pinterest, Yahoo. And of course, Metamarket used it widely
Starting point is 00:10:12 and we were acquired by Snap. I know Snap still today runs a pretty substantial Druid cluster. Wow. What an incredible story because, you know, if you're a company that's providing a BI product and you told someone, well, we're going to build our own real-time analytics database, probably they would say, that's a really bad idea, building a database as an internal tool. But what an incredible story
Starting point is 00:10:43 with the wide adoption of Druid. Did you ever imagine that it would be adopted that wisely when, or that widely when you started? You know, I, I, I think what we, we, I don't think we expected it to get adopted so widely. I think in some ways, you know, the, some of, I believe, the architectural advantages of Druid were that it was very purpose-built. So we weren't trying to create, at the time, a general purpose database. We were trying to solve our own problems. I think that turns out that can be an advantage, that level of focus, because we were able to sidestep a lot of requirements that we would have had to incorporate if we're trying to build a general purpose tool
Starting point is 00:11:32 but for instance you know drew didn't support joints initially and i would say even today it's i don't think it's known for great joint support. But I think what happens when you solve your own focused, well-defined problem is that it turns out other people have similar problems out there. So I think the decision to open source it was, one, I give some credit to Vinod Khosla, who was one of our early investors at Metamarkets. He supported that decision to open source it. Part of the reason we did open source it and it did gain adoption is it was not the focus of the business. We weren't trying to monetize Druid. We were trying to really, I think, be part of a broader ethos in Silicon
Starting point is 00:12:29 Valley, which is create more value than you capture. And we were huge beneficiaries of lots of other Apache ecosystem tools. And it just felt like the thing we needed to do was to give this back. And yeah, I think it was fairly surprising. I think a lot of the credit goes to the engineers also who were... Engineers love working on open source tools. And so there was a lot of investment by an early team to evangelize Druid, to give talks about it, to go help others that were trying to get it running at scale. So it may have been surprising, the adoption, but I think it was also a lot of effort went into kind of driving that early adoption in the Valley. Sure. Well, and it sounds like you tried a lot of other tools before you ended up building it, right? Which is expressive of there being a big need for it.
Starting point is 00:13:28 Can you, I'd love to, you know, you mentioned this, but it's purpose built for pretty specific use cases, right? And you mentioned joins as, you know, not great join support as a characteristic of that. I'd love to know what other characteristics are of Druid and what it really shines at. And maybe you can help us understand that by starting with the very specific problem you were trying to solve at Metamarkets. What were the reports that you were trying to build that none of these other products could support?
Starting point is 00:14:06 Sure, right. So I think fundamentally, and maybe as an aside, I would say, well, it seemed crazy to build our own sort of in-house data engine to power our BI tool. I do think if you look at the BI landscape, we're certainly not alone in that decision. Power BI is powered by VertiPack, which is a quite powerful
Starting point is 00:14:35 OLAP engine. Tableau has Hyper. It's an internal engine. If you look actually at, you know, Qlik, which is an inspiration for the BI tool that, you know, we built at Metamarkets and which we're continuing with at RealData, Qlik also had an internal engine. And so I think if you look at these BI tools and the problems that they're generally
Starting point is 00:15:06 trying to solve, and again, we didn't have the benefit VertiPack and Hyper and Click and even Sisense as an engine. None of those engines were open source. So we weren't able to adopt those at Metamarkets when we were building our analytics visualization tool. I would say there's a few kind of primitives that are really important to support in the kind of ad hoc exploratory business intelligence tool that we built. First and foremost, one of the most important is filtering. So the ability to look at a data set and then filter, in the case of billions of digital media impressions, filter on all of the impressions that are coming from CNN.com. That's a really critical thing.
Starting point is 00:15:59 People do it all the time in their BI tools. It's often thought of as a drill down. The way that there's a number of techniques that Druid uses, but fundamentally one of the core data structures under the covers is basically just inverted indices. So you go through and you essentially index all of your columns and you have a column named publisher website and cnn.com gets tokenized into a number and then you store an index that has all of the places
Starting point is 00:16:40 where that particular value exists and you can do a very fast lookup on on on that data, and then aggregate, you know, only over your values that match. So those are bitmap indices, primarily. And so Druid makes heavy use of these bitmap indices to do indexing of high cardinality dimensional columns in the data. And I think that's the same technique that a lot of the other BI engines use as well. Makes total sense. And tell us about Rill. So Metamark has developed Druid, open sourced it. You sold the company to Snap.
Starting point is 00:17:34 Can you tell us a little bit about your time at Snap? And then I want to ask about Rill, because you're sort of returning to Druid in a way. So yeah, tell us about the time at Snap and kind of how they leveraged Metamarkets technology. Yeah, so I think for the team at Metamarkets, I think we always had aspirations of selling this very unique exploratory analytics tool to multiple verticals. I think ultimately what we found, which is, again, no surprise given my confidence
Starting point is 00:18:16 around digital media, is often at the crucible of innovation for data infrastructure. The companies that had the most data that really needed this analytic stack that we built at Metamarkets, which consisted of pipelines, real-time ETL pipelines that fed into an Apache Druid data layer,
Starting point is 00:18:41 which then powered an interactive visualization tool. That kind of three-layered stack turned out to be very valuable for digital media businesses. And our customers ended up being AOL and Twitter was actually one of our largest customers and a number of kind of leading platforms in the advertising space. What started as a commercial discussion with Snap in 2017 turned into an acquisition conversation, as can sometimes happen. And Snap at that point was looking to accelerate their internal analytics roadmap. They were definitely behind at that point what Facebook, NetMeta, and what Google were offering
Starting point is 00:19:31 to their advertisers. And so MetaMarkets turned out to be an extremely valuable technology asset for Snap to bring in-house and actually build out their own internal and kind of advertiser-facing BI platform. So what we learned at Snap, which was interesting, is that, of course, this Druid-powered analytics stack had a lot of value beyond just advertising data. It soon became something that was used internally
Starting point is 00:20:11 to look at lots of other data streams at scale at Snap, including Snap telemetry data. So another thing that Snap was going through at that point was they were attempting to roll out their Android app. And you can imagine the amount of telemetry. I'm not a mobile app developer, but I I would say, operational intelligence at Stat more broadly. for their application, certainly for looking at their monetization, how many impressions, and what sort of monetization results
Starting point is 00:21:13 they're getting for their advertisers. And it also was used widely by the, not just by engineers, but by sales team and customer success folks. And so I think just being at Snap and watching that wide adoption of this tool internally was the inspiration for thinking, hey, could we take this? Could we do more with this? And so after a couple of years at Snap, I exited. And I was really kind of fortunate that I was able to actually license the core
Starting point is 00:21:52 Metamarkets IP back out of Snap. And that became the genesis of RHEL data today. So we really just saw the power of this platform and really the generality of it. And that was the inspiration to start RIL data now over three years ago. Very cool. want to ask about RIL, there are certainly a lot of technologies out there that are available outside of Druid to do this sort of thing, which I want to ask you about. But the technology landscape has changed significantly since you created Druid. Can you give us a picture of how Druid has evolved over time? Because I think you said 2011, you open sourced it in 2012. And so we're talking about the early days of the cloud data warehouse there even, which itself has changed significantly. So I'd just love to hear about the story of, Druid's had obviously a ton of staying power,
Starting point is 00:23:10 but relative to sort of database world, has been around for quite some time. Yeah, I think the market certainly shifted. The technology landscape has shifted dramatically since Druid was created in 2011 and open source in 2012. And so I would say, you know, what are some of the major shifts today? Probably, you know, if I were starting metamarkets today and we were looking for an engine to power, you know,
Starting point is 00:23:44 interactive exploratory data visualizations, we almost certainly would not need to create Druid. There's a lot of other, I think we're all familiar with a number of pretty powerful engines out there that are quite similar to Druid. You've got Apache Pino, which I think is fantastic for, particularly for streaming use cases. You've got ClickHouse, which is great, I think, in terms of its simplicity and ease to get running
Starting point is 00:24:14 on a single node. And then now I think it supports quite well at tremendous scale in a distributed manner. I think even a lot of the cloud data warehouses have gotten faster and better. I think they're still not quite... I don't know that I would want to run my BI stack or my BI applications directly on a data warehouse like Redshift or Snowflake or BigQuery,
Starting point is 00:24:42 but they've certainly gotten faster and approaching some of the speed that Druid, ClickHouse, Pino offer. Yeah, so I think it's a very different world now. I still think that there's still a need, very much a need for fast engines when it comes to user-facing analytics applications, when it comes to user-facing analytics applications, when it comes to data applications. And so what's probably changed the most
Starting point is 00:25:10 is that you can delay the decision of going to a distributed system longer than you used to be able to. I think the reason why DuckTB has gained so much attention lately is because, look, in the early days of Hadoop, you couldn't wrangle a billion records that easily on a single machine. And Moore's Laws had eight cycles, 10 plus cycles since the early days of when Hadoop was created.
Starting point is 00:25:48 So, and similarly Spark, you know, it was created in an era where machines were smaller and you needed to kind of run things in a distributed way. So I think maybe one of the biggest changes is that we now, we can run much bigger data workloads on single machines. And I think DuckTB, I think its popularity is a reflection that you may not need Spark, you may not need Druid or ClickHouse or Pino to get the kind of fast interactive speeds
Starting point is 00:26:18 that you may want for your data applications. Fascinating. Costas, I could keep going here, but we've entered the realm of talking about the current technology landscape in DuckDB. And so I can see your hand reaching for the microphone. So go for it. Thank you.
Starting point is 00:26:38 Thank you, Eric. So Michael, I want to ask you something because I think there's's a very unique opportunity here with Druid because we have a technology that has been out there for 10 years now. And as you said, and I think some of the stuff that have been already communicated is how different things were in 2012
Starting point is 00:27:03 compared to how it was in 2024, right? And I'd like to ask you, when Druid came out, what was, let's say, the main competition? Like what people, and when I say competition, let's not take that in terms of business competition, but more of how people were solving the problems back then. And how it is today. When do people today go and use...
Starting point is 00:27:35 When is a good time for someone to go and do it? Considering all these changes that you mentioned about the hardware, the software, the market needs, everything has changed in these 10 years, right? Yeah. I would say what's interesting is some things don't change nearly as much, I think, as people might think. Some things do change.
Starting point is 00:28:01 But I think what's basically you know the key features of the engine that we developed i think there's a few kind of decisions that were made in that architecture that are you know that are powerful and and by the way i think these architectural decisions again still remain necessary at scale today so So one of the first decisions was we need this to be a distributed database, right? We cannot, the data exceeds what we can fit on one node. So we need to make it distributed in parallel. And I think if you look,
Starting point is 00:28:36 a lot of the tricks of the trade of making things faster across different data tools is essentially make them parallel. The second thing that we really focused on was aggregation of data. So there was a post, I think, in one of the DBT labs log posts about introduction to OLAP cubes cubes olap cubes aren't you know aren't going away people still use them to aggregate data across dozens of different dimensions and and instead of storing raw event level data storing aggregates which can be
Starting point is 00:29:22 depending on how you do aggregation can be between 10 and 100 times less, have a less of a footprint than your kind of event level data. And then the third piece, I would say is just indexing, right. And there's lots of ways to do indexing. But each of those pieces, you know, parallelization, you know, via distribution, aggregation, and indexing, our customers back at Metamarkets. And I think this is true of a lot of data applications. Customers don't care. The end users don't care about the engine that's powering the application. They just care about the user experience.
Starting point is 00:30:28 And so I think that anyone who's starting to build a data stack today, there's a lot of different tools out there. I would just encourage, you know, Druid's one of them. But ultimately, you know, you've got to pick the right, depending on your scale, just pick the right engine that can deliver, I think, you know, fast sub-second performance for a data application and you'll make your end users happy. Yeah, okay.
Starting point is 00:30:51 You gave me a very interesting cue here because you said user experience. And I think we have to make a distinction here. We have, especially with a system that is, I'd say, user-facing, you have someone who's not necessarily an engineer there who's going to do their own analytics. Maybe even a business user when we're talking about BI.
Starting point is 00:31:15 So we have the user there who they care about a specific set of things around the experience that they have. And then there's also developer experience, right? Like it's all that are responsible for deploying, operating, building what the users need. And I think we need both. We need to balance both at the end, like in this environment that we have. Can you tell us a little bit more about that?
Starting point is 00:31:41 Like what's, let's say, the user experience that you talk about? Like what it means for a user? What they care more? How you would define in a few words, let's say, the user experience? And then talk also about the developer experience.
Starting point is 00:31:55 What's different and how it differs compared to the user experience? Well, I think probably the most important value that we embrace in the design of the BI tool that mind or the sort of sophisticated analyst persona in mind. questions of data, I think one of your guests maybe from AlterX made this point. Look, every knowledge worker is an analyst. Every knowledge worker needs to be a data worker. And so at Rill, we've really focused on simplicity. And some of the UX pieces of that are direct interaction. If you want to know more about a value in the tool, click on it, and you can filter on it.
Starting point is 00:33:11 If you want to zoom in on a time period, you should be able to drag that sub-range and be able to zoom in easily. So we really focused on simplicity, where you don't need to get training to use. People shouldn't have to be trained on how to use dashboards. They're such a part of the fabric of modern work that none of us are being trained on how to use a lot of the great tools that we use day to day. And I think dashboards should be no different and data tools should be no different. The second value in talking about user experience is speed, speed of interaction. So that term business intelligence, when we think about an intelligent person,
Starting point is 00:34:05 we think about somebody who responds to a question within seconds of us asking it. Slow, I think, is often synonymous with unintelligent. And so at REL, we really have focused on making our data, exploratory data application, sub-second. And the experience of sub-second tools, that just resonates with the human cognitive system. This is how we interact with the physical world in a sub-second way. And I think we've all gotten, unfortunately, too used to slow data applications. I think that's a consequence of some unfortunate architectures that have been built. But at RIL, we really want to return speed to be at the forefront of working with data. And then the third value that we really embrace at Realist is scale. And maybe just recognizing that in our experience, what may start out as a small data set that you can keep simple and keep speedy
Starting point is 00:35:16 often evolves to be quite a large data set. Most of our customers tend to grow. Some of our customers are dealing with trillions of data points. And so thinking about scalable systems, it does mean you have to make certain decisions. And one decision we made at Rill for the user experience is we do require a lot of upfront modeling of data. We don't let people kind of play fast and loose with their data model. It's not,
Starting point is 00:35:47 we don't really embrace a lot of ad hoc or like post hoc changes to data. We really focus on, we want our organizations to invest time building their data models. And then the result of that is that we can support that third value of scale. Because if you're going to scale up to billions or even trillions of events in your data, you do have to have a pretty well thought out data model to start with. So yeah, simplicity, speed, and scale
Starting point is 00:36:23 are the three values that we think are directed towards a better user experience of the real product. And what about the developer experience? Like, what's the difference there? Like, with a developer who has to go and manage, let's say, like, real data or Pino or any other system, like as part of like a broader data infrastructure there, right? Like what are, from your experience, let's say like the good and the bad things that are happening out there today? Yeah, well, I think there's always this sort of yin and yang
Starting point is 00:36:59 in the world of technology or things, you know, swing from one side of a continuum to the other. One of those is server versus client. So I think one thing that we've embraced at Rill, and I think a lot of developers seem to like, is the ability to do development locally versus development kind of remotely on the cloud. And I think those of us who kind of do local development, we know why we like that as developers. It's the speed of interaction, the speed of feedback. So I would say that's one almost shift, right?
Starting point is 00:37:42 I think we continue to see the value. We have these incredible, you know, most of us have Apple Silicon on our developer machines and an incredible amount of computational power underneath our keyboard. It's a tragedy to not be using that power in our day-to-day experience as developers. So that's one piece. I would say another that there's been some debate is people often ask, okay, low-code or no-code interfaces versus codeful interfaces. I think that at Real, we've made the decision to be very much a code-first developer tool. And everything we do, from defining data sources, to designing data models, to configuring the look and feel of our dashboards, everything is basically defined in SQL and YAML declarative artifacts. And I think that for developers, I think if you can be thoughtful about the code that you choose,
Starting point is 00:38:52 we really made sure we leaned into SQL as a kind of primary language for data modeling. A lot of other BI tools have kind of proprietary data modeling languages like DAX for Power BI or Tableau has its own expression language, LookML for Looker, but everyone knows SQL. So I think the code first approach, I think does serve developers. I think CLIs can be extremely powerful and again, well-crafted CLIs spark joy for developers and i the last thing i would say when it comes to those you know whether you kind of embrace a code first path or you know a no code
Starting point is 00:39:34 or low code path in the era of ai i think there's a quote from someone on Twitter that text is the universal interface. Code is such a powerful interface for the world. Here we are essentially communicating about lots of things just using effectively speech. I think that in a world of AI, I think the code- first interfaces will dominate because that's an API. So for real, it's not hard to use Copilot and develop on real because everything we do is code first. It would be very hard to have Copilot interact with a set of UX components
Starting point is 00:40:22 and design dashboards and data models and data source credentials, if everything were kind of point and click. So yeah, those are probably two things I think a lot about the code versus no code approach and the local versus cloud development approach. That's super interesting, actually. And a very good point about like what's going on with copilots and the AI situation right now and how they work well with code-first interfaces
Starting point is 00:40:55 instead of these drag-and-drop, which I never thought about. That's very interesting. Okay, so let's talk a little bit about like the present and the future of like bi and i'll take you a little bit like back in the past so bi went through let's say already some kind of like a cycle where we had around like 2015 2016 let's say like we had Looker, we had Sisense, we had Periscope data, we had Mode Analytics, we had ChartIO, we had all these different BI tools that some of them were targeting, let's say, other personas, and they were trying to differentiate based on that.
Starting point is 00:41:41 But what eventually happened, from what it seems, is the peak of that cycle was the acquisition of Looker by Google, which I think was also the biggest outcome in this space. And things got a little bit, I'll tell you that, not that exciting anymore. We've had some merges there with like Sisense and Periscope data. I think Mold now like got fired by another company, but... ThoughtSpot. Yeah, ThoughtSpot, correct. And it's not very clear like where this cycle ended and if there is a new cycle of innovation, what is like going to happen bi and what's happening with bi in general right so tell us a little bit about that like what happens
Starting point is 00:42:33 in the this previous cycle let's say in yourations and acquisitions do reflect similar to the world of databases. ask themselves, you know, when you see new database companies being started, you know, in the last few years, like, gosh, do we really need another database? Right. And I guess my broad view on sort of on cycles of BI would be just that, look, the world of data is so massive and so critical, you know, to the global economy and to every business that, you know, in the same way that like we don't have, you know, just kind of, I guess, you know, one type of manufacturing company that, you know, makes atoms. We, you know, there's really not a lot of grand uniformity when it comes to manufacturing bits, whether it's ETL or databases or exploratory business intelligence tools.
Starting point is 00:43:56 So I think my first comment is that I don't see anytime soon this sort of massive consolidation around one database or one BI tool to rule them all. I think the world is far too heterogeneous in terms of its problems for that to be the case. But as far as the kind of current cycle in BI goes, I would say, I think probably, I would argue that there's maybe three generations of BI we can really point to, and we're kind of in the third generation here. I think the first generation was desktop and server-based BI. So I think the Power BI as an early business intelligence tool. I think back years ago, Oracle had a BI tool that they shipped. You had SAP. You had a number of, I would call them old school, in the 1990s,
Starting point is 00:44:59 companies that were shipping desktop BI tools or beefy server BI tools. Click was in that category as well. And many of them had, as I mentioned before, embedded database engines that they came with. And that was generation one. And that worked pretty well for kind of, I think, the nature of enterprise architecture then. But then I think the big shift that occurred with, and frankly, Looker, I think heralded this era,
Starting point is 00:45:28 was where we had the shift to cloud BI. Looker was one of the first companies to really embrace that they weren't going to have an embedded engine. Looker was going to run on top of other databases. It was just going to have its semantic layer talk directly to a cloud data warehouse. So Looker grew, I think, very quickly because the cloud grew and people realized that was a better,
Starting point is 00:45:54 I think, a better architecture. Ultimately, some of the legacy BI tools did embrace that server architecture. Tableau allowed you know, allowed it, was able to connect to remote data warehouses as well. But that was sort of the second generation that I think we saw. And by the way, I think mode and ThoughtSpot probably represent that second generation as well. You know, mode primarily talks to, you know,
Starting point is 00:46:22 a remote data warehouse. ThoughtSpot increasingly is, you increasingly is about connecting to remote systems. I think now we're in a potential third generation of BI. And what's different now? You know, as I mentioned when we were chatting before the show, I think the next big disruption in the data stack is going to be the commoditization of the cloud data warehouse as the source of truth for company data. I think that more and more companies are embracing object stores like S3 and GCS and Azure object storage. More and more companies are embracing structured data
Starting point is 00:47:11 on object stores as their core foundational data layer. And as that happens, I think we need a new generation of data applications that can connect directly to the object stores and not just rely on the data warehouse like Looker did. And so that, frankly, is where we're certainly making a bet at Rill. We're making an enormous investment in support for things like Delta Lake and Apache Iceberg, also the commercial support for it by Tabular.
Starting point is 00:47:47 And I think there's a lot of exciting stuff to be done with that new architecture. So as we move basically through these three generations, we go from kind of server architectures for BI, and we move to kind of cloud warehouse architectures for BI. And I think we're now in the era of object store architectures for BI. And I think there's a lot of innovation that can be done in this kind of new data architecture. That's super interesting.
Starting point is 00:48:18 So in this new paradigm that we are talking about, how are all these pieces fit together? We have data warehouses. We have data lakes. that we are talking about, like how are all these pieces fit together, right? We have data warehouses, we have data lakes, we have BI tools that they have their own engines, right? We have systems like the more real-time systems like Pinot or like Druid and ClickHouse. How do these things fit together? And do they, let's say, overlap? Or are there some clear boundaries there where, let's say, a user, like a company,
Starting point is 00:49:00 has to cross in order to start considering using some other technologies? Well, I think that, again, I think it's still early days, but my own view on how these pieces may fit together, some broad thoughts. First of all, I think that all data will ultimately live in the data lake. It will ultimately live as Parquet, or I know one of your guests was the creator of LanceDB. All structured data will live in a structured data lake in an object store in the future. I think that will be the governing lowest common denominator of data across most organizations. And so that means that all data producing and data consuming systems will go through that foundational object store fabric.
Starting point is 00:50:00 I think Microsoft actually got it right when they talk about their fabric architecture. It doesn't make a lot of sense to try it directly, in my opinion, for only rare use cases, would you want to consume directly from Kafka? I think if you look at like, even what you know, the folks at warpstream labs are doing, they're using Kafka backed by an object store, it's serverless Kafka. So I think that, again, all data technologies, data services will create and write to and read from the object store. So then that does simplify things in a lot of ways, having that kind of fabric there. Then you just have different requirements for different styles of data applications you'd want to power off of that data.
Starting point is 00:50:47 For business intelligence applications, it's really important that things are fast. And so the only way to make sure that things are fast is you need your compute and your storage to be co-located in some way. So you have two choices. You can either move the data to the compute, or you can move the compute to the data. Both of those I think are acceptable. I think a tool like DuckDB is very powerful because it allows you to move compute to the data. You can spin up a Lambda job and stick a DuckDB in it, and you can run that compute very close in the correct region where you have fast access to the object store.
Starting point is 00:51:33 In Rill's case, we decide to actually orchestrate data out of the object store and aggregate it and move it to our compute nodes. But I think co-loc localization of data and compute is a key is a key piece. I would say, but in general, other workloads don't need that, you know, for a lot of reporting workloads, one of the challenges we see today is people are constantly moving data between data systems, one of the advantages of having everything in the object store is you don't need to do that migration. So I think reluctantly Snowflake
Starting point is 00:52:09 and several other tools have embraced the Iceberg format. I think we'll see that continue to expand in its adoption. And the idea there is that for asynchronous workloads, you don't need to move data into Snowflake to query it with Snowflake. You can query an external table from Snowflake and not have to do, you know, an ETL job and on the nature of the workloads. But increasingly, I think we'll see a lot of in situ data applications that operate on the data effectively sitting in the object store. And that's, I think, a huge efficiency gain for that style of architecture versus a lot of the, you know, a lot of the systems today that, you know, where you have a lot of data moving around. Yeah, that versus a lot of the systems today where you have a lot of data moving around. Yeah, that makes total sense.
Starting point is 00:53:10 All right, one last question from me, and then I'll give the mic back to Eric because we are close to the end here. One of the things that has happened in the past two years that is changing, I think, rapidly. This space of data is AI, right? And especially, I think, BI tools have been very eager to embrace that for very obvious reasons. I think, as you said, text being the API, right? It's a very strong concept there. But my feeling is that there are probably much more deeper things happening with AI
Starting point is 00:53:49 and how it will change the way we work with data. So what's your take on that? How do you see BI being affected by AI? And what's next there? Well, I would say maybe three consequences that I can think of. I think I once commented thinking about AI is that I think we'll know we have AGI once we have solved for data engineers not having to write regular expressions on their own. So I think one of the first and highest uses of AI
Starting point is 00:54:26 is actually for data wrangling. We all know that practitioners in data spend far too much time doing writing regular expressions, parsing data. And I think that the tremendous benefits will emerge on that front through things like Copilot. I think we can dramatically improve and reduce the pain around data munging with AI. Second thing I would say
Starting point is 00:54:59 is that in terms of its impact on the languages that data practitioners use, as I said before, obviously AI is code-based today, primarily prompt-based. And I think that we will actually see a lot of people have been trying to create new languages for data transformation. And I applaud those efforts. We need new languages for data transformation. And I applaud those efforts. We need new languages always. But I think that SQL is still, you know, still early days of SQL being adopted,
Starting point is 00:55:36 not just for querying data, but increasingly for transformation of data and ETL and data modeling. And so I think that AI is going to further propel SQL just because it's a lingua franca. There's so much for these large language models to learn from in terms of the massive corpus of blog posts and stack overflow answers that are using SQL to manipulate data. So I think AI will actually propel SQL to even greater dominance as the lingua franca for all data work. And then I would say the third
Starting point is 00:56:13 consequence of AI in the data space is, I think, solving the cold start problem. I think AI is great at sort of generating a scaffold of something that then an analyst can edit versus having to create from whole cloth. And so in particular, the area that I think AI has great potential for data work is, we've seen this already with OpenAI's analytics module. You know, a lot of people spend a lot of time pushing pixels when it comes to building data visualizations to make their data viz pretty. I think that being able to go from a data set to an informative, useful visualization of that data set or generating, you know, eight or 10 different possible visualizations of a particular data set. I think AIs get great potential to aid in that somewhat creative task that not all analysts are great at. So those are three areas I would say. Yeah, the propelling the dominance of, well, helping out with ETL, propelling
Starting point is 00:57:27 the dominance of SQL and providing a path for beautiful data visualization without a lot of effort. All right. That's awesome. I have plenty more questions, but I think we have to reserve them like for another episode. Eric, all yours. Yeah, well, we're right here at the buzzer. But Michael, what a fascinating conversation.
Starting point is 00:57:49 And you have such a long and fascinating career in data. But I have to know, we've talked so much about data on the show. If you couldn't work in data or technology, what would you do? If I couldn't work in data or technology, what would you do? If I couldn't work in data or technology, what would I do? I would probably be, I think my secret dream when I was in college was to be a
Starting point is 00:58:17 skip writer. I wrote a stand-up comedy show when I was in college. And so I would say if I were not working in data, I would probably be, yeah, maybe trying to work in Hollywood writing bad jokes for late night TV. I love it.
Starting point is 00:58:39 That's so fun. Or write for Saturday Night Live. Yeah. I don't know if I'm funny enough for that, but I certainly was, yeah, it would be a fun job, even if I may not have been the best at it. But yeah, that was my alternative dream.
Starting point is 00:58:55 I love it. Well, Michael, thank you again for sharing your time with us today. We learned so much. And best of luck as you continue working on real data. Thank you, Eric and Costas. Thanks for having me. And I look forward to meeting up in person sometime here in the Bay Area.
Starting point is 00:59:12 Thanks, guys. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers.
Starting point is 00:59:37 Learn how to build a CDP on your data warehouse at rudderstack.com

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.