The Data Stack Show - 181: OLAP Engines and the Next Generation of Business Intelligence with Mike Driscoll of Rill Data
Episode Date: March 13, 2024Highlights from this week’s conversation include:Michael’s background and journey in data (0:33)The origin story of Druid (2:39)Experiences and growth in Data (8:08)Druid's evolution (21:46)Druid'...s architectural decisions (26:32)The user experience (30:06)The developer experience (35:14)The evolution of BI tools (40:55)Data architecture and integration (47:53)AI's impact on BI (52:26)What would Mike be doing if he didn’t work in data? (56:27)Final thoughts and takeaways (57:02)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Before we start the show this week, we've got a quick message from a big show supporter,
Data Council founder, Pete Soderling.
Hi, Data Stack Show listeners.
I'm Pete Soderling, and I'd like to personally invite you to Data Council Austin this March
26 to 28, where I'll play host to hundreds of attendees, 100 plus top speakers, and dozens
of hot startups in the cutting edge of data science, engineering, and AI.
If you're sick and tired of salesy data conferences like I was, you'll understand
exactly why I started Data Council and how it's become known for being the best vendor-neutral,
no BS, technical data conference around. The community that attends Data Council are some
of the smartest founders, data engineers, and scientists, CTOs, heads of data, lead engineers, investors,
and community organizers. We're all working together to build the future of data and AI.
And as a listener to the Data Stack Show, you can join us at the event at a special price.
Get 20% discount off tickets by using promo code DATASTACK20. That's DATASTACK20. But don't just
take my word that it's the best data event out there.
Our attendees refer to Data Council as Spring Break for Data Geeks.
So come on down to Austin and join us for an amazing time with the data community.
I can't wait to see you there.
Welcome to the Data Stack Show.
Each week, we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com.
We are here with Michael Driscoll from Real Data.
Michael, thank you so much for joining us on the show today.
Great to be here, Eric.
All right, well, give us your brief background.
How did you originally get into data and what are you doing at Real today?
Yeah, thanks.
My background is actually probably not that dissimilar from a few of your guests you've
had over the years. I actually
started my career as a software developer working for the Human Genome Project a couple decades back.
And naturally, there's a lot of data in the Human Genome Project. And that was really the beginning of a multi-decade love affair, working with data at scale, heterogeneous data.
And since then, I've started a few companies.
My first startup was an e-tailer called CustomInc.com.
We sell t-shirts on the internet.
I later started a consultancy called Dataspora.
We did a lot of consultant work for banks and
folks in the big data era. I then went on to start a company called Metamarkets,
which was acquired by Snapchat or Snap, the makers of Snapchat that did analytics for advertising.
And now I've got Rill Data. We're a few years into that journey
and focused on an operational business intelligence product with RIL.
All right, that's quite a journey, Michael.
And I know that part of this journey also includes
some very interesting technologies like Druid.
And from the conversation we had earlier,
I've learned a few things that I wasn't aware like Druid. And from the conversation we had there, I learned a few things that I wasn't aware about
Druid and the relationship it had with BI and what were the initial ideas behind it.
And I'm super excited to get into that and learn more about how you started building
Druid while you did that,
and how you ended up today, actually,
with real data that has Druids on the backend,
but it's more than a query engine, right?
So I'm super excited to get into the details.
What about you?
What are you excited to talk about today?
Yeah, well, I think there's a few big macro trends that we're seeing in the data world today.
I would say I would be delighted to talk about some of the emerging data engines that are out there for powering fast analytics at scale, really at any scale. So Druid and ClickHouse,
also DuckDB, we all know is an exciting new engine. But I think the other trend that for me
is particularly exciting is the trend towards serverless frameworks.
I think if we've, those of us,
and I know all of you pay close attention to the space,
I think that there's a lot of new frameworks out there
for really taking not just data technologies to the cloud,
but making them serverless in the cloud.
And so I look at, yeah,
almost any area of the data stack,
I think is being remade
to be truly serverless at scale in the cloud.
And that's a pretty exciting area
that's going to take several years to play out.
Yeah, 100%.
We'll have a lot to talk about that too.
So Eric, what do you think?
Should we dive in?
Let's do it.
All right.
Well, I want to talk about Druid.
I don't think that we've covered this extensively on the show, but maybe you can help us understand Druid
by telling us the origin story and sort of where it came from in your time at Metamarkets.
Sure. Yeah. The story of Druid really is similar to the story of a lot of technology innovation, necessity is the mother of that innovation.
Metamarkets was started in the early 2010s as an advertising analytics business. I was the CTO and co-founder there. And we were building basically an exploratory BI tool for some of the largest digital advertising platforms that were emerging back then.
And as you can imagine, the data that we were looking at comprised of billions and billions of advertising events.
I've often said that, you know, in general, advertising is the crucible of a lot of technology innovation.
It's one of the first industries that was kind of fully digitally transformed, right?
The digital media already was made of bits.
And so unlike e-commerce or other verticals,
digital media and advertising really adopted a lot of data infrastructure technologies
and invented a lot of data infrastructure technologies much earlier than other verticals.
So here we were dealing with billions and billions of records.
We had an early customer based here in California that was called OpenX actually was their name.
They're still around today.
One of the first programmatic advertising businesses doing real-time buying and selling of ads.
So I had a lot of different databases.
I first started with Greenplum,
which was a distributed Postgres engine I'd worked with.
Tried to build dashboards that were interactive.
On top of that, we struggled at high concurrency. We eventually moved to a technique that people
still use quite a bit, which is we put everything into HBase, a key value store, and we pre-computed
all of the different projections of the OLAP cube and stored those keys and values and age base.
But that quickly becomes untenable as you kind of expand the dimensionality of your data.
It gets kind of massive. And so then an engineer that we had hired out of LinkedIn, very talented at the time, young guy named Eric Cheddar showed up and said, hey, I have an idea for a distributed in memory OLAP engine.
And we were young and possibly naive.
And we thought, all right, let's give it a shot.
So Druid was, I think we started work on it in maybe late, maybe late 2000 2011.
I think Eric, maybe actually early 2011. Eric wrote a spec for it. I have a, it's on my
blog actually, where he wrote out the architecture and a few hundred lines of, of code, 550 word
requirements stock. And then, and then about, you know, eight weeks later, the first version of Druid was in production.
That was April 2011.
We open sourced it at an O'Reilly Strata conference in October 2012.
And since then, obviously, it's been widely adopted by lots and lots of companies, probably most notably like Netflix, Lyft, eBay,
Salesforce, Pinterest, Yahoo.
And of course, Metamarket used it widely
and we were acquired by Snap.
I know Snap still today runs a pretty substantial
Druid cluster.
Wow.
What an incredible story because, you know,
if you're a company that's providing a BI product and you told someone,
well, we're going to build our own real-time analytics database, probably they would say,
that's a really bad idea, building a database as an internal tool. But what an incredible story
with the wide adoption of Druid. Did you ever
imagine that it would be adopted that wisely when, or that widely when you started?
You know, I, I, I think what we, we, I don't think we expected it to get adopted so widely.
I think in some ways, you know, the, some of, I believe, the architectural advantages of Druid were that it was very
purpose-built. So we weren't trying to create, at the time, a general purpose database. We were
trying to solve our own problems. I think that turns out that can be an advantage,
that level of focus, because we were able to sidestep a lot of
requirements that we would have had to incorporate if we're trying to build a general purpose tool
but for instance you know drew didn't support joints initially and i would say even today it's
i don't think it's known for great joint support.
But I think what happens when you solve your own focused, well-defined problem is that it turns out other people have similar problems out there.
So I think the decision to open source it was, one, I give some credit to Vinod Khosla,
who was one of our early
investors at Metamarkets. He supported that decision to open source it. Part of the reason
we did open source it and it did gain adoption is it was not the focus of the business.
We weren't trying to monetize Druid. We were trying to really, I think, be part of a broader ethos in Silicon
Valley, which is create more value than you capture. And we were huge beneficiaries of
lots of other Apache ecosystem tools. And it just felt like the thing we needed to do
was to give this back. And yeah, I think it was fairly surprising. I think a lot of the
credit goes to the engineers also who were... Engineers love working on open source tools.
And so there was a lot of investment by an early team to evangelize Druid, to give talks about it,
to go help others that were trying to get it running at scale. So it may have been surprising, the adoption, but I think it was also a lot of effort
went into kind of driving that early adoption in the Valley. Sure. Well, and it sounds like you
tried a lot of other tools before you ended up building it, right? Which is expressive of there being a big need for it.
Can you, I'd love to, you know, you mentioned this,
but it's purpose built for pretty specific use cases, right?
And you mentioned joins as, you know,
not great join support as a characteristic of that.
I'd love to know what other characteristics
are of Druid and what it really shines at. And maybe you can help us understand that by starting
with the very specific problem you were trying to solve at Metamarkets. What were the reports
that you were trying to build that none of these other products could support?
Sure, right.
So I think fundamentally, and maybe as an aside, I would say, well, it seemed crazy
to build our own sort of in-house data engine to power our BI tool.
I do think if you look at the BI landscape,
we're certainly not alone in that decision.
Power BI is powered by
VertiPack, which is a
quite powerful
OLAP engine.
Tableau has
Hyper.
It's an internal
engine.
If you look actually at, you know,
Qlik, which is an inspiration for the BI tool that, you know, we built at Metamarkets and which we're continuing with at RealData, Qlik also had an internal engine. And so I think if
you look at these BI tools and the problems that they're generally
trying to solve, and again, we didn't have the benefit VertiPack and Hyper and Click and even
Sisense as an engine. None of those engines were open source. So we weren't able to adopt those
at Metamarkets when we were building our analytics visualization tool. I would say there's
a few kind of primitives that are really important to support in the kind of ad hoc exploratory
business intelligence tool that we built. First and foremost, one of the most important is filtering. So the ability to look at a data set
and then filter, in the case of billions of
digital media impressions, filter on all of the impressions that are
coming from CNN.com. That's a really critical thing.
People do it all the time in their BI tools. It's often thought of
as a drill down.
The way that there's a number of techniques that Druid uses,
but fundamentally one of the core data structures under the covers is basically just inverted indices.
So you go through and you essentially index all of your columns
and you have a column named publisher website
and cnn.com gets tokenized into a number
and then you store an index that has all of the places
where that particular value exists
and you can do a very fast lookup on on on that
data, and then aggregate, you know, only over your values that match. So those are bitmap indices,
primarily. And so Druid makes heavy use of these bitmap indices to do indexing of high cardinality dimensional columns in the data.
And I think that's the same technique that a lot of the other BI engines use as well.
Makes total sense.
And tell us about Rill. So Metamark has developed Druid, open sourced it.
You sold the company to Snap.
Can you tell us a little bit about your time at Snap?
And then I want to ask about Rill,
because you're sort of returning to Druid in a way. So yeah,
tell us about the time at Snap and kind of how they leveraged Metamarkets technology.
Yeah, so I think for the team at Metamarkets, I think we always had aspirations of selling this very unique exploratory analytics tool
to multiple verticals.
I think ultimately what we found,
which is, again, no surprise given my confidence
around digital media,
is often at the crucible of innovation
for data infrastructure.
The companies that had the most data that really needed this analytic stack
that we built at Metamarkets,
which consisted of pipelines,
real-time ETL pipelines
that fed into an Apache Druid data layer,
which then powered an interactive visualization tool.
That kind of three-layered stack turned out to be very valuable for digital media businesses.
And our customers ended up being AOL and Twitter was actually one of our largest customers
and a number of kind of leading platforms in the advertising space.
What started as a commercial discussion with Snap in 2017 turned into an acquisition conversation,
as can sometimes happen. And Snap at that point was looking to accelerate their internal
analytics roadmap. They were definitely behind at that point
what Facebook, NetMeta, and what Google were offering
to their advertisers.
And so MetaMarkets turned out to be
an extremely valuable technology asset for Snap
to bring in-house and actually build out their own internal
and kind of advertiser-facing BI platform. So what we learned at Snap, which was interesting,
is that, of course, this Druid-powered analytics stack
had a lot of value beyond just advertising data.
It soon became something that was used internally
to look at lots of other data streams at scale at Snap,
including Snap telemetry data.
So another thing that Snap was going through at that point
was they were attempting to roll out their Android app.
And you can imagine the amount of telemetry.
I'm not a mobile app developer, but I I would say, operational intelligence at Stat more broadly. for their application, certainly for looking at their monetization,
how many impressions,
and what sort of monetization results
they're getting for their advertisers.
And it also was used widely by the,
not just by engineers,
but by sales team and customer success folks.
And so I think just being at Snap and watching that wide adoption of this tool
internally was the inspiration for thinking, hey, could we take this?
Could we do more with this?
And so after a couple of years at Snap, I exited. And I was really kind of fortunate that I was able to actually license the core
Metamarkets IP back out of Snap. And that became the genesis of RHEL data today. So we really just saw the power of this platform and really the generality of it.
And that was the inspiration to start RIL data now over three years ago.
Very cool. want to ask about RIL, there are certainly a lot of technologies out there that are available
outside of Druid to do this sort of thing, which I want to ask you about. But the technology
landscape has changed significantly since you created Druid. Can you give us a picture of how Druid has evolved over time? Because I think
you said 2011, you open sourced it in 2012. And so we're talking about the early days of the cloud
data warehouse there even, which itself has changed significantly. So I'd just love to hear about the story of,
Druid's had obviously a ton of staying power,
but relative to sort of database world,
has been around for quite some time.
Yeah, I think the market certainly shifted.
The technology landscape has shifted dramatically since Druid was created in 2011 and open source in 2012.
And so I would say, you know,
what are some of the major shifts today?
Probably, you know, if I were starting metamarkets today
and we were looking for an engine to power, you know,
interactive exploratory
data visualizations, we almost certainly would not need to create Druid. There's a lot of other,
I think we're all familiar with a number of pretty powerful engines out there
that are quite similar to Druid. You've got Apache Pino, which I think is fantastic for, particularly for streaming
use cases. You've got
ClickHouse, which
is great, I think, in terms
of its simplicity and ease to get running
on a single node.
And then now I think it supports
quite well at tremendous scale
in a distributed manner.
I think even
a lot of the cloud data warehouses have gotten faster
and better. I think they're still not quite... I don't know that I would want to run my BI stack
or my BI applications directly on a data warehouse like Redshift or Snowflake or BigQuery,
but they've certainly gotten faster and approaching some of the speed
that Druid, ClickHouse, Pino offer.
Yeah, so I think it's a very different world now.
I still think that there's still a need,
very much a need for fast engines
when it comes to user-facing analytics applications,
when it comes to user-facing analytics applications, when it comes to data applications.
And so what's probably changed the most
is that you can delay the decision
of going to a distributed system
longer than you used to be able to.
I think the reason why DuckTB
has gained so much attention lately
is because, look, in the early days of Hadoop,
you couldn't wrangle a billion records that easily on a single machine. And Moore's Laws had
eight cycles, 10 plus cycles since the early days of when Hadoop was created.
So, and similarly Spark, you know,
it was created in an era where machines were smaller and you needed to kind of run things in a distributed way.
So I think maybe one of the biggest changes is that we now,
we can run much bigger data workloads on single machines.
And I think DuckTB, I think its popularity is a reflection
that you may not need Spark,
you may not need Druid or ClickHouse or Pino
to get the kind of fast interactive speeds
that you may want for your data applications.
Fascinating.
Costas, I could keep going here,
but we've entered the realm of talking about
the current technology landscape in DuckDB.
And so I can see your hand reaching for the microphone.
So go for it.
Thank you.
Thank you, Eric.
So Michael, I want to ask you something
because I think there's's a very unique opportunity here
with Druid because we have a technology
that has been out there for 10 years now.
And as you said, and I think some of the stuff
that have been already communicated
is how different things were in 2012
compared to how it was in 2024, right?
And I'd like to ask you, when Druid came out,
what was, let's say, the main competition?
Like what people, and when I say competition,
let's not take that in terms of business competition,
but more of how people were solving the problems back then.
And how it is today.
When do people today go and use...
When is a good time for someone to go and do it?
Considering all these changes that you mentioned
about the hardware, the software, the market needs,
everything has changed in these 10 years, right?
Yeah.
I would say what's interesting is some things don't change nearly as much, I think, as people
might think.
Some things do change.
But I think what's basically you know the key features
of the engine that we developed i think there's a few kind of decisions that were made in that
architecture that are you know that are powerful and and by the way i think these architectural
decisions again still remain necessary at scale today so So one of the first decisions was
we need this to be a distributed database, right?
We cannot, the data exceeds what we can fit on one node.
So we need to make it distributed in parallel.
And I think if you look,
a lot of the tricks of the trade of making things faster
across different data tools
is essentially make them parallel.
The second thing that we really focused on was aggregation of data.
So there was a post, I think, in one of the DBT labs log posts
about introduction to OLAP cubes cubes olap cubes aren't
you know aren't going away people still use them to aggregate data across dozens of different
dimensions and and instead of storing raw event level data storing aggregates which can be
depending on how you do aggregation can be between 10 and 100 times less,
have a less of a footprint than your kind of event level data. And then the third piece,
I would say is just indexing, right. And there's lots of ways to do indexing. But
each of those pieces, you know, parallelization, you know, via distribution, aggregation, and indexing, our customers back at Metamarkets.
And I think this is true of a lot of data applications.
Customers don't care.
The end users don't care about the engine that's powering the application.
They just care about the user experience.
And so I think that anyone who's starting to build a data stack today,
there's a lot of different tools out there.
I would just encourage, you know, Druid's one of them.
But ultimately, you know, you've got to pick the right,
depending on your scale, just pick the right engine that can deliver, I think, you know,
fast sub-second performance for a data application
and you'll make your end users happy.
Yeah, okay.
You gave me a very interesting cue here
because you said user experience.
And I think we have to make a distinction here.
We have, especially with a system that is,
I'd say, user-facing,
you have someone who's not necessarily an engineer there
who's going to do their own analytics.
Maybe even a business user when we're talking about BI.
So we have the user there who they care about a specific set of things
around the experience that they have.
And then there's also developer experience, right?
Like it's all that are responsible for deploying, operating, building
what the users need.
And I think we need both.
We need to balance both at the end, like in this environment that we have.
Can you tell us a little bit more about that?
Like what's, let's say, the user experience that you talk about?
Like what it means for a user?
What they care more?
How you would define
in a few words,
let's say, the user experience?
And then talk also
about the developer experience.
What's different
and how it differs
compared to the user experience?
Well, I think
probably the most important value that we embrace in the design of the BI tool that mind or the sort of sophisticated analyst persona in mind. questions of data, I think one of your guests maybe from AlterX made this point. Look, every
knowledge worker is an analyst. Every knowledge worker needs to be a data worker. And so at Rill,
we've really focused on simplicity. And some of the UX pieces of that are direct interaction.
If you want to know more about a value in the tool, click on it, and you can filter on it.
If you want to zoom in on a time period, you should be able to drag that sub-range and be able to zoom in easily.
So we really focused on simplicity, where you don't need to get training to use.
People shouldn't have to be trained on how to use dashboards.
They're such a part of the fabric of modern work that none of us are being trained on
how to use a lot of the great tools that we use day to day.
And I think dashboards should be no different and data tools should be no different.
The second value in talking about user experience is speed, speed of interaction.
So that term business intelligence, when we think about an intelligent person,
we think about somebody who responds to a question within seconds of us asking it.
Slow, I think, is often synonymous with unintelligent. And so at REL, we really have focused on making our data, exploratory data application, sub-second. And the experience of sub-second tools, that just resonates with the human
cognitive system. This is how we interact with the physical world in a sub-second way.
And I think we've all gotten, unfortunately, too used to slow data applications. I think that's a consequence of some unfortunate architectures
that have been built. But at RIL, we really want to return speed to be at the forefront of
working with data. And then the third value that we really embrace at Realist is scale.
And maybe just recognizing that in our experience,
what may start out as a small data set that you can keep simple and keep speedy
often evolves to be quite a large data set.
Most of our customers tend to grow.
Some of our customers are dealing
with trillions of data points.
And so thinking about scalable systems, it does mean you have to make certain decisions.
And one decision we made at Rill for the user experience is we do require a lot of upfront
modeling of data.
We don't let people kind of play fast and loose with their data model. It's not,
we don't really embrace a lot of ad hoc or like post hoc changes to data. We really focus on,
we want our organizations to invest time building their data models. And then the result of that
is that we can support that third value of scale.
Because if you're going to scale up to billions
or even trillions of events in your data,
you do have to have a pretty well thought out data model
to start with.
So yeah, simplicity, speed, and scale
are the three values that we think are directed towards a better user experience of the real product.
And what about the developer experience?
Like, what's the difference there?
Like, with a developer who has to go and manage, let's say, like, real data or Pino or any other system, like as part of like a broader data infrastructure there, right?
Like what are, from your experience,
let's say like the good and the bad things
that are happening out there today?
Yeah, well, I think there's always this sort of yin and yang
in the world of technology or things, you know,
swing from one side of a continuum to the other.
One of those is server versus client. So I think one thing that we've embraced at Rill,
and I think a lot of developers seem to like, is the ability to do development locally versus
development kind of remotely on the cloud.
And I think those of us who kind of do local development, we know why we like that as developers.
It's the speed of interaction, the speed of feedback.
So I would say that's one almost shift, right?
I think we continue to see the value. We have these incredible, you know, most of us have Apple Silicon on our developer machines and an incredible amount of computational power underneath our keyboard. It's a tragedy to not be using that power in our day-to-day experience as developers. So that's one piece.
I would say another that there's been some debate is people often ask,
okay, low-code or no-code interfaces versus codeful interfaces.
I think that at Real, we've made the decision to be very much a code-first developer tool.
And everything we do, from defining data sources, to designing data models, to configuring the look and feel of our dashboards,
everything is basically defined in SQL and YAML declarative artifacts. And I think that for developers,
I think if you can be thoughtful
about the code that you choose,
we really made sure we leaned into SQL
as a kind of primary language for data modeling.
A lot of other BI tools have kind of proprietary
data modeling languages like DAX for Power BI or
Tableau has its own expression language, LookML for Looker, but everyone knows SQL. So I think
the code first approach, I think does serve developers. I think CLIs can be extremely
powerful and again, well-crafted CLIs spark joy for developers and i the last thing i would say
when it comes to those you know whether you kind of embrace a code first path or you know a no code
or low code path in the era of ai i think there's a quote from someone on Twitter that text is the universal interface.
Code is such a powerful interface for the world.
Here we are essentially communicating about lots of things
just using effectively speech.
I think that in a world of AI,
I think the code- first interfaces will dominate because
that's an API. So for real, it's not hard to use Copilot and develop on real because everything we
do is code first. It would be very hard to have Copilot interact with a set of UX components
and design dashboards and data models and data
source credentials, if everything were kind of point and click.
So yeah, those are probably two things I think a lot about the code versus no code approach
and the local versus cloud development approach.
That's super interesting, actually.
And a very good point about like what's going on with copilots
and the AI situation right now
and how they work well with code-first interfaces
instead of these drag-and-drop,
which I never thought about.
That's very interesting.
Okay, so let's talk a little bit about like the present and the future of like bi and i'll
take you a little bit like back in the past so bi went through let's say already some kind of like a
cycle where we had around like 2015 2016 let's say like we had Looker, we had Sisense, we had Periscope data, we had Mode Analytics,
we had ChartIO, we had all these different BI tools that some of them were targeting,
let's say, other personas, and they were trying to differentiate based on that.
But what eventually happened, from what it seems, is the peak of that cycle was the acquisition of Looker by Google, which I think was also the biggest outcome in this space.
And things got a little bit, I'll tell you that, not that exciting anymore.
We've had some merges there with like Sisense and Periscope
data. I think Mold now like got fired by another company, but...
ThoughtSpot.
Yeah, ThoughtSpot, correct. And it's not very clear like where this cycle ended and if there
is a new cycle of innovation, what is like going to happen bi and
what's happening with bi in general right so tell us a little bit about that like what happens
in the this previous cycle let's say in yourations and acquisitions do reflect similar to the world of databases. ask themselves, you know, when you see new database companies being started, you know,
in the last few years, like, gosh, do we really need another database?
Right.
And I guess my broad view on sort of on cycles of BI would be just that, look, the world of data is so massive and so critical, you know, to the global economy
and to every business that, you know, in the same way that like we don't have,
you know, just kind of, I guess, you know, one type of manufacturing company that, you know,
makes atoms. We, you know, there's really not a lot of grand uniformity when it comes to
manufacturing bits, whether it's ETL or databases or exploratory business intelligence tools.
So I think my first comment is that I don't see anytime soon this sort of massive consolidation around one database or one BI tool to rule them
all. I think the world is far too heterogeneous in terms of its problems for that to be the case.
But as far as the kind of current cycle in BI goes, I would say, I think probably,
I would argue that there's maybe three generations of BI we can really point to,
and we're kind of in the third generation here.
I think the first generation was desktop and server-based BI.
So I think the Power BI as an early business intelligence tool. I think back years ago, Oracle had a BI tool that
they shipped. You had SAP. You had a number of, I would call them old school, in the 1990s,
companies that were shipping desktop BI tools or beefy server BI tools.
Click was in that category as well.
And many of them had, as I mentioned before, embedded database engines that they came with.
And that was generation one.
And that worked pretty well for kind of, I think, the nature of enterprise architecture
then.
But then I think the big shift that occurred with,
and frankly, Looker, I think heralded this era,
was where we had the shift to cloud BI.
Looker was one of the first companies
to really embrace that they weren't going
to have an embedded engine.
Looker was going to run on top of other databases.
It was just going to have its semantic layer
talk directly to a cloud data warehouse.
So Looker grew, I think, very quickly because the cloud grew and people realized that was a better,
I think, a better architecture. Ultimately, some of the legacy BI tools did embrace that
server architecture. Tableau allowed you know, allowed it,
was able to connect to remote data warehouses as well.
But that was sort of the second generation
that I think we saw.
And by the way, I think mode and ThoughtSpot
probably represent that second generation as well.
You know, mode primarily talks to, you know,
a remote data warehouse.
ThoughtSpot increasingly is, you increasingly is about connecting to remote systems.
I think now we're in a potential third generation of BI.
And what's different now? You know, as I mentioned when we were chatting before the show, I think the next big disruption
in the data stack is going to be the commoditization of the cloud data warehouse as the source
of truth for company data.
I think that more and more companies are embracing object stores like S3 and GCS and Azure object storage.
More and more companies are embracing structured data
on object stores as their core foundational data layer.
And as that happens,
I think we need a new generation of data applications
that can connect directly to the object stores
and not just rely on the data warehouse like Looker did.
And so that, frankly, is where we're certainly making a bet at Rill.
We're making an enormous investment in support for things like Delta Lake and Apache Iceberg,
also the commercial support for it by Tabular.
And I think there's a lot of exciting stuff
to be done with that new architecture.
So as we move basically through these three generations,
we go from kind of server architectures for BI,
and we move to kind of cloud warehouse architectures for BI.
And I think we're now in the era of object store architectures for BI.
And I think there's a lot of innovation that can be done in this kind of new data architecture.
That's super interesting.
So in this new paradigm that we are talking about, how are all these pieces fit together?
We have data warehouses. We have data lakes. that we are talking about, like how are all these pieces fit together, right?
We have data warehouses, we have data lakes, we have BI tools that
they have their own engines, right?
We have systems like the more real-time systems like Pinot or like Druid and ClickHouse.
How do these things fit together?
And do they, let's say, overlap?
Or are there some clear boundaries there where, let's say, a user, like a company,
has to cross in order to start considering
using some other technologies?
Well, I think that, again, I think it's still early days, but my own view on how these pieces
may fit together, some broad thoughts. First of all, I think that all data will ultimately live in the data lake.
It will ultimately live as Parquet, or I know one of your guests was the creator of LanceDB.
All structured data will live in a structured data lake in an object store in the future. I think that will be the governing
lowest common denominator of data across most organizations. And so that means that all
data producing and data consuming systems will go through that foundational object store fabric.
I think Microsoft actually got it right when they talk about their fabric architecture.
It doesn't make a lot of sense to try it directly, in my opinion, for only rare use cases, would
you want to consume directly from Kafka?
I think if you look at like, even what you know, the folks at warpstream labs are doing,
they're using Kafka backed by an object store, it's serverless Kafka. So I think that, again, all data technologies, data services
will create and write to and read from the object store. So then that does simplify things in a lot
of ways, having that kind of fabric there. Then you just have different requirements for different
styles of data applications you'd want to power off of that data.
For business intelligence applications, it's really important that things are fast.
And so the only way to make sure that things are fast is you need your compute and your
storage to be co-located in some way.
So you have two choices.
You can either move the data to the compute,
or you can move the compute to the data. Both of those I think are acceptable.
I think a tool like DuckDB is very powerful because it allows you to move compute to the data. You can spin up a Lambda job and stick a DuckDB in it, and you can run that compute very close in the correct region
where you have fast access to the object store.
In Rill's case, we decide to actually orchestrate data
out of the object store and aggregate it
and move it to our compute nodes.
But I think co-loc localization of data and compute is a key is a key piece. I would say, but in general, other workloads don't
need that, you know, for a lot of reporting workloads, one of the challenges we see today
is people are constantly moving data between data systems, one of the advantages of having
everything in the object store is you don't need to do that migration.
So I think reluctantly Snowflake
and several other tools have embraced the Iceberg format.
I think we'll see that continue to expand in its adoption.
And the idea there is that for asynchronous workloads,
you don't need to move data into Snowflake to query it with Snowflake.
You can query an external table from Snowflake and not have to do, you know, an ETL job and on the nature of the workloads. But increasingly, I think we'll see a lot of in situ data applications that operate on the data effectively sitting in the object store.
And that's, I think, a huge efficiency gain for that style of architecture versus a lot of the, you know, a lot of the systems today that, you know, where you have a lot of data moving around. Yeah, that versus a lot of the systems today
where you have a lot of data moving around.
Yeah, that makes total sense.
All right, one last question from me,
and then I'll give the mic back to Eric
because we are close to the end here.
One of the things that has happened in the past two years
that is changing, I think, rapidly. This space of data is AI, right? And especially,
I think, BI tools have been very eager to embrace that for very obvious reasons. I think,
as you said, text being the API, right? It's a very strong concept there.
But my feeling is that there are probably much more deeper things happening with AI
and how it will change the way we work with data.
So what's your take on that?
How do you see BI being affected by AI?
And what's next there?
Well, I would say maybe three consequences
that I can think of.
I think I once commented thinking about AI is that I think we'll know we have AGI once we have solved for data engineers not having to write regular expressions on their own.
So I think one of the first and highest uses of AI
is actually for data wrangling.
We all know that practitioners in data
spend far too much time
doing writing regular expressions,
parsing data.
And I think that the tremendous benefits
will emerge on that front through things like Copilot. I think we can
dramatically improve and reduce the pain around data munging with AI. Second thing I would say
is that in terms of its impact on the languages that data practitioners use, as I said before,
obviously AI is code-based today, primarily prompt-based.
And I think that we will actually see a lot of people have been trying to create new languages
for data transformation.
And I applaud those efforts. We need new languages for data transformation. And I applaud those efforts.
We need new languages always.
But I think that SQL is still, you know,
still early days of SQL being adopted,
not just for querying data,
but increasingly for transformation of data and ETL
and data modeling.
And so I think that AI is going to further propel SQL
just because it's a lingua franca. There's so much for these large language models to learn
from in terms of the massive corpus of blog posts and stack overflow answers that are using SQL
to manipulate data. So I think AI will actually propel SQL to even
greater dominance as the lingua franca for all data work. And then I would say the third
consequence of AI in the data space is, I think, solving the cold start problem. I think AI is great at sort of generating a scaffold of something that then an analyst can edit versus having to create from whole cloth.
And so in particular, the area that I think AI has great potential for data work is, we've seen this already with OpenAI's analytics module.
You know, a lot of people spend a lot of time pushing pixels when it comes to building data
visualizations to make their data viz pretty. I think that being able to go from a data set
to an informative, useful visualization of that data set or generating, you know, eight or 10 different
possible visualizations of a particular data set. I think AIs get great potential to
aid in that somewhat creative task that not all analysts are great at. So those are three areas
I would say. Yeah, the propelling the dominance of, well, helping out with ETL, propelling
the dominance of SQL and providing a path for beautiful data visualization without a
lot of effort.
All right.
That's awesome.
I have plenty more questions, but I think we have to reserve them like for another episode.
Eric, all yours.
Yeah, well, we're right here at the buzzer.
But Michael, what a fascinating conversation.
And you have such a long and fascinating career in data.
But I have to know, we've talked so much about data on the show.
If you couldn't work in data or technology, what would you do?
If I couldn't work in data or technology, what would you do? If I couldn't work in data or technology,
what would I do?
I would probably be,
I think my secret dream when I was in college
was to be a
skip writer.
I wrote a stand-up comedy show
when I was in college.
And so I would say if I were not working in data,
I would probably be, yeah,
maybe trying to work in Hollywood
writing bad jokes for late night TV.
I love it.
That's so fun.
Or write for Saturday Night Live.
Yeah.
I don't know if I'm funny enough for that,
but I certainly was, yeah,
it would be a fun job,
even if I may not have been the best at it.
But yeah, that was my alternative dream.
I love it.
Well, Michael, thank you again
for sharing your time with us today.
We learned so much.
And best of luck as you continue working on real data.
Thank you, Eric and Costas.
Thanks for having me.
And I look forward to meeting up in person sometime here in the Bay Area.
Thanks, guys.
We hope you enjoyed this episode of the Data Stack Show.
Be sure to subscribe on your favorite podcast app to get notified about new episodes every
week.
We'd also love your feedback.
You can email me, ericdodds, at eric at datastackshow.com.
That's E-R-I-C at datastackshow.com.
The show is brought to you by Rudderstack, the CDP for developers.
Learn how to build a CDP on your data warehouse at rudderstack.com