The Data Stack Show - 83: Closing the Gap Between Business Analytics and Operational Analytics With Max Beauchemin of Preset
Episode Date: April 13, 2022Highlights from this week’s conversation include:Max’s career journey and role today (2:56)Hitting the limits of traditional BI (11:06)The most influential technology (14:34)Merging with BI and vi...sualization (17:35)Two thoughts on real-time (21:02)Defining BI (24:53)How many have actually achieved self-serve BI (29:54)How preset.io fits in the BI architecture of today (32:36)How to use preset.io to expose analytics (35:23)The analytics process to power something like embedded (42:45)Opportunities that exist right now in the BI market (44:53)Commoditization in visualizations across business models (47:58)What it felt like to create data tooling (51:34)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com..
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack, one platform for all your customer data pipelines.
Learn more at rudderstack.com.
And don't forget, we're hiring for all your customer data pipelines. Learn more at ruddersack.com. And don't forget,
we're hiring for all sorts of roles. Welcome to the Data Stack Show. Costas,
the guests that we have never cease to amaze me. And we're talking with Max from Precept today.
Not only has he worked at some of the biggest, most successful Silicon Valley companies. We're talking Ubisoft, Facebook, Airbnb, Lyft,
but he also is the originator of several major open source products in the data space.
So Airflow and Superset that are part of the Apache Foundation, pretty unbelievable.
And what a privilege to talk to someone like Max. I'm super excited. One of the things that I really want to ask him about is, we'll see if I can sneak this in because maybe it's a little bit of a personal question, but starting two projects like that, I wonder what it feels like to be the inventor of something like Airflow. And I mean, that's just a really cool thing. And I think I tend to make that like a really grandiose thing in my mind, and maybe it was.
But also, I think a lot of times inventors are just trying to solve a problem that really
interests them.
So that's what I'm going to ask.
How about you?
Yeah, absolutely.
And so I think it would also be awesome to ask him, what was the process?
Like, how do you do that stuff, right? Like, how do you end up building something that has this kind of adoption by, like, the
industry or the community or whatever out there, right?
So, yeah, I'd love to hear that from him.
I think it's going to be, like, super interesting.
But I also want to ask him about BI.
I mean, he's also a founder of a company that's, like, in the BI space.
Right now, the BI space is something that, I don't know, lately we don't talk that much about it.
But, you know, it's one of the most fundamental parts of the data stack.
So it would be awesome to hear from him, like what's the current state of the industry, what happened these past few years and what's next.
All right, let's do it.
Max, welcome to the Data Stack Show.
We can't wait to chat with you.
Hey, happy to be here.
It's an honor to be on the show.
I just realized there's so many episodes there is on this show.
Yeah, yeah, it's fun.
It's been super fun.
Okay, I don't know where to begin
because your resume is just incredible.
But I would love to hear about how you got started in the world of tech.
And then tell us how that transitioned specifically into working on data stuff.
So I started my career early 2000.
So right after the dot-com burst is when I started.
I started early.
So I did not finish my program in college.
I didn't really go to college.
I was lucky to strike an internship at a company called Ubisoft.
That's well-known now, at least, a big video game company.
And I joined, I did a little bit of web development early on during my internship.
And then I had an opportunity to work on, you know, their first data team.
And then, of course, like the data landscape was very different back then.
But what timeframe was that?
Like 2001, 2000, 2000.
Oh, wow.
Yeah.
Data.
So that's very different data landscape.
Yeah.
The tech stack.
Well, so what's interesting when we can talk one topic for today could be like, you know,
how we're kind of reinventing the same stuff over and over with different parameters.
But at the time we used the SQL server.
So Microsoft SQL server stack.
So there's SQL server analysis services, SQL server, the server itself.
There was reporting services.
I think that came a little bit after office web components and I think
integration services, the other one.
So that was the stack that we had selected at the time.
And then I was lucky enough to be part of the team that created the first data
warehouse, the first kind of business intelligence team there.
So before my time there, it was very little databases, just Excel files.
Right.
And then I worked on financial reporting,
supply chain, like kind of your,
so retail type stuff,
and less like your game analytics
or like your kind of modern analytics.
So that was really like counting dollars
and units sold at the time,
if you had like inventories, all that reporting.
So very specialized team.
So I worked there for quite a while,
worked at three different offices. I worked in Montreal, Paris, and San Francisco. So I
traveled the world over this first decade of my career. And then soon after I joined Yahoo,
where was the birth of Hadoop at the time. It's kind of interesting times, right? So the Hadoop
team would meet in very office where I was at. So I had the opportunity to meet some of the early Hadoop folks. I read some of the early pig scripts. If people are familiar with the language, the pig language, it's a little bit of a funky SQL-like, not really SQL-like, the kind of dataset language that SQL-like in some ways. So work on some of that early stuff. And then I think the part that's really interesting for me is when I joined
Facebook,
it seemed like they were really kind of on the other side of what I would
call like data modernity,
modernity,
like just the,
we were in this completely different phase at Facebook at the time.
So it was 2012 where everything was getting rebuilt from the on top of Hadoop and other things, right?
There was like this internal, like came in an explosion of data tools.
So this hackathon culture of like, build it if it doesn't exist.
So they had rebuilt internally a lot of the things that existed in the market
at the time, like from scratch, but also were building things that had never been built before that, you know, now,
you know, in some of the spaces that we're building stuff, I'll try to describe a little
bit more like what I mean by that, but essentially like people internally at Facebook had rebuilt,
you know, dashboarding tool, data exploration tool, an in-memory real-time database,
something that was a big inspiration for Airflow that's called DataSwarm.
That was an internal tool.
There was like multiple experiments in the DAG kind of data orchestration space too.
And there was like all of these like kind of mutant little data tools too that, you
know, some like data quality stuff, some data dictionary
stuff, data graph, metadata browser things.
And so early, early and not so early versions of some of these things that we
see really kind of emerge today on the market.
So it was a really inspiring time.
You know, for me, I was going from being kind of a BI engineer, data, data warehouse
architect, to being like a software engineer and like
building tools to enable more people to play with data.
So very like data had been democratized at Facebook.
People were like building all sorts of cool stuff.
And it was super fun to be there at that time.
Really inspiring too.
Absolutely.
Okay.
So what do you do today? I mean, you actually went to another couple
like really amazing companies, but what do you do today?
Yeah. I mean, I can keep going. I think it makes sense to do the transition. So right after
on Airbnb, and that's where I was missing a lot of the data tools that we add internally at Facebook
and it kind of brought there along with others this mentality of like, let's
build some stuff, let's solve these problems in a new way, and if I'm not
going to have something like Data Swarm here, I'm going to build something
that's going to solve my problems and my team's problem as a data engineer.
And that's what became Airflow.
So I was like, I want to get involved in open source.
I was always like an admirer, you admirer of people who had built open source projects
and looking at the Linux kernel and other things, just being inspired by that.
I was like, oh, maybe I can try and get a shot.
So I thought the timing was good to build something like Airflow.
And then just decided to go with it.
So I started building it actually
between the two jobs before joining it. I was so excited. I'm like, I'm going to make it open
source. I'm going to start working. Oh, wow. So, okay. So it wasn't like
the initial birth was between jobs. It was in between jobs, but I knew I was joining Airbnb.
They needed, we had talked about, spent a lot of time with the data team there,
talking to people, and it was clear that they needed something like that and that would be
enabled to build it. So I was like, I'm just going to get started in that month in between.
I think I missed a vesting cycle there, like a three months vesting cycle by a few days doing
that. Not a great thing. 2014 Airbnb was a really good time,
a pretty decent time to join.
But as a result, though, I got to put the project, like Airflow,
I think it had a different name.
Originally it was called Flux.
And I got to put it in open source under my GitHub.
And as I joined, I'm like, well, this thing is already open source.
It's working on it.
So I started working on a data mart.
You know, my primary function was like to do the data engineering for core, what we call customer experience CX internally.
And then I was building Airflow at the time and then working with a small team of data engineers to building stuff.
And we were kind of building Airflow at the same time as I was, you know, we were solving these data engineering, you know, challenges for them.
And then, you know, after I went like for a brief time at Lyft, so I spent a year there.
Well, I also, while I was at Airbnb, I started Apache Superset.
It's also very well known.
So Superset is very much in the data visualization, exploration, dashboarding space.
And the general idea there is like we were using,
we were investing heavily in Presto and Apache Druid,
or I think it was pre-Apache.
So Druid, the in-memory real-time database.
Sure.
And none of the tools on the market, you know, Looker and, you know, Tableau and the tools that exist at the time didn't work or didn't work well with the databases we were investing in. Let me ask you a specific question there, because I think it's fun for our audience.
You know, they sort of they cross a wide spectrum.
And I think some of them probably hear that and they say, like, yes, I know, like, I get that, that I had similar pain.
But for a lot of people, it's like, man, you know, Looker and Tableau are so powerful.
Like, how could you ever, you know, sort of reach the limit of those?
What did that look like and feel like inside the company in terms of hitting the limits of traditional BI?
I know you mentioned that it was like sort of database integration stuff, but like, could you explain that dynamic a little bit?
Well, so one dynamic was, you know, we had a large Presto cluster
and had hired people from the Presto team
or, you know, people who had used Presto in the past
and were investing heavily in this thing.
And that's our ad hoc layer.
And then you tried to, I think we had Tableau at the time
and we had to load stuff and extracts, which is like a, you know like a subpar database, if you ask me, at least at the time.
And there was this thing called the live mode, which would defer, kind of run the heavy lifting on the database itself.
But that didn't work very well for a variety of reasons I could expand into, but I won't.
And then Druid at the time didn't speak SQL.
It had this funky kind of dimensional query interface
and it just wouldn't talk to any BI tools at all.
And there was no front end for it.
So the very premise for Apache Superset was,
I'm going to build a quick,
it was a three-day hackathon project
to evolve for exploration of Druid datasets.
So Druid in-memory database, real-time, super fast, super fun database, heavily indexed, right?
So just like blazing, blazing fast in real-time.
And we had some real-time use cases, you know, internally at Airbnb.
Sure.
Pablo would just wouldn't, you know, there's no socket to connect anything.
So I'm like, oh, I'm going to build this thing.
And then quickly, you know, I could go deeper in that story.
But SuperSaid then, you know, over a weekend or, you know, at some point I was like, I got to make that work with Presto too.
This is fun.
It's a cool tool.
You know, you can explore the data.
You can save charts.
You can make dashboards.
Let's make it work with Presto too.
And, you know, and then that became much more ambitious over
time because of internal adoption at Airbnb. People liked having a tool that was just like
very fast time to chart, very fast time to dashboard, provided that the datasets have
been created. So if the premise is you have a data set that has all the metrics and dimensions you need to create a dashboard, then exploring, visualizing, the visual grammar, you know.
Sure.
And then be able to do very sophisticated things.
I call it the Photoshop of data visualization, right?
You can do powerful things.
And then maybe Looker was all about, you know, the semantic layer and be able to
replicate your business logic and this like semantic layer.
So for us, it's just like visualize data set quickly.
So that took and that worked really well
in the team at Airbnb and elsewhere
and everywhere really.
Amazing.
Yeah, that's quite the story.
So Max, I mean, you've been around like for so long
and you have seen like so many different things
like happening.
So before we get deeper into like what you're doing today
and understand like what's's preset and superset,
is there a technology that you have seen all these years that really surprised you in terms of how it changed, let's say, the landscape?
Or to put it in a different way, from all that stuff that you have seen from Hadoop to whatever we have today, right? What do you think is like the most influential technology that has saved today?
Yeah.
And, you know, like, so first it's like, yeah, the gray beard to show for, you know, all the years of data and all the, you know, lost it from scratching my head for, for decades. But, but I would say a lot of what we're doing now is like reinventing some of the
things that existed, you know, a decade or two ago, based on new premises.
And then there's these cycles in software where there's some big shifts, right?
The, the move to the cloud and the move to like distributed systems
first and containerization, right?
So once in a while you take a new set of new premises
and you have to rebuild everything
on these premises.
And then the pyramid of needs,
like maybe is flipped a little sideways
or upside down, right?
I think one thing that's new
from the, in the last like five to 10 years,
that was, that did not really exist
or exist well before as a streaming,
streaming mute cases.
I think that it's been cool to see, you know, different solution emerge around data streaming,
kind of streaming queries, streaming computation frameworks, things like Flink, you know, or
Spark streaming.
And, you know, there's been like, on top of that, there's new semantics and new things
around streamings that are interesting.
There's people trying to bridge the two world and having these common languages
to express both like batch and streaming using those same frameworks.
I think it's, I would say that's, that's been interesting to see like
brand new technology emerge there.
There's a question like, what does that mean in terms of visualization
to for media and use cases?
And, you know,
tons of thoughts there too.
I think it also does relate
to the chasm and analytics
between operational data
and business data
where operational data
is much more kind of timely
in a lot of ways.
So there's like,
so some things there
that are really interesting
to see too,
like we've got,
we've gotten really good
at operational streaming
analytics too.
You look at things like,
you know,
Datadog and Elastic
and there's some cool stuff
there too.
That's actually
pretty interesting
because
you mentioned streaming
and I never thought
about the other section
of streaming visualization.
So
I have to ask more now.
Yeah, I've seen like all these, you know,
like very interesting like platforms,
like we have Kafka, we have Sling,
we have like Mark, streaming.
But how do they fit with BI and visualization?
Like is it needed, first of all?
I mean, I start thinking right now,
like about what was the call of this, like the Lambda architecture, that was like a thing, like a couple of years ago, right?
Where you had the streaming layer that was more about notifications and like something just broke and you have to go and fix it, you know, but that kind of stuff.
And then you have bots, obviously, where you have reporting and reporting is usually like the most common use case for visualization.
But how do you see these two things like merging and how do you see like visualization playing a role like with streaming?
Yeah, it's interesting.
Like one thought, one immediate thought is like when you really think about the data that needs to be really fresh.
Well, so first I would say like, there's a latency and freshness. And I think like
fresh, like talking about latency of like, if we describe it as like how long it takes from
the moment you run a query or ask a question, you get the answer. I think that's infinitely
important, right? Like to be able to kind of dance with the data and like slice and dice and like for
it to be able to ask the next question as you go. I think that's transformative. An example that I
keep giving there is if Google takes a fraction of milliseconds to give you a result, if it took
five seconds, think about the implication. If Google took five seconds, 15 seconds, 30 seconds,
a minute, 10 minutes, it would take 10 minutes to resolve a Google search. It would still be a
wonder of the universe in terms of what it would allow for people to do, but how you interact with it, how you engage with it without
it to complete is completely different. So there, what I'm trying to point to is like latency is
super important. Like now talking about freshness, it's like, if you really need to know what
happened someplace in the past 30 seconds, and you're like looking at a chart
and refreshing and waiting for something to appear like you should be a bot like you're not doing the
right thing with your time like no one should be looking at a dashboard non-stop you know waiting
for an event to happen be like now i'm gonna refund this person click you know done approve
yes i've done my job so So I think like operational stuff,
then in some cases is really great
for automation and bots
and things like taking action.
It's also good for troubleshooting, right?
So there's an alert or something
that happens in Datadog,
your number of 500 errors,
like, you know, peaks.
And then you're like,
what the heck is going on?
You need to know live and now, like what's happening in the past five minutes. So, so I think that
stuff is like super operational and very different from like, is my business doing well? Like, you
know, if you're thinking about like, is my product doing well? There's some other use cases I think
interesting around streaming. It's like when you launch something, you want to make sure it's good,
right? Like you launch a new product change or you release something, you might want to get a
little closer to real time to make sure your launch is doing well.
But other than that, like for me, the bulk of the time, you know, I'm thinking like 90
days analytics, a lot of the high value questions are, you know, not things that happened in
the past five minutes.
Yeah, yeah. are not things that happened in the past five minutes. Yeah.
When you were saying about sitting on top of a dashboard and reloading all the time,
trying to see that something is happening, I cannot help myself but think of
first time founder that just created the dashboard for signups or something like that and being like, okay, where are the signups?
Where are the signups?
And you know, like you just need to go out there and create them and
not look at the dashboard.
That's what's going on with the signups.
So yeah, I, many times we get like really, you know, into this, like, let's
make everything like more real time and like faster and it's more fresh or whatever,
but real time is a very relative term, right?
Like not all use cases have the same definition of real time out there.
So it makes a lot of sense.
So, okay.
I had like one or two more thoughts to unpack on real time.
So one thing I think that one way I've been thinking about it too,
because I've had a lot of conversations with, I call them streamers, people like, you know, like kind of streaming
first and people are arguing like, oh, we should just like get rid of those mountains of SQL and
all the batch stuff and just rewrite everything in real time. That's going to be easy, right?
Let's just do it. And I'm like, yeah, wait a minute. But like, when you think about like,
what about freshness? like, what do you need
like visibility into for things that happen in the past, like, you know, minute or two or five
minutes or 10 minutes? What are really the metrics and dimensions and the level of accuracy that you
need for these things? So when you really start looking into the use cases, and we've done that
internally at Lyft and Airbnb, even like Lyft was much more like real-time business.
When you ask people about why do you need freshness?
Why do you need to know about what happened the past minute?
Then you realize the requirements are not as complex. handful of dimensions and metrics and maybe you don't need to know how many exact bookings and
just clicks on the booking icon you know button is enough right like so knowing this i tend to say
like you don't even need in a lot of cases the lambda architecture just seems like you have real
time requirements you solve those with specialist specialized tools then you have like your business analytics and
you solve that with the right set of tools and and then you know if they diverge and the numbers
are not exactly the same you explain the difference because we use different tools different
definitions for things and you move on with your life instead of trying to bring two worlds together
that are very like far apart in reality i think the other thing is, I think people lump
a lot of stuff into the real-time versus batch debate. And there's, I think, a pretty clear
separation between the analytics component and then the customer experience side of things.
So there are certain things that need to be delivered in an application in real-time because
a user is performing some action and there needs to be
some sort of response, right? I mean, it's non-trivial to build that stuff, but even then,
not everything needs to be real time, which is interesting. And I'm just thinking about some of
the companies that, you know, like e-commerce companies who are running thousands of tests,
you know, for them, real time is like 15 minutes, right? I mean, a testing team can't really process results.
Even at 15 minutes, it's pretty unbelievable. So definitely real time is like, it's such a
relative term. Right. And you have to really identify where there's value in it and then
what you're going to do to kind of support the use cases and what it's worth to you.
And maybe though the tooling will converge, right? Like we've seen some of that convergence a little bit
where the chasm between like business analytics
and operational analytics is not as wide as it used to be.
With the rise of tools that are,
like these next generation databases,
they can serve, you know, on both sides of the fence.
And you see things like, you know,
Superset is becoming, you know,
more and more used for operational analytics and things like, Superset is becoming more and more used for
operational analytics and things like Grafana are more used for business. So I think the
chasm is getting thinner over time. That's a good thing. And we used to have these databases
very specialized at the time series database for real-time use cases were very different
and really support OLAP use cases. And now that's kind of converging with things like, you know,
DroidClickHouse, Pinot, right?
Like these, these new next generation databases.
Yep.
Yep.
It's very interesting actually, like what you said about how things overlap,
like with Grafana, for example, and SuperStats.
So I have a question.
Let's go, let's talk a little bit about BI, okay, and visualization.
And let's give some definitions, right?
So what's BI?
What's BI?
Business intelligence, right?
It sounds so...
Intelligence as...
It's an aging term.
You know, I think it was around when I started my career.
So when, you know, no gray hair, more hair up here, you know, that term was already well
established.
So that's like 20 something years ago.
I think like the word intelligence come from, you know, the term that the way that the government
thinks about intelligence, it's like intel, like, you know, data, you know, insight, that
kind of stuff.
And then as applied to business.
So I guess it means it's the set of tools and best practices around analyzing and organizing and serving data.
A huge trend, I would say the first, maybe, I'm not sure exactly, depending on where you
look in the world, when that trend was most active.
But data democratization is the trend
that's like maybe caught up like on top of business intelligence so the general idea of like let's give
access to more people to more data it's usually like business intelligence coupled with this idea
like data warehousing which is like the this this practice of like hoarding data right of like i'm
going to take all the data that has anything to do with my business
that lives everywhere in the world
and I kind of hoard it
and bring it into this warehouse
that becomes a little bit the library
for the data and the organization.
And typically you have a business,
a BI tool, business intelligent tool
that sits on top of the data warehouse.
And then people can self-serve in this,
I would call it like general purpose,
but specialist tool, right?
So it's a tool that, you know, is made to, it's generic in the sense that you can use
a BI tool to query any type of data, you know, healthcare, business products, whatever it
might be.
And it's like generally geared towards like, you know, specialists, like people who are trained and we used to have like much more kind of specialists, towards specialists, people who are trained.
And we used to have much more specialists, people that are business intelligence professionals.
And I think that's changing.
We're seeing the rise of data literacy now.
More people are more sophisticated with data and use data every day.
So that means these tools are kind of changing.
And we, you know, maybe like talking about some of the trends, you know, like I think like BI was originally a little bit like a restaurant where to do like an imperfect analogy
where you'd come and you'd get a menu, you could kind of order your report, your chart,
get it served to you.
And then over time, maybe change to become a little bit more like a buffet, right?
Like people can come in and self-serve and then, you know,
have access to a wider variety of things and can assemble a meal for themselves.
But that's the general idea, maybe, like trying to describe BI.
I don't think I'm doing a great job at it.
Yeah, it's too much content to unpack here.
So it's more than visualization, right?
It's not just visualization, but visualization is an important part.
Yeah.
And I call it the database user interface, right?
So it's like between somewhere you have all of your data and somewhere you have people
with their visual cortex and their brain, and somehow you need all of your data and people, and somewhere you have people with, you know, their visual cortex and their brain and somehow, you know, you need to get that data into people's head so that it becomes intelligence.
So, yes, I think like, you know, if you describe what these tools do is they expose your data sets in a way that hopefully people can self-serve to explore, visualize, and
visualize their data.
There's usually a dashboarding component where you're able to gather an interactive set of
visualization with some guardrails so people can understand their data and interact with
it in a safe or in somewhat intuitive way.
One thing that's interesting about BI, it's was saying, one thing that's interesting about BI,
it's like any kind of data
for any type of persona,
you know, with any kind of background.
So it becomes like this tool
that's, you know, not very specialized.
We don't have really clear personas
or like a lot of standards.
Very general purpose in that sense.
Yeah.
Max, I have a question.
How many companies, so
the goal of having self-serve
BI is so appealing
and I think it's something that many companies
are working towards.
I mean,
I know you can't actually estimate
this, but what
percentage of companies do you think actually achieve
that? I mean, you've built these platforms inside of really large companies, but what percentage of companies do you think actually achieve that? I mean, you've built sort of these platforms inside of really large companies, but even though we have all the
tools to do this, it's still pretty hard to actually achieve that inside of a company where
you have a wide variety of stakeholders who can access sets that contain the information that is key to the
business, combines data from other functions.
Like, you know, it still seems like a pretty big challenge for most companies.
It's huge, right?
When you think about, you know, this, we call it like the data maturity curve, you know,
and how, you know, different companies or individuals can distribute on that curve.
I think, and then, you know,
I have this view of the world
where I work in Silicon Valley,
I'd very like data forward companies.
So the answer is like, I probably don't know.
But what I want to point out is,
you know, the analytics process is extremely involved
and maybe try to describe what I mean
by the analytics process.
But if the analytics process,
the process by which you, you know,
you instrument, store, organize, transform your data so that it can be,
you know, explored, visualized and consumed and acted upon.
Like for, you know, if BI is that last layer of like consume,
visualize and acted upon,
so much stuff needs to happen
first for that to be even possible now we're talking about you know data engineering and
and you know and like and like and like having a data analyst data like data engineers having
you know systems in place that actually store the data and make it available. There's just like so much that needs to happen for that to be possible that I would say,
you know, the world, I think like if we were to visualize companies on this like data maturity
life cycle, like we would have like a huge amount of companies that are very like, very
young at that to use like a you know not a generous
but like a respectful term like they're just companies will just like suck at that you know
really bad i think in general i think like in the past decade and in the next there's a migration
of like everyone's going to become much better with it's just a matter of like survival at this
point and you know and you know one thing is like people have been thinking like, oh, but data should be easy.
One day someone's going to fix it all and figure it all out. And we're going to solve data
engineering. We're going to solve BI. It will be all done. And what we're realizing now is the
problems, at least as complex and intricate and require specialists, the same way that software
engineering has, we've accepted that software engineering is complicated, it's expensive.
There's a bunch of specialists, there's a bunch of sub-disciplines.
I think we're realizing that data is just as important.
We're very far from like, you can have a team of five to ten people that are going to do data for your large, less-employed company.
That's just, no, that's not going to cut it. Yeah.
Max, can you give us like a description of how the BI market looks today and where it fits in that?
Yeah.
I mean, the market is a gigantic market that's like extremely validated, you know,
still has a foot, I would say there's like very big incumbents
and they're like, I'm thinking like first wave BI. And when I say that, you know, I'm thinking
like business objects, micro strategy, Cognos, things that are like very much like dinosaurs.
You don't hear about them as much unless you work, I don't know, at a company maybe that made
decision about their technology and their data stack, you know while ago i think like there's you know one thing that i think is a transformation that has not yet
happened that we're gonna see happen that you know we're really interested in you know at preset is
i was talking about data democratization before where it's like bringing more people to the
special place where you do data right so data democratization's like bringing more people to the special place where you do data
right so data democratization was like give more access to more data to more people and i think
like the real the question ahead of us is how do we bring instead of like how do how do we bring
people to the data buffet is more how do we bring food everywhere in the world like how do we do
kind of uber eats or how do we bring the right, you know, the right
meal to everyone where they sit at on top of the buffet?
The buffet is great.
I think the buffet is okay.
It totally works for a bunch of use cases.
But I think what we're going to see is, you know, analytics transcend the BI tool and
the special purpose, the special tools like the bi tools and come out and be part of
everyday experiences right so that means like in every app that you use on your phone and every
sas tool that you buy there's going to be you know interactive analytics in context where it's most
useful two on top of on top we're still the buffet there's also like people walking with or drug or dev everywhere would you like a you know a little bit of a side of analytics with whatever
you're doing right now and i think like we're thinking very actively about this at preset
let's do like beyond you know there's embedded analytics i could get into which is like how you
bring a dashboard and or you know charts and other contexts but the problem we're after is how do we enable the next,
the generation of application builders to easily bring interactive analytics
into things they're building today?
And I think that's a really interesting question.
I think it's still very, very hard if you're building a product today,
you're building experiences to bring interactive analytics as far as these experiences. So we want to make it a lot easier for people to do that.
Okay. So how, let's say I'm building a product and I also need to expose some analytics
right to my customers. How can I use Precert to do that?
Yeah. So that's a complex and intricate question that
also that has multiple components.
I think like what's clear is that there should be multiple ways to do this.
And there's a bunch of trade-offs as to how you want to do this.
The most obvious one is what I was just referring to as embedded analytics.
So that's, you know, you build a dashboard in a no code to in a no code tool, right?
By the way, BI is the original no-code tool.
When you think about it, you're able to do these very, very complex things by drag and dropping things on the screen.
But you go and you build a dashboard, you style it, you parameterize it, and you embed this dashboard inside your application, right? Maybe you have an analytics portion of your SaaS product that shows a dashboard.
And with embedded analytics with preset, you're able to apply some role level security to
say like, oh, I know it's this customer.
Therefore, the dashboard will apply these filters so that they can see exactly the things
that pertains to themselves.
And there's no, like there's isolation.
So there's embedded analytics.
There's another idea that we cater to at Preset that's, I call it white label BI.
So it's the idea there.
It's also not necessarily a new idea, but it's being able to have these prepackaged BI environments that are essentially, you know, if they're, you know,
a superset instance that's preloaded with the datasets, dashboards, charts, queries that are
relevant to that one customer. So you can say for each one of my customer, I'm going to create
a superset sandbox with all the data assets that they need. And then they can come and self-serve and, you know, if you
authorize them, they can write SQL and, you know, against the
data that you exposed to them.
So that's the white label use case.
And then another use case is more the component library.
So, so there were, were a little bit in the infancy of this, but I think we're
excited to expose the building blocks of preset
and superset as component libraries for people to remix into the experiences they
want to create.
So there you can picture, you have a React library where you bring in some
charts, you bring in some controls, maybe a selection picker, a filtering control, a date range picker. And as the application
developer, or as the engineer building the product, you create the exact experience that
you want with these rich components that enable you to have these cross filtering and drill down
these rich experiences that would be really prohibitive to build from scratch
that are really easy to build
if you have the right framework.
Yeah, makes sense.
And is this like as a problem,
is this related only to BI,
the BI side, like something like preset
or we still need to do work with our data warehouses
or the storage layer, the query
layer out there to enable these use cases.
Or I can just have Snowflake just push all the data into Snowflake and
then rely on Precept to do that.
Like what's your experience?
Jason Cosperinopoulos Yeah.
So, so I clearly this part of like, you know, earlier I talked about the analytics
process and still, you know, there, there's no, there's no BI happening.
There's no visualization happening unless you've gotten all these things right.
You know, I think in this specific case you created the data sets that you need with the
dimensions and the metrics that you want to expose in order to build a dashboard.
So that's something that might be going back to how BI is evolving.
I think BI is very monolithic in the sense that you would have the tool that includes
the data munging, the data transformation, the semantic layer,
which this is a super loaded term.
We can decide whether we want to unpack
semantic layer or not,
but you have all of these things
that were part of a very of a platform
like the Microsoft or Cognos or business objects.
They would tend to be very monolithic.
And I think what we're seeing now is like we're, so say for me, the semantic layer belongs
in the transform layer and that's, you know, dbt and airflow space.
And then we don't necessarily, so superset and preset, you know, don't really actively
solve that problem.
We just say other people solve it much better than we could.
And we just want to team up and integrate very well with them.
Not sure if I'm answering the question too.
I know I'm deriving quite a bit too,
but we're exploring a pretty complex space too.
Yeah, no, you do.
I mean, okay.
I made also my question a little bit more specific
to the Stonet layer, but as you said, it's
not just one thing that has to be in each place in order for this to work.
There are so many different things that need to happen.
And you used to have the monolith of Cognos, but we don't.
I mean, we still have monoliths, but things are starting to break into smaller monoliths at least.
And I think probably we see that with BI.
Right?
Like, I think it's one of the things that we see happening out there.
It probably was one of the first markets to see that.
But I think we also see that with even data warehousing, right?
Like the whole idea of having the data lake
where you have the storage on S3
and then you have a different query engine on top of that.
And then you have, now you have a table engine.
I mean, everything ranks in the lake.
Yeah, warehouse, lake house,
like special purpose databases, right?
Like do you use the same database for real time or not?
You know, it's gotta be like, you know,
BigQuery with a BI engine and they're like real time option that
becomes a monolith that serves it all?
Or for real time, you're going to use ClickHouse, Pnode, Druid, some of the databases are very
specialized and use something else for your warehouse.
So there's definitely, I think in the database space is really with all of the money we've seen kind of flowing in the snowflake place, like now people are looking
to say like, oh, can we, maybe there's like some, some layers that we can kind of delaminate out of
that and, you know, build some, some, some big businesses and some, some tools in some of these
areas. I think like that, to me, I see like exploration visualization as like it needs to
explode. Right. And at the database, I'm not as confident, you know, I'm not as sure about it. I
think like, you know, as a customer of a cloud data warehouse, I would like for the same database
to do it all, you know, to be like a BigQuery or Hey, snowflake, this is a table that's like,
is high availability and pitted to memory
because I needed to be fast and but but I want to stay in the same warehouse. I don't want to go and
purchase a bunch of tools. Yeah, right. Like people who are not in BI feel the same way about
BI, you know, it's like, I want to put a fine one thing that does it all but then then you realize
like that thing that does it all doesn't do anything right, you know? Yeah. Talking about
like a little bit in the question, you know know what is the analytics process to power something like embedded or
or what i referred to as white label bi and uh for embedded it's pretty simple as long as you have
you know we can apply the bi tool can or preset superset can apply role level security on the fly. So all is required is to say like this user is this customer ID, just apply a customer
ID filter on all the queries that you run.
It's a little bit more complicated than that, but it's like role level security.
I think for white labeled, it's a little bit more intricate where you might want to create
like a data mart for each customer.
And really what I mean by that, you know, it can just be like a view layer, right?
On top of, so say you have like five tables you want to expose to each one of your customers.
What you can do is create these schemas that have views that filter on a specific customer ID,
and then you give them access with a service account
to that specific schema that is limited to their data, but hopefully your schema, you have a universal schema that's the same for all your customers is all
refresh, perhaps atomically, right?
Like you refresh the whole thing every night or every hour, but, but you have
these little islands or windows into the warehouse that are filtered and isolated just for them.
And then you put a BI tool on top of that schema and you provide CAN reporting, CAN dashboard, and they can knock themselves out and go and push that further.
Yeah, that's a super interesting thing.
One last question for me, and then I'll give the microphone to Eric
to ask his question.
So, can you share some things,
some opportunities that you think
there exist right now in the BI market,
like things that you would like to see happen
or you expect to see happening,
or something to help our listeners go through, I don't know, maybe go
and build like a product out there, who knows?
David PĂ©rez- Yeah.
I mean, I really liked the idea of like bringing analytics everywhere.
And the premise is like, people are more data literate than they used to be, right?
Not only people expect to find a dashboard
in every SaaS application,
but they expect that
and that becomes a requirement.
I think also like people,
you know, in everything that they do,
if you post, you know, a blog post on Medium
or even like me participating in this podcast,
I would expect to see a dashboard
on how this podcast is doing, you know, in real time as we release it and be able to see like,
who's listening to the podcast and, you know, what are the demographics? I think like we're
starting to really expect more and more analytics everywhere and people are trained and they want
interactive analytics, not just like static too. And it's going to be real hard to go and build that, you know, with the building blocks that exist today.
The building blocks being like charting libraries and, you know, data warehouse drivers, you know.
So, you know, I think there's a real opportunity of thinking about like how do we enable people to bring analytics and all the experiences, you know, every day?
I think that's interesting.
So how is BI going to come out of its shell or is analytics going to come out of the shell and then, you know, outshine and kind of like ray out and be everywhere.
So we're really interested in actually doing this, you know, a preset.
And then I think like other trends may be beyond business intelligence it's
like big topic for me has been thinking about how we take the learnings from some of the devops
practices and the devops movement and apply you know transform that and reapply that and reinvent
that for the data people or you know and then and then, you know, the, the big thing that's
really interesting too, is how, like, how is the modern data team evolving?
Like, what do they do?
Like, what are the roles, you know?
And like, how do we then becomes like, how do we enable others to become better with
data?
So you become this vector to enable everyone to kind of self-serve.
Super fascinating.
Well, well, two more questions for me,
and one of them may take us
down a little bit of a rabbit hole,
but hopefully Brooks doesn't get too upset with us
for going a little bit over.
Just thinking about what you were saying
on the sort of the frontiers in BI,
what do you think is going to be commoditized first?
Or what do you see being commoditized?
And the context behind that question is,
you make such a good point in that
the amount of work that needs to be done
in order, like on the backend,
in order to enable self-serve BI is immense.
But also there are patterns around that.
So if you think about visualization,
a lot of businesses can conform to a particular data model for the business,
whatever, like a direct-to-consumer mobile app, and everyone has their different KPIs.
But do you think that there will be a lot of commoditization in the actual visualizations
as the data layer becomes
more established and defined
sort of across business models?
I think so.
I mean, for me,
I think we're trying to accelerate
that in some way
so we can really innovate too, right?
So as you commoditize things,
there's opportunity to go
and further.
Open source is a tidal wave
of commoditization. It's free,
remixable.
So I think clearly
we're doing that in the BI space
with Apache Superset first.
And we also have
freemium. So preset as a freemium
offering on Apache Superset.
So today you can go and sign up for
a free open source
project and have it run for you up to five people for free.
And I think the price point, too, is very competitive.
So I think we're trying to accelerate data.
Our mission is to make every team a data team, enable everyone to have the best tool to visualize, collaborate with data.
So I think that's clearly happening.
But beyond that, I think there's, you know, as we commoditize the consumption visualization
layer, there's opportunities to go and innovate.
For us, a theme is like enabling people to bring analytics everywhere.
That's one theme.
One thing I've discovered too is like in the data world, as you walk to the horizon, then, you know, the horizon gets further.
Like the universe, right?
Ever expanding.
That's it.
There's no, and you're like, oh, you know, I climbed on the top of that tree and I saw how far the horizon is.
And by the time I get there, I will be at the edge of the world.
And then you walk there, the horizon kind of moves with you.
Yeah, yeah, yeah.
Climbing a tree and you see
you're not there quite yet.
So I think like,
you know, that on this run,
you know, for one thing
I would really like to see
is to see like all the data
in the world or in your company
being like instantly queryable, right?
It'd be great if everything was in memory and you
could ask any question and yeah all the answers but the moment we create an amazing dig the next
generation in memory databases then people are like well i'm in a log board you know
but it's like then you start hoarding. The house is, the more you hoard stuff, you never kind of get to fully solve any of the problem.
And maybe that's the beauty of it.
Like how are we going to solve software engineering?
It won't be solved.
It will just continue to morph and grow and evolve.
Yeah, no, I think that's a great perspective.
Okay, last question,
because we really are close to time here,
but I'm interested to know,
so you've been really,
I mean, you've started some projects
that have become, you know, data tooling
that are, you know,
at least in whatever subset of the world,
you know, that a lot of data engineers operate in.
I sort of just go to tooling,
which is pretty amazing.
And I'm interested to know, what did it feel like when you were creating those? Did it feel like you
were just sort of solving a problem that was right in front of you? Or, you know, it's kind of,
I guess I'm asking you this almost as like an inventor, right? Like, did you feel like you
were inventing something or was it just a problem in front of you that happened to like, you know, solve like a pretty critical, like pervasive, you know, pain point?
Yeah, I don't know.
I think it's the history.
The history of innovation is made of people kind of remixing things.
So if you look at any of the great inventions in the history of humanity, not any, There are some really important exceptions to the rule,
but you look at what Isaac Newton had.
Sure.
At the time,
what was he reading?
I think Newton is actually a bad example
because he actually
pushed the head of innovation
quite a bit.
Yeah, like inventing calculus
and all that crazy stuff.
Yeah, so I think that's a bad example.
I think Tesla is also, like Nikola Tesla is also a bad example of that.
But everywhere, in a lot of places that you look at, you look at any of the, who were their contemporaries and what were they reading at the time?
What were they thinking about?
Who were they talking to, exchanging correspondence with through letter at the time?
Sure. Like the collective, you know, imagination was already on the cusp often of what they
discovered.
So I think it's very much the case like Airflow was largely inspired by two or three products
internally at Facebook.
They just wanted to be some other things.
Those were the things that emerged internally at Facebook from a bunch of experiments, right?
So maybe it was a fair motivation.
Those were the things that people internally decided to put their pipelines and bags into.
And then I took some of that stuff, remixed that with some of the things that I learned, you know, in Informatica and like using other tools and kind of remixed that into something that I thought was going to be immediately useful for, you know, at Airbnb for my team.
And knew that, you know, people, you know, coming out of Facebook, we're going to look for things like that.
So there's like this idea, like, you know, people talk about product market fit, you know, in business, but there's like project community fit too, an open source.
So there's like a timing thing too of like, if you were to build Airflow five years before, then probably people, well, it's too early.
It's not the right thing.
People are not really, it doesn't fit people's mental model.
It doesn't the right thing. People are not really, doesn't fit people's mental model. It doesn't resonate just yet.
So, so there's always this kind of timing aspect where you gotta
be early, but not too early.
So there's like always like, you know, some luck, some context, you know,
some some hard work too, and then you know, community building too, which
is a whole different topic, but like how to get people kind of interested and involved and excited about and get people to contribute.
So I think that's something that I figured out how to do in an okay way for both Airflow and SuperSat.
Well, thank you for sharing that.
That's just so fun to be able to talk to people who have, you know, sort of conceived of and built the tools that we use.
But as you point out, you know, we sort of stand on the shoulders
of lots of people who have done lots of cool things.
So Max, this has been an incredible conversation.
The episode flew by, but thank you for joining us.
We'd love to have you back on to dig into,
we could go for hours here,
but thank you so much for giving us some of your time.
Yeah, it was super fun.
I think we did scratch the surface.
We got a little deeper
in some. Yeah, I think that's good and fun, but there's still so much more to talk about. So happy
to come back on the show anytime. One thing that really struck me was, this may sound like a funny
conclusion, but Max has a very open mind about a lot of things, even BI, and he's building a company
in the BI space. I mean, he certainly has strong opinions, but I really loved his analogy of, you know, you sort of think you
get to the horizon, you climb a tree to look over the horizon, you realize like the horizon just
keeps moving. And I think that's really clear in the way that he approaches problems. You know,
he's trying to look for that frontier and he keeps a very open mind. And I just appreciated that a ton.
How about you?
Yeah, absolutely.
I think the way you put it, like, makes me think that probably that's a trait that an inventor needs to have in order to be an inventor.
Right.
So being like in an environment that changes so rapidly, like what he described. Like think about early days in Facebook, right?
How the engineering was in there and like all the things that one project after the
other and like building new technologies and building everything from scratch even if it
already existed.
So yeah, I think it makes total sense and it's an amazing trade for both an inventor
and I would say also
for a founder.
So I'm really, really,
really looking forward
to see what's next
about with research.
Me too.
All right.
Well, thanks for joining us
on the Data Stack Show
and we will catch you
on the next one.
We hope you enjoyed
this episode
of the Data Stack Show.
Be sure to subscribe
on your favorite podcast app
to get notified about new episodes every week. We'd also love your feedback. You can email me,
ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com.
The show is brought to you by Rudderstack, the CDP for developers.
Learn how to build a CDP on your data warehouse at rudderstack.com.