The Data Stack Show - 83: Closing the Gap Between Business Analytics and Operational Analytics With Max Beauchemin of Preset

Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, one platform for all your customer data pipelines. Learn more at rudderstack.com. And don't forget, we're hiring for all your customer data pipelines. Learn more at ruddersack.com. And don't forget, we're hiring for all sorts of roles. Welcome to the Data Stack Show. Costas, the guests that we have never cease to amaze me. And we're talking with Max from Precept today.

Starting point is 00:00:38 Not only has he worked at some of the biggest, most successful Silicon Valley companies. We're talking Ubisoft, Facebook, Airbnb, Lyft, but he also is the originator of several major open source products in the data space. So Airflow and Superset that are part of the Apache Foundation, pretty unbelievable. And what a privilege to talk to someone like Max. I'm super excited. One of the things that I really want to ask him about is, we'll see if I can sneak this in because maybe it's a little bit of a personal question, but starting two projects like that, I wonder what it feels like to be the inventor of something like Airflow. And I mean, that's just a really cool thing. And I think I tend to make that like a really grandiose thing in my mind, and maybe it was. But also, I think a lot of times inventors are just trying to solve a problem that really interests them. So that's what I'm going to ask. How about you?

Starting point is 00:01:36 Yeah, absolutely. And so I think it would also be awesome to ask him, what was the process? Like, how do you do that stuff, right? Like, how do you end up building something that has this kind of adoption by, like, the industry or the community or whatever out there, right? So, yeah, I'd love to hear that from him. I think it's going to be, like, super interesting. But I also want to ask him about BI. I mean, he's also a founder of a company that's, like, in the BI space.

Starting point is 00:02:04 Right now, the BI space is something that, I don't know, lately we don't talk that much about it. But, you know, it's one of the most fundamental parts of the data stack. So it would be awesome to hear from him, like what's the current state of the industry, what happened these past few years and what's next. All right, let's do it. Max, welcome to the Data Stack Show. We can't wait to chat with you. Hey, happy to be here. It's an honor to be on the show.

Starting point is 00:02:32 I just realized there's so many episodes there is on this show. Yeah, yeah, it's fun. It's been super fun. Okay, I don't know where to begin because your resume is just incredible. But I would love to hear about how you got started in the world of tech. And then tell us how that transitioned specifically into working on data stuff. So I started my career early 2000.

Starting point is 00:02:59 So right after the dot-com burst is when I started. I started early. So I did not finish my program in college. I didn't really go to college. I was lucky to strike an internship at a company called Ubisoft. That's well-known now, at least, a big video game company. And I joined, I did a little bit of web development early on during my internship. And then I had an opportunity to work on, you know, their first data team.

Starting point is 00:03:25 And then, of course, like the data landscape was very different back then. But what timeframe was that? Like 2001, 2000, 2000. Oh, wow. Yeah. Data. So that's very different data landscape. Yeah.

Starting point is 00:03:36 The tech stack. Well, so what's interesting when we can talk one topic for today could be like, you know, how we're kind of reinventing the same stuff over and over with different parameters. But at the time we used the SQL server. So Microsoft SQL server stack. So there's SQL server analysis services, SQL server, the server itself. There was reporting services. I think that came a little bit after office web components and I think

Starting point is 00:04:05 integration services, the other one. So that was the stack that we had selected at the time. And then I was lucky enough to be part of the team that created the first data warehouse, the first kind of business intelligence team there. So before my time there, it was very little databases, just Excel files. Right. And then I worked on financial reporting, supply chain, like kind of your,

Starting point is 00:04:29 so retail type stuff, and less like your game analytics or like your kind of modern analytics. So that was really like counting dollars and units sold at the time, if you had like inventories, all that reporting. So very specialized team. So I worked there for quite a while,

Starting point is 00:04:45 worked at three different offices. I worked in Montreal, Paris, and San Francisco. So I traveled the world over this first decade of my career. And then soon after I joined Yahoo, where was the birth of Hadoop at the time. It's kind of interesting times, right? So the Hadoop team would meet in very office where I was at. So I had the opportunity to meet some of the early Hadoop folks. I read some of the early pig scripts. If people are familiar with the language, the pig language, it's a little bit of a funky SQL-like, not really SQL-like, the kind of dataset language that SQL-like in some ways. So work on some of that early stuff. And then I think the part that's really interesting for me is when I joined Facebook, it seemed like they were really kind of on the other side of what I would call like data modernity, modernity,

Starting point is 00:05:35 like just the, we were in this completely different phase at Facebook at the time. So it was 2012 where everything was getting rebuilt from the on top of Hadoop and other things, right? There was like this internal, like came in an explosion of data tools. So this hackathon culture of like, build it if it doesn't exist. So they had rebuilt internally a lot of the things that existed in the market at the time, like from scratch, but also were building things that had never been built before that, you know, now, you know, in some of the spaces that we're building stuff, I'll try to describe a little

Starting point is 00:06:14 bit more like what I mean by that, but essentially like people internally at Facebook had rebuilt, you know, dashboarding tool, data exploration tool, an in-memory real-time database, something that was a big inspiration for Airflow that's called DataSwarm. That was an internal tool. There was like multiple experiments in the DAG kind of data orchestration space too. And there was like all of these like kind of mutant little data tools too that, you know, some like data quality stuff, some data dictionary stuff, data graph, metadata browser things.

Starting point is 00:06:48 And so early, early and not so early versions of some of these things that we see really kind of emerge today on the market. So it was a really inspiring time. You know, for me, I was going from being kind of a BI engineer, data, data warehouse architect, to being like a software engineer and like building tools to enable more people to play with data. So very like data had been democratized at Facebook. People were like building all sorts of cool stuff.

Starting point is 00:07:17 And it was super fun to be there at that time. Really inspiring too. Absolutely. Okay. So what do you do today? I mean, you actually went to another couple like really amazing companies, but what do you do today? Yeah. I mean, I can keep going. I think it makes sense to do the transition. So right after on Airbnb, and that's where I was missing a lot of the data tools that we add internally at Facebook

Starting point is 00:07:42 and it kind of brought there along with others this mentality of like, let's build some stuff, let's solve these problems in a new way, and if I'm not going to have something like Data Swarm here, I'm going to build something that's going to solve my problems and my team's problem as a data engineer. And that's what became Airflow. So I was like, I want to get involved in open source. I was always like an admirer, you admirer of people who had built open source projects and looking at the Linux kernel and other things, just being inspired by that.

Starting point is 00:08:14 I was like, oh, maybe I can try and get a shot. So I thought the timing was good to build something like Airflow. And then just decided to go with it. So I started building it actually between the two jobs before joining it. I was so excited. I'm like, I'm going to make it open source. I'm going to start working. Oh, wow. So, okay. So it wasn't like the initial birth was between jobs. It was in between jobs, but I knew I was joining Airbnb. They needed, we had talked about, spent a lot of time with the data team there,

Starting point is 00:08:47 talking to people, and it was clear that they needed something like that and that would be enabled to build it. So I was like, I'm just going to get started in that month in between. I think I missed a vesting cycle there, like a three months vesting cycle by a few days doing that. Not a great thing. 2014 Airbnb was a really good time, a pretty decent time to join. But as a result, though, I got to put the project, like Airflow, I think it had a different name. Originally it was called Flux.

Starting point is 00:09:17 And I got to put it in open source under my GitHub. And as I joined, I'm like, well, this thing is already open source. It's working on it. So I started working on a data mart. You know, my primary function was like to do the data engineering for core, what we call customer experience CX internally. And then I was building Airflow at the time and then working with a small team of data engineers to building stuff. And we were kind of building Airflow at the same time as I was, you know, we were solving these data engineering, you know, challenges for them. And then, you know, after I went like for a brief time at Lyft, so I spent a year there.

Starting point is 00:09:52 Well, I also, while I was at Airbnb, I started Apache Superset. It's also very well known. So Superset is very much in the data visualization, exploration, dashboarding space. And the general idea there is like we were using, we were investing heavily in Presto and Apache Druid, or I think it was pre-Apache. So Druid, the in-memory real-time database. Sure.

Starting point is 00:10:32 And none of the tools on the market, you know, Looker and, you know, Tableau and the tools that exist at the time didn't work or didn't work well with the databases we were investing in. Let me ask you a specific question there, because I think it's fun for our audience. You know, they sort of they cross a wide spectrum. And I think some of them probably hear that and they say, like, yes, I know, like, I get that, that I had similar pain. But for a lot of people, it's like, man, you know, Looker and Tableau are so powerful. Like, how could you ever, you know, sort of reach the limit of those? What did that look like and feel like inside the company in terms of hitting the limits of traditional BI? I know you mentioned that it was like sort of database integration stuff, but like, could you explain that dynamic a little bit? Well, so one dynamic was, you know, we had a large Presto cluster

Starting point is 00:11:09 and had hired people from the Presto team or, you know, people who had used Presto in the past and were investing heavily in this thing. And that's our ad hoc layer. And then you tried to, I think we had Tableau at the time and we had to load stuff and extracts, which is like a, you know like a subpar database, if you ask me, at least at the time. And there was this thing called the live mode, which would defer, kind of run the heavy lifting on the database itself. But that didn't work very well for a variety of reasons I could expand into, but I won't.

Starting point is 00:11:40 And then Druid at the time didn't speak SQL. It had this funky kind of dimensional query interface and it just wouldn't talk to any BI tools at all. And there was no front end for it. So the very premise for Apache Superset was, I'm going to build a quick, it was a three-day hackathon project to evolve for exploration of Druid datasets.

Starting point is 00:12:02 So Druid in-memory database, real-time, super fast, super fun database, heavily indexed, right? So just like blazing, blazing fast in real-time. And we had some real-time use cases, you know, internally at Airbnb. Sure. Pablo would just wouldn't, you know, there's no socket to connect anything. So I'm like, oh, I'm going to build this thing. And then quickly, you know, I could go deeper in that story. But SuperSaid then, you know, over a weekend or, you know, at some point I was like, I got to make that work with Presto too.

Starting point is 00:12:34 This is fun. It's a cool tool. You know, you can explore the data. You can save charts. You can make dashboards. Let's make it work with Presto too. And, you know, and then that became much more ambitious over time because of internal adoption at Airbnb. People liked having a tool that was just like

Starting point is 00:12:53 very fast time to chart, very fast time to dashboard, provided that the datasets have been created. So if the premise is you have a data set that has all the metrics and dimensions you need to create a dashboard, then exploring, visualizing, the visual grammar, you know. Sure. And then be able to do very sophisticated things. I call it the Photoshop of data visualization, right? You can do powerful things. And then maybe Looker was all about, you know, the semantic layer and be able to replicate your business logic and this like semantic layer.

Starting point is 00:13:41 So for us, it's just like visualize data set quickly. So that took and that worked really well in the team at Airbnb and elsewhere and everywhere really. Amazing. Yeah, that's quite the story. So Max, I mean, you've been around like for so long and you have seen like so many different things

Starting point is 00:13:59 like happening. So before we get deeper into like what you're doing today and understand like what's's preset and superset, is there a technology that you have seen all these years that really surprised you in terms of how it changed, let's say, the landscape? Or to put it in a different way, from all that stuff that you have seen from Hadoop to whatever we have today, right? What do you think is like the most influential technology that has saved today? Yeah. And, you know, like, so first it's like, yeah, the gray beard to show for, you know, all the years of data and all the, you know, lost it from scratching my head for, for decades. But, but I would say a lot of what we're doing now is like reinventing some of the things that existed, you know, a decade or two ago, based on new premises.

Starting point is 00:14:52 And then there's these cycles in software where there's some big shifts, right? The, the move to the cloud and the move to like distributed systems first and containerization, right? So once in a while you take a new set of new premises and you have to rebuild everything on these premises. And then the pyramid of needs, like maybe is flipped a little sideways

Starting point is 00:15:14 or upside down, right? I think one thing that's new from the, in the last like five to 10 years, that was, that did not really exist or exist well before as a streaming, streaming mute cases. I think that it's been cool to see, you know, different solution emerge around data streaming, kind of streaming queries, streaming computation frameworks, things like Flink, you know, or

Starting point is 00:15:38 Spark streaming. And, you know, there's been like, on top of that, there's new semantics and new things around streamings that are interesting. There's people trying to bridge the two world and having these common languages to express both like batch and streaming using those same frameworks. I think it's, I would say that's, that's been interesting to see like brand new technology emerge there. There's a question like, what does that mean in terms of visualization

Starting point is 00:16:04 to for media and use cases? And, you know, tons of thoughts there too. I think it also does relate to the chasm and analytics between operational data and business data where operational data

Starting point is 00:16:18 is much more kind of timely in a lot of ways. So there's like, so some things there that are really interesting to see too, like we've got, we've gotten really good

Starting point is 00:16:27 at operational streaming analytics too. You look at things like, you know, Datadog and Elastic and there's some cool stuff there too. That's actually

Starting point is 00:16:35 pretty interesting because you mentioned streaming and I never thought about the other section of streaming visualization. So I have to ask more now.

Starting point is 00:16:46 Yeah, I've seen like all these, you know, like very interesting like platforms, like we have Kafka, we have Sling, we have like Mark, streaming. But how do they fit with BI and visualization? Like is it needed, first of all? I mean, I start thinking right now, like about what was the call of this, like the Lambda architecture, that was like a thing, like a couple of years ago, right?

Starting point is 00:17:10 Where you had the streaming layer that was more about notifications and like something just broke and you have to go and fix it, you know, but that kind of stuff. And then you have bots, obviously, where you have reporting and reporting is usually like the most common use case for visualization. But how do you see these two things like merging and how do you see like visualization playing a role like with streaming? Yeah, it's interesting. Like one thought, one immediate thought is like when you really think about the data that needs to be really fresh. Well, so first I would say like, there's a latency and freshness. And I think like fresh, like talking about latency of like, if we describe it as like how long it takes from the moment you run a query or ask a question, you get the answer. I think that's infinitely

Starting point is 00:17:54 important, right? Like to be able to kind of dance with the data and like slice and dice and like for it to be able to ask the next question as you go. I think that's transformative. An example that I keep giving there is if Google takes a fraction of milliseconds to give you a result, if it took five seconds, think about the implication. If Google took five seconds, 15 seconds, 30 seconds, a minute, 10 minutes, it would take 10 minutes to resolve a Google search. It would still be a wonder of the universe in terms of what it would allow for people to do, but how you interact with it, how you engage with it without it to complete is completely different. So there, what I'm trying to point to is like latency is super important. Like now talking about freshness, it's like, if you really need to know what

Starting point is 00:18:41 happened someplace in the past 30 seconds, and you're like looking at a chart and refreshing and waiting for something to appear like you should be a bot like you're not doing the right thing with your time like no one should be looking at a dashboard non-stop you know waiting for an event to happen be like now i'm gonna refund this person click you know done approve yes i've done my job so So I think like operational stuff, then in some cases is really great for automation and bots and things like taking action.

Starting point is 00:19:13 It's also good for troubleshooting, right? So there's an alert or something that happens in Datadog, your number of 500 errors, like, you know, peaks. And then you're like, what the heck is going on? You need to know live and now, like what's happening in the past five minutes. So, so I think that

Starting point is 00:19:29 stuff is like super operational and very different from like, is my business doing well? Like, you know, if you're thinking about like, is my product doing well? There's some other use cases I think interesting around streaming. It's like when you launch something, you want to make sure it's good, right? Like you launch a new product change or you release something, you might want to get a little closer to real time to make sure your launch is doing well. But other than that, like for me, the bulk of the time, you know, I'm thinking like 90 days analytics, a lot of the high value questions are, you know, not things that happened in the past five minutes.

Starting point is 00:20:04 Yeah, yeah. are not things that happened in the past five minutes. Yeah. When you were saying about sitting on top of a dashboard and reloading all the time, trying to see that something is happening, I cannot help myself but think of first time founder that just created the dashboard for signups or something like that and being like, okay, where are the signups? Where are the signups? And you know, like you just need to go out there and create them and not look at the dashboard. That's what's going on with the signups.

Starting point is 00:20:35 So yeah, I, many times we get like really, you know, into this, like, let's make everything like more real time and like faster and it's more fresh or whatever, but real time is a very relative term, right? Like not all use cases have the same definition of real time out there. So it makes a lot of sense. So, okay. I had like one or two more thoughts to unpack on real time. So one thing I think that one way I've been thinking about it too,

Starting point is 00:21:04 because I've had a lot of conversations with, I call them streamers, people like, you know, like kind of streaming first and people are arguing like, oh, we should just like get rid of those mountains of SQL and all the batch stuff and just rewrite everything in real time. That's going to be easy, right? Let's just do it. And I'm like, yeah, wait a minute. But like, when you think about like, what about freshness? like, what do you need like visibility into for things that happen in the past, like, you know, minute or two or five minutes or 10 minutes? What are really the metrics and dimensions and the level of accuracy that you need for these things? So when you really start looking into the use cases, and we've done that

Starting point is 00:21:43 internally at Lyft and Airbnb, even like Lyft was much more like real-time business. When you ask people about why do you need freshness? Why do you need to know about what happened the past minute? Then you realize the requirements are not as complex. handful of dimensions and metrics and maybe you don't need to know how many exact bookings and just clicks on the booking icon you know button is enough right like so knowing this i tend to say like you don't even need in a lot of cases the lambda architecture just seems like you have real time requirements you solve those with specialist specialized tools then you have like your business analytics and you solve that with the right set of tools and and then you know if they diverge and the numbers

Starting point is 00:22:30 are not exactly the same you explain the difference because we use different tools different definitions for things and you move on with your life instead of trying to bring two worlds together that are very like far apart in reality i think the other thing is, I think people lump a lot of stuff into the real-time versus batch debate. And there's, I think, a pretty clear separation between the analytics component and then the customer experience side of things. So there are certain things that need to be delivered in an application in real-time because a user is performing some action and there needs to be some sort of response, right? I mean, it's non-trivial to build that stuff, but even then,

Starting point is 00:23:09 not everything needs to be real time, which is interesting. And I'm just thinking about some of the companies that, you know, like e-commerce companies who are running thousands of tests, you know, for them, real time is like 15 minutes, right? I mean, a testing team can't really process results. Even at 15 minutes, it's pretty unbelievable. So definitely real time is like, it's such a relative term. Right. And you have to really identify where there's value in it and then what you're going to do to kind of support the use cases and what it's worth to you. And maybe though the tooling will converge, right? Like we've seen some of that convergence a little bit where the chasm between like business analytics

Starting point is 00:23:48 and operational analytics is not as wide as it used to be. With the rise of tools that are, like these next generation databases, they can serve, you know, on both sides of the fence. And you see things like, you know, Superset is becoming, you know, more and more used for operational analytics and things like, Superset is becoming more and more used for operational analytics and things like Grafana are more used for business. So I think the

Starting point is 00:24:10 chasm is getting thinner over time. That's a good thing. And we used to have these databases very specialized at the time series database for real-time use cases were very different and really support OLAP use cases. And now that's kind of converging with things like, you know, DroidClickHouse, Pinot, right? Like these, these new next generation databases. Yep. Yep. It's very interesting actually, like what you said about how things overlap,

Starting point is 00:24:38 like with Grafana, for example, and SuperStats. So I have a question. Let's go, let's talk a little bit about BI, okay, and visualization. And let's give some definitions, right? So what's BI? What's BI? Business intelligence, right? It sounds so...

Starting point is 00:24:56 Intelligence as... It's an aging term. You know, I think it was around when I started my career. So when, you know, no gray hair, more hair up here, you know, that term was already well established. So that's like 20 something years ago. I think like the word intelligence come from, you know, the term that the way that the government thinks about intelligence, it's like intel, like, you know, data, you know, insight, that

Starting point is 00:25:21 kind of stuff. And then as applied to business. So I guess it means it's the set of tools and best practices around analyzing and organizing and serving data. A huge trend, I would say the first, maybe, I'm not sure exactly, depending on where you look in the world, when that trend was most active. But data democratization is the trend that's like maybe caught up like on top of business intelligence so the general idea of like let's give access to more people to more data it's usually like business intelligence coupled with this idea

Starting point is 00:25:58 like data warehousing which is like the this this practice of like hoarding data right of like i'm going to take all the data that has anything to do with my business that lives everywhere in the world and I kind of hoard it and bring it into this warehouse that becomes a little bit the library for the data and the organization. And typically you have a business,

Starting point is 00:26:16 a BI tool, business intelligent tool that sits on top of the data warehouse. And then people can self-serve in this, I would call it like general purpose, but specialist tool, right? So it's a tool that, you know, is made to, it's generic in the sense that you can use a BI tool to query any type of data, you know, healthcare, business products, whatever it might be.

Starting point is 00:26:40 And it's like generally geared towards like, you know, specialists, like people who are trained and we used to have like much more kind of specialists, towards specialists, people who are trained. And we used to have much more specialists, people that are business intelligence professionals. And I think that's changing. We're seeing the rise of data literacy now. More people are more sophisticated with data and use data every day. So that means these tools are kind of changing. And we, you know, maybe like talking about some of the trends, you know, like I think like BI was originally a little bit like a restaurant where to do like an imperfect analogy where you'd come and you'd get a menu, you could kind of order your report, your chart,

Starting point is 00:27:22 get it served to you. And then over time, maybe change to become a little bit more like a buffet, right? Like people can come in and self-serve and then, you know, have access to a wider variety of things and can assemble a meal for themselves. But that's the general idea, maybe, like trying to describe BI. I don't think I'm doing a great job at it. Yeah, it's too much content to unpack here. So it's more than visualization, right?

Starting point is 00:27:48 It's not just visualization, but visualization is an important part. Yeah. And I call it the database user interface, right? So it's like between somewhere you have all of your data and somewhere you have people with their visual cortex and their brain, and somehow you need all of your data and people, and somewhere you have people with, you know, their visual cortex and their brain and somehow, you know, you need to get that data into people's head so that it becomes intelligence. So, yes, I think like, you know, if you describe what these tools do is they expose your data sets in a way that hopefully people can self-serve to explore, visualize, and visualize their data. There's usually a dashboarding component where you're able to gather an interactive set of

Starting point is 00:28:33 visualization with some guardrails so people can understand their data and interact with it in a safe or in somewhat intuitive way. One thing that's interesting about BI, it's was saying, one thing that's interesting about BI, it's like any kind of data for any type of persona, you know, with any kind of background. So it becomes like this tool that's, you know, not very specialized.

Starting point is 00:28:56 We don't have really clear personas or like a lot of standards. Very general purpose in that sense. Yeah. Max, I have a question. How many companies, so the goal of having self-serve BI is so appealing

Starting point is 00:29:12 and I think it's something that many companies are working towards. I mean, I know you can't actually estimate this, but what percentage of companies do you think actually achieve that? I mean, you've built these platforms inside of really large companies, but what percentage of companies do you think actually achieve that? I mean, you've built sort of these platforms inside of really large companies, but even though we have all the tools to do this, it's still pretty hard to actually achieve that inside of a company where

Starting point is 00:29:36 you have a wide variety of stakeholders who can access sets that contain the information that is key to the business, combines data from other functions. Like, you know, it still seems like a pretty big challenge for most companies. It's huge, right? When you think about, you know, this, we call it like the data maturity curve, you know, and how, you know, different companies or individuals can distribute on that curve. I think, and then, you know, I have this view of the world

Starting point is 00:30:07 where I work in Silicon Valley, I'd very like data forward companies. So the answer is like, I probably don't know. But what I want to point out is, you know, the analytics process is extremely involved and maybe try to describe what I mean by the analytics process. But if the analytics process,

Starting point is 00:30:23 the process by which you, you know, you instrument, store, organize, transform your data so that it can be, you know, explored, visualized and consumed and acted upon. Like for, you know, if BI is that last layer of like consume, visualize and acted upon, so much stuff needs to happen first for that to be even possible now we're talking about you know data engineering and and you know and like and like and like having a data analyst data like data engineers having

Starting point is 00:30:59 you know systems in place that actually store the data and make it available. There's just like so much that needs to happen for that to be possible that I would say, you know, the world, I think like if we were to visualize companies on this like data maturity life cycle, like we would have like a huge amount of companies that are very like, very young at that to use like a you know not a generous but like a respectful term like they're just companies will just like suck at that you know really bad i think in general i think like in the past decade and in the next there's a migration of like everyone's going to become much better with it's just a matter of like survival at this point and you know and you know one thing is like people have been thinking like, oh, but data should be easy.

Starting point is 00:31:47 One day someone's going to fix it all and figure it all out. And we're going to solve data engineering. We're going to solve BI. It will be all done. And what we're realizing now is the problems, at least as complex and intricate and require specialists, the same way that software engineering has, we've accepted that software engineering is complicated, it's expensive. There's a bunch of specialists, there's a bunch of sub-disciplines. I think we're realizing that data is just as important. We're very far from like, you can have a team of five to ten people that are going to do data for your large, less-employed company. That's just, no, that's not going to cut it. Yeah.

Starting point is 00:32:31 Max, can you give us like a description of how the BI market looks today and where it fits in that? Yeah. I mean, the market is a gigantic market that's like extremely validated, you know, still has a foot, I would say there's like very big incumbents and they're like, I'm thinking like first wave BI. And when I say that, you know, I'm thinking like business objects, micro strategy, Cognos, things that are like very much like dinosaurs. You don't hear about them as much unless you work, I don't know, at a company maybe that made decision about their technology and their data stack, you know while ago i think like there's you know one thing that i think is a transformation that has not yet

Starting point is 00:33:11 happened that we're gonna see happen that you know we're really interested in you know at preset is i was talking about data democratization before where it's like bringing more people to the special place where you do data right so data democratization's like bringing more people to the special place where you do data right so data democratization was like give more access to more data to more people and i think like the real the question ahead of us is how do we bring instead of like how do how do we bring people to the data buffet is more how do we bring food everywhere in the world like how do we do kind of uber eats or how do we bring the right, you know, the right meal to everyone where they sit at on top of the buffet?

Starting point is 00:33:50 The buffet is great. I think the buffet is okay. It totally works for a bunch of use cases. But I think what we're going to see is, you know, analytics transcend the BI tool and the special purpose, the special tools like the bi tools and come out and be part of everyday experiences right so that means like in every app that you use on your phone and every sas tool that you buy there's going to be you know interactive analytics in context where it's most useful two on top of on top we're still the buffet there's also like people walking with or drug or dev everywhere would you like a you know a little bit of a side of analytics with whatever

Starting point is 00:34:29 you're doing right now and i think like we're thinking very actively about this at preset let's do like beyond you know there's embedded analytics i could get into which is like how you bring a dashboard and or you know charts and other contexts but the problem we're after is how do we enable the next, the generation of application builders to easily bring interactive analytics into things they're building today? And I think that's a really interesting question. I think it's still very, very hard if you're building a product today, you're building experiences to bring interactive analytics as far as these experiences. So we want to make it a lot easier for people to do that.

Starting point is 00:35:09 Okay. So how, let's say I'm building a product and I also need to expose some analytics right to my customers. How can I use Precert to do that? Yeah. So that's a complex and intricate question that also that has multiple components. I think like what's clear is that there should be multiple ways to do this. And there's a bunch of trade-offs as to how you want to do this. The most obvious one is what I was just referring to as embedded analytics. So that's, you know, you build a dashboard in a no code to in a no code tool, right?

Starting point is 00:35:44 By the way, BI is the original no-code tool. When you think about it, you're able to do these very, very complex things by drag and dropping things on the screen. But you go and you build a dashboard, you style it, you parameterize it, and you embed this dashboard inside your application, right? Maybe you have an analytics portion of your SaaS product that shows a dashboard. And with embedded analytics with preset, you're able to apply some role level security to say like, oh, I know it's this customer. Therefore, the dashboard will apply these filters so that they can see exactly the things that pertains to themselves. And there's no, like there's isolation.

Starting point is 00:36:27 So there's embedded analytics. There's another idea that we cater to at Preset that's, I call it white label BI. So it's the idea there. It's also not necessarily a new idea, but it's being able to have these prepackaged BI environments that are essentially, you know, if they're, you know, a superset instance that's preloaded with the datasets, dashboards, charts, queries that are relevant to that one customer. So you can say for each one of my customer, I'm going to create a superset sandbox with all the data assets that they need. And then they can come and self-serve and, you know, if you authorize them, they can write SQL and, you know, against the

Starting point is 00:37:09 data that you exposed to them. So that's the white label use case. And then another use case is more the component library. So, so there were, were a little bit in the infancy of this, but I think we're excited to expose the building blocks of preset and superset as component libraries for people to remix into the experiences they want to create. So there you can picture, you have a React library where you bring in some

Starting point is 00:37:37 charts, you bring in some controls, maybe a selection picker, a filtering control, a date range picker. And as the application developer, or as the engineer building the product, you create the exact experience that you want with these rich components that enable you to have these cross filtering and drill down these rich experiences that would be really prohibitive to build from scratch that are really easy to build if you have the right framework. Yeah, makes sense. And is this like as a problem,

Starting point is 00:38:13 is this related only to BI, the BI side, like something like preset or we still need to do work with our data warehouses or the storage layer, the query layer out there to enable these use cases. Or I can just have Snowflake just push all the data into Snowflake and then rely on Precept to do that. Like what's your experience?

Starting point is 00:38:38 Jason Cosperinopoulos Yeah. So, so I clearly this part of like, you know, earlier I talked about the analytics process and still, you know, there, there's no, there's no BI happening. There's no visualization happening unless you've gotten all these things right. You know, I think in this specific case you created the data sets that you need with the dimensions and the metrics that you want to expose in order to build a dashboard. So that's something that might be going back to how BI is evolving. I think BI is very monolithic in the sense that you would have the tool that includes

Starting point is 00:39:22 the data munging, the data transformation, the semantic layer, which this is a super loaded term. We can decide whether we want to unpack semantic layer or not, but you have all of these things that were part of a very of a platform like the Microsoft or Cognos or business objects. They would tend to be very monolithic.

Starting point is 00:39:43 And I think what we're seeing now is like we're, so say for me, the semantic layer belongs in the transform layer and that's, you know, dbt and airflow space. And then we don't necessarily, so superset and preset, you know, don't really actively solve that problem. We just say other people solve it much better than we could. And we just want to team up and integrate very well with them. Not sure if I'm answering the question too. I know I'm deriving quite a bit too,

Starting point is 00:40:12 but we're exploring a pretty complex space too. Yeah, no, you do. I mean, okay. I made also my question a little bit more specific to the Stonet layer, but as you said, it's not just one thing that has to be in each place in order for this to work. There are so many different things that need to happen. And you used to have the monolith of Cognos, but we don't.

Starting point is 00:40:38 I mean, we still have monoliths, but things are starting to break into smaller monoliths at least. And I think probably we see that with BI. Right? Like, I think it's one of the things that we see happening out there. It probably was one of the first markets to see that. But I think we also see that with even data warehousing, right? Like the whole idea of having the data lake where you have the storage on S3

Starting point is 00:41:07 and then you have a different query engine on top of that. And then you have, now you have a table engine. I mean, everything ranks in the lake. Yeah, warehouse, lake house, like special purpose databases, right? Like do you use the same database for real time or not? You know, it's gotta be like, you know, BigQuery with a BI engine and they're like real time option that

Starting point is 00:41:28 becomes a monolith that serves it all? Or for real time, you're going to use ClickHouse, Pnode, Druid, some of the databases are very specialized and use something else for your warehouse. So there's definitely, I think in the database space is really with all of the money we've seen kind of flowing in the snowflake place, like now people are looking to say like, oh, can we, maybe there's like some, some layers that we can kind of delaminate out of that and, you know, build some, some, some big businesses and some, some tools in some of these areas. I think like that, to me, I see like exploration visualization as like it needs to explode. Right. And at the database, I'm not as confident, you know, I'm not as sure about it. I

Starting point is 00:42:10 think like, you know, as a customer of a cloud data warehouse, I would like for the same database to do it all, you know, to be like a BigQuery or Hey, snowflake, this is a table that's like, is high availability and pitted to memory because I needed to be fast and but but I want to stay in the same warehouse. I don't want to go and purchase a bunch of tools. Yeah, right. Like people who are not in BI feel the same way about BI, you know, it's like, I want to put a fine one thing that does it all but then then you realize like that thing that does it all doesn't do anything right, you know? Yeah. Talking about like a little bit in the question, you know know what is the analytics process to power something like embedded or

Starting point is 00:42:51 or what i referred to as white label bi and uh for embedded it's pretty simple as long as you have you know we can apply the bi tool can or preset superset can apply role level security on the fly. So all is required is to say like this user is this customer ID, just apply a customer ID filter on all the queries that you run. It's a little bit more complicated than that, but it's like role level security. I think for white labeled, it's a little bit more intricate where you might want to create like a data mart for each customer. And really what I mean by that, you know, it can just be like a view layer, right? On top of, so say you have like five tables you want to expose to each one of your customers.

Starting point is 00:43:35 What you can do is create these schemas that have views that filter on a specific customer ID, and then you give them access with a service account to that specific schema that is limited to their data, but hopefully your schema, you have a universal schema that's the same for all your customers is all refresh, perhaps atomically, right? Like you refresh the whole thing every night or every hour, but, but you have these little islands or windows into the warehouse that are filtered and isolated just for them. And then you put a BI tool on top of that schema and you provide CAN reporting, CAN dashboard, and they can knock themselves out and go and push that further. Yeah, that's a super interesting thing.

Starting point is 00:44:23 One last question for me, and then I'll give the microphone to Eric to ask his question. So, can you share some things, some opportunities that you think there exist right now in the BI market, like things that you would like to see happen or you expect to see happening, or something to help our listeners go through, I don't know, maybe go

Starting point is 00:44:49 and build like a product out there, who knows? David Pérez- Yeah. I mean, I really liked the idea of like bringing analytics everywhere. And the premise is like, people are more data literate than they used to be, right? Not only people expect to find a dashboard in every SaaS application, but they expect that and that becomes a requirement.

Starting point is 00:45:13 I think also like people, you know, in everything that they do, if you post, you know, a blog post on Medium or even like me participating in this podcast, I would expect to see a dashboard on how this podcast is doing, you know, in real time as we release it and be able to see like, who's listening to the podcast and, you know, what are the demographics? I think like we're starting to really expect more and more analytics everywhere and people are trained and they want

Starting point is 00:45:41 interactive analytics, not just like static too. And it's going to be real hard to go and build that, you know, with the building blocks that exist today. The building blocks being like charting libraries and, you know, data warehouse drivers, you know. So, you know, I think there's a real opportunity of thinking about like how do we enable people to bring analytics and all the experiences, you know, every day? I think that's interesting. So how is BI going to come out of its shell or is analytics going to come out of the shell and then, you know, outshine and kind of like ray out and be everywhere. So we're really interested in actually doing this, you know, a preset. And then I think like other trends may be beyond business intelligence it's like big topic for me has been thinking about how we take the learnings from some of the devops

Starting point is 00:46:34 practices and the devops movement and apply you know transform that and reapply that and reinvent that for the data people or you know and then and then, you know, the, the big thing that's really interesting too, is how, like, how is the modern data team evolving? Like, what do they do? Like, what are the roles, you know? And like, how do we then becomes like, how do we enable others to become better with data? So you become this vector to enable everyone to kind of self-serve.

Starting point is 00:47:04 Super fascinating. Well, well, two more questions for me, and one of them may take us down a little bit of a rabbit hole, but hopefully Brooks doesn't get too upset with us for going a little bit over. Just thinking about what you were saying on the sort of the frontiers in BI,

Starting point is 00:47:22 what do you think is going to be commoditized first? Or what do you see being commoditized? And the context behind that question is, you make such a good point in that the amount of work that needs to be done in order, like on the backend, in order to enable self-serve BI is immense. But also there are patterns around that.

Starting point is 00:47:43 So if you think about visualization, a lot of businesses can conform to a particular data model for the business, whatever, like a direct-to-consumer mobile app, and everyone has their different KPIs. But do you think that there will be a lot of commoditization in the actual visualizations as the data layer becomes more established and defined sort of across business models? I think so.

Starting point is 00:48:11 I mean, for me, I think we're trying to accelerate that in some way so we can really innovate too, right? So as you commoditize things, there's opportunity to go and further. Open source is a tidal wave

Starting point is 00:48:23 of commoditization. It's free, remixable. So I think clearly we're doing that in the BI space with Apache Superset first. And we also have freemium. So preset as a freemium offering on Apache Superset.

Starting point is 00:48:40 So today you can go and sign up for a free open source project and have it run for you up to five people for free. And I think the price point, too, is very competitive. So I think we're trying to accelerate data. Our mission is to make every team a data team, enable everyone to have the best tool to visualize, collaborate with data. So I think that's clearly happening. But beyond that, I think there's, you know, as we commoditize the consumption visualization

Starting point is 00:49:13 layer, there's opportunities to go and innovate. For us, a theme is like enabling people to bring analytics everywhere. That's one theme. One thing I've discovered too is like in the data world, as you walk to the horizon, then, you know, the horizon gets further. Like the universe, right? Ever expanding. That's it. There's no, and you're like, oh, you know, I climbed on the top of that tree and I saw how far the horizon is.

Starting point is 00:49:41 And by the time I get there, I will be at the edge of the world. And then you walk there, the horizon kind of moves with you. Yeah, yeah, yeah. Climbing a tree and you see you're not there quite yet. So I think like, you know, that on this run, you know, for one thing

Starting point is 00:49:56 I would really like to see is to see like all the data in the world or in your company being like instantly queryable, right? It'd be great if everything was in memory and you could ask any question and yeah all the answers but the moment we create an amazing dig the next generation in memory databases then people are like well i'm in a log board you know but it's like then you start hoarding. The house is, the more you hoard stuff, you never kind of get to fully solve any of the problem.

Starting point is 00:50:30 And maybe that's the beauty of it. Like how are we going to solve software engineering? It won't be solved. It will just continue to morph and grow and evolve. Yeah, no, I think that's a great perspective. Okay, last question, because we really are close to time here, but I'm interested to know,

Starting point is 00:50:48 so you've been really, I mean, you've started some projects that have become, you know, data tooling that are, you know, at least in whatever subset of the world, you know, that a lot of data engineers operate in. I sort of just go to tooling, which is pretty amazing.

Starting point is 00:51:06 And I'm interested to know, what did it feel like when you were creating those? Did it feel like you were just sort of solving a problem that was right in front of you? Or, you know, it's kind of, I guess I'm asking you this almost as like an inventor, right? Like, did you feel like you were inventing something or was it just a problem in front of you that happened to like, you know, solve like a pretty critical, like pervasive, you know, pain point? Yeah, I don't know. I think it's the history. The history of innovation is made of people kind of remixing things. So if you look at any of the great inventions in the history of humanity, not any, There are some really important exceptions to the rule,

Starting point is 00:51:45 but you look at what Isaac Newton had. Sure. At the time, what was he reading? I think Newton is actually a bad example because he actually pushed the head of innovation quite a bit.

Starting point is 00:52:00 Yeah, like inventing calculus and all that crazy stuff. Yeah, so I think that's a bad example. I think Tesla is also, like Nikola Tesla is also a bad example of that. But everywhere, in a lot of places that you look at, you look at any of the, who were their contemporaries and what were they reading at the time? What were they thinking about? Who were they talking to, exchanging correspondence with through letter at the time? Sure. Like the collective, you know, imagination was already on the cusp often of what they

Starting point is 00:52:31 discovered. So I think it's very much the case like Airflow was largely inspired by two or three products internally at Facebook. They just wanted to be some other things. Those were the things that emerged internally at Facebook from a bunch of experiments, right? So maybe it was a fair motivation. Those were the things that people internally decided to put their pipelines and bags into. And then I took some of that stuff, remixed that with some of the things that I learned, you know, in Informatica and like using other tools and kind of remixed that into something that I thought was going to be immediately useful for, you know, at Airbnb for my team.

Starting point is 00:53:14 And knew that, you know, people, you know, coming out of Facebook, we're going to look for things like that. So there's like this idea, like, you know, people talk about product market fit, you know, in business, but there's like project community fit too, an open source. So there's like a timing thing too of like, if you were to build Airflow five years before, then probably people, well, it's too early. It's not the right thing. People are not really, it doesn't fit people's mental model. It doesn't the right thing. People are not really, doesn't fit people's mental model. It doesn't resonate just yet. So, so there's always this kind of timing aspect where you gotta be early, but not too early.

Starting point is 00:53:51 So there's like always like, you know, some luck, some context, you know, some some hard work too, and then you know, community building too, which is a whole different topic, but like how to get people kind of interested and involved and excited about and get people to contribute. So I think that's something that I figured out how to do in an okay way for both Airflow and SuperSat. Well, thank you for sharing that. That's just so fun to be able to talk to people who have, you know, sort of conceived of and built the tools that we use. But as you point out, you know, we sort of stand on the shoulders of lots of people who have done lots of cool things.

Starting point is 00:54:28 So Max, this has been an incredible conversation. The episode flew by, but thank you for joining us. We'd love to have you back on to dig into, we could go for hours here, but thank you so much for giving us some of your time. Yeah, it was super fun. I think we did scratch the surface. We got a little deeper

Starting point is 00:54:45 in some. Yeah, I think that's good and fun, but there's still so much more to talk about. So happy to come back on the show anytime. One thing that really struck me was, this may sound like a funny conclusion, but Max has a very open mind about a lot of things, even BI, and he's building a company in the BI space. I mean, he certainly has strong opinions, but I really loved his analogy of, you know, you sort of think you get to the horizon, you climb a tree to look over the horizon, you realize like the horizon just keeps moving. And I think that's really clear in the way that he approaches problems. You know, he's trying to look for that frontier and he keeps a very open mind. And I just appreciated that a ton. How about you?

Starting point is 00:55:27 Yeah, absolutely. I think the way you put it, like, makes me think that probably that's a trait that an inventor needs to have in order to be an inventor. Right. So being like in an environment that changes so rapidly, like what he described. Like think about early days in Facebook, right? How the engineering was in there and like all the things that one project after the other and like building new technologies and building everything from scratch even if it already existed. So yeah, I think it makes total sense and it's an amazing trade for both an inventor

Starting point is 00:56:05 and I would say also for a founder. So I'm really, really, really looking forward to see what's next about with research. Me too. All right.

Starting point is 00:56:14 Well, thanks for joining us on the Data Stack Show and we will catch you on the next one. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe

Starting point is 00:56:22 on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

Pet Camera - EBO Air 2

The Data Stack Show - 83: Closing the Gap Between Business Analytics and Operational Analytics With Max Beauchemin of Preset

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

Pet Camera - EBO Air 2

The Data Stack Show - 83: Closing the Gap Between Business Analytics and Operational Analytics With Max Beauchemin of Preset

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.