The a16z Show - Data Alone Is Not Enough: The Evolution of Data Architectures

Starting point is 00:00:00 Hi, and welcome to the A16Z podcast. I'm DOS. Data, Data, Data. It's long been a buzzword in the industry, whether big data, streaming data, data, data analytics, data science, even AI and machine learning. But data alone is not enough. It takes an entire system of tools and technology to extract value from data. A multi-billion dollar industry has emerged around these data tools and technologies, and with so much excitement and innovation in the space, it raises the question, how exactly do all these tools fit together? This podcast featuring Ali Goetze, the CEO and founder of Databricks, explores the evolution of data architectures,

Starting point is 00:00:39 including some quick history, where they're going, and a surprising use case for streaming data, as well as Ollie's take on how heat architect the picks and shovels that handle data end-to-end today. Joining Ali in this hallway-style jam is A16Z general partner Martine Casado, who, with other A-16Z enterprise partners, just published a series of blueprints for the modern data stack. You can find that as well as other pieces on building AI businesses, the empty promise of data modes, and more at A16.com backslash ML Economics.

Starting point is 00:01:13 In this conversation, we start with Ali answering the question. How did we arrive at the set of data tools we have today? It kind of started in the 80s. Business leaders were kind of flying blind, not knowing how the business were doing, waiting for finance to close the books. and this data warehousing paradigm came about where they said, look, we have all this data in these operational data systems. Why don't we just get all that data and we take it out of all these

Starting point is 00:01:39 systems, transform it into a central place, let's call it the data warehouse, and then we can get business intelligence on that data. And it was just a major transformation, because now you could have dashboards, you could know how your product is selling by region, by skew, by geography. And that itself has created at least a $20 billion market that has been around. for quite a few decades now. But about 10 years ago, this technology started seeing some challenges. One, more and more data types like video and audio

Starting point is 00:02:10 started coming about, and there's no way you can store any of that in data warehouses. Second, they were on-prem big boxes that you had to buy, and they coupled storage and computers. We became really expensive to scale them up and down. And the third thing was people wanted to do more and more machine learning and AI on these datasets. They saw that we can ask future looking questions, which of my customers is going to turn,

Starting point is 00:02:33 which of my products are going to sell, which campaigns should I be offering to who? So then the data lake came about 10 years ago. And the idea was, here's really cheap storage, dump all your data here, and you can get all those insights. And it turns out just dumping all your data in a central location, it's hard to make sense out of that data that's sitting there. And as a result, what people are doing now is they're taking subsets of that data moving into, to classic data warehouses in the cloud. So we've ended up with an architectural mess that's inferior to what we had in the 80s,

Starting point is 00:03:05 where we have data in two places, in the data lake and in the data warehouse, where the staleness and the recency is not great. In the last two, three years, there's some really interesting technological breakthroughs that actually now are enabling a new kind of design pattern. We refer to it as the lake house, and the idea is, what if you could actually

Starting point is 00:03:25 be able to do BI directly directly on your data lake? And what if you could do your reporting directly on your data lake? And you could do your data science and your machine learning straight up on the data link. I would love to tease apart a few things that have led us here. You know, this is very clearly a large existing data warehouse market around BI and analytics. And, you know, it's typified by people using SQL on structured data. It seems like the MLAI use case is a little bit different than the analytics use case, right? The analytics use case, it's normally human beings that are looking at dashboards and making decisions,

Starting point is 00:03:56 where the MLAI use case, you're creating these models and those models are actually put into production and they're part of the product. They're doing pricing. They're doing fraud detection. They're doing underwriting, et cetera. The analytics market is an existing buying behavior and existing customer. And MLAI is an emerging market. And so the core question is, is, are we actually seeing the emergence of multiple markets or is this one market? Well, there are big similarities and there are big differences. And let's start with the similarities. Roughly the same data is needed for both. There's no doubt when it comes to AI and machine learning, a lot of the secret sauce, you get those really great results or predictions, comes with augmenting your data with additional metadata

Starting point is 00:04:40 that you have. In some sense, so you have the same data and you're asking analytical questions, the only difference is one is backwards looking, one is future looking. But other than that, a lot of it is the same. And you want to do the same kind of things with the data. You want to sort of prepare it, want to have it so you can make sense of it. If you have structural problems with your data, that actually causes also problems for machine learning, actually. The difference is today is that it's light of business that typically is doing AI and data science or hardcore R&D, whereas data warehousing and BI oftentimes sits in IT. The users of the data warehouse and the BI tools are data analysts, business analysts. In the case of machine learning, we have data scientists, we have machine learning

Starting point is 00:05:17 engineers, we have machine learning scientists. The personas are different and it sits in a different place in the organization. And those people have different backgrounds and they have different requirements on the products they're using today. If you talk to some folks that come from the traditional analyst side, they'll say, you know, AI and ML is cool. But if you really look at what they're doing, they're just doing simple regressions. Why don't we just use the traditional model of data warehouses with SQL? And then we'll just extend SQL just to do basic regressions. And we have 99% of the use cases. Yeah, that's interesting. If you ask, because we we actually tried that at UC Berkeley.

Starting point is 00:05:53 There was a research project, and that research project tried to basically look at, is there where we can take an existing relational model and augmented with machine learning? And after five years, what they realized is that it's actually really hard to bolt on on top of these systems, machine learning and data science. And the reason is a little bit technical.

Starting point is 00:06:11 It just has to do with the fact that these are iterative, recursive algorithms that continue improving the statistical measure until it reaches a certain threshold in the stop. that's hard to implement on top of data warehousing. So if you look at the papers that were published out of that project, the conclusion was we have to really hack it hard, and it's not going to be pretty.

Starting point is 00:06:33 If you're thinking of the relational COD model, as it was with SQL on top of it, it's not sufficient for doing things like deep learning and so on. Is the same statement true about going from something architected for AI and ML and then having it support more of a traditional analyst relational model? So interestingly, I think the answer is no. because there is a now widespread data science API, what has emerged as the lingua franca for the data scientist at data frames.

Starting point is 00:07:00 A data frame is essentially is a way where you can take your data and you can turn it into tables and you can start doing queries on it. And that sounds a lot like SQL, but it's not because it's actually built with programming language support. So you can do that in programming languages like Python or R, which enables you to do data science. So now your data is in sort of tables. And that's great. It turns out you can now also build SQL on top of data frames.

Starting point is 00:07:27 So you can get sort of a marriage between the world of data science and machine learning and the world of BI and data analytics using data frames. I get what you're saying about the data warehouse, but there's a lot more than just the data sitting in a data warehouse. And I'm just trying to grok, like you still have this entire world of data and SQL. Is there a dissonance there or do they stay too well? world or like what happens? Well, so every enterprise we talk to, they have the majority of their data in the data lake today.

Starting point is 00:07:57 And then a subset of it goes into the data warehouse. So there's like two-step ETL that they do. The first ETL step is getting into the data lake and then there's a second ETL step that they use. So organizations are definitely paying a hefty price for this architectural redundancy. But the question is, do you really need two copies of it? And do you really have to maintain those two copies and keep them in sync? Are you going to have a world in which you have all of your data in the data lake,

Starting point is 00:08:21 and then you do your machine learning and data science on it, and then subsets of it moves into a data warehouse again, and you clean it up and you put it in that structured form so you can do SQL and BI, or can we do it all in one place? Okay, so let's actually ask that specific question because even though the AIML is a large market with a lot of value, there's a ton of existing workflow around BI.

Starting point is 00:08:45 So you've got all the dashboarding and all the tools that are based on SQL for a data warehouse, But then you also have folks that want to interact the data very quickly, which is they'll use something like a Clickhouse or a Druid in order to do that in OLAP. OLAP stands for online analytical processing. It's effectively a fast database that supports fast queries. And then you've got more traditional batch processing, which normally folks have thought about Spark. What you're saying is that you think that you can combine all of these things in the same data lake, including OLAP query loads? Yes, I actually think you can get all the way there. The data lakes are blob stores, big, large, you know, cheap storage, but kind of data swamps.

Starting point is 00:09:28 Turns out there's some recent technological breakthroughs that show you how you can basically turn them into a structured relational storage system. And the way you do that is you build basically transactionality into these data lakes. Once you have that, you can now start adding things like schemas on top of them. Once you add schemas on top of them, you can add quality. metrics. And once you have that, you can start reasoning about your data as structured data in tables instead of data that's just files. I get that, like putting structure on top of a blob store or whatever. But you still need a query layer, right? Building a query engine that's super fast that can like respond to analytical queries is I mean there's entire companies that do that. Yeah. So it turns out

Starting point is 00:10:10 there's two APIs you need. One is the data frame API that'll enable all the data science and machine learning, and then you can build a SQL layer on top of it. And there's nothing that really gets in the way of making this as performant as the state of the art fastest MPP engines out there. So the same tricks that you apply to get speedups, you can apply the same tricks now, because you're actually dealing with the structured data. It feels like especially in data, there's always kind of the trend de jour that everybody's exciting about, but like you're not ever really sure if the market's real or not.

Starting point is 00:10:40 And people have been saying this a lot for like real time and streaming use cases. And so it's very clear that like people want to process data at different time speeds. Batch we know is a very large market, which is like, listen, you've got a bunch of data. You want to do a whole bunch of processing. And then, you know, it's stored somewhere else and you do some queries. And then there's more and more people are talking about streaming analytics. That means that actually as a stream comes in, you do the queries before like it hits a disc. I sit and pitches basically is a full-time job.

Starting point is 00:11:05 A lot of the things motivating the streaming use cases seeing a little contrive. There's the latency and the speed and how fast you can get this stuff. That's one side of the equation and that's what everybody focuses on. And oftentimes when we ask the business leader, hey, so what kind of latency would be okay with you? Like, what are you okay with? They'll say, oh, I don't know, I mean, we want it to be super fast. Like every five minutes, every 10 minutes,

Starting point is 00:11:26 and you can accomplish that with batch systems. Or sometimes they say every half an hour. Half an hour is fine. But, you know, we want it to be as recent as half an hour. And then when you dig into like, wouldn't you want it to be even faster? It turns out that streaming systems, the weakest link will dictate the latest. So there'll be some upstream process that has nothing to do with the system that you're putting in place.

Starting point is 00:11:47 And if that upstream link, if that's one place where you loading the data in or something, if that's coming in every half an hour, and it doesn't matter how fast the rest is. I think the actual latency, this obsession with we need it in less than five milliseconds, for most use cases you don't have them. There's another side of the equation which people don't focus on because it's harder to understand or explain, but it might be the biggest benefits out of the streaming systems, which is it takes care of of all the data operations for you. So if you don't have a real-time streaming system,

Starting point is 00:12:18 you have to deal with things like, okay, so data arrives every day. I'm going to take it in here, I'm going to add it over there. Well, how do I reconcile? What if some of that data is late? I need to join two tables, but that table is not here. So maybe I'll wait a little bit and I rerun it again. And then maybe once a week, I rerun the whole thing from scratch just to make sure everything is consistent.

Starting point is 00:12:36 In some sense, all the ATL that people are doing today and all the data processing that they're doing today could be simplified. could be simplified if you had to turn it into a streaming case because the streaming engines take care of the operationalization for you. You don't have to worry anymore. Did this data arrive late? Are we still waiting on it? Is it inconsistent? They'll take care of all of that. So you think ultimately a large part of this becomes stream processing? What I'm kind of saying provocatively is that in some sense all of the batch data that's out there is potential use case for streaming. I think the stream processing systems have been too complicated

Starting point is 00:13:11 to use, but actually under the hood, they take care of a lot of data ops that people are doing manually today. I would love to talk through what you actually think a modern data stack looks like. We talked to a whole bunch of folks, and it seems like there's definitely a best practices stack forming, but very, very few people know what it looks like. Let's say you get hired Ali. You decide to have a new job as a VP of data for like whatever, and you were to build a data infrastructure that does both analytics and it does AIML. Like what product category, again,

Starting point is 00:13:46 not specific products, but product categories would you use, you know, end to end? If you get hired into a big company, I'll spend the next five years fighting political battles on who owns which part of the stack and which technology I got rid of. And if this technology doesn't fit this one and how does it? So there's a lot of orchard and human and process problems. But let's say I get in there and they say you guess I have it his way. He's got all the juice. That's right. Exactly. So obviously, Trying to do something on-prem makes absolutely no sense at this point. And when you're building that cloud native architecture,

Starting point is 00:14:17 don't try to replicate what you had in the past on-prem. Don't think of it as big clusters that are going to be shared by users. One big change that happens in the cloud that on-prem vendors don't think of often is that the networks in the cloud are invisible. So any two machines can communicate at full speed. And it can also communicate to the storage system, to the data lake, at full speed. This was not the case on-prem. things like Hadoop and so on, they had to optimize where you put the data and the

Starting point is 00:14:44 computation has to be close to the data and all done. So you move it into the cloud. Typically, you have data flowing in from some of your systems. That depends on what kind of business you're in. But you have IT devices or you have something from your web apps. Sometimes it goes to streaming sort of queuing systems, whether it's a Kafka or whatever they're called. It goes into those. And from there, it lands into the data lake. Into a data lake. So you're saying the data goes directly into your data lake. That's the first landing place. If you don't do that, you're actually going to go back a decade or two in the evolution.

Starting point is 00:15:16 Because if you don't put it in a data lake, then you have to immediately decide what schema you're going to have. And that's hard to get right from the beginning. So the good news with the data lake says you don't have to decide the schema. Just dump it there. Step number two, you need to build a structural transactional layer on top of it so that you can actually make sense of it. And there are technologies for that. There's three, four of these that appeared roughly at the same time. and they all enable you to actually take your data lake and turn it into a lakehouse.

Starting point is 00:15:45 Step number three, you need some kind of interactive data science environment where you can actually start interactively working on your data and get insights of it. Typically people have notebooks-based solutions where they can iterate with notebooks. Typically there's things like Spark under the hood and they're interactively processing their data and getting insights from it. And that's really important because a lot of data science in organizations ends up not being advanced machine. learning. It just ends up being, okay, so we have this data coming in from our products or from our devices or whatever it is. We have to massage it, get in a good form, and we need to get some basic

Starting point is 00:16:19 insights out of it. If you want to get into the predictive game, we need the machine learning platform. And there are now these machine learning platforms that are emerging. Many of them are proprietary, and you find them inside the companies. You can read about them, but you can't get your hands on one. And there are lots of startups building them, right? And this is for operational ML. Well, I think this actually goes from like training the ML model, so actually featureizing it, creating a model that can do the predictions, tracking the results, make sure that you can actually make them reproducible and you can start actually sort of reasoning about them to move it into production, which is the hardest part. I'm moving into production where you can actually serve it inside products. So that's the job of the machine. And machine learning platform in your world is the folks that work on it are the data scientists and data engineers both, right?

Starting point is 00:17:05 Yeah, so it's different organizations today, unfortunately. The serving part, the production part sometimes is owned by IT. And the creating of the models happens by data scientists that sit in line of business. And there is friction in those organizations because IT operates differently at a different wavelength than from the data scientists. But the machine learning platform needs to span both. If it doesn't, you're basically not going to get the full value out of the machine learning sort of work that you're doing. Can you talk a little bit to where like the data pipeline and DAG tools, fit in on all this? Yeah, I mean, that would be the first step of this, right? I talked about training

Starting point is 00:17:40 immediately. But the really hardest part really is to take that data that's not sitting in the data lake. And first step of it is sort of building the pipelines that create, featureizes it, and gets it in the right shape and form so that you can start doing machine learning on it. So that's step number one. And then, you know, after that, you start training the models. To orchestrate that automatically and make that workflow just happen, you need software that does that. definitely like the first mile in the ML platform. And if I want to like take my traditional BI dashboard and attach it to this, where does that attach? Yeah, good question.

Starting point is 00:18:13 So that's the last mile. BI itself typically uses something like JDBC, ODBC. And you know, to make that really fast and snappy and work on top of the data, you need some capability that makes that possible. In the past, your only option there has been put it in a data warehouse and then attach your BI tool to it. I'm claiming that with the lakehouse pattern that we're seeing and with some of those technological breakthroughs with those technologies I mentioned, you can connect your BI tool directly now on that data lake. To where? To the transactional layer that's built on top of it?

Starting point is 00:18:42 Yep. Yep. So if you have something like Delta Lake or if you have something like iceberg or high asset, you can connect it to those directly. If you didn't have any legacy problem, like it seems like doing a data lake makes a lot of sense. Is there a simple migration path to this? I think it's hard in West. In Asia, it's easier because there's not lots of legacy.

Starting point is 00:19:01 It's harder in the West because the enterprises have, well, I have 40 years of technology. that I've bought and I've installed, I have data in them, I've configured them. I need to make that work with what you're talking about and how do you fit the two together. Whereas if you're building it from clean slate, you can actually get it right more easily. So you're actually seeing more usage of data lakes for companies that aren't encumbered by legacy? Well, I mean, the companies that are really succeeding with this stuff, I mean, take an Uber, they're doing predictions and that predictions are the competitive advantage. You press within a second, it tells you what the price of the ride is.

Starting point is 00:19:34 it's basically assimilated the ride. It basically knows what that meter is going to tell you after an hour ride with traffic conditions and everything. And it gives you exactly the right price. Can't overprice, can't underprice. It matches supply and demand of drivers with surge pricing. It can even put people in the same car to lower the cost. All of these are machine learning use case. And those stacks, these are all companies that are 10 years old. They didn't exist. They don't have lots of legacy data warehouses and legacy systems. They built it custom for this use case. and it's a huge competitive advantage. Is this the durable stack that lasts for the next decade,

Starting point is 00:20:07 or is this converging on something that looks a little bit different that you can articulate from here? I can't predict the future, but I tell you, a few ingredients of it just makes sense long term. If I'm an enterprise and I'm sitting there as a CIO or someone that's picking the data strategy, I would make sure that whatever I'm building is multi-cloud. There's a lot of innovation happening between the different cloud vendors.

Starting point is 00:20:26 They have deep pockets, and there is sort of an arms race there. so make sure that you have something that's multi-cloud. The second thing I would do is as much as possible, try to base it on open standards and open-source technology, if possible. That gives you the biggest flexibility that if the space, again, changes, that you can sort of move things. Otherwise, you find yourself being locked in to a technology stack, the way you were locked into technologies from the 80s and 90s and 2000s.

Starting point is 00:20:52 Storing all your data, dumping it first in raw format into a data lake is something that's going to remain that way, because there's so much data that's being collected. You don't have time to figure out exact perfect schemas for it and what we're going to do with it. So either we dump it somewhere or we throw it away. And I think someone wants to be that employee that threw away the data, especially when it's so cheap to store it.

Starting point is 00:21:16 And the third thing I would do is I would make sure that this stack that you're building, the way you're laying it out, that it has machine learning and data science as first class citizen. Because that's a way actually to turn the data that we all had and we get business insights out of it. Machine learning platforms didn't exist 15 years ago. That probably will change quite a bit. I think the exact shape of the machine learning platform,

Starting point is 00:21:38 I don't think will look exactly the way it is today. But many of the ingredients are right. Perfect. Thank you so much. I don't know if we're on a race to see who speaks faster, but I think you win. Thank you for having me. Yeah, I really appreciate you taking the time.

Starting point is 00:21:52 Always a pleasure.

The a16z Show - Data Alone Is Not Enough: The Evolution of Data Architectures

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.