a16z Podcast - The Great Data Debate

Starting point is 00:00:00 Over a decade after the idea of big data was first born, data has become the central nervous system for decision-making in organizations of all sizes. But the modern data stack is evolving. And which infrastructure trends and technologies will ultimately win out remains to be decided. This episode from November 2020 brings together some of the industry's leading experts, from Snowflake to 5T, DPT, A16C, and beyond, to debate the future of the modern data stack, from data lakes versus data warehouses to analytics versus artificial intelligence and machine learning, to SQL versus everything else, and more. For more on how the data space is evolving and who the key up-and-coming players are, check out the inaugural data 50 list on future.com forward slash data 50.

Starting point is 00:00:46 We highlight and analyze the world's bellwether private companies across the most exciting categories in data, which in aggregate are valued today at more than $100 billion and have raised approximately $14 billion in total capital. And now, on to today's episode. Hi, and welcome to the great data debate. I'm DOS and this is the A16Z podcast. Today's episode is all about the debates happening around the modern data stack, lakes versus warehouses, analytics versus artificial intelligence and machine learning, SQL versus everything else, and more. For other content from our series on modern data businesses, including example blueprints we've shared and a podcast with

Starting point is 00:01:28 Databricks that traces the history and evolution of modern data architectures, please see A16Z.com backslash modern data. The conversation in this podcast was originally recorded as part of the Modern Data Stack conference hosted by 5Tran. It's a spirited

Starting point is 00:01:44 discussion with A16Z general partner and pioneer of software to find networking Martine Casado and four founders who are building different parts of the modern data stack. Bob Muglia, the former CEO of Snowflake, Michelle Ufford, the founder and CEO of Notable, Tristan Handy, the founder of Fishtown Analytics, as well as the leader of the open source project DBT, and the discussions moderator, Five Tran founder, George Fraser, whose voice is

Starting point is 00:02:09 the first you'll hear. All right. So I'm going to go ahead and kick this off with a spicy topic, I think, at least spicy in this crowd, which is Data Lakes. So Data Lakes is a blurry term used by different people to mean different things. But for the purposes of this discussion, let's define data lakes as tabular data, so tables, rows and columns, stored in an open source file format like parquet or orc in a public cloud object storage, like S3 or Google Cloud storage. So in a world where we have data warehouses that use object storage to store their data and give you some of the advantages of data lakes. Do data lakes still have a place?

Starting point is 00:02:55 Let's start with you. Martine, does the data lake have a future? One of the biggest policies that we do as an industry is we look at an architecture and we're oh, that can do all of these things, therefore it will be pushed into service to do all of these things. And that's just not how technology evolves. We make decisions in the design space based on the primary use cases that technology is being used for. And if you look at the use cases that data warehouses are being used for, they're largely driven by analytics, which is a certain workflow. It's a certain query pattern. And if you look at where

Starting point is 00:03:28 data lakes, it's actually quite different. They tend to be more unstructured data, focus on operational AI, compute intensive. And so if you look at the respective technologies, they're just being optimizing this massive design space for different use cases. Architecturally, sure, they can both do what the other one does. But in the end, you've got products and companies. that he's optimized around use cases. And I think the operational AI use case is a larger one, and it's growing faster. So actually, I think over time you can argue that it's the data lake that ends up consuming everything, not the data warehouse.

Starting point is 00:03:59 You're just trying to provoke Bob Martin. He's succeeded. You're watching Bob's face. All right, Bob. Let's hear from you. Data Lake doesn't have a future. No. I see these things very largely converging onto a relational sequel-based model.

Starting point is 00:04:15 And five years from now, data is going to sit behind a SQL prompt, and SQL data warehouses will replace data lakes from the perspective of storing structured and semi-structured data. The cloud SQL data warehouses already do everything that is necessary. And there really is no reason for people to have a separate data lake, except for historical precedent. A lot of companies come from environments where they had a lot of semi-structured data in a Hadoop environment, and having a data lake is a natural transition. And in a sense, the data lake, which is really S3 storage, together with a wide variety of any tools you want to put on top of it, is a very generalized platform. But over time, infrastructure evolves to take on more and more of the use cases.

Starting point is 00:04:59 SQL relational data warehouses have evolved to the point that for structured and semi-structured data, storage and query, they subsume all of what needs to be done pretty much today. What remains is images, video, documents, PDFs. Now, I don't call that unstructured data. I think that's a misnomer. There's no such thing as unstructured data. All data has structure of some kind. Structured data is tables, rows and columns. Semi-structured data is like JSON.

Starting point is 00:05:30 It's hierarchical and its nature. And I think there's a third category of data, which is what I call complex data. Images, documents, videos, most things that are streaming fall in. into this category. And more and more, machine learning can be applied to the contents of those data sources that turn it into semi-structured data that can be used for building complex data applications and for doing predictive analytics. So what's missing in the case of the data warehouse today is the support for complex data. But that's going to come. That's called a feature. Can you imagine if you could transact, fully transact all of these types of images, videos and things,

Starting point is 00:06:09 together with any source of semi-structured data in a data warehouse, the applications that open up are remarkable, and that's going to come in the next two to three years. I could see images being easily retrieved from the database, but do you actually see all of the image processing or the video processing taking place in the database as well? Not with SQL. SQL can't do that. So you'll use procedural logic in Python or something else to do that,

Starting point is 00:06:29 at least for now. In the long run, relational, we'll win, too. But that's probably more like eight to ten years away. I think we've been waiting for that for 40 years, Bob. But look what's happened. If you look over time, navigation on hierarchical in the 1980s got replaced with SQL. OLAP got replaced with SQL over the last 10 years or so. We replaced MapReduce with relational.

Starting point is 00:06:51 So all of these things, relational always wins. Well, relational wins for the actual retrieval. But what about the processing, the technology that you need to process images is fundamentally different than you do to retrieve data. Tristan, what are your thoughts on this? So I completely agree that SQL is going to dominate data. processing, at least a very large chunk of data processing. But there's different APIs that the data lake and the data warehouse exposed. So there's the file storage layer. And for a lot of

Starting point is 00:07:21 reasons, I believe that an organization will store their files one time. You will not have a data warehouse copy of the file and the data lake copy of the file, which in some architectures today, that's what you see. And so that requires you to have an open source file format that is shared between your data warehouse use cases and your other use cases. Above that, you have indexing and metadata that is a core part of the data warehouse, but it's also a core part of the data lake. I think those have to also start to converge so that different use cases can take advantage of the same stuff.

Starting point is 00:07:53 And then you have the SQL prompt. And maybe at the SQL prompt layer, the data warehouse dominates, but I think you need to allow different access patterns as well, because one closed source firm is never going to accomplish literally all data processing use cases in the world. All of these things should interoperate in an open source and an open format way. But the issues of format have kind of gone away because you can input and output any kind of format and export into any kind of format very easily.

Starting point is 00:08:18 The question is what are the operations that actually need to be formed against data that sits in a data lake? And today, anything associated with complex data, the data warehouse can't help you. And so there's a huge reason to have a data lake today. In 2025, I don't think so. I think that we really have five platforms being created globally, Snowflake, Databricks, and then the three clouds. Both Snowflake and Databricks, while they will come from very different places,

Starting point is 00:08:42 Snowflake will always be SQL and declarative in its approach. And Databricks certainly historically has been procedural and code-based. So it's a version of SQL versus code in some senses. And I think you'll see both companies and pretty much everybody else in the industry offering both within their platforms. So you've got two technologies that start with different use cases, somewhat different architectures,

Starting point is 00:09:03 but they're clearly going into a converged point, which is you have some declarative something and you have some procedural something and whether one's on top or the other, at the end of the day, they can both do both. But in the meantime, you have this decade-long journey. And in that decade-long journey, there is an architecture that's optimized around use cases.

Starting point is 00:09:22 I mean, the amount of trade-offs and decisions you make when building one of these systems is... Yeah, like, Timescale DB has very different characteristics than Snowflake. And they are characteristics that are optimized for a work. Yeah, entire companies focusing on different points in the design space with different optimization parameters. It's actually the use case that drives the technology because of all of the gravity around it. And so, again, if it turns out that AIML and an operational use is growing quicker, which it seems to be, it seems that's more going to dictate the technology from an architectural standpoint.

Starting point is 00:09:58 Martin, you've said a couple of times now that the AIML space is appearing to grow fast. I've actually not heard that assertion before. So broadly, two use cases, right? There's the analytics use case, which is driven by queries and dashboarding. The other one is creating a complex model from a data scientist and then serving that in production. That does things like wait time prediction, that does things like fraud detection, that does things like dynamic pricing. These are folks in our building complex models on existing data and then coming with bespoke way of serving that. That is very clearly now turning into a pattern that's being served by a data lake.

Starting point is 00:10:33 Now, it's on a much smaller base, but if you actually look in the industry, it's a very rapidly growing use case. Michelle, you've spent time in both the data science community and the analytics community. And notebooks in many ways are the place where these things sometimes come together. I'm curious to hear your thoughts about how the two stacks have evolved and maybe they're converging. Maybe they're building each other's features and getting more similar. But where does that take us?

Starting point is 00:10:59 Do we still have two stacks five years hence? I think we're going to continue to see greater and greater specialization because we're not going to have the ability or the budget to hire enough data scientists. And so those stacks are going to continue to evolve and it's going to be specialized based upon what it is that they're trying to do. The data lake will have a place, your images, your wall storage, all of those things are probably going to remain in the data lake and have a home there for a long time to come. I just think it's not going to look like how it looks today. Today it's just been a lack of understanding around what data do we really need to collect. We went from a home.

Starting point is 00:11:31 and one exchange to the other. We weren't collecting any data, now we're collecting everything because we don't know what's valuable. And the reality is that's not necessarily a good idea. Either the movement of data, I think we're going to see that stop. But format is going to be really important.

Starting point is 00:11:43 We need that inter-off because reprocessing data on scale is just, it's cost-prohibitive, it's time-prohibitive. It's not something that we want to do if we can avoid it. And I think you're going to see decentralization here. At the lower levels where you've got either business units embedded or you've got your product teams, you've got your data science teams embedded in those product teams, you're going to need a unifying layer at the very top in the form of technologies that make it easier for

Starting point is 00:12:05 everybody to query or be able to serve information. I think that the notebook is probably the best suited for that because it does have the language agnostic approach. You can see the ability to look at both data and code and have all of that context, the rich business context, the visualizations. We're going to see that involved as this modern data document. And we can use that as part of our unifying layer because your data scientists can then work with or your data analyst can work with SQL, but we can at the end of the day really hide all of the code and really get to what is the business implication of these things that we're doing.

Starting point is 00:12:37 So this really brings us to the second major topic that I wanted to discuss, which is how do we bring the machine learning, Python, Scala world, and the analytics SQL BI tool world together? There really are two stacks and two communities who sync the exact same data sources to Delta Lake and to Snowflake

Starting point is 00:12:58 simply for operational reasons. There's not a fundamental technological reason, but it's just the way the tooling has evolved. It's too inconvenient to cross that boundary. And there's essentially three visions of that world. One is that you're going to put machine learning into SQL, and probably BigQuery is the furthest along in pursuing this. You basically create a bunch of UDFs that do your linear algebra stuff.

Starting point is 00:13:22 The other is more the Databricks vision, where you put SQL into Python, or SQL into Scala, and you use data frames to do that. And then there's maybe a third vision where you use Arrow, the interchange format, and everything can just talk to each other,

Starting point is 00:13:39 and you can arrange it any way you want. Which of these visions do you think is going to win? What I would like to see when is something like Arrow so that you have to interrupt. You're going to see machine learning moving into SQL because you're going to have data engineers who are perfectly capable and have the need to do some anomaly detection

Starting point is 00:13:55 or some interesting progression. It's within the first. ability to do that. Future engineering is just another data transformation for them. But they don't have the same background in stats, and so they can only take it so far. And then you're going to see on the other side of the spectrum, your data scientist, where they have all of this really great math background, and they understand how to do more advanced deep learning. But they don't have the technology skills, and SQL is the most successful language for working with data. And so you're going to really see both of them really become capable of supporting both use cases. But ultimately,

Starting point is 00:14:24 You'll continue to see specialization here where the things that you want to do if you're trying to do deep learning are just fundamentally different than the types of things that you're just trying to do predictive models. I think a lot about the Arrow version of the world, and I think that that will end up in the fullness of time dominating. For the same reason that Martinez has been talking about, the tools end up evolving to the personas that they serve and the use cases that they serve. I want to do all the data prep and feature engineering, and then I want machine learning models to be trained on top of that. People do that, certainly. But the fact that the infrastructures to do those two different things are generally separate creates this big slowness. It's a purely technical slowness. And error doesn't solve all of that.

Starting point is 00:15:04 Arrow certainly helps. But there's dumb things like the servers that do those things are in different clouds. And the interchange fee, what do you call them interchange fees? Ingress charges. Ingress fees are expensive. They're criminal. They're not just expensive. They're ridiculous.

Starting point is 00:15:20 Right. As more people do this, it's going to become smoother. they're going to become more localized. At the end of the day, there's a reason why you've got multiple languages, and it's not because one is turning complete and the other isn't. And the reason is is because people build their entire workflow around languages and all of the tools. And so you're going to have a heterogeneous, fragmented system.

Starting point is 00:15:39 So therefore, you do need to have open interfaces. Bob? I'm a big believer at this time in the approach of having multiple systems that interact with common formats. Arrow is a huge step forward for that, not just because it's an efficient. format, but because it provides a consistent in-memory layout for people to do advanced analytics in their spark environments. And it's the way the world is working right now, because most customers actually have a data warehouse and an analytics platform separately, and they are connecting them together. Now, I'm the radical, however. I'm going to continue to be the ultimate radical and

Starting point is 00:16:12 declare that the approach that we're taking today in terms of machine learning is still roughly the approach of the internal combustion engine in the automobile. And the approach that that's happening where Arrow ties together those predictive systems together with declarative databases. That's really the creation of the hybrid or sort of the Prius era. Hybrid will dominate for the next, say, three to five years. And you will see hybrid systems being built by every major vendor. And so all of them will have a full predictive stack and a full declarative relational stack built in using some kind of interface like that. But that's only until relational actually solves the broader set of problem.

Starting point is 00:16:54 Does that mean that you'll be using SQL functions, predict X? No. Ironically, I think that while SQL will dominate well into the 2030s for doing data modeling and data transformation, there's another step beyond that, which is business modeling. And that needs to be represented in a knowledge graph. Knowledge graphs are how we'll do predictive analytics in the 2030s. And what needs to happen is a whole new generation of data system

Starting point is 00:17:20 that's based on relational knowledge graphs to create that. Michelle, you brought up a term earlier that I wanted to follow up on, which is data mesh. And I wonder if you could define that briefly for everyone, because similar to data lakes versus data warehouses, there's a question whether going forward that's more of a historical phenomenon or an actual good architecture that we want to continue. Data mesh is really a concept of decentralizing the data processing

Starting point is 00:17:48 and the ETL and analytics into each individual business unit and then having some sort of unifying solution at the top. And to do this while refers, having specialized data teams, having specialized roles, having infrastructure as a service available to them for data processing, and then having some sort of overarching standards for it, almost like a federation of your data engineers, to ensure that all of your ETL is consistent. So that as you are trying to do data retrieval and some sort of common query tool, you'll have. have that familiarity that you need. We are going to see things like Arrow really come to the forefront sooner or than later.

Starting point is 00:18:25 I think customers are going to demand it because of all of the challenges that we're currently having. We've got all the cost of the storage and the processing. Your teams that are trying to do the processing don't have the distant context that they need. And so as a result, you have this back and forth, a lot of wasted time. We've got a lot of data quality errors in the data multiple times. And so ultimately, we really want to take that body of knowledge and put the technology where that body of knowledge lives.

Starting point is 00:18:49 The data mesh is an attempt to do that. One part of what the data mesh folks are talking about is how to organize and how to structure a team to manage data across a large enterprise with very disparate and important data sources. That's very, very important. There's some good ideas in data mesh for that. Architecturally, data mesh has this sort of odd idea

Starting point is 00:19:12 that data is basically streaming, and you can use facilities like Kafka to do transforms as the data is in flight. And I don't believe that. I think that that is totally missing the fact that while there is streaming data, and you can do quite a bit with data that's simply streaming, in other words, append only data. To me, another critical source of data is transactional data coming out of business systems.

Starting point is 00:19:36 The streaming-based solutions have no answer for that, and they just sort of pretend that data consistency is unimportant. And I don't understand that because I put data consistency at the top of the issues that I think about when I think about managing data. Mesh has historically been one of these terms that conflates architecture with administrative domains and a distance service mesh, and it did this in Wi-Fi meshes and mesh networking, et cetera.

Starting point is 00:19:58 I think Bob is exactly right, which is there is a very real issue with separate administration domains, separate processing domains, separate access to tool sets. That's very, very different than building a fully distributed architecture, which just tends to be hard and inefficient. And it's often not the people that promote the mesh idea,

Starting point is 00:20:13 but when people hear the term mesh, they default to full distribution, which tends to be just a bad way to build systems. Said like a networking guy. Having seen this exact same thing happen in other domains for a couple of decades. I think all of us are very technology-focused human beings. And so when we think about data mesh, we tend to think about the architecture part of it. Bob, I'm glad you pointed out the distributed teams and the people aspect of this.

Starting point is 00:20:37 I think my constant question for the data mesh is, why can't you enable the distributed nature of what you're talking about with a unified architecture. My preference is always to have one data set that is very clean and well understood, that we do not have to move anywhere, that is performance alongside our large batch analytical processing, which is also working with our data science. That's the nirvana. That's the goal, is to just have one data storage and then having something that sits over top of it, and each of those different things are specialized for each of the different use cases, but you have one data store. I feel like the modern data stack keeps swallowing up more and more use

Starting point is 00:21:15 cases. It killed cubes a while ago. It's mostly killed Hadoop at this point. It keeps pulling more use cases into its orbit because it's fundamentally so flexible and so capable of doing many different things well enough that you don't really want to buy another system, build another system for that one use case. What do you think are some of the most interesting, surprising, significant use cases that may start to get pulled into the orbit of the modern data stack in the next couple of years. Complex data. We now have all this very, very interesting stuff that's happening in predictive analytics.

Starting point is 00:21:52 And to me, we've gone from semi-structured data as being the most interesting data sources to now having a wide variety of data sources. I was talking to a company involved in the medical field yesterday and just the rich amount of data that exists in the images and the doctor's notes. And all of that is opaque. to our systems today. It will not be in five years. That will all become part of the modern data stack. And to me, that's a gigantic transformation into the types of applications that will be created in the years to come. My last job was I ran marketing for a company and I really

Starting point is 00:22:27 went deep into growth marketing. The problem that you run into there is that you're constantly writing code to push data back and forth between systems because the different operational systems do different things and you need the same data and all of them. But no one has yet re-architected the systems in the modern data stack, just take all of the work that you've ingested and now push it back out to your operating systems, your operational systems. But I think we're at the beginning of that. What you're really talking about, Tristan, is the advent of the modern data app, which basically is an operational application that autonomously can make decisions for the business. And we've seen very few of those. I mean, very trivial examples of boy, will they

Starting point is 00:23:08 be significant in the future. There's really two visions of the data app that I've seen. One of them is the data app is a separate system, and you take the important data from your data warehouse and you push it. And then the other vision is the data app is just natively built to run on top of the data warehouse. I'm curious whether people have opinions about those two models and where they see that going. It's really the same conversation we've been having about how these things are built.

Starting point is 00:23:33 Data app is predictive analytics that actually takes autonomous action. takes the data that would otherwise be presented to a person and instead leverages that to actually take actions within the business. They're being built every which way today because there are few good tools to build data apps. That will not be true in a few years. One of the things that you run into when you try to build data applications and take action automatically is that latency becomes incredibly important. Everybody in the ecosystem is battling this right now. I think there's a lot of different visions of how we're going to crush the latency problem and how low we need it to get. How low does the latency need to be? At what point do we have

Starting point is 00:24:12 most of the interesting use cases? People have dozens to hundreds or even thousands of operational system. More and more, they're SaaS applications. They're outside of your organization. And they're always a source of truth now. They're the present. And a data warehouse or a data lake is about historical or the past. And what does that latency need to be? Does it need to be zero seconds? I don't think so. I mean, there are applications where zero seconds were instant is required, mostly having to do with eventing and alerting of some sort. Most of the time, if you can get it in a minute or two, you can leverage that data inside your historical system with predictive analytics to begin to perform actions on it. This is a very complicated topic that

Starting point is 00:24:53 I think is very use case specific, but there tends to be serious tradeoffs that systems designers make between latency and throughput. If you want higher throughput, you batch. And the reason that you batch is that you don't have as many domain crossings. However, if you look at most systems, you can make the trade-off, meaning you could do low latency in a data lake, and you could do high throughput in a data warehouse or vice versa. These are not architectural limitations. They just tend to be the trade-offs that were made as a result of serving whatever the primary use case is. I've heard a number of these kind of latency throughput trade-off discussions, and you actually get down to a machine level. They are just a result of the trade-offs that are made

Starting point is 00:25:34 in the system going into it. One of the interesting things that we see is that the point at which you start to have to spend a lot more to get the latency lower is actually lower than people think. I suspect you can get down into the 10-second range with still the sort of throughput-optimized architecture. Basically, the throughput-optimized architecture, I suspect, will go lower than we expect. What do you imagine will happen with the serving layer? So your website still needs to operate over that data.

Starting point is 00:26:00 Are you imagining that there's just going to be a caching layer or... It's not going to be a separate system? It depends on what the characteristics of the system need to be. If something needs to be really low latency, today's data warehouses are not always the right solution for it. And so it just depends on the application. Latencies will go down in these products, but to Marchean's point,

Starting point is 00:26:20 some of the architectural choices make the latency characteristics of a snowflake somewhat different than, for example, the latency characteristics of a M-Sql. One of the things that I would like to see more of in the future is Lambda architectures but with off-the-shelf tools. So my data flowing into a more streaming-like system and a more batch-like system so that I can get the best of both worlds, you're making trade-offs and you build these systems. As a user, I want to be able to choose and have both of them.

Starting point is 00:26:47 All right. Well, we have one minute left. I'd like to ask a yes or no question for everyone. Will there emerge another major data platform alongside Snowflake, Databricks, Google, AWS, and Azure? We'll start with you, Michelle. Yes or no? Yes. Bob?

Starting point is 00:27:06 What's your time skill? Oh, yeah. Sorry, in the next five years. Yes. Yes. But you know what may be relatively small relative to those guys? Well, I said major. That sounds like a...

Starting point is 00:27:15 But it's not like it was small five years ago. Justin? I think no. Martine? Yes. All right. Thank you very much, everyone, for joining. This has been a really fun conversation.

Starting point is 00:27:27 Really appreciate all of you being here. I know our audience does as well.

a16z Podcast - The Great Data Debate

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.