The a16z Show - The Great Data Debate

Starting point is 00:00:00 Over a decade after the idea of big data was first born, data has become the central nervous system for decision-making in organizations of all sizes. But the modern data stack is evolving. And which infrastructure trends and technologies will ultimately win out remains to be decided. This episode from November 2020 brings together some of the industry's leading experts, from Snowflake to Five Tran, DBT, A16C and beyond, to debate the future of the modern data stack, from data lakes versus data warehouses to analytics versus artificial intelligence and machine learning, to SQL versus everything else, and more. For more on how the data space is evolving and who the key up-and-coming players are, check out the inaugural data 50 list on future.com forward slash data 50.

Starting point is 00:00:46 We highlight and analyze the world's bellwether private companies across the most exciting categories in data, which in aggregate are valued today at more than $100 billion and have raised approximately $40, $14 billion in total capital. And now, on to today's episode. Hi, and welcome to the great data debate. I'm DOS and this is the A16Z podcast. Today's episode is all about the debates happening around the modern data stack, lakes versus warehouses, analytics versus artificial intelligence and machine learning, SQL versus everything else, and more. For other content from our series on modern data businesses, including example blueprints we've shared and a podcast with Databricks that traces the history and evolution of modern data architectures, please see A16Z.com backslash modern data data. The conversation in this podcast was originally recorded as part of the Modern Data Stack conference hosted by 5Tran.

Starting point is 00:01:43 It's a spirited discussion with A16Z general partner and pioneer of software to find networking, Martin Casado, and four founders who are building different partners. to the modern data stack. Bob Muglia, the former CEO of Snowflake, Michelle Ufford, the founder and CEO of Notable, Tristan Handy, the founder of Fishtown Analytics, as well as the leader of the open source project DBT, and the discussions moderator, FiveTran founder, George Fraser, whose voice is the first you'll hear. All right. So I'm going to go ahead and kick this off with a spicy topic, I think, at least spicy in this crowd, which is Data Lakes. So Data Lakes is a blurry term. used by different people to mean different things. But for the purposes of this discussion,

Starting point is 00:02:25 let's define data lakes as tabular data, so tables, rows and columns, stored in an open source file format like Parquet or Ork in a public cloud object storage, like S3 or Google Cloud Storage. So in a world where we have data warehouses that use object storage to store their data and give you some of the advantages of data lakes,

Starting point is 00:02:53 do data lakes still have a place? Let's start with you. Martine, does the data lake have a future? One of the biggest policies that we do as an industry is we look at an architecture and we're like, oh, that can do all of these things, therefore it will be pushed into service to do all of these things. And that's just not how technology evolves.

Starting point is 00:03:11 We make decisions in the design space based on the primary use cases that technology is being used for. And if you look at the use cases that data warehouses are being used for, they're largely driven by analytics, which is a certain workflow, it's a certain query pattern. And if you look at where data lakes,

Starting point is 00:03:28 it's actually quite different. They tend to be more unstructured data, focus on operational AI, compute intensive. And so if you look at the respective technologies, they're just being optimized in this massive design space for different use cases. Architecturally, sure, they can both do what the other one does. But in the end, you've got products and companies

Starting point is 00:03:47 optimized around use cases. And I think the operational AI use case is a larger one, and it's growing faster. So actually, I think over time you can argue that it's the data lake that ends up consuming everything, not the data warehouse. You're just trying to provoke Bob Martin. He's succeeded. You're watching Bob's face. All right, Bob, let's hear from you. Data Lake doesn't have a future.

Starting point is 00:04:10 No. I see these things very largely converging onto a relational sequel-based model. And five years from now, data is going to sit behind. to SQL prompt and SQL data warehouses will replace data lakes from the perspective of storing structured and semi-structured data. The cloud SQL data warehouses already do everything that is necessary. And there really is no reason for people to have a separate data lake, except for historical precedent. A lot of companies come from environments where they had a lot of semi-structured data in a Hadoop environment and having a data lake is a natural transition. And in a sense, the data lake, which is really

Starting point is 00:04:47 S3 storage, together with a wide variety of any tools you want to put on top of it, is a very generalized platform. But over time, infrastructure evolves to take on more and more of the use cases. SQL relational data warehouses have evolved to the point that for structured and semi-structured data, storage and query, they subsume all of what needs to be done pretty much today. What remains is images, video, documents, PDFs. Now, I don't call that unstructured data. I think that's a misnomer. There's no such thing as unstructured data. All data has structure of some kind. Structured data is tables, rows and columns. Semi-structured data is like JSON. It's hierarchical and its nature. And I think there's a third category

Starting point is 00:05:34 of data, which is what I call complex data. Images, documents, videos, most things that are streaming fall into this category. And more and more, machine learning can be applied to the contents of those data sources that turn it into semi-structured data that can be used for building complex data applications and for doing predictive analytics. So what's missing in the case of the data warehouse today is the support

Starting point is 00:05:57 for complex data. But that's going to come. That's called a feature. Can you imagine if you could transact, fully transact all of these types of images, videos and things together with any source of semi-structured data in a data warehouse? The applications that open

Starting point is 00:06:13 up are remarkable, and that's going to come in the next two to three years. I could see images being easily retrieved from the database, but do you actually see all of the image processing or the video processing taking place in the database as well? Not with SQL. SQL can't do that. So you'll use procedural logic in Python or something else

Starting point is 00:06:29 to do that at least for now. In the long run, relational will win too, but that's probably more like eight to ten years away. I think we've been waiting for that for 40 years, Bob. If you still need to be interesting. If you look over time, navigation on hierarchical in the 1980s got replaced with SQL. OLAP got replaced with SQL over the last 10 years or so.

Starting point is 00:06:49 We've replaced MapReduce with relational. So all of these things, relational always wins. Well, relational wins for the actual retrieval. But what about the processing, the technology that you need to process images is fundamentally different than you do to retrieve data. Tristan, what are your thoughts on this? So I completely agree that SQL is going to dominate data processing, at least a very large chunk of data processing.

Starting point is 00:07:12 But there's different APIs that the data lake and the data warehouse exposed. So there's the file storage layer. And for a lot of reasons, I believe that an organization will store their files one time. You will not have a data warehouse copy of the file and the data lake copy of the file, which in some architectures today, that's what you see. And so that requires you to have an open source file format that is shared between your data warehouse use cases and your other use cases. Above that, you have indexing and metadata that is a core part of the data warehouse, but it's also a core part of the data lake. I think those have to also start to converge so that different use cases can take advantage of the same stuff.

Starting point is 00:07:53 And then you have the SQL prompt. And maybe at the SQL prompt layer, the data warehouse dominates, but I think you need to allow different access patterns as well. Because one closed source firm is never going to accomplish literally all data processing use cases in the world. All of these things should interoperate in an open source and an open format way. the issues of format have kind of gone away because you can input and output any kind of format and export into any kind of format very easily. The question is what are the operations that actually need be formed against data that sits in a data lake? And today, anything associated with complex data, the data warehouse can't help you. And so there's a huge reason to have a data lake today.

Starting point is 00:08:30 In 2025, I don't think so. I think that we really have five platforms being created globally, Snowflake, Data Bricks, and then the three clouds, both Snowflake and Data Bricks, while they will come from very different places, Snowflake will always be SQL and declarative in its approach. And Databricks certainly historically has been procedural and code-based. So it's a version of SQL versus code in some senses. And I think you'll see both companies and pretty much everybody else in the industry offering both within their platforms.

Starting point is 00:08:59 So you've got two technologies that start with different use cases, somewhat different architectures, but they're clearly going into a converged point, which is you have some declarative something and you have some procedural something, and whether one's on top or the other, at the end of the day, they can both do both. But in the meantime, you have this decade-long journey.

Starting point is 00:09:18 And in that decade-long journey, there is an architecture that's optimized around use cases. I mean, the amount of trade-offs and decisions you make when building one of these systems is... Yeah, like, Timescale DB has very different characteristics than Snowflake, and they are characteristics that are optimized for a workflow. Yeah, entire companies, focusing on different points in the design space, with different optimization parameters.

Starting point is 00:09:41 It's actually the use case that drives the technology because of all of the gravity around it. And so, again, if it turns out that AIML and an operational use is growing quicker, which it seems to be, it seems that's more going to dictate the technology from an architectural standpoint. Martin, you've said a couple of times now

Starting point is 00:10:00 that the AIML space is appearing to grow faster. I've actually not heard that assertion before. So broadly two use cases, right? There's the analytics use case, which is driven by queries and dashboarding. The other one is creating a complex model from a data scientist and then serving that in production. That does things like wait time prediction, that does things like fraud detection, that does things like dynamic pricing. These are folks in our building complex models on existing data and then coming with

Starting point is 00:10:27 bespoke way of serving that. That is very clearly now turning into a pattern that's being served by a data lake. Now, it's on a much smaller base, but it'd be actually, look in the industry, it's a very rapidly growing use case. Michelle, you've spent time in both the data science community and the analytics community. And notebooks in many ways are the place where these things sometimes come together. I'm curious to hear your thoughts about how the two stacks have evolved and maybe they're converging.

Starting point is 00:10:55 Maybe they're building each other's features and getting more similar. But where does that take us? Do we still have two stacks five years hence? I think we're going to continue to see greater and greater specialization because we're not going to have the ability or the budget to hire enough data scientists. And so those stacks you can continue to evolve and it's going to be specialized based upon what it is that they're trying to do. The data lake will have a place, your images, your blob storage, all of those things, they're probably going to remain in the data lake and have a home there for a long time to come.

Starting point is 00:11:23 I just think it's not going to look like how it looks today. Today it's just been a lack of understanding around what data do we really need to collect. We went from one exchange to the other. We weren't collecting any data. Now we're collecting everything because we don't know what's valuable. And the reality is that's not necessarily a good idea either. The movement of data, I think we're going to see that stop. But format is going to be really important. We need that interop because reprocessing data on scale is just, it's cost prohibitive, it's time prohibitive.

Starting point is 00:11:48 It's not something that we want to do if we can avoid it. And I think you're going to see decentralization here at the lower levels where you've got either business units embedded or you've got your product teams and you've got your data science teams embedded in those product teams. You're going to need a unifying layer at the very top in the form of technologies that make it easier for everybody to query or be able to serve information. I think that the notebook is probably the best suited for that because it does have the language agnostic approach.

Starting point is 00:12:14 You can see the ability to look at both data and code and have all of that context, that rich business context, the visualizations. We're going to see that involved as this modern data document. And we can use that as part of our unifying layer because your data scientists can then work with or your data analysts can work with SQL, but we can at the end of the day, really hide all of the code and really get to what is the business implication of the, these things that are doing. So this really brings us to the second major topic that I wanted to discuss,

Starting point is 00:12:41 which is how do we bring the machine learning, Python, Scala world, and the analytics SQL, BI tool world together. There really are two stacks and two communities who sync the exact same data sources to Delta Lake and to Snowflake simply for operational reasons. There's not a fundamental technological reason, but it's just the way the tooling has evolved. It's too inconvenient to cross that boundary. And there's essentially three visions of that world.

Starting point is 00:13:11 One is that you're going to put machine learning into SQL, and probably BigQuery is the furthest along in pursuing this. You basically create a bunch of UDFs that do your linear algebra stuff. The other is more the Databricks vision, where you put SQL into Python or SQL into Scala, and you use data frames to do that. And then there's maybe a third vision. where you use Arrow, the interchange format,

Starting point is 00:13:37 and everything can just talk to each other, and you can arrange it any way you want. Which of these visions do you think is going to win? What I would like to see when is something like Arrow so that you have to interrupt. You're going to see machine learning moving into SQL because you're going to have data engineers who are perfectly capable and have the need to do some anomaly detection

Starting point is 00:13:55 or some interesting progression. It's within their ability to do that. Future engineering is just another big transformation for them. But they don't have the same background in stats, And so they can only take it so far. And then you're going to see on the other side of the spectrum, your data scientist where they have all of this really great math background, and they understand how to do more advanced deep learning.

Starting point is 00:14:13 But they don't have the technology skills, and SQL is the most successful language for working with data. And so you're going to really see both of them really become capable of supporting both use cases. But ultimately, you'll continue to see specialization here where the things that you want to do if you're trying to do deep learning are just fundamentally different than the types of things that you're just trying to do predictive models.

Starting point is 00:14:33 I think a lot about the Arrow version of the world, and I think that that will end up in the fullness of time dominating. For the same reason that Martinez has been talking about, the tools end up evolving to the personas that they serve and the use cases that they serve. I want to do all the data prep and feature engineering. And then I want machine learning models to be trained on top of that. People do that, certainly. But the fact that the infrastructures to do those two different things are generally separate creates this big slowness. It's a purely technical slowness. And error doesn't solve all of that.

Starting point is 00:15:04 Arrow certainly helps, but there's dumb things like the servers that do those things are in different clouds. And the interchange fee, what do you call them interchange fees? Ingress charges. Ingress fees are expensive. They're criminal. They're not just expensive. They're ridiculous. Right.

Starting point is 00:15:21 As more people do this, it's going to become smoother. They're going to become more localized. At the end of the day, there's a reason why you've got multiple languages, and it's not because one is turning complete and the other isn't. And the reason is, is because people build their entire workflow around languages and all of the tools. And so you're going to have a heterogeneous, fragmented system. So therefore, you do need to have open interfaces. Bob?

Starting point is 00:15:43 I'm a big believer at this time in the approach of having multiple systems that interact with common formats. Arrow is a huge step forward for that, not just because it's an efficient format, but because it provides a consistent in-memory layout for people to do advanced analytics in their spark environment. And it's the way the world is working right now because most customers actually have a data warehouse and an analytics platform separately, and they are connecting them together. Now, I'm the radical, however, I'm going to continue to be the ultimate radical and declare that the approach that we're taking today in terms of machine learning is still roughly the approach of the internal combustion engine in the automobile. And the approach that's happening where arrow ties together those predictive systems together with declarative databases, that's really the creation. of the hybrid or sort of the Prius era, hybrid will dominate for the next, say, three to five years,

Starting point is 00:16:36 and you will see hybrid systems being built by every major vendor. And so all of them will have a full predictive stack and a full declarative relational SQL stack built in using some kind of interface like that. But that's only until relational actually solves the broader set of problem. Does that mean that you'll be using SQL functions, predict X? No. Ironically, I think that while SQL will dominate well into the 2030s for doing data modeling and data transformation, there's another step beyond that, which is business modeling.

Starting point is 00:17:10 And that needs to be represented in a knowledge graph. Knowledge graphs are how we'll do predictive analytics in the 2030s. And what needs to happen is a whole new generation of data system that's based on relational knowledge graphs to create that. Michelle, you brought up a term earlier that I wanted to follow up on, which is data mesh. And I wonder if you could define that briefly for everyone, because similar to data lakes versus data warehouses, there's a question whether going forward that's more of a historical phenomenon or an actual good architecture that we want to continue. Data mesh is really a concept of decentralizing the data processing and the ETL and analytics into each individual business unit and then having some sort of unifying solution at the top. And to do this forward, for having specialized data teams, having specialized roles, having infrastructure as a service available to them for data processing, and then having some sort of overarching

Starting point is 00:18:07 standards for it, almost like a federation of data engineers, to ensure that all of your ETL is consistent, so that as you are trying to do data retrieval and some sort of common query tool, you'll have that familiarity that you need. We are going to see things like Arrow really come to the forefront. Sooner, rather than later, I think customers are going to demand because of all of the challenges that we're currently having. You've got all the cost of the storage and the processing. Your teams that are trying to do the processing don't have the distant context that they need. And so as a result, you have this back and forth, a lot of wasted time,

Starting point is 00:18:41 got a lot of data quality errors in the data multiple times. And so ultimately, we really want to take that body of knowledge and put the technology where that body of knowledge lives. The data mesh is an attempt to do that. One part of what the data mesh folks are talking about is how to organize and how to structure a team to manage data across a large enterprise with very disparate and important data sources. That's very, very important.

Starting point is 00:19:05 There's some good ideas in data mesh for that. Architecturally, data mesh has this sort of odd idea that data is basically streaming, and you can use facilities like Kafka to do transforms as the data is in flight. And I don't believe that. I think that that is totally missing the fact that while there is streaming data,

Starting point is 00:19:25 and you can do quite a bit with data that's simply streaming, in other words, append only data. To me, another critical source of data is transactional data coming out of business systems. The streaming-based solutions have no answer for that, and they just sort of pretend that data consistency is unimportant. And I don't understand that because I put data consistency at the top of the issues

Starting point is 00:19:47 that I think about when I think about managing data. Mesh has historically been one of these terms that conflates architecture with administrative domains and at this in service mesh, and it did this in Wi-Fi meshes and mesh networking, et cetera. I think Bob is exactly right, which is there is a very real issue with separate administration domains,

Starting point is 00:20:04 separate processing domains, separate access to tool sets. That's very, very different to building a fully distributed architecture, which just tends to be hard and inefficient. And it's often not the people that promote the mesh idea, but when people hear the term mesh, they default to full distribution, which tends to be just a bad way to build systems.

Starting point is 00:20:19 Said like a networking guy. Having seen this exact same thing happen in other domains, means for a couple of decades. I think all of us are very technology-focused human beings. And so when we think about data mesh, we tend to think about the architecture part of it. Bob, I'm glad you pointed out the distributed teams and the people aspect of this. I think my constant question for the data mesh is, why can't you enable the distributed nature of what you're talking about with a unified architecture?

Starting point is 00:20:47 My preference is always to have one data set that is very clean and well understood, that we do not have to move anywhere that is performance alongside our large batch analytical processing, which is also working with our data science. That's the nirvana. That's the goal is to just have one data storage and then having something that sits over top of it. And each of those different things are specialized in each of the different use cases, but you have one data store.

Starting point is 00:21:11 I feel like the modern data stack keeps swallowing up more and more use cases. It killed cubes a while ago. It's mostly killed Hadoop at this point. It keeps pulling more use cases into its orbit. because it's fundamentally so flexible and so capable of doing many different things well enough that you don't really want to buy another system, build another system for that one use case. What do you think are some of the most interesting, surprising, significant use cases that may start to get pulled into the orbit of the modern data stack in the next couple of years?

Starting point is 00:21:48 Complex data. We now have all this very, very interesting stuff that's happening in predictive analytics. And to me, we've gone from semi-structured data as being the most interesting data sources to now having a wide variety of data sources. I was talking to a company involved in the medical field yesterday and just the rich amount of data that exists in the images and the doctor's notes. And all of that is opaque to our systems today. It will not be in five years. That will all become part of the modern data stack. And to me, that's a gigantic transformation into the types of applications that will be created in the year.

Starting point is 00:22:22 to talk. My last job was I ran marketing for a company and I really went deep into growth marketing. The problem that you run into there is that you're constantly writing code to push data back and forth between systems because the different operational systems do different things and you need the same data and all of them. But no one has yet re-architectedicited the systems who, in the modern data stack, just take all of the work that you've ingested and now push it back out to your operating systems, your operational systems. But I think we're at the beginning of that. What you're really talking about, Tristan, is the advent of the modern data app,

Starting point is 00:22:58 which basically is an operational application that autonomously can make decisions for the business. And we've seen very few of those. I mean, very trivial examples of, boy, will they be significant in the future. There's really two visions of the data app that I've seen. One of them is the data app is a separate system, and you take the important data from your data warehouse and you push it. And then the other vision is the data app, is just natively built to run on top of the data warehouse.

Starting point is 00:23:25 I'm curious whether people have opinions about those two models and where they see that going. It's really the same conversation we've been having about how these things are built. Data app is predictive analytics that actually takes autonomous action. It takes the data that would otherwise be presented to a person and instead leverages that to actually take actions within the business. They're being built every which way today because there are view good tools to build data apps. That will not be true in a few years. One of the things that you run into when you try to build data applications and take action automatically is that latency becomes incredibly important.

Starting point is 00:24:01 Everybody in the ecosystem is battling this right now. I think there's a lot of different visions of how we're going to crush the latency problem and how low we need it to get. How low does the latency need to be? At what point do we have most of the interesting use cases? People have dozens to hundreds or even thousands of operational system. More and more, they're SaaS applications. they're outside of your organization. And they're always a source of truth now.

Starting point is 00:24:25 They're the present. And a data warehouse or a data lake is about historical or the past. And what does that latency need to be? Does it need to be zero seconds? I don't think so. I mean, there are applications where zero seconds, where instant is required, mostly having to do with eventing and alerting of some sort.

Starting point is 00:24:42 Most of the time, if you can get it in a minute or two, you can leverage that data inside your historical system with predictive analytics to begin. to perform actions on it. This is a very complicated topic that I think is very use case-specific, but there tends to be serious trade-offs that systems designers make

Starting point is 00:24:59 between latency and throughput. If you want higher throughput, you batch. And the reason that you batch is that you don't have as many domain crossings. However, if you look at most systems, you can make the trade-off, meaning you could do low-latency in a data lake and you could do high-throughput

Starting point is 00:25:17 in a data warehouse or vice versa. These are not architectural limitations. They just tend to be the tradeoffs that were made as a result of serving whatever the primary use case is. I've heard a number of these kind of latency throughput tradeoffs and you actually get down to machine level. They are just a result of the tradeoffs that are made on the system going into it.

Starting point is 00:25:36 One of the interesting things that we see is that the point at which you start to have to spend a lot more to get the latency lower is actually lower than people think. I suspect you can get down into the 10-second range with still the sort of throughput optimized architecture. Basically, the throughput optimized architecture, I suspect, will go lower than we expect. What do you imagine will happen with the serving layer?

Starting point is 00:25:57 So your website still needs to operate over that data. Are you imagining that there's just going to be a caching layer or is that going to be a separate system? It depends on what the characteristics of the system need to be. If something needs to be really low latency, today's data warehouses are not always the right solution for. And so it just depends on the application. Latencies will go down in these products, but to Marchean's point, some of the architectural choices make the latency characteristics of a snowflake somewhat different than, for example, the latency characteristics of a M-Squel.

Starting point is 00:26:27 One of the things that I would like to see more of in the future is Lambda architectures but with off-the-shelf tools. So my data flowing into a more streaming-like system and a more batch-like system so that I can get the best of both worlds, you're making trade-offs and you build these systems. as a user, I want to be able to choose and have both of them. All right. Well, we have one minute left. I'd like to ask a yes or no question for everyone. Will there emerge another major data platform alongside Snowflake, Databricks, Google, AWS, and Azure? We'll start with you, Michelle. Yes or no? Yes.

Starting point is 00:27:05 Bob? What's your time skill? Oh, yeah. Sorry, in the next five years. Yes. Yes. But you know what may be relatively small relative to those? Well, I said major. That sounds like a...

Starting point is 00:27:15 But it's notepleague was small five years ago. Justin? I think no. Martine? Yes. All right. Thank you very much, everyone, for joining. This has been a really fun conversation.

Starting point is 00:27:27 Really appreciate all of you being here. I know our audience does as well.

The a16z Show - The Great Data Debate

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.