a16z Podcast - The Great Data Debate
Episode Date: March 24, 2022Over a decade after the idea of “big data'' was first born, data has become the central nervous system for decision-making in organizations of all sizes. But the modern data stack is evolving and wh...ich infrastructure trends and technologies will ultimately win out remains to be decided.In this podcast, originally recorded as part of Fivetran's Modern Data Stack conference, five leaders in data infrastructure debate that question: a16z general partner and pioneer of software defined networking Martin Casado, former CEO of Snowflake Bob Muglia; Michelle Ufford, founder and CEO of Noteable; Tristan Handy, founder of Fishtown Analytics and leader of the open source project dbt; and Fivetran founder George Fraser.The conversation covers the future of data lakes, the new use cases for the modern data stack, data mesh and whether decentralization of teams and tools is the future, and how low we actually need to go with latency. And while the topic of debate is the modern data stack, the themes and differing perspectives strike at the heart of an even bigger: how does technology evolve in complex enterprise environments? We're re-running this episode as part of a special report on Future.com, the Data50: the World's Top Data Startups, which covers the bellwether private companies across the most exciting categories in data, from AI/ML to observability and more.
Transcript
Discussion (0)
Over a decade after the idea of big data was first born, data has become the central nervous
system for decision-making in organizations of all sizes. But the modern data stack is evolving.
And which infrastructure trends and technologies will ultimately win out remains to be decided.
This episode from November 2020 brings together some of the industry's leading experts,
from Snowflake to 5T, DPT, A16C, and beyond, to debate the future of the modern data stack,
from data lakes versus data warehouses to analytics versus artificial intelligence and machine learning,
to SQL versus everything else, and more. For more on how the data space is evolving and who the key
up-and-coming players are, check out the inaugural data 50 list on future.com forward slash data 50.
We highlight and analyze the world's bellwether private companies across the most exciting categories in
data, which in aggregate are valued today at more than $100 billion and have raised approximately
$14 billion in total capital. And now, on to today's episode. Hi, and welcome to the great data
debate. I'm DOS and this is the A16Z podcast. Today's episode is all about the debates happening
around the modern data stack, lakes versus warehouses, analytics versus artificial intelligence
and machine learning, SQL versus everything else, and more. For other content from our series on
modern data businesses, including example
blueprints we've shared and a podcast with
Databricks that traces the history and evolution
of modern data architectures,
please see A16Z.com
backslash modern data.
The conversation in this podcast
was originally recorded as part
of the Modern Data Stack conference hosted by
5Tran. It's a spirited
discussion with A16Z general partner
and pioneer of software to find networking
Martine Casado and four founders
who are building different parts of the modern
data stack. Bob Muglia,
the former CEO of Snowflake, Michelle Ufford, the founder and CEO of Notable,
Tristan Handy, the founder of Fishtown Analytics, as well as the leader of the open source
project DBT, and the discussions moderator, Five Tran founder, George Fraser, whose voice is
the first you'll hear. All right. So I'm going to go ahead and kick this off with a spicy
topic, I think, at least spicy in this crowd, which is Data Lakes. So Data Lakes is a blurry
term used by different people to mean different things. But for the purposes of this discussion,
let's define data lakes as tabular data, so tables, rows and columns, stored in an open source
file format like parquet or orc in a public cloud object storage, like S3 or Google Cloud storage. So in a
world where we have data warehouses that use object storage to store their data and give you
some of the advantages of data lakes.
Do data lakes still have a place?
Let's start with you. Martine, does the data lake have a future?
One of the biggest policies that we do as an industry is we look at an architecture and we're
oh, that can do all of these things, therefore it will be pushed into service to do all of
these things.
And that's just not how technology evolves.
We make decisions in the design space based on the primary use cases that technology is being
used for. And if you look at the use cases that data warehouses are being used for, they're largely
driven by analytics, which is a certain workflow. It's a certain query pattern. And if you look at where
data lakes, it's actually quite different. They tend to be more unstructured data, focus on operational
AI, compute intensive. And so if you look at the respective technologies, they're just being
optimizing this massive design space for different use cases. Architecturally, sure, they can both do
what the other one does. But in the end, you've got products and companies.
that he's optimized around use cases.
And I think the operational AI use case is a larger one, and it's growing faster.
So actually, I think over time you can argue that it's the data lake that ends up
consuming everything, not the data warehouse.
You're just trying to provoke Bob Martin.
He's succeeded.
You're watching Bob's face.
All right, Bob.
Let's hear from you.
Data Lake doesn't have a future.
No.
I see these things very largely converging onto a relational sequel-based model.
And five years from now, data is going to sit behind a SQL prompt, and SQL data warehouses
will replace data lakes from the perspective of storing structured and semi-structured data.
The cloud SQL data warehouses already do everything that is necessary.
And there really is no reason for people to have a separate data lake, except for historical precedent.
A lot of companies come from environments where they had a lot of semi-structured data in a Hadoop environment,
and having a data lake is a natural transition.
And in a sense, the data lake, which is really S3 storage, together with a wide variety of any tools you want to put on top of it, is a very generalized platform.
But over time, infrastructure evolves to take on more and more of the use cases.
SQL relational data warehouses have evolved to the point that for structured and semi-structured data, storage and query, they subsume all of what needs to be done pretty much today.
What remains is images, video, documents, PDFs.
Now, I don't call that unstructured data.
I think that's a misnomer.
There's no such thing as unstructured data.
All data has structure of some kind.
Structured data is tables, rows and columns.
Semi-structured data is like JSON.
It's hierarchical and its nature.
And I think there's a third category of data, which is what I call complex data.
Images, documents, videos, most things that are streaming fall in.
into this category. And more and more, machine learning can be applied to the contents of those
data sources that turn it into semi-structured data that can be used for building complex data
applications and for doing predictive analytics. So what's missing in the case of the data warehouse
today is the support for complex data. But that's going to come. That's called a feature. Can you
imagine if you could transact, fully transact all of these types of images, videos and things,
together with any source of semi-structured data in a data warehouse,
the applications that open up are remarkable,
and that's going to come in the next two to three years.
I could see images being easily retrieved from the database,
but do you actually see all of the image processing
or the video processing taking place in the database as well?
Not with SQL. SQL can't do that.
So you'll use procedural logic in Python or something else to do that,
at least for now.
In the long run, relational, we'll win, too.
But that's probably more like eight to ten years away.
I think we've been waiting for that for 40 years, Bob.
But look what's happened.
If you look over time, navigation on hierarchical in the 1980s got replaced with SQL.
OLAP got replaced with SQL over the last 10 years or so.
We replaced MapReduce with relational.
So all of these things, relational always wins.
Well, relational wins for the actual retrieval.
But what about the processing, the technology that you need to process images
is fundamentally different than you do to retrieve data.
Tristan, what are your thoughts on this?
So I completely agree that SQL is going to dominate data.
processing, at least a very large chunk of data processing. But there's different APIs that
the data lake and the data warehouse exposed. So there's the file storage layer. And for a lot of
reasons, I believe that an organization will store their files one time. You will not have a data
warehouse copy of the file and the data lake copy of the file, which in some architectures today,
that's what you see. And so that requires you to have an open source file format that is shared
between your data warehouse use cases and your other use cases.
Above that, you have indexing and metadata that is a core part of the data warehouse,
but it's also a core part of the data lake.
I think those have to also start to converge so that different use cases can take advantage
of the same stuff.
And then you have the SQL prompt.
And maybe at the SQL prompt layer, the data warehouse dominates,
but I think you need to allow different access patterns as well,
because one closed source firm is never going to accomplish literally all data processing
use cases in the world.
All of these things should interoperate in an open source and an open format way.
But the issues of format have kind of gone away because you can input and output any kind of format
and export into any kind of format very easily.
The question is what are the operations that actually need to be formed against data that sits in a data lake?
And today, anything associated with complex data, the data warehouse can't help you.
And so there's a huge reason to have a data lake today.
In 2025, I don't think so.
I think that we really have five platforms being created globally,
Snowflake, Databricks, and then the three clouds.
Both Snowflake and Databricks,
while they will come from very different places,
Snowflake will always be SQL and declarative in its approach.
And Databricks certainly historically has been procedural and code-based.
So it's a version of SQL versus code in some senses.
And I think you'll see both companies
and pretty much everybody else in the industry
offering both within their platforms.
So you've got two technologies that start with different use cases,
somewhat different architectures,
but they're clearly going into a converged point,
which is you have some declarative something
and you have some procedural something
and whether one's on top or the other,
at the end of the day, they can both do both.
But in the meantime, you have this decade-long journey.
And in that decade-long journey,
there is an architecture that's optimized around use cases.
I mean, the amount of trade-offs and decisions you make
when building one of these systems is...
Yeah, like, Timescale DB has very different characteristics than Snowflake.
And they are characteristics that are optimized for a work.
Yeah, entire companies focusing on different points in the design space with different optimization parameters.
It's actually the use case that drives the technology because of all of the gravity around it.
And so, again, if it turns out that AIML and an operational use is growing quicker, which it seems to be,
it seems that's more going to dictate the technology from an architectural standpoint.
Martin, you've said a couple of times now that the AIML space is appearing to grow fast.
I've actually not heard that assertion before.
So broadly, two use cases, right?
There's the analytics use case, which is driven by queries and dashboarding.
The other one is creating a complex model from a data scientist and then serving that in production.
That does things like wait time prediction, that does things like fraud detection, that does things like dynamic pricing.
These are folks in our building complex models on existing data and then coming with bespoke way of serving that.
That is very clearly now turning into a pattern that's being served by a data lake.
Now, it's on a much smaller base, but if you actually look in the industry, it's a very
rapidly growing use case.
Michelle, you've spent time in both the data science community and the analytics community.
And notebooks in many ways are the place where these things sometimes come together.
I'm curious to hear your thoughts about how the two stacks have evolved and maybe they're
converging.
Maybe they're building each other's features and getting more similar.
But where does that take us?
Do we still have two stacks five years hence?
I think we're going to continue to see greater and greater specialization because we're not
going to have the ability or the budget to hire enough data scientists. And so those stacks
are going to continue to evolve and it's going to be specialized based upon what it is that they're
trying to do. The data lake will have a place, your images, your wall storage, all of those things
are probably going to remain in the data lake and have a home there for a long time to come.
I just think it's not going to look like how it looks today. Today it's just been a lack of
understanding around what data do we really need to collect. We went from a home.
and one exchange to the other.
We weren't collecting any data,
now we're collecting everything
because we don't know what's valuable.
And the reality is that's not necessarily a good idea.
Either the movement of data,
I think we're going to see that stop.
But format is going to be really important.
We need that inter-off because reprocessing data on scale
is just, it's cost-prohibitive, it's time-prohibitive.
It's not something that we want to do if we can avoid it.
And I think you're going to see decentralization here.
At the lower levels where you've got either business units embedded
or you've got your product teams,
you've got your data science teams embedded in those product teams,
you're going to need a unifying layer at the very top in the form of technologies that make it easier for
everybody to query or be able to serve information. I think that the notebook is probably the best
suited for that because it does have the language agnostic approach. You can see the ability to
look at both data and code and have all of that context, the rich business context, the visualizations.
We're going to see that involved as this modern data document. And we can use that as part of our
unifying layer because your data scientists can then work with or your data analyst can work with SQL,
but we can at the end of the day really hide all of the code
and really get to what is the business implication
of these things that we're doing.
So this really brings us to the second major topic
that I wanted to discuss,
which is how do we bring the machine learning,
Python, Scala world,
and the analytics SQL BI tool world together?
There really are two stacks and two communities
who sync the exact same data sources
to Delta Lake and to Snowflake
simply for operational reasons.
There's not a fundamental technological reason,
but it's just the way the tooling has evolved.
It's too inconvenient to cross that boundary.
And there's essentially three visions of that world.
One is that you're going to put machine learning into SQL,
and probably BigQuery is the furthest along in pursuing this.
You basically create a bunch of UDFs that do your linear algebra stuff.
The other is more the Databricks vision,
where you put SQL into Python,
or SQL into Scala,
and you use data frames to do that.
And then there's maybe a third vision
where you use Arrow,
the interchange format,
and everything can just talk to each other,
and you can arrange it any way you want.
Which of these visions do you think is going to win?
What I would like to see when is something like Arrow
so that you have to interrupt.
You're going to see machine learning moving into SQL
because you're going to have data engineers
who are perfectly capable
and have the need to do some anomaly detection
or some interesting progression.
It's within the first.
ability to do that. Future engineering is just another data transformation for them. But they don't
have the same background in stats, and so they can only take it so far. And then you're going to
see on the other side of the spectrum, your data scientist, where they have all of this really
great math background, and they understand how to do more advanced deep learning. But they don't
have the technology skills, and SQL is the most successful language for working with data. And so you're
going to really see both of them really become capable of supporting both use cases. But ultimately,
You'll continue to see specialization here where the things that you want to do if you're trying to do deep learning are just fundamentally different than the types of things that you're just trying to do predictive models.
I think a lot about the Arrow version of the world, and I think that that will end up in the fullness of time dominating.
For the same reason that Martinez has been talking about, the tools end up evolving to the personas that they serve and the use cases that they serve.
I want to do all the data prep and feature engineering, and then I want machine learning models to be trained on top of that.
People do that, certainly.
But the fact that the infrastructures to do those two different things are generally separate creates this big slowness.
It's a purely technical slowness.
And error doesn't solve all of that.
Arrow certainly helps.
But there's dumb things like the servers that do those things are in different clouds.
And the interchange fee, what do you call them interchange fees?
Ingress charges.
Ingress fees are expensive.
They're criminal.
They're not just expensive.
They're ridiculous.
Right.
As more people do this, it's going to become smoother.
they're going to become more localized.
At the end of the day, there's a reason why you've got multiple languages,
and it's not because one is turning complete and the other isn't.
And the reason is is because people build their entire workflow around languages
and all of the tools.
And so you're going to have a heterogeneous, fragmented system.
So therefore, you do need to have open interfaces.
Bob?
I'm a big believer at this time in the approach of having multiple systems that interact with common formats.
Arrow is a huge step forward for that, not just because it's an efficient.
format, but because it provides a consistent in-memory layout for people to do advanced analytics
in their spark environments. And it's the way the world is working right now, because most
customers actually have a data warehouse and an analytics platform separately, and they are connecting
them together. Now, I'm the radical, however. I'm going to continue to be the ultimate radical and
declare that the approach that we're taking today in terms of machine learning is still roughly
the approach of the internal combustion engine in the automobile. And the approach that
that's happening where Arrow ties together those predictive systems together with declarative
databases. That's really the creation of the hybrid or sort of the Prius era. Hybrid will dominate
for the next, say, three to five years. And you will see hybrid systems being built by every
major vendor. And so all of them will have a full predictive stack and a full declarative
relational stack built in using some kind of interface like that. But that's only until relational
actually solves the broader set of problem.
Does that mean that you'll be using SQL functions, predict X?
No.
Ironically, I think that while SQL will dominate well into the 2030s
for doing data modeling and data transformation,
there's another step beyond that, which is business modeling.
And that needs to be represented in a knowledge graph.
Knowledge graphs are how we'll do predictive analytics in the 2030s.
And what needs to happen is a whole new generation of data system
that's based on relational knowledge graphs to create that.
Michelle, you brought up a term earlier that I wanted to follow up on,
which is data mesh.
And I wonder if you could define that briefly for everyone,
because similar to data lakes versus data warehouses,
there's a question whether going forward that's more of a historical phenomenon
or an actual good architecture that we want to continue.
Data mesh is really a concept of decentralizing the data processing
and the ETL and analytics into each individual business unit and then having some sort of
unifying solution at the top. And to do this while refers, having specialized data teams,
having specialized roles, having infrastructure as a service available to them for data processing,
and then having some sort of overarching standards for it, almost like a federation of your
data engineers, to ensure that all of your ETL is consistent. So that as you are trying to do
data retrieval and some sort of common query tool, you'll have.
have that familiarity that you need.
We are going to see things like Arrow really come to the forefront sooner or than later.
I think customers are going to demand it because of all of the challenges that we're currently
having.
We've got all the cost of the storage and the processing.
Your teams that are trying to do the processing don't have the distant context that they need.
And so as a result, you have this back and forth, a lot of wasted time.
We've got a lot of data quality errors in the data multiple times.
And so ultimately, we really want to take that body of knowledge and put the technology
where that body of knowledge lives.
The data mesh is an attempt to do that.
One part of what the data mesh folks are talking about
is how to organize and how to structure a team
to manage data across a large enterprise
with very disparate and important data sources.
That's very, very important.
There's some good ideas in data mesh for that.
Architecturally, data mesh has this sort of odd idea
that data is basically streaming,
and you can use facilities like Kafka to do transforms
as the data is in flight.
And I don't believe that.
I think that that is totally missing the fact that while there is streaming data,
and you can do quite a bit with data that's simply streaming,
in other words, append only data.
To me, another critical source of data is transactional data coming out of business systems.
The streaming-based solutions have no answer for that,
and they just sort of pretend that data consistency is unimportant.
And I don't understand that because I put data consistency at the top of the issues
that I think about when I think about managing data.
Mesh has historically been one of these terms
that conflates architecture with administrative domains
and a distance service mesh,
and it did this in Wi-Fi meshes and mesh networking, et cetera.
I think Bob is exactly right,
which is there is a very real issue
with separate administration domains,
separate processing domains, separate access to tool sets.
That's very, very different than building
a fully distributed architecture,
which just tends to be hard and inefficient.
And it's often not the people that promote the mesh idea,
but when people hear the term mesh,
they default to full distribution,
which tends to be just a bad way to build systems.
Said like a networking guy.
Having seen this exact same thing happen in other domains for a couple of decades.
I think all of us are very technology-focused human beings.
And so when we think about data mesh, we tend to think about the architecture part of it.
Bob, I'm glad you pointed out the distributed teams and the people aspect of this.
I think my constant question for the data mesh is,
why can't you enable the distributed nature of what you're talking about with a unified
architecture. My preference is always to have one data set that is very clean and well understood,
that we do not have to move anywhere, that is performance alongside our large batch
analytical processing, which is also working with our data science. That's the nirvana. That's the
goal, is to just have one data storage and then having something that sits over top of it,
and each of those different things are specialized for each of the different use cases, but you have
one data store. I feel like the modern data stack keeps swallowing up more and more use
cases. It killed cubes a while ago. It's mostly killed Hadoop at this point. It keeps pulling more
use cases into its orbit because it's fundamentally so flexible and so capable of doing many
different things well enough that you don't really want to buy another system, build another
system for that one use case. What do you think are some of the most interesting, surprising,
significant use cases that may start to get pulled into the orbit of the modern data stack
in the next couple of years.
Complex data.
We now have all this very, very interesting stuff that's happening in predictive analytics.
And to me, we've gone from semi-structured data as being the most interesting data sources
to now having a wide variety of data sources.
I was talking to a company involved in the medical field yesterday and just the rich amount
of data that exists in the images and the doctor's notes.
And all of that is opaque.
to our systems today. It will not be in five years. That will all become part of the modern
data stack. And to me, that's a gigantic transformation into the types of applications that
will be created in the years to come. My last job was I ran marketing for a company and I really
went deep into growth marketing. The problem that you run into there is that you're constantly
writing code to push data back and forth between systems because the different operational
systems do different things and you need the same data and all of them. But no one has yet
re-architected the systems in the modern data stack, just take all of the work that you've ingested
and now push it back out to your operating systems, your operational systems. But I think we're
at the beginning of that. What you're really talking about, Tristan, is the advent of the modern
data app, which basically is an operational application that autonomously can make decisions
for the business. And we've seen very few of those. I mean, very trivial examples of boy, will they
be significant in the future.
There's really two visions of the data app that I've seen.
One of them is the data app is a separate system, and you take the important data from your
data warehouse and you push it.
And then the other vision is the data app is just natively built to run on top of the
data warehouse.
I'm curious whether people have opinions about those two models and where they see that going.
It's really the same conversation we've been having about how these things are built.
Data app is predictive analytics that actually takes autonomous action.
takes the data that would otherwise be presented to a person and instead leverages that to
actually take actions within the business. They're being built every which way today because there
are few good tools to build data apps. That will not be true in a few years. One of the things
that you run into when you try to build data applications and take action automatically is that
latency becomes incredibly important. Everybody in the ecosystem is battling this right now.
I think there's a lot of different visions of how we're going to crush the latency problem and
how low we need it to get. How low does the latency need to be? At what point do we have
most of the interesting use cases? People have dozens to hundreds or even thousands of
operational system. More and more, they're SaaS applications. They're outside of your
organization. And they're always a source of truth now. They're the present. And a data warehouse
or a data lake is about historical or the past. And what does that latency need to be? Does it need
to be zero seconds? I don't think so. I mean, there are applications where zero seconds were
instant is required, mostly having to do with eventing and alerting of some sort. Most of the time,
if you can get it in a minute or two, you can leverage that data inside your historical system
with predictive analytics to begin to perform actions on it. This is a very complicated topic that
I think is very use case specific, but there tends to be serious tradeoffs that systems designers
make between latency and throughput. If you want higher throughput, you batch. And the reason that
you batch is that you don't have as many domain crossings. However, if you look at most systems,
you can make the trade-off, meaning you could do low latency in a data lake, and you could do
high throughput in a data warehouse or vice versa. These are not architectural limitations.
They just tend to be the trade-offs that were made as a result of serving whatever the primary
use case is. I've heard a number of these kind of latency throughput trade-off discussions,
and you actually get down to a machine level. They are just a result of the trade-offs that are made
in the system going into it.
One of the interesting things that we see is that the point at which you start to have to
spend a lot more to get the latency lower is actually lower than people think.
I suspect you can get down into the 10-second range with still the sort of throughput-optimized
architecture.
Basically, the throughput-optimized architecture, I suspect, will go lower than we expect.
What do you imagine will happen with the serving layer?
So your website still needs to operate over that data.
Are you imagining that there's just going to be a caching layer or...
It's not going to be a separate system?
It depends on what the characteristics of the system need to be.
If something needs to be really low latency,
today's data warehouses are not always the right solution for it.
And so it just depends on the application.
Latencies will go down in these products,
but to Marchean's point,
some of the architectural choices make the latency characteristics
of a snowflake somewhat different than, for example,
the latency characteristics of a M-Sql.
One of the things that I would like to see more of in the future
is Lambda architectures but with off-the-shelf tools.
So my data flowing into a more streaming-like system and a more batch-like system so that I can get the best of both worlds,
you're making trade-offs and you build these systems.
As a user, I want to be able to choose and have both of them.
All right.
Well, we have one minute left.
I'd like to ask a yes or no question for everyone.
Will there emerge another major data platform alongside Snowflake, Databricks, Google, AWS, and Azure?
We'll start with you, Michelle.
Yes or no?
Yes.
Bob?
What's your time skill?
Oh, yeah.
Sorry, in the next five years.
Yes.
Yes.
But you know what may be relatively small relative to those guys?
Well, I said major.
That sounds like a...
But it's not like it was small five years ago.
Justin?
I think no.
Martine?
Yes.
All right.
Thank you very much, everyone, for joining.
This has been a really fun conversation.
Really appreciate all of you being here.
I know our audience does as well.