The a16z Show - The Great Data Debate
Episode Date: March 24, 2022Over a decade after the idea of “big data'' was first born, data has become the central nervous system for decision-making in organizations of all sizes. But the modern data stack is evolving and wh...ich infrastructure trends and technologies will ultimately win out remains to be decided.In this podcast, originally recorded as part of Fivetran's Modern Data Stack conference, five leaders in data infrastructure debate that question: a16z general partner and pioneer of software defined networking Martin Casado, former CEO of Snowflake Bob Muglia; Michelle Ufford, founder and CEO of Noteable; Tristan Handy, founder of Fishtown Analytics and leader of the open source project dbt; and Fivetran founder George Fraser.The conversation covers the future of data lakes, the new use cases for the modern data stack, data mesh and whether decentralization of teams and tools is the future, and how low we actually need to go with latency. And while the topic of debate is the modern data stack, the themes and differing perspectives strike at the heart of an even bigger: how does technology evolve in complex enterprise environments? We're re-running this episode as part of a special report on Future.com, the Data50: the World's Top Data Startups, which covers the bellwether private companies across the most exciting categories in data, from AI/ML to observability and more. Stay Updated:Find a16z on YouTube: YouTubeFind a16z on XFind a16z on LinkedInListen to the a16z Show on SpotifyListen to the a16z Show on Apple PodcastsFollow our host: https://twitter.com/eriktorenberg Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures. Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.
Transcript
Discussion (0)
Over a decade after the idea of big data was first born, data has become the central nervous system
for decision-making in organizations of all sizes. But the modern data stack is evolving.
And which infrastructure trends and technologies will ultimately win out remains to be decided.
This episode from November 2020 brings together some of the industry's leading experts,
from Snowflake to Five Tran, DBT, A16C and beyond, to debate the future of the modern data stack,
from data lakes versus data warehouses to analytics versus artificial intelligence and machine learning,
to SQL versus everything else, and more. For more on how the data space is evolving and who the key
up-and-coming players are, check out the inaugural data 50 list on future.com forward slash data 50.
We highlight and analyze the world's bellwether private companies across the most exciting categories in data,
which in aggregate are valued today at more than $100 billion and have raised approximately $40,
$14 billion in total capital. And now, on to today's episode.
Hi, and welcome to the great data debate. I'm DOS and this is the A16Z podcast. Today's
episode is all about the debates happening around the modern data stack, lakes versus warehouses,
analytics versus artificial intelligence and machine learning, SQL versus everything else, and more.
For other content from our series on modern data businesses, including example blueprints we've shared and a podcast with Databricks that traces the history and evolution of modern data architectures, please see A16Z.com backslash modern data data.
The conversation in this podcast was originally recorded as part of the Modern Data Stack conference hosted by 5Tran.
It's a spirited discussion with A16Z general partner and pioneer of software to find networking, Martin Casado, and four founders who are building different partners.
to the modern data stack. Bob Muglia, the former CEO of Snowflake, Michelle Ufford, the founder and CEO
of Notable, Tristan Handy, the founder of Fishtown Analytics, as well as the leader of the open source
project DBT, and the discussions moderator, FiveTran founder, George Fraser, whose voice is the first
you'll hear. All right. So I'm going to go ahead and kick this off with a spicy topic,
I think, at least spicy in this crowd, which is Data Lakes. So Data Lakes is a blurry term.
used by different people to mean different things.
But for the purposes of this discussion,
let's define data lakes as tabular data,
so tables, rows and columns,
stored in an open source file format like Parquet or Ork
in a public cloud object storage,
like S3 or Google Cloud Storage.
So in a world where we have data warehouses
that use object storage to store their data
and give you some of the advantages of data lakes,
do data lakes still have a place?
Let's start with you.
Martine, does the data lake have a future?
One of the biggest policies that we do as an industry
is we look at an architecture and we're like,
oh, that can do all of these things,
therefore it will be pushed into service to do all of these things.
And that's just not how technology evolves.
We make decisions in the design space
based on the primary use cases that technology is being used for.
And if you look at the use cases
that data warehouses are being used for,
they're largely driven by analytics,
which is a certain workflow,
it's a certain query pattern.
And if you look at where data lakes,
it's actually quite different.
They tend to be more unstructured data,
focus on operational AI, compute intensive.
And so if you look at the respective technologies,
they're just being optimized in this massive design space
for different use cases.
Architecturally, sure, they can both do what the other one does.
But in the end, you've got products and companies
optimized around use cases.
And I think the operational AI use case is a larger one, and it's growing faster.
So actually, I think over time you can argue that it's the data lake that ends up consuming everything, not the data warehouse.
You're just trying to provoke Bob Martin.
He's succeeded.
You're watching Bob's face.
All right, Bob, let's hear from you.
Data Lake doesn't have a future.
No.
I see these things very largely converging onto a relational sequel-based model.
And five years from now, data is going to sit behind.
to SQL prompt and SQL data warehouses will replace data lakes from the perspective of storing
structured and semi-structured data. The cloud SQL data warehouses already do everything that is
necessary. And there really is no reason for people to have a separate data lake, except for historical
precedent. A lot of companies come from environments where they had a lot of semi-structured data in a Hadoop
environment and having a data lake is a natural transition. And in a sense, the data lake, which is really
S3 storage, together with a wide variety of any tools you want to put on top of it, is a very
generalized platform. But over time, infrastructure evolves to take on more and more of the use
cases. SQL relational data warehouses have evolved to the point that for structured and
semi-structured data, storage and query, they subsume all of what needs to be done pretty much
today. What remains is images, video, documents, PDFs.
Now, I don't call that unstructured data. I think that's a misnomer. There's no such thing as unstructured
data. All data has structure of some kind. Structured data is tables, rows and columns.
Semi-structured data is like JSON. It's hierarchical and its nature. And I think there's a third category
of data, which is what I call complex data. Images, documents, videos, most things that are streaming
fall into this category. And more and more, machine learning can be applied to the contents of those
data sources that turn it into
semi-structured data that can be used
for building complex data applications
and for doing predictive analytics.
So what's missing in the case
of the data warehouse today is the support
for complex data.
But that's going to come. That's called a
feature. Can you imagine
if you could transact, fully
transact all of these types of
images, videos and things together with
any source of semi-structured data in a data warehouse?
The applications that open
up are remarkable, and that's going to come in the next
two to three years.
I could see images being easily retrieved from the database,
but do you actually see all of the image processing
or the video processing taking place in the database as well?
Not with SQL.
SQL can't do that.
So you'll use procedural logic in Python or something else
to do that at least for now.
In the long run, relational will win too,
but that's probably more like eight to ten years away.
I think we've been waiting for that for 40 years, Bob.
If you still need to be interesting.
If you look over time,
navigation on hierarchical in the 1980s got replaced with SQL.
OLAP got replaced with SQL over the last 10 years or so.
We've replaced MapReduce with relational.
So all of these things, relational always wins.
Well, relational wins for the actual retrieval.
But what about the processing, the technology that you need to process images
is fundamentally different than you do to retrieve data.
Tristan, what are your thoughts on this?
So I completely agree that SQL is going to dominate data processing,
at least a very large chunk of data processing.
But there's different APIs that the data lake and the data warehouse exposed.
So there's the file storage layer.
And for a lot of reasons, I believe that an organization will store their files one time.
You will not have a data warehouse copy of the file and the data lake copy of the file,
which in some architectures today, that's what you see.
And so that requires you to have an open source file format that is shared between your data warehouse use cases and your other use cases.
Above that, you have indexing and metadata that is a core part of the data warehouse, but it's also a core part of the data lake.
I think those have to also start to converge so that different use cases can take advantage of the same stuff.
And then you have the SQL prompt.
And maybe at the SQL prompt layer, the data warehouse dominates, but I think you need to allow different access patterns as well.
Because one closed source firm is never going to accomplish literally all data processing use cases in the world.
All of these things should interoperate in an open source and an open format way.
the issues of format have kind of gone away because you can input and output any kind of format and
export into any kind of format very easily. The question is what are the operations that actually
need be formed against data that sits in a data lake? And today, anything associated with complex
data, the data warehouse can't help you. And so there's a huge reason to have a data lake today.
In 2025, I don't think so. I think that we really have five platforms being created globally,
Snowflake, Data Bricks, and then the three clouds, both Snowflake and Data Bricks,
while they will come from very different places,
Snowflake will always be SQL and declarative in its approach.
And Databricks certainly historically has been procedural and code-based.
So it's a version of SQL versus code in some senses.
And I think you'll see both companies
and pretty much everybody else in the industry offering both within their platforms.
So you've got two technologies that start with different use cases,
somewhat different architectures,
but they're clearly going into a converged point,
which is you have some declarative something
and you have some procedural something,
and whether one's on top or the other,
at the end of the day, they can both do both.
But in the meantime, you have this decade-long journey.
And in that decade-long journey,
there is an architecture that's optimized around use cases.
I mean, the amount of trade-offs and decisions you make
when building one of these systems is...
Yeah, like, Timescale DB has very different characteristics than Snowflake,
and they are characteristics that are optimized for a workflow.
Yeah, entire companies, focusing on different points in the design space,
with different optimization parameters.
It's actually the use case that drives the technology
because of all of the gravity around it.
And so, again, if it turns out that AIML
and an operational use is growing quicker,
which it seems to be,
it seems that's more going to dictate the technology
from an architectural standpoint.
Martin, you've said a couple of times now
that the AIML space is appearing to grow faster.
I've actually not heard that assertion before.
So broadly two use cases, right?
There's the analytics use case, which is driven by queries and dashboarding.
The other one is creating a complex model from a data scientist and then serving that in production.
That does things like wait time prediction, that does things like fraud detection,
that does things like dynamic pricing.
These are folks in our building complex models on existing data and then coming with
bespoke way of serving that.
That is very clearly now turning into a pattern that's being served by a data lake.
Now, it's on a much smaller base, but it'd be actually,
look in the industry, it's a very rapidly growing use case.
Michelle, you've spent time in both the data science community and the analytics community.
And notebooks in many ways are the place where these things sometimes come together.
I'm curious to hear your thoughts about how the two stacks have evolved and maybe they're
converging.
Maybe they're building each other's features and getting more similar.
But where does that take us?
Do we still have two stacks five years hence?
I think we're going to continue to see greater and greater specialization because we're
not going to have the ability or the budget to hire enough data scientists. And so those stacks
you can continue to evolve and it's going to be specialized based upon what it is that they're
trying to do. The data lake will have a place, your images, your blob storage, all of those things,
they're probably going to remain in the data lake and have a home there for a long time to come.
I just think it's not going to look like how it looks today. Today it's just been a lack of
understanding around what data do we really need to collect. We went from one exchange to the other.
We weren't collecting any data. Now we're collecting everything because we don't know what's valuable.
And the reality is that's not necessarily a good idea either.
The movement of data, I think we're going to see that stop.
But format is going to be really important.
We need that interop because reprocessing data on scale is just, it's cost prohibitive,
it's time prohibitive.
It's not something that we want to do if we can avoid it.
And I think you're going to see decentralization here at the lower levels where you've
got either business units embedded or you've got your product teams and you've got your data science
teams embedded in those product teams.
You're going to need a unifying layer at the very top in the form of technologies that
make it easier for everybody to query or be able to serve information.
I think that the notebook is probably the best suited for that because it does have the language
agnostic approach.
You can see the ability to look at both data and code and have all of that context,
that rich business context, the visualizations.
We're going to see that involved as this modern data document.
And we can use that as part of our unifying layer because your data scientists can then work
with or your data analysts can work with SQL, but we can at the end of the day, really hide
all of the code and really get to what is the business implication of the,
these things that are doing.
So this really brings us to the second major topic that I wanted to discuss,
which is how do we bring the machine learning, Python, Scala world,
and the analytics SQL, BI tool world together.
There really are two stacks and two communities who sync the exact same data sources
to Delta Lake and to Snowflake simply for operational reasons.
There's not a fundamental technological reason,
but it's just the way the tooling has evolved.
It's too inconvenient to cross that boundary.
And there's essentially three visions of that world.
One is that you're going to put machine learning into SQL,
and probably BigQuery is the furthest along in pursuing this.
You basically create a bunch of UDFs that do your linear algebra stuff.
The other is more the Databricks vision,
where you put SQL into Python or SQL into Scala,
and you use data frames to do that.
And then there's maybe a third vision.
where you use Arrow, the interchange format,
and everything can just talk to each other,
and you can arrange it any way you want.
Which of these visions do you think is going to win?
What I would like to see when is something like Arrow
so that you have to interrupt.
You're going to see machine learning moving into SQL
because you're going to have data engineers who are perfectly capable
and have the need to do some anomaly detection
or some interesting progression.
It's within their ability to do that.
Future engineering is just another big transformation for them.
But they don't have the same background in stats,
And so they can only take it so far.
And then you're going to see on the other side of the spectrum,
your data scientist where they have all of this really great math background,
and they understand how to do more advanced deep learning.
But they don't have the technology skills,
and SQL is the most successful language for working with data.
And so you're going to really see both of them really become capable of supporting
both use cases.
But ultimately, you'll continue to see specialization here
where the things that you want to do if you're trying to do deep learning
are just fundamentally different than the types of things
that you're just trying to do predictive models.
I think a lot about the Arrow version of the world, and I think that that will end up in the fullness of time dominating.
For the same reason that Martinez has been talking about, the tools end up evolving to the personas that they serve and the use cases that they serve.
I want to do all the data prep and feature engineering.
And then I want machine learning models to be trained on top of that.
People do that, certainly.
But the fact that the infrastructures to do those two different things are generally separate creates this big slowness.
It's a purely technical slowness.
And error doesn't solve all of that.
Arrow certainly helps, but there's dumb things like the servers that do those things are in different clouds.
And the interchange fee, what do you call them interchange fees?
Ingress charges.
Ingress fees are expensive.
They're criminal.
They're not just expensive.
They're ridiculous.
Right.
As more people do this, it's going to become smoother.
They're going to become more localized.
At the end of the day, there's a reason why you've got multiple languages, and it's not because one is
turning complete and the other isn't.
And the reason is, is because people build their entire workflow around languages and all of the tools.
And so you're going to have a heterogeneous, fragmented system.
So therefore, you do need to have open interfaces.
Bob?
I'm a big believer at this time in the approach of having multiple systems that interact with common formats.
Arrow is a huge step forward for that, not just because it's an efficient format,
but because it provides a consistent in-memory layout for people to do advanced analytics in their spark environment.
And it's the way the world is working right now because most customers actually have a data warehouse and an analytics platform separately, and they are connecting them together.
Now, I'm the radical, however, I'm going to continue to be the ultimate radical and declare that the approach that we're taking today in terms of machine learning is still roughly the approach of the internal combustion engine in the automobile.
And the approach that's happening where arrow ties together those predictive systems together with declarative databases, that's really the creation.
of the hybrid or sort of the Prius era,
hybrid will dominate for the next, say, three to five years,
and you will see hybrid systems being built by every major vendor.
And so all of them will have a full predictive stack
and a full declarative relational SQL stack built in
using some kind of interface like that.
But that's only until relational actually solves the broader set of problem.
Does that mean that you'll be using SQL functions, predict X?
No.
Ironically, I think that while SQL will dominate well into the 2030s for doing data modeling and data transformation, there's another step beyond that, which is business modeling.
And that needs to be represented in a knowledge graph.
Knowledge graphs are how we'll do predictive analytics in the 2030s.
And what needs to happen is a whole new generation of data system that's based on relational knowledge graphs to create that.
Michelle, you brought up a term earlier that I wanted to follow up on, which is data mesh.
And I wonder if you could define that briefly for everyone, because similar to data lakes versus data warehouses, there's a question whether going forward that's more of a historical phenomenon or an actual good architecture that we want to continue.
Data mesh is really a concept of decentralizing the data processing and the ETL and analytics into each individual business unit and then having some sort of unifying solution at the top.
And to do this forward, for having specialized data teams, having specialized roles, having infrastructure
as a service available to them for data processing, and then having some sort of overarching
standards for it, almost like a federation of data engineers, to ensure that all of your
ETL is consistent, so that as you are trying to do data retrieval and some sort of common
query tool, you'll have that familiarity that you need. We are going to see things like
Arrow really come to the forefront. Sooner, rather than later, I think customers are going to demand
because of all of the challenges that we're currently having.
You've got all the cost of the storage and the processing.
Your teams that are trying to do the processing don't have the distant context that they need.
And so as a result, you have this back and forth, a lot of wasted time,
got a lot of data quality errors in the data multiple times.
And so ultimately, we really want to take that body of knowledge
and put the technology where that body of knowledge lives.
The data mesh is an attempt to do that.
One part of what the data mesh folks are talking about is how to organize
and how to structure a team to manage data across a large enterprise
with very disparate and important data sources.
That's very, very important.
There's some good ideas in data mesh for that.
Architecturally, data mesh has this sort of odd idea
that data is basically streaming,
and you can use facilities like Kafka to do transforms
as the data is in flight.
And I don't believe that.
I think that that is totally missing the fact
that while there is streaming data,
and you can do quite a bit with data that's simply streaming,
in other words, append only data.
To me, another critical source of data
is transactional data coming out of business systems.
The streaming-based solutions have no answer for that,
and they just sort of pretend that data consistency is unimportant.
And I don't understand that
because I put data consistency at the top of the issues
that I think about when I think about managing data.
Mesh has historically been one of these terms
that conflates architecture with administrative domains
and at this in service mesh,
and it did this in Wi-Fi meshes and mesh networking, et cetera.
I think Bob is exactly right,
which is there is a very real issue
with separate administration domains,
separate processing domains, separate access to tool sets.
That's very, very different
to building a fully distributed architecture,
which just tends to be hard and inefficient.
And it's often not the people that promote the mesh idea,
but when people hear the term mesh,
they default to full distribution,
which tends to be just a bad way to build systems.
Said like a networking guy.
Having seen this exact same thing happen in other domains,
means for a couple of decades.
I think all of us are very technology-focused human beings.
And so when we think about data mesh, we tend to think about the architecture part of it.
Bob, I'm glad you pointed out the distributed teams and the people aspect of this.
I think my constant question for the data mesh is, why can't you enable the distributed nature
of what you're talking about with a unified architecture?
My preference is always to have one data set that is very clean and well understood, that we do
not have to move anywhere that is performance alongside our large batch
analytical processing, which is also working with our data science.
That's the nirvana.
That's the goal is to just have one data storage and then having something that sits
over top of it.
And each of those different things are specialized in each of the different use cases, but you
have one data store.
I feel like the modern data stack keeps swallowing up more and more use cases.
It killed cubes a while ago.
It's mostly killed Hadoop at this point.
It keeps pulling more use cases into its orbit.
because it's fundamentally so flexible and so capable of doing many different things well enough
that you don't really want to buy another system, build another system for that one use case.
What do you think are some of the most interesting, surprising, significant use cases
that may start to get pulled into the orbit of the modern data stack in the next couple of years?
Complex data.
We now have all this very, very interesting stuff that's happening in predictive analytics.
And to me, we've gone from semi-structured data as being the most interesting data sources to now having a wide variety of data sources.
I was talking to a company involved in the medical field yesterday and just the rich amount of data that exists in the images and the doctor's notes.
And all of that is opaque to our systems today.
It will not be in five years.
That will all become part of the modern data stack.
And to me, that's a gigantic transformation into the types of applications that will be created in the year.
to talk. My last job was I ran marketing for a company and I really went deep into
growth marketing. The problem that you run into there is that you're constantly writing code
to push data back and forth between systems because the different operational systems do different
things and you need the same data and all of them. But no one has yet re-architectedicited
the systems who, in the modern data stack, just take all of the work that you've ingested and
now push it back out to your operating systems, your operational systems.
But I think we're at the beginning of that.
What you're really talking about, Tristan, is the advent of the modern data app,
which basically is an operational application that autonomously can make decisions for the business.
And we've seen very few of those.
I mean, very trivial examples of, boy, will they be significant in the future.
There's really two visions of the data app that I've seen.
One of them is the data app is a separate system,
and you take the important data from your data warehouse and you push it.
And then the other vision is the data app,
is just natively built to run on top of the data warehouse.
I'm curious whether people have opinions about those two models and where they see that going.
It's really the same conversation we've been having about how these things are built.
Data app is predictive analytics that actually takes autonomous action.
It takes the data that would otherwise be presented to a person and instead leverages that
to actually take actions within the business.
They're being built every which way today because there are view good tools to build data apps.
That will not be true in a few years.
One of the things that you run into when you try to build data applications and take action automatically is that latency becomes incredibly important.
Everybody in the ecosystem is battling this right now.
I think there's a lot of different visions of how we're going to crush the latency problem and how low we need it to get.
How low does the latency need to be?
At what point do we have most of the interesting use cases?
People have dozens to hundreds or even thousands of operational system.
More and more, they're SaaS applications.
they're outside of your organization.
And they're always a source of truth now.
They're the present.
And a data warehouse or a data lake is about historical or the past.
And what does that latency need to be?
Does it need to be zero seconds?
I don't think so.
I mean, there are applications where zero seconds,
where instant is required,
mostly having to do with eventing and alerting of some sort.
Most of the time, if you can get it in a minute or two,
you can leverage that data inside your historical system
with predictive analytics to begin.
to perform actions on it.
This is a very complicated topic
that I think is very use case-specific,
but there tends to be serious trade-offs
that systems designers make
between latency and throughput.
If you want higher throughput, you batch.
And the reason that you batch
is that you don't have as many domain crossings.
However, if you look at most systems,
you can make the trade-off,
meaning you could do low-latency in a data lake
and you could do high-throughput
in a data warehouse or vice versa.
These are not architectural limitations.
They just tend to be the tradeoffs that were made
as a result of serving whatever the primary use case is.
I've heard a number of these kind of latency throughput tradeoffs
and you actually get down to machine level.
They are just a result of the tradeoffs
that are made on the system going into it.
One of the interesting things that we see is that
the point at which you start to have to spend a lot more
to get the latency lower is actually lower than people think.
I suspect you can get down into the 10-second range
with still the sort of throughput optimized architecture.
Basically, the throughput optimized architecture, I suspect,
will go lower than we expect.
What do you imagine will happen with the serving layer?
So your website still needs to operate over that data.
Are you imagining that there's just going to be a caching layer
or is that going to be a separate system?
It depends on what the characteristics of the system need to be.
If something needs to be really low latency,
today's data warehouses are not always the right solution for.
And so it just depends on the application.
Latencies will go down in these products, but to Marchean's point, some of the architectural choices make the latency characteristics of a snowflake somewhat different than, for example, the latency characteristics of a M-Squel.
One of the things that I would like to see more of in the future is Lambda architectures but with off-the-shelf tools.
So my data flowing into a more streaming-like system and a more batch-like system so that I can get the best of both worlds, you're making trade-offs and you build these systems.
as a user, I want to be able to choose and have both of them.
All right. Well, we have one minute left.
I'd like to ask a yes or no question for everyone.
Will there emerge another major data platform alongside Snowflake, Databricks, Google, AWS, and Azure?
We'll start with you, Michelle. Yes or no?
Yes.
Bob?
What's your time skill?
Oh, yeah.
Sorry, in the next five years.
Yes.
Yes.
But you know what may be relatively small relative to those?
Well, I said major. That sounds like a...
But it's notepleague was small five years ago.
Justin?
I think no.
Martine?
Yes.
All right.
Thank you very much, everyone, for joining.
This has been a really fun conversation.
Really appreciate all of you being here.
I know our audience does as well.
