The a16z Show - a16z Podcast: The Storage Renaissance
Episode Date: March 21, 2017As we enter a new era of distributed computing -- and of big data, in the form of machine and deep learning -- storage becomes (even more) important. It might not be sexy, but storage is what makes th...e internet and cloud computing go round and round: "Without storage, we wouldn't have databases; without databases, we wouldn't have big data; we wouldn't have analytics ... we wouldn't have anything because information needs to be stored, and it needs to be retrieved." This is especially complicated by the fact that more and more computing is happening at the edge, as with autonomous car sensing. Clearly, storage is important. But now it's also undergoing a renaissance as it becomes faster, cheaper, and more in-memory. What does this mean for all the big players in the storage ecosystem? For CIOs and IT departments? For any company competing on data, whether it's in analyzing it or owning it? And for that matter: What is data, really? Beyond the existential questions, this episode of the a16z Podcast -- with a16z partner Peter Levine; Alluxio (formerly Tachyon) founder and CEO Haoyuan Li (“HY”); and storage industry analyst Mike Matchett of The Taneja Group -- covers all this and more. It even tries to make storage, er, great again. Stay Updated:Find a16z on YouTube: YouTubeFind a16z on XFind a16z on LinkedInListen to the a16z Show on SpotifyListen to the a16z Show on Apple PodcastsFollow our host: https://twitter.com/eriktorenberg Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures. Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.
Transcript
Discussion (0)
Hi everyone, welcome to the A6 and Z podcast. I'm Sonal. Today's episode is all about storage. With the cost of
system memory decreasing, memory from both storage and compute will be the exact same thing. So as we enter a
new era of distributed computing, and what Peter has also argued in a popular deck is the end of the cloud,
how does storage evolve? How is this affected by trends in computing such as machine and deep learning?
Joining us to have this conversation today are H.Y, CEO and co-founder of Alexio, formerly Takion,
which came out of the UC Berkeley Amplab,
the birthplace of other industry-defining technology
such as Spark and Mezos,
general partner Peter Levine,
who has funded memory-centric infrastructure companies
at every level of the Berkeley data analytics deck,
the badass deck,
and Mike Matchett, senior analyst at Tenasia Group,
which covers everything related to big data,
compute, and storage.
Okay, so that's the intros.
To kick things off,
I just have to ask,
why should we care about storage?
I feel like it's a dark underbelly
of computing that no one really cares about.
Look, I mean, while storage may be the underbelly
without storage computers wouldn't work.
And so it's one of the most important, you know,
compute networking and storage are the three fundamental elements
of what makes the entire internet work.
It makes cloud computing work.
And without storage, you wouldn't have databases.
And without databases, you wouldn't have big data.
You wouldn't have analytics.
You wouldn't have anything because information needs to be stored
and it needs to be retrieved.
So storage is hugely, hugely important.
And, you know, the interesting thing is I think we're in a very transformative period of time here where storage is undergoing a bit of a renaissance.
And I think it's going to transform how computing and applications work in the not too distant future.
When I got started in storage, I thought, hey, this is really the state, the tried and true stuff.
Compute was where it was at. You know, there was all these advantages and advances happening.
client server and then cloud and new chips coming along every year. But the more I got into storage,
the more I figured out that storage is really the most complex part of that equation. It takes a lot
of effort to protect data, to manage data. Data has gravity. It has momentum. It has weight in
history. So storage is really the critical piece to get right. Wait, what do you mean when you say
that data has gravity and momentum? Data has to live somewhere. You know, compute can be spun up in a
cloud. It's a little ephemeral. You can repeat it. You can spin up and down virtual machines.
But data actually has to have a footprint somewhere and that footprint has to be persisted and
protected and secured. And of course, in this case, made accessible or the data has no value.
Right. But how is that different than what we have right now, that it requires a new form of
storage? I'm going to use a phrase that I say a lot in the podcast, but I think it's actually
really true of our times, which is when we say there's a lot more data, that's a difference of
degree, not just kind. Why do we need a different type of solution? Why can't we just keep doing
the same things that we were doing before, but just do it bigger and better? So I think this is like
many other things in the world. Like when you make a cell phone at the beginning, cell phone just
just make a phone call and now the cell phone is a different type of cell phone. Similar thing is also
happening in the storage industry as well. At the very beginning, storing the block devices,
the bits and just bits, raw data, beyond the block devices, we had file system, all type, all different
hypo file system. Now we have blob storage, object storage, and in the meantime, we have so much
innovation in open source area as well. And you have public cloud storage from several huge
vendors in the world like Amazon, Google, Microsoft, Alibaba, etc. You have different type of
storage solution provided by traditional wonders like EMC, like HPE, IBM, they're providing
these new innovations. I think that will pale in
comparison to what's going to happen over the next, let's say, decade here. When we think about
information and data, there's an entirely new phenomena that's really just kicked in relative to
what is data. What is data? That's a very existential question. Well, up until right now,
compute data has largely been input by some human being typing on a keyboard or a database
recovering a record, that's the input of a human asking the computer something that data largely
has been put in there through human fingers and through some human interaction. Fast forward to right now,
let's just talk about a self-driving car that has sensors. Those sensors are now inputting data
that's the world around us. And so there's completely new types of data. So what is data now far exceeds
the human input data, we are now collecting the world's information via sensors, and all of that
needs to be processed and stored. It will be literally orders of magnitude, in the exact
mathematical sense, orders of magnitude, more data that needs to be stored and processed. So that's
sort of point one on what's happening. Secondly, the mobile supply chain is influencing the data
center and influencing the cost curves in the data center for storage. You take a mobile phone and
you take the components of that mobile phone and put it in the data center, you have a very
inexpensive storage substrate that is far less expensive than the enterprise systems that we saw
in the past. And so the cost curves come way down. We will have much more in memory data,
systems that literally live in real memory. And the notion,
of disk drives and tape drives and even SSDs will all go away. I believe that there's a future
here where memory architecture is completely flattened. Computing has been built on slow,
cheap, and fast and expensive. And I believe that we're going to be fast and cheap.
I mean, it sounds like it'd be obvious, but what does fast and cheap really do for us when it comes
to the so-called storage renaissance? Fast and cheap means that we can collect massive amounts of
information, put it in memory, not have to put it out to disk drive and do all these backflips
to get data to work correctly. It's going to all be in memory and it will be very inexpensive.
And that's the renaissance in whether you call it storage, but more importantly, data and the
importance of data and the correlation between the volumes of data and the price curves in
the not too distant future for what I'll call storage, even though it's memory. And those
pieces coming together, that to me is the renaissance that's happening in computer.
I totally agree. And actually just add one more point to that is that we actually should view
memory as a front tier of storage. Exactly. It's a tier of storage. Exactly. I would argue that it is
the tier of storage, that over time there is no other storage. It's just memory. Certainly,
we've been seeing the rise of memory class storage already being talked about by vendors and
and bringing persistence to memory will completely overhaul how compute and storage is envisioned today,
because tomorrow data is going to live in the compute devices.
Those are going to be more Internet of Things devices and be far more distributed as well.
But I also want to temper that with the thought that we've also talked to a lot of these big storage vendors,
and they're forecasting that there just simply isn't going to be enough storage for all the data we're collecting
in a midterm horizon, like three to five years,
that we're creating so much data,
there won't be enough chips,
there won't be enough hard drives,
there won't be enough tape.
There simply isn't going to be enough storage out in the world,
which I do want to point out means
that there's still some opportunities for things
in storage management, people to consider
how many copies of that data am I making?
Do I have to take the compute,
the processes out to where the data lives?
Do I have to bring the data centralized
and make copies of it,
or can I do something more optimized
with how I organize my architecture
and only store the data once,
only compute data once in one place.
Just to draw a fine point here,
we are entering a new world of distributed computing.
And if you think about the new world of distributed computing,
the data that gets collected in a self-driving car
or some endpoint is going to be...
The information will be processed at that endpoint.
It won't be translated back to a central storage pool.
The information will be cured.
and then transmitted back.
I would be collecting massive amounts of information.
I mean, self-driving car collects 10 gigabytes of data a mile, right?
Like some ridiculous amount of data.
You know, there's not enough storage on the planet to ever hold all that information.
So the curation is going to occur at the edge close to the compute and the quote-unquote storage
will be processed at the edge.
And then important information will come back to some sense.
centralized data store. But all that computation at the edge and the storage of it and kind of
the permutations of it is exactly the renaissance that I believe needs to happen in storage,
even to process this stuff. So another huge issue here is that it made the ecosystem much
more complex than before. All the big enterprise companies, they will try different innovations.
They will have their existing storage and pot storage, new storage, formed a very complex
systems and makes it hard to manage, hard to consume, and in many cases is not cost effective
as well.
And this is one thing we're seeing requested by the customers many big enterprise in the
world is that how can they consume and this data from different storage systems easily
and manage them efficiently.
So is this just because they have a hodgepodge of like all these different storage systems
or is it that it's just buried in the same place but under a bunch of.
different interfaces and tools or like what's the problem really?
The data is stored in different storage systems.
Just give you a very concrete example.
If you talk to this department, their data is stored in a public cloud storage,
maybe in the probably Amazon IS3 or Google Cloud Storage.
And another department, they have some data stored in the EMC storage, HPE storage.
And you have another department, say, I have my own private cloud storage, another group.
They want to analyze the data inside the enterprise.
In the end of the day, why people want data?
People want to use data to generate values,
which means use data to make decisions or facilitate making decisions.
The more data you have, you can analyze the better result normally you get.
So this existing environment is very hard.
We're trying to tackle or leverage taking memory as a first-class citizen in the storage, in the storage system.
We have a memory-centric architecture, and beyond that, how to manage the data,
or how to access the data from different storage systems.
in the most effective manner.
Let's pause on that for a quick moment.
Why does in-memory aspect matter?
So one side is, of course, about performance.
And performance is memory
is much faster than SSD or HDD.
And from the other perspective,
it's also about the cost.
The cost is decreasing very fast.
It's about every 18 month.
The cost decreased by 50%.
So that's two points.
Performance plus the cost,
which trigger the capacity,
that two points,
together now is the right time to build memory as a tier of storage.
I think that machine learning is the application that unlocks much of the new in-memory systems.
It's not so, I mean, machine learning is the next generation of big data.
And what is machine learning?
It's iterations over large, large, large data sets to come up with better ways to forecast and better ways to utilize information.
The only way, so in order to unlock the power machine learning in a time-sensitive fashion,
is to actually do these computations in memory.
Because if you have to what's called go out to the disk drive to get information or go out to an SSD,
the time to seek for that information and look for it is a huge penalty when you're dealing with massive amounts of information.
To the extent it's all in memory, I can operate on it very quickly and do many more.
iterative sets in a shorter time frame giving the results that might be needed in a machine
learning or AI environment. Unless we get to in-memory processing and in-memory data structures,
machine learning doesn't really work. Yeah. And so we have to come up with these ways of having
much more in-process high-fidelity storage that is this new tier. And this exact is, you know,
this is exactly what's causing what I would argue this renaissance to occur.
over the next several years here.
I'll just add that super computing a couple years ago
was really inaccessible to most people.
And in a super queuing environment,
every node is highly networked to every other nodes
because they needed to communicate,
state, and information between them.
When you had Hadoop and MapReduce,
they could partition certain categories of problems
and run them in parallel,
but not really a lot of machine learning algorithms.
they just don't work that way.
They require more of that interconnectedness
and that communication between nodes
and between memory and data sets
and lots of iterations on the same data.
So Spark and in-memory approaches to machine learning
really accelerate the opportunity
to create and apply machine learning algorithms
to just about every facet of human existence,
not to overstate the case,
but there really is a huge.
huge renaissance just from that alone coming.
So I find all this fascinating in terms of the evolution of computing.
But the question I have is, how does this actually affect people?
Like, how does it change for better or worse their work and their work practice?
Today, if people want to really see the global data, they move the data manually from one storage to another storage to put them together to analyze it.
even though the data could be in memory in the final storage,
but because of this manual process or this process of moving data around,
first of all, it's very hard to manage.
Secondly, the whole process is very time-consuming.
It could be easily like weeks or even longer.
So that makes things much harder.
And data has less value.
So that's another huge issue we're seeing.
Yeah, I mean, you need a new abstraction layer
when there's an old world and you have a new world
and you don't need to be stuck into this old model
of how you, in this case, store and view our data.
But what does that mean on the design side,
on the interface side for people?
Exactly.
So the issue today is that data really stored
in different data silos.
There are so many different type of storage,
they have different type of interface,
make the application level very hard to consume easily.
That's a big issue.
Think of virtualization.
In the compute side,
we have a virtualization technology
to really virtualize the compute resource
and to be able to leverage the resource more efficiently.
Also, think of the internet protocol stack.
In the middle, you have the IP layer, which is really the narrow waste.
When you make the innovation in the upper layer, you don't need to worry about the lower layer.
So similarly, from the storage ecosystem perspective, we should build a layer to abstract
different storage systems and then present a unique or standard API to the upper layer
with a global namespace.
From the user perspective,
they will be able to access
the data from different storage systems
very easily, very easily.
And this coupled with the in-memory technology
as well as a smart algorithm
to intelligently move the data
to solve a lot of issues for the users.
Performance issue, cost issue,
performance cost issue,
and also the unification of data silo
issues. So that's also a one direction we're trying. Is this something we're saying beyond one
company? Like what's sort of the broad industry level view of all this? There's certainly a big need
for unifying storage today. Most organizations have a hodgepodge, an organically grown set of
state of storage systems and applications stretching all the way back to their mainframes.
There's an awful lot of mainframe still out there, believe it or not. And when we look at what a
companies real assets are today, oftentimes, as I think, again, Peter's pointing out, it's in the
data that they have, but the data collectively and in aggregate, in order to be able to analyze it
and make predictions out of it, you want a cohesive whole. So solutions that can bring all
that data together in an integrated analysis and support actually pipelines of prediction and
feedback loops of analytics and visualization and really tighten the knot, if you will,
are going to be extremely valuable.
And those that can leverage the existing infrastructure and the existing data stores without upsetting
the Applecart are the ones that are going to be adopted for it.
How does it affect the IT department or the CEO?
Like, how should they think about this shift?
IT departments in general are facing a lot of change.
It started with this idea of cloud and treating their internal customers as service
provider as clients and seeing themselves as service providers, that's a big shift in culture and
approach and temperament. With infrastructure and the idea of being able to bring machine learning
and really leverage companies' data sets in a myriad of ways, they have to become experts in that
data, in those applications, because at the department level or division level in the business
analyst side, you're going to find lots of people who know how to use Microsoft Excel.
So IT departments are really going to be looked at as the point of the sword in bringing these
advances to the company and not just being seen as reactive operators of infrastructure on the
back end. And that's a big shift.
Look, I mean, data is the lifeblood of any organization. And so it doesn't matter whether
you're a web developer or consumer products or health care organization.
fundamentally, we are all becoming data-driven organizations.
Big data up until right now has really been a reactive process,
even querying a database.
Like I go look at what's happened in the past.
And the holy grail of all computing and the holy grail of what a CIO ultimately cares about
and what a business cares about is that I can predict the future.
If I know what's going to happen tomorrow, whether it's inventory,
health care, finance, and I know what's going to happen tomorrow or next week or next month,
and I can accurately predict that. That is the holy grail of computing.
And I'll even accelerate that. It's not about predicting tomorrow for a lot of people.
It's about predicting what's going to happen next in terms of what the user's going to click on.
Where does the car turn? How does that rocket land on its legs? That sort of thing.
It's becoming much more of a real-time proposition.
the cuss now of moving from historical to future prediction and rides on the back of boring,
recreated storage for in-memory architectures and machine learning coming together to provide
this very unique and very interesting capability that, quite frankly, has not happened yet
in the history of computing.
I can't believe this. You've just made storage sexy again.
So one last note to wrap up. What comes next?
Well, I think we mentioned the storage class memory that's coming out Intel 3D Crosspoint
and some other folks making a very fast, close to the chip memory that's actually storage.
So it'll be faster again than Flash, maybe a little bit slower than today's DRAM,
but it's going to fill in that gap.
And that's going to be another interesting tier of storage.
It's going to change a lot of the way computing works.
And that's coming out already.
we're going to see that in the next couple of quarters.
I also would introduce in the longer term something to think about,
and that is there are companies today that should be,
from a contractual perspective, perhaps,
locking in their rights to data up and down their supply chain,
especially with the Internet of Things.
I think there's going to be some companies kind of surprised in a year or two
to find that they're not going to have access to the data
that they're going to really want to have
that's relevant to what they need to do to make the predictions
to optimize their business.
And they're going to call it a kind of data poverty or paucity.
And we're going to find companies that are forward-looking and data-rich.
And some companies that have suddenly discovered they're sitting on the outside a little bit and data poor.
Yeah, that's especially fascinating because we talk a lot on this podcast about the role of data
and building businesses, like whether it's data network effects or data in machine learning startups.
We just talk a lot about how data is increasingly advantage in a lot of businesses.
If we can have faster access data, easier management data, have a complete view of the data, a smarter way of analyzing the data.
I think it's a very exciting moment and many more innovation will come out along the way.
Okay, well, thank you guys for joining the A6 and Z podcast.
