The a16z Show - a16z Podcast: The Storage Renaissance

Starting point is 00:00:00 Hi everyone, welcome to the A6 and Z podcast. I'm Sonal. Today's episode is all about storage. With the cost of system memory decreasing, memory from both storage and compute will be the exact same thing. So as we enter a new era of distributed computing, and what Peter has also argued in a popular deck is the end of the cloud, how does storage evolve? How is this affected by trends in computing such as machine and deep learning? Joining us to have this conversation today are H.Y, CEO and co-founder of Alexio, formerly Takion, which came out of the UC Berkeley Amplab, the birthplace of other industry-defining technology such as Spark and Mezos,

Starting point is 00:00:33 general partner Peter Levine, who has funded memory-centric infrastructure companies at every level of the Berkeley data analytics deck, the badass deck, and Mike Matchett, senior analyst at Tenasia Group, which covers everything related to big data, compute, and storage. Okay, so that's the intros.

Starting point is 00:00:48 To kick things off, I just have to ask, why should we care about storage? I feel like it's a dark underbelly of computing that no one really cares about. Look, I mean, while storage may be the underbelly without storage computers wouldn't work. And so it's one of the most important, you know,

Starting point is 00:01:04 compute networking and storage are the three fundamental elements of what makes the entire internet work. It makes cloud computing work. And without storage, you wouldn't have databases. And without databases, you wouldn't have big data. You wouldn't have analytics. You wouldn't have anything because information needs to be stored and it needs to be retrieved.

Starting point is 00:01:22 So storage is hugely, hugely important. And, you know, the interesting thing is I think we're in a very transformative period of time here where storage is undergoing a bit of a renaissance. And I think it's going to transform how computing and applications work in the not too distant future. When I got started in storage, I thought, hey, this is really the state, the tried and true stuff. Compute was where it was at. You know, there was all these advantages and advances happening. client server and then cloud and new chips coming along every year. But the more I got into storage, the more I figured out that storage is really the most complex part of that equation. It takes a lot of effort to protect data, to manage data. Data has gravity. It has momentum. It has weight in

Starting point is 00:02:13 history. So storage is really the critical piece to get right. Wait, what do you mean when you say that data has gravity and momentum? Data has to live somewhere. You know, compute can be spun up in a cloud. It's a little ephemeral. You can repeat it. You can spin up and down virtual machines. But data actually has to have a footprint somewhere and that footprint has to be persisted and protected and secured. And of course, in this case, made accessible or the data has no value. Right. But how is that different than what we have right now, that it requires a new form of storage? I'm going to use a phrase that I say a lot in the podcast, but I think it's actually really true of our times, which is when we say there's a lot more data, that's a difference of

Starting point is 00:02:53 degree, not just kind. Why do we need a different type of solution? Why can't we just keep doing the same things that we were doing before, but just do it bigger and better? So I think this is like many other things in the world. Like when you make a cell phone at the beginning, cell phone just just make a phone call and now the cell phone is a different type of cell phone. Similar thing is also happening in the storage industry as well. At the very beginning, storing the block devices, the bits and just bits, raw data, beyond the block devices, we had file system, all type, all different hypo file system. Now we have blob storage, object storage, and in the meantime, we have so much innovation in open source area as well. And you have public cloud storage from several huge

Starting point is 00:03:35 vendors in the world like Amazon, Google, Microsoft, Alibaba, etc. You have different type of storage solution provided by traditional wonders like EMC, like HPE, IBM, they're providing these new innovations. I think that will pale in comparison to what's going to happen over the next, let's say, decade here. When we think about information and data, there's an entirely new phenomena that's really just kicked in relative to what is data. What is data? That's a very existential question. Well, up until right now, compute data has largely been input by some human being typing on a keyboard or a database recovering a record, that's the input of a human asking the computer something that data largely

Starting point is 00:04:25 has been put in there through human fingers and through some human interaction. Fast forward to right now, let's just talk about a self-driving car that has sensors. Those sensors are now inputting data that's the world around us. And so there's completely new types of data. So what is data now far exceeds the human input data, we are now collecting the world's information via sensors, and all of that needs to be processed and stored. It will be literally orders of magnitude, in the exact mathematical sense, orders of magnitude, more data that needs to be stored and processed. So that's sort of point one on what's happening. Secondly, the mobile supply chain is influencing the data center and influencing the cost curves in the data center for storage. You take a mobile phone and

Starting point is 00:05:19 you take the components of that mobile phone and put it in the data center, you have a very inexpensive storage substrate that is far less expensive than the enterprise systems that we saw in the past. And so the cost curves come way down. We will have much more in memory data, systems that literally live in real memory. And the notion, of disk drives and tape drives and even SSDs will all go away. I believe that there's a future here where memory architecture is completely flattened. Computing has been built on slow, cheap, and fast and expensive. And I believe that we're going to be fast and cheap. I mean, it sounds like it'd be obvious, but what does fast and cheap really do for us when it comes

Starting point is 00:06:07 to the so-called storage renaissance? Fast and cheap means that we can collect massive amounts of information, put it in memory, not have to put it out to disk drive and do all these backflips to get data to work correctly. It's going to all be in memory and it will be very inexpensive. And that's the renaissance in whether you call it storage, but more importantly, data and the importance of data and the correlation between the volumes of data and the price curves in the not too distant future for what I'll call storage, even though it's memory. And those pieces coming together, that to me is the renaissance that's happening in computer. I totally agree. And actually just add one more point to that is that we actually should view

Starting point is 00:06:53 memory as a front tier of storage. Exactly. It's a tier of storage. Exactly. I would argue that it is the tier of storage, that over time there is no other storage. It's just memory. Certainly, we've been seeing the rise of memory class storage already being talked about by vendors and and bringing persistence to memory will completely overhaul how compute and storage is envisioned today, because tomorrow data is going to live in the compute devices. Those are going to be more Internet of Things devices and be far more distributed as well. But I also want to temper that with the thought that we've also talked to a lot of these big storage vendors, and they're forecasting that there just simply isn't going to be enough storage for all the data we're collecting

Starting point is 00:07:41 in a midterm horizon, like three to five years, that we're creating so much data, there won't be enough chips, there won't be enough hard drives, there won't be enough tape. There simply isn't going to be enough storage out in the world, which I do want to point out means that there's still some opportunities for things

Starting point is 00:07:56 in storage management, people to consider how many copies of that data am I making? Do I have to take the compute, the processes out to where the data lives? Do I have to bring the data centralized and make copies of it, or can I do something more optimized with how I organize my architecture

Starting point is 00:08:13 and only store the data once, only compute data once in one place. Just to draw a fine point here, we are entering a new world of distributed computing. And if you think about the new world of distributed computing, the data that gets collected in a self-driving car or some endpoint is going to be... The information will be processed at that endpoint.

Starting point is 00:08:36 It won't be translated back to a central storage pool. The information will be cured. and then transmitted back. I would be collecting massive amounts of information. I mean, self-driving car collects 10 gigabytes of data a mile, right? Like some ridiculous amount of data. You know, there's not enough storage on the planet to ever hold all that information. So the curation is going to occur at the edge close to the compute and the quote-unquote storage

Starting point is 00:09:04 will be processed at the edge. And then important information will come back to some sense. centralized data store. But all that computation at the edge and the storage of it and kind of the permutations of it is exactly the renaissance that I believe needs to happen in storage, even to process this stuff. So another huge issue here is that it made the ecosystem much more complex than before. All the big enterprise companies, they will try different innovations. They will have their existing storage and pot storage, new storage, formed a very complex systems and makes it hard to manage, hard to consume, and in many cases is not cost effective

Starting point is 00:09:48 as well. And this is one thing we're seeing requested by the customers many big enterprise in the world is that how can they consume and this data from different storage systems easily and manage them efficiently. So is this just because they have a hodgepodge of like all these different storage systems or is it that it's just buried in the same place but under a bunch of. different interfaces and tools or like what's the problem really? The data is stored in different storage systems.

Starting point is 00:10:15 Just give you a very concrete example. If you talk to this department, their data is stored in a public cloud storage, maybe in the probably Amazon IS3 or Google Cloud Storage. And another department, they have some data stored in the EMC storage, HPE storage. And you have another department, say, I have my own private cloud storage, another group. They want to analyze the data inside the enterprise. In the end of the day, why people want data? People want to use data to generate values,

Starting point is 00:10:45 which means use data to make decisions or facilitate making decisions. The more data you have, you can analyze the better result normally you get. So this existing environment is very hard. We're trying to tackle or leverage taking memory as a first-class citizen in the storage, in the storage system. We have a memory-centric architecture, and beyond that, how to manage the data, or how to access the data from different storage systems. in the most effective manner. Let's pause on that for a quick moment.

Starting point is 00:11:12 Why does in-memory aspect matter? So one side is, of course, about performance. And performance is memory is much faster than SSD or HDD. And from the other perspective, it's also about the cost. The cost is decreasing very fast. It's about every 18 month.

Starting point is 00:11:29 The cost decreased by 50%. So that's two points. Performance plus the cost, which trigger the capacity, that two points, together now is the right time to build memory as a tier of storage. I think that machine learning is the application that unlocks much of the new in-memory systems. It's not so, I mean, machine learning is the next generation of big data.

Starting point is 00:11:57 And what is machine learning? It's iterations over large, large, large data sets to come up with better ways to forecast and better ways to utilize information. The only way, so in order to unlock the power machine learning in a time-sensitive fashion, is to actually do these computations in memory. Because if you have to what's called go out to the disk drive to get information or go out to an SSD, the time to seek for that information and look for it is a huge penalty when you're dealing with massive amounts of information. To the extent it's all in memory, I can operate on it very quickly and do many more. iterative sets in a shorter time frame giving the results that might be needed in a machine

Starting point is 00:12:45 learning or AI environment. Unless we get to in-memory processing and in-memory data structures, machine learning doesn't really work. Yeah. And so we have to come up with these ways of having much more in-process high-fidelity storage that is this new tier. And this exact is, you know, this is exactly what's causing what I would argue this renaissance to occur. over the next several years here. I'll just add that super computing a couple years ago was really inaccessible to most people. And in a super queuing environment,

Starting point is 00:13:19 every node is highly networked to every other nodes because they needed to communicate, state, and information between them. When you had Hadoop and MapReduce, they could partition certain categories of problems and run them in parallel, but not really a lot of machine learning algorithms. they just don't work that way.

Starting point is 00:13:40 They require more of that interconnectedness and that communication between nodes and between memory and data sets and lots of iterations on the same data. So Spark and in-memory approaches to machine learning really accelerate the opportunity to create and apply machine learning algorithms to just about every facet of human existence,

Starting point is 00:14:05 not to overstate the case, but there really is a huge. huge renaissance just from that alone coming. So I find all this fascinating in terms of the evolution of computing. But the question I have is, how does this actually affect people? Like, how does it change for better or worse their work and their work practice? Today, if people want to really see the global data, they move the data manually from one storage to another storage to put them together to analyze it. even though the data could be in memory in the final storage,

Starting point is 00:14:38 but because of this manual process or this process of moving data around, first of all, it's very hard to manage. Secondly, the whole process is very time-consuming. It could be easily like weeks or even longer. So that makes things much harder. And data has less value. So that's another huge issue we're seeing. Yeah, I mean, you need a new abstraction layer

Starting point is 00:15:03 when there's an old world and you have a new world and you don't need to be stuck into this old model of how you, in this case, store and view our data. But what does that mean on the design side, on the interface side for people? Exactly. So the issue today is that data really stored in different data silos.

Starting point is 00:15:19 There are so many different type of storage, they have different type of interface, make the application level very hard to consume easily. That's a big issue. Think of virtualization. In the compute side, we have a virtualization technology to really virtualize the compute resource

Starting point is 00:15:33 and to be able to leverage the resource more efficiently. Also, think of the internet protocol stack. In the middle, you have the IP layer, which is really the narrow waste. When you make the innovation in the upper layer, you don't need to worry about the lower layer. So similarly, from the storage ecosystem perspective, we should build a layer to abstract different storage systems and then present a unique or standard API to the upper layer with a global namespace. From the user perspective,

Starting point is 00:16:06 they will be able to access the data from different storage systems very easily, very easily. And this coupled with the in-memory technology as well as a smart algorithm to intelligently move the data to solve a lot of issues for the users. Performance issue, cost issue,

Starting point is 00:16:28 performance cost issue, and also the unification of data silo issues. So that's also a one direction we're trying. Is this something we're saying beyond one company? Like what's sort of the broad industry level view of all this? There's certainly a big need for unifying storage today. Most organizations have a hodgepodge, an organically grown set of state of storage systems and applications stretching all the way back to their mainframes. There's an awful lot of mainframe still out there, believe it or not. And when we look at what a companies real assets are today, oftentimes, as I think, again, Peter's pointing out, it's in the

Starting point is 00:17:06 data that they have, but the data collectively and in aggregate, in order to be able to analyze it and make predictions out of it, you want a cohesive whole. So solutions that can bring all that data together in an integrated analysis and support actually pipelines of prediction and feedback loops of analytics and visualization and really tighten the knot, if you will, are going to be extremely valuable. And those that can leverage the existing infrastructure and the existing data stores without upsetting the Applecart are the ones that are going to be adopted for it. How does it affect the IT department or the CEO?

Starting point is 00:17:43 Like, how should they think about this shift? IT departments in general are facing a lot of change. It started with this idea of cloud and treating their internal customers as service provider as clients and seeing themselves as service providers, that's a big shift in culture and approach and temperament. With infrastructure and the idea of being able to bring machine learning and really leverage companies' data sets in a myriad of ways, they have to become experts in that data, in those applications, because at the department level or division level in the business analyst side, you're going to find lots of people who know how to use Microsoft Excel.

Starting point is 00:18:26 So IT departments are really going to be looked at as the point of the sword in bringing these advances to the company and not just being seen as reactive operators of infrastructure on the back end. And that's a big shift. Look, I mean, data is the lifeblood of any organization. And so it doesn't matter whether you're a web developer or consumer products or health care organization. fundamentally, we are all becoming data-driven organizations. Big data up until right now has really been a reactive process, even querying a database.

Starting point is 00:19:03 Like I go look at what's happened in the past. And the holy grail of all computing and the holy grail of what a CIO ultimately cares about and what a business cares about is that I can predict the future. If I know what's going to happen tomorrow, whether it's inventory, health care, finance, and I know what's going to happen tomorrow or next week or next month, and I can accurately predict that. That is the holy grail of computing. And I'll even accelerate that. It's not about predicting tomorrow for a lot of people. It's about predicting what's going to happen next in terms of what the user's going to click on.

Starting point is 00:19:42 Where does the car turn? How does that rocket land on its legs? That sort of thing. It's becoming much more of a real-time proposition. the cuss now of moving from historical to future prediction and rides on the back of boring, recreated storage for in-memory architectures and machine learning coming together to provide this very unique and very interesting capability that, quite frankly, has not happened yet in the history of computing. I can't believe this. You've just made storage sexy again. So one last note to wrap up. What comes next?

Starting point is 00:20:19 Well, I think we mentioned the storage class memory that's coming out Intel 3D Crosspoint and some other folks making a very fast, close to the chip memory that's actually storage. So it'll be faster again than Flash, maybe a little bit slower than today's DRAM, but it's going to fill in that gap. And that's going to be another interesting tier of storage. It's going to change a lot of the way computing works. And that's coming out already. we're going to see that in the next couple of quarters.

Starting point is 00:20:51 I also would introduce in the longer term something to think about, and that is there are companies today that should be, from a contractual perspective, perhaps, locking in their rights to data up and down their supply chain, especially with the Internet of Things. I think there's going to be some companies kind of surprised in a year or two to find that they're not going to have access to the data that they're going to really want to have

Starting point is 00:21:15 that's relevant to what they need to do to make the predictions to optimize their business. And they're going to call it a kind of data poverty or paucity. And we're going to find companies that are forward-looking and data-rich. And some companies that have suddenly discovered they're sitting on the outside a little bit and data poor. Yeah, that's especially fascinating because we talk a lot on this podcast about the role of data and building businesses, like whether it's data network effects or data in machine learning startups. We just talk a lot about how data is increasingly advantage in a lot of businesses.

Starting point is 00:21:44 If we can have faster access data, easier management data, have a complete view of the data, a smarter way of analyzing the data. I think it's a very exciting moment and many more innovation will come out along the way. Okay, well, thank you guys for joining the A6 and Z podcast.

The a16z Show - a16z Podcast: The Storage Renaissance

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.