a16z Podcast - a16z Podcast: The Storage Renaissance

Episode Date: March 21, 2017

As we enter a new era of distributed computing -- and of big data, in the form of machine and deep learning -- storage becomes (even more) important. It might not be sexy, but storage is what makes th...e internet and cloud computing go round and round: "Without storage, we wouldn't have databases; without databases, we wouldn't have big data; we wouldn't have analytics ... we wouldn't have anything because information needs to be stored, and it needs to be retrieved." This is especially complicated by the fact that more and more computing is happening at the edge, as with autonomous car sensing. Clearly, storage is important. But now it's also undergoing a renaissance as it becomes faster, cheaper, and more in-memory. What does this mean for all the big players in the storage ecosystem? For CIOs and IT departments? For any company competing on data, whether it's in analyzing it or owning it? And for that matter: What is data, really? Beyond the existential questions, this episode of the a16z Podcast -- with a16z partner Peter Levine; Alluxio (formerly Tachyon) founder and CEO Haoyuan Li (“HY”); and storage industry analyst Mike Matchett of The Taneja Group -- covers all this and more. It even tries to make storage, er, great again.

Transcript
Discussion (0)
Starting point is 00:00:00 Hi everyone, welcome to the A6 and Z podcast. I'm Sonal. Today's episode is all about storage. With the cost of system memory decreasing, memory from both storage and compute will be the exact same thing. So as we enter a new era of distributed computing, and what Peter has also argued in a popular deck is the end of the cloud, how does storage evolve? How is this affected by trends in computing such as machine and deep learning? Joining us to have this conversation today are H.E., CEO and co-founder of Alexio, formerly Takion, which came out of the UC Berkeley Amplac. the birthplace of other industry-defining technology such as Spark and Mizzos, general partner Peter Levine, who has funded memory-centric infrastructure companies at every level of the Berkeley Data Analytics deck, the badass deck, and Mike Matchett, senior analyst at Tenasia Group, which covers everything related to big data, compute, and storage. Okay, so that's the intros. To kick things off, I just have to ask, why should we care about storage? I feel like it's a dark underbelly of computing that no one really cares about.
Starting point is 00:00:54 Look, I mean, while storage may be the underbelly, without storage computers wouldn't work. And so it's one of the most important, you know, compute networking and storage are the three fundamental elements of what makes the entire internet work and makes cloud computing work. And without storage, you wouldn't have databases and without databases, you wouldn't have big data, you wouldn't have analytics, you wouldn't have anything because information needs to be stored and it needs to be retrieved. So storage is hugely, hugely important. And, you know, the interesting thing is I think we're in a very very important.
Starting point is 00:01:30 transformative period of time here where storage is undergoing a bit of a renaissance and i think it's going to transform how computing and applications work in the not too distant future when i got started in storage i thought hey this is really the state the the tried and true stuff compute was where it was at you know there's all these advantages and advances happening in client server and then cloud and new chips coming along every year. But the more I got into storage, the more I figured out that storage is really the most complex part of that equation. It takes a lot of effort to protect data, to manage data. Data has gravity. It has momentum. It has weight in history. So storage is really the critical piece to get right. Wait, what do you mean when you say that data has gravity and momentum?
Starting point is 00:02:20 Data has to live somewhere. You know, compute can be spun up in a cloud. It's a little ephemeral. you can repeat it, you can spin up and down virtual machines. But data actually has to have a footprint somewhere, and that footprint has to be persisted and protected and secured. And of course, in this case, made accessible or the data has no value. Right. But how is that different than what we have right now, that it requires a new form of storage?
Starting point is 00:02:45 I'm going to use a phrase that I say a lot in the podcast, but I think it's actually really true of our times, which is when we say there's a lot more data, that's a difference of degree, not just kind. Why do we need a different type of solution? Why can't we just keep doing the same things that we were doing before, but just do it bigger and better? So I think this is like many other things in the world. Like when you make a cell phone at the beginning, cell phone just make a phone call and now the cell phone is a different type of cell phone.
Starting point is 00:03:10 Similar thing is also happening in the storage industry as well. At the very beginning, storing the block devices, the bits and just bits, raw data. Beyond the block devices, we have a file system, our different type of file system. Now we have blob storage, object storage. And in the meantime, we have so much innovation in open source area as well. And you have public cloud storage from several huge vendors in the world like Amazon, Google, Microsoft, Alibaba, et cetera. You have different type of storage solutions provided by traditional wonders like EMC, like HPE, IBM, they're providing these new innovations. I think that will pale in comparison to what.
Starting point is 00:03:54 what's going to happen over the next, let's say, decade here. When we think about information and data, there's an entirely new phenomena that's really just kicked in relative to what is data. What is data? That's a very existential question. Well, up until right now, compute data has largely been input by some human being typing on a keyboard or a database recovering a record that's the input of a human asking the computer something. That data largely has been put in there through human fingers and through some human interaction.
Starting point is 00:04:30 Fast forward to right now, let's just talk about a self-driving car that has sensors. Those sensors are now inputting data that's the world around us. And so there's completely new types of data. So what is data now far exceeds the human input data? We are now collecting the world's information via sensors. and all of that needs to be processed and stored. It will be literally orders of magnitude, and in the exact mathematical sense, orders of magnitude,
Starting point is 00:05:02 more data that needs to be stored and processed. So that's sort of point one on what's happening. Secondly, the mobile supply chain is influencing the data center and influencing the cost curves in the data center for storage. You take a mobile phone and you take the components of that mobile phone and put it in the data center, have a very inexpensive storage substrate that is far less expensive than the enterprise systems that we saw in the past. And so the cost curves come way down. We will have much more in-memory
Starting point is 00:05:37 data, systems that literally live in real memory, and the notion of disk drives and tape drives and even SSDs will all go away. I believe that there's a future here where memory architecture texture is completely flatten. Computing has been built on slow, cheap, and fast and expensive. And I believe that we're going to be fast and cheap. I mean, it sounds like it'd be obvious, but what does fast and cheap really do for us when it comes to the so-called storage renaissance? Fast and cheap means that we can collect massive amounts of information, put it in memory, not have to put it out to disk drive and do all these backflips to get data to work correctly. It's going to all be in memory and it will be very inexpensive.
Starting point is 00:06:24 And that's the renaissance in whether you call it storage, but more importantly, data and the importance of data and the correlation between the volumes of data and the price curves in the not too distant future for what I'll call storage, even though it's memory. And those pieces coming together, that to me is the renaissance that's happening in computer. I totally agree. And actually just add one more points to that. is that we actually should view memory as a front tier of storage. Exactly. It's a tier of storage. Exactly.
Starting point is 00:06:59 I would argue that it is the tier of storage, that over time there is no other storage. It's just memory. Certainly, we've been seeing the rise of memory class storage already being talked about by vendors, and bringing persistence to memory will completely overhaul how compute and storage is envisioned today, because tomorrow data is going to live in the compute devices.
Starting point is 00:07:25 Those are going to be more Internet of Things devices and be far more distributed as well. But I also want to temper that with the thought that we've also talked to a lot of these big storage vendors. And they're forecasting that there just simply isn't going to be enough storage for all the data we're collecting in a midterm horizon, like three to five years, that we're creating so much data, there won't be enough chips, there won't be enough hard drives, there won't be enough tape. simply isn't going to be enough storage out in the world, which I do want to point out means that there's still some opportunities for things in storage management, people to consider how many copies of that data am I making? Do I have to take the compute, the processes out to where the data lives? Do I have to bring the data centralized and make copies of it? Or can I do something more optimized with how I organize my architecture and only store the data once,
Starting point is 00:08:15 only compute data once in one place? Just to draw a fine point here, we are we are entering a new world of distributed computing. And if you think about the new world of distributed computing, the data that gets collected in a self-driving car or some endpoint is going to be the information will be processed at that endpoint. It won't be translated back to a central storage pool. The information will be curated and then transmitted back. We'll be collecting massive amounts of information.
Starting point is 00:08:46 I mean, self-driving car collects 10 gigabytes of data. a mile, right? Like some ridiculous amount of data. You know, there's not enough storage on the planet to ever hold all that information. So the curation is going to occur at the edge, close to the compute and the quote-unquote storage will be processed at the edge, and then important information will come back to some centralized data store. But all that computation at the edge and the storage of it and kind of the permutations of it is exactly the Renaissance that I believe needs to happen in storage even to process this stuff.
Starting point is 00:09:23 So another huge issue here is that it made the ecosystem much more complex than before. All the big enterprise companies, they will try different innovations, they will have their existing storage and pot storage, new storage, formed very complex systems
Starting point is 00:09:41 and it makes it hard to manage, hard to consume, and in many cases, is not cost effective as well. And this is one thing we're seeing requested by the customers many big enterprise in the world is that how can they consume and this data from different storage systems easily
Starting point is 00:10:01 and manage them efficiently. So is this just because they have a hodgepodge of like all these different storage systems or is it that it's just buried in the same place but under a bunch of different interfaces and tools or like what's the problem really? The data is stored in different storage systems. Just give you a very concrete example. If you talk to this department, their data is stored in a public cloud storage,
Starting point is 00:10:22 maybe in the probably Amazon IS3 or Google Cloud Storage. And another department, they have some data stored in the EMC storage, HPE storage. And you have another department, say, I have my own private cloud storage, another group. They want to analyze the data inside the enterprise. In the end of the day, why people want data? People want to use data to generate values, which means use data to make decisions. or facilitated making decisions. The more data you have, you can analyze the better result normally you get.
Starting point is 00:10:52 So this existing environment is very hard. We're trying to tackle or leverage taking memory as a first class citizen in a storage system. We have a memory-centric architecture. And beyond that, how to manage the data or how to access the data from different storage systems in the most effective manner. Let's pause on that for a quick moment. Why does in-memory aspect matter? So one side is, of course, about performance.
Starting point is 00:11:17 and performance is memory is much faster than SSD or HDD. And from the other perspective, it's also about the cost. The cost is decreasing very fast. It's about every 18 months, the cost decreased by 50%. So that's two points, performance plus the cost, which trigger the capacity. That two points together now is the right time to build memory as a tier of storage. I think that machine learning is the application that unlocks much of the new in-memory systems. It's not so – I mean, machine learning is the next generation of big data.
Starting point is 00:11:57 And what is machine learning? It's iterations over large, large, large data sets to come up with better ways to forecast and better ways to utilize information. The only way – so in order to unlock the power machine learning in a time-sensitive fashion is to actually do these computations, in memory. Because if you have to, what's called go out to the disk drive to get information or go out to an SSD, the time to seek for that information and look for it is a huge penalty when you're dealing with massive amounts of information. To the extent it's all in memory, I can operate on it very quickly and do many more iterative sets in a shorter time frame giving the results that might be needed in a machine learning or AI environment. Unless we get to in-memory processing and in-memory data structures, machine learning doesn't really work. And so we have to come up with these ways of having much more in-process, high-fidelity storage that is this new tier. And this exact is, you know, this is exactly what's causing what I would argue this renaissance to occur over the next several years here.
Starting point is 00:13:11 I'll just add that supercomputing a couple years ago was really inaccessible to most people. And in a super queuing environment, every node is highly networked to every other nodes because they needed to communicate state and information between them. When you had Hadoop and MapReduce, they could partition certain categories of problems and run them in parallel, but not really a lot of machine learning algorithms. They just don't work that way. They require more of that interconnectedness and that communication between nodes and between memory. and data sets, lots of iterations on the same data. So Spark and in-memory approaches to machine learning really accelerate the opportunity to create and apply machine learning algorithms to just about every facet of human existence, not to
Starting point is 00:14:05 overstate the case, but there really is a huge renaissance just from that alone coming. So I find all this fascinating in terms of the evolution of computing. But the question I have is, how does this actually affect people? Like, how does it change for better or worse their work and their work practice? Today, if people want to really see the global data, they move the data manually from one storage to another storage to put them together to analyze it, even though the data could be in memory in the final storage. But because of this manual process or this process of moving data around,
Starting point is 00:14:43 is first of all, it's very hard to manage. Secondly, the whole process is very time-consuming. It could be easily like weeks or even longer. So that makes things much harder. And data has have less value. So that's another huge issue we're seeing. Yeah, I mean, you need a new abstraction layer when there's a whole, when there's an old world and you have a new world and you don't need to be stuck into this old model
Starting point is 00:15:08 of how you, in this case, store and view our data. But what does that mean on the design side, on the interface side for people? Exactly. So the issue today is that data really stored in different data silos. There are so many different type of storage. They have different type of interface, make the application level very hard to consume easily. That's a big issue. Think of virtualization.
Starting point is 00:15:28 In the compute side, we have a virtualization technology to really virtualize the compute resource and to be able to leverage the resource more efficiently. Also, think of the internet protocol stack. In the middle, you have the IP layer, which is really the narrow waste. When you make the innovation in the upper layer, you don't need to worry about the lower layer. So similarly, from the storage ecosystem perspective, we should build a layer to abstract different storage systems
Starting point is 00:15:56 and then present a unique or standard API to the upper layer with a global namespace. From the user perspective, they will be able to access the data from different storage systems very easily, very easily. very easily. And this coupled with in-memory technology as well as a smart algorithm to intelligently move the data to solve
Starting point is 00:16:23 a lot of issues for the users. Performance issue, cost issue, performance cost issue, and also the unification of the silo issues. So that's also one direction we're trying. Is this something we're saying beyond one company? Like what's sort of the broad industry level view of all this? There's certainly a big need for unifying storage today. Most organizations have a hodgepodge, an organically grown set of state of storage systems and applications stretching all the way back to their mainframes. There's an awful lot of mainframes still out there, believe it or not. And when we look at what a company's real assets are today, oftentimes, as I think Peter's pointing out, it's in the data that they have. But the data collectively, And in aggregate, in order to be able to analyze it and make predictions out of it, you want a cohesive whole. So solutions that can bring all that data together in an integrated analysis and support actually pipelines of prediction and feedback loops of analytics and visualization and really tighten the knot, if you will, are going to be extremely valuable.
Starting point is 00:17:33 and those that can leverage the existing infrastructure and the existing data stores without upsetting the apple cart are the ones that are going to be adopted for it? How does it affect the IT department or the CIO? Like, how should they think about this shift? IT departments in general are facing a lot of change changes. It started with this idea of cloud and treating their internal customers as service provider
Starting point is 00:17:55 as clients and seeing themselves as service providers. That's a big shift in culture and approach and temperament. With infrastructure and the idea of being able to bring machine learning and really leverage a company's data sets in a myriad of ways, they have to become experts in that data, in those applications, because at the department level or division level and the business analyst side, you're going to find lots of people who know how to use Microsoft Excel. So IT departments are really going to be looked at as the point of the sword
Starting point is 00:18:30 in bringing these advances to the company and not just being seen as reactive operators of infrastructure on the back end. And that's a big shift. I mean, data is the lifeblood of any organization. And so it doesn't matter whether you're a web developer or consumer products or healthcare organization. Fundamentally, we are all becoming data-driven organizations.
Starting point is 00:18:56 Big data up until right now has really been a reality. active process, even querying a database. Like I go look at what's happened in the past. And the holy grail of all computing and the holy grail of what a CIO ultimately cares about and what a business cares about is that I can predict the future. If I know what's going to happen tomorrow, whether it's inventory, health care, finance, and I know what's going to happen tomorrow or next week or next month, and I can accurately predict that, that is the holy grail of computing. And I'll even accelerate that. It's not about predicting tomorrow for a lot of people.
Starting point is 00:19:35 It's about predicting what's going to happen next in terms of what the user is going to click on. Where does the car turn? How does that rocket land on its legs? That sort of thing. It's becoming much more of a real-time proposition. And we are at the cuss now of moving from historical to future prediction and rides on the back of boring, recreated storage. for in-memory architectures and machine learning coming together to provide this very unique and very
Starting point is 00:20:07 interesting capability that, quite frankly, has not happened yet in the history of computing. I can't believe this. You've just made storage sexy again. So one last note to wrap up. What comes next? Well, I think we mentioned the storage class memory that's coming out Intel 3D Crosspoint and some other folks making a very fast, close to the chip memory that's actually storage. So it'll be faster again than flash, maybe a little bit slower than today's DRAM, but it's going to fill in that gap. And that's going to be another interesting tier of storage.
Starting point is 00:20:44 It's going to change a lot of the way computing works. And that's coming out already. We're going to see that in the next couple quarters. I also would introduce in the longer term something to think about, and that is there are companies today that should be from a contractual perspective, perhaps, locking in their rights to data up and down their supply chain, especially with the Internet of Things, I think there's going to be some companies kind of surprised in a year or two to find that they're not going to have access to the data
Starting point is 00:21:13 that they're going to really want to have that's relevant to what they need to do to make the predictions to optimize their business. And they're going to call it a kind of data poverty or paucity. And we're going to find companies that are forward-looking and data rich and some companies that have suddenly discovered they're sitting on the outside a little bit and data poor. Yeah, that's especially fascinating because we talk a lot on this podcast about the role of data and building businesses, like whether it's data network effects or data and machine learning startups. We just talk a lot about how data is increasingly advantage in a lot of businesses. If we can have faster access data, easier management data, have a complete view
Starting point is 00:21:50 of the data, a smarter way of analyzing the data. I think it's a very, very, exciting moment and many more innovation will come out along the way. Okay, well, thank you guys for joining the A6 and Z podcast.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.