Storage Developer Conference - #82: Eusocial Storage Devices

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to STC Podcast Episode 82. Let's go ahead and get started. I got a bunch of slides here. I'm going to try to move through them, but I don't know that we're going to get through them all. This is the first time I've done this sort of presentation with respect to the UC Santa Cruz. This is a new sort of thing that's happened over the summer, is that U Social Storage is now an incubator inside the University of California at Santa Cruz.

Starting point is 00:01:07 And specifically, it's inside an organization called CROSS, which is the Center for Research on Open Source Software. This is an organization that was essentially founded by Sage Weil. Sage is the author of Ceph. He did his PhD thesis at Santa Cruz on Ceph and consistent hashing and all these great things that we see in scale-out. And he took Ceph as a graduate research project.

Starting point is 00:01:40 He turned it into an open-source project, eventually incubated it in a startup, and then eventually got it sold to Red Hat. And he wanted to see that success replicated again and again through the university system. So he's actually the founder of Cross, and the idea is to replicate that. So we have research that's going on, education. We have projects that are part of the research work. And eventually those projects turn into incubations that hopefully create successful open source projects.

Starting point is 00:02:29 From there, it's up to the students and the people involved to take it beyond that. But it currently is funded by industry. We have an industry advisory board that essentially pays annual dues and essentially funds all the research and the incubation for open source. And there is a number of projects inside this organization today. It definitely has an open source focus. Everything we do is open source. So all the things that we're talking about today in eSocial Storage will all be open source based. And currently we have, I forget the number, we have Toshiba, Micron, Samsung, Seagate, Western Digital, Huawei are all members that are funding this organization right now. And I didn't introduce myself.

Starting point is 00:03:16 I am Philip Kufelt. I've been in the storage industry for about 25 years, working, starting at Veritas and working through a number of storage companies, some my own and some big companies. My last two gigs were at Toshiba and Huawei. At Huawei, I was the director of storage standards, and at Toshiba, I created the KV Drive, which is kind of an example of what we're going to be talking about today, which is an autonomous storage device. But as I was working with both of these companies, I had an opportunity to work with Cross as well,

Starting point is 00:03:55 and I saw that Cross was this neat intersection between industry and research. And my feeling is that moving forward with offloaded intelligent storage devices is something that's going to take time. The industry is not going to tomorrow say, yes, this is what we're going to do. We're going to have to stage it in a strategic fashion over a number of years that is going to be combined with research to prove out the proof points of why it's a good idea. So we need the research. We need the industry to provide comments and commentary to guide and mold it in the direction that it needs to go in.

Starting point is 00:04:31 So I thought Cross was a great place to place it. So this summer, I took eSocial Storage and put it in Cross and actually am now a member of the staff at UC Santa Cruz. So use social storage. We at CROSS, we had our first presentation at FAST 18. We published an article on Usenix login. If you want to go there, everything I'm talking about today is in that article. I'm kind of redoing that article as part of this presentation.

Starting point is 00:05:06 And then if you want to see information about CROSS, and also we have a symposium that's coming up in the beginning of October that is a very interesting showing of all the work that's going on, not only in CROSS, but also bringing in industry and other research organizations to talk about interesting storage topics. There will be a eSocial storage topic there, and we're going to focus on the strategic goal of uSocial storage, which is in-storage compute, and we're going to talk about the tradeoffs between the different types of models

Starting point is 00:05:35 of storage compute, in-storage compute. So I encourage you to come out and take a look at it. Okay, so half of this presentation is going to be some of the motivations of why we think offloading is the right direction. The industry right now is going in two different directions. We have a lot of people looking at sort of the moving intelligence off of the storage devices and into the host and allowing the host to do closer management of the underlying media itself. So you see that with open channel and some of the things going on in the industry.

Starting point is 00:06:15 At the same time, we're also looking at composability and disaggregation. And those are going to be moving the data farther away, which means doing data management across the fabric is going to get more expensive. So we're going to look at the offloading argument and talk about why we think it's a good direction to go in. And then we're going to talk about the sort of framework, the strategic framework of what use social storage is. So first of all, I've been using the term use social for those who don't know what that means uh you know it's essentially you social is a social strategy that's used by insects and organisms where you have highly specific uh highly specialized groups of individuals that provide uh individual tasks and they're focused and that's what they do and so you can think of this like

Starting point is 00:07:03 ants or termites those those kind of things. So we thought it was a good metaphor for use social storage. Can we devise a system where you have these autonomous, intelligent devices that are doing something specific for the media that they are, because every media is not created equal, but combines together to make a unified whole that provides an amazing amount of services. And you'll hear me start using the term casts. This is a term that we've decided to adopt from the eusocial aspect in animals.

Starting point is 00:07:42 This is a grouping of like-minded organisms. We'll talk more about that in a minute. So why use social storage? I'm going to talk in depth about three different trends, public and private cloud. I'm going to talk about server offload, specifically in disaggregation. And then talking about how the ways of the past

Starting point is 00:08:02 may not fit into the future. All right. So public cloud. This is an area I don't hear a lot of people talking about, but as far as a reason for going in a particular technological direction. Right now we have a lot of public cloud storage offerings that are out there. There's a handful of big guys out there providing cloud, and we always talk about how we're migrating our data and resources into these

Starting point is 00:08:36 public clouds. And yet, if you look at some of the marketing statistics that are out there, even as of late, there was some 451 research that shows that the migration isn't necessarily all one direction. We are not always taking our stuff and moving it into the public cloud. We're actually doing both. We're migrating back to the private clouds as well. And there's various reasons for that, you know, from security to control to cost. There's a bunch of reasons why private cloud still makes sense today. It's still a management problem, but it still exists. And I think the big data centers took notice of that.

Starting point is 00:09:14 And you can see that because, for example, a couple years ago or a year and a half ago, you now can buy Azure Stack, which is a private cloud version of the Azure Stack, right? So I can now deploy Azure Stack in my on-prem or off-prem private area, and then also be utilizing public cloud services as well, and it provides for an easy migration. So that's the strategy from the big cloud guys, is that public or private, let's lock them into what we're doing.

Starting point is 00:09:47 It makes it easier to migrate back and forth. Google partnered with Nutanix about 18 months ago. There's even rumors now that maybe they're coming up with their own private cloud offering outside of Nutanix. It just shows that the industry has taken note that there's not just one market. It's not just public cloud and everybody's moving there. There is private cloud as well. And I think, you know, from my standpoint, one of the worst possible outcomes is that we end up five or ten years from now with five big cloud companies. And we all do our business with those five cloud companies.

Starting point is 00:10:20 It's bad for the industry. It's bad for the industry. It's bad for vendors. Vendors don't have the ability to get R&D to move the market forward because they're all essentially servicing these five big customers instead of a wide width of customer base. But what's missing in the private cloud world, I think, is still an easy-to-use, easy-to-deploy, scalable, scale-out system that allows you to deploy this in the private cloud without a lot of expense. Some of the reasons why we look to public cloud is the management, the complexity. All of those things are some of the reasons why we move to public cloud. So I think this is one of the things that's still missing today, is a good, easy to use.

Starting point is 00:11:10 There's lots of stuff out there. Don't get me wrong. But as far as easy to use and provides all the feature set that you get from a public cloud is another story completely. Server offload. cloud is another story completely. Server offload. This was a talk given by Fritz Kruger and Alan

Starting point is 00:11:33 Samuels of Western Digital about two years back, and I think it's still relevant today. They were talking about how CPU bandwidth, it's actually not CPU bandwidth bandwidth it's more the ram bandwidth is becoming the the bottleneck of of tomorrow um we can easily let me back up the bandwidth of nand for example is skyrocketing right now and that's what this chart is showing

Starting point is 00:12:00 they basically went historically and looked at uh storage bandwidth over time and then looked at network bandwidth over time and also looked at DDR bandwidth over time. And the rate in which the network and the storage bandwidth is increasing right now is far outstripping the DDR bandwidth at this point. It's not yet above it. That's why I can still put five, ten SSDs per proc to a system. But as these things get continually faster and faster, the number of SSDs that I could attach to a per processor basis is going to get smaller and smaller. And when I say that, I mean that you're not hiding bandwidth. I can add 30 SSDs to a system, but I'm not going to be able to attain all of the bandwidth that those SSDs provide because they're going to be behind that bottleneck of the DRAM because I can't get the data in and out

Starting point is 00:12:59 quick enough. So that means that essentially the DMAs right now are doing two different things. When we're talking about I.O., you're talking about two different kinds of I.O. There's the I.O. that clients are really trying to do.

Starting point is 00:13:20 I'm trying to read and write my data. But then there's usually management of that data that's going on underneath the covers, right? Behind the file system, behind the volume management, behind the scale out for replication. All these things is data management that's going on, which means I'm going to be bringing data in and out of the server, consuming that DMA bandwidth, even though it's not for my primary client application point, for use by the application. I'm going to get into a little deeper of that picture in just a second.

Starting point is 00:13:53 Furthermore, just to show you, this was a talk given by Jay Doe of Microsoft. He's in the Microsoft Research Organization, and he basically provided a nice graphic showing that server mismatch. The fact that I can layer up flash dies behind a flash controller attached to PCIe that is then attached to the root complex that then has to talk to this CPU, and I can easily stack up enough dies to give myself easily a 66x sort of ratio between available bandwidth and the NAND dies compared to what I can do at the CPU and in the memory. And again, the real work is north and south.

Starting point is 00:14:37 It's moving data in and out. You're going to hear me talk about north and south, east and west a lot today. North and south is moving the data in and out for the client application. But then there are data management, and there's a north-south component to that for doing translations, compaction, deduplication, all of those kind of things, as well as east and west traffic.

Starting point is 00:14:56 And that's where I'm going to be doing redundancy, recovering, rebalancing, tiering, caching. All of those things are moving data from one device to the next. So those are east and west type of transactions. So let's look at an example of the north and south. So here in this picture, I've got essentially a server that's running RocksDB. I'm doing a key value store.

Starting point is 00:15:19 So that means my clients above the server want to get key value services, and they want to be able to put and get values. Unfortunately, for RocksDB to be implemented successfully, it goes through a number of translation layers. For example, it uses a file system to place its SST files. That means I have to translate between a key value to what SST file does it live in. So that's one of the jobs of the key value store. But then when I get into the file system,

Starting point is 00:15:50 the file system has to translate between that file and an LBA. And then the block layer passes it down to the LBA, and then the LBA gets translated again to a physical address by the FTL. So there's lots of translations that are going on in this stack. But in addition to that, so there's extra I.O. for metadata and things of that nature that are going on as part of that data path. But in addition to that, because of the characteristics of a key value store and you keep it in a sorted order, there's garbage collection, there's sorting that has to go on, and that causes files to be read in and read out back and forth in addition to the key value operations

Starting point is 00:16:28 that the client is providing. So, for example, when I... For example, a ROXDB is a level DB type of implementation, which means that I have different levels in which my data is stored in, and that gives me a way of efficiently sorting data as it comes into the system and not re-sorting data that I don't need to.

Starting point is 00:16:48 And as data fills up a layer, then I have to move it up to another layer, which causes garbage collection and other things of that nature. That's when you start to see this reading in of data and writing out of data, back and forth, not for the actual client, but to do the data management of the key value store. But it also does things like scrubbing. Is my data correct? Periodically, it's going out and looking at the data

Starting point is 00:17:10 to see if it checks some matches. Reading in the data, writing that data back out. So that's consuming this DMA bandwidth, which we've already talked about is a bottleneck. I'm consuming it for not work for the client. I'm consuming it for work for the storage, for storage and data management. Scale out, situation gets a little worse

Starting point is 00:17:36 because you've got to do the same things in scale out, but now I'm doing it not only north and south, moving data between the device and the host, but I'm also moving data between the host and other hosts and other media. So I not only move data in and out, north and south, but now east and west. I move data from clients to media

Starting point is 00:17:52 to other scale-out servers for redundancy. I also do it for recovery when I'm recovering erasure-coded data and things of that nature. When it's time to rebalance, when you add a new system with a bunch of storage in it into a scale-out environment, you want to then redistribute the data

Starting point is 00:18:10 across to all the available disks so that you're taking advantage of the capacity and access. So you have rebalancing that goes on. And then, typically, a lot of the scale-out systems are providing a single quality of service. In other words, I have a scale-out for my disks, I have a scale-out systems are providing a single quality of service. In other words, I have a scale-out for my disks. I have a scale-out for my SSD. I have different scale-outs for different purposes. And then I have to migrate data when it's time to migrate it for information lifecycle

Starting point is 00:18:35 management type of things, right? I put it on the SSD scale-out to start with because that's when everybody's hitting it and it needs to be fast. But over time, I'm going to age it and put it over somewhere else. Again, consuming that bandwidth of the DMA to get it into the host and then back out to some other server and to some other scale out. So scale out is a strong candidate for offloading, as are the key value stores. And here's just a picture of what i was talking about about all the different you know internal copies and things like that that are going on between the different systems and across

Starting point is 00:19:13 the network and again if you start thinking about and we're going to talk more about this in the next section we start talking about disaggregation these are network transactions as well every one of these north and south is the network every one of these north and south is a network. Every one of the east and west is a network. Here's an example of what it might look like if you offloaded. First of all, with offloading, you can get rid of the translation layers, which I think is a key important piece. You don't need file systems and blocks.

Starting point is 00:19:45 Since my client is talking key value, it's talking about objects, then I can just shuttle it down to the device itself as an object, and there doesn't need to be any translation on the host at all. And because the key value database is running next to the media itself, I can get better efficiencies between the key value database and the underlying media technology. I can tailor better efficiencies between the key value database and the underlying media technology. I can tailor them particularly for that particular media. All right. Okay, so I got some history lessons here, and I don't think most of you guys don't really even care about that. You know, but just suffice it to say that disaggregation has been with us

Starting point is 00:20:29 since I started my computer science career back in the 80s, right? We were doing NFS way back when, and the obvious reasons for that were, you know, to have that consolidated in, to get rid of some of the physical limitations of a server plus disk is to prevent me from having data copies everywhere.

Starting point is 00:20:49 There's limited pathways. A server can only have so many disks directly attached to it before you're either running out of bandwidth or you're running out of hardware. And availability. I can make the disks available, but I can't make them available with regard to the server

Starting point is 00:21:06 unless I pair up another server and move things around. But by disaggregating, I can focus everything into a single location, and I focus my management, I focus my availability problems into one place, and I have access to it from anywhere. So if my clients die or fail or do whatever, it's all right. I just put another box in. I can still get at my storage. So that was some of the motivations for why we went to NAS.

Starting point is 00:21:33 Scale-out were the same thing. They were disaggregated as well. The idea was, again, I have thousands and thousands of client applications running in the cloud. I don't want to tie them to a physical box. That would be a mistake because that means that my storage either has to be ephemeral, meaning I only use it for that session and then I throw it away. But if I have needs for persistent data, I have to separate it away from the host so that my client application can move from cloud server to cloud server. So that's why scale-out systems were created, were to essentially meet the demands of those thousands and thousands of client applications

Starting point is 00:22:09 and give me an underlying technology that allows me to continue to scale out capacity and access over time by adding more and more nodes. But I still end up with this fan-in, fan-out architecture where I have servers talking through some sort of a head to the media. So I fan in to the guy that's managing the media, and then back out to the actual media itself. Scale-out's no different.

Starting point is 00:22:37 So if I have a Ceph environment where I've got a server with 30 drives attached to it, all of my countless thousands of clients are all fanning into that OSD, and then he basically fans it out to the drives. So that's a limitation, and it's a bad limitation because it forces you to make hardwired decisions at implementation time. You have to figure out what quality of service

Starting point is 00:23:02 you want to provide. I'm going to provide HDD-level quality of service, but for cost reasons, I'm going to trap some of that hard drive throughput behind a server to reduce the cost, meaning instead of having 10 drives or 15 drives where I can definitely access all of that throughput through my 10 gig E-link, I'm going to put 20, 30, 60 drives behind it with the idea that it's okay that I'm trapping some of that throughput behind that NAS server, but my cost point is lower, right? So we make these

Starting point is 00:23:40 decisions, and no matter what happens in the evolution of our cloud, like those assumptions don't hold true after I make my purchase decision, I can't change it. I'm stuck. My hardware is fixed in the way that I've chosen to architect it together. So it's a hard limitation that offloading and disaggregation go to fixing, meaning that I should have a finer-grained unit of allocation that allows me to determine how I organize my system. So here's how scale-out disaggregates. And I just talked about some of the problems. Basically, there's a single software component that manages the individual media,

Starting point is 00:24:27 and I stack up these pairs on a single host through a single or multiple Ethernet connection. So each OSD is like its own little scale-out device, right, that can either run on this system or wherever the storage is directly attached. And it's the direct attach that means that I have to run on this server. If the drive wasn't directly attached, I could run it anywhere. Kinetic was another disaggregation mechanism. This was a way that the industry was starting to move in this disaggregation process. It was a key value API.

Starting point is 00:25:04 Used TCPIP as the fabric between the two, had the basic get, puts, and delete. So back to that picture of a key value store, you could have implemented a ROXDB inside of a kinetic device and provided that key value disaggregation and got rid of the file system and volumes that were in there.

Starting point is 00:25:24 The problems were that it was a unique protocol. It was brand new. Nobody had seen it, nobody had heard of it, and it required applications to be rewritten for it. And I know this because I implemented one of these devices. We implemented a key value device at Toshiba, and the biggest headwind that we faced was around, how do I use it?

Starting point is 00:25:44 That's a great thing, but now what software uses it? And at that time, nobody did. And when people rewrote their applications to use it, because we had a handful of development programs in place, they developed it to just modify the existing functionality and leverage the pieces and parts of Kinetic that were there to match what they were already doing. So, for example, if you look at Seth, and we go back to this picture somewhere,

Starting point is 00:26:17 that picture, it's managing a local disk, and so it just modified it to do puts and gets instead of reads and writes. But it has east and west traffic there, too, disk, and so it just modified it to do puts and gets instead of reads and writes, right? But it has east and west traffic there, too, because it's talking from one OSD to the next to do replication and other things. Kinetic had peer-to-peer operations inside of it, where I could have let the drive deal with the copies and the replication and the data movement, but nobody took advantage of those.

Starting point is 00:26:45 And the real power of Kinetic was that, was that it provided a rich tool set that if you rethought and reimagined your application, you could get more out of it by utilizing and offloading not only north and south, but east and west as well. But unfortunately, there was no multi-vendor because Toshiba dropped the product

Starting point is 00:27:06 and the software failed to create any colonies out there. So although it's still being worked on, it's less of an issue today. The big one today is NVMe disaggregation. We're seeing this. As part of my last job at Huawei, I worked with NVMe and the Fabrics teams, and we're now separating, disaggregating across different types of fabrics.

Starting point is 00:27:31 We have InfiniBand and Fiber Channel and RDMA, iWarp, all these different strategies for disaggregating. But in addition to that, what's going to come out by the end of the year, at least the first drafts are almost done, looking for ratification hopefully in the next six months, is TCP, and that's being driven by Facebook right now. It's still block mechanism, which means that it's not really conducive

Starting point is 00:27:58 for scale-out type of environments. You still need somebody to manage. Whenever you have a block, that means you have an address that you have to manage, right? I have to know where my data is and I have to manage that device. And it gets very difficult to scale out

Starting point is 00:28:13 at the block level. The big advantage to scale out is that we use key values and keys can be consistently hashed. I can use an algorithm to determine what system it's on. I don't have to go talk to somebody to figure out, oh, it's on this system,

Starting point is 00:28:27 and this is the translation to that. I can actually just say, this key, I know exactly where it lives, and go talk to him. And then that person is responsible for returning back the value. So it's in the right direction because it's disaggregating, and TCP could be a game changer because it could provide for a day

Starting point is 00:28:43 when we see generic TCP ports on our devices, which is a fundamental piece for offloaded autonomous devices is to have a network port. One of the things that you can do if you had these disaggregated devices, I have kinetic written down there, but you could substitute NVMe right now with that. This is an older slide. Is that once you do separate them out, then the OSDs, those little pieces of software that manage the disk, they can run anywhere. So I can now hyper-converge this

Starting point is 00:29:16 world by just scheduling my OSDs inside my application cloud. And they can move whenever there's a failure of an application server, they can move with it to another place, another location. So they can be scheduled and move around without causing failure problems. Alright. So what is this suggesting?

Starting point is 00:29:41 We moved from aggregated to SAN to scale out if we truly disaggregate and we truly put intelligence in the devices we can get to basically a crossbar instead of a fan in fan out

Starting point is 00:29:55 if I have the ability to essentially take devices and have them participate as intelligent devices that are doing all the data management taking client requests north and south but then dealing with it internally have them participate as intelligent devices that are doing all the data management. Taking client requests north and south, but then dealing with it internally, I can create an environment where every client can connect to whatever drive they need to talk to, and whatever drive can talk to whatever drive it needs to talk to for availability, rebalancing,

Starting point is 00:30:21 and stuff like that. And so that's what uSocial is. uSocial is trying to move in this direction. Now I caution you that this is like an end goal. This is strategic, long range. This is not something that tomorrow is going to happen, right? This is something that we have to plot a path of where are the big pain points, how do we get those pain points, and start walking in that direction.

Starting point is 00:30:50 And that's what we're doing at UC right now is we're trying to plot that path. We're trying to see where in the industry are pain points that we can address and we can actually empirically show that this is a direction you want to take right now. And we walk in that direction. Next one will be another step and another step. But eventually, getting to this and in-store compute is the final goal. So to provide for an overall storage system, they need to have data availability.

Starting point is 00:31:22 They need to have all the things that we see in modern scale-out systems today. They need to have availability. They need to have data availability. They need to have all the things that we see in modern scale-out systems today, right? They need to have availability. They need to scale capacity. They need to scale access, which means, you know, every time I add another device, I'm getting more throughput. They also need to define, and this is something I don't see in the existing scale-out products today. They need to define lines and classes of service. So think of a class of service as a big bucket about the features and characteristics of a storage system. So an HDD has a particular performance line of service, right? It's got high latency, but it's got certain throughput numbers.

Starting point is 00:32:01 And that's a line of service, those performance characteristics. I then pair it up with data availability. I can replicate these three times and get three points of replication. These are all things that I can build up into a class of service that is my cold storage class of service or my warm storage class of service.

Starting point is 00:32:26 An entire system needs to be aware of these classes of services and then provide configurable mechanisms to allow users to use them. In other words, when I write an object, where should it go? Well, this is the new hit movie, and I need to make sure it's on super fast fast so it's going to go into the class of service that is my persistent memory class. But over time nobody's going to look at it anymore. It needs to go somewhere else. And all of that data management as it winds its way through these media needs to be definable and then have the devices be able to move that storage around as needed. So use social storage.

Starting point is 00:33:12 It must have north and south access to allow clients to do actual I.O., and it must have east and west scaling out to provide redundancy, the quality of service, and lifecycle management. Since we're just absolutely talking about adding computational resources into these devices, at some point, you can conceive of them doing more than just the data management. And there is a direct need for that. When you start looking at edge and edge computing that is going on,

Starting point is 00:33:38 where our data collection may be outside of our main compute facilities, having the ability to scale back the amount of data we have to send back for further processing is a huge win. So you can think of cameras attached to storage devices that maybe need to do filtering of the faces and only sending faces back, or actually locating individual faces

Starting point is 00:34:00 before sending back alerts and other things of that nature. There's a myriad of examples of where, when you get out to the edge, having the ability to do general compute is an important feature, especially in the world of genomics and other things like that. There's lots of places where I can show you this. And we have an example at the symposium. We're going to be looking at a database that's doing this, that's distributing the queries out to all of the devices as well.

Starting point is 00:34:26 So what is uSocial? First of all, don't think hardware. This is a software abstraction. Think of this more along the lines of the next evolution of cylinder block, I mean, cylinder track block mapping to LBA to key value to the next thing. It is strictly an API abstraction, and that allows hardware builders and architects

Starting point is 00:34:52 to build the hardware that matches their particular media. And so when I think of autonomous devices with an SSD, it might be an SSD with an Ethernet device on it, but it also might be, when I'm talking about hard drives, it might be an SSD with an Ethernet device on it, but it also might be, when I'm talking about hard drives, it might be three or four hard drives together behind a controller that make it look like a single key value store. It all depends on the cost and price point of what you're trying to achieve as an architect.

Starting point is 00:35:19 So uSocial is not trying to make those decisions for you at all. It's trying to say, here is the rich set of characteristics that your device can do, right? So it's a standard object protocol that disaggregates mechanism-based. Policy is all configured. Cluster operations, and the cluster operations

Starting point is 00:35:38 are for distributing out the configuration because we need to know about who our neighbors are. It is not for creating cluster operations for clients to leverage. So there's no transactional system that I'm talking about here. I'm talking about strictly managing which devices are participating and updating it as failures occur or as devices are created. Peer-to-peer operations that allow for copies and data movement data integrity mechanisms being able to essentially check some of your data make sure it's all good and is not rotting away over time

Starting point is 00:36:15 it's an abstraction that provides for improved failure domains my failure domains become smaller I don't have a server with 20 drives attached to it that if the server fails, I lose 20 drives. It becomes whatever that unit is that is the autonomous storage device itself. And ultimately

Starting point is 00:36:36 would support in-store compute. So, no restrictions on media type, form factor, capacity, components, or fabric type. That's a decision left up to the hardware architects. So what does it look like? It looks a lot like scale-out systems, right?

Starting point is 00:36:56 So if I was to look at a Ceph box and take that picture of the server with a bunch of OSDs and strip away the server, it kind of looks like that, right? Each device is its own object store manager, and it participates with all of the other ones to provide a unified global object store. So that means that we have these monitors that monitor the configuration and changes that occur in the configuration, updating all of the participants as that configuration changes. But we also add the new thing, which is the cast. The cast is the unit of scale-out. Scale-out occurs within a cast.

Starting point is 00:37:35 So I can have a scale-out of SSDs, a scale-out of hard drives, a scale-out of SMR drives. I can have different quality of service that have their own scale out capabilities. And what that allows me to do is that now I have essentially a uniform quality of service that I can manage objects through that. And I can do that through just having defined configuration that everybody shares.

Starting point is 00:38:06 Again, it's an object API, and just like Seth, you need things like virtual block because applications still write blocks. There's still applications that use file systems. There's file systems available as well. So what does that, what does some of the information lifecycle management look like?

Starting point is 00:38:24 So if you look, that's the configuration. The cluster map is the configuration that we're sharing, and we make sure that it's consistent between all the members, and it defines who my neighbors are, but it also can define a class of service. For example, here's a content delivery network. We've got essentially five different casts here, and for a content delivery network,

Starting point is 00:38:52 we're going to use three of them. And when I put a key value, I tag it with what is the strategy I want for this object. This object is a CDN, so it looks in the cluster map and sees that the first element in the CDN, in the CDN graph, is cast3. And so that's where it writes the data to. And then over time, there are triggers that are definable that tell you when it's time to move, whether it's age-based or access-based or things of that nature. Then it gets essentially moved by the device itself without the client's knowledge. It just migrates it.

Starting point is 00:39:24 Just like in scale-out, we rebalance without telling the client's knowledge or it just migrates it just like in scale out we rebalance without telling the client where where this is because ultimately it's the consistent hashing that tells it where the object actually lives so the same the same mechanism for rebalancing and recovery can be used for doing life cycle management. What about in-store compute? So I would not suggest that every device, you social device is an in-store compute device. Maybe it is, maybe it isn't.

Starting point is 00:40:00 But because of the price point, I might want devices that are more expensive, that do more things for me, and that would be part of my quality of service that I would define. My class of service here would be a high-end in-store compute device, and it would live in its own cast. And I could scale out those kind of devices together

Starting point is 00:40:19 with each other. And when it comes time to do a function, I essentially just again using the consistent hashing function it essentially will mask out to all of the members of the that have that particular object or that object range a function and parameters so for example in the

Starting point is 00:40:41 project that's ongoing called skyhook DB and at UC, they create and they objectize the tables in a database, in a Postgres database, and they're sharding on rows. And each object has its own set of rows. And so I can then take a query that is saying, you know, select all the rows that have in this table that have some column less than five or something like that.

Starting point is 00:41:12 I can then figure out those objects that that table is contained in, because it's going to be contained in a set of objects, and then push essentially that select down to all of the devices in parallel and have them select on the portions of that table that they each have. So that's one of the powers of this in-store compute model is that I can essentially associate a certain amount of compute and algorithms to a given object and have that run against

Starting point is 00:41:44 that object at any time okay so one of the things i'm very cognizant of because i come from industry i come from a company that like built these devices is that there needs to be room for open source and proprietary. If you're going to get investment from the big media players, there has to be a way that everything can't be open source. So the devices, what you're seeing there in purple is kind of where the space of, you know, where companies could provide new products, new devices that speak, use social. They don't have to be open.

Starting point is 00:42:26 As a matter of fact, there's all sorts of ways to get competitive advantage by essentially doing smart algorithmic matches between key value and object stores to the underlying media. There's lots of cool things that can be done in that space when you're right next to the media and doing things locally. But there should also be an open source version of it. So if I wanted to create this with a Raspberry Pi and a device, there should be an open source project

Starting point is 00:42:49 that allows me to do that. And that's what we're doing at UC Santa Cruz, is we're doing that open source project. We're just starting on it. It's a long way away. So what are we doing at UC? That's kind of the social storage. What are we doing at UC Santa Cruz? We have, I actually have two grad students now. Woo-hoo! I have minions. So the first thing, like I said, is that this is a strategic view.

Starting point is 00:43:18 This is a five-year view type of thing. The first thing is how do we get the industry to believe that offloading is the right direction to go in we're already looking at it we're seeing movements in that direction samsung has a key value device there's other other people that are looking at disaggregation and key value as well um but no every time we bring this up in public forums the first thing we get hit with is like why you know show me show me why not just these slides, which are nice, pretty language, but what is the actual empirical evidence that says this is the right direction to go in? So that is actually the first project that we're engaging in right now is the offload evaluation. We're trying to construct

Starting point is 00:44:00 an empirical model that allows us to compare very disparate systems. Like if I wanted to show the value of a Raspberry Pi with a hard drive attached to it versus a big Intel Xeon with 30 hard drives attached to it, how do I compare these two environments? What is the metric I use? It gets very difficult to show apple-to-apple comparisons. And so that's what we're trying to do with that first project. And I'll give you kind of an overview of what we're doing. I'd love for you to beat me up and tell me I'm, like, wacky and what's wrong with our strategy.

Starting point is 00:44:34 I only think I have about five minutes left. And then, finally, the full API definition. This is talking about a key value store. What are the APIs for that first key value store? What should they look like? There's already work going on in the object twig at SNEA. I encourage everybody to participate in that as well. There's

Starting point is 00:44:54 work going on in NVMe with key value. A bunch of people are looking at this. We want to do it. We want to follow what all that work is and not reinvent the wheel, but we also want to look at it with these broader goals in mind. In other words, what are the other things we're going to need down the road

Starting point is 00:45:09 and make sure that those things are moving in that same direction? Okay. And then the in-store compute is the other piece. So the presentation that we're going to be giving at the symposium is going to talk about everything from canned functionality. I conceive of a key value store as canned functionality.

Starting point is 00:45:30 I'm offloading compute that runs on the host onto the device. That's step one of in-store compute is having a key value store. There could be compression, encryption, and a bunch of other things that are canned sort of functionality, to the far end of the spectrum, which is general purpose compute, where I'm actually allowing people to write whatever they want to write. And we see examples of that in the industry today. And we're going to have a representative from NGD Systems give a presentation on a container-based approach, where I actually deliver a container to my storage device and let it do whatever it needs to do. And then there's middle ground in an interpreted language where I'm limiting maybe

Starting point is 00:46:10 some of the functionality, but I'm minimizing some of the management overhead from both the industry, managing tool chains and debug environments and all sorts of other stuff that need to be going on, to just the interpreted language itself. So that's in-store compute. All right, with the last four minutes or so that I have, I'll talk about this offload evaluation. So this is the fundamental problem, right? This is what I described in my presentation, that we have data management software,

Starting point is 00:46:39 we have media management firmware that runs on the device. The client I.O. is provided by the client, but underneath of it all, between the actual media and the server, is client I.O. plus data management I.O., right? How do I actually tell you that the right is better than the left? What is the number I provide you to say from a cost perspective, from a space perspective, from a space perspective, from a watts perspective,

Starting point is 00:47:06 that this is actually a better approach. So what we need to do is we need to define a unit of work. This is my conclusion anyway. We need to find a unit of work that can be equally done in either environment. It should be independent of the environment that it's performed on. In other words, it should be able to be equally run on one

Starting point is 00:47:29 versus the other. And equally is an interesting word. It just means that the entire task can be completed basically. Some environments would be capable of doing multiple units. For example, if I had 30 drives attached to a four-way Xeon with a ton of memory, I should be able to do many of these work units. But a Raspberry Pi may not be able to do but a fraction of a work unit. But that's okay because the cost profiles, the power profiles, and the space profiles

Starting point is 00:48:01 are vastly different between those two environments. And so when I start looking at it from a dollar per work unit, kilowatt per hour work unit, this is where I get my apples to apples comparison, right? It's because I can now look at these work units as compared to cost management kilowatts. All right? The problem is, what is this work unit?

Starting point is 00:48:22 What does it look like? First of all, we need to simplify this. You can get really complicated really quick. We're going to start with this research by just looking at north and south. We're not going to worry about the other network traffic. Everything that happens with the network traffic is just more wins for disaggregation.

Starting point is 00:48:42 And we're going to be looking initially at local traffics and not disaggregated. And we're going to be looking initially at local traffics and not disaggregated. And we're going to avoid trying to get lost in the details of cycles and particular platform architectures and things of that nature. So how do we do that? We introduced the notion of a MIBIWU, which is essentially a

Starting point is 00:49:03 media-based work unit. And it's some load over time. It's workload dependent. But we define it in terms of only the media being the bottleneck. In other words, how much work can we get out of a particular piece of media? So there will be a different MIBWU for a hard drive, for a Seagate hard drive, for a Western Digital hard drive, for a Samsung SSD. There will be different Mibius for those that allow you to do these comparisons.

Starting point is 00:49:34 But that is the thing that remains consistent between these environments. an Intel box using a SSD from Toshiba against a Raspberry Pi using that same SSD, that becomes the uniform piece of the puzzle. So if I make the work unit in terms of that media, then it is directly comparable. So what our plan is right now is to take some key value store, media, then it is directly comparable. What our plan is right now is to take some key value store. We're using RocksDB and we're using a known benchmark of YCSB. We're taking a consistent SSD

Starting point is 00:50:16 and we're going to measure what the unrestricted throughput of that YCSB benchmark is to that particular media. And we want to see that they're not hitting any memory bottlenecks, any CPU bottlenecks, any other bottlenecks other than the media device itself. So theoretically, the throughput underneath this, this throughput here should match what this device is able to do by itself, right? The maximum it's able to do

Starting point is 00:50:50 is the maximum I see out of the bottom of this, right? And once I know that I have that, that becomes my unit of work. Let's say it's 10,000 transactions per second. That becomes my unit of work. I can now figure out how many of these devices can I add onto a server before I start seeing that curve go from a linear progression, yep, to a flat progression. And then I know how many MIVIWUs that server's able to do on that. On the other side,

Starting point is 00:51:17 it's whatever fraction of that 10,000 transactions per second it can actually do. Anyway, so that's the work we're doing at UC Santa Cruz. If you have any questions or anything, come find me. I'd love to talk to you about it and see if there's any interest in what we're doing and maybe help participating. So thank you. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe

Starting point is 00:51:50 at sneha.org. Here you can ask questions and discuss this topic further with your peers in the storage developer community. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

Your Ad Here

Storage Developer Conference - #82: Eusocial Storage Devices

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.