Storage Developer Conference - #106: Container Attached Storage (CAS) with openEBS

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to STC Podcast, Episode 106. Today, this talk will be about container-attached storage. A little bit about my background. I've been in storage for roughly 10 years or something.

Starting point is 00:00:54 Mostly worked on software-defined storage systems on a variety of platforms. To be honest, I was actually looking into doing something else than storage, and then containers re-happened, with the emphasis on re-happened, because it's not really new. And, you know, here I am talking here today to you guys. So before I continue, I want to point out that a central component throughout this talk is containers and Kubernetes. So a lot of what I'm going to say today won't work without Kubernetes. This is not about PVs and PVCs, so don't worry about that. But it is an integral part.

Starting point is 00:01:34 And the genesis of the OpenEBS project was around the idea of providing a platform for stateful workloads in a cloud-native environment. And if you look into how these applications are built, we believe that it justifies to rethink certain things in the way that we do them in storage. But before we go into that, oh, hold on. Next slide's not coming. There you go. As we started to develop on OpenEBS, we noticed that there are some things happening at the same time.

Starting point is 00:02:17 And there is this movement in parallel, let's say. And one thing is the people. So the actual end user that uses the storage in cloud-native environments. And these people are the ones that develop, configure, deploy, and operate these systems in production. The software itself, obviously, it's not the same as it was 10 years ago, but there are certain things built into these cloud-native applications from the get-go that may result in doing things a little bit differently, again, from a storage aspect. And lastly, hardware, obviously. There's a lot of talk about these type of things, NVMe and SSD, you know, during this conference. And so, you know, we really want to try to figure out how

Starting point is 00:02:59 we can exploit these new things. So, you know, combining all this together, for me personally, it made me a little bit more excited about, hey, we can actually build something nice. So one thing as we started to talk to these people that we learned from the past, and one thing that stood out specifically is that applications have changed and somebody forgot to tell storage.

Starting point is 00:03:26 And when I say apps, I implicitly mean this client-server model that is almost default. So you access something on your phone that's the client and the service runs in the cloud. And storage people, and I am not an exception to the rule, but we tend to think from the device up, right? So we have our precious non-volatile whatever, and that's the coolest thing out there, and we try to make that available to the end user, but we really try to invert that and see, okay, what is it actually that you guys need in this cloud-native environment, and what can we do to help achieve that job or goal. So let's look a little bit into these things that have changed, or at least I could put

Starting point is 00:04:13 a huge list on this, but one of the most important things is that cloud-native applications are distributed systems themselves. And they are not afraid to use distributed algorithms like Paxos or Raft. Nobody flinches if they use these applications or these protocols in the core of their applications. There is this rise of new cloud-native languages, Metaparticle, Ballerina. And if you look at how these applications or how these languages are built, is that they kind of import containers and then connect containers with the syntax.

Starting point is 00:04:50 And it's kind of weird when you initially see it, but when you're developing these type of applications, it might actually make sense. And the other thing is that the applications are designed to fail across DCs, so this is a built-in thing. They assume for it to fail. And it's not just across nodes, as I said, also DCs, so this is a built-in thing. They assume for it to fail, and it's not just across nodes, as I said, also DCs, but also regions and even providers, and it needs to be multi-platform, multi-cloud,

Starting point is 00:05:12 multi-hypervisor, multi-anything. Databases, you know, there are a lot of these non-SQL or key-value stores databases out there, and they all have, you know have a form of load distribution, scale-out, sharding, or what have you. And the final piece that we observed is that data sets of the individual containers themselves are relatively small.

Starting point is 00:05:36 So you will certainly see every now and again a container that has a petabyte in terms of storage, but in general, they are relatively small. And this is because the applications themselves are distributed by nature, right? So you don't end up with these heavy storage capacity type-like workloads. So to sum up that a little bit,

Starting point is 00:05:58 is that data availability and performance is not anymore exclusively controlled at the storage layer, where typically you had this monolithic application that was running hunky-dory on your system, and then you virtualized it, and then, okay, that kind of worked. And if you wanted to scale the performance of your application, what you typically did is you add more shelves to it, right? Or you did a forklift upgrade by swapping the head

Starting point is 00:06:20 and changing it for another one. With a scale-out system, you would add nodes, or you would add nodes with faster media. But all in all, it was controlled at the storage layer, and if your storage layer went down, you had a huge problem. So there was a lot of money involved in building these reliable storage systems. So software-defined came along, didn't really change all that much. Software-defined storage, I mean. Perhaps what it changed is the

Starting point is 00:06:51 cost model a little bit, where the initial hit on buying the system was a little bit less, but keeping them afloat was a lot harder because all the cool things that the vendors handled, like blinking a lead, that did not always work. So the other aspect is the people. As I mentioned, the DevOps persona, so OpenEBS is very much targeted at these people. They deliver fast and frequently, and rumor has it if you work at GitHub your first day,

Starting point is 00:07:19 they tell you you will actually write something that goes in production today. So there's a little bit of pressure if you come there. I don't know if it's true, of course, but it could be a typo. It doesn't really matter, but you really need to deploy something in production. CICD pipelines is something else. Every commit kicks off a complete build. Pretty sure you guys do that one way or the other yourselves.

Starting point is 00:07:39 Blue-green deployments is something that is very common these days. I'll go over an example of what that means. Software delivery, obviously, has changed a lot. I usually call it tarball and steroids, so it's not just the binary, but also the execution environment defined in YAML that comes along with the application. And declarative intent is also a thing that you see in these cloud-native environments. And everything is version-controlled, environments. And everything is version-controlled, and the whole infrastructure is version-controlled

Starting point is 00:08:08 through things like GitOps and ChatOps and whatnot. And Kubernetes is, you know, I think we could consider that to say that it's the unified cross-cloud control plane. So it basically allows you to do from private, public, different clouds. Kubernetes is like basically the cornerstone of it all. So hardware storage trends.

Starting point is 00:08:34 This is probably something that you guys perhaps even know better than I do. But one of the things that we observed is that storage vendor limitations, which whatever storage vendor it is, some of the limitations of that particular storage system bubble up in the application design. And that makes it really hard to write code that is optimal

Starting point is 00:08:53 if you move it from one cloud to the other or from private to public or vice versa or even have this hybrid model where you have something private and something public. So there was a lot of friction we saw when talking to these people. Very rudimentary things. Don't do CIs when I do backup. Well, don't do backups while I do CIs.

Starting point is 00:09:14 These typical things create a lot of friction. Storage administrators go bananas. All those workloads coming left, front, right, and center. They need LUNs every day, every minute. And one of the things that stood out to me is one of our friends who was working at a website that allows you to book hotels. If you think of the two top ones, you probably have the 50% chance of hitting the right one. And he basically said, you know, we're done with these storage systems. They're too complex. They have their own manual. It's all this, it's all that. We simply

Starting point is 00:09:49 use direct attached storage using NVMe because nothing goes faster than that. A persistent dim. Sorry. Yeah, yeah, yeah. So increasing core count creates also new challenges. So, well, maybe not so much challenges, but concurrency primitives are built into languages these days. So as these developers write these concurrent programs that are falling in their lap, they don't have to spawn a thread. They just use channels or something like that. And so they create very concurrent software, maybe not even intentionally, but that requires also some stuff at the background. So this picture, I think you guys have seen it. I haven't collected this data myself. This actually lives in the GitHub repo. I find it very interesting, but it clearly shows the trend

Starting point is 00:10:37 that the core count goes up, which ties back into these concurrency primitives in languages, and the frequency creeps up a little bit, but doesn't change all that much. The cool thing is, though, is that that is a perfect match. If you look at how NVMe at the core works with those queue pairs and all these type of things, RDMA, and a 1U server, I don't know if the amount's still correct.

Starting point is 00:11:02 I grabbed this picture off the Internet, but there's actually a physical box in the lobby, if you have seen it. But, you know, it's kind of amazing that this lives just in one U-box, right? So the other thing is that with containers, you can actually control very granularly which container gets access to what.

Starting point is 00:11:21 So I believe you could put 24 cores or 48 cores in this system. You can actually, through containers, decide what resources should be dedicated to storage and what should be dedicated to the application. So it is very flexible. Makes it also very hard, to be honest. But one does not simply create a new storage system.

Starting point is 00:11:42 As I mentioned, I've been in storage for 10 years, so I know how very, very hard it is. So we didn't really want to create a new storage backend per se. And we also are not talking here about we've found this new B3++++ algorithm that's even faster than all the previous ones. But we do see an opportunity to innovate in this space if we keep the focus on cloud-native applications. So the question that we asked ourselves is, is what if storage for container-native applications was itself container-native?

Starting point is 00:12:14 So actually put the storage controller in a container and make that, you know, like a first-class citizen of the DevOps persona toolbox. So it kind of looks like DAS, right, but not really. So we thought about CAS and then referred to it as container attached storage. I picked up the SNIA dictionary, I think is what it's called. CAS is not in there.

Starting point is 00:12:41 So maybe next year, who knows. But I'm well aware of the fact that it's also used for content addressable storage, but that's a completely different ballgame. So based on these conversations and these limitations that we found by talking to these people, we put some design constraints on the system that we wanted to build. And one thing that we did not want to do is build another distributed storage system. There is a plural of distributed storage systems. Building these systems is really, really hard. As I mentioned, cloud-native applications are distributed systems themselves, so putting distributed on distributed is an operational nightmare waiting to happen.

Starting point is 00:13:20 And obviously, what's most important is small is the new big, because those data sets are relatively small. Cloud-native versus cloud-washed, I want to point out this a little bit. What it means to us is that you don't just pick up your application, put it in a container, and say that you're cloud-native. There's more to it than that. So the idea is that we have a per-workload storage system using declarative intent by the developer, so the developer has full control, no more friction between the departments and these type of things. Reduce the blast radius, so if one of the microservices goes down, not the whole application goes down, or even worse, the whole department of the company goes down.

Starting point is 00:14:01 Obviously, it runs in containers for containers, so this implies that it has to be in user space. It's not a cluster of storage instance, rather a cluster of storage instances. So, small but fundamental difference. So, how does this look a little bit? So, on the left-hand side is what you typically had Kubernetes 101 when it initially started out. There was no state. So, these were all, let's say, web servers, and you could scale them up and scale them down. There's actually a command called scale, and you enter a number and, let's say, five, and it would spin up five instances and so forth.

Starting point is 00:14:34 So now we want to basically do the same thing, but stateful. Again, based on Kubernetes, leveraging Kubernetes, through YAML, which the developers use to declare their state. And then we would provision next to the actual container an additional container that contains your data. So there's a lot of contains in that sentence. So the challenges and the solutions that we think that we solve, sum them up.

Starting point is 00:15:04 So small working sets, as I mentioned, and this allows us to keep the data local. Local doesn't necessarily mean that's physically on the same node, but ideally that's what you want to achieve. Ephemeral, so some workloads that are stateful don't necessarily need to be preserved if the container is removed.

Starting point is 00:15:24 And these type of storage systems are also very familiar. And in our case, the storage lives as long as the pod. So if you clean up the pod, the data is gone. You don't have to do that, by the way, but that's an option. Scale by end, just add more end containers. Mobile workloads, we can follow the workloads. Obviously, there is some inertia involved. You cannot just instantly put data from one node on the other node.

Starting point is 00:15:49 But as I mentioned, the working set size is relatively small. That makes it feasible to do some form of replication in the background and then move the workload when it's done. The DevOps is responsible for the operation, so it's just another microservice to him. Cloud lock-in. Some say that cloud is the new proprietary. Some people go out all ballistic on AWS and use everything what AWS has to offer,

Starting point is 00:16:16 and then they want to move to Azure, and then they figure out, oops, I've locked myself into the Amazon infrastructure. So it's important to be able to move your workloads across these clouds. Per workload optimization, obviously, again, declarative intent. I don't need compression because I'm storing images or I want to have this, I want to have that, you name it. So a little bit zooming in, how this looks. So if you put this in perspective,

Starting point is 00:16:42 you have the application, which has a persistent volume claim, and we basically insert the logic that makes that data persistent. On the right-hand side, we zoom in a little bit, and we'll go over the details as to what they mean. But the general idea is that we're like everybody gets its own storage controller, and everybody is happy to sort of speak. So as I mentioned, Kubernetes is an integral part of our solution. And we did not want to integrate but build on top. A lot of storage solutions, they, through CSI, and there was a talk about this yesterday, they basically, through a plug-in, duct-taped their storage system

Starting point is 00:17:25 and make it available into Kubernetes. But you lose all the flexibility and the mobility if you do that. So we use the operator framework and the Kubernetes native API to construct a storage control plane because, after all, there is no storage controller that we can talk to, right?

Starting point is 00:17:41 All of that lives in the containers themselves. And we use a thing called CRDs, custom resources. It doesn't really matter, but it's basically a value that you can put a call back on. If it changes, you execute some logic. So you can do things like failover and failure detection like that. We obviously also need to keep track of allocation, the status, the usual things that you need to do in a storage system. Another part that is very important is the node device manager. It kind of, I guess, since we're here in SNEA event, it kind of looks like Redfish, Shortfish type like inventory management, but it is important to figure out where you put your data in these

Starting point is 00:18:18 type of things, what devices are available, what type of storage pools and classes you want to have and build using this information. The other part is visual topology and end-to-end testing, which is a tool that we developed out of a necessity, because how do I verify that my workload that runs on my storage system actually works on this persistent cloud storage system. So we needed to verify that. And all of these pieces themselves, by the way, are microservices again. So we really use the microservice paradigm when building these type of systems. So a little bit more of how this works.

Starting point is 00:19:02 So when a developer writes something, I want a persistent volume and yada, yada, yada, that basically eventually gets executed and sent to the Kubernetes server, where the Kubernetes server or the controller basically does an up call to the OpenEBS provisioner. And this is where we inflate the volume that says how big it should be and whatnot. We inflate that to a set of microservices. So that's basically, there's a lot more to it, obviously. It also depends on what type of storage you are actually using.

Starting point is 00:19:33 But in a nutshell, this is what it is. Visualization, so this is WeaveScope. It has nothing to do with OpenEBS itself, but we contributed some code to it that allows you to see the topology, let's say, of your microservice and how it is connected to storage. So you could say that it's kind of like a server enclosure management for the cloud, but not really. But you get the idea. So the other thing I mentioned is blue-green deployment, so what that means. And I put a little star at the Kubernetes service, and I need to explain this a little bit.

Starting point is 00:20:10 So a service is basically the thing that you connect to when you talk to a service that runs in the cloud. And it is this service that routes your request to the right container, right? And I'll come back to this later, but it is important to note that you basically talk to the service and the service knows, hey, this request needs to go to this microservice. And by default, these services are very simple IP tables. So, you know, we really don't want to do that in storage, obviously, so I'll get back to that. But the idea is that if I want to upgrade, let's say, the controller that is on version 1.2,

Starting point is 00:20:48 I spin up this new version of that container, and then I basically swap it in and out. So you can do online rolling upgrades, and this is what they call blue-green, because if it suddenly starts to fail, I swap it back, right? It's a very simple model, actually, and, you know, kind of makes sense. And it's compared to the more traditional failover of two heads doing SCSI reservations and importing the devices and things like that. So, and based on these services, a lot of other companies spun up Istio, so you may want to look into that, Linkerd, that basically augment what the default service from Kubernetes does at a generic service level.

Starting point is 00:21:31 But we do this at storage level, and instead of calling it a service mesh, we call it a data mesh. Because imagine that these nodes that are one way or the other talking to each other, a container-attached volume is a service consisting out of a controller and a replica somewhere. You don't know where it is. You don't really care either. So there's this list from Mr. Peter Deutsch, if I pronounce his name correct, the fallacies of distributed computing.

Starting point is 00:22:00 I basically condensed it to the only constant is change. So what we need to do is dynamically reconfigure and find the optimal path between these microservices that process your storage I.O. So rescheduling is one thing. Kubernetes will, without a doubt, stop your controller and spin it up somewhere else. And so you need to handle these type of situations. Transient failures, so, for example, a node has some issues connecting to the network or whatever, you really want to make sure that you detect those and can act on these type of issues.

Starting point is 00:22:33 And default Kubernetes services don't provide any functionality like that. The other thing is that you don't know if you're dealing with a VM or bare metal, and it is really, really important to figure out what type of connections I could use. So, for example, there is this project from Intel called Clear Containers. I think it's now called Kata Containers.

Starting point is 00:22:54 It doesn't really matter. But it's basically a stripped-down hypervisor, and you only run your container in that VM. So it won't allow you to run Windows or anything like that, but just allows you to start up a small binary, and that then could connect with BERT.io, let's say. So you need to know all this. So dynamic reconfiguration, really integral, important part.

Starting point is 00:23:17 So how does this look? And this is a little bit of an example that I put up on the slide to hope to make the point across. So you have this, so the developer doesn't write this, so on the right hand side, that's the YAML that defines how I want to connect, ideally, my controller and my replicas or my application or whatever it is. And the developer doesn't have to write all of these things every time that he wants to use a volume. These get pooled in storage classes and whatsoever. So this gets set up once, and you basically pick a class, and then you get these type

Starting point is 00:23:49 of features. So the connection types in the state are reflected in Kubernetes itself through those CRDs. So you can actually observe, hey, my controller at this point is using NVMe over Fabric to talk to the other systems. So iSCSI, NBD, encryption, also very important in the cloud. Kubernetes has these constructs of Kubernetes secrets where they can store keys. So we can use Kubernetes to grab those keys and encrypt the data as it goes over the wire, which obviously is an integral part of public clouds. So to put it in a little bit more perspective, a very, very simple example. So imagine that this is the case.

Starting point is 00:24:33 I have two nodes. The controller is spun up in one node, and the replica or the target needs to be on the other node. So how do we find each other? We do not want to put all the logic of the controller and the replica being able to find one another in the code. That would make it very, very difficult, right? Because the permutations are limitless. And so basically what it does is that it contacts the service mesh, which is an integral part of the OpenEBS operator, and says,

Starting point is 00:25:02 I need to connect to this thing, but I don't know how I can connect to it, because, you know, how should I know? So then the operator talks to the node where the replica needs to be, says, well, based on what I found through the node device manager, you should be able to, in this case, for example, spin up an iSCSI target, and then the replica says, okay, I'm done, you can move on, and then the other side opens the initiator, and then a data path exists, and we can actually transfer data. I couldn't find a faster car than this, but, you know, it's besides the point. So, you know, and so why is this? I mean, this is a very simple example, right? But imagine that in the case of that this replica was on the same node.

Starting point is 00:25:50 At this point, I would not want to use NVMe or NVMe over Fabric or something like that, but I probably want to stick to virtio, for example. It depends, of course, but you get the idea, right? Because this can change at any given point in time. So the idea is that storage just fades away as a concern from the developer perspective. And a developer can focus on developing applications, and the storage admin can go back and do whatever he was doing without worrying about all the demands that the developer has for a particular workload. So let's talk a little bit about the implementation. So as I mentioned, we started this out as, you know, is it feasible? So we have several data engines, as we call it. Jiva was the first one.

Starting point is 00:26:27 It's basically the primordial soup that we use to figure out, does this actually stick and fly? So, you know, throw it against the wall and see if it sticks idea. And it actually turns out that people are actually really liking the fact that how easy it is. It is really easy to set up. You don't have to do anything. You just define that you want a volume,

Starting point is 00:26:47 and boom, off you go. And it's also very instrumental for us to find cloud-native use cases. As I mentioned, it's just a microservice, so we can plug these things in and out. The biggest problem, however, for us to solve at this point or at that point was, okay, how do we do user space I.O.?

Starting point is 00:27:06 Because user space I.O. is not necessarily for us for performance, even though that's a nice bonus. But the problem is that on public clouds, you have different kernels, and you don't really want to taint the kernel. If you spin up a GKE instance, you get a very minimalistic kernel, and they removed all the drivers. There is no iSCSI initiator, for example. So that's a big problem, right? So we basically said, okay, we'll just, well, not just. It's a lot of work, but we'll just do it in user space. And obviously the tainting is something that you also don't want to get into

Starting point is 00:27:39 because, you know, Will Red Hat actually helped you if you have a kernel panic and they see that you tainted the kernel with your own kernel module. So that was another reason to do that. Also a side note, and maybe this is a surprise, I don't know, performance in the public cloud is not yet the biggest concern. So, you know, very low microsecond latency workloads you don't see in the cloud yet. I'm pretty sure that will come, but as of today, that's not really necessary. So we don't need 10 million IOPS for a particular workload, let's say.

Starting point is 00:28:12 In any case, to do this IO and user space, we thought of this idea of an input-output container, IOC for short. The idea was based on observing patterns in Kubernetes itself, where you basically come to the conclusion that you have these microservices that constantly talk to one another. So instead of doing I.O. through the kernel, why don't we do I.O. through microservices like, you know, they seem to do all the time. Again, the Node device manager, instrumental here to figure out, you know, what type of devices do I want to use, what type of devices can I use. And the IOC is a daemon set, which is another construct of Kubernetes that spins up on all the nodes that you select through label selectors.

Starting point is 00:28:53 But you can forget about that. The thing is that the daemon set is actually grabbing the devices and making them available through the different type of fabrics that you want to support. So iSCSI, NVMe, network block device, whatever. This is based on SPDK. I won't go into SPDK. It's been talked about a lot. You either love it or hate it, I suppose.

Starting point is 00:29:17 But the general idea is that it uses user space IO to DMA straight into the device by leveraging huge pages. So you don't necessarily have to do this with SPDK. There are other frameworks out there that more or less do the same. So this is not a logical diagram, but basically visualize how everything connects together. So you have the IOC that has ownership of the devices. There is a sidecar container, which is another pattern.

Starting point is 00:29:42 You can forget about that, but this basically allows us to interact with it. The data mesh that talks to the target and the replica in this particular case so that they can find each other. And NIO completely flows in user space, obviously, except for network IO, which we have tried as well. But I'll come back to that a little bit later. So the other thing is that we did is that's okay.

Starting point is 00:30:05 So what if we could do virtio straight from the container? So a lot of applications are getting rebuilt. And would it not be easy or cool or nice or whatever you want to call it if I, instead of calling io.read or io.write or whatever language you're using, is that the implementation actually uses virtio in the back end, right? So instead of going through a block device, you actually use virt.io directly. So we implemented this, and it's kind of like a cloud-native IOSDK, perhaps, if you want to call it that. We implemented SCSI and block available in Go, because that is a lot of the languages, that's the language that is used a lot

Starting point is 00:30:45 in the cloud-native landscape. And the primary reason is that I think that there is this, I guess there is this notion of, you know, if persistent DIMMs and whatever you want to call it are so fast, why do we need file system abstractions to begin with? Why won't I just use my database and write to the device directly, right?

Starting point is 00:31:05 So, but there was no libvert.io, unfortunately, so we created one. So we did a small test, SPDK in full pulling mode. SPDK comes with a FIO plugin. So what we did is we developed this vert.io library, developed this vert.io FIO plugin, and then wanted to use vert.io to do SPDK,

Starting point is 00:31:28 which constructed our IOC. So we had some expectations. I think it's always good for when you do an experiment to figure out what you more or less can expect. We expected that performance would drop because SPDK uses pole mode drivers that constantly loop in a tight loop to figure out if there's work to do. We did not want to do that because if we would do that, all the containers that would use OpenEBS would spin 100% CPU and nobody would be looking into us.

Starting point is 00:31:57 So that was not an option. So we used EventFD. It was an interesting syscall. Never heard of it because I came from a different platform. But anyways, we expected that the result was roughly around what the kernel would do on this device. And it's a little bit of quid pro quo, right? We were okay with dropping, let's say,

Starting point is 00:32:16 20% performance if it actually worked. So the results initially were not that good. For SPDK, yes. For us, no. So 490,000 IOPS, and the kernel was close to that, but we did not get anywhere near. So we looked into adaptive and hybrid polling. Adaptive polling is what's it called in the QEMU space. Hybrid polling is what the kernel actually uses it, calls it.

Starting point is 00:32:47 And we started to play with this adaptive polling. So if there is a lot of work to do, we do a lot of work. If there's not so much work to do, we sleep a little bit, let's say. So this has an impact on latency, so it's configurable in a sense. But after, once we reach the five millisecond sleep time, we basically got results that actually outperformed the kernel as well. So we were pretty pleased with that, and that basically allowed us to continue and explore other possibilities using this methodology.

Starting point is 00:33:23 So, yeah, okay, I went through this a little bit already. So SPDK can outperform the kernel, but again, this was not our goal. Our goal was to do storage in containers, and because of that, we have to be in user space. So it's feasible to do. This event FD thing, it has a huge impact on performance. I was actually stunned to see by how much.

Starting point is 00:33:47 This was around the time frame of Spectre and Meltdown, so I don't know if I had a kernel that had a partial fix or whatever, but I was surprised to see that. And basically the lesson that we learned in general is that we need to implement an I.O. path that is single-threaded and lockless. So you can do complete asynchronous I.O. without any blocking syscalls and process all the I.O. that way. And that basically gives the best results. So other things that we looked into, because obviously not everybody will rewrite their application using virt.io or whatever

Starting point is 00:34:25 protocol that emerges or SDK emerges as the standard. So we have support for iSCSI, NBD, TCMU, which is a subsystem in the Linux kernel that basically memory maps the SCSI commands into this user space I.O. register that you then can read

Starting point is 00:34:41 and process the I.O. But to really keep up with the low latency devices, we really need to move to things like NVMe over Fabric and whatnot. As I mentioned, we also looked into networking. So if we switch to iSCSI, let's say, then we're in user space, but then we would still break into the kernel to do TCP IP. So VPP-VCL is a BSD socket-ish type-like system, and it's basically almost the same,

Starting point is 00:35:16 but it is like SPDK for networking. So DPDK obviously is for networking, but it doesn't provide like a full-blown TCP stack, right? So it has packet processing, but not so much for TCP. So we started to do another experiment where we basically replaced all the socket calls and all the read calls where appropriate to use the VPP-VCL calls, and we saw a huge increase because of that. Now, I do have to say that this increase might also be actually due to the poor quality of that code in that time frame, but it did dramatically increase the performance.

Starting point is 00:35:57 The downside of this is that, like SPDK and DPDK and thus also VPP, VCL, they require their own core. So then we at least needed two cores to do this full pulling mode, and that didn't really look like an option. So at this point, we stepped away from that. But if we ever encounter a situation where people really, really want to do it, it's available. The other thing is I found that recently, or recently, a couple of months ago, is Microsoft FreeFlow. It basically does the same thing as VPP, VCL, but uses a different approach. But one thing that these things have in common is that they use the LD preload trick.

Starting point is 00:36:36 Some call it a hack. I think it's an elegant solution. But anyway, to intercept the syscalls and then offload it to whatever their implementation was. And that brings me to one of my last slides, actually, is that we get a lot of questions from people saying, hey, I want to do file. Well, no, we do block. Yeah, but I want to do the same thing with file.

Starting point is 00:37:03 And the problem there, obviously, is that you pull in the whole file system, and this whole latency, and there have been a lot of talks already about barriers in file systems and so forth. So we know what these problems are, so I'll skip over them. But the idea is that you could use the same approach with an LD preload trick where you basically intercept the read and write calls.

Starting point is 00:37:21 And for databases, this is actually very feasible because the thing is that because it's in a container, it's a very isolated environment, so the developer clearly controls exactly how this system works. And these write-ahead logging and these compactions, they are very typical I.O. patterns, so you don't have to implement all the things that you could do on a file system.

Starting point is 00:37:41 So the library is mounted in a namespace, and that's propagated into the kernel. They call this mount propagation. And then you could actually do I-O, intercept the things there, and put it over the buyer, for example, over virt I-O or something else. Yeah, so the other thing is CICD testing,

Starting point is 00:38:00 so end-to-end testing, if you want to do this. This is, again, what we refer to as litmus. So, yeah, that's basically, I think, roughly it from this slide. So summary. So what we try to do with OpenEBS is bring advanced storage features to individual container workloads, cloud-native, not cloud-washed, and using the same paradigms that these developers are accustomed to so it feels like a natural fit for them.

Starting point is 00:38:30 The other thing is that we do I.O. handling and I.O.C. Control release cadence, again, that's the most important thing for us. We want to be able to upgrade the I.O. path if we see fit and not reboot the kernel or patch the kernel or things like that. A declarative provisioning and protection policies. Obviously, if you do storage, you want to have a form of backup. No matter how cloudish you are, always make sure you have a backup. Remove the friction between teams.

Starting point is 00:38:57 Multi-cloud from public to private and vice versa. And as I mentioned, it's not a scale-out storage system, but it's a... Yes. So with that, are there any questions? Okay, well, thank you for your time, and enjoy the conference. Thanks for listening. If you have questions about the material presented in this podcast,

Starting point is 00:39:21 be sure and join our developers mailing list by sending an email to developers-subscribe Thank you.

Your Ad Here

Storage Developer Conference - #106: Container Attached Storage (CAS) with openEBS

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.