Storage Developer Conference - #38: SPDK - Building Blocks for Scalable, High Performance Storage Application

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcast. You are listening to SDC Podcast Episode 38. Today we hear from Benjamin Walker, software engineer with Intel Corporation, as he presents SPDK, Building Blocks for Scalable High Performance Storage Application, from the 2016 Storage Developer Conference.

Starting point is 00:00:50 This is the agenda. The Storage Performance Development Kit is, so we're going to cover first what it is, very briefly how we got started. I think that's less interesting to this crowd than some of the others we're going to talk about our NVM NVME pulled mode user space driver which was really the starting point of the whole thing we're going to talk about our NVME over fabrics target we're going to talk about where we're going and the future things we're doing and then some information about how to get involved.

Starting point is 00:01:28 So first thing we're going to cover is what is the storage performance development kit. Before I explain what it is, let me outline the problem. And this is a problem I'm sure you're all very familiar with. But a number of years ago, the I-O per second of a hard drive was about 500, something like that. These are all read numbers, 4K reads. Latency was somewhere around 2 milliseconds. And that persisted for a decade or more. Then we got set and NAND SSDs.

Starting point is 00:02:03 These could do 25,000 IOPS. I saw some go up to 50,000 occasionally. You're talking about maybe 100 microseconds of latency. More recently, we get the NVMe SSDs. These don't really improve latency because we're still talking about NAND media, the same backing media, but we get huge bandwidth numbers. Particularly

Starting point is 00:02:26 we can do much, much more in parallel. And that's really the difference between NVMe and SATA SSDs. NVMe SSDs can do wildly more work in parallel. And the NAND itself is highly parallel. So the real

Starting point is 00:02:42 problem we saw, and this was two or three years ago, the real problem we saw was that this was two or three years ago, the real problem we saw was that we knew there was a new SSD coming. We knew the numbers were going to be much better. So we went and figured out what is the cost of the software running on top of these SSDs that we think we're going to be building sometime in the distant future 2017 now so it is the distant future so we wanted to make sure the software was ready when the device came to market and so SPDK is really our attempt to solve that or at least start the solution to that problem so So SBDK is fundamentally a set of

Starting point is 00:03:25 C libraries, user space static libraries with headers. It's nothing magical. We will call parts of it a driver. That's weird to call a user space library a driver. It's a driver in the sense

Starting point is 00:03:41 that, yes, it controls the hardware directly. It reads and writes registers. It maps a bar. But it's also just a C library. It's a driver in the sense that, yes, it controls the hardware directly. It reads and writes registers, it maps a bar, but it's also just a C library. It's not part of the kernel, nothing like that. All the code is completely open source. It is three clause BSC licensed, so very permissive. You can take it and pretty much do whatever you want. It's all on GitHub. We try to document as thoroughly as we can. The driving principles behind SPDK are that

Starting point is 00:04:13 it is user space as I mentioned. It is completely lockless at least through the IOPaaS. Sometimes we will implement code that has locks in it. When it does not need to be performed it would be convenient for you. But all of the I paths, anything that would need to be performing is completely lockless. We do not rely on any interrupts from the SSD. It is polled mode only. That works out because it is both far more efficient, but also because we don't have interrupts in user space. It can do millions of IOPS per core in terms of 4K reads and writes to SSDs, which I'll get into a bit. And it was really intended always for Intel Optane SSDs

Starting point is 00:04:59 before we even had the name Optane three years ago. That's what we were doing this whole time. We released it before we announced Optane because it does benefit NAND SSDs, but the purpose of this project was for Optane. I'm going to talk a little bit about, and I don't know if that blue is going to come through or not, but there's

Starting point is 00:05:22 four blue boxes up there. I'm going to talk a little bit about the different types of things that go into SBDK. Again, it's a collection of C libraries. The idea is that you can take the ones you want, mix and match, integrate them into your application. We have a number of example applications and supported applications as well, official applications. But they're all based on these composable libraries. So we break the libraries up for organizational sake into four different categories. The first of which is hardware drivers.

Starting point is 00:05:57 These are the things that actually talk to the hardware. They map the PCI address space of a PCI device into a user space process, talk directly to the bar, they do MMIO, those sorts of things. The next type of library or component we call storage services, these are the things right above the drivers, they're doing something like abstracting away the, you know, the most common one is like a block device abstraction layer where you can use a common protocol to do reads and writes and it will forward the IOD to the NVMe driver or the kernel or something like that, you know. These are simple abstractions on top of the base drivers.

Starting point is 00:06:39 And I'll go through examples of all these. Then we have libraries and applications even that are for implementing storage protocols. There are two of those right now, iSCSI and NVMe or Fabrics. These are ways to export block devices over the network. And then the fourth type of library are things designed to be used on the client side. So libraries for writing an NVMF initiator or something like that.

Starting point is 00:07:11 All right, so let's fill in some of these big boxes. The two drivers we have released right now are an NVMe SSD driver. Our NVMe Express is 1.2.1 compliant. We stay up to date with the spec. We are in regular contact with the spec writers, work very closely with them, give a lot of feedback, report a lot of bugs. So all of the features of the spec are essentially supported by SPDK and will continue to be indefinitely.

Starting point is 00:07:48 We also have a driver for what Intel calls its quick data technology, sometimes called IOAT. Also sometimes the internal code name was Crystal Beach. This is a DMA engine present on some Xeon platforms. And so this gives you user space access directly to the DMA engine present on some Xeon platforms. And so this gives you user space access directly to the DMA engine on the platform and you can do asynchronous copies and fills. And it has other features as well,

Starting point is 00:08:14 but we really only expose copy and fill. And again, pull mode access makes that fairly efficient. Right now we or the biggest thing we've just released is support for NVMe over Fabrics. This is both a library that you could incorporate into your application, your

Starting point is 00:08:36 stack, as well as an application that uses the library that just implements an NVMe over Fabrics target. We also a number of years ago, wrote an iSCSI target based on top of the NVMe driver, and we just recently open sourced it. So this component is actually

Starting point is 00:08:57 one of the oldest things, but it's the most recent thing we've open sourced. But it's an iSCSI target based on all the SBDA principles. There's a SCSI library which does all sorts of SCSI processing. There's an iSCSI library which does the network requests and things like that. There's a block device abstraction layer which is a common block interface. It has operations like read-write, flush, trim,

Starting point is 00:09:29 and you can do those and it will forward them to different backends. It could be the kernel lib-aio, it could be our NVMe driver is obviously the one that's most important to us, but we have a malloc backend as well. And I'll get into this a little bit in a bit. So let's go into, I just have one slide here about how SPDK got started. I'd like to cover very quickly. And that really requires a discussion of DPDK, the Data Plane Development Kit. I think DPDK was started nine or ten years ago at this point. This is a toolkit, much like SPDK, designed for networking, network technologies, so that basically vendors could build switches

Starting point is 00:10:23 and networking products on standard Intel silicon. They don't need specialized platforms anymore. And it uses a lot of the same concepts that we're going to use in SPDK. It is free and open source, of course, but it is pull mode, user space drivers, zero copy through the stack, all these sort of things. SBDK does depend on DPDK, and I just want to talk about that a little bit. We don't depend on this huge DPDK project.

Starting point is 00:10:56 We depend on a very tiny subset of DPDK, mostly around the part that abstracts out how you unbind devices from the kernel and rebind them to user space because we support Linux and FreeBSD and it's different. We also use their lockless ring and they have a really slick buffer pool, right, and that's it. So those things could all be taken out of dbdk and copied in spdk and then we wouldn't depend on dbdk anymore, it. So those things could all be taken out of dbdk and copied into spdk and then we wouldn't depend on dbdk anymore. So we share basically the core parts, the things that figure out how to map huge pages and all these sort of core operations that you

Starting point is 00:11:36 need to write user space driver. But we don't actually use any of their NIC drivers or any of the networking part of it. Feel free to ask questions. You guys can jump in any time. Just stop me or I'll just keep going and going and going. Okay, so next I'm going to talk about our NVMe driver. I think this is probably the most interesting component we've had for a while until we released the NVMe or Fabrics target the NVMe driver supports the 1.2 now.1 spec

Starting point is 00:12:16 we've just added that we will continue to follow the specification very closely it is a C library with a single header completely and thoroughly documented very closely. It is a C library with a single header, completely and thoroughly documented, statically linked into your application. You

Starting point is 00:12:33 use it by basically you initialize the device by giving it a PCI BDF, and it will go make sysfs calls on Linux and say, unbind your driver, bind myself, map the bar, run through the NVMe spec initialization process, and then it gives you simple functions that you can call like create an IO queue or submit a read, you know things like that.

Starting point is 00:13:01 It is entirely asynchronous and it's zero copy and it's lockless. So you give it basically a buffer and you say, go read into my buffer and you give it a function pointer that you want it to call when it's done and then you call a function to say, are you done yet? Are you done yet? Are you done yet? And when it's done, it'll call your function and your data will be in the buffer you gave it and it gets DMA'd from the device into that buffer directly.

Starting point is 00:13:27 There are no copies. You're the app? Go ahead. When this business case driver is completed, how would it go in the driver? So they can coexist, just not loaded on the same device at the same time. So if I have 8 NVM...

Starting point is 00:13:48 Well, so this is just a C library linked into an application. So this is only running when your application is running. So the kernel module is always loaded. But you thought the kernel module driver is a unique one for any vendor. Yeah. How do you get rid of the NVM driver, which is existing? Yeah. Yeah, but... Oh, in Linux and

Starting point is 00:14:09 SFS, you can tell it, you know, stop, you know, unbind from this device by PCI bus device function, and then say don't bind to this one again. And the kernel will just leave it alone. And it just shows up as an unknown PCI device. And then we use either UIO or VFIO in the kernel to say

Starting point is 00:14:25 we want to map your bar of this device. And with UIO, we put a placeholder driver in so the kernel knows somebody owns it. And VFIO just has an iOctal and you say, I want this one. And then the kernel knows somebody owns this. In VFIO, it's even nicer. They've added

Starting point is 00:14:42 a lot of great features where when you say, I want to take this device, it'll do a security check, make sure you have permission, so you can run as a non-privileged user while you're doing this. And then also, it's tied into the IOMMU on Xeon platforms so that the IOMMU will understand when you map that device into your user-based process that you can only DMA within your process.

Starting point is 00:15:06 Because when you program the NVMe device as a DMA engine on it, you give it physical addresses, which could be anywhere, right? But the IOMMU will limit you to only DMAing inside your process. Go ahead. This is a user process

Starting point is 00:15:21 configuration in IOMMU that's going to go on DMA on your memory. So how the user process gets here Yeah, so there's challenges around pinning the memory, right, so it doesn't disappear. Right now we solve that by requiring all the data be allocated from huge pages. And huge pages are static and don't move. So that's a quick and easy solution. It also actually makes the performance a lot better, which is a nice side effect. So that's one way to do it.

Starting point is 00:15:55 There are other strategies, some of which would require kernel support. But yeah, basically we say don't kill the process. But it's allocated at a huge page, so even if the DMA happens, it's continuing to happen after the process has ended, it's not corrupting your system. And the IMMU is preventing you from DMAing over critical kernel structures or something. Yeah. Go ahead. So I think that when I looked at the code, I didn't see any just as a standard normal

Starting point is 00:16:32 I would have writing to the admin queue. Yeah. And also from the completion queue, you'd mark market by the... Only on the admin path. So the admin path, those operations are not performance sensitive and there's only one admin queue globally.

Starting point is 00:16:54 Yeah, but I think the idea is I think the idea is you dedicate a call to the device, right? So I'm thinking... Well, so we actually access one device simultaneously from many different cores,

Starting point is 00:17:13 each one having their own I-O queue. And then each of those threads are cores. For us, those are the same thing. You pin one thread to a core. Each of those may want to do an admin operation at some time, and so they coordinate with a lock. And it's because admin stuff is not an IO path. We thought it would be more convenient

Starting point is 00:17:32 to make the admin operations thread safe because they all have to operate on the same queue. IO queues, you create one per thread, and you only use it from that thread, and there's no lock. Okay, so this is just a quick table. I'm sure you've all read it by now of features we support, which is kind of nice.

Starting point is 00:17:52 This actually makes it a fantastic testing platform for new SSD features because, again, it's user space. If it crashes, you don't blow up. Go ahead. Do you support DIV that is in line with the data block? So where you have quite qualified data and that gives PI information basically right back to it?

Starting point is 00:18:13 I would have to look at the header to see if we've added that. I believe we have. I believe we have. I could go confirm. The problem is I don't know if we have a device that can do that. So, like in our lab you know so it's a practical concern right so even if I claim it works and even if the code is there that's I don't believe that's

Starting point is 00:18:37 something that's been tested. Yeah that's been tested, I'm pretty sure. But if you move the PI out, right, I don't know if we... I don't know. I'd have to ask our validation team to see. My question was, do you support where PI is with the block, not separate? We should support both is what I'm saying. And I'm not confident that we've tested either. I'm more likely that we have tested in line as opposed to

Starting point is 00:19:06 separated but I can only think of one device that we would have access to that supports that and I don't know if that test has been run. But if it doesn't work we'll fix it. You're asking, is it asynchronous or non-blocking? I mean, here, can I explain a little more, like expand, what you say, what you mean with asynchronous or non-blocking? Yeah, so the general flow for an I.O. is you call a function let's say it's called read

Starting point is 00:19:45 and you give it an LBA start a length, a buffer that is the size of however many blocks you want to read. And you also give it a callback a function pointer. And you call that function. That function will never block under any circumstances.

Starting point is 00:20:02 It may fail. You try to do too many operations at once, it may come back and say, you need to try this again. I couldn't do it. It will never block. Then you call another function that says, check for completions. And you can bound it on how many it will complete at a time.

Starting point is 00:20:18 And you're supposed to pull that function. You keep calling it. And when it finds completions, it will match them up with the callbacks you gave originally when you submitted them, and it will call your callback. Is there a polar semantics where you can

Starting point is 00:20:34 post some sort of descriptors for the submission use, and then when it returns, you get the vector of the ones that have completed summarize the question oh yeah I apologize

Starting point is 00:20:49 the question is is there something like a select or you know where you can say pull all of these the completions coming off of the NVMe queue get reordered the completions don't come in the right order so we just call whatever callbacks they match up to

Starting point is 00:21:05 right when you pull us. So there's no blocking. There's no, you know, like, call me when something happens. No, you have to constantly pull, and it just calls the functions as we find them, basically in a loop. So I wonder what I'm supposed to do. I'm supposed to spin around until my read is finished, or I'm supposed to do something I'm supposed to spin around until my read is going to be finished

Starting point is 00:21:26 or I'm supposed to do something else like issue new reads? Yeah, so the question there is I submit a read. I have to call my pull function to wait for it to finish. What am I supposed to do? Should I sit in a busy loop

Starting point is 00:21:39 and wait for it to finish or should I go do something else? New reads. Yeah yeah and so the answer is never spend in a busy loop don't do that and and so a lot of people do that and that's absolutely not our intention go do something else productive right you know you have to sometimes rethink how you're doing your application but you need to be thinking in like a pipeline you know like I submitted this so now I can go do something else.

Starting point is 00:22:06 Keep going. Do as much as you can. Get that queue depth up. Submit as many reads as you possibly can. Don't just sit and wait for completions unless you really have nothing else to do. What feeds the poll? You only get the callback in response to the poll.

Starting point is 00:22:24 There's no interrupts. Oh. No other threads. There's no threads in SBDK, only your threads. When you call that check for completions, that looks at it. So you decide when it checks for completions. So SBDK will never spawn a background thread for you in any of our applications ever.

Starting point is 00:22:57 Right? And we'll never use an interrupt. So we only do things in response to your request. That's why we're just a C library, right? We have no threading model. We're not enforcing anything like that. You decide when the work happens. And you can put bounds on check completions.

Starting point is 00:23:10 You can say, only complete five. If you're worried about quality of service and things like that. Go ahead. Are you asking if you can message pass? No, no. Malicious attacks. DMA directing the physical memory is possible to

Starting point is 00:23:38 overwrite in some way or permission. If you're running as root, yeah. If you're using VFIO as a non-privileged user, the IOMMU will stop you from DMAing outside your process. Yeah. So the hardware on Intel Xeon platforms, if you set it up correctly, and this is some extra work,

Starting point is 00:23:58 you know, it's not how our examples run, the hardware is capable of preventing you from, preventing that device, that PCI device, its DMA engine, from writing to addresses outside of its process space. So if somebody doesn't do that...

Starting point is 00:24:16 If you don't set all that up, and it takes considerable effort, but if you don't set all that up, it can DMA wherever you want. Alright, let me get back. So, here's a quick performance slide. This is just sort of comparing Linux

Starting point is 00:24:33 libAIO on one core. This is all for one core versus SPDK on one core. We're opening up a block device directly with libAIO. We're doing asynchronous reads and writes. It's sort of a similar model with libAIO. We're doing asynchronous reads and writes. It's sort of a similar model with libAIO as it is to SPDK. You pull for completions, these sort of things.

Starting point is 00:24:52 But it just shows you the magnitude of the performance difference. On a single core, and I'm sorry if that gray doesn't come through very well, but on a single core, the kernel in our testing gets between 300,000 and 400,000 4K random reads per second using LibAIO on an Intel P3700 SSD. SBDK, as we add SSDs, just adds the max IOPS of that SSD. And it keeps going. Now, I should warn you that it does not go past 8 on a single core. If you add another core, it'll go to 16. But good luck finding PCI lanes.

Starting point is 00:25:36 So there's some trouble measuring this scaling too far because I can't get enough PCI lanes. But definitely up to 8 is just linear. too far because of I can't get enough PCI links. But definitely up to 8 is just linear. And then we finally hit the CPU bottleneck. About 300 nanoseconds. Total.

Starting point is 00:25:58 Total. Round trip. This is round trip this is so we have one core with one Q pair per SSD and there's no benefit doing more Q pairs for an SSD in terms of performance so we have 8 Q pairs

Starting point is 00:26:20 and we're trying to submit 128 IO per Q pair so we submit them all and then we're polling them all, and then every time we find a completion, we submit another, and it takes eight of those to finally get to the point where the CPU is not just busy waiting. But if you want to take advantage of that production latency, you do have to sit there and poll, right? You do.

Starting point is 00:26:44 You do. So your there and pull, right? You do. You do. So your application presumably would do more work than just submit a pre-generated read, right? So you would have other things you'd have to figure out. I mean, you'd have to work through the whole architecture of your application, right, to figure out, you know, okay, you freed me from blocking. Now I've got to go find something productive to do. Okay, so this is the overhead,

Starting point is 00:27:06 and we measured this not too long ago. It's kernel 4.7 RC1. Again, you see like the 300 nanosecond software latency. I'm subtracting out the device. I'm subtracting out the PCI flight time. The Linux kernel, you know, we're at 5 5500 nanoseconds, which is not bad, you know, 5.5, you know, it's not terrible. So these are some of the things we're doing to be fast. Some of these things the kernel will do over time, I have no doubt. The kernel

Starting point is 00:27:40 will get faster. For instance, I know they're implementing polling and they already have an NVMe. A lot of these techniques are going to be employed in the kernel and the kernel will get better. Some of them you can't do in an operating system kernel because they're specific things. You know, you're making assumptions. With SBDK, your process owns the whole device. Right? So you don't have to share it with anybody. And you're also controlling the threading model. So you don't have random threads trying to figure out what hardware queue to use.

Starting point is 00:28:12 You set that up ahead of time. So there's no locks. There's other things like that that you just can't do in an operating system kernel. Now the price is you have to design your application for that. Which is definitely significant optimization work. But the kernel will get faster than this, I have no doubt.

Starting point is 00:28:31 Okay, so let's talk about NVMF, because I think that's what a lot of people want to hear, and I don't want to run out of time. But before I talk about NVMF, I'm going to talk about iSCSI. And this will lead into it clearly. We wrote the iSCSI target about NVMF, I'm going to talk about iSCSI. And this will lead into it clearly. We wrote the iSCSI target about three years ago using all these techniques just to prove if this was reasonable, to prove if this would help

Starting point is 00:28:55 and all these sort of lockless pull mode things. And it turns out it does help significantly. So here's a comparison of Linux LIO versus SPDK. The kernel here, LIO kernel, is using 32 CPU cores to do 2.25 million 4K random

Starting point is 00:29:16 reads. Every time I say IOPs, it's 4K random reads. SPDK's taking 21 cores to do 2.8 million? That's better, right? But that's not dramatic. So we're using non-blocking TCP sockets.

Starting point is 00:29:36 We're pinning iSCSI connections. When a new connection is established, we look up what device is it for, and we migrate it to that core and then we only use it from that core so we can do all sorts of modifying state completely locklessly. Every device has one NVMe hardware queue that's on the same core as the sockets, the connections that are trying to use it so that you can pull off the TCP stack, translate to NVMe, dump it on the queue all without a lock in line, just keep going.

Starting point is 00:30:13 It turns out that of these 21 cores, more than 70% of their work is doing TCP. We could have maybe thought about a user space pull mode TCP stack I don't know how much benefit we get there I think that's a viable strategy for improvement people could think about that but we kind of just left it there for now so if you work out the math here and you say, how many IOPS per core is this? It turns out SBDK got about 2x. You know, that's great.

Starting point is 00:30:54 It's not incredible. But again, most of that time left is TCP. So this is setting me up for why NVM express over fabrics. So first of all, it does eliminate a layer of translation from SCSI to NVMe because we can send the same SCSI command or close to the same SCSI command with a modified SGL over the fabric as we actually send to the device.

Starting point is 00:31:23 That overhead is tiny. The real advantage here is RUMA. The RUMA cards effectively offload TCP. So for iWarp, it really is TCP. For Rucky, it's UDP. And that's responsible for a huge performance increase. So here's what we get with NVMe over Fabrics. And I apologize for the max at 1.3 million or whatever.

Starting point is 00:31:59 That's just the fastest NIC we had. So they're both going to cap the NIC. But the kernel took 11 cores and SVDK took one. And you can make the kernel do it a little bit more efficiently than that if you really disable cores. Some of the performance testers have done that. But that's not a realistic scenario. You have to be able to use those cores for something else to actually gain value. So disabling them is not saying anything so if you turn it back on if you turn it back on it takes 11

Starting point is 00:32:30 and I'm sure they'll get a little bit better over time so any questions on the NVMe over fabrics go ahead I have it on a later slide questions on the NVMe over fabrics? Go ahead. Do you have an NVMe over fabrics? I have it on a later slide.

Starting point is 00:32:50 But we don't have it written. How many injections did you have? How many? What does it say? It's a lot. this is using the in kernel initiator for both

Starting point is 00:33:10 it's the same one we just spin up as many initiators as we need the our target should be able to do the full performance over a single connection the way it's set up

Starting point is 00:33:24 because for us it's set up. Because for us, it's all being processed on one core. You can use more cores, you know, it's just we only need one to saturate the NIC. But since we're all on one core, there's no real difference for SPDK if you have 10 connections or one. We're looping over the list of connections, right?

Starting point is 00:33:46 So it doesn't ultimately matter. It's a lot like the internals of an SSD where typically it doesn't matter how many queue pairs you have. It doesn't change the max performance of that SSD. The queue pairs are really for the client. Okay, so I'm going to go through some of the other stuff in SBDK because we're running

Starting point is 00:34:06 a little bit of time. One thing we have is a block device abstraction layer, which came out of our iSCSI target. This is a generic block layer that really just is taking what looked like BIOS from the kernel or something like that, a little different structure. But they have, you can do like read, write, trim, flush. You can submit them asynchronously to this thing and it will translate them to the appropriate thing underneath. So we have three modules open source underneath right now. Our NVMe driver, Linux lib AIO, and a RAM disk. All right, I'm going to talk about the future a little bit.

Starting point is 00:35:11 Excuse me. It's catching up to me. All right, so I showed this picture a little bit earlier. So I'm going to add some things to it. We added, we're thinking about, in the second half of this year, doing an NVMF initiator with the same principles.

Starting point is 00:35:33 So it would be user space pull mode, based on RDMA and on our NVMe driver, of course. It will have the same interface as our local NVMe driver. It's the same strategy we currently use. They just swapped out the PCI part underneath their NVMe driver and put in a fabric. It'll be the same header file.

Starting point is 00:35:59 So you can use a remote device and a local device with SBDK using the same API. We're going to put the block device abstraction layer into our NVMe over Fabrics target. I think that's already there. Excuse me. I believe that's already there. I think it went out just a week or two ago. So it'll be similar to iSCSI.

Starting point is 00:36:27 And then we've got some performance issues when you scale to large numbers of devices. 32, 64 SSDs pulling in the loop, especially on platforms with small caches, CPU caches. We know how to solve this. We need to leverage some of the RDMA features a little bit better. But basically, it's adding fixed latency based on the number of SSDs when you're looping over the list and polling. Because we're missing the cache too much, the CPU cache too much.

Starting point is 00:36:56 But we have a clear line of sight how to fix that. So we'll get that in the second half of this year. And then as we get more RDMA devices to test with and higher performance RDMA devices, we'll continue to make sure this scales up. Okay, so the next thing we're adding, I'd like to talk briefly about, and we'll see how much interest I have in this. I'm calling a blob store. So, and this, I don't know. We'll see about the name. But most people say, so if you work for like a sand vendor or whatever and you're exporting iSCSI devices today,

Starting point is 00:37:37 SBDK is obvious how you'd use that, right? If you're a database author or something like that, key value store, you say, oh, you're so much more CPU efficient than the kernel. That's great. I need a file system. So what do you do? So most applications want some level of file semantics. That's just a fact. There's no way around that. Fortunately, most applications need a tiny fraction of what a file system can do.

Starting point is 00:38:10 And that was a realization we had. Particularly databases and key value stores optimized, heavily optimized storage applications are using as little from the file system as they can get away with. Because file systems are slow. So they use a small number of files. They use a flat hierarchy.

Starting point is 00:38:29 They don't care about permissions. They own the whole disk. Something like that, right? So you can't use a kernel file system on top of a user space driver, of course. The other problem is POSIX. And POSIX is slow and there's nothing you can do about it

Starting point is 00:38:47 except for make the media so fast that it's like RAM which we're trying to do but on an NVMe device where you have 3 microseconds of flight time POSIX is just not going to work at high performance because it's blocking

Starting point is 00:39:03 and assumes certain copies and all these sort of things so we really need to move away from that and fortunately many databases already have they're using lib AIO okay so I'm calling this thing that I would like to build a blob store. And I'm choosing the word blob with some thought, despite that I don't like the way it sounds. I don't want to call it a file system. It's not anywhere close to a file system.

Starting point is 00:39:36 I don't want to call them objects, because objects is taking on a life of its own. That sort of assumes like a get, put, delete interface. It's not going to have that. This thing is going to be, and we're working on this now, it's going to be asynchronous, pulled-mode, lockless, event-driven, just like the rest of

Starting point is 00:39:55 SVDK. Essentially, you allocate a blob on the disk, and it's a block allocator. You can read and write blobs in units of blocks, logical blocks. You can delete them, you can resize them, whatever you need to do, and they are persistent. You shut down, you come back up, it finds them all again. They have a name, you don't even give it the name, they have a GUID essentially, you say I need a new blob, and it gives you

Starting point is 00:40:23 a GUID. And then you say make my blob this many blocks, and it does it. And it turns out this is probably the lowest level abstraction that you need on top of a block device. It's the lowest level abstraction that I can possibly fathom that would be valuable to someone. And this is actually probably enough of an abstraction to port a number of databases on top of SBDK. This is very similar to Bluestore in functionality. It's probably enough to port RocksDB on it. It's probably enough to port MySQL's InnoDB engine, which is using libio direct to block.

Starting point is 00:41:06 But this is intended to be a framework, rather, for building higher-order services if you need them. You could build a mostly POSIX-compliant file system on top of this if you wanted to go slower, for instance, if your system needed it. Now, I don't know how close deposits you get before this is not worth using. And you should just use the kernel. There's some cutoff. I don't know where that

Starting point is 00:41:32 is. But everybody will find that, I'm sure, at some point. So you may have any questions about that. That'll be second half this year. Excuse me as I lose my voice in the middle of my talk. Okay, so the last thing I want to talk about is VHOS SCSI, which is another little side project we have that I think is pretty cool.

Starting point is 00:42:05 This is essentially my last slide here. We are implementing right now a mechanism to take the BERT IOS SCSI IOS out of QEMU and route it directly to a user space process using VHOS SCSI where that user space process is accessing

Starting point is 00:42:22 the device using SPDK. And this centralizes all the I.O. work, essentially from a set of VMs onto a single storage service running on the hypervisor, all in user space. So then a lot of companies can innovate that in user space, do the proprietary stuff there in that one process. It also would be possible to forward to our NVMF initiator, for instance, from that user space process.

Starting point is 00:42:54 That would be another choice. Instead of going to the local NVMe device, go to a remote one. Since my voice is not cooperating, I made it. Join us at sbdk.io. We'd like to be an open source community. Join our mailing list. We're active on there.

Starting point is 00:43:24 Please send us feedback, report bugs. We're open to contributions. With that, I'll take questions. Go ahead. You support multiple different RDA. Do you support multiple different RDA?

Starting point is 00:43:41 For the iSCSI SOP? So for iSCSI for iSCSI, for iSCSI, for iSCSI SPDK, we just use a kernel TCP IP stack. So whatever the kernel supports, right, which is everything. If you were to do a user-based TCP IP stack on top of like DPDK or something, yeah, you'd have to worry about does DPDK have a driver for my NIC? You know, but the one we're using is just the kernel for the TCP. But again, it's 70% of the CPU. So it's the bottleneck.

Starting point is 00:44:16 Go ahead. I have two comments. Okay. About the logstores, very interesting thing. And I know at least two teams have been trying to invent these things, you know. But I think one is just doing something that they really consider right now at the point they should do. Like invent Globes, doors, and doors. And yeah, if it's coming, it's very cool. And the second thing is like, I will rephrase my first point.

Starting point is 00:44:45 Can you give us some recommendations based on your own experience about the situation of the point, like, should the trade issue as many reads and writes, according to my transactional logic as well. Is the system that sets more, like, more? Yeah, so the question... There it goes. Okay, so the question was,

Starting point is 00:45:12 can I give some advice or feel about where the saturation point is for how many I.O. you really need to submit? And it totally depends on the device, right, which is sort of unfortunate.

Starting point is 00:45:24 For many devices, the answer is as many I.O. as you can, right? You really need to get a max Q depth to saturate these things. The Optane line of devices saturates at much, much lower Q depths, you know, four or eight or something, I don't know. They're quoted numbers. So it's a huge advantage to that. You know, this next generation of media makes it much easier to hit the full performance of the device

Starting point is 00:45:48 because you can do it at such low Q depth. But for today's NAND devices, even the fastest ones, 128, 256, whatever you can throw at it, they need all the Q depth they can get. So according to your experience, it can sound strange, it should speak of a much lower situation.

Starting point is 00:46:12 Random things happen. Yeah, it can also depend on the I.O. size. And also, you know, depending on background work that the device is doing. It's really tricky to benchmark writes on N work that the device is doing, it's really tricky to benchmark writes on NAND because the device is always cleaning up in the background

Starting point is 00:46:30 and making them appear better than they are. So it's a real challenge to get accurate write saturation numbers. To piggyback on that, what kind of instrumentation metrics we have in terms of, from the library, is that the number of leads of certain sites, the distribution, when did they? Yeah, so the toolkit is still a bit immature

Starting point is 00:46:57 in terms of full instrumentation. We have this tracing library, which will, you know, you can set up trace points and log with accurate timestamps at any point in the whole code. There's a number of those in the IO path. We are currently in talks with the VTune team to provide a full GUI solution on tracing latency using their existing disk IO tool that works with the kernel.

Starting point is 00:47:23 I don't know where that will go. I'm not making any promises or announcements. Right, but that would be my ideal. Say that again for me. For the virtualization example, why not use the HTTP to enter the data plane? Like the data plane, like the Nix? Yeah, in the virtualhost Guzzi solution is based on what DPDK did in this space with their vhost net. It's the same idea where they route the I.O. to a user-based process to do virtual switching.

Starting point is 00:48:20 So we're doing the same thing they are. Anything else? All right, then I think we're good. Oh, never mind. Well, I'll just talk to you. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list

Starting point is 00:48:41 by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the developer community. For additional information about the Storage Developer Conference, visit storagedeveloper.org.

Storage Developer Conference - #38: SPDK - Building Blocks for Scalable, High Performance Storage Application

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.