Storage Developer Conference - #122: 10 Million I/Ops From a Single Thread

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to STC Podcast, episode 122. I'm Ben Walker, technical lead from Intel. I've been one of the core maintainers of the SPDK project. And I'm going to spend the next 50 minutes talking about some work we did earlier this year

Starting point is 00:00:57 to optimize our NVMe driver to get to a little over 10 million 4K random reads driven from a single thread. All right, so I always include an overview slide of SBDK in my talks, but it's probably not necessary to do anymore since SBDK has been mentioned a million times already today. But it's a user space block storage stack. It's kind of like what you'd find in an operating system. But it's all in user space, and it's designed to be high performance and is intended for use with NVMe drives specifically. So you won't find legacy features like IO reordering and things like that.

Starting point is 00:01:51 They'll do it on hard drives. We'll cut that all out. So it has a user-space NVMe driver. That's the part we're talking about today. It's open source. It has the BSD3 clause license. It's on GitHub. There's open source. It has the BSD3 clause license. It's on GitHub. There's our website.

Starting point is 00:02:07 The project is crazy, insanely active at this point. We do quarterly releases, so the last release I pulled the stats, and there were 1,200 commits from 56 different committers. A good portion are from Intel, but I think we're quickly approaching... We're over 25% are not Intel committers anymore. Hundreds of commits in every release are from outside sources.

Starting point is 00:02:39 1,200 commits in three months is like 20 per workday. And that's not patch submissions. That's the ones that made it. So this thing is flying. Okay, so I'll just get right to it. The big numbers. This is a single thread directly to the SPDK NVMe driver

Starting point is 00:03:04 using one NVMe queue pair per device, 4K random reads at a queue depth of 128 to each device. So 10.39 million IOPS. What the talk is really going to be about is how. I'm not just going to show you the numbers and say we did it and then gloat for like 50 minutes right the talk is about how we did it uh so then um the background is uh this is this is sort of the high level overview of the system specs it's um an intel xeon it Cascade Lake server platform with 21 P4600 1.6 terabyte SSDs. So it costs a pretty penny. So just some background before we go into all this.

Starting point is 00:04:01 NVMe drives are getting really fast. And you thought they were fast before, right? When we transitioned from SATA to NVMe, we thought, these drives are so fast, this is great. No, they're going to do that again, right? Really, really fast. To the point where software, which was the bottleneck before

Starting point is 00:04:25 and was the problem SPDK was solving before, but SPDK is going to struggle to keep up with how fast these drives are. Especially with Gen 4 and Gen 5 PCIe, the bandwidth is going to be really tough for software to deal with. And I just want to set the stage by saying storage servers that do tens of millions of IOPS,

Starting point is 00:04:53 single server, is normal now. That should be the baseline going forward. Okay, so this talk is based on a blog post, actually, that some of you may want to pull up, read over if you have your laptop open, that we did in May of this year. The blog post is much longer. It goes into all sorts of detail, has all the system specs and what compiler we used and, you know, everything. So if you have specific questions, I'm not going to remember off the top of my head

Starting point is 00:05:28 all of the details, but they're there. For this talk, I'll only be covering three of the techniques from that post, just because it's too much otherwise. And there are other areas of active research that I'm also not going to cover. But I don't want to give the impression that we're done. There are things we are still looking at to make this better.

Starting point is 00:05:55 And some of them are quite promising. Okay, so the way the talk is going to go is I'm going to go into a little bit of background about NVMe, just because we have to know that to have the discussion about performance. And I'm sure, you know, with this conference, people are generally familiar. So we'll go pretty quick. And then I'm going to go through the three techniques that I'm going to cover. And I'll pause after each one, and we can do a couple of questions. We can't go on forever, but if there are questions, we can cover them right after I cover the techniques so it's still fresh in our mind.

Starting point is 00:06:34 Okay, so just as a reminder, NVMe queues consist of two arrays, effectively, in host memory. The submission queue and the completion queue. And they're treated as circular rings. And where you put commands or pull commands off depends on what we call our doorbells. They're really indexes into this ring. And the doorbells live in the PCIe bar, so writing to the doorbell is an MMIO, but the queues themselves live in host memory.

Starting point is 00:07:10 And so I've drawn these to scale. The submission queue entries are 64 bytes, and the completion queue entries are 16 bytes. So there's four completion queue entries in the same space as a submission queue entry.

Starting point is 00:07:25 This is important later. All right, and so to submit a command, you build a 64-byte command, and you put it at the end of the submission queue. The end of the submission queue is pointed to by sqtail. So you copy the command in, and you write sqtail, which is an MMIO, tells the device, hey, I put something at the end of the submission queue. The device DMAs that 64 byte command down, processes it, does whatever you told it to do. Then it will post a completion into the completion queue. And the way we detect that that completion arrived is that in every completion queue entry there is a phase bit. And so the first

Starting point is 00:08:12 time you pass through the queue, you're looking for that phase bit to flip from zero to one. That means that there's a new completion there. And the second time you pass through that array you look for it to flip from one back to zero. You just keep doing that. We pull on that bit, looking for it to flip. When we're done, we tell the device, we've consumed the entry by writing CQHead. The device told us that it consumed the submission queue entry when it gave us the completion. So it tells us its new value of sqhead in that completion queue entry.

Starting point is 00:08:52 And then it just continues on like this. So that's how we're submitting an I.O. on a queue pair. Okay. So I just wanted to point out one thing that was very smart of the original spec authors. And that's they put the phase bit, the thing you pull on, into the completion. And so now as I pull in our driver, I'm checking that phase bit for completions. And when I find it, that completion has already been pulled into my CPU cache. I've already got all the rest of the data about it ready for me.

Starting point is 00:09:27 If I had to pull on some other register that told me which slot finished or something, I would have had to do a load, a CPU load on that to find the phase bit or the completion index change, look at the value, and then go load something else. So that's a chained operation but instead here i just pull on the spot where the completion actually arrives and i can mathematically calculate where that's going to be so it's more efficient so good job um okay so a little bit about how SBDK works. SBDK works by assigning NVMe queue pairs to threads. So SBDK is a user space driver.

Starting point is 00:10:13 It takes a device and it dedicates it to your process. And that means the lifetime of the driver is the same as the lifetime of your application. So the lifetime of our queues can match the lifetime of your application. So the lifetime of our queues can match the lifetime of your threads. This is different than in an operating system where the driver is loaded at the beginning of time and your applications may come or go. The queues need to outlast your application. So typically, operating systems

Starting point is 00:10:37 are going to assign queue pairs to cores. In Linux, it's gotten significantly more complex over the last couple of years, but traditionally that's how they do it. SBDK just assigns them to threads. We're inside your application anyway. We take no locks around the queue pairs. Just use one queue pair per thread. So that's a major performance benefit. We also disable interrupts, and we do everything via polling. And the reason we disable interrupts is that interrupts are incredibly expensive to handle because they swap out whatever you were doing,

Starting point is 00:11:17 which you may be right in the middle of, flush your whole cache out, force you to handle something else, and then you have to restore all your state afterward. That takes a long time. What we do instead is we pull, and we pull on our time. So when your application has a lull in its activity,

Starting point is 00:11:38 usually applications are cyclical. You do some work, you call a function, it returns, then you pull. And it's more friendly to the cache. You're probably evicting things call a function, it returns, then you pull, right? And it's more friendly to the cache, right? You're probably evicting things with the cache that you were done with anyway. So these are the two most critical things. Well, we're in user space too, so it avoids syscalls. But these are sort of like the base things about why SBDK is fast, and then we're going to go into all the other lower level reasons. Just as a brief background, and I didn't know what other slide to put this on, it's not

Starting point is 00:12:14 really related to this, but for this benchmark and for everything we're talking about, the code will be compiled with the O2 optimization level. So optimizations are on, but not cranked up. We do have link time optimization enabled, but we do not have profile-guided optimization enabled. Profile-guided optimization does help, but we don't feel that it is fair to use in a real benchmark. Because what profile-guided optimization does, and we support this in SPDK,

Starting point is 00:12:45 is you compile with it instrumented, you run your workload or whatever, it outputs a bunch of information and you recompile, and it tries to optimize for what you just did. And so we're running a benchmark, and so that's just kind of like cheating, right? It's not really relevant to the real world. So we don't use that when we report numbers.

Starting point is 00:13:09 We do support it. It's all built into the tool chain. But we do use link time optimization, which allows function inlining across compilation units. And that's because in SBDK, we like to put all of our code in nice, separated, clean modules with interfaces between them. But to get good performance, you often need to inline functions. And so

Starting point is 00:13:30 we need to turn on LTO to stop paying a price for our nicely organized modular code. Right? This sort of is our get-out-of-jail-free card. And LTO really does make a difference. It's like between 5% and 10%. Okay, so at a high level, keys to performance, how to write fast code. It all boils down to don't let the CPU stall. Right? And all these techniques are just that we're going to cover are ways to get the data to the CPU at the right time. So things you do to do that is no cross thread coordination. You can't take locks.

Starting point is 00:14:11 Right? You need to have all your threads running independently. Pull instead of interrupt so that you can do it at opportune times for you. Minimize MMIO. So when you write the doorbell registers, that's a memory mapped I.O., and that means it has to send a message out on the PCI bus to the NVMe device. It varies wildly by platform. All the platforms can be quite different. But in general, the instruction goes off to a queue, and it sits in a queue, and the PHY is serializing it out on the PCI bus.

Starting point is 00:14:50 And that queue, I mean, they can come in very different sizes depending on the platform. But if it fills up and you try to do another write, your CPU blocks. Right? So NVMe has been very good about not requiring you to do any MMIO reads. That's a major, major improvement over HCI. Reads are blocking operations. Your CPU issues the read all the way to the device at the PCI bus. It comes all the way back, and your CPU is just sitting there waiting for that round-trip time,

Starting point is 00:15:19 like three microseconds or four microseconds. Writes in PCIe are posted, which means you just send them out. And then you keep going, right? Which is great. NVMe only requires writes. A major revolution in storage interface design. Except if you're out of space in the queue, right? And then your write blocks.

Starting point is 00:15:43 So you only hit that at pretty extreme levels of performance. Then you have to get the right things into the CPU cache at the right time. And that typically means getting the CPU to be able to speculate ahead appropriately. And so you have to organize your data structures so that it can speculate Which is tricky

Starting point is 00:16:07 And then also don't touch cache lines that you don't have to Right, don't do a load and wait on something if you didn't have to And one way you can avoid doing this is pack your data structures so you touch the fewest cache lines you can One thing we do all throughout SBDK is we'll look at a data structure and we'll organize the hot path and the cold path data in the structure. And we'll put all the things that are touched in the hot path in that data structure close by each other. And all the things in the cold path somewhere else at the end, usually.

Starting point is 00:16:40 Right? And so then you minimize the number of cache lines you touch as you walk through that hot path all right so number one minimize mmio so at a high level spdk applications are all async, pull mode things. Your thread is sitting there in some while loop and it's doing some work, sending a read, doing some work, et cetera. And then it's polling for completions.

Starting point is 00:17:17 This polls for completions. So these are like submits and this is check for completions. It doesn't block. It just checks. If it finds a completion, it calls the callback function, which is one of the arguments to these functions. So it'll call, you know, callback function and callback function too when it finds it.

Starting point is 00:17:36 So at a high level, that's what's happening in all the SBDK programs. It's the typical event loop style programming that so many server side applications use these days. So in a naive implementation of submitting commands in this API, you would do an MMIO right here to submit and an MMIO right here to submit. And then for each completion you find, you would do an MMIO MMIO to tell the device you consumed the completion queue entry. So that would be four here. Okay, trick one.

Starting point is 00:18:21 In that poll call, when you go to process completions, process all the completions that you find, every completion queue entry that you find. And then when you get to the end, when you run out, you find one whose phase bit hasn't flipped. Then ring the doorbell just once with the maximum value. Pretty simple. Don't ring it every time you find one. Everyone knows this trick. Every driver, they all do this. When a command is submitted in those read commands I had in the previous loop, copy

Starting point is 00:18:56 the command in the submission queue entry slot for each one, but don't ring the doorbell. Instead, ring the doorbell when they poll. Now that means the first time you poll, it'll just submit a batch of commands. It won't give you your completions. But you're in an event loop style. You're just sitting there spinning, doing work and spinning, doing work and spinning, right? You're polling all the time. So you're now, you've just figured out a clever way to batch the submissions transparently to the user. So if you turn that off, this benchmark gets 2.89 million IOPS.

Starting point is 00:19:32 You turn that on, 10.39 million. Yeah, so... Yeah, go ahead In SBDK How often you poll is up to you Right, so that would be up to your application To limit itself You have to poll frequently enough Right

Starting point is 00:20:04 We don't see, when we're very, very busy in these benchmarks, it's polling very frequently, you know, every microsecond or two, right? It's not really adding any significant amount of latency. Go ahead. Can you elaborate on user pull? Because when I do a write, I just do a write. I don't necessarily pull it. I mean, user don't pull.

Starting point is 00:20:32 Well, when you're using SPDK, you have to structure your application as an event loop. And so you're in a loop at the top level. Yeah, and you are submitting commands, and then as part of that event loop, every time through, you poll once. Right? That's the typical model.

Starting point is 00:20:50 So you're polling every time you get through your loop. You're processing work off of probably a queue of things to do, and then you're polling. Processing work off of queue of things to do, which submits I.O. You're polling, et cetera. You're just going around and around and around. I don't know if you know this or not, but do you know sort of what the average number of batch submission signatures there were in this? Yeah, between 30 and 50. Wow. I think that was his question about latency. That first one had to wait until you got 30

Starting point is 00:21:24 batches before you said. Yeah, the overhead of submitting a command through the NVMe driver, like the submit path, I think, is what, like 300 nanoseconds? Yeah, so multiply that by 30. That's what we're waiting. Not very much. Yeah.

Starting point is 00:21:48 Yeah. Yeah. Absolutely. Yeah. And actually, when you create the queue pair in SPDK, you get to choose whether it delays. It's actually off by default. So by default, it will ring the doorbell. It'll submit it right away,

Starting point is 00:22:03 because that's the safe behavior. But if your program is structured in such a way that you can tolerate this, you can turn it on. So by default, it will ring the doorbell. It'll submit it right away, because that's the safe behavior. But if your program is structured in such a way that you can tolerate this, you can turn it on. Okay, so one more. Oh, and I also wanted to mention, too, that this is how IOU ring works. So you submit a bunch of commands, and then you do a syscall, which is effectively ring the doorbell slash pull for completions. It's all one. Just like SPDK. OK. And so, trick three. When a completion is

Starting point is 00:22:40 posted, don't write the completion queue head doorbell to tell the device that we consumed the completion. Because the queues are big. 1024 maybe is a common size. And the NVMe driver, all the submissions are going through the NVMe driver. So we can just do the math. They're one-to-one mapped. Every submission gets one completion. So you know when the device needs new free completion queue slots, right? All the submissions are coming through our software. So we just count. And, you know, you technically only have to free up completion queue slots one time through the ring. So, but I do want to point out We're finding like 30 to 50

Starting point is 00:23:26 Using the other two techniques And so this doesn't help the benchmark Really Because we're already batching So effectively in this particular workload This would probably help lower queue depth We don't actually So Jim wrote a patch to do this

Starting point is 00:23:42 We did not merge it Primarily because SBDK has to work not just according to the NVMe spec, but also in the real world. And so we were concerned, and still are, that there are drives who need that kick of the completion queue entry doorbell, right? Because I've never seen anybody do this before,

Starting point is 00:24:06 and so it's a risk, right? Even though it's spec-compliant, but we're still hesitant to move forward. Okay, any questions on this one? Otherwise, we'll move on to the next strategy. Go ahead. Probably It depends on how busy your core is If your core is idle

Starting point is 00:24:39 You're not going to be able to measure the difference anyway Because you're polling so frequently If your core is busy doing something heavily cpu bound that actually like takes a measurable amount of time and what you really care about is latency and not bandwidth then yeah you would probably turn this optimization off Yeah. If you're getting back around the loop quickly, it doesn't actually matter.

Starting point is 00:25:15 You know, if your loop takes 500 nanoseconds, it doesn't matter, right? But if your loop takes a millisecond, it matters. Okay. All right, it matters. Okay. All right, one more. So in the SPDK NVMe driver, we don't dictate your threading model. It's just a C library with functions you call, and we assume you have some loop.

Starting point is 00:25:48 Right? Our benchmark has a loop that will pull all the devices and then jump around, pull all the devices and jump, you know. We have higher layers up the stack that begin to make more assumptions about your threading model, although without trying to strictly force you into one. And in some of those we have ways to, like, register what we call pollers, that every time through the loop it'll just run every poller, you know, and some

Starting point is 00:26:19 frameworks to do that. But at this low level it's just a C library. It's completely passive. It only does things when you call it. All right, go ahead. So in SPDK, the application directly creates the queue pairs, and you must create one per thread that you want to submit I.O. from. You can't share them across threads. So this is just using a single thread.

Starting point is 00:26:52 If you use two threads, I can't build a system that can do this, but it would scale perfectly linearly. They're entirely independent of each other. If I could build a system that would get 21 million I.O.Ps or 22 million I. million IOPS, it would get it with two cores. Okay, next one. The title of this one was, all right, so we're going to talk about the completion path. So this is just sort of the general steps of that process completions function call.

Starting point is 00:27:27 So basically it resumes at the entry where it last left off. But basically for each completion queue entry, CQE, check if the phase flipped. If it didn't flip, we're done. We don't have any completions. Or we ran out of completions. If it did flip, we stuff... The only way to get back to your context when you submit an I.O. and NVMe is that they give you an integer

Starting point is 00:27:51 that you put in the submission queue entry, and they give you back that integer in the completion, the CID. So in order to get back to our context, which contains things like the callback function and callback arguments that we're going to call back to our context, which is, you know, contains things like the callback function and callback arguments that we're going to call back to the user application with, we have to use that CID to get there. And so what we do is we allocate an array of these structures we call trackers, one

Starting point is 00:28:18 to one with the slots in the queue. And well, there's the same number as there are in the queue. They're not necessarily direct mapped. But when the command completes, the CID is just the offset into this array. So we get our tracker. And then the tracker has a pointer to our request object. The trackers are large because we have DMA-able memory for the PRP lists in them. And so we have a separate structure for the requests, because

Starting point is 00:28:45 we allow you to queue up more, in software queue up, more requests than there are slots in the queue, just for convenience. And so the requests are smaller structures which you can have a lot deeper queue depth on. And so the tracker points at the request. The request then we use to call the callback function, and we pass your argument in. All

Starting point is 00:29:07 right? So that was, I think, our original implementation. And this is pretty similar to how almost every other driver in the world is going to work. Actually, so I want to just walk through the problem with this code. This is an example of what's called a data-dependent load or a chain of data-dependent loads. And so if you follow through what the CPU has to do in order to execute this code, it has to first dereference the CQE, a load instruction. When that arrives from memory to cache, or into the register, really, and gets operated on, to get this value, it can then compute the location of this offset here, issue another load. When that load gets back to the CPU, it then can calculate this address here of the request, issue another load, right?

Starting point is 00:30:07 And then when finally that load gets back, you can figure out the location of this function and this callback argument and make the function call. So it's a chain of three, right? You can't speculate the next load until you complete the previous load, because the value that you need to load is in that previous one. Don't do that. Okay, so what we noticed, I think actually Jim noticed, was that these are our pseudocode of our structures.

Starting point is 00:30:43 This NVMe tracker, if we looked at how the data was laid out in terms of cache lines, we actually had empty space, padded out space in this tracker. And so we just said, let's copy the function and the argument in there. So the request has the function and the argument.

Starting point is 00:31:02 So when we submit the request, we first have to obtain a argument. So when we submit the request, we first have to obtain a tracker, and then we just, instead of just copying only our pointer to ourselves here, we also copy the, this pointer and this pointer. It's all in cache, so the copies then are real fast. You know, once you've touched one thing in the cache line, touching other things is not a

Starting point is 00:31:19 problem. And then when it completes, now we skip a step. Right? So before we would have to access the, we would do, dereference the tracker to get the request, the request here. But now we've gone straight from the CID, one data dependent load, and then once we get the tracker we can just immediately call the callback function. So we've eliminated it. We've gone from a chain of two data-dependent loads to just one. And we got 500,000 IOPS improvement. So any questions on that one?

Starting point is 00:32:02 Good? Okay. All right. Last one. I promise. Good? Okay Alright, last one I promise This is again in the completion path So it's that same scenario we were in before I've already got it optimized to skip the request And before. I've already got it optimized to skip the request.

Starting point is 00:32:27 And we're processing along the completion queue array, and we say we're gonna process index zero completion queue entry. Well, we go check completion queue entry one, which is one ahead of us, here, the plus one. And we say, is it done? Is this one also done? If so, prefetch its associated tracker. Oh, and then also go get the one after it, too. The completion queue entry, not the tracker. And I've got a picture of how this works out.

Starting point is 00:33:07 And then continue as normal. We're building a pipeline here. So if I draw the completion queue entries in the tracker array, where blue is loads where we have to wait, and green are prefetches. First we read the one we were actually going to process. Then we read the next one. And I'll talk about why that's okay in just a minute. Then we prefetch the next one's tracker, which we don't know where. I just picked a box.

Starting point is 00:33:44 They're not necessarily mapped one-to-one. And then we prefetch the next one in the array. And then we repeat. So, once we've prefetched this one, we go figure out this one's tracker, say it's, you know, over here. We call its completion callback. Then we come back through the loop, and process this one, we go figure out this one's tracker, say it's, you know, over here. We call its completion callback. Then we come back through the loop and process this one. We read this one,

Starting point is 00:34:11 but we've already prefetched, so now that one's fast. We prefetch this one and we prefetch whatever its tracker is. Let's say it's that one. Right? And, and then we go to process this one's tracker, which we already prefetched here. So we've got a pipeline going. So this is tricky. You're basically speculating out two completion queue entries ahead and one tracker ahead, always. So, there's a couple of subtleties here about, in addition to this being already incredibly complex, there's a couple of additional subtleties.

Starting point is 00:34:59 And that is, the completion queue entries, remember from earlier, are 16 bytes. And a cache line is 64 bytes. So let's draw the cache line boundary here. The first four. When I touched the first one, I touched four. I've got four sitting there now. I can only touch, I can only pull things in a cache in 64 byte granularity. I touched one, I've touched four. Got them in cache. So touching the next one is, you know, free. Right? Doesn't cost me anything. And if you count up the

Starting point is 00:35:32 ways, you know, the different offsets in which I could touch one, there's only one time where actually reading one CQE ahead triggers an extra cache line read, right, when I'm on the boundary. So three out of four times that was free anyway and I got to kick off my prefetches. Now a lot of the times of course these prefetches don't do anything because I'm prefetching a cache line that I've already got from the previous one, but whatever, it doesn't hurt anything. OK. We're coming to a close here, so I'll start taking questions. I just want to add

Starting point is 00:36:17 that there's even more complexity in processing this completion queue that we're still working on. Because, again, it's 16-byte entries. And so every time through the loop, we retouch that 16-byte entry. It might be better for us to do a 64-byte move onto the stack one time. And then just use it from there. The reason for that is that when the device wants to write a phase bit update, it can only do so in 64 byte increments on PCIe. So with DDIO it's going to try to stick a 64 byte cache line in, right? And so if we read it

Starting point is 00:36:59 while the device is trying to write it, when the device tries to write it, it steals the cache line to the PCIe root complex. Then it does the update to it, flips the phase bit, puts it in a queue to get serialized back out to write. If we read it while it's in that queue, it gets stolen back to the CPU and we check it and say, oh, there's no update here. And when it finally gets to the front of the queue on the PCIe side to get written out, it realizes, oh, my cache line was stolen, and it blocks the whole PCIe queue. Right?

Starting point is 00:37:35 So minimizing the number of times we read, we poll those completions, not by necessarily polling less, but by polling in 64-byte chunks might be better. Making the completion queue entry 64 bytes, even though it wastes space, might have been a better choice. But these are things we're still actively investigating. So with that, I will take questions. I'll get to you. I mean, the interrupts are always off.

Starting point is 00:38:11 They're permanently off. Right? Yeah, like the software has to be in an event loop where it's submitting I.O. and then polling for completions. And then when it finds completions, it just resubmits more I.O. You know, it does more work. Right, so it never... It never needs interrupts to, like, kick it and go.

Starting point is 00:38:32 It's always going. I hope that answered. Okay. Okay. In the benchmark, the way it works is for each device, it loops through and it just submits however much queue depth we want, 128 in this case, to each device. And then the completion function that it passes

Starting point is 00:38:58 with each of those submissions, when it pulls and then finds completions, that function gets called and they just resubmit themselves. Right? Writing a benchmark that's fast enough to get to 10.39 million IOPS is quite a challenge. A lot of the patches to hit this number were fixes to the benchmark. It's not FIO. FIO.

Starting point is 00:39:28 FIO caps around like 2.7 million per core. And you can't even see SBDK in the trace. It's all FIO code. I think probably. You know, I mean, SPDK doesn't set an interval. Your application calls a function to poll whenever you want to poll. That is definitely going to depend on the average latency to complete the IO operation. So you could implement something like the kernel's hybrid polling if you want to measure the average time and start polling at 75% of the time

Starting point is 00:40:09 or something like that. Generally, SPK just, you know, the simplest applications just poll all the time. Right? If there's IO outstanding, they poll. You know, they're not blocking, but every time through the loop,

Starting point is 00:40:24 they'll check if there's anything done. Well, so SPDK doesn't dictate your application's threading model, at least at this layer. So we can't decide when to do things, right? We're a C library. You call us, right? So even if we could figure out when we thought would be the best time to pull, we can't make you do it. Right? Your code has to call SBDK.

Starting point is 00:41:09 Right? So we could probably give you hints or something. You know, what the average. I mean, we have stats, like latency stats and things like that, which will tell you the average IO completion time. You could use that to figure out when the best time to pull would be. Yep. What is the average IO completion time? out when the best time to pull would be. Yep. I'm not sure if I reported that in the blog post or not. I don't actually know. Not for the P4600. Yeah, I, so on the blog post I also run the same thing with, like, 20 Optane drives.

Starting point is 00:41:48 And the latency is real good. But I don't recall. I mean, it's basically whatever the spec sheet of the drive is. You know, the software is just doing whatever the... The software is such a small component of the overall time that you basically get the exact characteristics of the hardware. Especially at this QDEP. Yeah, you didn't have anything for the 46-month. You had the 11 million for the 5-volt random read,

Starting point is 00:42:16 and that was 57 microseconds. 57 microseconds on Optane's at 11 million IOPS per thread. Yeah, so you used the 31 million, so all of that's the same. at 11 million IOPS per thread. The one I'm reporting is purely random read. If we were to actually rerun the benchmark with Optane with a mixed workload, it might actually be faster because you get bidirectional bandwidth. But for NAND SSDs, the writes are going to slow it down. Yeah. directional bandwidth. But for NAND SSDs, the writes are going to slow it down. Yeah?

Starting point is 00:42:46 Yeah. Well, there's a few things. One, we're going to get PCIe Gen 4 and Gen 5. Right? Real soon. So these things are going to get a few things. One, we're going to get PCIe Gen 4 and Gen 5 real soon. So these things are going to get a lot faster. Two is, this is the bottom of the stack. There's a much bigger storage stack that is fully featured with logical volumes and thin provisioning and snapshots and clones. We need to optimize the whole thing. And so I'm only benchmarking the bottom.

Starting point is 00:43:48 Right? The whole thing is fast, but not like this. Right? You know, as soon as you add, like, our BDEV, our block device abstraction layer, I think we drop to seven or eight million on a single thread. And, you know, the more you add, the more you add, et cetera.. You have to just keep optimizing the whole thing. The other thing is, why burn cores?

Starting point is 00:44:11 You can sell them. A lot of these companies are doing cloud hosting and they're selling the cores as VMs. They monetize the cores. Don't burn them on storage. It's a value proposition. You mentioned how you think about So don't burn them on storage. Right? It's a value proposition. In the one I showed, it was 21 P4600 SSDs. Oh, there's one submission queue per NVMe device. Yeah.

Starting point is 00:44:43 I have a question on that. Is it one thread, well, one queue pair per thread, is that an SBDK limitation? Because NVMe, you allow up to, it supports like 64K queue pairs. Oh, in the NVMe, SBDK NVMe driver, you can create as many queue pairs on any number of threads that you want, that the drive allows. And you can put them all on one thread if you want. You just can't use a queue pair on more than one thread at a time. And it turns out all these drives that we were testing only need one queue pair to get their full performance. Because then you get the head-of-line blocking when your write is out of space.

Starting point is 00:45:23 It's a random read test. So it doesn't matter in this case. But yeah, you could create a queue for reads, a queue for writes. That's a smart thing to do sometimes. You can create them on different threads. You can, whatever you want. No. No, I have 21 queues in one thread. All 21. Yeah.

Starting point is 00:45:56 Yeah. So it just loops through 21 times and calls process completions. Just round robin. Yeah, so we submit all the, we fill up all the queues to whatever queue depth we want, which is 128 in this case. Round robin. And then we pull them

Starting point is 00:46:15 all round robin. And whenever we find a completion, we just immediately submit another IO. So we're just going around pulling all 21 over and over and over again. There's no extra smarts than that. No. I'm sure there will be, right? You know, I think...

Starting point is 00:46:52 I mean, this number is obviously ridiculous, right, for right now. Like, you can't get this out on the network, you know. But it's coming, right? We're going to get PCIe Gen 4. We're going to get Gen 5. We have faster and faster fabrics. You know, 400 gig Ethernet is coming. So we're trying to stay ahead of the curve, right?

Starting point is 00:47:13 You know, databases can generate a lot of I.O. They'll figure it out. Yeah, I just realized I've been forgetting to repeat all the questions. But are we going to do something similar with NVMe over TCP? In fact, we have been. Zia, who's sitting here, and I and others have been working on NVMe over TCP optimizations for the last two months, trying our very, very best to get the performance up. And it'll, yeah, I mean, maybe we'll do a talk next year,

Starting point is 00:47:52 I don't know. But yeah, that's a very active area. There's a lot of people working on RDMA transport from multiple companies, trying to make that fast. We're a little over a million IOPs per core there. TCP has a ways to go. But, yeah. Okay.

Starting point is 00:48:12 Two questions. You said it wasn't FIO. Is it any... Do you know what benchmark that was? Oh, everything is open source. Oh, okay. It's all on... You know, if you go to spdk.io, there'll be a link to our GitHub.

Starting point is 00:48:24 We have a benchmark tool that a lot of people use, honestly, because, I mean, you can use it with kernel devices and everything like that. So it's just called perf. Okay. And, yeah, it's available. The other thing is you said it's for one web to all 21 drives. So if you create 21 pairs and 21 webs, should the, wouldn't the number be 21X, though? Is that fair?

Starting point is 00:48:40 You would get the same performance. Okay. So, yeah, it's available. So, yeah, it's available. So, yeah, it's available. So, yeah, it's available. So, yeah, it's available. So, yeah, it's available. So, yeah, it's available. You would get the same performance. You would just be wasting cores. I'm using 21 SSDs in parallel from a single thread. I mean, it's not actually CPU bound.

Starting point is 00:49:11 No, no, no. One thread. Right? But it just pulls one, pulls the next one, pulls the next one, pulls the next one, pulls the next one,

Starting point is 00:49:20 pulls the next one around all on one thread. But it's using 21 drives from one thread. So if you use more threads, it doesn't help because it's not CPU bound. Okay, I think I'm out of time. Thanks for listening. If you have questions about the material presented in this podcast,

Starting point is 00:49:42 be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the storage developer community. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

Your Ad Here

Storage Developer Conference - #122: 10 Million I/Ops From a Single Thread

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.