Storage Developer Conference - #122: 10 Million I/Ops From a Single Thread
Episode Date: March 30, 2020...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the
SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage
developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage
Developer Conference. The link to the slides is available in the show notes at snea.org
slash podcasts. You are listening to STC Podcast, episode 122.
I'm Ben Walker, technical lead from Intel.
I've been one of the core maintainers of the SPDK project.
And I'm going to spend the next 50 minutes talking about some work we did earlier this year
to optimize our NVMe driver to get to a little over 10 million 4K random reads driven from a single thread.
All right, so I always include an overview slide of SBDK in my talks,
but it's probably not necessary to do anymore since SBDK has been mentioned a million times already today.
But it's a user space block storage stack.
It's kind of like what you'd find in an operating system.
But it's all in user space, and it's designed to be high performance
and is intended for use with NVMe drives specifically.
So you won't find legacy features like IO reordering and things like that.
They'll do it on hard drives.
We'll cut that all out.
So it has a user-space NVMe driver.
That's the part we're talking about today.
It's open source.
It has the BSD3 clause license.
It's on GitHub. There's open source. It has the BSD3 clause license. It's on GitHub.
There's our website.
The project is crazy, insanely active at this point.
We do quarterly releases, so the last release I pulled the stats,
and there were 1,200 commits from 56 different committers.
A good portion are from Intel,
but I think we're quickly approaching...
We're over 25% are not Intel committers anymore.
Hundreds of commits in every release
are from outside sources.
1,200 commits in three months is like 20 per workday.
And that's not patch submissions.
That's the ones that made it.
So this thing is flying.
Okay, so I'll just get right to it.
The big numbers.
This is a single thread
directly to the SPDK NVMe driver
using one NVMe queue pair per device, 4K random
reads at a queue depth of 128 to each device. So 10.39 million IOPS.
What the talk is really going to be about is how. I'm not just going to show you the
numbers and say we did it and then gloat for like 50 minutes right the talk is
about how we did it uh so then um the background is uh this is this is sort of the high level
overview of the system specs it's um an intel xeon it Cascade Lake server platform with 21 P4600 1.6 terabyte SSDs.
So it costs a pretty penny.
So just some background before we go into all this.
NVMe drives are getting really fast.
And you thought they were fast before, right?
When we transitioned from SATA to NVMe,
we thought, these drives are so fast, this is great.
No, they're going to do that again, right?
Really, really fast.
To the point where software,
which was the bottleneck before
and was the problem SPDK was solving before,
but SPDK is going to struggle
to keep up with how fast these drives are.
Especially with Gen 4 and Gen 5 PCIe,
the bandwidth is going to be really tough
for software to deal with.
And I just want to set the stage by saying
storage servers that do tens of millions of IOPS,
single server,
is normal now.
That should be the baseline going forward.
Okay, so this talk is based on a blog post, actually, that some of you may want to pull
up, read over if you have your laptop open, that we did in May of this year. The blog
post is much longer. It goes into all sorts of detail, has all the system specs and what
compiler we used and, you know, everything. So if you have specific questions,
I'm not going to remember off the top of my head
all of the details, but they're there.
For this talk, I'll only be covering
three of the techniques from that post,
just because it's too much otherwise.
And there are other areas of active research
that I'm also not going to cover.
But I don't want to give the impression that we're done.
There are things we are still looking at to make this better.
And some of them are quite promising.
Okay, so the way the talk is going to go is I'm going to go into a little bit of background about NVMe,
just because we have to know that to have the discussion about performance.
And I'm sure, you know, with this conference, people are generally familiar.
So we'll go pretty quick.
And then I'm going to go through the three techniques that I'm going to cover.
And I'll pause after each one, and we can do a couple of questions. We can't go on forever, but if there are questions, we can cover them
right after I cover the techniques so it's still fresh in our mind.
Okay, so just as a reminder, NVMe queues consist of two arrays, effectively, in host memory.
The submission queue and the completion queue.
And they're treated as circular rings.
And where you put commands or pull commands off depends on what we call our doorbells.
They're really indexes into this ring.
And the doorbells live in the PCIe bar, so writing to the doorbell
is an MMIO, but the
queues themselves live in host memory.
And so I've drawn these
to scale.
The submission queue entries are 64
bytes, and the
completion queue entries are 16 bytes.
So there's four completion queue entries
in the same space
as a submission queue entry.
This is important later.
All right, and so to submit a command, you build a 64-byte command, and you put it at the end of the submission queue.
The end of the submission queue is pointed to by sqtail.
So you copy the command in, and you write sqtail,
which is an MMIO, tells the device, hey, I put something at the end of the submission
queue. The device DMAs that 64 byte command down, processes it, does whatever you told
it to do. Then it will post a completion into the completion queue. And the way we detect that that completion
arrived is that in every completion queue entry there is a phase bit. And so the first
time you pass through the queue, you're looking for that phase bit to flip from zero to one.
That means that there's a new completion there. And the second time you pass through that
array you look for it to flip from one back to zero.
You just keep doing that.
We pull on that bit, looking for it to flip.
When we're done, we tell the device, we've consumed the entry by writing CQHead.
The device told us that it consumed the submission queue entry when it gave us the completion.
So it tells us its new value of sqhead in that completion queue entry.
And then it just continues on like this.
So that's how we're submitting an I.O. on a queue pair.
Okay.
So I just wanted to point out one thing that was very smart of the original spec authors.
And that's they put the phase bit, the thing you pull on, into the completion.
And so now as I pull in our driver, I'm checking that phase bit for completions.
And when I find it, that completion has already been pulled into my CPU cache.
I've already got all the rest of the data about it ready for me.
If I had to pull on some other register that told me which slot finished or something,
I would have had to do a load, a CPU load on that to find the phase bit or the completion index change,
look at the value, and then go load something else.
So that's a chained operation but instead
here i just pull on the spot where the completion actually arrives and i can mathematically calculate
where that's going to be so it's more efficient so good job um okay so a little bit about how SBDK works.
SBDK works by assigning NVMe queue pairs to threads.
So SBDK is a user space driver.
It takes a device and it dedicates it to your process.
And that means the lifetime of the driver is the same as the lifetime of your application.
So the lifetime of our queues can match the lifetime of your application. So the lifetime of our queues can match the lifetime of your threads.
This is different than in an operating system
where the driver is loaded at the beginning of time
and your applications may come or go.
The queues need to outlast your application.
So typically, operating systems
are going to assign queue pairs to cores.
In Linux, it's gotten significantly more complex
over the last couple of years,
but traditionally that's how they do it. SBDK just assigns them to threads. We're inside your
application anyway. We take no locks around the queue pairs. Just use one queue pair per thread.
So that's a major performance benefit. We also disable interrupts, and we do everything via polling.
And the reason we disable interrupts is that interrupts are incredibly expensive to handle
because they swap out whatever you were doing,
which you may be right in the middle of,
flush your whole cache out,
force you to handle something else,
and then you have to restore all your state afterward.
That takes a long time.
What we do instead is we pull,
and we pull on our time.
So when your application has a lull in its activity,
usually applications are cyclical.
You do some work, you call a function, it returns,
then you pull.
And it's more friendly to the cache. You're probably evicting things call a function, it returns, then you pull, right? And it's
more friendly to the cache, right? You're probably evicting things with the cache that
you were done with anyway. So these are the two most critical things. Well, we're in user
space too, so it avoids syscalls. But these are sort of like the base things about why SBDK is fast, and then we're going to go into all the other lower level reasons.
Just as a brief background, and I didn't know what other slide to put this on, it's not
really related to this, but for this benchmark and for everything we're talking about, the
code will be compiled with the O2 optimization level. So optimizations are on, but not cranked up.
We do have link time optimization enabled,
but we do not have profile-guided optimization enabled.
Profile-guided optimization does help,
but we don't feel that it is fair to use in a real benchmark.
Because what profile-guided optimization does,
and we support this in SPDK,
is you compile with it instrumented,
you run your workload or whatever,
it outputs a bunch of information and you recompile,
and it tries to optimize for what you just did.
And so we're running a benchmark,
and so that's just kind of like cheating, right?
It's not really relevant to the real world.
So we don't use that when we report numbers.
We do support it.
It's all built into the tool chain.
But we do use link time optimization,
which allows function inlining across compilation units.
And that's because in SBDK,
we like to put all of our code in nice,
separated, clean modules with
interfaces between them. But to get good performance, you often need to inline functions. And so
we need to turn on LTO to stop paying a price for our nicely organized modular code. Right?
This sort of is our get-out-of-jail-free card. And LTO really does make a difference. It's like between 5% and 10%.
Okay, so at a high level, keys to performance, how to write fast code.
It all boils down to don't let the CPU stall.
Right?
And all these techniques are just that we're going to cover are ways to get the data to the CPU at the right time.
So things you do to do that is no cross thread coordination.
You can't take locks.
Right?
You need to have all your threads running independently.
Pull instead of interrupt so that you can do it at opportune times for you.
Minimize MMIO.
So when you write the doorbell registers, that's a memory mapped I.O., and
that means it has to send a message out on the PCI bus to the NVMe device. It varies
wildly by platform. All the platforms can be quite different. But in general, the instruction
goes off to a queue, and it sits in a queue, and the PHY is serializing it out on the PCI bus.
And that queue, I mean, they can come in very different sizes depending on the platform.
But if it fills up and you try to do another write, your CPU blocks.
Right?
So NVMe has been very good about not requiring you to do any MMIO reads.
That's a major, major improvement over HCI.
Reads are blocking operations.
Your CPU issues the read all the way to the device at the PCI bus.
It comes all the way back, and your CPU is just sitting there waiting for that round-trip time,
like three microseconds or four microseconds.
Writes in PCIe are posted, which means you just send them out.
And then you keep going, right?
Which is great.
NVMe only requires writes.
A major revolution in storage interface design.
Except if you're out of space in the queue, right?
And then your write blocks.
So you only hit that at pretty extreme levels of performance.
Then you have to get the right things
into the CPU cache at the right time.
And that typically means getting the CPU
to be able to speculate ahead appropriately.
And so you have to organize your data structures
so that it can speculate
Which is tricky
And then also don't touch cache lines that you don't have to
Right, don't do a load and wait on something if you didn't have to
And one way you can avoid doing this is pack your data structures so you touch the fewest cache lines you can
One thing we do all throughout SBDK is we'll look at a data structure
and we'll organize the hot path and the cold path data in the structure.
And we'll put all the things that are touched in the hot path in that data structure
close by each other.
And all the things in the cold path somewhere else at the end, usually.
Right?
And so then you minimize the number of cache lines you touch
as you walk through that hot
path all right so number one minimize mmio
so at a high level spdk applications are all async, pull mode things.
Your thread is sitting there in some while loop and it's doing some work, sending a read,
doing some work, et cetera.
And then it's polling for completions.
This polls for completions.
So these are like submits and this is check for completions.
It doesn't block.
It just checks.
If it finds a completion, it calls the callback function,
which is one of the arguments to these functions.
So it'll call, you know, callback function and callback function too
when it finds it.
So at a high level, that's what's happening in all the SBDK programs.
It's the typical event loop style programming
that so many server side applications
use these days. So in a naive implementation of submitting commands in this API, you would
do an MMIO right here to submit and an MMIO right here to submit. And then for each completion
you find, you would do an MMIO MMIO to tell the device you consumed the completion queue entry.
So that would be four here.
Okay, trick one.
In that poll call, when you go to process completions, process all the completions that you find, every completion queue entry that you find.
And then when you get to the end, when you run out, you find one whose phase bit hasn't flipped.
Then ring the doorbell just once with the maximum value.
Pretty simple.
Don't ring it every time you find one.
Everyone knows this trick. Every
driver, they all do this.
When a command is submitted in those read commands I had in the previous loop, copy
the command in the submission queue entry slot for each one, but don't ring the doorbell.
Instead, ring the doorbell when they poll. Now that means the
first time you poll, it'll just submit a batch of commands. It won't give you your completions.
But you're in an event loop style. You're just sitting there spinning, doing work and
spinning, doing work and spinning, right? You're polling all the time. So you're now,
you've just figured out a clever way to batch the submissions transparently to the user.
So if you turn that off,
this benchmark gets 2.89 million IOPS.
You turn that on, 10.39 million.
Yeah, so... Yeah, go ahead
In SBDK
How often you poll is up to you
Right, so that would be up to your application
To limit itself
You have to poll frequently enough
Right
We don't see, when
we're very, very busy in these benchmarks, it's polling very frequently, you know, every
microsecond or two, right? It's not really adding any significant amount of latency.
Go ahead.
Can you elaborate on user pull?
Because when I do a write, I just do a write.
I don't necessarily pull it.
I mean, user don't pull.
Well, when you're using SPDK,
you have to structure your application as an event loop.
And so you're in a loop at the top level.
Yeah, and you are submitting commands,
and then as part of that event loop,
every time through, you poll once.
Right?
That's the typical model.
So you're polling every time you get through your loop.
You're processing work off of probably a queue of things to do, and then you're polling.
Processing work off of queue of things to do, which submits I.O.
You're polling, et cetera.
You're just going around and around and around.
I don't know if you know this or not, but do you know sort of what the average number of batch submission signatures there were in this?
Yeah, between 30 and 50.
Wow. I think that was his question about latency. That first one had to wait until you got 30
batches before you said. Yeah, the overhead of submitting a command
through the NVMe driver,
like the submit path, I think, is what,
like 300 nanoseconds?
Yeah, so multiply that by 30.
That's what we're waiting.
Not very much.
Yeah.
Yeah. Yeah.
Absolutely.
Yeah.
And actually, when you create the queue pair in SPDK,
you get to choose whether it delays.
It's actually off by default.
So by default, it will ring the doorbell.
It'll submit it right away,
because that's the safe behavior.
But if your program is structured in such a way that you can tolerate this, you can turn it on. So by default, it will ring the doorbell. It'll submit it right away, because that's the safe behavior.
But if your program is structured in such a way that you can tolerate this, you can turn it on.
Okay, so one more.
Oh, and I also wanted to mention, too, that this is how IOU ring works.
So you submit a bunch of commands, and then you do a syscall, which is effectively ring the doorbell slash pull for completions. It's all one. Just like
SPDK.
OK. And so, trick three. When a completion is
posted, don't write the completion queue head doorbell to
tell the device that we consumed
the completion. Because the queues are big. 1024 maybe is a common size. And the NVMe
driver, all the submissions are going through the NVMe driver. So we can just do the math.
They're one-to-one mapped. Every submission gets one completion. So you know when the device needs
new free completion queue slots, right? All the submissions are coming through our software.
So we just count. And, you know, you technically only have to free up completion queue slots
one time through the ring. So, but I do want to point out We're finding like 30 to 50
Using the other two techniques
And so this doesn't help the benchmark
Really
Because we're already batching
So effectively in this particular workload
This would probably help lower queue depth
We don't actually
So Jim wrote a patch to do this
We did not merge it
Primarily because SBDK has to work
not just according to the NVMe spec,
but also in the real world.
And so we were concerned, and still are,
that there are drives who need that kick
of the completion queue entry doorbell, right?
Because I've never seen anybody do this before,
and so it's a risk, right?
Even though it's spec-compliant,
but we're still hesitant to move forward.
Okay, any questions on this one?
Otherwise, we'll move on to the next strategy.
Go ahead. Probably
It depends on how busy your core is
If your core is idle
You're not going to be able to measure the difference anyway
Because you're polling so frequently
If your core is busy doing something
heavily cpu bound that actually like takes a measurable amount of time
and what you really care about is latency and not bandwidth then yeah you would probably turn this
optimization off Yeah.
If you're getting back around the loop quickly,
it doesn't actually matter.
You know, if your loop takes 500 nanoseconds,
it doesn't matter, right?
But if your loop takes a millisecond,
it matters.
Okay. All right, it matters. Okay.
All right, one more.
So in the SPDK NVMe driver, we don't dictate your threading model.
It's just a C library with functions you call, and we assume you have some loop.
Right?
Our benchmark has a loop that will pull all the devices and then jump around, pull all
the devices and jump, you know.
We have higher layers up the stack that begin to make more assumptions about your threading model, although without trying
to strictly force you into one. And in some
of those we have ways to, like, register what
we call pollers, that every time through the loop
it'll just run every poller, you know, and some
frameworks to do that.
But at this low level it's just a C library.
It's completely passive.
It only does things when you call it. All right, go ahead.
So in SPDK, the application directly creates the queue pairs, and you must create one per thread
that you want to submit I.O. from.
You can't share them across threads.
So this is just using a single thread.
If you use two threads,
I can't build a system that can do this,
but it would scale perfectly linearly.
They're entirely independent of each other.
If I could build a system that would get
21 million I.O.Ps or 22 million I. million IOPS, it would get it with two cores.
Okay, next one. The title of this one was, all right, so we're going to talk about the
completion path. So this is just sort of the general steps of that process completions function call.
So basically it resumes at the entry where it last left off.
But basically for each completion queue entry, CQE, check if the phase flipped.
If it didn't flip, we're done.
We don't have any completions.
Or we ran out of completions. If it did flip, we stuff...
The only way to get back to your context
when you submit an I.O. and NVMe
is that they give you an integer
that you put in the submission queue entry,
and they give you back that integer in the completion,
the CID.
So in order to get back to our context,
which contains things like the callback function and callback arguments that we're going to call back to our context, which is, you know, contains things like the callback
function and callback arguments that we're going to call back to the user application
with, we have to use that CID to get there.
And so what we do is we allocate an array of these structures we call trackers, one
to one with the slots in the queue.
And well, there's the same number as there are in the queue.
They're not necessarily direct mapped.
But when the command completes, the CID is just the offset into this array.
So we get our tracker.
And then the tracker has a pointer to our request object.
The trackers are large because we have DMA-able memory for the PRP lists in them.
And so we have a separate structure for the requests, because
we allow you to queue up more, in software
queue up, more requests than there are slots in
the queue, just for convenience. And so the requests
are smaller structures which you can have a lot
deeper queue depth on. And so the tracker points
at the request. The request then we use to
call the callback function, and we pass your argument
in. All
right? So that was, I think, our original implementation. And this is pretty similar
to how almost every other driver in the world is going to work.
Actually, so I want to just walk through the problem with this code. This is an example
of what's called a data-dependent load or a chain of data-dependent loads. And so if
you follow through what the CPU has to do in order to execute this code, it has to first
dereference the CQE, a load instruction. When that arrives from memory to cache, or into the register, really, and gets operated on, to get this value,
it can then compute the location of this offset here, issue another load.
When that load gets back to the CPU, it then can calculate this address here of the request, issue another load, right?
And then when finally that load gets back, you can figure out the location of this function
and this callback argument and make the function call. So it's a chain of three, right? You
can't speculate the next load until you complete the previous load, because the value that you need to load
is in that previous one.
Don't do that.
Okay, so what we noticed,
I think actually Jim noticed,
was that these are our pseudocode of our structures.
This NVMe tracker,
if we looked at how the data was laid out
in terms of cache lines,
we actually had empty space,
padded out space in this tracker.
And so we just said,
let's copy the function and the argument in there.
So the request has the function and the argument.
So when we submit the request,
we first have to obtain a argument. So when we submit the request, we first have to
obtain a tracker, and then we just, instead of
just copying only our pointer to ourselves here, we
also copy the, this pointer and this pointer. It's
all in cache, so the copies then are real
fast. You know, once you've touched one thing in
the cache line, touching other things is not a
problem. And then when it completes, now we skip a step.
Right? So before we would have to access the, we would do, dereference the tracker to get
the request, the request here. But now we've gone straight from the CID, one data dependent
load, and then once we get the tracker we can just immediately call the callback function.
So we've eliminated it.
We've gone from a chain of two data-dependent loads to just one.
And we got 500,000 IOPS improvement.
So any questions on that one?
Good?
Okay.
All right. Last one. I promise. Good? Okay Alright, last one
I promise
This is again in the completion path
So it's that same scenario we were in before
I've already got it optimized to skip the request
And before. I've already got it optimized to skip the request.
And we're processing along the completion queue array, and we say we're gonna process
index zero completion queue entry. Well, we go check completion queue entry one, which is one ahead of us, here, the plus one.
And we say, is it done?
Is this one also done?
If so, prefetch its associated tracker.
Oh, and then also go get the one after it, too.
The completion queue entry, not the tracker.
And I've got a picture of how this works out.
And then continue as normal. We're building a pipeline here. So if I draw the completion queue entries in the tracker
array, where blue is loads where we have to wait, and green are prefetches.
First we read the one we were actually going to process.
Then we read the next one.
And I'll talk about why that's okay in just a minute.
Then we prefetch the next one's tracker,
which we don't know where.
I just picked a box.
They're not necessarily mapped one-to-one.
And then we prefetch the next one in the array.
And then we repeat.
So,
once we've prefetched this one,
we go figure out this one's tracker,
say it's, you know, over here.
We call its completion callback. Then we come back through the loop, and process this one, we go figure out this one's tracker, say it's, you know, over here. We call its completion callback. Then we come back through the loop and process this one. We read this one,
but we've already prefetched, so now that one's fast.
We prefetch this one and we prefetch whatever its
tracker is. Let's say it's that one. Right? And,
and then we go to process this one's tracker,
which we already prefetched here.
So we've got a pipeline going.
So this is tricky.
You're basically speculating out two completion queue entries ahead and one tracker ahead, always. So, there's a couple of subtleties here about, in addition to this being already incredibly complex, there's a couple of additional subtleties.
And that is, the completion queue entries, remember from earlier, are 16 bytes.
And a cache line is 64 bytes. So let's draw the cache line boundary here.
The first four.
When I touched the first one, I touched four.
I've got four sitting there now.
I can only touch, I can only pull things in a cache in 64 byte granularity.
I touched one, I've touched four.
Got them in cache. So touching the next one is, you know, free. Right? Doesn't cost me anything. And if you count up the
ways, you know, the different offsets in which I could touch one, there's only one time where
actually reading one CQE ahead triggers an extra cache line read, right, when I'm on
the boundary. So three out of four
times that was free anyway and I got to kick off my prefetches.
Now a lot of the times of course these prefetches don't do anything because I'm prefetching
a cache line that I've already got from the previous one, but whatever, it doesn't hurt
anything.
OK. We're coming to a close here, so I'll start taking questions. I just want to add
that there's even more complexity in processing this completion queue that we're still working on.
Because, again, it's 16-byte entries.
And so every time through the loop, we retouch that 16-byte entry.
It might be better for us to do a 64-byte move onto the stack one time.
And then just use it from there.
The reason for that is that when the device wants
to write a phase bit update, it can only do so in 64 byte increments on PCIe. So with
DDIO it's going to try to stick a 64 byte cache line in, right? And so if we read it
while the device is trying to write it, when the device tries to write it, it steals the cache line to the PCIe root complex. Then it does the update to it, flips the phase bit, puts it in a queue
to get serialized back out to write. If we read it while it's in that queue, it gets
stolen back to the CPU and we check it and say, oh, there's no update here. And when
it finally gets to the front of the queue
on the PCIe side to get written out,
it realizes, oh, my cache line was stolen,
and it blocks the whole PCIe queue.
Right?
So minimizing the number of times we read,
we poll those completions,
not by necessarily polling less,
but by polling in 64-byte chunks might
be better. Making the completion queue entry 64 bytes, even though it wastes space, might
have been a better choice. But these are things we're still actively investigating.
So with that, I will take questions. I'll get to you.
I mean, the interrupts are always off.
They're permanently off.
Right?
Yeah, like the software has to be in an event loop
where it's submitting I.O. and then polling for completions.
And then when it finds completions, it just resubmits more I.O.
You know, it does more work.
Right, so it never...
It never needs interrupts to, like, kick it and go.
It's always going.
I hope that answered.
Okay.
Okay.
In the benchmark, the way it works is for each device, it loops through
and it just submits however much queue depth we want,
128 in this case, to each device.
And then the completion function that it passes
with each of those submissions,
when it pulls and then finds completions,
that function gets called
and they just resubmit themselves.
Right? Writing a benchmark that's fast enough to get to 10.39 million IOPS is quite a challenge.
A lot of the patches to hit this number were fixes to the benchmark.
It's not FIO.
FIO.
FIO caps around like 2.7 million per core.
And you can't even see SBDK in the trace.
It's all FIO code. I think probably.
You know, I mean, SPDK doesn't set an interval.
Your application calls a function to poll whenever you want to poll.
That is definitely going to depend on the average latency to complete the IO operation.
So you could implement something like the kernel's hybrid polling if you want to measure the average time
and start polling at 75% of the time
or something like that.
Generally, SPK just,
you know, the simplest applications
just poll all the time.
Right?
If there's IO outstanding, they poll.
You know, they're not blocking,
but every time through the loop,
they'll check if there's
anything done.
Well, so SPDK doesn't dictate your application's threading model, at least at this layer.
So we can't decide when to do things, right?
We're a C library.
You call us, right? So even if we could figure out when we thought would be the best time to pull, we can't make you do it.
Right?
Your code has to call SBDK.
Right?
So we could probably give you hints or something.
You know, what the average.
I mean, we have stats, like latency stats and things like that, which will tell you the average IO completion time. You could use that to figure out when the best time to pull would be.
Yep.
What is the average IO completion time? out when the best time to pull would be. Yep.
I'm not sure if I reported that in the blog post or not. I don't actually know.
Not for the P4600. Yeah, I, so on the blog post I also run the same thing with, like, 20 Optane drives.
And the latency is real good.
But I don't recall.
I mean, it's basically whatever the spec sheet of the drive is.
You know, the software is just doing whatever the... The software is such a small component of the overall time
that you basically get the exact characteristics of the hardware.
Especially at this QDEP.
Yeah, you didn't have anything for the 46-month.
You had the 11 million for the 5-volt random read,
and that was 57 microseconds.
57 microseconds on Optane's at 11 million IOPS per thread.
Yeah, so you used the 31 million, so all of that's the same. at 11 million IOPS per thread.
The one I'm reporting is purely random read.
If we were to actually rerun the benchmark with Optane with a mixed workload, it might actually be faster
because you get bidirectional bandwidth.
But for NAND SSDs, the writes are going to slow it down.
Yeah. directional bandwidth. But for NAND SSDs, the writes are going to slow it down. Yeah?
Yeah. Well, there's a few things.
One, we're going to get PCIe Gen 4 and Gen 5.
Right? Real soon. So these things are going to get a few things. One, we're going to get PCIe Gen 4 and Gen 5 real soon.
So these things are going to get a lot faster.
Two is, this is the bottom of the stack.
There's a much bigger storage stack that is fully featured with logical volumes and thin provisioning and snapshots and clones.
We need to optimize the whole thing.
And so I'm only benchmarking the bottom.
Right?
The whole thing is fast, but not like this.
Right?
You know, as soon as you add, like, our BDEV,
our block device abstraction layer,
I think we drop to seven or eight million on a single thread.
And, you know, the more you add, the more you add, et cetera.. You have to just keep optimizing the whole thing.
The other thing is, why burn cores?
You can sell them.
A lot of these companies are doing cloud hosting and they're selling the cores as VMs.
They monetize the cores.
Don't burn them on storage.
It's a value proposition.
You mentioned how you think about So don't burn them on storage. Right? It's a value proposition. In the one I showed, it was 21 P4600 SSDs.
Oh, there's one submission queue per NVMe device.
Yeah.
I have a question on that.
Is it one thread, well, one queue pair per thread, is that an SBDK limitation? Because NVMe, you allow up to, it supports like 64K queue pairs.
Oh, in the NVMe, SBDK NVMe driver, you can create as many queue pairs on any number of
threads that you want, that the drive allows.
And you can put them all on one thread if you want. You just can't use a queue pair
on more than one thread at a time. And it turns out all these drives that we were testing
only need one queue pair to get their full performance.
Because then you get the head-of-line blocking when your write is out of space.
It's a random read test.
So it doesn't matter in this case.
But yeah, you could create a queue for reads, a queue for writes.
That's a smart thing to do sometimes.
You can create them on different threads.
You can, whatever you want. No. No, I have 21 queues in one thread.
All 21.
Yeah.
Yeah.
So it just loops through 21 times
and calls process completions.
Just round robin.
Yeah, so we submit all the,
we fill up all the queues to whatever queue depth we want,
which is 128 in this case.
Round robin. And then we pull them
all round robin. And whenever we find a
completion, we just immediately submit another IO.
So we're just going around pulling
all 21 over and over and over again.
There's no extra smarts than that.
No.
I'm sure there will be, right?
You know, I think...
I mean, this number is obviously ridiculous, right, for right now.
Like, you can't get this out on the network, you know.
But it's coming, right?
We're going to get PCIe Gen 4.
We're going to get Gen 5.
We have faster and faster fabrics.
You know, 400 gig Ethernet is coming.
So we're trying to stay ahead of the curve, right?
You know, databases can generate a lot of I.O.
They'll figure it out.
Yeah, I just realized I've been forgetting to repeat all the questions.
But are we going to do something similar with NVMe over TCP?
In fact, we have been.
Zia, who's sitting here, and I and others have been working on NVMe over TCP optimizations for the last two months,
trying our very, very best to get the performance up.
And it'll, yeah, I mean, maybe we'll do a talk next year,
I don't know.
But yeah, that's a very active area.
There's a lot of people working on RDMA transport
from multiple companies, trying to make that fast.
We're a little over a million IOPs per core there.
TCP has a ways to go.
But, yeah.
Okay.
Two questions.
You said it wasn't FIO.
Is it any... Do you know what benchmark that was?
Oh, everything is open source.
Oh, okay.
It's all on...
You know, if you go to spdk.io,
there'll be a link to our GitHub.
We have a benchmark tool that a lot of people use, honestly, because, I mean, you can use
it with kernel devices and everything like that.
So it's just called perf.
Okay.
And, yeah, it's available.
The other thing is you said it's for one web to all 21 drives.
So if you create 21 pairs and 21 webs, should the, wouldn't the number be 21X, though?
Is that fair?
You would get the same performance.
Okay.
So, yeah, it's available.
So, yeah, it's available.
So, yeah, it's available. So, yeah, it's available. So, yeah, it's available. So, yeah, it's available. So, yeah, it's available. You would get the same performance.
You would just be wasting cores.
I'm using 21 SSDs in parallel from a single thread.
I mean, it's not actually CPU bound.
No, no, no.
One thread.
Right?
But it just pulls one,
pulls the next one,
pulls the next one,
pulls the next one,
pulls the next one,
pulls the next one around
all on one thread.
But it's using 21 drives
from one thread.
So if you use more threads, it doesn't help because it's not CPU bound.
Okay, I think I'm out of time.
Thanks for listening.
If you have questions about the material presented in this podcast,
be sure and join our developers mailing list by sending
an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic
further with your peers in the storage developer community. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.