Storage Developer Conference - #164: Enabling Asynchronous I/O Passthru in NVMe-Native Applications
Episode Date: March 8, 2022...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the
SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage
developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage
Developer Conference. The link to the slides is available in the show notes at snea.org
slash podcasts. You are listening to STC Podcast, episode 164.
Hi, everyone, and welcome to our session. Today, Canson, Simon, and myself are going to talk a little bit about some of the work that we're doing in Linux mainline
to enable a new high-performance IAPath that we're calling a synchronous IAPathsware.
We're obviously going to cover what we mean by this, and what are the use cases we're after,
and then we're going to deep dive into the changes that we're making in Linux and in
other places in the storage stack to enable this path.
Before we get started, I want to make clear that while some of the parts of this work
are already upstream, things change very quickly.
There is still a number of elements
that are under discussion
in the Linux kernel mailing list.
So things, again, might change.
And I also want to make very clear
that some of the work that we're presenting here
is a community effort.
So, you know, we presented some of these ideas,
applied some of the feedback, got a lot of conversations into the mailing list.
And just to name a few of the people that have been involved directly into this work are Jens, Christoph, and Keith.
There has been many others.
So, you know, if you're interested in the topic, I would encourage you to subscribe to the mailing list, follow the conversation, and participate of the things that are already or still under debate.
We are dividing the presentation in three main parts.
First, I'm going to talk about the work we're doing in MDME to enable these internal paths through IOPath. What is the motivation? What are the things that we want to cover? How it relates
to the rest of the IOPaths that are already available in Linux? Then Kanchan, he's going
to talk about asynchronous IOCtos, which is a very necessary work for this NVMe in-kernel
pass-through to be able to perform a scale.
And then Simon at the end,
he's going to wrap everything up and he's going to talk to you
about how you use this path
and the work that we are doing in XMVME
to provide this common storage API
for all these paths to come together.
And he's going to use the opportunity
to guide you through a demo
and give you some performance numbers
on real-world applications.
So,
if you look at Linux today,
you traditionally have
two main ways of submitting I.O.
You have your file I.O.,
which is very convenient and very easy
from an application perspective. If you want more control, then applications would rely on
the raw block interface. This is very convenient if you want this control. You see the whole LBA address space. You can control a lot of the IA properties
whether you want asynchronous or synchronous or whether you want
to do direct IA and implement your own cache or leverage the kernel
page cache, QDev, priorities, all that.
And a number of applications have been very successful at leveraging
this for implementing all sorts of things on top, like object stores, key values, things like that.
For NVMe specifically, the block device, as you know, is namespace granularity, so it fits very well.
The block layer is fantastic.
But when you're building this common block abstraction,
it comes with some natural limitations.
So just to name a few,
I believe we're all aware that there is a number of these data protection schemes,
like protection information,
defix combinations that are not supported.
So something that is used in enterprise environments,
this 4K plus 64 in a contiguous buffer,
something that BlockLayer does not support today.
So these people basically, when they plug the devices,
they don't see the block device come up.
Or sometimes even worse, there is this inconsistent interface where sometimes the block device comes up, but it comes with capacity zero, which means you cannot use it.
So it's not really useful, right?
We also have constraints for new device types. To give an example in CNS,
the block layer puts some constraints that make a lot of sense for the kernel,
which is having a power of two sound sizes
or enforcing the use of the append command.
If you don't want to use these features,
then the block device would not come up.
I'm sure there are other reasons,
but I believe this has had an impact
on explaining the rise of SPDK.
Because SPDK got a number of things very right.
They understood that there is these applications
that want a very fast, low latency IO path
they have control over.
Because a number of applications
can build this domain specific,
you can call it domain specific block layers
or domain specific abstractions
because the application knows the workload.
I don't know,
you know that you're writing at 32K,
you can optimize for that.
You know that the application
is respecting your maximum data transfer size. You know that the application is respecting
your maximum data transfer size.
You don't need to implement splits,
you know, these type of things.
I think there's no discussion that SPDK paves the way
for all these low latency storage stacks.
Today we see IOU ring matching this performance
and it's great, but the applications are ready,
most probably because they did a lot of work already on SPTK.
And at the end of the day, I think when you put this all together,
SPTK got very right that there is a number of people
that want to do this end-to-end optimization.
So either because you want to keep it as a vendor-specific
because you control your whole storage stack, or because you want to do some type of POCs before you bring it to the technical working group in NVMe or to your vendor.
So it allows for this fast innovation, where you can do a number of things, create your command, speak NVMe to your device, and then go all the way down.
However, you know, the more popular SPDK becomes, the more generic it becomes. And then we start having this redundancy on some of the features that are traditionally
the operating system responsibility, which now are being moved to user space in order
to cover this.
So the main question here is,
can we bring this functionality to an IO path
that goes inside of the kernel
so that we can reuse all the kernel functionality,
you know, your security, your, you know, whether you want to do cgroups,
whether you want to, you know, to use whatever containers, you know.
Avoid
the redundancy that comes from putting a lot
of this logic in user space.
And this is where the NVMe
generic device comes in place.
What it is
is pretty simple.
When you get a new namespace,
it comes
typically as a
block device. You know, your
NVMe X and Y,
0 and 1 and 2, 3 3 etc. Now you will see also a car device coming
on the side very similar to the car device that comes for the controller like NVMe 0 and NVMe 1.
This device will always be available so if some of the features of that NVMe namespace
are not supported by the block layer,
the block device will be rejected
or come up as capacity zero or as read-only.
But the card device will always be there.
And this device would allow you to send IOC tiles
directly into the driver,
meaning that you have this communication
where you can submit an NVMe command
that you form in user space,
and the device will understand that.
And the path goes entirely through the kernel.
If you want to compare that to what it looks in the SPDK,
it is very similar in terms that if you're using the lowest
SPDK API for the NVMe driver, you will be speaking NVMe.
But the namespace will, or actually the whole device,
will not be detached from the kernel.
The PCIe device will not be detached and put into user space. It will still be within the kernel. You know, the PCI device will not be detached and put into user space.
It will still be within the kernel.
This gives you some interesting use cases
because then, you know, you can think of
a normal server with several drives.
Some of them, if they have several namespaces,
can have like a normal file system put into them.
And a namespace in the same device can be used as a pass-through without having to touch the whole device.
So you have much more flexibility in that sense.
It also gives you a common stack for implementing these kind of security policies where you don't need to have
some part of these policies in user space,
some part of these policies within kernel space.
So this is the main target.
SCSI has had something similar for quite a long time.
UFS, to the best of my knowledge,
this doesn't exist.
But if this is a use case that becomes interesting,
especially after the asynchronous Ioctals get merged,
we can look into enable that.
For NVMe, we merged this in 5.13.
There's support for the Ioctal IO.
There's basic support for NVMe CLI,
and we're having some patches upstream
in the next week or so
for having all this support enabled.
How does it look like
if you want to consume the NVMe-generated device?
So as I mentioned before,
now you not only are going to see
your dev NVMe 0 and 1 and 2 and 3 as the different namespaces on the block device, you're going to see this NG from generic and the same numeracy.
This is coming from NBME CLI. Now you can see the patches here. That's what I mentioned before. This is the part that is missing for the list on NVMe
and we're posting the patches soon.
If you want to use this
from the programming interface,
this looks like a typical IOC tool.
So you would form the NVMe command
as a normal NVMe IOC tool, the same one that is existing today.
This means obviously that performance is not very well good because it's a synchronous interface.
CanSan will talk about what we're doing to use IODuring and make this scale and actually be usable for the IOPath.
Good thing about using the NVMe driver in the kernel
is that it is pretty simple for us to enable fabrics
because it is automatic.
Just as it is, when it is enabled as a block interface,
the device is up.
When it is not, we need to enable the password controller.
This is something that is easy to do.
And when it is done, all the functionality comes basically for free,
which is very convenient.
So this is it for the mimi generic device um canton will take over
and he will talk about asynchronous iactos if you have any questions about this part in particular
you know please drive through slack or reach out to us afterwards
thank you javier so now let's delve into the catch-all system call that has been around since so many years.
It's called IO-Control or in short, IOctel.
Many operations which do not fit well into the existing system calls are done by IOctel.
But despite the fact that that has been massively used for doing a variety of stuff,
IOctel remains synchronous to this date.
And we are going to talk about turning it synchronous
with the help of iUring.
But before that, let's touch base with iUring.
It came into kernel in 2019,
provides a scalable infra for doing async IO,
both for file and for network as well.
Unlike the Linux AIO, it allowed doing async buffer I-O too.
And at this point, lots of system call beyond read
and write, I mean, have already found their async variant
within the I-O ring infra.
Making directories, link and symlink
are the few recent examples of that.
If we look at the communication
between the user and kernel part of the Iuring,
so the backbone here is a pair of ring buffer
that kernel creates and uses space maps.
And that allows further transition to happen
with a reduced number of system calls and memory copies.
And the programming model after we set up the ring
would be to pick a free submission queue entry,
SQE from the submission queue,
fill it up with relevant opcode
and all the other information
that the operation may require. And we can fill up more such SQEs, make a bunch of it.
And once we are done doing that, we could submit all these entries by doing a single system call,
which is called IuringEnter. And at some later point of time, we can read the completion by
looking at the CQ ring.
So if the tail and head of the CQ are not at the same position, we have got a completion.
And that's about it to keep it simple. A lot more is possible, like elimination of
submission system call by enabling SQ offload and doing the completion without relying on interrupts. But then in this slide,
I am only scratching the surface of it.
Far more has been said on Iuring in a much better way.
For example, this stock faster IO through Iuring by GENZ
would be excellent to give year two.
And now we go into howUring can deliver an async
Ioptal.
First thing first in the current and the next slide,
I'm talking about the work that Jens has done to wire up
async Ioptal.
Let's start with the user interface part of it.
How does it look like to iUring user?
So the feature currently is called Uring
command and it is triggered by passing this new opcode called Ioring op Uring command.
Apart from this new opcode, there is this new type of submission queue entry and it's called
command SQE or CSQE in short. As you can see in the figure here,
the CSQE is of same size as regular SQE, that's 64 bytes.
But CSQE is spatial in the sense that it has got 40 bytes
of free space, and which is what we refer as payload
in subsequent slides.
And this payload can be used by application
to store Ioptel command itself.
So it's going to save the allocation cost
if the Ioptel command happens to be smaller than 40 bytes.
And if the command is larger,
application can allocate it by some other way
and put the pointer of the command
inside the CSQE payload army.
And of course, these are two ways
that I mentioned over here,
but other ways to use the payload are also possible.
And coming back to,
once we are done setting up the CSQE and its payload,
we send it down the usual way and we leave the completion in the usual way.
It is to be noted that Iuring does not peek into the payload or enforces many rules around it.
It merely passes it down to the IQL, and which can have its own custom logic
to process the payload.
And here we look at the internal communication model
between the I-O-Ring and the I-Octal provider,
which could be a file system or could be a driver.
So I-O-Ring starts by fetching the CSQE that is supplied and sets up an
internal container called Iuring command and that container is cardinal for all the further
communication. The second part is whoever is the Iuchtel provider, it can participate in this business of making the ioctl-lessing by
implementing this new file operation called uring command. So iuring essentially submits
the ioctl to the provider by calling this particular method and when provider takes
charge it does whatever needs to be done for submission and returns without blocking.
It can return the results instantly,
or it can say that it requires some more time
to send the result.
And for that case, it's whenever it has the result,
it is going to pass it back to the Iuring
by calling another helper,
which is currently called Iuring command then.
You could see all this in the sequence diagram here.
And once the result reaches to the Iuring,
it does post this particular result into the CSQE
and we are done.
Moving on, we look at employing the async Iuctal
for the NVMe pass-through interface.
So as we see in the figure here,
we have bunch of abstractions stacked up
on the storage device,
and each one is having its own purpose.
With the pass-through interface,
Kernel is exposing a path,
which kind of cuts through all these layers,
all these abstractions.
And this can be useful to try out some new features,
especially in NVMe,
we know that the features are emerging fast.
And without this,
all these features basically take some time
to move up the letters of abstractions over here.
At times when it is about building a file
or user interface on top of a new device feature,
some sort of opaqueness need to be designed
so that the interface would be,
can be reused in future and it allows future extension.
But pass-through skips all that.
And the feature can be, it can be used readily.
But the problem is that the pass-through travels
via a blocking Ioptel and that makes it almost useless
for fast devices like NVMe.
So endeavor here is to build a new pass-through interface
which is scalable
and by combining it with the IONIQ advancement we do imagine a much more useful interface.
So here if we look at the NVMe Iopt, it is about forcing the sync behavior over a sync.
The NVMe interface is naturally a sync,
who submits the command and it can go back to its business,
post the submission.
It's the driver which implements
the sync behavior by putting the submitter going to a blocking bit.
In the new scheme of things,
I'm talking about the UDIN command here.
So for the NVMe care device,
we add the UDIN command handler,
and that does nothing more than decoupling
the completion from submission.
At high level, that's what it is.
And there won't be any blocking weight.
So one of the problem which is worth to mention here
is that which I refer as a sync update to user memory.
So it's a general problem if a ioctl command has some field
which need to be updated during the IO completion.
Because such fields are in user space
and they cannot be touched while we are running in the IO completion. Because such fields are in user space and they cannot be touched
while we are running in the interrupt contest.
And typically the completion of the IO
arrive in interrupt contest.
So to solve that, if you look at the sequence over here,
NVMe driver sets up a callback to do all that update.
And that particular callback function,
it supplies to the IO ring ring. Then IE ring
sets up something called a task work which is kernel's mechanism to to schedule a work into
into a user space task and with that the callback that driver supplied gets executed in the task context and we could do the update.
Now let's look at the example over here. So we could see that we open the care device
ng0 and 1 and then we allocate a pass-through command structure. We set it up in a regular way.
But what we do is that we put this particular command
inside the command SQE payload.
And once we have done that,
we could submit it in a regular way.
And then we can read the completion
and that's about it.
That would be a way to do
the reading from the device
from this particular new interface.
Some of the tidbits for the GNS users,
if we look at it.
So it is possible to do the async zone reset.
Currently the zone reset is possible
only via the zone management command
and zone management command is sent by IFTL.
And the other useful thing could be
to send the zone append command at higher QD in.
So currently Jonah pen is usable inside the kernel,
but it's not exposed to the user space.
And this could be one of the way to do that.
While having the interface async is the first thing to do, it doesn't have to be the only thing. Since NVMe is talking to iEuring directly, we seem to have room
for adding some more goodness in the pass-through interface and in async-aroptal in general.
So Ioring has a bunch of features beyond async I mean, to make the IO faster.
And some of those are listed in this table.
We have register file,
which is about reducing the cost of acquiring
and releasing the file references inside the kernel.
And there is sqpol,
which enables application
to submit the IO without doing system call at all.
And we have fixed buffer and async polling
which we will cover in detail anyway in the further slides.
The first two feature in this table,
the register file and the sqPol,
they become available with the infra
that we discussed in previous slides.
While the other two feature, they
require some new plumbing, both at the U-ring and then
the driver.
So we start with the fixed buffer support for U-ring
pass-through.
How can we go about doing that?
But before that, let's look at what is fixed buffer.
So the last, or maybe the second last thing to do,
before we submit the IO to the NVMe device
is making the buffers DMA ready.
And for that to happen fine,
we got to pin the buffers in memory.
Generally, this is a per IO cost.
We pin the buffer for the IO and unpin it once the IO is done.
With fixed buffer, this per IO work is not done.
In place of that, application pins the buffer once
and goes about reusing these pinned or pre-mapped buffers subsequently.
For this support, we add a new output in the Iuring,
which is similar to the original Iuring command,
but it acts on the fixed buffer instead.
So it's called Iuring command fixed.
And application uses this code to submit the IO
and tells which T-MAP buffer to be used by its index.
Buff index is shown in the figure over here.
On the NVMe side, driver stops doing the per IO pin and unpin
and it talks to IO ring toing to obtain the prepend buffer.
Now we look at the second feature, which is about pass-through polling.
But before that, let's take a cursory look
at the info that Kernel has for IO polling.
So IO polling became relevant,
particularly after the emergence of low latency stories.
And the interrupt, the classical mean to indicate the emergence of low latency storage and the interrupt the the classical
mean to indicate the completion of i o it started to become a non-trivial software overhead for such
devices so so in the current polling model uh device stops generating the interrupts while the
submitter activity checks for the completion uh Originally the kernel had the sync polling infra
when application summits the IO.
And after that, it starts spinning on the CPU
looking for the competition actively.
That was referred as the classical polling.
Obviously at times it became heavy handed on the CPU
and people added the option of hybrid polling.
With that application puts itself to sleep
while for some time while looking for completion.
The system called to do this kind of sync polling
RP read V2, P write V2 with the high cry option.
And then came a sync polling with the AIU option. And then came async polling with the IO urine.
So it basically boils down to this fundamental question
that what all choices do we have after submitting an IO
while we have already disabled the interrupts?
We can spin, that's one thing to do.
And we can sleep and spin, that's what we do with the hybrids. We can spin. That's one thing to do. And we can sleep and spin. That's what we do with the hybrid polling.
That would be the second option. With the async polling,
we get this third option of doing something useful like submitting
another IO or maybe some other app-specific
stuff. And that's possible because polling has been decoupled
from submission.
In order to use the async polling in our hearing,
we need to set it up with a flag here.
And then all the IO we do wire that ring
that becomes polled IO. diode. Now we look at how can you wire up the polling support in the Euling pass-through.
So figure in the left, it shows the per-core cues at the block layer. These are referred
as the SCTX software context. And these are mapped to the hardware context,
at CTX in short.
That CTX corresponds to the NVMe SQ and CQ pair.
Now, from device side,
NVMe protocol allows to create NVMe CQ without interrupts.
For such completion queues,
device is only going to post the CQE and not
going to generate the interrupt after that. So here in the figure the green CQs are interrupt
disabled while the blue ones are interrupt enabled and the same color coding applies for the HTTX that CTX that's sitting above it. Now, when we look at what needs to be done
to enable the polling for urine command,
for the submission, we need to do two things.
We need to ensure that the IO is placed on it,
hold that CTX, the green one.
And once we
place it,
so we are showing that
IO placement within that CTX
with the orange color.
And
that is the cookie for the command.
So cookie basically tells
that CTX
and the command
ID.
While the second thing to do would be to remember this cookie because completion would be done at a later point of time.
So we store this cookie into the Iuring command
structure itself and that's about submission.
And when caller decides to do the polling for the completion,
we pick the corresponding Iuring command.
We obtain the cookie from it.
And we send it to something called BLK poll,
which implements the infra of classical polling as well as hybrid polling.
So it takes a cookie as an input
and it uses it to identify the relevant HTTX
and the corresponding NVMe CQ.
And then it starts polling that particular NVMe CQ.
Once it finds the completion, the job is over.
Now, if you take a step back and look at the table
comparing the parts, it looks much better
than before. Yes, we still have a gap because
recently one more advancement has got added into the bunch and that's
called biocache or biorecycling.
It's a transparent optimization
as far as user space is concerned,
because nothing extra needs to be done.
Nothing extra needs to be specified from user space.
And that is what makes it different
from the rest of the feature.
We'll be having this feature added
as we move along this path.
And now all said and done um
that the the whole nvme pass-through interface whether we talk about the the new one the new
iuring one or the the existing uh the existing i octal one it expects application to speak in
native language uh that means the application need to pass the NVMe commands to talk to NVMe device.
So I will pass it
over to Simon, who is going to talk about what can we do
about that part, how best to consume this interface
and how this interface performs.
Thank you, Kanchan.
Yeah, and I will specifically be talking about three things.
And the first is one way of consuming the car device,
and specifically why you might want to do it in that way.
And second, we'll be giving you a demo of how the different tooling around this usage works,
such that you can try it out yourself.
It will be quite pragmatic.
It includes tools such as FIO for evaluating performance,
such that you can go ahead and do that yourself.
And third, well, that's about evaluating the performance of the car device compared to what's currently available today and other ways of shipping pass-through commands to your device.
All right, starting with one. We'll go back here looking at this figure again,
because one way of consuming the car device
and the async pass-through path provided in these new patches,
well, that is to use the example that Kenshin showed you,
that is construct the HKE and send that down the IOU ring path
as you would normally do.
However, what we want to do is also do stuff like FIO.
We want to do a performance evaluation of the interface.
We also want to look a bit of the tooling
and how you can talk with your NVMe devices on the command line.
So we actually want to experiment with this interface,
this new system interface, through a bunch of different application and tooling interfaces.
And to enable that, well, then we use XenVimi.
And XenVimi is the smallest amount of software abstraction that you need in order to encapsulate a bunch of different operating system and system interfaces into
a single programmable API.
And by doing so, then the different tools already provided by XNVMe, such as the command
line tool for sending administrative commands and shipping other IO commands through the
command line, well, we can just reuse that.
And also for the performance evaluation, we can use the XMME FIO IO engine when we do that performance evaluation.
And then we just need to implement that small part inside of the XMME runtime.
And we can then do have this comparative study.
So that is sort of a quick overview of either you can like go down and do the low level coding yourself or you can use something like XMME.
And we'll be looking at the tooling on how to do that.
So just think of XMME as that command line library
or command library where you can change your IO path
without changing a single line of code.
It has the synchronous interface, buffer management,
and also a queuing-based interface with callbacks.
So you submit your command by putting it into a queue.
It completes and invokes your callback function.
So that's sort of the general interface of XenVMe
with a bunch of tooling on top.
So let's go to two and see how that works.
So let's jump into the tooling.
And this is a brief overview of what we'll be looking at.
First off, on the command line tooling part, and let's just jump right into device enumeration and looking at the specifics of CNS.
As we mentioned, there was an append command that we really want to be able to ship over an asynchronous interface.
So here we enumerate our devices.
We can see we have now one using the generic chart device.
There's also another one using the regular block interface.
And when we want to ship our append command,
then you have the same application,
and then you instrument the XMV runtime
to use different backends.
So now the EMA1 is using NVMe driver octals
and the queues and callbacks
are just emulation of that interface.
And then we get a result from that.
We can then shift that over
to using the new IOU ring command interface
on the generic chart device
and ship the command over that instead.
So this is just to show you how that tooling can be shifted.
And if you want to, you can also utilize the SPDK
to ship your commands through that as well.
I think it was just there.
Yeah.
And the difference here is that now you're telling it to send it down to a PCIe address.
And since that does not have the information about which namespace it is, which the other tools do here or the system interface does here, well, then we provide that to the tool.
And then we ship it over that. So that's just to show how you can shift these things without changing your code. And for the most interesting
part are the tooling needed to do the performance evaluation. Here we use FIO
and the XMME FIO IO engine. It has a FIO job
that looks like this, doing some 4K random reads.
And yeah, the way you instrument that is telling it the file name,
tell it to go to the generic device here and use the IOU link command interface.
And then you just send that away.
And it works.
This is all the regular FIO instrumentation needed to run FIO.
And if you want to experiment with this
yourself, you can also have a look at how you would shift this over using the
SPDK NVMe driver instead. And that would look something like this.
Where what you should be aware about here is just that this colon is a special child in FIO.
So you need to do an escape of that.
And again, there's no information about the specific namespace. So we provide that to the tool as well.
And then you can run that.
So it's the same file engine, same implementation
in XNVMe. It's just instrumentation of that XNVMe library. And that's what we'll be using now in the
and going through the performance numbers. So let's have a look at that. Yeah, and so that was the tool part.
Now, let's look at the more interesting thing, which really is the performance part, saying
because we said that with NVMe pass-through, what we had before was an interface that wasn't
scalable.
We've mentioned that,
and performance was poor. So what we mean by that, let's demonstrate it here. We're running FIO.
Well, as you increase your queue depth, you would expect that you're hiding some latency,
and you can then get a higher throughput as you're processing those commands.
But since everything is happening in a blocking synchronous fashion, then you have to wait for everything to complete, which also accumulates the latency of doing each command. to then implement a thread pool. And such a thread pool is available in XMME.
You just switch that XMME async.
But what you can see then is that you're spending at QDEF 1,
it's slower than just being synchronous
because you're paying a bunch of resource cost
and context shifting in setting up your thread pool,
which makes it slower.
However, it does scale.
So today, in case we didn't have
IOU ring command, what we would do is then set up a thread pool to ship our commands over the
Ioctl interface, but having this incurred cost of having a thread pool and using those resources
on the user space side. So that's really what we want to improve. We want something that's scalable
and efficient. So let me show you how things are looking today. So with or how things might look
in the future when this interface gets upstream in the kernel. Well, what we have here is something that's more efficient than at QDev1, than using
the NVMe drive I octal, plus it's scaling. So we're actually getting the best of both worlds.
We're getting the ability to send more than read and write,
and we're getting efficient scalability by using the IOU ring infrastructure. And we have all of that flexibility of sending any command down that we want.
So that's great.
But other ways of also doing this today would be to use using something like the SPDK NVMe driver,
as you just saw with the XNVMe tooling where we could switch over and use that driver.
So what does that look today?
Well, that is better.
So we are achieving and hitting this peak here,
which is the device max IOPS for 4K IO.
And we're basically just having a QDeth 1 lower overhead,
and we hit that spot sooner.
So there are still some things to improve in terms of reducing latency of the
user space to kernel interface.
And Kenshin has talked a lot about these different optimizations of fixed
buffers, different kinds of polling,
and these different things we can do in that exchange between user space and
kernel space.
And let's look at what happens when we enable one of those.
So what we do here is to reduce that gap a bit further.
So we're going from the blue line here up to the magenta,
and we're approaching what we get from being entirely in user space
by enabling complete impaling inside of the kernel.
Yeah, and there are, of of course other things that you can do to lower the overhead with fixed buffers
that's not being done here. So what we're doing here is purely just enabling that polling logic.
Yeah, so I think that's the main takeaway. So with the car device and the async interface on top of it for pass-through commands,
we came from a world that wasn't scalable.
That's your synchronous IOCTOs.
Then you can try and do that in user space by adding a thread pool,
but it's not very efficient.
With IOU-Ring, we get that scalability with the IOU Ring
password interface. And we can also enable optimization such as polling to make it even
better. So that's the main takeaway. If all you're seeing, then please digest this.
Now, as mentioned on the upstreaming and ecosystem side of things, we have the NVMe generic device that is the NG0 popping up in the Linux kernel.
It's available today and it's been available since 5.13.
And the asynchroctal path, the way that we can actually ship these commands, which are not other than read and write that's still an ongoing upstreaming
effort and today you can go grab that from github the link is here and the features included here
are the async nvme pass through the one we just used to send append commands over to send append
commands over and the polling feature here that improved our latency and our IOPS scaling.
And XenVMe is also readily available on GitHub, and currently it supports a lot of different IO paths,
where traditional IO are POSIX, POSIX AIO, Linux is LIB AIO, IO URING.
It supports the car device, the generic NVMe device. It supports
the NVMe driver, IOctals on Linux.
And it supports the SPDK
NVMe driver. And
on Windows, there's support for
doing some of these NVMe driver
encapsulations for IOctals as
well. Plus for your traditional IO,
it does IO control
ports. And
there are also similar support on FreeBSD,
revving IO tools and doing POSIX AIO
for your regular read and write commands.
So all of the emulation, the thread pooling,
the NVMe driver encapsulation
are basically running on all of these different platforms.
And now the experimental support
for the async IOC tools
are also available,
such that you can go
and experiment with
what we hope to be
the future of an IOC,
an async IOC pass-through path
for Linux.
Yeah, so the last thing
we have to say here
is basically that,
yeah, please do talk with us.
You're very welcome to join our Discord channel.
The name is put up here.
You can put that into Discord and find us and come hang out and talk.
We all hang out on that Discord channel on a daily basis.
You're also very welcome to email us with any questions you might have,
whether it's specifics to the kernel infrastructure,
things to improve,
we can of course always discuss that on the mailing list,
or if it's regarding XNVMe and how to use it
and how we might see getting this
into other open source projects out there.
And so the very last thing to say is that
we hope that you will take a moment to rate the
session since your feedback is quite important to us. But I will leave you on this slide and
say thank you for listening. Thanks for listening. If you have questions about the material presented
in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at snea.org.
Here you can ask questions and discuss this topic further with your peers in the storage developer community.
For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.