Storage Developer Conference - #164: Enabling Asynchronous I/O Passthru in NVMe-Native Applications

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to STC Podcast, episode 164. Hi, everyone, and welcome to our session. Today, Canson, Simon, and myself are going to talk a little bit about some of the work that we're doing in Linux mainline to enable a new high-performance IAPath that we're calling a synchronous IAPathsware. We're obviously going to cover what we mean by this, and what are the use cases we're after,

Starting point is 00:01:03 and then we're going to deep dive into the changes that we're making in Linux and in other places in the storage stack to enable this path. Before we get started, I want to make clear that while some of the parts of this work are already upstream, things change very quickly. There is still a number of elements that are under discussion in the Linux kernel mailing list. So things, again, might change.

Starting point is 00:01:33 And I also want to make very clear that some of the work that we're presenting here is a community effort. So, you know, we presented some of these ideas, applied some of the feedback, got a lot of conversations into the mailing list. And just to name a few of the people that have been involved directly into this work are Jens, Christoph, and Keith. There has been many others. So, you know, if you're interested in the topic, I would encourage you to subscribe to the mailing list, follow the conversation, and participate of the things that are already or still under debate.

Starting point is 00:02:09 We are dividing the presentation in three main parts. First, I'm going to talk about the work we're doing in MDME to enable these internal paths through IOPath. What is the motivation? What are the things that we want to cover? How it relates to the rest of the IOPaths that are already available in Linux? Then Kanchan, he's going to talk about asynchronous IOCtos, which is a very necessary work for this NVMe in-kernel pass-through to be able to perform a scale. And then Simon at the end, he's going to wrap everything up and he's going to talk to you about how you use this path

Starting point is 00:02:52 and the work that we are doing in XMVME to provide this common storage API for all these paths to come together. And he's going to use the opportunity to guide you through a demo and give you some performance numbers on real-world applications. So,

Starting point is 00:03:13 if you look at Linux today, you traditionally have two main ways of submitting I.O. You have your file I.O., which is very convenient and very easy from an application perspective. If you want more control, then applications would rely on the raw block interface. This is very convenient if you want this control. You see the whole LBA address space. You can control a lot of the IA properties whether you want asynchronous or synchronous or whether you want

Starting point is 00:03:51 to do direct IA and implement your own cache or leverage the kernel page cache, QDev, priorities, all that. And a number of applications have been very successful at leveraging this for implementing all sorts of things on top, like object stores, key values, things like that. For NVMe specifically, the block device, as you know, is namespace granularity, so it fits very well. The block layer is fantastic. But when you're building this common block abstraction, it comes with some natural limitations.

Starting point is 00:04:33 So just to name a few, I believe we're all aware that there is a number of these data protection schemes, like protection information, defix combinations that are not supported. So something that is used in enterprise environments, this 4K plus 64 in a contiguous buffer, something that BlockLayer does not support today. So these people basically, when they plug the devices,

Starting point is 00:05:01 they don't see the block device come up. Or sometimes even worse, there is this inconsistent interface where sometimes the block device comes up, but it comes with capacity zero, which means you cannot use it. So it's not really useful, right? We also have constraints for new device types. To give an example in CNS, the block layer puts some constraints that make a lot of sense for the kernel, which is having a power of two sound sizes or enforcing the use of the append command. If you don't want to use these features,

Starting point is 00:05:40 then the block device would not come up. I'm sure there are other reasons, but I believe this has had an impact on explaining the rise of SPDK. Because SPDK got a number of things very right. They understood that there is these applications that want a very fast, low latency IO path they have control over.

Starting point is 00:06:07 Because a number of applications can build this domain specific, you can call it domain specific block layers or domain specific abstractions because the application knows the workload. I don't know, you know that you're writing at 32K, you can optimize for that.

Starting point is 00:06:23 You know that the application is respecting your maximum data transfer size. You know that the application is respecting your maximum data transfer size. You don't need to implement splits, you know, these type of things. I think there's no discussion that SPDK paves the way for all these low latency storage stacks. Today we see IOU ring matching this performance

Starting point is 00:06:42 and it's great, but the applications are ready, most probably because they did a lot of work already on SPTK. And at the end of the day, I think when you put this all together, SPTK got very right that there is a number of people that want to do this end-to-end optimization. So either because you want to keep it as a vendor-specific because you control your whole storage stack, or because you want to do some type of POCs before you bring it to the technical working group in NVMe or to your vendor. So it allows for this fast innovation, where you can do a number of things, create your command, speak NVMe to your device, and then go all the way down.

Starting point is 00:07:31 However, you know, the more popular SPDK becomes, the more generic it becomes. And then we start having this redundancy on some of the features that are traditionally the operating system responsibility, which now are being moved to user space in order to cover this. So the main question here is, can we bring this functionality to an IO path that goes inside of the kernel so that we can reuse all the kernel functionality, you know, your security, your, you know, whether you want to do cgroups,

Starting point is 00:08:08 whether you want to, you know, to use whatever containers, you know. Avoid the redundancy that comes from putting a lot of this logic in user space. And this is where the NVMe generic device comes in place. What it is is pretty simple.

Starting point is 00:08:33 When you get a new namespace, it comes typically as a block device. You know, your NVMe X and Y, 0 and 1 and 2, 3 3 etc. Now you will see also a car device coming on the side very similar to the car device that comes for the controller like NVMe 0 and NVMe 1. This device will always be available so if some of the features of that NVMe namespace

Starting point is 00:09:06 are not supported by the block layer, the block device will be rejected or come up as capacity zero or as read-only. But the card device will always be there. And this device would allow you to send IOC tiles directly into the driver, meaning that you have this communication where you can submit an NVMe command

Starting point is 00:09:28 that you form in user space, and the device will understand that. And the path goes entirely through the kernel. If you want to compare that to what it looks in the SPDK, it is very similar in terms that if you're using the lowest SPDK API for the NVMe driver, you will be speaking NVMe. But the namespace will, or actually the whole device, will not be detached from the kernel.

Starting point is 00:10:03 The PCIe device will not be detached and put into user space. It will still be within the kernel. You know, the PCI device will not be detached and put into user space. It will still be within the kernel. This gives you some interesting use cases because then, you know, you can think of a normal server with several drives. Some of them, if they have several namespaces, can have like a normal file system put into them. And a namespace in the same device can be used as a pass-through without having to touch the whole device.

Starting point is 00:10:35 So you have much more flexibility in that sense. It also gives you a common stack for implementing these kind of security policies where you don't need to have some part of these policies in user space, some part of these policies within kernel space. So this is the main target. SCSI has had something similar for quite a long time. UFS, to the best of my knowledge, this doesn't exist.

Starting point is 00:11:00 But if this is a use case that becomes interesting, especially after the asynchronous Ioctals get merged, we can look into enable that. For NVMe, we merged this in 5.13. There's support for the Ioctal IO. There's basic support for NVMe CLI, and we're having some patches upstream in the next week or so

Starting point is 00:11:26 for having all this support enabled. How does it look like if you want to consume the NVMe-generated device? So as I mentioned before, now you not only are going to see your dev NVMe 0 and 1 and 2 and 3 as the different namespaces on the block device, you're going to see this NG from generic and the same numeracy. This is coming from NBME CLI. Now you can see the patches here. That's what I mentioned before. This is the part that is missing for the list on NVMe and we're posting the patches soon.

Starting point is 00:12:09 If you want to use this from the programming interface, this looks like a typical IOC tool. So you would form the NVMe command as a normal NVMe IOC tool, the same one that is existing today. This means obviously that performance is not very well good because it's a synchronous interface. CanSan will talk about what we're doing to use IODuring and make this scale and actually be usable for the IOPath. Good thing about using the NVMe driver in the kernel

Starting point is 00:12:55 is that it is pretty simple for us to enable fabrics because it is automatic. Just as it is, when it is enabled as a block interface, the device is up. When it is not, we need to enable the password controller. This is something that is easy to do. And when it is done, all the functionality comes basically for free, which is very convenient.

Starting point is 00:13:23 So this is it for the mimi generic device um canton will take over and he will talk about asynchronous iactos if you have any questions about this part in particular you know please drive through slack or reach out to us afterwards thank you javier so now let's delve into the catch-all system call that has been around since so many years. It's called IO-Control or in short, IOctel. Many operations which do not fit well into the existing system calls are done by IOctel. But despite the fact that that has been massively used for doing a variety of stuff, IOctel remains synchronous to this date.

Starting point is 00:14:06 And we are going to talk about turning it synchronous with the help of iUring. But before that, let's touch base with iUring. It came into kernel in 2019, provides a scalable infra for doing async IO, both for file and for network as well. Unlike the Linux AIO, it allowed doing async buffer I-O too. And at this point, lots of system call beyond read

Starting point is 00:14:37 and write, I mean, have already found their async variant within the I-O ring infra. Making directories, link and symlink are the few recent examples of that. If we look at the communication between the user and kernel part of the Iuring, so the backbone here is a pair of ring buffer that kernel creates and uses space maps.

Starting point is 00:15:04 And that allows further transition to happen with a reduced number of system calls and memory copies. And the programming model after we set up the ring would be to pick a free submission queue entry, SQE from the submission queue, fill it up with relevant opcode and all the other information that the operation may require. And we can fill up more such SQEs, make a bunch of it.

Starting point is 00:15:32 And once we are done doing that, we could submit all these entries by doing a single system call, which is called IuringEnter. And at some later point of time, we can read the completion by looking at the CQ ring. So if the tail and head of the CQ are not at the same position, we have got a completion. And that's about it to keep it simple. A lot more is possible, like elimination of submission system call by enabling SQ offload and doing the completion without relying on interrupts. But then in this slide, I am only scratching the surface of it. Far more has been said on Iuring in a much better way.

Starting point is 00:16:14 For example, this stock faster IO through Iuring by GENZ would be excellent to give year two. And now we go into howUring can deliver an async Ioptal. First thing first in the current and the next slide, I'm talking about the work that Jens has done to wire up async Ioptal. Let's start with the user interface part of it.

Starting point is 00:16:41 How does it look like to iUring user? So the feature currently is called Uring command and it is triggered by passing this new opcode called Ioring op Uring command. Apart from this new opcode, there is this new type of submission queue entry and it's called command SQE or CSQE in short. As you can see in the figure here, the CSQE is of same size as regular SQE, that's 64 bytes. But CSQE is spatial in the sense that it has got 40 bytes of free space, and which is what we refer as payload

Starting point is 00:17:20 in subsequent slides. And this payload can be used by application to store Ioptel command itself. So it's going to save the allocation cost if the Ioptel command happens to be smaller than 40 bytes. And if the command is larger, application can allocate it by some other way and put the pointer of the command

Starting point is 00:17:44 inside the CSQE payload army. And of course, these are two ways that I mentioned over here, but other ways to use the payload are also possible. And coming back to, once we are done setting up the CSQE and its payload, we send it down the usual way and we leave the completion in the usual way. It is to be noted that Iuring does not peek into the payload or enforces many rules around it.

Starting point is 00:18:19 It merely passes it down to the IQL, and which can have its own custom logic to process the payload. And here we look at the internal communication model between the I-O-Ring and the I-Octal provider, which could be a file system or could be a driver. So I-O-Ring starts by fetching the CSQE that is supplied and sets up an internal container called Iuring command and that container is cardinal for all the further communication. The second part is whoever is the Iuchtel provider, it can participate in this business of making the ioctl-lessing by

Starting point is 00:19:07 implementing this new file operation called uring command. So iuring essentially submits the ioctl to the provider by calling this particular method and when provider takes charge it does whatever needs to be done for submission and returns without blocking. It can return the results instantly, or it can say that it requires some more time to send the result. And for that case, it's whenever it has the result, it is going to pass it back to the Iuring

Starting point is 00:19:40 by calling another helper, which is currently called Iuring command then. You could see all this in the sequence diagram here. And once the result reaches to the Iuring, it does post this particular result into the CSQE and we are done. Moving on, we look at employing the async Iuctal for the NVMe pass-through interface.

Starting point is 00:20:06 So as we see in the figure here, we have bunch of abstractions stacked up on the storage device, and each one is having its own purpose. With the pass-through interface, Kernel is exposing a path, which kind of cuts through all these layers, all these abstractions.

Starting point is 00:20:25 And this can be useful to try out some new features, especially in NVMe, we know that the features are emerging fast. And without this, all these features basically take some time to move up the letters of abstractions over here. At times when it is about building a file or user interface on top of a new device feature,

Starting point is 00:20:52 some sort of opaqueness need to be designed so that the interface would be, can be reused in future and it allows future extension. But pass-through skips all that. And the feature can be, it can be used readily. But the problem is that the pass-through travels via a blocking Ioptel and that makes it almost useless for fast devices like NVMe.

Starting point is 00:21:22 So endeavor here is to build a new pass-through interface which is scalable and by combining it with the IONIQ advancement we do imagine a much more useful interface. So here if we look at the NVMe Iopt, it is about forcing the sync behavior over a sync. The NVMe interface is naturally a sync, who submits the command and it can go back to its business, post the submission. It's the driver which implements

Starting point is 00:21:57 the sync behavior by putting the submitter going to a blocking bit. In the new scheme of things, I'm talking about the UDIN command here. So for the NVMe care device, we add the UDIN command handler, and that does nothing more than decoupling the completion from submission. At high level, that's what it is.

Starting point is 00:22:22 And there won't be any blocking weight. So one of the problem which is worth to mention here is that which I refer as a sync update to user memory. So it's a general problem if a ioctl command has some field which need to be updated during the IO completion. Because such fields are in user space and they cannot be touched while we are running in the IO completion. Because such fields are in user space and they cannot be touched while we are running in the interrupt contest.

Starting point is 00:22:47 And typically the completion of the IO arrive in interrupt contest. So to solve that, if you look at the sequence over here, NVMe driver sets up a callback to do all that update. And that particular callback function, it supplies to the IO ring ring. Then IE ring sets up something called a task work which is kernel's mechanism to to schedule a work into into a user space task and with that the callback that driver supplied gets executed in the task context and we could do the update.

Starting point is 00:23:30 Now let's look at the example over here. So we could see that we open the care device ng0 and 1 and then we allocate a pass-through command structure. We set it up in a regular way. But what we do is that we put this particular command inside the command SQE payload. And once we have done that, we could submit it in a regular way. And then we can read the completion and that's about it.

Starting point is 00:24:09 That would be a way to do the reading from the device from this particular new interface. Some of the tidbits for the GNS users, if we look at it. So it is possible to do the async zone reset. Currently the zone reset is possible only via the zone management command

Starting point is 00:24:35 and zone management command is sent by IFTL. And the other useful thing could be to send the zone append command at higher QD in. So currently Jonah pen is usable inside the kernel, but it's not exposed to the user space. And this could be one of the way to do that. While having the interface async is the first thing to do, it doesn't have to be the only thing. Since NVMe is talking to iEuring directly, we seem to have room for adding some more goodness in the pass-through interface and in async-aroptal in general.

Starting point is 00:25:30 So Ioring has a bunch of features beyond async I mean, to make the IO faster. And some of those are listed in this table. We have register file, which is about reducing the cost of acquiring and releasing the file references inside the kernel. And there is sqpol, which enables application to submit the IO without doing system call at all.

Starting point is 00:25:51 And we have fixed buffer and async polling which we will cover in detail anyway in the further slides. The first two feature in this table, the register file and the sqPol, they become available with the infra that we discussed in previous slides. While the other two feature, they require some new plumbing, both at the U-ring and then

Starting point is 00:26:13 the driver. So we start with the fixed buffer support for U-ring pass-through. How can we go about doing that? But before that, let's look at what is fixed buffer. So the last, or maybe the second last thing to do, before we submit the IO to the NVMe device is making the buffers DMA ready.

Starting point is 00:26:40 And for that to happen fine, we got to pin the buffers in memory. Generally, this is a per IO cost. We pin the buffer for the IO and unpin it once the IO is done. With fixed buffer, this per IO work is not done. In place of that, application pins the buffer once and goes about reusing these pinned or pre-mapped buffers subsequently. For this support, we add a new output in the Iuring,

Starting point is 00:27:10 which is similar to the original Iuring command, but it acts on the fixed buffer instead. So it's called Iuring command fixed. And application uses this code to submit the IO and tells which T-MAP buffer to be used by its index. Buff index is shown in the figure over here. On the NVMe side, driver stops doing the per IO pin and unpin and it talks to IO ring toing to obtain the prepend buffer.

Starting point is 00:27:51 Now we look at the second feature, which is about pass-through polling. But before that, let's take a cursory look at the info that Kernel has for IO polling. So IO polling became relevant, particularly after the emergence of low latency stories. And the interrupt, the classical mean to indicate the emergence of low latency storage and the interrupt the the classical mean to indicate the completion of i o it started to become a non-trivial software overhead for such devices so so in the current polling model uh device stops generating the interrupts while the

Starting point is 00:28:20 submitter activity checks for the completion uh Originally the kernel had the sync polling infra when application summits the IO. And after that, it starts spinning on the CPU looking for the competition actively. That was referred as the classical polling. Obviously at times it became heavy handed on the CPU and people added the option of hybrid polling. With that application puts itself to sleep

Starting point is 00:28:48 while for some time while looking for completion. The system called to do this kind of sync polling RP read V2, P write V2 with the high cry option. And then came a sync polling with the AIU option. And then came async polling with the IO urine. So it basically boils down to this fundamental question that what all choices do we have after submitting an IO while we have already disabled the interrupts? We can spin, that's one thing to do.

Starting point is 00:29:24 And we can sleep and spin, that's what we do with the hybrids. We can spin. That's one thing to do. And we can sleep and spin. That's what we do with the hybrid polling. That would be the second option. With the async polling, we get this third option of doing something useful like submitting another IO or maybe some other app-specific stuff. And that's possible because polling has been decoupled from submission. In order to use the async polling in our hearing, we need to set it up with a flag here.

Starting point is 00:29:53 And then all the IO we do wire that ring that becomes polled IO. diode. Now we look at how can you wire up the polling support in the Euling pass-through. So figure in the left, it shows the per-core cues at the block layer. These are referred as the SCTX software context. And these are mapped to the hardware context, at CTX in short. That CTX corresponds to the NVMe SQ and CQ pair. Now, from device side, NVMe protocol allows to create NVMe CQ without interrupts.

Starting point is 00:30:42 For such completion queues, device is only going to post the CQE and not going to generate the interrupt after that. So here in the figure the green CQs are interrupt disabled while the blue ones are interrupt enabled and the same color coding applies for the HTTX that CTX that's sitting above it. Now, when we look at what needs to be done to enable the polling for urine command, for the submission, we need to do two things. We need to ensure that the IO is placed on it, hold that CTX, the green one.

Starting point is 00:31:23 And once we place it, so we are showing that IO placement within that CTX with the orange color. And that is the cookie for the command. So cookie basically tells

Starting point is 00:31:40 that CTX and the command ID. While the second thing to do would be to remember this cookie because completion would be done at a later point of time. So we store this cookie into the Iuring command structure itself and that's about submission. And when caller decides to do the polling for the completion, we pick the corresponding Iuring command.

Starting point is 00:32:08 We obtain the cookie from it. And we send it to something called BLK poll, which implements the infra of classical polling as well as hybrid polling. So it takes a cookie as an input and it uses it to identify the relevant HTTX and the corresponding NVMe CQ. And then it starts polling that particular NVMe CQ. Once it finds the completion, the job is over.

Starting point is 00:32:44 Now, if you take a step back and look at the table comparing the parts, it looks much better than before. Yes, we still have a gap because recently one more advancement has got added into the bunch and that's called biocache or biorecycling. It's a transparent optimization as far as user space is concerned, because nothing extra needs to be done.

Starting point is 00:33:12 Nothing extra needs to be specified from user space. And that is what makes it different from the rest of the feature. We'll be having this feature added as we move along this path. And now all said and done um that the the whole nvme pass-through interface whether we talk about the the new one the new iuring one or the the existing uh the existing i octal one it expects application to speak in

Starting point is 00:33:40 native language uh that means the application need to pass the NVMe commands to talk to NVMe device. So I will pass it over to Simon, who is going to talk about what can we do about that part, how best to consume this interface and how this interface performs. Thank you, Kanchan. Yeah, and I will specifically be talking about three things. And the first is one way of consuming the car device,

Starting point is 00:34:19 and specifically why you might want to do it in that way. And second, we'll be giving you a demo of how the different tooling around this usage works, such that you can try it out yourself. It will be quite pragmatic. It includes tools such as FIO for evaluating performance, such that you can go ahead and do that yourself. And third, well, that's about evaluating the performance of the car device compared to what's currently available today and other ways of shipping pass-through commands to your device. All right, starting with one. We'll go back here looking at this figure again,

Starting point is 00:35:07 because one way of consuming the car device and the async pass-through path provided in these new patches, well, that is to use the example that Kenshin showed you, that is construct the HKE and send that down the IOU ring path as you would normally do. However, what we want to do is also do stuff like FIO. We want to do a performance evaluation of the interface. We also want to look a bit of the tooling

Starting point is 00:35:36 and how you can talk with your NVMe devices on the command line. So we actually want to experiment with this interface, this new system interface, through a bunch of different application and tooling interfaces. And to enable that, well, then we use XenVimi. And XenVimi is the smallest amount of software abstraction that you need in order to encapsulate a bunch of different operating system and system interfaces into a single programmable API. And by doing so, then the different tools already provided by XNVMe, such as the command line tool for sending administrative commands and shipping other IO commands through the

Starting point is 00:36:19 command line, well, we can just reuse that. And also for the performance evaluation, we can use the XMME FIO IO engine when we do that performance evaluation. And then we just need to implement that small part inside of the XMME runtime. And we can then do have this comparative study. So that is sort of a quick overview of either you can like go down and do the low level coding yourself or you can use something like XMME. And we'll be looking at the tooling on how to do that. So just think of XMME as that command line library or command library where you can change your IO path

Starting point is 00:36:54 without changing a single line of code. It has the synchronous interface, buffer management, and also a queuing-based interface with callbacks. So you submit your command by putting it into a queue. It completes and invokes your callback function. So that's sort of the general interface of XenVMe with a bunch of tooling on top. So let's go to two and see how that works.

Starting point is 00:37:23 So let's jump into the tooling. And this is a brief overview of what we'll be looking at. First off, on the command line tooling part, and let's just jump right into device enumeration and looking at the specifics of CNS. As we mentioned, there was an append command that we really want to be able to ship over an asynchronous interface. So here we enumerate our devices. We can see we have now one using the generic chart device. There's also another one using the regular block interface. And when we want to ship our append command,

Starting point is 00:38:00 then you have the same application, and then you instrument the XMV runtime to use different backends. So now the EMA1 is using NVMe driver octals and the queues and callbacks are just emulation of that interface. And then we get a result from that. We can then shift that over

Starting point is 00:38:20 to using the new IOU ring command interface on the generic chart device and ship the command over that instead. So this is just to show you how that tooling can be shifted. And if you want to, you can also utilize the SPDK to ship your commands through that as well. I think it was just there. Yeah.

Starting point is 00:38:51 And the difference here is that now you're telling it to send it down to a PCIe address. And since that does not have the information about which namespace it is, which the other tools do here or the system interface does here, well, then we provide that to the tool. And then we ship it over that. So that's just to show how you can shift these things without changing your code. And for the most interesting part are the tooling needed to do the performance evaluation. Here we use FIO and the XMME FIO IO engine. It has a FIO job that looks like this, doing some 4K random reads. And yeah, the way you instrument that is telling it the file name, tell it to go to the generic device here and use the IOU link command interface.

Starting point is 00:39:34 And then you just send that away. And it works. This is all the regular FIO instrumentation needed to run FIO. And if you want to experiment with this yourself, you can also have a look at how you would shift this over using the SPDK NVMe driver instead. And that would look something like this. Where what you should be aware about here is just that this colon is a special child in FIO. So you need to do an escape of that.

Starting point is 00:40:19 And again, there's no information about the specific namespace. So we provide that to the tool as well. And then you can run that. So it's the same file engine, same implementation in XNVMe. It's just instrumentation of that XNVMe library. And that's what we'll be using now in the and going through the performance numbers. So let's have a look at that. Yeah, and so that was the tool part. Now, let's look at the more interesting thing, which really is the performance part, saying because we said that with NVMe pass-through, what we had before was an interface that wasn't scalable.

Starting point is 00:41:04 We've mentioned that, and performance was poor. So what we mean by that, let's demonstrate it here. We're running FIO. Well, as you increase your queue depth, you would expect that you're hiding some latency, and you can then get a higher throughput as you're processing those commands. But since everything is happening in a blocking synchronous fashion, then you have to wait for everything to complete, which also accumulates the latency of doing each command. to then implement a thread pool. And such a thread pool is available in XMME. You just switch that XMME async. But what you can see then is that you're spending at QDEF 1, it's slower than just being synchronous

Starting point is 00:41:55 because you're paying a bunch of resource cost and context shifting in setting up your thread pool, which makes it slower. However, it does scale. So today, in case we didn't have IOU ring command, what we would do is then set up a thread pool to ship our commands over the Ioctl interface, but having this incurred cost of having a thread pool and using those resources on the user space side. So that's really what we want to improve. We want something that's scalable

Starting point is 00:42:25 and efficient. So let me show you how things are looking today. So with or how things might look in the future when this interface gets upstream in the kernel. Well, what we have here is something that's more efficient than at QDev1, than using the NVMe drive I octal, plus it's scaling. So we're actually getting the best of both worlds. We're getting the ability to send more than read and write, and we're getting efficient scalability by using the IOU ring infrastructure. And we have all of that flexibility of sending any command down that we want. So that's great. But other ways of also doing this today would be to use using something like the SPDK NVMe driver, as you just saw with the XNVMe tooling where we could switch over and use that driver.

Starting point is 00:43:24 So what does that look today? Well, that is better. So we are achieving and hitting this peak here, which is the device max IOPS for 4K IO. And we're basically just having a QDeth 1 lower overhead, and we hit that spot sooner. So there are still some things to improve in terms of reducing latency of the user space to kernel interface.

Starting point is 00:43:51 And Kenshin has talked a lot about these different optimizations of fixed buffers, different kinds of polling, and these different things we can do in that exchange between user space and kernel space. And let's look at what happens when we enable one of those. So what we do here is to reduce that gap a bit further. So we're going from the blue line here up to the magenta, and we're approaching what we get from being entirely in user space

Starting point is 00:44:19 by enabling complete impaling inside of the kernel. Yeah, and there are, of of course other things that you can do to lower the overhead with fixed buffers that's not being done here. So what we're doing here is purely just enabling that polling logic. Yeah, so I think that's the main takeaway. So with the car device and the async interface on top of it for pass-through commands, we came from a world that wasn't scalable. That's your synchronous IOCTOs. Then you can try and do that in user space by adding a thread pool, but it's not very efficient.

Starting point is 00:45:02 With IOU-Ring, we get that scalability with the IOU Ring password interface. And we can also enable optimization such as polling to make it even better. So that's the main takeaway. If all you're seeing, then please digest this. Now, as mentioned on the upstreaming and ecosystem side of things, we have the NVMe generic device that is the NG0 popping up in the Linux kernel. It's available today and it's been available since 5.13. And the asynchroctal path, the way that we can actually ship these commands, which are not other than read and write that's still an ongoing upstreaming effort and today you can go grab that from github the link is here and the features included here are the async nvme pass through the one we just used to send append commands over to send append

Starting point is 00:46:00 commands over and the polling feature here that improved our latency and our IOPS scaling. And XenVMe is also readily available on GitHub, and currently it supports a lot of different IO paths, where traditional IO are POSIX, POSIX AIO, Linux is LIB AIO, IO URING. It supports the car device, the generic NVMe device. It supports the NVMe driver, IOctals on Linux. And it supports the SPDK NVMe driver. And on Windows, there's support for

Starting point is 00:46:33 doing some of these NVMe driver encapsulations for IOctals as well. Plus for your traditional IO, it does IO control ports. And there are also similar support on FreeBSD, revving IO tools and doing POSIX AIO for your regular read and write commands.

Starting point is 00:46:52 So all of the emulation, the thread pooling, the NVMe driver encapsulation are basically running on all of these different platforms. And now the experimental support for the async IOC tools are also available, such that you can go and experiment with

Starting point is 00:47:08 what we hope to be the future of an IOC, an async IOC pass-through path for Linux. Yeah, so the last thing we have to say here is basically that, yeah, please do talk with us.

Starting point is 00:47:25 You're very welcome to join our Discord channel. The name is put up here. You can put that into Discord and find us and come hang out and talk. We all hang out on that Discord channel on a daily basis. You're also very welcome to email us with any questions you might have, whether it's specifics to the kernel infrastructure, things to improve, we can of course always discuss that on the mailing list,

Starting point is 00:47:51 or if it's regarding XNVMe and how to use it and how we might see getting this into other open source projects out there. And so the very last thing to say is that we hope that you will take a moment to rate the session since your feedback is quite important to us. But I will leave you on this slide and say thank you for listening. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at snea.org.

Starting point is 00:48:32 Here you can ask questions and discuss this topic further with your peers in the storage developer community. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

Storage Developer Conference - #164: Enabling Asynchronous I/O Passthru in NVMe-Native Applications

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

Storage Developer Conference - #164: Enabling Asynchronous I/O Passthru in NVMe-Native Applications

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.