Storage Developer Conference - #87: Latest developments with NVMe/TCP

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to STC Podcast, Episode 86. This talk is really designed to give an overview of what's coming with NVMe over Fabrics 1.1, focusing on NVMe TCP full have.

Starting point is 00:00:54 Some more time, we're going to cover one more aspect, which is discovery enhancements. But current plan record is just to focus on NVMe TCP. All right, so I'll start with a recap of NVMe over Fabrics, just to align everyone that were on the same page. I'm sure everyone knows the history and what's NVMe over Fabrics. Probably heard it like a million times before. But still. So three years ago, NVMe over Fabrics delivered the promise of extending and providing distancestandard NVMe over Rocky prototypes, conferences and whatnot. Later in 2014, if I'm getting the dates wrong, that's out of my own recollection.

Starting point is 00:01:58 We have enough people in the room that can correct me. But later, 2014, NVMe, NVMe, NVMexpress formed the initiative to standardize NVMe over Fabrics. In 2015, NVMe.org formed what's called the Fabrics Linux Driver Task Force that is designed to really implement Linux NVMe or Fabrics Driver, sort of converge on a single implementation where we had multiple efforts to align on the same page and really deliver on the implementation. We had a rough patch starting on converging on a code base,

Starting point is 00:02:56 which later you will find that it's something that happens to be reoccurring. From there, we developed pretty similarly to how Linux is being developed. We had a private Linux kernel hosting in GitLab that was open to all the members of part of the Fabrics Linux Drive Task Force. And during the development, a lot of the heavy lifting that was designed to make the NVMe driver really a proper subsystem in Linux, that was something that was contributed even before the NVMe over Fabrics driver in itself. So that's a lot of the work that was being done in the Fabrics Linux Driver Task Force activity

Starting point is 00:03:45 was propagated to upstream Linux as the efforts went on. It took just over a year or a year and a half for the spec to get ratified, oftentimes driven from the implementation, trying to find out what works and what doesn't work. So in October 2016, kernel 4.8, we've seen the merge of the NVMe over Fabrics driver support host and target stack merged into the Linux kernel. So what happened since then? We're two years since then.

Starting point is 00:04:36 A lot has happened, and this is really a bullet view of a lot of what happened. So NVMeo Fabric Stack, as a young stack in Linux, went through a lot of hardening and stability fixes. And as the community grows and the user base expands, a lot of fixes has been made. Also, we've seen contribution as tracing support to get a finer instrumentation of the code, see where we can improve or debug. Additional transport, fiber channel, which was pretty close to the initial merge of the NVMe over Fabrics, which included the RDMA and the loop driver.

Starting point is 00:05:25 A lot of activity around that. A lot of great work from James Smart from Broadcom now. I haven't seen him around this time. Tool chain enhancements. So we have a lot of flexibility about how we want to establish a Fabrics controller. So pretty much enhanced NVMe CLI, which is the tool chain for the NVMe host stack.

Starting point is 00:05:54 A lot of arguments such as proper queue sizing, how many IOQs you would want, a way to control the host NQN, keep alive timeouts, and a lot of other things. In terms of compliance, we also added the UUID support to the Linux NVMe target implementation and also correctly supported in the host stack. In terms of enhancements, so this is really a minor view.

Starting point is 00:06:31 If you attended the last talk from Martin, you probably saw more. We've seen this covers two years of development, so this is really the highlights. TCG Opal support. sorry for the typo. IO polling mode, which later became a hybrid mode, a hybrid polling mode option, and also was bucketized. Thanks to Stephen Bates, is he here? I don't know, just wanted to give a shout out.

Starting point is 00:07:05 And also the most recent development on ANA, which is a young ratified TP that's implemented in the native NVMe multipath encoder in Linux. If you haven't heard of it, you should give it a try. It's awesome. And all sorts of other activities that I'm for sure missing, If you haven't heard of it, you should give it a try. It's awesome. And all sorts of other activities that I'm for sure missing, but this is sort of what happened since then.

Starting point is 00:07:41 So all this was Fabrics 1.0 and Fabrics 1.1. So the scope is revolved around four activities in general. The top priority item was the TCP transport binding, also called as NVMe over TCP. Discovery enhancements, or a more formal name is the dynamic resource enumeration TP. And two other small TPs that one is the dynamic resource enumeration TP. And two other small TPs that, one is the SQ flow control disable mode that allows implementations that don't necessarily care

Starting point is 00:08:14 about SQ head pointer updates to relieve them of this constraint. I can, if we'll have time, we'll talk a bit more about that, the problems and what it comes to solve or what other problem it's creating, but it's not the focus here. And traffic-based Keepalive, which I think is published today on the NVMe, NVMe Express spec webpage. And we'll shortly see the Linux driver implementation once I get a chance to send patches

Starting point is 00:08:54 which are ready for a long time. And the Fabrics Linux driver working, the task force took it on itself to implement the TPs, prototype them, and both drive the spec and provide early adoption or fast adoption into Linux, which is a known early adopter of new technologies. By the way,

Starting point is 00:09:24 feel free to stop me at any time. Just raise your hand and you can ask questions. I'll spot you and answer. So just a couple of motivations for NVMe or TCP. First and foremost, it's ubiquitous, which is not into taking lightly. It runs everywhere and on everything. It's probably the most widest deployed transport out there. So in some way, it would just make sense to also run NVMe on it. In terms of ecosystem, it's well understood and maintained by all the major vendors that operate either the private public cloud or vendors that contribute and run high-scale data centers.

Starting point is 00:10:32 It is high performance. It can deliver excellent performance in terms of scalability in CPU cores and also can achieve low latency. We'll touch a bit more about that, but if you look, I often sort of look about, look on how storage ecosystem involves and all the technologies that are evolved,

Starting point is 00:11:02 and whenever they look familiar, it's always from the networking stack that solved a lot of these problems 10 years before it's introduced into storage. So a lot of NVMe concept and the way that it works in terms of the network stack, a lot of the concepts are implemented, and so is TCP is no different with that respect. In terms of scale, it's pretty much proven to scale. It's well-suited for longer distances and large-scale deployments. And as I said, in terms of the ecosystem, RFCs keep popping up and

Starting point is 00:11:48 TCP enhancements and improvements keep being propagated and pushed either within Linux or the other major operating systems. Okay, so ratification status, sort of where we are. We're in, I think, the final stage in terms of voting, or one stage before that. Technical proposal is in its 30-day member review. After that, it will undergo integration into the NVMe over Fabric's existing spec. Once that's done, it will go into board approval. And once the board approves, it should be available online. Captured some of the companies that are active, a couple of the vendors and companies that are actively involved, and also some of the supporters that helped made it happen. In terms of the standard, the guideline is really about simplicity and efficiency,

Starting point is 00:13:06 not to try and capture more than what we can bite on. And that's it. Other questions? Okay. So... I actually have a question. What port number do you use? We actually allocated...

Starting point is 00:13:32 That's a good question. We allocated an IANA port number. Before, in NVMe over Fabrics, we allocated a port number for NVMe over Fabrics, which now is allowed from the enhancements, is allowed to use only for iWarp and Rocky, which was 4420. NVMe over TCP discovery, because IANA don't want different port numbers for the same thing for TCP versus RDMA in general.

Starting point is 00:14:03 So we have what's called an NVMe discovery port number, which is often used by NVMe TCP, which is 8009. So it has a default port. So we actually, in the TP, we added a clarification section that explains that you should not use the same port number for IWARP and TCP on the same network. Well, ICE does itself a problem by logging in and out and associating the capabilities, but you don't do that. No, we don't do that.

Starting point is 00:14:48 We don't fall back into... Oh, I have just a second. It's okay. I'll go read. We don't fall back into RDMA. Sorry, I don't have battery, and it's going to die at some point. Sorry about that. All right. So, no, we don't log in with TCP and then fall back into RDMA. We have default ports both for RDMA and TCP.

Starting point is 00:15:42 IWAR being an RDMA technology on top of TCP, there is a clarification section that you should not implement dual mode because TCP and iWarp cannot share the same port space. In terms of the model or the association model, we basically didn't need to do much. It was just do whatever NVMe does and just try not to screw it up. So in terms of mapping, we map every NVMe queue that the controller creates

Starting point is 00:16:16 into its own bidirectional TCP connection. The benefit is that we don't have any controller-wide sequencing like you can find in other TCP-based implementations, and no controller-wide reassembly constraints before the target really submits I.O. to its backend. One important thing to understand is that when an NVMe TCP association is formed, or a controller representation is formed, we first connect, we establish the TCP connection. At this point it sort of floats, and binding into an NVMe controller comes when the fabrics connect after that follows on that TCP connection. So we sort of connect a bunch of TCP connections and only then we form a controller out of them. And basically, from the TCP perspective,

Starting point is 00:17:26 Admin Queue or IO Queue, these are just messaging layers on top of TCP. You can think of NVMe TCP as another application on top of TCP. What it really is, is all about framing the messaging model on top of the stream. Unlike RDMA, for example, but much similar to Fibre Channel, The TCP transfer binding does need to capture and establish a wire format

Starting point is 00:18:12 just because the data transfer and the messaging is not given to us like it is in RDMA. So every NVMe or Fabrics capsule or data is encapsulated in what we call a PDU, not a new name. It's a protocol data unit. You can see the structure. It consists of a header, which is optionally protected by header digest. The header itself is split into an 8-byte common header that is shared by all PDU types, whether it's data transfer, ready to transfer, or a command capsule or response capsule. And we have a variable length PDU header that, for example, for the fabrics command, contains the SQE itself. After that, we have an optional pad field

Starting point is 00:19:15 that's designed to enhance alignment constraints, mostly to accommodate hardware implementations, hardware offloads. After that comes the data part, if the capsule itself contains any data. And optionally, this data is protected with a data digest. And basically, the most general form, this is the protocol data unit that every data transfer or capsule transfer is formed with.

Starting point is 00:19:55 All of the PDU types, we have nine, if I recall correctly, nine PDU types in the protocol. We have what we call ICREC and ICRES, which is the connection initialization PDU. Basically what we do is that we establish the TCP byte frame, which is the three-way handshake, followed by that. It's the NVMe TCP initialized connection message transfer, where the parameters that are relevant for that connection are negotiated. This is before NVMe over Fabric's association is formed. Once we have the NVMe TCP connection established,

Starting point is 00:20:37 then we'll move on to Fabrics Connect. After that, we have NVMe over Fabrics established. And then if it's an admin queue, the initialization process can continue. If it's an IO queue, the control is ready for IO. We also have H2C and C2H term rec. I'll explain what H2C and C2H. H2C and C2H are directions.

Starting point is 00:21:10 That's how we annotate the direction and the protocol. H2C stands for host to controller. C2H stands for controller to host. So these are connection termination PDUs. These are only used when stuff goes wrong. So if, for example, a protocol error happens, a sequencing error happens, or a digest error happens, we then use the term rec, which has a pretty well-defined set of rules about what you exactly do to handle the error. We also have the capsule command and capsule response, which are the NVMe over Fabric capsules.

Starting point is 00:21:55 We have host-to-controller data and control-to-host data, which is the solicited or unsolicited data PDUs that transfer data. And we have the R2T, which is a ready-to-transfer, just like iSCSI, those of us that know it. Basically, solicit host-to-control data transfer. It brings... It allows for the target to actually control being able to back pressure a host

Starting point is 00:22:26 depending on its internal resources, depending on the implementation. So a host can basically, in NVMe over Fabrics, not just TCP, can expose what's called maximum in-capsule data. So every write data can come either in-capsule or solicited. That's the way we kind of define it in NVMe TCP. So usually you would find controllers that for small data transfer will allow the command capsule to carry data within the capsule unsolicited. Otherwise, it will solicit data

Starting point is 00:23:15 with an R2TPD. Other questions on the types? Shouldn't be too complicated. In terms of the IO flow, that's, I guess, the classic RPC model, or every storage transport that you would find probably works the same way, or at least shares the model.

Starting point is 00:23:44 For a read, we will find a command capsule PDU where the host tells the controller, I want to read data from this LBA to this number of LBAs. It uses a transport-specific SGL just because the scatter gathering duty is really from the host perspective, so the controller doesn't have anything to do with it, unlike RDMA, for example. We actually share the transport-specific SGL with something that was driven from Fiber Channel,

Starting point is 00:24:20 which works exactly the same way. Once the controller gets that, fetch the data from disk or has it already, will issue a set of controller to host data PDUs, which it frees to, it has all the freedom in the world to slice and dice it however it feels like. The host is really have a committed buffer for that data read. And after this sequence, it will issue a response capsule. The transport, the promise us, the TCP delivers in order, promise in order delivery, so it can be pipeline.

Starting point is 00:25:06 And that's basically a read, a write. So we have the command capsule that can either have in-capsule data, but if it doesn't, then the controller would issue a ready-to-transfer PDU with the number of bytes bytes it's willing to accept and what's the offset from the command data buffer from the virtual buffer that exists in the host.

Starting point is 00:25:34 For that, in response to that, the host would issue a set of host-to-controller data PDUs. And once that sequencing is done, the controller would send a response capsule PDU. I mean, if you think of other examples for either... For a variety of transport, it should work the same way. Is it zero copy? Well, that depends.

Starting point is 00:26:11 What was your question? The question was, is the data zero copy? So it depends on which side are you considering, what implementation you are running. So if you have an offload device that can support direct DMA into host memory, that will be a different driver, obviously. So it can do zero copy receive and zero copy send.

Starting point is 00:26:39 The Linux networking stack supports zero copy send, so that's not an issue. So the question is, I guess the answer is that it depends. Yeah, so... So if you have an offload device that does the termination stack of NVMe over TCP on top of TCP,

Starting point is 00:27:03 which would mean you also have an offload of the entire TCP stack, or at least a partial offload, then the driver would receive a buffer from the host. It will associate it with some tag that we have. In NVMe TCP, that tag is the command ID. We kind of reuse whatever NVMe already has, which will be associated with the host buffer or scatter buffer. So an offload implementation could directly DMA it, so we won't have to need the data copy. From the networking perspective,

Starting point is 00:27:39 I'm sorry, from the Linux stack perspective, the receive will be copied. The receive will have data copy involved. But the transmission would not have data copy. All right, so NVMe TCP in terms of Linux. 2017, Fabrics Linux Driver Task Force took on developing the NVMe TCP driver, as already mentioned. At the time, we had two prototypes that were existing, one from LightBits, one from Solflare. The task force kind of examined both codebases, and we quickly converged on one codebase to start with,

Starting point is 00:28:28 moving forward, so that everyone can be aligned. As the spec evolved and new contributions came from different vendors that came in in different stages, the code went through quite a bit of drifting. I mean, weak. Things have changed. And also, this created sort of a feedback loop back into the spec to bring new ideas

Starting point is 00:28:56 from the prototype into the specification itself. In terms of the code, it's in solid shape. It's open to all the members of NVMe.org. Once the spec is ratified, it will be submitted with close proximity to Linux. And shortly after that, well, that depends on the community feedback, but I expect that shortly after that,

Starting point is 00:29:21 it will find its way into the kernel. From there, it will find its way into the OS distributions. Some of the driver design guidelines. So when I first sat down to design the stack itself and the driver itself, a couple of design guidelines came clear that it's worth doing. One would be a single per CPU reactor thread, which is actually a private bound work queue in Linux. That would never, that would really be the sole owner of a set of queues that we don't have to share state between CPUs. That's something that's like a primary motive

Starting point is 00:30:14 that keeps coming back in NVMe. So in terms of how you write TCP transport, that was clear that we're going to have to do that. We want to keep context switches into a bare minimum. So in the IOPath, we have one context switch, which some might say that one too much, but we still have one to allow us to not have a big critical section

Starting point is 00:30:43 within the Linux socket operations. NVMe queues are spread among these reactor threads that handle both sends and receive. We never block on I.O., so we never use blocking I.O. operations on networking. Basically, every send into the wire is with message don't wait, and we take, and we rely on ourselves to pick up from wherever we left off whenever we got E again, or reject from the socket once contention hits. This allows us to first multiplex

Starting point is 00:31:30 between connection on the same thread, and also never to block, or never schedule to sleep. Aggressively avoid data copy. So with incoming data, it's not possible today in Linux to not copy data. We get raw datagram from the network, which has a five tuple in it.

Starting point is 00:31:55 Once from that, we resolve the NVMe queue. We look at the command ID. We know what buffers are associated, and from there we copy the data. But for sends, from the TX perspective, we never data even headers. So I don't think we copy any byte that we don't have to. We aggressively reuse common interfaces for example we don't need

Starting point is 00:32:28 from a TCP application perspective we don't need scatter lists so we work directly on the biovex we use the common IOV iterators to do data placement or to do the data copy itself, not have to maintain all that state around us. If you look at the implementation throughout the kernel, you find that that's often not the case, and Linux are to place common logic behind API and interfaces. We like interfaces a lot, so use those interfaces as much as we can.

Starting point is 00:33:18 In terms of receiving, we either do soft IRQ or in the same reactor context, which is basically a loop that does either send and receive or both. That's really, again, to minimize. If it's soft IRQ, we don't pay the price of a context switch between the reactor thread, so it's directly from soft IRQ. Or if it's in the reactor thread itself, it's in band within the operations that we have stuff in the pipe. Also, atomic operation, that's also a big no-no if you want to achieve high parallelism. So we only have a single atomic operation in the IOPath, which is basically to get a new IO into the queue itself.

Starting point is 00:34:04 So the queue is basically an abstraction, a bunch of struct requests that are waiting to be transmitted. They're already prepared, lined up for NVMe TCP, just need to be delivered. And also one important aspect that is constantly tuned is fairness and budgeting mechanism between NVMeQs, which allows us not to abuse or hog the CPU for a single connection. Always make sure that they're all fair. All right, this is a drawing of the NVMe host stack,

Starting point is 00:34:48 or the best I could produce. I'm not sure if it's the most understandable one, but we'll walk through it. This is the block layer. So have user space, file system, and block MQ. We have two components in the host stack. We have the NVMe core and the fabrics library that the core is sharing between PCI

Starting point is 00:35:14 and all the transports. Fabrics is relevant only for the fabric transports that look like NVMe over fabrics. From user space, we have either handles as a missed device into the NVMe over Fabrics to connect or issue discovery log pages. Well, no, actually just to connect and disconnect. And we have the car device, which is the controller representations that we either issue admin operations

Starting point is 00:35:46 or the various log pages. We have the PCI driver that existed for a long time. The PCI stack is not that fat, just needed to fill up the gap. We have NVMe over RDMA that has the RDMA stack or the HCA driver, either runs on InfiniBand or Ethernet. We have NVMe TCP, which is our new one. And we have the fiber channel driver. And this is the stack. I mean, when I looked at this diagram, I kind of remembered the famous poster of Linux,

Starting point is 00:36:27 so I didn't know if it was simple or not. But this is the stack. The fact is that writing this driver sort of gave me at least confidence that the subsystem is in good shape because not a lot of modifications were needed or affected by adding this driver. So we have a proper subsystem in Linux

Starting point is 00:36:55 where a lot of the core functionality exists in the core layer. But still, if you open up the code, you will find a lot of similarities between the NVMe RDMA and the NVMe TCP. Maybe it's because I've been involved in both, but they do, as IP-based transport, share a lot of code.

Starting point is 00:37:18 Also in error recovery, so we still have plenty of room for improvement. And this is the target stack, which, aside from one enhancement, literally didn't have any code modifications to adjust for the TCP transport, just a natural fit. We have, again, the NVMeCore and the Discovery subsystem where the core basically talks

Starting point is 00:37:49 plugs in into an op structure both to RDMA, TCP fiber channel and we also have the loop driver which is a local access to the NVMe target and in the back end we either have block interface or file interface into what's usually an NVMe PC, but not necessarily. File system, if Chetnia is here, another shout-out to Chetnia. He's on a roll today.

Starting point is 00:38:21 That's basically it. The message is that not a lot of code modifications were needed for NVMe over TCP. The stack is in pretty good shape. Another metric is the lines of code count. True for 4.19, rebased on the private GitLab tree with the TCP code inbound. We see that 35% from the host stack has the core code.

Starting point is 00:38:56 We see that TCP and RDMA are roughly in the same size. Fiber channel's bigger. Fiber channel's always bigger. I'm not sure why. We also have the PCI, which is a growing feature and capabilities every day. Thank you. I'll expedite the process. And in the target stack, things look surprisingly or not surprisingly the same. RDMA and TCP are roughly the same size. Core has 35%, and FC is the biggest again. In terms of data digest, one thing that was sort of a caveat,

Starting point is 00:39:39 because we use common interfaces and iterators in our implementation, when we want to update data digest on the fly, we sort of don't have any way to do that because the actual placement and copy itself is kind of hidden from us behind layers. And you don't want to do that after you copied all the data because that would pollute the cache. If you're already copying the data, it's hot in the cache, you want to calculate CRC on it if you care about digest.

Starting point is 00:40:14 So what we did is to basically add what's called an SKB copy and hash datagram iterator that gets a pre-initialized A hash request that will pretty much update it on the fly as we go. This way, basically, we can achieve. And we also added an interface for other consumers that implement data digest. iSCSI, unfortunately, does everything differently.

Starting point is 00:40:44 So it doesn't use the interface. Wasn't easy to convert it to that interface. So that was, that was good and, and helped sort of minimize the lines of code we had in the driver itself. One other thing that I've learned, or we learned, is that there's some surprises when you implement a TCP application, or in this case, not necessarily TCP.

Starting point is 00:41:19 So as I mentioned, headers are never copied. They're preallocated from the memory allocator, from the slab, and are also zero-copied when sent to the network, like any other BobFerrer NVMe TCP. However, when QDEP gets high and the network gets congested, TCP stack can coalesce multiple headers together.

Starting point is 00:41:46 On the other hand, kernel hardening, which is a debug feature in Linux, will always panic the kernel when you try to user copy from a memory buffer that's crossing a KMalloc object. The documentation said it's an heuristic to catch an attempt to exploit the kernel. So when we were happy from our implementation,

Starting point is 00:42:13 we also found that DHCP discover used to crash sometimes, crash the kernel. What we found is that user space programs are allowed to use packet filters and read data quite often. They can do it with either BPF JITs or TAP devices, NIT network, TAP interface, interface tapping. And as I said, DHClient was the culprit. But that was really an issue that was related to virtualization,

Starting point is 00:42:55 which shouldn't really have happened, but what we came to a conclusion is that every user space application can basically use packet filters. So slab objects are not really something that you can send to the network, although we actually shopped around and asked, and no one could really answer why slab pages are not allowed to be sent to the network because they do use reference counting.

Starting point is 00:43:26 But the solution is to use PageFragments API. That's what network drivers use to allocate SKBs on the Rx path. So NVMe TCP basically converted, and instead they use the memory allocator, it used the PageFrag allocator that is, first of all, not a slab object, so we don't have the hardening issue. But one good benefit is that once every queue has its own cache, basically when you do a send page to the network and a reference is taken on the page, which is a kref, we don't have shared pages among different queues which are associated with different CPUs. So that actually relieved some of the state that shared between cores.

Starting point is 00:44:17 So it's completely unrelated to NVMe TCP, just a surprise that we've learned. In terms of features, so again, the driver should operate, it should go faster and scale very good with performance. It has zero copy transmission. It supports header and data digest. In terms of CPU affinity,

Starting point is 00:44:44 assignments of IO affinity, assignments of I-O threads, it's also taken care of. NVMe TCP, which I think now I understand I didn't mention it, has a TLS enhancement, which is an optional feature for it, which can actually run within TLS. That's future work. We have two approaches. One would be to trampoline into user space or port TLS sign check into the kernel. Polling mode I.O., as I said, we have an inherent context switch in the I.O. path, so currently polling I.O. is not supported, but nothing prevents it from being supported. In terms of automatic receive flow steering,

Starting point is 00:45:32 based on the five tuple of all the TCP connection, that's also something that we plan. Just need to figure out atomicity of adding steering rules to the NIC tables. And also out of our data transfer, currently for a given command, the data delivery is in order that's in order to support out of order that's probably 1.2 fabrics material

Starting point is 00:45:55 I'll skip the TLS section just because of lack of time so I just wanted to provide some quick performance because I know a lot of people are interested in performance figures. This is a comparison for the TCP compared to RDMA. So obviously that shouldn't come as a big surprise. TCP has a networking stack, unlike RDMA, that's offloaded. But the tax you pay for a 4K random read at QDEPTH1, which is a metric, not the metric,

Starting point is 00:46:35 you get roughly around 50 microsecond penalty. And it's important to understand that TCP is not of one size. It solves a problem for specific use cases. Sometimes it makes sense, sometimes it does not. And in terms of the tail for QDAP1, that's not a noticeable difference at all. In terms of the implementation itself, we see a very good scaling, a linear scale of IOPS between threads. Just to emphasize, this represents a multi-threaded application that does sync IO like a lot of the common databases

Starting point is 00:47:17 that you would think of that use a file system or raw block devices on top. And we see that from one thread to eight threads or 16 threads, which is over the CPU count, we get linear scaling of performance and latency does not get affected. So in terms of the parallelism and the lockless design with minimum shared state is able to be achieved. Of course, the driver can scale a lot more in terms of IOPS,

Starting point is 00:47:52 but that's something that we feel is a representative workload from common applications. In terms of the host CPU utilization, so this graph is the number of cores depending on the block size. And this is how many cores it takes to saturate a 25 gig link on a 2 gigahertz Intel Xeon with Mellanox QuintX 4LX on a high Q-depth.

Starting point is 00:48:31 We see that, obviously, reads versus writes, reads contains a data copy, so reads will always be higher, but writes, we see that for 4K IOs, pretty much the most intensive in terms of the stack itself. And as the block size increases, we see the NIC offloads that sort of help us, the segmentation offload and receive offload, where for large block sizes, we're either one or two or less CPUs.

Starting point is 00:49:08 So that's also a metric. But we see that definitely some of the NIC features can help us. And this is the whole stack. From the target stack, it's usually half than that because that's not only NVMe TCP. That's also block, MQ, VFS, FIO itself. So, Saheem, is that the 512? Yeah, the 512.

Starting point is 00:49:36 Yeah, I couldn't understand it as well. I think it's a measurement error. I know it was so... I guess it was... I couldn't understand it. I know it was so, it was, I guess it was, I couldn't understand it. I couldn't figure it out as well. Why is it, but from 64 to, well, we need to consider that that's a long distance.

Starting point is 00:49:55 I just got lazy. Didn't run all the block sizes. But for 5K 12 bytes, the data copy is not the determining factor because the TCP stack sees large buffers and the stack does not

Starting point is 00:50:13 have to process a lot of SK buffs. Thank you very much for listening. Thanks for listening. Here you can ask questions and discuss this topic further with your peers in the Storage Developer community. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

Storage Developer Conference - #87: Latest developments with NVMe/TCP

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.