Storage Developer Conference - #32: Performance Implications Libiscsi RDMA Support

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcast. You are listening to SDC Podcast Episode 32. Today we hear from Roy Stearman, Software Engineer with Mellanox, as he presents Performance Implications, LibSCSI RDMA Support from the 2016 Storage Developers Conference. I'm Roy Sturman, Storage Engineer in Mellanox. Here with me is Agui Greenberg.

Starting point is 00:00:57 This project has been done in cooperation with my university, I just graduated graduated so with Ben Gurion University worth mentioning so what I'm going to talk about in this project are common deployments of iSCSI initiator in virtual machine from virtual machine actually I'm going to introduce Lib iSCSI which is a user space iSCSI library which is available in QEMU and all other applications. I'm going to talk about ICER which is an extension for iSCSI over RDMA which we implemented in this project inside LibI SCSI actually. We're going to show some performance results of our solution compared to different iSCSI solutions, iSCSI deployments from

Starting point is 00:01:53 virtual machine, especially comparing user space implementations versus kernel and the ICER transport versus TCP. We're also going to talk about some challenge we had in user space when trying to... My microphone is off. So I think I can talk without it. Yeah, I think maybe it's good. So, as I said, I'm going to talk about some challenges we encountered of registering or actually it's like mapping user space has its own problems. So actually, what are the common interpretations of the Aliskazi initiator from a virtual machine? We have the common interpretation which is using the kernel models of Aliskazi, transparent of TCP or IP, you can use, you can actually connect from the iProvisor

Starting point is 00:03:50 into an iSCSI target to get the iSCSI device in the iProvisor and pass it to the virtual machine using a newSCSI library. LibreSCSI has integration inside QNU. When using also VR9 or VRK and VR9 modules, you can get the iSCAT device inside the virtual machine. And we'll talk later about the benefits of it. And we have another solution that we, out of this project scope, which is a single-root I-O virtualization, which has scalability issues and we didn't compare our solution to it. So the only missing part of this puzzle is the answer implementation inside the interspace, which we don't have.

Starting point is 00:04:58 Caused all kinds of disruptions. I think we're not sure if it's dead or not. 1, 2, 1-2. All right, can you stick around? OK. This is live as well. Hold it for a second. Sorry. It's not our fault.

Starting point is 00:05:53 So as I was saying, the missing part of this table was the ISA implementation inside the system that we actually chose to implement inside D-by SCSI. What is D-by SCSI? D-by SCSI is an ISA SCSI initiated user space implementation. מה זה דיבוי סקאזי? דיבוי סקאזי הוא אינישייט של ספצי חדש של ספציה הוא נעשה פרופורציה גדולה לנבלות המחקרות של ספציה הוא יחזור על הספציה, כמובן הספציה水道の水を取り除く Kanskje vi kan se utklappning av denne kvartalen, So, we're going to go ahead and get started. Very high speed. Actually, this is a very good start. The lower you can find actually, we find, I think it's about 300,000 people. In fact, this is the only one we all know, this is the IP server that we have in the lab. And that is defined also by ISER. ISER is actually using a different program that we use at TCT, and it's probably

Starting point is 00:08:18 where we found stuff to get to the. ISER is using the RDMA for the model, which is called the RDMA0. It's in the. And the important thing about ISER is that it is transferred to the user except for a flag or a station. The ISTASI is. or a leg or a station. The ice study is still in the ice here.

Starting point is 00:08:47 It's not. So, the water is still. The water is still. The water is still. The water is still. The water is still. The water is still.

Starting point is 00:09:03 The water is still. The water is still. The water is still. The water is still the unit in which the real one buffer makes up. And so, we hope that the buffers will be able to use the CPU for each buffer.. The process of the Commission Office is currently ongoing. It's a very busy year. The next year, we will transform the data into a more comprehensive model of the implementation. And we get far with the ability to try to get those stations and use public and hardware applications around the world.

Starting point is 00:10:12 For ROCi, which is RDMA, we have a third channel. So what are the ISER alternatives that you want to do it, you have to do it. Actually, ISER is using all the tools I talked about in the article today. It's zero-copy, it's full-copy, it has a reliability, it's a lot, it's a lot. It eliminates this PCPAP process for a very satisfying management.

Starting point is 00:10:49 Because all those events we need to get high-quality, high-quality... You don't only... I got them, okay? I'll show you in a few slides. You don't only... I appreciate the sense of encapsulation inside the ice. Second flow is the write command flow, which is pretty the same like the read command flow, except in ICER write command, you can put data and inline data inside the command actually. You can send data which can cause lower latency than doing more iterations. So now I will go back to the table we saw before. We actually had motivation to implement ICER in Libby SCSI because a big advantage RDMA stack has over the TCP,

Starting point is 00:12:43 it can access directly to the hardware from user space. Okay? It doesn't have the sockets API which is handled by the kernel. It has direct access to the hardware resources, to the RDMA context. So what we actually tried to do was to get the Libby SCSI, which supported only by TCP transport, and to create a transport obstruction so we could integrate our ISER implementation inside of it without taking any effects of the existing Libby Libby's cause the library and all the application that are using it yeah also also we try to do it as transparent as possible to the users so it will not have to change we will see it in few slides so we created the user space networking API which is bypassing the kernel as I said we separated the user space networking API, which is bypassing the kernel, as I said.

Starting point is 00:13:47 We separated the data and control plane in I-SERV for optimization. We are planning to add nice and easy integration in QEMU. I will talk about it later. It's already in progress. And we tried and got high performance for our implementation in this slide you can see the modification we did inside libyskazi to add the transport support the transport obstruction of icer or tcp handling and this is actually the transport API which includes okay yeah

Starting point is 00:14:33 okay so so actually TCP and ISER has different API's that the control and data plane has the similar functions, okay? It's the iSCSI flow, but somewhere along the road, it should be separated into TCP implementation or into ICER implementation because they have different semantics. And I have a slide about it. Here you can see we try to centralize the transport-specific code. We try to create a transport layer that will have the API inside of it. So we could plug in ICER as easily as possible.

Starting point is 00:15:32 Also, we are planning to add ICER support inside Libby Scassi in QEMU. Libby Scassi has support in QEMU as far as today. Also in Libvirt, which is a virtualization library that can be used to manage virtual machines. We moved the polling logic of the, in case of the TCP, the socket polling logic. And in case of ICER, it's the RDMA context of completion queue polling logic into the transport layer because there is a big difference. It's pretty the same to pull a TCP socket or an RDMA completion queue. You get basically the same consequences.

Starting point is 00:16:20 We also moved the IO vectors into the transport layer because in the deprecated, it's not yet deprecated, but in the old API, you first added the packet data unit of the iSCSI layer and only then you added the IO vectors inside of it. ICER can't work like this because it needs to have the IO vectors the minute you want to send the message compared to TCP. So merging process is in progress, should be available in the next few weeks,

Starting point is 00:16:57 maybe a bit more. Excuse me. Yeah. I was looking at the slides. So you mentioned about polling logic, right? So is it like the application has the possibility to choose whether it wants to choose interrupt driven or polling logic? Or is it the driver?

Starting point is 00:17:16 No, actually the polling logic is coming from the QMU asynchronous IO block driver. So independently in ICER or TCP. I just mentioned that we moved all the polling logic into our transport layer inside Libby SCSI to avoid the code duplications. And so basically all the IO threads are originated in QMU. So basically each I.O. has... QMU has a set of I.O. threads.

Starting point is 00:17:52 For each I.O., one I.O. thread is assigned to pull for it. So it's not interrupt driven, never. It's never interrupt driven at the moment. I know that there are patches to integrate QMU events to also work in an interrupt mode, but it's not available. So it's not directly connected to either TCP or ICER implementation. It's just the way QEMU works. So here you can see the Libby Scassi configuration inside libvirt XML file. Each virtual machine needs to have its own XML file. You can see here

Starting point is 00:18:39 the additional attributes of ICER. All the other configurations doesn't involve any of the transport layer, okay? It can be changed, but the changes are going to be like very small, okay? Nothing to add to the user configuration anyhow. So after we talked about the implementation and the motivation, let's go on to the experiments and results we've done. We actually performed our measurements over a Mellanox ConnectX4 on both target and initiator. We use the common TGT user space iSCSI target, okay, which is very commonly used and very convenient to use. We use RAM devices as our backing store and the target devices so we can check only the network protocol and not the problems that are in the target side. Our benchmarking tool was a flexible I.O.,

Starting point is 00:19:46 which is also commonly used. Each guest we ran, guest is actually the virtual machine, has a single CPU from QEMU, which is actually a virtual CPU, and single FIO job. A comparison has been done versus the three other options I mentioned before, Libby-Scassi TCP implementation and the both kernel implementation which is ICER and TCP implementation but using the kernel models and not the Libby-Skazi library. So this slide, you can see the both TCP bottleneck is probably not the NIC, okay? It's probably somewhere over the network stacks. You can see here the ICER-attached device. ICER-attached device is actually an ICER connection that we've done.

Starting point is 00:20:46 We've connected through the hypervisor and passed it to the virtual machine, as I mentioned before, as one of the common ways to... The VM is using virtIO to talk to the hypervisor, which then uses ICER to... Yeah, the connection is the VM is using virtIO BLK. It doesn't even know that this is an ISER device. It gets the IO into the hypervisor, and then the hypervisor is using the ISER modules in kernel to execute the IO. As you can see here, this is the, okay, see there is a typo.

Starting point is 00:21:20 This is the ISER LibreSCSI, okay? This is not the ISER attached device. Yeah, I don't know why it says, you can see by the colors. This is the ISER Libre SCSI okay this is not the Icer attached device yeah I don't know why it says you can see by the colors this is the Icer Libre SCSI which gets great performance from virtual machine 180k IOPS for a single core yeah you can you can probably understand that the core was fully utilized okay so we'll see in a minute that you can aggregate more IOPS by using more cores or more VMs or whatever you want to use. What was using CPU? What?

Starting point is 00:21:57 You said you're saturating one core. What was using the CPU? You said this was an offload technology, right? CPU wasn't going to be involved with the transfer? No, no. The offload technology is the RDMA offload which is inside the hypervisor. But, okay,

Starting point is 00:22:14 the FIO application that runs in the virtual machine was... There's still an IO stack. There's the QMU IO stack. And the virt IO stack. You's the QMU I-O stack. Yeah, and the Virta I-O stack. You need to post the work request. You need to track it.

Starting point is 00:22:28 You need to pull the completion queue. No, actually the... The I-O queue here on your chart was on the same adapter, the same machine. Same setup as ICER, exactly. All the difference is software. So I have a question here. So in between the time 6 to the time 9, ICER, exactly. Actually... The difference is software. The... So I have a question here.

Starting point is 00:22:49 So in between the time six to the time nine, the difference is calling the other one? Yeah. Yeah. It's the VIRTIO overhead. The VIRTIO from the guest into the host. That's the overhead. Actually... And also just another question.

Starting point is 00:23:06 So the Vilt.io on the... on the time six configuration, the ISER there is implemented by hypervisor user space or hypervisor... Hypervisor kernel, because you don't have an implementation of ISER in user space or hypervisor... Hypervisor kernel because you don't have an implementation of ISER in user space that is commonly used by... I don't know.

Starting point is 00:23:30 For example, from the Levi-Scassi in the hypervisor... Yeah, but it doesn't make sense to go into the hypervisor and get it to go back to user space, you know, for... It doesn't make... No, no. Are you suggesting... We are using using Li by scasi inside the hypervisor okay when when you are executing the IO in the VM you're getting into QEMU okay and then QEMU actually can split it into two ways

Starting point is 00:23:57 weekly by scasi it can operate the IO without accessing the kernel without extra context which if you are using the ICER kernel model so you need go you אי-או, בלי היכרון, בלי חדש קונטקסט, אם אתם עושים את המודל של אי-סויינג, אז צריך לנסות לנסות לגביר את האי-או בלי המודל של אי-סויינג. doing in the user space of the hypervisor. And then you will see a clear value of not going to the hypervisor. But that's great. And one more comment to Chuck. There is another aspect of consuming CPU, but we'll get to that once we talk about memory registration in user space, which is very challenging.

Starting point is 00:24:40 A follow-up question. What was the bottleneck. I'm not sure where was the bottleneck, but I'm guessing it's somewhere in the network stacks. We could have probably tuned the TCP stack a bit more, I guess. But we wouldn't have seen the same performance as RMA just because we have all the overhead of TCP. Processing the transport.

Starting point is 00:25:19 When we were speaking about TCP overhead, it's not like the adapter is being stuck. No, no. It's a software. That's a good question. I think the right answer is that, yes, the CPU. Yeah, for this benchmark, Roy said it's not. It wasn't.

Starting point is 00:25:37 But I guess it was the TCP stack wasn't perfectly tuned. Yeah, but we got like 80, yeah. Probably, and the nozzle, or other things that we couldn't tune. Yes? So can you guys tell us the difference between the 6s and 9x? What is the difference between them?

Starting point is 00:25:58 So what is the difference? Like the 9x is the Libby SCSI? Yeah, what are the parameters that you use with 9x versus 6x? The same parameters exactly. Just the configuration is different. The 9x is the Libby SCSI user space implementation and the 6x is the kernel ICER implementation. Basically, there is a vert IO that the guest is talking to the hypervisor and that's the difference. In the LibreStatic case, there is a Verdeo protocol but it's very light because it's right into the same actually go to ICER as well, through ICER, but for PCI

Starting point is 00:26:46 pass-through of block devices, you have the Verdeo protocol, which goes down to the hypervisor vhost, and then it's dispatching the command into the kernel ICER implementation. The first thing is, it actually should be ICER. Yeah, yeah. There is a typo here in the 181K. It's LibIscasi ISA. You can see.

Starting point is 00:27:13 So in the LibIscasi, both TCP, did you run it with the TCP in user space driver or TCP in user space? The LibreSCSI TCP is already in user space because LibreSCSI was written in user space. So the full stack in user space above TPK, above what? Above SOC. It's not completely in user space because once you write into the socket, you need to copy the data. Within the kernel. That's what I'm saying, right? There are formal drivers in user space and then you can eliminate those of the context which those that you know.

Starting point is 00:27:51 Well, if you have a DPDK-based driver that uses polling mode and has a user space TCP IP stack. Yeah. Well, you can get this probably some of the benefits although not everything because well for direct buffer assignments it's hard for initiator mode because the NIC is getting raw data graphs from the wire it doesn't know like an RDMA where what's the R key there is no R key in raw internet data graphs so it's not a it's not an easy task, basically. So we assume that the content is stored again on the device as the operating system?

Starting point is 00:28:29 Yeah, when using the Sockets API. Yeah, exactly. So next slide can show the bandwidth versus different block sizes. We use a block size from 1K till 128K. Iodept in this case was 64K to get a good bandwidth. As you can see also here TCP from both implementations is pretty much the same. you can see a little advantage to Libe-SCSI TCP. You can see Icer-attached device is eight times better than TCP's implementation and you

Starting point is 00:29:13 can also here witness the Icer-Libe-SCSI which gets five and a half gigabytes per second which is almost saturating the wire. OK? What's your bandwidth to FDR? It was 56. FDR. Yeah, FDR. 56. So you mentioned that it's a single isopetri connection. It's a single core to a single isopetri connection

Starting point is 00:29:43 to a single logical unit on the target side? Yeah, it was mapped one-to-one from the initiator side to target. Here you can see the latency checks we did also comparing the three other solutions. Also here you can see the ISO LiBase CASI is getting an average of 60.43 microseconds per IO. We used 1K block size. We could use a lower block size, but we also checked it and it was pretty much the same. And you can see when we took a higher Iodept, we got a linearization of the latency. But the slope is much better in ISER LibID SCSI than in ISER attached device.

Starting point is 00:30:44 Also here, the TCP has the disadvantage of the processing, like I talked before. Here you can see, again, latency versus block size, which you can also witness the benefits of both ICER, but you can see small advantage to LibreSCSI ICER over the common use. I just want to talk about this one. This is a we try to get an option of a scalability issue from different virtual machines to create multiple connections on Libby SCSkazi ICER and the other solutions to see how it

Starting point is 00:31:26 aggregates over the same host. And as we can see, a bigger slope is better scalability. We can see the ICER-Libby-Skazi slope is getting good results compared to the different ICER where TCP is somewhere, dashing somewhere near them. This is the IOPs of the same graph and now I'll give Sege to explain about the RDMA memory registration from user space issue. Okay, so basically Roy enumerated all the benefits he gets from RDMA, which is zero copy, which is everything is great. But it doesn't come necessarily for free in terms of really the application aspect and how you program your application. Basically it means that now the device or the ARNIC needs to know every address of every buffer that you're going to do networking from now on. And this buffer also needs to be pinned into memory. It cannot be swapped out once it's exposed to RDMA.

Starting point is 00:32:39 It cannot be migrated to another NEMU socket. Basically it's pinned wherever it is, it will stay there. So it's rather limiting, but it's still workable. So basically, pinning and also providing the RNICs all the mappings of your networking buffers is called memory registration in RDMA. And because it's a slow operation, okay, in firm user space, basically registering memory, it's a slow operation. You can use it in a data path if you want to get good results. You can use it in the data path. And also, if you want to do it correctly, you pre-register from all your buffers. If you want a remote peer to access your buffers, you need to pass it with remote access permissions. Is that actually like pass registration?

Starting point is 00:33:41 Yeah, okay, so if you attended the last talk I gave as part of the NGMIO 5 Revit, I referred to fast memory registration. That's available in privileged user modes or in the kernel modes that is aware of the physical mappings. So for user space application, you're not aware of the physical translation of your buffers. So you can't, you don't have the option to do fast memory registration. Okay? Do you recommend using FRMR? Yeah, we can't use... We don't have FRMR in user space. We can't... There were attempts in the past to implement FRMR or expose FRMR interface from the kernel to user space.

Starting point is 00:34:25 But they never really succeeded. But basically what we have is what we're stuck with. So applications usually need to preregister all their networking buffers up front and then use those buffers to do RDMA transfer or data transfers. The problem is with midlayers. So midlayers like OpenMPI and SMEM in the HPC use cases or our Levi-Skadi ICER implementation is that we don't own the buffers. The GES-VM basically does an I operation to its own buffer

Starting point is 00:35:01 and then we as a midlayer see that buffer and we don't know, basically it's not registered and we need now to perform registration. So basically for each data transfer to do a memory registration which is operation we wouldn't have been able to get high IIs. So what do meat layers usually do? Or what are the options for them to solve? One of them is preregistration of the entire application memory space. Basically pin all the guest memory space into RAM.

Starting point is 00:35:41 And then basically every buffer that comes from the guest VM is registered and we know that and they already get access to it. But it's pretty limited because it's, you know, all the 1GB or 2GB that the VM is getting is pinned into RAM space and it's not very scalable when you want to stack VMs on the same IP provided. Another would be to implement a pool of buffers and then modify the application to, instead of using the POSIX allocation API, to use the transport or the mid layer to also supply the buffers, much like DPDK. That's the way DPDK works, but it requires modifying

Starting point is 00:36:26 the applications, and we wanted to move away from that. Another method that exists in the HPC use cases is a pin-down cache. Basically, it's a registration and caching of registrations on the fly. Basically, you get a buffer. If it's not registered, you register it. Hopefully, if you get spatial and temporal locality,

Starting point is 00:36:52 you'll find the same buffers over and over again. So in the steady state, you should be able to get enough remappings. But there are a lot of complications when you get partial overlaps. Then you need to decide what to do, whether to tear down the last registration and perform a bigger new one or just register the soft part that is not already registered so it's not the

Starting point is 00:37:12 most friendly logic but it's doable yes Iran so there is also a place in between right so you say when you use this caching scheme, so this pin-down cache. So you can also say that... We don't use it, but some use it. Yeah, when you implement that. When you implement the pin-down cache, then you can say, when I'm missing, when I have a miss, I don't have to register because registration cost a lot, I can copy.

Starting point is 00:37:42 Yeah. So copy on a. OK, so that's come closer than what we did. So basically you can say, OK, the buffer is small enough. I'm willing to pay the overhead of data copy. But if the buffer is large, then I wouldn't want to copy like 64k. So I'll register it.

Starting point is 00:38:02 So that's another approach we're writing here. But the really interesting one is Pageable RDMA, which is called on-demand paging. I think two years ago we added this capability to the stack to support basically paging from I.O. devices. So what basically it allows, it allows applications to trust with the hardware and the kernel drivers to handle page faults and just work seamlessly as if all my memory was registered while it's not very much. So basically if the device needs to support a secondary translation table, which should resemble the OS translation table, page table.

Starting point is 00:38:51 And we use basically the device once it tries to access a page which is not present, which is not mapped with the device. It can raise the page fold. The kernel driver gets that page fold, handles the page fold, basically restores the page mapping, and then tells the hardware to continue. And again, from the other side, the OS, the operating system, can invalidate a page if it wants to swap it out or migrate it to a different NUMA socket. So we hooked into the MMU notifier callouts

Starting point is 00:39:28 that are available in the MS subsystem. So basically, we get an MMU notification that this page is about to be invalidated. So we sync it with the hardware. And then once the hardware attempts to access this page, which is invalidated, it will raise a page fault. And then our driver, or the kernel driver, or kernel

Starting point is 00:39:45 driver, can handle it and restore it back in, restore the mapping, and tell it to continue. So this concept is. What happens to the data that's incoming on the wire while all these pages fall? OK, so, OK, I didn't want to get in that deep. But OK, so it depends on your transport connection. So if it's a, what's it called, unreliable data

Starting point is 00:40:09 grant, the incoming packet that cannot be serviced is just dropped. Or it's implementation specific. The device can hold some emergency internal buffer that can store the data grant and let the kernel take care of the page faults and then DMA to memory or you can drop. If it's a reliable connection it just sends back to the peer a receive not ready non-acknowledge and then the peer should retry.

Starting point is 00:40:42 In the case of iWARP the page isry. Yeah exactly. So the upper layer protocol has to deal with the R&R doesn't it? No, no. Well it's implemented in hardware. The only thing that the upper layer should know is to program the QP to give some number of retries for R&R. What do you mean? You're laughing. Well, I know that NMS over in side is using paging technology, then you would need to be ready to accept or get that receive not ready R&R nox. Yeah, you willLT has to be aware of this. What is the purpose of setting it to zero? That means that there is no latency.

Starting point is 00:41:52 It's dropped and we drop the connection. We figure out that we did something wrong. It's a bug. Yeah, but so is it for the? No, as soon as that happens, you know that something is not right in your client or server. It's for software, you know, it's not something that happens in normal practice. Yes, so let's assume I would call R and R not number two is something that could be handled by

Starting point is 00:42:30 hardware because it is being, handling on demand trading. Would you accept that? Would you think that then the ULP is not, doesn't have to be aware of that? The problem is there are a lot of ULP implementations out there that already set it to zero. Well, I think that setting it to zero is, I don't know if it's correct, because you don't know if the other side is using shared receive queue, for example. So if this server is serving like 100 clients

Starting point is 00:43:00 and it uses shared receive queue to scale in order to scale, you might get receive not ready if that server is packed and if it's not posting receive work request in time. So it's not necessarily something that went wrong, but you can say that in my application or my implementation it's not acceptable for me. I would like to tear down the connection and then restore it. Sure enough. Yes, it could be also a decision to make also on the on demand paging problem, right? You could have a decision when I get an on demand paging. And now I want to- You don't know from the pure side,

Starting point is 00:43:38 you don't know that the other side is- Yeah, but you could do what you could if you don't the doctor, right? It's like, it's a decision that you make on the domestic latency that you want to achieve, right? There's a distinction between some spit data transfer getting page fault versus some spit data like shared CQ scenario, using the other way around, right? Well, yeah, there's different supports. And for reliable connection, which I think all our storage ULP drivers use, reliable connection, it will always result in. Sorry. So ISO with ODP. So OK,, so now we established that the application can register huge memory regions which are very lightweight, nothing is allocated, okay, and none of the pages are actually pinned. I will start to flowing in the kernel will handle page fault so you get a ramp up and then you know if

Starting point is 00:44:45 In the steady state you don't see a lot of page fault. So the application or The guest can basically register the entire memory space as a single Memory key but for storage you don't want to expose all... You can't really send this memory key to a remote target to do RDA because it's exposing the entire memory space of the guest. Okay, so what we need to do is basically use memory windows. Okay, so this is a technology that allows... If the application has a very big buffer that has local permissions, it can open a memory window

Starting point is 00:45:27 within this buffer and upgrade the permissions for remote access and then send the R key back to the target to do RDMA. What is on-demand paging? So it's basically the pageability of RDMA. The capability of the device to generate page faults which are handled by a kernel and in foundation. So basically, iSERC cannot send this, the entire memory space R key to a target to do RDMA because it's somewhat of a security violation. It needs to basically for each IO get this offset and buffer of this specific transaction,

Starting point is 00:46:21 open a memory window on it, which is a fast fast operation we can do it in the data plane and then send its R key to the initiator and then the to the target and then the target will already made only has permissions to access this buffer alone okay so this is the plan and this is what we tend to do. Currently we don't have support for pageable memory windows. So basically once the driver sees a page fault, it needs to know it's a memory window and recover the page translations from there and not a native memory key. So this support is still ongoing. But we have initial results, and it looks promising. And this is the way that we think that it should work. And this is the way we think it should be implemented.

Starting point is 00:47:20 Hopefully other devices will embrace this technology. And actually, we that, we actually think that also for NVMe technology, we also see the benefits of using page faults and not pinning the entire guest memory space for the VM to access the NVMe device. So we have a few thoughts on this area. Okay, summary and conclusion. So basically what we did is that with the alternative ISO implementation, we have the kernel, we have now one in user space. We integrated the driver into QMU which is still in the works and is driven by Roy and Ronnie and Palo of these folks. We were

Starting point is 00:48:13 able to get pretty good performance scaling compared to the native TCP stack and Ver.io pass through ICER devices from the kernel. But again, the kernel implementation has disadvantage of memory footprint because again, Pageable RDMA is still not working with memory windows. So we can't use that at the moment. So basically what we do is we do a copy for small transfers or register large transfers. What can be done next? First of all, adding the paging support

Starting point is 00:48:59 for memory windows so iSERV can use it efficiently and push it into the open source device chassis. We have plenty of room for performance optimizations in the idle path. We also have stability improvements, especially error-prone paths that are not fully resilient right now and we would like we would love to have some specific ICER unit tests and device that leads up so these are areas that if you're interested you're welcome to contribute so this project was conducted under the supervision and guidance of Dr. Shlomo Greenberg from Ben Gurion University. And I want to give special thanks to Roni Isalberg, which helped me a lot during the way to merge it inside the LibreScasi open source. And also helped me with QMU integration of

Starting point is 00:50:05 ISER. Thank you. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers dash subscribe at

Starting point is 00:50:23 sneha.org. Here you can ask questions and discuss this topic further with your peers in the developer community. For additional information about the Storage Developer Conference, visit storagedeveloper.org.

Storage Developer Conference - #32: Performance Implications Libiscsi RDMA Support

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.