Storage Developer Conference - #102: Achieving 10-Million IOPS from a single VM on Windows Hyper-V

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to STC Podcast, episode 102. Okay, let's get started. Good afternoon. Today we are going to present Achieving 10-Minute IOPs from Single VM on Windows Hyper-V. I'm Liang. I'm from Microsoft Azure Performance Team. And this is my first body demo from Microsoft Windows Server Performance Team.

Starting point is 00:00:59 We have been together working on Hyper-V VM storage performance for probably more than 10 years, actually, since Hyper-V was formed at the Microsoft. So today's agenda is I'm here first cover some motivation and the performance challenges and the issues we solved. Also, we are gonna review the Hyper-V storage path and the optimization we have done. And Danry is going to work with you

Starting point is 00:01:29 performance configurations, settings, performance data, and the new NVMe polling model we have been doing on Windows. So if you attended today morning session, Amber from Intel, actually, he described a very interesting story. She said NVMe developed so fast, hardware developed so fast, performance becomes such incredible. But because of the software stack remains constant,

Starting point is 00:02:01 almost constant, and the end user cannot really see the performance benefit brought by the hardware improvement. So as a software guy, I'm very sorry to hear that. And actually, that's the exact reason Dan Yu and I are working together

Starting point is 00:02:21 to improve our software stack so we can bring the best performance experience to our customers. So you may ask, why 10-minute ops? Why we need 10-minute ops? So obviously, we have seen a strong customer demand for these high transaction-based workload. Artificial intelligence, big data,

Starting point is 00:02:46 machine learning, database, all this stuff, they want high throughput, all this high transaction-based. Also, cloud market grows. So you see all these cloud major vendors, they have seen this revenue doubled or even tripled year over year. And it's also also very competitive market.

Starting point is 00:03:06 So every cloud vendor wanted to get the best performance VM skills to their customers. So obviously we also wanted every VM skill release has the high performance in terms of high throughput and the low latencies. In the meantime, the hardware technologies during the past 10 years advanced very rapidly. That includes now we have much faster storage today

Starting point is 00:03:33 and a much more advanced processor today compared to like 10 years ago. In today's market, most of these cloud vendors has this called storage-optimized VM skills. Amazon, Azure, Oracle, all these things. And all these VM skills could handle these IOPs at the minimum IOPs levels. So let's first talk about the storage technology advancement

Starting point is 00:04:01 which enables these high I high ops from storage perspective. We have attended several sessions that we talk about the shift from data center SSD storage interface from traditional SAS, SATA to PCIe NVMe. Actually interesting is the Seagate, people also discuss their thinking or planning, move the HDD interface from traditional SAS SATA also to NVMe interface.

Starting point is 00:04:30 So that's very interesting. And if you look at the table below, here we compiled the theoretical protocol bandwidth and the actual IOPS between the SATA SSD, SAS SSD, and the PCIe and RAM SSDs. So SATA SSD, we know it's kept at like 600 megabytes. So in reality, the fast SATA SSD today delivers around 100 kA orbs per device,

Starting point is 00:04:59 even for enterprise skills. For 12 gig SAS, the theoretical bandwidth is like 1,200 megabytes per second. And in reality, to the most top tier, this enterprise SAS SD delivered around 250k IOPS, no more than that. But the PCIe NMSD, we are using this Gen 3, which is very close to four gigabytes per second,

Starting point is 00:05:29 and could deliver close to one million IOPS per second for PCIe by four. So you see that's a huge jump compared to this traditional SATA and SAS SDs in terms of performance. So let's first talk about this NVMe. NVMe, like Amber mentioned, is really designed from the ground up. So we can capitalize on high IOPS, low latency,

Starting point is 00:05:58 based on this internal built-in parallelism. What this internal built-in parallelism really means? If I look at this graph, traditionally, like a SaaS starter, they only have a single command queue. But for NVMe, situation is quite different. They support large number of these I-O queues. Very common, we see like a 32, 64, 128, or even 256 in today's market.

Starting point is 00:06:26 So these multiple I-O-Qs can enable this device to scale their I-O initiation and completion. It's got a number of CPUs for high throughput. As a profile guy, we really know, actually, if you look at this I-O path, most performance bottlenecks come from either IO completion or IO initiation path. That's the two most important bottleneck for IOs.

Starting point is 00:06:52 In the previous slide, you said that you have one million IOs. Yes. Question is, are these reads or writes? Reads. And how many of IOs are you talking about? 4K. Reads and. 4K. Yeah, we said that all this, obviously,

Starting point is 00:07:08 it's measured by 4K. Yeah, so obviously we are talking about 4K random reads. The write performance on this kind of NAND flash is pretty, it's not as good as reads. So let's spend some time to review Windows Hyper-V storage path. We know today Microsoft Azure is powered by Hyper-V. So all this VM stack actually is running based on top of Hyper-V. We have like a three mode.

Starting point is 00:07:40 This is very similar even for Linux world. Then KVM is very similar. We have emulation path that is engaged the hypervisor constantly, so it's pretty slow. We also have this most frequently used, this paravirtualized, or we call it synthetic path. So we use this enlightenment driver to bypass this hypervisor and minimize hypervisor enrollment.

Starting point is 00:08:04 And today we have this direct hardware assignment. to bypass this hypervisor and minimize hypervisor enrollment. And today we have this direct hardware assignment. That means we can access a PCIe device to a VM, that for best performance reasons. And from VM storage back-in mode perspective, we also have this file-based, that is, for Hyper-V we frequently use this virtual hard disk image, either VHD or VHDX. And that means, for file-based, it means you have

Starting point is 00:08:31 to sit on our file systems, so that our file system has all of this, all we have. So we also have like a physical disk base. That means we can bypass host file systems. So that includes a SCSI pass-through, which means all these disks, physical disks, is presented to the VM as a SCSI disk. And then we also have these PCIe pass-throughs.

Starting point is 00:08:53 Now that we have this new feature, in Windows Server 2016, to do this PCIe direct assignment, so that has better performance compared with this file-based mode. We wanted to review this, our storage performance work we have done during the past, like, 10 years. First is our journey to Minya Ops VMs.

Starting point is 00:09:19 You know, it's very interesting, like, six years ago, we come here, we come to probably the same conference room. We announced Hyper-V become the first hypervisor to achieve more than one million ops from single VMs. So that's based on Windows Server 2012. A year later, in October 2013, Microsoft announced we achieved more than two million ops from single VM. And we showed a public demo at the TechEd Europe 2013s.

Starting point is 00:09:51 So all these one million ops, two million ops demos, they all use the similar configurations. It was based on 64, SCSI pass-through disk I just mentioned, so that means they use traditional paravirtualized pass. And the every pass-through disk I just mentioned. So that means they use traditional para-virtualized pass. And every pass-through disk is backed by our SATA SSDs.

Starting point is 00:10:11 So we just talked about the storage advanced very quickly. So now PCIe, a single PCIe Gen 3 by 4 can deliver close to one million ops. And today, if you look at all these cloud vendors, for the best performance VM skills, they still heavily use local storage. So that we say, oh, we want to go beyond

Starting point is 00:10:38 one million, two million, three million ops. So that is very natural work for us. But the experimental we have done actually showed the throughput of the VCM-VM is going to be kept around the three-minute ops. If we continue using the traditional para-virtual path. And I'm going to explain next. So if I look at this graph, this is just the HEPA-V

Starting point is 00:11:04 para-virtualized storage pass. So, on the left side, that's the VM side, you'll see we have some enlightenment driver, it's called a store VSC. Actually, we also have a Linux solution, so that is already part of the Linux. So, and then you cross the boundary, VMBus driver, we will set it down on the host.

Starting point is 00:11:26 We have the store VSP driver, then we go through some parcels, then we send it to the IO disk, physical disk. So look at this graph. Every storage IO goes through the IO initiation and the completion path twice. Once in the VM, and another one we have to do the same again on host.

Starting point is 00:11:49 Also, you will notice that the contention between the root virtual processors and the VM virtual processors is going to present a very big performance bottleneck. Because you have host stack, you have the VM stack, they will contend for CPU cycles. We have some existing mitigations. We have hyper V main route, we have a CPU group.

Starting point is 00:12:15 What this feature do is they just isolate this route repeat from VM virtual processors so they will not contend with each other. Like I said, we found this Hyper-V VM is going to cap at three-minute orbs based on traditional Pyro virtual paths. So where did this overhead come from? We did some experiment. So if you look at this table here,

Starting point is 00:12:44 this is the Windows Hyper-V storage path, CPU overhead breakdown. Our caveat here is this was married using 1-mini-ops traffic, 4K random reads. So the data may be different depending on OS or the traffic, but generally they match or scales linearly. So if you look at this one, the first one is the guest component overhead.

Starting point is 00:13:11 That's roughly like a 20%. And this overhead come from hyper-v guest inattentment, as we said, storeVSC here. This inattentment is made aware, oh, I'm running in a virtualization environment. So obviously we have a guest OS storage stack because the VM is also OS. So combining them together, the guest storage stack

Starting point is 00:13:36 took a 20% of the CPU stack away. Now we just mentioned that every IO has to cross the boundary from VM to host and then come back from host to VM. And to cross the boundary, we use this shell buffer mechanism, that uses the VM bus. And that also take away 20% of the CPU cycles. Of course, the biggest part come from the host.

Starting point is 00:14:01 The first one is that we have a hyper-v host component. We just mentioned we have the store VSP driver. That is mainly responsible for our dispatch and purpose. We also have another parcel, like a VHDMP parcels for, like, a file-based image. So we, we mentioned for this file-based, we have host file system overhead, NTFS, for example. So that's roughly 10%.

Starting point is 00:14:24 Of course, host storage stack. That means the disk driver, your store port, the main port. That is about 10%. So you put all this together, the host, take away 40% of CPU cycles. Another one is, of course, the hypervisor. So that is, under this one-minute ops,

Starting point is 00:14:43 it's about 20% overhead. And for most of this 20% overhead, the interrupt delivery is predominant. So we are going to explain more how we mitigate this both. So obviously, this table shows us CPU cost is typically a bottleneck for these high ops, for these mini-ops, or over mini ops work node. So we wanted to mitigate.

Starting point is 00:15:09 We wanted to save CPU cycle as much as possible. So there are two different directions. The new storage virtualization technologies that will help mitigate host as well as VM to host overhead. That's the storage virtualization part. And another part is new processor or virtualization technologies, like APKV or post-enrollment.

Starting point is 00:15:32 That's going to hypermedicate, hypervise overhead, dealing with interoperable delivery to virtual processors. What does that mean? What does that mean? It means the CPU cycles. So like you say, every, for a number of Skylake CPU, right? So it's like a 2.7 gigahertz.

Starting point is 00:15:52 So how many CPU cycles you're going to spend for these IOs? So generally, you do the calculation. You're divided by your total CPU cycles per second, divided by your IO per second. So you'll get it for every IO, you'll get how many CPU cycles per second divided by your IO per second. So you'll get it for every IO, you'll get how many CPU cycles you consume. On the left-hand side, it's also like CPU cycles, right?

Starting point is 00:16:13 Yes, yes. Okay. So that's roughly how we get these calculations. Okay, there are actually much. So is it so far? Okay. In the lowest row, there is a... Yeah, Yeah. So traditional, a parallel virtualized has this kind of inherent overhead. And because it cannot avoid this kind of contention, root of EP, VM, EP.

Starting point is 00:16:42 So traditional hyper-viscous pass-through is not sufficient to provide maximum performance. Now you will be asked, hey, a natural response is why we shift to the storage SLVs. Yes. Storage SLV today is still not there yet. Just like Amber mentioned in today's sessions, in NM expect, we do have this virtualization enhancement sections dealing with the SLV support. But there is some key part is not there yet.

Starting point is 00:17:19 For example, lack of resource control. How to get rid of noisy cable, noisy labeling issues. So if you have multiple VMs, or you don't want one VM doing 4K random writes, another VM is doing 4K random reads, another VM user doing 4K random reads is going to suffer significantly because of this, another VM is doing this 4K random writes. So we need some kind of a resource control mechanism

Starting point is 00:17:45 in N-Wave spec to handle this kind of stuff. And that one is not there yet today. Also, if you look at the different SSD hardware vendors, they all come up with their own implementations. They're trying to make up some missing details for themselves. I'm gonna give you a few examples. I don't want to name, but I see different SSD vendors,

Starting point is 00:18:13 they have quite a different implementation. Say, how many virtual functions? Most of them just choose the fixed number of virtual functions. But from a cloud vendor perspective, they want a flexible. We want it to 16, 32, or even, or eight. So that kind of implementation today is more available to,

Starting point is 00:18:34 in industry. So all these make the thought of our stack support as always really difficult. That's the reality. So we talk about the hyper-discrete device assignment. That's a PCIe pass-through technology. So essentially, we can assign a PCIe device directly to a VM. So this is still an experimental feature.

Starting point is 00:19:01 We first introduced in Windows Server 2016. This feature always has performance benefit experimental feature we first introduced in Windows Server 2016. This feature always has performance benefit because it allow VM user to access these IOUs directly. That brings a significant performance gain compared with traditional para-virtualized path. But it has very serious security concerns because a malicious VM, by exposing the main queue to VM user Because a malicious VM, by exposing admin queue to VM user,

Starting point is 00:19:27 a malicious VM user could do something pretty bad. And that could make your host and other VM running on the same host suffer. So that's the reason this remains an experimental feature for DDA part. So we want a secured direct storage hardware access solution in cloud VM. Operating with PCIe and MME is best suited for this purpose.

Starting point is 00:19:58 And how to do that, how to make it safe? So we need to field out this unsafe admin command. Well, we still allow this VM user to access IOC directly. So I'm just talking about this solution. Generally, if you have some questions, I'll refer you to some EconAd announcement just this week. So sort of our solution, we can do it either at the hyperbiter layer or we can use like a dedicated host field

Starting point is 00:20:26 driver to intercept this kind of other main request. And we can take actions accordingly. And another one is also very common, it's hardware solution. People can use FPGA or customize ASIC to field of this unsafe other main command. And the other solutions, like, for example, the BMC,

Starting point is 00:20:51 we can only dedicate this risk diverse management like a formal update through BMC, OPL, DM only. So that's something interesting. But... Excuse me. Can you explain a little bit about what you're looking for here? As far as, say there was a customized solution, the idea is that you would not want any admin commands

Starting point is 00:21:11 to be able to be processed while there is virtual I.O. command that are being done? So basically, the hardware solution and software solution is largely doing the same thing. So you can refer to Amazon Natural Architecture. So I'm from Microsoft, but the principle is the same. What the hardware does is they expose a fake environment device to VMs.

Starting point is 00:21:34 So the main command actually is first intercept by this either hardware or software. So if it's like a formal update command, they will just probably cancel it. So this assumes that you don't have an SLV? No. This is if you're using... Exactly.

Starting point is 00:21:52 As I mentioned, today, storage SLV is not there yet. And then we need a solution now for best performance. But you're not using the assignment, so you're assigning the entire device? Yes. Yes. We are not partitioning anything. We just assign Intel device to VM. Yes. Yes. Yes. We just... I can go a little bit in detail. Because everything is mapped like a page, right?

Starting point is 00:22:22 So you can just intercept this page request. Then you can say, oh, it's all the main queue, we just, we just queue it. You don't need to, you don't need to, you don't need to. Yeah. Can you give me an example of an NSIG NVME? Yeah, firmware update. Firmware update? Firmware update, yes. You definitely don't want malicious user to do this firmware update. Firmware update, yes. You definitely don't want a malicious user

Starting point is 00:22:46 to do this firmware update. Who knows security concerns? You don't want them to do this. Yeah. Exactly. Exactly. Exactly. Yeah. So that's, yeah.

Starting point is 00:23:12 Is there a way to lock up the MQ for certain commands? No, that's not available. We have something like considered like we expand that the doorbell registers. But today, every window, you know, it's very interesting. The spec support a customized doorbell strata size. But in reality, every window, they're just a zero gap. So we can't do that.

Starting point is 00:23:38 Otherwise, we just separate all the main doorbell into a different page. Then issue is solved. But we can't. Okay, so we talked about the storage overhead. Let's pause a moment and review this interrupt. I just mentioned the hypervisor overhead. Most of the hypervisor overhead is dealing

Starting point is 00:23:58 with the interrupt delivery. So we know actually from the virtualization environment, deliver interrupt to a VM is much more expensive than in a biometric environment. And the reason is we have this VM exit overhead. So all means in hypervisor intercept. So when doing a virtual interrupt delivery, we did frequently suspend your

Starting point is 00:24:27 VM and exit to hypervisor to update some data structure and then come back, resume that VM. So, for example, there are several most typical type of this overhead come from. First one is APK register access. So we have several, very common, like interrupt, request register, interrupt command register, or end of interrupt register. So you want to generate the interrupt, or you want to deliver interrupt, or you want signaling complete interrupt,

Starting point is 00:25:00 you need to write MSR registers. And this kind of update registers cause the VM exit or hyper-wide intercept. Another one is like EP, inter-processor interrupt. That makes sense worse because on the sending virtual processor and the receiving virtual processor, they both need to exit.

Starting point is 00:25:20 On the right part, that is when external device, like IO device, generate this interrupt. It goes through IO, MMEO, goes through hypervisor, and finally deliver to these virtual processors. That will also go through this VM exit. So we copy-paste Windows hyper-V, hyper-virtual processor performance count here just to show you how big this Hypervisor intercept overhead could become.

Starting point is 00:25:49 So this was measured under like a 4.5 million hours traffic. So if you look at the Hypervisor runtime CPU, that's 20%. So that means the Hypervisor dealing with all these interrupt related to Hypervisor intercept takes 20% of the total CPU time out. And the total intercept for 4.1 million hours of traffic is 9.2 million per second hypervisor intercept.

Starting point is 00:26:19 You can see how hypervisor is busy. So hardware interrupt is 2.2 obvious we have some kind of a interrupt coalescing mechanism here. So that is for two IOs we have one interrupt here. But overall this performance counter we used here just to show for this IO intensive work node how expensive dealing with interruptoperable delivery will become.

Starting point is 00:26:45 So that's the issue we have to solve. So interoperable, virtual interoperable delivery overhead actually is a well-known issue to industry. So both Intel and AMD, they're working very hard to solve this or mitigate this. If you look at the Intel server processor roadmap, from the Helen, Everbridge Bridge, Hathwell, Broadwell to Skylake.

Starting point is 00:27:09 So a peak of virtualization is available starting from the Avid Bridge. And the post-enrollment support made available from Broadwell. For AMD, from Barcelona to Bordeaux to the latest EPYC, that's based on ZenMaker architecture, a big support was made available on EPYC, that's based on architecture. A week support was made available on EPYC. Yeah, so I will briefly just go through this Intel one.

Starting point is 00:27:34 I don't want to go to much details here. So Intel virtualization technology, advancement include the two part. We just saw that we wanted to mitigate the issue related to the APK register access saw that we wanted to mitigate the issue related to the APK register access. Also we wanted to mitigate the access with interop delivery. So interop has this APK virtualization.

Starting point is 00:27:52 That allows guests to directly access APK register from virtual APK page. So you don't incur VM exit. And the post interop, post interop actually was made in Zen a while ago, but for Windows it's still new. So post interrupt enables this direct external interrupt delivered to VM, VPs. We should reduce the hypervibration enrollment.

Starting point is 00:28:16 We choose the term reduce because this does not get rid of this VM exit completely. For example, on EP, on the sending CPU, we still need the VM exit for security concerns. So we worked at Microsoft, Daniel and I worked with this Hyper-V team and other teams to introduce, enable this both post-PI support

Starting point is 00:28:41 and IPv4 support, starting from Windows over 2019. So that was announced this week. AMD has very similar technology, also with different implementation. So it's called AVIC. That's the term AMD used for APKV. So that AVIC, although most APKV access and the interoperability into guest, also with reduced VM exit.

Starting point is 00:29:05 So one thing we need to remind you here because they have different implementation. So for posting robot support on AMD platforms, they use different mechanisms. They have to update data structure called guest virtual AP, GA log in table constantly. So the experiment that Danru and I did, uncovered is for very high hours traffic,

Starting point is 00:29:30 these GA logging mechanisms could incur also very significant performance overhead. So, but this is just tied to AMD implementation. So with that, I'm gonna hand over to Danru. He's going to walk us through with you the rest of the sections. Thanks, Leon, for a great overview of the past 10, 20 years of the storage advancement and IO acquisition advancement. So this is the foundation of our work to deliver 10-mini-hour ops platform. Without that, it is impossible for our work.

Starting point is 00:30:09 So this slide shows two tables. The first table shows the system machine configurations, the water-time machines, the water-time storage systems we use to build up this platform. The lower table shows the virtual machine configuration. So first one for the systems, we're using a commodity HP DLR560 Gen10 server. This is commodity server, you can buy it from the market. So this is a full socket server

Starting point is 00:30:32 with Intel Skylake 8168 processors. Overall, the full socket server processors deliver 192 logical processors at 2.7 GHz. So these processors, these number of processors provide enough CPU power to deliver the 10 milli-ops that we needed. For the storage of systems, we are using Intel PCIe Gen 3, a half-length AICP4608 MME devices.

Starting point is 00:31:02 So this MME form factor has a very interesting design. They kind of stack two MME SSDs into one form factor so that on one form factor it can deliver 1.4 million IOPS. So that we can safely PCI slot only use

Starting point is 00:31:19 8 PCI 3x8 to deliver this kind of IOPS. So with that you can see that on the hardware side, we have all the powers to deliver more than 10 million IOPS on Siri. Now let's look at

Starting point is 00:31:34 the virtual machines. We created VM with 192 virtual processors. So the reason we are creating this four-size VM is for fair comparison between bare metal, Drude, and VM. So that we have the same size of the machines, so we can see how much we can get on the IOS

Starting point is 00:31:54 side. We enable virtual NUMA in the VM so that we can better map the devices into the NUMA topology. The host OS and the guest OS both using Windows Server 2019 with all the positions and the features we delivered on this new platform. Now this slide shows the test tools and experience settings we used to deliver the free CIO benchmarking. Now I know FIO is very popular in this world, but on Windows side, we have a tool called Dixspeed. That is also

Starting point is 00:32:27 open source. That's all the source code and binaries are on the GitHub. So if you want, you can download the source code and build it yourself. So this tool is using all Windows APIs, so the more close to what the customer will use in the real world.

Starting point is 00:32:43 Actually, tomorrow, one of my colleague, Daniel Pearson, also coming from Windows Server Probe team, he is going to have a talk on disk speed, talk about how these tools advance, he is going to get some demos, how to use these tools, and all the new features delivered there. So I recommend people to go to that talk as well. For the experiment settings,

Starting point is 00:33:02 we use each ME device as a raw disk, using raw IOs, not using anti-FS systems in the middle. We're using one disk speed for every disk. Every disk, we're using eight IOQs, affinitized to eight VMVPs. And we use 128 QDAPs per thread. Overall, every SSD is 1024 QDAPs. So these QDAPs is enough to overall every SSD is 1024 QDAPs. So these QDAPs is enough to drive the physical capacity of the SSDs.

Starting point is 00:33:30 On the, we are using 4K random rate aligned to 4K boundaries, this non-buffered I.O. is not buffered I.O., so it's actually go to all disks. For the performance comparisons, we compare three different configurations. We compare bare metal and the root. Basically, you have to do the access to the hardware and also the VM. So we'll see how it looks like in the next slide.

Starting point is 00:33:54 So here is the performance results we deliver on these platforms. One thing I want to mention here is that we are using the traditional inter-round mode on the NVMe driver. So this means all IOs is delivered by interval mode. In the next few slides, I'm going to talk about another different experiment mode we introduced in 2019, on the pulling mode,

Starting point is 00:34:13 to see how the comparison of interval mode and the pulling mode in the next few slides. But here, all the results is delivered on the traditional NVMe interval mode. On XS, there's six different configurations we tried. And Y is the IOS throughput that we delivered. So the first one is, this is bare metal. You can see that on bare metal,

Starting point is 00:34:35 we can deliver 11 milliamps on this platform. So it's really good, 11 milliamps. But now if we move to the VM, that's in the parallel virtualized synthetic path, traditionally in VM, 192 VMs, very large VM with all the CPUs, we drop from 11 milliamps down to 2.5 milliamps. So we only use a fraction of the hardware power here.

Starting point is 00:34:59 As Liang mentioned just earlier, that we do the experiment, we try different configurations, try different tweaking. No matter how we try it, it cannot be on 3-B-L-I-OPS. So this is a very big gap here. Now how about we move to the root? So the root here has all

Starting point is 00:35:13 the assets about the direct assets for hardware. Now interestingly enough, we only deliver 7.2-B-L-I-OPS. The reason for that, as Liang mentioned earlier, is that on the root, even the OS has direct access for hardware, the interop delivery still have a huge overhead.

Starting point is 00:35:32 So this is one result, posting drop support. That's on 2016. We haven't enabled the support yet. Now if we move to the VM, with DDA, direct discrete assess. So that way the super further job from 7.2 million to 4.5 million. The reason for this job is that on the root,

Starting point is 00:35:56 you have a one to one VP to logic processors mapping. But on the guest VM, you don't have that mapping. Let's introduce some additional overhead on the hypervisor. For example, one example is that you have to grab a partition lock when you do the inter-processor delivery, interrupt delivery. And another example is that when there's an API query performance counters, that becomes Cisco

Starting point is 00:36:20 instead of the user mode API. So we did some optimizations in 2019 on hypervisor level to get rid of some of the user mode API. So we did some optimizations in 2019 on the VISA level to get rid of some of the overhead. Try to minimize that. So the last two rows shows that we have all the optimizations in on the 2019. This is the route with post interrupt enabled. Now the throughput increase about 7.2 million to 10 million.

Starting point is 00:36:45 It's fairly close to what Bayer-Merritt can deliver. It's only a fraction, less than 10% difference. Now the most interesting part is on the last one. In the VM, we can deliver exactly the same throughput as root. There's 10 milliamps from a single VM. So there's a very, very amazing number that achieved single BEM today. Now let's talk about another mode, N-VM pooling. So we see the platform can do very good performance on these internal modes.

Starting point is 00:37:18 So why we consider pooling? So the main reason for pooling is that it will help be able to achieve those close to bare metal performance on the platform that doesn't have those efficient EIP post-interrupt support. So as I mentioned, not every platform has post-interrupt support. If we don't have that, it means you have very high interrupt delay overhead.

Starting point is 00:37:42 Now with pooling, constant pooling, there's no interrupt, you can inhibit all the interrupt overhead. Now with pulling, constant pulling, there's no interrupter, you can inhibit all the interrupt overhead. And there's another reason for that is that some MAMV devices, they don't have enough IOQs. So that means they don't have enough MSIX interrupts. Now if, for example, you have only eight MSIX interrupts, but you have 192 virtual processors in the VM,

Starting point is 00:38:07 you have to come into how to balance those interrupters into virtual processors. You can easily come into these mapping problems. So we have seen in the reality that this becomes a very big purpose concern in the virtualized environment. Let's capture throughput to a very large scale. Now if you compare pooling with interrupt, pooling of course provides much lower latencies and more consistent IOTL latencies.

Starting point is 00:38:35 You don't need to wait for IOCompletion coming into the VM and then go to the DPCs and then all go to IOCompletion. You constantly pooling the completion queues, so there's very low latency. But of course, on the other hand, you are going to incur some higher CPU costs if your queue depth is pretty low,

Starting point is 00:38:55 for those low queue depth workloads. So there's a balance here. On the Linux world, the pooling has already introduced kernel 4.4, that's a while back. On 4.10, there's already a hybrid pooling mode introduced on Linux world. But Windows is far behind. So far, none of the Windows drivers support pooling mode.

Starting point is 00:39:17 None of the inbox drivers, none of the third-party drivers. We just don't have that available today. So there and I are working with other people in the back software. We experimented some pulling support by modifying the current NVMe drivers on Windows 2019. So we see how that looks like. So before we go to results,

Starting point is 00:39:38 I want to present a little bit deeper on how we implement this NVMe pooling on the Windows site. Now we look at this graph. This is a very simplified graph of the pooling implementation on our site. So you have a user app, you send IO initiations to the submission queue. Then on the LMM driver site, it's going to queue the pooling DPCs and constantly pull the IO completion queues. Then it will continue to pull the IO completion queues until it found there's

Starting point is 00:40:14 no pending IO exist. It will leave these pooling queues. So a few points I would make here. The first one, the interrupt is disabled for every pudding mode queue. And we also support a mix of the pudding queues, of the pudding queues interrupt queues, so that you can. Interrupts still make sense for some of the large IOs. You don't want to constant pudding for one bad IO,

Starting point is 00:40:38 for example. So that's the way to put them all. You can split the IO completion queues into two modes. Some of them you can configure in the putting queue, some in the interrupt queue. So that you can get benefit from all of that. The third one is there's two modes in Windows, traditional DPC mode or thread DPC mode. By default, we're using thread DPC mode so the system can be more responsive. And there's a passive level DPCs that all other interrupt can come in.

Starting point is 00:41:06 And the DPC is scheduled to all IO-initiated processors. So if your application is scaled out of all the processors, you get DPCs scale nicely along with DPCs. So we have more balanced CPU usage. Another one is that we don't have dedicated pulling threads. We queue DPCs when there is other scenarios. If there's no other scenarios, there's no pulling cost. So that's the one of nice design on this one.

Starting point is 00:41:33 So this is the results we got on this platform, this platform. So this is bare metal on the drop mode. Now if we enable the interop pulling mode, we increase about 10% to 12 million hours. And the one thing I want to mention here is that this platform has very efficient post-inter-hours support. So that's why we did not see

Starting point is 00:41:52 a lot of improvements on this platform, but even on that, we see more than 10% improvements, throughput improvements. On the VM side, we improved from 10 million hours to 11 million hours. Leanne and I have done some experiments on other platforms that don't have efficient post-intrusion support

Starting point is 00:42:09 and also the NVMe devices have limited IOQs. On that platform, we have seen huge improvements because of the benefits we just talked about earlier. Now for the conclusions on this one, there's a few points Liang that I want to make. The first one is, I-optimized skills is very critical for all cloud providers. We saw very strong customer demands, and also all the cloud providers are starting to provide those I-optimized skills, 3 million, 4 million, that's the one that the customer wants, actually

Starting point is 00:42:44 wants. And with that demand, the traditional parallelized path apparently cannot meet the need. It already hits the limit. It cannot go anywhere with the parallelized path. Then we need direct device assignment that

Starting point is 00:42:59 can bypass all the parallelized overhead and also the latest IOV transition enhancement there will also load up both the keys to achieve the near native performance for using VM. So as Amber mentioned this earlier, the software is evil right now. So we have still a long journey to go to reduce overhead.

Starting point is 00:43:21 This is just the start of the whole journey. Now last one is Windows Hyper-V can provide 10 BDI ops from a single VM. So we are doing that on the modern commodity hardware. Now we really want to thank Intel and HPE. They are generally supportive for this work. They provide the hardware

Starting point is 00:43:37 and SSDs. And we saw their support. That's impossible for us to complete the work. So that is really, really expensive. So, any questions? Okay. Okay, thanks to everyone. Thanks for listening. If you have questions

Starting point is 00:43:56 about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the storage developer community. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

Storage Developer Conference - #102: Achieving 10-Million IOPS from a single VM on Windows Hyper-V

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.