Storage Developer Conference - #102: Achieving 10-Million IOPS from a single VM on Windows Hyper-V
Episode Date: July 15, 2019...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the
SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage
developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage
Developer Conference. The link to the slides is available in the show notes at snea.org
slash podcasts. You are listening to STC Podcast, episode 102.
Okay, let's get started. Good afternoon. Today we are going to present Achieving 10-Minute IOPs from Single VM on Windows Hyper-V.
I'm Liang. I'm from Microsoft Azure Performance Team.
And this is my first body demo from Microsoft Windows Server Performance Team.
We have been together working on Hyper-V VM storage performance for probably
more than 10 years, actually, since Hyper-V was formed
at the Microsoft.
So today's agenda is I'm here first cover some motivation
and the performance challenges and the issues we solved.
Also, we are gonna review the Hyper-V storage path
and the optimization we have done.
And Danry is going to work with you
performance configurations, settings, performance data,
and the new NVMe polling model we have been doing on Windows.
So if you attended today morning session,
Amber from Intel, actually,
he described a very interesting story.
She said NVMe developed so fast,
hardware developed so fast,
performance becomes such incredible. But because of the software stack remains constant,
almost constant,
and the end user cannot really see
the performance benefit
brought by the hardware improvement.
So as a software guy,
I'm very sorry to hear that.
And actually, that's the exact reason
Dan Yu and I are working together
to improve our software stack
so we can bring the best performance experience
to our customers.
So you may ask, why 10-minute ops?
Why we need 10-minute ops?
So obviously, we have seen a strong customer demand
for these high transaction-based workload.
Artificial intelligence, big data,
machine learning, database,
all this stuff, they want high throughput,
all this high transaction-based.
Also, cloud market grows.
So you see all these cloud major vendors,
they have seen this revenue doubled
or even tripled year over year.
And it's also also very competitive market.
So every cloud vendor wanted to get the best performance
VM skills to their customers.
So obviously we also wanted every VM skill release
has the high performance in terms of high throughput
and the low latencies.
In the meantime, the hardware technologies
during the past 10 years advanced very rapidly.
That includes now we have much faster storage today
and a much more advanced processor today
compared to like 10 years ago.
In today's market, most of these cloud vendors
has this called storage-optimized VM skills.
Amazon, Azure, Oracle, all these things.
And all these VM skills could handle these IOPs
at the minimum IOPs levels.
So let's first talk about the storage technology advancement
which enables these high I high ops from storage perspective.
We have attended several sessions that we talk about
the shift from data center SSD storage interface
from traditional SAS, SATA to PCIe NVMe.
Actually interesting is the Seagate,
people also discuss their thinking or planning,
move the HDD interface from traditional
SAS SATA also to NVMe interface.
So that's very interesting.
And if you look at the table below,
here we compiled the theoretical protocol
bandwidth and the actual IOPS between
the SATA SSD, SAS SSD, and the PCIe and RAM SSDs.
So SATA SSD, we know it's kept at like 600 megabytes.
So in reality, the fast SATA SSD today
delivers around 100 kA orbs per device,
even for enterprise skills.
For 12 gig SAS, the theoretical bandwidth
is like 1,200 megabytes per second.
And in reality, to the most top tier,
this enterprise SAS SD delivered around 250k IOPS,
no more than that.
But the PCIe NMSD, we are using this Gen 3,
which is very close to four gigabytes per second,
and could deliver close to one million
IOPS per second for PCIe by four.
So you see that's a huge jump compared
to this traditional SATA and SAS SDs in terms of performance.
So let's first talk about this NVMe.
NVMe, like Amber mentioned, is really designed
from the ground up.
So we can capitalize on high IOPS, low latency,
based on this internal built-in parallelism.
What this internal built-in parallelism really means?
If I look at this graph, traditionally,
like a SaaS starter, they only have a single command queue.
But for NVMe, situation is quite different.
They support large number of these I-O queues.
Very common, we see like a 32, 64, 128,
or even 256 in today's market.
So these multiple I-O-Qs can enable this device
to scale their I-O initiation and completion.
It's got a number of CPUs for high throughput.
As a profile guy, we really know, actually,
if you look at this I-O path,
most performance bottlenecks come from either IO completion
or IO initiation path.
That's the two most important bottleneck for IOs.
In the previous slide,
you said that you have one million IOs.
Yes.
Question is, are these reads or writes?
Reads.
And how many of IOs are you talking about? 4K. Reads and.
4K.
Yeah, we said that all this, obviously,
it's measured by 4K.
Yeah, so obviously we are talking about 4K random reads.
The write performance on this kind of NAND flash
is pretty, it's not as good as reads.
So let's spend some time to review Windows Hyper-V storage path.
We know today Microsoft Azure is powered by Hyper-V.
So all this VM stack actually is running based on top of Hyper-V.
We have like a three mode.
This is very similar even for Linux world.
Then KVM is very similar.
We have emulation path that is engaged
the hypervisor constantly, so it's pretty slow.
We also have this most frequently used,
this paravirtualized, or we call it synthetic path.
So we use this enlightenment driver
to bypass this hypervisor and minimize hypervisor enrollment.
And today we have this direct hardware assignment. to bypass this hypervisor and minimize hypervisor enrollment.
And today we have this direct hardware assignment.
That means we can access a PCIe device to a VM,
that for best performance reasons.
And from VM storage back-in mode perspective,
we also have this file-based, that is,
for Hyper-V we frequently use this virtual hard disk image, either VHD or VHDX.
And that means, for file-based, it means you have
to sit on our file systems, so that our file system
has all of this, all we have.
So we also have like a physical disk base.
That means we can bypass host file systems.
So that includes a SCSI pass-through,
which means all these disks, physical disks,
is presented to the VM as a SCSI disk.
And then we also have these PCIe pass-throughs.
Now that we have this new feature,
in Windows Server 2016,
to do this PCIe direct assignment,
so that has better performance
compared with this file-based mode.
We wanted to review this, our storage performance
work we have done during the past, like, 10 years.
First is our journey to Minya Ops VMs.
You know, it's very interesting, like, six years ago,
we come here, we come to probably the same conference room.
We announced Hyper-V become the first hypervisor
to achieve more than one million ops from single VMs.
So that's based on Windows Server 2012.
A year later, in October 2013, Microsoft announced
we achieved more than two million ops from single VM.
And we showed a public demo at the TechEd Europe 2013s.
So all these one million ops, two million ops demos,
they all use the similar configurations.
It was based on 64, SCSI pass-through disk
I just mentioned, so that means
they use traditional paravirtualized pass. And the every pass-through disk I just mentioned. So that means they use traditional
para-virtualized pass.
And every pass-through disk is backed
by our SATA SSDs.
So we just talked about the storage
advanced very quickly.
So now PCIe, a single PCIe Gen 3 by 4
can deliver close to one million ops.
And today, if you look at all these cloud vendors,
for the best performance VM skills,
they still heavily use local storage.
So that we say, oh, we want to go beyond
one million, two million, three million ops.
So that is very natural work for us.
But the experimental we have done actually showed
the throughput of the VCM-VM is going to be kept
around the three-minute ops.
If we continue using the traditional para-virtual path.
And I'm going to explain next.
So if I look at this graph, this is just the HEPA-V
para-virtualized storage pass.
So, on the left side, that's the VM side,
you'll see we have some enlightenment driver,
it's called a store VSC.
Actually, we also have a Linux solution,
so that is already part of the Linux.
So, and then you cross the boundary, VMBus driver,
we will set it down on the host.
We have the store VSP driver,
then we go through some parcels,
then we send it to the IO disk, physical disk.
So look at this graph.
Every storage IO goes through the IO initiation
and the completion path twice.
Once in the VM,
and another one we have to do the same again on host.
Also, you will notice that the contention
between the root virtual processors
and the VM virtual processors is going to present
a very big performance bottleneck.
Because you have host stack, you have the VM stack,
they will contend for CPU cycles.
We have some existing mitigations.
We have hyper V main route, we have a CPU group.
What this feature do is they just isolate
this route repeat from VM virtual processors
so they will not contend with each other.
Like I said, we found this Hyper-V VM is going to cap at three-minute orbs
based on traditional Pyro virtual paths.
So where did this overhead come from?
We did some experiment.
So if you look at this table here,
this is the Windows Hyper-V storage path,
CPU overhead breakdown.
Our caveat here is this was married
using 1-mini-ops traffic, 4K random reads.
So the data may be different depending on OS or the traffic,
but generally they match or scales linearly.
So if you look at this one, the first one
is the guest component overhead.
That's roughly like a 20%.
And this overhead come from hyper-v guest
inattentment, as we said, storeVSC here.
This inattentment is made aware, oh, I'm
running in a virtualization environment.
So obviously we have a guest OS storage stack
because the VM is also OS.
So combining them together, the guest storage stack
took a 20% of the CPU stack away.
Now we just mentioned that every IO
has to cross the boundary from VM to host
and then come back from host to VM.
And to cross the boundary, we use this
shell buffer mechanism, that uses the VM bus.
And that also take away 20% of the CPU cycles.
Of course, the biggest part come from the host.
The first one is that we have a hyper-v host component.
We just mentioned we have the store VSP
driver. That is mainly responsible for our dispatch
and purpose. We also have another parcel, like a
VHDMP parcels for, like, a file-based
image. So we, we mentioned for this file-based,
we have host file system overhead, NTFS,
for example. So that's roughly 10%.
Of course, host storage stack.
That means the disk driver, your store port,
the main port.
That is about 10%.
So you put all this together, the host,
take away 40% of CPU cycles.
Another one is, of course, the hypervisor.
So that is, under this one-minute ops,
it's about 20% overhead.
And for most of this 20% overhead,
the interrupt delivery is predominant.
So we are going to explain more how we mitigate this both.
So obviously, this table shows us CPU cost
is typically a bottleneck for these high ops,
for these mini-ops, or over mini ops work node.
So we wanted to mitigate.
We wanted to save CPU cycle as much as possible.
So there are two different directions.
The new storage virtualization technologies
that will help mitigate host as well as VM to host overhead.
That's the storage virtualization part.
And another part is new processor
or virtualization technologies,
like APKV or post-enrollment.
That's going to hypermedicate,
hypervise overhead,
dealing with interoperable delivery to virtual processors.
What does that mean?
What does that mean?
It means the CPU cycles.
So like you say, every, for a number of Skylake CPU, right?
So it's like a 2.7 gigahertz.
So how many CPU cycles you're going to spend for these IOs?
So generally, you do the calculation.
You're divided by your total CPU cycles per second,
divided by your IO per second.
So you'll get it for every IO, you'll get how many CPU cycles per second divided by your IO per second. So you'll get it for every IO,
you'll get how many CPU cycles you consume.
On the left-hand side,
it's also like CPU cycles, right?
Yes, yes.
Okay.
So that's roughly how we get these calculations.
Okay, there are actually much.
So is it so far?
Okay.
In the lowest row, there is a... Yeah, Yeah. So traditional, a parallel virtualized has this kind of inherent overhead.
And because it cannot avoid this kind of contention, root of EP, VM, EP.
So traditional hyper-viscous pass-through is not sufficient to provide
maximum performance. Now you will be asked, hey,
a natural response is why we shift to the storage SLVs.
Yes. Storage SLV today is still not there yet.
Just like Amber mentioned in today's sessions,
in NM expect, we do have this virtualization enhancement sections
dealing with the SLV support.
But there is some key part is not there yet.
For example, lack of resource control.
How to get rid of noisy cable, noisy labeling issues.
So if you have multiple VMs, or you don't want
one VM doing 4K random writes, another VM is doing 4K random
reads, another VM user doing 4K random reads is going to
suffer significantly because of this, another VM is doing this
4K random writes.
So we need some kind of a resource control mechanism
in N-Wave spec to handle this kind of stuff.
And that one is not there yet today.
Also, if you look at the different SSD hardware vendors,
they all come up with their own implementations.
They're trying to make up some missing details for themselves.
I'm gonna give you a few examples.
I don't want to name,
but I see different SSD vendors,
they have quite a different implementation.
Say, how many virtual functions?
Most of them just choose the fixed number
of virtual functions.
But from a cloud vendor perspective,
they want a flexible.
We want it to 16, 32, or even, or eight.
So that kind of implementation today is more available to,
in industry.
So all these make the thought of our stack support
as always really difficult.
That's the reality.
So we talk about the hyper-discrete device assignment.
That's a PCIe pass-through technology.
So essentially, we can assign a PCIe device directly to a VM.
So this is still an experimental feature.
We first introduced in Windows Server 2016.
This feature always has performance benefit experimental feature we first introduced in Windows Server 2016.
This feature always has performance benefit because it
allow VM user to access these IOUs directly.
That brings a significant performance gain compared with
traditional para-virtualized path.
But it has very serious security concerns because a malicious VM,
by exposing the main queue to VM user Because a malicious VM, by exposing admin queue to VM user,
a malicious VM user could do something pretty bad.
And that could make your host
and other VM running on the same host suffer.
So that's the reason this remains
an experimental feature for DDA part.
So we want a secured direct storage hardware access solution
in cloud VM.
Operating with PCIe and MME is best suited for this purpose.
And how to do that, how to make it safe?
So we need to field out this unsafe admin command.
Well, we still allow this VM user to access IOC directly.
So I'm just talking about this solution.
Generally, if you have some questions,
I'll refer you to some EconAd announcement just this week.
So sort of our solution, we can do it either at the hyperbiter layer
or we can use like a dedicated host field
driver to intercept this kind of other main request.
And we can take actions accordingly.
And another one is also very common,
it's hardware solution.
People can use FPGA or customize ASIC to field
of this unsafe other main command.
And the other solutions,
like, for example, the BMC,
we can only dedicate this risk diverse management like a formal update through BMC, OPL, DM only.
So that's something interesting.
But...
Excuse me.
Can you explain a little bit
about what you're looking for here?
As far as, say there was a customized solution,
the idea is that you would not want any admin commands
to be able to be processed while there is virtual I.O. command
that are being done?
So basically, the hardware solution and software solution
is largely doing the same thing.
So you can refer to Amazon Natural Architecture.
So I'm from Microsoft, but the principle is the same.
What the hardware does is they expose
a fake environment device to VMs.
So the main command actually is first intercept
by this either hardware or software.
So if it's like a formal update command,
they will just probably cancel it.
So this assumes that you don't have an SLV?
No.
This is if you're using...
Exactly.
As I mentioned, today, storage SLV is not there yet.
And then we need a solution now for best performance.
But you're not using the assignment,
so you're assigning the entire device?
Yes.
Yes. We are not partitioning anything. We just assign Intel device to VM.
Yes. Yes. Yes.
We just... I can go a little bit in detail. Because everything is mapped like a page, right?
So you can just intercept this page request. Then you can say, oh, it's all the main queue, we just, we just queue
it.
You don't need to, you don't need to, you don't need to.
Yeah.
Can you give me an example of an NSIG NVME?
Yeah, firmware update.
Firmware update?
Firmware update, yes. You definitely don't want malicious user to do this firmware update. Firmware update, yes. You definitely don't want a malicious user
to do this firmware update.
Who knows security concerns?
You don't want them to do this.
Yeah.
Exactly.
Exactly. Exactly.
Yeah.
So that's, yeah.
Is there a way to lock up the MQ for certain commands?
No, that's not available.
We have something like considered like we expand
that the doorbell registers.
But today, every window, you know, it's very interesting.
The spec support a customized doorbell strata size.
But in reality, every window, they're just a zero gap.
So we can't do that.
Otherwise, we just separate all the main doorbell
into a different page.
Then issue is solved.
But we can't.
Okay, so we talked about the storage overhead.
Let's pause a moment and review this interrupt.
I just mentioned the hypervisor overhead.
Most of the hypervisor overhead is dealing
with the interrupt delivery.
So we know actually from the virtualization environment,
deliver interrupt to a VM is much more expensive
than in a biometric environment.
And the reason is we have this VM exit overhead.
So all means in hypervisor intercept.
So when doing a virtual interrupt delivery,
we did frequently suspend your
VM and exit to hypervisor to update some data structure and then come back, resume that
VM. So, for example, there are several most typical type of this overhead come from. First one is APK register access. So we have several, very common,
like interrupt, request register,
interrupt command register,
or end of interrupt register.
So you want to generate the interrupt,
or you want to deliver interrupt,
or you want signaling complete interrupt,
you need to write MSR registers.
And this kind of update registers
cause the VM exit or hyper-wide intercept.
Another one is like EP, inter-processor interrupt.
That makes sense worse
because on the sending virtual processor
and the receiving virtual processor,
they both need to exit.
On the right part,
that is when external device, like IO device, generate this interrupt.
It goes through IO, MMEO, goes through hypervisor,
and finally deliver to these virtual processors.
That will also go through this VM exit.
So we copy-paste Windows hyper-V,
hyper-virtual processor performance count here
just to show you how big this Hypervisor intercept overhead could become.
So this was measured under like a 4.5 million hours traffic.
So if you look at the Hypervisor runtime CPU,
that's 20%.
So that means the Hypervisor dealing with all these interrupt
related to Hypervisor intercept
takes 20% of the total CPU time out.
And the total intercept for 4.1 million hours of traffic
is 9.2 million per second hypervisor intercept.
You can see how hypervisor is busy.
So hardware interrupt is 2.2 obvious
we have some kind of a interrupt
coalescing mechanism here.
So that is for two IOs we have one interrupt here.
But overall this performance counter
we used here just to show for this IO intensive work node
how expensive dealing with interruptoperable delivery will become.
So that's the issue we have to solve.
So interoperable, virtual interoperable delivery overhead
actually is a well-known issue to industry.
So both Intel and AMD, they're working very hard
to solve this or mitigate this.
If you look at the Intel server processor roadmap,
from the Helen, Everbridge Bridge, Hathwell,
Broadwell to Skylake.
So a peak of virtualization is available
starting from the Avid Bridge.
And the post-enrollment support
made available from Broadwell.
For AMD, from Barcelona to Bordeaux
to the latest EPYC, that's based on ZenMaker architecture,
a big support was made available on EPYC, that's based on architecture. A week support was made available on EPYC.
Yeah, so I will briefly just go through this Intel one.
I don't want to go to much details here.
So Intel virtualization technology,
advancement include the two part.
We just saw that we wanted to mitigate the issue
related to the APK register access saw that we wanted to mitigate the issue related to the APK register access.
Also we wanted to mitigate the access
with interop delivery.
So interop has this APK virtualization.
That allows guests to directly access APK register
from virtual APK page.
So you don't incur VM exit.
And the post interop, post interop actually
was made in Zen a while ago, but for Windows it's still new.
So post interrupt enables this direct external interrupt
delivered to VM, VPs.
We should reduce the hypervibration enrollment.
We choose the term reduce because this does not get rid
of this VM exit completely.
For example, on EP, on the sending CPU,
we still need the VM exit for security concerns.
So we worked at Microsoft,
Daniel and I worked with this Hyper-V team
and other teams to introduce,
enable this both post-PI support
and IPv4 support,
starting from Windows over 2019.
So that was announced this week.
AMD has very similar technology, also with different implementation.
So it's called AVIC.
That's the term AMD used for APKV.
So that AVIC, although most APKV access and the interoperability into guest,
also with reduced VM exit.
So one thing we need to remind you here
because they have different implementation.
So for posting robot support on AMD platforms,
they use different mechanisms.
They have to update data structure called
guest virtual AP, GA log in table constantly.
So the experiment that Danru and I did,
uncovered is for very high hours traffic,
these GA logging mechanisms could incur
also very significant performance overhead.
So, but this is just tied to AMD implementation.
So with that, I'm gonna hand over to Danru.
He's going to walk us through with you the rest of the sections.
Thanks, Leon, for a great overview of the past 10, 20 years of the storage advancement and IO acquisition advancement.
So this is the foundation of our work to deliver 10-mini-hour ops platform.
Without that, it is impossible for our work.
So this slide shows two tables.
The first table shows the system machine configurations,
the water-time machines, the water-time storage systems we use to build up this platform.
The lower table shows the virtual machine configuration.
So first one for the systems,
we're using a commodity HP DLR560 Gen10 server.
This is commodity server, you can buy it from the market.
So this is a full socket server
with Intel Skylake 8168 processors.
Overall, the full socket server processors
deliver 192 logical processors at 2.7 GHz.
So these processors, these number of processors
provide enough CPU power to deliver the 10 milli-ops
that we needed.
For the storage of systems, we are using Intel PCIe Gen 3,
a half-length AICP4608 MME devices.
So this MME form factor has a very interesting design.
They kind of stack two
MME SSDs into one form factor
so that on one
form factor it can deliver 1.4
million IOPS.
So that we can safely
PCI slot only use
8 PCI
3x8 to deliver
this kind of IOPS.
So with that you can see that
on the hardware side, we have
all the powers to deliver
more than 10 million IOPS on Siri.
Now let's look at
the virtual machines.
We created VM
with 192 virtual processors.
So the reason we are creating this
four-size VM
is for fair comparison between bare metal,
Drude, and VM.
So that we have the same size of the machines, so we can see how much we can get on the IOS
side.
We enable virtual NUMA in the VM so that we can better map the devices into the NUMA topology.
The host OS and the guest OS both using
Windows Server 2019 with all the positions and the features we
delivered on this new platform. Now this slide shows the test tools and
experience settings we used to deliver the free CIO benchmarking. Now I know
FIO is very popular in this world, but on Windows side, we have a tool
called Dixspeed. That is also
open source. That's all the source
code and binaries are on the GitHub.
So if you want, you can
download the source code and build it yourself.
So this tool is
using all Windows APIs, so
the more close to what the customer will
use in the real world.
Actually, tomorrow, one of my colleague, Daniel Pearson,
also coming from Windows Server Probe team,
he is going to have a talk on disk speed,
talk about how these tools advance,
he is going to get some demos, how to use these tools,
and all the new features delivered there.
So I recommend people to go to that talk as well.
For the experiment settings,
we use each ME device as a raw disk, using raw IOs, not using anti-FS systems in the middle.
We're using one disk speed for every disk.
Every disk, we're using eight IOQs, affinitized to eight VMVPs.
And we use 128 QDAPs per thread.
Overall, every SSD is 1024 QDAPs.
So these QDAPs is enough to overall every SSD is 1024 QDAPs.
So these QDAPs is enough to drive
the physical capacity of the SSDs.
On the, we are using 4K random rate
aligned to 4K boundaries, this non-buffered I.O.
is not buffered I.O., so it's actually go to all disks.
For the performance comparisons,
we compare three different configurations.
We compare bare metal and the root.
Basically, you have to do the access to the hardware and also the VM.
So we'll see how it looks like in the next slide.
So here is the performance results we deliver on these platforms.
One thing I want to mention here is that we are using the traditional inter-round mode
on the NVMe driver.
So this means all IOs is delivered by interval mode.
In the next few slides,
I'm going to talk about another different experiment mode
we introduced in 2019,
on the pulling mode,
to see how the comparison of interval mode
and the pulling mode in the next few slides.
But here, all the results is delivered
on the traditional NVMe interval mode.
On XS, there's six different configurations we tried.
And Y is the IOS throughput that we delivered.
So the first one is, this is bare metal.
You can see that on bare metal,
we can deliver 11 milliamps on this platform.
So it's really good, 11 milliamps.
But now if we move to the VM,
that's in the parallel virtualized synthetic path,
traditionally in VM, 192 VMs,
very large VM with all the CPUs,
we drop from 11 milliamps down to 2.5 milliamps.
So we only use a fraction of the hardware power here.
As Liang mentioned just earlier,
that we do the experiment,
we try different configurations,
try different tweaking.
No matter how we try it, it cannot be on 3-B-L-I-OPS. So this is
a very big gap here.
Now how about we move to the root?
So the root here has all
the assets about the
direct assets for hardware.
Now interestingly enough, we only deliver 7.2-B-L-I-OPS.
The reason for that,
as Liang mentioned earlier,
is that on the root, even the OS has direct access
for hardware, the interop delivery still have
a huge overhead.
So this is one result, posting drop support.
That's on 2016.
We haven't enabled the support yet.
Now if we move to the VM, with DDA,
direct discrete assess.
So that way the super further job
from 7.2 million to 4.5 million.
The reason for this job is that on the root,
you have a one to one VP to logic processors mapping.
But on the guest VM, you don't have that mapping.
Let's introduce some additional overhead on the hypervisor.
For example, one example is that you have to grab
a partition lock when you do the inter-processor delivery,
interrupt delivery.
And another example is that when there's an API query
performance counters, that becomes Cisco
instead of the user mode API.
So we did some optimizations in 2019 on hypervisor level to get rid of some of the user mode API. So we did some optimizations in 2019
on the VISA level to get rid of some of the overhead.
Try to minimize that.
So the last two rows shows that we have all the optimizations
in on the 2019.
This is the route with post interrupt enabled.
Now the throughput increase about 7.2 million to 10 million.
It's fairly close to what Bayer-Merritt can deliver.
It's only a fraction, less than 10% difference.
Now the most interesting part is on the last one.
In the VM, we can deliver exactly the same throughput as root.
There's 10 milliamps from a single VM.
So there's a very, very amazing number that achieved single BEM today.
Now let's talk about another mode, N-VM pooling.
So we see the platform can do very good performance on these internal modes.
So why we consider pooling?
So the main reason for pooling is that it will help be able to achieve those close to bare metal performance
on the platform that doesn't have those efficient
EIP post-interrupt support.
So as I mentioned, not every platform has
post-interrupt support.
If we don't have that, it means you have
very high interrupt delay overhead.
Now with pooling, constant pooling,
there's no interrupt, you can inhibit all the interrupt overhead. Now with pulling, constant pulling, there's no interrupter,
you can inhibit all the interrupt overhead.
And there's another reason for that is that
some MAMV devices, they don't have enough IOQs.
So that means they don't have enough MSIX interrupts.
Now if, for example, you have only eight MSIX interrupts,
but you have 192 virtual processors in the VM,
you have to come into how to balance those interrupters into virtual processors.
You can easily come into these mapping problems.
So we have seen in the reality that this becomes a very big purpose concern
in the virtualized environment.
Let's capture throughput to a very large scale.
Now if you compare pooling with interrupt,
pooling of course provides much lower latencies
and more consistent IOTL latencies.
You don't need to wait for IOCompletion
coming into the VM and then go to the DPCs
and then all go to IOCompletion.
You constantly pooling the completion queues,
so there's very low latency.
But of course, on the other hand,
you are going to incur some higher CPU costs
if your queue depth is pretty low,
for those low queue depth workloads.
So there's a balance here.
On the Linux world, the pooling has already introduced
kernel 4.4, that's a while back.
On 4.10, there's already a hybrid pooling mode
introduced on Linux world.
But Windows is far behind.
So far, none of the Windows drivers support pooling mode.
None of the inbox drivers, none of the third-party drivers.
We just don't have that available today.
So there and I are working with other people
in the back software.
We experimented some pulling support
by modifying the current NVMe drivers on Windows 2019.
So we see how that looks like.
So before we go to results,
I want to present a little bit deeper
on how we implement this NVMe pooling on the Windows site.
Now we look at this graph.
This is a very simplified graph of the pooling implementation on our site.
So you have a user app, you send IO initiations to the submission queue.
Then on the LMM driver site, it's going to queue the pooling DPCs and constantly
pull the IO completion queues.
Then it will continue to pull the IO completion queues until it found there's
no pending IO exist.
It will leave these pooling queues.
So a few points I would make here.
The first one, the interrupt is disabled for every pudding mode queue.
And we also support a mix of the pudding queues,
of the pudding queues interrupt queues, so that you can.
Interrupts still make sense for some of the large IOs.
You don't want to constant pudding for one bad IO,
for example.
So that's the way to put them all.
You can split the IO completion queues into two modes.
Some of them you can configure in the putting queue, some in the interrupt queue.
So that you can get benefit from all of that.
The third one is there's two modes in Windows, traditional DPC mode or thread DPC mode.
By default, we're using thread DPC mode so the system can be more responsive.
And there's a passive level DPCs that all other interrupt can come in.
And the DPC is scheduled to all IO-initiated processors.
So if your application is scaled out of all the processors,
you get DPCs scale nicely along with DPCs.
So we have more balanced CPU usage.
Another one is that we don't have dedicated pulling threads.
We queue DPCs when there is other scenarios.
If there's no other scenarios, there's no pulling cost.
So that's the one of nice design on this one.
So this is the results we got on this platform,
this platform.
So this is bare metal on the drop mode.
Now if we enable the interop pulling mode,
we increase about 10% to 12 million hours.
And the one thing I want to mention here is that
this platform has very efficient post-inter-hours
support. So that's why we did not see
a lot of improvements on this
platform, but even on that, we see more than 10%
improvements, throughput improvements.
On the VM side, we improved
from 10 million hours to 11 million hours.
Leanne and I have done some
experiments on other platforms
that don't have efficient post-intrusion support
and also the NVMe devices have limited IOQs.
On that platform, we have seen huge improvements
because of the benefits we just talked about earlier.
Now for the conclusions on this one,
there's a few points Liang that I want to make.
The first one is, I-optimized skills is very critical for all cloud providers.
We saw very strong customer demands, and also all the cloud providers are starting to provide
those I-optimized skills, 3 million, 4 million, that's the one that the customer wants, actually
wants.
And with that demand,
the traditional parallelized path apparently cannot meet
the need. It already hits the
limit. It cannot go anywhere with the
parallelized path.
Then we need direct
device assignment that
can bypass all the parallelized
overhead and
also the latest IOV transition enhancement there
will also load up both the keys to achieve
the near native performance for using VM.
So as Amber mentioned this earlier,
the software is evil right now.
So we have still a long journey to go to reduce overhead.
This is just the start of the whole journey.
Now last one is Windows Hyper-V
can provide 10 BDI ops from a single
VM. So we are doing that on the modern
commodity hardware.
Now we really want to thank Intel
and HPE. They are generally supportive for
this work. They provide the hardware
and SSDs. And we
saw their support. That's impossible
for us to complete the work. So that
is really, really expensive.
So, any questions?
Okay. Okay, thanks to everyone.
Thanks for
listening. If you have questions
about the material presented in this
podcast, be sure and join
our developers mailing list
by sending an email
to developers-subscribe at sneha.org.
Here you can ask questions and discuss this topic further with your peers in the storage
developer community.
For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.