Storage Developer Conference - #54: Bridging the Gap Between NVMe SSD Performance and Scale Out Software
Episode Date: August 8, 2017...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC
Podcast. Every week, the SDC Podcast presents important technical topics to the developer
community. Each episode is hand-selected by the SNEA Technical Council from the presentations
at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcast.
You are listening to episode 54 of the SDC podcast.
Today we hear from Anjaneya Chagram, Principal Engineer, Intel,
as he presents Bridging the Gap Between NVMe SSD Performance and Scale-Out
Software from the 2016 Storage Developer Conference.
Good afternoon, everyone.
Can you hear me okay down there?
So welcome to the Bridging the Gap Between NVMe SSD Performance and Scale-Out Software.
My name is Reddy Chagam.
I'm the principal engineer and chief SDS architect from Intel Data Center Group.
I have two co-presenters with me. I will actually let Swaroop and Murali introduce.
Sure, thanks. My name is Swaroop Datta. I manage the product management team at VMware in the storage and availability business unit.
Thanks Swaroop. My name is Murali Raiyopal.
I'm a storage architect in the office of CTO at VMware.
Fantastic.
Okay, so we are going to cover two broad scale-out software stacks.
I am specifically going to focus on the Ceph side.
Mainly, Swaroop and Murali are going to focus on the VMwareeph side. Mainly the and then Swaroop and Murali are going to
focus on the VMware vSAN stack and the theme of this presentation is what
exactly we are doing within the context of NVMe SSDs and speeding up the
software side of the you know these two stacks and we'll also give a flavor of
the planned work going forward. How many of you know NVMe protocol,
heard about it, familiar with the concept? Okay, so few. Okay, so I'm going to give a
very high-level view into what exactly is NVMe SSD and you know how this plays
out for the following discussion with Seth. So if you look at the NVMe SSD and you know how this plays out for the following discussion with
self so if you look at the NVMe essentially it is the standard software
interface between the host and the PCIe attached SSD media so it's a standard
software interface you can actually plug in any NVMe attached SSD from any
vendor and it should be able to work without
changing the host software.
So that's the number one thing.
It uses essentially the queuing as the foundation construct as a way to communicate to the media.
There are a set of admin queues to really do some sort of setting up your SSD as well
as there are submission queues and completion queues. So you kind of use the queuing construct as a way to communicate
to the media in a standard way.
Of course, it is attached to PCIe lanes.
So it means that it has a direct connectivity from your CPU processor to the NVMe media.
There is no intermediate, you know, chipset moving components like host bus adapter.
So that's a key component.
Because it is attached to the PCIe lanes,
you essentially can take advantage of the bandwidth that you get out of PCIe lanes.
So with PCIe Gen 3, one lane can actually give you 1 gigabyte per second bandwidth.
Most of the NVMe SSD cards out there, they have x4 or x8. one lane can actually give you one gigabyte per second bandwidth most of
the NVMe SSD cards out there they have by four or by eight by four lane PCIe
SSD cards typically give you the reasonably half a million I was per
second from reads and somewhere around 200 K IOPS for the right side with the
opt-in SSD you can actually get to 1 million IOPS with the by 8 PCIe lane slot right so significant amount of you know the
throughput and performance you can actually get out of NVMe SSDs because of
the fact that it is directly attached and you can scale based on the number of
lanes and it's not limited by you know certain aspects the other
thing is obviously you don't need host bus adapter like SAS controller and so on so it essentially
gives you a you know cost reduction as well as power reduction plus performance so if you look
at the Broadwell CPU you essentially have per, you can actually get 40 PCIe
lanes.
A bunch of them will be shared with the networking.
So you essentially will get enough number of PCIe lanes to drive the bandwidth that
is needed on the compute as well as the networking.
Few different form factors.
M.2 is really meant for boot type of use case, boot drive.
And then in the small form factor, which is U.2, and then there is an add-in card.
So you essentially can use different form factors based on the system configs that you are after
and the density and price points that you are really looking for.
Okay?
All right.
So that's kind of the brief overview of what NVMe protocol is about, what is NVMe
SSD and what are the benefits.
So I'm going to directly jump into Ceph side.
How many of you heard about Ceph, know Ceph?
Okay, reasonable number.
So Ceph is the open source storage software scale out. It is built on the foundation of object interface, which is called RADOS.
So, that is the foundation layer.
You can actually take the RADOS foundation layer, deploy on a pool of standard high volume servers.
So, you essentially can take a bunch of machines, put Ceph software on top of it,
and you essentially get a scale-out storage stack
that is protected beyond a host failure. So you can have host failure, rack failure, you can
essentially make sure that the data is protected. So it gives you the scale-out property, it also
gives you the durability, so you can either have multiple copies or integer coded pools,
you can mix and match, all kinds of interesting things you can do to protect your data lots of you know
features that are related to things like when something goes bad a media goes bad
you should be able to detect it automatically replicate it to ensure
that there is a high availability in the cluster so anything and everything that
you can think of in the scale-out software, you see those properties in Ceph software as well.
It's an open source software primarily supported by Red Hat, lots of community
contributions from big storage vendors including Intel, Samsung, SanDisk,
Mellanox, lots of service providers as well contributing to the community work.
It's popular for block storage within the context of OpenStack deployment.
So wherever you see OpenStack deployment,
stuff tends to be the choice for the block storage
for backing up your virtual machines.
It provides three different constructs typically,
which are very important for lots of customers.
So you can use that as a shared file system which is called CephFS you can also use that as a
distributed block layer which is called RBD or you can also use that as the
object interface which is essentially rados gateway rest api's s3 compatible
as well as the Swift compatible so it gives you those people three properties using the foundation which is the object layer and from the NVMe workloads
perspective we kind of look at three different groupings of workloads if you
look at this there is a capacity on the X scale and there is the performance on the Y scale three types of workloads one
are high ops low latency workloads which are typically like database type of
workloads and then you are looking at throughput optimized things like content
delivery networks VDI cloud DVR and so on big data workloads is at the top and
then when it comes to the archival capacity type of workloads are
really object based workloads and we look at NVMe relevance across these
three spectrums of workloads as opposed to really looking at NVMe SSDs just for
the low latency workloads which is very important to remember because if you
look if you see the trend where what's happening is there are NVMe SSDs that are low latency
types of media out there, like Optane SSD will give you a million IOPS, less than 10
microsecond latency.
So you're really looking at that for the low latency workloads.
And then on the capacity side, lots of vendors out there, including Intel 3d and based media really looking at capacity
oriented workloads mostly content delivery archive backup object based workloads so think of this as
more of nvm ssds have relevance across the board as opposed to really the low latency so if you
were to zoom into the seph architecture the top portion is the self client and the bottom portion is set storage nodes so if you look at the
top portion the way we are looking at is NVMe SSD is today we self does the
caching specifically for the block workloads it uses DRAM as the caching
layer and we are looking at extending that with NVMe SSD so we can actually
give a much better cash real estate on the client nodes.
And we can bring in the value prop of NVMe SSDs.
So that's the focus area on the client side.
And then on the back end in the storage nodes where you have Ceph, there are two different configurations today.
One is actually production ready, which is called file store back end.
And the other one is the blue store which is currently in
the tech preview mode both these configurations can take advantage of flash to speed up your
writes using nvme ssds and to speed up your reads for the read cache so we kind of see that as both
both scenarios are pretty relevant for the storage nodes as well so that's kind of the intersect
point when you look at the nMe SSD is where it makes sense within
the context of Ceph. So from the consumption perspective as an end-user
you are really looking at three different configurations. Again just
looking at the NVMe SSD and what kind of configurations that we typically see in the end customer deployments
Standard is where you have
essentially a mix of
NVMe SSDs paired up with hard disk drives
And you can use the NVMe SSDs as a way to speed up your writes as well as cache
For the read to service the reads as well. So it's meant for both write and read caching
paired with the hard disk drives.
Typically what we look for is with one
high endurance NVMe SSD, P3700 is the Intel
high endurance SSD, NVMe SSD card.
You can typically pair up with 16,
four terabyte hard disk drives. So you can, you know, 16 4TB hard disk drives.
So you can, you know, that's kind of the ratio based on the benchmarking.
The next one is the better configuration, which is kind of balanced with the best TCO.
You are looking at a combination of NVMe SSDs with SATA SSDs as a way to get all- low latency type of SKUs normally you pair
up one NVMe SSD with around six SATA SSDs and you use the NVMe SSD as a way
to speed up your writes and the reads will get serviced from SATA SSDs so
that's a breakdown with a better configuration and the best configuration
is essentially everything is all NVMe based on the current testing that we have done and all the optimizations we are really looking at
around four NVMe SSDs per node as the design endpoint beyond that Ceph will
not scale and we are looking at lots of things to optimize it but currently
that's the recommended configuration for all these things to work you really need
to make sure that the CPU has enough power irrespective of what configuration you choose and then
you have enough networking bandwidth so you do need to consider that it's a
balanced system config from a compute storage and networking perspective to
make sure that it can scale and optimized. Yeah. So you said that Ceph not scale beyond four SSD devices for node, you know, for that?
Yeah, let me go over the details and I can, you know, give some glimpse into where the
challenges are.
So if you look at, historically, just a, you know, so the question is why is Ceph not scaling
beyond four NVMe SSDs
there are lots so if you look at Ceph design endpoint it started 10 years ago
and SSDs at those in those days were pretty much like unheard of so that most
of the design team has been I need to scale across the nodes
because my hard disk drives give me limited throughput
and the way I can scale is,
have a scale out design pattern across many nodes
and I can scale that way.
So it is really designed for hard disk drives
and then things evolved in terms of speeding up
the reads and writes using NVMe SSDs as a you know enhancement
then we are looking at okay there are lots of bottlenecks in terms of threading optimization
that we really need to look at to optimize so that's the focus so the software needs to be
optimized there are lots of threads it's designed for hard disk drive so certain things you can actually
choose it in a way that it's not a big deal but with NVMe SSDs your software
has to be really lean and the path should be extremely narrow and that's
the area that needs work. the fine store and the blue store yeah so i thought that for ssds you would be using blue
store which is basically optimized for flash and then you would not run through those scalable
variations yeah so if you look at the sef data path so the way i look at sef data path is there
is a networking section the i will coming into the networking layer in Ceph and there is the core data
path that really makes the decision of where is the data distributed how do you
protect the data and the whole mechanism of how you group the data how you shot
the data the whole logic is actually in the middle layer and then the bottom
layer is the core I over to the media so BlueStore is essentially coming at the bottom side.
So with the BlueStore changes,
you are bypassing the file system,
so it is directly using the raw media.
So you are speeding up the I-O to the media
and the layer above it will be optimized.
There is a networking optimization
that is pretty close with the RDMA
and XIO messenger pattern,
you can actually get to the networking stack as well.
The piece that really needs work is in the middle.
In the radios here.
Radios, yeah.
It's called OSD layer.
That really requires...
There are lots of choke points where everything waits on,
let's say, PG locks and certain choke points
where everything waits.
So those choke points need to be eliminated for us to really speed up for the NVMe SSDs.
Good question.
So I'm going to kind of spend some time on what is the current state when you look at the NVMe SSD performance. So we looked at, for
the performance, we focused on MySQL as the key benchmark. The reason being it is very
popular for the, when I talked about the OpenStack, where Ceph is popular, lots of customers are
looking at MySQL as one of the key workloads. So we ended up actually picking that because
of the relevance as well as popularity. And and we also have we also do lots of synthetic
benchmarks random I owe as well as sequential I over each and writes with
different sizes look at really where the optimization points are so we looked at
the synthetic as well as the my sequel as a way to you know assess the current
performance and where the bottlenecks are. The system config that I'm going to talk about has a five-node super micro cluster.
This has actually 10 slots where you can actually put six NVMe SSDs on one NUMA node,
and the other one is four, so six plus four is what you have with a super micro.
But the testing that we did, we played around with every other configuration and we settled on for NVMe SSDs but each NVMe
SSD is partitioned into four logical part you know blocks not partitions
rather sorry so each NVMe is partitioned into four you know logical block the
the logical you know the regions and then we have
self SSD managing each one of them that's the only way to get really the
scaling with the current configuration so partitioning your new uses into four
and then have each one managed by one object storage demon which is the self
essentially the critical component that manages the storage to the media.
On the client side, per node, we used four Docker containers.
Two Docker containers are really the clients that are actually running the benchmark,
which is called SysBench, and then two of them are really doing the work,
which are really hosting the MySQL database server.
So with that, there are certain things we have to optimize when you look at the NVMe SSDs and the NUMA you know system
configs. The key thing is when you look if you look at the right hand side we
have partitioned in such a way that your compute, memory, NVMe SSDs and the
networking is attached to the same NUMA node.
So you are not really going across the inter socket to really get the job done when the
I.O. is coming in.
So that is a very, very critical component when you really want to take advantage of
NVMe SSDs and how they are actually physically partitioned in a system which is NUMA aware
partitioning. So that's one key
thing second thing is you need to make sure that your NVMe NIC devices are
really into the same you know processor interrupts and all that stuff so soft
soft IRQ balancing and everything need to be on the same in NUMA node not on
the other one so that you're not really incurring the overhead.
And then we have, for a given NUMA node, because we had four NVMe SSDs, each one partitioned into four, you know, subsections, ended up having 16 OSDs.
So you really need fairly high-end Xeon E5 skew. So looking at around 2690 or beyond
is typically what you need.
With networking that is at least dual 10 gig,
but preferably 40 gig or beyond.
Go ahead.
So back when you did that, when you partitioned the MDMA
into the four things, did you put the journals
on the same disk or you would have a different disk
for the journal?
Yeah, excellent question.
So the testing here is focusing on the BlueStore,
not the Filestore backend.
So you don't need a journal with the BlueStore.
With the Filestore, you need a journal. journal with the blue store with the file store you need a journal so with the blue store essentially the metadata
and the actual data sitting on the same partition so it's using the same
partition
yeah yeah you can do that.
But the problem with, if you look at the P3700,
you're really, you're kind of getting around 400K IOPS,
read IOPS, 4K random read IOPS,
around 200K random write.
If you put one OSD,
you're probably going to get at the most
maybe 10 to 15K IOPS or maybe 20 k IOPS out of the device.
So if you really want to optimize the device throughput,
you have to slice them into four and then put more OSDs.
But that's not the long-term direction we want to be.
We really want to address the performance
and make sure that one OSD can manage one NVMe SSD and drive
the throughput but we are not there yet but that's the reason why it's a kind of
a workaround more than anything else to get the squeeze the maximum performance
out of the NVMe SSD so if you put one OSD there is no way you can get that
performance back okay so here is the so the first chart is all about random read and write performance
with five node, four NVMe SSDs per node.
I can actually get to 1.4 million IOPS closer to one millisecond latency and then you can actually stretch it all the way up to 1.6 million
IOPS with two roughly around two 2.2 millisecond latency that's really really you know that's
pretty good when you compare with all flash system configs and the price that you pay here you are
talking about an open
source software purchasing a standard high volume server putting in your own
SSDs that your preferred choice your preferred vendor and so on so you have a
lot of flexibility to put your own system config open source software and
you can actually get down to you know beyond a million IOPS in a fine word
cluster that is fairly amazing it
doesn't mean it is completely optimized from an efficiency perspective whether
it can take advantage of the media throughput but this is somewhat in
parity with what you normally see with most of the storage stacks including
storage appliances out there and if you look at the database performance, we can actually get down to 1.3 million queries per second with 20 clients driving the SQL workload, each one having eight threads.
So you're really looking at 160 threads hitting this five node cluster and you can actually
get to 1.3 million queries per second, you know, somewhere around that range.
So this is the 100% random read.
So in the interest of time I just touched upon just the two different charts.
There are a lot of details in the backup as well.
The notion is that you can actually take NVMe SSDs,
you can optimize the system config for the NUMA nodes,
you can actually get a reasonable amount of performance
and acceptable latency, even for the database workloads,
to give you a flavor that it is ready for deploying low latency workloads.
Now, we kind of briefly touched upon
what's going on in the community.
Fairly high level.
Caching work is not done yet on the client side.
So we are looking for the caching work on the client side
to really speed up quite a bit.
And the caching pattern is,
it's going to be crash consistent,
ordered right back
shared on a compute node we are looking at those properties as a way to really
speed up the VDI workloads ephemeral compute instances can immensely benefit
for that you know with with that kind of a property so that one is still that one
is pre in progress we are looking for sometime next year as a production
worthy production ready caching solution that
can expand beyond DRAM and take advantage of the NVMe SSDs on the
compute nodes the second one is currently compression work is in progress
and we are looking at dedupe as an extension to it a way to optimize the actual storage capacity for the flash-based backends because without
Ddupe the value prop quickly goes away with the flash-based backends.
So Ddupe is a very, very critical component from an efficiency perspective.
So that work is actually design reviews are actually in internal design waiting is going
on and we are looking for upstreaming this in the community long tail latency I didn't touch upon the long tail latency
I talked about the average latency which is around one millisecond or so the goal
is to make sure that the long tail latency beyond the 99th percentile is
extremely tolerable and you know it's not completely beyond few milliseconds
range so the goal is to make sure we optimize a long tail latency performance And it's not completely beyond a few milliseconds range.
So the goal is to make sure we optimize the long tail latency
performance as well, which is absolutely
needed for the low latency workloads.
And then we touched upon this one, which is, yeah,
we can actually take the NVMe SSD, which
is pretty performance-oriented SSDs,
and we can cardboard four slices as an optimal design in point but it also is going to add lots of threads
lots of memory lots of networking and lots of compute overhead and the goal is
to go back to the design pattern of what can we do to optimize the storage stack
in the OSD to really take advantage of the NVMe SSD
throughput in a much more efficient way. There are two things that we are looking at from
a broad focus perspective. User mode implementation is the theme that we are pushing in the SEF
implementations. There is a user mode networking so instead of actually doing lots of context switching the goal is to do the user more networking which is
using the data plane development kit as the foundation. Very popular for the
comms workloads and we are trying to use that for the storage workloads and there
is a talk in this room right after this session and I think is a 3 o'clock right
Ben? So Ben is actually talking about the storage performance development kit.
So he's going to talk about how do you optimize the device access
using the storage performance development kit,
which is essentially built on top of DPDK foundation, but
looking at the storage workloads, what are the things that we really need to move into the
user space,
and how do we optimize it.
So the goal is to take advantage of the existing work that is actually happening,
move that into the user mode stack as much as possible,
optimize and address the efficiency problem that we currently have.
And then the last one is more to do with the persistent memory integration.
So we are looking at the Linux DAX extensions as well as AppDirect,
which is the direct integration to the media.
Those are the two constructs as a way to integrate the BlueStore with the persistent memory.
So that can bring in the persistent memory value prop for the specific use cases.
So this kind of gives you the broad spectrum of what is going to happen between now and next year to really optimize it more and more to take
advantage of the NVMe SSD bandwidth. to make sure that for the VM machine, we bind it to a particular socket?
How is that being controlled? How data is flowing in?
So with the VMs, essentially there are two different ways
of integrating when it comes to Ceph.
One is the QMU, user mode virtualization,
is dominant implementation when it comes to open
stack so there is a back-end driver for QMU which is called LibRBD which is
based on the user mode stack so we could essentially do the optimization on the
user mode side and there is also a kernel mode driver for container
framework so if you want to instantiate containers that are really backed by
SAP you will have to use the kernel RBD.
Most of our focus is essentially the user mode QMU layer. So the whole caching stuff
that I'm talking about is really looking at the user mode side. There are lots of kernel
mode device mapper based caching devices that you can use, VM cache, flash cache, cache
and there is an Intel caching software. So you can pair up with any one of them in the kernel mode and get a take advantage
of it but our focus is the user more when it comes to the client
yeah you do have to optimize so the way it works is in self you do have to optimize. So the way it works is in Ceph, you essentially have to, it's not about having the system
config that is powerful on the storage node.
You also need to make sure that the client side has enough bandwidth and it is also NUMA
optimized.
Otherwise you are going to run into bottlenecks on the client side compared to the storage
node.
So you do have to comprehend both sides
of the configurations, but predominantly
the biggest bottleneck is the NVMe SSDs and the IOS side.
If you don't do the NUMA optimization there,
everything will fall apart.
So that's a critical component.
It doesn't mean you don't do it on the client side,
but that's a critical component.
Okay, so we're gonna run out of time.
I want to make sure I give enough time
for the vSAN side of the discussion.
So I will hand it over to Swaroop
to cover the VMware vSAN.
All right, thanks, Reddy.
So I'm going to describe
what we are doing with vSAN and NVMe.
So this is that part of the presentation.
So how many of you are aware of VMware's vSAN?
Okay.
Okay, so let me give you a quick overview.
So we introduced virtual SAN, vSAN as we call it internally,
in March of 2014.
That was the first release of vSAN.
Essentially what vSAN does is it provides cluster storage across multiple server nodes.
And the beauty of vSAN is it's embedded within the vSphere kernel itself,
so it's not a virtual storage appliance running on top of ESX.
And once you have server nodes, and these can be any server nodes, right?
That's how we differentiate against some of our competitors where you can buy servers
either from HP, Dell, Cisco, Lenovo, Fujitsu, whoever is your favorite vendor.
The only thing we ask is you buy a bunch of SSDs, a bunch of HDDs or SSDs,
and essentially vSAN will club the SSDs and HDDs from the different servers
and make it into a giant data store.
There's no concept of LUNs, no concept of volumes.
There is no traditional SAN fabric, SAN fabric switching.
None of that exists.
Basically the VM IO comes into the vSAN storage stack and
it goes into the devices underneath.
And in terms of availability, we provide replicas for
each of the VMDK objects which are persisted on the vSAN data store.
So if your entire host may go down but the
replica is available on another node from vSAN basically is able to
retrieve the data and serve it back to the VM. So this all forms a
kind of a new industry paradigm which is coming. It's called
hyper-converged infrastructure.
The hyper-converged is important out there because all the I.O. is being processed within
the kernel or within the hypervisor itself. The hyper-converged infrastructure consists
of the compute stack, the networking stack, the storage stack, and all of this is layered with a pretty comprehensive
management control plane on top of it, right?
And so that's what hyperconverged infrastructure
is defined as, and it's different from what you may have,
you may have heard of as converged infrastructure, where the hypervisor may not play as critical part
as it plays in hyperconverged infrastructure.
So that's kind of the basic difference between HCI and CI.
So we saw an overview.
I already talked about it.
This is completely software-defined.
So there is
no proprietary hardware component we do not use any of any specific as BGA for
EU or compression encryption etc these are all standard components which are
shipped with every Intel server out there it's a distributed scale of
architecture so you essentially have theAN embedded within the kernel,
and they all are communicating with a 10 gigi length
between them.
And it's integrated well within the vSphere platform.
Since we are VMware, we ensure that the interop
with every component out there, whether it's vMotion, storage
vMotion, HA, DRS,
vRealize operations, which is our management suite, all of them are integrated well with vSAN.
And it's a policy-driven control plane, so right from the design up,
we actually designed it to be on a per VM policy.
What does that mean?
You can, on a per VM basis, you can set,
hey, how should the performance look like?
What should the availability look like?
Should I make two replicas of the VMDK?
Should I make three replicas, et cetera?
And those are some of the policies which you can set
in addition to, of course, the capacity policies, which you can set in addition to of course the capacity policies which
you can set on a pervium on a pervium basis it's very different from traditional storage where most
of your policies are land based or volume based virtual volumes which is another initiative
another product with vmware that also now solves perv VM based policies for traditional storage.
But for vSAN we essentially built the product from the ground up with taking in mind the per VM policies.
So we have
so vSAN kind of operates in two modes. One is
we essentially have something called the hybrid mode and we have an all
slash mode. So what I mean by that is we have two tiers within vSAN. One is the caching
tier. We have a read cache and a write buffer. And the capacity here is essentially the persistence
stored for vSAN. For the caching tier, we dictate that you have to have a flash
device out there so it could either be SSDs, PCIe based SSDs or now we are getting a lot
of customers who are beginning to use NVMe based SSDs. For the capacity tier we essentially ask that you either have a
choice of using HDDs so that makes it in the hybrid mode or you can use all flash
which is essentially SSDs also in the capacity tier. If you ask me what is the
adoption curve for all flash versus hybrid with vSAN, if you had asked me
last year I would have said the all flash is probably with vSAN. If you had asked me last year, I would have said
the all flash is probably 10% of our adoption.
But this year, if you ask me as to what the adoption is,
it's more than 40 to 50%, right?
And even there, with NVMe, we are seeing like last year,
July or August when we were talking to Intel,
we didn't have any NVMe device on the HCL,
last year, June or July.
This year we have about 30 to 40 NVMe's on the HCL and there are a lot of customers
from the beginning of the year who have been putting NVMe in the caching tier and
now we have about more than 10 plus accounts who essentially have NVMe
across both the caching and the capacity tier. So we
have seen some real evidence of how NVMe is being adopted within the
industry. I forgot to mention about the performance. I think we are one of the
few vendors who actually mentioned performance pretty explicitly. It's 40k
IOPS for the hybrid mode per node and about 100k IOPS per host
with the with the all flash and if you start putting devices like PCIe SSDs or NVMe SSDs
sub millisecond latencies becomes very very common.
So current all flash, vSAN all flash this is how it looks. I mentioned about the tier one caching, the tier two is all about data persistence.
If you're in an all flash mode,
the SSDs are high performance, high endurance,
very high throughput for caching the writes.
We do not do any kind of read caching in an all flash mode,
but the tier two data persistence is more read intensive,
lower endurance, is generally less expensive
than the Tier 1 caching.
Most of our customers use like either SATA SSDs out here or even SAS SSDs and out here
they were using SAS and SATA SSDs but now it's generally becoming either PCIe SSDs or the NVMe SSDs and especially the Intel P3700s are
becoming pretty common out there with the P3510 on the tier 2s. We have space efficiency, so 6.2,
that was the latest release which we have for vFAN. Earlier before we introduced dedup and compression, what we had to do was essentially make a replica
of the object. And an object for us is essentially you can consider it as VMDK, so we are creating
multiple objects based on what the user has set the policy. 6.2 in March, we introduced it in March of this year. That was our
latest release. We introduced dedupe and compression and erasure coding, which is essentially,
we kind of loosely use it for RAID 5 and RAID 6. And depending on the workload, we have seen about
2x to 8x savings. So if you're doing, for example, VDI full clones, that's about 7x to 10x savings.
If you're doing like Oracle single instance databases, et cetera,
compression is what would take effect there.
And we are seeing anywhere from 5% to like 25% to 30% of compression ratios with Oracle.
Performance, of course, if you're using AllFlash,
in terms of IOPS, we are about four times higher than the hybrid vSAN.
So, again, the hybrid vSAN is HDDs used in the persistence tier.
And when you start using devices like NVMe,
sub-millisecond latency response times is what you could
expect. We support almost all applications out there so when we
started out with vSAN 1.0 about two years back, we kind of since everybody
knows that a storage 1.0 product you have to establish the credibility, make
sure that it is, make sure that the industry kind of believes that it is an
enterprise product so we kind of had that it is an enterprise product.
So we kind of had the number of use cases very restricted.
So we kind of had VDI, DR target, a few DevTest use cases out there.
But two years down the line, we have opened it out for any use case out there.
So there are customers who are deploying things like, you know,
Oracle RAC, MSCS, Exchange, SQL.
Every application out there is being deployed on Visa.
And we have some pretty interesting use cases for,
like for example, Lufthansa having, you know,
like their aircraft has about 300,000 sensors.
And each of the
sensor is sending data back to their main data center where they are running
some analytics and it spits out like a kind of a report on how the flight did
what are some of the things to take care of and within an hour the technician has
to address all the concerns with this report get the flight back up and back
up and running.
This is being done on their A380s, etc.
So some very interesting cases, there are defense sectors where we find it's being deployed
in particular form factors.
So it works very well because in some of these defense sectors, they want particular form
factors. of these in some of these different sectors they want like particular form factors they do not want
like an appliance form factor which may not fit into a particular space for example so vsan works
beautifully for them because they can choose a particular hardware just slap the vsan software
on top of it and it runs so some very interesting use cases out there and I forgot to mention we have about 5,000 customers so it's
growing pretty rapidly. It's one of the fastest storage products at least I have
I have worked on. So how is the market evolving? This is an IDC graph so the IDC
predictions kind of align with how we are seeing the NVMe adoption growing within vSAN.
We see that especially the SATA SSDs is going to kind of start trickling down,
while the PCIe-based SSDs and the SAS SSDs will probably occupy more than 60 to 70% of the market.
And by 2018 we expect about the NVMe, what IDC projects is 70% of the client SSD market,
70% of the client market will be NVMe based.
So what are the benefits of vSAN with NVMe?
So I won't go through the overview, Reddy did a good job about that.
But it's ideal for the caching tier.
We are getting a lot of customers who are using NVMe today with vSAN and we highly recommend
that.
There are a few customers who are beginning to use it
even in the persistent store also.
We have NVMe devices certified for the vSAN caching tier
and specifically for all flash configurations.
You can also use it for the hybrid model.
It works perfectly for that.
And what we are doing specifically in the roadmap, and I will go through it in a little bit more detail is we are
enhancing both the ESXi storage stack and the vSAN storage stack to make it
effective enough to take some of the real advantages of NVMe and some of
the NVMe 1.2 features which are also coming up. So we'll talk about that and
Murli will also go into some more detail in that regard. This is how our HCL looks like,
Intel P3700s, the S3510s all being listed there. This is a chart which Intel recently published. I know this chart may
be a bit convoluted but let me see if I can explain it. The bar graphs which you see correspond
to the capacity while the line graphs which you see correspond to the IOPS. And Intel kind of did it in four modes. One is purely if you want to deploy kind of a performance, do a performance deployment
where you really care about performance and nothing else.
Or you may go on the extreme where you do not really care about the performance but
you want to make your deployment highly, highly available.
Or you can have it capacity-based,
where you want large capacities within that cluster.
Or you may say, I want a slight balance between performance
and capacity.
As you see, this is the configuration
which was used, about 1.2 million IOPS
for the performance deployment.
And of course, as you do a trade-off with availability
compared to performance, your performance, of course,
tapers off quite a bit.
But what this graph essentially shows is, with NVMe,
the kind of dollar per IOPS and the dollar per gate
which you get is pretty remarkable.
In fact, we get a lot of customers who ask us,
Hey, isn't NVMe too expensive?
But these are kind of the graphs which we have been publicizing along with Intel
to show that the amount of performance which you get at the cost which NVMe is, it's pretty remarkable and it's pretty
encouraging for customers to use NVMe. Also raw comparison, so we did a raw
comparison of taking a 8 node vSAN cluster, comparing it with an 8-node NVMe cluster, and these are kind of the IOPS
which we are getting, and these are some of the
configuration which we have. What we also observed in terms of latencies were on
the SSD side, we had about 1.5 millisecond latency and in NVMe we were getting 1 millisecond
latencies or slightly less than that.
So what's coming next?
What's the future for NVMe within vSAN?
So let me go through some of the ideas out there. So the Intel Optane SSDs, they are the ones which we are working with Intel closely,
some pretty impressive reduction in latencies about approximately 10x compared to NAND SSDs,
so that's kind of a direct benefit to vSAN, right, if you're going to use it in the caching tier or even within the capacity
tier.
We have done some measurements with, along with Intel, with the application performance,
the ESXi application performance.
It's about two and a half times faster than the NAND PCI Express.
And once we start doing some, and these are like raw numbers in terms of,
we haven't really done any software optimizations.
These are just plugging in,
taking out the SAS and SATA SSDs,
plugging in NVMe SSDs,
and these are the numbers which we are getting.
There are a lot of software optimizations
which we have started to work with. So today, we did a lot of software optimizations which we have started to work with.
So today we did a comparison of how does NVMe perform without any ESXi storage stack optimizations,
which is what it is today.
The gray bars are what you get,
but we are also prototyping a lot of optimizations within the ESXi storage stack.
So Reddy mentioned about how you can parallelize the NVMe across multiple lanes and those are some
of the optimizations which we are introducing within the storage stack and as you can see
the performance benefits which we get making these changes within the ESXi
storage stack is getting us quite a few benefits. Of course the prototype is not released, these are
all internal results as of now but thought I will show you as to what one can aspire as they start
using NVMe with vSAN and perhaps even with ESXi. So I won't go through this but let's look at
what the next generation hardware offers with NVMe and also with NVDems or
persistent memory. So there are four areas which we are focusing on. One is
the high speed NVMe which we hope to leverage as it is and that will provide performance benefits to
vSAN right away.
There are ESXi storage stack enhancements which we are looking at, just as I mentioned
in the previous slide.
We are also looking at, I mentioned to you that we have a two-tier architecture today.
And there is no reason for us, especially with NVIDIMS,
et cetera, coming into the market whenever it comes,
to kind of collapse this into a single-tier
all-flash architecture.
What that would mean is using NVIDIMS for just the metadata
while using SAS, SATA, or NVMe SSDs for the persistent store.
Or we can still keep it as a two-tier architecture where you have the metadata into NVDIMS,
you have the write cache as NVMe, and the SAS and SATA SSDs as your persistent store.
We are also looking at RDMA over Ethernet to boost network transfers.
As the device latencies for these start
becoming smaller and smaller, the network latencies will start becoming the
bottleneck and that is why we are looking at RDMA to help address the
networking challenges also because right now it's not a challenge for us but once
these device latencies start shrinking down and down the network latencies will start will start becoming an issue for us. So with that I will give
it to Murli to kind of go over some of the native driver stack and some of the
technical details. So I think we're gonna run out of time here but what I wanted
to share is a snapshot of where we stand in terms of NVMe support today. So first
let's look at our driver stack and where we play in terms of NVMe or complementary technologies. So there are fundamentally two places that we actually do drivers. One is in the IO stack, that is the NVMe for the PCIe space as well as NVMe for fabric. Those are the two fundamental drivers. Along with that we have drivers for the RDMV, which is like a rocky driver.
That's it's right now driver stack.
The other place is something we call virtual NVMe.
Essentially that sits below a host or a guest operating system
and that allows a guest operating system to have a native NVMe driver
so that it can use this virtual device to funnel in IOs that are completely natively NVMe.
So those are the places that we play. And what I want to share is our future plans. So far we have support for 1.0e spec and that's both supported
in the inbox as well as an async driver. In future what we're going to do is try
to support the 1.2 feature set so I don't have a I can't speak about the
exact timetable on this one but but it's coming soon. And that deals with multiple namespaces,
multiple queues, so this actually addresses the performance thing that Swaroop was talking about.
We will have support, we'll have a driver for Indian New York Fabric, and we'll address the
Rocky to start with, Rocky Fabric, followed by other transports as our customers see a need.
And finally, we have a plan to, I talked about the virtual NVME,
the one that sits below the guest operating system.
That should be very interesting for other operating systems which sit inside VMs.
And finally, at some point, we plan to have a completely native NVMe stack.
In other words, it will coexist with us for a while,
but the outlook is that at some point, we're going to completely transition into a NVMe stack.
So that's the outlook for the coming years.
And you can read about this.
You can actually download the drivers today.
We have drivers for the 5.5 as well as the 6.0 ESXi platforms.
Also, we have a project where we're actually inviting the device vendors to actually contribute to the 1.2 driver that's coming soon. And there's information in the links here. You can log in there, and you can find more details.
And I'll be happy to talk to you after this session.
So that's fundamentally that.
I think we'll leave some room for questions.
Yeah.
I mean, we can do questions maybe like two, three minutes.
We probably have to give up for the next speaker.
So any questions on the vSAN side?
I know we had a few during the talk.
Do you have any idea how much CPU overhead or server overhead is for this software stack?
On vSAN in general? So what we advertise is less than 10%, right? That's the, we call it the vSAN tax. So that's the, that's typically
what we have. Now dedupe and compression, also you have to incur some additional CPU, right?
And that is typically about, again, less than 10% of what tax you have already paid with vSAN
without dedupe and compression. So let's say you have 100 cores, you need 10 already paid with vSAN without dedupent compression. So let's say
you have 100 cores, you need 10 more cores for vSAN. For dedupent compression, another 10 percent
of that is 10 percent of 10, so another one additional core, so 11 cores in general if you
turn on dedupent compression. So you have to go through a kind of a TCO calculation, right, to basically figure out,
okay, I'm going to enable vSAN on this.
What are the, you know, how much more CPUs do I need to put within the cluster?
So that's how it works.
Do you have a mix of fan and 10 prediction in the end case in your environment?
Yes, yes. So the default is 10, but you can toggle between 10 and 10 if you would like to.
In terms of consistency, is the AP system or the CAB system?
AP system is the C AP system or CP system? AP system or CP system?
The state AP, the consistency available.
Eventual consistency or the strong consistency?
So this is all strong consistency, right?
So essentially, the write, as soon as the write happens,
when the writes are replicated to all the replicas,
that's when the acknowledgment is sent back to the client.
So even though it's an object-based system,
so I know that other object-based systems
may be eventually consistent, but with vSAN,
it's strongly consistent.
Any other questions?.
So in the hybrid model, the caching tier is both for read cache and write buffering.
In the all-flash model, the caching tier is only for write buffering.
The reads happen directly to the ecosystem or to the capacity.
. happen directly to the ecosystem or to the capacity.. So you have the right buffer on each of the nodes, right?
Okay.
So you have the right buffer on each of the nodes, and it is replicated.
It's the right buffer on each of the nodes, yes.
All right. Thank you. Thanks for your time. Appreciate it.
Thanks for listening.
If you have questions about the material presented in this podcast,
be sure and join our developers mailing list
by sending an email to developers-subscribe at sneha.org.
Here you can ask questions and discuss this topic further
with your peers in the developer community.
For additional information about the Storage Developer Conference, visit storagedeveloper.org.