Storage Developer Conference - #30: Bridging the Gap Between NVMe SSD Performance and Scale Out Software
Episode Date: December 8, 2016...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC
Podcast. Every week, the SDC Podcast presents important technical topics to the developer
community. Each episode is hand-selected by the SNEA Technical Council from the presentations
at our annual Storage Developer Conference. The link to the slides is available in the show notes at snia.org slash podcast.
You are listening to SDC Podcast Episode 30.
Today we hear from Anageya Chagam, Principal Engineer with Intel,
Swaroop Dutta, Director of Product Management at VMware,
and Murali Rajakobal, Storage Architect, VMware,
as they present Bridging the Gap Between NVMe SSD Performance and Scale-Out Software from
the 2016 Storage Developer Conference.
So welcome to the Bridging the Gap Between NVMe SSD Performance and Scale-Out Software.
My name is Reddy Chagam.
I'm the Principal Engineer and chief SDS architect
from Intel Data Center Group.
I have two co-presenters with me.
I will actually let Swaroop and Murali introduce.
Sure, thanks.
My name is Swaroop Datta.
I manage the product management team at VMware
in the storage and availability business unit.
Thanks, Swaroop.
My name is Murali Rajlopal. I'm a storage architect in the office of availability business unit. Thanks, Swaroop. My name is Murali Rai Lopal.
I'm a storage architect in the office of CTO at VMware.
Fantastic.
Okay, so we are going to cover two broad scale-out software stacks.
I am specifically going to focus on the Ceph side.
Mainly, Swaroop and Murali are going to focus on the VMware
vSAN stack and the theme of this presentation is what exactly we are
doing within the context of NVMe SSDs and speeding up the software side of the
you know these two stacks and we'll also give a flavor of the planned work going
forward how many of you know NVMe protocol, heard about it,
familiar with the concept? Okay, so few. Okay, so I'm going to give a very high level view into
what exactly is NVMe SSD and, you know, how this plays out for the following discussion with SEF. So if you
look at the NVMe essentially it is the standard software interface between the
host and the PCIe attached SSD media. So it's a standard software interface you
can actually plug in any NVMe attached SSD from any vendor and it should be
able to work without changing the host
software. So that's the number one thing. It uses essentially the queuing as the foundation
construct as a way to communicate to the media. There are a set of admin queues to really do some
sort of setting up your SSD as well as there are submission queues and completion queues so you kind of use the queuing construct as a way to communicate to the media in a standard
way of course it is attached to PCIe lanes so it means that it has a direct
connectivity from your CPU processor to the NVMe media there is no intermediate
you know chipset moving components like host bus
adapter right so so that's a key component because it is attached to the
NVMe I mean the PCIe lanes you essentially can take advantage of the
bandwidth that you get out of PCIe lanes so with PCIe gen 3 one lane can actually
give you one gigabyte per second bandwidth most of the NVMe SSD
cards out there they have by four or by eight by four lane PCIe SSD cards
typically give you the reasonably half a million I was per second from reads and
somewhere around 200 K IOPS for the right side with the opt-in SSD you can actually get to 1 million IOPS with the by 8 PCIe lane slot right so significant amount of you know the
throughput and performance you can actually get out of NVMe SSDs because of
the fact that it is directly attached and you can scale based on the number of
lanes and it's not limited by certain aspects.
The other thing is obviously you don't need host bus adapter like SAS controller and so on.
So it essentially gives you a cost reduction as well as power reduction plus performance.
So if you look at the Broadwell CPU, you essentially have per socket,
you can actually get 40 PCIe
lanes.
A bunch of them will be shared with the networking.
So you essentially will get enough number of PCIe lanes to drive the bandwidth that
is needed on the compute as well as the networking.
Few different form factors.
M.2 is really meant for boot type of use case, boot drive.
And then in the small form factor, which is U.2,
and then there is an add-in card.
So you essentially can use different form factors
based on the system configs that you're after
and the density and price points
that you're really looking for.
All right, so that's kind of the brief overview
of what NVMe protocol is about, what is NVMe SSD, and what are the benefits.
So I'm going to directly jump into Ceph side.
How many of you heard about Ceph, know Ceph?
Okay, reasonable number.
Okay, so Ceph is the open source storage software scale out.
It is built on the foundation of object interface,
which is called RADOS.
So that is the foundation layer.
You can actually take the RADOS foundation layer,
deploy on a pool of standard high volume servers.
So you essentially can take bunch of machines,
put Ceph software on top of it,
and you essentially get a scale out storage stack
that is protected beyond a host failure so you can have host failure rack
failure you can essentially make sure that the data is protected so it gives
you the scale or property it also gives you the durability so you can either
have multiple copies or a integer coded pools you can mix and match all kinds of
interesting things you can do to protect your data.
Lots of features that are related to things like when something goes bad, a media goes
bad you should be able to detect it, automatically replicate it to ensure that there is a high
availability in the cluster.
So anything and everything that you can think of in the scale-out software, you see those properties in Ceph software as well.
It's an open source software,
primarily supported by Red Hat, lots of community
contributions from big storage vendors including Intel, Samsung, SanDisk, Mellanox,
lots of
service providers as well contributing to the community work.
It's popular for
block storage within the context of OpenStack deployment so wherever you see
OpenStack deployment stuff tends to be the choice for the block storage for
backing up your virtual machines. It provides three different constructs
typically which are very important for lots of customers so you can use that as
a shared file system which is called CephFS. So you can use that as a shared file system,
which is called CephFS.
You can also use that as a distributed block layer,
which is called RBD.
Or you can also use that as the object interface,
which is essentially RADOS Gateway.
REST APIs, S3 compatible, as well as the Swift compatible.
So it gives you those three properties
using the foundation, which is the object layer and from the NVMe workloads perspective we
kind of look at three different groupings of workloads if you look at
this there is a capacity on the X scale and there is the performance on the Y
scale three types of workloads one
are high ops low latency workloads which are typically like database type of
workloads and then you are looking at throughput optimized things like content
delivery networks VDI cloud DVR and so on big data workloads is at the top and
then when it comes to the archival capacity type of workloads are really object-based workloads
and we look at nvme relevance across these three spectrums of workloads as opposed to really
looking at nvme ssds just for the low latency workloads which is very important to remember
because if you look if you see the trend where what's happening is there are NVMe SSDs that are low latency
types of media out there, like Optane SSD will give you a million IOPS, less than 10
microsecond latency.
So you're really looking at that for the low latency workloads.
And then on the capacity side, lots of vendors out there, including Intel, have 3D NAND-based
media really looking at capacity
oriented workloads mostly content delivery archive backup object based
workloads so think of this as more of NVM SSDs have relevance across the board
as opposed to really the low latency so if you were to zoom into the Ceph
architecture the top portion is the Ceph client and the bottom portion is Ceph storage nodes.
So if you look at the top portion,
the way we are looking at is NVMe SSDs.
Today, Ceph does the caching,
specifically for the block workloads.
It uses DRAM as the caching layer
and we are looking at extending that with NVMe SSDs
so we can actually give a much better cash real estate
on the client nodes and we can bring in the value prop of NVMe SSDs.
So that's the focus area on the client side and then on the back end in the storage nodes
where you have Ceph, there are two different configurations today.
One is actually production ready which is called file store back end and the other one
is the blue store which is currently
in the tech preview mode both these configurations can take advantage of
flash to speed up your rights using NVMe SSDs and to speed up your reads for the
read cache so we kind of see that as both both scenarios are pretty relevant
for the storage nodes as well so that's kind of the intersect point when you
look at the NVMe SSDs where it makes sense within the context of Ceph.
So from the consumption perspective as an end-user you are really looking at
three different configurations. Again just looking at the NVMe SSD and what
kind of configurations that we typically see in the end customer deployments.
Standard is where you have essentially a mix of NVMe SSDs paired up with hard disk drives
and you can use the NVMe SSDs as a way to speed up your writes as well as cache for
the reads to service the reads as well.
So it's meant for both write and read caching paired with the
hard disk drives typically what we look for is with one high endurance NVMe SSD P3700 is the
Intel high endurance SSD NVMe SSD card you can typically pair up with 16 4TB hard disk drives.
You can, that's kind of the ratio based on the benchmarking.
The next one is the better configuration,
which is kind of balanced with the best TCO.
You are looking at a combination of NVMe SSDs
with SATA SSDs as a way to get all flash,
low latency type of SKUs.
Normally, you pair up one NVMe SSD with
around six SATA SSDs and you use the NVMe SSD as a way to speed up your
writes and the reads will get service from SATA SSDs so that's a breakdown
with a better configuration and the best configuration is essentially everything
is all NVMe based on the current testing that we have done and all the optimizations,
we are really looking at around four NVMe SSDs per node
as the design endpoint.
Beyond that, Ceph will not scale.
And we are looking at lots of things to optimize it,
but currently that's the recommended configuration.
For all these things to work,
you really need to make sure that the CPU has enough power,
irrespective of what configuration you choose, and then you have enough networking bandwidth.
So you do need to consider that it's a balanced system config from a compute, storage,
and networking perspective to make sure that it can scale and optimized. Yeah. So you said that Ceph not scaled beyond four SSD devices
for the reason for that?
Yeah, let me go over the details and I can, you know,
give some glimpse into where the challenges are.
So if you look at historically just a, you know,
so the question is why is Ceph not scaling beyond four NVMe SSDs?
There are lots. So if you look at Ceph design endpoint, it started 10 years ago.
And SSDs in those days were pretty much like unheard of.
So most of the design theme has been I need to scale across the nodes because my hard disk drives
give me a limited throughput and the way I can scale is have a scale out design
pattern across many nodes and I can scale that way so it is really designed
for hard disk drives and then things evolved in terms of speeding up the
reads and writes using NVMe SSDs as a you know enhancement then we are looking at okay there are lots of bottlenecks in terms
of threading optimization that we really need to look at to optimize so that's
the focus so the software needs to be optimized there are lots of threads it's
designed for hard disk drive so certain things you can actually choose it in a way
that it's not a big deal.
But with NVMe SSDs your software has to be really lean and the path should be extremely
narrow and that's the area that needs work.
So from your previous slide, you're actually on both the fine store and the
blue store.
So I thought that for SSDs you would be using blue store which is basically optimized for flash and then you would not run through those variations
yeah so if you look at the Ceph data path so the way I look at Ceph data path is
there is a networking section the I will coming into the networking layer in Ceph
and there is the core data path that
really makes the decision of where is the data distributed how do you protect
the data and the whole mechanism of how you group the data how you shot the data
the whole logic is actually in the middle layer and then the bottom layer
is the core I over to the media so the blue store is essentially coming at the
bottom side so with the blue storeore is essentially coming at the bottom side.
So with the BlueStore changes,
you are bypassing the file system,
so it is directly using the raw media.
So you are speeding up the I-O to the media
and the layer above it will be optimized.
There is a networking optimization
that is pretty close with the RDMA
and XIO messenger pattern.
You can actually get to the networking stack
as well the piece that really needs work is in the middle yeah it's called OSD
layer that really requires there are lots of choke points where there is
everything weighs on let's say PG locks and certain certain choke points where
everything waits so those choke points need to be eliminated for us to really speed up for the NVMe SSDs
good question
so I'm going to kind of spend some time on what is the current state when you
look at the NVMe SSD performance so we looked at from for the performance we
focused on MySQL as the key benchmark. The reason being it is very
popular for you when I talked about the OpenStack where Cephry is popular lots of
customers are looking at MySQL as one of the key workloads so we ended up
actually picking that because of the you know the relevance as well as popularity
and we also have we also do lots of synthetic benchmarks random I owe
as well as sequential I over each and writes with different sizes look at
really where the optimization points are so we looked at the synthetic as well as
the my sequel as a way to you know assess the current performance and where
the bottlenecks are the system config that I'm going to talk about has five node super micro cluster. This has
actually ten slots where you can actually put six NVMe SSDs on one
NUMA node and the other one is four so six plus four is what you have
with a super micro but the testing that we did we played around with every other
configuration and we settled on four NVMe SSD ssds but each nvme ssd is partitioned into four uh logical part you know
blocks or partitions rather sorry um so each nvme is partitioned into four um you know logical blah
the the logical you know the regions and then we have
self SSD managing each one of them that's the only way to get really the
scaling with the current configuration so partitioning your new email says it
into four and then have each one managed by one object storage demon which is the
self essentially the critical component that manages the storage to the media
okay on the client side we used per node we used four docker containers well two essentially the critical component that manages the storage to the media.
On the client side, per node we used four docker containers. Two docker
containers are really the clients that are actually running the benchmark which
is called sysbench and then two of them are really doing the work which are
really hosting the MySQL database server. Okay so with that there are certain things we have to optimize when you look at the NVMe
SSDs and the NUMA system configs.
The key thing is, if you look at the right hand side, we have partitioned in such a way
that your compute, memory, NVMe SSDs, and networking is attached to the same NUMA node so you are not really going
across the inter socket to really get the job done when the I.O. is coming in so that
is a very very critical component when you really want to take advantage of NVMe SSDs
and how they are actually physically partitioned in a system which is NUMA-aware partitioning. So that's one key thing.
Second thing is you need to make sure that your NVMe NIC devices
are really pinned to the same processor interrupts and all that stuff.
So soft IRQ balancing and everything need to be on the same NUMA node,
not on the other one, so that you're not really incurring the overhead and then we have
for a given NUMA node because we had four NVMe SSDs each one partitioned into four you know
subsections ended up having 16 OSDs so you really need fairly high-end Xeon E5 skew so looking at around if you know 2690 or beyond is what is typically
what you need at with networking that is at least a dual 10 gig but preferably 40 gig or beyond When you partitioned the MDMA into the.
Did you put the journals on the same disk
or you would have a different disk for the journals?
Yeah, excellent question.
So the testing here is focusing on the BlueStore,
not the Filestore backend.
So you don't need a journal with the BlueStore.
With the Filestore you need a journal a journal with the blue store with the file store you need a journal so with the blue store essentially the metadata
and the actual data sitting on the same partition so it's using the same
partition
yeah yeah you can do that but the problem with if you look at the p3700 you're really you're
kind of getting around 400k IOPS read IOPS ran 4k random read IOPS around 200k random right if you
put one OSD you're probably going to get at the most maybe 10 to 15 k IOPS or maybe 20 k IOPS
out of the device so if you really want to optimize the device throughput you
have to slice them into four and then put more OSDs but that's not the
long-term direction we want to be we really want to address the performance
and make sure that one OSD can manage one NVMe SSD and drive the throughput but we
are not there yet but that's the reason why it's a kind of a workaround more
than anything else to get the squeeze the maximum performance out of the NVMe
SSD so if you put one OSD there is no way you can get that performance back
okay so here is the so the first chart is all about random read and write performance
with five node, four NVMe SSDs per node.
I can actually get to 1.4 million IOPS closer to one millisecond latency and then you can actually stretch it all the way up to 1.6 million
IOPS with roughly around 2, 2.2 millisecond latency. That's relatively, you know, that's
pretty good when you compare with all flash system configs and the price that you pay.
Here you are talking about an open source software purchasing a standard high volume server putting in your own SSDs that your
preferred choice your preferred vendor and so on so you have a lot of
flexibility to put your own system config open source software and you can
actually get down to you know beyond a million IOPS in a five node cluster that
is fairly amazing it doesn't mean it is completely optimized from an efficiency perspective whether
it can take advantage of the media throughput but this is somewhat in
parity with what you normally see with most of the storage stacks including
storage appliances out there and if you look at the database performance, we can actually get down to 1.3 million queries per second
with 20 clients driving the SQL workload,
each one having eight threads.
So you're really looking at 160 threads
hitting this five-node cluster.
And you can actually get to 1.3 million queries per second you know somewhere around that
range so this is the hundred percent random read
so in the interest of time I just touched upon just the two different
charts there are a lot of details in the backup as well the notion is that you
can actually take NVMe SSDs, you can optimize the system config for
the NUMA nodes, you can actually get a reasonable amount of performance and acceptable latency
even for the database workloads to give you a flavor that it is ready for deploying low
latency workloads.
Now we kind of briefly touched upon
what's going on in the community fairly high level caching work is not done yet
on the client side so we are looking for the caching work on the client side to
really speed up quite a bit and the caching pattern is it's going to be
crash consistent ordered right back shared on a compute node.
We are looking at those properties as a way to really speed up the VDI workloads.
Ephemeral compute instances can immensely benefit for that, you know, with that kind of a property.
So that one is still that one is in progress we are looking for sometime next year as a production where the production ready caching solution that can expand beyond
DRAM and take advantage of the NVMe SSDs on the compute nodes the second one is
currently compression work is in progress and we are looking at dedupe as
an extension to it a way to optimize the actual storage capacity for the flash based back ends because
without dedupe the value prop quickly goes away with the flash based back end
so dedupe is a very very critical component from an efficiency perspective
so that work is actually design reviews are actually in internal design waiting
is going on and we are looking for upstreaming this in the community
long tail latency i didn't touch upon the long tail latency i talked about the average latency which is around one millisecond or so the goal is to make sure that the long tail latency beyond the
99th percentile is extremely tolerable and you know it's not completely beyond few milliseconds
range so the goal is to make sure we optimize the long tail latency performance as well which is absolutely
needed for the low latency workloads and then we touched upon this one which is
yeah we can actually take the nvme ssd which is pretty you know performance oriented ssds
and we can cardboard four slices as an optimal design
endpoint, but it also is going to add lots of threads, lots of memory, lots of networking,
and lots of compute overhead.
And the goal is to go back to the design pattern of what can we do to optimize the storage
stack in the OSD to really take advantage of the NVMe SSD
throughput in a much more efficient way. There are two things that we are looking
at in a you know from a broad focus perspective. User mode implementation is
the theme that we are pushing in the self implementations. There is a user
more networking so instead of actually doing lots of context switching the goal
is to do the user more networking which So instead of actually doing lots of context switching, the goal is to do the user more
networking which is using the data plan development kit as the foundation.
Very popular for the comms workloads and we are trying to use that for the storage workloads.
And there is a talk in this room right after this session and I think it is 3 o'clock,
right Ben?
So Ben is actually talking about the storage performance development kit.
So he's gonna talk about how do you optimize
the device access using the storage
performance development kit,
which is essentially built on top of DPDK foundation,
but looking at the storage workloads,
what are the things that we really need
to move into the user space, and how do we optimize it.
So the goal is to take advantage of the existing work that is actually happening, move that
into the user mode stack as much as possible, optimize and address the efficiency problem
that we currently have.
And then the last one is more to do with the persistent memory integration.
So we are looking at the Linux DAX extensions as well as AppDirect which is the direct integration
to the media.
Those are the two constructs as a way to integrate the BlueStore with the persistent memory.
So that can bring in the persistent memory value prop for the specific use cases.
So this kind of gives you the broad spectrum of what is going to
happen between now and next year to really optimize it more and more to take
advantage of the NVMe SSD bandwidth. So with the VMs, essentially there are two different ways of integrating when it comes
to Ceph.
One is the QMU, user mode virtualization is dominant implementation when it comes to OpenStack.
So there is a backend driver for QEMU which is called LibRBD which is based on the user
mode stack.
So we could essentially do the optimization on the user mode side.
And there is also a kernel mode driver for container framework.
So if you want to instantiate containers that are really backed by SAP, you will have to
use the kernel RBD. Most of our focus is
essentially the user mode QMU layer. So the whole caching stuff that I'm talking about
is really looking at the user mode side. There are lots of Kernel mode device mapper based
caching devices that you can use, VM cache, flash cache, cache and there is an Intel caching
software. So you can pair up with any one of them in the kernel mode and get a take advantage of it but our focus
is the user mode when it comes to the client
yeah you do have to optimize so the way it works is in Ceph,
you essentially have to,
it's not about having the system config
that is powerful on the storage node.
You also need to make sure
that the client side has enough bandwidth
and it is also NUMA optimized.
Otherwise, you are going to run into bottlenecks
on the client side compared to the storage node.
So you do have to comprehend
both sides of the configurations but predominantly the
biggest bottleneck is the NVMe SSDs and the Ivo side if you don't do the
pneumo optimization there everything will fall apart right so that's a
critical component it doesn't mean you don't do it on the client side but
that's a critical component okay so we're going to run out of time I want to
make sure I give enough time for the vSAN side of the discussion. So I will hand it over to Swaroop to cover the VMware vSAN.
All right, thanks, Reddy. So I'm going to describe how, what we are doing with vSAN
and NVMe. So this is that part of the presentation.
So how many of you are aware of VMware's vSAN?
Okay, so let me give you a quick overview.
So we introduced Virtual SAN, vSAN as we call it internally, in March of 2014.
That was the first release of vSAN. Essentially what vSAN does is it provides cluster storage across multiple server nodes.
And the beauty of vSAN is it's embedded within the vSphere kernel itself,
so it's not a virtual storage appliance running on top of ESX.
And once you have server nodes and these
can be any server nodes right that's how we differentiate against some of our
competitors where you can buy servers either from HP, Dell, Cisco, Lenovo, Fujitsu
whoever is your favorite vendor the only thing we ask is you buy a bunch of SSDs
a bunch of HDDs or SSDs, and essentially vSAN will club the
SSDs and HDDs from the different servers and make it into a giant data store.
There's no concept of LUNs, no concept of volumes, there is no traditional SAN fabric,
SAN fabric switching, none of that exists basically the VM IO comes
into the vSAN storage stack and it goes into the devices
underneath and in terms of availability we provide replicas for each of the VMDK
objects which are persisted on the vSAN data store so if your entire host may go
down but the replica is available on another node from vSAN data store. So if your entire host may go down, but the replica is available on another node from vSAN,
basically is able to retrieve the data
and serve it back to the VM.
So this all forms kind of a new industry paradigm
which is coming.
It's called hyper-converged infrastructure. The hyper-converged is important
out there because all the I.O. is being processed within the kernel or within the hypervisor
itself. The hyper-converged infrastructure consists of the compute stack, the networking
stack, the storage stack, and all of this is layered with a pretty comprehensive
management control plane on top of it, right? And so that's what hyperconverged infrastructure
is defined as. And it's different from what you may have heard of as converged infrastructure,
where the hypervisor may not play as critical part as it plays in hyper as converged infrastructure where the hypervisor may
not play as critical part as it plays in hyperconverged infrastructure so that's
the kind of the basic difference between HCI and CI. So we saw an
overview I already talked about it this is completely software-defined, so there is no proprietary hardware component.
We do not use any specific FBGA for DU or compression, encryption, etc.
These are all standard components which are shipped with every Intel server out there.
It's a distributed scale-out architecture, so you essentially have vSAN embedded within
the kernel and they all are communicating with a 10 gigi length between them. And it's
integrated well within the vSphere platform since we are VMware, we ensure that the interop
with every component out there, whether it's vMotion, storage vMotion, HA, DRS, vRealize operations,
which is our management suite, all of them are integrated well with vSAN. And it's a policy-driven
control plane. So right from the design up, we actually designed it to be on a per VM policy.
What does that mean? You can, on a per VM basis, you can set, hey, how should the performance look like?
What should the availability look like?
Should I make two replicas of the VMDK?
Should I make three replicas, et cetera?
And those are some of the policies which you can set in addition to, of course,
the capacity policies which you can set on a per VM basis.
It's very different from traditional storage,
where most of your policies are LUN-based or volume-based.
Virtual Volumes, which is another initiative, another product with VMware,
that also now solves per VM- based policies for traditional storage.
But for vSAN, we essentially built the product from the ground up with taking in mind the per VM policies.
So we have, so vSAN kind of operates in two modes.
One is we essentially have something called the hybrid mode and we have an all slash
mode. So what I mean by that is we have two tiers within vSAN. One is the
caching tier. We have a read cache and a write buffer and the capacity here is
essentially the persistent store for vSAN. For the caching tier we dictate that you have to have a flash device
out there so it could either be SSDs, PCIe based SSDs or now we are getting a lot of
customers who are beginning to use NVMe based SSDs. For the capacity tier we essentially ask that you either have a choice of using HDDs
so that makes it in the hybrid mode or you can use all flash which is essentially SSDs
also in the capacity tier. If you ask me what is the adoption curve for all flash versus hybrid
with vSAN, if you had asked me last year I would have said the all flash is probably 10%
of our adoption but this year if you ask me as to what the adoption is it's more than 40 to 50%
right and even there with NVMe we are seeing like last year July or August when we were talking to
Intel we didn't have any NVMe device on the HCL last year, June or July.
This year we have about 30 to 40 NVMes on the HCL and there are a lot of customers from the beginning of the year
who have been putting NVMe in the caching tier and now we have about more than 10 plus accounts
who essentially have NVMe across both the caching and the capacity tier. So we are seeing some
real evidence of how NVMe is being adopted within the industry. I
forgot to mention about the performance. I think we are one of the few vendors
who actually mentioned performance pretty explicitly. It's 40k IOPS for the
hybrid mode per node and about 100k IOPS per host
With the with the all flash and if you start putting devices like PCIe SSDs or NVMe SSDs
Sub-millisecond latencies becomes very very common
So current all flash, vSAN all flash, this is how it looks
I mentioned about the tier 1 caching. The tier two is all about data persistence.
If you're in an All Flash mode,
the SSDs are high performance, high endurance,
very high throughput for caching the writes.
We do not do any kind of read caching in an All Flash mode,
but the tier two data persistence is more read intensive,
lower endurance, is generally less expensive than the tier one
caching. Most of our customers use like either SATA SSDs out here or even SAS SSDs and out
here they were using SAS and SATA SSDs but now it's generally becoming either PCIe SSDs
or the NVMe SSDs and especially the Intel P3700s are becoming
pretty common out there with the P3510 on the tier twos.
We have space efficiency, so 6.2, that was the latest release which we have for vSAN.
Earlier, before we introduced dilute and compression, what we had to do was essentially make a replica
of the object.
And an object for us is essentially you can consider it as VMDK, so we are creating multiple
objects based on what the user has set the policy.
6.2, in March, we introduced it in March of this year. That was our latest release.
We introduced dedupe and compression and erasure coding which is essentially, we kind of loosely
use it for RAID 5 and RAID 6 and depending on the workload we have seen about 2x to 8x savings. So
if you're doing for example VDI full clones that's about 7x to 10x savings. So if you're doing, for example, VDI full clones,
that's about 7x to 10x savings.
If you're doing like Oracle single instance databases,
et cetera, compression is what would take effect there.
And we are seeing anywhere from 5% to 25% to 30%
of compression ratios with Oracle.
Performance, of course, if you're using all flash,
in terms of IOPS, we are about four times higher
than the hybrid vSAN.
So again, the hybrid vSAN is HDDs used in the persistence
tier.
And when you start using devices like NVMe,
sub-millisecond latency response times
is what you could expect.
We support almost all applications out there so when we started out with
vSAN 1.0 about two years back we kind of since everybody knows that a storage 1.0
product you have to establish the credibility make sure that it is make
sure that the industry kind of believes that it is an enterprise product so we
kind of had the number of use cases very restricted so we kind of believes that it is an enterprise product so we kind
of had the number of use cases very restricted so we kind of had VDI, DR
target, a few DevTest use cases out there but two years down the line we have
opened it out for any use case out there so there are customers who are deploying things like you know
Oracle RAC, MSCS, Exchange, SQL, every application out there is being deployed
on VISA and we have some pretty interesting use cases for like for
example Lufthansa having you know like if their aircraft has about 300,000
sensors and each of the sensor is sending data back to their main data center
where they are running some analytics,
and it spits out like kind of a report on how the flight did,
what are some of the things to take care of,
and within an hour, the technician has to address all the concerns with this report,
get the flight back up and running, right?
This is being done on their A380s, etc.
So some very interesting cases, there are, you know, defense sectors where V-SAN is being deployed
in particular form factors. So it works very well because they, in some of these
defense sectors, they want like particular form factors. They do not want like an appliance form factor which may not fit into a particular space, for example.
So vSAN works beautifully for them because they can choose a particular hardware,
just slap the vSAN software on top of it, and it runs.
So some very interesting use cases out there.
And I forgot to mention we have about 5,000 customers so it's
growing pretty rapidly. It's one of the fastest storage products at least I have
I have worked on. So how is the market evolving? This is an IDC graph so the IDC
predictions kind of align with how we are seeing the NVMe adoption growing within
vSAN.
We see that especially the SATA SSDs is going to kind of start trickling down while the
PCIe based SSDs and the SAS SSDs will probably occupy more than 60 to 70% of the market.
And by 2018, we expect about the NVMe,
what IDC projects is 70% of the client SSD market,
70% of the client market will be NVMe based.
So what are the benefits of vSAN with NVMe? So I won't go through the overview.
Reddy did a good job about that. But it's ideal for the caching tier. We are getting a lot of
customers who are using NVMe today with vSAN and we highly recommend that. There are a few customers who are
beginning to use it even in the persistent store also. We have NVMe
devices certified for the vSAN caching tier and specifically for all flash
configurations. You can also use it for the hybrid model that it works
perfectly for that and what we are doing specifically in the roadmap, and I will go through it in a little bit more detail,
is we are enhancing both the ESXi storage stack and the vSAN storage stack
to make it effective enough to take some of the real advantages of NVMe
and some of the NVMe 1.2 features which are also coming up.
So we'll talk about that, and Murli will also go into some more detail in that regard.
This is how our HCL looks like, Intel P3700s, the S3510s, all being listed there.
This is a chart which Intel recently published.
I know this chart may be a bit convoluted, but let me see if I can explain it.
The bar graphs which you see correspond to the capacity, while a performance deployment where you really care about performance and nothing else. Or you may go on the extreme where you do not really care about the performance but you want to make your deployment highly, highly available. Or you can have it capacity based where you
want like large capacities within that cluster or you may say I want a kind of
a slight balance between performance and capacity. As you see this is a
configuration which was used about 1.2 million IOPS for the performance
deployment and of course as you do a trade-off with availability
compared to performance, your performance, of course,
tapers off quite a bit.
But what this graph essentially shows is, with NVMe,
the kind of dollar per IOPS and the dollar per gate
which you get is pretty remarkable.
In fact, we get a lot of customers who ask us,
hey, isn't NVMe too expensive?
But these are kind of the graphs which we have been publicizing along with Intel
to show that the amount of performance which you get at the cost which NVMe is, it's pretty remarkable and it's pretty encouraging for customers to use NVMe.
Also, raw comparisons.
So we did a raw comparison of taking an 8-node vSAN cluster,
comparing it with an 8-node NVMe cluster and these are kind of the IOPS
which we are getting and these are some of the
configuration which we have. What we also observed in terms of latencies were on
the SSD side we had about 1.5 millisecond latency, and in NVMe, we were getting 1 millisecond latencies
or slightly less than that.
So what's coming next?
What's the future for NVMe within vSAN?
So let me go through some of the ideas out there.
So the Intel Optane SSDs, they are the ones which we are working with Intel
closely, some pretty impressive reduction in latencies about approximately 10x compared
to NAND SSDs, so that's kind of a direct benefit to vSAN, right, if you're going to use it
in the caching tier or even within the capacity
tier.
We have done some measurements along with Intel with the application performance, the
ESXi application performance.
It's about two and a half times faster than the NAND PCI Express.
And once we start doing some, and these are like raw numbers in terms of,
we haven't really done any software optimizations. These are just plugging in, taking out the SAS and
SATA SSDs, plugging in NVMe SSDs, and these are the numbers which we are getting.
There are a lot of software optimizations which we have started to work with. So today we did a comparison of how does NVMe perform without any ESXi storage stack optimizations,
which is what it is today.
The gray bars are what you get,
but we are also prototyping a lot of optimizations within the ESXi storage stack.
So Reddy mentioned about how you can parallelize the NVMe across multiple lanes, and those
are some of the optimizations which we are introducing within the storage stack.
And as you can see, the performance benefits which we get making these changes within the
ESXi storage stack is
getting us quite a few benefits. Of course the prototype is not released, these are all internal
results as of now, but thought I will show you as to what one can aspire as they start using NVMe
with vSAN and perhaps even with ESXi. So I won't go through this but let's look at what the
next generation hardware offers with NVMe and also with NVDems or
persistent memory. So there are four areas which we are focusing on. One is
the high speed NVMe which we hope to leverage as it is and that will provide performance benefits to vSAN right away.
There are ESXi storage stack enhancements which we are looking at, just as I mentioned in the previous slide.
We are also looking at, I mentioned to you that we have a two-tier architecture today.
And there is no reason for us, especially with NVDIMS, etc., coming into the market
whenever it comes, to kind of collapse this into a single-tiered all-flash architecture.
What that would mean is using NVDIMS for just the metadata while using SAS, SATA, or NVMe
SSDs for the persistent store.
Or we can still keep it as a two-tier architecture
where you have the metadata into NVDIMS,
you have the write cache as NVMe,
and the SAS and SATA SSDs as your persistent store.
We are also looking at RDMA over Ethernet
to boost network transfers.
As the device latencies for these
start becoming smaller and smaller,
the network latencies will start becoming the bottleneck.
And that is why we are looking at RDMA to help address the networking challenges also
because right now it's not a challenge for us,
but once these device latencies start shrinking down and down,
the network latencies will start becoming an issue for us.
So with that, I will give it to Murli to kind of go over some
of the native driver stack and some of the technical details.
So I think we're going to run out of time here,
but what I wanted to share is a snapshot of where we stand
in terms of NVMe support today.
So first let's look at our driver stack and where we play in terms of NVMe support today. So first let's look at our driver stack
and where we play in terms of NVMe or complementary
technologies.
So there are fundamentally two places that we actually
do drivers.
One is in the IO stack.
That is the NVMe for the PCIe space,
as well as NVMe for fabric.
Those are the two fundamental drivers.
Along with that, we have drivers for the RDMA
piece which is like a rocky driver. That's it's right in the IO driver stack. The other
place is something we call virtual NVMe. Essentially that sits below a host or a guest operating
system and that allows a guest operating system to have a native NVMe driver so that you can use this virtual device
to funnel in IOs that are completely natively NVMe.
So those are the places that we play.
And what I want to share is our future plans.
So far we have been supporting NVMe protocol.
So currently we have support for 1.0e spec.
And that's both supported in Inbox So currently we have support for 1.0e spec.
And that's both supported in the inbox as well as the driver.
In future, what we're going to do is try to support
the 1.2 feature set.
So I can't speak about the exact time table on this one,
but it's coming soon.
And that deals with multiple namespaces
multiple queues so this actually addresses the performance thing that Swaroop was
talking about we will have support we'll have a driver for Indian New York fabric
and we'll address the rocky to start with rocky fabric followed by other transports as our customers see a need. And finally we have a plan to, I talked about the virtual NVME device which sits below the guest operating system.
That should be very interesting for other operating systems which sit inside VMs. And finally at some point we plan to have a completely native NVMe stack. In other words, it will coexist with SCSI for a while,
but the outlook is that at some point, we're going to completely transition into a NVMe stack.
So that's the outlook for the coming years.
And you can read about this.
You can actually download the drivers today.
We have drivers for the 5.5 as well as the 6.2 ESXi platforms this also you can you
actually have a project they'll be actually inviting vendors the device
vendors to actually contribute to the 1.2 driver that's coming soon and this
information in the links here you can log in there and you can find
more details and I'll be happy to talk to you after this session so that's
fundamentally that I think leave some room for question yeah I mean we can do
questions maybe like two three minutes you probably have to give give up for
the next speaker so any questions on the vSAN side? I know we had a few during the talk.
Do you have any idea how much CPU overhead or surface overhead is for this software stack?
On vSAN in general? So what we advertise is less than 10%, right? That's the, we call it the visa and tax so that's the that's typically what
we have now dedup and compression also you have to incur some additional CPU
right and that is typically about again less than 10% of what tax you have
already paid with visa without dedup and compression so let's say you have already paid with vSAN without dedupent compression. So let's say you have 100 cores, you need 10 more cores for vSAN.
For dedupent compression, another 10% of that is 10% of 10, so another one additional core,
so 11 cores in general if you turn on dedupent compression.
And is it statically assigned or is it done through vSAN? So you have to go through a kind of a TCO calculation, right, to basically figure out,
okay, I'm going to enable vSAN on this.
What are the, you know, how much more CPUs do I need to put within the cluster?
So that's how it works.
Do you have a mix of fan and tech
in your environment?
SHARAT SHROFFER- Yes.
So the default is 10.
But you can toggle between 10 and tech if you would like to.
consistency, is the recent AP system or
a CD system?
SHARAT SHROFFER- AP system is a CB system? AP system versus CB system. The same AP, the consistency available at GNN.
Eventual consistency or the strong consistency?
So this is all strong consistency, right?
So essentially the write, as soon as the write happens,
when the writes are replicated to all the replicas,
that's when the acknowledgement is sent back to the client.
So even though it's an object-based system, so I know that other object-based systems may be eventually consistent, but with vSAN, it's strongly consistent.
Any other questions? The caching tier that you have, is it for read cache?
So in the hybrid model, the caching tier is both for read cache and write buffering.
In the all-flash model, the caching tier is only for write buffering.
The reads happen directly to the capacity. Sorry?
The data is.
Yes.
Yeah.
.
Yes.
.
So you have the right buffer on each of the nodes, right? OK. Yeah. So you have the right buffer on each of the nodes, right?
Okay.
So you have the right buffer on each of the nodes,
and it is replicated...
Right buffers on all nodes?
Yes.
It's right buffer on each of the nodes, yes.
All right. Thank you.
Thanks for your time.
Appreciate it.
Thanks for listening. If you. Thanks for your time. Appreciate it. Thanks for listening.
If you have questions about the material presented in this podcast,
be sure and join our developers mailing list
by sending an email to developers-subscribe at sneha.org.
Here you can ask questions and discuss this topic further
with your peers in the developer community.
For additional information about the Storage Developer Conference,
visit storagedeveloper.org.