Storage Developer Conference - #54: Bridging the Gap Between NVMe SSD Performance and Scale Out Software

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcast. You are listening to episode 54 of the SDC podcast. Today we hear from Anjaneya Chagram, Principal Engineer, Intel, as he presents Bridging the Gap Between NVMe SSD Performance and Scale-Out Software from the 2016 Storage Developer Conference.

Starting point is 00:00:54 Good afternoon, everyone. Can you hear me okay down there? So welcome to the Bridging the Gap Between NVMe SSD Performance and Scale-Out Software. My name is Reddy Chagam. I'm the principal engineer and chief SDS architect from Intel Data Center Group. I have two co-presenters with me. I will actually let Swaroop and Murali introduce. Sure, thanks. My name is Swaroop Datta. I manage the product management team at VMware in the storage and availability business unit. Thanks Swaroop. My name is Murali Raiyopal.

Starting point is 00:01:26 I'm a storage architect in the office of CTO at VMware. Fantastic. Okay, so we are going to cover two broad scale-out software stacks. I am specifically going to focus on the Ceph side. Mainly, Swaroop and Murali are going to focus on the VMwareeph side. Mainly the and then Swaroop and Murali are going to focus on the VMware vSAN stack and the theme of this presentation is what exactly we are doing within the context of NVMe SSDs and speeding up the software side of the you know these two stacks and we'll also give a flavor of

Starting point is 00:02:00 the planned work going forward. How many of you know NVMe protocol, heard about it, familiar with the concept? Okay, so few. Okay, so I'm going to give a very high-level view into what exactly is NVMe SSD and you know how this plays out for the following discussion with Seth. So if you look at the NVMe SSD and you know how this plays out for the following discussion with self so if you look at the NVMe essentially it is the standard software interface between the host and the PCIe attached SSD media so it's a standard software interface you can actually plug in any NVMe attached SSD from any vendor and it should be able to work without

Starting point is 00:02:45 changing the host software. So that's the number one thing. It uses essentially the queuing as the foundation construct as a way to communicate to the media. There are a set of admin queues to really do some sort of setting up your SSD as well as there are submission queues and completion queues. So you kind of use the queuing construct as a way to communicate to the media in a standard way. Of course, it is attached to PCIe lanes. So it means that it has a direct connectivity from your CPU processor to the NVMe media.

Starting point is 00:03:21 There is no intermediate, you know, chipset moving components like host bus adapter. So that's a key component. Because it is attached to the PCIe lanes, you essentially can take advantage of the bandwidth that you get out of PCIe lanes. So with PCIe Gen 3, one lane can actually give you 1 gigabyte per second bandwidth. Most of the NVMe SSD cards out there, they have x4 or x8. one lane can actually give you one gigabyte per second bandwidth most of the NVMe SSD cards out there they have by four or by eight by four lane PCIe SSD cards typically give you the reasonably half a million I was per

Starting point is 00:03:55 second from reads and somewhere around 200 K IOPS for the right side with the opt-in SSD you can actually get to 1 million IOPS with the by 8 PCIe lane slot right so significant amount of you know the throughput and performance you can actually get out of NVMe SSDs because of the fact that it is directly attached and you can scale based on the number of lanes and it's not limited by you know certain aspects the other thing is obviously you don't need host bus adapter like SAS controller and so on so it essentially gives you a you know cost reduction as well as power reduction plus performance so if you look at the Broadwell CPU you essentially have per, you can actually get 40 PCIe

Starting point is 00:04:47 lanes. A bunch of them will be shared with the networking. So you essentially will get enough number of PCIe lanes to drive the bandwidth that is needed on the compute as well as the networking. Few different form factors. M.2 is really meant for boot type of use case, boot drive. And then in the small form factor, which is U.2, and then there is an add-in card. So you essentially can use different form factors based on the system configs that you are after

Starting point is 00:05:15 and the density and price points that you are really looking for. Okay? All right. So that's kind of the brief overview of what NVMe protocol is about, what is NVMe SSD and what are the benefits. So I'm going to directly jump into Ceph side. How many of you heard about Ceph, know Ceph? Okay, reasonable number.

Starting point is 00:05:40 So Ceph is the open source storage software scale out. It is built on the foundation of object interface, which is called RADOS. So, that is the foundation layer. You can actually take the RADOS foundation layer, deploy on a pool of standard high volume servers. So, you essentially can take a bunch of machines, put Ceph software on top of it, and you essentially get a scale-out storage stack that is protected beyond a host failure. So you can have host failure, rack failure, you can essentially make sure that the data is protected. So it gives you the scale-out property, it also gives you the durability, so you can either have multiple copies or integer coded pools,

Starting point is 00:06:23 you can mix and match, all kinds of interesting things you can do to protect your data lots of you know features that are related to things like when something goes bad a media goes bad you should be able to detect it automatically replicate it to ensure that there is a high availability in the cluster so anything and everything that you can think of in the scale-out software, you see those properties in Ceph software as well. It's an open source software primarily supported by Red Hat, lots of community contributions from big storage vendors including Intel, Samsung, SanDisk, Mellanox, lots of service providers as well contributing to the community work.

Starting point is 00:07:04 It's popular for block storage within the context of OpenStack deployment. So wherever you see OpenStack deployment, stuff tends to be the choice for the block storage for backing up your virtual machines. It provides three different constructs typically, which are very important for lots of customers. So you can use that as a shared file system which is called CephFS you can also use that as a distributed block layer which is called RBD or you can also use that as the

Starting point is 00:07:34 object interface which is essentially rados gateway rest api's s3 compatible as well as the Swift compatible so it gives you those people three properties using the foundation which is the object layer and from the NVMe workloads perspective we kind of look at three different groupings of workloads if you look at this there is a capacity on the X scale and there is the performance on the Y scale three types of workloads one are high ops low latency workloads which are typically like database type of workloads and then you are looking at throughput optimized things like content delivery networks VDI cloud DVR and so on big data workloads is at the top and then when it comes to the archival capacity type of workloads are

Starting point is 00:08:26 really object based workloads and we look at NVMe relevance across these three spectrums of workloads as opposed to really looking at NVMe SSDs just for the low latency workloads which is very important to remember because if you look if you see the trend where what's happening is there are NVMe SSDs that are low latency types of media out there, like Optane SSD will give you a million IOPS, less than 10 microsecond latency. So you're really looking at that for the low latency workloads. And then on the capacity side, lots of vendors out there, including Intel 3d and based media really looking at capacity

Starting point is 00:09:06 oriented workloads mostly content delivery archive backup object based workloads so think of this as more of nvm ssds have relevance across the board as opposed to really the low latency so if you were to zoom into the seph architecture the top portion is the self client and the bottom portion is set storage nodes so if you look at the top portion the way we are looking at is NVMe SSD is today we self does the caching specifically for the block workloads it uses DRAM as the caching layer and we are looking at extending that with NVMe SSD so we can actually give a much better cash real estate on the client nodes. And we can bring in the value prop of NVMe SSDs.

Starting point is 00:09:52 So that's the focus area on the client side. And then on the back end in the storage nodes where you have Ceph, there are two different configurations today. One is actually production ready, which is called file store back end. And the other one is the blue store which is currently in the tech preview mode both these configurations can take advantage of flash to speed up your writes using nvme ssds and to speed up your reads for the read cache so we kind of see that as both both scenarios are pretty relevant for the storage nodes as well so that's kind of the intersect point when you look at the nMe SSD is where it makes sense within

Starting point is 00:10:27 the context of Ceph. So from the consumption perspective as an end-user you are really looking at three different configurations. Again just looking at the NVMe SSD and what kind of configurations that we typically see in the end customer deployments Standard is where you have essentially a mix of NVMe SSDs paired up with hard disk drives And you can use the NVMe SSDs as a way to speed up your writes as well as cache For the read to service the reads as well. So it's meant for both write and read caching

Starting point is 00:11:07 paired with the hard disk drives. Typically what we look for is with one high endurance NVMe SSD, P3700 is the Intel high endurance SSD, NVMe SSD card. You can typically pair up with 16, four terabyte hard disk drives. So you can, you know, 16 4TB hard disk drives. So you can, you know, that's kind of the ratio based on the benchmarking. The next one is the better configuration, which is kind of balanced with the best TCO.

Starting point is 00:11:35 You are looking at a combination of NVMe SSDs with SATA SSDs as a way to get all- low latency type of SKUs normally you pair up one NVMe SSD with around six SATA SSDs and you use the NVMe SSD as a way to speed up your writes and the reads will get serviced from SATA SSDs so that's a breakdown with a better configuration and the best configuration is essentially everything is all NVMe based on the current testing that we have done and all the optimizations we are really looking at around four NVMe SSDs per node as the design endpoint beyond that Ceph will not scale and we are looking at lots of things to optimize it but currently that's the recommended configuration for all these things to work you really need

Starting point is 00:12:23 to make sure that the CPU has enough power irrespective of what configuration you choose and then you have enough networking bandwidth so you do need to consider that it's a balanced system config from a compute storage and networking perspective to make sure that it can scale and optimized. Yeah. So you said that Ceph not scale beyond four SSD devices for node, you know, for that? Yeah, let me go over the details and I can, you know, give some glimpse into where the challenges are. So if you look at, historically, just a, you know, so the question is why is Ceph not scaling beyond four NVMe SSDs

Starting point is 00:13:07 there are lots so if you look at Ceph design endpoint it started 10 years ago and SSDs at those in those days were pretty much like unheard of so that most of the design team has been I need to scale across the nodes because my hard disk drives give me limited throughput and the way I can scale is, have a scale out design pattern across many nodes and I can scale that way. So it is really designed for hard disk drives

Starting point is 00:13:39 and then things evolved in terms of speeding up the reads and writes using NVMe SSDs as a you know enhancement then we are looking at okay there are lots of bottlenecks in terms of threading optimization that we really need to look at to optimize so that's the focus so the software needs to be optimized there are lots of threads it's designed for hard disk drive so certain things you can actually choose it in a way that it's not a big deal but with NVMe SSDs your software has to be really lean and the path should be extremely narrow and that's the area that needs work. the fine store and the blue store yeah so i thought that for ssds you would be using blue

Starting point is 00:14:25 store which is basically optimized for flash and then you would not run through those scalable variations yeah so if you look at the sef data path so the way i look at sef data path is there is a networking section the i will coming into the networking layer in Ceph and there is the core data path that really makes the decision of where is the data distributed how do you protect the data and the whole mechanism of how you group the data how you shot the data the whole logic is actually in the middle layer and then the bottom layer is the core I over to the media so BlueStore is essentially coming at the bottom side. So with the BlueStore changes,

Starting point is 00:15:08 you are bypassing the file system, so it is directly using the raw media. So you are speeding up the I-O to the media and the layer above it will be optimized. There is a networking optimization that is pretty close with the RDMA and XIO messenger pattern, you can actually get to the networking stack as well.

Starting point is 00:15:27 The piece that really needs work is in the middle. In the radios here. Radios, yeah. It's called OSD layer. That really requires... There are lots of choke points where everything waits on, let's say, PG locks and certain choke points where everything waits.

Starting point is 00:15:45 So those choke points need to be eliminated for us to really speed up for the NVMe SSDs. Good question. So I'm going to kind of spend some time on what is the current state when you look at the NVMe SSD performance. So we looked at, for the performance, we focused on MySQL as the key benchmark. The reason being it is very popular for the, when I talked about the OpenStack, where Ceph is popular, lots of customers are looking at MySQL as one of the key workloads. So we ended up actually picking that because of the relevance as well as popularity. And and we also have we also do lots of synthetic benchmarks random I owe as well as sequential I over each and writes with

Starting point is 00:16:32 different sizes look at really where the optimization points are so we looked at the synthetic as well as the my sequel as a way to you know assess the current performance and where the bottlenecks are. The system config that I'm going to talk about has a five-node super micro cluster. This has actually 10 slots where you can actually put six NVMe SSDs on one NUMA node, and the other one is four, so six plus four is what you have with a super micro. But the testing that we did, we played around with every other configuration and we settled on for NVMe SSDs but each NVMe SSD is partitioned into four logical part you know blocks not partitions rather sorry so each NVMe is partitioned into four you know logical block the

Starting point is 00:17:23 the logical you know the regions and then we have self SSD managing each one of them that's the only way to get really the scaling with the current configuration so partitioning your new uses into four and then have each one managed by one object storage demon which is the self essentially the critical component that manages the storage to the media. On the client side, per node, we used four Docker containers. Two Docker containers are really the clients that are actually running the benchmark, which is called SysBench, and then two of them are really doing the work,

Starting point is 00:18:00 which are really hosting the MySQL database server. So with that, there are certain things we have to optimize when you look at the NVMe SSDs and the NUMA you know system configs. The key thing is when you look if you look at the right hand side we have partitioned in such a way that your compute, memory, NVMe SSDs and the networking is attached to the same NUMA node. So you are not really going across the inter socket to really get the job done when the I.O. is coming in. So that is a very, very critical component when you really want to take advantage of

Starting point is 00:18:35 NVMe SSDs and how they are actually physically partitioned in a system which is NUMA aware partitioning. So that's one key thing second thing is you need to make sure that your NVMe NIC devices are really into the same you know processor interrupts and all that stuff so soft soft IRQ balancing and everything need to be on the same in NUMA node not on the other one so that you're not really incurring the overhead. And then we have, for a given NUMA node, because we had four NVMe SSDs, each one partitioned into four, you know, subsections, ended up having 16 OSDs. So you really need fairly high-end Xeon E5 skew. So looking at around 2690 or beyond

Starting point is 00:19:28 is typically what you need. With networking that is at least dual 10 gig, but preferably 40 gig or beyond. Go ahead. So back when you did that, when you partitioned the MDMA into the four things, did you put the journals on the same disk or you would have a different disk for the journal?

Starting point is 00:19:53 Yeah, excellent question. So the testing here is focusing on the BlueStore, not the Filestore backend. So you don't need a journal with the BlueStore. With the Filestore, you need a journal. journal with the blue store with the file store you need a journal so with the blue store essentially the metadata and the actual data sitting on the same partition so it's using the same partition yeah yeah you can do that.

Starting point is 00:20:26 But the problem with, if you look at the P3700, you're really, you're kind of getting around 400K IOPS, read IOPS, 4K random read IOPS, around 200K random write. If you put one OSD, you're probably going to get at the most maybe 10 to 15K IOPS or maybe 20 k IOPS out of the device. So if you really want to optimize the device throughput,

Starting point is 00:20:52 you have to slice them into four and then put more OSDs. But that's not the long-term direction we want to be. We really want to address the performance and make sure that one OSD can manage one NVMe SSD and drive the throughput but we are not there yet but that's the reason why it's a kind of a workaround more than anything else to get the squeeze the maximum performance out of the NVMe SSD so if you put one OSD there is no way you can get that performance back okay so here is the so the first chart is all about random read and write performance

Starting point is 00:21:30 with five node, four NVMe SSDs per node. I can actually get to 1.4 million IOPS closer to one millisecond latency and then you can actually stretch it all the way up to 1.6 million IOPS with two roughly around two 2.2 millisecond latency that's really really you know that's pretty good when you compare with all flash system configs and the price that you pay here you are talking about an open source software purchasing a standard high volume server putting in your own SSDs that your preferred choice your preferred vendor and so on so you have a lot of flexibility to put your own system config open source software and

Starting point is 00:22:18 you can actually get down to you know beyond a million IOPS in a fine word cluster that is fairly amazing it doesn't mean it is completely optimized from an efficiency perspective whether it can take advantage of the media throughput but this is somewhat in parity with what you normally see with most of the storage stacks including storage appliances out there and if you look at the database performance, we can actually get down to 1.3 million queries per second with 20 clients driving the SQL workload, each one having eight threads. So you're really looking at 160 threads hitting this five node cluster and you can actually get to 1.3 million queries per second, you know, somewhere around that range.

Starting point is 00:23:11 So this is the 100% random read. So in the interest of time I just touched upon just the two different charts. There are a lot of details in the backup as well. The notion is that you can actually take NVMe SSDs, you can optimize the system config for the NUMA nodes, you can actually get a reasonable amount of performance and acceptable latency, even for the database workloads, to give you a flavor that it is ready for deploying low latency workloads.

Starting point is 00:23:44 Now, we kind of briefly touched upon what's going on in the community. Fairly high level. Caching work is not done yet on the client side. So we are looking for the caching work on the client side to really speed up quite a bit. And the caching pattern is, it's going to be crash consistent,

Starting point is 00:24:04 ordered right back shared on a compute node we are looking at those properties as a way to really speed up the VDI workloads ephemeral compute instances can immensely benefit for that you know with with that kind of a property so that one is still that one is pre in progress we are looking for sometime next year as a production worthy production ready caching solution that can expand beyond DRAM and take advantage of the NVMe SSDs on the compute nodes the second one is currently compression work is in progress

Starting point is 00:24:36 and we are looking at dedupe as an extension to it a way to optimize the actual storage capacity for the flash-based backends because without Ddupe the value prop quickly goes away with the flash-based backends. So Ddupe is a very, very critical component from an efficiency perspective. So that work is actually design reviews are actually in internal design waiting is going on and we are looking for upstreaming this in the community long tail latency I didn't touch upon the long tail latency I talked about the average latency which is around one millisecond or so the goal is to make sure that the long tail latency beyond the 99th percentile is extremely tolerable and you know it's not completely beyond few milliseconds

Starting point is 00:25:24 range so the goal is to make sure we optimize a long tail latency performance And it's not completely beyond a few milliseconds range. So the goal is to make sure we optimize the long tail latency performance as well, which is absolutely needed for the low latency workloads. And then we touched upon this one, which is, yeah, we can actually take the NVMe SSD, which is pretty performance-oriented SSDs, and we can cardboard four slices as an optimal design in point but it also is going to add lots of threads

Starting point is 00:25:50 lots of memory lots of networking and lots of compute overhead and the goal is to go back to the design pattern of what can we do to optimize the storage stack in the OSD to really take advantage of the NVMe SSD throughput in a much more efficient way. There are two things that we are looking at from a broad focus perspective. User mode implementation is the theme that we are pushing in the SEF implementations. There is a user mode networking so instead of actually doing lots of context switching the goal is to do the user more networking which is using the data plane development kit as the foundation. Very popular for the comms workloads and we are trying to use that for the storage workloads and there

Starting point is 00:26:37 is a talk in this room right after this session and I think is a 3 o'clock right Ben? So Ben is actually talking about the storage performance development kit. So he's going to talk about how do you optimize the device access using the storage performance development kit, which is essentially built on top of DPDK foundation, but looking at the storage workloads, what are the things that we really need to move into the user space, and how do we optimize it.

Starting point is 00:27:10 So the goal is to take advantage of the existing work that is actually happening, move that into the user mode stack as much as possible, optimize and address the efficiency problem that we currently have. And then the last one is more to do with the persistent memory integration. So we are looking at the Linux DAX extensions as well as AppDirect, which is the direct integration to the media. Those are the two constructs as a way to integrate the BlueStore with the persistent memory. So that can bring in the persistent memory value prop for the specific use cases.

Starting point is 00:27:46 So this kind of gives you the broad spectrum of what is going to happen between now and next year to really optimize it more and more to take advantage of the NVMe SSD bandwidth. to make sure that for the VM machine, we bind it to a particular socket? How is that being controlled? How data is flowing in? So with the VMs, essentially there are two different ways of integrating when it comes to Ceph. One is the QMU, user mode virtualization, is dominant implementation when it comes to open stack so there is a back-end driver for QMU which is called LibRBD which is

Starting point is 00:28:30 based on the user mode stack so we could essentially do the optimization on the user mode side and there is also a kernel mode driver for container framework so if you want to instantiate containers that are really backed by SAP you will have to use the kernel RBD. Most of our focus is essentially the user mode QMU layer. So the whole caching stuff that I'm talking about is really looking at the user mode side. There are lots of kernel mode device mapper based caching devices that you can use, VM cache, flash cache, cache and there is an Intel caching software. So you can pair up with any one of them in the kernel mode and get a take advantage

Starting point is 00:29:07 of it but our focus is the user more when it comes to the client yeah you do have to optimize so the way it works is in self you do have to optimize. So the way it works is in Ceph, you essentially have to, it's not about having the system config that is powerful on the storage node. You also need to make sure that the client side has enough bandwidth and it is also NUMA optimized. Otherwise you are going to run into bottlenecks on the client side compared to the storage node. So you do have to comprehend both sides

Starting point is 00:29:46 of the configurations, but predominantly the biggest bottleneck is the NVMe SSDs and the IOS side. If you don't do the NUMA optimization there, everything will fall apart. So that's a critical component. It doesn't mean you don't do it on the client side, but that's a critical component. Okay, so we're gonna run out of time.

Starting point is 00:30:03 I want to make sure I give enough time for the vSAN side of the discussion. So I will hand it over to Swaroop to cover the VMware vSAN. All right, thanks, Reddy. So I'm going to describe what we are doing with vSAN and NVMe. So this is that part of the presentation.

Starting point is 00:30:27 So how many of you are aware of VMware's vSAN? Okay. Okay, so let me give you a quick overview. So we introduced virtual SAN, vSAN as we call it internally, in March of 2014. That was the first release of vSAN. Essentially what vSAN does is it provides cluster storage across multiple server nodes. And the beauty of vSAN is it's embedded within the vSphere kernel itself,

Starting point is 00:30:59 so it's not a virtual storage appliance running on top of ESX. And once you have server nodes, and these can be any server nodes, right? That's how we differentiate against some of our competitors where you can buy servers either from HP, Dell, Cisco, Lenovo, Fujitsu, whoever is your favorite vendor. The only thing we ask is you buy a bunch of SSDs, a bunch of HDDs or SSDs, and essentially vSAN will club the SSDs and HDDs from the different servers and make it into a giant data store. There's no concept of LUNs, no concept of volumes.

Starting point is 00:31:39 There is no traditional SAN fabric, SAN fabric switching. None of that exists. Basically the VM IO comes into the vSAN storage stack and it goes into the devices underneath. And in terms of availability, we provide replicas for each of the VMDK objects which are persisted on the vSAN data store. So if your entire host may go down but the replica is available on another node from vSAN basically is able to

Starting point is 00:32:12 retrieve the data and serve it back to the VM. So this all forms a kind of a new industry paradigm which is coming. It's called hyper-converged infrastructure. The hyper-converged is important out there because all the I.O. is being processed within the kernel or within the hypervisor itself. The hyper-converged infrastructure consists of the compute stack, the networking stack, the storage stack, and all of this is layered with a pretty comprehensive management control plane on top of it, right? And so that's what hyperconverged infrastructure

Starting point is 00:32:57 is defined as, and it's different from what you may have, you may have heard of as converged infrastructure, where the hypervisor may not play as critical part as it plays in hyperconverged infrastructure. So that's kind of the basic difference between HCI and CI. So we saw an overview. I already talked about it. This is completely software-defined. So there is

Starting point is 00:33:25 no proprietary hardware component we do not use any of any specific as BGA for EU or compression encryption etc these are all standard components which are shipped with every Intel server out there it's a distributed scale of architecture so you essentially have theAN embedded within the kernel, and they all are communicating with a 10 gigi length between them. And it's integrated well within the vSphere platform. Since we are VMware, we ensure that the interop

Starting point is 00:34:00 with every component out there, whether it's vMotion, storage vMotion, HA, DRS, vRealize operations, which is our management suite, all of them are integrated well with vSAN. And it's a policy-driven control plane, so right from the design up, we actually designed it to be on a per VM policy. What does that mean? You can, on a per VM basis, you can set, hey, how should the performance look like?

Starting point is 00:34:30 What should the availability look like? Should I make two replicas of the VMDK? Should I make three replicas, et cetera? And those are some of the policies which you can set in addition to, of course, the capacity policies, which you can set in addition to of course the capacity policies which you can set on a pervium on a pervium basis it's very different from traditional storage where most of your policies are land based or volume based virtual volumes which is another initiative another product with vmware that also now solves perv VM based policies for traditional storage.

Starting point is 00:35:06 But for vSAN we essentially built the product from the ground up with taking in mind the per VM policies. So we have so vSAN kind of operates in two modes. One is we essentially have something called the hybrid mode and we have an all slash mode. So what I mean by that is we have two tiers within vSAN. One is the caching tier. We have a read cache and a write buffer. And the capacity here is essentially the persistence stored for vSAN. For the caching tier, we dictate that you have to have a flash device out there so it could either be SSDs, PCIe based SSDs or now we are getting a lot

Starting point is 00:35:57 of customers who are beginning to use NVMe based SSDs. For the capacity tier we essentially ask that you either have a choice of using HDDs so that makes it in the hybrid mode or you can use all flash which is essentially SSDs also in the capacity tier. If you ask me what is the adoption curve for all flash versus hybrid with vSAN, if you had asked me last year I would have said the all flash is probably with vSAN. If you had asked me last year, I would have said the all flash is probably 10% of our adoption. But this year, if you ask me as to what the adoption is, it's more than 40 to 50%, right?

Starting point is 00:36:34 And even there, with NVMe, we are seeing like last year, July or August when we were talking to Intel, we didn't have any NVMe device on the HCL, last year, June or July. This year we have about 30 to 40 NVMe's on the HCL and there are a lot of customers from the beginning of the year who have been putting NVMe in the caching tier and now we have about more than 10 plus accounts who essentially have NVMe across both the caching and the capacity tier. So we

Starting point is 00:37:05 have seen some real evidence of how NVMe is being adopted within the industry. I forgot to mention about the performance. I think we are one of the few vendors who actually mentioned performance pretty explicitly. It's 40k IOPS for the hybrid mode per node and about 100k IOPS per host with the with the all flash and if you start putting devices like PCIe SSDs or NVMe SSDs sub millisecond latencies becomes very very common. So current all flash, vSAN all flash this is how it looks. I mentioned about the tier one caching, the tier two is all about data persistence. If you're in an all flash mode,

Starting point is 00:37:49 the SSDs are high performance, high endurance, very high throughput for caching the writes. We do not do any kind of read caching in an all flash mode, but the tier two data persistence is more read intensive, lower endurance, is generally less expensive than the Tier 1 caching. Most of our customers use like either SATA SSDs out here or even SAS SSDs and out here they were using SAS and SATA SSDs but now it's generally becoming either PCIe SSDs or the NVMe SSDs and especially the Intel P3700s are

Starting point is 00:38:27 becoming pretty common out there with the P3510 on the tier 2s. We have space efficiency, so 6.2, that was the latest release which we have for vFAN. Earlier before we introduced dedup and compression, what we had to do was essentially make a replica of the object. And an object for us is essentially you can consider it as VMDK, so we are creating multiple objects based on what the user has set the policy. 6.2 in March, we introduced it in March of this year. That was our latest release. We introduced dedupe and compression and erasure coding, which is essentially, we kind of loosely use it for RAID 5 and RAID 6. And depending on the workload, we have seen about 2x to 8x savings. So if you're doing, for example, VDI full clones, that's about 7x to 10x savings. If you're doing like Oracle single instance databases, et cetera,

Starting point is 00:39:32 compression is what would take effect there. And we are seeing anywhere from 5% to like 25% to 30% of compression ratios with Oracle. Performance, of course, if you're using AllFlash, in terms of IOPS, we are about four times higher than the hybrid vSAN. So, again, the hybrid vSAN is HDDs used in the persistence tier. And when you start using devices like NVMe, sub-millisecond latency response times is what you could expect. We support almost all applications out there so when we

Starting point is 00:40:10 started out with vSAN 1.0 about two years back, we kind of since everybody knows that a storage 1.0 product you have to establish the credibility, make sure that it is, make sure that the industry kind of believes that it is an enterprise product so we kind of had that it is an enterprise product. So we kind of had the number of use cases very restricted. So we kind of had VDI, DR target, a few DevTest use cases out there. But two years down the line, we have opened it out for any use case out there. So there are customers who are deploying things like, you know,

Starting point is 00:40:46 Oracle RAC, MSCS, Exchange, SQL. Every application out there is being deployed on Visa. And we have some pretty interesting use cases for, like for example, Lufthansa having, you know, like their aircraft has about 300,000 sensors. And each of the sensor is sending data back to their main data center where they are running some analytics and it spits out like a kind of a report on how the flight did

Starting point is 00:41:14 what are some of the things to take care of and within an hour the technician has to address all the concerns with this report get the flight back up and back up and running. This is being done on their A380s, etc. So some very interesting cases, there are defense sectors where we find it's being deployed in particular form factors. So it works very well because in some of these defense sectors, they want particular form factors. of these in some of these different sectors they want like particular form factors they do not want

Starting point is 00:41:45 like an appliance form factor which may not fit into a particular space for example so vsan works beautifully for them because they can choose a particular hardware just slap the vsan software on top of it and it runs so some very interesting use cases out there and I forgot to mention we have about 5,000 customers so it's growing pretty rapidly. It's one of the fastest storage products at least I have I have worked on. So how is the market evolving? This is an IDC graph so the IDC predictions kind of align with how we are seeing the NVMe adoption growing within vSAN. We see that especially the SATA SSDs is going to kind of start trickling down, while the PCIe-based SSDs and the SAS SSDs will probably occupy more than 60 to 70% of the market.

Starting point is 00:42:46 And by 2018 we expect about the NVMe, what IDC projects is 70% of the client SSD market, 70% of the client market will be NVMe based. So what are the benefits of vSAN with NVMe? So I won't go through the overview, Reddy did a good job about that. But it's ideal for the caching tier. We are getting a lot of customers who are using NVMe today with vSAN and we highly recommend that. There are a few customers who are beginning to use it

Starting point is 00:43:27 even in the persistent store also. We have NVMe devices certified for the vSAN caching tier and specifically for all flash configurations. You can also use it for the hybrid model. It works perfectly for that. And what we are doing specifically in the roadmap, and I will go through it in a little bit more detail is we are enhancing both the ESXi storage stack and the vSAN storage stack to make it effective enough to take some of the real advantages of NVMe and some of

Starting point is 00:44:02 the NVMe 1.2 features which are also coming up. So we'll talk about that and Murli will also go into some more detail in that regard. This is how our HCL looks like, Intel P3700s, the S3510s all being listed there. This is a chart which Intel recently published. I know this chart may be a bit convoluted but let me see if I can explain it. The bar graphs which you see correspond to the capacity while the line graphs which you see correspond to the IOPS. And Intel kind of did it in four modes. One is purely if you want to deploy kind of a performance, do a performance deployment where you really care about performance and nothing else. Or you may go on the extreme where you do not really care about the performance but you want to make your deployment highly, highly available.

Starting point is 00:45:04 Or you can have it capacity-based, where you want large capacities within that cluster. Or you may say, I want a slight balance between performance and capacity. As you see, this is the configuration which was used, about 1.2 million IOPS for the performance deployment. And of course, as you do a trade-off with availability

Starting point is 00:45:29 compared to performance, your performance, of course, tapers off quite a bit. But what this graph essentially shows is, with NVMe, the kind of dollar per IOPS and the dollar per gate which you get is pretty remarkable. In fact, we get a lot of customers who ask us, Hey, isn't NVMe too expensive? But these are kind of the graphs which we have been publicizing along with Intel

Starting point is 00:45:57 to show that the amount of performance which you get at the cost which NVMe is, it's pretty remarkable and it's pretty encouraging for customers to use NVMe. Also raw comparison, so we did a raw comparison of taking a 8 node vSAN cluster, comparing it with an 8-node NVMe cluster, and these are kind of the IOPS which we are getting, and these are some of the configuration which we have. What we also observed in terms of latencies were on the SSD side, we had about 1.5 millisecond latency and in NVMe we were getting 1 millisecond latencies or slightly less than that. So what's coming next?

Starting point is 00:46:52 What's the future for NVMe within vSAN? So let me go through some of the ideas out there. So the Intel Optane SSDs, they are the ones which we are working with Intel closely, some pretty impressive reduction in latencies about approximately 10x compared to NAND SSDs, so that's kind of a direct benefit to vSAN, right, if you're going to use it in the caching tier or even within the capacity tier. We have done some measurements with, along with Intel, with the application performance, the ESXi application performance. It's about two and a half times faster than the NAND PCI Express.

Starting point is 00:47:43 And once we start doing some, and these are like raw numbers in terms of, we haven't really done any software optimizations. These are just plugging in, taking out the SAS and SATA SSDs, plugging in NVMe SSDs, and these are the numbers which we are getting. There are a lot of software optimizations which we have started to work with. So today, we did a lot of software optimizations which we have started to work with.

Starting point is 00:48:06 So today we did a comparison of how does NVMe perform without any ESXi storage stack optimizations, which is what it is today. The gray bars are what you get, but we are also prototyping a lot of optimizations within the ESXi storage stack. So Reddy mentioned about how you can parallelize the NVMe across multiple lanes and those are some of the optimizations which we are introducing within the storage stack and as you can see the performance benefits which we get making these changes within the ESXi storage stack is getting us quite a few benefits. Of course the prototype is not released, these are

Starting point is 00:48:51 all internal results as of now but thought I will show you as to what one can aspire as they start using NVMe with vSAN and perhaps even with ESXi. So I won't go through this but let's look at what the next generation hardware offers with NVMe and also with NVDems or persistent memory. So there are four areas which we are focusing on. One is the high speed NVMe which we hope to leverage as it is and that will provide performance benefits to vSAN right away. There are ESXi storage stack enhancements which we are looking at, just as I mentioned in the previous slide.

Starting point is 00:49:36 We are also looking at, I mentioned to you that we have a two-tier architecture today. And there is no reason for us, especially with NVIDIMS, et cetera, coming into the market whenever it comes, to kind of collapse this into a single-tier all-flash architecture. What that would mean is using NVIDIMS for just the metadata while using SAS, SATA, or NVMe SSDs for the persistent store. Or we can still keep it as a two-tier architecture where you have the metadata into NVDIMS,

Starting point is 00:50:11 you have the write cache as NVMe, and the SAS and SATA SSDs as your persistent store. We are also looking at RDMA over Ethernet to boost network transfers. As the device latencies for these start becoming smaller and smaller, the network latencies will start becoming the bottleneck and that is why we are looking at RDMA to help address the networking challenges also because right now it's not a challenge for us but once these device latencies start shrinking down and down the network latencies will start will start becoming an issue for us. So with that I will give it to Murli to kind of go over some of the native driver stack and some of the

Starting point is 00:50:55 technical details. So I think we're gonna run out of time here but what I wanted to share is a snapshot of where we stand in terms of NVMe support today. So first let's look at our driver stack and where we play in terms of NVMe or complementary technologies. So there are fundamentally two places that we actually do drivers. One is in the IO stack, that is the NVMe for the PCIe space as well as NVMe for fabric. Those are the two fundamental drivers. Along with that we have drivers for the RDMV, which is like a rocky driver. That's it's right now driver stack. The other place is something we call virtual NVMe. Essentially that sits below a host or a guest operating system and that allows a guest operating system to have a native NVMe driver so that it can use this virtual device to funnel in IOs that are completely natively NVMe.

Starting point is 00:51:51 So those are the places that we play. And what I want to share is our future plans. So far we have support for 1.0e spec and that's both supported in the inbox as well as an async driver. In future what we're going to do is try to support the 1.2 feature set so I don't have a I can't speak about the exact timetable on this one but but it's coming soon. And that deals with multiple namespaces, multiple queues, so this actually addresses the performance thing that Swaroop was talking about. We will have support, we'll have a driver for Indian New York Fabric, and we'll address the Rocky to start with, Rocky Fabric, followed by other transports as our customers see a need. And finally, we have a plan to, I talked about the virtual NVME,

Starting point is 00:52:52 the one that sits below the guest operating system. That should be very interesting for other operating systems which sit inside VMs. And finally, at some point, we plan to have a completely native NVMe stack. In other words, it will coexist with us for a while, but the outlook is that at some point, we're going to completely transition into a NVMe stack. So that's the outlook for the coming years. And you can read about this. You can actually download the drivers today.

Starting point is 00:53:23 We have drivers for the 5.5 as well as the 6.0 ESXi platforms. Also, we have a project where we're actually inviting the device vendors to actually contribute to the 1.2 driver that's coming soon. And there's information in the links here. You can log in there, and you can find more details. And I'll be happy to talk to you after this session. So that's fundamentally that. I think we'll leave some room for questions. Yeah. I mean, we can do questions maybe like two, three minutes. We probably have to give up for the next speaker.

Starting point is 00:54:02 So any questions on the vSAN side? I know we had a few during the talk. Do you have any idea how much CPU overhead or server overhead is for this software stack? On vSAN in general? So what we advertise is less than 10%, right? That's the, we call it the vSAN tax. So that's the, that's typically what we have. Now dedupe and compression, also you have to incur some additional CPU, right? And that is typically about, again, less than 10% of what tax you have already paid with vSAN without dedupe and compression. So let's say you have 100 cores, you need 10 already paid with vSAN without dedupent compression. So let's say you have 100 cores, you need 10 more cores for vSAN. For dedupent compression, another 10 percent

Starting point is 00:54:53 of that is 10 percent of 10, so another one additional core, so 11 cores in general if you turn on dedupent compression. So you have to go through a kind of a TCO calculation, right, to basically figure out, okay, I'm going to enable vSAN on this. What are the, you know, how much more CPUs do I need to put within the cluster? So that's how it works. Do you have a mix of fan and 10 prediction in the end case in your environment? Yes, yes. So the default is 10, but you can toggle between 10 and 10 if you would like to. In terms of consistency, is the AP system or the CAB system?

Starting point is 00:55:43 AP system is the C AP system or CP system? AP system or CP system? The state AP, the consistency available. Eventual consistency or the strong consistency? So this is all strong consistency, right? So essentially, the write, as soon as the write happens, when the writes are replicated to all the replicas, that's when the acknowledgment is sent back to the client. So even though it's an object-based system,

Starting point is 00:56:10 so I know that other object-based systems may be eventually consistent, but with vSAN, it's strongly consistent. Any other questions?. So in the hybrid model, the caching tier is both for read cache and write buffering. In the all-flash model, the caching tier is only for write buffering. The reads happen directly to the ecosystem or to the capacity. . happen directly to the ecosystem or to the capacity.. So you have the right buffer on each of the nodes, right?

Starting point is 00:57:06 Okay. So you have the right buffer on each of the nodes, and it is replicated. It's the right buffer on each of the nodes, yes. All right. Thank you. Thanks for your time. Appreciate it. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org.

Starting point is 00:57:39 Here you can ask questions and discuss this topic further with your peers in the developer community. For additional information about the Storage Developer Conference, visit storagedeveloper.org.

Your Ad Here

Storage Developer Conference - #54: Bridging the Gap Between NVMe SSD Performance and Scale Out Software

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.