Storage Developer Conference - #30: Bridging the Gap Between NVMe SSD Performance and Scale Out Software

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snia.org slash podcast. You are listening to SDC Podcast Episode 30. Today we hear from Anageya Chagam, Principal Engineer with Intel, Swaroop Dutta, Director of Product Management at VMware, and Murali Rajakobal, Storage Architect, VMware,

Starting point is 00:00:47 as they present Bridging the Gap Between NVMe SSD Performance and Scale-Out Software from the 2016 Storage Developer Conference. So welcome to the Bridging the Gap Between NVMe SSD Performance and Scale-Out Software. My name is Reddy Chagam. I'm the Principal Engineer and chief SDS architect from Intel Data Center Group. I have two co-presenters with me. I will actually let Swaroop and Murali introduce.

Starting point is 00:01:14 Sure, thanks. My name is Swaroop Datta. I manage the product management team at VMware in the storage and availability business unit. Thanks, Swaroop. My name is Murali Rajlopal. I'm a storage architect in the office of availability business unit. Thanks, Swaroop. My name is Murali Rai Lopal. I'm a storage architect in the office of CTO at VMware. Fantastic.

Starting point is 00:01:31 Okay, so we are going to cover two broad scale-out software stacks. I am specifically going to focus on the Ceph side. Mainly, Swaroop and Murali are going to focus on the VMware vSAN stack and the theme of this presentation is what exactly we are doing within the context of NVMe SSDs and speeding up the software side of the you know these two stacks and we'll also give a flavor of the planned work going forward how many of you know NVMe protocol, heard about it, familiar with the concept? Okay, so few. Okay, so I'm going to give a very high level view into

Starting point is 00:02:17 what exactly is NVMe SSD and, you know, how this plays out for the following discussion with SEF. So if you look at the NVMe essentially it is the standard software interface between the host and the PCIe attached SSD media. So it's a standard software interface you can actually plug in any NVMe attached SSD from any vendor and it should be able to work without changing the host software. So that's the number one thing. It uses essentially the queuing as the foundation construct as a way to communicate to the media. There are a set of admin queues to really do some sort of setting up your SSD as well as there are submission queues and completion queues so you kind of use the queuing construct as a way to communicate to the media in a standard

Starting point is 00:03:09 way of course it is attached to PCIe lanes so it means that it has a direct connectivity from your CPU processor to the NVMe media there is no intermediate you know chipset moving components like host bus adapter right so so that's a key component because it is attached to the NVMe I mean the PCIe lanes you essentially can take advantage of the bandwidth that you get out of PCIe lanes so with PCIe gen 3 one lane can actually give you one gigabyte per second bandwidth most of the NVMe SSD cards out there they have by four or by eight by four lane PCIe SSD cards

Starting point is 00:03:50 typically give you the reasonably half a million I was per second from reads and somewhere around 200 K IOPS for the right side with the opt-in SSD you can actually get to 1 million IOPS with the by 8 PCIe lane slot right so significant amount of you know the throughput and performance you can actually get out of NVMe SSDs because of the fact that it is directly attached and you can scale based on the number of lanes and it's not limited by certain aspects. The other thing is obviously you don't need host bus adapter like SAS controller and so on. So it essentially gives you a cost reduction as well as power reduction plus performance. So if you look at the Broadwell CPU, you essentially have per socket,

Starting point is 00:04:44 you can actually get 40 PCIe lanes. A bunch of them will be shared with the networking. So you essentially will get enough number of PCIe lanes to drive the bandwidth that is needed on the compute as well as the networking. Few different form factors. M.2 is really meant for boot type of use case, boot drive. And then in the small form factor, which is U.2,

Starting point is 00:05:08 and then there is an add-in card. So you essentially can use different form factors based on the system configs that you're after and the density and price points that you're really looking for. All right, so that's kind of the brief overview of what NVMe protocol is about, what is NVMe SSD, and what are the benefits. So I'm going to directly jump into Ceph side.

Starting point is 00:05:31 How many of you heard about Ceph, know Ceph? Okay, reasonable number. Okay, so Ceph is the open source storage software scale out. It is built on the foundation of object interface, which is called RADOS. So that is the foundation layer. You can actually take the RADOS foundation layer, deploy on a pool of standard high volume servers.

Starting point is 00:05:56 So you essentially can take bunch of machines, put Ceph software on top of it, and you essentially get a scale out storage stack that is protected beyond a host failure so you can have host failure rack failure you can essentially make sure that the data is protected so it gives you the scale or property it also gives you the durability so you can either have multiple copies or a integer coded pools you can mix and match all kinds of interesting things you can do to protect your data.

Starting point is 00:06:27 Lots of features that are related to things like when something goes bad, a media goes bad you should be able to detect it, automatically replicate it to ensure that there is a high availability in the cluster. So anything and everything that you can think of in the scale-out software, you see those properties in Ceph software as well. It's an open source software, primarily supported by Red Hat, lots of community contributions from big storage vendors including Intel, Samsung, SanDisk, Mellanox, lots of

Starting point is 00:07:00 service providers as well contributing to the community work. It's popular for block storage within the context of OpenStack deployment so wherever you see OpenStack deployment stuff tends to be the choice for the block storage for backing up your virtual machines. It provides three different constructs typically which are very important for lots of customers so you can use that as a shared file system which is called CephFS. So you can use that as a shared file system, which is called CephFS.

Starting point is 00:07:27 You can also use that as a distributed block layer, which is called RBD. Or you can also use that as the object interface, which is essentially RADOS Gateway. REST APIs, S3 compatible, as well as the Swift compatible. So it gives you those three properties using the foundation, which is the object layer and from the NVMe workloads perspective we kind of look at three different groupings of workloads if you look at

Starting point is 00:07:56 this there is a capacity on the X scale and there is the performance on the Y scale three types of workloads one are high ops low latency workloads which are typically like database type of workloads and then you are looking at throughput optimized things like content delivery networks VDI cloud DVR and so on big data workloads is at the top and then when it comes to the archival capacity type of workloads are really object-based workloads and we look at nvme relevance across these three spectrums of workloads as opposed to really looking at nvme ssds just for the low latency workloads which is very important to remember

Starting point is 00:08:37 because if you look if you see the trend where what's happening is there are NVMe SSDs that are low latency types of media out there, like Optane SSD will give you a million IOPS, less than 10 microsecond latency. So you're really looking at that for the low latency workloads. And then on the capacity side, lots of vendors out there, including Intel, have 3D NAND-based media really looking at capacity oriented workloads mostly content delivery archive backup object based workloads so think of this as more of NVM SSDs have relevance across the board

Starting point is 00:09:17 as opposed to really the low latency so if you were to zoom into the Ceph architecture the top portion is the Ceph client and the bottom portion is Ceph storage nodes. So if you look at the top portion, the way we are looking at is NVMe SSDs. Today, Ceph does the caching, specifically for the block workloads. It uses DRAM as the caching layer and we are looking at extending that with NVMe SSDs

Starting point is 00:09:42 so we can actually give a much better cash real estate on the client nodes and we can bring in the value prop of NVMe SSDs. So that's the focus area on the client side and then on the back end in the storage nodes where you have Ceph, there are two different configurations today. One is actually production ready which is called file store back end and the other one is the blue store which is currently in the tech preview mode both these configurations can take advantage of flash to speed up your rights using NVMe SSDs and to speed up your reads for the

Starting point is 00:10:15 read cache so we kind of see that as both both scenarios are pretty relevant for the storage nodes as well so that's kind of the intersect point when you look at the NVMe SSDs where it makes sense within the context of Ceph. So from the consumption perspective as an end-user you are really looking at three different configurations. Again just looking at the NVMe SSD and what kind of configurations that we typically see in the end customer deployments. Standard is where you have essentially a mix of NVMe SSDs paired up with hard disk drives and you can use the NVMe SSDs as a way to speed up your writes as well as cache for

Starting point is 00:10:59 the reads to service the reads as well. So it's meant for both write and read caching paired with the hard disk drives typically what we look for is with one high endurance NVMe SSD P3700 is the Intel high endurance SSD NVMe SSD card you can typically pair up with 16 4TB hard disk drives. You can, that's kind of the ratio based on the benchmarking. The next one is the better configuration, which is kind of balanced with the best TCO. You are looking at a combination of NVMe SSDs

Starting point is 00:11:38 with SATA SSDs as a way to get all flash, low latency type of SKUs. Normally, you pair up one NVMe SSD with around six SATA SSDs and you use the NVMe SSD as a way to speed up your writes and the reads will get service from SATA SSDs so that's a breakdown with a better configuration and the best configuration is essentially everything is all NVMe based on the current testing that we have done and all the optimizations, we are really looking at around four NVMe SSDs per node

Starting point is 00:12:10 as the design endpoint. Beyond that, Ceph will not scale. And we are looking at lots of things to optimize it, but currently that's the recommended configuration. For all these things to work, you really need to make sure that the CPU has enough power, irrespective of what configuration you choose, and then you have enough networking bandwidth. So you do need to consider that it's a balanced system config from a compute, storage,

Starting point is 00:12:37 and networking perspective to make sure that it can scale and optimized. Yeah. So you said that Ceph not scaled beyond four SSD devices for the reason for that? Yeah, let me go over the details and I can, you know, give some glimpse into where the challenges are. So if you look at historically just a, you know, so the question is why is Ceph not scaling beyond four NVMe SSDs? There are lots. So if you look at Ceph design endpoint, it started 10 years ago. And SSDs in those days were pretty much like unheard of.

Starting point is 00:13:18 So most of the design theme has been I need to scale across the nodes because my hard disk drives give me a limited throughput and the way I can scale is have a scale out design pattern across many nodes and I can scale that way so it is really designed for hard disk drives and then things evolved in terms of speeding up the reads and writes using NVMe SSDs as a you know enhancement then we are looking at okay there are lots of bottlenecks in terms of threading optimization that we really need to look at to optimize so that's the focus so the software needs to be optimized there are lots of threads it's designed for hard disk drive so certain things you can actually choose it in a way

Starting point is 00:14:05 that it's not a big deal. But with NVMe SSDs your software has to be really lean and the path should be extremely narrow and that's the area that needs work. So from your previous slide, you're actually on both the fine store and the blue store. So I thought that for SSDs you would be using blue store which is basically optimized for flash and then you would not run through those variations yeah so if you look at the Ceph data path so the way I look at Ceph data path is there is a networking section the I will coming into the networking layer in Ceph

Starting point is 00:14:42 and there is the core data path that really makes the decision of where is the data distributed how do you protect the data and the whole mechanism of how you group the data how you shot the data the whole logic is actually in the middle layer and then the bottom layer is the core I over to the media so the blue store is essentially coming at the bottom side so with the blue storeore is essentially coming at the bottom side. So with the BlueStore changes, you are bypassing the file system,

Starting point is 00:15:09 so it is directly using the raw media. So you are speeding up the I-O to the media and the layer above it will be optimized. There is a networking optimization that is pretty close with the RDMA and XIO messenger pattern. You can actually get to the networking stack as well the piece that really needs work is in the middle yeah it's called OSD

Starting point is 00:15:32 layer that really requires there are lots of choke points where there is everything weighs on let's say PG locks and certain certain choke points where everything waits so those choke points need to be eliminated for us to really speed up for the NVMe SSDs good question so I'm going to kind of spend some time on what is the current state when you look at the NVMe SSD performance so we looked at from for the performance we focused on MySQL as the key benchmark. The reason being it is very popular for you when I talked about the OpenStack where Cephry is popular lots of

Starting point is 00:16:17 customers are looking at MySQL as one of the key workloads so we ended up actually picking that because of the you know the relevance as well as popularity and we also have we also do lots of synthetic benchmarks random I owe as well as sequential I over each and writes with different sizes look at really where the optimization points are so we looked at the synthetic as well as the my sequel as a way to you know assess the current performance and where the bottlenecks are the system config that I'm going to talk about has five node super micro cluster. This has actually ten slots where you can actually put six NVMe SSDs on one

Starting point is 00:16:55 NUMA node and the other one is four so six plus four is what you have with a super micro but the testing that we did we played around with every other configuration and we settled on four NVMe SSD ssds but each nvme ssd is partitioned into four uh logical part you know blocks or partitions rather sorry um so each nvme is partitioned into four um you know logical blah the the logical you know the regions and then we have self SSD managing each one of them that's the only way to get really the scaling with the current configuration so partitioning your new email says it into four and then have each one managed by one object storage demon which is the

Starting point is 00:17:39 self essentially the critical component that manages the storage to the media okay on the client side we used per node we used four docker containers well two essentially the critical component that manages the storage to the media. On the client side, per node we used four docker containers. Two docker containers are really the clients that are actually running the benchmark which is called sysbench and then two of them are really doing the work which are really hosting the MySQL database server. Okay so with that there are certain things we have to optimize when you look at the NVMe SSDs and the NUMA system configs. The key thing is, if you look at the right hand side, we have partitioned in such a way

Starting point is 00:18:17 that your compute, memory, NVMe SSDs, and networking is attached to the same NUMA node so you are not really going across the inter socket to really get the job done when the I.O. is coming in so that is a very very critical component when you really want to take advantage of NVMe SSDs and how they are actually physically partitioned in a system which is NUMA-aware partitioning. So that's one key thing. Second thing is you need to make sure that your NVMe NIC devices are really pinned to the same processor interrupts and all that stuff. So soft IRQ balancing and everything need to be on the same NUMA node, not on the other one, so that you're not really incurring the overhead and then we have

Starting point is 00:19:07 for a given NUMA node because we had four NVMe SSDs each one partitioned into four you know subsections ended up having 16 OSDs so you really need fairly high-end Xeon E5 skew so looking at around if you know 2690 or beyond is what is typically what you need at with networking that is at least a dual 10 gig but preferably 40 gig or beyond When you partitioned the MDMA into the. Did you put the journals on the same disk or you would have a different disk for the journals? Yeah, excellent question. So the testing here is focusing on the BlueStore, not the Filestore backend.

Starting point is 00:20:00 So you don't need a journal with the BlueStore. With the Filestore you need a journal a journal with the blue store with the file store you need a journal so with the blue store essentially the metadata and the actual data sitting on the same partition so it's using the same partition yeah yeah you can do that but the problem with if you look at the p3700 you're really you're kind of getting around 400k IOPS read IOPS ran 4k random read IOPS around 200k random right if you put one OSD you're probably going to get at the most maybe 10 to 15 k IOPS or maybe 20 k IOPS out of the device so if you really want to optimize the device throughput you

Starting point is 00:20:51 have to slice them into four and then put more OSDs but that's not the long-term direction we want to be we really want to address the performance and make sure that one OSD can manage one NVMe SSD and drive the throughput but we are not there yet but that's the reason why it's a kind of a workaround more than anything else to get the squeeze the maximum performance out of the NVMe SSD so if you put one OSD there is no way you can get that performance back okay so here is the so the first chart is all about random read and write performance with five node, four NVMe SSDs per node.

Starting point is 00:21:35 I can actually get to 1.4 million IOPS closer to one millisecond latency and then you can actually stretch it all the way up to 1.6 million IOPS with roughly around 2, 2.2 millisecond latency. That's relatively, you know, that's pretty good when you compare with all flash system configs and the price that you pay. Here you are talking about an open source software purchasing a standard high volume server putting in your own SSDs that your preferred choice your preferred vendor and so on so you have a lot of flexibility to put your own system config open source software and you can actually get down to you know beyond a million IOPS in a five node cluster that is fairly amazing it doesn't mean it is completely optimized from an efficiency perspective whether

Starting point is 00:22:28 it can take advantage of the media throughput but this is somewhat in parity with what you normally see with most of the storage stacks including storage appliances out there and if you look at the database performance, we can actually get down to 1.3 million queries per second with 20 clients driving the SQL workload, each one having eight threads. So you're really looking at 160 threads hitting this five-node cluster. And you can actually get to 1.3 million queries per second you know somewhere around that

Starting point is 00:23:10 range so this is the hundred percent random read so in the interest of time I just touched upon just the two different charts there are a lot of details in the backup as well the notion is that you can actually take NVMe SSDs, you can optimize the system config for the NUMA nodes, you can actually get a reasonable amount of performance and acceptable latency even for the database workloads to give you a flavor that it is ready for deploying low latency workloads. Now we kind of briefly touched upon

Starting point is 00:23:46 what's going on in the community fairly high level caching work is not done yet on the client side so we are looking for the caching work on the client side to really speed up quite a bit and the caching pattern is it's going to be crash consistent ordered right back shared on a compute node. We are looking at those properties as a way to really speed up the VDI workloads. Ephemeral compute instances can immensely benefit for that, you know, with that kind of a property. So that one is still that one is in progress we are looking for sometime next year as a production where the production ready caching solution that can expand beyond DRAM and take advantage of the NVMe SSDs on the compute nodes the second one is

Starting point is 00:24:33 currently compression work is in progress and we are looking at dedupe as an extension to it a way to optimize the actual storage capacity for the flash based back ends because without dedupe the value prop quickly goes away with the flash based back end so dedupe is a very very critical component from an efficiency perspective so that work is actually design reviews are actually in internal design waiting is going on and we are looking for upstreaming this in the community long tail latency i didn't touch upon the long tail latency i talked about the average latency which is around one millisecond or so the goal is to make sure that the long tail latency beyond the 99th percentile is extremely tolerable and you know it's not completely beyond few milliseconds

Starting point is 00:25:23 range so the goal is to make sure we optimize the long tail latency performance as well which is absolutely needed for the low latency workloads and then we touched upon this one which is yeah we can actually take the nvme ssd which is pretty you know performance oriented ssds and we can cardboard four slices as an optimal design endpoint, but it also is going to add lots of threads, lots of memory, lots of networking, and lots of compute overhead. And the goal is to go back to the design pattern of what can we do to optimize the storage stack in the OSD to really take advantage of the NVMe SSD

Starting point is 00:26:06 throughput in a much more efficient way. There are two things that we are looking at in a you know from a broad focus perspective. User mode implementation is the theme that we are pushing in the self implementations. There is a user more networking so instead of actually doing lots of context switching the goal is to do the user more networking which So instead of actually doing lots of context switching, the goal is to do the user more networking which is using the data plan development kit as the foundation. Very popular for the comms workloads and we are trying to use that for the storage workloads. And there is a talk in this room right after this session and I think it is 3 o'clock,

Starting point is 00:26:42 right Ben? So Ben is actually talking about the storage performance development kit. So he's gonna talk about how do you optimize the device access using the storage performance development kit, which is essentially built on top of DPDK foundation, but looking at the storage workloads, what are the things that we really need

Starting point is 00:27:02 to move into the user space, and how do we optimize it. So the goal is to take advantage of the existing work that is actually happening, move that into the user mode stack as much as possible, optimize and address the efficiency problem that we currently have. And then the last one is more to do with the persistent memory integration. So we are looking at the Linux DAX extensions as well as AppDirect which is the direct integration to the media. Those are the two constructs as a way to integrate the BlueStore with the persistent memory.

Starting point is 00:27:35 So that can bring in the persistent memory value prop for the specific use cases. So this kind of gives you the broad spectrum of what is going to happen between now and next year to really optimize it more and more to take advantage of the NVMe SSD bandwidth. So with the VMs, essentially there are two different ways of integrating when it comes to Ceph. One is the QMU, user mode virtualization is dominant implementation when it comes to OpenStack. So there is a backend driver for QEMU which is called LibRBD which is based on the user mode stack.

Starting point is 00:28:32 So we could essentially do the optimization on the user mode side. And there is also a kernel mode driver for container framework. So if you want to instantiate containers that are really backed by SAP, you will have to use the kernel RBD. Most of our focus is essentially the user mode QMU layer. So the whole caching stuff that I'm talking about is really looking at the user mode side. There are lots of Kernel mode device mapper based caching devices that you can use, VM cache, flash cache, cache and there is an Intel caching software. So you can pair up with any one of them in the kernel mode and get a take advantage of it but our focus

Starting point is 00:29:08 is the user mode when it comes to the client yeah you do have to optimize so the way it works is in Ceph, you essentially have to, it's not about having the system config that is powerful on the storage node. You also need to make sure that the client side has enough bandwidth and it is also NUMA optimized.

Starting point is 00:29:38 Otherwise, you are going to run into bottlenecks on the client side compared to the storage node. So you do have to comprehend both sides of the configurations but predominantly the biggest bottleneck is the NVMe SSDs and the Ivo side if you don't do the pneumo optimization there everything will fall apart right so that's a critical component it doesn't mean you don't do it on the client side but that's a critical component okay so we're going to run out of time I want to

Starting point is 00:30:02 make sure I give enough time for the vSAN side of the discussion. So I will hand it over to Swaroop to cover the VMware vSAN. All right, thanks, Reddy. So I'm going to describe how, what we are doing with vSAN and NVMe. So this is that part of the presentation. So how many of you are aware of VMware's vSAN? Okay, so let me give you a quick overview. So we introduced Virtual SAN, vSAN as we call it internally, in March of 2014. That was the first release of vSAN. Essentially what vSAN does is it provides cluster storage across multiple server nodes. And the beauty of vSAN is it's embedded within the vSphere kernel itself,

Starting point is 00:30:58 so it's not a virtual storage appliance running on top of ESX. And once you have server nodes and these can be any server nodes right that's how we differentiate against some of our competitors where you can buy servers either from HP, Dell, Cisco, Lenovo, Fujitsu whoever is your favorite vendor the only thing we ask is you buy a bunch of SSDs a bunch of HDDs or SSDs, and essentially vSAN will club the SSDs and HDDs from the different servers and make it into a giant data store. There's no concept of LUNs, no concept of volumes, there is no traditional SAN fabric,

Starting point is 00:31:40 SAN fabric switching, none of that exists basically the VM IO comes into the vSAN storage stack and it goes into the devices underneath and in terms of availability we provide replicas for each of the VMDK objects which are persisted on the vSAN data store so if your entire host may go down but the replica is available on another node from vSAN data store. So if your entire host may go down, but the replica is available on another node from vSAN, basically is able to retrieve the data and serve it back to the VM. So this all forms kind of a new industry paradigm

Starting point is 00:32:20 which is coming. It's called hyper-converged infrastructure. The hyper-converged is important out there because all the I.O. is being processed within the kernel or within the hypervisor itself. The hyper-converged infrastructure consists of the compute stack, the networking stack, the storage stack, and all of this is layered with a pretty comprehensive management control plane on top of it, right? And so that's what hyperconverged infrastructure is defined as. And it's different from what you may have heard of as converged infrastructure, where the hypervisor may not play as critical part as it plays in hyper as converged infrastructure where the hypervisor may

Starting point is 00:33:05 not play as critical part as it plays in hyperconverged infrastructure so that's the kind of the basic difference between HCI and CI. So we saw an overview I already talked about it this is completely software-defined, so there is no proprietary hardware component. We do not use any specific FBGA for DU or compression, encryption, etc. These are all standard components which are shipped with every Intel server out there. It's a distributed scale-out architecture, so you essentially have vSAN embedded within the kernel and they all are communicating with a 10 gigi length between them. And it's integrated well within the vSphere platform since we are VMware, we ensure that the interop

Starting point is 00:33:59 with every component out there, whether it's vMotion, storage vMotion, HA, DRS, vRealize operations, which is our management suite, all of them are integrated well with vSAN. And it's a policy-driven control plane. So right from the design up, we actually designed it to be on a per VM policy. What does that mean? You can, on a per VM basis, you can set, hey, how should the performance look like? What should the availability look like? Should I make two replicas of the VMDK? Should I make three replicas, et cetera? And those are some of the policies which you can set in addition to, of course,

Starting point is 00:34:43 the capacity policies which you can set on a per VM basis. It's very different from traditional storage, where most of your policies are LUN-based or volume-based. Virtual Volumes, which is another initiative, another product with VMware, that also now solves per VM- based policies for traditional storage. But for vSAN, we essentially built the product from the ground up with taking in mind the per VM policies. So we have, so vSAN kind of operates in two modes. One is we essentially have something called the hybrid mode and we have an all slash

Starting point is 00:35:25 mode. So what I mean by that is we have two tiers within vSAN. One is the caching tier. We have a read cache and a write buffer and the capacity here is essentially the persistent store for vSAN. For the caching tier we dictate that you have to have a flash device out there so it could either be SSDs, PCIe based SSDs or now we are getting a lot of customers who are beginning to use NVMe based SSDs. For the capacity tier we essentially ask that you either have a choice of using HDDs so that makes it in the hybrid mode or you can use all flash which is essentially SSDs also in the capacity tier. If you ask me what is the adoption curve for all flash versus hybrid with vSAN, if you had asked me last year I would have said the all flash is probably 10%

Starting point is 00:36:26 of our adoption but this year if you ask me as to what the adoption is it's more than 40 to 50% right and even there with NVMe we are seeing like last year July or August when we were talking to Intel we didn't have any NVMe device on the HCL last year, June or July. This year we have about 30 to 40 NVMes on the HCL and there are a lot of customers from the beginning of the year who have been putting NVMe in the caching tier and now we have about more than 10 plus accounts who essentially have NVMe across both the caching and the capacity tier. So we are seeing some real evidence of how NVMe is being adopted within the industry. I forgot to mention about the performance. I think we are one of the few vendors

Starting point is 00:37:15 who actually mentioned performance pretty explicitly. It's 40k IOPS for the hybrid mode per node and about 100k IOPS per host With the with the all flash and if you start putting devices like PCIe SSDs or NVMe SSDs Sub-millisecond latencies becomes very very common So current all flash, vSAN all flash, this is how it looks I mentioned about the tier 1 caching. The tier two is all about data persistence. If you're in an All Flash mode, the SSDs are high performance, high endurance,

Starting point is 00:37:51 very high throughput for caching the writes. We do not do any kind of read caching in an All Flash mode, but the tier two data persistence is more read intensive, lower endurance, is generally less expensive than the tier one caching. Most of our customers use like either SATA SSDs out here or even SAS SSDs and out here they were using SAS and SATA SSDs but now it's generally becoming either PCIe SSDs or the NVMe SSDs and especially the Intel P3700s are becoming pretty common out there with the P3510 on the tier twos.

Starting point is 00:38:34 We have space efficiency, so 6.2, that was the latest release which we have for vSAN. Earlier, before we introduced dilute and compression, what we had to do was essentially make a replica of the object. And an object for us is essentially you can consider it as VMDK, so we are creating multiple objects based on what the user has set the policy. 6.2, in March, we introduced it in March of this year. That was our latest release. We introduced dedupe and compression and erasure coding which is essentially, we kind of loosely use it for RAID 5 and RAID 6 and depending on the workload we have seen about 2x to 8x savings. So

Starting point is 00:39:22 if you're doing for example VDI full clones that's about 7x to 10x savings. So if you're doing, for example, VDI full clones, that's about 7x to 10x savings. If you're doing like Oracle single instance databases, et cetera, compression is what would take effect there. And we are seeing anywhere from 5% to 25% to 30% of compression ratios with Oracle. Performance, of course, if you're using all flash, in terms of IOPS, we are about four times higher

Starting point is 00:39:49 than the hybrid vSAN. So again, the hybrid vSAN is HDDs used in the persistence tier. And when you start using devices like NVMe, sub-millisecond latency response times is what you could expect. We support almost all applications out there so when we started out with vSAN 1.0 about two years back we kind of since everybody knows that a storage 1.0

Starting point is 00:40:15 product you have to establish the credibility make sure that it is make sure that the industry kind of believes that it is an enterprise product so we kind of had the number of use cases very restricted so we kind of believes that it is an enterprise product so we kind of had the number of use cases very restricted so we kind of had VDI, DR target, a few DevTest use cases out there but two years down the line we have opened it out for any use case out there so there are customers who are deploying things like you know Oracle RAC, MSCS, Exchange, SQL, every application out there is being deployed on VISA and we have some pretty interesting use cases for like for

Starting point is 00:40:57 example Lufthansa having you know like if their aircraft has about 300,000 sensors and each of the sensor is sending data back to their main data center where they are running some analytics, and it spits out like kind of a report on how the flight did, what are some of the things to take care of, and within an hour, the technician has to address all the concerns with this report, get the flight back up and running, right? This is being done on their A380s, etc.

Starting point is 00:41:27 So some very interesting cases, there are, you know, defense sectors where V-SAN is being deployed in particular form factors. So it works very well because they, in some of these defense sectors, they want like particular form factors. They do not want like an appliance form factor which may not fit into a particular space, for example. So vSAN works beautifully for them because they can choose a particular hardware, just slap the vSAN software on top of it, and it runs. So some very interesting use cases out there. And I forgot to mention we have about 5,000 customers so it's growing pretty rapidly. It's one of the fastest storage products at least I have

Starting point is 00:42:11 I have worked on. So how is the market evolving? This is an IDC graph so the IDC predictions kind of align with how we are seeing the NVMe adoption growing within vSAN. We see that especially the SATA SSDs is going to kind of start trickling down while the PCIe based SSDs and the SAS SSDs will probably occupy more than 60 to 70% of the market. And by 2018, we expect about the NVMe, what IDC projects is 70% of the client SSD market, 70% of the client market will be NVMe based.

Starting point is 00:43:04 So what are the benefits of vSAN with NVMe? So I won't go through the overview. Reddy did a good job about that. But it's ideal for the caching tier. We are getting a lot of customers who are using NVMe today with vSAN and we highly recommend that. There are a few customers who are beginning to use it even in the persistent store also. We have NVMe devices certified for the vSAN caching tier and specifically for all flash configurations. You can also use it for the hybrid model that it works perfectly for that and what we are doing specifically in the roadmap, and I will go through it in a little bit more detail, is we are enhancing both the ESXi storage stack and the vSAN storage stack

Starting point is 00:43:53 to make it effective enough to take some of the real advantages of NVMe and some of the NVMe 1.2 features which are also coming up. So we'll talk about that, and Murli will also go into some more detail in that regard. This is how our HCL looks like, Intel P3700s, the S3510s, all being listed there. This is a chart which Intel recently published. I know this chart may be a bit convoluted, but let me see if I can explain it. The bar graphs which you see correspond to the capacity, while a performance deployment where you really care about performance and nothing else. Or you may go on the extreme where you do not really care about the performance but you want to make your deployment highly, highly available. Or you can have it capacity based where you want like large capacities within that cluster or you may say I want a kind of

Starting point is 00:45:09 a slight balance between performance and capacity. As you see this is a configuration which was used about 1.2 million IOPS for the performance deployment and of course as you do a trade-off with availability compared to performance, your performance, of course, tapers off quite a bit. But what this graph essentially shows is, with NVMe, the kind of dollar per IOPS and the dollar per gate which you get is pretty remarkable.

Starting point is 00:45:46 In fact, we get a lot of customers who ask us, hey, isn't NVMe too expensive? But these are kind of the graphs which we have been publicizing along with Intel to show that the amount of performance which you get at the cost which NVMe is, it's pretty remarkable and it's pretty encouraging for customers to use NVMe. Also, raw comparisons. So we did a raw comparison of taking an 8-node vSAN cluster, comparing it with an 8-node NVMe cluster and these are kind of the IOPS which we are getting and these are some of the

Starting point is 00:46:31 configuration which we have. What we also observed in terms of latencies were on the SSD side we had about 1.5 millisecond latency, and in NVMe, we were getting 1 millisecond latencies or slightly less than that. So what's coming next? What's the future for NVMe within vSAN? So let me go through some of the ideas out there. So the Intel Optane SSDs, they are the ones which we are working with Intel closely, some pretty impressive reduction in latencies about approximately 10x compared

Starting point is 00:47:16 to NAND SSDs, so that's kind of a direct benefit to vSAN, right, if you're going to use it in the caching tier or even within the capacity tier. We have done some measurements along with Intel with the application performance, the ESXi application performance. It's about two and a half times faster than the NAND PCI Express. And once we start doing some, and these are like raw numbers in terms of, we haven't really done any software optimizations. These are just plugging in, taking out the SAS and

Starting point is 00:47:52 SATA SSDs, plugging in NVMe SSDs, and these are the numbers which we are getting. There are a lot of software optimizations which we have started to work with. So today we did a comparison of how does NVMe perform without any ESXi storage stack optimizations, which is what it is today. The gray bars are what you get, but we are also prototyping a lot of optimizations within the ESXi storage stack. So Reddy mentioned about how you can parallelize the NVMe across multiple lanes, and those are some of the optimizations which we are introducing within the storage stack. And as you can see, the performance benefits which we get making these changes within the

Starting point is 00:48:44 ESXi storage stack is getting us quite a few benefits. Of course the prototype is not released, these are all internal results as of now, but thought I will show you as to what one can aspire as they start using NVMe with vSAN and perhaps even with ESXi. So I won't go through this but let's look at what the next generation hardware offers with NVMe and also with NVDems or persistent memory. So there are four areas which we are focusing on. One is the high speed NVMe which we hope to leverage as it is and that will provide performance benefits to vSAN right away. There are ESXi storage stack enhancements which we are looking at, just as I mentioned in the previous slide.

Starting point is 00:49:35 We are also looking at, I mentioned to you that we have a two-tier architecture today. And there is no reason for us, especially with NVDIMS, etc., coming into the market whenever it comes, to kind of collapse this into a single-tiered all-flash architecture. What that would mean is using NVDIMS for just the metadata while using SAS, SATA, or NVMe SSDs for the persistent store. Or we can still keep it as a two-tier architecture where you have the metadata into NVDIMS, you have the write cache as NVMe,

Starting point is 00:50:12 and the SAS and SATA SSDs as your persistent store. We are also looking at RDMA over Ethernet to boost network transfers. As the device latencies for these start becoming smaller and smaller, the network latencies will start becoming the bottleneck. And that is why we are looking at RDMA to help address the networking challenges also because right now it's not a challenge for us,

Starting point is 00:50:39 but once these device latencies start shrinking down and down, the network latencies will start becoming an issue for us. So with that, I will give it to Murli to kind of go over some of the native driver stack and some of the technical details. So I think we're going to run out of time here, but what I wanted to share is a snapshot of where we stand in terms of NVMe support today. So first let's look at our driver stack and where we play in terms of NVMe support today. So first let's look at our driver stack

Starting point is 00:51:05 and where we play in terms of NVMe or complementary technologies. So there are fundamentally two places that we actually do drivers. One is in the IO stack. That is the NVMe for the PCIe space, as well as NVMe for fabric. Those are the two fundamental drivers.

Starting point is 00:51:23 Along with that, we have drivers for the RDMA piece which is like a rocky driver. That's it's right in the IO driver stack. The other place is something we call virtual NVMe. Essentially that sits below a host or a guest operating system and that allows a guest operating system to have a native NVMe driver so that you can use this virtual device to funnel in IOs that are completely natively NVMe. So those are the places that we play. And what I want to share is our future plans. So far we have been supporting NVMe protocol.

Starting point is 00:52:00 So currently we have support for 1.0e spec. And that's both supported in Inbox So currently we have support for 1.0e spec. And that's both supported in the inbox as well as the driver. In future, what we're going to do is try to support the 1.2 feature set. So I can't speak about the exact time table on this one, but it's coming soon. And that deals with multiple namespaces

Starting point is 00:52:25 multiple queues so this actually addresses the performance thing that Swaroop was talking about we will have support we'll have a driver for Indian New York fabric and we'll address the rocky to start with rocky fabric followed by other transports as our customers see a need. And finally we have a plan to, I talked about the virtual NVME device which sits below the guest operating system. That should be very interesting for other operating systems which sit inside VMs. And finally at some point we plan to have a completely native NVMe stack. In other words, it will coexist with SCSI for a while, but the outlook is that at some point, we're going to completely transition into a NVMe stack. So that's the outlook for the coming years. And you can read about this. You can actually download the drivers today.

Starting point is 00:53:22 We have drivers for the 5.5 as well as the 6.2 ESXi platforms this also you can you actually have a project they'll be actually inviting vendors the device vendors to actually contribute to the 1.2 driver that's coming soon and this information in the links here you can log in there and you can find more details and I'll be happy to talk to you after this session so that's fundamentally that I think leave some room for question yeah I mean we can do questions maybe like two three minutes you probably have to give give up for the next speaker so any questions on the vSAN side? I know we had a few during the talk.

Starting point is 00:54:05 Do you have any idea how much CPU overhead or surface overhead is for this software stack? On vSAN in general? So what we advertise is less than 10%, right? That's the, we call it the visa and tax so that's the that's typically what we have now dedup and compression also you have to incur some additional CPU right and that is typically about again less than 10% of what tax you have already paid with visa without dedup and compression so let's say you have already paid with vSAN without dedupent compression. So let's say you have 100 cores, you need 10 more cores for vSAN. For dedupent compression, another 10% of that is 10% of 10, so another one additional core, so 11 cores in general if you turn on dedupent compression. And is it statically assigned or is it done through vSAN? So you have to go through a kind of a TCO calculation, right, to basically figure out,

Starting point is 00:55:10 okay, I'm going to enable vSAN on this. What are the, you know, how much more CPUs do I need to put within the cluster? So that's how it works. Do you have a mix of fan and tech in your environment? SHARAT SHROFFER- Yes. So the default is 10. But you can toggle between 10 and tech if you would like to.

Starting point is 00:55:36 consistency, is the recent AP system or a CD system? SHARAT SHROFFER- AP system is a CB system? AP system versus CB system. The same AP, the consistency available at GNN. Eventual consistency or the strong consistency? So this is all strong consistency, right? So essentially the write, as soon as the write happens, when the writes are replicated to all the replicas, that's when the acknowledgement is sent back to the client.

Starting point is 00:56:08 So even though it's an object-based system, so I know that other object-based systems may be eventually consistent, but with vSAN, it's strongly consistent. Any other questions? The caching tier that you have, is it for read cache? So in the hybrid model, the caching tier is both for read cache and write buffering. In the all-flash model, the caching tier is only for write buffering. The reads happen directly to the capacity. Sorry? The data is. Yes. Yeah.

Starting point is 00:56:53 . Yes. . So you have the right buffer on each of the nodes, right? OK. Yeah. So you have the right buffer on each of the nodes, right? Okay. So you have the right buffer on each of the nodes, and it is replicated... Right buffers on all nodes?

Starting point is 00:57:11 Yes. It's right buffer on each of the nodes, yes. All right. Thank you. Thanks for your time. Appreciate it. Thanks for listening. If you. Thanks for your time. Appreciate it. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list

Starting point is 00:57:32 by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the developer community. For additional information about the Storage Developer Conference, visit storagedeveloper.org.

Your Ad Here

Storage Developer Conference - #30: Bridging the Gap Between NVMe SSD Performance and Scale Out Software

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.