Storage Developer Conference - #139: Use Cases for NVMe-oF for Deep Learning Workloads and HCI Pooling

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to STC Podcast, episode 139. Hello and welcome to the Virtual Storage Developer Conference 2020. My name is Nishant Lodha and I'm responsible for Emerging Technologies at Marvell with special focus on NVMe and NVMe over Fabrics.

Starting point is 00:00:58 And today what I would like to do is talk to you a little bit about some new and emerging use cases for NVMe over fabrics, both around machine learning, deep learning, as well as a relatively new concept for hyper converged infrastructure pooling. I'd also like to give you perspective on what was the background and motivation, a little bit background about NVMe and NVMe or fabrics briefly discuss the different use cases that are in the industry both existing as well as emerging for NVMe or fabrics and how the different NVMe or fabrics technologies be it internet based like RDMA, TCP, or traditional fiber channel-based, how do they apply to a lot of these use cases?

Starting point is 00:01:50 And then we'll quickly focus on a specific topic of how NVMe or Fabrics has the potential to accelerate deep learning. And I'll take you through the lifecycle of the deep learning process from data preparation to how the data is collected at the edge, data preparation, big data, and then finally machine learning and deep learning and how NVMe and NVMe or Fabrics is going to play a crucial role in accelerating this AI ML pipeline. And then finally, towards the end, I will describe what are the different hyper-converged infrastructures that are there in the market today, or what is the key challenge with respect to these HCI deployments and how NVMe or Fabrics as a proven mature technology can deliver pooling and bring more efficiencies to hyperconverged infrastructures. So let's start off by talking a little bit about NVMe or Fabrics as a background.

Starting point is 00:02:55 A lot of you guys are already familiar with NVMe as a concept. Take high-speed media, put it on top of a low-latency, high-performance bus like PCI Express, and use an extremely brand-new and low-latency stack with the NVMe group defined. And what that is enabling is enabling high-performance storage and bringing about a new paradigm for storage. By some analysts' estimates, they expect that NVMe, especially NVMe or Fabrics and all flash arrays to be a dominant interconnect for a lot of these primary storage applications within the next few years. Well, NVMe or Fabrics is exactly that, which it allows you to use an industry standard mechanism to scale out NVMe all across your data center.

Starting point is 00:03:46 Now, in order to scale the bandwidth, the high IOPS, and the low latency characteristics of NVMe through a fabric, what you really need is a highly efficient fabric. And when I talk about efficiencies, it is not just about performance, right? And I'll take you through a bunch of different use cases and how the value of the different fabrics play into those use cases. So like I said, you know, if you really want to scale NVMe all across the data center, you need a real fabric and a real network. The challenge has been that there is plenty of different fabrics out there in the market. From a standards perspective, the first standard that was done was for NVMe or RDMA with Rocky V2, STCP variant iWarp or InfiniBand. For

Starting point is 00:04:40 that matter, way back in 2016, got standardized. A lot of traction initially about specially MP or ROC-UV2. Then came Fiber Channel. There's a lot of investments in data centers by customers who have invested in a Fiber Channel SAN, and Fiber Channel continues to move forward with the new FC and VME standard. There are already products out there supporting FC and VME standard. There are already products out there supporting FC and VME, both in terms of HBAs as well as storage arrays. But more recently, sometime in

Starting point is 00:05:14 late 2018, was a new standard that got developed by the NVM Express group called NVME over TCP. And what that is, is it allows you to extend and reach NVMe using a standard TCP network, right? It's interesting and also confusing, right? A, you have a bunch of different options for fabrics from fiber channel, which is also high, low latency, high performance. There's definitely Rocky V2, iWarp, and InfiniBand, which are extremely low latency, very efficient fabrics. And then there is NVMe over TCP, and TCP is sometimes not considered high performance and more considered more general purpose. So we'll briefly talk about recent innovations around NVMe over TCP and what has been done by the industry as well as Marvell to help accelerate NVMe over TCP

Starting point is 00:06:07 and make it more suitable for the variety of use cases, whether it comes to hyper-converged infrastructures, deep learning and such. With that, let's first look at a bunch of different use cases, especially around RDMA. As you realize that, you know, RDMA comes in many different variants. There is Rocky V2, which is literally infinite band over Ethernet. And there's Rocky V2, which is the routable version, which uses UDP. And then there is iWarp. And if I broadly look at all the different use cases for RDMA, they span a lot of different things, but most of them focus on transporting storage across the wire, right? Starting from the very top right here, there's definitely

Starting point is 00:06:53 disaggregated storage with NVMe or Fabrics. We spoke about that. That's the standard defined by the NVMexpress group, bunch of products out there, especially with NVMe over RockEV2. There is legacy use cases of RDMA, especially iSCSI extensions over RDMA. Haven't seen a whole lot of deployment, but still pockets of that. Microsoft has made some big investments in RDMA, starting with Windows Server 2012, especially having the SMB or the RDMA variant of SMB called SMB Direct run on top of Microsoft Windows operating systems. So there's work done within VMware, for example, with para-virtualized RDMA that leverages a standard para-virtualized RDMA interface from virtual machines that require direct access to RDMA fabric. And then specific to deep learning,

Starting point is 00:07:46 there is a technology that Nvidia and many others have collaborated to build, which is GDR or GPU direct, which allows GPUs across multiple different nodes to talk to each other via extremely low latency fabric, again, which is RDMA. And I think GPU direct most often uses Rocky V2 as a fabric. There are a bunch of different file systems, for example, Cephs and NFS that have been extended to work on top of RDMA. And then going back to Microsoft Windows, there are things like

Starting point is 00:08:19 VM migration that have the capability to run on top of RDMA. Pretty neat set of use cases that have broadly emerged for RDMA, which is not just NDMA or fabrics, but a lot of them are storage centric. And I wanted to use this to provide a little bit color because whether you look at hyper-converged infrastructures or deep learning workloads, it is not one thing that delivers acceleration and efficiency for these workloads. End of the day, a lot of compute storage and networking technologies have to come together to deliver such high performance workloads

Starting point is 00:08:57 like deep learning, machine learning, and even hyper-converged infrastructures. Like I mentioned, one of the first standards that got done was NVMe over RDMA. And today, when a lot of people think of NVMe over fabrics, they imagine it to be NVMe over RDMA. But there have been some challenges with NVMe over RDMA. And I want to make this point to all of you guys that no one size fits all. If you just look at RDMA as a technology, and more specifically Rocky V2, it has some excellent use cases that allow it to deliver low latency performance. But end of the day, realize that a lot of these

Starting point is 00:09:40 use cases can be limited because Rocky V2, even after all the bunch of enhancements that have been done to RDMA, more specifically Rocky V2, it still somewhat requires a lossless network in order to function efficiently. Although there have been technologies like, you know, data center quantized congestion notification that allows, you know, Rocky to become more resilient, so to speak, but underlying transport still expects that it will have access to a lossless internet network. And, you know, for a lot of enterprise data centers,

Starting point is 00:10:21 they may not have the skills set to deliver a lossless network. And even if they have the skill set to deliver a lossless network. And even if they have the skill set to deliver a lossless network, setting it up one time is perhaps an easier task. But maintaining a lossless network could be challenging. And what happens is that if there is loss within the network, there is potential for retransmits, there's potential for congestion to build up because there is no end-to-end congestion control in a Rocky V2 fabric. Like I said, once again, there is things like DCQCN that allows you to manage congestion a little bit better, but then it's not automatic. It has a bunch of options that you need to tune and manage. And, you know, it's just not for everyone. And as we walk through the rest of the presentation,

Starting point is 00:11:10 I want to provide color as to what are the different use cases, even beyond deep learning and hyperconverged, that specific fabrics make a lot of sense and some fabrics do not make a lot of sense. Even within the RDMA, CAMP remains divided with two different types of RDMA technologies, both Rocky as well as iWarp. These two technologies do not talk to each other, do not interoperate with each other, which sometimes can create islands within the data center. Also, one of the big challenges is that if your customer is deploying NVMe over Fabric,

Starting point is 00:11:50 specifically NVMe over Rocky V2, let's say there is an all-flash storage array that has NVMe over Rocky V2 capabilities, you have to understand that RDMA is an end-to-end network where you need host-side network interfaces to also support Rocky V2. And a lot of the networking devices that have got shipped in servers, whether it's 10 gig, 25 gig, may not have Rocky V2 capabilities and can sometimes require an upgrade. But don't get me wrong on this one. If a customer and if a use case requires extraordinarily high performance and absolute low latency

Starting point is 00:12:33 and the network itself is not too big and generally contained, NVMe over Rocky is a great choice. And especially for workloads like deep learning, which are very latency sensitive, very high performing, NVMe over Rocky is a great choice. But we cannot take that principle and apply it broadly across enterprise data centers. Like I said, Rocky V2 is great, works beautifully on a small scale, very contained, very well managed use cases, which are classic of deep learning workloads, where anywhere from a couple of nodes to no more than like 50, 60 nodes are working together to form this deep learning cluster, which is generally managed and dedicated for a single purpose task. With that, let's talk about the different use cases by fabric, right? And I talk about the three most popular different fabrics. There's definitely on the very left is NVMe over RDMA,

Starting point is 00:13:40 which we have been talking about. There is FC NVMe or NVMe over Fiber Channel, which is also a standard. And like I said, there's a bunch of investments in the past by especially enterprise data center guys in Fiber Channel. They're likely to continue to take their traditional SCSI-based Fiber Channel forward with running NVMe over Fiber Channel. And at the very end here is NVMe or TCP, which is a relatively new standard.

Starting point is 00:14:10 But if you walk with me through the different set of applications, right, if you're looking at more machine learning, high-performance computing, distributed databases like Cassandra, MongoDB, which require access to a high-performance fabric, but at the same time, they have the skill set either via software or via the environments they are deployed in or the scale at which they are deployed in, that that complexity that comes with RDMA can be managed, then NVMe or RDMA or Rocky V2 is a great choice for those HPC and AI ML workloads. But if you look at enterprise databases,

Starting point is 00:14:55 Oracle databases, SAP HANA, virtualized environment, SQL Server, these are extremely high performance, but they are also very business critical, right, in which reliability is the key. And that is where technologies like FC and BME shines, which allow customers to A, leverage their existing infrastructure, B, deliver something that has been tried and tested for decades for delivering storage across the wire, which is fiber channel and now extended with FC and VME. All the other set of applications that you can imagine, right? There could be even applications that have traditionally

Starting point is 00:15:38 run on NVMe over Rocky, but they have the potential to run on NVMe over TCP because there are scenarios where all the applications are the same. It depends on the customer's skill set, the environment, their culture, their networks, how old and legacy are their networks, what kind of interoperability they need, where NVMe over TCP is the catch-all technology, where the key is simplicity. Because with NVMe or TCP, you don't need to invest in a lossless network. You don't need to worry about congestion or entire internet works on top

Starting point is 00:16:19 of TCP. And simplicity is the key. And the key, the decision point there is to balance performance with simplicity. That's what NVMe over TCP is about. are and how they apply to different workloads. Enterprise applications choose fiber channel. If it is high performance HPC, AI, ML applications, choose NVMe over Rocky V2. Small container environments, high performance will do very well with Rocky V2. For all other applications, choose NVMe over TCP. So a little bit of backwards here, talk about NVMe OTCP since it's a relatively new standarderscalers as well, RDMA-based, standard TCP-based. And what it is, the key reason why NVMe or TCP has become a standard

Starting point is 00:17:32 and driven by all of these industry stalwarts is that it allows NVMe or fabrics to be deployed in any IP data center environment. IP is everywhere and NVMe or TCP can run on standard TCP IP and provide you access to your NVMe storage across any network without the skill set required for lossless network, without the additional investments and cost, for example, of implementing a new fiber channel SAN. However, there are some challenges with NVMe or TCP that I briefly mentioned about, right? Like TCP is a general purpose networking stack that is not specifically designed per se for NVMe

Starting point is 00:18:22 or for high performance storage. And what that brings about is that although TCP does not have the challenges of that with RDMA, performance can potentially suffer. Now, a bunch of different industry leaders are doing things to accelerate NBME over TCP to make it more suitable for transporting NBME across the wire, right? The goals of some of these technologies are, and probably a little bit more color on the next slide as to how those benefits actually translates into numbers and performance is that various different industry leaders, including Marvell, are working towards accelerating NVMe over TCP. And some approaches, for example, there is work being done in the Linux community

Starting point is 00:19:10 to do better alignment of the stack to pin-specific CPU cores, as well as queues directly to hardware interfaces on the networking devices to bring better performance, better predictability to NVMe over TCP. There are some industry people, for example, like Marvell, who's building uploads for NVMe over TCP.

Starting point is 00:19:35 And the goals of a lot of these technologies are, end of the day, to bring the simplicity of TCP to these deployments, but try and achieve performance as close as possible to that of a RDMA-based fabric, in essence, to give customers the best of both worlds, right? With that, let's look at how one of these technologies for offloading NVMe over TCP could potentially work. Just look at the stack here on the right-hand side. If you were looking at any standard software-driven NVMe over TCP, there would be the NVMe stack up there, certainly IO applications. On top of that, there's NVMe over TCP layer right below

Starting point is 00:20:25 it. And then there is the TCP IP stack, which is traditionally running within your compute resources on your server. And there is standard L2 drivers. The beauty of NVMe over TCP is you can plug in any networking device, any NIC from your old 1G to your 10G NICs to newer high-speed NICs and do software-based NVMe over TCP. But like I said earlier, the TCP stack is general purpose and may not provide to you the performance that you need. That's why one of the work that has been done was to offload NVMe over TCP onto a standard commodity NIC. And with an offload model, the diagram looks somewhat similar to that on the left-hand side here, where everything, whether it is NVMe over TCP layer, as well as the TCP IP stack is now managed by the NIC.

Starting point is 00:21:16 And a lot of these NICs, which have traditionally delivered storage offloads like iSCSI and others have now been transformed to deliver acceleration and offloads for NVMe or TCP. You have two big advantages here, right? A, instead of using the general purpose TCP IP stack in the host, you're now using the specialized storage stack, which is embedded in the 10, 25, 50, 100 gig NIC that can now sit on either of these servers within hyperconverged infrastructures or in all flash arrays. And number two, there is the need to free up compute resources so that those compute resources can run applications, they can run virtual machines, they can run deep learning frameworks, they can do the job for which they were designed and not to get burdened with just

Starting point is 00:22:10 processing IO. And some of these advantages of offload can be seen in the chart here. Let's focus the chart on the left-hand side here. What you see is that there are three different lines on this chart and this chart compares what is the latency of a 4 kilobyte IO block just working with null media here for a cute multi threaded application from end-to-end latency perspective and the three lines actually show here what a software TCP implementation would look like and what is the latency that you would get. That is the line at the very top. The line at the very bottom, which is the greenish line, is how NVMe over Rocky V2 would perform in terms of four kilobyte read latencies. And the line right above it, the blue line, is one of the offload solutions in which NVMe over TCP is offloaded to the NIC. One thing is very clear from this chart that if

Starting point is 00:23:13 your customers use a software NVMe over TCP solution, not only is their average latency high, but their tail latency can be quite bad. And tail latency, as plotted here on the Y-axis, or the X-axis, sorry, is what the last 10, 20 percentage of the IOs, how long do they take? See, imagine an application, let's say it's a database or a hyper-converged infrastructure that is looking through its networking interface to access remote NVMe, you would not want that some of those IO requests, for example, complete in 100 microseconds,

Starting point is 00:23:55 while some take 200 microseconds, right? That's not predictable. That's not consistent performance, right? And what you see with a software TCP solution that there is significantly high tail latency with the software solution, again, going back to using a generic stack within the server for NVMe over TCP.

Starting point is 00:24:18 There is no doubt looking at the green line at the very bottom that NVMe over Rocky provides the absolute lowest latency of this in these environments. Again, going back to my conversation earlier about use cases, if you require the absolute lowest latency and have the skill set to deliver a lossless network, NVMe over Rocky is a great choice. And we'll talk about its use cases for deep learning in a second here. But if you do offload NVMe over TCP down to the NIC, and these are offloads in commodity, regular NICs, you'll see as the goal I stated earlier is

Starting point is 00:24:52 that the performance is pretty close to that of NVMe over RDMA fabric. And again, giving back to you the best of both worlds, which is delivering you latency, which is good enough, but without the complexities of managing congestion, delivering a lossless network, creating islands within your data center. Let's look at the chart here on the right-hand side. I mentioned the cost of IOs in a previous conversation here. With the chart here on the right-hand side, it compares two things. It compares a non-offloaded NVMe over TCP solution

Starting point is 00:25:34 with an offloaded NVMe over TCP solution. As you see, this is a regular 25-gig networking interface. Whether you choose software or the hardware, considering the power of today's, especially x86 compute resources, all these interfaces can easily achieve line rate at eight kilobyte block sizes hitting 25 gigabits or close to 25 gigabits per second. But look at that dark blue, purple line that is plotted on the secondary Y axis that showcases what is the cost of doing these IOs, right? On this specific server, over half of the CPU cores have been

Starting point is 00:26:14 simply burned in the software TCP case for delivering IO. That is not what hyper-converged application, that is not what enterprise applications want. They want CPU resources to be free to run IO, not to run IO, but to run actual business logic. If you offload the same thing with a NIC, which has NVMe over TCP offload, CPU utilization falls down significantly to literally one third of that value, leaving you headroom. And as we talk about the different use cases from deep learning, machine learning, deep learning, as well as HCI, I'll provide you color how this headroom is extremely valuable for customers who are looking to run business logic, whether it is the AIML frameworks and complete training in time, whether to share

Starting point is 00:27:06 resources, bring cost efficiencies, or for hyper-converged infrastructures where NVMe or Fabrics is bringing about a new way to merge the worlds of external storage and hyper-converged storage. With that background, just a quick summary, right? A bunch of different fabrics available from Fiber Channel to RDMA to TCP for NVMe or fabrics. Customers have a choice. The choice has to be determined by a bunch of different things. The choice has to be determined by use cases primarily. Number two, by the skill set of the customer deploying those environments, as well as what is the previous investments by that customer. Now, talking about AI ML use cases, there's a couple of things about

Starting point is 00:28:01 these use cases that we should talk about before we get into the material for NVMe or Fabrics. Number one, these use cases and these deployments are generally net-need. A lot of these use cases did not exist in the past and are now being deployed. Number two, these are high-performance workloads, very expensive infrastructure, which A, needs to be utilized to its maximum potential, and B, needs to deliver cost efficiencies that the customer needs because the amount of investment they are pouring into this infrastructure. Interesting new news. Recently, I think just a few weeks ago, a company called OpenAI announced a new deep learning system called GPT-3.

Starting point is 00:28:48 If you have not looked at it, look at its capabilities. It is pretty shocking from the type of natural language expertise that the system has. It is getting pretty close to what humans can do. I haven't touched and felt it directly, but just from reading the bunch of press on it, you can see it has the ability to, for example, for you developers, take a design spec and write code out of that, complete half-written documents

Starting point is 00:29:18 from the most complex legal language to a casually written language. And this system is absolutely the latest and greatest. But the point that I'm mentioning here is that, you know, the system is complex. It has a very huge set of internal neural network that actually enables the capabilities. And with such complex systems, you require the scale and you require the performance that NVMe brings to the table in order to train, maintain, and infer out of this kind of systems, for example. Let's do a little bit of conceptual chat here and get that thing out of the way. When I look at machine learning, deep learning, I look at two types of workloads in general.

Starting point is 00:30:06 There is depicted here on the top is definitely training where you actually take tag-none-tag data and run it through your machine learning, deep learning set of algorithms to eventually build out a trained model. Most training workloads are certainly very compute and storage intensive, very GPU intensive. I see them running both in the cloud with specialized accelerators, including GPUs, and also within on-prem data centers, primarily using GPUs for acceleration.

Starting point is 00:30:39 If you have looked at, like most popularly talked about reference architecture, which is the NVIDIA DGX series, you will see it's the training nodes, the deep learning nodes are heavy on GPU resources, a lot of NVMe into the mix, as well as high speed networking IO into the mix. And inferencing at the very bottom is when you use that trained model and look at data being generated from all your different sources to do analysis and result gathering and classification, right? Whether you've built a model for you that can identify objects and then use that model to actually identify objects for inferencing. As you would imagine, inferencing is more real-time, more latency sensitive. We see a mix of specialized accelerator devices, especially low-power devices that can have the ability to need to get responses faster, you will see more and more of these inferencing happen near to where the data is generated, which is at the edge. For the rest of the presentation, I would like to focus more on the deep learning, the training side of things, that is very compute and storage intensive, typically done in data centers where you see NVMe and NVMe or fabrics as technologies that have the potential to have a strong hold. Before we again jump into a little bit more background here, and I mentioned this in the previous slide. If you look at data, it's important to understand that deep learning and machine learning is not, you know, the only piece of the lifecycle of what people traditionally call as machine learning and deep learning.

Starting point is 00:32:46 If you start here from the very left-hand side, data is generated from a bunch of devices at the edge. It could be IoT, other edge devices, and all of these are then pulled together in distributed edge data centers all around. And a lot of that data is then aggregated, cleaned up, worked upon, and churned through big data systems for all your MapReduce algorithms. And that data eventually is what feeds into the machine learning and the deep learning tasks that run on these, you know, GPU heavy and NVMe heavy servers, there is a lot of work happens by data scientists to actually adjust the model, adjust the weights, they work on a smaller data set to fine tune the model before it is actually

Starting point is 00:33:38 deployed in a deep learning setup to actually build a model. And even within the deep learning model, the process is iterative where models are built, verified and rebuilt and things like that, right? So it's the point that I want to make here is it's important to understand that the life cycle of machine learning is beyond deep learning.

Starting point is 00:34:01 It involves collecting data, cleaning up data, associating data and tags to it, running through your map reduce and big data algorithms to fine tune the data and then feed it into the machinery. And certainly all around is your data analytics that help you make better decisions next time on in how you do things, right? And this is important. I'll talk about this a little bit more in the slide that if you break down the stages of the AI ML pipeline into three large groups, right? Here on the left-hand side is definitely data preparation, right?

Starting point is 00:34:39 Then there's data science. And then finally, there is a deep learning curve. For all of these models, you will see that NVMe plays a strong role across the life cycle of machine learning, deep learning. You will see that NVMe continues to be something that you absolutely need in order to do a more efficient job, whether it is data prep, data science, or deep learning. Now, NVMe or fabrics may not apply to all of these models today. For example, if you look at data prep, there is a lot of data ingest transformation. It's compute and storage intensive, but I don't see a lot of NVMe there just because the models that are used for the transformation of data are more clustered

Starting point is 00:35:24 file systems and more distributed kind of architectures. But I expect that in the future, as the scale of that grows and as new models come in, as well as companies that deliver data preparation as a service and where there is shared infrastructure and tenants and the things

Starting point is 00:35:44 that we are used to in our virtualized and cloud world, we'll see that the ability to scale, share cost efficiencies will also come to bringing NVMe or Fabrics in the data preparation side of things. So data science, I briefly talked about this. This is where data scientists actually work on refining the models, the weights, and things like that in order to build a set of parameters that will eventually be deployed on the deployment. on a smaller sized data sets because it's again a reiterative process as they adjust the weights, do the math, so to speak. A lot of the data science work happens

Starting point is 00:36:33 on shared infrastructures. It could be virtual infrastructures where data scientists do not need dedicated access, whether it is to GPUs or to storage. And data science could very well be happening on a virtual machine that is shared with the rest of the enterprise infrastructure, which is where the cost efficiencies of NVMe or fabrics being able to pool your storage resources together, carve out a piece of that when required to a data scientist who then uses that storage to help

Starting point is 00:37:09 pull in the different models and the data sets to actually do what they need to do. Deep learning is extremely high performance. This is the iterative training and model generation process, highly parallel GPUs. The data sets that go into deep learning are now the full-blown data sets. And you will see that the size of these data sets really vary. For example, you could have some workloads, for example, I've recently reading about a model that was built for some insurance agencies for them to predict what is kind of the collision probability for the kind of vehicles and drivers that they are insuring.

Starting point is 00:37:50 That requires like millions and millions of small files that needs to be pumped into the deep learning nodes. I think some estimate was like 5 million different files every few hours that need to be pumped into that deep learning infrastructure. Compare and contrast that, for example, with medical imaging, right? A lot of these medical imaging files and data sets that need to be fed into the deep learning workloads could be, you know, gigabytes, terabytes of size, right? How many of these workloads do you think are these data sets can be held into local storage, into RAM, right?

Starting point is 00:38:28 They quickly overflow, and it needs a shared fabric to be able to, you know, deliver the scale and the performance that just cannot be made by a single entity node, which all the storage is captive to. And in the next slide, I'll talk about a concept, which is people call it a GPU storage bottleneck and how NVMe or Fabrics removes some of that bottlenecks. And end of the day, the goal is different for all of these stages, right? For example, if you look at data science, the goal of deploying NVMe or Fabrics is cost efficiency is different for all of these stages, right? For example, if you look at data science,

Starting point is 00:39:07 the goal of deploying NVMe or Fabrics is cost efficiency on a shared infrastructure. The goal of deploying NVMe or Fabrics for deep learning is to remove the GPU storage bottlenecks. And end of the day, this model has to build in a relatively less amount of time because again, it's a reiterative process. You don't want it to take weeks and weeks to build the model and you adjust the weights again and go back again for weeks and weeks to build another model. Accelerating training times,

Starting point is 00:39:36 as well as bringing cost efficiencies all around is extremely critical. Now, I would say that considering that the deep learning technology is still relatively new, there hasn't been a lot of work and focus from the customers to make it more cost-efficient. But I'll provide some more color by going to the next slide. Let's talk about the GPU storage bottleneck, so to speak. And if you look at AI ML datasets, and I mentioned some of these are either so large in number with millions of small files or extremely large in size, for example, the medical imaging datasets, that they just cannot fit into GPUs local storage capability or for the GPU's local storage capability or for the GPU server's local storage capability, which what a lot of industry analysts now call it as a GPU storage bottleneck, which means the data set that the GPU needs to work on

Starting point is 00:40:37 can no longer be easily brought as close as it wants and locally stored on its compute node, right? What that brings about is that a lot of time, these deep learning machine learning nodes, especially GPUs, end up waiting for storage resources. They end up storage to be data sets to be pulled in from the network into local storage and then have access to that. What that does is that that impedes the training

Starting point is 00:41:06 times. By one estimate that I was recently made aware of, average GPU utilization in these deep learning loads is less than 50%. Now realize that GPU servers are almost three times more expensive than a typical server. These are pricey, high-performance, specialized components. You would not want them to run at less than 50% utilization. You wouldn't want them to wait on storage resources to become available. What NVMe or Fabrics does, it provides these GPUs direct access, high performance, low latency, direct access to a shared and elastic pool of storage that is A, accessible via a high-speed typically Ethernet-based fabric, Ethernet or InfiniBand-based fabric, and B, all that data set can now be

Starting point is 00:42:15 more easily managed, new data set pulled into it, and you're no longer restricted by the local storage capacity of the deep learning node. All the benefits that we have always talked about for external storage, the ability to pool resources, the cost efficiencies, the backups, the sharing, the ability to share, all of that is being brought about by NVMe or Fabrics to deep learning. The key difference is that with NVMe or fabrics, you get performance, transactional performance,

Starting point is 00:42:49 as well as latency pretty close to that of local flash performance, which brings you, again, best of the both worlds, which means A, you get performance pretty close to that of local flash, and B, you are no longer limited by that whatever number of PCIe slots that you have and how many number of local NVMEs you can cram into the single deep learning node. And to the day, NVMe or Fabrics allows people like data scientists, HPC researchers to just get more out of

Starting point is 00:43:24 their applications. When their applications run faster, they can make faster decisions. They can adjust their weights and their networks much more efficiently than they would with limited access to storage resources, which would be captive to a single deep learning node. Some other, before we talk about this, it's important to understand that it's not

Starting point is 00:43:51 one technology in the amino or fabrics that will form the heart of deep learning or storage. Deep learning is a set of different concepts and technologies that help accelerate. And see, what do you want? This doesn't apply just to machine learning, deep learning, but with any workload, right? You won't want to focus all your energy in optimizing the compute of that workload and the bottleneck simply shifts to the network. And if you optimize the network, the bottleneck then shifts to the network. And if you optimize the network, the bottleneck then shifts to the storage.

Starting point is 00:44:27 A lot of work has happened for deep learning for optimizing compute, right? For example, you see those high performance x86 or other architecture processors into deep learning nodes. You see these GPUs with teraflops of capability into these deep learning nodes. You see these GPUs with teraflops of capability into these deep learning nodes. You see networking like a bunch of different 100 gig NICs

Starting point is 00:44:54 that show up in these deep learning nodes. The storage, you don't want to not optimize storage and leave all the other investments that you made in optimizing the network and compute not give you the best bang for your buck. And that's where it is important to look at compute networking as well as storage optimizations. And all said and done, it looks like storage optimization had not been a big focus for a lot of these deep learning infrastructures,

Starting point is 00:45:30 and NVMe over Fabrics is finally bringing that into the picture. Now let's talk about the other use cases of RDMA within deep learning modes. And this will inform what we come up into the next few slides. There's a technology that I briefly mentioned, which is GDR or GPU direct RDMA. And I think GDR has also been extended recently to also run for NVMe or Fabrics or something to look out and watch out for as you enhance your own, as developers, you enhance your own NVMe or Fabrics stack. And what GDR or GPU direct is, it allows basically two GPUs on two different nodes. For example, I show in the picture here on between server one and server two,

Starting point is 00:46:10 GPUs HSM can now be directly mapped through these RDMA enabled NICs so that these two GPUs who sit across two different nodes can now talk to each other via a low latency high performance fabric like RDMA. So a couple of things here right first of all you know typically one compute unit with four or eight GPUs is just not enough to complete your training workloads in time you need the scale of multiple different deep learning nodes and you don't want that the connectivity between the deep learning nodes to become a bottleneck. And that's why GPU Direct. GPU Direct has excellent integration with NVIDIA Nickel 2.0 for, you know,

Starting point is 00:46:55 so it automatically helps you decide whether what is the most optimized path to take when you need to access GPU resources from one server to another, right? Beautifully integrated into the upper layer AI ML stacks. Also, there is one other technology. I picked this image from Bitfusion, which has been a relatively recent acquisition by VMware.

Starting point is 00:47:22 And what Bitfusion as a technology does is that the Bitfusion allows customers to pool GPU resources and share them over RDMA-enabled network. As you see in this picture here, you can have AI researchers, data scientists, developers who require intermittent access to GPU resources to get a slice of what they need at the time they need without requiring you to dedicate resources off to a single AI researcher, expensive GPU resources to a single AI researcher for a period of time in which they might not be fully utilized. So now let's put the couple of things that we talked about all together. What you see in a deep learning node is that technologies like RDMA, especially Rocky V2 are pretty prevalent. For example, Rocky V2 is being used for GPU direct between different deep learning nodes. Rocky V2 is being deployed for extending GDR or GPU direct for NVMe or Fabrics or for NVMe. network that allows you to share GPU resources across your pools of AI researchers and data

Starting point is 00:48:48 scientists as shown in this diagram, there is prevalence of RDMA-enabled technologies in these deep learning nodes. These deep learning nodes have a bunch of different 100-gig NICs which already have RDMA capabilities. So it is natural to extend that to requiring NVMe over fabrics for these deep learning nodes. So no longer do you need to have captive storage on deep learning nodes. You can extend the reach of NVMe all across your deep learning nodes by having a storage pool. And unlike many other enterprise applications, you may not need an intelligent and sometimes

Starting point is 00:49:35 more expensive all-flash array to be that storage pool. You could have simply an Ethernet bunch of flash, which provides you just a bunch of dumb flash accessible through an Ethernet interface off to your different set of deep learning nodes. The RDMA fabric, like I said, is already prevalent, already available in these deep learning nodes, provide excellent scalability for these conventional neural networks, and provides you a performance as well as, you know, the envelope and latency of access to remote storage. Right. So once again, to summarize kind of for machine learning,

Starting point is 00:50:15 deep learning, the key takeaway here is that networking and compute, especially GPUs and high performance processors are already very well optimized for deep learning, but not enough focus have been given to the storage piece of that, right? The infrastructure for deep learning already comprises low latency Rocky V2 transport.

Starting point is 00:50:43 And it makes a lot of sense to extend those existing capabilities by removing captive storage from the deep learning nodes and aggregating them in a bunch of flash or all flash arrays and deliver near local performance with the scalability and the cost efficiency that any external storage brings to future. That goes back to my conversation about picking the right fabric for the right use case. There is small contained deep learning infrastructures, RDMA already prevalent there, NVMe or fabrics as a capability already available in the typical Linux or VMware-like operating system that are run for either deep learning or for the shared infrastructures for data scientists. Makes a lot of sense to extend that and optimize storage.

Starting point is 00:51:37 Okay. Now let's talk about some hyper-converged infrastructures or HCIs. A quick overview of some of these popular technologies. There's definitely VMware vSAN or Virtual SAN. It's simple hyper-converged storage. You take a bunch of different compute nodes, each one of them with their own storage, cluster them together over an Ethernet network, and voila, you've got a highly distributed, hyper-converged infrastructure. Standard x86 servers, standard networking components.

Starting point is 00:52:12 Important thing to realize is that the networking components within each one of these nodes would then cluster together, usually over a standard L2 network, in the case of VMware vSAN to deliver the inter-node connectivity. Another such example is coming from Microsoft which is called Azure Stack HCI or used to be called Storage Spaces Direct or S2D. It's a similar model like VMware vSAN and it allows you to kind of cluster together pools of these compute plus storage nodes over a network, right? The key difference with the Microsoft solution is already incorporates an RDMA enabled network

Starting point is 00:52:56 with choices of either Rocky and iWarp to cluster these nodes together, right? So whether you look at VMware vSAN or you look at storage spaces direct, what is the number one challenge there? Let's look at that. The number one challenge is that with any hyper-converged infrastructure,

Starting point is 00:53:20 compute and storage scales together. Since storage is captive generally to that compute server, if you ran out of storage and you no longer have rooms to fit additional drives in each one of your nodes, you need to obtain another compute unit to get more storage, whether you need more compute or not.

Starting point is 00:53:41 Which brings us to the concept of NVMe or Fabrics in which now you try and bring the two worlds together. You have the hyper-converged world but you allow a networking component like a 10, 25, 50, 100 in one of those servers or in many of those servers to connect

Starting point is 00:53:59 to an eBoff or a dumb bunch of Flash. A lot of these Flash eBoffs can be daisy-chained if you want to scale. And what this allows is that it allows you to bring flash, dedicated flash, without requiring additional compute investments into the hyper-converged infrastructure.

Starting point is 00:54:19 Now you have the ability to scale your storage without having to scale compute. And the reason that is possible is using NVMe or Fabrics to bring the two worlds together. Now, important thing to note here is that NVMe or TCP makes a lot of sense for a lot of these environments, A, because of availability of TCP and TCP stacks in a lot of these hyper-converged infrastructures that I expect to happen in the future, as well as the ability to, you know, as well as the need by customers to actually need a simple network, because hyper-converged infrastructures can scale to multiple different nodes, you know, like I mentioned, 64 nodes and higher,

Starting point is 00:55:06 and they require you to deliver the scale and simplicity of NVMe or TCP. And I talked about this early on, there are solutions from the industry, including Marvell, for things to accelerate NVMe or TCP. You get the performance similar to that of RDMA without requiring a lossless network. So once again, to kind of summarize for HCI workloads, there is a need to scale storage

Starting point is 00:55:37 without scaling compute. And technologies like Ethernet Bunch of Flash, hooked together from NVMe or Fabrics, can bring dump storage into the mix which can then be managed, compressed, secured, deduped using the software mechanisms of hyperconverged infrastructure bringing the best of those two worlds together.

Starting point is 00:56:00 It is imperative to use a simple fabric like NVMe or TCP for such workloads because they can scale, they need to work in a grayscale environment and make a great choice for those kind of environments. Going back, I talked about this slide at the very beginning. I just want to close up with this thought that there are a bunch of different fabrics. There are emerging use cases for all of these fabrics. But if you are running a high performance AI ML workload, you already have Rocky V2 into the mix for your deep learning nodes,

Starting point is 00:56:38 for your shared infrastructures, for data scientists. Leverage that by using NVMe or RDMA and optimize the storage. If you have stuck to Fiber Channel, your investment in Fiber Channel, continue down the line. There's nothing wrong with extending the life of your existing billions of dollars of Fiber Channel infrastructure already invested in. Works great for business-critical use cases. Continue that forward. Well, all the other things, whether it's HCI workloads, whether it is environments where your customers cannot deliver a lossless network, where the balance of performance and cost is critical,

Starting point is 00:57:16 use NVMe over TCP. Excellent choices all around. What matters is what is the use case and what is the customer skill set that can apply to that. With that, thank you so much for listening to my talk. It's been, I wish I were in front of you talking to you guys, seeing your face smiling and maybe that thunderous applause towards the end, hopefully. But it will be virtual this time. I hope to see you on the Slack channels for any questions. Looking forward to seeing you guys in person, perhaps next year. Thank you.

Starting point is 00:57:54 Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at snea.org. Here you can ask questions and discuss this topic further with your peers in the storage developer community. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

Storage Developer Conference - #139: Use Cases for NVMe-oF for Deep Learning Workloads and HCI Pooling

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

Storage Developer Conference - #139: Use Cases for NVMe-oF for Deep Learning Workloads and HCI Pooling

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.