Grey Beards on Systems - 109: GreyBeards talk SmartNICs & DPUs with Kevin Deierling, Head of Marketing at NVIDIA Networking

Starting point is 00:00:00 Hey everybody, Ray Lucchese here with Matt Lieb. Welcome to the next episode of Greybeards on Storage podcast, a show where we get Greybeards Storage bloggers to talk with system vendors and other experts to discuss upcoming products, technologies, Marketing at NVIDIA Networking. So Kevin, why don't you tell us a little bit about yourself and what NVIDIA is doing with their smart NICs? Yeah, great to be here. Thanks a lot. Yeah, I joined the NVIDIA networking team through the acquisition of Mellanox Technologies, which happened in April of this year see the marriage of AI and accelerated computing and then intelligent networking because, of course, at Mellanox, we're the leader in high-performance networking. And that was first with our InfiniBand and HPC heritage, and now more with the intelligent

Starting point is 00:01:19 networking on both our adapters and our DPU, which is what is incorporated into SmartNICs, for example. It seems like, you know, Mellanox had been doing, you know, SmartNIC types of development for quite a while and have been rolling out their own technology in this space, right? I mean, the Bluefield and all that stuff. Absolutely. Yeah. So, you know, people don't realize it. And now NVIDIA is a leader in Ethernet networking, as well as InfiniBand for high performance computing and AI networking. So at Mellanox, we entered the space really at 10,

Starting point is 00:02:02 but more importantly, at the higher speed. So 25, 40 gig, 50 gig, 100 gig. And we did that with our ConnectX family of adapters, which are really, really smart next. And then we introduced our Bluefield devices, which are our DPUs, our data processing units. And so together, we have really an unmatched portfolio of networking gear. And then on top of that, we have switches as well, both for Ethernet and InfiniBand. So we have an end-to-end platform that is now integrated as part of the AI offering of these platforms for all different types of workloads. Yeah, I was virtually at VMworld here a couple of weeks back, I guess a month or so ago, I guess now. And Pat was talking about, you know, almost the porting of vSphere to

Starting point is 00:02:54 NVIDIA DPUs. Is that, I don't understand what's going on there. Is this like, why would they, you know, why would they rewrite their hypervisor to work on a smart NIC? Yeah, I think that you got that absolutely right. And it was actually two parts of the announcement. The first part was really bringing AI to all businesses, because all businesses will become AI enabled, and every workload will be accelerated. And the other part of that, so was to bring the DPU into the fold and the DPU, this is our blue field, is really about offloading,

Starting point is 00:03:36 accelerating and isolating. And the key thing there is that you can take all of the data center services that used to run on the x86, now you can run those on the DPU. And the beautiful thing here is you get those benefits that I talked about. So because it's running on a separate processor, it accelerates everything with the hardware that we have on chip, and it isolates that from the application processing domain. So there's really goodness across the board. VMware recognized that with Pat and Jensen on stage at VMworld. And so now

Starting point is 00:04:12 we're executing towards that vision. Yeah, that's bizarre. So from an acceleration perspective, I had always heard that TCP offload was something that a lot of NICs did, but Bluefield takes this. Is Bluefield the right answer? Maybe I should ask. Can we talk about Bluefield or should we be talking about NVIDIA Smart NICs? What's the branding here? No, Bluefield DPUs is how we refer to them. So that's the right way to refer to it.

Starting point is 00:04:41 And that DPU functionality, it really complements the CPU and the GPU. So we talk about the trinity of the data center. And Jensen was talking about the fact that the data center is the new unit of computing and that the workloads that we're trying to solve now with these AI problems are so large, they don't fit in a computer anymore, that it's the entire data center that actually is going to be running programs. And the CPU is great for running applications. The GPU is great for running accelerated applications, which are highly parallel. And the DPU, or the data processing unit, which is how we refer to our blue field, is great at things that are data intensive. And so all the accelerators that are built into the DPU

Starting point is 00:05:25 are really important. And so what sort of accelerators are built into the DPU? Maybe that's a question I wanted to ask. Yeah. Well, you talked about TCP IP acceleration, and certainly a ton of the workloads are involved with TCP IP. And we do the basic accelerations there. But we do things like load balancing and firewalls and stateful accelerations, and we run NAT, and we accelerate all of these things, so network address translation. And when you look at a data center, there's potentially hundreds of servers that the workload is going to be carved up into. And we need to steer the data efficiently to the right node. We need to sometimes trickle data out of a, for example, if you're streaming video, you may have 10,000 users and you're actually supplying data to each

Starting point is 00:06:17 of them, but at a very controlled pace. And we can do all of that with packet pacing, encryption, compression. There's just a ton of things that we can accelerate with hardware that the data processing unit is way better at than if you try to run this in software-defined networking, software-defined security, and software-defined storage on the CPU. Yeah, yeah. RDMA is part of that as well? Absolutely. So RDMA is the remote direct memory access, and that really goes to the heritage of Mellanox with our InfiniBand technology. So that was part of InfiniBand from really 20 years ago when we first started. The HPC world adopted RDMA as a way to build supercomputers. And if you look at the problems that are being solved today in AI, they're really very similar to a supercomputer.

Starting point is 00:07:15 So a big cloud data center looks exactly like a giant supercomputer. And more and more today, the problems that are being solved for scientific computing and other things are really AI problems as well. So the largest supercomputers in the world are leveraging GPUs and now DPUs and accelerated computing. Yeah, I was talking to somebody, it might have been Argonne or something like that. They were working on a brand new supercomputer installation and stuff. And they had almost as many GPUs as they had CPUs in a configuration, primarily to support neural networking, AI, machine learning, deep learning kinds of stuff. So yeah, it's pretty impressive what's going on in the HPC space to incorporate AI sorts of workloads and things. Yeah, I think it's my only question is why only almost as many GPUs? We should have two or ten times more GPUs. I don't know what the numbers were, but it was pretty impressive. And Oak Ridge, which was the

Starting point is 00:08:20 prior supercomputering capital of the world as of last year or the year before, had plenty of GPUs as well. Absolutely. And I think what's interesting at some level, the CPUs, depending on the workload, is really sort of an adjunct to the GPU. And it's designed to keep that GPU very busy. And now with the DPU, part of that function is actually freed off of the CPU and running on the DPU. We can do things, for example, like GPU direct storage, what we call GDS. And then we can stream the data directly to the GPU and not have to bother the CPU. So what this is really all about is as we get massively parallel, we start to hit Amdahl's law where things that become synchronized. If you have a thousand machines and then every once in a while, you need to go back to one machine before you can do something. And you have a thousand really powerful machines waiting for more data. And the fact that we can address that with the DPU, and there's a lot of new cool technologies

Starting point is 00:09:31 that we've implemented in the DPUs that allows us to really overcome Amdahl's law. I've been writing off and on for years in my Rayon Storage blog about hardware innovation versus software innovation. Most of the people have been beating me over the head saying software is going to solve all the problems. We don't need hardware innovation, et cetera, et cetera. I said, you guys are nuts. It's amazing what you can do with hardware innovation if it's properly targeted, properly focused. And the types of speed ups that you're talking about couldn't be done in software alone. Yeah, I think, you know, Mark Anderson made the famous statement that software is eating the world.

Starting point is 00:10:13 And in one sense, he was right. But at the end of the day, software has to run on something. joy about joining NVIDIA is, in particular, Jensen and really the entire team understands that the really critical things happen at that interface between hardware and software. And so if you are Stovepipe, you're in your little enclave only doing software, or you're in your little enclave only looking at hardware, then your design space is limited. But if you actually transcend that and start innovating across the hardware-software boundary, that's where huge breakthroughs can happen. And it's really exciting to be part of NVIDIA where they realize this. And that's what we're doing. We're innovating across the hardware software boundaries. So I

Starting point is 00:11:05 think it's very prescient that you realize that. Yeah. Well, it's been a slug. It hadn't been easy. Most of the people keep beating me over the head saying it's not happening, but I have been a proponent of hardware innovation for a long time. And all you had to do was wait, right? Don't tell me that. I don't want to wait. You know what I find so interesting is how much seems to have happened in how short a period of time. The prevalence of GPU in the machine learning and AI space just over the last three years has really changed the landscape. And that's not even touching upon the value of these smart NICs and the way that the ability to offload data against what is transpiring in your traditional sort of processor space onto the network card,

Starting point is 00:12:07 I think is such an incredible development. Yeah, I totally agree. I think sometimes you hit that sort of the tipping point where something that was impossible previously suddenly becomes possible. And you'll see that explosion once that happens of progress. And in particular, I think some of the machine learning algorithms that have been developed, where before, you know, it was only a decade ago, I was at a DNA sequencing company, and we were trying to figure out the information from the DNA. And we were trying to actually look at the physics and the biochemistry and figure everything out from first principles. And eventually, we realized that it was just too complicated. And we threw what wasn't even called machine learning yet at it. And we let the computer write the program because humans couldn't do it. And once that happened, we had this giant breakthrough. Unfortunately, nobody knew what machine learning was yet. So our investors didn't understand it. If we'd known, we just, we didn't need to wait a year and call it machine learning and we could have raised a ton of money. So with that in mind, and for our audience's edification, maybe you might be willing to apply a definition to the differentiation between

Starting point is 00:13:30 AI and machine learning. Yeah, that's a tough thing for me to do. That's not my core expertise. So I'm going to give the layman's version of that and we'll get one of our best. Yeah, sometimes it is. Sometimes it's somebody that doesn't have a deep understanding, but at least from my exposure at NVIDIA, AI is really leveraging these neural network models that are modeled after the brain, the way neural networks work, so that instead of doing really a procedural language, you're actually using a bunch of connections and then iterating on the model so that you can run that through. But the model itself is being developed by human beings that understand the way these models work, and then they iterate over time

Starting point is 00:14:19 to refine the models. Yeah. So I would say deep learning is a subset of machine learning and machine learning is a subset of AI. Machine learning can apply to things like statistical inferencing that you can do with data analytics and things of that nature. Whereas deep learning is a neural net technology, like you said, Kevin, that, you know, where they, you know, scientists build a structure of a model, and then they run this data through it over and over again. And every time, every step that the data comes in, there's some minor modifications to weights throughout the neural network that occur to make it more correct. Or, you know, and over time, as you get more and more data through this thing, it gets better and better.

Starting point is 00:15:05 Exactly right. And I think that's the key is that you're no longer writing the program yourself. You're creating a framework and feeding it with data. And that data needs to be curated so that you understand what it is. This is a cat. I'm going to show you another cat and another cat and another cat. And, you know, for a long time, people were looking at cats as sort of the way to train all these models. But now it's amazing. We have with machine learning, if you feed it enough data, we have superhuman capabilities. So better than certain domains, right? Certain domains. Yeah. Yeah. In certain domains. But it's cool. I mean, natural language processing, visual recognition.

Starting point is 00:15:48 Translation. Self-driving cars, looking at medical images for x-rays. It's just so exciting, all the different fields that are going to be impacted by AI and what it's going to do for humanity. It's really going to improve our lives. And so, you know, the NVIDIA GPU technology, because it's so massively parallelized, was a natural fit for deep learning neural nets, because you've got these, oh, I don't know, thousands of different nodes on this neural net, all of which need to have a separate weight connected to it. And, you know,

Starting point is 00:16:25 there's floating point operations that occur. And it's just that by having it be parallelizable, if that's even the right word or something like that, you know, these things can be happening much better than just a single serial CPU or even a quad core or eight core kind of environment. Yeah, you're so right. Who knew that gaming and all that would lead to really this orthogonal AI vector of innovation? You know, one thing that I wasn't here when NVIDIA did that and when they really pivoted to take on this new market. I've heard people say that, you know, it was just serendipity that they realized that. But from what I've seen, nothing is serendipitous at NVIDIA. It's a very thoughtful company. I'm sure.

Starting point is 00:17:17 And I will tell you that when I look at all the innovation vectors that are in our new Ampere GPUs, whether it's the things you talked about with the floating point formats or the different types of neural networks that are facilitated or the memory interfaces. It is a highly tuned machine for AI, for inferencing, for training. It's really fantastic. This is not an accident that NVIDIA found itself at the forefront of AI. Well, the other thing that AI, machine learning, and deep learning brought about was the need for data. More data is better, and the more data that you can apply to these sorts of solutions, the better. So that's where the networking, I guess, kicks into this framework. by having that data be available to the GPU in a timely fashion, it can make this whole operation of training and inferencing much better, I guess, right? I think you nailed it. So I think Jensen

Starting point is 00:18:16 realized with the Mellanox acquisition, the importance of the network. And I think it was ahead of other people in realizing that because if the data center is the new unit of computing, it's driven by the fact that you just can't fit all of the data into a single computer, nor can you fit all of the model even. And so as you scale out into a data center scale of computing, because that's how much data you have, and that's how much compute you need, And suddenly the network becomes the bottleneck because each of the weights of the machines and the data that needs to be shared between the machines all the time, constantly, the data becomes the limiting factor. So if you look at our DGX boxes, they have actually eight of our 200 gig, actually nine of our 200 gig adapters inside.

Starting point is 00:19:07 So that's the NVIDIA AI machine, right? The appliance? That's right. That's right. The DGX is really an AI appliance. And we have partners that build similar boxes. So we have a lot of different partners that build those boxes based on the DGX. So we have something called the HGX that they can incorporate for their platforms.

Starting point is 00:19:29 And then we can scale those out into pods and super pods. And the networking is just a critical component of that, both for just the raw latency and throughput that we have, but also for some of the offloads. So what we call in-network computing, where we're really processing the data as it moves. And this is addressing directly the problem of serialization that Amdahl's law highlights. So it brings up the question of, so what sort of processing are you doing with the data in a NIC? I mean, obviously, all the offload capabilities and the acceleration of data transfer makes a lot of sense, but this is beyond that. You know, it used to be store and forward kinds of things, but now you're store, process, and forward kind of stuff. So is it really a

Starting point is 00:20:17 general purpose computing environment? I mean, in a smart NIC? I mean, I've heard of people putting hypervisors on these things, right? Yeah. So that's what VMware and Pat Gelsinger announced with Jensen at VMworld was the hypervisor running on the DPU, the SmartNIC. But it's even more than that, because we're actually, as you said, processing the data as it moves. So it's in the switches themselves. And so with our technology, it was a great story that my counterpart Galad tells Shiner

Starting point is 00:20:52 about a pizza delivery and that it takes half an hour to get your pizza and you can build more ovens. You can go parallel, but at the end of the day, you still take the pizza and you have to drive it to the person's home and you can build a lot more pizzas, but it still takes half an hour. But put the pizza oven in the truck. Suddenly you can get your pizza and instead of half an hour, you can get it in 20 minutes or 15 minutes. So that's really what we're doing now, is we're processing the data as it's moving through the network. And it also reduces a ton of network traffic too, is another benefit that we get. So it's a very cool thing. So when you

Starting point is 00:21:35 think about that the data center is the new unit of computing, it really is. It's the entire data center that's processing ones and zeros as they're flying across the network at 200 gigabits per second speed. You mentioned this data center as a new unit of computing a number of times, Kevin. So, you know, the cloud, right? The cloud has been, you know, to some extent, accelerating software innovation. But I had never really understood that they had some hardware innovation going on as well. Are these smart NICs starting to be embedded in cloud services as well? Absolutely. So it's kind of below the radar screen to you, but they were the first ones to realize the benefits of all the accelerators that we have.

Starting point is 00:22:19 Because if you think about a cloud vendor, what they're selling to you is virtual machines. You're going to buy a bunch of virtual machines. And meanwhile, they have a whole bunch of policies that they need to support like any data center. And instead of doing that as hardware appliances, like firewalls and load balancers, which is just too expensive for cloud scale, they do it in software. And what they found was running it on the x86 was simply too expensive. And too high latency and all that stuff. Yeah. So if you're selling virtual machines and you're consuming, say, 30% of your CPU to run software-defined networking and software-defined security and software-defined storage, if you can offload that onto the DPU, the smart name, then suddenly you have 30% more CPU cores that you can sell to your

Starting point is 00:23:14 customers. So for the cloud vendors, it ripped right down to their bottom line. And now the enterprise is realizing that. And that's where VMware, for example, and other partners are coming into play because they can take advantage of this and it pays for itself. You get 30% more compute capacity out of every server. And on top of that, you get this isolation, you get better security than you can get when you're running things in the same domain as an untrusted application. Yeah. So the security aspect wasn't necessarily that obvious as a benefit for SmartNICs, but it is one of the pillars of the technology, right? Absolutely. So I think as some of these attacks have really become more sophisticated and more

Starting point is 00:24:00 nefarious, so things like the Spectre and Meltdown side channel attacks, which are really, really complex. They're looking at things that are happening, not in the program themselves, but in the hardware, in the CPU. Suddenly, if you're running your security policy on the same processor, that you're inviting people right into the middle of your cloud. You're inviting them into your data center. That peripheral security model, the firewalls at the edge of your data center, doesn't work anymore because you're telling people, hey, go install whatever you want right in the middle of my data center. If you can take over the x86 and take over the security policy, then you can take over the data center. And so now what we've realized is that putting that security policies and storage access policies and data

Starting point is 00:24:52 access policy into an isolated security domain, which is the DPU, suddenly now there really is separation between the application processing domain and the security processing domain. It's kind of, I'm just trying to figure out, you know, because these sorts of things, processing data, you know, doing things like accelerated data transfers and offloading the processor with all this stuff takes compute, takes memory, takes networking, it takes all this stuff. Are you embedding all these sorts of capabilities in these smart NICs? Absolutely. So like ARM processors and stuff?

Starting point is 00:25:34 Absolutely. So our DPU has eight super powerful ARM processors, the A72 multi-gigahertz processor. So this is really a powerful engine. But what we're not doing is actually taking the exact same software-defined functions that were running on the x86 CPU and just moving them to the ARM processor. Because frankly, that wouldn't really accomplish what we need to do. x86 is a really good processor. So it's the combination of those ARM processors with all of the acceleration for networking, for storage, and for security. It's the acceleration combined with the programmable engines that really makes this a powerful DPU smart neck. And the density of the data center that's being leveraged to support

Starting point is 00:26:27 these algorithms has got to be far less when placing this quantity of DPU against the CPU and even the GPU for that matter. Yeah, the real key is efficiency. And so we like to talk about total data center efficiency because that's the whole key. If you can do more with less, so less boxes, less power, you know, less everything. And the way to do that is do everything efficiently. And the CPU is really good at single-threaded application processing. And the GPU is really, really good at highly parallel accelerated processing. And the DPU is really good at compressing data and looking at the packet information and deciding where to steer it. And maybe sometimes recognizing, hey, this is a denial of service attack. We're going to put a

Starting point is 00:27:24 firewall or an intrusion detection system in every single server because that's where the attacks are. They're now in the center of your data center, not on the peripheral. They're not coming from north-south traffic. They can potentially be east-west. And when we do that, then you're just getting way more performance than if you're looking at every packet, detecting a DDoS attack, and then dropping the packet. What's the point? It feels a little like the micro-segmentation that VMware has introduced with NSX, which, if you ask me, is one of the key components of leveraging that technology ever. Absolutely. Absolutely. So NSX is, you know, the fantastic control plane that

Starting point is 00:28:07 implements micro segmentation, where really you're putting security around every application. And if you try to do that in hardware, it's just cost prohibitive, even if you could, but you don't see the individual virtual machines and applications that you're protecting. And now what we're doing with the DPU is we can actually offload that from the x86 processor and run that on the DPU. So with that in mind, I wonder what the software interface looks like and the ability to deploy these kind of security rules across an entire, say, cluster, is it as globally recognized as an NSX environment might be? Yeah, I think that's a great question. So I think that's exactly what we're drilling into with our deep R&D connections with VMware. Ultimately, the key here is it shouldn't be any different which processor it's running on, whether it's running on the DPU or the CPU.

Starting point is 00:29:17 It should all look and feel the same. And that's the goal that we're executing towards. It's almost like you're creating a new data center architecture here. I mentioned CPU and GPUs. I understand that. I understand networking. But the DPU is another entity altogether. I mean, are you going to be able to gradually introduce DPU functionality to the data center? Obviously, the VMware connection is going to make a big difference here as well, I guess. I don't know what I'm asking here.

Starting point is 00:29:51 No, I think you're asking the right questions. And I think you're right that VMware is really a critical partner to making this all work in a way that is very familiar to the guys that are managing enterprises today. And so the fact is that if you can go look at your VMware instance and vSphere and manage things in the same way that you always have, again, that's what we're doing here. There's not going to be any difference because it's running on an x86 or a DPU. So if it's a CPU or a DPU, it is all going to run the same. And so that's what we're looking forward to is working tightly with our folks at VMware so that it manages and the look and the feel and the way you configure all of your security rules across your entire data center

Starting point is 00:30:37 is managed the same way as it would be if it was running on a CPU. So VMware is the lion's share of virtualization out there, but there are a few others out there as well. Are you guys working with them as well? Yeah, I think there's a ton of folks on the Linux side, and they were part of our announcements as well that we made at GTC. So the Red Hats and the Canonicals

Starting point is 00:31:01 are all running on top of our DPUs as well. And really there, we have a huge investment in the open source community, certainly on many of our HPC applications. But now we're starting to see the AI workloads and many other workloads. And there you've got a whole other set of ecosystem partners that we're working with to do micro-segmentation free. So it's a new DPU world then. That's what you're envisioning.

Starting point is 00:31:29 And quite frankly, NVIDIA is not the only guy out there anymore. I mean, there's at least a couple of startups and others that are looking at this technology as well, right? Yeah, there are. I think what really differentiates us is the fact that we're committed to a roadmap, that we have a platform, and that we have a software API called Doka, which is it's not going to change over time. Things that you wrote yesterday will work on the future versions of Doka, and new things that will be developed will work. And so with that Doka commitment, it's very much like our CUDA.

Starting point is 00:32:17 So CUDA is the application framework that we built for the GPUs, and there's just a ton of different vertical workloads that run on top of our CUDA infrastructure, our software infrastructure. Oh, God, yeah. I've been using CUDA for quite a while, both in the, sorry to say, you know, a blockchain kind of solution, as well as AI workstations and stuff like that. CUDA made a big... Bitcoin's up to $10,000 a gig. Well, luckily I wasn't doing Bitcoin, of course. Yeah, but Kuda's fantastic because it doesn't matter what the GPU is that's underneath it. We're making massive wholesale architectural acceleration changes, but the Kuda layer on top stays the same so that all your software investments just keep running on top of CUDA as you're getting these underlying hardware acceleration

Starting point is 00:33:10 benefits from the architectural changes we're making. And you see DOKA is doing pretty much the same sorts of things for smart DPUs, I guess. Exactly. So the DOKA is going to do the exact same thing for all the data center services. And so as we implement new accelerations from generation to generation, and that goes to the roadmap that I talked about, which is both on the DPU itself, but also on an integrated platform. So the other thing that differentiates us is AI, because frankly, that's the driving force today within the compute world and really in business is companies that are becoming AI will succeed and those who are lagging

Starting point is 00:33:52 and not taking advantage of their data won't. And what we've done is we've combined the DPU with our GPU. And so initially it's what we call our Bluefield 2X and we showed a whole roadmap going through Bluefield 3X. And even today, we can deliver the performance of 128 CPU cores. Looking forward, that's just going to explode once we combine that with the GPU. It's going to be thousands of CPU core equivalents on one card. One of the things about CUDA is that, although it might have been an NVIDIA solution at first, there are other, God forbid, other GPU vendors out there that are also

Starting point is 00:34:34 plugging into the CUDA ecosystem, I'd call it. You see the same sort of thing with DOKA? Yeah, certainly we can see that people will try to adopt our frameworks and take advantage of that. Really, we like the competition if we just keep running faster. And part of that is achieving scale so that we can just continue to invest. So one of the things I love is looking at all of the different workloads, whether it's medical imaging, which is our Clara framework, or Merlin, which is our recommendation engine, or something like Aerial, which is our 5G stack for doing actually the radio, software-defined radios. There's just so many different things that we're developing on top of CUDA. And we've got millions of developers for our GPUs.

Starting point is 00:35:28 And being able to take advantage of that now and do the same thing with data center services, that's what's exciting, is that we can pull all these different things together and start to see something. And if we're investing in enabling our partners, we think they're going to continue to work with us and be part of the journey that we're making to accelerate not just the AI, but now also all the data center services. God, you're trying to take over the data center here. Well, I don't think it's take over the data center. There's a ton of things that we are not doing that we rely on partners. And that's what I was talking about the ecosystem. So whether that's security partners or software partners like VMware or all our friends in the Linux community, microservices companies like GuardiCore that does

Starting point is 00:36:18 something similar for Linux to what NSX does with VMware. This know, this is storage boxes. We have a ton of storage partners that we're working with to validate their platforms running to be able to do GPU direct storage. All the server vendors, we work with all of the major server vendors to incorporate. We do reference architectures. We have something called NVIDIA certified so that we will go run a whole bunch of tests

Starting point is 00:36:44 that test security and GPU direct storage and that they get all of the AI workloads that we have in our NGC container platform. So a data center takes a community, that's for sure. We're not trying to take it all over. We just want to accelerate it. Can I ask about the DPU Connect storage and what that actually looks like? Is it dedicating an IP stream towards the storage, leveraging RDMA or one of those protocols, a la, forgive me for saying this, iSCSI? Or how are you connecting your storage directly to the NIC? Yeah, so you're absolutely right. It's this GPU direct storage, which we call GDS. In the old days, you would have to get data from the storage box and then run it up into the CPU, probably running a protocol like iSCSI, which effectively runs over standard software TCP, UDP protocols.

Starting point is 00:37:54 And then you would copy it from CPU memory and then put it to the GPU. So if the GPU wanted memory, it actually had to go into the CPU's memory subsystem and then cross the PCI interface between the CPU and the GPU. Today, what we can do is with a smart NIC or a DPU, we can actually take data directly from the storage device, never landed in the CPU's memory and drop it directly into the GPU's memory. And GPUs can talk directly to each other. And the way they do that is in fact using the technology you talked about, the RDMA. And we can do that with file systems, so NFS over RDMA.

Starting point is 00:38:40 We can do it with block storage with things like NVMe over fabrics. So we have the ability now to make a direct connection from the data that's in a storage box into the GPU memory without using any CPU cycles and without the data passing through the CPU subsystem. And that's just a huge performance gain because we're moving so much data and processing so much data. Yeah, I could see massive effect of this. I mean, along the lines of what we've seen with NVMe, in terms of storage-related bandwidth utilization, just dropping that window down drastically from a latency perspective. Yeah, you're absolutely right. And the latency is a strange thing. People look at the average latency, but it's actually in these scale-out

Starting point is 00:39:33 things. There was a great paper that Google wrote called The Tail at Scale, and it's not your average latency. It's that 99.9%. One out of a thousand times, if you're dealing with something in software, the CPU is off taking an interrupt, who knows what it's doing. And once in a while, you know, your average latency is one millisecond, but once in a while, it's 25 milliseconds. And if you have a thousand nodes and that happens one out of a thousand times, that means that it happens every time. You get your worst case latency rather than the average latency. And if we do everything in hardware and we just deliver the data directly to the GPU memory and we're never in software and we're

Starting point is 00:40:16 using RDMA to move the data and do everything, then what happens is it's not just that it's faster on average, it's faster and it's predictable. It's deterministic latency across all thousand nodes. And that's the huge thing that in the scale-out processing really brings to bear. It's not just latency, right? It's bandwidth too. I mean, you have the capability of increasing the bandwidth of the network as well, right? Absolutely. So the bandwidth there, you know, we're still limited by physics and just how fast we can drive these Surtees. But today we're

Starting point is 00:40:50 shipping 200 gigabits per second. And as I talked about, we've got nine of those in one of our DGX boxes. So you're talking about 1.8 terabits per second. So things that just weren't possible previously in terms of moving large data sets between nodes, suddenly with that kind of bandwidth between nodes, you can just move a ton of data. And bandwidth and latency are obviously related in terms of when you're trying to move a gigabyte of data between two boxes. We can do that in a really short period of time now. So from a storage perspective, the major acceleration capabilities are the GPU direct capability that you're adding to the NIC, as well as the RDMA offload for NVMe and that sort of thing, right? Absolutely. And so RDMA is a key part of GPU direct storage. And then on top of that is the NVMe that you mentioned. NVMe in and of itself gives much,

Starting point is 00:41:53 much faster storage access by really eliminating a ton of bus protocol transactions. Instead of using back and forth handshaking across a bus, you just stick everything in memory, have a whole bunch of things, hey, do all this stuff. And then you point the device, our DPU, to a location in memory and it says, go do all this stuff. And we read everything from memory and then we pushloading all of the storage access to the DPU provides the offload, accelerate, and isolate that I talked about before. In the past, we've talked about, I would call it PCIe extension kinds of capabilities, which were all pretty much proprietary hardware kinds of things. But nowadays, it seems like they're doing a lot of this over Ethernet. Is the SmartNIC enabling some of that as well? Yeah, I think, you know,

Starting point is 00:42:52 it's interesting. I've been around a long time. So I remember the very first PCI Express specs when it replaced some of the things that were happening on the parallel bus. And the concern was it was going to wipe out things like InfiniBand and Ethernet, but it really never has because PCI Express does something really well, which is a local memory-based access, but it doesn't have that notion of, hey, I have a million connections to a million different things. Those are processes or people. It just doesn't scale that way. It's just a giant flat memory space. And so Ethernet and also InfiniBand has that ability to multiplex

Starting point is 00:43:36 thousands of different communication threads and demultiplex them on the other end and run that across the network and do so reliably. So I've lost a packet. Nothing breaks. We resend in hardware. Today, all of our Rocky, which is our RDMA over converged Ethernet, works on a ZTR mode, which is zero touch Rocky, we call it, where you don't have to configure the network. You don't do anything. The hardware itself takes track of it. So unlike TCP IP, where you've got software that's doing timeouts and then retransmission, all of that's happening in hardware with both InfiniBand and with Rocky now.

Starting point is 00:44:14 And so it just runs way faster. Well, this has been great. So Matt, any last questions for Kevin before we go? I really don't, Ray. Thanks for asking. Okay. Kevin, anything you'd like to say to our listening audience before we go? I really don't, Ray. Thanks for asking. Okay. Kevin, anything you'd like to say to our listening audience before we close? No, it's just great to talk to you guys. I love when you get it and you're asking good questions that are going to help us drive forward. We're looking forward to partnering with all of our industry partners here and changing the world for better with AI because that's really what it's about. Okay. Well, this has been great. Thank you very much, Kevin,

Starting point is 00:44:46 for being on our show today. Thank you. Great to be here. And that's it for now. Bye, Matt. Bye, Ray. And bye, Kevin. Bye-bye.

Starting point is 00:44:57 Until next time. Next time, we will talk to another system storage technology person. Any questions you want us to ask, please let us know. And if you enjoy our podcast, tell your friends about it. Please review us on Apple Podcasts, Google Play, and Spotify, as this will help get the word out. Thank you.

Grey Beards on Systems - 109: GreyBeards talk SmartNICs & DPUs with Kevin Deierling, Head of Marketing at NVIDIA Networking

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.