Grey Beards on Systems - 144: Greybeard talks AI IO with Subramanian Kartik & Howard Marks of VAST Data

Starting point is 00:00:00 Hey everybody, Ray Lucchese here. Welcome to another sponsored episode of the Greybeards on Storage podcast, a show where we get Greybeards bloggers together with storage assistant vendors to discuss upcoming products, technologies, and trends affecting the data center today. This Greybeards on Storage episode is brought to you today by Vast Data, and now it is my great pleasure to introduce Sub-Romanian Kartik, Global Systems Engineering Lead, and Howard Marks, former co-host, now technologist, extraordinary, and plenty of potentiary, both at Vast Data. So Kartik and Howard, why don't you tell us a little bit about yourselves and what's new at Vast?

Starting point is 00:00:41 Full disclosure, Vast paid me recently to write a paper for them on AIIO. Okay, take it away. All right. Thank you, Ray. I appreciate the introduction and appreciate the chance to be on your podcast. Just a little bit about myself, folks. I've been at VAST data now for three years. Prior to that, about 20 years at EMC and Dell. And before that, I was an academic. I used to be a particle physicist and did a lot of research work in that area before I came into the industry. And it's been an amazing ride here at VAST over the last three years. We've grown at a pace which is hard to imagine. And we continue to go and take more and more market. And we're innovating like there's no tomorrow. Over to you, Howard.

Starting point is 00:01:28 Well, as long-term listeners to this podcast may recognize, I was co-host of Greybeards on Storage in my days as an independent storage analyst, but I've now turned to the dark side where I run product evangelism and explaining how things work at Vast Data, hence my title, Technologist Extraordinary and Plenipotentiary. So about this paper I recently wrote for Vast Data, all about AI, I.O.,

Starting point is 00:01:55 I spent a lot of time with Kartik talking about what I.O. looks like for deep learning, neural network training and and inferencing, and that sort of stuff. So why is AI IO so difficult to understand, Kartik? Maybe you can give us a brief understanding there. Sure. I mean, I'm happy to give you my perspective here, Ray. Honestly, it's not difficult to understand in the sense that if you look at how AI processing is done, just to give you some context, there are two sort of main processing phases, shall we say, for AI models. One is called training. This is when you expose a model to data that it hasn't seen before. And by looking at more and more data, it refines the model.

Starting point is 00:02:51 And it measures its success by looking at the outcome, as this prediction, as people say, versus what it's expected to predict. And once we reach a point where that prediction is optimally matched to what we expect, at that point, the trained model is put into the next phase, which is called inference. Inference is when you show the model something it hasn't seen before and then have it make a prediction. This virtual cycle keeps continuing. Sometimes the models keep getting refined, cetera. So the IO characteristics of this process are dominated in the training phase, interestingly, by a lot of reading. Because you start with a

Starting point is 00:03:32 lot of data and all you really do to the data is you read a lot. And then you write out a very small amount. That small amount is typically the model that actually gets refined. We're talking about like the neural network at that point, right? Exactly. I'm talking about a neural network model. This is more for deep learning, but actually this also holds for any other kind of model. When you read in a lot of data, and then you write out the model. It's almost like programming with data, wouldn't you say, Kartik? It is in some sense. I mean, the interesting thing about deep learning especially is that the model itself figures out what the features are that it is supposed to look for instead of the human being providing the features. So more like programs programming themselves with data. Exactly.

Starting point is 00:04:25 I guess. My favorite example is they trained a model to try and identify cancerous and benign tumors. And the model concluded that rulers led to cancerous tumors because all the photos of cancerous tumors were from medical books where there was a ruler in the picture for skin. Yeah, yeah. That's interesting. There's quite a lot of things of that nature in AI history and things like that. They're getting better at this, Howard. I'll say that much. It's not like that. So in a nutshell, there's a lot of data that needs to be processed in one form or another to construct a neural network model. And, you know, the question is, you know, is that data something that gets read quickly or is it read slowly or is it read in some fashion that it just gets brought into memory and then left there? You know, those are the sorts of questions that lead to some mystification or mystery with respect to AI workloads and that sort of thing.

Starting point is 00:05:33 Good questions. I mean, these days, like the talk of the town is, you know, basically GPU intensive models, commonly known as neural networks or also called deep learning. The reading patterns for this are the GPUs tend to get saturated usually before the IO subsystem gets saturated, especially for training. It could be the opposite for inference, where the IO subsystem is taxed more because by that time the model is already trained. So the computationally intensive part of that is done. But the one outstanding characteristic

Starting point is 00:06:10 for AI is not only that it reads a lot, and there's a lot of data that's input, but it's also very random in its nature of reads. So it's random reads that dominate AI environments more than anything else. In contrast to something like HPC, which would be sequential and large sequential blocks being read and things of that nature. In the case for AI, deep learning neural networks processing, it's all random reads? Is that what you're saying? Yeah. So you're spot on. Traditional high-performance computing was built for a large amount of data in and even more sometimes written out. And the IO patterns were large block sequential. AI training is characterized by small to medium block random read IOs.

Starting point is 00:07:06 Reads are 95% plus the workloads that you see. And the media that supplies those reads should be able to support random IO. This is where something like Flash would be of significant benefit, wouldn't you say, Howard? Yeah, we certainly think so. I might say we're a bit biased, but yeah. Yeah, yeah, yeah. Exactly, exactly.

Starting point is 00:07:29 Well, you know, it's serious types of IO. Yeah, the other question I have is a lot of these models, when you look at something like the MLPerf, which is ML Commons, benchmarks and things of that nature, they're throwing lots of GPUs at training. It's not like it's a one-off training, one GPU to train a model that sometimes it's on the order of, God forbid, a thousand GPUs or something like that. Oh, yes.

Starting point is 00:08:02 In that sort of situation, does that boost the data bandwidth or not? It depends on the class of the problem which we're looking at. When you need a large number of GPUs, it typically means that the model is complex and needs lots of computation connected with it. And sometimes the model size itself is so large that it won't fit in GPU memory and you need to be able to spread it out across multiple GPUs, in many cases, multiple GPU servers. Extreme examples of this would be some of the large language models that are so popular in the press these days, such as GPT-3 or 3.5, which is the backend for ChatGPT. And those kind of models often require a thousand or more GPUs to train

Starting point is 00:08:41 across a very, very large number of servers to be able to do this. Something like GPT-3.5 is on the order of 75 billion neural network nodes or parameters? More like GPT-3 was 175 billion parameters in it. The model itself is in the hundreds of gigabytes in size. Too big to fit in a single GPU's memory. You're going to have to spread it out. And there's so much computation involved that multiple GPU servers will work in cooperation and GPUs from one server will communicate to GPUs

Starting point is 00:09:21 from another server just to exchange information as the processing goes on. And that's just the training. Now, something like that in inferencing would be quite extraordinary, right? I mean, you might have to. Inferencing is interesting. Inferencing, because the model is fully trained, you can actually embed the model in something quite a bit smaller to do inference. I mean, think, for example, of something like autonomous driving.

Starting point is 00:09:50 You train the model with a ton of HD video and LiDAR data, thousands of GPUs, but the final model is small enough to fit in an automobile. It doesn't have any highly specialized hardware in it. Some of these automobiles are almost like data centers on wheels anymore, Kartik. These days? Oh, yeah. You're showing our age, Ray. I guess.

Starting point is 00:10:14 But, you know, it's got, you know, terabyte of storage and it's sitting there with probably, I don't know, eight or nine CPU chips with multi-core each or something. It's crazy. We both remember when that was a whole data center, but most of our listeners don't.

Starting point is 00:10:31 I do too. Yes. I'm old enough to qualify for that. So one of the things that was sort of striking, which kind of was the, I would say almost the leading idea behind the paper, we were looking at the NVIDIA DGX reference architecture papers. And so they've got, oh, I don't know, six or seven different storage vendors have produced DGX reference architectures. And they all have, I'm not sure, think it's a retina net uh training resonant resonant resonant resonant train yeah resonant 50 training that they were show the performance of their storage with and across all six of these it almost looks exactly the same i mean whether

Starting point is 00:11:19 you're using one dgx or four dgx what, 32, maybe 64 different GPUs or something like that. The performance of the storage looked exactly the same. What does that tell you, Howard? It tells me that there's a lot of compute going on and the DGXs aren't just sucking data as fast as they can. So the storage isn't the bottleneck. Yeah, storage is not the bottleneck in this case, which is kind of, it's interesting when you think about 64 GPUs sitting there, 40 gig or 80 gig each, you know,

Starting point is 00:11:49 sitting there churning on this, on, you know, what, ResNet-50, I'm not sure that's a very sophisticated image recognition model. I wouldn't, I have no idea what the size of it is, but. Oh, it's a relatively small one. You know, it's, it was a very famous one, though. I mean, this was the initial, what they call ImageNet Challenge. It consists of about 1.3 million, roughly 115K images. So they're pretty small images, which is why, like Howard said,

Starting point is 00:12:20 basically the fact that any vendor you look at can do pretty much the same tells you that that particular benchmark is not IO-bound. Yeah, yeah, yeah. But there are plenty of other models out there, some of which are IO-bound. And there are plenty of other protocols out there as well that different systems can utilize with NVIDIA GPUs and things of that nature that boosts some of these workloads up considerably. And also remember, we're saying that all of these vendors whose all-flash systems were fast enough. Right, right.

Starting point is 00:12:58 Yeah, exactly. So it's interesting, all the reference architectures that NVIDIA has produced so far for any vendor, all of them happen to be on all-flash systems. Yeah. And for the reasons we talked about. Right, because of the random small reads and things of that nature. Yeah, yeah, yeah, yeah, yeah. So we talked about a little bit in the paper, a new protocol that's recently come out called, I think, NVIDIA GPU Direct Storage. Yeah. And what does that do for this sort of workload and that sort of thing? Yeah, so what NVIDIA noticed was as their GPUs got more and more powerful

Starting point is 00:13:47 and they were put on systems with a moderate amount of CPU, it was difficult to keep the GPUs busy. I mean, their appetite is voracious. Even though they're compute-bound, effectively, it's still problematic to keep the gpus busy well gpus have more memory bandwidth than the cpus to their onboard memory exactly and so and so what becomes the problem is as you rdma off of multiple 100 gigabit per second Ethernet cards into CPU memory to get to GPU memory,

Starting point is 00:14:28 the CPU memory became a bottleneck. And GPU direct allows the NIC to RDMA directly into GPU memory bypassing the CPU memory. So it goes from storage

Starting point is 00:14:44 direct to GPU without touching the CPU at all? Right. And the CPU memory. So it goes from storage direct to GPU without touching the CPU at all? Right. And the CPU sort of coordinates the process, no doubt, but everything else is moving data from one memory, storage memory to GPU memory or vice versa, I guess, right? Yeah, because otherwise, if you look at the data path, you'd have to move the data from the storage to CPU and system memory and then move it from system memory to GPU memory. In many cases, that's an unnecessary step if you're not going to actually modify the data and memory.

Starting point is 00:15:16 And you're limited then by the backplane bandwidth between system memory and the GPU subsystem. Instead of that, why not cut out the middleman? Just cut directly to GPUs. So does GPU Direct work with, obviously it works with Ethernet RDMA as well as InfiniBand RDMA. Is it a file services protocol or is it a block services protocol? I'm just trying to understand the nature of it.

Starting point is 00:15:43 So under the covers, you're going to be moving data from storage to a memory address and in the GPUs. So whatever this is, it needs to be a direct memory access operation. Hence, the storage subsystem has to support either DMA or RDMA.

Starting point is 00:16:05 Right. Now, at best, we implement this with NFS because that's what we do. We expose everything with NFS and RDMA is the underlying transport for the NFS, as opposed to what people usually see, which is TCP. Other vendors have native clients, which are DMA clients, and they can use DMA as well to get the data into the gpus right uh but in essence yeah that's not it's basically a very low level hardware transfer of data from storage into gpu memory more than that gpu direct is a file protocol

Starting point is 00:16:39 ai uses files not block yeah yeah i was gonna say that. Whether that's a parallel file system or a NAS like us. Yeah, so most of the training data nowadays is all S3 objects or files and things of that nature. Isn't that the way they're structured? Yeah, see, the thing is that you need every GPU server in your GPU farm to be able to see the same data. That means it has to be some shared storage. And file gives you a very clean way to be able to see all the data. Right. No matter how big your CPU farm is.

Starting point is 00:17:18 So file is the preferred exposure. Yeah. Either through a parallel file system or through NAS of some kind or objects. You're right. Now, for things like prediction or classifications and things of that nature, typically the training data has got, you know, like a data which is an image and then maybe the actual classification, which might be objects that are detected in the image and that sort of thing. Are those spread across lots and lots of directories,

Starting point is 00:17:46 or is it typically one or two directories? How is that structured in a file system look like? So it's not so much directories that matters. It's mount points. It's a single namespace that matters from an operational standpoint. Typically, each server should be able to mount a one mount should contain all the data. Now that mount may have sub directories and folders and stuff like that. That's fine. But the key thing is what you don't want to be

Starting point is 00:18:18 dealing with is the headache of managing multiple mounts on the same client. So the ability to have large mounts, large namespace, and keep in mind, some of the data sets could be multiple petabytes in size. You want to be able to expose all those petabytes without having to do this and deliver the full performance of your storage to that mount. Right. Now, some storage has problems with a single mount point and providing high-performance bandwidth and things of that nature. But apparently, VAS doesn't have this problem.

Starting point is 00:18:54 What's VAS' secret recipe for being able to sustain high performance to small numbers of mounts and small numbers of directories? The basic problem is TCP. If you mount an NFS mount point over TCP, that's until just recently one TCP session. And therefore you have the bandwidth product delay problem that you can only have so much data in flight and that limits the bandwidth of that to about 2.5 gigabytes per second per mount point per mount point so

Starting point is 00:19:35 about two years ago the n connect mount option made it into most common Linux distributions. And that lets you specify as a mount option, Nconnect equals four. And instead of using one TCP session, the NFS client will use four TCP sessions. And so that boosts the total available bandwidth from 2.5 gigabytes per second to 10 gigabytes per second. Then we have to attack the latency problem. So instead of running NFS over TCP, we run NFS over RDMA, and that reduces the time it takes to process each request and the effective bandwidth

Starting point is 00:20:23 because the system is spending less time sitting there doing nothing. Finally, we've made our own addition to the TCP client that allows us to spread the multiple connections that nConnect specifies over multiple source destination IP address pairs. So different NICs, different nodes on the scale-out storage system. And so with the combination of all of those, we can saturate multiple 100 gigabit per second connections from the storage to a single client. Yeah, this is often something that catches people by surprise, right?

Starting point is 00:21:08 Most people say NFS, like Howard said, straight up NFS over TCP is limited to 2.5 gigabytes a second. We've been able to benchmark with GPU direct, to be fair, 175 gigabytes a second, single mount point, single client. What? Over NF single client. What? Over NFS3. What, like 64 nodes or what? That client was a DGX A100.

Starting point is 00:21:34 Correct. Oh, my God. The server had 16 compute nodes. The storage system had 16 compute nodes. Correct. But it's still one client accessing one. Exactly. That client had 8 NICs, you know, 8 to 100

Starting point is 00:21:47 NICs. So there's plenty of bandwidth going in, but we were able to line saturate the sucker. Huh. Huh. Huh. That's very impressive for a storage system these days. Yeah. I can remember when, you know, 10 gigabytes per second for block was good.

Starting point is 00:22:04 Let alone... You're dating us all. File, you know. I know. I know. Okay. when 10 gigabytes per second for block was good, let alone file. Right. I know, I know. Okay, I'll stop there. You know, it also is revelatory about the old bone that block is fast and file is slow. Yeah, yeah. Block was fast and file was slow when processors were 20 megahertz and you had two cores.

Starting point is 00:22:27 And network bandwidth was 10 megabit. Right. If you have enough compute horsepower and you have enough network bandwidth, NAS can be just as fast as you need. With a few tweaks here and there. Oh, yeah. You need to be clever. Everything's about clever software. Yeah, and, you know,

Starting point is 00:22:47 the way, I mean, we contributed all the code for these enhancements we made to open source. Oh, good, good. Trying to get NFS maintainers to make it part of the Linux kernel so the entire community

Starting point is 00:22:58 can benefit from it, not just us. Right, right, right. Well, okay. This has been great, guys. Kartik and Howard, is there anything you'd like to say to our listening audience before we close? I'd just like to point out that AI is just one of the many things you can do with universal storage. The name is hyperbolic, but just a little.

Starting point is 00:23:20 And that's the vast data storage system, right? Yep. Yeah. Yeah. Yeah. To wrap up here, Ray, first of all, thank you for having me on the podcast. We are absolutely committed at VAST to being the best platform for AI and for virtually everything else out there. The reason is because we feel that we solve some core problems which every serious AI practitioner is going to face. One of them is performance. We talked about that. The other is scale.

Starting point is 00:23:54 Our contention-free architecture with no east-west traffic or cache coherency allows us to scale to very large namespaces. And the third is we are the most affordable such system for all flash due to our unique ability to use low-cost consumer-grade NAND as the substrate for solid-state. And lastly, it is easy to use. These systems are super easy to administer. When you're talking about this much storage and this much data, ease of use starts to become key. Operational stability and resilience are second to none in the industry from how we operate.

Starting point is 00:24:34 Okay. These, I think, are the reasons why we are perfect for AI. All right. Well, this has been great. Karthik and Howard, thanks again for being on our show today. Always a pleasure, Ray. You need to, Ray. Thank you so much. And thanks again to Vast Data for sponsoring this podcast. That's it for now.

Starting point is 00:24:50 Bye, Kartik. Bye, Howard. Take care, Ray. Until next time. Next time, we will talk to the system storage technology person. Any questions you want us to ask, please let us know. And if you enjoy our podcast, tell your friends about it. Please review us on Apple Podcasts, Google Play, and Spotify, as this will help get the word out. Thank you.

Your Ad Here

Grey Beards on Systems - 144: Greybeard talks AI IO with Subramanian Kartik & Howard Marks of VAST Data

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.