Grey Beards on Systems - 144: Greybeard talks AI IO with Subramanian Kartik & Howard Marks of VAST Data
Episode Date: March 28, 2023Sponsored by Today we talked with VAST Data’s Subramanian Kartik (@phyzzycyst), Global Systems Engineering Lead and Howard Marks (@DeepStorage@mastodon.social, @deepstoragenet) former GreyBeards co-...host and now Technologist Extraordinary & Plenipotentiary at VAST. Howard needs no introduction to our listeners but Kartik does. Kartik has supported a number of customers implementing AI apps at VAST and prior … Continue reading "144: Greybeard talks AI IO with Subramanian Kartik & Howard Marks of VAST Data"
Transcript
Discussion (0)
Hey everybody, Ray Lucchese here. Welcome to another sponsored episode of the Greybeards
on Storage podcast, a show where we get Greybeards bloggers together with storage assistant vendors
to discuss upcoming products, technologies, and trends affecting the data center today.
This Greybeards on Storage episode is brought to you today by Vast Data, and now it is my
great pleasure to introduce Sub-Romanian Kartik, Global Systems Engineering Lead, and Howard
Marks, former co-host, now technologist, extraordinary, and plenty of potentiary, both at Vast Data.
So Kartik and Howard, why don't you tell us a little bit about yourselves and what's new
at Vast?
Full disclosure, Vast paid me recently to write a paper for them on AIIO. Okay, take it away. All right. Thank you, Ray. I appreciate the
introduction and appreciate the chance to be on your podcast. Just a little bit about myself,
folks. I've been at VAST data now for three years. Prior to that, about 20 years at EMC and Dell.
And before that, I was an academic. I used to be a particle physicist
and did a lot of research work in that area before I came into the industry. And it's been
an amazing ride here at VAST over the last three years. We've grown at a pace which is hard to
imagine. And we continue to go and take more and more market. And we're innovating like there's
no tomorrow. Over to you, Howard.
Well, as long-term listeners to this podcast may recognize,
I was co-host of Greybeards on Storage in my days as an independent storage analyst,
but I've now turned to the dark side
where I run product evangelism
and explaining how things work at Vast Data,
hence my title,
Technologist Extraordinary and Plenipotentiary.
So about this paper I recently wrote for Vast Data, all about AI, I.O.,
I spent a lot of time with Kartik talking about what I.O. looks like
for deep learning, neural network training and and inferencing, and that sort of stuff.
So why is AI IO so difficult to understand, Kartik? Maybe you can give us a brief
understanding there. Sure. I mean, I'm happy to give you my perspective here, Ray.
Honestly, it's not difficult to understand in the sense that if you look at how AI processing is done, just to give you some context, there are two sort of main processing phases, shall we say, for AI models.
One is called training.
This is when you expose a model to data that it hasn't seen before.
And by looking at more and more data, it refines the model.
And it measures its success by looking at the outcome,
as this prediction, as people say, versus what it's expected to predict.
And once we reach a point where that prediction is optimally matched
to what we expect, at that point, the
trained model is put into the next phase, which is called inference. Inference is when you show
the model something it hasn't seen before and then have it make a prediction. This virtual cycle
keeps continuing. Sometimes the models keep getting refined, cetera. So the IO characteristics of this process are
dominated in the training phase, interestingly, by a lot of reading. Because you start with a
lot of data and all you really do to the data is you read a lot. And then you write out a very
small amount. That small amount is typically the model that actually gets refined.
We're talking about like the neural network at that point, right? Exactly. I'm talking about a neural network model. This is more for deep learning,
but actually this also holds for any other kind of model. When you read in a lot of data,
and then you write out the model. It's almost like programming with data, wouldn't you say, Kartik?
It is in some sense. I mean, the interesting thing about deep learning especially is that the model itself figures out what the features are that it is supposed to look for instead of the human being providing the features.
So more like programs programming themselves with data.
Exactly.
I guess.
My favorite example is they trained a model to try and identify cancerous and benign tumors.
And the model concluded that rulers led to cancerous tumors because all the photos of cancerous tumors were from medical books where
there was a ruler in the picture for skin. Yeah, yeah. That's interesting. There's quite a lot of
things of that nature in AI history and things like that. They're getting better at this,
Howard. I'll say that much. It's not like that. So in a nutshell, there's a lot of data that needs to be processed in one form or another to construct a neural network model.
And, you know, the question is, you know, is that data something that gets read quickly or is it read slowly or is it read in some fashion that it just gets brought into memory and then left there? You know, those are the sorts of questions that lead to some mystification
or mystery with respect to AI workloads and that sort of thing.
Good questions.
I mean, these days, like the talk of the town is, you know,
basically GPU intensive models, commonly known as neural networks
or also called deep learning.
The reading patterns for this are the GPUs tend to get saturated usually before the IO subsystem
gets saturated, especially for training. It could be the opposite for inference,
where the IO subsystem is taxed more because by that time the model is already trained.
So the computationally intensive part of that is done. But the one outstanding characteristic
for AI is not only that it reads a lot, and there's a lot of data that's input, but it's also
very random in its nature of reads. So it's random reads that dominate AI environments more than
anything else. In contrast to something like HPC, which would be sequential and large sequential
blocks being read and things of that nature. In the case for AI, deep learning neural networks
processing, it's all random reads? Is that what you're saying?
Yeah. So you're spot on. Traditional high-performance computing was built for a
large amount of data in and even more sometimes written out. And the IO patterns were large block
sequential. AI training is characterized by small to medium block random read IOs.
Reads are 95% plus the workloads that you see.
And the media that supplies those reads should be able to support random IO.
This is where something like Flash would be of significant benefit,
wouldn't you say, Howard?
Yeah, we certainly think so.
I might say we're a bit biased, but yeah.
Yeah, yeah, yeah.
Exactly, exactly.
Well, you know, it's serious types of IO.
Yeah, the other question I have is a lot of these models,
when you look at something like the MLPerf, which is ML Commons,
benchmarks and things of that nature, they're throwing lots of GPUs at training.
It's not like it's a one-off training,
one GPU to train a model that sometimes it's on the order of,
God forbid, a thousand GPUs or something like that.
Oh, yes.
In that sort of situation, does that boost the data bandwidth or not?
It depends on the class of the problem which we're looking at.
When you need a large number of GPUs, it typically means that the model is complex and needs lots of computation connected with it. And sometimes the model size itself is so large that it won't fit
in GPU memory and you need to be able to spread it out across multiple GPUs, in many cases,
multiple GPU servers. Extreme examples of this would be some of the large language models
that are so popular in the press these days,
such as GPT-3 or 3.5, which is the backend for ChatGPT.
And those kind of models often require a thousand or more GPUs to train
across a very, very large number of servers to be able to do this.
Something like GPT-3.5 is on the order of 75 billion neural network nodes or parameters?
More like GPT-3 was 175 billion parameters in it. The model itself is in the hundreds of gigabytes in size.
Too big to fit in a single GPU's memory.
You're going to have to spread it out.
And there's so much computation involved
that multiple GPU servers will work in cooperation
and GPUs from one server will communicate to GPUs
from another server just to exchange information
as the processing goes on.
And that's just the training.
Now, something like that in inferencing would be quite extraordinary, right?
I mean, you might have to.
Inferencing is interesting.
Inferencing, because the model is fully trained, you can actually embed the model in something quite a bit smaller to do inference.
I mean, think, for example, of something like autonomous driving.
You train the model with a ton of HD video and LiDAR data, thousands of GPUs, but the
final model is small enough to fit in an automobile.
It doesn't have any highly specialized hardware in it.
Some of these automobiles are almost like data
centers on wheels anymore, Kartik.
These days? Oh, yeah.
You're showing our age, Ray.
I guess.
But, you know, it's
got, you know, terabyte of storage
and it's sitting there with probably,
I don't know, eight or nine
CPU chips
with multi-core each or something.
It's crazy.
We both remember when that was a whole data center, but most of our listeners don't.
I do too.
Yes.
I'm old enough to qualify for that.
So one of the things that was sort of striking, which kind of was the, I would say almost the leading idea behind the paper, we were looking at the NVIDIA DGX reference architecture papers.
And so they've got, oh, I don't know, six or seven different storage vendors have produced DGX reference architectures.
And they all have, I'm not sure, think it's a retina net uh training resonant
resonant resonant resonant train yeah resonant 50 training that they were show the performance
of their storage with and across all six of these it almost looks exactly the same i mean whether
you're using one dgx or four dgx what, 32, maybe 64 different GPUs or something like that.
The performance of the storage looked exactly the same.
What does that tell you, Howard?
It tells me that there's a lot of compute going on and the DGXs aren't just sucking
data as fast as they can.
So the storage isn't the bottleneck.
Yeah, storage is not the bottleneck in this case, which is kind of, it's interesting when you think about 64 GPUs
sitting there, 40 gig or 80 gig each, you know,
sitting there churning on this, on, you know, what, ResNet-50,
I'm not sure that's a very sophisticated image recognition model.
I wouldn't, I have no idea what the size of it is, but.
Oh, it's a relatively small one.
You know, it's, it was a very famous one, though.
I mean, this was the initial, what they call ImageNet Challenge.
It consists of about 1.3 million, roughly 115K images.
So they're pretty small images, which is why, like Howard said,
basically the fact that any vendor you look at can do pretty much the same tells you that that particular benchmark is not IO-bound.
Yeah, yeah, yeah.
But there are plenty of other models out there, some of which are IO-bound.
And there are plenty of other protocols out there as well that different systems can utilize with NVIDIA GPUs and things of that nature that boosts
some of these workloads up considerably.
And also remember, we're saying that all of these vendors whose all-flash systems were
fast enough.
Right, right.
Yeah, exactly.
So it's interesting, all the reference architectures that NVIDIA has produced so far for any vendor, all of them happen to be on all-flash systems.
Yeah.
And for the reasons we talked about.
Right, because of the random small reads and things of that nature.
Yeah, yeah, yeah, yeah, yeah. So we talked about a little bit in the paper, a new protocol that's recently come out called, I think, NVIDIA GPU Direct Storage.
Yeah.
And what does that do for this sort of workload and that sort of thing? Yeah, so what NVIDIA noticed was as their GPUs got more and more powerful
and they were put on systems with a moderate amount of CPU,
it was difficult to keep the GPUs busy.
I mean, their appetite is voracious.
Even though they're compute-bound, effectively,
it's still problematic to keep the gpus busy
well gpus have more memory bandwidth than the cpus to their onboard memory
exactly and so and so what becomes the problem is as you rdma off of multiple 100 gigabit per second Ethernet cards into CPU memory
to get to GPU memory,
the CPU memory became a
bottleneck.
And GPU direct
allows the
NIC to RDMA
directly into GPU memory
bypassing the CPU
memory. So it goes from storage
direct to GPU without touching the CPU at all? Right. And the CPU memory. So it goes from storage direct to GPU without touching the CPU at all?
Right.
And the CPU sort of coordinates the process, no doubt,
but everything else is moving data from one memory,
storage memory to GPU memory or vice versa, I guess, right?
Yeah, because otherwise, if you look at the data path,
you'd have to move the data from the storage to CPU and system memory and then move it from system memory to GPU memory.
In many cases, that's an unnecessary step if you're not going to actually modify the data and memory.
And you're limited then by the backplane bandwidth between system memory and the GPU subsystem.
Instead of that, why not cut out the middleman? Just cut directly to GPUs.
So does GPU Direct work with,
obviously it works with Ethernet RDMA
as well as InfiniBand RDMA.
Is it a file services protocol
or is it a block services protocol?
I'm just trying to understand the nature of it.
So under the covers,
you're going to be moving data
from storage to a memory address
and in the GPUs.
So whatever this is,
it needs to be a direct memory access operation.
Hence, the storage subsystem
has to support either DMA or RDMA.
Right.
Now, at best, we implement this with NFS because that's what we do.
We expose everything with NFS and RDMA is the underlying transport for the NFS, as opposed
to what people usually see, which is TCP.
Other vendors have native clients, which are DMA clients, and they can use DMA as well
to get the data into the gpus
right uh but in essence yeah that's not it's basically a very low level hardware transfer
of data from storage into gpu memory more than that gpu direct is a file protocol
ai uses files not block yeah yeah i was gonna say that. Whether that's a parallel file system or a NAS like us.
Yeah, so most of the training data nowadays is all S3 objects or files and things of that nature.
Isn't that the way they're structured?
Yeah, see, the thing is that you need every GPU server in your GPU farm to be able to see the same data.
That means it has to be some shared storage.
And file gives you a very clean way to be able to see all the data.
Right.
No matter how big your CPU farm is.
So file is the preferred exposure.
Yeah.
Either through a parallel file system or through NAS of some kind or objects.
You're right.
Now, for things like prediction or classifications and things of that nature, typically the training
data has got, you know, like a data which is an image and then maybe the actual classification,
which might be objects that are detected in the image and that sort of thing.
Are those spread across lots and lots of directories,
or is it typically one or two directories?
How is that structured in a file system look like?
So it's not so much directories that matters.
It's mount points.
It's a single namespace that matters from an operational standpoint.
Typically, each server should be
able to mount a one mount should contain all the data. Now that mount may have sub directories and
folders and stuff like that. That's fine. But the key thing is what you don't want to be
dealing with is the headache of managing multiple mounts on the same client.
So the ability to have large mounts, large namespace,
and keep in mind, some of the data sets could be multiple petabytes in size.
You want to be able to expose all those petabytes without having to do this and deliver the full performance of your storage to that mount.
Right.
Now, some storage has problems with a single mount point
and providing high-performance bandwidth and things of that nature.
But apparently, VAS doesn't have this problem.
What's VAS' secret recipe for being able to sustain high performance
to small numbers of mounts and small numbers of directories?
The basic problem is TCP.
If you mount an NFS mount point over TCP,
that's until just recently one TCP session.
And therefore you have the bandwidth product delay problem that you can only
have so much data in flight and that limits the
bandwidth of that to about 2.5 gigabytes per second per mount point per mount point so
about two years ago the n connect mount option made it into most common Linux distributions.
And that lets you specify as a mount option, Nconnect equals four.
And instead of using one TCP session, the NFS client will use four TCP sessions.
And so that boosts the total available bandwidth from 2.5 gigabytes per second to 10 gigabytes per second.
Then we have to attack the latency problem.
So instead of running NFS over TCP,
we run NFS over RDMA,
and that reduces the time it takes to process each request and the effective bandwidth
because the system is spending less time sitting there doing nothing. Finally, we've made our own addition to the TCP client
that allows us to spread the multiple connections that nConnect specifies over multiple source
destination IP address pairs.
So different NICs, different nodes on the scale-out storage system.
And so with the combination of all of those,
we can saturate multiple 100 gigabit per second connections
from the storage to a single client.
Yeah, this is often something that catches people by surprise, right?
Most people say NFS, like Howard said,
straight up NFS over TCP is limited to 2.5 gigabytes a second.
We've been able to benchmark with GPU direct, to be fair,
175 gigabytes a second, single mount point, single client.
What? Over NF single client. What?
Over NFS3.
What, like 64 nodes or what?
That client was a DGX A100.
Correct.
Oh, my God.
The server had 16 compute nodes.
The storage system had 16 compute nodes.
Correct.
But it's still one client accessing one.
Exactly.
That client had 8 NICs, you know, 8 to 100
NICs. So there's plenty of bandwidth
going in, but we were able to line saturate
the sucker. Huh.
Huh. Huh. That's very
impressive for a storage system these days.
Yeah. I can remember
when, you know, 10 gigabytes per second for
block was good.
Let alone... You're dating us all. File, you know. I know. I know. Okay. when 10 gigabytes per second for block was good, let alone file.
Right.
I know, I know.
Okay, I'll stop there.
You know, it also is revelatory about the old bone that block is fast
and file is slow.
Yeah, yeah.
Block was fast and file was slow when processors were 20 megahertz and you had two cores.
And network bandwidth was 10 megabit.
Right.
If you have enough compute horsepower and you have enough network bandwidth, NAS can be just as fast as you need.
With a few tweaks here and there.
Oh, yeah.
You need to be clever.
Everything's about clever software.
Yeah, and, you know,
the way, I mean,
we contributed all the code
for these enhancements
we made to open source.
Oh, good, good.
Trying to get NFS maintainers
to make it part of the Linux kernel
so the entire community
can benefit from it,
not just us.
Right, right, right.
Well, okay.
This has been great, guys.
Kartik and Howard, is there anything you'd like to say to our listening audience before we close?
I'd just like to point out that AI is just one of the many things you can do with universal storage.
The name is hyperbolic, but just a little.
And that's the vast data storage system, right?
Yep.
Yeah.
Yeah. Yeah.
To wrap up here, Ray, first of all, thank you for having me on the podcast.
We are absolutely committed at VAST to being the best platform for AI and for virtually everything else out there.
The reason is because we feel that we solve some core problems which every serious AI practitioner
is going to face. One of them is performance. We talked about that. The other is scale.
Our contention-free architecture with no east-west traffic or cache coherency allows us to scale
to very large namespaces. And the third is we are the most affordable such system for all flash
due to our unique ability to use low-cost consumer-grade NAND
as the substrate for solid-state.
And lastly, it is easy to use.
These systems are super easy to administer.
When you're talking about this much storage and this much data, ease of use starts to become key.
Operational stability and resilience are second to none in the industry from how we operate.
Okay.
These, I think, are the reasons why we are perfect for AI.
All right.
Well, this has been great.
Karthik and Howard, thanks again for being on our show today.
Always a pleasure, Ray.
You need to, Ray.
Thank you so much. And thanks again to Vast Data for sponsoring this podcast. That's it for now.
Bye, Kartik. Bye, Howard. Take care, Ray. Until next time.
Next time, we will talk to the system storage technology person. Any questions you want us to
ask, please let us know. And if you enjoy our podcast, tell your friends about it. Please
review us on Apple Podcasts, Google Play, and Spotify, as this will help get the word out. Thank you.