Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 07x02: Building an AI Training Data Pipeline with VAST Data
Episode Date: June 10, 2024Model training seriously stresses data infrastructure, but preparing that data to be used is a much more difficult challenge. This episode of Utilizing Tech features Subramanian Kartik of VAST Data di...scussing the broad data pipeline with Jeniece Wnorowski of Solidigm and Stephen Foskett. The first step in building an AI model is collecting, organizing, tagging, and transforming data. Yet this data is spread around the organization in databases, data lakes, and unstructured repositories. The challenge of building a data pipeline is familiar to most businesses, since a similar process is required in analytics, business intelligence, observability, and simulation, but generative AI applications have an insatiable appetite for data. These applications also demand extreme levels of storage performance, and only flash SSDs can meet this demand. A side benefit is the improvements in power consumption and cooling versus hard disk drives, and this is especially true as massive SSDs come to market. Ultimately the success of generative AI will drive greater collection and processing of data on the inferencing side, perhaps at the edge, and this will drive AI data infrastructure further. Hosts: Stephen Foskett, Organizer of Tech Field Day: https://www.linkedin.com/in/sfoskett/ Jeniece Wnorowski, Datacenter Product Marketing Manager at Solidigm: https://www.linkedin.com/in/jeniecewnorowski/ Guest: Subramanian Kartik, Ph. D, Global Systems Engineering Lead at VAST Data: https://www.linkedin.com/in/subramanian-kartik-ph-d-1880835/ Follow Utilizing Tech Website: https://www.UtilizingTech.com/ X/Twitter: https://www.twitter.com/UtilizingTech Tech Field Day Website: https://www.TechFieldDay.com LinkedIn: https://www.LinkedIn.com/company/Tech-Field-Day X/Twitter: https://www.Twitter.com/TechFieldDay Tags: #UtilizingTech, #Sponsored, #AIDataInfrastructure, #AI, @SFoskett, @TechFieldDay, @UtilizingTech, @Solidigm,
Transcript
Discussion (0)
Model training seriously stresses data infrastructure, but preparing that data to be used is a much more difficult challenge.
This episode of Utilizing Tech features Kartik from Vast Data discussing the broad data pipeline with Janice Narowski from Solidigm and myself.
The first step in building an AI data model is collecting, organizing, tagging, and transforming data.
And once the model is trained, you've got to present that data to be used. That's what we're talking about on this episode
of Utilizing Tech. Welcome to Utilizing Tech, the podcast about emerging technology from Tech Field
Day, part of the Futurum Group. This season is presented by Solidigm and focuses on AI data
infrastructure.
I'm your host, Stephen Foskett, organizer of the Tech Field Day event series.
And joining me today as co-host is Janice from Solidaim.
Welcome to the show.
Thank you for having us, Stephen.
So, as you know, we've been planning this whole season out to have lots of great guests from Solidigm partners, from industry
partners, from people who really know quite a lot about the data challenges of AI. But I think that
sometimes people get a little bit wrapped around the wheel thinking about feeding GPUs. They think
about training data. They think about high performance. They think that that's like the
only thing that matters. Now we see that when it comes to the GPUs, they think that that's the only thing
that matters for AI infrastructure. They forget storage. They forget compute. They forget
inferencing. They forget everything else. It's all about the GPU. But in storage too, you know,
you can't train models unless you've got data, right? Lots and lots of data. And everyone knows that, right?
That is the headline every day, all day,
when it comes to AI, loads and loads of data.
And to your earlier point, right?
It's all about the GPU, the compute performance,
but we are really excited today
to have one of our very best partners here on the show.
And he's going to talk a lot about why data is so important all the way through,
not just looking at how do you feed the GPU or the compute,
but the overall data pipeline.
And it doesn't just stop at training, right?
You have to look at the whole thing, the whole kit and caboodle,
all the way through to inferencing and then some.
So we're delighted to have Kartik Subramanian join us from Vast Data.
Welcome, Kartik.
Thank you, Janice.
Hi, folks.
Just by way of introducing myself, I'm Kartik.
I work for Vast Data.
I've been working with these guys for four years.
Prior to that, I was at EMC and Dell.
And before that, I used to be a particle physicist.
Strange change of careers over here.
Last two years, I've been obsessed with anything connected with generative AI, as all you guys know.
So super happy to be here and to be able to contribute whatever I can.
So we've talked a lot about the value of data to generative AI, especially when it comes to model training, but also transfer
learning and fine tuning, and of course, actual inferencing and generation. But there's a lot more
to it than that. I mean, it's one of those things, I think, where people don't realize, you know,
when you're going to go to that training, you know, what data are you going to use and where does that data come from? That's a much, much bigger process than just turning it on and saying,
okay, go train, right? Yeah, you're 100% right. People tend to think of generative AI and AI in general, for that matter, is something where you consume data
for the GPU subsystems that you have.
And they take a first look at it and say,
hey, let's look at, say, ChatGPT.
You know, when like GPT-3 was first trained,
that's 175 billion parameter model.
It was trained on a little over 500 gigabytes of data.
We're like, oh, well, I don't really need that much data.
But the training ran for months
and cost like a whole bunch of money.
So it was like, oh, okay, GPUs are important.
No question, GPUs are important.
They need data
and they have a lot of performance requirements
due to checkpointing and other things like that
while the training is going on. But they forget
that there's actually a whole bunch of heavy lifting that happened before
the 500 plus gigabytes of tokens were created.
That was distilled from petabytes and petabytes
of data all on the internet to build these foundation
models, which we all know and love like Lama and, you know, Palm and others like that.
What they don't realize is that that data actually fits as part of a pipeline.
And this is a one takeaway I'd like to leave for our customer or for anybody
listening is that pipeline encompasses all the way from the raw data to inference.
And every one of these is data intensive as we go along, not just a little bit under the GPUs.
Now, this situation becomes even more complicated when you look at how enterprise is going to
ultimately use it. Because keep in mind, these big models are trained by people who are professional model trainers. Enterprise now wants to incorporate their internal wisdom, their big corpora data,
into the input stream for training these models. This is a data wrangling problem, which is
enormous. And here you literally have customers with tens, if not hundreds of petabytes of data who are now trying to figure out how do I get a grip on this and how do I actually make these fit in the whole AI pipeline start to end?
So Kartik, on that note, I mean, everyone's kind of looking at the overall data pipeline.
And I think VAST in particular, right, we've worked with you guys for over 10 years, and you've always continued to innovate and do things a little bit different.
So can you tell us how Vast might be looking at, you know, AI with the data pipeline and how it's different from, you know, maybe other types of, you know, platforms in the industry kind of, you know, looking at the overall pipeline?
That's a great question, Janice.
Vast is a data platform company, not just a storage company.
Of course, at the heart of what we do is the most scalable multi-protocol storage subsystem on Earth.
We are extremely high performing.
We have all the necessary certifications from NVIDIA, BasePad, SuperPad, now most recently NCP also.
However, what distinguishes us is not only that we are high performing
and we scale well, but we also expose our data
through other modalities than just file and object protocol.
We also expose ourselves as a table.
So now we are open to analyzing data using other tooling,
such as Spark or Trino or Dremio on top with native tabular
structures within us, which is crucial for the kind of data crunching you need to actually take
the raw data which is currently sitting in large data lakes on Hadoop or Iceberg or Minio or object
stores or something like that. We want to be able to corral that and to be able to give the transformation platform
to convert these into the things
that can be actually input into a model.
Now, we do this with an exceptional degree of security
and governance and controls across the whole thing.
And probably most importantly,
because we're able to consolidate the entire data pipeline, the data connected to the data pipeline on a single platform, we eliminate copying of data, moving of data, and we are able to reduce the pipeline, as well as extending this to a global
namespace, that alone makes us the perfect fit for the AI world that's emerging right now.
Yeah, I would imagine that many people listening could kind of get their heads around this. I mean,
think about the data that you have, think about how it's organized
or probably disorganized, and then think about how, how much work would it take me to locate,
identify, tag, organize, consolidate, you know, basically get all that data ready to even start doing any kind of an AI
model training or retraining or fine tuning. Think about all of those tasks. Think about all of that
data. And throughout it all, of course, it's not just going to be one person. It's going to be an
entire team. It's going to be a disparate team from multiple parts of the organization, maybe even multiple organizations. It's going to have many
different data types, many different data sources. And all of that has to come together
in a nice organized fashion and then be presented to the model. That's incredibly difficult. If
you've ever sat there and watched that little bar grow on your laptop as you're copying a file, well, multiply that by a thousand X when you're talking about enterprise data.
It's incredibly difficult to move this.
And so it's really not overstating it to say that what you just described, Kartik, by having all of this data on a unified platform with a unified namespace, well, that's pretty transformative, wouldn't you say?
Talk to us a little bit about how that really works in customers'
environments when they're preparing to prepare a model.
Yeah, so that's a fantastic way to describe the problem. I think you nailed it there, Stephen. In the enterprise, there's multiple levels of problems to solve.
First is all the data which was created over the last 30 years is potentially going to be something that we're going to want to train models with.
This data is in data silos all over the place.
Some of them are in data lakes.
Some of them are in data warehouses. Some
of them are on the cloud and snowflake or data breaks or something like that. Some are on-prem,
some are off-prem. Secondly, very few people actually understand what data they have or what
the semantic meaning is of that data to the business processes which are core to them. There is no ontological model which typically exists in these large enterprises,
and people are scrambling to build one.
So you'll hear a lot of talk about things like knowledge graphs and data fabrics
and data meshes and things like that.
These are all basically efforts to get a grip in simple terms on where is the data,
what is the data, and what use is it to me.
Then comes the actual task of culling that data and making it available to be able to train
larger models like this, which will then give rise to business use cases to move forward.
So the philosophy most people are adopting right now is to say,
I need a fundamental transformation of enterprise architecture
to get what we call AI ready.
Now, the silos of information that stuff lies on
is often infrastructure that was built 10 years ago.
It's sitting on 10 gigabit networks.
And there are petabytes and petabytes of data.
I mean, how am I ever going to get a handle on this?
How am I ever going to move it, et cetera?
So part of this transformation,
which we think is going to be the biggest transformation
in infrastructure in history over the next five years or so,
we anticipate people will spend about $2.8 trillion
in this transformation, is to start to
aggregate the data to understand the meaning of the data, and then to prepare it and decide how
I'm going to vector it towards a variety of models, which are then going to be able to transform
how my business operates. At the end of the day, training is important, like we said, but
frankly, money doesn't get made in training. It's a money sink. It's all in the inference.
You got to get it in the hands of the end users, and that's what needs to happen.
And all this needs to be done in a secure, highly governed fashion, as transparency,
copyright violation, intellectual property issues, all that start to become more and more important. So there's a lot of work that the traditional enterprise has to do to do this.
But it all starts by understanding the data and starting to consolidate it in modern infrastructure.
This is where Solivan comes in.
They got the best mousetrap over there to provide the underlying storage, which is high
performance and can really add advantage here.
Yeah. So a couple of questions, but I have a couple. Well, one, I'll just want to,
you talked a lot, Kartik, about enterprise, and I just wanted to get your thoughts a little bit
more about HPC. You know, we've worked with you over the years, you know, high performance
computing is, you know, obviously turning more toward what is now AI.
But you don't really go backwards, right?
You don't look at AI and go to HPC.
So just love to get your point on how do your partners in the HPC space tackle this AI data pipeline?
And is it different from enterprise customers?
So AI is just a subset of HPC. I should probably be a little more careful than that. People will
take umbrage to me calling what we're doing right now AI. AI is actually a very broad subset.
What we're talking about here specifically is generative AI, which is neural network based. And especially large models and how they operate here.
There's absolutely no reason that anyone should believe
that large language models are the only kind of computation
that's going to go on.
It is shocking that 95% of analytics that goes on
in most enterprises is actually good old-fashioned
structured data.
And it's standard machine
learning algorithms like linear regression, logistic regression, Bayesian forest. That's
the kind of stuff that gets done. Now, when I translate this to the lens of an HPC operator,
and many of them are customers of ours, we are widely deployed across most of the national labs, most of the large HPC centers, et cetera,
they are seeing that the traditional classic HPC cluster,
one that had a lot of compute nodes
and high-performance parallel file systems under it,
is giving way to heavily GPU-accelerated codes as well, right?
Not just GPUs, could be DPUs, could be FPGAs.
Some kind of coprocessor acceleration
is getting more and more ubiquitous in here.
So they too are transforming.
One of the big things they're doing
is they're moving away
from traditional hard drive based technologies
to solid state technologies to be able to do this
because the IO patterns for AI tend to be able to do this. Because the IO patterns for AI
tend to be more random read dominated. And at this point, it becomes an IOPS game.
And this drives, unfortunately, a constraint because of the mechanical construction they have.
And NAND can do a heck of a lot better job at being able to deliver this.
So they're all looking at high performance, all flash namespaces to be able to deliver the kind of performance they need to handle this
unusual mix of workloads. Traditional HPC simulation, MPI jobs, high throughput, large block,
you know, type, you know, sequential read, sequential write type workload, contrasting with
these heavy random IO intensive workloads at the same time.
Combination of these,
perfect platform is a completely solid state platform.
That transformation is well underway.
The only hindrance to this is economics.
They say, oh, I like Flash,
but oh my God, it's too expensive.
I can't afford it.
This is where SolidIne comes in and gives us very affordable, dense NAND
with very high capacity.
And some of the secret sauce that we've developed in conjunction with Solidigm has allowed us to increase the endurance of those systems to a point where it's starting to become affordable.
And then the data services to shrink this data, again, bend the cost curve to a point where it becomes something which is, you know, which enterprise can use, which is very, very tenable
for what they want to build at this point.
But also for high-performance computing shops,
they also are moving in exactly the same direction.
Pipelines are the same.
At the end of the day, it's data guys.
It goes through successive processing steps and successive refinements,
and ultimately what emerges is either great science or great business.
Either way, we want to be right in the middle of it.
This is what we do.
In a way, Kartik, there's an analogy here to the GPU question.
One of the things that we talked about all last season on Utilizing Tech
was the fact that when it comes to training,
you have to keep those GPUs fed or you're not making maximum use
of that incredible hardware investment.
It seems to me that the same question is true of solid state.
You're buying solid state storage, not for capacity,
but for a combination of capacity and performance.
And so it's a good idea to make sure
that the way you're using it
is able to leverage that performance. And that means
that you have an intelligent storage engine, you have an intelligent storage approach that's able
to make the best use of this incredible resource that's literally orders of magnitude faster than
what you get on spinning media. And I think that today we've gotten to the point where, and maybe
you can tell me if you agree with this, we've gotten to the point where primary storage really is flash, full stop. It is
flash. And disks have a place, but it's not in primary storage. It's in secondary applications.
It's in data protection. It's in archives. It's in the areas that don't need any kind of performance. Is that what you're seeing in the market?
Yeah, to a large extent, yes.
There is another that, yes, people like Flash for multiple reasons.
One is it is high performance.
Secondly, the power form factor is much more manageable at scale as drive size become bigger. But probably as important is the,
you know, what I would describe as the dependability
of the actual performance itself.
So it's not just raw performance.
It seems counterintuitive,
but training actually is a GPU bound problem.
You actually do very little IO.
Input data sets and the output models are pretty small. That's not where you get hammered for IO.
Where you get hammered for IO is actually during checkpointing. So absolutely, you need to have
systems that are very high performance. And that's true. But what's crucial about all flash namespaces is, as many, many of our customers have found out, is you no longer have to worry about, is my right data at the right place at the right time?
So as you know well, Stephen, we eschew the idea of tiering just for that reason.
We believe that AI and what is emerging, now we're going to go into multimodal models.
Data volumes are in the petabytes.
I just talked to a customer who's going to buy a couple of thousand Blackwell GPUs with 60 petabytes of storage.
That's a lot. That's a hell of a lot.
I can guarantee you that there's some very, very dense content in there that they intend to analyze.
Key thing is the nature of AI says that you cannot predict what you're going to need when. When the GPU subsystems need it, they need to go and find it. And at that point,
that data can't be stuck on a slow object store somewhere else or on another disk-based tier
somewhere else, because that's a buzzkill. It just, you know, all of a sudden you go, whoomp,
my latency went from one millisecond to 200 milliseconds or 800 milliseconds.
You know, what am I going to do?
So the uniformity and the predictability of the workload,
I think, is as important.
Now, the other point I would make is that the reason why I think,
the other reason why I think Flash is ultimately going to reign in this space
is so SolidIme is introducing bigger and bigger spindles.
You're going to be reaching limits as to how much you can do with mechanical drives.
Right now, people are shipping 20 terabyte drives, maybe 24 terabyte drives.
You know, the long awaited promise of Hammer is still not materialized.
In the meantime, Solid Ion is like charging along, 60 terabyte drives, 120 terabyte drives.
The density from a capacity from a floor space in power is going to be significantly lower
in these kind of environments than what we had in disk-based
environments. I think just that delta alone is going to basically eliminate drives.
They're just not power efficient enough, not space efficient enough, and not performant enough.
They're going to get squeezed out on the low end with tape and on the high end with all flash-based
systems. And this is something I've seen in conference
after conference that I've been in,
that that's really what's gonna happen in the end.
Yeah, thank you for saying that, Kartik,
because that was gonna be one of my questions for you was,
what's your opinion on the space and the power consumption?
Obviously, this is a really hot topic.
Can you comment on any of your partners or customers
that you're working with that have really been able to reap the benefits of both feeding that GPU, keeping the performance well, all the while bringing down the overall cooling and heat?
Yes, we've been humbled and privileged for being the de facto standard for a variety of people in the AI space.
On one side, we have several super pod deployments which are going on with NVIDIA.
As you know, that's their flagship offering from a single tenant perspective.
But we are also the standard for a large number of the new tier of cloud service providers.
I call them AI cloud service, AICSPs.
These differ from the tier one cloud providers
such as AWS and Azure and Google
in the sense that they are not general purpose.
They will build ground up with the intense power footprint,
the RDMA networks to be able to do communication
between the GPUs and high-performance storage, specifically to be able to do communication between the GPUs and high performance storage,
specifically to be able to tackle the challenges of generative AI itself.
So in this space, we have many.
CoreWeave is a great customer of ours.
They're deploying over 100,000 GPUs globally.
We routinely see production jobs running huge, huge training over there.
But equally interesting to us is some of their
customers came to them and said, I'm having data wrangling issues. Can you help me here?
Can you help me actually make sense and do the pre-processing as well? Because you start with
a large amount of data and you're going to condense it down to a small amount. Many, many cases,
that's very CPU-led workload as opposed to GPU-led workload.
Again, to your older question about HPC, I don't think these are different.
These are actually blending together to form one common infrastructure
where both CPU as well as GPU and other forms of coprocessors
will coexist in the same data pipeline.
So that's also going to very much be there.
Lambda is another huge customer of ours.
They're also extremely in the same boat.
They also have a lot of GPUs and their intent is to offer GPUs as a service. So here again,
security and governance become super critical, right? Because if you're in a multi-tenant
environment, guess what? You're going to have to keep complete logical separation of data
and you're going to have to be able to encrypt the data,
encrypt the traffic connected with this
to be able to reduce the data within that kind of domain
to build FedRAMP-capable infrastructures,
IL-5, IL-6 level protection
to build zero-trust architecture.
All of those are things that we excel at,
and that's really where we go much more than anything else.
So these people consume data at scale, and they're going to start consuming data at a much larger scale.
So far, we were looking at tech space LLMs.
There's a whole class of large-scale vision problems.
There's a whole class of large-scale multimodal problems and the huge sucking sound of all the enterprise data
getting filtered into these generative models,
which is going to drive data volumes through the roof.
Along with that, it will also rise the stability,
the dependability, the predictable performance,
the scale, the security, the governance.
All of those will start to come into the forefront.
And so
we're, like I said, really fortunate. Working with these customers has taught us a lot.
We now understand what the real requirements are for how to actually go to market with this.
And this helps us innovate continuously with ourselves, with our partners. You know,
we've just reached an OEM agreement with Supermicro. You may have heard that.
We've already had a deep relationship with HPE,
with GreenLake for File.
They have different hardware stack,
but it doesn't matter to us.
We are a software offering,
but we want that kind of diversity
to be able to support all kinds of environments.
But the core problem is that we want to solve
is we want to solve the customer's data platform problem.
How do you build something which can take all data, analyze it in any modality you like, be it through a database engine or through PyTorch,
and then ultimately take it through to inference and fundamentally change how they operate their business?
That's a really good point that you make there with the inferencing question as well, because one of the things I think we're going to see
is a proliferation of data on the, well, rubber meets the road side of things.
On the edge type of thing. Yeah, your model's trained,
everything's ready to go. Now let's turn the data fire hose on that end. And that's just going to cause even more data to be collected, to be processed, to be stored, to be acted on.
And it's going to drive up the demands for at the edge, in the cloud, in the data center, everywhere.
And I think that that's natural.
I think all of us can see that because, you know, chat GPT is lovely, but it's text, you know, okay, now let's, let's throw images at that. Let's show documents at
that. Let's show, throw a video, a streaming video, multiple cameras. Now, now what kind of
data requirements do we have? It just grows and grows and grows. And, and I think that that's,
you know, it's good because of course, these are new applications that are hopefully going to be
productive and, and profitable for people.
But it's also a challenge because we're going to need to be able to handle that kind of data, which, again, you need to have intelligent infrastructure and you need to have high performance and you need to worry about environmental impact and all these things.
I couldn't agree with you more. Inference has its own peculiar set of technical and business challenges,
which are, interestingly enough, not present in training. On the technical side, you know,
training, I can hog a GPU. It's all mine, mine, mine. Inference is not that way. In a multi-tenant
edge inference use case, my edge GPUs or whatever accelerators doing inference need to have access to multiple models.
I may need to load and unload these models depending on who is exactly doing the inference.
Worse, if I'm doing retrieval augmented generation or what they call RAG,
and people have vector databases which codify the internal things that the company does,
then those again would start to move into the edge.
How do I do this rapidly?
How do I do this in a governed way?
The EU AI Act was passed, as you know, a couple of months ago.
It mandates that any of the data that is used for query value,
especially for high-risk business, needs to be preserved for eternity.
All the queries, all the keys have to be preserved for eternity.
And these are very difficult technical problems as well as governance problems to solve.
And we believe we have a superior mousetrap in that sense to be able to do this.
Well, thank you so much, Kartik, for joining us and talking
about this. I think you've really opened up my eyes and those of our audience as well
to the greater question. Because again, it's very easy, to my point at the beginning, it's very easy
to focus on how do I feed training? How do I keep these GPUs active? And forget that the data pipeline starts way before that and continues way after that.
And the volume of data is just absolutely incredible.
So thank you so much for joining us here today.
If people are interested in this, where can they find you?
Where can they continue the conversation?
Yeah, easiest way, let's start by going to vastdata.com. There's a massive wealth of information of customers we work with, of how our technology works. It's a great white paper, vastR-T-I-K. And I'll be more than happy to respond to you guys.
But anyone from VAST, contact anyone from VAST locally,
and they'll be able to help you as well,
and they'll be able to guide you to me and to some of my colleagues.
That's the easiest way to do it.
So thank you for listening to Utilizing Tech,
focused on AI data infrastructure.
The Utilizing Tech podcast series is available in your favorite podcast application
as well as on YouTube.
If you enjoyed this discussion,
please do give us a rating and a nice review
in your podcast application of choice.
It's always great to hear from you.
This podcast was brought to you by Solidyme
as well as Tech Field Day,
home of IT experts from across the enterprise,
now part of the Futurum Group.
For show notes and more episodes,
head over to our dedicated website,
utilizing tech.com or find us on X Twitter and Mastodon at utilizing tech.
Thanks for listening. And we will see you next week.