Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 07x02: Building an AI Training Data Pipeline with VAST Data

Starting point is 00:00:00 Model training seriously stresses data infrastructure, but preparing that data to be used is a much more difficult challenge. This episode of Utilizing Tech features Kartik from Vast Data discussing the broad data pipeline with Janice Narowski from Solidigm and myself. The first step in building an AI data model is collecting, organizing, tagging, and transforming data. And once the model is trained, you've got to present that data to be used. That's what we're talking about on this episode of Utilizing Tech. Welcome to Utilizing Tech, the podcast about emerging technology from Tech Field Day, part of the Futurum Group. This season is presented by Solidigm and focuses on AI data infrastructure. I'm your host, Stephen Foskett, organizer of the Tech Field Day event series.

Starting point is 00:00:49 And joining me today as co-host is Janice from Solidaim. Welcome to the show. Thank you for having us, Stephen. So, as you know, we've been planning this whole season out to have lots of great guests from Solidigm partners, from industry partners, from people who really know quite a lot about the data challenges of AI. But I think that sometimes people get a little bit wrapped around the wheel thinking about feeding GPUs. They think about training data. They think about high performance. They think that that's like the only thing that matters. Now we see that when it comes to the GPUs, they think that that's the only thing

Starting point is 00:01:29 that matters for AI infrastructure. They forget storage. They forget compute. They forget inferencing. They forget everything else. It's all about the GPU. But in storage too, you know, you can't train models unless you've got data, right? Lots and lots of data. And everyone knows that, right? That is the headline every day, all day, when it comes to AI, loads and loads of data. And to your earlier point, right? It's all about the GPU, the compute performance, but we are really excited today

Starting point is 00:01:59 to have one of our very best partners here on the show. And he's going to talk a lot about why data is so important all the way through, not just looking at how do you feed the GPU or the compute, but the overall data pipeline. And it doesn't just stop at training, right? You have to look at the whole thing, the whole kit and caboodle, all the way through to inferencing and then some. So we're delighted to have Kartik Subramanian join us from Vast Data.

Starting point is 00:02:29 Welcome, Kartik. Thank you, Janice. Hi, folks. Just by way of introducing myself, I'm Kartik. I work for Vast Data. I've been working with these guys for four years. Prior to that, I was at EMC and Dell. And before that, I used to be a particle physicist.

Starting point is 00:02:46 Strange change of careers over here. Last two years, I've been obsessed with anything connected with generative AI, as all you guys know. So super happy to be here and to be able to contribute whatever I can. So we've talked a lot about the value of data to generative AI, especially when it comes to model training, but also transfer learning and fine tuning, and of course, actual inferencing and generation. But there's a lot more to it than that. I mean, it's one of those things, I think, where people don't realize, you know, when you're going to go to that training, you know, what data are you going to use and where does that data come from? That's a much, much bigger process than just turning it on and saying, okay, go train, right? Yeah, you're 100% right. People tend to think of generative AI and AI in general, for that matter, is something where you consume data

Starting point is 00:03:47 for the GPU subsystems that you have. And they take a first look at it and say, hey, let's look at, say, ChatGPT. You know, when like GPT-3 was first trained, that's 175 billion parameter model. It was trained on a little over 500 gigabytes of data. We're like, oh, well, I don't really need that much data. But the training ran for months

Starting point is 00:04:12 and cost like a whole bunch of money. So it was like, oh, okay, GPUs are important. No question, GPUs are important. They need data and they have a lot of performance requirements due to checkpointing and other things like that while the training is going on. But they forget that there's actually a whole bunch of heavy lifting that happened before

Starting point is 00:04:32 the 500 plus gigabytes of tokens were created. That was distilled from petabytes and petabytes of data all on the internet to build these foundation models, which we all know and love like Lama and, you know, Palm and others like that. What they don't realize is that that data actually fits as part of a pipeline. And this is a one takeaway I'd like to leave for our customer or for anybody listening is that pipeline encompasses all the way from the raw data to inference. And every one of these is data intensive as we go along, not just a little bit under the GPUs.

Starting point is 00:05:13 Now, this situation becomes even more complicated when you look at how enterprise is going to ultimately use it. Because keep in mind, these big models are trained by people who are professional model trainers. Enterprise now wants to incorporate their internal wisdom, their big corpora data, into the input stream for training these models. This is a data wrangling problem, which is enormous. And here you literally have customers with tens, if not hundreds of petabytes of data who are now trying to figure out how do I get a grip on this and how do I actually make these fit in the whole AI pipeline start to end? So Kartik, on that note, I mean, everyone's kind of looking at the overall data pipeline. And I think VAST in particular, right, we've worked with you guys for over 10 years, and you've always continued to innovate and do things a little bit different. So can you tell us how Vast might be looking at, you know, AI with the data pipeline and how it's different from, you know, maybe other types of, you know, platforms in the industry kind of, you know, looking at the overall pipeline? That's a great question, Janice.

Starting point is 00:06:26 Vast is a data platform company, not just a storage company. Of course, at the heart of what we do is the most scalable multi-protocol storage subsystem on Earth. We are extremely high performing. We have all the necessary certifications from NVIDIA, BasePad, SuperPad, now most recently NCP also. However, what distinguishes us is not only that we are high performing and we scale well, but we also expose our data through other modalities than just file and object protocol. We also expose ourselves as a table.

Starting point is 00:06:58 So now we are open to analyzing data using other tooling, such as Spark or Trino or Dremio on top with native tabular structures within us, which is crucial for the kind of data crunching you need to actually take the raw data which is currently sitting in large data lakes on Hadoop or Iceberg or Minio or object stores or something like that. We want to be able to corral that and to be able to give the transformation platform to convert these into the things that can be actually input into a model. Now, we do this with an exceptional degree of security

Starting point is 00:07:36 and governance and controls across the whole thing. And probably most importantly, because we're able to consolidate the entire data pipeline, the data connected to the data pipeline on a single platform, we eliminate copying of data, moving of data, and we are able to reduce the pipeline, as well as extending this to a global namespace, that alone makes us the perfect fit for the AI world that's emerging right now. Yeah, I would imagine that many people listening could kind of get their heads around this. I mean, think about the data that you have, think about how it's organized or probably disorganized, and then think about how, how much work would it take me to locate, identify, tag, organize, consolidate, you know, basically get all that data ready to even start doing any kind of an AI

Starting point is 00:08:49 model training or retraining or fine tuning. Think about all of those tasks. Think about all of that data. And throughout it all, of course, it's not just going to be one person. It's going to be an entire team. It's going to be a disparate team from multiple parts of the organization, maybe even multiple organizations. It's going to have many different data types, many different data sources. And all of that has to come together in a nice organized fashion and then be presented to the model. That's incredibly difficult. If you've ever sat there and watched that little bar grow on your laptop as you're copying a file, well, multiply that by a thousand X when you're talking about enterprise data. It's incredibly difficult to move this. And so it's really not overstating it to say that what you just described, Kartik, by having all of this data on a unified platform with a unified namespace, well, that's pretty transformative, wouldn't you say?

Starting point is 00:09:45 Talk to us a little bit about how that really works in customers' environments when they're preparing to prepare a model. Yeah, so that's a fantastic way to describe the problem. I think you nailed it there, Stephen. In the enterprise, there's multiple levels of problems to solve. First is all the data which was created over the last 30 years is potentially going to be something that we're going to want to train models with. This data is in data silos all over the place. Some of them are in data lakes. Some of them are in data warehouses. Some of them are on the cloud and snowflake or data breaks or something like that. Some are on-prem,

Starting point is 00:10:31 some are off-prem. Secondly, very few people actually understand what data they have or what the semantic meaning is of that data to the business processes which are core to them. There is no ontological model which typically exists in these large enterprises, and people are scrambling to build one. So you'll hear a lot of talk about things like knowledge graphs and data fabrics and data meshes and things like that. These are all basically efforts to get a grip in simple terms on where is the data, what is the data, and what use is it to me. Then comes the actual task of culling that data and making it available to be able to train

Starting point is 00:11:17 larger models like this, which will then give rise to business use cases to move forward. So the philosophy most people are adopting right now is to say, I need a fundamental transformation of enterprise architecture to get what we call AI ready. Now, the silos of information that stuff lies on is often infrastructure that was built 10 years ago. It's sitting on 10 gigabit networks. And there are petabytes and petabytes of data.

Starting point is 00:11:47 I mean, how am I ever going to get a handle on this? How am I ever going to move it, et cetera? So part of this transformation, which we think is going to be the biggest transformation in infrastructure in history over the next five years or so, we anticipate people will spend about $2.8 trillion in this transformation, is to start to aggregate the data to understand the meaning of the data, and then to prepare it and decide how

Starting point is 00:12:14 I'm going to vector it towards a variety of models, which are then going to be able to transform how my business operates. At the end of the day, training is important, like we said, but frankly, money doesn't get made in training. It's a money sink. It's all in the inference. You got to get it in the hands of the end users, and that's what needs to happen. And all this needs to be done in a secure, highly governed fashion, as transparency, copyright violation, intellectual property issues, all that start to become more and more important. So there's a lot of work that the traditional enterprise has to do to do this. But it all starts by understanding the data and starting to consolidate it in modern infrastructure. This is where Solivan comes in.

Starting point is 00:12:57 They got the best mousetrap over there to provide the underlying storage, which is high performance and can really add advantage here. Yeah. So a couple of questions, but I have a couple. Well, one, I'll just want to, you talked a lot, Kartik, about enterprise, and I just wanted to get your thoughts a little bit more about HPC. You know, we've worked with you over the years, you know, high performance computing is, you know, obviously turning more toward what is now AI. But you don't really go backwards, right? You don't look at AI and go to HPC.

Starting point is 00:13:32 So just love to get your point on how do your partners in the HPC space tackle this AI data pipeline? And is it different from enterprise customers? So AI is just a subset of HPC. I should probably be a little more careful than that. People will take umbrage to me calling what we're doing right now AI. AI is actually a very broad subset. What we're talking about here specifically is generative AI, which is neural network based. And especially large models and how they operate here. There's absolutely no reason that anyone should believe that large language models are the only kind of computation that's going to go on.

Starting point is 00:14:16 It is shocking that 95% of analytics that goes on in most enterprises is actually good old-fashioned structured data. And it's standard machine learning algorithms like linear regression, logistic regression, Bayesian forest. That's the kind of stuff that gets done. Now, when I translate this to the lens of an HPC operator, and many of them are customers of ours, we are widely deployed across most of the national labs, most of the large HPC centers, et cetera, they are seeing that the traditional classic HPC cluster,

Starting point is 00:14:53 one that had a lot of compute nodes and high-performance parallel file systems under it, is giving way to heavily GPU-accelerated codes as well, right? Not just GPUs, could be DPUs, could be FPGAs. Some kind of coprocessor acceleration is getting more and more ubiquitous in here. So they too are transforming. One of the big things they're doing

Starting point is 00:15:17 is they're moving away from traditional hard drive based technologies to solid state technologies to be able to do this because the IO patterns for AI tend to be able to do this. Because the IO patterns for AI tend to be more random read dominated. And at this point, it becomes an IOPS game. And this drives, unfortunately, a constraint because of the mechanical construction they have. And NAND can do a heck of a lot better job at being able to deliver this. So they're all looking at high performance, all flash namespaces to be able to deliver the kind of performance they need to handle this

Starting point is 00:15:49 unusual mix of workloads. Traditional HPC simulation, MPI jobs, high throughput, large block, you know, type, you know, sequential read, sequential write type workload, contrasting with these heavy random IO intensive workloads at the same time. Combination of these, perfect platform is a completely solid state platform. That transformation is well underway. The only hindrance to this is economics. They say, oh, I like Flash,

Starting point is 00:16:18 but oh my God, it's too expensive. I can't afford it. This is where SolidIne comes in and gives us very affordable, dense NAND with very high capacity. And some of the secret sauce that we've developed in conjunction with Solidigm has allowed us to increase the endurance of those systems to a point where it's starting to become affordable. And then the data services to shrink this data, again, bend the cost curve to a point where it becomes something which is, you know, which enterprise can use, which is very, very tenable for what they want to build at this point. But also for high-performance computing shops,

Starting point is 00:16:52 they also are moving in exactly the same direction. Pipelines are the same. At the end of the day, it's data guys. It goes through successive processing steps and successive refinements, and ultimately what emerges is either great science or great business. Either way, we want to be right in the middle of it. This is what we do. In a way, Kartik, there's an analogy here to the GPU question.

Starting point is 00:17:17 One of the things that we talked about all last season on Utilizing Tech was the fact that when it comes to training, you have to keep those GPUs fed or you're not making maximum use of that incredible hardware investment. It seems to me that the same question is true of solid state. You're buying solid state storage, not for capacity, but for a combination of capacity and performance. And so it's a good idea to make sure

Starting point is 00:17:41 that the way you're using it is able to leverage that performance. And that means that you have an intelligent storage engine, you have an intelligent storage approach that's able to make the best use of this incredible resource that's literally orders of magnitude faster than what you get on spinning media. And I think that today we've gotten to the point where, and maybe you can tell me if you agree with this, we've gotten to the point where primary storage really is flash, full stop. It is flash. And disks have a place, but it's not in primary storage. It's in secondary applications. It's in data protection. It's in archives. It's in the areas that don't need any kind of performance. Is that what you're seeing in the market?

Starting point is 00:18:26 Yeah, to a large extent, yes. There is another that, yes, people like Flash for multiple reasons. One is it is high performance. Secondly, the power form factor is much more manageable at scale as drive size become bigger. But probably as important is the, you know, what I would describe as the dependability of the actual performance itself. So it's not just raw performance. It seems counterintuitive,

Starting point is 00:19:00 but training actually is a GPU bound problem. You actually do very little IO. Input data sets and the output models are pretty small. That's not where you get hammered for IO. Where you get hammered for IO is actually during checkpointing. So absolutely, you need to have systems that are very high performance. And that's true. But what's crucial about all flash namespaces is, as many, many of our customers have found out, is you no longer have to worry about, is my right data at the right place at the right time? So as you know well, Stephen, we eschew the idea of tiering just for that reason. We believe that AI and what is emerging, now we're going to go into multimodal models. Data volumes are in the petabytes.

Starting point is 00:19:53 I just talked to a customer who's going to buy a couple of thousand Blackwell GPUs with 60 petabytes of storage. That's a lot. That's a hell of a lot. I can guarantee you that there's some very, very dense content in there that they intend to analyze. Key thing is the nature of AI says that you cannot predict what you're going to need when. When the GPU subsystems need it, they need to go and find it. And at that point, that data can't be stuck on a slow object store somewhere else or on another disk-based tier somewhere else, because that's a buzzkill. It just, you know, all of a sudden you go, whoomp, my latency went from one millisecond to 200 milliseconds or 800 milliseconds. You know, what am I going to do?

Starting point is 00:20:32 So the uniformity and the predictability of the workload, I think, is as important. Now, the other point I would make is that the reason why I think, the other reason why I think Flash is ultimately going to reign in this space is so SolidIme is introducing bigger and bigger spindles. You're going to be reaching limits as to how much you can do with mechanical drives. Right now, people are shipping 20 terabyte drives, maybe 24 terabyte drives. You know, the long awaited promise of Hammer is still not materialized.

Starting point is 00:21:07 In the meantime, Solid Ion is like charging along, 60 terabyte drives, 120 terabyte drives. The density from a capacity from a floor space in power is going to be significantly lower in these kind of environments than what we had in disk-based environments. I think just that delta alone is going to basically eliminate drives. They're just not power efficient enough, not space efficient enough, and not performant enough. They're going to get squeezed out on the low end with tape and on the high end with all flash-based systems. And this is something I've seen in conference after conference that I've been in,

Starting point is 00:21:47 that that's really what's gonna happen in the end. Yeah, thank you for saying that, Kartik, because that was gonna be one of my questions for you was, what's your opinion on the space and the power consumption? Obviously, this is a really hot topic. Can you comment on any of your partners or customers that you're working with that have really been able to reap the benefits of both feeding that GPU, keeping the performance well, all the while bringing down the overall cooling and heat? Yes, we've been humbled and privileged for being the de facto standard for a variety of people in the AI space.

Starting point is 00:22:26 On one side, we have several super pod deployments which are going on with NVIDIA. As you know, that's their flagship offering from a single tenant perspective. But we are also the standard for a large number of the new tier of cloud service providers. I call them AI cloud service, AICSPs. These differ from the tier one cloud providers such as AWS and Azure and Google in the sense that they are not general purpose. They will build ground up with the intense power footprint,

Starting point is 00:22:59 the RDMA networks to be able to do communication between the GPUs and high-performance storage, specifically to be able to do communication between the GPUs and high performance storage, specifically to be able to tackle the challenges of generative AI itself. So in this space, we have many. CoreWeave is a great customer of ours. They're deploying over 100,000 GPUs globally. We routinely see production jobs running huge, huge training over there. But equally interesting to us is some of their

Starting point is 00:23:26 customers came to them and said, I'm having data wrangling issues. Can you help me here? Can you help me actually make sense and do the pre-processing as well? Because you start with a large amount of data and you're going to condense it down to a small amount. Many, many cases, that's very CPU-led workload as opposed to GPU-led workload. Again, to your older question about HPC, I don't think these are different. These are actually blending together to form one common infrastructure where both CPU as well as GPU and other forms of coprocessors will coexist in the same data pipeline.

Starting point is 00:23:58 So that's also going to very much be there. Lambda is another huge customer of ours. They're also extremely in the same boat. They also have a lot of GPUs and their intent is to offer GPUs as a service. So here again, security and governance become super critical, right? Because if you're in a multi-tenant environment, guess what? You're going to have to keep complete logical separation of data and you're going to have to be able to encrypt the data, encrypt the traffic connected with this

Starting point is 00:24:27 to be able to reduce the data within that kind of domain to build FedRAMP-capable infrastructures, IL-5, IL-6 level protection to build zero-trust architecture. All of those are things that we excel at, and that's really where we go much more than anything else. So these people consume data at scale, and they're going to start consuming data at a much larger scale. So far, we were looking at tech space LLMs.

Starting point is 00:24:56 There's a whole class of large-scale vision problems. There's a whole class of large-scale multimodal problems and the huge sucking sound of all the enterprise data getting filtered into these generative models, which is going to drive data volumes through the roof. Along with that, it will also rise the stability, the dependability, the predictable performance, the scale, the security, the governance. All of those will start to come into the forefront.

Starting point is 00:25:24 And so we're, like I said, really fortunate. Working with these customers has taught us a lot. We now understand what the real requirements are for how to actually go to market with this. And this helps us innovate continuously with ourselves, with our partners. You know, we've just reached an OEM agreement with Supermicro. You may have heard that. We've already had a deep relationship with HPE, with GreenLake for File. They have different hardware stack,

Starting point is 00:25:51 but it doesn't matter to us. We are a software offering, but we want that kind of diversity to be able to support all kinds of environments. But the core problem is that we want to solve is we want to solve the customer's data platform problem. How do you build something which can take all data, analyze it in any modality you like, be it through a database engine or through PyTorch, and then ultimately take it through to inference and fundamentally change how they operate their business?

Starting point is 00:26:22 That's a really good point that you make there with the inferencing question as well, because one of the things I think we're going to see is a proliferation of data on the, well, rubber meets the road side of things. On the edge type of thing. Yeah, your model's trained, everything's ready to go. Now let's turn the data fire hose on that end. And that's just going to cause even more data to be collected, to be processed, to be stored, to be acted on. And it's going to drive up the demands for at the edge, in the cloud, in the data center, everywhere. And I think that that's natural. I think all of us can see that because, you know, chat GPT is lovely, but it's text, you know, okay, now let's, let's throw images at that. Let's show documents at that. Let's show, throw a video, a streaming video, multiple cameras. Now, now what kind of

Starting point is 00:27:15 data requirements do we have? It just grows and grows and grows. And, and I think that that's, you know, it's good because of course, these are new applications that are hopefully going to be productive and, and profitable for people. But it's also a challenge because we're going to need to be able to handle that kind of data, which, again, you need to have intelligent infrastructure and you need to have high performance and you need to worry about environmental impact and all these things. I couldn't agree with you more. Inference has its own peculiar set of technical and business challenges, which are, interestingly enough, not present in training. On the technical side, you know, training, I can hog a GPU. It's all mine, mine, mine. Inference is not that way. In a multi-tenant edge inference use case, my edge GPUs or whatever accelerators doing inference need to have access to multiple models.

Starting point is 00:28:10 I may need to load and unload these models depending on who is exactly doing the inference. Worse, if I'm doing retrieval augmented generation or what they call RAG, and people have vector databases which codify the internal things that the company does, then those again would start to move into the edge. How do I do this rapidly? How do I do this in a governed way? The EU AI Act was passed, as you know, a couple of months ago. It mandates that any of the data that is used for query value,

Starting point is 00:28:43 especially for high-risk business, needs to be preserved for eternity. All the queries, all the keys have to be preserved for eternity. And these are very difficult technical problems as well as governance problems to solve. And we believe we have a superior mousetrap in that sense to be able to do this. Well, thank you so much, Kartik, for joining us and talking about this. I think you've really opened up my eyes and those of our audience as well to the greater question. Because again, it's very easy, to my point at the beginning, it's very easy to focus on how do I feed training? How do I keep these GPUs active? And forget that the data pipeline starts way before that and continues way after that.

Starting point is 00:29:27 And the volume of data is just absolutely incredible. So thank you so much for joining us here today. If people are interested in this, where can they find you? Where can they continue the conversation? Yeah, easiest way, let's start by going to vastdata.com. There's a massive wealth of information of customers we work with, of how our technology works. It's a great white paper, vastR-T-I-K. And I'll be more than happy to respond to you guys. But anyone from VAST, contact anyone from VAST locally, and they'll be able to help you as well, and they'll be able to guide you to me and to some of my colleagues.

Starting point is 00:30:14 That's the easiest way to do it. So thank you for listening to Utilizing Tech, focused on AI data infrastructure. The Utilizing Tech podcast series is available in your favorite podcast application as well as on YouTube. If you enjoyed this discussion, please do give us a rating and a nice review in your podcast application of choice.

Starting point is 00:30:34 It's always great to hear from you. This podcast was brought to you by Solidyme as well as Tech Field Day, home of IT experts from across the enterprise, now part of the Futurum Group. For show notes and more episodes, head over to our dedicated website, utilizing tech.com or find us on X Twitter and Mastodon at utilizing tech.

Starting point is 00:30:53 Thanks for listening. And we will see you next week.

Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 07x02: Building an AI Training Data Pipeline with VAST Data

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.