In The Arena by TechArena - Riding the AI Data Pipeline with VAST Data, a Data Insights Series podcast with Solidigm

Episode Date: April 29, 2024

TechArena host Allyson Klein is joined by Solidigm’s Jeniece Wnorowski as they continue to explore rapid data innovation fueling today’s computing. In today’s episode, they chat with VAST Data�...�s Global VP of Engineering, Submaranian Kartik, as he describes how his team has delivered a breakthrough data platform for the AI Era.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Tech Arena, featuring authentic discussions between tech's leading innovators and our host, Alison Klein. Now, let's step into the arena. Welcome to Tech Arena. My name is Alison Klein, and this is a Data Insights podcast. And that means I have my co-host with me, Janice Narowski. Welcome to the program, Janice. How are you doing? Well, thank you, Allison. I'm doing great. I'm so excited to be here today. Janice, you have been traveling all over the world, and I know that you've been talking to a lot of folks about data and the data pipeline. We are in for a fantastic episode today. Tell us who is coming on the program to talk to us. Yeah, I have been traveling a lot, namely
Starting point is 00:01:01 events, and I've just been blown away by the work that our special guest has been doing with the team, and that is Vast Data. So today we have joining us is Kartik Subramanian, who is the Global Vice President of Systems Engineering for Vast. And welcome to the show, Kartik. Well, thank you, Janice. Much appreciated and fantastic to be with you on another podcast over here. This is great. I remember the last recording we did, I thoroughly enjoyed it. So looking forward to our conversation today. Excellent. So Kartik, I am so excited to talk to you about what you've been doing with the VAST team, but why don't we just get started? VAST has been on the program before, but you've been getting incredible traction in the
Starting point is 00:01:51 market for driving the AI data pipeline to new scale and performance. Can you give us a sense of what is shaping this market and how is VAST making progress? Yeah, so this market is explosive, as you know. Since the introduction of ChatGPT, there's been this, shall I say, irrational exuberance in the market around anything connected with generative AI. And yes, it absolutely clearly has enormous promise. It's still in first innings, in my opinion. So a lot of the activity we are seeing is really partly in the enterprise and partly in the cloud as well.
Starting point is 00:02:39 There's a new breed of cloud service providers who are specialized in running these kind of workloads for generative AI, which usually people think of as large language models, but there are other kinds of generative AI workloads as well. And the large language models require a large number of GPUs with very sophisticated networking and storage connected to them. And therefore, the hyperscalers, there's a shortage of these things. So this market has exploded. We are heavy participants in that explosion. And we are essentially feeding the frenzy, throwing gasoline on the fire over there.
Starting point is 00:03:18 On the flip side is enterprise. Enterprise is very tentative right now. They are still trying to define what they want to do, trying to understand what sort of impact this will have on their business. Are they going to make money? Are they going to save money? Or are they going to stay out of jail?
Starting point is 00:03:37 Hopefully a combination of all these three. And so while their steps are tentative, we think that in the next year, two years or so, this wave has just begun. We've only scratched the surface. The expansion and explosion is on its way at this point. Got it. So with that expansion and explosion, can you tell us, Kartik, a little bit about how Vast does this differently from other storage models?
Starting point is 00:04:08 Sure. For starters, we are not a storage company, as you guys know. We're a data platform company. So, of course, we're known for our highly scalable, highly performant, online at all times, strongly secure platform, which we really love. And then we expose ourselves to file and object protocols. But gone quite a bit beyond that and said, why should data only be looked at from one set of lenses? Why not, you know, structured data as a table?
Starting point is 00:04:41 And so we've introduced functions around the database in our system. This allows for data lake technologies, typically things like Trino and Spark, et cetera, to work natively off of us. We're continuing to move forward and producing concepts like a global namespace. So any data is visible anywhere to any GPU farm. In fact, I just got off a call with a customer showing them how to take a large, long-running training job and move it from one data center to another data center
Starting point is 00:05:15 without losing that data within minutes. And this is a stunt which is very, very difficult to pull off. So probably the most important thing is we look at data as a pipeline rather than something just static. It's a constantly flowing stream of data. And that there are different modalities to be able to analyze the data. Some of them are with GPUs and some of them are with more traditional technologies like CPUs.
Starting point is 00:05:43 We are the only ones who can do the entire pipeline, core to cloud to edge, as well as through all the different types of data. This is what's been the heart of our success in this market. So we're now proudly the standard for a large number of the largest of the cloud service providers in the tier two space. We're sure and we are making a run with the hyperscalers as well. Now, Kartik, there's been a lot of parallels drawn between traditional high performance computing platforms and AI training clusters. While you were talking, I could kind of piece together parts of the answer to this question. But for the audience, how does these systems work similarly?
Starting point is 00:06:31 And then how do you see the differences between what enterprises and large cloud players are doing with their AI training and what traditional HPC technical computing clusters would perform. So both share similarities. Both are forms of high-performance computing, obviously. More traditional HPC environments still rely on a large number of compute elements which are distributed, tens, thousands of nodes, 10,000 nodes, or even more. And they cooperatively work to solve certain types of problems. elements which are distributed, tens, thousands of nodes, 10,000 nodes, or even more. And they cooperatively work to solve certain types of problems. That market has been around for over 20 years and is pretty mature.
Starting point is 00:07:19 And the primary workloads that they did was what we often call HPC simulations. Large amounts of data are ingested, crunched cooperatively between many, many CPUs, and then that produces an output, which is useful. So oil and gas, energy, competition, fluid dynamics, all these are very common workloads there. Those codes were optimized mainly for large datasets and large-scale sequential reads and sequential writes. Large block sequential reads and sequential writes was what really dominates that.
Starting point is 00:07:50 The other form of accelerated computing, which we are in the era that we're in right now, uses other coprocessors such as GPUs and FPGAs and other things like that. And there the workloads are very, very different. They're very read intensive, yes, but random read-intensive. So solid-state is a necessary component of the media that needs to underlie that because these technologies are not able to provide the IO performance that's needed for stuff like this. Their scaling and availability characteristics are also somewhat different as well. These systems have to be highly shared systems and are very highly available. Cannot take an outreach for anything.
Starting point is 00:08:34 Many of the HPC clusters, and we are, by the way, extremely active in high performance peering as well. Some of the largest HPC clusters in the world run on us. But they are usually relatively homogenous sets of workloads that are running in AI. These are strongly multi-tenant environments, have a secure environment, classified environments, et cetera. So they have a different level of requirements
Starting point is 00:08:59 compared to what you would see in an HPC world. But in a nutshell, yeah, need heavy random read, heavy IO characteristics, especially for things like checkpointing. And therefore, it mandates large-scale all-slash systems. So Kartik, with that sophistication, right,
Starting point is 00:09:23 these workloads are pretty complex. Where in your mind are customers in terms of sophistication and implementing your systems? Yeah, as I mentioned, right now the people who are doing the most active work here are the actual model builders themselves. So all of us have heard of models like GPT-3 and GPT-4. These require enormous amounts of data as well as training to get built. Or LAMA for meta, MISTRA, for example. And these are all people who are on the leading edge of research in here. Clearly, there are many private sector people, too, who are doing a lot of work here.
Starting point is 00:10:13 And, you know, large autonomous driving companies, so absolutely, drug discovery companies are doing this as well. Traditional brick-and-mortar enterprise is more tentative. Like I mentioned earlier, they're still identifying use cases. So they tend to start with pre-trained models, which they would get from Hugging Face or something like that, and then expose that to their internal data through things like retrieval, augmented generation, or RAG, and then be able to do inference with something like that. This is going to morph because of a regulatory climate
Starting point is 00:10:50 that's changing extremely rapidly. The European Union has already passed the AI Act that is mandating that certain business sectors and certain types of data, you have to preserve data for a long time. You need reproducibility many months after the fact. So you need to know what data went in, input training, what the outputs are, so they can repudiate anything anyone alleges about them. U.S. as well. We've all recently seen the new law proposed by Adam Schiff in the Senate, which is requiring that everybody declare any copyrighted information
Starting point is 00:11:29 that they may have use for training. This means now it's no longer just a GPU game. It's a governance game. And we're going to have to have compliance archives and controls in place to be able to work with this. We think that over time, people will be training their own models, probably not huge ones, but smaller ones.
Starting point is 00:11:50 We may see the emergence of more specialized AI models dominate over highly general model like ChatGPT as things go on. So despite the fact that it's tentative, we're seeing spending pick up and interest pick up quite a lot in the enterprise. The cloud guys, of course, are just going berserker at this point. They're buying GPUs like they're going out of time, literally tens to hundreds of thousands at a time. Now, I've been following Vast on the tech arena for the last couple of years. And in
Starting point is 00:12:20 fact, you guys are one of my first guests on this platform. You made some really exciting announcements around collaboration with NVIDIA and Supermicro lately. Can you help unpack those and talk a little bit about what these new collaborations with the industry leaders in AI are helping with delivering new capability to your customers? Absolutely. So I've had the privilege of working with NVIDIA now for over four years. All the initial testing we did with GPU direct storage for high-performance RDMA networks, as well as BasePath and SuperPath certification, things that I was deeply involved in all the way through. One of the interesting things about VAST that a lot of people don't realize is
Starting point is 00:13:08 even though we do storage and we're a full data platform company, we're not a hardware company. We're a completely software company. So the hardware stack under us can be very varied. And we've been fortunate. We've been partnering at SolidArm for so long. You guys are anchor suppliers for us for the dense land that we need to make our systems affordable and high-performing.
Starting point is 00:13:35 And there are other form factors as well, which we are exploring. So the Supermicro partnership that we announced, our re-announced, our GTC, is one of those things over here which is, we believe, super important. Prior to this, the shelves that actually held our Dentsland, which were made by our contract manufacturers, tended to be somewhat specialized and this book in the sense that even though they were built out of out of widely available industry components that required special assembly and care with super micro what we looked at was to say can we use a fully totally generic off-the-shelf industry standard server instead to be the foundation for BAST. And this is really what we did over the last few months or so. So we take a server with 12 disk shells, have some storage class memory and some dense NAND, and voila, now you've got a building block for VAST.
Starting point is 00:14:36 We did another thing which is very interesting. As you know, we are very containerized in our architecture. Both our front-end nodes, which handle protocols, as well as the back-end nodes, which handle the media, are all essentially Docker containers. So we decided to co-locate them on the same server that we have the storage shelves on. So we essentially eliminated a whole layer
Starting point is 00:14:58 of server architecture in this mix. And that allows us to have a very highly hyper-converged setup, which has extremely good scale properties. We think this is a fantastic offer for people in the cloud space. It's built for scale. It's built for high performance. It's built for ease. Probably most importantly, it's built also for low form factor
Starting point is 00:15:24 and for low power, which are increasingly critical in this space. Got it. Kartik, you mentioned your work with Solidigm and having worked with Solidigm for a while now. And obviously the foundation of your architecture, a big portion of that foundation is the you know, the data and the media. But can you tell us a little bit more about what type of drives you're using with SolidIne and how those help you? Yeah, we go to SolidIne because you guys make solid, I guess, dense NAND systems.
Starting point is 00:16:02 So we started out with the U.2 form factor QLC technology, which you had introduced, because one of the key design elements in our platform was the goal to forever kill disk drives and go to completely solid-state media. But we knew that the most whiz-bang technology and the work would not be worth it if it costs three times or four times as much. So we had to normalize the cost curve.
Starting point is 00:16:33 So going with DenseLand was a major step forward for us. What really changed the game was we figured out, along with Intel and you guys, how to create a flash translation layer which would allow us to extend the endurance for these drives to well beyond what you would normally expect. We were able to extend the endurance to beyond 10 years. That suddenly catapults Dense NAND into the arena of being viable for enterprise workloads. This was a huge, huge move for us. This let us bend the cost curve significantly, along with some of the other software features we have,
Starting point is 00:17:09 such as large-scale reduction of data. A combination of the two now makes us not just fast, scalable, and online operations and performance, but also affordable, which is a key element of what we do here. We have continued that partnership with zinogenes we moved on from u.2 now we're using google form factors eagerly awaiting other things that we're going to be doing together lots and lots of demand for even more density larger larger drives you know we are we started with 15 terabyte drives. Now we are about to introduce 60 terabyte drives.
Starting point is 00:17:47 They're in heavy demand, though. I got to tell you, everybody's buying them up like they're going out of style, which is good for you guys. So this is excellent. So, yeah, that's what we're working towards. Awesome. Yeah, we really appreciate the collaboration. I can speak on behalf of our team over the years. And we are, like you said, also really excited for the future.
Starting point is 00:18:10 And we started talking about how, you know, traveling all over and I'm seeing VAST everywhere. Your booths are always packed with people interested in your technology. But can you tell us, for this audience here, where can folks go to learn more about your solutions? Fantastic. First place to start is to go to our website, vastdata.com. You'll find there a lot of very interesting material on what kind of industry sectors, what kind of solutions we offer,
Starting point is 00:18:42 ranging all the way from high-performance computing to life sciences to media entertainment, and of course, the ubiquitous AI, which is almost so horizontal. And along some solutions which people are often surprised to find us in, like in the backup and recovery space, where we act as a target, an all-flash target for backup systems. One might ask why, but that's because our restore speeds are like blindingly fast. And in this day of ransomware, that seems to be as much a concern rather than full environment recovery. In fact, it's a concern more than just a single file recovery or a single
Starting point is 00:19:15 directory recovery. So those are all the things you can learn about us from that perspective. You can also learn about us from a data platform perspective. What's the buzz all about when we say we can expose data as a table? What kind of problems can we solve with that? And how do we plug in into and refactor Hadoop environments or other kinds of data lake environments like Spark and Impala and Hive
Starting point is 00:19:41 or the tools that are used over there? All of that stuff also is clear. A deeper architectural understanding of what Vast is, how it operates. We have a fantastic white paper. Easy to find, vast.com slash white paper. Once you go there, it's a long but easy read. And it'll give you a full detailed exposition of what makes us really good. And do not forget to look up all the customer testimonials over here.
Starting point is 00:20:07 We have some marquee customers in every one of these sectors. Many of them have great videos they've recorded, like, you know, the solution briefs, which are associated there, white papers, all of that stuff is public. Next step, of course, contact someone from VAST. You know, if your appetite is not whetted, trust me, we're just waiting to engage with you and we'll be able to provide you one-on-one assistance in anything you like, far deeper, dives, drill downs, design workshops, all of that stuff as we go around the world.
Starting point is 00:20:51 Well, Kartik, thank you so much for taking time out of your day to talk with Janice and I and share your vision for the data pipeline. It was so cool. I've been following VAST and the incredible solutions that you've been delivering to market. So it's a real pleasure having you on the program. Thanks for being here. As always. I'll catch you next time. Thanks for joining the Tech Arena. Subscribe and engage at our website, thetecharena.net. All content is copyright by the Tech Arena. Thank you. you

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.