Grey Beards on Systems - 171: GreyBeards talk Storage.AI with Dr. J Metz, SNIA Chair and Technical Director, AMD

Starting point is 00:00:00 Hey, everybody. Ray LaCasey here. Jason Collier here. Welcome to another sponsored episode of the Greybirds on Storage Podcasts, show where we get Greybirds bloggers together with storage assist the vendors to discuss upcoming products, technologies, and trends affecting the Peta Center today. We have with us here today, Dr. Jay Metz, chair and chief executive of Snea and technical director at AMD. Dr. Jay has been on a show many times before. He just got back from the Snea Storage Developer Conference in California. And we also want to hear about Snea's new storage.A.I. initiative.

Starting point is 00:00:46 So Dr. J, why don't you tell us a little bit about yourself, how SDC went last week and what Snea's storage is all, zero storage.AI is all about. Yeah. Did you just say I'm a chief executive because you're a chair and chief executive of Snea. Is that not true? Okay. Just I misheard that. For some reason, I thought I thought you were, I thought I had taken Lisa Sue's place at AMD. I wasn't sure. I was not. No, no, no, no, no. That would probably be an unwise decision. Yeah, so I am the chair of the Snea board of directors. And so I, I've been, I've been in that role since 2020 and have been working with the organization SNEA organizations since probably 2013, 2014, something like that. So a little over 10 years. And I'm also a technical director for AMD and I work with, you know, advanced networking and storage internally. And so I kind

Starting point is 00:01:43 of bridge between the two of them. And yeah, we've started a new project inside of of Snea. and it's called Storage.AI, although the URL was a bit expensive, so we, we, they want, they want more than 50 grand for the URL. So we, well, let's go to Snea. That AI and see if that kind of helps. We'll have to figure that way out. And then we had the, yeah, we had the developer conference last week, and it was incredibly successful, very, very popular. The, the content was amazing. The enthusiasm, the energy. It was, it was a, it was a big deal. It was a lot of fun. Huh.

Starting point is 00:02:23 So they tell me that your keynote on storage.AI was well received. Well, nobody threw vegetables, so I'm having it. Well, that's good. That's good. Yeah, I mean, I think where people are, it's a developer conference, right? So a lot of people have been asking about how the storage and the data and the memory are all supposed to work together in AI workloads. And once you start peeling back the covers, it's very nuanced and very complicated and not intuitive at all.

Starting point is 00:02:58 And where we currently sit in the AI workload is that the data is not in the same place as the accelerators like the GPUs and the DPUs and that sort of. You have to bring it into the network to do that. And so how you do that is what storage.a.i is all about. And the developers were very interested in it. So my plan at the keynote was to paint the picture. This is how it should work. This is how we are going to approach this in an open vendor neutral, technology neutral fashion.

Starting point is 00:03:28 We're going to try and solve the workload problem rather than just create a feature. And it was well received as far as I've been told. So I'm very excited about it. So how would you characterize some of the challenges that storage has an AI environment? Now, are we talking about training and inferencing both or training or just inferencing or how does this work? That's a very good point. That's a very good question because in both of those situations, the workload isn't just one workload. So let's just take the training just for an example, right?

Starting point is 00:04:01 It's actually many workloads put together. You've got to ingest the data. You've got to pre-process the data. You have to do the premodeling and you have to, you know, you have to do forward passes and backwards passes along the way. you've got the checkpoint and so on and so forth. Each and every one of these tasks requires a step in the process, its own workload. And when I say own workload, I mean, it's got different characteristics for storage, so for data, for data movement, for memory, and so on.

Starting point is 00:04:28 And it's to say that we've got a storage problem, if you could imagine the finger quotes, you could say we have a storage problem, it really undersells the significance of the amount of work necessary to get from point A to point B. Are you trying to solve the whole data pipeline here? I mean, this is significant. Not all at once.

Starting point is 00:04:52 So the thing is that if you've got, let's look at it from an accelerator perspective, a GPU perspective, right? A GPU really only cares that it's got the data, right? It doesn't care how it gets the data. It just needs to do the data. However, the workload cares how it gets the data. So if you need to get the data and your GPUs, are connected via a, you know, very tightly coupled network or a scale-out network, you know,

Starting point is 00:05:18 if you're depending upon where those GPUs are actually done, and you don't have the data there, you have to get the data in there. And when you trace how the data gets in and out of these different systems, you start to realize that there's an awful lot of extra work that has to go on because it's not indigenous to those accelerator networks, right? So once you start to do that closely, you start to realize that I don't have to swallow the ocean. If I improve just some of these aspects, right, I don't have to do it all at once, but if I can improve the data memory transfer, if I can improve the data over a remote memory access protocol, if I can improve the way that the GPUs communicate with the storage devices itself, I don't have to use them

Starting point is 00:06:04 all together at the same time, but I can increase the efficiency by just, you know, just starting off with one and then adding things as I go along. So you don't have to do everything all at once, but you do have to do something because as we start to get into these really, really, really, really large systems, it's non-trivial, right? So, I mean, we're getting to the point where it used to be that people would say, well, there's only five different companies who are doing this in the first place, right? That's just not quite correct. Yeah. And, and, you know, for the company that I work for, you know, when we start talking about mid-tier systems, mid-tier system is between 5,000 GPUs and 100,000 GPUs, right? Well, that's a mid-tier system.

Starting point is 00:06:51 Jesus. Yep. And so then, you know, obviously, you know, my other life, I do, I do the networking stuff. So networking, this is what a lot of people want to talk about first, and which is what we've done with something like ultra-ethenet. But with it comes to this. the storage, it's always sort of assumed. It's like the high school experiment, assume a frictionless environment. In this case, it's assume the data is there when you need it, where you need it, when you need it.

Starting point is 00:07:19 And that's an inaccurate assumption, right? Because the data is not there. You have to get it there. And when you start to trace how that works, it's even a little bit goes a long way because the more GPUs you have, the more difficult it is to get the data where it needs to be for the right,

Starting point is 00:07:38 purpose. It becomes effectively a data tractable, right? And you have to do this within a certain time period, right? Because these things break. That's why we do checkpointing is because the higher the scale you go, the more of a guarantee that something is going to break. You have to checkpoint to get back to that point, et cetera, et cetera. Yeah, the checkpoint itself is serious stuff as well. You know, it's not like it's not trivial itself. and it's so all of this stuff right all of this stuff so if i can improve the checkpointing and the checkpoint reloading just on that alone if i could do that better yeah i've already made i've already made everybody's life better and another thing to think about is too when you when you look at basically

Starting point is 00:08:21 the interfacing if you're going into an existing you know organization that has you know you know 50 years worth of data on traditional storage like how do you do interfacing with that into this new paradigm as well when you're talking this massive scale as well it's a there's a There's a lot of intriguing characteristics that need to be addressed when you're going through doing these large, large implementations. Absolutely. The other thing I noticed about the Storage.A.I consortium is there's, I don't know what the number is, but like 100 companies that are participating companies, just about every organization in IT is there. Not just the storage guys. I mean, you know, it's more than just storage.

Starting point is 00:09:03 Well, just to help, you know, create the taxonomy a little bit. So Snea is the organization, and we've got about 200 different companies that are involved in Snea. And because of the fact that it's an initiative inside of Snea, a community inside of Snea, everybody's welcome to participate, right? And when we did the announcement, we were very fortunate to have more than a dozen companies in a very short period of time pledged their support for this. And we've got new companies coming in as well to do this. you know, storage companies, hyper-scalers, vendors like my own, and so on.

Starting point is 00:09:37 So it's a broad cross-section of the industry who desperately want to have an open, standardized, neutral ecosystem that can be, you know, applied on over and over on a regular basis. So, yeah, this is a, you know, it's very rewarding to have to see. wide amount of participation across the industry and that's surprised the heck out of me quite frankly I mean most of the storage initiatives in the past have not been that widely uh widely participated maybe I'm wrong about that I'm not an active participant in any of these but this is something you're absolutely right and and a lot of it has to do with perception in a lot of ways some of it has to do with some there's actuals and there's there's a perception both of which are valid and correct but what we've been doing now over the last you know a couple of years is

Starting point is 00:10:36 we've been saying look these are very big problems it's so big in fact that it's kind of silly that any one organization could even begin to think about solving it on its own so what snea's been doing is we've we've really redesigned the way that we interact with our member companies. So we say, you want to join Snea because you like energy efficiency. Great. How about security? Great.

Starting point is 00:11:03 How about AI? Fantastic. You're in. You get to do whatever you want to do, right? Which is a new way of thinking about how these kinds of organizations work. So that gives us the ability to solve a particular series of problems that are workload specific to AI. But we also know that we don't do other things, right? For example, we don't handle things like PCIE.

Starting point is 00:11:27 So we have a relationship with PCI sick. We don't handle a lot of the implementation stuff. So we have a relationship with OCV. We don't handle the networking stuff. So we have a relationship with UEC and I, AAA, and so on, right? And the good news is that maybe 10 years ago or 20 years ago, there was a bit of a fight with regards to the fiefdoms, you know. Well, why are they doing that over there?

Starting point is 00:11:49 They should be doing it over here. Why is it going off in the Linux Foundation? what it could be what we're looking to do here and the the demographics of people has evolved to say, well, I have enough work to do on my own. I'm having a hard enough time trying to find the participants to do this. So why don't I work with these organizations and against them? Right, yeah. Exactly.

Starting point is 00:12:09 And so we've got this kind of multi-tiered approach of inside the groups, inside the organization, inside the industry. So back to storage.AI. So, I mean, you mentioned checkpointing and reloading, which is obviously a significant amount of activity during training, training itself. But the data pipeline, like I said, spans six or seven different work steps. And then you actually get to inferencing and there's a whole different world and inferencing and stuff like that. I mean, how I just don't know where to start. quite frankly and that's the challenge what and what you do is you start with first principles right so what you do is you deal with the nature of the problem of data and data locality

Starting point is 00:13:00 and data movement and data protection those things are you going to be universal those things are going to be uh always top of mind for a lot of people and so what we do is you say okay so how does how does an accelerator not just a GPU but we got TPUs and we've got other types of startups and who are doing their own different architectures, how does an accelerator first get its data? How does it process data? How does it work with data? And so on, right, those are first principles. And then we say, all right, so what are the things that we currently do for moving data from one memory location to another memory location? And does it matter whether it's high bandwidth memory or does it matter if it's, you know, DRAM or does it matter if it's some

Starting point is 00:13:40 sort of tiered search? Well, the answer to that is yes. Right. So what do we do in those regards to help make things more efficient. We use one type of solution versus another to get where we want to go. And as it turns out, you can. So for instance, Snea has a standard called SDXI, which stands for smart data accelerator interface. Effectively, it was a technology driven by improved virtualization memory movement. So let's say you've got, and Jason knows this really, really well because he's he's dealt with these 17 layers of abstraction for orchestration and containers and that kind of stuff so what sdx i does is if you've got these kinds of environments and i've got an application sitting inside a

Starting point is 00:14:23 container since sitting inside of a virtualized environment sitting inside an orchestrated you know abstraction layer and so on and so on sounds like my nas box yeah exactly um the application can call an sdxi function of the hardware and have the hardware do the data movement from one memory region to another in a in a privileged fashion, right? So it was a virtualization element. I could do things like zero out memory. I could do, you know, wiping out data localities,

Starting point is 00:14:50 or I can just, you know, move from one container to a storage virtualization container and so on and so forth. It's a virtualization solution, but it's a data mover. Dr. Jay, it seems a bit low level. I mean, I understand the need for that sort of standard and where it belongs, but, God, the AI problem is so broad. Well, so is that the same problem that has to exist in AI? And the answer is yes.

Starting point is 00:15:16 If I have a GPU that needs to have a shared memory buffer with a CPU, why can't I use that technology to move data from one memory buffer to the other memory buffer? Yeah, I agree. I completely agree. But the GPUs are trying to get the data out of the storage buffer, which is a storage device on a PCI bus or, you know, it's a storage on a fabric. or something like that. Exactly.

Starting point is 00:15:43 So how do you make that better? And by applying the storage, the smart data initiative, smart data mover initiative? The SDXI, yeah. In some cases, yes, in some cases, no. Right? So you use the right tool for the job.

Starting point is 00:16:00 And I certainly don't want to give the impression that there's a one solution for everything because that's not the case. But what we're doing here is, you also have to remember where we are currently. Where we are currently is that all data has to be brought through an external network into a memory buffer that the CPU orchestrates, at which point the CPU is the triage of where the data needs to go, and always has to be involved, the kernel always has to get involved, and then it gets moved inside of a GPU, and that's all before the GPU can actually do the work, right? So all that stuff has to be brought in from an external solution.

Starting point is 00:16:40 However, some people want to do this with local storage, some people want to do this with remote storage, some people want to do this with a mix, and they all do in some sort of proprietary form right now. So if you want to be able to have the same semantics across the board from an accelerator perspective, how do you do that? Well, that's what storage.ai is looking to address. I mean The challenge with I mean those proprietary initiatives exist today I mean our friends

Starting point is 00:17:15 at some of the GPU accelerators have very specific functionality to get data from a storage device on the PCI bus directly to the GPU and back again and almost across the fabric as well I mean, how does something like the Storage.A.I. initiative interface with that or are you, are you intending to try to augment some of that stuff? Or where does it belong, I guess? Well, I guess you have to you have to parse the assumptions that are made, right?

Starting point is 00:17:47 You assume that it's PCIE, but what if it's not, right? You assume that this is a remote storage, what if it's not? Or you assume it's a local storage, what if it's not, right? And so as we start to do this, we start to realize that some of these things are going to have to be more targeted and more specific to solving parts of the workload problems, which are a number. When we say this is, it may be a small part, small in the grand scheme of like a million endpoints, right? So you're still going to be replicating this exact process millions of times during an epoch. And so it's what may look like a really tiny, small piece of the puzzle compounds over time. Yeah.

Starting point is 00:18:28 And that's where we're trying to say. So if we want to have, if we want to do a storage access, right, from an accelerator. And you're right. we've got proprietary solutions for doing this they're so proprietary in fact that they don't even work across different models of the same company's GPUs right you've got different architectures and so if you wanted to do this universally or if you want to create some sort of you know migration pattern over time then uh and not do a forklift upgrade every single time you're going to have to have some sort of of baseline and that's not counting the model that you have to have for no pun intended but the but the

Starting point is 00:19:07 structure that you have to have on the data on the NAND itself. What are the expectations of the data structures? And how do you handle that? Some of these things are going to be protocols, some of them are going to be APIs. Some of them are going to be reference architectures and guidance. But that's the whole point, right? Depending upon where these things are going to fit, you're going to have different velocities of the projects as well. You're going to see a lot of these people coming together to try to make this work. And that's what's storage.AI is supposed to do, give them the audience to have these conversations that then communicate with the technical working groups, whether they be inside of SNIA or outside of

Starting point is 00:19:48 Snea, to get the work done. So this is a checks and balance is kind of an approach. Huh, huh, huh. Yeah, and my challenge to some extent is to try to, there's just so much to get a hold of. I mean, you talk about on the storage.A.I. page, a number. of initiatives that are current and ongoing. There are certainly a couple of initiatives that are there that are yet to be, not funded, but initiated, I guess, is the right word or something like that. And then there's the proprietary stuff that's out there that everybody's trying to work with as well. It doesn't appear to be, I don't know how you incorporate that, I guess, is the game. Well, as a standards organization, we can only incorporate what our members

Starting point is 00:20:36 contribute. And I think one of the things that's kind of critical is that in any technology, you have room for proprietary solutions as well as open ones, right? And that's kind of where that push and pull, that give and take, you know, the accordion kind of collapses and expands. That tension of open and private actually helps drive things forward in a lot of ways. So I think it's probably a really good caveat to say that it is not a panacea. We are not trying to solve all the world's problems, right? And we are not trying to say that everybody should donate all of their, their IP into a standards organization.

Starting point is 00:21:21 That's not what we're saying at all. And it's not going to happen anyway, so it would be foolish to do so. But what we are saying is that when you get back to first principles, there are some things that the industry can agree on that will help make everybody's life better. And I think that that's where that is and where those lines are, given the fact that this is a problem that is, as you point out, really involved. There's a number of places that we can start and add it along the way. And ultimately, I think that that's probably one of the things that's going to make this very successful,

Starting point is 00:21:52 which is it's kind of got a guaranteed future. It seems like it's, you know, and you talk about the midter organizations with 100,000 GPS. use i mean the large guys million million plus yeah it's insane what they're doing yeah and when you look at too what they've got um from you know all the way from from the large down to the small you find that they're very heterogeneous environments especially when it comes to storage as well right so it's like how do you think about that you know the interoperability of all of those components to basically address you know the kind of the the the AI focused uh uh work that they're trying to get done with it

Starting point is 00:22:34 I think the enterprise is struggling to try to understand, you know, the inferencing side of this, which is, you know, give me the models and I'll take them and run with them. But I still have intense I owe requirements for inferencing alone if I'm on to scale it up and such. Yeah, and when you look at it, I mean, really, a lot of the AI is a really, it's a scale-out problem, even from, you know, the inferencing and the, you know, inferencing and training both. And like you said, the enterprises, I think we're going to see a lot of the enterprises focus on inference level workloads, whereas the big guys are going to be, you know, there's going to be like, you know, probably, you know, dozen or two dozen companies that are going to develop models that are going to be, you know, explicitly trained to basically help certain verticalized industries. And I think those verticalized industries are going to take that and use things like the retrieve log minute generation and components like that

Starting point is 00:23:27 when they're, you know, utilizing that to reference, you know, LLMs and things like that. But all of those different storage requirements, even when you think about it, you know, you think about your standard old, you know, singular NFS-level environment, you know, and you're trying to basically feed that into a thousand GPU cluster. Yeah, it's a tell you where your bottleneck's going to be. It's so, you know, there are a lot of intriguing, intriguing problems that they're going to have to be solved with this. And, you know, I think interoperability is going to be a big key to making it move forward. There's also the aspect of, of who wants what, right? So you're very,

Starting point is 00:24:10 you're very right about the fact that, you know, enterprise customers, enterprise vendors, enterprise institutions, they've got a certain level that they're comfortable with. They're usually used to not having to build an engineering team of 50 people to, code the software to run GPUs, right? That's not what they want. And ultimately, there's, that's one level. That's one level of the, uh, the stratosphere that needs to have a level of standardization. They love standardization because of the fact that they, they know what they're going to get to a certain degree. And then at the same time, there's a, there's an even lower level where people want to be able to know exactly how to program the end, the unbolted memory.

Starting point is 00:24:54 They want to know exactly how to program their HBM. They want to I know exactly how to handle the importation of checkloading data in an asymmetrical accelerator environment. All of these things are very low level, but there needs to be a place to have the conversation because different people are going to come up with different ways of doing this. It goes a little bit deep now, but once you start getting into how the performance works, how the migration works, all of these things get down the rabbit hole very quickly. And so as a result, you know, what Snea is allowing us to do is to have both of those

Starting point is 00:25:36 conversations, both from the enterprise level and in the really, really deep, you know, I want to know the nanosecond this bit moves, right, those conversations. And that's the place where that can happen in both sides. And Jason mentioned RAG, which is sort of a standardized database environment, but it's not. You know, it's, it's, it's, uh, it's, uh, it's, uh, it's, it's, uh, it's becoming more more important as, as, as, uh, as, as, as the enterprise starts to realize what it really needs to do with this stuff. Yeah.

Starting point is 00:26:06 Yeah. Yeah. Yeah. Yeah. And well, but that's, you know, that, see, we laugh at that, but, but what, what Jason just said has so many more implications than people realize, right? So, so the thing is that for, for, for every. bit that we need to move into a vector database gets amplified, not twice, not three times,

Starting point is 00:26:30 but we're talking hundreds of times, right? So the amplification problem of all the data that gets moved for every bit that a GPU actually works on is causing a lot of grief, right? And so because of that, and it's all because of the fact that there's these inefficiencies compound over time. When you say application, I think of storage data footprint. print, but you're actually talking data access? Is that how I understand that? Now, I'm talking about the actual data that's created to move, to get a GPU to actually work on a bit, right?

Starting point is 00:27:05 So if I've got, if I've got a bit in a object storage, I can't process an object storage natively inside of a GPU, I've got to convert it. And when I convert it, I create extra bits. And so every extra bit that I create, I have to move. Every extra bit that I move, I create another bit. And so that amplification of writing because of all the different places it has to go and all the copying that needs to be done, I get incredible amounts of amplification that just overwhelms the amount of capacity and storage networking that needs to be done. And with all of those come other implications. I've played around with multiple different vector databases.

Starting point is 00:27:39 And when I actually take a piece of content and effectively loaded into the vector database, it's usually about 10x the size of the original content. Well, that's the embedding that has to go on from a token by token basis. I mean, the embedding is probably, you know, 256 vectors, if not more, you know, per token. So, yeah, I understand how that would play out. So if you don't right size your system, right? Because you're only looking at the models, the weights, the states, that kind of stuff, right? Which, which in of itself is significant because people say, I've got a $5 billion parameter model that I can run on my laptop.

Starting point is 00:28:20 Yes. And, well, if we start moving into like 70 billion and 400 billion and one trillion and two trillion, those are the, and 10 trillion is what we're, let the last number that I've been, you know, kind of talking about. It's, it's not, it's a nonlinear increase in data requirements. And so because of that exponential, the other side, you start getting into effectively an ability where you have so much data necessary to create. It's a lot. ability to iterate over this process, that you can't actually move forward. Don't even get me started on power requirements for all this stuff. Yeah, yeah, and that's the other, you know, elephant in the room. Yeah, yeah, yeah. Well, yeah, yeah, to some extent, the storage, the storage is, is part of that power equation. And obviously, the networking is part of the power equation, but the GPUs are obviously a big part of the power equation.

Starting point is 00:29:14 You know, one thing I noticed that was not part of the company participants. I don't see, I don't see NVIDIA. Maybe I shouldn't mention the vendor specifically, but I mean, they're sort of the elephant in this room from a consumer of data perspective. So obviously every one of our, you know, every one of the participant organizations want to work well in that space with those with those accelerators. And there are other accelerators. Obviously, MD and others have. equal acceleration capabilities in very specific domains. I guess how do you get something like a company that's outside the participant organization

Starting point is 00:30:03 to adopt your standards? I guess that's the question I have. Well, those are two different questions. So the way that standards work is that you have a standard and you get to implement it whether you, if you want to or if you don't want to, you don't have to, right? So, and even if you have a multi-phrased document, you don't have to implement all of a standard. And as it happens, you know, Nvidia is a member of CN. So whether they wish to participate in the storage.a.i or,

Starting point is 00:30:37 or in the community or inside of the technical working groups, they're already, you know, heavily participating in a couple of the groups, specifically at a level yeah yeah yeah well yeah it's that's what that's what uh seea does um now how far they want to go is is like every other member of the of the group they can they can make that determination and since it's relatively new you know nobody has actually put forth any kind of um you know intellectual property at all like they've talked about what they want to do which is great but nobody's actually um had the opportunity to put anything forward because we're still in the kind of the formation stages of the, you know, the organizational elements,

Starting point is 00:31:18 you know, the T's that cross, the eyes that get dotted and stuff. But there's, I think it would be a mistake to assume that, you know, I'm not a good mind reader, so I don't know what they're going to do, but I do know that they did presentations about AI at the, this senior developer conference too. So I'm hopeful that they'll participate, but like anybody else, I can't. We were talking about some of the storage.a.I substandard organization or activities. We mentioned smart data initiative, object drive initiatives in there, NBM programming model is in there. Computational storage is in there.

Starting point is 00:32:03 I mean, how do those, some of that's obviously been going on for quite a while within SNA. And some of it, you know, there are a couple of them that aren't even. don't even have an activity at all going on. I guess how does this all get pulled together under the storage. AI framework, if that's the right term? Yeah, sure. So the way that Sniya is organized is we've got things called communities, and we've got things called technical working groups.

Starting point is 00:32:30 So if you think about the technical working groups as a very specific siloed group that works on the specifications or standards or, you know, requirements, then you can think of the community as an abstraction that goes across these different technical working groups that are thematic, right? So in this particular case, it's an AI theme. We could have a memory theme that touches the hardware, the software, you know, we could have a security theme and energy efficiency theme, that kind of thing. So basically the communities then talk about what needs to be done, what kind of things are going to be useful, how we're going to work with external organizations. And then the technical working groups talk about how, right? How can this be done?

Starting point is 00:33:17 How can this be put into place? Now, there's a couple of legal reasons why this is broken out in these different places, right? So one of the legal rules is that you have to be very specific about how the intellectual property gets, you know, contributed and that sort of thing. So this is kind of a, you know, one of the reasons why people join these organizations like C in the first place is because you get these legal protections. So what the storage. AI itself does is it creates the environment for the overarching community to talk about the AI issues. And then we have the chairs of the individual technical working groups who are members of these communities as well. And we say, hey, look, we want to do this security thing. I'm just

Starting point is 00:33:59 picking something up here. And so the chair of the security group says it's security technical working group. We call them Twigs, technical working group. The chair of the security tweaks says, okay, I'll take that back to my, my technical working group. Of course, obviously people are going to probably be participating from the group as well. And they say, this is what the storage. I think because what we need to do is going to affect what goes on in these other groups. So in storage.ai is a community? Is that how you?

Starting point is 00:34:29 Yes. Oh, I got you. It's a community that talks about requirements. It's a community that talks about the issues and what needs to be done. And then the actual IP work, you were talking about into actual property earlier, that gets done in a twig. Right, right, right, right. So wouldn't it be a good first step to just kind of identify the challenges that AI presents to the data world, I guess? I mean, you mentioned checkpointing, right?

Starting point is 00:35:00 Checkpointing is a big issue. It's a significant issue during training of large models and the larger of the model, the worst it is. is and no more GPUs and buy the worse it is i mean that that that's understandable but these other you know these other aspects of uh of where data lies during various phases of the data pipeline the rag and all the other stuff uh they all have very specific challenges in my mind and so part of this is an education problem yeah which fortunately is one of those areas that seeing it excels at. Yeah, it's not just education.

Starting point is 00:35:38 It's an identification to some extent. It's, it's, it's, it's, it predates education. Education assumes you have an idea of what, what you're trying to do and where you want to go with or what it should look like. We're not even there yet. I mean, obviously storage, storage has been in the world for, for ages. And most of us have lived through scars and scars of, of, of, of, of, of, of, of, uh, years of it, but in an AI environment, the tables have turned.

Starting point is 00:36:09 Well, so to that end, I think what you've done is you put your finger on one of the more higher order communication problems, right? So when up until very recently, we kind of compartmentalized compute networking and storage into very broad categories, but there were kind of, it was kind of an airlock between them, right? And that kind of changed about 15 years ago when people started talking about converged networking and different types of storage networks and, you know, that could, you know, could run on non-traditional networking and that kind of stuff, right? You know, the whole, you know, data center ethernet movement and so on. And then once we started getting into the mid-2010s,

Starting point is 00:36:53 it changed again. Because now all of a sudden, we were looking at non-volatile memory. And and moving away from the limitations that happened from spinning drives. And there's a lot of implications there that we didn't think about when we were creating these protocols, right? So as opposed to a protocol, now we have an interface.

Starting point is 00:37:13 And an interface has a completely different way of influencing technology than a protocol does. And then we started looking at the workloads, right? So we were talking about object, file, block, and then key value was an interesting science experiment for a while. And so, but that's, those are, we're just getting finer and finer granularity as we move along. But generally, if I had a, a host that I wanted to communicate with a storage environment, I was going to choose block, file, or object. And that was it. And we were

Starting point is 00:37:47 pretty much done. And how did you choose? You choose, you chose basic bond on the workload. You chose basically on, you know, am I doing transactions in a database or am I doing large systems that need to be searched? Those were the kinds of criteria for choosing those. In AI, we go further in the granular part. Now we're talking about, you know, moving from really, really low latency memory to slightly low latency memory to low latency, persistence storage and so on and so forth. So as we moved into these different areas, the granularity has gotten finer and finer tuned. And in this particular case, each of these different iterations means that we've got to have

Starting point is 00:38:28 the ability to identify what's going on and then, what the actual impact of that is. That's where we are right now in the stages. And so that's what we were talking about at the SDC with all the different presentations, which is to identify what is the impact of these iterations. Then the next step is to say, well, okay, these are going to cause some problems

Starting point is 00:38:45 because we got some smart people out here are figuring this out. So how do we solve these problems before they become unsolvable? And so that's where we are. We're right now at the point where we're identifying these in educating others about what needs to be done and they need to talk together, right?

Starting point is 00:39:02 My performance people need to talk with my design people. My design people need to talk with my security people and so on and so forth. So that's where we are right now. And then in the future, we'll be able to say, this is what we need to do to solve this problems and the reason why. And so it's a long-term situation. Yeah, but I understand all that, but the solution needs is today.

Starting point is 00:39:28 okay so you're not going to get it i know well uh you get it in a proprietary fashion to some extent it's not a complete solution it's not it's not you know everybody's solution and everybody's you kind of don't i'm sorry i think you kind of don't because because right now in a proprietary solution they realize that there are limits in those proprietary solutions that require coordination amongst a large number of companies, right? And that's what those proprietary solutions are proposing right now is because they can't solve those problems by themselves. And that's what we're doing with storage.ai.

Starting point is 00:40:05 You can't solve these problems by yourself because those problems exist right now. And because of the fact that people are expecting things to change faster than they're used to, that everybody's racing to kind of come up with these basic solutions. But that's where flaws get into play, right? That's where scams get involved. That's where people start losing trust and faith.

Starting point is 00:40:25 because of the fact that these fly-by-night operations with proprietary solutions. And I'm not naming names here. I'm just talking generically. But you get some startup over it. It was like, I got a solution to all these different problems. When in truth, they're just blowing smoke. So you need to have, excuse me, I'll listen to my voice a little bit. You need to have the ability to recognize that some of these problems are going to be easily solved now,

Starting point is 00:40:50 you know, by just having a little bit of work. And then sometimes we need to get much. bigger ones that are going to be better in the long run. But the other question to ask is, what's the alternative? These problems have come up now. They're already needed to be solved right now. They're not going to be solved right now. And unless somebody comes up with some really magic way of doing this, and that's not really a strategy, the only really way that I can think of doing this is to get all these great minds together and start working on the problem together. I no doubt that's the right solution in the long run.

Starting point is 00:41:23 The challenges, the proprietary solutions that exist are coming out are point solutions. They solve a little bit of the puzzle, but they don't solve the big challenges across the board. That's right. I mean, to a large extent, you know, storage has kind of grown up from a proprietary protocol bus associated kind of environment to be more fabric-based. And that sort of came out from a standard's activity as much as anything. But it's something that emerges over time. It takes a long time to get there. And in the end, I mean, everybody's happy that, you know,

Starting point is 00:42:04 the storage fabric exists and it's well known. And Ethan, it exists and is well known. But it's, yeah, they all kind of start from a proprietary perspective and sort of become standardized. over time. I'm not sure what I'm saying, but it seems to me that the proprietary solutions that are existing today are point solutions. And ultimately, you want to create an overarching architecture that adopts some of these, the better ideas than these proprietary solutions and becomes a standard. But the challenges at the time it takes. Well, yes, that is a very

Starting point is 00:42:43 big challenge, but I would like to push back a little bit on a presumption that I heard, and if I was wrong about this, you can go ahead and correct me. But the way I see it is that the progress isn't necessarily a linear march from a proprietary into a standardized format. The way I have seen things and experienced things over time is that it's something of an evident flow, right? That we kind of worry back and forth, that we have a starting point, a proprietary solution that eventually does become standardized. I think S-CON to FICA, the fiber channel is a really good example of that. And then we've got standards that kind of move back into proprietary. And I think some of the AI solutions, UEC is a really good example of that. And so we

Starting point is 00:43:25 kind of move back and forth in this ebb and flow of proprietary and standardize and open and and so on. And over time, people will always use a technology, in a way that it wasn't originally designed. Like I mentioned the SDXI thing earlier. That's a perfect example, right? We're talking about using it for AI, even though it was not developed. But the other part of it too is that you're,

Starting point is 00:43:53 and this is why the storage data is so important. Not all AI workloads, even for training, even for inference are created equal. I can have an LLM, we talk about LLMs all the time, but I have a different pattern if I'm talking about a CNN or a GN. I have a different pattern depending upon you know, what the nature of those models are.

Starting point is 00:44:17 The RAG stuff is a really good example. How we incorporate these data back into a model? How do we iterate over the models themselves? How precise are we looking to get, right? And then we have the data flow of communication, right? Are we talking about all to all, all reduced? Are we talking about, you know, a client server peer to peer? How do all these things affect the way that the workload gets accomplished?

Starting point is 00:44:42 And so then we have to figure out what happens when things go wrong, right? What do we do when problems come up and they're going to come up more and more and more frequently, right? How do we handle those kinds of things? So all of these different issues are going to resist a one-size-fits-all architecture or reference design or specification or standard. And that's why I'm constantly going back to the first principles concept. I need to move data, I need to protect data, I need to give you back the correct data bit that you asked for when you asked for it. That's a first principle. And I need to make sure that I am consistent with the way that it works between different processor types, as well as different storage types, as well as different networking types.

Starting point is 00:45:29 That's what we're looking to accomplish, not necessarily reinvent, you know, Claude or GROC or, you know, chat GPT or anything like that. That's not what we're doing. Well, yeah, I mean, that's a different challenge, quite frankly. I mean, our challenge is to try to get the, you know, like you say, get the data bit to the right place, the right time in the fashion that it wants it and move on from there. I mean, I do that, you know, gazillion times as a second as much as needs to be done. But yeah, yeah. Yeah. Well, I mean, so to your point, one of the things that I hear a lot is,

Starting point is 00:46:08 this kind of generic well somebody needs to do x or somebody needs to do y right someone should think of doing z right uh or for our friends in canada and england z so once once you start to realize that that there are a very very very few companies or organizations that can do quote unquote something you start to realize that if you have the ability to do something this is what it is right so for the sake of doing something snia is is organizing the work that currently exists leveraging the stuff that currently exists even if it wasn't originally designed for this particular problem and then repurposing it for the workload because it's appropriate as well as adding in new things creating the different environments with other organizations to do it so that's the something

Starting point is 00:46:57 and we are the they we are people who are the they that are doing something to make people's lives better. And if people don't like it, they don't want to do that. And I'm not saying you're saying that. If they don't like that approach, they are perfectly free to do a proprietary solution or create a consortium to do it on their own. That's nothing wrong with that. We just happen to think that this approach, given the experience of the people involved, given the, you know, the brain power, which is way beyond my own, you know, believe that this is the way to do things. and the organization that is already structured legally to be able to handle those things. So this is the day, we are the day, and this is the something.

Starting point is 00:47:38 Yeah, yeah, yeah. All right, well, I think we're going to have to leave it there. Jason, any last questions for Dr. Jay before we calls? No, good discussion. It's really going to be interesting to see what plays out in storage.ai. There's a ton of interest in it, and it's going to be a really, really fun community to follow. wide participation. Dr. Jay, is there anything like to say

Starting point is 00:48:03 to our list in the audience before we close? Well, I think that if the audience is really interested in some of the things that I've talked about, the Senea Developer.org site is where you can go find these presentations, these brilliant presentations that, as you can tell, I did not do. But we also have Snea video on the YouTube channel so you can see past presentations about all those things

Starting point is 00:48:28 that these things are. X-I has been around for a while, so you can learn more about that stuff. Obviously, Snea.aI is the place where people can go to find out more about this particular initiative. And if they don't join, sorry, if they don't already belong, they can join. And we highly encourage as much participation as possible in this. So please do take a look at it. And I can be reached at, you know, on LinkedIn or to the Twitters at Dr.J. and ETZ. So that's me. All right. This has been great, Dr. Jay. Thank you very much for being on

Starting point is 00:49:06 our show today. I'm very happy to be here. Thank you so much. And that's it for now. Bye, Dr. Jay. Bye, Jason. See you, Ray. Until next time. Next time, we will talk to less system storage technology person. Any questions you want us to ask, please let us know. And if you enjoy our podcast, tell your friends about it. Please review us on Apple Podcasts. Google Play and Spotify, as this will help get the word out.

Grey Beards on Systems - 171: GreyBeards talk Storage.AI with Dr. J Metz, SNIA Chair and Technical Director, AMD

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.