Grey Beards on Systems - 175: GreyBeards talk Accelerated Object with SNIA TWG CoChairs, Jason Goldschmidt, DELL Distinguished Eng. & Nick Connolly, ARM Principal Eng.

Starting point is 00:00:00 Hey, everybody, Ray LaCasey here with Keith Townsend. Welcome to another sponsored episode of the Greybirds on Storage Podcasts, show where we get Greybirds bloggers together with storage. This is the vendors to discuss upcoming products, technologies, and trends affecting the data center today. We have with us here today, Nick Connolly, our in principal software engineer, Snea Accelerated Object I.O. co-chair and Jason Goldschmidt, Dell Distinguished Engineer, Snea Accelerate Object I.O. Coathear. I was at Snea Storage AI Conference late last month, and Jason,

Starting point is 00:00:40 and another presented on their work developing accelerated object standards. So Nick and Jason, why don't you tell us a little bit about yourselves and what's up at Snea and its accelerated object working group? Sure thing. Hi, Ray and Keith. My name's Jason Goldschmidt. I work at Dell Technologies. My title is Distinguished Engineer.

Starting point is 00:01:03 I have been working in the storage industry for 20 plus years. A lot of my time has been focused on. network and storage protocols, as well as cloud storage, and now as the hot focus for everyone, storage for AI. I became involved in SNIA this past fall when I presented at the Snea Developer Conference in San Jose about some emerging work in RDMA accelerated S3 compatible object storage. And as a result of that presentation, there was a good amount of interest,

Starting point is 00:01:49 and a number of Snea member companies came together and decided to form a technical working group to look at this particular area of how to accelerate object I.O. in the age of AI workloads. And Nick? Hi, I'm Nick. I've been involved with storage for years. And it's fascinating to see the changes that are happening due to AI workloads. I remember when I started in storage, if you got 5 megasecond out of a disk, you were quite happy you were doing good. And then if you've got a ray controller and you might be, maybe you got 45 megastsecond, you know, that's good.

Starting point is 00:02:33 But now you're talking 45 gigabytes a second and that's on a 400 gig. network connection. So what happens when we go to 1.6? It's a completely different world. So I'm really happy to be involved with the accelerating object storage because I think there's lots of things changing with storage and to be able to use RDMA and technologies like that to deliver beta faster to the workload is really an important area. So I got a couple of questions off to start. But first I wanted to mention some things. statistics. I did some research on AWS S3 object storage. They have hundreds of exabytes of objects, roughly 500 trillion objects. If I do my math correctly, that's about 200

Starting point is 00:03:22 megabytes per object, which is a pretty sizable file. And they're doing, gosh, 200 million object requests a second. And roughly a petabobiles. a second across 123 availability zone. So that's a lot of data and it's a lot of bandwidth and it's a lot of parallelism. Where does accelerated objects fit in the AI workload

Starting point is 00:03:54 these days? I guess I can talk about training and inferencing and that sort of thing. I don't know who wants to start on that, but I thought the stats were kind of impressive. Yeah, the scale of object storage, especially S3 is quite impressive.

Starting point is 00:04:11 And it really highlights how folks have thought about object storage for a long time, that it's a place to put lots and lots of unstructured data. And that's how it's been used. So in many cases, it's been viewed as backup or scratch space or logging, these non-performant requirement workloads that S3 provided a very simple interface, very simple provisioning, low metadata, large object count methods to developers and data scientists

Starting point is 00:04:53 and whomever had the requirement for storing and retrieving data. With AI, the use of object-rengths, storage has really changed. Oh, yeah? Right. So all of the interest in why someone might use object instead of file, right, still persist. S3 compatible object storage is easy to use, right? API-based.

Starting point is 00:05:26 It can be provisioned from a client. It can be embedded in software. it allows the user of the storage to be somewhat divorced from the administrative process. Files typically involve mount points, involve head of time provisioning. With object, we think about, oh, I'm going to create a bucket. I'm going to start creating objects. I'm going to read back those objects when I need them. I might do some lifecycle management of those objects.

Starting point is 00:06:00 And all that can be done with code or CLI at the user level, not involving admins whatsoever. Right. Doesn't involve kernel drivers. It doesn't involve, you know, administrative access on the system that is operating on that storage. So this is the motivators of why folks really prefer object over file, right? the reduced metadata that's attached to it and just simply the scale of the number of objects versus, say, number of files that you might have in a directory. Yeah.

Starting point is 00:06:39 That continues. Right. My problem, right? Sure. Exactly. So that continues to become true in the AI world, right? Where a tremendous amount of data is being produced, is being ingested, right? is being used for the AI workloads across training and inference, right?

Starting point is 00:07:05 You have RAG. All of these things are producing large amounts of data and ingesting large amounts of data. What changes with AI is the massive requirement for performance, for low latency, for high throughput. because the GPUs that do the work are very expensive. And having an idle GPU costs money. So. But, you know, from my perspective, objects has object storage has long, long, long, long heritage.

Starting point is 00:07:48 And it's been looking for a killer app for ever in my mind. I thought Data Lakes was the answer to that. but then AI emerged. And it was almost a perfect storm, as one vendor said it, for data. The data explosion is just gone bonkers in my mind. Excuse my technical terms. So, I mean, in a training environment, you know, a lot of this data used to be or was or is being staged on local storage and then being ingested by GPUs in order to speed up, you know,

Starting point is 00:08:25 GPU utilization. Where does something like accelerated objects fit in, you know, supplying that sort of data in a different form or a different fashion, I guess? I think that's interesting. I've heard two approaches on this for the staging of the data. One is that, yes, you take your data, transform it, stage it, and then use it for training. The counter argument is that actually requires more storage because you've essentially duplicated your data and there is another model which is streaming in but in a

Starting point is 00:09:03 training context you're also using you're also creating checkpoints logs the on a very large training environment your chance of any node going down is high you want to be able to preserve those as logs and checkpoints and it seems that S3 object is the preferred route for those. Yeah, so, yeah, I, to some extent, the checkpointing is a function of how many nodes. If you got 10 nodes, you could probably checkpoint, you know, once every shift or something like that.

Starting point is 00:09:39 If you got a thousand nodes, it's probably once every 20 minutes or 10 minutes or something like that, because chances are something going down. So are checkpoints, you know, first question, are they serialized? Is all a GPU activity pretty much frozen while a checkpoint is written out? I mean, are you aware of that?

Starting point is 00:09:57 Again, there seems to be multiple approaches. There's the, take the checkpoint straight from the GPNU RAM out to storage. There's the other approach of freeze everything, copy it to RAM and then put it back to storage. More recently, there are things like copying the data to other nodes within the cluster, that if you lose a node, you can then retrieve the data from the other node. But you're still going to want to take a checkpoint for archival purposes or if you need to roll the model back. So there's still a need to put those out to more persistent storage. Yeah, yeah, yeah, yeah, yeah.

Starting point is 00:10:44 Go ahead. It would be helpful to kind of understand from a system perspective, the importance. the importance of moving the pipeline, whether we're talking about getting data off-disc, or getting, regardless of whether it's what, you know, kind of what protocol is stored in into the GPU and the importance of RDMA in that and reducing the overall latency. And I think that's the first goal, right,

Starting point is 00:11:17 is to reduce the latency. Because I think some people are thinking, you know what, Even if I could get the speeds of S3 up to traditional block level storage, there's the problem of traversing the OS stack and getting to that. So can we talk to kind of the levels of the challenges and where this fits in into solving some of those latency challenges? Yeah, absolutely. So the, some of the motivators around RDMA use within AI and what we're seeing is deployment models for AI where it's RDMA everywhere, built into the full system deployment,

Starting point is 00:12:11 multiple nodes, fast interconnects inside and outside of AI servers. The use of RDMA really has two benefits. One is, of course, the zero copy memory-to-memory nature of RDMA. And so that is going to reduce latency because it's going to reduce those data copies that are associated with going through the OS stack or going from CPU-based memory into GPU memory, right, that hosts a device copy can add latency. But additionally, what RDMA brings is not just reduced latency, but reduced resource utilization. So it's two pieces of resource. One is that host CPU resource that's being involved in the data movement when RDMA is not in place,

Starting point is 00:13:11 and that the memory bandwidth contention between the CPU and the GPU or the device and the GPU, that needs to be part of the consideration. Additionally, GPUs need to perform work during those, data copies or data offloading if for the the host to device and device to host transfers. By being able to move data directly into GPU memory in a zero copy manner, it reduces that GPU utilization. It keeps that host CPU utilization very low while performing the whole operation with lower latency. So this has been well observed in networking, right? So GPU to GPU networking, note to note networking.

Starting point is 00:14:11 The way to, you know, you get an instantaneous boost when you switch from using TCPIP to using RDMA and the magic happens of bringing up GPU utilization. So extremely important in training. What about in, you know, I think maybe a year, year and a half ago, I remember having the discussions about the impact of storage performance on inference. And people were pretty dismissive of it. Up until I think we've started to realize that this really matters and overall agentic. So can you talk to kind of the architectural considerations for when you're training versus when you're doing inference? And specifically, I think the interest has come when it comes to agenetic when we have to do tool call. So I had a fun stat this week that in the past year, that inferencing workloads has grown by 320 times from where it was last year.

Starting point is 00:15:25 And the growth of inference deployments and inference workloads is tremendous, right? Moving very quickly and outpacing training workloads. Yeah, and I call that industrialization of AI. It's gone beyond the hype cycle now. It's starting to actually be used. Inference is about business value. Yeah. Right.

Starting point is 00:15:52 And so that's why we're seeing so much of it, right? is how do you take your chain data and then actually turn it into business value? That's what inference it does. So inferencing and inference acceleration has really gone hand in hand with storage, with storage technology. It's based on this idea of GPU memory is limited. GPU memory has a small space for storing the computed KV values, the KV cache. Once that space is exhausted, it means that values in that cache have to be evicted.

Starting point is 00:16:38 If they're not found in that cache, it means the GPU needs to recompute data that it's already done. So if we can offload that data to some other form of storage, we get an instant boost if the transfer time from that storage to GPU memory is quick enough to provide a value over re-computing those kV values. And this is where external storage, like S3 or file or block, can come in to provide massive amounts of kV data when you're talking about very, very large inference environments where having local disk or local memory is insufficient to store all of that data for the need of how many GPUs and how many users are deployed within that environment performing inference operations. Yeah, the challenge, and I understand all that, and the KB cache offload, as you get more and more and more activity on a particular GPU, more and more concretes. threads or more and more token in context, the KV cache grows and can grow considerably.

Starting point is 00:17:59 You know, when you're only running one thread on a system, it's relatively straightforward to keep most of that in HBM and maybe in CPU memory. But when you're running 100 or 1,000 threads on a stack of 8 GPUs, it becomes a real concern because those threads are operating concurrently. and the swapping from one thread to another is happening as GPUs become idle and doing that swapping, it's almost like virtual memory to some extent.

Starting point is 00:18:28 I mean, you're trying to, you know, bring in the context for a particular thread, and then you're going to try to save it off to go on to the next thread. Is that what's happening? Essentially, yeah. What's interesting is that there is a slight different semantic

Starting point is 00:18:47 in terms of storage, because this is stuff that can be recalculated. And we've traditionally found as a sort of storage is something that's permanent, but actually if it lasts for a month and you lose it, maybe the dynamics are different. Maybe that's quite acceptable. Coming back on something earlier was,

Starting point is 00:19:07 it's not just the performance, the latency. One of the restricting factors is the power cost and the cooling. And if you can move that data at lower power cost, then that's really an advantage. Yeah. So the other thing, the other challenge with objects in general and S3 in particular is that, you know,

Starting point is 00:19:27 it's an IP protocol. It's got a lot of, it's very chatty. Yep. It's not exactly what you consider a high performance protocol, although obviously with enough threads and enough concurrency,

Starting point is 00:19:39 you can generate a lot of throughput. The latency for them to, to establish a connection for S3 is that trivial, I guess, I'd say. I would agree. But go back to the comment you made earlier about the size of the objects and the amount of data that's being moved.

Starting point is 00:19:59 These are not 4K transcers, typically. These are going to be large. And that's where the setup costs start to be lower compared with the overall I.O. The size of the data dictates if you're doing something like RDMA, it becomes a significant advantage because you're reducing the copies, number one. And number two, you're operating at line speeds almost, I guess. It's the bandwidth you can offer in that environment is significant.

Starting point is 00:20:32 So there is a place, obviously, in training for S3 over RDMA. And there's obviously is a place in inferencing for KV cache. But, I mean, there's logging. there's other things going on and inferencing where object storage plays apart, wouldn't you say? Yeah, absolutely. There's a large amount of data generation and also data storage, right? We know about the techniques used in RAG to provide context for a inferencing session. That data needs to be stored somewhere and object is a good use case. It provides a good use case for that data. Yeah. So the reality is that's where the data is at. I mean, it's not just, you know, as I think

Starting point is 00:21:34 about my workflow and I'm building a system in any of the cloud providers, my first choice for building the you know I have 8,000 blog posts and video pieces of content and tweets loaded into on the disc I'm not going to I'm not going to put that on a NFS share or some other type of file system it is a cloud app so I'm going to read it as preferably as as an S3 object there's you know kind of the practicalness no one's building no one's someone's building file systems to host their cloud-based apps. So as I'm consuming all of these logs and keeping all this data that I didn't think I would ever touch, now AI-infersing gives me the ability to actually touch it from a rag perspective.

Starting point is 00:22:39 I would say, you know, it's not normal, I guess, for having objects being the back end for a database, but I'm certainly familiar with a couple of vendors that do that sort of thing. And so, I mean, yeah, yeah, you know, it can have the raw data for sure. And then when you're doing, you know, rag processing and converting it to a vector database, even the vector database itself can be sizable enough that maybe it belongs on object. Is that what you're, I mean, is that what you guys are seeing in the field? Yes, the particularly vector databases, you know, we see new advancements that came out in the last year or so from AWS and that have been making their way to on-prem object servers is S3 tables.

Starting point is 00:23:31 You know, this idea of having structured data that's backed by objects is becoming very on-trend in comparison to, you know, how we normally view databases as being blocked. storage deployment. You know, AWS announced a couple of years ago S3 Vector. So the well they, and we're seeing storage vendors in general do

Starting point is 00:23:57 this of where we're taking our existing data, vectorizing it and having that data, that vector database be part of the S3 service so I can consume it directly without having a second layer of you know, vectorization.

Starting point is 00:24:14 So I am able to natively use these calls. So are we that advanced where we're consuming something like an S3, X3 vector directly into our data pipeline to get this data set directly to the GPUs, or are we kind of one level removed today? I think what I see to some extent is that the S3 vectors, the database is moving the data from the object store directly to the GPU via some vector search request, I guess.

Starting point is 00:24:56 I mean, that's ultimately what you want to see, how you got there, whether you can reduce a lot of the operating system overhead, which is, you know, what RDMA is all about to a large extent. So I guess with that said, anytime I make a storage request from that, that, you know, that, So the optimization is anytime I make a storage request from the GPU to the underlying S3-based storage, RDMA may become to my preferred method of making that request. So no matter the pipeline or workflow where we're talking about KVCAS, we're loading data via a rack process, that calling to get into the GPU need is,

Starting point is 00:25:44 is benefited by being over RDMA. Yeah, absolutely. I mean, that's what I would see as the ultimate goal for accelerated object storage. I mean, it just begs the question to some extent. I'm assuming we're talking Rocky here, Rocky version one, but this Rocky version two, which sort of operates across the Internet. Is something like this viable for, you know, Rocky version two? That's an interesting question.

Starting point is 00:26:16 I mean, in the accelerated object, what we're looking at doing is starting with Rocky V1. But there are a whole realm of technologies like ultra-ithernet that potentially have a role to play in the future. Yeah, yeah, yeah. Yeah, this is a little mind-twisty because there's the accessing the, the stores, you know, kind of making the calls to retrieve the storage.

Starting point is 00:26:52 And then there's kind of the protocol that those requests from a networking perspective, from that, I guess, I guess the best way to describe it is north-south calls. Those calls themselves might be able to RDMA depending on the underlying infrastructure. So I guess, turtles all the way down, RDMA all the way through the entire process from the network. to the actual storage calls. And this is a big place where the technical working group in Snea is looking to make a contribution. The signaling that is needed between your S3 compatible clients and your endpoint, in order to indicate I would like the data to be transferred out of band with

Starting point is 00:27:47 this different protocol, right? And then our first step is looking at RDMA reads and rights. And that exchange of metadata, that signaling, defining that in a way that can be implemented and delivered into products and built an interoperable way, that is one of the main goals of what the technical working group is looking to produce. Yeah, because from, from, From my kind of developer lens, I know that I should be using RDMA when possible. I don't have the technical capability to do the low-level optimizations to have that happen. And as I go from not from whether it's cloud provider or the cloud provider, but generically, I want to say, hey, here's my S3 back-in from vendor X. I want to load that into my GPUs running on vendor-wise hardware.

Starting point is 00:28:54 And that hardware may be a CPU, GPU, whatever accelerator I choose to use. I don't want to get into the specification details of that. I don't want to recreate that wheel. So that's the work of Snege. Yeah, that's right. Yeah, so providing that. Yeah. So one thing, and I probably should have mentioned this earlier, S3 is not really an official standard.

Starting point is 00:29:20 It's sort of a, it's a byproduct of what AWS has been able to, you know, dominate the environment. I mean, local object storage has always kind of had its own protocol. And over time, the adopted S3 is as a dominant protocol. So how does, I guess the question really is, how does something like Sneez-Twig standards, for accelerated object work itself on top of some protocol, which is not standard. That is a beautiful question that we are, to some extent, reckoning with. I think the best thing we can do is say, you know, S3 exists.

Starting point is 00:30:01 There is a fairly common understanding of how it operates, and this is a specification to layer on top of that in order to be able to provide the hook for RDMA transfer. But it's not, yet we're building on a ratified standard. It is definitely a more ambiguous environment. And I guess that next level question, after that, what's the right level of abstraction, you know, kind of across the spectrum of technical implementations?

Starting point is 00:30:35 Where does NIA start and stop? It starts and stops everywhere. I think. I mean, with respect to Scrii over objects, I don't know what the protocol would look like, but I would assume there's some sort of metadata flag that's specified or used in the get puts that indicate RDMA. Is that how you see this happening?

Starting point is 00:30:59 I guess maybe that's in the process of being developed. That's the rough idea, yes. The extra metadata passed across in order to enable the RDMA transfer. But there are all sorts of interesting corner cases to resolve on that. such as, you know, what check sums mean in that context, things like that. Yeah, sure. Yeah.

Starting point is 00:31:24 How to checksums work at all in this environment when you're doing memory to memory kinds of work. Yeah. And so something like this would have to be implemented in spec, from a GPU perspective. Somebody like an Nvidia or AMD or whoever is doing the accelerator would have to support this new protocol, new metadata. Is that how you see this? Obviously, there's a vendor side of this as well. So there's a client side and there's a vendor side.

Starting point is 00:31:51 Both of them have to agree on the standards, I guess. Yes. Yeah. I mean, there are some early stage products coming up to the market that have some form of this implemented. Our goal is to provide an interoperable standard that vendors can adopt to give a much more university accessible solution. And you would provide, As Snea would provide, I don't know what I call it,

Starting point is 00:32:22 plug fest is not the right word, but I think it's similar to this, for these sort of vendor and client solutions to test that they're following the standard. Is that how this works? But certainly what's been done in the past with fiber channel and, I Scuzzi and things of that nature. I know that, but you would have to have object storage now, not block storage and file storage and things of that nature. Yeah.

Starting point is 00:32:49 But it's certainly something that we're talking about and would hope to see happen. Yeah, yeah, yeah. The other surprising thing to me is that this just started last November. As far as I can tell, the twig started last November after the storage developers conference. And you're already talking. I know at least one vendor and possibly two that have already supports something in this space. It's not standard quote unquote,

Starting point is 00:33:16 but it's using the protocol. It's using Rocky to transfer objects. Yeah, I've seen outputs from this and the early test results are like insane. Like the, I think one vendor claimed, one software vendor that used one of these solutions claimed like a 17x increase in performance. It was like having 17x the number of GPUs.

Starting point is 00:33:46 Yeah, I mean, when you start talking inferencing KV Cash offload, it becomes pretty impressive. The training side, it's harder to assess the speed up, but it certainly can be impressive as well. You guys obviously are not at the point where you have performance data. Do you? I should ask. No, no, no, no, yet.

Starting point is 00:34:10 That's right. Yeah. Yeah. Yeah. And I think this all wraps up in just the case at which AI has been advancing, right? How many things did not exist six weeks ago and now we talk about constantly. Very, very short period of time. And that's what we're seeing here also.

Starting point is 00:34:33 So there are many different vendors who have gone to market with solutions. We are, we with many different vendor member companies for SNIA are saying, look, we may be competitors, but interoperability is important to our customers, is important to supporting a new protocol and its advancement and innovation. So that's where Snea and the technical working group get involved. I guess a question to follow on to that is, can a standards working group operate at the speed necessary to follow what's going in the industry?

Starting point is 00:35:19 I mean, it's a tough call. I know Sniya obviously tried to optimize and increase their, increase their throughputness with respect to this sort of thing. But standards operated a different speed. Standards working groups, let's call it. I was on one call a while back, had 50 people on a call. I couldn't handle it. I had to get off because it was, you know, it was that impressive.

Starting point is 00:35:45 It had that many, you know, interested parties. Yeah, I think that is an issue. But I think if you look at what's happening in Saneer, there is a very definite focus around AI. There's the Storage AI Initiative, which encompasses a number of working group. And the event that you wrote recently was a day focused totally on AI.

Starting point is 00:36:12 I think the pressure in the market is such things to move perhaps faster than they have previously. Yeah, yeah, yeah, yeah. Getting back to a question that Keith had asked specifically about agenic workloads. In my mind, agenic workloads,

Starting point is 00:36:31 obviously AI in inferencing is building and generating lots of context, which means lots of KV stores, which means the more deeper you go into this tool use, AI use, you know, different steps or different phases in the process, the more context.

Starting point is 00:36:55 matters. And so key value cash offload becomes a critical component to something like that, in my mind. Is that how you see things? Yeah, absolutely. We know that models are growing their maximum context length. It's becoming more typical that models support a million. Some are supporting 10 million tokens of context length per session. So these are extremely large amount of data when you consider that those tokens represented for that model as a file or object or space with a memory. And of course, multiplied by many users, it becomes incredibly important to figure out the data management story for this, even if it's, we're thinking about as in some ways being

Starting point is 00:38:00 ephemeral, right? This is a cache, right? If this data is lost, the GPUs can recombute it. There is a cost to that, right? And so a, what we're noticing in KV Cash is this tiering approach. Right. Between volatile memory, long-term storage, memory-like interfaces with some data protection. All of these are creating this ability to expand that total context-saving space in ways that haven't existed before. Right, right, right, right. having the bandwidth and lower latency abilities of something like S3 over RDMA makes us more viable to a large extent and more performant. Well, guys, this has been great.

Starting point is 00:39:08 We could probably talk for another couple of hours on what's going on in this industry. But Keith, any last questions for Nick or Jason before we leave? No, I'm going to have to feed this whole session to AI to help me. Don't do it. Well, you maybe go ahead and do it. This has been one of the more meeting ones. I appreciate it, guys. I'm starting to completely grok it, but this is neat.

Starting point is 00:39:35 Okay. Nick or Jason, is there anything you'd like to say to our listening audience before we close? I would say come and get involved at the Accelerated Object Storage Working Group. And don't miss out on SDC in... Storage Developer Conference, right? Yep. And coming up in the fall, is that true? Yep, September 28 to 30th.

Starting point is 00:39:56 Okay. All right, well, this has been great, Nick and Jason. Thanks again for being on our show today. Thank you so much for having us. Thank you. And that's it for now. Bye, Nick, by Jason, and by Keith. Bye, Ray.

Starting point is 00:40:09 Until next time. Next time, we will talk to a list of system storage technology person. Any questions you want us to ask, please let us know. And if you enjoy our podcast, tell your friends about it. Please review us on Apple Podcasts, Google Play, and Spotify, as this will help get the word out.

Grey Beards on Systems - 175: GreyBeards talk Accelerated Object with SNIA TWG CoChairs, Jason Goldschmidt, DELL Distinguished Eng. & Nick Connolly, ARM Principal Eng.

Jason Goldschmidt and Nick Connolly, co-chairs of SNIA's Accelerated Object TWG, discussed the importance of S3 over RDMA for AI processing. SNIAs work addresses industries need for faster data transf...er to improve GPU utilization during model training and inferencing.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.