Grey Beards on Systems - 175: GreyBeards talk Accelerated Object with SNIA TWG CoChairs, Jason Goldschmidt, DELL Distinguished Eng. & Nick Connolly, ARM Principal Eng.
Episode Date: June 1, 2026Jason Goldschmidt and Nick Connolly, co-chairs of SNIA's Accelerated Object TWG, discussed the importance of S3 over RDMA for AI processing. SNIAs work addresses industries need for faster data transf...er to improve GPU utilization during model training and inferencing.
Transcript
Discussion (0)
Hey, everybody, Ray LaCasey here with Keith Townsend.
Welcome to another sponsored episode of the Greybirds on Storage Podcasts,
show where we get Greybirds bloggers together with storage.
This is the vendors to discuss upcoming products, technologies, and trends affecting the data center today.
We have with us here today, Nick Connolly, our in principal software engineer,
Snea Accelerated Object I.O. co-chair and Jason Goldschmidt, Dell Distinguished Engineer,
Snea Accelerate Object I.O. Coathear.
I was at Snea Storage AI Conference late last month, and Jason,
and another presented on their work developing accelerated object standards.
So Nick and Jason, why don't you tell us a little bit about yourselves and what's up at Snea
and its accelerated object working group?
Sure thing.
Hi, Ray and Keith.
My name's Jason Goldschmidt.
I work at Dell Technologies.
My title is Distinguished Engineer.
I have been working in the storage industry for 20 plus years.
A lot of my time has been focused on.
network and storage protocols, as well as cloud storage,
and now as the hot focus for everyone, storage for AI.
I became involved in SNIA this past fall
when I presented at the Snea Developer Conference in San Jose
about some emerging work in RDMA accelerated S3 compatible
object storage. And as a result of that presentation, there was a good amount of interest,
and a number of Snea member companies came together and decided to form a technical working group
to look at this particular area of how to accelerate object I.O. in the age of AI workloads.
And Nick?
Hi, I'm Nick.
I've been involved with storage for years.
And it's fascinating to see the changes that are happening due to AI workloads.
I remember when I started in storage, if you got 5 megasecond out of a disk, you were quite happy you were doing good.
And then if you've got a ray controller and you might be, maybe you got 45 megastsecond, you know, that's good.
But now you're talking 45 gigabytes a second and that's on a 400 gig.
network connection. So what happens when we go to 1.6? It's a completely different world.
So I'm really happy to be involved with the accelerating object storage because I think
there's lots of things changing with storage and to be able to use RDMA and technologies like
that to deliver beta faster to the workload is really an important area.
So I got a couple of questions off to start. But first I wanted to mention some things.
statistics. I did some research on AWS S3 object storage. They have hundreds of
exabytes of objects, roughly 500 trillion objects. If I do my math correctly, that's about 200
megabytes per object, which is a pretty sizable file. And they're doing, gosh, 200 million
object requests a second. And roughly a petabobiles.
a second across 123
availability zone. So that's
a lot of data and it's a lot of bandwidth
and it's a lot of parallelism.
Where does accelerated objects
fit in the AI workload
these days? I guess I can talk about
training and inferencing
and that sort of thing. I don't know who
wants to start on that, but I thought
the stats were kind of impressive.
Yeah, the scale
of object storage, especially
S3 is quite impressive.
And it really highlights how folks have thought about object storage for a long time,
that it's a place to put lots and lots of unstructured data.
And that's how it's been used.
So in many cases, it's been viewed as backup or scratch space or logging,
these non-performant requirement workloads
that S3 provided a very simple interface,
very simple provisioning, low metadata,
large object count methods to developers and data scientists
and whomever had the requirement for storing and retrieving data.
With AI, the use of object-rengths,
storage has really changed.
Oh, yeah?
Right.
So all of the interest in why someone might use object instead of file, right, still persist.
S3 compatible object storage is easy to use, right?
API-based.
It can be provisioned from a client.
It can be embedded in software.
it allows the user of the storage to be somewhat divorced from the administrative process.
Files typically involve mount points, involve head of time provisioning.
With object, we think about, oh, I'm going to create a bucket.
I'm going to start creating objects.
I'm going to read back those objects when I need them.
I might do some lifecycle management of those objects.
And all that can be done with code or CLI at the user level, not involving admins whatsoever.
Right.
Doesn't involve kernel drivers.
It doesn't involve, you know, administrative access on the system that is operating on that storage.
So this is the motivators of why folks really prefer object over file, right?
the reduced metadata that's attached to it and just simply the scale of the number of objects
versus, say, number of files that you might have in a directory.
Yeah.
That continues.
Right.
My problem, right?
Sure.
Exactly.
So that continues to become true in the AI world, right?
Where a tremendous amount of data is being produced, is being ingested, right?
is being used for the AI workloads across training and inference, right?
You have RAG.
All of these things are producing large amounts of data and ingesting large amounts of data.
What changes with AI is the massive requirement for performance, for low latency, for high throughput.
because the GPUs that do the work are very expensive.
And having an idle GPU costs money.
So.
But, you know, from my perspective,
objects has object storage has long, long, long, long heritage.
And it's been looking for a killer app for ever in my mind.
I thought Data Lakes was the answer to that.
but then AI emerged.
And it was almost a perfect storm, as one vendor said it, for data.
The data explosion is just gone bonkers in my mind.
Excuse my technical terms.
So, I mean, in a training environment, you know, a lot of this data used to be or was or is being staged on local storage
and then being ingested by GPUs in order to speed up, you know,
GPU utilization.
Where does something like accelerated objects fit in, you know, supplying that sort of data
in a different form or a different fashion, I guess?
I think that's interesting.
I've heard two approaches on this for the staging of the data.
One is that, yes, you take your data, transform it, stage it, and then use it for training.
The counter argument is that actually requires more storage because you've essentially
duplicated your data and there is another model which is streaming in but in a
training context you're also using you're also creating checkpoints logs the on a
very large training environment your chance of any node going down is high you want
to be able to preserve those as logs and checkpoints and it seems that S3 object
is the preferred route for those.
Yeah, so, yeah, I, to some extent,
the checkpointing is a function of how many nodes.
If you got 10 nodes, you could probably checkpoint,
you know, once every shift or something like that.
If you got a thousand nodes, it's probably once every 20 minutes
or 10 minutes or something like that,
because chances are something going down.
So are checkpoints, you know, first question,
are they serialized?
Is all a GPU activity pretty much frozen
while a checkpoint is written out?
I mean, are you aware of that?
Again, there seems to be multiple approaches.
There's the, take the checkpoint straight from the GPNU RAM out to storage.
There's the other approach of freeze everything, copy it to RAM and then put it back to storage.
More recently, there are things like copying the data to other nodes within the cluster,
that if you lose a node, you can then retrieve the data from the other node.
But you're still going to want to take a checkpoint for archival purposes or if you need to roll the model back.
So there's still a need to put those out to more persistent storage.
Yeah, yeah, yeah, yeah, yeah.
Go ahead.
It would be helpful to kind of understand from a system perspective, the importance.
the importance of moving the pipeline,
whether we're talking about getting data off-disc,
or getting, regardless of whether it's what,
you know, kind of what protocol is stored in into the GPU
and the importance of RDMA in that and reducing the overall latency.
And I think that's the first goal, right,
is to reduce the latency.
Because I think some people are thinking, you know what,
Even if I could get the speeds of S3 up to traditional block level storage,
there's the problem of traversing the OS stack and getting to that.
So can we talk to kind of the levels of the challenges
and where this fits in into solving some of those latency challenges?
Yeah, absolutely.
So the, some of the motivators around RDMA use within AI and what we're seeing is deployment models for AI where it's RDMA everywhere, built into the full system deployment,
multiple nodes, fast interconnects inside and outside of AI servers.
The use of RDMA really has two benefits.
One is, of course, the zero copy memory-to-memory nature of RDMA.
And so that is going to reduce latency because it's going to reduce those data copies
that are associated with going through the OS stack or going from CPU-based memory into GPU memory,
right, that hosts a device copy can add latency. But additionally, what RDMA brings is not just
reduced latency, but reduced resource utilization. So it's two pieces of resource. One is that
host CPU resource that's being involved in the data movement when RDMA is not in place,
and that the memory bandwidth contention between the CPU and the GPU or the device and the GPU,
that needs to be part of the consideration. Additionally, GPUs need to perform work during those,
data copies or data offloading if for the the host to device and device to host transfers.
By being able to move data directly into GPU memory in a zero copy manner, it reduces that
GPU utilization. It keeps that host CPU utilization very low while performing the whole
operation with lower latency.
So this has been well observed in networking, right?
So GPU to GPU networking, note to note networking.
The way to, you know, you get an instantaneous boost when you switch from using TCPIP to using RDMA and the magic happens of bringing up GPU utilization.
So extremely important in training.
What about in, you know, I think maybe a year, year and a half ago, I remember having the discussions about the impact of storage performance on inference.
And people were pretty dismissive of it.
Up until I think we've started to realize that this really matters and overall agentic.
So can you talk to kind of the architectural considerations for when you're training versus when you're doing inference?
And specifically, I think the interest has come when it comes to agenetic when we have to do tool call.
So I had a fun stat this week that in the past year, that inferencing workloads has grown by 320 times from where it was last year.
And the growth of inference deployments and inference workloads is tremendous, right?
Moving very quickly and outpacing training workloads.
Yeah, and I call that industrialization of AI.
It's gone beyond the hype cycle now.
It's starting to actually be used.
Inference is about business value.
Yeah.
Right.
And so that's why we're seeing so much of it, right?
is how do you take your chain data and then actually turn it into business value?
That's what inference it does.
So inferencing and inference acceleration has really gone hand in hand with storage,
with storage technology.
It's based on this idea of GPU memory is limited.
GPU memory has a small space for storing the computed KV values, the KV cache.
Once that space is exhausted, it means that values in that cache have to be evicted.
If they're not found in that cache, it means the GPU needs to recompute data that it's already done.
So if we can offload that data to some other form of storage, we get an instant boost if the
transfer time from that storage to GPU memory is quick enough to provide a value over
re-computing those kV values. And this is where external storage, like S3 or file or block, can come in to
provide massive amounts of kV data when you're talking about very, very large inference environments
where having local disk or local memory is insufficient to store all of that data for the need of how many GPUs and how many users are deployed within that environment performing inference operations.
Yeah, the challenge, and I understand all that, and the KB cache offload, as you get more and more and more activity on a particular GPU, more and more concretes.
threads or more and more token in context, the KV cache grows and can grow considerably.
You know, when you're only running one thread on a system, it's relatively straightforward to
keep most of that in HBM and maybe in CPU memory.
But when you're running 100 or 1,000 threads on a stack of 8 GPUs, it becomes a real
concern because those threads are operating concurrently.
and the swapping from one thread to another
is happening as GPUs become idle
and doing that swapping,
it's almost like virtual memory to some extent.
I mean, you're trying to, you know,
bring in the context for a particular thread,
and then you're going to try to save it off
to go on to the next thread.
Is that what's happening?
Essentially, yeah.
What's interesting is
that there is a slight different semantic
in terms of storage,
because this is stuff that can be recalculated.
And we've traditionally found as a sort of storage
is something that's permanent,
but actually if it lasts for a month and you lose it,
maybe the dynamics are different.
Maybe that's quite acceptable.
Coming back on something earlier was,
it's not just the performance, the latency.
One of the restricting factors is the power cost and the cooling.
And if you can move that data at lower power cost,
then that's really an advantage.
Yeah.
So the other thing,
the other challenge with objects in general
and S3 in particular is that, you know,
it's an IP protocol.
It's got a lot of,
it's very chatty.
Yep.
It's not exactly what you consider
a high performance protocol,
although obviously with enough threads
and enough concurrency,
you can generate a lot of throughput.
The latency for them to,
to establish a connection for S3
is that trivial, I guess, I'd say.
I would agree.
But go back to the comment you made earlier
about the size of the objects
and the amount of data that's being moved.
These are not 4K transcers, typically.
These are going to be large.
And that's where the setup costs
start to be lower compared with the overall I.O.
The size of the data dictates
if you're doing something like RDMA, it becomes a significant advantage because you're reducing the copies, number one.
And number two, you're operating at line speeds almost, I guess.
It's the bandwidth you can offer in that environment is significant.
So there is a place, obviously, in training for S3 over RDMA.
And there's obviously is a place in inferencing for KV cache.
But, I mean, there's logging.
there's other things going on and inferencing where object storage plays apart, wouldn't you say?
Yeah, absolutely. There's a large amount of data generation and also data storage, right? We know about
the techniques used in RAG to provide context for a inferencing session. That data needs to be
stored somewhere and object is a good use case. It provides a good use case for that data.
Yeah. So the reality is that's where the data is at. I mean, it's not just, you know, as I think
about my workflow and I'm building a system in any of the cloud providers, my first choice for
building the you know I have 8,000 blog posts and video pieces of content and tweets
loaded into on the disc I'm not going to I'm not going to put that on a NFS share or some
other type of file system it is a cloud app so I'm going to read it as preferably as
as an S3 object there's you know kind of the practicalness no one's building no one's
someone's building file systems to host their cloud-based apps.
So as I'm consuming all of these logs and keeping all this data that I didn't think I would
ever touch, now AI-infersing gives me the ability to actually touch it from a rag perspective.
I would say, you know, it's not normal, I guess, for having objects being the back end for
a database, but I'm certainly familiar with a couple of vendors that do that sort of thing.
And so, I mean, yeah, yeah, you know, it can have the raw data for sure. And then when you're
doing, you know, rag processing and converting it to a vector database, even the vector database
itself can be sizable enough that maybe it belongs on object. Is that what you're, I mean,
is that what you guys are seeing in the field? Yes, the particularly vector databases, you know, we see
new advancements that came out in the last year or so from AWS and that have been making
their way to on-prem object servers is S3 tables.
You know, this idea of having structured data that's backed by objects is becoming very on-trend
in comparison to, you know, how we normally view databases as being blocked.
storage deployment.
You know,
AWS announced a couple of years ago
S3 Vector. So the
well they, and we're
seeing storage vendors in general do
this of where we're taking
our existing data, vectorizing
it and having that data, that vector
database be part
of the S3 service so I can
consume it directly without having a
second layer of
you know, vectorization.
So I am
able to natively use these calls.
So are we that advanced where we're consuming something like an S3,
X3 vector directly into our data pipeline to get this data set directly to the GPUs,
or are we kind of one level removed today?
I think what I see to some extent is that the S3 vectors,
the database is moving the data from the object store directly to the GPU via some vector search
request, I guess.
I mean, that's ultimately what you want to see, how you got there, whether you can reduce
a lot of the operating system overhead, which is, you know, what RDMA is all about to a large
extent.
So I guess with that said, anytime I make a storage request from that, that, you know, that,
So the optimization is anytime I make a storage request from the GPU to the underlying S3-based storage,
RDMA may become to my preferred method of making that request.
So no matter the pipeline or workflow where we're talking about KVCAS,
we're loading data via a rack process, that calling to get into the GPU need is,
is benefited by being over RDMA.
Yeah, absolutely.
I mean, that's what I would see as the ultimate goal for accelerated object storage.
I mean, it just begs the question to some extent.
I'm assuming we're talking Rocky here, Rocky version one,
but this Rocky version two, which sort of operates across the Internet.
Is something like this viable for, you know, Rocky version two?
That's an interesting question.
I mean, in the accelerated object,
what we're looking at doing is starting with Rocky V1.
But there are a whole realm of technologies like ultra-ithernet
that potentially have a role to play in the future.
Yeah, yeah, yeah.
Yeah, this is a little mind-twisty
because there's the accessing the,
the stores, you know, kind of making the calls to retrieve the storage.
And then there's kind of the protocol that those requests from a networking perspective,
from that, I guess, I guess the best way to describe it is north-south calls.
Those calls themselves might be able to RDMA depending on the underlying infrastructure.
So I guess, turtles all the way down, RDMA all the way through the entire process from the network.
to the actual storage calls.
And this is a big place where the technical working group in Snea is looking to make a contribution.
The signaling that is needed between your S3 compatible clients and your endpoint,
in order to indicate I would like the data to be transferred out of band with
this different protocol, right? And then our first step is looking at RDMA reads and rights.
And that exchange of metadata, that signaling, defining that in a way that can be implemented and delivered
into products and built an interoperable way, that is one of the main goals of what the technical
working group is looking to produce. Yeah, because from, from,
From my kind of developer lens, I know that I should be using RDMA when possible.
I don't have the technical capability to do the low-level optimizations to have that happen.
And as I go from not from whether it's cloud provider or the cloud provider, but generically, I want to say, hey, here's my S3 back-in from vendor X.
I want to load that into my GPUs running on vendor-wise hardware.
And that hardware may be a CPU, GPU, whatever accelerator I choose to use.
I don't want to get into the specification details of that.
I don't want to recreate that wheel.
So that's the work of Snege.
Yeah, that's right.
Yeah, so providing that.
Yeah.
So one thing, and I probably should have mentioned this earlier, S3 is not really an official standard.
It's sort of a, it's a byproduct of what AWS has been able to, you know, dominate the environment.
I mean, local object storage has always kind of had its own protocol.
And over time, the adopted S3 is as a dominant protocol.
So how does, I guess the question really is, how does something like Sneez-Twig standards,
for accelerated object work itself on top of some protocol, which is not standard.
That is a beautiful question that we are, to some extent,
reckoning with.
I think the best thing we can do is say, you know, S3 exists.
There is a fairly common understanding of how it operates,
and this is a specification to layer on top of that in order to be able to provide the hook
for RDMA transfer.
But it's not, yet we're building on a ratified standard.
It is definitely a more ambiguous environment.
And I guess that next level question, after that,
what's the right level of abstraction, you know,
kind of across the spectrum of technical implementations?
Where does NIA start and stop?
It starts and stops everywhere.
I think.
I mean, with respect to Scrii over objects,
I don't know what the protocol would look like,
but I would assume there's some sort of metadata flag
that's specified or used in the get puts that indicate RDMA.
Is that how you see this happening?
I guess maybe that's in the process of being developed.
That's the rough idea, yes.
The extra metadata passed across in order to enable the RDMA transfer.
But there are all sorts of interesting corner cases
to resolve on that.
such as, you know, what check sums mean in that context, things like that.
Yeah, sure.
Yeah.
How to checksums work at all in this environment when you're doing memory to memory kinds of work.
Yeah.
And so something like this would have to be implemented in spec, from a GPU perspective.
Somebody like an Nvidia or AMD or whoever is doing the accelerator would have to support
this new protocol, new metadata.
Is that how you see this?
Obviously, there's a vendor side of this as well.
So there's a client side and there's a vendor side.
Both of them have to agree on the standards, I guess.
Yes.
Yeah.
I mean, there are some early stage products coming up to the market
that have some form of this implemented.
Our goal is to provide an interoperable standard
that vendors can adopt to give a much more university accessible solution.
And you would provide, As Snea would provide, I don't know what I call it,
plug fest is not the right word, but I think it's similar to this,
for these sort of vendor and client solutions to test that they're following the standard.
Is that how this works?
But certainly what's been done in the past with fiber channel and,
I Scuzzi and things of that nature.
I know that, but you would have to have object storage now, not block storage and file storage
and things of that nature.
Yeah.
But it's certainly something that we're talking about and would hope to see happen.
Yeah, yeah, yeah.
The other surprising thing to me is that this just started last November.
As far as I can tell, the twig started last November after the storage developers conference.
And you're already talking.
I know at least one vendor and possibly two
that have already supports something in this space.
It's not standard quote unquote,
but it's using the protocol.
It's using Rocky to transfer objects.
Yeah, I've seen outputs from this
and the early test results are like insane.
Like the, I think one vendor claimed,
one software vendor that used one of these solutions
claimed like a 17x increase in performance.
It was like having 17x the number of GPUs.
Yeah, I mean, when you start talking inferencing KV Cash offload,
it becomes pretty impressive.
The training side, it's harder to assess the speed up,
but it certainly can be impressive as well.
You guys obviously are not at the point where you have performance data.
Do you?
I should ask.
No, no, no, no, yet.
That's right.
Yeah.
Yeah.
Yeah.
And I think this all wraps up in just the case at which AI has been advancing, right?
How many things did not exist six weeks ago and now we talk about constantly.
Very, very short period of time.
And that's what we're seeing here also.
So there are many different vendors who have gone to market with solutions.
We are, we with many different vendor member companies for SNIA are saying, look, we may be
competitors, but interoperability is important to our customers, is important to supporting
a new protocol and its advancement and innovation.
So that's where Snea and the technical working group get involved.
I guess a question to follow on to that is,
can a standards working group operate at the speed necessary
to follow what's going in the industry?
I mean, it's a tough call.
I know Sniya obviously tried to optimize and increase their,
increase their throughputness with respect to this sort of thing.
But standards operated a different speed.
Standards working groups, let's call it.
I was on one call a while back, had 50 people on a call.
I couldn't handle it.
I had to get off because it was, you know, it was that impressive.
It had that many, you know, interested parties.
Yeah, I think that is an issue.
But I think if you look at what's happening in Saneer,
there is a very definite focus around AI.
There's the Storage AI Initiative,
which encompasses a number of working group.
And the event that you wrote recently
was a day focused totally on AI.
I think the pressure in the market
is such things to move perhaps faster
than they have previously.
Yeah, yeah, yeah, yeah.
Getting back to a question
that Keith had asked specifically about agenic workloads.
In my mind,
agenic workloads,
obviously AI
in inferencing is
building and generating lots of context,
which means lots of KV stores,
which means the more deeper you go into this tool use,
AI use, you know,
different steps or different phases in the process,
the more context.
matters. And so key value cash offload becomes a critical component to something like that,
in my mind. Is that how you see things? Yeah, absolutely. We know that models are growing their
maximum context length. It's becoming more typical that models support a million. Some are
supporting 10 million tokens of context length per session.
So these are extremely large amount of data when you consider that those tokens represented
for that model as a file or object or space with a memory.
And of course, multiplied by many users, it becomes incredibly important to figure out the data
management story for this, even if it's, we're thinking about as in some ways being
ephemeral, right? This is a cache, right? If this data is lost, the GPUs can recombute it.
There is a cost to that, right? And so a, what we're noticing in KV Cash is this tiering approach.
Right.
Between volatile memory, long-term storage, memory-like interfaces with some data protection.
All of these are creating this ability to expand that total context-saving space in ways that haven't existed before.
Right, right, right, right.
having the bandwidth and lower latency abilities of something like S3 over RDMA makes us more viable to a large extent and more performant.
Well, guys, this has been great.
We could probably talk for another couple of hours on what's going on in this industry.
But Keith, any last questions for Nick or Jason before we leave?
No, I'm going to have to feed this whole session to AI to help me.
Don't do it.
Well, you maybe go ahead and do it.
This has been one of the more meeting ones.
I appreciate it, guys.
I'm starting to completely grok it, but this is neat.
Okay.
Nick or Jason, is there anything you'd like to say to our listening audience before we close?
I would say come and get involved at the Accelerated Object Storage Working Group.
And don't miss out on SDC in...
Storage Developer Conference, right?
Yep.
And coming up in the fall, is that true?
Yep, September 28 to 30th.
Okay.
All right, well, this has been great, Nick and Jason.
Thanks again for being on our show today.
Thank you so much for having us.
Thank you.
And that's it for now.
Bye, Nick, by Jason, and by Keith.
Bye, Ray.
Until next time.
Next time, we will talk to
a list of system storage technology person.
Any questions you want us to ask, please let us know.
And if you enjoy our podcast, tell your friends about it.
Please review us on Apple Podcasts, Google Play, and Spotify, as this will help get the word out.
