Grey Beards on Systems - 163: GreyBeards talk Ultra Ethernet with Dr J Metz, Chair of UEC steering committee, Chair of SNIA BoD, & Tech. Dir. AMD
Episode Date: April 2, 2024Dr J Metz, (@drjmetz, blog) has been on our podcast before mostly in his role as SNIA spokesperson and BoD Chair, but this time he’s here discussing some of his latest work on the Ultra Ethernet Con...sortium (UEC) (LinkedIN: @ultraethernet, X: @ultraethernet) The UEC is a full stack re-think of what Ethernet could do for … Continue reading "163: GreyBeards talk Ultra Ethernet with Dr J Metz, Chair of UEC steering committee, Chair of SNIA BoD, & Tech. Dir. AMD"
Transcript
Discussion (0)
Hey everybody, Ray Lucchese here.
Jason Collier here.
Welcome to another sponsored episode of the Greybeards on Storage podcast,
a show where we get Greybeards bloggers together with storage assistant vendors
to discuss upcoming products, technologies, and trends affecting the data center today. We have with us here today Dr. Jay Metz, Technical Director in System Design at AMD and Chair of SNEA's Board of Directors.
Dr. Jay has been on our show before and has been involved with the UltraEthernet connection since the get-go.
So Dr. Jay, why don't you tell us a little bit about yourself and what the UltraEthernet connection since the get-go. So, Dr. J., why don't you tell us a little bit about yourself
and what the UltraEthernet Consortium is all about,
and what does it have to do with HPC and AI?
Sure. Well, thanks a lot for having me.
It's always a pleasure to sit on the graybeards and chat.
I've got a little gray hair as myself.
So, yeah, I am the chair of the steering committee
for the UltraEthernet Consortium,
as well as the chair for SNEA, as you mentioned.
The UltraEthernet Consortium basically came out of stealth mode in July of last year. We're designing solutions for a full stack AI HPC workload solution tuning for Ethernet. And our approach is to effectively get, to push the boundaries
of the performance of the network, not just at the physical layer with the speeds and feeds,
but also at the upper levels as well and make them all aligned and tuned for specific types
of workloads. So it's a rather ambitious project. It's got incredible growth in terms of the number
of members. In the last four months, we've gone from 10 to 60 different members, and we're getting new members every
week. So it's a good problem in the industry where you've got a lot of passionate people
working on-
A lot of interest in this stuff.
Yeah, exactly.
Yeah, yeah, yeah, yeah. So I mean, how does HPC and AI workloads kind of different
from your standard ethernet activity? I mean, it seems like it's all very similar.
Well, it's the 95, five rule, right? You wind up with the, you know, 95% of the way there is,
you know, this nice linear growth of a pain. And then the last 5% of getting to where you want it to be costs an awful lot more.
So as we've been pushing the envelope when it comes to these workloads, certain things
have become obvious.
One is we've got these boundaries.
We've got compute boundaries, got the memory boundaries, have the networking boundaries,
scale, right?
And when you're talking about these types of workloads, you have far more
equipment to solve a particular problem than you may have had for any typical type of traditional
Ethernet environment. So let me give you an example. Whereas in a traditional general purpose
Ethernet network, where what we call inside of
ultra Ethernet, you know, very creatively network number one. It's, it's your multi-tenants. It's
where you've got your virtual machines that are on a, on a particular node and you're all, you're
trying to connect, you know, through the network into a storage environment. That's, that's typical
kind of data center type of environment. When you're talking about AI though,
and you're talking about HPC,
you're talking about a specific type of workload
that goes beyond any one particular node,
which means that the network becomes far more important
because oftentimes the last bit back
doesn't necessarily come from the same place
that the first bit back does.
So that tail latency of the number of different nodes that are responding can have
a huge impact on the performance of your, of your workload. And as we start to get really
fine tuned in this, we both get very small, meaning inside of the node itself, where you're
talking about what, what memory node needs to talk to what other memory node, what buffers are talking to what buffers. And at the large scale, because
you wind up with hundreds of thousands of nodes conceivably. And so we're trying to address the
minutiae between that really small margin for error and the very large scale that can have
unintended consequences in performance. Who has 100,000 servers in this world?
I actually want to point out one thing that's interesting.
So I've been working a bit on, you know, kind of our MI300X pieces.
And one of the things that most people don't know,
and those go in this OEM form factor inside a server
where there are eight of those MI300 cards.
And each one of those things has its own interface and its own IP address.
So when you're talking about a node, a node is not just the x86,
and you've got like these are like PCI cards in there.
Each one of those connections into a GPU also has basically an external connection,
be it InfiniBand or some type of Ethernet fabric.
Yeah, I mean, the HPC world has always been, you know,
InfiniBand intensive environment.
I mean, I've seen some lately.
I think the one down in Los Alamos might be doing an Ethernet solution,
but very unusual to see Ethernet be the only network in an HPC environment.
Well, it all depends on what you're trying to accomplish too, right?
Because over the last few years,
the number one, I think the top three
are Ethernet-based in the HPC 500, right?
No way.
Yeah.
So, you know, Frontier is one of the ones that's, you know, huge.
El Capitan is one that's huge off the top of my head.
I mean, the thing is that Ethernet is a viable solution for these types of environments.
And also remember that not every type of workload is tightly coupled, right?
You've got different types of HPC workloads.
You've got tightly coupled, you've got the embarrassingly parallel HPC workloads.
And when you start to add in the scale up, you know, networks for accelerators like GPUs,
you know, you've got these types of environments that have similar but not quite equivalent
requirements as well. So all of these different types of environments
have a bit of an overlap in the Venn diagram, and there's no logical reason why Ethernet couldn't be
that platform. So talk to me a little bit about the difference between tightly coupled and parallel
versus scale kind of environments. I mean, I guess I'm familiar with parallel because that seems to
be the world that we're living today, but tightly coupled seems almost real time-ish kind of discussion.
That's a very good way of looking at it. So let's say you've got a workload
that is doing massive calculations that requires another workload somewhere else to be done,
right? Genomics is a really good example.
Sequencing, those weather pattern analyses and the like.
Where when you've got these,
in order for you to actually complete your task,
you need to wait for somebody else to finish their task.
That's a tightly coupled workload.
Embarrassingly parallel by way of comparison
is the example that I like to give is, you remember the SETI program, you know, the search for.
I had friends using that stuff.
Right. So so, you know, way back in the 90s and I remember loaning out the compute power on my computer at home to the SETI project because they would effectively crunch the numbers and they would send it back into, into the SETI project. And that's, that's embarrassingly parallel. That's an embarrassingly
parallel workload where you're not dependent upon my job to get done in order for you to do
your job. Right. So that's, I mean, and, and we still call that HPC, even although we tend to
think of it in, in the, you know, in, you know, the weather pattern type of workloads and so on.
But they're all basically on that same spectrum.
And so you wind up with these different types of requirements in order to accomplish that.
For our purposes here, we're really talking a lot about the tightly coupled ones,
which are the kind of, we have one problem that we're trying to solve.
That one problem runs for months, if not years.
And it's a it is a the network becomes a huge hindrance if you're not doing it right.
Or it can be a major boon if you are.
Yeah. So, I mean, AI training to a large extent seems more embarrassingly
parallel these days, especially when you're getting to, you know, thousands of servers and
tens of thousands of GPUs and that sort of stuff. Is that a reasonable statement?
I think you're definitely right. I mean, so, you know, what I think people don't necessarily
realize is that you've got a kind of a tag team one-two punch when it comes to AI, right?
So when you're dealing with these transformers and you've got, you know, like, for example, GPT, right?
You know, what happens is you're dealing with the processing of ingestion of data into a format that you can then run modeling processing on.
And you transform those data, that data into, you know,
a different model and an architecture that requires, you know,
summation and then eventually you got the generation.
So those two things though are not the same.
So if you've got summaries and summation inside of your,
your AI programs in order to
get to your chat GPTs eventually, that's a parallels type of a process. That's your GPU,
that's your computation bound, that's your matrix to matrix multiplication. That's the kind of stuff
that you need to do in parallel. And then if you're going to be generating that, you know, that text, well, that's, that's the serial CPU bound and memory bound kind of application.
They work together to ultimately create the end result.
Yeah, yeah, yeah, yeah.
So that's a lot of network.
Yeah, yeah.
So it's, it's a lot of network and it's almost, it's very similar to this tightly coupled solution we were talking about early, because you're doing one computation to get one token out, and then you're doing the next computation based on that token and everything that went before it.
Yeah, that's right.
I guess there's both aspects of this, too.
Yeah, and a lot of that generation, too, it's like the tweaking of the parameters, right?
And then when you think about it, you know, one of the things they always talk about it's like 70 billion parameters being you know pumped into like the chat gpt model um when you
think about it it's like a gpt is like a trillion ish kind of numbers i know it's more than that now
but it was like back in the old days um yeah yeah yeah but that's like having a mixing board with
with a you know a trillion different knobs on it. Right. And, uh, you, you,
basically, you know, a lot of it's having to figure out which tweaks to make, right. And then
rerunning that stuff in parallel. So there's a lot of, yeah, you're right. You're right. And, and,
but not only that, you, you've got a lot of data that goes back and forth.
Right. So, so the thing is that the, the data at, at rest is not the data that is going to be processed for the training, nor is it going to be processed that way for the generation. The data has to be mutated. It has to be modified. It has to be transformed, hence the word, into a usable format. And then that has to happen over and over and over again before it ever gets to an end user or usable result.
And that is what we're trying to address.
Because as you point out, yeah, there are a lot of similarities between HPC tightly coupled relationships and AI tightly coupled relationships.
But there's enough of a difference to prevent you from having a one size fits all.
And that's what UEC is attempting to accomplish, which is we create these different profiles that are creating the tuning from the physical layer all the way through the software
layer for these different types of workloads that doesn't currently exist in the OSI model for
Ethernet. So you have different characteristics that you're able to tune the network for almost on like a real-time basis?
Is that what you're saying, Dr. Jack?
Real-time is probably a little bit of a bridge too far at the moment.
I mean, you know, generally speaking, if you're looking, each of these different workloads happen to have certain types of networking characteristics, right?
You need to have, for some workloads, you may need to have, you know, completely free
for all, best effort, right?
Unreliable, unordered, you get there when you get there.
That's not usually the case in these types of workloads, but let's just say for the sake
of argument that that's our starting point.
Sometimes, some types of workloads, HPC in particular, if you're talking RDMA, you need to have reliable order delivery.
There are some workloads that don't have to be ordered.
They can be reliable, but they don't have to be ordered.
So you have reliable unordered delivery.
And then, of course, you've got different types of nuances of the different types of workloads that go beyond that.
Each of these has a fingerprint on the network, right?
They have different types of impacts.
They can create congestion issues.
So for example, one of the things that happens
in very large scale RDMA based approaches
is that the ability to do multi-pathing
is rather restricted because you have to keep it rather constrained to the flow levels.
And so there's a lot of different types of managing congestion in order to have it.
But it's a one-size-fits-all.
It winds up sacrificing some of the efficiency of the network in order to create the reliability that goes along with that workload. So what we're looking to do is say, look, you can, you can do that. That's a,
that's a possibility, but we, as we get these larger and larger and larger systems, right,
you've heard about, you know, the, the GPT-4, you've heard about the, you know, the, the new
requirements that, you know, I think OpenAI, I was talking about what a $7 trillion environment, right? I think that's, that's very ambitious. You know, but people are talking about
really large systems. And if you do that, you don't want to waste the network by effectively
restricting it to different flow levels. So we're looking at doing what we call packet spraying,
which is sort of, you know, a multi pathing on steroids, so to speak, you know, because of the fact that now
you've got the transport layer handling a lot of that heavy lifting to address the, you know,
the congestion notification, the congestion mitigation, as well as the packet order delivery
and the security elements. So our goal is, you know, to effectively get, you know, tenfold or more number of endpoints on this particular network, up to a million endpoints, and have maximum efficiency across the network.
So, it's, like I said, it's a rather ambitious project, but it's designed specifically to help ease some of the financial burden that comes along with it when you have wasted space by inefficiency.
A million endpoints sounds obscene.
It is the same reaction I had, Ray.
Even if there's eight endpoints per server,
that's still, you know, 100, yeah. Jesus Christ. Be careful, Ray. Even if there's eight endpoints per server, that's still, you know,
a hundred, yeah.
Jesus Christ. Be careful, Ray.
You're going to start sounding like we only need
640K of memory.
I've been there.
That's why we're
greybeards, right?
I think we've all been there, my friend.
Yeah, yeah. So, I mean, I always thought
the problems with, you know, AI and HPC and stuff like that was all storage-based, not network-based.
You're saying that the network can be, in some of these situations, a critical bottleneck, depending on how it's configured and how it's used.
And the current Ethernet technologies, I guess I'd call it, aren't up to the snuff to handle these
larger scale environments. Is that what you're saying? Well, yes and not quite. So I think if
you think in a traditional three-tiered environment, which I think a lot of people in our
boat, the graybeards think because you've got compute, network, and storage.
What is happening, though, is a movement away from that model, that three-tiered model, where the compute is now at the storage.
The network is now at the compute, and the storage is now at the network.
We are now talking about breaking apart the access to the memory.
The memory is now directly connected into the network.
The data movers that are being handled have to be much more flexible.
The integration of the acceleration and the actual processing,
because you've got the different bindings that go along with the actual workloads.
We were talking about computational bound versus memory bound, for instance.
We are breaking apart the model
into finer granularity,
and we're placing it at different parts of the network.
So a lot of these collectives
that we currently have inside of,
what would be inside of a host,
they want to put inside of the network, right?
So in-network collectives is a huge deal.
The computational capabilities of being next of the network, right? So in-network collectives is a huge deal. You know, the computational capabilities of being next to the data, right? So for instance,
I mentioned earlier that inside of the data, you have to restructure the data so that it can be
manipulated for these kinds of transformers. That's not native, right? That's not native
at the actual data repositoriesories themselves so you've got to
manipulate the data as quickly as you can because you're going to be going back to that well often
so computational storage computational processing well compute once you get to that point why not
have that same functionality that logical block functionality that you put inside of a pretty
diagram why don't you have that at the network level why don't you put that into a dpu why don't you have that at the network level? Why don't you put that into a DPU? Why don't you put that into a SmartNIC, right?
Why not have that kind of environment there?
All this stuff almost exists today.
I mean, obviously, NVIDIA with the Bluefin and all this stuff.
I mean, so the GPU has the intelligence or DPU has the intelligence to do a lot of network
functionality at the network node.
You mentioned in-network collectives.
You want to define that term?
Basically, the in-network collectives are the kind of some of the processing, like you've
heard of Nickel and Rickle and those kinds of collective libraries, right?
They handle some of the processing.
It is now being discussed to be able to have some of that
processing done inside of the network so that you're not actually using the entire traversing
to handle some of these. I have to confess, I am not a software expert. I can and have
screwed up Hello World. But what I do know is that they are a major part of the workload and where that actually gets done is part of the conversation.
And so it is an environmental construct that winds up being discussed as to where does it have to be put.
And it does underscore the major point that I was trying to make, which is the traditional compute network and storage no longer applies, as you pointed out, it's already being done. And so you want to have a protocol that is
tuned to be able to handle that kind of granularity.
Yeah, I think a piece to highlight on that, when you think about, you know,
the AI construct, I mean, this is the way it's being processed now is where we're using
a crowbar to jack existing tools in there, right? Where we're kind of forcing them in. And I think
one of the things really that, you know, UltraEthernet is trying to do is basically kind
of take a, like the good parts of what technologies we've already got in there and which ways in which
we can adapt those to, to work better with,
with kind of this new paradigm that we're seeing with,
with the generative AI style of models.
Yeah.
Yeah.
Computational storage also been around for,
for a while now.
And then lots of new functionality is moving upward and that sort of stuff.
But,
you know,
you talk about transforming the data or, or, you know, I'll call it encoding or embedding the data into some other format and stuff like that.
That can be done today outboard or it can be done inboard, possibly, without too much of a problem.
I guess the problem is you have to move the data around a lot more.
What do you mean offboard or inboard?
I'm not really sure I'm following, to be honest.
Yeah, in the host or out in the storage i guess is right is a better way to say
that you know okay but but here's so here's the question right what's the cost of moving it into
the host it's and therein lies it is extremely costly because we're talking we're not talking
a couple terabytes we're talking a couple exabytes. We're talking a couple exabytes for some of these things.
And so actually there's a really great presentation that was done by Los to move the data across the network in order to handle their workload.
So what used to take nine months now took less than a month because they're only working on one workload. And by not having to move that data,
by doing that processing internally
before it actually has to get moved,
they've managed to save an incredible amount of time.
And that's not insignificant, right?
Yeah, yeah, yeah.
Dr. J, all that's available today,
or it was available back in 2022,
or even before that in a computational storage.
Where does the ultra-Ethernet solution,
what does it do on top of that, I guess, is the question.
Well, I mean, the fact that you can do it is not the same thing
as the fact that you can do it in an open and standard way.
Ah, okay.
Okay, okay, I got you.
So you're trying to standardize this sort of additional complexity that's emerging with new hardware technologies to make it more available.
That's right. And the other part of it, too, is that this is an exceptionally hard problem to solve.
Extremely difficult problem to solve because you have to align everything from the physical layer all the way up into the software layer. So for instance, just a quick example, let's assume that we've got some new special
optical physical layer that is going to bring us to 1.6 terabit a second coming out of the host.
It's already being discussed. And I want to do this, but it's going to cost me a single
watt of extra power, a single watt. I have a million nodes. Now I've just increased my power by 1 million watts, right? For a lot of,
for a lot of operations, that's a non-starter, right? So, so we've, we've got this butterfly
effect that we have to try to address because of the fact that, you know, one change is not going
to be so simple as say, I'm going to kick it over the, over the fence and say, now the link layer has to deal with this extra
watt. And then the link layer has to kick that over the fence and say, now the transport layer
has to do it. Now the software layer has to do it. Especially if you've got these, you know, these
integration problems, like you've got storage, you have to deal with, you've got management,
you have to deal with. So doing this in a proprietary fashion is possible, but you want as many people as you can possibly get into the same room to talk
about how what somebody wants to solve is going to affect what they're going to be doing on their
end. And that is why the UltraEthernet Consortium is so popular. That's why we've moved from 10
companies in November to 60 companies in March, right? I mean, it is a very big problem that a lot of people are trying to get
into the same room to have that conversation.
It is unrealistic and probably unfair to expect any one particular company
to solve all of those problems at scale.
Yeah, I mean, you look at the deep use line.
It's proprietary.
The computational storage is proprietary to each one of the vendors that supply that capability.
So by having the UltraEthna Consortium sort of spearhead a more standardized approach to these sorts of things? Is that where you guys are going?
I mean, so this sort of things kind of roll out in phases.
I mean, what's your roadmap for UEC, I guess, if I had to ask the right question?
Well, I mean, it's a great question.
It's one of the ones we get the most comment.
I think one of the things that, you know, it is such a major problem that people want
to have a solution yesterday.
Um, and I think one of the, one of the funny things that I don't think people even realize is that, you know, before, before the fall of 22, um, everybody was talking about HPC.
That was, that was the big thing that they were trying to deal with. By the spring of 23, all of a sudden it was AI, AI
everywhere. You know, chat GPT, you know, version three changed the game. It was a watershed moment
for the industry. But I don't think people realize it wasn't all that long ago. And so we've been
working on this pretty diligently and we've been really fast by any kind of standards-based approach, right?
So most standards take five, six, seven years to get a version 1.0 specification out the door.
We are on track for getting our 1.0 by the end of this year, which would be just a smidge over
a year, right? So we've got a general outline in a recent blog, like this week, as a matter of fact,
a couple of days ago, as we talk about this today, on the ultraethanet.org website
that talks about, you know, on track to 1.0, what it's going to entail, how it works, what the,
you know, the details are about, well, details, depends on what your level is, right?
From the perspective of those who are developing it, it's a rather high level, but for those who are brand new to it, it may be actually kind of detailed.
But the idea is that we want to get a backwards compatible approach to handling ultra-ethanate
transports and corresponding software out the door, which can be implemented in a greenfield
environment.
I think one of the things that's important to note though, is that when people start to develop these types of
networks, they're not general purpose networks. They're networks for specific workloads and they
tend to be rather greenfield. But you also want to be able to use the existing equipment that's
available at the time. So we don't want to do a wholesale forklift upgrade for UltraEthernet.
On the contrary, we want to be able to integrate into the existing products that vendors have, the testing equipment that people can use, and so on.
And that's where we're starting with storage specific for ultra ethernet,
for management techniques, for performance, you know, tuning and compliance and certification
processes. That's all going to be done post 1.0. So the services that go around that capability,
in addition to more advanced congestion approaches.
That's going to be post 1.0.
But we're on track for being able to have that capabilities by the end of this year, 2024.
So you're saying that UEC 1.0 could be potentially implemented
on current networking hardware?
Is that what you're saying?
That is the intent.
That is the intent.
Whoa.
Now, we're starting off at a rather high level right so we're not talking about 10 gig ethernet right um i mean i mean the so that's
what i'm saying this is a this is a tuned ethernet that is a stack oriented approach uh and we are
talking about you know at the very least 100 or higher, probably closer to two or 400 gig in terms of being able to get, you know, true advanced, you know, advantages of
what ultra-Ethernet is going to wind up doing. But I'm not going to be prescriptive about how
people want to use it. If they want to kick the tires with, you know, you know, the lower speeds
in order to find out how it works, I'm more than happy that they're going to do that. Bizarre.
I don't think it's bizarre.
I mean, I don't want to tell you my work is bizarre.
No, no, no.
It's just that for me to see, okay, I've barely got one gig in my network and servers at home, let alone 200, 400 gig, which is,
we're talking serious serious size complex environments here
like i said it's ambitious but you're right i mean that you're absolutely right most enterprises
i'm not sure have a hundred gig or or higher yet today so this is almost outside the enterprise
well actually one of the interesting things is like enterprises for for a lot of this technology
they're looking at it as like kind of this when they their ai initiative it's almost looking at it like it's
an appliance right like they're they're buying something and integrated into that is where a
lot of that connectivity is starting but then it's how do you tie that into the overall enterprise
and this is where i think the you know that ethernet tie-in is going to be big. Yeah, yeah, yeah, yeah, yeah, yeah.
So, yeah, there might be like a backbone there that ties into some massive AI appliance of some type and plugs its capabilities into the rest of the enterprise.
Yeah.
And I think, like I said at the very beginning, we've broken down our approaches into three different types of networks.
And I think this is a good time to revisit that because we're not talking the general purpose Ethernet here.
We're not talking about where you're going to run your VMs, where you're going to run your services.
We're talking about a workload specific network and,
and it's a scale out network to start with.
And so we may have to come up with something more creative than network number
two, but that's what we're focusing on at the moment. And then, you know,
then you have the scale up network that goes along with those accelerators,
which is something on the back burner at the moment.
But obviously if you're going to have to subdivide these into very specific kind of sub workloads, right, that's
got to be integrated as well. So, you know, there's a long term goal here. But I do want to
be very clear that for anybody who wants to put inside of an ultra Ethernet network, it's designed for a workload, right?
A workload type.
And it's not supposed to be, you know, this mixed traffic,
you know, general purpose network
that people might be more familiar with inside of an enterprise.
Not to say that enterprises can't use it
because they're moving into that, you know,
that modeling approach as well. They're very interested in the AI, you know, what that can do for them.
But I want to be, I just want to be very clear that we're not talking about replacing
Ethernet as a general purpose protocol.
Yeah, I mean, that plays to the HPC and AI discussion points, right? I mean,
these are the guys that have these very specific workloads.
Like you mentioned, Los Alamos has one workload.
They operate for months at a time.
And seeing a nine-fold increase is significant in that.
But, you know, AI is a little bit more diverse than that.
I mean, you know, we have experimentation.
We have training. We have training, we have
reg, we have inferencing, we've got a lot more flexibility of workload characteristics in your
typical AI environment today. Yep. And that's why we work on these different profiles, right? Because the way that we're handling these different nuances for the workloads is more than just how they're ordered, right?
There's an entire semantic language that needs to be addressed.
And that's a core part of what makes UEC, UEC. We have an integrated packet delivery system
that goes along with the semantic understanding
of the operations to be able to handle those AI nuances.
Yeah, yeah, yeah, yeah, yeah.
Interesting.
I mean, ultimately at the end of the day,
what we're looking to do and where I think the different companies are excited about, you know, participating is that it offers up a wide variety of opportunities for building upon a standard basis for solving this problem while at the same time having your magic
sauce as well. Right. So, um, and I honestly think that's one of the reasons why it's, it's,
it was actually very surprising to us how, how popular it was, how quickly it was. We,
we didn't realize this at first. Um, but you know, it's since gotten to be that way and we're very happy about having all the interest, to be sure.
But I think it's kind of telling that, you know, this has tapped into a particular need inside of the industry for a lingua franca for how to approach the networking capabilities.
So the other big, you know, change over from a technology perspective is cxl you see cxl being
a specific profile of a workload kind of thing in this regard
not particularly because cxl was was never really designed to be at this level, right?
So CXL was originally designed to be able to create an environment
that fed endpoints back into a particular host.
And so it was all about feeding into a particular host.
And then it got expanded in later versions to be able to have multi-host inside of a PCIe-based fabric that allowed for different types of endpoints to be able to communicate with other types of endpoints and share memory.
So the memory pooling and the memory sharing was a big part of what came out in CXL 3.0, especially in a hierarchical switched environment that was all effectively
on that PCIe based layer, right? That IO layer. But it was never designed to be row scale or
data center scale. And so it is not necessarily completely independent of what UEC is doing.
You could conceivably have memory pooled environments at the endpoint that could wind up with ultra
ethernet fabric endpoints that communicate across the different nodes or racks or rows
or scale, whatever it is.
So I don't see them as necessarily competing against one another, but you could have a CXL environment
inside of a host or even, you know,
a multi-chassis environment
that could wind up being a fabric endpoint
or multi-fabric endpoint for the ultra-Ethernet.
It's just, it is a concept that is a possibility,
but not a focal point for either CXL or UEC at this point.
Yeah, yeah, yeah, yeah.
So it comes down to very profile-specific activities.
So from my perspective, I see this as,
depending on the number of profiles that you have,
if your workload fits into one of these profiles,
then the UEC is the answer to your prayers.
Is that how I would see that?
Um, well, I guess, I mean, you know, the, the thing is that we, we, if, if you don't, if you're going to come up with a brand new requirement, right. Um, the, the way that we have uh approached this is that um we're open to suggestions
right so if somebody comes up with a profile that we haven't thought of i mean
mostly these types of things are somewhat mutually exclusive we know the ordered or
unordered for instance reliable or unreliable that's kind of a binary situation you you don't
come across too many third options in that environment. But if somebody were to say that we would, and it was applicable, I'm sure that somebody would want to come in and suggest a draft that could accommodate it. Otherwise, it would be a proprietary solution that they may want to run on on their own. But as an open organization,
UltraEthernet is open to hearing about
other potential problems that need to be done.
And all we really need is for a good persuasive argument
as to why it needs to be done,
and the group could approve a draft to work on it.
So I can't really think you know in those terms
at the moment because nobody's done such a thing but the mechanism is there for them to do it
yeah yeah yeah yeah well i mean you know obviously hbc has got a very specific
you know specific workload that they're trying to to to manage and work with and and uh you know, specific workload that they're trying to manage and work with. And, you know, AI training and AI inferencing
and some of the transformer stuff appears to be very,
very specific to its networking characteristics.
But the advantage of something like Ethernet today
is that it pretty much can handle everything.
It may not be able to handle everything optimized, but it can handle any workload you want to put on top of it.
That sort of flexibility is not necessarily intrinsic in the UEC approach, I guess.
That's not necessarily a bad thing.
The big problem that we've had with Ethernet,
and I'm actually glad you really,
I'm really glad you brought this up
because the flexibility that Ethernet has had
has allowed people to create layers of abstraction
that have hit the performance problem, right?
So the more you can abstract something,
the more you have to deal with the performance issues.
What we're trying to do is say okay hold on a second let's let's not just simply add another
layer of abstraction onto the onto the uh into the mix because once you do that you you are
defeating the very purpose here of trying to solve the problem for these workloads which are
performance centric so what we're looking to do is kind of reclaim some of that performance
capability that the stack will give us by preventing the need to create an abstraction
layer. Because ultimately, at the end of the day, that's self defeating. So in this particular case,
what Ethernet flexibility allows us to do is, is, is come back from the edge of simply adding in
additional software stacks that, you know stacks that you might find in Kubernetes
or containers or virtualization and that kind of stuff, because you get further and further
away from the hardware.
And then that costs more money in terms of efficiency and power and cooling and so on.
We really want to streamline the lower levels because of the fact that that's what the workloads are going to be requiring in order to make that ratio of efficiency
and performance to cost. And I can assure you there are definitely breakpoints in where Ethernet
is today. And we've hit a number of them when I'm going through, you know, setting up these
compute clusters, even things like, you know, like, you know, setting up these compute clusters, even
things like, you know, like, you know, ARP caching and stuff from basically hardware
and software places that they're just, you know, from a technical perspective, there
are some details of things that just break when you run it at this level, when you've
got, you know, 10, 400 gig, you know, Ethernet interfaces per system that you're dealing
with.
And now you're dealing with that across a big clustered system.
You find some interesting breakpoints along the way.
Yeah, yeah, yeah, I'm sure, I'm sure.
And so by getting closer to the hardware, you increase the performance,
you're able to handle more and more nodes, more and more endpoints,
and optimize workloads to handle the network better, I guess.
I guess that's how I'm reading it.
Well, as you know, in storage, I mean, the closer you get to the wire,
the more rigid the architecture has to be.
And what we're trying to do is do that,
but not sacrifice the flexibility that each workload has its own personality.
And so that's where all that work in the transport layer is becoming really effective.
Yeah, I mean, the challenge there, of course, is always trying to find the optimal line of where you provide the abstraction and where you go deep to the hardware kind of
thing. I guess it's not right phraseology, but there's a line here between abstraction and
non-abstraction that you're moving down, I guess. That is correct. Yeah. And we also understand that we're not trying
to swallow the ocean, right? The workloads that we've got are very specific. They're solving a
particular problem, addressing those needs for those problems. And to that end, we have defined ourselves into the approach that we're
looking to take that could be expanded later on. But I think it would also be rather
disingenuous of us to say that we're going to be everything for everybody. We're not a panacea.
We have a very specific problem or series of problems that we're trying to address. It's an ambitious enough project to begin with. And I don't want to impress that this is going to necessarily replace anything and everything you've ever thought of when it comes to networking, because that would just be silly. That's not our intent. That's not our goal. And that's certainly the wrong thing to take away from any conversation about ultra-Ethernet.
Right, right, right, right, right.
Getting back to the roadmap, you mentioned one about, oh,
your end-ish kind of numbers and that your expectation is fully that current hardware might, with proper software firmware changes,
be able to support this.
Do you have like a plug fest kind of thing?
I mean, you know, the old days of compatibility testing.
We will be doing compatibility testing.
So we've established four new work groups
to UltraEthernet, which included compliance and tests
and a performance and debug group.
And the work that will eventually wind up being a hackathon or a plug fest or connectivity
type of events is in the plans.
It's in the works.
That'll probably be much more fleshed out towards the end of the year when 1.0 comes
out where the testing plans become much more codified.
But it is one of those things that is part and parcel of how we intend to
roll out ultra-Ethernet with safety for the industry to adopt.
Yeah, yeah, yeah. Great, great. Well, Dr. Jay, this has been mind-opening for me.
Not necessarily a networking expert, but certainly understand the complexities of
some of these workloads that we're dealing with these days. I do my best.
And the cluster node
counts seem to be extraordinarily large, but
that's a different story. Is there, Jason, any last
questions for Dr. J before we leave? No, but one comment
I do have, when we talked about you know we started off with
the million uh kind of the million endpoints when you think about that and you talk about the density
i was just talking about where we've got you know 10 uh 10 400 gig connections per system now you're
looking at a hundred thousand systems and it takes a million endpoints to connect a hundred thousand
systems which still sounds like a lot but not to a hyperscaler out there.
Somebody that has 10, 400 gig connections to a node
is beyond my comprehension.
I'll have to show it to you sometime, right?
I'd like to buy a gaggle of them sometime as well.
Buy a Ferrari, it's cheaper.
Yeah.
That's just the power.
Oh, yeah.
They're about 10,000
watts under full load.
Alright.
Dr. J, anything you'd like to
say or listen to the audience before we close?
Just a couple
of pitches for where they
can get more information, if you don't mind.
Sure, go right ahead.
What I wanted to say was that
there is a website for ultraethanet.org where we have a very good white paper that describes the
intent of the organization and what we're hoping to accomplish. A new blog was released about the
1.0 specification that's also available on that website. We do happen to update the Twitter
account with at UltraEthernet and the LinkedIn with at UltraEthernet. And of course, I can be
found on both of those locations with just me. So on Twitter, it's at Dr. J Metz and of course,
J Metz on LinkedIn. And anybody can ask any questions that they like. I'm more than happy
to answer questions along those veins.
All right.
Well, listen, this has been great, Dr. J.
Thank you very much for being on our show today.
I'm very happy to be here.
Thank you so much for inviting me.
And that's it for now.
Bye, Dr. J.
Bye, Jason.
Bye, Ray.
Until next time.
Next time, we will talk to the system storage technology person.
Any questions you want us to ask, please let us know.
And if you enjoy our podcast, tell your friends about it.
Please review us on Apple Podcasts, Google Play, and Spotify, as this will help get the word out. Thank you.