Grey Beards on Systems - 163: GreyBeards talk Ultra Ethernet with Dr J Metz, Chair of UEC steering committee, Chair of SNIA BoD, & Tech. Dir. AMD

Starting point is 00:00:00 Hey everybody, Ray Lucchese here. Jason Collier here. Welcome to another sponsored episode of the Greybeards on Storage podcast, a show where we get Greybeards bloggers together with storage assistant vendors to discuss upcoming products, technologies, and trends affecting the data center today. We have with us here today Dr. Jay Metz, Technical Director in System Design at AMD and Chair of SNEA's Board of Directors. Dr. Jay has been on our show before and has been involved with the UltraEthernet connection since the get-go. So Dr. Jay, why don't you tell us a little bit about yourself and what the UltraEthernet connection since the get-go. So, Dr. J., why don't you tell us a little bit about yourself and what the UltraEthernet Consortium is all about,

Starting point is 00:00:49 and what does it have to do with HPC and AI? Sure. Well, thanks a lot for having me. It's always a pleasure to sit on the graybeards and chat. I've got a little gray hair as myself. So, yeah, I am the chair of the steering committee for the UltraEthernet Consortium, as well as the chair for SNEA, as you mentioned. The UltraEthernet Consortium basically came out of stealth mode in July of last year. We're designing solutions for a full stack AI HPC workload solution tuning for Ethernet. And our approach is to effectively get, to push the boundaries

Starting point is 00:01:26 of the performance of the network, not just at the physical layer with the speeds and feeds, but also at the upper levels as well and make them all aligned and tuned for specific types of workloads. So it's a rather ambitious project. It's got incredible growth in terms of the number of members. In the last four months, we've gone from 10 to 60 different members, and we're getting new members every week. So it's a good problem in the industry where you've got a lot of passionate people working on- A lot of interest in this stuff. Yeah, exactly.

Starting point is 00:02:02 Yeah, yeah, yeah, yeah. So I mean, how does HPC and AI workloads kind of different from your standard ethernet activity? I mean, it seems like it's all very similar. Well, it's the 95, five rule, right? You wind up with the, you know, 95% of the way there is, you know, this nice linear growth of a pain. And then the last 5% of getting to where you want it to be costs an awful lot more. So as we've been pushing the envelope when it comes to these workloads, certain things have become obvious. One is we've got these boundaries. We've got compute boundaries, got the memory boundaries, have the networking boundaries,

Starting point is 00:02:40 scale, right? And when you're talking about these types of workloads, you have far more equipment to solve a particular problem than you may have had for any typical type of traditional Ethernet environment. So let me give you an example. Whereas in a traditional general purpose Ethernet network, where what we call inside of ultra Ethernet, you know, very creatively network number one. It's, it's your multi-tenants. It's where you've got your virtual machines that are on a, on a particular node and you're all, you're trying to connect, you know, through the network into a storage environment. That's, that's typical

Starting point is 00:03:22 kind of data center type of environment. When you're talking about AI though, and you're talking about HPC, you're talking about a specific type of workload that goes beyond any one particular node, which means that the network becomes far more important because oftentimes the last bit back doesn't necessarily come from the same place that the first bit back does.

Starting point is 00:03:44 So that tail latency of the number of different nodes that are responding can have a huge impact on the performance of your, of your workload. And as we start to get really fine tuned in this, we both get very small, meaning inside of the node itself, where you're talking about what, what memory node needs to talk to what other memory node, what buffers are talking to what buffers. And at the large scale, because you wind up with hundreds of thousands of nodes conceivably. And so we're trying to address the minutiae between that really small margin for error and the very large scale that can have unintended consequences in performance. Who has 100,000 servers in this world? I actually want to point out one thing that's interesting.

Starting point is 00:04:29 So I've been working a bit on, you know, kind of our MI300X pieces. And one of the things that most people don't know, and those go in this OEM form factor inside a server where there are eight of those MI300 cards. And each one of those things has its own interface and its own IP address. So when you're talking about a node, a node is not just the x86, and you've got like these are like PCI cards in there. Each one of those connections into a GPU also has basically an external connection,

Starting point is 00:05:04 be it InfiniBand or some type of Ethernet fabric. Yeah, I mean, the HPC world has always been, you know, InfiniBand intensive environment. I mean, I've seen some lately. I think the one down in Los Alamos might be doing an Ethernet solution, but very unusual to see Ethernet be the only network in an HPC environment. Well, it all depends on what you're trying to accomplish too, right? Because over the last few years,

Starting point is 00:05:34 the number one, I think the top three are Ethernet-based in the HPC 500, right? No way. Yeah. So, you know, Frontier is one of the ones that's, you know, huge. El Capitan is one that's huge off the top of my head. I mean, the thing is that Ethernet is a viable solution for these types of environments. And also remember that not every type of workload is tightly coupled, right?

Starting point is 00:06:04 You've got different types of HPC workloads. You've got tightly coupled, you've got the embarrassingly parallel HPC workloads. And when you start to add in the scale up, you know, networks for accelerators like GPUs, you know, you've got these types of environments that have similar but not quite equivalent requirements as well. So all of these different types of environments have a bit of an overlap in the Venn diagram, and there's no logical reason why Ethernet couldn't be that platform. So talk to me a little bit about the difference between tightly coupled and parallel versus scale kind of environments. I mean, I guess I'm familiar with parallel because that seems to

Starting point is 00:06:43 be the world that we're living today, but tightly coupled seems almost real time-ish kind of discussion. That's a very good way of looking at it. So let's say you've got a workload that is doing massive calculations that requires another workload somewhere else to be done, right? Genomics is a really good example. Sequencing, those weather pattern analyses and the like. Where when you've got these, in order for you to actually complete your task, you need to wait for somebody else to finish their task.

Starting point is 00:07:18 That's a tightly coupled workload. Embarrassingly parallel by way of comparison is the example that I like to give is, you remember the SETI program, you know, the search for. I had friends using that stuff. Right. So so, you know, way back in the 90s and I remember loaning out the compute power on my computer at home to the SETI project because they would effectively crunch the numbers and they would send it back into, into the SETI project. And that's, that's embarrassingly parallel. That's an embarrassingly parallel workload where you're not dependent upon my job to get done in order for you to do your job. Right. So that's, I mean, and, and we still call that HPC, even although we tend to think of it in, in the, you know, in, you know, the weather pattern type of workloads and so on.

Starting point is 00:08:08 But they're all basically on that same spectrum. And so you wind up with these different types of requirements in order to accomplish that. For our purposes here, we're really talking a lot about the tightly coupled ones, which are the kind of, we have one problem that we're trying to solve. That one problem runs for months, if not years. And it's a it is a the network becomes a huge hindrance if you're not doing it right. Or it can be a major boon if you are. Yeah. So, I mean, AI training to a large extent seems more embarrassingly

Starting point is 00:08:47 parallel these days, especially when you're getting to, you know, thousands of servers and tens of thousands of GPUs and that sort of stuff. Is that a reasonable statement? I think you're definitely right. I mean, so, you know, what I think people don't necessarily realize is that you've got a kind of a tag team one-two punch when it comes to AI, right? So when you're dealing with these transformers and you've got, you know, like, for example, GPT, right? You know, what happens is you're dealing with the processing of ingestion of data into a format that you can then run modeling processing on. And you transform those data, that data into, you know, a different model and an architecture that requires, you know,

Starting point is 00:09:34 summation and then eventually you got the generation. So those two things though are not the same. So if you've got summaries and summation inside of your, your AI programs in order to get to your chat GPTs eventually, that's a parallels type of a process. That's your GPU, that's your computation bound, that's your matrix to matrix multiplication. That's the kind of stuff that you need to do in parallel. And then if you're going to be generating that, you know, that text, well, that's, that's the serial CPU bound and memory bound kind of application. They work together to ultimately create the end result.

Starting point is 00:10:11 Yeah, yeah, yeah, yeah. So that's a lot of network. Yeah, yeah. So it's, it's a lot of network and it's almost, it's very similar to this tightly coupled solution we were talking about early, because you're doing one computation to get one token out, and then you're doing the next computation based on that token and everything that went before it. Yeah, that's right. I guess there's both aspects of this, too. Yeah, and a lot of that generation, too, it's like the tweaking of the parameters, right? And then when you think about it, you know, one of the things they always talk about it's like 70 billion parameters being you know pumped into like the chat gpt model um when you

Starting point is 00:10:50 think about it it's like a gpt is like a trillion ish kind of numbers i know it's more than that now but it was like back in the old days um yeah yeah yeah but that's like having a mixing board with with a you know a trillion different knobs on it. Right. And, uh, you, you, basically, you know, a lot of it's having to figure out which tweaks to make, right. And then rerunning that stuff in parallel. So there's a lot of, yeah, you're right. You're right. And, and, but not only that, you, you've got a lot of data that goes back and forth. Right. So, so the thing is that the, the data at, at rest is not the data that is going to be processed for the training, nor is it going to be processed that way for the generation. The data has to be mutated. It has to be modified. It has to be transformed, hence the word, into a usable format. And then that has to happen over and over and over again before it ever gets to an end user or usable result. And that is what we're trying to address.

Starting point is 00:11:48 Because as you point out, yeah, there are a lot of similarities between HPC tightly coupled relationships and AI tightly coupled relationships. But there's enough of a difference to prevent you from having a one size fits all. And that's what UEC is attempting to accomplish, which is we create these different profiles that are creating the tuning from the physical layer all the way through the software layer for these different types of workloads that doesn't currently exist in the OSI model for Ethernet. So you have different characteristics that you're able to tune the network for almost on like a real-time basis? Is that what you're saying, Dr. Jack? Real-time is probably a little bit of a bridge too far at the moment. I mean, you know, generally speaking, if you're looking, each of these different workloads happen to have certain types of networking characteristics, right?

Starting point is 00:12:44 You need to have, for some workloads, you may need to have, you know, completely free for all, best effort, right? Unreliable, unordered, you get there when you get there. That's not usually the case in these types of workloads, but let's just say for the sake of argument that that's our starting point. Sometimes, some types of workloads, HPC in particular, if you're talking RDMA, you need to have reliable order delivery. There are some workloads that don't have to be ordered. They can be reliable, but they don't have to be ordered.

Starting point is 00:13:14 So you have reliable unordered delivery. And then, of course, you've got different types of nuances of the different types of workloads that go beyond that. Each of these has a fingerprint on the network, right? They have different types of impacts. They can create congestion issues. So for example, one of the things that happens in very large scale RDMA based approaches is that the ability to do multi-pathing

Starting point is 00:13:44 is rather restricted because you have to keep it rather constrained to the flow levels. And so there's a lot of different types of managing congestion in order to have it. But it's a one-size-fits-all. It winds up sacrificing some of the efficiency of the network in order to create the reliability that goes along with that workload. So what we're looking to do is say, look, you can, you can do that. That's a, that's a possibility, but we, as we get these larger and larger and larger systems, right, you've heard about, you know, the, the GPT-4, you've heard about the, you know, the, the new requirements that, you know, I think OpenAI, I was talking about what a $7 trillion environment, right? I think that's, that's very ambitious. You know, but people are talking about really large systems. And if you do that, you don't want to waste the network by effectively

Starting point is 00:14:38 restricting it to different flow levels. So we're looking at doing what we call packet spraying, which is sort of, you know, a multi pathing on steroids, so to speak, you know, because of the fact that now you've got the transport layer handling a lot of that heavy lifting to address the, you know, the congestion notification, the congestion mitigation, as well as the packet order delivery and the security elements. So our goal is, you know, to effectively get, you know, tenfold or more number of endpoints on this particular network, up to a million endpoints, and have maximum efficiency across the network. So, it's, like I said, it's a rather ambitious project, but it's designed specifically to help ease some of the financial burden that comes along with it when you have wasted space by inefficiency. A million endpoints sounds obscene. It is the same reaction I had, Ray.

Starting point is 00:15:41 Even if there's eight endpoints per server, that's still, you know, 100, yeah. Jesus Christ. Be careful, Ray. Even if there's eight endpoints per server, that's still, you know, a hundred, yeah. Jesus Christ. Be careful, Ray. You're going to start sounding like we only need 640K of memory. I've been there. That's why we're

Starting point is 00:15:57 greybeards, right? I think we've all been there, my friend. Yeah, yeah. So, I mean, I always thought the problems with, you know, AI and HPC and stuff like that was all storage-based, not network-based. You're saying that the network can be, in some of these situations, a critical bottleneck, depending on how it's configured and how it's used. And the current Ethernet technologies, I guess I'd call it, aren't up to the snuff to handle these larger scale environments. Is that what you're saying? Well, yes and not quite. So I think if you think in a traditional three-tiered environment, which I think a lot of people in our

Starting point is 00:16:41 boat, the graybeards think because you've got compute, network, and storage. What is happening, though, is a movement away from that model, that three-tiered model, where the compute is now at the storage. The network is now at the compute, and the storage is now at the network. We are now talking about breaking apart the access to the memory. The memory is now directly connected into the network. The data movers that are being handled have to be much more flexible. The integration of the acceleration and the actual processing, because you've got the different bindings that go along with the actual workloads.

Starting point is 00:17:22 We were talking about computational bound versus memory bound, for instance. We are breaking apart the model into finer granularity, and we're placing it at different parts of the network. So a lot of these collectives that we currently have inside of, what would be inside of a host, they want to put inside of the network, right?

Starting point is 00:17:42 So in-network collectives is a huge deal. The computational capabilities of being next of the network, right? So in-network collectives is a huge deal. You know, the computational capabilities of being next to the data, right? So for instance, I mentioned earlier that inside of the data, you have to restructure the data so that it can be manipulated for these kinds of transformers. That's not native, right? That's not native at the actual data repositoriesories themselves so you've got to manipulate the data as quickly as you can because you're going to be going back to that well often so computational storage computational processing well compute once you get to that point why not have that same functionality that logical block functionality that you put inside of a pretty

Starting point is 00:18:20 diagram why don't you have that at the network level why don't you put that into a dpu why don't you have that at the network level? Why don't you put that into a DPU? Why don't you put that into a SmartNIC, right? Why not have that kind of environment there? All this stuff almost exists today. I mean, obviously, NVIDIA with the Bluefin and all this stuff. I mean, so the GPU has the intelligence or DPU has the intelligence to do a lot of network functionality at the network node. You mentioned in-network collectives. You want to define that term?

Starting point is 00:18:50 Basically, the in-network collectives are the kind of some of the processing, like you've heard of Nickel and Rickle and those kinds of collective libraries, right? They handle some of the processing. It is now being discussed to be able to have some of that processing done inside of the network so that you're not actually using the entire traversing to handle some of these. I have to confess, I am not a software expert. I can and have screwed up Hello World. But what I do know is that they are a major part of the workload and where that actually gets done is part of the conversation. And so it is an environmental construct that winds up being discussed as to where does it have to be put.

Starting point is 00:19:38 And it does underscore the major point that I was trying to make, which is the traditional compute network and storage no longer applies, as you pointed out, it's already being done. And so you want to have a protocol that is tuned to be able to handle that kind of granularity. Yeah, I think a piece to highlight on that, when you think about, you know, the AI construct, I mean, this is the way it's being processed now is where we're using a crowbar to jack existing tools in there, right? Where we're kind of forcing them in. And I think one of the things really that, you know, UltraEthernet is trying to do is basically kind of take a, like the good parts of what technologies we've already got in there and which ways in which we can adapt those to, to work better with,

Starting point is 00:20:25 with kind of this new paradigm that we're seeing with, with the generative AI style of models. Yeah. Yeah. Computational storage also been around for, for a while now. And then lots of new functionality is moving upward and that sort of stuff. But,

Starting point is 00:20:43 you know, you talk about transforming the data or, or, you know, I'll call it encoding or embedding the data into some other format and stuff like that. That can be done today outboard or it can be done inboard, possibly, without too much of a problem. I guess the problem is you have to move the data around a lot more. What do you mean offboard or inboard? I'm not really sure I'm following, to be honest. Yeah, in the host or out in the storage i guess is right is a better way to say that you know okay but but here's so here's the question right what's the cost of moving it into

Starting point is 00:21:16 the host it's and therein lies it is extremely costly because we're talking we're not talking a couple terabytes we're talking a couple exabytes. We're talking a couple exabytes for some of these things. And so actually there's a really great presentation that was done by Los to move the data across the network in order to handle their workload. So what used to take nine months now took less than a month because they're only working on one workload. And by not having to move that data, by doing that processing internally before it actually has to get moved, they've managed to save an incredible amount of time. And that's not insignificant, right?

Starting point is 00:22:15 Yeah, yeah, yeah. Dr. J, all that's available today, or it was available back in 2022, or even before that in a computational storage. Where does the ultra-Ethernet solution, what does it do on top of that, I guess, is the question. Well, I mean, the fact that you can do it is not the same thing as the fact that you can do it in an open and standard way.

Starting point is 00:22:37 Ah, okay. Okay, okay, I got you. So you're trying to standardize this sort of additional complexity that's emerging with new hardware technologies to make it more available. That's right. And the other part of it, too, is that this is an exceptionally hard problem to solve. Extremely difficult problem to solve because you have to align everything from the physical layer all the way up into the software layer. So for instance, just a quick example, let's assume that we've got some new special optical physical layer that is going to bring us to 1.6 terabit a second coming out of the host. It's already being discussed. And I want to do this, but it's going to cost me a single watt of extra power, a single watt. I have a million nodes. Now I've just increased my power by 1 million watts, right? For a lot of,

Starting point is 00:23:32 for a lot of operations, that's a non-starter, right? So, so we've, we've got this butterfly effect that we have to try to address because of the fact that, you know, one change is not going to be so simple as say, I'm going to kick it over the, over the fence and say, now the link layer has to deal with this extra watt. And then the link layer has to kick that over the fence and say, now the transport layer has to do it. Now the software layer has to do it. Especially if you've got these, you know, these integration problems, like you've got storage, you have to deal with, you've got management, you have to deal with. So doing this in a proprietary fashion is possible, but you want as many people as you can possibly get into the same room to talk about how what somebody wants to solve is going to affect what they're going to be doing on their

Starting point is 00:24:15 end. And that is why the UltraEthernet Consortium is so popular. That's why we've moved from 10 companies in November to 60 companies in March, right? I mean, it is a very big problem that a lot of people are trying to get into the same room to have that conversation. It is unrealistic and probably unfair to expect any one particular company to solve all of those problems at scale. Yeah, I mean, you look at the deep use line. It's proprietary. The computational storage is proprietary to each one of the vendors that supply that capability.

Starting point is 00:24:54 So by having the UltraEthna Consortium sort of spearhead a more standardized approach to these sorts of things? Is that where you guys are going? I mean, so this sort of things kind of roll out in phases. I mean, what's your roadmap for UEC, I guess, if I had to ask the right question? Well, I mean, it's a great question. It's one of the ones we get the most comment. I think one of the things that, you know, it is such a major problem that people want to have a solution yesterday. Um, and I think one of the, one of the funny things that I don't think people even realize is that, you know, before, before the fall of 22, um, everybody was talking about HPC.

Starting point is 00:25:38 That was, that was the big thing that they were trying to deal with. By the spring of 23, all of a sudden it was AI, AI everywhere. You know, chat GPT, you know, version three changed the game. It was a watershed moment for the industry. But I don't think people realize it wasn't all that long ago. And so we've been working on this pretty diligently and we've been really fast by any kind of standards-based approach, right? So most standards take five, six, seven years to get a version 1.0 specification out the door. We are on track for getting our 1.0 by the end of this year, which would be just a smidge over a year, right? So we've got a general outline in a recent blog, like this week, as a matter of fact, a couple of days ago, as we talk about this today, on the ultraethanet.org website

Starting point is 00:26:33 that talks about, you know, on track to 1.0, what it's going to entail, how it works, what the, you know, the details are about, well, details, depends on what your level is, right? From the perspective of those who are developing it, it's a rather high level, but for those who are brand new to it, it may be actually kind of detailed. But the idea is that we want to get a backwards compatible approach to handling ultra-ethanate transports and corresponding software out the door, which can be implemented in a greenfield environment. I think one of the things that's important to note though, is that when people start to develop these types of networks, they're not general purpose networks. They're networks for specific workloads and they

Starting point is 00:27:12 tend to be rather greenfield. But you also want to be able to use the existing equipment that's available at the time. So we don't want to do a wholesale forklift upgrade for UltraEthernet. On the contrary, we want to be able to integrate into the existing products that vendors have, the testing equipment that people can use, and so on. And that's where we're starting with storage specific for ultra ethernet, for management techniques, for performance, you know, tuning and compliance and certification processes. That's all going to be done post 1.0. So the services that go around that capability, in addition to more advanced congestion approaches. That's going to be post 1.0.

Starting point is 00:28:10 But we're on track for being able to have that capabilities by the end of this year, 2024. So you're saying that UEC 1.0 could be potentially implemented on current networking hardware? Is that what you're saying? That is the intent. That is the intent. Whoa. Now, we're starting off at a rather high level right so we're not talking about 10 gig ethernet right um i mean i mean the so that's

Starting point is 00:28:33 what i'm saying this is a this is a tuned ethernet that is a stack oriented approach uh and we are talking about you know at the very least 100 or higher, probably closer to two or 400 gig in terms of being able to get, you know, true advanced, you know, advantages of what ultra-Ethernet is going to wind up doing. But I'm not going to be prescriptive about how people want to use it. If they want to kick the tires with, you know, you know, the lower speeds in order to find out how it works, I'm more than happy that they're going to do that. Bizarre. I don't think it's bizarre. I mean, I don't want to tell you my work is bizarre. No, no, no.

Starting point is 00:29:18 It's just that for me to see, okay, I've barely got one gig in my network and servers at home, let alone 200, 400 gig, which is, we're talking serious serious size complex environments here like i said it's ambitious but you're right i mean that you're absolutely right most enterprises i'm not sure have a hundred gig or or higher yet today so this is almost outside the enterprise well actually one of the interesting things is like enterprises for for a lot of this technology they're looking at it as like kind of this when they their ai initiative it's almost looking at it like it's an appliance right like they're they're buying something and integrated into that is where a lot of that connectivity is starting but then it's how do you tie that into the overall enterprise

Starting point is 00:29:59 and this is where i think the you know that ethernet tie-in is going to be big. Yeah, yeah, yeah, yeah, yeah, yeah. So, yeah, there might be like a backbone there that ties into some massive AI appliance of some type and plugs its capabilities into the rest of the enterprise. Yeah. And I think, like I said at the very beginning, we've broken down our approaches into three different types of networks. And I think this is a good time to revisit that because we're not talking the general purpose Ethernet here. We're not talking about where you're going to run your VMs, where you're going to run your services. We're talking about a workload specific network and, and it's a scale out network to start with.

Starting point is 00:30:49 And so we may have to come up with something more creative than network number two, but that's what we're focusing on at the moment. And then, you know, then you have the scale up network that goes along with those accelerators, which is something on the back burner at the moment. But obviously if you're going to have to subdivide these into very specific kind of sub workloads, right, that's got to be integrated as well. So, you know, there's a long term goal here. But I do want to be very clear that for anybody who wants to put inside of an ultra Ethernet network, it's designed for a workload, right? A workload type.

Starting point is 00:31:28 And it's not supposed to be, you know, this mixed traffic, you know, general purpose network that people might be more familiar with inside of an enterprise. Not to say that enterprises can't use it because they're moving into that, you know, that modeling approach as well. They're very interested in the AI, you know, what that can do for them. But I want to be, I just want to be very clear that we're not talking about replacing Ethernet as a general purpose protocol.

Starting point is 00:31:59 Yeah, I mean, that plays to the HPC and AI discussion points, right? I mean, these are the guys that have these very specific workloads. Like you mentioned, Los Alamos has one workload. They operate for months at a time. And seeing a nine-fold increase is significant in that. But, you know, AI is a little bit more diverse than that. I mean, you know, we have experimentation. We have training. We have training, we have

Starting point is 00:32:25 reg, we have inferencing, we've got a lot more flexibility of workload characteristics in your typical AI environment today. Yep. And that's why we work on these different profiles, right? Because the way that we're handling these different nuances for the workloads is more than just how they're ordered, right? There's an entire semantic language that needs to be addressed. And that's a core part of what makes UEC, UEC. We have an integrated packet delivery system that goes along with the semantic understanding of the operations to be able to handle those AI nuances. Yeah, yeah, yeah, yeah, yeah. Interesting.

Starting point is 00:33:21 I mean, ultimately at the end of the day, what we're looking to do and where I think the different companies are excited about, you know, participating is that it offers up a wide variety of opportunities for building upon a standard basis for solving this problem while at the same time having your magic sauce as well. Right. So, um, and I honestly think that's one of the reasons why it's, it's, it was actually very surprising to us how, how popular it was, how quickly it was. We, we didn't realize this at first. Um, but you know, it's since gotten to be that way and we're very happy about having all the interest, to be sure. But I think it's kind of telling that, you know, this has tapped into a particular need inside of the industry for a lingua franca for how to approach the networking capabilities. So the other big, you know, change over from a technology perspective is cxl you see cxl being a specific profile of a workload kind of thing in this regard

Starting point is 00:34:35 not particularly because cxl was was never really designed to be at this level, right? So CXL was originally designed to be able to create an environment that fed endpoints back into a particular host. And so it was all about feeding into a particular host. And then it got expanded in later versions to be able to have multi-host inside of a PCIe-based fabric that allowed for different types of endpoints to be able to communicate with other types of endpoints and share memory. So the memory pooling and the memory sharing was a big part of what came out in CXL 3.0, especially in a hierarchical switched environment that was all effectively on that PCIe based layer, right? That IO layer. But it was never designed to be row scale or data center scale. And so it is not necessarily completely independent of what UEC is doing.

Starting point is 00:35:44 You could conceivably have memory pooled environments at the endpoint that could wind up with ultra ethernet fabric endpoints that communicate across the different nodes or racks or rows or scale, whatever it is. So I don't see them as necessarily competing against one another, but you could have a CXL environment inside of a host or even, you know, a multi-chassis environment that could wind up being a fabric endpoint or multi-fabric endpoint for the ultra-Ethernet.

Starting point is 00:36:13 It's just, it is a concept that is a possibility, but not a focal point for either CXL or UEC at this point. Yeah, yeah, yeah, yeah. So it comes down to very profile-specific activities. So from my perspective, I see this as, depending on the number of profiles that you have, if your workload fits into one of these profiles, then the UEC is the answer to your prayers.

Starting point is 00:36:47 Is that how I would see that? Um, well, I guess, I mean, you know, the, the thing is that we, we, if, if you don't, if you're going to come up with a brand new requirement, right. Um, the, the way that we have uh approached this is that um we're open to suggestions right so if somebody comes up with a profile that we haven't thought of i mean mostly these types of things are somewhat mutually exclusive we know the ordered or unordered for instance reliable or unreliable that's kind of a binary situation you you don't come across too many third options in that environment. But if somebody were to say that we would, and it was applicable, I'm sure that somebody would want to come in and suggest a draft that could accommodate it. Otherwise, it would be a proprietary solution that they may want to run on on their own. But as an open organization, UltraEthernet is open to hearing about other potential problems that need to be done.

Starting point is 00:37:51 And all we really need is for a good persuasive argument as to why it needs to be done, and the group could approve a draft to work on it. So I can't really think you know in those terms at the moment because nobody's done such a thing but the mechanism is there for them to do it yeah yeah yeah yeah well i mean you know obviously hbc has got a very specific you know specific workload that they're trying to to to manage and work with and and uh you know, specific workload that they're trying to manage and work with. And, you know, AI training and AI inferencing and some of the transformer stuff appears to be very,

Starting point is 00:38:31 very specific to its networking characteristics. But the advantage of something like Ethernet today is that it pretty much can handle everything. It may not be able to handle everything optimized, but it can handle any workload you want to put on top of it. That sort of flexibility is not necessarily intrinsic in the UEC approach, I guess. That's not necessarily a bad thing. The big problem that we've had with Ethernet, and I'm actually glad you really,

Starting point is 00:39:07 I'm really glad you brought this up because the flexibility that Ethernet has had has allowed people to create layers of abstraction that have hit the performance problem, right? So the more you can abstract something, the more you have to deal with the performance issues. What we're trying to do is say okay hold on a second let's let's not just simply add another layer of abstraction onto the onto the uh into the mix because once you do that you you are

Starting point is 00:39:35 defeating the very purpose here of trying to solve the problem for these workloads which are performance centric so what we're looking to do is kind of reclaim some of that performance capability that the stack will give us by preventing the need to create an abstraction layer. Because ultimately, at the end of the day, that's self defeating. So in this particular case, what Ethernet flexibility allows us to do is, is, is come back from the edge of simply adding in additional software stacks that, you know stacks that you might find in Kubernetes or containers or virtualization and that kind of stuff, because you get further and further away from the hardware.

Starting point is 00:40:15 And then that costs more money in terms of efficiency and power and cooling and so on. We really want to streamline the lower levels because of the fact that that's what the workloads are going to be requiring in order to make that ratio of efficiency and performance to cost. And I can assure you there are definitely breakpoints in where Ethernet is today. And we've hit a number of them when I'm going through, you know, setting up these compute clusters, even things like, you know, like, you know, setting up these compute clusters, even things like, you know, like, you know, ARP caching and stuff from basically hardware and software places that they're just, you know, from a technical perspective, there are some details of things that just break when you run it at this level, when you've

Starting point is 00:40:57 got, you know, 10, 400 gig, you know, Ethernet interfaces per system that you're dealing with. And now you're dealing with that across a big clustered system. You find some interesting breakpoints along the way. Yeah, yeah, yeah, I'm sure, I'm sure. And so by getting closer to the hardware, you increase the performance, you're able to handle more and more nodes, more and more endpoints, and optimize workloads to handle the network better, I guess.

Starting point is 00:41:27 I guess that's how I'm reading it. Well, as you know, in storage, I mean, the closer you get to the wire, the more rigid the architecture has to be. And what we're trying to do is do that, but not sacrifice the flexibility that each workload has its own personality. And so that's where all that work in the transport layer is becoming really effective. Yeah, I mean, the challenge there, of course, is always trying to find the optimal line of where you provide the abstraction and where you go deep to the hardware kind of thing. I guess it's not right phraseology, but there's a line here between abstraction and

Starting point is 00:42:17 non-abstraction that you're moving down, I guess. That is correct. Yeah. And we also understand that we're not trying to swallow the ocean, right? The workloads that we've got are very specific. They're solving a particular problem, addressing those needs for those problems. And to that end, we have defined ourselves into the approach that we're looking to take that could be expanded later on. But I think it would also be rather disingenuous of us to say that we're going to be everything for everybody. We're not a panacea. We have a very specific problem or series of problems that we're trying to address. It's an ambitious enough project to begin with. And I don't want to impress that this is going to necessarily replace anything and everything you've ever thought of when it comes to networking, because that would just be silly. That's not our intent. That's not our goal. And that's certainly the wrong thing to take away from any conversation about ultra-Ethernet. Right, right, right, right, right. Getting back to the roadmap, you mentioned one about, oh,

Starting point is 00:43:35 your end-ish kind of numbers and that your expectation is fully that current hardware might, with proper software firmware changes, be able to support this. Do you have like a plug fest kind of thing? I mean, you know, the old days of compatibility testing. We will be doing compatibility testing. So we've established four new work groups to UltraEthernet, which included compliance and tests and a performance and debug group.

Starting point is 00:44:02 And the work that will eventually wind up being a hackathon or a plug fest or connectivity type of events is in the plans. It's in the works. That'll probably be much more fleshed out towards the end of the year when 1.0 comes out where the testing plans become much more codified. But it is one of those things that is part and parcel of how we intend to roll out ultra-Ethernet with safety for the industry to adopt. Yeah, yeah, yeah. Great, great. Well, Dr. Jay, this has been mind-opening for me.

Starting point is 00:44:38 Not necessarily a networking expert, but certainly understand the complexities of some of these workloads that we're dealing with these days. I do my best. And the cluster node counts seem to be extraordinarily large, but that's a different story. Is there, Jason, any last questions for Dr. J before we leave? No, but one comment I do have, when we talked about you know we started off with the million uh kind of the million endpoints when you think about that and you talk about the density

Starting point is 00:45:10 i was just talking about where we've got you know 10 uh 10 400 gig connections per system now you're looking at a hundred thousand systems and it takes a million endpoints to connect a hundred thousand systems which still sounds like a lot but not to a hyperscaler out there. Somebody that has 10, 400 gig connections to a node is beyond my comprehension. I'll have to show it to you sometime, right? I'd like to buy a gaggle of them sometime as well. Buy a Ferrari, it's cheaper.

Starting point is 00:45:41 Yeah. That's just the power. Oh, yeah. They're about 10,000 watts under full load. Alright. Dr. J, anything you'd like to say or listen to the audience before we close?

Starting point is 00:45:57 Just a couple of pitches for where they can get more information, if you don't mind. Sure, go right ahead. What I wanted to say was that there is a website for ultraethanet.org where we have a very good white paper that describes the intent of the organization and what we're hoping to accomplish. A new blog was released about the 1.0 specification that's also available on that website. We do happen to update the Twitter

Starting point is 00:46:25 account with at UltraEthernet and the LinkedIn with at UltraEthernet. And of course, I can be found on both of those locations with just me. So on Twitter, it's at Dr. J Metz and of course, J Metz on LinkedIn. And anybody can ask any questions that they like. I'm more than happy to answer questions along those veins. All right. Well, listen, this has been great, Dr. J. Thank you very much for being on our show today. I'm very happy to be here.

Starting point is 00:46:54 Thank you so much for inviting me. And that's it for now. Bye, Dr. J. Bye, Jason. Bye, Ray. Until next time. Next time, we will talk to the system storage technology person. Any questions you want us to ask, please let us know.

Starting point is 00:47:09 And if you enjoy our podcast, tell your friends about it. Please review us on Apple Podcasts, Google Play, and Spotify, as this will help get the word out. Thank you.

Grey Beards on Systems - 163: GreyBeards talk Ultra Ethernet with Dr J Metz, Chair of UEC steering committee, Chair of SNIA BoD, & Tech. Dir. AMD

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.