Grey Beards on Systems - 173: GreyBeards Year End 2025 podcast

Starting point is 00:00:00 Hey everybody, Ray LaCasey here with Keith Townsend. Jason Collier here. Welcome to another sponsored episode of the Greybirds on Storage Podcasts, show where we get Greybirds bloggers together with storage. This is the vendors to discuss upcoming products, technologies, and trends affecting the data center today. This is our annual year-end podcast where we discuss the years technology trends and what to look for. for next year. Keith and Jason, what would you like to talk about? You know what? How can we not talk about the AI bubble? Like, this is crazy.

Starting point is 00:00:41 Yeah, the AI is going crazy. I mean, the whole agenda is driving it, more Gen. AI, more AI in the enterprise and stuff like that. It is going insane. I don't know if I'll make an argument that we're seeing the AI actually in the enterprise, but we are seeing it on the training side. Yeah. We are seeing it on the enterprise side. So it's been one of those things where we're seeing and basically increase,

Starting point is 00:01:11 and we're actually seeing kind of a shift in basically the going from training to inferencing models. We've seen a lot of folks basically taking and almost establishing what I'd call this like three-tier model of where they've got an LLM running as a base, and then there's a kind of a smaller language model sitting on top of it, and then they're doing a retrieve

Starting point is 00:01:35 a long-minute generation on top of that. And that's starting to basically then feed into what kind of a lot of the agent-agetic AI for enterprise types of programs. Yeah, so it's been interesting. We have definitely seen a shift in folks going, from looking specifically at doing training to the point of, hey, there are certain established like large language models out there

Starting point is 00:02:04 that we can run on top of VLM, Olamma, any other types of platforms, and then kind of start to develop applications, you know, on top of that that are using those as the base. So you're saying a full deployment then of, I would call it the AI stack, but I mean, it's more than just that, right? Yeah. And a good example, you know, of kind of how that works. You know, the LLM, like people are just going out and they're taking the big ones,

Starting point is 00:02:33 taking like, you know, a GPTOSS or or basically deepseek or whatever you want to put as LLM. Actually, we're seeing also a combination of people using many different types underneath that, depending on the application. Certain LLM support those a little bit better. But then they're also going through and looking at more specialized models. This is where we're seeing kind of training and enterprise. is they're training, you know, basically those smaller language models that are very, what I'd say, you know, vertical application specific. So things like, you know, if it's manufacturing, if it's legal, if it's medical, right, they're doing like, you know, say if it's legal, they're taking the Lexus Nexus database, train it on that and then also train it on local wall, depending on the city, the county, the state.

Starting point is 00:03:17 And you've seen all this happening in on premises, not necessarily in the cloud, but it's on a little bit of both. A little bit of both. And I'd say a lot of those kind of the tier two folks, think of, you know, your core weaves, vultures and companies like that that are, they're doing a lot of the training for those. But then also we're starting to see it in the enterprise, you know, some of the Fortune 500s that are wanting to deploy, you know, some of those bigger stacks. And they're actually wanting to do, you know, run a lot of the inferencing stack. So I'd say those tier twos are getting a lot of training business. And then the enterprises themselves are running a lot of the inferencing stuff. on-prem.

Starting point is 00:03:55 On-prem. Keith and I were talking with Articulate a couple months back, and they've got a lot of, I would say, vertical-specific models that they're selling as a service, actually, right? Yeah, so very much that motion of saying, okay, we're going to take a small, large language model, train it specifically on a domain, or build one from scratch,

Starting point is 00:04:20 depending on the domain and the application. And then run it in a dedicated inferencing cluster, whether that's on-prem or in the public cloud. But the idea is pretty well established. And we're starting to see, I don't know, Jason, if we're going to actually see the inference flip next year. One of my predictions for my only prediction for 2026 is that we're going to see a shakeout of all the evaluation. bubbling that's happening with these smaller AI companies on whether or not inferencing is going to be adopted at a pace that justifies the the frowniness of this investment.

Starting point is 00:05:09 Yeah, I mean, AI startup valuations are going bonkers. I'm going to a roof. I saw one that was been, I don't know, a couple months old, and half a billion dollars. It almost makes me want to go back into AI in a big way. Yeah, I think Open AI as of this writing, they just got an investment this week or last week that valued them at $900 billion. This is a company that's burning to what, $10 billion or more at whatever pace that they're doing it. Yeah, yeah. My overall guess, and the reality is that training piece on doing those large language models, doing those really just the ginormous ones that those guys work on.

Starting point is 00:05:48 just the sheer amount of compute the cost that that is associated with it is I mean I think the last time I think it was like roughly estimated at $10 billion was the training costs in basically power cooling compute resources necessary to train one last GPT yeah yeah the GPD 4 GP5 I don't remember what right right but it's I mean it's a lot of resources and when you think about it You know, and enterprising isn't going to have that many resources to do it. So this is where I think, you know, the big companies and the companies are going to be the hyperscalers that are between the, you know, the open AIs, the, you know, Open AI, GROC with X.AI, Microsoft, and Google with Gemini, IBM with Granite. Those are going to be the types of companies that are going to train those LLMs. and I would guess, and here's my prediction, I think they're probably going to be about 12 to 24 of those companies that are going to be training those models.

Starting point is 00:06:58 Yeah, I question, I question how many of the models of those models do, does the industry actually need? Yeah. Because, Jason, I'm seeing that motion that you mentioned, which is you take a open source, a large open source general, purpose model, put a smaller model on top of it, some RAG associated to it, some data optimization, and that's what agentic AI workloads are being built on, not these huge models that are great general purpose or general knowledge models, but not necessarily great for, you know,

Starting point is 00:07:37 these structured tasks within a specific domain. Absolutely. Yeah, I mean, especially when you look at, I think there was some, there was some statistic on the, that I think 60% of, of the knowledge it was used to train one of the models. And I don't remember which one it was, but 60% of it came from Reddit threads. And then I'm just like, wow, that's a, that's some really trustworthy information right there. That would explain my experience with Claude. Yeah. Yeah, that's, that's, that's bizarre.

Starting point is 00:08:10 I think the other thing that seems to be occurring to a large extent, where, starting to see a little crack in the NVIDIA armor with respect to GPUs. I mean, there are, you know, Google's made, ah, forever really, been talking about TPUs and their TPUs are more energy efficient and, yeah, it's almost like they're starting to commercialize that as a solution rather than just internally use it in the hyperscaler. But also like ML-Perf, we saw that AMD GPUs are starting to make a splash there. yeah i mean i think the gp u yeah definitely the proliferation i mean you know invidia's had a really good head start with right you know i mean that's i mean they kind of you know they've got a big head start on that

Starting point is 00:08:55 and you know i'd say probably the uh and you know the google tp u is great and then you know what amd's doing with with the instinct uh components are are also good and you know i know there's been a lot of investment on both both companies to really kind of up the software software game and I think it's really kind of that GPU proliferation is really coming from basically both of those folks up up in there the software game Google's really good at doing software right but the I'd say the thing that that they're a little bit behind on the TPU I mean it's basically it's very domain specific and it was very has been very domain specific to running on Google's cloud for a very long time so how does that you know how would that translate to on-prem and is it just going to be running the cloud? And then also it has a pretty small developer ecosystem at this point, right? So it's, and the reason that, you know, Nvidia has had such a success with that GPU proliferation is the fact that

Starting point is 00:09:55 Okuda will run on a software, software, software. It'll run, I mean, it'll run on, on, you know, this college kids, you know, the college kids graphics card, right? Yeah. That he's playing, he's going to be playing World of Warcraft on it one day and then right and you know an AI PhD thesis on it the next right yeah yeah there's a strong argument for like RTX 3 3,000s which are you know a couple of generations old because they still do have plenty of RAM high speed memory and the tensor cores are fast enough

Starting point is 00:10:31 Jason I agree with you the when this whole TPU met and it started to hit the the kind of investment market. I scratched my head thinking, like, yeah, AMD is probably way ahead of Google when it comes to enabling software outside of its ecosystem. I always picture the TPU is being tied to vertex and moving it away from vertex. I don't see this tied to vertex, but I mean, they've got their own software stack associated with it and stuff like that. But I've used it in the past for some deep learning activities. Again, mostly in Google Cloud and stuff like that. But, yeah, that's my, that's my point is that it's tied to, when I say Vertex, I mean, the Google services. Yeah. So they abstract the way. So you're going to, you know,

Starting point is 00:11:26 it's curated models, et cetera. And you might be using their GP, their TPUs. But that's way different than saying, you know, going to hugging face, pulling down. uh quant 16 model and then run it directly on a tPU that's a whole different level of software enable yeah whereas you can just basically pull it down and run it on like you know any like in video graphics card right now right and then and it wasn't until honestly and it wasn't until amd came out with basically it was the 7900s the rtx 7900 as far as a consumer card that had the capability of running the rockum stack on it right so and it's one of those things like I said, the best way to predict how a lot of these applications are going to be built in the future is, well, if you give the kids, guess what? Those kids are going to graduate, you know, college. And then 10 years later, they're the director of software engineering and what are they going to use, what they do in college, right? And having the ability to have been, you know, doing this for the last 10 years on pretty much any NVIDIA card that popped out the door. They already know the stack, right? They know the stack. They know how it's built. And, and, and that's a,

Starting point is 00:12:38 That's been, you know, one, I'd say a big, when you were talking about the AI bubble as well, and you look at Nvidia and its valuation, it's like you've got, like I said, all those, all those kids that are now directors of software engineering that are going out and building, you know, a lot of these, like, kind of AI platforming tools. They're building it on what they knew. And then Nvidia was kind of, you know, they were the first mover in that game. When you look at the AI bubble, like, Nvidia has a market capital. capitalization larger than the entire pharmaceutical industry yeah yeah well you know it pays to be

Starting point is 00:13:15 first especially with everybody in the world wants your technology yeah so keith you mentioned something about what's going on with uh apple yeah so apple released rduma support over a thunderboat which uh which for the Macs Studio with, you know, up to I think it, I think it goes up to a terabyte of RAM. I can't remember. Oh God, yeah, yeah, on the studio.

Starting point is 00:13:44 Okay. But imagine, you know, it's shared RAM. It's shared RAM between the GPU and the PROS. And yeah, the GPU and the regular system. But this isn't an amazing hack. One, it's super fast. I want to say it's like

Starting point is 00:14:00 833 gigahertz, something in that frame. So they've seeded a bunch of YouTubers with a cluster of four of these, and they're getting 50% the performance of a H-200 and basically what equates to a $40,000 cluster. But the problem is still, you know, I've watched these, and these are pretty good system admins and AI enthusiast. And the problem is software. MLX, if

Starting point is 00:14:35 the model doesn't run, isn't optimized for MLX, it just simply doesn't run. And MLX is Apple's, you know, proprietary framework for AI. And, you know, we all go back to the problem. Yeah, Nvidia, we don't want

Starting point is 00:14:51 Nvidia to win inference like it has training, but software has been the barrier. Yeah. Yeah. Yeah. It's, it, and it's been their, I don't know, moat, right? Their deep moat that surrounds their, their whole world to some extent. And it's, it's been a long time coming. I mean, Rockham's made significant, significant efforts to try to be more, more complete and such. And, you know, you can see that in the ML Perf numbers and stuff like that. Yeah. And then it's also, you know, a big part of that, too, is also the, you know, with hip, that's been a, basically hip and hip-sickle and stuff like that. That's been a real kind of driver and being able to take, you know,

Starting point is 00:15:41 like kind of existing coup de code and move it over. But then I think also it's like developments and stuff, you know, in those software stacks where it's enable, enablement for basically the higher level languages, right? So where you're talking like basically like Pi Torch? Yeah, I mean, that's primarily most of the AI developments being done. with PyTorch at this point. So you know, I know AMD's put a lot of effort into, you know, really kind of pushing the,

Starting point is 00:16:08 basically Rockman Pie Torch kind of communities to pull together. And then also when you're thinking about, you know, spanning it out, you know, the RDMA and connecting, when you got all these large language models, you know, there's only so much you can fit in a single system, but a lot of these models now have actually like basically spanning,

Starting point is 00:16:29 across and that's what really the RDMA is being utilized for is doing uh you can take and do model splitting across multiple yeah so this whole model parallelization right you can split the model at different layers you can split the data you can do both i mean it it almost takes both when you're talking trillion parameter models right yeah absolutely i mean you have to have you you got basically be able to split it and then you get into all kinds of other fun things and synchronization networking like it's like and it's the same stuff we used to have have when we were doing newmo when we're doing like multi-core stuff right yeah yeah only across the network now across network right so that's that's uh there there's like a thousand or one problems

Starting point is 00:17:10 multiplied by a thousand so yeah yeah yeah i keep you know keep people keep asking me are we in an AI bubble is it is it going to end what's going to happen i i you know it's it's a tough thing to to talk about but i think they're that we're starting to see some evidence that that uh things are not nearly as rosy as they seem. And when these big companies start going into, you know, the debt market to try to fund their data centers. And these data centers are huge. I mean, talk about billions of dollars, you know.

Starting point is 00:17:43 And it's, it's a question of whether the market is there for it. I don't know. I have had a lot of conversations with, I was actually having a conversation with basically a senator and the executive director of the Department of Energy about, you know, because the reality is, I mean, there's such a power constraint, too, for this. Oh, God, yeah.

Starting point is 00:18:12 I mean, those, you know, an example like the Helios rack that's going to have, like, 72 GPUs, and it's in the, basically the OCP ultra-wide form factor. I mean, it's going to, it's going to draw over 250,000 watts of power. and that's like a rack now imagine a data center full of those things right and how many like you know like we're talking like the megawatts and gigawatts that are going to be pulled in these AI data centers so there's a big concern about one the how do you build it how do you power it how do you cool it right right and um you know and those are all liquid cooled uh pretty much all

Starting point is 00:18:50 liquid cooled systems they got to the power i mean from a power perspective too this is very interesting. Actually, I found out, so Purdue University actually started the, uh, the very first accredited program, uh, on, uh, basically building, maintaining SMRs. Hmm. So I think the small modular reactors and there's, you know, it's been in the news lately too, you know, they're reopening three mile island, right? Oh yeah. And Microsoft bought the whole thing, right? I know. Yeah. And it's a yeah, they've got, they've got a spinoff company that says basically doing it. And, uh, but it's going to require, it's going to require a lot of power. I mean, I think the estimates that I heard were at 30% of the capacity we need to be at by 2030.

Starting point is 00:19:37 All right. Well, we could talk AI for an hour and a half, I think, maybe longer. The other topic of interest I thought might be, you know, what's happening with VMware? I've been a tech field day guy for, God, I'd say not quite 20 years, but close on. and VMware has always been really, really active in tech field day events. And lately, I've seen Broadcom networking, but I haven't seen VMware. What's going on there? Yeah, I actually did a buyer room on this where I, you know, get a bunch of customers

Starting point is 00:20:10 and put them in a room with a sponsoring vendor and then have a conversation around what these customers are doing. And the insights that I'm getting from these customers, pretty clear. There are three types of VMware customers, VNWR customers who are exceptionally hesitant to stay VMware customers, VMware customers who have just said that, hey, you know what, I am going to bite the bullet and not worry about hyper, you know, not worry about finding or replatforming hypervisors and just pay the VIG that, I don't know if that's a New York and Chicago thing only when you say big, but pay, pay, pay, pay the tax, the Broadcom tax to stay a VMware customer via, you know, kind of the same IBM Z type of setup. And then there's the third customer. That customer who is spending

Starting point is 00:21:07 $100 million a year in cloud, and they see VCF as a way to save, eventually save 20% of that spend on-prem. So why not give Broadcom five million bucks a year? That's the customer that Brockcom is now focused on. And I don't know if that's the typical tech field day content consumer. This is hot time calling up the CTO of one of these cheap, as Jason, you know, alluded to one of these cheaply priced from a stock perspective pharma, Fortune 50 pharma customers and convincing them to put all their R&D on premises. on VCF cloud.

Starting point is 00:21:54 Right, right, right, right. I mean, VCF's still a, still a package that it can be, you know, quite, quite easy to use. Oh, no, it's not easy to use. It's not, it is, it's not just a package. It is the package. Right. If you want to buy, if you want to buy VMware v. Spear today by itself, you can't. Yeah.

Starting point is 00:22:16 You have, and that was the Tech Field Day audience, like the audience that wanted to buy v-spear. If you were to install your bits, what you're enabled for today from Broadcom, it is going to, by default, install the entire BCF suite. You may not use it, but it will install it because that is the product. I thought they have a smaller license for just the V-SPIR solution. They just, it was a, I think it was called V-SPIR Foundation or something to that effect. They just announced that Europe, that they've sunsatted for Europe,

Starting point is 00:22:54 is no longer available in Europe. So that's the indication of where this is. Brodcom wants the VCF customer. If you're stuck and you want to continue to just operate your V-Spore estate the way you have in the past, okay, fine. You're going to buy V-CF, but you'll get to continue. use just v-spear if you want, but you're going to buy VCF. Everything. Yeah, yeah, yeah, yeah.

Starting point is 00:23:25 Well, I mean, and the value really is there. I mean, if you're running a VCF stack and bought into the whole thing, you can you could save money rather than running on a hypervisor someplace. That's the theory. Especially at that level of expenditure. Yeah, that's the theory. The insights that I've gotten from both folks to deploying, the software and these executives is that it takes an entire reimagining and transformation of

Starting point is 00:23:57 your staff because now you have to run a cloud and running the cloud means that network storage compute these silos security they can't be silos anymore they have to be they have to operate as one team and that is a much much more difficult uh challenge then how to install and get a cloud running. It is operating the cloud that has been the barrier to adoption. Yeah, I heard somebody the other day. It was kind of made a funny joke on a podcast, and they said, you can't just basically change a guy from basically an IT t-shirt

Starting point is 00:24:40 and put a DevOps T-shirt on them and have him figure it out, right? And that's kind of what it, that whole, like you said, the kind of that cloud shift is it's a shift in it's a shift in fundamental skills and a shift in fundamental um you know honestly the the the way that a lot of things are run it's interesting i was actually at cube con um was that like back in november oh yeah i think it's back in november early december something like that and it was uh i was talking to stew at red hat right and it's been really really interesting you know the uh you know he that they're you know also seeing a shift of, you know, folks that are looking for it and how they're, you know,

Starting point is 00:25:24 taking those virtualized workloads and, you know, a lot of, you know, and having, you know, conversations with him, a lot of the folks are looking at really kind of re-orchestrating their things into a more cloud-native style of a fashion as far as their app workloads. So they are getting more DevOps-centric in the way that they're looking at it. But then they're also looking at, you know, the things like OpenShift, where you can do, you know, you know, effectively utilizing cube vert to do virtual machine conversion on top of it as well to move over to a different platform yeah yeah yeah well i mean red hat's been a been a player in that space for quite a while you mentioned that the ibnz model it's sort of interesting i mean that

Starting point is 00:26:10 broadcom is is taking vmware into that mode of operation where you lock customers in and you continue to provide value to a certain segment of the market and hopefully you're not growing your actual overall market right here you're not trying to add to customers you're just basically trying to basically retain current customers and put them in a headlock right yeah well yeah and and it's working the renewal rates are exceptionally high like they're not getting small businesses so you right when you see renewal rates between 70 80 percent people are like that's horrible for software but that 20 to 30 percent that are not renewing are the customers that Brockcom doesn't want yeah yeah and you can see that that from basically

Starting point is 00:27:02 their channel strategy as well I mean what they did after the acquisition I think they cut like 3,000 channel partners or they cut 3,000 channel partners and then even some of the bigger channel partners. They, they ended up, you know, dissecting relationships they had with Fortune 500s and took all the Fortune 500 to direct. Yeah, yeah. Sounds like IBM more and more, actually. Yeah, sounds like IBM more and more. It's interesting how they play, how those plays keep rolling out left and right. All right, well, I think we played out the VM's discussion so far. The other thing that's of interest lately is the price of hardware seems to be going through the roof. DRAM is being significantly affected here.

Starting point is 00:27:47 Yeah, I think this hark is back to the original conversation. We focus on these growing data centers. These growing data centers, you're putting something aside of them. Yeah, but it's compute, it's power, it's cooling. I mean, there's lots of stuff in there, but it... Yeah, but those components are the same components that I need. You know, I'm looking at my invidia, spark that's on my desk now and i was i made the argument that the stricts was a better value

Starting point is 00:28:17 at nearly half the cost because of the price of the components the the primarily ram the the the strict is now the same price as the or if not more than the invidia spark yeah yeah and it's a deram pricing in deram pricing i mean deram pricing is going up because there's like when you think about You think about one of these standard AI boxes that's going in a data center. I mean, the stuff that's, you know, kind of on the truck today and then think about the stuff that's, you know, coming next year. But the stuff that's on the truck today, you've got, you know, you've got a system that's probably got one and a half terabyte of RAM in it just for the CPU component. And then each one of those video cards, you know, is probably going to have anywhere from, like, you know, probably about, you know, the 96 to 192 gig, HBM on it. And, you know, a lot of the kind of the retooling, I mean, that's basically where the big margin, that's where the margin dollars are and that's where the markup is, you know, in that industry, right?

Starting point is 00:29:19 Because you're going to sell one of those systems, you know, you're going to buy one of those AI systems that consumes like four years of space has got, you know, when you add it all up, like there's, you know, like three, four terabytes worth of memory in that thing, all the way from, you know, DRAMs. basically the DRAM, the LPDDR, HBM, all of those components are going in there, but it's going into a really, really high-margin product. And because one of those boxes is going to sell for like half a million dollars. And then you've got a data center full of them. And you got probably, you know, like you got a minimum of eight of them in a rack, and then you got a data center full of racks, right? And so, but it's an exceptionally high-margin value product.

Starting point is 00:30:07 So where, if you were a memory manufacturer, where are you going to spend your, where are you going to spend your production, production time? Like, what are you going to fab? You're going to fab out stuff that's going to make you more margin dollar. But usually that, you know, that drives a vicious spiral. So it certainly starts to, as the volume production starts to go up, those prices should start coming down again someday, eh? Well, where's the value of production? Yeah, there's no more. volume of the production. That's the problem.

Starting point is 00:30:41 Like, you know, our friends at Solidine are probably producing as many, as many, whether they're big, their big drives, there are 122 terabyte drives. Yeah, yeah. They're probably producing as many of those as they can sell. Yeah, yeah. Yeah, it's going to be interesting because, I mean, honestly, I think this does open up, this opens up an interesting market opportunity on ways in which you could, you know, potentially do some type of memory

Starting point is 00:31:09 tearing technologies from a software perspective. I've been talking about memory tearing for decade now. Well, this might be a driver. So we're saying this in the enthusiast community, XO with the

Starting point is 00:31:23 with these Mac clusters and these Linux clusters, memory sharing. They're loading these big models and using basically memory sharing technologies to do that. I see a few. That memory between the GPU and the CPU makes it a little bit more effective, more efficient, I guess.

Starting point is 00:31:42 Yeah, and I've seen a few startups as well that are actually doing, you know, memory tearing from basically like faster NAND and flash components, you know, utilizing, you know, specific, you know, kind of kernel, you know, special kernel stuff that they can do to effectively create, take a system with like 64, like 64, like 64. 4 gig of RAM and basically make it 128 gig by basically doing, you know, intelligent hot blocking of what's going into the memory and what goes on to SSD. And this hark is back to the old skill that's coming an new networking. Like, you need ultra-low high bandwidth networking for any of this to work. Yeah, one of the things that impressed me the most about that spark box is the fact that it's got a Connect X-7 with two 200-gig ports on it. Yeah, and I've seen people, and they're limiting the last thing I read,

Starting point is 00:32:44 my friends at Kamawazza told me this is that they're limiting Nvidia in the OX and the Spark OSS on the boxes, limiting it to two boxes. But I've seen, again, enthusiasts take four of them, and basically it's hardware. You can throw your own OS on it and basically go at it. This is basically a 50-70 with 96 gig of RAM in each box. Yeah.

Starting point is 00:33:12 And I've seen, I've also seen, oh, he was like Patrick Kennedy. He actually took one. He basically took one of those, you can get these microtick switches now that are like the QSP, QSFP 56s and can hook like five of those sparks up together. Yeah, those, that switch is like 500 bucks. It's 500 bucks. Yeah. Yeah, 500 bucks for 200 geeks, which.

Starting point is 00:33:37 Yeah. That's crazy. That's nuts. Thank God. I think I need to have one in my basement. You may have one in your basement. I wouldn't be surprised if they start putting these in home routers. I don't know.

Starting point is 00:33:50 It's just crazy. Jesus, Jesus. Yeah. So the other thing, so, you know, the other topic of discussion was, you know, where is Ethernet and where is Infineban and all that stuff? I don't know who, I saw Jason, you must have gone to supercomputing this year. Actually, I did not. Oh. But a bunch of people on my team did.

Starting point is 00:34:10 So what's the story on the networking these days? Well, it is interesting. I think there's some more, there's definitely some more kind of open initiatives that are actually out there that are starting to, that are starting to basically make themselves present in the industry. one of those is a UALink and like AMD kind of kicked off UALink and then what's actually happened is a lot of the a lot of the other folks in the industry like hyperscalers and and folks like that have actually got on board with the whole UALink initiative and the whole point of it is to have more of an open standard for interconnection of kind of enrax systems. So, you know, kind of a way where you can use like PCI fabric, but then hook it together. And, you know, it really kind of started out as like the, us open sourcing components of kind of the infinity fabric to do kind of interconnection of some of these components. And it's actually gotten quite a bit of traction. So with that, you know, it's basically like an open, more of an open version of kind of what Infiniband's been providing. So, yeah, Infinity Fabric was your interchip within a chip. And that's how interconnected these things have gotten, you know.

Starting point is 00:35:40 And I mean, it's, you know, it's like the Infinity Fabric is when you look at the like our MI 300-based systems where they're hooked into, you know, these OAM boards where there's, so there's eight of those instinct GPUs on, on this single board. but they're all interconnected via that infinity fabric. And, yeah, start talking about infinity fabric and you could put like a, you know, a memory tearing solution within that structure and stuff like now. I was talking real performance. Yeah, and that's not the only option. So think about if someone gets serious about the thing that consumer machines are starting to come with, Thunderbolt 5, this is what?

Starting point is 00:36:26 Thunderbolt 5's bandwidth is something like 280,000. gigabits per second somewhere i know it's over 200 gigabits per second imagine if there was a thunderbolt five's networking switch yeah right right right now we're talking uh enterprise class yeah yeah the uh on a stop that that so an interesting one to check out so yeah checking out like it's a ua link consortium.org is the uh that's basically and they've got the uh the the 1.0 specifications up there for that but yeah it's all about low latency high bandwidth interconnect um and like I said it's primarily you know for for basically you know communicating into basically switching in the AI AI compute pods that's actually what's going to be driving like that that helios uh

Starting point is 00:37:15 that big helio system that the the we're putting out with a 72 GPUs good and that's a for use solution no that's that's a that's a rack that's a it's a it's an OECDs OCP wide form factor. So, I mean, the thing's literally, I think, three feet wide. Ah, almost as wide as I am. Yeah, but you know what? I just need two more racks. I have my, I took the rack from the, at the close down,

Starting point is 00:37:43 I closed down the data center earlier this year or last year, I forget when it was. And I couldn't get rid of the rack, so it's sitting in my garage. I think I got a use case for it now. You're also, you're also going to need a reinforced floor because I can tell you to the thing weighs back as much as a Ford F350. Well, I have F-250 in the garage.

Starting point is 00:38:02 I'll just pull that out. Pull it out. You just mount it in there. Oh, and don't mind the fact that you need a quarter of a megawatt to run it. That might be the real problem. You can have an SMR in the backyard maybe. Yeah, it's right. Well, they actually, they got the micro SMRs too as well.

Starting point is 00:38:20 And those will push about 10 megawatt, and you can fit three of them on the back of a semi-trailer. You know, I was just about to say I was on the road trip the other day. I think I was headed to Tennessee. I saw one with three on the back of the semi. So absolutely, it was probably heading to Memphis, and you can tell you can. Yeah, you know where that's going.

Starting point is 00:38:47 You know where that's going. But my wife was like, what is that? And I'm like, oh, I know exactly what that is. Why would you need a generator that big? Yeah. Yeah, basically to power the Tesla megapacks that are powering the colossus. Yeah, yeah, that's bizarre. All right, Jens, I think we're about running out of time here.

Starting point is 00:39:10 It's been great. Any last comments from anybody? Well, it's been, I think, a very interesting, very interesting 2025. This is definitely like AI has made a lot of kind of fundamental shifts in, you know, the way that we've been thinking about, infrastructure, data centers, interconnects. It's been a phenomenal year of, I think, a lot of innovation from a lot of different companies. And can't wait until 2026. It's going to be an exciting year.

Starting point is 00:39:44 Yeah, certainly not stopping there. Keith? Yeah, I'm amazed at how quickly this has evolved. I think just from a, not just from a training and how good the models have gotten, but how how consumable AI has become for enterprises we're just now understanding kind of the failure modes and the architectural patterns that that will scale 2026 I think is going to be a really, really interesting year for AI adoption. Yeah, yeah, yeah. And then there's always the whole discussion about AGI and what happens down the road from that. So all right, gents, this has been

Starting point is 00:40:24 great. Thanks, thanks for being on the show. Thanks, Ray. And that's it for now. Bye, Keith. Bye, Jason. Bye, Ray. See you, Ray. Until next time. Next time, we will talk to a list system storage technology person. Any questions you want us to ask, please let us know. And if you enjoy our podcast, tell your friends about it. Please review us on Apple Podcast, Google Play, and Spotify, as this will help get the word out. Thank you.

Grey Beards on Systems - 173: GreyBeards Year End 2025 podcast

Once again in 2025, AI was all the news. We are seeing some cracks in the NVIDIA moat but it's early yet. Broadcom VMware moves and DRAM sky-high pricing sort of round out the year for us. Listen to o...ur YE podcast to learn more.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.