Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 08x06: HPC Technology Transfer with Los Alamos National Laboratory

Starting point is 00:00:00 Much of what we take for granted in the IT industry was seeded from HPC in the national labs. This episode of Utilizing Tech features Gary Greider, HPC division leader at Los Alamos Labs, discussing leading edge technology with Scott Shadley of SolidIME and myself. Welcome to Utilizing Tech, the podcast about emerging technology from Tech Field Day, part of the Futurum Group. This season is presented by Solidime and focuses on new applications like AI and the Edge and other related technologies. I'm your host, Stephen Foskett, organizer of the Tech Field Day event series.

Starting point is 00:00:36 Joining me from Solidime as my co-host this week is my good old friend, Scott Shadley. Welcome to the show, Scott. Hey, Stephen, great to be here. Glad to continue the wonderful series of conversations we're having with some of these great folks to the show, Scott. Hey, Stephen. Great to be here. Glad to continue the wonderful series of conversations we're having with some of these great folks in the industry. Yeah, you know, that's been the fun part of this whole season is that we've been able to invite in people from companies

Starting point is 00:00:55 that are doing some cool stuff, of course, but also people who are actually practitioners doing very cool things out in the world. And I appreciate you opening up your Rolodex and inviting on some of your friends. Absolutely, that's one of the cool things about today I think is really exciting is when we think about aspects of what's going on in this industry,

Starting point is 00:01:14 we don't really realize how much of the labs part of it comes into play, because there's a bunch of labs around the country that you hear about from time to time. But in this particular instance, we're bringing along a very interesting story from one of those labs. So absolutely. So Scott has invited Gary Greider of the HPC division over at Los Alamos National Labs to join us today to talk about the ways that HPC has driven compute, AI, edge, all sorts of things forward. Welcome to the show, Gary. Tell us a little bit about yourself. What are you doing over there?

Starting point is 00:01:52 Well, I'm the division leader for high performance computing at the lab. And I've been at the lab since 89. So I've been there for a while. Our division basically is responsible for all the high-performance computing machines and buildings they live in and all the way up to the application. The applications are spread across the laboratory. So our portfolio is pretty large. We've got many, many tens of megawatts of computing power

Starting point is 00:02:22 and lots of machines and networks and storage. So that's it. And I think that, as I said a second ago, it does seem in many ways, at least to me, that the HPC space has driven the state of the art forward in ways that I think people don't quite understand. It reminds me a little about the space race, and they talk about Tang and Velcro and pressurized pens and things like that. The same thing happens

Starting point is 00:02:47 in HPC. In fact, you know, your phone or your laptop these days has NUMA technology that in my mind came from the HPC world and the same is true of so many other elements. You're on the inside of that. Do you see that? Do you see things that are coming and how they're going to affect the rest of computing? Well, we certainly see things we think are coming. And we certainly can look backwards and say that there's an awful lot of what's in computing today, at least scalable computing that came from at least scalable computing that came from Department of Energy Labs and other scientific organizations. And oftentimes we are wrong. Sometimes we think something that's going to happen, you know, it doesn't.

Starting point is 00:03:32 But more usually it happens. It just doesn't happen when we think it's going to. You know, a good example of that was, you know, back in the early 2000s, we were working on high-performance shutters for optics because you couldn't flash a laser fast enough, so you have to put shutters on it to make it go really fast. We were finding a lot of work in that area, and we thought, oh, everybody's going to be watching movies at home. The world is going to need all this bandwidth, and so we need to get

Starting point is 00:04:05 high bandwidth fiber going. And we found it a lot of work in the area. We thought it would be instantly used by everybody that wants to watch a movie. And then it didn't happen for a decade. And then a decade later, everybody is watching movies. But they did it in a completely different way. They decided to cash all this stuff out locally to you and all kinds of fancy things. And so in some ways, some technology that we've been working on still sitting waiting. Another example is microchannel cooling. You've got this 2.5D integration going on

Starting point is 00:04:38 in the silicon industry, and it's headed towards 3D integration. And the question is, if you have 3D chips, how do you cool them? And long ago, back in the late 90s, we funded a bunch of work called microchannel cooling, which is drilling microchannels through silicon and cooling it from the inside, which is kind of the only way

Starting point is 00:04:59 once it gets thick enough. And that technology is still sitting on the shelf waiting for the economy to catch up to it and the need for 3D to actually occur. And once the economics is right, I assume that technology will be used. So I could go on and on about stuff we've funded and some stuff has made it and some stuff hasn't.

Starting point is 00:05:23 Yeah, I find that very interesting, Gary. And I appreciate kind of that history lesson of things like that. One of the reasons that I thought it would be fun to bring you into this particular conversation was we got engaged through one of the initiatives that you're working on that's active today in some of the space that ties directly into what we're doing. And it'd be great to hear a little bit about EMC3 and what you're doing there and what it's bringing to the market that we're dealing with today.

Starting point is 00:05:48 And I've participated in it across a couple of organizations now. So it's kind of cool to see that that's still moving forward and you've got some really unique innovations that are relevant to the AI era right now that have come out of that. Yeah, that's true. So Efficient Mission Centric Computing Consortium

Starting point is 00:06:08 is what EMC3 stands for. And really, it was a way for us to sort of pseudo formalize our partnerships with industry and with other using organizations for high-performance computing and AI-oriented technologies. And where it came from really was the fact that, you know, oftentimes science sites, and that isn't just the national labs, that's also the energy companies and, you know, the aerospace companies and people that do science on computers, and not just information science.

Starting point is 00:06:42 We often aren't served well because the larger community, the cloud community, the big AI factories and things like that are the big dogs. And they sort of decide what technology is going to be for us. So it was really sort of banding together a bunch of organizations that their needs aren't completely covered very well by the mass, you know, large sales, and try to present a market to a bunch of sympathetic industry partners that might partner with us

Starting point is 00:07:13 to, you know, make some of that stuff happen. And a good example of that is sparse memory access. So, you know, if you think about how GPUs work, they gulp in a whole bunch of vectors, and then they gulp in a whole bunch of vectors, and then they gulp in a whole bunch of other vectors, and then they do a mass multiply of all the elements of those vectors in parallel. If your problem maps to that kind of a solution, then a GPU is really, really nifty. However, if your problem doesn't map to that, you're kind of left in the cold because there's

Starting point is 00:07:44 not really a lot of good ways to access the memory. And so one of our projects ongoing for the last many years has been to work on trying to push lookup tables down to close to the memory subsystem so that you don't waste cache line bandwidth and you don't waste cache area in the processor, storing indices to lookups and things like that that you don't waste cash line bandwidth, and you don't waste cash area in the process, or storing indices to lookups and things like that that you don't want to do.

Starting point is 00:08:10 And it's interesting because we're not the only people that see this. A good example of somebody that sees it is Amazon. So Amazon, when you log on to Amazon, you get a bunch of ads in front of your face. And the way they decide what ads to put in front of your face is they take all the products that they sell and all the stuff that you've bought, and they put it

Starting point is 00:08:32 on two dimensions of a graph. And it ends up being a super sparse data structure. And they want to compare that super sparse data structure with other people that have similar sparse data structures. And so that whole thing is really a sparse workload. And another really simple example is just a table join. So you take Oracle, and they have a dense table

Starting point is 00:08:52 and a sparse table, and they want to join it together on a common key. And the way you do that is through a lookup table. And so there's all kinds of examples of sparse access to memory in the world. And it's not being served very well by industry today, and so we've been working on trying to make that happen.

Starting point is 00:09:10 And probably the other big project that's going on that we realize needs to have some work and that's pushing processing down near storage. So just like sparse access is pushing access down close to memory, pushing stuff down close to storage is also important. And the naysayers out there might say, well, that's been tried a few times. And it's certainly true. I've certainly tried it a few times myself and funded a lot of work in the area. But where we really see it coming to head in real need is if you have a lot of holdings, at my site maybe we have

Starting point is 00:09:49 three or four or five exabytes of data that we hold. And it maybe represents, I don't know, maybe a trillion files and 100 billion directories or something. So it's a large mass of data and it's text and images and output from simulations and all kinds of things. And you couldn't ever train on four exabytes of data. Today, people train on petabytes of data, so it's three orders of magnitude off. So how are you ever going to do training, or better yet, how are you ever going to do inference against all of that data or major parts of it? And so the only way is to have rich indexes

Starting point is 00:10:30 and be able to subset that data very quickly so that you can just get the vectors that you need and build yourself a training set or build yourself more likely a reg that your model looks at to enhance its answering capabilities. And so if we get to the point to where you're asking questions of something that large, then you

Starting point is 00:10:56 need to push the lookups, the similarity lookups, as close to the devices as possible so that you're not moving data back and forth because that's the only way you'll get low latency to get an answer to a question you're asking you know of your your holdings and so we see a time when this is going to really actually be necessary otherwise you won't be able to accomplish what you want to do so that's really sort of the long game there's certainly a lot of shorter term wins you get out of it, but longer term,

Starting point is 00:11:28 if you don't have a way to get something from all your data, then why the hell are you keeping it? And finally, there's mechanisms for doing that. So we need to enable that and pushing the compute near the storage, at least some parts of it, reductions at least, maybe necessary. I mean, that's exactly, I mean, it's a great example of how being able to think outside of the market box,

Starting point is 00:11:54 right, so the one beautiful thing about the work that you're doing and enabling through the work that you're doing is it's somewhat ignoring the big guys, right, because they do have their bowl and their whip and whatever they're doing to make us carrot and stick type of stuff. But it's really interesting to see that there can be a lot that can be accomplished if you take a side and look at it from a truly what does it need versus what does this person think they want? And that's one reason I really like the ability to be engaged with the work that you're doing there and things like that.

Starting point is 00:12:25 So, I mean, it's really fun to think about what's been going on at that lab for so long. Like you said, you've been there quite a number of years, which is awesome. And we appreciate all that work that you're doing there. But I recently went to the Tiplit Summit in January, where it was mentioned that they're now in the process of trying to migrate all of the cool nuclear blast data that was generated at the lab into a form of consumption to do exactly what you wanted to do, which is take petabytes of information you can't recreate and be able to train and use that for useful future data. Yeah, it's not just the nuclear test data, but it's all the subcrit tests.

Starting point is 00:13:05 And there's tons of surveillance on the weapons over the years. We tear them apart, and we try to figure out what's wrong with them. They're old. They've been rolled around on trucks. They've been rattling around in subs for decades. So yeah, there's a ton of data. And it's of all kinds. Yeah.

Starting point is 00:13:25 So one thing, Stephen, you may not know about Gary as I've talked with him over the years is I have been at GCC and I actually met a few of Gary's coworkers there because he has done a great job of launching a few people out of the lab into the industry because of the hard work and effort that they put in there. So not only is he helping drive the market, he's actually helping the careers of quite a few individuals as well. Yeah, we have an enormous student program. It's really cool. We have 1200 summer students a year at the lab, which is a pretty large program, 300 post-docs, and about three-fourths of the laboratory staff comes from its student programs. And, of course, we don't hang on to all those students and or people.

Starting point is 00:14:06 So they disappear into the ether and come back eventually. And we notice that they're at some company helping us vicariously. It is interesting, isn't it, that so much of what Gary's talking about, both in terms of technology, but also, as you said, in terms of skills is increasingly applicable outside the lab and HPC environment because, you know, I guess we're all probably old enough to remember when terabytes was a lot of data. And now, you know, you've got a terabyte size SD card, you know, when if Scott, if you had said, I'm going to be handing you a 100 plus terabyte SSD, even just 10 years ago, I would have laughed at you because that would have seemed like a

Starting point is 00:14:57 ludicrous amount of data. And yet that's a real, not consumer, but a real product, a real thing that you can buy. And the same is true of some of these other things, Gary. It strikes me that, as you were alluding to in your conversation about, for example, the sparse memory and computational storage, the scale of that data sounds absolutely incredible. But when we're looking at what's happening in the AI space today, we are seeing people saying,

Starting point is 00:15:29 hey, how can we train a model with maybe not quite that large a data set, but a very large data set. And we're gonna have to get to that point probably eventually. And as you said as well, it's on the inferencing side as well. If we want AI to be able to process all the whatever data, then we need to be able to build systems that will be able to scale

Starting point is 00:15:53 and allow that AI to efficiently access that kind of data. So what you're doing really is what's going to be happening probably five years from now, right? Would you agree to that? I think so, yeah. I mean, we aren't used to working only five years ahead. That's sort of the last decade in my career kind of a thing. It used to feel like more like 10 or 15 years ahead, and now it feels more like five or less. It is kind of hilarious when I go to AI conferences and they talk about the woes of having to checkpoint their state of their training job because it runs for a long, long time.

Starting point is 00:16:31 And I have to laugh because we had to do checkpointing 25 years ago and have been checkpointing ever since and actually still at scales bigger than they are from a synchronous point of view. And so it's kind of interesting how a lot of the things we've done, they're using or have to use all the interconnects that this InfiniBand and UltraEthernet and all that kind of stuff has its heritage in parallel computing

Starting point is 00:17:00 in the early 2000s. And in fact, we were heavily involved in the invention of InfiniBand. It was a consortia between a couple of DOE labs, Bank of America and Oracle. Isn't that an odd set of people to work together? But Oracle was trying to do this thing called a parallel database at the time.

Starting point is 00:17:21 And Bank of America, well, they were trying to buy a parallel database at the time. And they all said, well, they were trying to buy a parallel database at the time and they all said, well, we need this interconnect. And in the 90s, we had funded a bunch of proprietary interconnects for every vendor known to mankind. And everybody was tired of having proprietary interconnects and special APIs to access them

Starting point is 00:17:40 for all the functions they had. And so VinBand was sort of our attempt at trying to get a common high performance interconnect that had a common, now it's called OFED layer that lives in Linux that you use to do this large scale computing. And so many, many other things that we did in the past come around again and are used by others. And what's really nifty right now is in the data space that's happening.

Starting point is 00:18:05 So, you know, tools like Lustre, which I went to DOE to get the money to build, and PNFS, which we started in 2002 at the University of Michigan. And there's all these things that we did for parallel storage. And it's really cool to see, you know, something other than HPC sites starting to use those tools. Yeah, I think it's interesting, Gary, to your point. Because we mentioned that this series is focused on kind of today's AI and where it's going with the edge

Starting point is 00:18:35 and things like that. And we've been talking high performance. And you've given us some really cool things. I think it's one of those. There's always the next shiny object. And from my point of view, you're working on not even today there's always the next shiny object. And from my point of view, you're working on not even today's shiny object or next shiny object,

Starting point is 00:18:48 you're on the shiny, shiny object, right? So when we all think we're squirreling to solve today's problems, you've already solved them in ways that you're waiting for the rest of the world to catch up on. So it's kind of a unique ability to have a perspective like that.

Starting point is 00:18:59 And I truly appreciate the fact that you and your team and the work that you're doing there has given us as consumers and our partners in that space, the opportunity to work through and catch up, if you will, in some aspects of it. And it just shows the value of having this kind of forward-looking aspect of it. Because when you look at a company like SolidIME,

Starting point is 00:19:19 we have an R&D budget and we're thinking, we have a three-year roadmap, we have a five-year roadmap, and we don't tend to think much beyond that. And when something pops up, it's like, oh, but you're like, you know, we have a three year roadmap, we have a five year roadmap, and we don't tend to think much beyond that. And when something pops up, it's like, oh, but you're like, to your point, you're laughing about checkpointing. It started to make me laugh. It's like, I've heard from I have an older brother who works at a DOE lab as well in Idaho, where, you know, nuclear started as far as energy consumption.

Starting point is 00:19:39 And I hear stories from him all the time. And it's like, this is kind of cool and innovative way to think about it. And I really do appreciate getting the chance to have you on here to talk about some of it. So what are your thoughts, Steven? Yeah, I'm with you, man. It is really cool. And that's why, Gary, I really want to kind of put it to you. What are you working on now that is in the back of your mind and you're looking at that and you're saying, that right there is gonna be important pretty soon. What are the tools, the technologies, the concepts

Starting point is 00:20:11 that you think are gonna be driving computing in the future? The only other term, it's much what I've talked about, but thinking bigger and longer, if you look at the Moore's law and other denar scaling laws and things like that, that we've run out of gas a while ago. And we were at 2D for a long, long time. And now, 2.5D is everywhere.

Starting point is 00:20:44 And when you get to 3D, once you solve all the problems that we've solved some of already, but not all for sure, you're out of Ds. And so what do you do next? And so what that means, which is an ugly way to think about it, but what it means is you no longer get any of these wins on reductions of size or anything. The only way to get more computing

Starting point is 00:21:09 is to buy more computing, which is a different situation than we've been in in a really, really long time. We've always been able to shrink and add more. And when that stops, when you're at the end of that road, what do you do? And so there's a fair amount of effort going on to try to figure out, okay, well, what do you do when you run out of Ds and you're done, right? You don't have any more shrinks you can do because you're at atomic scale and you've

Starting point is 00:21:39 already used up all three dimensions and all you're doing now is just buying more and more, you know, covering the planet with, you know, 3D silicon, what do you do? And so there's a fair amount of effort going on to try to figure out what that is. And, you know, we've we had an early quantum system at the lab and it was interesting the infrastructure cost three times as much as the machine. But, you know, that that's not exactly an answer. It's part of an answer. There's analog that looks pretty interesting. And so we have a fair amount of effort going on into looking at analog. And what would that get for us? And it's really mostly about picajoule per something, right? Because not only will we cover the planet with silicon,

Starting point is 00:22:26 but we'll also use up all the energy at the pace we're going at, right? If you run out of these. And so somebody has to stem the tide there. And it feels like analog or biological computing is where we have to go. So we have a fair amount of effort going on in both of those areas.

Starting point is 00:22:44 Our Center for Nanotechnology is looking at interfaces between silicon and bio, you know, circuitry, which, you know, may, may be part of the answer. So that's the problem, I think for all of us, you know, at some point is what happens when we run out of Ds. That's a very unique perspective. I think it's interesting because there's things like in SNIA with the DNA efforts that are going there to use DNA to do archival storage and stuff like that. But what do you do to get it back and that kind of thing.

Starting point is 00:23:16 I really do appreciate that perspective because I am very curious what the next D is, because I've been on the same train with you about, we don't have anything really in the market to replace NAND and DRAM, the way we have used them in the past to replace other products. So you start getting beyond that end of the quantum space, is that really where you're gonna get solutions,

Starting point is 00:23:35 but that only solves one piece of that big problem, right? So really do appreciate those insights. If I could jump in on that too, just to be clear for our audience. So Scott, maybe you're taking this, I don't know, maybe you're ahead of the time, but memory and flash, you know, is ahead, in my opinion, of compute, of general compute in terms of implementing 3D chip architectures. I mean, you guys, especially in the storage space, you're stacking, you're stacking them tall. I don't know if people know this, but I

Starting point is 00:24:08 mean we're talking about skyscraper style, you know, of chips with, you know, over a hundred levels. That is not what we're seeing yet in the compute space. Gary, do you think, is that what you're talking about when you're talking about 3D? Do you think that we're gonna have, you know, processors scaling like that? Yes, it's happening. And that's a big, you know, as you say, that's something that hasn't yet come to computing, but it is, and it's going to. And like I said, I mean, memory and flash are certainly the fields that are setting

Starting point is 00:24:40 the stage for that. Yeah, it's interesting. The one question I usually get is, what about photonics on chip? And we've done work in photonics on chip for, I don't know, 20 years. And it's ready. It's just that the economy, its economy isn't there yet. It's just like this microchannel cooling stuff, right? It's technology that's sitting there waiting for a problem

Starting point is 00:25:04 that it can uniquely solve. And it could be that that's one step between us and full 3D. Because if you can get the kind of bandwidths that that promises, and you can move things apart by inches instead of millimeters, one could imagine spreading a workload out using... So it could be that finally that technology will actually take off and have an economic reason to exist because it's still too early for microchannel cooling and full 3D, but

Starting point is 00:25:37 both are probably going to happen. The stock market question, of course, is when, right? Well, I mean, is when, right? Well, I mean, this has been truly insightful. I've actually learned even more just by having a few minutes of chatting time with you. So it looks like next time we see each other in another event as we continue to cross paths, I'm gonna have to definitely sit down with you

Starting point is 00:25:56 for some more conversation and things like that. And Steven, to your point, yeah, when we start stacking the processing chips, when we can get them cooled properly and getting through Silicon VIA versus Wirebond, that's one step, but then we have to get beyond that too. So I'll hand it back over to you, Stephen. Yeah, absolutely.

Starting point is 00:26:15 Because those are some of the challenges, just to translate this into plain nerd from deep nerd. When you start stacking chips on top of each other you have to figure out ways of powering those chips you have to figure out ways of you know basically distributing power on a bus kind of throughout that stack you have to figure out ways of addressing and communicating with those I mean it literally is so my background is an mean, it literally is. So my background is in urban planning. It literally is the same as making skyscrapers. You have to think about elevator sky lobbies.

Starting point is 00:26:50 You have to think about emergency exits and water and fire and electricity and all those things when you make a tall skyscraper. It's the same with chips. You have to think about how am I gonna power the ones on top? How am I gonna get the data in and out? And for something as uniform as a flash chip, that has been a little not easy, but a little more doable.

Starting point is 00:27:14 And then for compute, it's just, wow, let's see what happens there. And then, Gary, the other things that you mentioned too, I've seen some very early analog AI processors, for example, I was talking to a company that's doing an analog AI processor, it mostly kind of works, but it's pretty cool. You know, that's for sure. And if they can get it to really work, that would be awesome.

Starting point is 00:27:40 You know, so much to think about. Gary, is there someplace that people can continue this conversation, can learn more about your work and about EMC3? Best place probably is to just go to LinkedIn. You can find me on LinkedIn for sure. Excellent, and thank you so much for joining us. I will say that there is actually

Starting point is 00:28:03 on the Los Alamos website also, there is a little bit more about the efficient mission-centric computing consortium, if I can read it off properly. And so people could do that. Is this something as well that you mentioned students coming in, is this something that people can get involved in? Sure, it's mostly for other using sites and industry partners, but like I said,

Starting point is 00:28:29 we have a huge student program, the largest student program of all the DOE labs by a factor of four. It's really big. And so we hire tons of students every year. So we really encourage the computer science and computer engineering and electrical engineering and mechanical engineering and physics and material science and all that kind of stuff. But we hire an awful lot of students. So, and, and actually just not in Los Alamos, but the DOE labs that the student programs across them, you know, we're talking about tens of thousands of students a year. So you know, I highly encourage people if they want to work on interesting problems to come as students.

Starting point is 00:29:11 Yeah, absolutely. My oldest studied computer science and some of their friends actually went to work for the national labs. Others went to work for freaking Facebook and stuff. But you know, I mean, you know to each their own And and so it's pretty cool that people can get involved in some of this cutting-edge research Well, thank you so much for joining us Scott. What's going on with with you and with solid I'm lately, you know We're still having fun. We've got solid I am comm slash AI Some really cool innovations were introduced at GTC. so go take a look at that if you haven't

Starting point is 00:29:46 already. And we're continuing to just put things forward, keeping a nice solid focus on that. Me personally, I'm having a lot of fun having these kinds of conversations. If you feel like checking, following me around, scott.shadley on LinkedIn or smshadley on both former Twitter and now Blue Sky. Excellent. And as for me, you'll find me at S. Foskitt on most social medias, including LinkedIn, as Gary and Scott both said. I'd love to hear from you. And we recently had our AI Infrastructure

Starting point is 00:30:18 Field Day event. So if you're interested in sort of the infrastructure underneath AI, maybe check that out. Thank you for listening to this episode of the Utilizing Tech podcast series. You can find this podcast in your favorite applications as well as on YouTube. If you enjoyed this discussion, please do give us a rating, a nice review. It really does help people to find us. This podcast is brought to you by Solidim and by Tech Field Day, part of the Futurum group.

Starting point is 00:30:41 For show notes and more episodes, head over to our dedicated website, utilizingtech.com, or find us on ex-Twitter, Blue Sky and Mastodon at Utilizing Tech. Thanks for listening, and we will see you next week.

Your Ad Here

Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 08x06: HPC Technology Transfer with Los Alamos National Laboratory

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.