Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 08x06: HPC Technology Transfer with Los Alamos National Laboratory
Episode Date: May 5, 2025Much of what we take for granted in the IT industry was seeded from HPC and the national labs. This episode of Utilizing Tech features Gary Grider, HPC Division Leader at Los Alamos National Labs, dis...cussing leading-edge technology with Scott Shadley of Solidigm and Stephen Foskett. The Efficient Mission Centric Computing Consortium (EMC3) is working to bring technologies like sparse memory access and computational storage to life. These technologies are designed for today's massive scale data sets, but Moore's Law suggests that this scale might be coming soon to AI applications and beyond. The goal of the national labs is to work 5-10 years ahead of the market to lay the foundations for what will be needed in the future. Specific products like InfiniBand, Lustre, pNFS, and more were driven forward by these labs as well. Some promising future directions include 3D chip scaling, analog and biological computing, and quantum chips.Guest: Gary Grider, HPC Division Leader at Los Alamos National LabsHosts: Stephen Foskett, President of the Tech Field Day Business Unit and Organizer of the Tech Field Day Event SeriesJeniece Wnorowski, Head of Influencer Marketing at Solidigm Scott Shadley, Leadership Narrative Director and Evangelist at SolidigmFollow Tech Field Day on LinkedIn, on X/Twitter, on Bluesky, and on Mastodon. Visit the Tech Field Day website for more information on upcoming events. For more episodes of Utilizing Tech, head to the dedicated website and follow the show on X/Twitter, on Bluesky, and on Mastodon.
Transcript
Discussion (0)
Much of what we take for granted in the IT industry was seeded from HPC in the national labs.
This episode of Utilizing Tech features Gary Greider, HPC division leader at Los Alamos Labs,
discussing leading edge technology with Scott Shadley of SolidIME and myself.
Welcome to Utilizing Tech, the podcast about emerging technology from Tech Field Day, part of the Futurum Group. This season is presented by Solidime
and focuses on new applications like AI and the Edge
and other related technologies.
I'm your host, Stephen Foskett,
organizer of the Tech Field Day event series.
Joining me from Solidime as my co-host this week
is my good old friend, Scott Shadley.
Welcome to the show, Scott.
Hey, Stephen, great to be here.
Glad to continue the wonderful series of conversations we're having with some of these great folks to the show, Scott. Hey, Stephen. Great to be here. Glad to continue the wonderful series of conversations
we're having with some of these great folks in the industry.
Yeah, you know, that's been the fun part of this whole season
is that we've been able to invite in people from companies
that are doing some cool stuff, of course,
but also people who are actually practitioners
doing very cool things out in the world.
And I appreciate you opening up your Rolodex
and inviting on some of your friends.
Absolutely, that's one of the cool things about today
I think is really exciting is when we think about aspects
of what's going on in this industry,
we don't really realize how much of the labs part of it
comes into play, because there's a bunch of labs
around the country that you hear about from time to time.
But in this particular instance, we're bringing along a very interesting story from one of those labs.
So absolutely. So Scott has invited Gary Greider of the HPC division over at Los Alamos National
Labs to join us today to talk about the ways that HPC has driven compute, AI, edge, all
sorts of things forward. Welcome to the show, Gary. Tell us a
little bit about yourself. What are you doing over there?
Well, I'm the division leader for high performance computing
at the lab. And I've been at the lab since 89. So I've been there
for a while. Our division basically is responsible for all
the high-performance computing machines and buildings
they live in and all the way up to the application.
The applications are spread across the laboratory.
So our portfolio is pretty large.
We've got many, many tens of megawatts of computing power
and lots of machines and networks and storage.
So that's it.
And I think that, as I said a second ago,
it does seem in many ways, at least to me,
that the HPC space has driven the state of the art
forward in ways that I think people don't quite understand.
It reminds me a little about the space race,
and they talk about Tang and Velcro and pressurized pens and things like that. The same thing happens
in HPC. In fact, you know, your phone or your laptop these days has NUMA
technology that in my mind came from the HPC world and the same is true of so
many other elements. You're on the inside of that. Do you see that? Do you see
things that are coming and how they're going to affect the rest of computing?
Well, we certainly see things we think are coming. And we certainly can look backwards and say that there's an awful lot of what's in computing today, at least scalable computing that came from
at least scalable computing that came from Department of Energy Labs and other scientific organizations.
And oftentimes we are wrong.
Sometimes we think something that's going to happen, you know, it doesn't.
But more usually it happens.
It just doesn't happen when we think it's going to.
You know, a good example of that was, you know, back in the early 2000s, we were working on high-performance shutters for optics because
you couldn't flash a laser fast enough, so you have to put shutters on it to make it
go really fast.
We were finding a lot of work in that area, and we thought, oh, everybody's going to be
watching movies at home.
The world is going to need all this bandwidth, and so we need to get
high bandwidth fiber going. And we found it a lot of work in the area. We thought it would
be instantly used by everybody that wants to watch a movie. And then it didn't happen
for a decade. And then a decade later, everybody is watching movies. But they did it in a completely
different way. They decided to cash all this stuff out locally to you and all kinds of fancy things.
And so in some ways, some technology
that we've been working on still sitting waiting.
Another example is microchannel cooling.
You've got this 2.5D integration going on
in the silicon industry,
and it's headed towards 3D integration.
And the question is, if you have 3D chips,
how do you cool them?
And long ago, back in the late 90s,
we funded a bunch of work called microchannel cooling, which
is drilling microchannels through silicon
and cooling it from the inside, which is kind of the only way
once it gets thick enough.
And that technology is still sitting on the shelf
waiting for the economy to catch up to it
and the need for 3D to actually occur.
And once the economics is right,
I assume that technology will be used.
So I could go on and on about stuff we've funded
and some stuff has made it and some stuff hasn't.
Yeah, I find that very interesting, Gary. And I appreciate kind of that history lesson of things like that.
One of the reasons that I thought it would be fun to bring you into this
particular conversation was we got engaged through one of the initiatives
that you're working on that's active today in some of the space that ties
directly into what we're doing.
And it'd be great to hear a little bit about EMC3 and what you're doing
there and what it's bringing to the market
that we're dealing with today.
And I've participated in it
across a couple of organizations now.
So it's kind of cool to see that that's still moving forward
and you've got some really unique innovations
that are relevant to the AI era right now
that have come out of that.
Yeah, that's true.
So Efficient Mission Centric Computing Consortium
is what EMC3 stands for.
And really, it was a way for us to sort of pseudo formalize
our partnerships with industry and with other using
organizations for high-performance computing
and AI-oriented technologies.
And where it came from really was the fact that, you know, oftentimes science sites,
and that isn't just the national labs, that's also the energy companies and, you know,
the aerospace companies and people that do science on computers, and not just information science.
We often aren't served well because the larger community,
the cloud community, the big AI factories and things like that
are the big dogs.
And they sort of decide what technology is going to be for us.
So it was really sort of banding together
a bunch of organizations that their needs aren't completely
covered very well by the mass, you know, large sales,
and try to present a market to a bunch of sympathetic industry partners that might partner with us
to, you know, make some of that stuff happen.
And a good example of that is sparse memory access.
So, you know, if you think about how GPUs work, they gulp in a whole bunch of vectors,
and then they gulp in a whole bunch of vectors, and then they
gulp in a whole bunch of other vectors, and then they do a mass multiply of all the elements
of those vectors in parallel.
If your problem maps to that kind of a solution, then a GPU is really, really nifty.
However, if your problem doesn't map to that, you're kind of left in the cold because there's
not really a lot of good ways to access the memory.
And so one of our projects ongoing for the last many years
has been to work on trying to push lookup tables down
to close to the memory subsystem so that you don't waste
cache line bandwidth and you don't waste cache area
in the processor, storing indices to lookups and things like that that you don't waste cash line bandwidth, and you don't waste cash area in the process,
or storing indices to lookups and things like that
that you don't want to do.
And it's interesting because we're not the only people
that see this.
A good example of somebody that sees it is Amazon.
So Amazon, when you log on to Amazon,
you get a bunch of ads in front of your face.
And the way they decide what ads to put in front of your face
is they take all the products that they sell and all
the stuff that you've bought, and they put it
on two dimensions of a graph.
And it ends up being a super sparse data structure.
And they want to compare that super sparse data
structure with other people that have similar sparse data
structures.
And so that whole thing is really a sparse workload.
And another really simple example is just a table join.
So you take Oracle, and they have a dense table
and a sparse table, and they want to join it together
on a common key.
And the way you do that is through a lookup table.
And so there's all kinds of examples of sparse access
to memory in the world.
And it's not being
served very well by industry today, and so we've been working on trying to make that
happen.
And probably the other big project that's going on that we realize needs to have some
work and that's pushing processing down near storage.
So just like sparse access is pushing access down close to memory, pushing stuff down close
to storage is also important.
And the naysayers out there might say, well, that's been tried a few times.
And it's certainly true.
I've certainly tried it a few times myself and funded a lot of work in the area.
But where we really see it coming to head in real need is if you have a lot of holdings, at my site maybe we have
three or four or five exabytes of data that we hold.
And it maybe represents, I don't know, maybe a trillion files and 100 billion directories
or something.
So it's a large mass of data and it's text and images and output from simulations and all kinds of things.
And you couldn't ever train on four exabytes of data. Today, people train on petabytes of data,
so it's three orders of magnitude off. So how are you ever going to do training, or better yet,
how are you ever going to do inference against all of that data or major parts of it?
And so the only way is to have rich indexes
and be able to subset that data very quickly
so that you can just get the vectors that you need
and build yourself a training set
or build yourself more likely a reg
that your model looks at to enhance its answering
capabilities.
And so if we get to the point to where you're asking questions
of something that large, then you
need to push the lookups, the similarity lookups,
as close to the devices as possible
so that you're not moving data back and forth because that's the only way you'll get low latency to get an
answer to a question you're asking you know of your your holdings and so we see
a time when this is going to really actually be necessary otherwise you won't
be able to accomplish what you want to do so that's really sort of the long
game there's certainly a lot of shorter term wins
you get out of it, but longer term,
if you don't have a way to get something from all your data,
then why the hell are you keeping it?
And finally, there's mechanisms for doing that.
So we need to enable that and pushing the compute
near the storage, at least some parts of it,
reductions at least, maybe necessary.
I mean, that's exactly, I mean, it's a great example
of how being able to think outside of the market box,
right, so the one beautiful thing about the work
that you're doing and enabling through the work
that you're doing is it's somewhat ignoring
the big guys, right, because they do have their bowl
and their whip and whatever they're doing to make us carrot and stick type of stuff. But it's really interesting
to see that there can be a lot that can be accomplished if you take a side and look at it
from a truly what does it need versus what does this person think they want? And that's one reason
I really like the ability to be engaged with the work that you're doing there and things like that.
So, I mean, it's really fun to think about what's been going on at that lab for so long.
Like you said, you've been there quite a number of years, which is awesome.
And we appreciate all that work that you're doing there.
But I recently went to the Tiplit Summit in January, where it was mentioned that they're
now in the process of trying to migrate all of the cool nuclear blast data that was generated at the lab into a form of consumption to do
exactly what you wanted to do, which is take petabytes of information you can't recreate
and be able to train and use that for useful future data.
Yeah, it's not just the nuclear test data, but it's all the subcrit tests.
And there's tons of surveillance on the weapons over the years.
We tear them apart, and we try to figure out
what's wrong with them.
They're old.
They've been rolled around on trucks.
They've been rattling around in subs for decades.
So yeah, there's a ton of data.
And it's of all kinds. Yeah.
So one thing, Stephen, you may not know about Gary as I've talked with him over the years
is I have been at GCC and I actually met a few of Gary's coworkers there because he has
done a great job of launching a few people out of the lab into the industry because of
the hard work and effort that they put in there.
So not only is he helping drive the market, he's actually helping the careers of quite a few individuals as well.
Yeah, we have an enormous student program. It's really cool. We have 1200 summer students
a year at the lab, which is a pretty large program, 300 post-docs, and about three-fourths
of the laboratory staff comes from its student programs. And, of course, we don't hang on to all those students and or people.
So they disappear into the ether and come back eventually.
And we notice that they're at some company helping us vicariously.
It is interesting, isn't it, that so much of what Gary's talking about, both in terms of technology, but also,
as you said, in terms of skills is increasingly applicable outside the lab and HPC environment
because, you know, I guess we're all probably old enough to remember when terabytes was
a lot of data.
And now, you know, you've got a terabyte size SD card, you know, when if Scott, if you had said, I'm going to be handing you a 100 plus terabyte SSD,
even just 10 years ago, I would have laughed at you because that would have seemed like a
ludicrous amount of data. And yet that's a real, not consumer, but a real product, a real thing that you can buy.
And the same is true of some of these other things, Gary.
It strikes me that, as you were alluding to
in your conversation about, for example,
the sparse memory and computational storage,
the scale of that data sounds absolutely incredible.
But when we're looking at what's happening
in the AI space today, we are seeing people saying,
hey, how can we train a model with maybe not quite
that large a data set, but a very large data set.
And we're gonna have to get to that point
probably eventually.
And as you said as well,
it's on the inferencing side as well.
If we want AI to be able to process all the whatever data,
then we need to be able to build systems that will be able to scale
and allow that AI to efficiently access that kind of data.
So what you're doing really is what's going to be happening probably five years from now, right?
Would you agree to that?
I think so, yeah. I mean, we aren't used to working only five years ahead. That's sort of the last decade in my career kind of a thing.
It used to feel like more like 10 or 15 years ahead, and now it feels more like five or less.
It is kind of hilarious when I go to AI conferences and they talk about the woes of having
to checkpoint their state of their training job
because it runs for a long, long time.
And I have to laugh because we had
to do checkpointing 25 years ago and have been checkpointing
ever since and actually still at scales bigger than they are
from a synchronous point of view.
And so it's kind of interesting how a lot of the things we've
done, they're using or have to use all the interconnects
that this InfiniBand and UltraEthernet and all
that kind of stuff has its heritage in parallel computing
in the early 2000s.
And in fact, we were heavily involved
in the invention of InfiniBand.
It was a consortia between a couple of DOE labs, Bank
of America and Oracle.
Isn't that an odd set of people to work together?
But Oracle was trying to do this thing called a parallel database
at the time.
And Bank of America, well, they were
trying to buy a parallel database at the time. And they all said, well, they were trying to buy a parallel database
at the time and they all said,
well, we need this interconnect.
And in the 90s, we had funded a bunch of proprietary
interconnects for every vendor known to mankind.
And everybody was tired of having proprietary interconnects
and special APIs to access them
for all the functions they had.
And so VinBand was sort of our attempt
at trying to get a common high performance interconnect that
had a common, now it's called OFED layer that lives in Linux
that you use to do this large scale computing.
And so many, many other things that we did in the past
come around again and are used by others.
And what's really nifty right now is in the data space that's happening.
So, you know, tools like Lustre, which I went to DOE to get the money to build,
and PNFS, which we started in 2002 at the University of Michigan.
And there's all these things that we did for parallel storage.
And it's really cool to see, you know, something other than HPC sites
starting to use those tools.
Yeah, I think it's interesting, Gary, to your point.
Because we mentioned that this series is focused
on kind of today's AI and where it's going with the edge
and things like that.
And we've been talking high performance.
And you've given us some really cool things.
I think it's one of those.
There's always the next shiny object.
And from my point of view, you're working on not even today there's always the next shiny object. And from my point of view,
you're working on not even today's shiny object
or next shiny object,
you're on the shiny, shiny object, right?
So when we all think we're squirreling
to solve today's problems,
you've already solved them in ways
that you're waiting for the rest of the world
to catch up on.
So it's kind of a unique ability
to have a perspective like that.
And I truly appreciate the fact that you and your team
and the work that you're doing there
has given us as consumers and our partners in that space,
the opportunity to work through and catch up, if you will,
in some aspects of it.
And it just shows the value
of having this kind of forward-looking aspect of it.
Because when you look at a company like SolidIME,
we have an R&D budget and we're thinking,
we have a three-year roadmap, we have a five-year roadmap,
and we don't tend to think much beyond that. And when something pops up, it's like, oh, but you're like, you know, we have a three year roadmap, we have a five year roadmap, and we don't tend to think much beyond that.
And when something pops up, it's like, oh, but you're like, to your point, you're laughing
about checkpointing.
It started to make me laugh.
It's like, I've heard from I have an older brother who works at a DOE lab as well in
Idaho, where, you know, nuclear started as far as energy consumption.
And I hear stories from him all the time.
And it's like, this is kind of cool and innovative way to think about it.
And I really do appreciate getting the chance to have you on here to talk about some
of it. So what are your thoughts, Steven?
Yeah, I'm with you, man. It is really cool. And that's why, Gary, I really want to kind
of put it to you. What are you working on now that is in the back of your mind and you're
looking at that and you're saying, that right there is gonna be important pretty soon.
What are the tools, the technologies, the concepts
that you think are gonna be driving computing in the future?
The only other term, it's much what I've talked about,
but thinking bigger and longer,
if you look at the Moore's law
and other denar scaling laws and things like that,
that we've run out of gas a while ago.
And we were at 2D for a long, long time.
And now, 2.5D is everywhere.
And when you get to 3D, once you solve all the problems
that we've solved some of already, but not all for sure,
you're out of Ds.
And so what do you do next?
And so what that means, which is an ugly way to think about it,
but what it means is you no longer get any of these wins
on reductions of size or anything.
The only way to get more computing
is to buy more computing, which is a different situation
than we've been in in a really, really long time.
We've always been able to shrink and add more.
And when that stops, when you're at the end of that road,
what do you do?
And so there's a fair amount of effort going on to try to figure out, okay, well, what
do you do when you run out of Ds and you're done, right?
You don't have any more shrinks you can do because you're at atomic scale and you've
already used up all three dimensions and all you're doing now is just buying more and more,
you know, covering the planet with, you know, 3D silicon, what do you do? And so there's a fair amount
of effort going on to try to figure out what that is. And, you know, we've we had an early
quantum system at the lab and it was interesting the infrastructure cost three times as much
as the machine. But, you know, that that's not exactly an answer. It's part of an answer.
There's analog that looks pretty interesting. And so we have a fair amount of effort going
on into looking at analog. And what would that get for us? And it's really mostly about
picajoule per something, right? Because not only will we cover the planet with silicon,
but we'll also use up all the energy
at the pace we're going at, right?
If you run out of these.
And so somebody has to stem the tide there.
And it feels like analog or biological computing
is where we have to go.
So we have a fair amount of effort going on
in both of those areas.
Our Center for Nanotechnology is looking at interfaces between silicon and bio,
you know, circuitry, which, you know, may, may be part of the answer.
So that's the problem, I think for all of us, you know, at some point is what
happens when we run out of Ds.
That's a very unique perspective.
I think it's interesting because there's things like in SNIA with
the DNA efforts that are going there to use DNA to do archival storage and stuff like that.
But what do you do to get it back and that kind of thing.
I really do appreciate that perspective because I am very curious what the next D is,
because I've been on the same train with you about,
we don't have anything really in the market
to replace NAND and DRAM,
the way we have used them in the past
to replace other products.
So you start getting beyond that end of the quantum space,
is that really where you're gonna get solutions,
but that only solves one piece of that big problem, right?
So really do appreciate those insights.
If I could jump in on that too,
just to be clear for our audience.
So Scott, maybe you're taking this, I don't know, maybe you're ahead of the time,
but memory and flash, you know, is ahead, in my opinion, of compute, of general compute
in terms of implementing 3D chip architectures.
I mean, you guys, especially in the storage space, you're stacking, you're stacking them tall. I don't know if people know this, but I
mean we're talking about skyscraper style, you know, of chips with, you know,
over a hundred levels. That is not what we're seeing yet in the compute space.
Gary, do you think, is that what you're talking about when you're talking about
3D? Do you think that we're gonna have, you know, processors scaling like that?
Yes, it's happening.
And that's a big, you know, as you say, that's something that hasn't yet come to computing,
but it is, and it's going to.
And like I said, I mean, memory and flash are certainly the fields that are setting
the stage for that.
Yeah, it's interesting.
The one question I usually get is, what about photonics on chip?
And we've done work in photonics on chip for, I don't know, 20 years.
And it's ready.
It's just that the economy, its economy isn't there yet.
It's just like this microchannel cooling stuff, right?
It's technology that's sitting there waiting for a problem
that it can uniquely solve.
And it could be that that's one step between us and full 3D.
Because if you can get the kind of bandwidths
that that promises, and you can move things apart by inches
instead of millimeters, one could imagine spreading
a workload out using...
So it could be that finally that technology will actually take off and have an economic
reason to exist because it's still too early for microchannel cooling and full 3D, but
both are probably going to happen.
The stock market question, of course, is when, right?
Well, I mean, is when, right?
Well, I mean, this has been truly insightful. I've actually learned even more just by having a few minutes
of chatting time with you.
So it looks like next time we see each other
in another event as we continue to cross paths,
I'm gonna have to definitely sit down with you
for some more conversation and things like that.
And Steven, to your point, yeah,
when we start stacking the processing chips,
when we can get them cooled properly
and getting through Silicon VIA versus Wirebond, that's one step, but then we have to
get beyond that too.
So I'll hand it back over to you, Stephen.
Yeah, absolutely.
Because those are some of the challenges, just to translate this into plain nerd from
deep nerd.
When you start stacking chips on top of each other you
have to figure out ways of powering those chips you have to figure out ways
of you know basically distributing power on a bus kind of throughout that stack
you have to figure out ways of addressing and communicating with those
I mean it literally is so my background is an mean, it literally is. So my background is in urban planning. It literally is the same as making skyscrapers.
You have to think about elevator sky lobbies.
You have to think about emergency exits
and water and fire and electricity
and all those things when you make a tall skyscraper.
It's the same with chips.
You have to think about how am I gonna power the ones on top?
How am I gonna get the data in and out?
And for something as uniform as a flash chip,
that has been a little not easy, but a little more doable.
And then for compute, it's just, wow, let's see what happens there.
And then, Gary, the other things that you mentioned too,
I've seen some very early analog AI processors,
for example, I was talking to a company
that's doing an analog AI processor,
it mostly kind of works, but it's pretty cool.
You know, that's for sure.
And if they can get it to really work, that would be awesome.
You know, so much to think about.
Gary, is there someplace that people can continue
this conversation, can learn more about your work
and about EMC3?
Best place probably is to just go to LinkedIn.
You can find me on LinkedIn for sure.
Excellent, and thank you so much for joining us.
I will say that there is actually
on the Los Alamos website also,
there is a little bit more about the efficient
mission-centric computing consortium,
if I can read it off properly.
And so people could do that.
Is this something as well that you mentioned students
coming in, is this something that people can get involved in?
Sure, it's mostly for other using sites and industry partners, but like I said,
we have a huge student program, the largest student program of all the DOE labs by a factor of four.
It's really big. And so we hire tons of students every year. So we really encourage the computer
science and computer engineering
and electrical engineering and mechanical engineering and physics and material science
and all that kind of stuff. But we hire an awful lot of students. So, and, and actually
just not in Los Alamos, but the DOE labs that the student programs across them, you know,
we're talking about tens of thousands of students a year. So you know, I highly encourage people if they want to work on interesting problems
to come as students.
Yeah, absolutely.
My oldest studied computer science and some of their friends actually went to work for
the national labs.
Others went to work for freaking Facebook and stuff.
But you know, I mean, you know to each their own And and so it's pretty cool that people can get involved in some of this cutting-edge research
Well, thank you so much for joining us Scott. What's going on with with you and with solid I'm lately, you know
We're still having fun. We've got solid I am comm slash AI
Some really cool innovations were introduced at GTC. so go take a look at that if you haven't
already.
And we're continuing to just put things forward, keeping a nice solid focus on that.
Me personally, I'm having a lot of fun having these kinds of conversations.
If you feel like checking, following me around, scott.shadley on LinkedIn or smshadley on
both former Twitter and now Blue Sky.
Excellent. And as for me, you'll find me at S. Foskitt on most social
medias, including LinkedIn, as Gary and Scott both said.
I'd love to hear from you. And we recently had our AI Infrastructure
Field Day event. So if you're interested in sort of the infrastructure
underneath AI, maybe check that out.
Thank you for listening to this episode of the Utilizing Tech podcast series.
You can find this podcast in your favorite applications as well as on YouTube.
If you enjoyed this discussion, please do give us a rating, a nice review.
It really does help people to find us.
This podcast is brought to you by Solidim and by Tech Field Day, part of the Futurum
group.
For show notes and more episodes, head over to our dedicated website, utilizingtech.com,
or find us on ex-Twitter, Blue Sky and Mastodon at Utilizing Tech.
Thanks for listening, and we will see you next week.