@HPC Podcast Archives - OrionX.net - @HPCpodcast-88: Mike Heroux of Sandia on ECP, HPC Software
Episode Date: August 19, 2024Dr. Mike Heroux joins us to discuss HPC software in general and the Exascale Computing Project (ECP) software efforts in particular. Topics include performance vs. portability and maintainability..., heterogeneous hardware, the impact of AI on workloads and tools, the emergence of Research Software Engineer as a needed role and a career path, the convergence of commercial and HPC software stacks, and what's on the horizon. [audio mp3="https://orionx.net/wp-content/uploads/2024/08/088@HPCpodcast_MIke-Heroux_ECP_HPC-Software_20240819.mp3"][/audio] The post @HPCpodcast-88: Mike Heroux of Sandia on ECP, HPC Software appeared first on OrionX.net.
Transcript
Discussion (0)
For us, performance is a very high priority.
And so we pay attention to any new architecture features that would help with performance.
Take a step back and say, okay, not only are these AI processors the best way of getting
performance for a reasonable cost, but they offer us new ways of considering
how we might formulate the problems that we're solving.
Within the software ecosystem that we've constructed already,
we can handle a variety of heterogeneity
in software libraries and tools and applications
as we go forward.
From OrionX in association with InsideHPC, this is the AtHPC podcast.
Join Shaheen Khan and Doug Black as they discuss supercomputing technologies and the applications,
markets, and policies that shape them.
Thank you for being with us.
Hi, everybody.
It's Doug Black with the AtHPC podcast with Shaheen Khan of OrionX.net.
And Shaheen, today we have a special guest, Mike Haru.
He is a senior scientist at Sandia National Laboratories,
and he's scientist-in-residence at St. John's University in Minnesota.
He's been with both of those organizations for more than 25 years. Now,
directly relevant to us, Mike is the Exascale Computing Project's Director of Software
Technologies. And while that project has been completed, Mike's software work has continued
to receive funding. And by the way, earlier in his career, Mike was with SGI and Cray. His focus is on all aspects of scalable scientific and engineering
software for parallel computing architectures. So Mike, welcome. Thank you very much, Doug. Happy
to be here. Great to be with you. So you've worked on supercomputing software for decades.
Share with us some thoughts, if you would, on the distinctions of developing applications for
supercomputers? Yeah, I think probably it's not a difference in a kind of binary sense,
but it's a difference in the degree of attention we pay to the computer architecture and the devices
and the connection of those devices to each other for the purposes of getting
as much performance as possible out of the computer. Certainly people who develop mainstream
software don't want their software to run poorly or slowly, but for us, performance is a very high
priority. And so we pay attention to any new architecture features that are being added to a system that would help with performance. We pay attention to algorithms that could run faster or be more robust in the presence of parallelism and offer new ways of solving problems that couldn't have been solved before. And we also care even about reconsidering problem
formulation. Maybe an algorithm that was really good at giving an answer for a lower performance
type machine could be replaced by something that has more parallelism inherent, more concurrency
possible. So we pay a lot of attention to performance issues, again, more so than I think the average mainstream software
developer would have. Certainly in some other fields, high, low latency is important and things
like that, but it is a hallmark of the kind of work that we do. And because we care about those
things and because new systems that offer better performance usually offer it through some kind of architectural
innovation or a software innovation.
We have to adapt our software.
Not only do we have to get it to compile on a new computer system, usually also it has
a new compiler or a new version of the compiler that may be a little bit buggy itself.
We're also considering new algorithms and
considering new problem formulations. And so all of those things lead to, I think, why we often
call this the bleeding edge, in that a person who works in the high-performance computing field,
if they want to go from point A to point B and they see a path to get it through their own work,
they're often disrupted because there
are impediments that arise due to software or architecture features that they didn't foresee.
And so it may take a really long time to get from A to B because we are working in such an evolving
and disruptive type of environment. Okay, so getting directly into Exascale, you began with the Exascale
computing project in 2017. And we know a core element of ECP's strategy was the generation of
capable or usable and useful Exascale class systems. So what were some of the key challenges of your project work for ECP and developing code
for a system just as massive in scale as Frontier or all three systems?
So there were a few things.
First of all, the notion of having a capable or useful usable Exascale class system, it
has several components to it.
One is we've seen supercomputers built over the years that were really good at getting
a good LINPACK node.
And so the LINPACK benchmark is a really small piece of software.
It's very compute-friendly relative to, say, bandwidth and latency and even the amount
of memory you might put on a node.
It's very forgiving.
And so you can have a system that
you build that gets really good LINPACK performance results and shows up high in the top 500,
but then may not be so useful in a broader sense. And it may not be useful because the computer
architecture, the bandwidth and the latencies and the amount of provisioned memory you have might
not be sufficient for other kinds of applications. And that's certainly one aspect of it. But the other is that the software
that is needed to make the system usable by lots of different applications isn't present,
and there hasn't been a budget for creating that software. It also can be that the IO subsystem,
the things that you need to get data into the
computer and off of the computer when you're done doing computation, may also not be sufficiently
robust and fast enough to handle the data rates that are needed.
And then the overall usability of the system, will it stay up long enough for you to get
useful work done in between checkpoints. And so the system itself
has to be really reliable and robust and be up and running on a regular basis. So these are some
of the things that go into this concept of a capable or usable system. The role of software
technology, the effort that I led for ECP, we had roughly 250 scientists working on the libraries and tools
within the software technology area. The distinction between that work and the work
that was being done in the application space is our work was intended to be reusable software,
stuff that other people used, not just ourselves, and could be used in a way that was scalable in the
sense that many people could use. And so we had the role of building libraries and tools that
would help the application teams. If it's a performance tool, it would help them get insight
into performance bottlenecks, things that might be inhibiting their scaling. If it was a build tool,
those kinds of things. If it was something like a mathematics
library, mathematical library, like a linear solver or an FFT, those were kinds of libraries
we did. If it was in the IO space, it might be an IO library that gets data onto or into or off of
a system, or it could be a data compression library. Because even though we built a scalable system with robust IO capabilities,
the rate at which we could retrieve data into the machine and push it back out,
that ratio is always getting worse just because of how the markets behave.
And so we invested in data compression capabilities that reduce the amount of data
that you had to store and still have high fidelity in
that data. And then the software ecosystem itself, we worked on container software capabilities.
We invested a lot in SPAC, which is a package management system that has become very popular
in the high performance computing world. And then we had the kind of the collection of all
our capabilities into something called E4S, which is this curated
portfolio of software that we use to support all and deliver all the capabilities that we produced
in the software technology out to the users, out to the world that wanted this kind of software.
Mike, you pointed out some of the gap in performance, IO versus processing. I think that has led to looking for ways that at the outlook of various roadmaps?
Are we going to continue to see such a big gap between processing speed and then memory
access, albeit HBM is like trying to address a little bit of that, all the way to IO, which
has always been a problem?
Yeah, I don't see any kind of disruption that would make that kind of work easier to do. Of course, we see now in our hardware, our GPUs are really AI processors.
And so they present to us opportunities for low precision hardware arithmetic.
And that helps with data transfer, performance, things like that.
But then you also have to have algorithms that can tolerate lower precision arithmetic. And that's an area of study. And so to the extent that we can revisit
the implementation of our algorithms or consider new algorithms that can operate at lower precision,
we can make some progress in that space simply by having fewer bits to transfer
per numerical result. So that's one way.
But beyond that, I don't see us necessarily getting at this kind of growing trend of memory performance versus compute performance and IO performance versus memory performance versus
compute performance.
I don't see us fundamentally changing the challenges in that area. A related topic is heterogeneity.
HPC community is eager to try whatever can make things go fast and we'll figure it out. And we
have sufficient technical depth to do it. When it goes over to the commercial enterprise side,
the appetite for that is a little bit less, but then we are seeing the advent of AI
drag them into the accelerator world in a way that exposes that complexity. But there's more
heterogeneity where that came from, including emergence of quantum computing, let alone
accelerators of 20 different varieties. What are the labs doing? What are the Exascale project doing
to bring all of that and simplify that?
So a few things, and ECP accelerated this, but more than a decade ago, we foresaw the need to
be able to write portable high concurrency software, meaning that we needed to be able
to express concurrency at a loop level in a way that could be compiled to existing heterogeneous systems and to future systems.
So to be very concrete, I'll use one example.
There is an ecosystem built around a project called Raja at Lawrence Livermore National Labs.
There's another one that started off at Sandia. I was part of the original efforts, but it's also expanded now to include
lots of other laboratories and other organizations well beyond Sandia, and it's become a community
software project. And so I'll talk about Cocos because it's the one that I know best and
seems to be more broadly used. But the basics of Cocos are it allows you to express a parallel
for loop, a do loop in Fortran, a parallel reduce, which is like a dot product, and then a parallel for loop, a do loop in Fortran, a parallel reduce, which is like a dot product,
and then a parallel scan algorithm, which is like a dot product with intermediate results.
It allows you to compute in parallel algorithms that are otherwise not possible to do in parallel.
That's really cool computer science stuff. But anyway, it does those three kinds of loops and
when elaborate versions of it in a way that it can be expressed by the programmer,
say a computer scientist or a chemist or a physicist who's writing code in a way that exposes the parallelism,
exposes the concurrency, but doesn't tie it to a specific type of compute processor.
And in particular with ECP, NVIDIA GPUs were around.
We use them all the time to do development and we made sure our software worked on that.
But we are also targeting GPUs, accelerators from AMD and Intel. And we had to write our codes
in a way that were portable across those three target architectures, in addition to using ARM CPUs, vectorizing CPUs,
and with an eye toward the future to data flow machines that are emerging in various settings.
And so a lot of the work of ECP, especially in my area in the software technology with
reusable libraries and tools, was done with this mind that we need to have portable performance
across all these different systems. And so we've done that. The software stack that we provide
is portable across all of these different architectures and is performance portable
and allows, let's say you have a particular hotspot and you need to get the most out of that on a particular GPU device,
you can do it in a way that only that special kernel needs to be written in a custom way.
And the rest of your code that needs to work pretty well in parallelism can still be done
using these portability layers. And so that's a big part of what we've done so far. So that's
part of the heterogeneity story is that portability
across the existing GPU architectures from NVIDIA, AMD, and Intel is a big part of it. ARM processors,
emerging data flow types of processors. Now you mentioned Quantum. Quantum is a very different
type of processing device. From our perspective, at least from a software stack perspective,
we view it as another attached device with a different instruction set architecture.
Architecturally, in terms of software, we already know how to handle that. We already handle GPUs
as discrete devices. So it's not a fundamentally brand new thing in terms of software architecture.
Of course, what is truly unique is the kinds of algorithm
that you want to implement for a quantum device.
And then what is the programming language that you're going to use?
And how do you compile that code?
That's all emerging as we go.
But it's not a fundamentally different software architecture.
And so we're confident that within the software ecosystem that we've constructed already,
we can handle a variety of heterogeneity in software libraries and tools and applications as we go forward. So Mike, looking ahead, we hear about OLCF6, the next generation of leadership
systems at the labs. I assume there's a lot of direct relevance in the work you did for ECP
with OLCF6, but maybe not. Or are there areas where new work will have to be done?
I'm sure there'll be new work that has to be done simply because even if it's along the same path as
what Frontier was, there's always new work to be done, in part because a lot of the performance will probably
come from increased demands for parallelism, because we really have, we're still limited by
the speed of light and our latencies aren't improving in any real way. And so more concurrent
execution, more parallelism is the way that we get performance from these machines. It'll be
really interesting though, because the emergence of AI means that we have opportunities in the scientific computing area
to take a step back and say, okay, not only are these AI processors the best way of getting
performance for a reasonable cost for our traditional double precision computations and even single precision
computations. But they offer us new ways of considering how we might formulate the problems
that we're solving to take advantage of inference engines as core components of how we do scientific
discovery. So we may, and in fact, this work's been going on for some time. It's not brand new, but I think it will grow. And so what we will see with OLCS6, because I don't know any details
about it, but presumably it's going to have some kind of very rich AI type processor in it. And
we will be able to utilize those to build up novel approaches to solving scientific problems using deep learning formulations
of the problems that we're trying to solve. What comes with that is how do you know that your
answer is right, all these kinds of things. But part of that is just building a level of comfort
with AI capabilities. And part of it is building in more validation and verification types of
approaches so we can detect if our inference engine is not doing a good job or has gone off the rails.
But those are some of the ways that I see things changing going forward.
The other thing that I see changing going forward is that I think the distinction between what's an on-prem, on-premises, on-site computer versus what's a cloud system. Say the difference between
AWS and Frontier. Right now, those software stacks and the interfaces that are used to do something
on AWS versus Frontier are pretty different. But I think over time, especially in the next
generation of large-scale systems, again, I have no special knowledge, but I just see that there are market forces making this not only possible, but really an
essential thing, is that the APIs that we're using on-prem should be to the user nearly
indistinguishable from what we use in the cloud. Or at least we want to move in that direction
because our users, the application teams that are trying to do, solve scientific problems are going to expect that if they do
something in the cloud and they have a workflow, they have a set of scripts and workflow management
tools, that they're also going to be able to use those tools on a system at a leadership computing
facility in a way that is nearly identical, as nearly
identical as possible, and that the difference between an application team using something
in the cloud versus on site should be as minimal as possible.
Do you think, Mike, that would present some challenges in what you alluded to early in
our conversation, the fervor for performance,
almost to the exclusion of portability and maintainability.
That's a really good point. However, at the same time, we have seen use of containers,
we don't see a degradation in performance. And in fact, while maybe the latest GPUs are hard to come by in a cloud environment, GPUs are there and performance is real.
And we see our application teams going on to these cloud-based services and realizing really good performance.
And so we can take our software and we can compile and run it in the cloud in a way that doesn't compromise performance,
especially compared to how things might have been in the past.
Have there been any actual tests to see?
Here's how I'm using cloud XYZ.
And here is what I would do on permanent traditional HPC oriented performance first
kind of a model.
Yeah, I can't point to anything specifically, but I've seen a
lot of anecdotal evidence that a person can expect pretty good performance from a cloud-based system
if they provision it ahead of time with a good network setup. Yeah. And in fact...
Yeah. If the config has to be there. Yes, exactly. The configuration has to be there.
In fact, and maybe we'll get to this, but one of the things that's emerged at the very
end of ECP is this high-performance software foundation, HBSF, which I view as a very important
signal of the relevance of high-performance computing to the broader industry and a non-DOE,
non-governmental marketplace, and that they care enough about high-performance
scientific software that they're interested in having a foundation that supports open-source
software focused on high performance. And so I think we're seeing a market trend that introduces
both the demand for high performance and the tools and libraries that
can provide that for the user community. So I foresee that the distinction between
cloud and on-prem in terms of performance and capabilities, the distinction will disappear
over time. And because there is so much demand for high performance.
Mike, tell us, why don't we get in for a moment.
We know the EC project has been completed, but you received additional funding for the
work that you're doing.
Could you specifically talk about that project work?
Yeah, sure, sure.
Yeah, the project is called PESO.
It's a bit of a tortured acronym, and it stands for Partnering for Scientific Software Ecosystem
Stewardship Opportunities.
And so what the Office of Science, the Advanced Scientific Computing Research Office, decided to do as the Exascale project was completing was they certainly were committed to stewarding and advancing the ecosystem that was developed under ECP. And so that means the libraries and tools,
the applications that were developed, and then the software stack, which is E4S along with a very
capable set of features from SPAC, all of which were developed during the timeframe of ECP.
And so there is an effort going on. Now, this effort is much smaller than what ECP was. ECP was,
it measured, I think, roughly $1.8 billion. It was often quoted total project budget. The efforts
that we're talking about here are in the tens of millions total of budget, so much smaller budget.
But at the same time, we're trying to organize ourselves to leverage other activities that are
going on in the scientific ecosystem.
For example, PSF is now up and running. I mentioned E4S, one of the products that is
supported by the PESO project. So I'm the PI, along with Lois McInnes, of the PESO project.
E4S is our primary product, but we do other things as well. But remember, E4S is a scientific software stack.
It contains programming models, things like support for MPIs, for COCOs, for Raja, for
the math libraries, for HPC Toolkit and other kinds of performance tools, IO libraries like
Audios.
And now all of the teams who are developing these products are getting funding from other
sources. All of the teams who are developing these products are getting funding from other sources, but we pull it all together and we do quarterly releases of E4S.
So we curate the specific versions of those libraries and tools, and we make for a robust
software stack that application teams and library development teams can depend upon as being reliable, robust, portable, available on all the
leadership class systems, available in container environments, available in the cloud and AWS,
Google Cloud. And so we're working on this full set of available software that can be used. And
so that's what we're doing in post-ECP. Would we like a bit more money for doing that? Yes.
But we're able to make reasonable progress even as it is.
Also part of this effort is something called CAS, the Consortium for the Advancement of
Scientific Software.
That is an umbrella organization that's pulling together the Peso team and a bunch of other
teams that are doing libraries and tools work under a single aggregate
umbrella. And so we're evolving the approach that we're using to develop and support scientific
software. And exactly how this organization or this collection of organizations is going to go
forward is itself still evolving. We're still establishing the charter and the bylaws for how the CAS, how this
consortium will work, but we're making reasonable progress and we'll keep plugging away at it.
We hope for a large mission-driven type of project coming to DOE, maybe around AI for science and
things like that. We see that in the news. And so I think if something like that
comes to the laboratories, then I think that gives us additional resources to further sustain and
expand the software stack as we go. But we'll see. We'll see how it goes. Right now, we've got
funding. We have people. We have the opportunity to carry things forward for the next several years. Peso itself
is a five-year project and we're working on it every day. That's excellent. I think this
attention on the software side of HPC in general is just a gap that continues to be a gap and needs
attention. So it's great to see it. Where do you think, including the projects that you are leading, as well as what the rest of the industry and the community is doing,
where do you think we are in terms of that kind of portability and performance across multiple GPUs?
Is that a solved problem at this point? Or is this still a problem if somebody's code is really optimized for a GPU and then you
want to move it to another? Yeah, we're not anywhere close to all the way there for sure.
I do think that I didn't mention OpenMP. OpenMP has this so-called target offload functionality.
It's still being made rigorous in terms of its capabilities. There are some codes
from within ECP that relied upon it and had some struggle, especially towards the end. But it still
remains a viable approach. Of course, there's OpenACC. There are other approaches. There's CUDA,
there's HIP, there's SICKL. There of these kind of vendor-connected approaches to doing parallel computing.
Again, I think Cocos is a really nice model for getting performance portability in a way
that's pretty practical, but it's also heavily templated C++.
Not everybody wants to build that awareness of C++ syntax.
So there are other programming languages like Julia, for example, where there
are quite a few libraries that run well on GPUs, but we're really far away from having portability
in a whole bunch of software. But it's an essential thing, I think, going forward.
During ECP, we'd hear from people who are running a compute center, say for weather prediction or
other types of really important
computations. But their software stack is still focused on the CPU. They don't have an investment
in the redesign that would be required to take advantage of GPUs. And so they may have a two
megawatt compute center and they can't really increase their power budget. And so from generation to generation, the software stack is still CPU, so they can get the accelerated architecture, I think they're stuck
in this low GPU, CPU type of performance model that's going to be really hard to get away from
without a big investment in a redesign of their software stack, something like what ECP did. I
mean, ECP, you can think of it as just a massive lift of a software ecosystem that DOE cares about.
It's not even the whole ecosystem, but a significant piece of it from a CPU to GPU type of architectures.
And now that we have that small subset of DOE software working well on GPUs, the next generation of GPUs comes along.
We tweak the software, but it's going to make use of those
newer GPUs. We're on a new commodity performance curve and energy efficiency curve. And it's not
just at the high end. It's at the desk side as well, because the GPU is in the desk side system
all the way up to the very high end. That sounds like something AI might actually be able to help with. Yeah, I think so.
There's a real possibility that we can get some assistance from AI to transform the software stack.
Of course, as I tell people, should I move to GPUs or not?
And my advice has always been, don't do it until you absolutely have to.
Because the longer you wait, the more stable the ecosystem will be around you to use, and the more tools that will be available to you to do the process of transformation.
So my answer to almost everybody is don't do it.
Yeah, it's a nuanced advice too.
It's don't do it because you are going to have to do it.
I'm doing it, but don't do it.
Mike, what about the experience of seeing software in operation on Frontier, this incredibly powerful system?
Do you have any anecdotes or just experiences to share this software that actually got running, applications running?
Yeah, well, I'd love to be able to tell you a story from the trenches.
But I've said often that I used to write software back not too many years ago, real supercomputing software. And then I wrote software by email and telephone calls to my postdocs and junior staff, right, never typing a semicolon. And for the past seven years, I've been talking to people who talk to people who write software. And so take what I
say with a little bit of, with the caution that it deserves. But yeah, I mean, it's just, it's
delightful to see all of the, everything come together, right? This was a massive effort
involving a thousand plus people to produce the systems and produce the software, produce the applications that sat
on top of it, and then the crown jewel of these scientific results from it. And to see it all
come together is really, truly amazing. And there are some beautiful scientific results that have
come from this effort. There will be a lot more, right? In fact, I mean, kind of the tagline of the post-TCP era
is that it's the exascale era, right? We built these machines, we built the software, we built
the applications. There are a lot more applications that we want to get to effectively use these
systems, but we are in the exascale era. And for the next decade or so, we should be able to see
lots and lots of scientific breakthroughs
that come as the fruit of the efforts to build these systems and the software ecosystems
that run on them.
Mike, you mentioned data flow systems.
Can you say more about, or processing, can you say more about the status of that and
where you see that helping?
Yeah, I can say a little bit.
Again, I'm not going to say anything that's under non-disclosure agreement or anything, right?
But I do see viable systems that have a data flow design,
meaning that data get injected
into a network of connected processors
and the data live within that network
and flow through it as computations are performed on it. So you can
think of it as like a transformation engine that takes data in, keeps it in, but spreading it across
and doing lots of different operations on it before it finally leaves the networked collection
of devices. And there's a fair bit of evidence that this style of architecture offers a lot of
promise for some really important problems.
And so I think I see that as being the next wave of type of devices that are still pretty general purpose,
not totally general purpose, but they're going to address some important problems that we haven't been able to do as well as we'd like using AI type processors, which can do really well with those problems,
but they're really targeting AI,
not the kinds of problems
that data flow architectures could solve.
So I'm really excited about data flow.
I think it's, again, a nice front
and a set of new capabilities in the HPC space.
I mean, certainly for latency hiding,
it's a great model.
It is exactly right. That's exactly what the kinds of problems where latency is a really big deal.
Right. And if you have sufficient data to just stream through, then you also can absorb the upfront hits.
But then that starts making it sound seriously like FPGAs and other configurable approaches.
So do you see all of that kind of blending together
into some established model?
I do.
And I think it's the software.
So the compilers are going to be really important
as a part of this.
Anything I've seen, it's not just the hardware,
it's the compiler.
And the compiler's ability to reorganize
the flow of computation and data movement
is an important element to these types
of devices. One thing we talked about was the growing overlap between the commercial enterprise
software stack and mode of operations and processes and the HPC world. And in our pre-call,
I was saying that in the early days, you compiled, you linked,
maybe there was a math library and you were done. And now it's like a vast universe of software and
containers and GitHub and SPAC. And even when Make came about, it was like, really? I don't
really need it. I can just do it. So we've come a long way. Can you speak to how that is impacting HPC software? We had a nice chat
about research software engineer as a title, as a career path that has emerged and the impact of
that on skillset, specialization, career path, organizational structure, process, all of that,
that is now more relevant in HPC than it ever was before.
How's that changing things?
Yeah, well, I think the success of HPC as a collection of enabling technologies is what
leads to this increase in complexity and increase in all the different tool sets and why it's
become more complicated and in fact, truly complex.
I mean, the behavior is more than the sum of its part.
There is complexity we can't predict until we hit it.
And I just think that's a natural part of what it means to be successful as we go forward.
And so I don't see it as a bad thing or somehow we're doing things wrong.
In fact, it is just a sign that we're doing things right to some extent. Now, can we do a better job? Yeah, I think we can do a better job of managing
our complexity, of leveraging knowledge gained in the larger marketplaces that have more money than
we do to invest in software and can maybe take a step back and do a better job at design.
Because we're often fighting, we're racing against the clock to produce these new systems,
to get performance from them. We have to go quickly rather than invest in things that are
for sustainability. I think we've done a pretty good job with ECP. ECP was many of the team
members, people who worked on the Exascale computing project,
remarked saying that this is the first time I felt like I could work on my software and the testing
and make sure it was really robust as a part of the work I was doing to make it available to my
users. And so we hope that kind of availability of time and funding will continue as we go forward
post-TCP so that we can make better products, better tools, and make them available to more
people.
And that's certainly part of the ambition for E4S.
What we're trying to do with E4S is providing this curated stack so the users of E4S can
simply, it's available.
In many instances, it's already installed for them.
And so they can just point to it and build their application on top of it. If they want to rebuild
and customize some library configurations that they want to use, they can do that as well.
And then we also provide things like build caches that will allow you to suck in a library that was
built once before for you.
And if you need to rebuild it again, well, the binary for that build is already available.
And we see compile and link times go down by a factor of 10.
So there are lots of ways that we're able to improve the user's experience by focusing
on this curated software stack, along with what the vendors provide and along with what
application teams
pull in from other software providers or the open source community.
So we made a lot of progress.
ECP changed the culture, I think, for DOE in terms of investments in scientific software,
viewing software as a facility, not as just something that we do on a way to delivering a computer system.
I think that attitude about scientific software is persisting beyond ECE.
That's excellent. That's excellent. Let's touch on AI just a little bit more. It is bringing
its own software stack to the fore, right? And even some of what appeared to be an app is now becoming its own sort of part
of the tool chain, including LLMs and SLMs and other forms. To what extent, I mean, you mentioned
that HPC guys having enabled it are now actually using it too. I think it's going beyond precision.
How do you see all of that evolving and be internalized by the HPC community?
Yeah. So first of all, I'm jealous because the amount of funding that's available for producing
that software. There's one statistic I saw from IDC where the estimated investment in AI
high-end systems would be 300 billion a year this coming year versus $10 billion for traditional modeling and simulation.
So a factor of 30, right?
To me, that says a few things.
One is if the AI community is interested in solving a problem, they're going to do it better, faster, and a lot more expensive than what we can do in the scientific computing community.
And so we should just let them do that.
Well, let them, they're going to do it anyway, right?
But we should respect that they're going to do that.
And we should then try to leverage that as best we can.
So wherever we can leverage the investments of the broader AI community
in the scientific community, that's a good thing, right?
And then we should look for where are the gaps that they're not paying attention to, either because it's an opportunity cost.
They just have other things that are more higher priority or something they don't care as much about as we do.
We need to invest in those spaces and then leverage what they're trying to do.
I view the emergence of AI as being this dominant market capability or potential as a really positive thing.
I think we're just high performance computing community has always been parasitic in what
it does.
After the Cray days, they Cray vector multiprocessors, right?
Since then, we've utilized mass market components and pull them together in a way that makes
for high performance computing.
So clusters were that way. GPUs were that way.
GPUs that are really AI devices are that way.
And so we've always taken what the broader computing community has produced
and said we can take that and synthesize it and use it as the foundation for what we do.
And also, we're a healthy parasite.
If you want to carry this analogy a little further, we're a healthy parasite in that
we do something helpful to the host that we're connecting to or attached to.
And that we say, if you tweak this little bit of a feature in your system, you could
get better performance from it too.
We've noticed, right?
We're trying to get high performance out of your devices.
If you organize your memory this way, or if you rearrange or change something aspect of your system or improve your compilers
in this way, you too can get better performance. And so I don't know, I view that relationship as
being something very interesting and valuable going forward. I'm excited about the advent of AI
as this huge market-driven set of activities.
Some more reason why the focus on performance is a good thing.
Yeah.
Because we are willing to do anything to get it.
Yes, that's right. Yeah, yeah, yeah. Exactly.
Mike, I might suggest the term symbiotic as opposed to parasitic. I don't know, just a thought.
Okay. All right. All right. I don don't know i like the word parasitic
it gets people's attention when i say it but in a good you mean in a good way though yeah
absolutely yes i mean it in a good way yes yes yes it does have that quality it does get better
attention for sure excellent so one other question and I know we're pushing against the time,
but if you have time, we can. Yeah, I have time.
It's SC24. Where can we catch up with all of this at SC24? And if you have any particular
programs that you want to highlight, we'd love to know about it.
Yeah. So the Department of Energy will have its booth this year again at SC24.
And so you'll be able to get a lot of information about the things that I've talked to that
are DOE related at that booth.
You will also see quite a few tutorials and papers that are related to the work that the
Exascale Computing Project has done.
We sent out an invitation to our teams at the end of the Exascale Computing Project has done. We sent out an invitation to our teams at the end of the
Exascale Computing Project to contribute to some special issues of the International Journal on
High-Performance Computing Applications. It's a journal that focuses on HPC. The response to that
was tremendous. We had more than 40 articles contributed to that, all from the Exascale
Computing Project teams. So that shows you the volume of work that has come out of the Exascale
Computing Project. So you will see in the technical program at SC, lots of papers related to the work
being done that had been done under the Exascale project. You'll see it in the tutorials.
You will see it in the workshops as well.
And you'll also see things that are maybe non-traditional,
like we mentioned RSEs.
There's the HPC RSE workshop that's going to be at SC24.
It's been there for a few years.
Certainly the people who organize that don't think of themselves as ECP people, but I believe that the Exascale project helped reinforce both the importance of that kind of work and
also fed in skilled people to that community and that effort will persist.
And so in addition, of course, at SC24, you'll see a lot of the impact of the work that the
Exascale project did in the scientific results
and in the beautiful graphics that other people will be displaying in their booths on the exhibit
floor. So I think you'll see a lot of impact from what EZP did at SC24. Well, we're certainly seeing
a lot of news coming out of the Office of Science about these applications in action.
So, yeah, we'll certainly look forward to that at SC in November.
And I just, in general, see, well, as we all know,
I believe AI is a subset of HPC, albeit a really big one.
But still, it's HPC skill set, HPC infrastructure,
HPC workflow, HPC culture, right? Yeah, it's HPC skillset, HPC infrastructure, HPC workflow, HPC culture, right?
Yeah, it is.
And having enabled it, we are now using it because that was the idea all along the thing.
But when I look at the SC24 show conference, it's still the core of it is HPC with AI kind of a quick second, right?
Would you see it that way?
Yeah, I think that's true, but I think that's changing. So again, within e4S, we actually have a fairly robust stack of AI libraries and tools. We package things like TensorFlow,
Horovod, the distributed layer that sits on top of the device layers, which is hard to install.
And so the fact that we curate those
libraries and tools into an integrated stack, it can be meaningful to some people, right? And so
I think what we'll see as time goes on, that the distinction between what's AI and what's HPC,
it'll blend. I'm not sure the distinction will be as easy to perceive as we go forward. And especially as scientists start to
learn how to use AI approaches for scientific discovery. Interesting. You just gave me an idea,
but do you remember those old commercials of Will It Blend?
Vinegar and water? No, they actually had an actual blender.
And then I don't know whether it was,
I think the guy who did it was being paid to do it or something,
but he would throw anything in the blender.
And the question was, will it blend?
So he would buy a brand new iPhone and throw it in there and see.
Oh my God.
I don't think I've seen that.
It was pretty funny.
I bet it was. Yeah.
Anyway, one more thing I need to take out from this.
Yeah, there you go. There you go. Yeah. I don't know. To me, I think HPC and AI will blend bananas and strawberries in a smoothie. It'll be better than either one alone, I think.
Excellent. Excellent analogy.
Yeah. Very good.
Good stuff. All right. Thank you, Mike. Pleasure. Pleasure to have this conversation. Look forward to catching up with you in person at SC. And I'm still working out what my next arrangements are. They're almost in place, but I will be moving into the kind of consulting phase of my career.
I will still be involved in the PESO project, that's for sure.
But it won't be as a Sandia staff member.
I see.
Well, congratulations.
That's amazing.
Yeah, I'm looking forward to it.
I've been at Sandia for more than 26 years.
And so it's time to work a little less and still be engaged
and do the work that I only the things that I want to do. You get to be more picky.
Nice. I get to be picky. Exactly. Congratulations on that. Yeah, thanks. Okay. All right, Mike,
thanks so much for your time. You've been a great guest. Yeah, my pleasure. Thanks for
the opportunity. I really appreciate it. Take care. Thanks a lot.
That's it for this episode of the At HPC Podcast.
Every episode is featured on InsideHPC.com and posted on OrionX.net.
Use the comment section or tweet us with any questions or to propose topics of discussion.
If you like the show, rate and review it on Apple Podcasts or wherever you listen.
The At HPC Podcast is a production of OrionX in association with Inside HPC.
Thank you for listening.