@HPC Podcast Archives - OrionX.net - @HPCpodcast-70: Paul Messina – Journey to Exascale
Episode Date: October 17, 2023With the annual observance of Exascale Day on October 18th, we were delighted to get a chance to discuss the journey to Exascale with Dr. Paul Messina who led the Accelerated Strategic Computing Init...iative (ASCI) program from 1998 to 2000, and was the first director of the Exascale Computing Project (ECP) from 2015 until late 2017. [audio mp3="https://orionx.net/wp-content/uploads/2023/10/070@HPCpodcast_Paul-Messina_Journey-to-Exascale_20231017.mp3"][/audio] The post @HPCpodcast-70: Paul Messina – Journey to Exascale appeared first on OrionX.net.
Transcript
Discussion (0)
And so the expertise that's been gained by all those people will pervade the nation's
scientific computing for the next decade or two. That to me is a fantastic accomplishment.
The four themes were applications and algorithms and that I believe I had Jeffrey Fox leading that
part of the initial workshops. Device technology, architecture that Seymour Cray led,
software technology that was led by Ken Kennedy.
What a cast. Beautiful.
Yeah, it was just a wonderful experience.
Now, there has been concern for national security reasons in the United States,
as you mentioned, about sharing technologies.
From OrionX in association with InsideHPC, this is the AtHPC podcast. Join Shaheen Khan and Doug
Black as they discuss supercomputing technologies and the applications, markets, and policies that
shape them. Thank you for being with us. Everyone, Shaheen, great to be with you again.
Delighted to be here. What a special guest. I'm really proud to have this opportunity. Yeah, as we're looking
at Exascale and the annual observance of Exascale Day on October 18th, we have with us Paul Messina,
really one of the HPC community luminaries, a distinguished career dating back to the early 70s. After earning his
PhD at the University of Cincinnati, Paul joined Argonne National Lab in 1973. He was involved in
building programming language for the original Cray-1. At Caltech, he was the director of the
Center for Advanced Computing Research. In 1999 and 2000, he led the Accelerated Strategic Computing Initiative,
ASCII, at NNSA. And among other things, he was the first director of the Exascale Computing
Project starting in 2015, where he stayed until late 2017, fulfilling a two-year commitment that
Paul made when he came on board at ECP. And now he is currently consulting on various projects.
Paul, welcome. We're delighted to have you with us today.
Thank you, Doug. My pleasure.
Okay. So looking at Exascale, the Exascale computing project, a project that is really
just completed as I understand it. Let's start by talking about your overall assessment of the
American Exascale effort. You were involved in the initial planning stages,
and you've seen, we've seen, at least the first system, Frontier come to fruition.
Would you say the Exascale project in this country has been a success? Indeed, I would. I feel the American Exascale project has been very successful. As I'm sure
you're aware, from the beginning, we took a holistic approach to how the project should
evolve so that it would involve concurrently the applications development for running on Exascale,
developing the software infrastructure for Exascale systems, and that would serve the
applications needs, and working with companies to improve their commercial products, not one-off things for us, but their
future commercial products so that they could achieve Exascale. And so this holistic approach,
I believe, has been very successful. As you mentioned, the Frontier system is operational
and is an Exascale system indeed. And there are real important applications running on it now. Two other exascale systems
are well underway. The Aurora system at Argonne National Laboratory and El Capitan at Lawrence
Livermore National Laboratory will soon be operational as well. And one of the things that
the important things, in my view, that the project has achieved, in addition to having an operational system already, are, well, one, the human resources. Well, over a thousand people were involved and
have been involved in the ECP, and from many different institutions, you know, from 16 national
labs, from something like 35 different universities, and I don't remember how many industries. And so the expertise that's been gained by all those people, you know, will pervade the
nation's, you know, scientific computing for the next decade or two.
That to me is a fantastic accomplishment.
Secondly, the software stack that has been developed by the ECP is an important accomplishment
in my view for two reasons.
One, it works on the existing Exascale
system and is working on the early versions of Aurora and El Capitan. But in addition,
it's a software stack that's meant to support HPC basically throughout the capability range,
not just on Exascale. And that to, is an accomplishment that should help the nation and the world in
dealing with HPC because it's a problem if you have to switch your software environment when you
get to the highest levels of capability. So by having a software stack that can be used,
I'll say mid-size HPC systems as well as the fastest, that is a great accomplishment.
Then the goals of the architecture, the component
technology have also been achieved in terms of reasonable, in quotes, power consumption,
given the fact that then art scaling ended long ago and Moore's law has been tapering off for
some time. And yet we have a frontier that the peak consumption is only around 23 megawatts compared to what it would have been if we hadn't partnered with multiple computer companies as part of the project of the ECP through the path forward funding that we provided.
Paul, tell us about the initial Exascale vision.
When did your attention start being directed in this area of Exascale? And I'm
interested in the concerns or doubts regarding a practical, usable, and affordable system at
this scale could be stood up. What were the key challenges on your mind?
Well, so I first started paying attention to Exascale in 2007. And it's because three other people had organized town hall meetings in 2007
to look into Exascale. So Rick Stevens, Thomas Zachariah, and Horst Simon from Argonne, Oak Ridge,
and Berkeley Laboratories. So they deserve the credit for getting that started. In late 2007 and then in 2008, I started discussing the Exascale
vision with both them and the people at DOE headquarters, Steve Binkley and Barb Helland
and Michael Strayer in particular. So what were the concerns that we had at the time? Well,
certainly, again, the power consumption, because if one did the simple arithmetic of using component
technology available at that time, say 2008, and how many components you would need to achieve
exascale just as a peak, it was a huge number, talking about gigawatts, completely unrealistic
for a real system. So that was a concern. Now, I will say that since I was heavily involved in the planning the petaflops workshops to explore the issues
felt that it would be feasible in about 10 years to achieve the petaflops goals with a reasonable
power structure. So that's not to say that in 2008, we were not concerned about power for exascale,
but I and others who worked with me on petaflops felt, although it would take
a lot of effort, it could be reached. The second one was, as often the case, you know, the software
and algorithms, you know, would we need to completely redo algorithms with the massive
increase in parallelism that would be needed? Because we knew even then that it wouldn't be a matter of having really
fat nodes that were individual CPUs were extremely fast. No, the way we would achieve exascale would
be by having millions of operations going on concurrently. So can we achieve that level of
parallelism with algorithms and taking into account the interconnect, the internal network, IO as well.
So those were big challenges.
However, people around that time in the late aughts, 2008, 2009, were demonstrating that what was then the leading programming model for parallel computing,
namely the message passing interface, MPI, people started looking at the implementations
and were able to demonstrate million-weight parallelism with high efficiency, with, you know,
appropriately crafted implementation. So it wasn't the parallel programming model that was the issue,
it was the implementations. So there was some confidence of being able to achieve exascale
reasonably, but definitely recognition of the
goals being quite ambitious. It's definitely been, in my view, a smashing success and an indication
of how this country can get really big things done at that level. But then that also puts it
in the context of geopolitics, and especially as advanced technologies in general and supercomputers in particular become matters
of national security. One of my concerns has been its impact on scientific collaboration
and open communication and sort of chilling that process. But I know, of course, that the system is
also resilient in terms of advancing science. I'd love to get your perspective on it.
My approach has always been that collaboration is important.
Almost everything that I've had a part in accomplishing relied on collaboration. And
international collaboration is very common in the scientific world and even in the technology world.
A lot of great advances have taken place that way. Think back to the 1970s. The Transputer
was a British development. And a little bit later,
somewhat later, I guess, a couple of decades later, some Israeli researchers and companies
developed highly efficient optical switches. So collaboration with Japan on some software issues
for AI in the 80s and 90s as another example. So definitely, I have both been involved in collaborations with
many countries and feel that they're important. Now, there has been concern for national security
reasons in the United States, as you mentioned, about sharing technologies. And in most cases,
I have felt and I have said to government, that there's no hiding these things. Early on
in the parallel computing, I met with some people in parts of the government who said to me, well,
we have to restrict the highest performing systems from being exported. And I replied,
well, that really won't help because people will take commodity parts and put together highly parallel systems and achieve
the capability they need that way. And their comeback was, well, parallel computing is just
too complicated and other countries won't do it. And I basically laughed. Now, to their credit,
a couple of years later, the same people came back to me and said, yeah, you were right. Actually, we saw in a physics research lab in China that they had this, you know, large
Beowulf running and getting things done.
You know, they were able to both build a system, but more importantly, to program it so that
they could get their work done.
So, you know, there are some exceptions, you know, some things I do think need to be guarded.
But much of the advancement is going to be open or so easily deducible.
It can be deduced from trends in technology by people in any part of the world.
I think the thing to do is usually to try to run faster.
By that, I mean to advance fast the new technologies.
And that's how the U.S. has been successful for five decades that I know of.
Don't look back, just keep running. Keep running, that's absolutely right. In that vein, what is
your view on the top 500 list and the fact that it seems like more and more people aren't
participating, but then some people are? Well, the top 500 list has been useful in having real numbers,
even though their applicability is, of course, narrow, but nevertheless, having real numbers
on certainly the peak speed, power consumption, and by the way, the performance on high performance
LINPAC, that one benchmark program. The participation has always been voluntary. It's up to an institution
or a manufacturer to decide whether they want to submit data to be included in the top 500
or top 100 for that matter. I suspect that there will always been institutions that didn't submit
but could have ranked quite well. So 10, 20 years ago, I heard rumors of commercial
companies in the United States that had installed systems that would have ranked very high in the
top 500, but chose not to because they didn't want their competitors to know that they had that
capability. More recently, I think you were alluding to the belief that China has perhaps a couple of exascale systems and China has decided for whatever reason to not yet submit the LIMPAC numbers for consideration in the top 500 list.
Is that a problem?
I don't think so. Now, going forward, as you're, I'm sure, well aware, there are several other benchmarks that have been gaining traction and poke at different parts of the performance spectrum.
The Graph 500 is one, the Green one is another.
And those are important considerations because they do reflect on things that matter for the high-performance computing ecosystem.
Now, the Exascale Computing Project, you no doubt know,
all its measurements of performance are about applications speed relative on Exascale systems
as a ratio of what the speed was on predecessor systems that were pre-Exascale.
So all the measures of speed are based on real, very complex applications
running 50 times faster approximately
than they used to run on the predecessor systems.
We very deliberately did not say
that we would measure success by LINPACK
because it's just too narrow.
Now, for the project to do that, it's feasible.
Now, for the world to do that, it's pretty complicated.
Which applications do you use to measure relative performance?
So I think simpler benchmarks will continue to be useful and used.
But more and more, I think we'll be able to look at applications
and how much faster they're running on real systems.
When I've been to Oak Ridge to the OLCF and I see that, for example, the plumbing
and everything that enables that system, the supercomputing center infrastructure,
is that a fertile area to talk about or should we just focus on the system itself?
No, I think it is a fertile area, not that I have a lot of detail to add,
but certainly one of the things that I learned a long
time ago is that, if you will, the plumbing matters. So when I first heard Seymour Cray
talk about the Cray one in 1976, you know, he came to Argonne to give a talk. And one of the
things that he spent quite a bit of time on was the issue of cooling. So that's what I meant by plumbing. Now,
of course, at that time, it was not liquid cooling. But soon after, of course, the Cray-2
was liquid cooling. Yes, when I have visited Oak Ridge and Los Alamos, I look at that part of the
infrastructure. At Argonne, since I was there for quite a while, as the infrastructure needed to
continually be upgraded for the next machine, as the infrastructure needed to continually be upgraded
for the next machine, I'm well aware that there are a lot of issues. In the liquid cooling that's
prevalent now, there are typically in the early months issues with the quality of the water.
Things are pretty finicky, and so it's easy to have problems. The electrical distribution, not only the capacity to be able to
have 30 megawatts, 60 megawatts of power coming into the room, but then the distribution.
Circuit breakers are often having to be custom designed, and there are failures early on. To my
recollection, in all the major facilities that I've been involved in, there have been power distribution issues in the first few months or the first year. So to get back to your question,
yes, getting that infrastructure in place to be able to operate an exascale system is quite an
achievement. It is not just a matter of, you know, ordering from the electric company, hey, give me,
you know, another 30 or 60 megawatts in this room. And it's not a matter of just from the electric company, hey, give me another 30 or 60 megawatts in this room.
And it's not a matter of just having pipes that are big enough diameter to carry the cooling water
to distribute all the heat. And these things that sound like very routine infrastructure at the
level that one needs for exascale and before that, actually, for petascale,
there are glitches. So for El Capitan, for example, Livermore has completed all the
infrastructure that's needed for El Capitan. It's in place. And they're very proud of that,
and they should be because it's a big deal. And they got it done ahead of schedule and
under budget, which, of course, makes it even nicer. That's fantastic. Paul, there are a couple of threads that come through in what you describe.
And when I look at everything that you've led, one of them is this view that the approach needs
to be holistic. The other one is collaboration, co-creation, co-design. The next one is a
systematic approach to go through all of these.
I think that a lot of this started with the C3P project back at Caltech, which in my view,
raised or presented the first green light that there is a good distance that we can go to now.
Let's go do that. Would you take us through that and maybe some of the subsequent green
lights that had to be approached?
I would be glad to, Shaheen. Certainly the C3P project at Caltech that was led by Jeffrey Fox and that was funded by the Department of Energy, at that time Applied Mathematical Sciences
program. C3P was indeed a groundbreaker. So in what way? Well, one was that it involved a new architecture that was developed largely by Chuck Seitz,
another professor at Caltech, but with very active involvement by Jeffrey Fox, who was
a physicist, but who had deep insight in the architecture needs for some key physics calculations that he was involved in.
So it was co-designed to some extent at the hardware level. But more importantly, Jeffrey
was able to involve professors in many other parts of Caltech who had very challenging applications
and get them to participate in C3P. And so there was this collaborative effort and a
co-design effort because the system software was being developed. You know, there was no previous
model to use. The system software was being developed by Jeffrey and his group within C3P,
as well as Chuck Seitz's group, but taking into account the application needs. I, of course,
was following those advancements from my position at Argonne National Laboratory at the time.
But I really felt that that was quite amazing.
And especially because the architecture that Jeffrey was using was a distributed memory architecture.
Now, in those days, we're talking about early 1980s, shared memory was thought to be the way to go.
The Nalcor HEP machine, for example, which was a beautiful design by Burton Smith, was a shared memory machine.
And there was the Encore and the Sequent shared memory machines.
And so here was this outlier, which, as we know now, turned out to be the way of the future of distributed memory, which made programming more difficult.
And people were worried about scaling capability because of speeds and feeds,
as well as algorithmic issues.
But yet, here was this C3P effort that was turning out results, papers.
They turned out so many technical reports and journal articles.
So very, very impressive.
I organized a workshop at Argonne on parallel computing in mid-1986, I guess.
And of course, one of the people I invited was Jeffrey Fox.
That time, I was pushing the Department of Energy Mathematical Sciences program to fund
a large parallel systems for applications. I should mention that in the
early 80s, I established the Advanced Computing Research Facility at Argonne, which had half a
dozen different parallel computer architecture systems in place. And its goal was to experiment
with parallel computing with different hardware architectures, and it was open to the world.
And so many people were experimenting with that. And I felt that those results, as well as the
ones I saw from C3P, justified leaping to parallel computing architectures for production systems,
just as experimental systems. So my presentation at the workshop I organized was along those lines. And Jeffrey
Fox invited me afterwards to consider moving to Caltech because he said Caltech was thinking of
skipping the vector architecture generations and moving to parallel computing for very high-end
scientific applications. And so I did move to Caltech. And shortly after that, I worked with Intel Delta, 512 nodes, and I forget the
number of gigaflops, but something like 14 gigaflops speed. But to do that, I was able to
put together a collaboration, so one of the recurring themes, a collaboration of, well,
four federal agencies to provide the funding and 13 institutions to work on this project with me and put in place
the Intel Touchstone Delta and use it for high-end applications that would outstrip what could be
done on the fastest vector machines at the time. And so I had national laboratories, Intel
Corporation, and some universities working together on this project and doing co-design.
So we worked with Intel and applications people and computer scientists to come up with the
system software and algorithms for the Delta.
So that experience is very much based on co-design and collaboration.
So the Exascale Computing Project in the planning for that certainly took advantage of those experiences, as well as the planning for petaflops that I was involved in in the mid-1990s.
So, in fact, I was looking at some of the background for ECP in preparation for this interview, and I came across an email that I sent September 2008 to Steve Binkley and Barb Helland.
They had asked me about whether I'd be interested in helping with planning for Exascale.
And I said, basically, yes.
And here I'm quoting.
I said, my preliminary thoughts are that after a few more of the applications-oriented workshops
had been held, we would organize some technology-focused workshops that take as input the results of the applications workshops.
The general approach would be similar to the one we used for the petaflops workshops of
the mid-1990s.
Now, end quote, what was the approach of the petaflops workshops?
We had four themes going on in the workshops for co-design the concept. And so the
four themes were applications and algorithms. And that I believe I had Jeffrey Fox leading that part
of the initial workshops, device technology, architecture that Seymour Cray led, but it
involved all the luminaries in the computer architecture at the time, the software technology that was led by Ken Kennedy.
So what a cast.
Beautiful.
Yeah, it was great.
It was just a wonderful experience as I was cruising between all four of these focus areas
during the workshop.
And as I went into the architecture one at some point, Seymour said,
OK, let's pause and let's bring Paul up to
data as to what we've been thinking so far. Because when I detected that there were some
issues to be resolved, I would then send messengers from one of the other focus areas to interact,
to have essentially real-time co-design going on. So collaboration and co-design was something that the community had been doing for some time.
And so the ECP planning from very early on took those things into account.
And I believe that the success of the ECP, of the U.S. Exascale project, is largely due to that approach, the holistic approach, the collaborative and co-design approach, which had been successful. And now really it's a standard way of proceeding with really anything
with a little bit of complexity really demands it, doesn't it? It does indeed. Yes, long gone
are the days, fortunately, of people saying, you know, build it and they will come. You know,
that has occasionally been successful, but I think it's no longer viable in HPC.
Assuming you can even build it by yourself because, you know, now the scope of technologies and skill set and everything that is required just totally exceeds the capabilities of almost any organization.
Isn't that the case?
Oh, indeed.
And that's certainly the way that I approached a lot of projects in the past. If I and my colleagues came up with a vision, then we knew that within our own institutions,
we didn't necessarily have all the expertise.
And so we would cast about and try to identify people who did and determine whether they
were interested in working with us and shared our vision.
And whenever we did, the projects tended to be successful
working together. Paul, I'd like to ask you about HPC out at the edge and how scientific instruments
and other devices that are not the supercomputer proper might increasingly be acting as HPC or
mini-supers or micro-supers. Do see that? And how's that developing in your view?
Yeah, the edge computing broadly defined, I think is a very exciting development. And so I view it
as having quite a spectrum. So at Argonne National Laboratory, for example, there's the advanced
photon source, which is extremely bright x-rays that are used to examine and develop new materials,
including biological ones. Just amazing resolution. So that's a data source. But as is common for such
accelerator data sources, one has to do some very quick, I guess, sifting of data to get the useful
information out. And so you need processing power very close to this data source.
That, of course, is the way it's been for particle physics accelerators, such as the
Large Hadron Collider at CERN in Europe.
You know, incidentally, I helped that project.
I provided, I guess, a blueprint for how to deal with the data issues of the LHC.
Back in 2002, I was asked to help with that.
So one aspect of edge computing is very close to a data source because it's just very fast.
And so although the calculations that are done are not terribly complex, they need to
be done very quickly on lots and lots of data.
Then there's the idea of edge computing being smarter and being able to
take input from the environment, maybe multiple inputs, and draw inferences on those inputs to
come up with an action plan, if you will. And, you know, so that comes into detecting things like
fires through remote sensing, taking into account, you know account the wind direction and speed, and therefore
coming up with a plan of where to apply the limited resources for containing the fires.
So that would be one crude example. Obviously, self-driving cars, this is very much the case.
Yes.
That one has to take multiple inputs and then come up with an action plan, if you will, as to what the car
should do. And it does get into HPC because, well, as we have all been saying for some time,
the smartphone in our pockets is much faster than the Cray-1 of 1976. So we are dealing with HPC,
but distributed and in a different way of distribution from what
has traditionally been known as distributed computing. So I think it's quite an interesting
area. And I think there will be more and more weaving in of edge computing into the ecosystem
of larger scale computing. So to get back to the example that I mentioned about the advanced photon source
needing to do some very quick calculation on its input, then also at Argonne, we have put in place
a very high-speed connection to the current supercomputers in the Argonne Leadership
Computing Facility. So that will include the Aurora Exascale System. And to be able to do more complex near real-time analysis of the
experimental data is something that can be very useful. In one case of a few years ago, when this
was done early in the experiment, it was detected that the data was not quite right. And people who
were running the experiment only had access to the photon source for something like 48 hours.
So what you don't want to happen is that you gather nonsensical data for 48 hours, take it home, and then find out you've achieved nothing.
And now you have to wait two years for the next time you have access to the accelerator.
So in this case, we were able to, within a few hours, determine that turned out to be a loose cable.
That was fixed.
And so the experiment could
successfully proceed for the rest of the time it had access to that beam line. So there are also,
you know, those aspects of that. That's a great story. Yeah. Maybe we can conclude with getting
your perspective on the future, including quantum computing. And I'm sure you look at quantum
computing and you say, I've seen this movie before, but it's promising the whole thing, not just quantum, but like what is next after EXA? Well, what is next after EXA?
Multidimensional and predictions of the future are always risky, but so what? It's silly not to
do some prediction. So quantum computing, yeah, I certainly do believe will play a role, probably a
fairly important role. Clearly it will not take over computing and be the only way of computing.
And I don't think too many people are saying that.
I hope not anyway.
I've been aware of quantum computing since the 1980s.
I was fortunate in that at Argonne, a physicist called Paul Benioff did some of the theoretical work on that.
Secondly, I was on advisory committee for
Los Alamos National Laboratory. I, again, in the 1980s, late 80s, heard presentations on the promise
of quantum computing by staff at Los Alamos. They were looking at that back then. So yes,
these things take decades, but quantum computing will be important.
I see it crudely as a super accelerator for certain kinds of calculation, you know, made it a more conventional computer.
You know, you've already asked me about edge computing.
So I see this ecosystem having more and more components to it.
You know, we started out with an ecosystem that was simply a single computer, and then data became
more important. And so as a single computer with very large memories and very large storage of data
capability, then started adding connections to multiple computers, to experiments,
then within a computer, different kinds of accelerators. So I think that will continue,
but you're going, having more and
more, I guess, components that form the ecosystem. One of the things that the ECP has accomplished
is the software stack that I believe is useful for mid-scale HPC as well as exascale. And that's
an important part of the ecosystem. And if that software stack can also, in many cases, also
contribute to edge computing, and I believe it can, then that again is an expansion of the ecosystem
of computing writ large. So another more technology-oriented aspect of computing is optical.
That has been a very long time coming. And I know we, in the mid-90s, and looking at
petascale, I involved people that I thought were at the forefront of optical computer design. And
they told me, forget it for, you know, if your target is 10 years, we won't be ready in 10 years.
It's further off. And they were right. And it was great that they were candid about that,
instead of saying, oh, yeah, we can contribute.
But yeah, they're all right.
But, you know, optical networks are a big deal.
Karen Bergman's work, I think, is very exciting.
And optical computers, I think, will also play a role.
I don't know how far in the future, but, you know, I definitely think that they will be part of the toolkit that we will have in the future. So very complex,
interrelated technologies and components, and hopefully largely tied together by a software
stack that can be used up and down the capability and across technologies as well.
Brilliant. All right. Paul, thanks so much for joining us today. We've been with Paul Messina,
the original director of the Exascale Computing Project. Thanks so much. Thank you, Paul. What a delight. You're welcome. It's been a pleasure.
All right. Take care. Really, really enjoyed this, Paul. And I really look forward to
having you again if you have some time for us. In the meantime, enjoy Annapolis and all that
you're doing. That's great. Thank you very much. Be glad to talk every time.
Thanks a lot. Take care.
That's it for this episode of the At HPC
podcast. Every episode
is featured on InsideHPC.com
and posted on OrionX.net.
Use the comment section or tweet
us with any questions or to propose
topics of discussion. If you like the show,
rate and review it on Apple Podcasts
or wherever you listen. The At
HPC podcast is a production
of OrionX in association with Inside HPC. Thank you for listening.