@HPC Podcast Archives - OrionX.net - @HPCpodcast-70: Paul Messina – Journey to Exascale

Starting point is 00:00:00 And so the expertise that's been gained by all those people will pervade the nation's scientific computing for the next decade or two. That to me is a fantastic accomplishment. The four themes were applications and algorithms and that I believe I had Jeffrey Fox leading that part of the initial workshops. Device technology, architecture that Seymour Cray led, software technology that was led by Ken Kennedy. What a cast. Beautiful. Yeah, it was just a wonderful experience. Now, there has been concern for national security reasons in the United States,

Starting point is 00:00:42 as you mentioned, about sharing technologies. From OrionX in association with InsideHPC, this is the AtHPC podcast. Join Shaheen Khan and Doug Black as they discuss supercomputing technologies and the applications, markets, and policies that shape them. Thank you for being with us. Everyone, Shaheen, great to be with you again. Delighted to be here. What a special guest. I'm really proud to have this opportunity. Yeah, as we're looking at Exascale and the annual observance of Exascale Day on October 18th, we have with us Paul Messina, really one of the HPC community luminaries, a distinguished career dating back to the early 70s. After earning his PhD at the University of Cincinnati, Paul joined Argonne National Lab in 1973. He was involved in

Starting point is 00:01:33 building programming language for the original Cray-1. At Caltech, he was the director of the Center for Advanced Computing Research. In 1999 and 2000, he led the Accelerated Strategic Computing Initiative, ASCII, at NNSA. And among other things, he was the first director of the Exascale Computing Project starting in 2015, where he stayed until late 2017, fulfilling a two-year commitment that Paul made when he came on board at ECP. And now he is currently consulting on various projects. Paul, welcome. We're delighted to have you with us today. Thank you, Doug. My pleasure. Okay. So looking at Exascale, the Exascale computing project, a project that is really

Starting point is 00:02:15 just completed as I understand it. Let's start by talking about your overall assessment of the American Exascale effort. You were involved in the initial planning stages, and you've seen, we've seen, at least the first system, Frontier come to fruition. Would you say the Exascale project in this country has been a success? Indeed, I would. I feel the American Exascale project has been very successful. As I'm sure you're aware, from the beginning, we took a holistic approach to how the project should evolve so that it would involve concurrently the applications development for running on Exascale, developing the software infrastructure for Exascale systems, and that would serve the applications needs, and working with companies to improve their commercial products, not one-off things for us, but their

Starting point is 00:03:06 future commercial products so that they could achieve Exascale. And so this holistic approach, I believe, has been very successful. As you mentioned, the Frontier system is operational and is an Exascale system indeed. And there are real important applications running on it now. Two other exascale systems are well underway. The Aurora system at Argonne National Laboratory and El Capitan at Lawrence Livermore National Laboratory will soon be operational as well. And one of the things that the important things, in my view, that the project has achieved, in addition to having an operational system already, are, well, one, the human resources. Well, over a thousand people were involved and have been involved in the ECP, and from many different institutions, you know, from 16 national labs, from something like 35 different universities, and I don't remember how many industries. And so the expertise that's been gained by all those people, you know, will pervade the

Starting point is 00:04:09 nation's, you know, scientific computing for the next decade or two. That to me is a fantastic accomplishment. Secondly, the software stack that has been developed by the ECP is an important accomplishment in my view for two reasons. One, it works on the existing Exascale system and is working on the early versions of Aurora and El Capitan. But in addition, it's a software stack that's meant to support HPC basically throughout the capability range, not just on Exascale. And that to, is an accomplishment that should help the nation and the world in

Starting point is 00:04:45 dealing with HPC because it's a problem if you have to switch your software environment when you get to the highest levels of capability. So by having a software stack that can be used, I'll say mid-size HPC systems as well as the fastest, that is a great accomplishment. Then the goals of the architecture, the component technology have also been achieved in terms of reasonable, in quotes, power consumption, given the fact that then art scaling ended long ago and Moore's law has been tapering off for some time. And yet we have a frontier that the peak consumption is only around 23 megawatts compared to what it would have been if we hadn't partnered with multiple computer companies as part of the project of the ECP through the path forward funding that we provided. Paul, tell us about the initial Exascale vision.

Starting point is 00:05:41 When did your attention start being directed in this area of Exascale? And I'm interested in the concerns or doubts regarding a practical, usable, and affordable system at this scale could be stood up. What were the key challenges on your mind? Well, so I first started paying attention to Exascale in 2007. And it's because three other people had organized town hall meetings in 2007 to look into Exascale. So Rick Stevens, Thomas Zachariah, and Horst Simon from Argonne, Oak Ridge, and Berkeley Laboratories. So they deserve the credit for getting that started. In late 2007 and then in 2008, I started discussing the Exascale vision with both them and the people at DOE headquarters, Steve Binkley and Barb Helland and Michael Strayer in particular. So what were the concerns that we had at the time? Well,

Starting point is 00:06:39 certainly, again, the power consumption, because if one did the simple arithmetic of using component technology available at that time, say 2008, and how many components you would need to achieve exascale just as a peak, it was a huge number, talking about gigawatts, completely unrealistic for a real system. So that was a concern. Now, I will say that since I was heavily involved in the planning the petaflops workshops to explore the issues felt that it would be feasible in about 10 years to achieve the petaflops goals with a reasonable power structure. So that's not to say that in 2008, we were not concerned about power for exascale, but I and others who worked with me on petaflops felt, although it would take a lot of effort, it could be reached. The second one was, as often the case, you know, the software

Starting point is 00:07:52 and algorithms, you know, would we need to completely redo algorithms with the massive increase in parallelism that would be needed? Because we knew even then that it wouldn't be a matter of having really fat nodes that were individual CPUs were extremely fast. No, the way we would achieve exascale would be by having millions of operations going on concurrently. So can we achieve that level of parallelism with algorithms and taking into account the interconnect, the internal network, IO as well. So those were big challenges. However, people around that time in the late aughts, 2008, 2009, were demonstrating that what was then the leading programming model for parallel computing, namely the message passing interface, MPI, people started looking at the implementations

Starting point is 00:08:46 and were able to demonstrate million-weight parallelism with high efficiency, with, you know, appropriately crafted implementation. So it wasn't the parallel programming model that was the issue, it was the implementations. So there was some confidence of being able to achieve exascale reasonably, but definitely recognition of the goals being quite ambitious. It's definitely been, in my view, a smashing success and an indication of how this country can get really big things done at that level. But then that also puts it in the context of geopolitics, and especially as advanced technologies in general and supercomputers in particular become matters of national security. One of my concerns has been its impact on scientific collaboration

Starting point is 00:09:31 and open communication and sort of chilling that process. But I know, of course, that the system is also resilient in terms of advancing science. I'd love to get your perspective on it. My approach has always been that collaboration is important. Almost everything that I've had a part in accomplishing relied on collaboration. And international collaboration is very common in the scientific world and even in the technology world. A lot of great advances have taken place that way. Think back to the 1970s. The Transputer was a British development. And a little bit later, somewhat later, I guess, a couple of decades later, some Israeli researchers and companies

Starting point is 00:10:11 developed highly efficient optical switches. So collaboration with Japan on some software issues for AI in the 80s and 90s as another example. So definitely, I have both been involved in collaborations with many countries and feel that they're important. Now, there has been concern for national security reasons in the United States, as you mentioned, about sharing technologies. And in most cases, I have felt and I have said to government, that there's no hiding these things. Early on in the parallel computing, I met with some people in parts of the government who said to me, well, we have to restrict the highest performing systems from being exported. And I replied, well, that really won't help because people will take commodity parts and put together highly parallel systems and achieve

Starting point is 00:11:07 the capability they need that way. And their comeback was, well, parallel computing is just too complicated and other countries won't do it. And I basically laughed. Now, to their credit, a couple of years later, the same people came back to me and said, yeah, you were right. Actually, we saw in a physics research lab in China that they had this, you know, large Beowulf running and getting things done. You know, they were able to both build a system, but more importantly, to program it so that they could get their work done. So, you know, there are some exceptions, you know, some things I do think need to be guarded. But much of the advancement is going to be open or so easily deducible.

Starting point is 00:11:49 It can be deduced from trends in technology by people in any part of the world. I think the thing to do is usually to try to run faster. By that, I mean to advance fast the new technologies. And that's how the U.S. has been successful for five decades that I know of. Don't look back, just keep running. Keep running, that's absolutely right. In that vein, what is your view on the top 500 list and the fact that it seems like more and more people aren't participating, but then some people are? Well, the top 500 list has been useful in having real numbers, even though their applicability is, of course, narrow, but nevertheless, having real numbers

Starting point is 00:12:33 on certainly the peak speed, power consumption, and by the way, the performance on high performance LINPAC, that one benchmark program. The participation has always been voluntary. It's up to an institution or a manufacturer to decide whether they want to submit data to be included in the top 500 or top 100 for that matter. I suspect that there will always been institutions that didn't submit but could have ranked quite well. So 10, 20 years ago, I heard rumors of commercial companies in the United States that had installed systems that would have ranked very high in the top 500, but chose not to because they didn't want their competitors to know that they had that capability. More recently, I think you were alluding to the belief that China has perhaps a couple of exascale systems and China has decided for whatever reason to not yet submit the LIMPAC numbers for consideration in the top 500 list.

Starting point is 00:13:37 Is that a problem? I don't think so. Now, going forward, as you're, I'm sure, well aware, there are several other benchmarks that have been gaining traction and poke at different parts of the performance spectrum. The Graph 500 is one, the Green one is another. And those are important considerations because they do reflect on things that matter for the high-performance computing ecosystem. Now, the Exascale Computing Project, you no doubt know, all its measurements of performance are about applications speed relative on Exascale systems as a ratio of what the speed was on predecessor systems that were pre-Exascale. So all the measures of speed are based on real, very complex applications

Starting point is 00:14:28 running 50 times faster approximately than they used to run on the predecessor systems. We very deliberately did not say that we would measure success by LINPACK because it's just too narrow. Now, for the project to do that, it's feasible. Now, for the world to do that, it's pretty complicated. Which applications do you use to measure relative performance?

Starting point is 00:14:51 So I think simpler benchmarks will continue to be useful and used. But more and more, I think we'll be able to look at applications and how much faster they're running on real systems. When I've been to Oak Ridge to the OLCF and I see that, for example, the plumbing and everything that enables that system, the supercomputing center infrastructure, is that a fertile area to talk about or should we just focus on the system itself? No, I think it is a fertile area, not that I have a lot of detail to add, but certainly one of the things that I learned a long

Starting point is 00:15:25 time ago is that, if you will, the plumbing matters. So when I first heard Seymour Cray talk about the Cray one in 1976, you know, he came to Argonne to give a talk. And one of the things that he spent quite a bit of time on was the issue of cooling. So that's what I meant by plumbing. Now, of course, at that time, it was not liquid cooling. But soon after, of course, the Cray-2 was liquid cooling. Yes, when I have visited Oak Ridge and Los Alamos, I look at that part of the infrastructure. At Argonne, since I was there for quite a while, as the infrastructure needed to continually be upgraded for the next machine, as the infrastructure needed to continually be upgraded for the next machine, I'm well aware that there are a lot of issues. In the liquid cooling that's

Starting point is 00:16:12 prevalent now, there are typically in the early months issues with the quality of the water. Things are pretty finicky, and so it's easy to have problems. The electrical distribution, not only the capacity to be able to have 30 megawatts, 60 megawatts of power coming into the room, but then the distribution. Circuit breakers are often having to be custom designed, and there are failures early on. To my recollection, in all the major facilities that I've been involved in, there have been power distribution issues in the first few months or the first year. So to get back to your question, yes, getting that infrastructure in place to be able to operate an exascale system is quite an achievement. It is not just a matter of, you know, ordering from the electric company, hey, give me, you know, another 30 or 60 megawatts in this room. And it's not a matter of just from the electric company, hey, give me another 30 or 60 megawatts in this room.

Starting point is 00:17:07 And it's not a matter of just having pipes that are big enough diameter to carry the cooling water to distribute all the heat. And these things that sound like very routine infrastructure at the level that one needs for exascale and before that, actually, for petascale, there are glitches. So for El Capitan, for example, Livermore has completed all the infrastructure that's needed for El Capitan. It's in place. And they're very proud of that, and they should be because it's a big deal. And they got it done ahead of schedule and under budget, which, of course, makes it even nicer. That's fantastic. Paul, there are a couple of threads that come through in what you describe. And when I look at everything that you've led, one of them is this view that the approach needs

Starting point is 00:17:56 to be holistic. The other one is collaboration, co-creation, co-design. The next one is a systematic approach to go through all of these. I think that a lot of this started with the C3P project back at Caltech, which in my view, raised or presented the first green light that there is a good distance that we can go to now. Let's go do that. Would you take us through that and maybe some of the subsequent green lights that had to be approached? I would be glad to, Shaheen. Certainly the C3P project at Caltech that was led by Jeffrey Fox and that was funded by the Department of Energy, at that time Applied Mathematical Sciences program. C3P was indeed a groundbreaker. So in what way? Well, one was that it involved a new architecture that was developed largely by Chuck Seitz,

Starting point is 00:18:51 another professor at Caltech, but with very active involvement by Jeffrey Fox, who was a physicist, but who had deep insight in the architecture needs for some key physics calculations that he was involved in. So it was co-designed to some extent at the hardware level. But more importantly, Jeffrey was able to involve professors in many other parts of Caltech who had very challenging applications and get them to participate in C3P. And so there was this collaborative effort and a co-design effort because the system software was being developed. You know, there was no previous model to use. The system software was being developed by Jeffrey and his group within C3P, as well as Chuck Seitz's group, but taking into account the application needs. I, of course,

Starting point is 00:19:43 was following those advancements from my position at Argonne National Laboratory at the time. But I really felt that that was quite amazing. And especially because the architecture that Jeffrey was using was a distributed memory architecture. Now, in those days, we're talking about early 1980s, shared memory was thought to be the way to go. The Nalcor HEP machine, for example, which was a beautiful design by Burton Smith, was a shared memory machine. And there was the Encore and the Sequent shared memory machines. And so here was this outlier, which, as we know now, turned out to be the way of the future of distributed memory, which made programming more difficult. And people were worried about scaling capability because of speeds and feeds,

Starting point is 00:20:32 as well as algorithmic issues. But yet, here was this C3P effort that was turning out results, papers. They turned out so many technical reports and journal articles. So very, very impressive. I organized a workshop at Argonne on parallel computing in mid-1986, I guess. And of course, one of the people I invited was Jeffrey Fox. That time, I was pushing the Department of Energy Mathematical Sciences program to fund a large parallel systems for applications. I should mention that in the

Starting point is 00:21:07 early 80s, I established the Advanced Computing Research Facility at Argonne, which had half a dozen different parallel computer architecture systems in place. And its goal was to experiment with parallel computing with different hardware architectures, and it was open to the world. And so many people were experimenting with that. And I felt that those results, as well as the ones I saw from C3P, justified leaping to parallel computing architectures for production systems, just as experimental systems. So my presentation at the workshop I organized was along those lines. And Jeffrey Fox invited me afterwards to consider moving to Caltech because he said Caltech was thinking of skipping the vector architecture generations and moving to parallel computing for very high-end

Starting point is 00:21:58 scientific applications. And so I did move to Caltech. And shortly after that, I worked with Intel Delta, 512 nodes, and I forget the number of gigaflops, but something like 14 gigaflops speed. But to do that, I was able to put together a collaboration, so one of the recurring themes, a collaboration of, well, four federal agencies to provide the funding and 13 institutions to work on this project with me and put in place the Intel Touchstone Delta and use it for high-end applications that would outstrip what could be done on the fastest vector machines at the time. And so I had national laboratories, Intel Corporation, and some universities working together on this project and doing co-design. So we worked with Intel and applications people and computer scientists to come up with the

Starting point is 00:23:11 system software and algorithms for the Delta. So that experience is very much based on co-design and collaboration. So the Exascale Computing Project in the planning for that certainly took advantage of those experiences, as well as the planning for petaflops that I was involved in in the mid-1990s. So, in fact, I was looking at some of the background for ECP in preparation for this interview, and I came across an email that I sent September 2008 to Steve Binkley and Barb Helland. They had asked me about whether I'd be interested in helping with planning for Exascale. And I said, basically, yes. And here I'm quoting. I said, my preliminary thoughts are that after a few more of the applications-oriented workshops

Starting point is 00:24:03 had been held, we would organize some technology-focused workshops that take as input the results of the applications workshops. The general approach would be similar to the one we used for the petaflops workshops of the mid-1990s. Now, end quote, what was the approach of the petaflops workshops? We had four themes going on in the workshops for co-design the concept. And so the four themes were applications and algorithms. And that I believe I had Jeffrey Fox leading that part of the initial workshops, device technology, architecture that Seymour Cray led, but it involved all the luminaries in the computer architecture at the time, the software technology that was led by Ken Kennedy.

Starting point is 00:24:47 So what a cast. Beautiful. Yeah, it was great. It was just a wonderful experience as I was cruising between all four of these focus areas during the workshop. And as I went into the architecture one at some point, Seymour said, OK, let's pause and let's bring Paul up to data as to what we've been thinking so far. Because when I detected that there were some

Starting point is 00:25:11 issues to be resolved, I would then send messengers from one of the other focus areas to interact, to have essentially real-time co-design going on. So collaboration and co-design was something that the community had been doing for some time. And so the ECP planning from very early on took those things into account. And I believe that the success of the ECP, of the U.S. Exascale project, is largely due to that approach, the holistic approach, the collaborative and co-design approach, which had been successful. And now really it's a standard way of proceeding with really anything with a little bit of complexity really demands it, doesn't it? It does indeed. Yes, long gone are the days, fortunately, of people saying, you know, build it and they will come. You know, that has occasionally been successful, but I think it's no longer viable in HPC. Assuming you can even build it by yourself because, you know, now the scope of technologies and skill set and everything that is required just totally exceeds the capabilities of almost any organization.

Starting point is 00:26:17 Isn't that the case? Oh, indeed. And that's certainly the way that I approached a lot of projects in the past. If I and my colleagues came up with a vision, then we knew that within our own institutions, we didn't necessarily have all the expertise. And so we would cast about and try to identify people who did and determine whether they were interested in working with us and shared our vision. And whenever we did, the projects tended to be successful working together. Paul, I'd like to ask you about HPC out at the edge and how scientific instruments

Starting point is 00:26:53 and other devices that are not the supercomputer proper might increasingly be acting as HPC or mini-supers or micro-supers. Do see that? And how's that developing in your view? Yeah, the edge computing broadly defined, I think is a very exciting development. And so I view it as having quite a spectrum. So at Argonne National Laboratory, for example, there's the advanced photon source, which is extremely bright x-rays that are used to examine and develop new materials, including biological ones. Just amazing resolution. So that's a data source. But as is common for such accelerator data sources, one has to do some very quick, I guess, sifting of data to get the useful information out. And so you need processing power very close to this data source.

Starting point is 00:27:48 That, of course, is the way it's been for particle physics accelerators, such as the Large Hadron Collider at CERN in Europe. You know, incidentally, I helped that project. I provided, I guess, a blueprint for how to deal with the data issues of the LHC. Back in 2002, I was asked to help with that. So one aspect of edge computing is very close to a data source because it's just very fast. And so although the calculations that are done are not terribly complex, they need to be done very quickly on lots and lots of data.

Starting point is 00:28:21 Then there's the idea of edge computing being smarter and being able to take input from the environment, maybe multiple inputs, and draw inferences on those inputs to come up with an action plan, if you will. And, you know, so that comes into detecting things like fires through remote sensing, taking into account, you know account the wind direction and speed, and therefore coming up with a plan of where to apply the limited resources for containing the fires. So that would be one crude example. Obviously, self-driving cars, this is very much the case. Yes. That one has to take multiple inputs and then come up with an action plan, if you will, as to what the car

Starting point is 00:29:05 should do. And it does get into HPC because, well, as we have all been saying for some time, the smartphone in our pockets is much faster than the Cray-1 of 1976. So we are dealing with HPC, but distributed and in a different way of distribution from what has traditionally been known as distributed computing. So I think it's quite an interesting area. And I think there will be more and more weaving in of edge computing into the ecosystem of larger scale computing. So to get back to the example that I mentioned about the advanced photon source needing to do some very quick calculation on its input, then also at Argonne, we have put in place a very high-speed connection to the current supercomputers in the Argonne Leadership

Starting point is 00:29:57 Computing Facility. So that will include the Aurora Exascale System. And to be able to do more complex near real-time analysis of the experimental data is something that can be very useful. In one case of a few years ago, when this was done early in the experiment, it was detected that the data was not quite right. And people who were running the experiment only had access to the photon source for something like 48 hours. So what you don't want to happen is that you gather nonsensical data for 48 hours, take it home, and then find out you've achieved nothing. And now you have to wait two years for the next time you have access to the accelerator. So in this case, we were able to, within a few hours, determine that turned out to be a loose cable. That was fixed.

Starting point is 00:30:43 And so the experiment could successfully proceed for the rest of the time it had access to that beam line. So there are also, you know, those aspects of that. That's a great story. Yeah. Maybe we can conclude with getting your perspective on the future, including quantum computing. And I'm sure you look at quantum computing and you say, I've seen this movie before, but it's promising the whole thing, not just quantum, but like what is next after EXA? Well, what is next after EXA? Multidimensional and predictions of the future are always risky, but so what? It's silly not to do some prediction. So quantum computing, yeah, I certainly do believe will play a role, probably a fairly important role. Clearly it will not take over computing and be the only way of computing.

Starting point is 00:31:29 And I don't think too many people are saying that. I hope not anyway. I've been aware of quantum computing since the 1980s. I was fortunate in that at Argonne, a physicist called Paul Benioff did some of the theoretical work on that. Secondly, I was on advisory committee for Los Alamos National Laboratory. I, again, in the 1980s, late 80s, heard presentations on the promise of quantum computing by staff at Los Alamos. They were looking at that back then. So yes, these things take decades, but quantum computing will be important.

Starting point is 00:32:10 I see it crudely as a super accelerator for certain kinds of calculation, you know, made it a more conventional computer. You know, you've already asked me about edge computing. So I see this ecosystem having more and more components to it. You know, we started out with an ecosystem that was simply a single computer, and then data became more important. And so as a single computer with very large memories and very large storage of data capability, then started adding connections to multiple computers, to experiments, then within a computer, different kinds of accelerators. So I think that will continue, but you're going, having more and

Starting point is 00:32:46 more, I guess, components that form the ecosystem. One of the things that the ECP has accomplished is the software stack that I believe is useful for mid-scale HPC as well as exascale. And that's an important part of the ecosystem. And if that software stack can also, in many cases, also contribute to edge computing, and I believe it can, then that again is an expansion of the ecosystem of computing writ large. So another more technology-oriented aspect of computing is optical. That has been a very long time coming. And I know we, in the mid-90s, and looking at petascale, I involved people that I thought were at the forefront of optical computer design. And they told me, forget it for, you know, if your target is 10 years, we won't be ready in 10 years.

Starting point is 00:33:38 It's further off. And they were right. And it was great that they were candid about that, instead of saying, oh, yeah, we can contribute. But yeah, they're all right. But, you know, optical networks are a big deal. Karen Bergman's work, I think, is very exciting. And optical computers, I think, will also play a role. I don't know how far in the future, but, you know, I definitely think that they will be part of the toolkit that we will have in the future. So very complex, interrelated technologies and components, and hopefully largely tied together by a software

Starting point is 00:34:11 stack that can be used up and down the capability and across technologies as well. Brilliant. All right. Paul, thanks so much for joining us today. We've been with Paul Messina, the original director of the Exascale Computing Project. Thanks so much. Thank you, Paul. What a delight. You're welcome. It's been a pleasure. All right. Take care. Really, really enjoyed this, Paul. And I really look forward to having you again if you have some time for us. In the meantime, enjoy Annapolis and all that you're doing. That's great. Thank you very much. Be glad to talk every time. Thanks a lot. Take care. That's it for this episode of the At HPC

Starting point is 00:34:48 podcast. Every episode is featured on InsideHPC.com and posted on OrionX.net. Use the comment section or tweet us with any questions or to propose topics of discussion. If you like the show, rate and review it on Apple Podcasts or wherever you listen. The At

Starting point is 00:35:03 HPC podcast is a production of OrionX in association with Inside HPC. Thank you for listening.

@HPC Podcast Archives - OrionX.net - @HPCpodcast-70: Paul Messina – Journey to Exascale

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.