@HPC Podcast Archives - OrionX.net - @HPCpodcast-88: Mike Heroux of Sandia on ECP, HPC Software

Starting point is 00:00:00 For us, performance is a very high priority. And so we pay attention to any new architecture features that would help with performance. Take a step back and say, okay, not only are these AI processors the best way of getting performance for a reasonable cost, but they offer us new ways of considering how we might formulate the problems that we're solving. Within the software ecosystem that we've constructed already, we can handle a variety of heterogeneity in software libraries and tools and applications

Starting point is 00:00:42 as we go forward. From OrionX in association with InsideHPC, this is the AtHPC podcast. Join Shaheen Khan and Doug Black as they discuss supercomputing technologies and the applications, markets, and policies that shape them. Thank you for being with us. Hi, everybody. It's Doug Black with the AtHPC podcast with Shaheen Khan of OrionX.net. And Shaheen, today we have a special guest, Mike Haru.

Starting point is 00:01:11 He is a senior scientist at Sandia National Laboratories, and he's scientist-in-residence at St. John's University in Minnesota. He's been with both of those organizations for more than 25 years. Now, directly relevant to us, Mike is the Exascale Computing Project's Director of Software Technologies. And while that project has been completed, Mike's software work has continued to receive funding. And by the way, earlier in his career, Mike was with SGI and Cray. His focus is on all aspects of scalable scientific and engineering software for parallel computing architectures. So Mike, welcome. Thank you very much, Doug. Happy to be here. Great to be with you. So you've worked on supercomputing software for decades.

Starting point is 00:02:00 Share with us some thoughts, if you would, on the distinctions of developing applications for supercomputers? Yeah, I think probably it's not a difference in a kind of binary sense, but it's a difference in the degree of attention we pay to the computer architecture and the devices and the connection of those devices to each other for the purposes of getting as much performance as possible out of the computer. Certainly people who develop mainstream software don't want their software to run poorly or slowly, but for us, performance is a very high priority. And so we pay attention to any new architecture features that are being added to a system that would help with performance. We pay attention to algorithms that could run faster or be more robust in the presence of parallelism and offer new ways of solving problems that couldn't have been solved before. And we also care even about reconsidering problem formulation. Maybe an algorithm that was really good at giving an answer for a lower performance

Starting point is 00:03:12 type machine could be replaced by something that has more parallelism inherent, more concurrency possible. So we pay a lot of attention to performance issues, again, more so than I think the average mainstream software developer would have. Certainly in some other fields, high, low latency is important and things like that, but it is a hallmark of the kind of work that we do. And because we care about those things and because new systems that offer better performance usually offer it through some kind of architectural innovation or a software innovation. We have to adapt our software. Not only do we have to get it to compile on a new computer system, usually also it has

Starting point is 00:03:57 a new compiler or a new version of the compiler that may be a little bit buggy itself. We're also considering new algorithms and considering new problem formulations. And so all of those things lead to, I think, why we often call this the bleeding edge, in that a person who works in the high-performance computing field, if they want to go from point A to point B and they see a path to get it through their own work, they're often disrupted because there are impediments that arise due to software or architecture features that they didn't foresee. And so it may take a really long time to get from A to B because we are working in such an evolving

Starting point is 00:04:38 and disruptive type of environment. Okay, so getting directly into Exascale, you began with the Exascale computing project in 2017. And we know a core element of ECP's strategy was the generation of capable or usable and useful Exascale class systems. So what were some of the key challenges of your project work for ECP and developing code for a system just as massive in scale as Frontier or all three systems? So there were a few things. First of all, the notion of having a capable or useful usable Exascale class system, it has several components to it. One is we've seen supercomputers built over the years that were really good at getting

Starting point is 00:05:27 a good LINPACK node. And so the LINPACK benchmark is a really small piece of software. It's very compute-friendly relative to, say, bandwidth and latency and even the amount of memory you might put on a node. It's very forgiving. And so you can have a system that you build that gets really good LINPACK performance results and shows up high in the top 500, but then may not be so useful in a broader sense. And it may not be useful because the computer

Starting point is 00:05:58 architecture, the bandwidth and the latencies and the amount of provisioned memory you have might not be sufficient for other kinds of applications. And that's certainly one aspect of it. But the other is that the software that is needed to make the system usable by lots of different applications isn't present, and there hasn't been a budget for creating that software. It also can be that the IO subsystem, the things that you need to get data into the computer and off of the computer when you're done doing computation, may also not be sufficiently robust and fast enough to handle the data rates that are needed. And then the overall usability of the system, will it stay up long enough for you to get

Starting point is 00:06:42 useful work done in between checkpoints. And so the system itself has to be really reliable and robust and be up and running on a regular basis. So these are some of the things that go into this concept of a capable or usable system. The role of software technology, the effort that I led for ECP, we had roughly 250 scientists working on the libraries and tools within the software technology area. The distinction between that work and the work that was being done in the application space is our work was intended to be reusable software, stuff that other people used, not just ourselves, and could be used in a way that was scalable in the sense that many people could use. And so we had the role of building libraries and tools that

Starting point is 00:07:32 would help the application teams. If it's a performance tool, it would help them get insight into performance bottlenecks, things that might be inhibiting their scaling. If it was a build tool, those kinds of things. If it was something like a mathematics library, mathematical library, like a linear solver or an FFT, those were kinds of libraries we did. If it was in the IO space, it might be an IO library that gets data onto or into or off of a system, or it could be a data compression library. Because even though we built a scalable system with robust IO capabilities, the rate at which we could retrieve data into the machine and push it back out, that ratio is always getting worse just because of how the markets behave.

Starting point is 00:08:17 And so we invested in data compression capabilities that reduce the amount of data that you had to store and still have high fidelity in that data. And then the software ecosystem itself, we worked on container software capabilities. We invested a lot in SPAC, which is a package management system that has become very popular in the high performance computing world. And then we had the kind of the collection of all our capabilities into something called E4S, which is this curated portfolio of software that we use to support all and deliver all the capabilities that we produced in the software technology out to the users, out to the world that wanted this kind of software.

Starting point is 00:09:00 Mike, you pointed out some of the gap in performance, IO versus processing. I think that has led to looking for ways that at the outlook of various roadmaps? Are we going to continue to see such a big gap between processing speed and then memory access, albeit HBM is like trying to address a little bit of that, all the way to IO, which has always been a problem? Yeah, I don't see any kind of disruption that would make that kind of work easier to do. Of course, we see now in our hardware, our GPUs are really AI processors. And so they present to us opportunities for low precision hardware arithmetic. And that helps with data transfer, performance, things like that. But then you also have to have algorithms that can tolerate lower precision arithmetic. And that's an area of study. And so to the extent that we can revisit

Starting point is 00:10:10 the implementation of our algorithms or consider new algorithms that can operate at lower precision, we can make some progress in that space simply by having fewer bits to transfer per numerical result. So that's one way. But beyond that, I don't see us necessarily getting at this kind of growing trend of memory performance versus compute performance and IO performance versus memory performance versus compute performance. I don't see us fundamentally changing the challenges in that area. A related topic is heterogeneity. HPC community is eager to try whatever can make things go fast and we'll figure it out. And we have sufficient technical depth to do it. When it goes over to the commercial enterprise side,

Starting point is 00:11:01 the appetite for that is a little bit less, but then we are seeing the advent of AI drag them into the accelerator world in a way that exposes that complexity. But there's more heterogeneity where that came from, including emergence of quantum computing, let alone accelerators of 20 different varieties. What are the labs doing? What are the Exascale project doing to bring all of that and simplify that? So a few things, and ECP accelerated this, but more than a decade ago, we foresaw the need to be able to write portable high concurrency software, meaning that we needed to be able to express concurrency at a loop level in a way that could be compiled to existing heterogeneous systems and to future systems.

Starting point is 00:11:50 So to be very concrete, I'll use one example. There is an ecosystem built around a project called Raja at Lawrence Livermore National Labs. There's another one that started off at Sandia. I was part of the original efforts, but it's also expanded now to include lots of other laboratories and other organizations well beyond Sandia, and it's become a community software project. And so I'll talk about Cocos because it's the one that I know best and seems to be more broadly used. But the basics of Cocos are it allows you to express a parallel for loop, a do loop in Fortran, a parallel reduce, which is like a dot product, and then a parallel for loop, a do loop in Fortran, a parallel reduce, which is like a dot product, and then a parallel scan algorithm, which is like a dot product with intermediate results.

Starting point is 00:12:31 It allows you to compute in parallel algorithms that are otherwise not possible to do in parallel. That's really cool computer science stuff. But anyway, it does those three kinds of loops and when elaborate versions of it in a way that it can be expressed by the programmer, say a computer scientist or a chemist or a physicist who's writing code in a way that exposes the parallelism, exposes the concurrency, but doesn't tie it to a specific type of compute processor. And in particular with ECP, NVIDIA GPUs were around. We use them all the time to do development and we made sure our software worked on that. But we are also targeting GPUs, accelerators from AMD and Intel. And we had to write our codes

Starting point is 00:13:18 in a way that were portable across those three target architectures, in addition to using ARM CPUs, vectorizing CPUs, and with an eye toward the future to data flow machines that are emerging in various settings. And so a lot of the work of ECP, especially in my area in the software technology with reusable libraries and tools, was done with this mind that we need to have portable performance across all these different systems. And so we've done that. The software stack that we provide is portable across all of these different architectures and is performance portable and allows, let's say you have a particular hotspot and you need to get the most out of that on a particular GPU device, you can do it in a way that only that special kernel needs to be written in a custom way.

Starting point is 00:14:12 And the rest of your code that needs to work pretty well in parallelism can still be done using these portability layers. And so that's a big part of what we've done so far. So that's part of the heterogeneity story is that portability across the existing GPU architectures from NVIDIA, AMD, and Intel is a big part of it. ARM processors, emerging data flow types of processors. Now you mentioned Quantum. Quantum is a very different type of processing device. From our perspective, at least from a software stack perspective, we view it as another attached device with a different instruction set architecture. Architecturally, in terms of software, we already know how to handle that. We already handle GPUs

Starting point is 00:14:55 as discrete devices. So it's not a fundamentally brand new thing in terms of software architecture. Of course, what is truly unique is the kinds of algorithm that you want to implement for a quantum device. And then what is the programming language that you're going to use? And how do you compile that code? That's all emerging as we go. But it's not a fundamentally different software architecture. And so we're confident that within the software ecosystem that we've constructed already,

Starting point is 00:15:32 we can handle a variety of heterogeneity in software libraries and tools and applications as we go forward. So Mike, looking ahead, we hear about OLCF6, the next generation of leadership systems at the labs. I assume there's a lot of direct relevance in the work you did for ECP with OLCF6, but maybe not. Or are there areas where new work will have to be done? I'm sure there'll be new work that has to be done simply because even if it's along the same path as what Frontier was, there's always new work to be done, in part because a lot of the performance will probably come from increased demands for parallelism, because we really have, we're still limited by the speed of light and our latencies aren't improving in any real way. And so more concurrent execution, more parallelism is the way that we get performance from these machines. It'll be

Starting point is 00:16:23 really interesting though, because the emergence of AI means that we have opportunities in the scientific computing area to take a step back and say, okay, not only are these AI processors the best way of getting performance for a reasonable cost for our traditional double precision computations and even single precision computations. But they offer us new ways of considering how we might formulate the problems that we're solving to take advantage of inference engines as core components of how we do scientific discovery. So we may, and in fact, this work's been going on for some time. It's not brand new, but I think it will grow. And so what we will see with OLCS6, because I don't know any details about it, but presumably it's going to have some kind of very rich AI type processor in it. And we will be able to utilize those to build up novel approaches to solving scientific problems using deep learning formulations

Starting point is 00:17:27 of the problems that we're trying to solve. What comes with that is how do you know that your answer is right, all these kinds of things. But part of that is just building a level of comfort with AI capabilities. And part of it is building in more validation and verification types of approaches so we can detect if our inference engine is not doing a good job or has gone off the rails. But those are some of the ways that I see things changing going forward. The other thing that I see changing going forward is that I think the distinction between what's an on-prem, on-premises, on-site computer versus what's a cloud system. Say the difference between AWS and Frontier. Right now, those software stacks and the interfaces that are used to do something on AWS versus Frontier are pretty different. But I think over time, especially in the next

Starting point is 00:18:21 generation of large-scale systems, again, I have no special knowledge, but I just see that there are market forces making this not only possible, but really an essential thing, is that the APIs that we're using on-prem should be to the user nearly indistinguishable from what we use in the cloud. Or at least we want to move in that direction because our users, the application teams that are trying to do, solve scientific problems are going to expect that if they do something in the cloud and they have a workflow, they have a set of scripts and workflow management tools, that they're also going to be able to use those tools on a system at a leadership computing facility in a way that is nearly identical, as nearly identical as possible, and that the difference between an application team using something

Starting point is 00:19:12 in the cloud versus on site should be as minimal as possible. Do you think, Mike, that would present some challenges in what you alluded to early in our conversation, the fervor for performance, almost to the exclusion of portability and maintainability. That's a really good point. However, at the same time, we have seen use of containers, we don't see a degradation in performance. And in fact, while maybe the latest GPUs are hard to come by in a cloud environment, GPUs are there and performance is real. And we see our application teams going on to these cloud-based services and realizing really good performance. And so we can take our software and we can compile and run it in the cloud in a way that doesn't compromise performance,

Starting point is 00:20:06 especially compared to how things might have been in the past. Have there been any actual tests to see? Here's how I'm using cloud XYZ. And here is what I would do on permanent traditional HPC oriented performance first kind of a model. Yeah, I can't point to anything specifically, but I've seen a lot of anecdotal evidence that a person can expect pretty good performance from a cloud-based system if they provision it ahead of time with a good network setup. Yeah. And in fact...

Starting point is 00:20:37 Yeah. If the config has to be there. Yes, exactly. The configuration has to be there. In fact, and maybe we'll get to this, but one of the things that's emerged at the very end of ECP is this high-performance software foundation, HBSF, which I view as a very important signal of the relevance of high-performance computing to the broader industry and a non-DOE, non-governmental marketplace, and that they care enough about high-performance scientific software that they're interested in having a foundation that supports open-source software focused on high performance. And so I think we're seeing a market trend that introduces both the demand for high performance and the tools and libraries that

Starting point is 00:21:27 can provide that for the user community. So I foresee that the distinction between cloud and on-prem in terms of performance and capabilities, the distinction will disappear over time. And because there is so much demand for high performance. Mike, tell us, why don't we get in for a moment. We know the EC project has been completed, but you received additional funding for the work that you're doing. Could you specifically talk about that project work? Yeah, sure, sure.

Starting point is 00:21:56 Yeah, the project is called PESO. It's a bit of a tortured acronym, and it stands for Partnering for Scientific Software Ecosystem Stewardship Opportunities. And so what the Office of Science, the Advanced Scientific Computing Research Office, decided to do as the Exascale project was completing was they certainly were committed to stewarding and advancing the ecosystem that was developed under ECP. And so that means the libraries and tools, the applications that were developed, and then the software stack, which is E4S along with a very capable set of features from SPAC, all of which were developed during the timeframe of ECP. And so there is an effort going on. Now, this effort is much smaller than what ECP was. ECP was, it measured, I think, roughly $1.8 billion. It was often quoted total project budget. The efforts

Starting point is 00:22:53 that we're talking about here are in the tens of millions total of budget, so much smaller budget. But at the same time, we're trying to organize ourselves to leverage other activities that are going on in the scientific ecosystem. For example, PSF is now up and running. I mentioned E4S, one of the products that is supported by the PESO project. So I'm the PI, along with Lois McInnes, of the PESO project. E4S is our primary product, but we do other things as well. But remember, E4S is a scientific software stack. It contains programming models, things like support for MPIs, for COCOs, for Raja, for the math libraries, for HPC Toolkit and other kinds of performance tools, IO libraries like

Starting point is 00:23:38 Audios. And now all of the teams who are developing these products are getting funding from other sources. All of the teams who are developing these products are getting funding from other sources, but we pull it all together and we do quarterly releases of E4S. So we curate the specific versions of those libraries and tools, and we make for a robust software stack that application teams and library development teams can depend upon as being reliable, robust, portable, available on all the leadership class systems, available in container environments, available in the cloud and AWS, Google Cloud. And so we're working on this full set of available software that can be used. And so that's what we're doing in post-ECP. Would we like a bit more money for doing that? Yes.

Starting point is 00:24:25 But we're able to make reasonable progress even as it is. Also part of this effort is something called CAS, the Consortium for the Advancement of Scientific Software. That is an umbrella organization that's pulling together the Peso team and a bunch of other teams that are doing libraries and tools work under a single aggregate umbrella. And so we're evolving the approach that we're using to develop and support scientific software. And exactly how this organization or this collection of organizations is going to go forward is itself still evolving. We're still establishing the charter and the bylaws for how the CAS, how this

Starting point is 00:25:06 consortium will work, but we're making reasonable progress and we'll keep plugging away at it. We hope for a large mission-driven type of project coming to DOE, maybe around AI for science and things like that. We see that in the news. And so I think if something like that comes to the laboratories, then I think that gives us additional resources to further sustain and expand the software stack as we go. But we'll see. We'll see how it goes. Right now, we've got funding. We have people. We have the opportunity to carry things forward for the next several years. Peso itself is a five-year project and we're working on it every day. That's excellent. I think this attention on the software side of HPC in general is just a gap that continues to be a gap and needs

Starting point is 00:26:00 attention. So it's great to see it. Where do you think, including the projects that you are leading, as well as what the rest of the industry and the community is doing, where do you think we are in terms of that kind of portability and performance across multiple GPUs? Is that a solved problem at this point? Or is this still a problem if somebody's code is really optimized for a GPU and then you want to move it to another? Yeah, we're not anywhere close to all the way there for sure. I do think that I didn't mention OpenMP. OpenMP has this so-called target offload functionality. It's still being made rigorous in terms of its capabilities. There are some codes from within ECP that relied upon it and had some struggle, especially towards the end. But it still remains a viable approach. Of course, there's OpenACC. There are other approaches. There's CUDA,

Starting point is 00:26:59 there's HIP, there's SICKL. There of these kind of vendor-connected approaches to doing parallel computing. Again, I think Cocos is a really nice model for getting performance portability in a way that's pretty practical, but it's also heavily templated C++. Not everybody wants to build that awareness of C++ syntax. So there are other programming languages like Julia, for example, where there are quite a few libraries that run well on GPUs, but we're really far away from having portability in a whole bunch of software. But it's an essential thing, I think, going forward. During ECP, we'd hear from people who are running a compute center, say for weather prediction or

Starting point is 00:27:43 other types of really important computations. But their software stack is still focused on the CPU. They don't have an investment in the redesign that would be required to take advantage of GPUs. And so they may have a two megawatt compute center and they can't really increase their power budget. And so from generation to generation, the software stack is still CPU, so they can get the accelerated architecture, I think they're stuck in this low GPU, CPU type of performance model that's going to be really hard to get away from without a big investment in a redesign of their software stack, something like what ECP did. I mean, ECP, you can think of it as just a massive lift of a software ecosystem that DOE cares about. It's not even the whole ecosystem, but a significant piece of it from a CPU to GPU type of architectures.

Starting point is 00:28:54 And now that we have that small subset of DOE software working well on GPUs, the next generation of GPUs comes along. We tweak the software, but it's going to make use of those newer GPUs. We're on a new commodity performance curve and energy efficiency curve. And it's not just at the high end. It's at the desk side as well, because the GPU is in the desk side system all the way up to the very high end. That sounds like something AI might actually be able to help with. Yeah, I think so. There's a real possibility that we can get some assistance from AI to transform the software stack. Of course, as I tell people, should I move to GPUs or not? And my advice has always been, don't do it until you absolutely have to.

Starting point is 00:29:40 Because the longer you wait, the more stable the ecosystem will be around you to use, and the more tools that will be available to you to do the process of transformation. So my answer to almost everybody is don't do it. Yeah, it's a nuanced advice too. It's don't do it because you are going to have to do it. I'm doing it, but don't do it. Mike, what about the experience of seeing software in operation on Frontier, this incredibly powerful system? Do you have any anecdotes or just experiences to share this software that actually got running, applications running? Yeah, well, I'd love to be able to tell you a story from the trenches.

Starting point is 00:30:23 But I've said often that I used to write software back not too many years ago, real supercomputing software. And then I wrote software by email and telephone calls to my postdocs and junior staff, right, never typing a semicolon. And for the past seven years, I've been talking to people who talk to people who write software. And so take what I say with a little bit of, with the caution that it deserves. But yeah, I mean, it's just, it's delightful to see all of the, everything come together, right? This was a massive effort involving a thousand plus people to produce the systems and produce the software, produce the applications that sat on top of it, and then the crown jewel of these scientific results from it. And to see it all come together is really, truly amazing. And there are some beautiful scientific results that have come from this effort. There will be a lot more, right? In fact, I mean, kind of the tagline of the post-TCP era is that it's the exascale era, right? We built these machines, we built the software, we built

Starting point is 00:31:31 the applications. There are a lot more applications that we want to get to effectively use these systems, but we are in the exascale era. And for the next decade or so, we should be able to see lots and lots of scientific breakthroughs that come as the fruit of the efforts to build these systems and the software ecosystems that run on them. Mike, you mentioned data flow systems. Can you say more about, or processing, can you say more about the status of that and where you see that helping?

Starting point is 00:32:02 Yeah, I can say a little bit. Again, I'm not going to say anything that's under non-disclosure agreement or anything, right? But I do see viable systems that have a data flow design, meaning that data get injected into a network of connected processors and the data live within that network and flow through it as computations are performed on it. So you can think of it as like a transformation engine that takes data in, keeps it in, but spreading it across

Starting point is 00:32:31 and doing lots of different operations on it before it finally leaves the networked collection of devices. And there's a fair bit of evidence that this style of architecture offers a lot of promise for some really important problems. And so I think I see that as being the next wave of type of devices that are still pretty general purpose, not totally general purpose, but they're going to address some important problems that we haven't been able to do as well as we'd like using AI type processors, which can do really well with those problems, but they're really targeting AI, not the kinds of problems that data flow architectures could solve.

Starting point is 00:33:12 So I'm really excited about data flow. I think it's, again, a nice front and a set of new capabilities in the HPC space. I mean, certainly for latency hiding, it's a great model. It is exactly right. That's exactly what the kinds of problems where latency is a really big deal. Right. And if you have sufficient data to just stream through, then you also can absorb the upfront hits. But then that starts making it sound seriously like FPGAs and other configurable approaches.

Starting point is 00:33:42 So do you see all of that kind of blending together into some established model? I do. And I think it's the software. So the compilers are going to be really important as a part of this. Anything I've seen, it's not just the hardware, it's the compiler.

Starting point is 00:33:57 And the compiler's ability to reorganize the flow of computation and data movement is an important element to these types of devices. One thing we talked about was the growing overlap between the commercial enterprise software stack and mode of operations and processes and the HPC world. And in our pre-call, I was saying that in the early days, you compiled, you linked, maybe there was a math library and you were done. And now it's like a vast universe of software and containers and GitHub and SPAC. And even when Make came about, it was like, really? I don't

Starting point is 00:34:37 really need it. I can just do it. So we've come a long way. Can you speak to how that is impacting HPC software? We had a nice chat about research software engineer as a title, as a career path that has emerged and the impact of that on skillset, specialization, career path, organizational structure, process, all of that, that is now more relevant in HPC than it ever was before. How's that changing things? Yeah, well, I think the success of HPC as a collection of enabling technologies is what leads to this increase in complexity and increase in all the different tool sets and why it's become more complicated and in fact, truly complex.

Starting point is 00:35:25 I mean, the behavior is more than the sum of its part. There is complexity we can't predict until we hit it. And I just think that's a natural part of what it means to be successful as we go forward. And so I don't see it as a bad thing or somehow we're doing things wrong. In fact, it is just a sign that we're doing things right to some extent. Now, can we do a better job? Yeah, I think we can do a better job of managing our complexity, of leveraging knowledge gained in the larger marketplaces that have more money than we do to invest in software and can maybe take a step back and do a better job at design. Because we're often fighting, we're racing against the clock to produce these new systems,

Starting point is 00:36:10 to get performance from them. We have to go quickly rather than invest in things that are for sustainability. I think we've done a pretty good job with ECP. ECP was many of the team members, people who worked on the Exascale computing project, remarked saying that this is the first time I felt like I could work on my software and the testing and make sure it was really robust as a part of the work I was doing to make it available to my users. And so we hope that kind of availability of time and funding will continue as we go forward post-TCP so that we can make better products, better tools, and make them available to more people.

Starting point is 00:36:52 And that's certainly part of the ambition for E4S. What we're trying to do with E4S is providing this curated stack so the users of E4S can simply, it's available. In many instances, it's already installed for them. And so they can just point to it and build their application on top of it. If they want to rebuild and customize some library configurations that they want to use, they can do that as well. And then we also provide things like build caches that will allow you to suck in a library that was built once before for you.

Starting point is 00:37:25 And if you need to rebuild it again, well, the binary for that build is already available. And we see compile and link times go down by a factor of 10. So there are lots of ways that we're able to improve the user's experience by focusing on this curated software stack, along with what the vendors provide and along with what application teams pull in from other software providers or the open source community. So we made a lot of progress. ECP changed the culture, I think, for DOE in terms of investments in scientific software,

Starting point is 00:37:59 viewing software as a facility, not as just something that we do on a way to delivering a computer system. I think that attitude about scientific software is persisting beyond ECE. That's excellent. That's excellent. Let's touch on AI just a little bit more. It is bringing its own software stack to the fore, right? And even some of what appeared to be an app is now becoming its own sort of part of the tool chain, including LLMs and SLMs and other forms. To what extent, I mean, you mentioned that HPC guys having enabled it are now actually using it too. I think it's going beyond precision. How do you see all of that evolving and be internalized by the HPC community? Yeah. So first of all, I'm jealous because the amount of funding that's available for producing

Starting point is 00:38:53 that software. There's one statistic I saw from IDC where the estimated investment in AI high-end systems would be 300 billion a year this coming year versus $10 billion for traditional modeling and simulation. So a factor of 30, right? To me, that says a few things. One is if the AI community is interested in solving a problem, they're going to do it better, faster, and a lot more expensive than what we can do in the scientific computing community. And so we should just let them do that. Well, let them, they're going to do it anyway, right? But we should respect that they're going to do that.

Starting point is 00:39:29 And we should then try to leverage that as best we can. So wherever we can leverage the investments of the broader AI community in the scientific community, that's a good thing, right? And then we should look for where are the gaps that they're not paying attention to, either because it's an opportunity cost. They just have other things that are more higher priority or something they don't care as much about as we do. We need to invest in those spaces and then leverage what they're trying to do. I view the emergence of AI as being this dominant market capability or potential as a really positive thing. I think we're just high performance computing community has always been parasitic in what

Starting point is 00:40:10 it does. After the Cray days, they Cray vector multiprocessors, right? Since then, we've utilized mass market components and pull them together in a way that makes for high performance computing. So clusters were that way. GPUs were that way. GPUs that are really AI devices are that way. And so we've always taken what the broader computing community has produced and said we can take that and synthesize it and use it as the foundation for what we do.

Starting point is 00:40:39 And also, we're a healthy parasite. If you want to carry this analogy a little further, we're a healthy parasite in that we do something helpful to the host that we're connecting to or attached to. And that we say, if you tweak this little bit of a feature in your system, you could get better performance from it too. We've noticed, right? We're trying to get high performance out of your devices. If you organize your memory this way, or if you rearrange or change something aspect of your system or improve your compilers

Starting point is 00:41:10 in this way, you too can get better performance. And so I don't know, I view that relationship as being something very interesting and valuable going forward. I'm excited about the advent of AI as this huge market-driven set of activities. Some more reason why the focus on performance is a good thing. Yeah. Because we are willing to do anything to get it. Yes, that's right. Yeah, yeah, yeah. Exactly. Mike, I might suggest the term symbiotic as opposed to parasitic. I don't know, just a thought.

Starting point is 00:41:42 Okay. All right. All right. I don don't know i like the word parasitic it gets people's attention when i say it but in a good you mean in a good way though yeah absolutely yes i mean it in a good way yes yes yes it does have that quality it does get better attention for sure excellent so one other question and I know we're pushing against the time, but if you have time, we can. Yeah, I have time. It's SC24. Where can we catch up with all of this at SC24? And if you have any particular programs that you want to highlight, we'd love to know about it. Yeah. So the Department of Energy will have its booth this year again at SC24.

Starting point is 00:42:25 And so you'll be able to get a lot of information about the things that I've talked to that are DOE related at that booth. You will also see quite a few tutorials and papers that are related to the work that the Exascale Computing Project has done. We sent out an invitation to our teams at the end of the Exascale Computing Project has done. We sent out an invitation to our teams at the end of the Exascale Computing Project to contribute to some special issues of the International Journal on High-Performance Computing Applications. It's a journal that focuses on HPC. The response to that was tremendous. We had more than 40 articles contributed to that, all from the Exascale

Starting point is 00:43:05 Computing Project teams. So that shows you the volume of work that has come out of the Exascale Computing Project. So you will see in the technical program at SC, lots of papers related to the work being done that had been done under the Exascale project. You'll see it in the tutorials. You will see it in the workshops as well. And you'll also see things that are maybe non-traditional, like we mentioned RSEs. There's the HPC RSE workshop that's going to be at SC24. It's been there for a few years.

Starting point is 00:43:43 Certainly the people who organize that don't think of themselves as ECP people, but I believe that the Exascale project helped reinforce both the importance of that kind of work and also fed in skilled people to that community and that effort will persist. And so in addition, of course, at SC24, you'll see a lot of the impact of the work that the Exascale project did in the scientific results and in the beautiful graphics that other people will be displaying in their booths on the exhibit floor. So I think you'll see a lot of impact from what EZP did at SC24. Well, we're certainly seeing a lot of news coming out of the Office of Science about these applications in action. So, yeah, we'll certainly look forward to that at SC in November.

Starting point is 00:44:32 And I just, in general, see, well, as we all know, I believe AI is a subset of HPC, albeit a really big one. But still, it's HPC skill set, HPC infrastructure, HPC workflow, HPC culture, right? Yeah, it's HPC skillset, HPC infrastructure, HPC workflow, HPC culture, right? Yeah, it is. And having enabled it, we are now using it because that was the idea all along the thing. But when I look at the SC24 show conference, it's still the core of it is HPC with AI kind of a quick second, right? Would you see it that way?

Starting point is 00:45:09 Yeah, I think that's true, but I think that's changing. So again, within e4S, we actually have a fairly robust stack of AI libraries and tools. We package things like TensorFlow, Horovod, the distributed layer that sits on top of the device layers, which is hard to install. And so the fact that we curate those libraries and tools into an integrated stack, it can be meaningful to some people, right? And so I think what we'll see as time goes on, that the distinction between what's AI and what's HPC, it'll blend. I'm not sure the distinction will be as easy to perceive as we go forward. And especially as scientists start to learn how to use AI approaches for scientific discovery. Interesting. You just gave me an idea, but do you remember those old commercials of Will It Blend?

Starting point is 00:46:00 Vinegar and water? No, they actually had an actual blender. And then I don't know whether it was, I think the guy who did it was being paid to do it or something, but he would throw anything in the blender. And the question was, will it blend? So he would buy a brand new iPhone and throw it in there and see. Oh my God. I don't think I've seen that.

Starting point is 00:46:24 It was pretty funny. I bet it was. Yeah. Anyway, one more thing I need to take out from this. Yeah, there you go. There you go. Yeah. I don't know. To me, I think HPC and AI will blend bananas and strawberries in a smoothie. It'll be better than either one alone, I think. Excellent. Excellent analogy. Yeah. Very good. Good stuff. All right. Thank you, Mike. Pleasure. Pleasure to have this conversation. Look forward to catching up with you in person at SC. And I'm still working out what my next arrangements are. They're almost in place, but I will be moving into the kind of consulting phase of my career. I will still be involved in the PESO project, that's for sure.

Starting point is 00:47:12 But it won't be as a Sandia staff member. I see. Well, congratulations. That's amazing. Yeah, I'm looking forward to it. I've been at Sandia for more than 26 years. And so it's time to work a little less and still be engaged and do the work that I only the things that I want to do. You get to be more picky.

Starting point is 00:47:32 Nice. I get to be picky. Exactly. Congratulations on that. Yeah, thanks. Okay. All right, Mike, thanks so much for your time. You've been a great guest. Yeah, my pleasure. Thanks for the opportunity. I really appreciate it. Take care. Thanks a lot. That's it for this episode of the At HPC Podcast. Every episode is featured on InsideHPC.com and posted on OrionX.net. Use the comment section or tweet us with any questions or to propose topics of discussion. If you like the show, rate and review it on Apple Podcasts or wherever you listen. The At HPC Podcast is a production of OrionX in association with Inside HPC.

Starting point is 00:48:10 Thank you for listening.

CODACE Plant Stand

@HPC Podcast Archives - OrionX.net - @HPCpodcast-88: Mike Heroux of Sandia on ECP, HPC Software

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

CODACE Plant Stand

@HPC Podcast Archives - OrionX.net - @HPCpodcast-88: Mike Heroux of Sandia on ECP, HPC Software

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.