Big Ideas Lab - El Capitan

Episode Date: November 19, 2024

For decades, it was an ambitious dream: to create a supercomputer powerful enough to tackle humanity's most complex problems. Now, that dream is a reality. On November 18, 2024, El Capitan made histor...y as the world’s fastest supercomputer, surpassing two exaflops of speed. Join us as we explore how this monumental achievement is set to redefine national security, revolutionize scientific research, and spark breakthroughs that could change the world as we know it.

Transcript
Discussion (0)
Starting point is 00:00:00 For decades, the U.S. Department of Energy has been pursuing a bold vision. A system powerful enough to tackle the greatest challenges facing humanity. Fears of a serious new threat to U.S. nation. Russia has begun major nuclear weapons exercises. The World Health Organization has declared a global public health emergency. The electricity system is undergoing a once-in-a-century transformation. What does the energy of the future look like? That vision? Exascale computing.
Starting point is 00:00:33 Exa is a Greek prefix meaning 10 to the 18th, and Exascale system nominally is about how many calculations can it perform per second. The U.S. government and the National Nuclear Security Administration's Trilabs, Lawrence Livermore, Los Alamos, and Sandia National Laboratories, needed a machine capable of operating on a scale that had never been done before. You can think of it as a billion billion. And so just the sheer number of calculations that you can perform in a fixed amount of time is beyond anything that we've been able to do in the past. The NNSA labs needed a computer that
Starting point is 00:01:15 could simulate nuclear reactions to the tiniest detail, discover new materials, boost energy, advance inertial confinement fusion, and meet the nation's evolving national security demands. But building a machine of this magnitude required vision, and a willingness to gamble on the unknown. More than a decade of work went into building something that would push the boundaries of what was possible. Capable of performing more than two quintillion calculations per second at its peak. And now that vision is a reality.
Starting point is 00:02:00 This machine doesn't just turn on. It awakens. Piece by on, it awakens. Piece by piece, system by system. Each one coming to life in perfect synchronization. And then, it happens. The future has arrived. Welcome to the world, El Capitan. We expect El Capitan to offer more total compute capability
Starting point is 00:02:32 than any previously built system. Welcome to the Big Ideas Lab, your weekly exploration inside Lawrence Livermore National Laboratory. Hear untold stories, meet boundary-pushing pioneers, and get unparalleled access inside the gates. From national security challenges to computing revolutions, discover the innovations that are shaping tomorrow, today. On November 18, El Capitan was officially launched at supercomputing's biggest showcase, the SC Conference, where it was announced as the world's fastest supercomputer.
Starting point is 00:03:22 At a peak speed of more than two exaflops, El Capitan is not just a technological marvel, but a machine that holds the future of national security, scientific research, and breakthrough innovations in its hands. El Capitan is one of the first exascale systems deployed in the United States. It is the third in a series that the United States has been developing and is the first of these exascale systems to be deployed for the National Security Mission. Rob Neely is the Associate Director
Starting point is 00:03:55 for Weapons Simulation and Computing at Lawrence Livermore National Laboratory. The immense computational power of El Capitan and its unclassified companion system Tuolumne holds the potential to solve some of humanity's biggest challenges. From fusion energy to climate modeling to renewable energy research to breakthroughs in drug discovery and earthquake simulation. However, at its core, El Capitan was designed with a singular mission to ensure the safety,
Starting point is 00:04:27 security and reliability of the U.S. nuclear stockpile. For the United States to maintain confidence in our nuclear stockpile, prior to 1992, if we wanted to understand if a change or a new design worked, we would go off to Nevada, drill a big hole in the ground, put the weapon down there and set it off. And that's called underground testing. And that's how things worked for decades. We stopped doing that in 1992 under the Clinton administration. And that left us with the big question, how are we going to retain confidence in our nuclear stockpile? And so that really spearheaded a big push in the United States to use supercomputing and modeling and simulation as one leg of a new program called science-based stockpile
Starting point is 00:05:20 stewardship designed to make sure we could retain our confidence in these weapons. With new global threats emerging and a Cold War-era arsenal still in play, ensuring that the U.S. maintains its nuclear deterrence and its competitive advantage over its adversaries have become some of the nation's most critical challenges. For the first time in decades, we're designing new weapons that are similar but safer, higher performing. So that national security mission is core to a lot of what we do and what we plan to use El Capitan for.
Starting point is 00:05:56 This mission is not new. It's one Lawrence Livermore National Laboratory has been working on since its founding in the 1950s. The nice thing about D. Lee Labs is they do make these long-term investments in science and technology that we think we're going to need for the mission so they can take 20 and 30 years to come to fruition, which is a really interesting work environment for us. Teresa Bailey is the Associate program director for the weapons
Starting point is 00:06:25 simulation and computing computational physics program at Lawrence Livermore National Laboratory. Her job is to oversee the development of a wide range of modeling and simulation tools that can be run on Lawrence Livermore National Laboratory's high-performance computers. She points out that El Capitan in many ways represents the culmination of decades of research and development, bringing to life the vision of the Accelerated Strategic Computing Initiative, or ASCII, that was established over 25 years ago. The ASCII program was designed to deliver modeling and simulation tools aimed at stockpile stewardship using high performance computers so that we would never have to go back to nuclear testing.
Starting point is 00:07:10 So El Capitan really represents that end product for the original vision of ASCII. Super computing is bringing to bear as much computational power as we can assemble to solve the hardest problems that are out there. Bronis D. Sapinski is the Chief Technology Officer for Livermore Computing. We do modeling and simulation of a variety of processes. Most of them are related to stockpile stewardship, but climate science, things like that,
Starting point is 00:07:44 to do those kinds of simulations in ways that actually models things close to reality. It takes quite a lot of computation and so it takes much more computing capability than you have in say your laptop or your phone. Today supercomputers far surpass the computational power of any device you have at home. What truly sets them apart is their precision and interconnectedness. Supercomputers are designed to have thousands of compute nodes work together to run simulations that mimic reality with incredible accuracy, which requires immense computational power to achieve the 64-bit double precision calculations necessary
Starting point is 00:08:26 for reliable scientific results. Mathematically, real numbers are infinite precision. In a computer, you have to choose some finite precision. The fidelity with which you're representing that infinitely precise number in the computer is limited by the number of bits you devote to it. And so getting an accurate answer depends on how many bits you use for it. Think about it this way. When you write down the number 3, you're not just writing 3. In precise terms it's 3.00000000, continuing infinitely.
Starting point is 00:09:08 Computers, however, have finite precision, meaning they have to cut off those trailing zeros at some point. In scientific computation, every extra decimal place of precision can be the difference between a simulation that is reliable and one that isn't. Doing large simulations requires fairly significant precision. So most of our computations require 64-bit computations. 64-bit precision allows supercomputers to handle numbers with up to 64 binary digits, enabling the highly accurate calculations needed for complex simulations,
Starting point is 00:09:47 such as those used in nuclear weapons research. In national security, where the smallest margin of error can have critical consequences, close enough simply isn't an option. This is why Lawrence Livermore National Laboratories and the NNSA have been relentlessly focused on hitting that exascale computing target. But achieving this level of technological advancement requires more than just improving current capabilities. It requires holding a vision so bold and far-reaching that the path forward may not always be immediately
Starting point is 00:10:22 clear. Back in about 2008, there was a seminal paper that was released by DARPA, the Defense Advanced Research Projects Agency, foreshadowing the difficulties that the computing industry was gonna have reaching this exascale target. So for decades, computers had been getting a thousand times faster,
Starting point is 00:10:43 approximately every decade or so built on first just smaller transistors the more things you could pack on a chip then by parallel computing putting more of these chips together into a single system but getting to exascale very early on almost 15 years ago it was recognized this was going to be a challenge like we haven't addressed before. So we started thinking about these systems long before we decided what the systems would actually be because we knew there was going to be a lot of research needed to be able to utilize these systems effectively.
Starting point is 00:11:19 This DARPA report foreshadowed the immense challenges on the horizon. It highlighted key issues like power, memory, and system resiliency. Fast forward ahead to about 2015-2016, the United States funded something called the Exascale Computing Project, which was really about the research needed to develop the software and the applications that would ultimately run on these machines. And it also funded some research for companies like AMD and Intel and Nvidia and HPE, big players in the supercomputing industry, to help them develop technology faster so that we could deploy those at the laboratory sooner for our mission. So all this was happening about six, seven years ago.
Starting point is 00:12:05 And at that time is when we began thinking about what's our next system going to be? What's the NNSA's exascale system going to be? As the exascale computing project took shape, it became clear that the path forward would require new solutions for both software and hardware. One of the original challenges we had when thinking about exascale computing was really around the power requirements
Starting point is 00:12:30 of these computers. Historically, the earliest supercomputers were just the earliest computers, right? Over time though, they became dominated by something called vector systems, so that's a way of computing a bunch of things at the same time in parallel. There's kind of limitations on that. And so over time, we moved to networked systems of CPUs, which is the standard way of what people use in their laptops. And so for a long time,
Starting point is 00:12:58 we were building systems with CPUs. That's how for decades, we've been getting faster and faster performance on these supercomputers. But if you drew sort of a straight line on where we knew technology was going in the late 2000s out to 10 years later or so, the amount of power it was going to require to field one of these systems, we were going to have to think about putting a nuclear reactor next to the building because it was in the hundreds of megawatts, which the operational costs for that were more than the Department of Energy even was willing to accept. And so a lot of the initial challenges and a lot of the initial research was how can
Starting point is 00:13:38 we continue to ride this wave of improved computer performance without expanding the amount of energy and power that's going to be required. The power challenge was immense. Exascale systems like El Capitan would require a completely new approach to energy efficiency, pushing computing experts to explore new ways to design and build these machines. In order to get more parallelism, we moved to processors that are used to drive the graphics on your screen, so GPUs. In 2018, Lawrence Livermore National
Starting point is 00:14:12 Laboratory launched Sierra, a groundbreaking supercomputer that combined CPUs with GPUs, making it one of the first large-scale systems to use this integrated heterogeneous approach. Sierra delivered 125 petaflops at its theoretical peak, roughly one-eighth of the computational performance of exascale. Part of what we were able to do between the community and our vendor partners like folks at Nvidia and AMD and Intel were to make these graphics processing chips suitable for scientific computing. It was really scientific computing and partnership
Starting point is 00:14:52 with companies that helped us recognize that, yes, we could do this. This could become the basis for the next generation of supercomputers. And it's going to be something like that technology that's gonna be required to get us to exascale computing in a power budget that we can manage. Sierra's design was a huge leap forward,
Starting point is 00:15:11 but the shift to GPUs introduced a new challenge. Many of the existing codes weren't built to run on GPUs. These codes had been designed for CPU-based systems, and adapting them wasn't a simple task. These aren't just little codes that you can rewrite over and over again. They're big codes. There are sometimes millions of lines of codes
Starting point is 00:15:36 coming together. So to make these big shifts in algorithmic type takes a lot of upfront thought and research to make sure it has the payoff that we need. Imagine being tasked with translating a complex manual into a different language. Except this manual isn't just a few pages. It's millions of lines long. And every detail is critical. Even the smallest error could derail the entire process.
Starting point is 00:16:10 This was the challenge Lawrence Livermore National Laboratory's developers faced when adapting CPU based codes to run efficiently on GPUs. To overcome this the team implemented Raja and UMPIRE, coding tools that simplify and streamline the process of adapting and using the codes. These tools, first used for Sierra, sped up the work for El Capitan dramatically, reducing code implementation time and pushing the exascale transition forward. The next pivotal step toward a fully functional exascale machine came with the introduction of AMD's next generation processors, known as APUs, or Accelerated Processing Units.
Starting point is 00:16:56 These chips combine both CPUs and GPUs into a single hardware package, making them more efficient and easier to program. This invention marked a major leap forward in computing technology, not just for the lab but for the world. The APU was an innovation that AMD came up with, one of our partners in El Capitan, to basically integrate the idea of the CPU and the GPU all on a single package. Sierra had GPUs in it but they were really completely separate from the CPUs. They were separate memory and one of the complications of using those systems and using accelerated computing in general was that the programmer now had
Starting point is 00:17:39 to make some explicit decisions about when to move data between the CPU and the GPU and when to transfer data between the CPU and the GPU and when to transfer control of the program from one type of device to another. The APU now gets rid of one of those complications. That makes the system more efficient from an energy standpoint, because you're not doing those useless movements of data, and it also makes it easier to program, because you don't have to program that movement of data and it also makes it easier to program because you don't have to program that movement of data. That's technically a big advantage.
Starting point is 00:18:10 El Capitan is made up of tens of thousands of these APUs, each one linked together to create a vast system capable of calculations on a scale never before seen. The way these large supercomputers are assembled is you have the basic unit of compute that's called a node and a node in our case is actually already made up of multiple APUs. Then you take nodes and you assemble those into blades and then blades get assembled into like a commercial grade refrigerator sized rack that sits on the floor and weighs a lot.
Starting point is 00:18:47 And then we assemble those racks together on the order of about a hundred of them for El Capitan to make the entire system. One of the biggest challenges with exascale computing wasn't just designing the machine, but building the infrastructure to support it. At Lawrence Livermore National Laboratory, they had to overhaul the entire electrical and cooling infrastructure, doubling their capacity to handle the immense energy demands of El Capitan. A new utility yard was built, supplying enough energy to power tens of thousands of homes,
Starting point is 00:19:34 just to ensure the supercomputer could run at full capacity without interruption. As part of something called the Exascale Computing Facility Modernization Project, We deployed significant increase in the electrical infrastructure to our main data center. And so that took us from 45 megawatts to 85 megawatts. So we're essentially 2x the energy that we can deliver to the computer for. Now El Capitan is not gonna use all 85 of those megawatts
Starting point is 00:20:10 but it's gonna use somewhere around 30 of those at any given time. That power is enough to supply around 30,000 homes. The extra energy capacity of the Livermore Computing Facility ensures they can sustain existing supercomputers alongside El Capitan. Despite its substantial energy requirements, El Capitan is one of the most energy efficient supercomputers ever built in terms of performance per watt. But all that power generates heat. A tremendous amount of it. Liquid cooling is required to keep these systems from literally melting because they run at sometimes over a thousand watts.
Starting point is 00:20:53 So you think about how hot a hundred watt light bulb can get. Magnify that now by 10, 20, 50 times. That's how much heat you're trying to dissipate in a very small package in one of these nodes of a super computer. And then multiply that by the tens of thousands of nodes that make up these systems. That's a lot of heat that you've got to try to make sure you can get rid of. And liquid cooling is the idea that you bring in cool water, you then run water across cold plates, it dissipates some of the heat away, goes out the other side of the water, you then run water across cold plates, it dissipates some of
Starting point is 00:21:26 the heat away, goes out the other side of the rack, and then eventually, through heat exchangers, goes out to a cooling tower, and then that water is cooled and the cycle repeats. At full operation, El Capitan will cycle through 5 to 8 million gallons of water every 24 hours to keep its systems cool and running efficiently. Building El Capitan required more than just cutting-edge technology. It took a coordinated effort across multiple organizations. Years of collaboration between Lawrence Livermore National Laboratory, the Department of Energy and NNSA, and private industry were essential in overcoming the immense technical challenges of exascale computing.
Starting point is 00:22:13 Back around the time when we were first starting to talk about exascale and recognizing the challenges, we created a term that stuck called co-design, which was really the idea that we're going to have to take this from a standard customer-client relationship with these companies to something much more collaborative. We need to understand more about the long-distance roadmaps of these companies so that we can begin to angle our research and our applications development toward what their roadmaps are. But probably more importantly, these companies really need to understand where the bottlenecks are in our applications so that they can think about how to design their
Starting point is 00:22:54 hardware in ways that are going to best address our concerns and our needs. Co-design emerged as a way to blur the lines between hardware and software development, bringing together experts from both fields to work side by side. This deeper level of collaboration, often involving clearances for security-sensitive work, allowed teams to quickly identify and address the most critical challenges, speeding up progress in ways that wouldn't have been possible without a standard customer-supplier relationship. Being innovative is really critical to doing new things, taking new approaches. But if you have a completely new approach all on your own, you're not going to get much done
Starting point is 00:23:35 because big things take lots of people. Together, they've built something truly extraordinary. El Capitan is not only faster and more powerful, but is also able to tackle problems that were once deemed intractable. So there's a class of problems that are big 3d problems that we want to run at high resolution. We've been studying this class of problem for years since the beginning of ASCII. And the first time we took it out for a spin, it took like half of our biggest supercomputer and it took over a month to run it. And in 2015, we checked again and it took maybe 20% of our supercomputer and it took
Starting point is 00:24:22 a little less than a month. Then we took that same calculation out for a spin on Sierra and all of a sudden it took 3.3 days. Whoa! Okay, that is like game-changing, right? Think about it. Think about what you can turn around in 3.3 days as opposed to a month. Ten different types of those calculations. Think about if you're designing something, how that changes what you can do, right? It's just night and day. I get 10 shots as a designer to make a choice in a month.
Starting point is 00:24:58 That's incredible. Oh, and by the way, that 3.3 days took less than 10% of Sierra. That was like, this is going to be tractable on El Capitan. We need to continue pushing. We need to get to higher mesh resolutions and do a better job with the physics. And that is our goal. Our goal is a reasonable turnaround time for a medium to high resolution, full physics calculation in three dimensions
Starting point is 00:25:27 that we have never been able to do before. It's an open question for whether or not LCAP will be the fastest computer in the world for one year, two years, maybe three years. We can't predict that right now just because everybody's working always to build faster and faster computers. El Capitan's achievement as the world's fastest supercomputer isn't just about speed. It's about what that speed can accomplish. It signals a new era of computational capability that will tackle some of the biggest challenges facing humanity,
Starting point is 00:26:05 whether it's understanding complex physical phenomena or advancing national security. We have a series of problems that are just going to challenge the entire scale of the machine. There are problems, I can imagine, they're big problems, they're things that no one has ever dreamed really trying. We have laboratory directed research and development projects that have put things in place where like, if we could do something massive, we could solve this problem. And there are a few of them that over time using El Capitan,
Starting point is 00:26:37 we will get the codes aligned and arranged and go after those problems as well. They're probably not gonna be the first thing we try, but over the life of the machine, we will take a shot. I'm very certain of that. El Capitan's journey did not begin yesterday, and it won't end tomorrow. Decades of planning, innovation, and collaboration
Starting point is 00:26:57 have led to this moment. And now, even as it comes online, the lab is already looking to the future. We always are planning ahead as much as possible and we're already starting to think about what the next system in the 2030 timeframe is going to be. And we want to make sure we can begin to stand it up and we're anticipating it's also going to be a very power hungry system while still keeping El Capitan running, because it will of course be being used for the mission during that time. So we sized our facility to be able to support multiple exascale systems at one time during that overlap period when we're at end of life for one
Starting point is 00:27:37 system and beginning of life for the next system. So that was one of our big challenges was making sure we have the infrastructure, the power and the cooling required to field these systems. Building world-changing technology requires looking far ahead, anticipating the limits of today's capabilities and constantly pushing the boundaries. The team at Livermore isn't just solving problems. They're planning for challenges that may not even exist yet. What's the next big thing? That is the billion dollar question.
Starting point is 00:28:13 I think taking a step back and looking at the numerical methods that we can apply on these machines and looking at different ways to run sensitivities or to understand how we not just get one solution but a solution plus ensembles of answers or getting gradients of the solution is probably what we should be doing once we get through the El Capitan challenge. Hardware advances are uncertain. The machine learning market is driving hardware changes that are complex for our codes to deal with. They don't need the precision we need,
Starting point is 00:28:59 and so the hardware that vendors are creating take that into account to sell to that market. So that's going to be a challenge for us. We're looking at both technology slowing down and prices going up and very worried that for the same dollars we're not going to get a lot more compute than we have in El Capitan. So we're thinking about how can we make the systems get more work done for the same compute capability. Because of that, the future of the architectures is unclear.
Starting point is 00:29:33 So for the code teams, I think thinking about new types of calculations we can do, new types of numerical methods we can employ, because we do have huge compute compute is what we'll do in the short term. The real power of El Capitan isn't just in the numbers it can crunch today, but in the new frontiers it opens for tomorrow. This is a signal of the United States continuing leadership in high performance computing, that we can continue to do something that is the best in the world. And that alone makes El Capitan interesting. And that alone is one reason to be proud of what we're doing here at our national laboratories with U.S. industry to do something that is the best in the world. But of course, a fast computer
Starting point is 00:30:26 that doesn't actually solve any of humanity's problems, that's not terribly interesting. So it's not enough for us to just be the fastest in the world. It has to be for a purpose. As the team at Livermore has shown, reaching for the pinnacle is just the beginning. The pursuit of what comes next, anticipating future limits and pushing past them. That's the enduring mission. Thanks for listening.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.