Big Ideas Lab - Supercomputing

Starting point is 00:00:00 It is the fast, reliable, and tireless performance of a variety of arithmetic and logical operations that gives the computer its great utility and power. It was a sunny morning in July 1960. Inside a newly constructed building at Lawrence Livermore National Laboratory, a team of engineers and scientists bustled around a colossal machine. This wasn't just any machine. It was the UNIVAC Livermore Advanced Research Computer, also called the UNIVAC LARC. It was the most advanced computer of its time. The UNIVAC LARC had originally been designed and built in a Philadelphia factory,

Starting point is 00:00:46 and it had taken a crew of 40 installers and five 18-wheelers three weeks to trek across the country through winter conditions to deliver it to the lab. But delivery was just the first step of an even longer journey. Now, after four months of assembly, installation, and calibration, it was finally ready for testing. The computer room was filled with anticipation as engineers and installers gathered around the 135,000-pound, eight-foot-tall machine. As the clock ticked closer to the moment of truth, the central console, a maze of flashing lights and switches, became the focus of attention.

Starting point is 00:01:29 Then, with a nod from the lead engineer, the UNIVAC LARC roared to life. This behemoth, a marvel of engineering and computational power, marked the dawn of a new era in computing, the era of supercomputers. It was a monster not only in size, but also with a vision for the future and the advancement of humankind. You are a backpacker walking on a ridge with two chasms. The chasm on the left is having a big idea that has too much risk and you're going to fail. The chasm on the right is being so afraid of failure that you can't make progress. So you've got this three-foot ridge that you're walking on with a backpack. You take a misstep on either side and you're in the drake. It's all over.

Starting point is 00:02:42 That's what we did. We did that for 25 years. Step into the extraordinary world of supercomputers, the invisible giants shaping our world. Welcome to the Big Ideas Lab, your weekly exploration inside Lawrence Livermore National Laboratory. Hear untold stories, meet boundary-pushing pioneers, and get unparalleled access inside the gates. From national security challenges to computing revolutions, discover the innovations that are shaping tomorrow, today. From today's perspective, it's hard to imagine a world without computers. Computing technology is in everything.

Starting point is 00:03:45 Our phones, laptops, TVs, thermostats, cars, even our coffee pots. But before the 1940s, the word computer meant something entirely different, something a lot more human. Before there were electronic computers, a computer was literally a room full of usually women calculating things by hand. That's Rob Neely, the Associate Director for Weapon Simulation and Computing at Lawrence Livermore National Laboratory. We're a nuclear weapons lab, and the mission here has always been rooted in the nuclear weapons stockpile. And that was true from day one of this laboratory. So the computers were used to give the weapons designers of the time

Starting point is 00:04:23 insight into really how to design these new weapons. In the 1940s, computers advanced in several significant ways, transitioning from electromechanical systems to electronic designs, which laid the groundwork for the next stage of computing technology. By the time the lab was founded in 1952, it was quickly identified that computers, the most advanced computers, would be a prominent one of the thought leaders for this new laboratory, recognized the importance of computing to what they wanted to do to design the next generation of nuclear weapons. Edward Teller, even before those doors opened, got permission to buy a computer for the weapons program. Michael McCoy was a lifetime employee at Lawrence Livermore and the former director of Lawrence Livermore National Laboratory's Advanced Simulation and Computing Program.

Starting point is 00:05:29 It was a UNIVAC Remington Rand machine, probably weighed tons with tubes and must have been almost impossible to use. But that machine had maybe a thousand words of memory. You could store a thousand different numbers on it, very small, and would do a few thousand calculations per second. And those machines didn't have compilers on them. You couldn't sit down and write code saying A equals B plus C and expect the computer to do it. You had to do it in something called machine language, which was very cumbersome. This is the face of a UNIVAC, the fabulous electronic machine, which we have borrowed. This was the UNIVAC-1, a predecessor to the UNIVAC-LARC that we met at the beginning of the episode. Engineers at the time lovingly referred to the UNIVAC-1 as an oversized toaster.

Starting point is 00:06:26 This is not a joke or a trick. It's an experiment. We think it's going to work. This computer was famously used by CBS to predict the 1952 presidential election when President Eisenhower won by a landslide. It was obviously evident that we should have had nerve enough to believe the machine in the first place. It was right. We were wrong. Next year, we'll believe it. Univac predicts now with, again, odds of 100 to 1 that Eisenhower will be re-elected president. It's very hard to program. You literally had to move wires, kind of like a switchboard operator. So it was very rudimentary, but it gave physicists insight into equations that you would use to model the effects that we were trying to get out of a nuclear weapon. At the time, larger yield, bigger bangabler, but we didn't rely on it completely. We would always be grounded in experiments and tests and what we call our UGT or underground

Starting point is 00:07:30 test history. What you could do with those primitive computers was using the data that you got from these underground tests or whatnot, you could take small excursions away from that design and get a sense of what that would do. It's not like it would predict. It would give you insight as to what to do next. And so the laboratories needed these computers, even though they had underground tests and scientists with intuition, in order to accelerate progress. It wasn't until the late 50s and early 60s that the lab saw computer speeds, storage capabilities, and problem-solving abilities begin to progress in a powerful direction. At the time, supercomputing and computers and computing, it was all one thing, right? There was only a handful of companies that made computers.

Starting point is 00:08:36 And whether you were using them at the National Laboratory or using them for the Census Bureau or using them for airline reservations. They were all the same kinds of systems. So computers and supercomputers were really one and the same up until about the early 60s. And that's where the term supercomputer began to emerge. This is when computers like the Univac Lark, which held the title as the fastest computer until 1962, enter the scene. From there, year after year, these machines progressed, standing out as true supercomputers

Starting point is 00:09:12 compared to their commercial counterparts. So there was an era, sort of a second era of computing that I would call the first supercomputing era. And so there was a couple decades there where supercomputing really was its own line and was very specialized for the kind of mission we do here. While the commercial computer market was developing smaller, more portable computers, supercomputer development was headed in the opposite direction. Supercomputers kept growing in size and power, occupying their own buildings with reinforced floors, specialized systems for cooling and power, occupying their own buildings with reinforced floors, specialized systems

Starting point is 00:09:46 for cooling and power, and dedicated teams working around the clock just to maintain them. It is the size and complexity that would continue to differentiate supercomputing forever. I would define a supercomputer as, it's a moving definition, of course, because if you look at the amount of computing power in a supercomputer of the 70s or 80s, that's easily in your iPhone now or perhaps even in your microwave oven. But the way I think of supercomputers is it's the technology that is the cutting edge for its time. It's not something that's going to fit in your pocket. We have learned to make war by unlocking the atom. To make peace, we must limit our use of that power.

Starting point is 00:10:34 In the 1980s and 90s, the geopolitical landscape began to shift. Mr. Gorbachev, tear down this wall. Now, what changed? The United Nations General Assembly voted overwhelmingly to adopt the Comprehensive Nuclear Test Ban Treaty. What changed is the end of testing in the 90s. That no nuclear weapons will be detonated anywhere on the face of the earth. And the SALT talks, the collapse of the Soviet Union, all of those events.

Starting point is 00:11:10 And the world began talking, a lot of people in the U.S. began talking about the peace dividend. And so the laboratory budgets were cut enormously. The historic and revolutionary transformation of the Soviet Union and the liberation of its people. The end of the Cold War in the 1990s saw a reduction in the urgency surrounding nuclear weapons, leading to a decline in the arms race between the United States and the Soviet Union. This trend started with the SALT II Treaty in 1979 and culminated with the Comprehensive Test Ban Treaty, or CTBT, in the 1990s, which aimed to halt nuclear testing globally. The United States conducted its last nuclear test in 1992. At some point, though, the country understood that this stockpile,

Starting point is 00:12:02 all the various weapons systems that are in the stockpile, were aging. In the past, it didn't matter because you would swap them out and develop new weapons and replace them. Now you had to keep a weapon going for 40, 50, 60 years. The question was, how do you do that? Well, weapons age like people. We don't age symmetrically. That makes it a more complicated 3D problem. Also, if you can't test, you have no basis of knowing whether whatever design or what you've done is useful. It might even be deleterious. So what happened was there was a need for what we call predictive simulation. We had to move from primitive 1D and maybe one-dimensional, two-dimensional codes to three-dimensional codes. We had to move from very simple physics models to very complex physical models. Because

Starting point is 00:13:10 we had to have confidence in the outcomes, we would need a 100 teraflop computer for 10 to the 14th operations per second in order to model an entire nuclear device, some kind of system at entry level resolution. It would be a 3D calculation with some primitive physics in it, but enough to convince us that we could do these simulations, but not nearly enough resolution to rely upon. So that was the goal of the ASCII program. In 1995, the Accelerated Strategic Computing Initiative, or ASCII, was established. This program included three labs, Sandia National Laboratories, Los Alamos National Laboratory, and of course, Lawrence Livermore National Laboratory.

Starting point is 00:14:09 The launch of this program highlighted the government's recognition of the growing demand for more advanced computers to bolster national security. However, embracing the idea was merely the initial phase. The real challenge lay in translating this vision into action. According to the U.S. Department of Energy, the primary goal of ASCII is to create the high-confidence simulation capabilities needed to integrate fundamental science, experiments, and archival data into the stewardship of the actual weapons in the stockpile. At the time, this mission meant building a computer that could fully model a nuclear device.

Starting point is 00:14:48 In order to advance, there's three things that have to happen. First of all, the computers have to advance in the manner that we hope, and we're very dependent on industry for that. The second is our codes and our ability to analyze the results coming from the codes has to advance commensurate with the former. Every time the technology changed, the codes, the dog would wag the tail of the codes and the codes would have to get rewritten. And finally, we need the infrastructure. We need the computer building and the rest of it. These machines had to fit on primitive computer floors and required enormous amounts of power. We have a very large building that has a machine room floor where we keep a lot of the computers.

Starting point is 00:15:38 That's Becky Springmeier, division leader for Livermore Computing. Our main machine room floor has 48,000 square feet, which is more than a couple of football fields across. You can imagine that it's very noisy in the computer room floor. There's a dull roar, so we have certain rules that if you're close to certain computers long enough, you need to wear ear protection. You talk about multidisciplinary. There's physicists developing the codes. There's engineers developing codes, different emphases. There's computer scientists making sure the underlying software on the computers actually work. doing the procurements, dealing with the vendors over costs and technology.

Starting point is 00:16:26 And then there's the infrastructure. If you don't advance all three of those together simultaneously and in a coordinated way across all of these disciplines, you are a dead duck. And I'll tell you, the Litter Board does not tolerate dead ducks. This has been a dream of American leaders going back to Presidents Eisenhower and Kennedy Banning all nuclear tests for all time After the U.S. ceased nuclear testing in 1992, the government was faced with a problem How would they maintain the existing weapons stockpile without physical testing? For the answer, they started the Accelerated

Starting point is 00:17:05 Strategic Computing Initiative, or ASCII, and turned to national labs for guidance. The ASCII program was tasked with designing a computer that could simulate a nuclear test. And in order to do that, they would need at least a 100 teraflop computer. So what exactly does that mean? Let's break it down. Flop is an acronym for floating point operations per second. A floating point number is a number with a decimal, allowing for a wide range of values between whole numbers. A floating point operation is a mathematical calculation, like addition, subtraction, division, or multiplication between floating point numbers. It's just a fancy way of saying doing math with numbers that have

Starting point is 00:17:52 decimals. So a teraflop, tera means trillion, is a trillion floating point operation per second. And a 100 teraflop computer, like the one Livermore hoped to build in the 1980s and 90s, is a computer that performs 100 trillion floating point operations per second. For comparison, the average home PC at the time could achieve around 100 million floating point operations per second, which is only 0.0001% of the computational capability that ASCII hoped to house in their supercomputer. This hypothetical 100 teraflop computer would provide the power needed for creating highly accurate weapon simulations. I don't think anybody argued that we could do something different at 100 teraflops. That wasn't the hard part.

Starting point is 00:18:51 The hard part was, how did we get to 100 teraflops? That's Michael McCoy, the former director of Lawrence Livermore National Laboratory's Advanced Simulation and Computing Program. Fortunately for Michael and his team at Livermore, it was the late 90s, and computing was about to undergo a massive leap in progress that could just land them at that goal. This third era of computing you might call massively parallel computing, where we were taking more or less off-the-shelf chips, putting those together in a system so that you had hundreds of these processors all working together as a single system. That was a sea change in computing, moving from these scalar computers to massively parallel

Starting point is 00:19:33 computers. There were multiple problems, right? The first was running huge numbers of processors in parallel. It was moving to tens of thousands to hundreds of thousands of processors. When up till then, we might have been running at 50 to 60 processors. Second, there was the issue of data management and visualization. That was key because you're going to be generating all of this data. Where do you put it? How do you look at it? Another was how do you develop codes, physics codes, that can run on a parallel computer?

Starting point is 00:20:18 If you're writing a code to run on a single processor, that's relatively straightforward. If you're writing a code to run on a single processor, that's relatively straightforward. If you're writing a code to run on a computer that has a fairly limited operating system that doesn't really know much about running in parallel, that's an entirely different problem. Step by step, Michael and his team began chipping away at each of these obstacles. And in the fall of 1998, the lab hit its first major milestone. They installed the ASCII Blue Pacific supercomputer, a computer that could achieve a 3.9 teraflop performance. We are now breaking the speed barrier

Starting point is 00:21:02 when it comes to computing power and computing time. A year after its installation, the Blue Pacific performed the first ever three-dimensional simulation of an exploding nuclear weapon. It's fast. It would take a person using one of these handheld calculators, 63,000 years, to make the number of calculations that this new computer makes in one second. Michael's team had successfully achieved their first ASCII milestone and quickly moved on to their next project, the ASCII White. However, the team's success was short-lived as they were quickly humbled by the intricacy and complexity of ASCII-White.

Starting point is 00:21:46 Failure of even the smallest component could mean failure of the entire system. They're so complex. I recall with an early system, we were working with IBM, and we could not get the machine to be stable. It was the ASCII white machine, a 10-terraflop system. And it was like an $80 million system. It was the second system we cited as part of the ASCII program. And it went on for weeks or months. And finally, working with IBM, we identified that an interposer, which is a little part, it's between two other active parts of the system, was failing.

Starting point is 00:22:32 And there's thousands of these interposers on the machine, several on each node. And it was going to be a monumentally complex and difficult thing to address this problem. We've had other similar experiences almost with every machine we've cited. We often get a serial number one machine is the first of its kind, and we have to figure things out about it as we go. And usually and often that happens when we start to scale it up and we see things that they didn't see in the factory that no one has seen before. Part of what our operations team does is monitor the systems 24-7.

Starting point is 00:23:11 And sometimes things go wrong. Notes will go down. And so all around the clock, we have speed of 12.3 teraflops. The ASCII Purple system was the next iteration after ASCII White. It was this project that the team believed could finally surpass the 100 teraflop threshold. But there was a major problem. The power consumption of the Purple machine was growing without bounds. It wasn't just the computers not working as they were designed that caused issues.

Starting point is 00:23:58 There were also design challenges with them working exactly how they were designed. We could handle it on the new building we were building, but what comes next? If we stay on this path, what happens next? We'll even be able to site a computer that requires more power than the city of San Francisco. And so we began thinking about what are low power consuming solutions that we might explore. As the computers grew in size, so too did their power consumption. It was a logical, linear progression, but Michael knew that they were reaching a point where they would have to think differently. Enter Blue Gene L. We built the Blue Gene machine. We blew the Earth Simulator, which was the big

Starting point is 00:24:48 Japanese machine that had been number one at the time, out of the water with this 360 teraflop computer. $16 million machine blew a $400 million machine in Japan out of the water. We took number one at supercomputing. That was fun. That machine, we upgraded it from 360 teraflops to 500 teraflops. That machine stayed number one for seven or eight cycles before the rest of the planet could catch up. Now, I will, in honesty, this machine had a limited scope of inquiry. It wasn't like Earth Simulator that it supplanted or ASCII White or Purple that could be used for nuclear weapons calculations. But it suggested that there were other approaches consuming far less power that should be followed and could be followed with positive results. While the computers were being designed to do simulated experiments, each computer iteration itself was its own experiment. They tested different levels of power consumption, structural designs, and code types. After they broke the 100-teraflop milestone with the Blue Gene machine,

Starting point is 00:26:06 competing labs around the world exceeded Lawrence Livermore National Laboratory's record. They lost the title of world's most powerful supercomputer. When we were trying to get to terascale computing, that was a big deal. And now, you know, on a cell phone, you can do things that the giant computers did in my day when I was first doing homework on a CDC 7600. There's a massive amount of change in computers over time, and they get more and more capable. But with their next project, Sequoia, a third-generation blue-gene machine, they achieved a far more sophisticated version of the original Blue Gene. Sequoia's power grew at an exponential rate, exceeding the computational limits that had

Starting point is 00:26:51 seemed impossible just a few years before, showcasing how foundational Blue Gene was in advancing the development of high-speed computers. Sequoia earned the title of World's Fastest Supercomputer in June 2012, when it demonstrated a 16.32 petaflop performance. For perspective, one petaflop is 1,000 times faster than one teraflop. Not only was it the fastest computer, but it also earned the title of the world's most energy efficient computer. But the race for most powerful computer never ends. So that's really been the last two or three decades in the post-testing era is how can we make these codes so accurate that we have enough confidence to say to the president,

Starting point is 00:27:38 our stockpile is safe and secure. So all of that requires us to have more confidence in the simulations that we're running. So if you think about like a digital photo, you know, a low resolution looks kind of blotchy, but you can kind of see what's going on. High resolution, much more accuracy. There's an analogy when we do simulations, we discretize things into small pieces and the finer resolution you can use, the more accuracy you'll get. That takes more memory and more compute. Six years later, in 2018, the lab launched the supercomputer Sierra. Sierra had a 125 petaflop theoretical peak performance and was more than six times faster than Sequoia. It's actually the unique design architecture

Starting point is 00:28:25 configured specifically to run artificial intelligence that is the breakthrough. It's called Sierra, a very different kind of supercomputer than ever before. We've been talking about artificial intelligence for decades. These custom-designed chips and the classified software are creating detailed computer simulations to a level never seen before. They can study biology all the way down to the individual atoms, the

Starting point is 00:28:52 hydrogens, all the way down there. Cancer, HIV, traumatic brain injuries are just a few topics that scientists are eager to get started on. In just two decades, the computing teams at Lawrence Livermore had gone from a 50 gigaflop peak capacity computer to now a 125 petaflop capacity, a 2.5 million increase. And there were even faster speeds anticipated ahead. All of this is culminating in the need for better codes, better physics, and that all requires faster computers if we want to be able to get answers an attractable amount of time to help us deliver on the mission. So that's really what's driving us to now exascale computing with the arrival of El Capitan. Exa is the important piece of this,

Starting point is 00:29:42 and that's a Greek prefix that means 10 to the 18th. The reason Exascale was such a big deal was that back right around the time of the first petascale system, we started looking at technology trends, and it became clear to everybody pretty quickly that getting that next factor of 1,000 was going to be a lot harder than the previous couple factors of a thousand were. So for a long time, people have probably heard of Moore's law, right? It's the approximately the doubling of speed of processors every year and a half or so. It's technically more related to how many transistors you can fit in a given area. But for a long time, as long as we could shrink computing smaller and

Starting point is 00:30:27 smaller, we could assume it would get faster and faster. And we just rode that curve for a while. Well, that's beginning to hit its limits. We've started looking at how parallel can we go? How big can these systems be made? But now we're running into power constraints, right? You might need a nuclear power plant literally to run a supercomputer if we weren't going to do something dramatically different to make these systems more power efficient. So exascale was deemed early on, as soon as people began thinking about it as, this is not just going to be a typical turn of the crank like we've been doing. Not that it was easy in the past, right? Every one of these leaps of a factor of a thousand in computing speed was a lot of work, but Exascale was going

Starting point is 00:31:11 to be viewed as a particularly difficult challenge. Exascale is the next grand step in the supercomputing evolution. It's not only 1,000 times faster than Petascale, but a many times more complicated system to build. Imagine you've got a really powerful gaming system, right? Maybe your teenager is really into running first-person player games and wants a really beefy system. Maybe it's even water-cooled, but it's got a really hefty graphical processing unit on it and a state-of-the-art CPU and lots of memory. So that's what most people think of as like a really powerful computer. Now imagine you take something like that and replicate it 10,000 times and put that all in a very dense packaging on a very high speed network so that now you can take a problem that you're trying to solve.

Starting point is 00:32:06 We're not running games now. We're running simulations. You're now taking that problem and distributing it across these 10,000 or so very powerful individual nodes and writing the kinds of software that we do to take advantage of all that different parallelism, all the GPUs working in concert. That's really what's cutting edge. That's what's complicated. And that's what some of the things we do best here at Livermore. El Capitan will be the lab's first exascale computer. It is projected to be released and functional in late 2024. Over the years, the computing

Starting point is 00:32:42 challenges have gotten even bigger and more complex to solve, which begs the question, is there a stopping point? Scientific discovery is not about the destination. It's about the journey. It's like life. And there are mileposts in that journey that are seminal and that elevate simulation and change the way we live. But anyone asking you, are we done now? Is an exascale computer enough? Doesn't have a clue in terms of understanding the process of scientific discovery and evolution. It's an endless journey, but each step along that journey improves prospects, delivers to the country, increases our security, helps us to develop drugs that are essential, helps with genome sequencing, helps us to model earthquakes and couple them to structures to see if they'll collapse. All those things emanate from breakthrough calculations that

Starting point is 00:33:51 are done at the DOE laboratories. There's no done. We're never done. Once we're done, we're dead. Science is like a human life. It's a journey. You're only done when you're dead. In the relentless quest for scientific advancement, Lawrence Livermore National Laboratory is on the brink of a monumental leap forward. Exascale supercomputing. The stakes are high and the challenge is daunting in this extraordinary journey to unlock what many have deemed escape supercomputing. A realm of computational power so immense, it has the potential to revolutionize our understanding of the universe. The thing about big ideas is that lots of people have big ideas.

Starting point is 00:34:35 The problem is execution and all of those are the critical elements. And I would say that Livermore is known for its persistence. We have a big idea. We are given the mandate to pursue it. We face tidal waves of opposition over the years. Technical, political, fiscal, all of them. And at the end, the laboratory, the scientists prevail. So it's that persistence that is the key to differentiate between extraordinary institutions and ordinary institutions.

Starting point is 00:35:25 The journey towards unparalleled computational heights is far from over. At the time of our writing this episode, and right now as you are listening to it, the team at Livermore are hard at work in the pursuit of the next big idea, achieving exascale computing. Major challenges lie ahead, but as Michael emphasizes, the lab's unwavering commitment will endure. It is this perseverance that will keep them at the forefront, continually setting them apart as a beacon of innovation in the ever-evolving landscape of scientific exploration. Lawrence Livermore National Laboratory invites you to join our diverse team of professionals where opportunities abound for engineers, scientists, IT experts, welders, administrative and business professionals, and more. At Lawrence Livermore National Laboratory, your contributions are not just jobs, they're a chance to make an impact, from strengthening U.S. security to leading the charge in revolutionary energy solutions and expanding the boundaries of scientific knowledge.

Starting point is 00:36:33 Our culture at the lab values collaboration, innovation, and a relentless pursuit of excellence. We're committed to nurturing your professional journey within a supportive workspace and offering a comprehensive benefits package designed to ensure your well-being and secure your future. Seize the opportunity to help solve something monumental. Dive into Lawrence Livermore National Laboratory's wide variety of job openings at llnl.gov forward slash careers, where you can also learn more about our application process. This is your chance to join a team dedicated to a mission that matters. Make your mark, visit L L N L dot gov forward slash careers today to discover the roles

Starting point is 00:37:19 waiting for you. Remember your expertise might just be the spotlight of our next podcast interview. Don't delay. Uncover the myriad of opportunities available at Lawrence Livermore National Laboratory. Thank you for tuning in to Big Ideas Lab. If you loved what you heard, please let us know by leaving a rating and review. And if you haven't already, don't forget to hit the follow or subscribe button in your podcast app to keep up with our latest episode. Thanks for listening.

Your Ad Here

Big Ideas Lab - Supercomputing

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.