Big Ideas Lab - El Capitan
Episode Date: November 19, 2024For decades, it was an ambitious dream: to create a supercomputer powerful enough to tackle humanity's most complex problems. Now, that dream is a reality. On November 18, 2024, El Capitan made histor...y as the world’s fastest supercomputer, surpassing two exaflops of speed. Join us as we explore how this monumental achievement is set to redefine national security, revolutionize scientific research, and spark breakthroughs that could change the world as we know it.
Transcript
Discussion (0)
For decades, the U.S. Department of Energy has been pursuing a bold vision.
A system powerful enough to tackle the greatest challenges facing humanity.
Fears of a serious new threat to U.S. nation.
Russia has begun major nuclear weapons exercises.
The World Health Organization has declared a global public health emergency.
The electricity system is undergoing a once-in-a-century transformation.
What does the energy of the future look like?
That vision? Exascale computing.
Exa is a Greek prefix meaning 10 to the 18th, and Exascale system nominally is about how many calculations can it perform per second. The U.S. government and the National Nuclear Security Administration's Trilabs,
Lawrence Livermore, Los Alamos, and Sandia National Laboratories,
needed a machine capable of operating on a scale that had never been done before.
You can think of it as a billion billion.
And so just the sheer number of calculations
that you can perform in a fixed amount of time
is beyond anything that we've been able to do in the past.
The NNSA labs needed a computer that
could simulate nuclear reactions to the tiniest detail,
discover new materials, boost energy,
advance inertial confinement fusion, and meet the
nation's evolving national security demands.
But building a machine of this magnitude required vision, and a willingness to gamble on the
unknown.
More than a decade of work went into building something that would push the boundaries of what was possible. Capable of performing more than two quintillion calculations per second at its peak.
And now that vision is a reality.
This machine doesn't just turn on.
It awakens. Piece by on, it awakens.
Piece by piece, system by system.
Each one coming to life in perfect synchronization.
And then, it happens.
The future has arrived.
Welcome to the world, El Capitan.
We expect El Capitan to offer more total compute capability
than any previously built system.
Welcome to the Big Ideas Lab,
your weekly exploration inside
Lawrence Livermore National Laboratory.
Hear untold stories, meet boundary-pushing pioneers, and get unparalleled access inside the gates.
From national security challenges to computing revolutions, discover the innovations that are shaping tomorrow, today.
On November 18, El Capitan was officially launched at supercomputing's biggest showcase,
the SC Conference, where it was announced as the world's fastest supercomputer.
At a peak speed of more than two exaflops, El Capitan is not
just a technological marvel, but a machine that holds the future of
national security, scientific research, and breakthrough innovations in its hands.
El Capitan is one of the first exascale systems deployed in the United States.
It is the third in a series that the United States
has been developing and is the first of these exascale systems
to be deployed for the National Security Mission.
Rob Neely is the Associate Director
for Weapons Simulation and Computing at Lawrence Livermore
National Laboratory.
The immense computational power of El Capitan
and its unclassified companion system Tuolumne
holds the potential to solve some of humanity's biggest challenges.
From fusion energy to climate modeling to renewable energy research
to breakthroughs in drug discovery and earthquake simulation.
However, at its core, El Capitan was designed with a singular mission to ensure the safety,
security and reliability of the U.S. nuclear stockpile.
For the United States to maintain confidence in our nuclear stockpile, prior to 1992, if
we wanted to understand if a change or a new design worked, we would go off to Nevada, drill a big hole in the ground, put the weapon down there and set it off.
And that's called underground testing. And that's how things worked for decades.
We stopped doing that in 1992 under the Clinton administration.
And that left us with the big question, how are we going to retain confidence in our nuclear stockpile?
And so that really spearheaded a big push in the United States to use supercomputing
and modeling and simulation as one leg of a new program called science-based stockpile
stewardship designed to make sure we could retain our confidence in these weapons.
With new global threats emerging and a Cold War-era arsenal still in play, ensuring that
the U.S. maintains its nuclear deterrence and its competitive advantage over its adversaries
have become some of the nation's most critical challenges.
For the first time in decades, we're designing new weapons that are similar but safer, higher
performing.
So that national security mission is core to a lot of what we do and what we plan to
use El Capitan for.
This mission is not new.
It's one Lawrence Livermore National Laboratory has been working on since its founding in
the 1950s.
The nice thing about D. Lee Labs is they do make these long-term investments in science
and technology that we think we're going to need for the mission so they can take 20
and 30 years to come to fruition, which is a really interesting work environment for
us.
Teresa Bailey is the Associate program director for the weapons
simulation and computing computational physics program at Lawrence Livermore
National Laboratory. Her job is to oversee the development of a wide range
of modeling and simulation tools that can be run on Lawrence Livermore
National Laboratory's high-performance computers. She points out that El Capitan
in many ways represents the culmination of decades of research and development,
bringing to life the vision of the Accelerated Strategic Computing Initiative, or ASCII, that was established over 25 years ago.
The ASCII program was designed to deliver modeling and simulation tools aimed at stockpile stewardship using high
performance computers so that we would never have to go back to nuclear testing.
So El Capitan really represents that end product for the original vision of ASCII.
Super computing is bringing to bear as much computational power as we can assemble to solve the hardest problems
that are out there.
Bronis D. Sapinski is the Chief Technology Officer
for Livermore Computing.
We do modeling and simulation of a variety of processes.
Most of them are related to stockpile stewardship,
but climate science, things like that,
to do those kinds of simulations in ways
that actually models things close to reality. It takes quite a lot of computation and so it takes
much more computing capability than you have in say your laptop or your phone. Today supercomputers
far surpass the computational power of any device you have at home.
What truly sets them apart is their precision and interconnectedness.
Supercomputers are designed to have thousands of compute nodes work together to run simulations
that mimic reality with incredible accuracy, which requires immense computational power
to achieve the 64-bit double precision calculations necessary
for reliable scientific results.
Mathematically, real numbers are infinite precision.
In a computer, you have to choose some finite precision.
The fidelity with which you're representing that infinitely precise number in the computer
is limited by the number of bits you devote to it.
And so getting an accurate answer depends on how many bits you use for it.
Think about it this way. When you write down the number 3, you're not just
writing 3. In precise terms it's 3.00000000, continuing infinitely.
Computers, however, have finite precision,
meaning they have to cut off those trailing zeros at some point.
In scientific computation, every extra decimal place of precision
can be the difference between a simulation that is reliable and one that isn't.
Doing large simulations requires fairly significant precision.
So most of our computations require 64-bit computations.
64-bit precision allows supercomputers to handle numbers with up to 64 binary digits, enabling
the highly accurate calculations needed for complex simulations,
such as those used in nuclear weapons research.
In national security, where the smallest margin of error can have critical consequences, close
enough simply isn't an option.
This is why Lawrence Livermore National Laboratories and the NNSA have been relentlessly focused on
hitting that exascale computing target.
But achieving this level of technological advancement requires more than just improving
current capabilities.
It requires holding a vision so bold and far-reaching that the path forward may not always be immediately
clear. Back in about 2008, there was a seminal paper
that was released by DARPA,
the Defense Advanced Research Projects Agency,
foreshadowing the difficulties
that the computing industry was gonna have
reaching this exascale target.
So for decades, computers had been getting
a thousand times faster,
approximately every decade
or so built on first just smaller transistors the more things you could
pack on a chip then by parallel computing putting more of these chips
together into a single system but getting to exascale very early on almost
15 years ago it was recognized this was going to be a challenge like we haven't addressed before.
So we started thinking about these systems long before we decided what the systems would
actually be because we knew there was going to be a lot of research needed to be able
to utilize these systems effectively.
This DARPA report foreshadowed the immense challenges on the horizon.
It highlighted key issues
like power, memory, and system resiliency.
Fast forward ahead to about 2015-2016, the United States funded something called the
Exascale Computing Project, which was really about the research needed to develop the software
and the applications that would ultimately run on these machines. And it also funded some research for companies like AMD and Intel and Nvidia and HPE,
big players in the supercomputing industry, to help them develop technology faster so that we
could deploy those at the laboratory sooner for our mission. So all this was happening about six, seven years ago.
And at that time is when we began thinking about
what's our next system going to be?
What's the NNSA's exascale system going to be?
As the exascale computing project took shape,
it became clear that the path forward
would require new solutions for both software and hardware.
One of the original challenges we had when thinking about
exascale computing was really around the power requirements
of these computers.
Historically, the earliest supercomputers were just
the earliest computers, right?
Over time though, they became dominated by something called
vector systems, so that's a way of computing a bunch of
things at the same time in parallel.
There's kind of limitations on that. And so over time, we moved to networked systems of
CPUs, which is the standard way of what people use in their laptops. And so for a long time,
we were building systems with CPUs.
That's how for decades, we've been getting faster and faster performance on these supercomputers.
But if you drew sort of a straight line on where we knew technology was going in the
late 2000s out to 10 years later or so, the amount of power it was going to require to
field one of these systems, we were going to have to think about putting a nuclear reactor
next to the building because it was in the hundreds of megawatts, which the operational costs for that were more than
the Department of Energy even was willing to accept.
And so a lot of the initial challenges and a lot of the initial research was how can
we continue to ride this wave of improved computer performance without expanding the
amount of energy and power that's
going to be required.
The power challenge was immense.
Exascale systems like El Capitan would require a completely new approach to energy efficiency,
pushing computing experts to explore new ways to design and build these machines.
In order to get more parallelism, we moved to processors that are
used to drive the graphics on your screen, so GPUs. In 2018, Lawrence Livermore National
Laboratory launched Sierra, a groundbreaking supercomputer that combined CPUs with GPUs,
making it one of the first large-scale systems to use this integrated heterogeneous approach.
Sierra delivered 125 petaflops at its theoretical peak, roughly one-eighth of the computational
performance of exascale.
Part of what we were able to do between the community and our vendor partners like folks
at Nvidia and AMD and Intel were to make these graphics processing chips
suitable for scientific computing.
It was really scientific computing and partnership
with companies that helped us recognize that,
yes, we could do this.
This could become the basis
for the next generation of supercomputers.
And it's going to be something like that technology
that's gonna be required to get us to exascale computing
in a power budget that we can manage.
Sierra's design was a huge leap forward,
but the shift to GPUs introduced a new challenge.
Many of the existing codes weren't built to run on GPUs.
These codes had been designed for CPU-based systems,
and adapting them wasn't a simple task.
These aren't just little codes
that you can rewrite over and over again.
They're big codes.
There are sometimes millions of lines of codes
coming together.
So to make these big shifts in algorithmic type
takes a lot of upfront thought and research to make sure it has the
payoff that we need.
Imagine being tasked with translating a complex manual into a different language.
Except this manual isn't just a few pages.
It's millions of lines long.
And every detail is critical. Even the smallest error could derail the entire process.
This was the challenge Lawrence Livermore National Laboratory's developers
faced when adapting CPU based codes to run efficiently on GPUs. To overcome
this the team implemented Raja and UMPIRE,
coding tools that simplify and streamline the process of adapting and using the codes.
These tools, first used for Sierra, sped up the work for El Capitan dramatically,
reducing code implementation time and pushing the exascale transition forward.
The next pivotal step toward a fully functional exascale machine came with the introduction
of AMD's next generation processors, known as APUs, or Accelerated Processing Units.
These chips combine both CPUs and GPUs into a single hardware package, making them more
efficient and easier to program. This invention marked a major leap forward in computing technology, not just for the
lab but for the world.
The APU was an innovation that AMD came up with, one of our partners in El Capitan, to
basically integrate the idea of the CPU and the GPU all on a single package. Sierra had
GPUs in it but they were really completely separate from the CPUs. They
were separate memory and one of the complications of using those systems
and using accelerated computing in general was that the programmer now had
to make some explicit decisions about when to move data between the CPU and
the GPU and when to transfer data between the CPU and the GPU
and when to transfer control of the program from one type of device to another.
The APU now gets rid of one of those complications.
That makes the system more efficient from an energy standpoint,
because you're not doing those useless movements of data,
and it also makes it easier to program, because you don't have to program that movement of data and it also makes it easier to program because you
don't have to program that movement of data. That's technically a big advantage.
El Capitan is made up of tens of thousands of these APUs, each one linked
together to create a vast system capable of calculations on a scale never before
seen. The way these large supercomputers are assembled is you have
the basic unit of compute that's called a node and a node in our case is actually
already made up of multiple APUs. Then you take nodes and you assemble those
into blades and then blades get assembled into like a commercial grade
refrigerator sized rack that
sits on the floor and weighs a lot.
And then we assemble those racks together on the order of about a hundred of them for
El Capitan to make the entire system. One of the biggest challenges with exascale computing wasn't just designing the machine,
but building the infrastructure to support it.
At Lawrence Livermore National Laboratory, they had to overhaul the entire electrical
and cooling infrastructure, doubling their capacity to handle the immense energy demands
of El Capitan.
A new utility yard was built, supplying enough energy
to power tens of thousands of homes,
just to ensure the supercomputer could run at full capacity
without interruption.
As part of something called the Exascale Computing Facility
Modernization Project, We deployed significant increase in the electrical infrastructure
to our main data center.
And so that took us from 45 megawatts to 85 megawatts.
So we're essentially 2x the energy that we can deliver to the computer for.
Now El Capitan is not gonna use all 85 of those megawatts
but it's gonna use somewhere around 30 of those
at any given time.
That power is enough to supply around 30,000 homes.
The extra energy capacity of the Livermore Computing Facility
ensures they can sustain existing supercomputers alongside El Capitan. Despite its substantial energy requirements, El Capitan
is one of the most energy efficient supercomputers ever built in terms of performance per watt.
But all that power generates heat. A tremendous amount of it. Liquid cooling is required to keep these systems from literally melting because they run at
sometimes over a thousand watts.
So you think about how hot a hundred watt light bulb can get.
Magnify that now by 10, 20, 50 times.
That's how much heat you're trying to dissipate in a very small package in one
of these nodes of a super computer.
And then multiply that by the tens of thousands of nodes that make up these systems.
That's a lot of heat that you've got to try to make sure you can get rid of.
And liquid cooling is the idea that you bring in cool water, you then run water across cold
plates, it dissipates some of the heat away, goes out the other side of the water, you then run water across cold plates, it dissipates some of
the heat away, goes out the other side of the rack, and then eventually, through heat
exchangers, goes out to a cooling tower, and then that water is cooled and the cycle repeats.
At full operation, El Capitan will cycle through 5 to 8 million gallons of water every 24 hours
to keep its systems cool and running efficiently.
Building El Capitan required more than just cutting-edge technology. It took a
coordinated effort across multiple organizations. Years of collaboration
between Lawrence Livermore National Laboratory, the Department of Energy and NNSA,
and private industry were essential in overcoming the immense technical challenges of exascale computing.
Back around the time when we were first starting to talk about exascale and recognizing the challenges,
we created a term that stuck called co-design, which was really the idea that we're going to have to take
this from a standard customer-client relationship with these companies to something much more
collaborative. We need to understand more about the long-distance roadmaps of these companies so
that we can begin to angle our research and our applications development toward what their roadmaps
are. But probably
more importantly, these companies really need to understand where the bottlenecks
are in our applications so that they can think about how to design their
hardware in ways that are going to best address our concerns and our needs.
Co-design emerged as a way to blur the lines between hardware and software
development, bringing together experts from both fields to work side by side.
This deeper level of collaboration, often involving clearances for security-sensitive
work, allowed teams to quickly identify and address the most critical challenges, speeding
up progress in ways that wouldn't have been possible without a standard customer-supplier
relationship. Being innovative is really critical to doing new things, taking new approaches.
But if you have a completely new approach all on your own, you're not going to get much done
because big things take lots of people. Together, they've built something truly extraordinary.
El Capitan is not only faster and more powerful, but is
also able to tackle problems that were once deemed intractable. So there's a
class of problems that are big 3d problems that we want to run at high
resolution. We've been studying this class of problem for years since the beginning of ASCII.
And the first time we took it out for a spin, it took like half of our biggest supercomputer
and it took over a month to run it.
And in 2015, we checked again and it took maybe 20% of our supercomputer and it took
a little less than a month. Then we took that same calculation out for a spin on Sierra and all of a sudden it took 3.3
days. Whoa! Okay, that is like game-changing, right? Think about it. Think
about what you can turn around in 3.3 days as opposed to a month. Ten
different types of those calculations.
Think about if you're designing something,
how that changes what you can do, right?
It's just night and day.
I get 10 shots as a designer to make a choice in a month.
That's incredible.
Oh, and by the way, that 3.3 days
took less than 10% of Sierra.
That was like, this is going to be tractable on El Capitan.
We need to continue pushing.
We need to get to higher mesh resolutions and do a better job with the physics.
And that is our goal.
Our goal is a reasonable turnaround time for a medium to high resolution, full physics calculation in three dimensions
that we have never been able to do before.
It's an open question for whether or not LCAP will be the fastest computer in the world
for one year, two years, maybe three years.
We can't predict that right now just because everybody's
working always to build faster and faster computers.
El Capitan's achievement as the world's fastest supercomputer isn't just about speed. It's
about what that speed can accomplish. It signals a new era of computational capability that
will tackle some of the biggest challenges facing humanity,
whether it's understanding complex physical phenomena or advancing national security.
We have a series of problems that are just going to challenge the entire scale of the
machine. There are problems, I can imagine, they're big problems, they're things that
no one has ever dreamed really trying. We have laboratory directed research and development projects
that have put things in place where like,
if we could do something massive,
we could solve this problem.
And there are a few of them that over time using El Capitan,
we will get the codes aligned and arranged
and go after those problems as well.
They're probably not gonna be the first thing we try,
but over the life of the machine, we will take a shot.
I'm very certain of that.
El Capitan's journey did not begin yesterday,
and it won't end tomorrow.
Decades of planning, innovation, and collaboration
have led to this moment.
And now, even as it comes online,
the lab is already looking to the future.
We always are planning ahead as much as possible and we're already starting to think about what the next system in the 2030 timeframe is going to be.
And we want to make sure we can begin to stand it up and we're anticipating it's also going to be a very power hungry system while still keeping El Capitan running,
because it will of course
be being used for the mission during that time. So we sized our facility to be able to support
multiple exascale systems at one time during that overlap period when we're at end of life for one
system and beginning of life for the next system. So that was one of our big challenges was making sure we have the infrastructure, the
power and the cooling required to field these systems.
Building world-changing technology requires looking far ahead, anticipating the limits
of today's capabilities and constantly pushing the boundaries.
The team at Livermore isn't just solving problems.
They're planning for challenges that may not even exist yet.
What's the next big thing?
That is the billion dollar question.
I think taking a step back and looking at the numerical methods that we can apply on
these machines and looking at different ways to run sensitivities or to
understand how we not just get one solution but a solution plus ensembles of answers or getting
gradients of the solution is probably what we should be doing once we get through the El Capitan challenge.
Hardware advances are uncertain.
The machine learning market is driving hardware changes
that are complex for our codes to deal with.
They don't need the precision we need,
and so the hardware that vendors are creating
take that into account to sell to that market.
So that's going to be a challenge for us.
We're looking at both technology slowing down and prices going up
and very worried that for the same dollars we're not going to get a lot more compute
than we have in El Capitan.
So we're thinking about how can we make the systems get more work done for the same compute capability.
Because of that, the future of the architectures is unclear.
So for the code teams, I think thinking about new types
of calculations we can do, new types of numerical methods
we can employ, because we do have huge compute compute is what we'll do in the short term.
The real power of El Capitan isn't just in the numbers it can crunch today, but in the new frontiers it opens for tomorrow.
This is a signal of the United States continuing leadership in high performance computing, that we can continue to do something that is the best in the world.
And that alone makes El Capitan interesting.
And that alone is one reason to be proud of what we're doing here
at our national laboratories with U.S. industry to do something that is the best in the world. But of course, a fast computer
that doesn't actually solve any of humanity's problems, that's not terribly interesting.
So it's not enough for us to just be the fastest in the world. It has to be for a purpose.
As the team at Livermore has shown, reaching for the pinnacle is just the beginning.
The pursuit of what comes next, anticipating future limits and pushing past them.
That's the enduring mission. Thanks for listening.