Big Compute - How Supercomputing Touches the World(s)
Episode Date: December 22, 2020From the case on your phone to rovers on Mars to vaccines -- supercomputers have played a role in just about everything around us. And many of those projects have rolled through... one of the biggest supercomputing centers in the world -- the Texas Advanced Computing Center (TACC). In this episode, we talk to undercover superhero Dan Stanzione, executive director of TACC, about the many discoveries and innovations his supercomputers have had a role in, and what it’s like to oversee it all. Whether it be Rommie Amaro’s recent COVID-19 breakthroughs or assisting emergency responders after a hurricane, Dan and the TACC are making a real difference behind the scenes of society.
Transcript
Discussion (0)
If this podcast is getting too serious, I need to stop.
Hello, everyone.
I'm Ernest DeLeon.
And I'm Jolie Hales.
And welcome to the Big Compute Podcast.
Here we celebrate innovation in a world of virtually unlimited compute, and we do it
one important story at a time.
We're talking about those stories behind scientists and engineers who are embracing the power
of high-performance computing to better the lives of all of us.
From the products we use every day to the technology of tomorrow, high-performance computing
plays a direct role in making it all happen, whether people know it or not. So for this episode, we're going to do something a little bit
different than we usually do. Yes, because we had the opportunity to talk to a who's who
in supercomputing. And rather than just listen to my voice explain a topic today, we thought that
we'd let our expert do most of the talking.
Yeah, considering he's worked with high-performance computing projects in
every category and doesn't have just one focus.
Right. People we've talked to recently specialize in an aspect of COVID,
or they specialize in tsunamis. Whereas Dan, since he's a who's who in supercomputing,
has pretty much worked with all of that.
Though, as I understand it, before we even get going, Ernest, you have something you
want to ask me first?
Yes.
Jolie, do you love Mars?
Mars like the planet or the candy bar?
Or wait, is that a candy bar company?
I think it's actually both.
Now that you bring that up.
I was primarily referencing the red planet.
Do I love Mars?
I've never been.
I hear it's wonderful in the winter.
I'm sure it would be really cool.
So yes, yes, Ernest, I love Mars.
Tell me more about Mars.
I think that's where we're going.
Have you seen the movie The Martian?
I have seen the movie The Martian.
This will come as quite a shock to my crewmates.
And to NASA.
And to the entire world.
But I'm still alive.
Surprise.
So let me tell you a little story.
Okay, I like stories. Surprise. So let me tell you a little story. Okay.
I like stories. In the movie The Martian, the protagonist Mark Watney ends up stranded on Mars due to an unfortunate series of catastrophic events on the Martian surface.
Who's the actor that played that?
Wasn't it Matt Damon?
It was Matt Damon.
That's right.
Cool.
Sorry to interrupt.
No worries.
Just got to get that picture in my head, especially if it's of Matt Damon.
And Andy Weir is the author of the novel. anyone wants to read it i've read the novel it's
excellent oh so i won't ruin the movie for you or the novel which i highly recommend you read
before you see the movie oh too late but let's just say that mark watney is in a pretty bad spot
now you can either accept that or you can get to work the telecom equipment that he has is unusable at a certain point
and his communication with earth is now dead silent
oh that's so uncomfortable yes after some time Mark Watney realizes that there is one other place
that has telecom equipment he could maybe use. And I emphasize the maybe here. So he sets out
to find NASA's original Mars rover, Sojourner, but more importantly, the lander named Pathfinder.
Pathfinder has the telecom equipment Mark needs to reestablish communication with Earth.
Unfortunately, the original Pathfinder and Sojourner team ceased communicating with Earth back in September of 1997.
Is that a real thing?
That's a real thing.
And, of course, this movie takes place even in the future from now.
So you're talking even more decades between when this original thing went offline as opposed to you know the movie timeline and we'll get
into that here in a second Mars rovers have a long and storied history starting
with Pathfinder and Sojourner and continuing with spirit opportunity and
curiosity breaking news this morning and curiosity is a smashing success the NASA
Mars rover touched down this morning right there on the red planet,
a daring mission with more to come.
The landing capped a journey
that lasted more than eight months
and covered more than 350 miles.
The original Pathfinder and Sojourner team
had a total service life of 83 souls.
Let me interject here and note that a soul
is equal to one revolution of Mars, similar to an Earth day.
On Mars, however, that revolution takes approximately 24 hours, 39 minutes, and 35 seconds by Earth time measurements.
Wait, wait, wait, wait, wait.
Because I know soul is sun in Spanish.
That's what I know of a soul.
Okay, so a soul is a measurement of time?
Right.
And it's equal to one revolution of Mars.
Right.
So like here on Earth, we call that a day or a day-night cycle, right?
Yes.
24 hours.
However, Mars takes-
A little bit longer.
A little bit longer.
Okay.
Right.
So 83 souls is like 83 days and change.
Exactly.
Okay.
So-
That's not very much.
No, and there's a reason for that, right?
So the design of these things has evolved over time.
The Spirit rover landed on Mars on January 4th, 2004
and sent data back to NASA for over six years.
Whoa.
It sent its last communication on March 22nd of 2010.
And that was the first Mars rover?
No, the first was Sojourner.
This is the second one, yes.
Gotcha.
The Opportunity rover landed on the Martian surface
on January 25th, 2004,
and thus far it has surpassed all previous service lives
at 5,352 souls,
which is approximately 5,498 Earth days, which is almost 15 Earth years or eight
Martian years.
What?
Yes.
How come it lasted so long?
Well, again, the technology is getting better and better as these things evolve.
When this little rover landed, the objective was to have it be able to move 1,100 yards and survive for 90 days on Mars,
90 souls. And instead, here we are 14 years later after 28 miles of travel, and today we get to
celebrate the end of this mission. Opportunity sent its last communication to Earth on June 10th, 2018. So
you're seeing a common thread here. The original three units are no longer in operation. The
Curiosity rover landed on Mars on August 6th, 2012. And as of December 2020, when we were recording
this podcast, over eight Earth years later, the Curiosity rover is still in service and still sending data back to Earth.
Interesting. So Curiosity is actually in use today.
And then the other rovers are just hanging out on Mars, not doing anything now?
Right. They are currently defunct.
Huh. I guess I always pictured us taking them back to Earth, but that wouldn't make sense to do so.
Right. We can't retrieve them.
Yeah.
Now, keep in mind that Curiosity is not the longest serving rover.
It's eight years and Opportunity was almost 15 years in Earth time.
So a little over half, but it's still in service.
Mm-hmm. little over half, but it's still in service. So imagine for a minute, if NASA had the technology
to extend the lifespan of the rover significantly, right? So not eight years, not 15 years, but let's
say 50 years or a hundred years. Instead of Mark Watney risking his life to go find a dead rover
and lander, NASA could have located the nearest functional rover to Mark and redirected
it to go to him before he lost his original telecom equipment as a precautionary measure.
Yes. So extending the distance traveled and overall service life of future Mars rovers
is just one of the many problems that NASA scientists are trying to solve
with the help of supercomputers.
Oh, I see where you're going with this.
Yes.
Okay.
More specifically, the supercomputers at the Texas Advanced Computing Center, or TAC for
short.
Much of the computational heavy lifting is done within supercomputers at the TAC, but
NASA is looking into onboard high-performance computing for rovers,
and the TAC is helping lead the charge. This isn't the first time the TAC has surfaced in
recent stories about supercomputing, some on this very podcast. I was going to say, I remember
talking about the TACC. We talked about it on one of our COVID episodes. Yes. So we wanted to delve a little deeper into what the
tack is and why it is so important. Enter. Dan, is it Dan Stanzione? Is that how you say your name?
That is correct. There's a broad family split between Stanzione, Stanzione, and the Italian
Stanzione, but I'm a Stanzione. Yes. I love it when there are family disputes on how to spell
or pronounce last names.
My name's been butchered my entire life, so one day I'll share that story.
Suspense.
Dan is a pretty interesting guy.
When asked about what he does for fun outside of work, he said...
In their stuff beyond work in life, I was aware of that.
I know, isn't that crazy?
And you know, it's election day, so today it's reloading election results over and over again.
Hitting the reload button on various and sundry websites over and over again.
I bet our listeners can tell when we recorded this with Dan.
But no, I'm pretty committed to this stuff.
Spend some time on a boat here and there when I'm not doing this or, you know, living in Texas.
I obviously watch a lot of football.
What is your title and what are your responsibilities?
The titles are long and sound more important than they probably are, but I'm the Associate
Vice President for Research at the University of Texas at Austin. And at universities,
if you're an Associate Vice President, that means you have a good parking spot. And that's really
the only purpose of the title. But then beyond that, I'm the Executive Director of the Texas
Advanced Computing Center. We're in a center here of about 180 people who are involved in building really big computers
and then finding ways to use large computers to do science and engineering work for both
the University of Texas and for other people doing unclassified research all around the
country and the world.
Having a good parking spot definitely means something.
I didn't have one at my last company. And let's just say I had a couple people hit and run on my car because it
was parked on the street. That was fun. Man, that sucks. I know. The TAC is doing some really great
work in the world today. But before we delve into that, what does the day in the life of Dan look
like? I'm sort of a scientist who's never
had to pick a field. One day we might be working on hurricane forecasts and impacts on buildings
and structures and doing simulation for that. The next day it's genomics and drug discovery.
When you're involved in computing, you're involved in sort of every facet of science and engineering
around the world. A lot of it is the less exciting part of keeping everybody paid and writing
reports, but the fun parts are getting to work with the really cool scientists that we work
with all around the world and getting to design and build the next generation of the world's
biggest machines. That is so interesting. I mean, I'm curious, what did you study to land you in
this place? Because it sounds like now you're probably very well versed in everything from physics to chemistry
to, you know, life sciences. You probably know at least a little bit about all of it at this point.
A little bit is probably the functional phrase there. So I have enough to talk to people in
those things. By training, I'm an electrical engineer as an undergrad for my bachelor's
degree. And then my graduate degrees were actually in computer engineering. I wanted to
design and build really fast computer chips. The sort of what I do now was not a thing
that really existed when I was going through school. And I just sort of got sucked into it.
By the time I was doing my PhD, I was working with a genomics center and the sort of chemistry
and material center being the computing guy for the scientists. And, you know, it started out
just something I did in grad school and then started doing projects with the folks here at TAC and started doing more and more.
And I came over here as the deputy director in early 2009 and took over as director in 2014.
It's always interesting to see where people land and what they go through to get to where they are
in their careers, because it's it's never what we'd think.
Never. Even in my case where it's very close, it's not, you know, what I originally thought.
It's crazy. Like my mentor when I was getting my master's in film.
Peter Jackson.
How did you know? He does have an Academy Award in screenwriting for writing The Sting.
But the whole way that he launched his career is he got in a car accident
with somebody who worked for a film company. And then he ended up giving a script to that person
and that launched his career. So he was a great mentor. But when it came to actual like, how do
you jumpstart your career advice? It was not very useful because I couldn't figure out who to crash
into. Yeah, it was literally luck. Yeah.
Dan also spoke about the evolution of supercomputing from something that only government and military really had access to because of the cost to where it is today.
This just wasn't a thing.
You know, the National Science Foundation started investing in this in sort of the mid 80s.
The rise of microprocessors and then the cloud has just made it much more accessible to so many more researchers and types of researchers that these centers have sort of grown and spread.
So it's been an interesting time.
What do you enjoy most about what you do? You know, we get to work with some really fantastic scientists around the world and do fascinating, fascinating things.
You know, I think in almost any job, the thing that is most rewarding is the relationships and the people that you get to deal with.
Graduate fellowship deadline for NSF was last week, and I was writing reference letters for just some remarkable young people who've done amazing things we've gotten to work with.
And, you know, you learn about their lives and stuff like that.
But at the same time, we get to do really impactful science.
And I mean, we're not the ones necessarily out there doing it, but we're making sure it can be done.
We've assisted on Nobel Prizes. We've assisted on, you know, really some groundbreaking discoveries, some that are more sort of theoretical and basic science, some that are just fun, like better spaceship engines.
And, you know, this year it's been a huge amount of work around COVID vaccines and the structure of the virus and things that are going to have in the very short term, you know, real impact on people's lives. And I get to play with really big computers and show them off to thousands of people
every year. So there's lots of things that you enjoy in this kind of job, but it's got a huge
variety. And like I said, in the end, it's the people that make it worth doing. It's always about
the people we work with, isn't it? Hands down, absolutely. But for our listeners, why don't we go into more
detail on what the TAC is? Like, what do they do? Good idea. So TAC, or the Texas Advanced Computing
Center, we are, in my humble opinion, the best of the academic supercomputing centers and one of the
largest in the world. We run the largest university-based supercomputer in the world,
and we run a bunch of other computing and data systems for folks, but really we exist to help people do scalable things for the challenges
that we face in science and society, right? If you're working on a problem and almost every
scientific problem has a computational piece now, whether it's simulation or data or AI that you're
dealing with, and eventually you're going to scale off your laptop and that's where we get involved. And it's the hardware and the people that make that happen.
Got it. And I already know the answer to this question, but I'm going to ask it anyway for
our listeners. Where are you located? We are here in Austin, Texas. We're part of the
University of Texas at Austin. We actually live at the J.J. Pickle Research Campus. So we're about
eight or nine miles north of downtown Austin, but we really serve users all around the country and around the world.
This might be a dumb question, but are the supercomputers actually physically located
there as well?
We are essentially one of the cloud providers for academic supercomputing.
So most of our users don't ever actually see or touch the machines, but we have the actual
physical data centers and physical machines in the building I'm sitting in right now where
we can supply about 10 million watts of power to keep them going. How and when was the TAC founded?
It was founded in 2001. And we really got on the map when we won the Ranger system,
which was one of the big National Science Foundation systems that was the number four
machine in the world when it first came up. And that was in 2008. And that's when TAC sort of
became one of the real leaders in providing things, not just here at UT, but around the country.
Now, Dan, we interview a lot of undercover superheroes on this podcast, and we think you may know one, Romy Amaro.
Oh, yeah. Romy has been our biggest user the last six months because of all of her COVID work.
And you see her enthusiasm and energy and her ability to change the world. Who wouldn't want to help people like that?
That's mighty big praise from the executive director of the TAC.
Yeah, I could totally understand that. I mean, after talking to her, we were like, that woman is amazing. I totally agree. And I mean, props to you for helping her out and getting
the research off the ground as quickly as you did. I mean, we did the math on the show and the amount
of supercomputing resources,
like you said, that she was using,
it was quite the chunk of compute power.
And we started those runs at pretty large scale
the last week of February.
That was our first COVID-related research project.
And by mid-March, it was the largest one we were running.
So, and there's been 50 something others since then.
But we got started at scale quickly, largely because again, it's back to the people in the relationships, right? I already knew
Romy and I already knew what she did and she's used our systems for years. So she knew how to
get on and be effective right away doing what we're doing. And, you know, rather than going
through some complicated bureaucratic process, you know, when COVID was becoming what was obviously
going to be a big thing. She sent an email.
And I said, hey, Dan, I don't know what you guys are up to, but, you know,
this virus, this virus is looking pretty serious.
I think we probably need to do something with it.
And I said, I know you, I know the work you're doing is great.
So we can make that happen today.
It was amazing.
I mean, it was amazing to have that level of support,
but that was really key in sort of getting time to solution very quickly for this effort.
We were off and running. And, you know, again, we've done 60 something other projects with different people since in the COVID space.
But you get off the ground quickly because the infrastructure is in place and the relationships are in place and the knowledge and the training are in place to make all of this stuff happen.
And that's why we could start so quickly on that work.
And I was so grateful for them. The output of Romy's work has been the
input to some other work more upstream in sorting through billions of possible compounds for good
drug candidates. It was done by some people at Argonne National Labs and the University of
Chicago and Rutgers and University College London, and just a whole bunch of other people around the
world. Took that sort of basic structural work that Romy did and built it into this AI-enabled vaccine discovery pipeline. The 4 billion compounds we
started with, they handed about 30 off to medicinal chemists to start fabricating and start clinical
trials on by July or August. And so getting that work done early was key. And she's kept discovering
new things. I'm sure she told you about, you know, figuring out that the spikes on the coronavirus sort of wrap themselves in a sheet of sugar and then, you know,
they're going to get into a cell. Right. And all of that. And that's what helps it hide from the
immune system because it just looks like sugar molecules. Right. And so, you know, none of that
is stuff that we knew in January and we know it all now. Also, you said 60 different projects
that TACC is involved in when it comes to COVID-19
right now. Did you say 60, 6-0? Yes, I think that's about right. It changes a little bit every
week. It might be 59, it might be 63, but we're in that neighborhood of different projects we've
supported. Some of them are at the sort of structural level, like Romi, where we're working
at the molecular level, you know, and that's 20 or so projects, 25, 30 in that realm. We have another
10 or 15 that are more on the sort of human side of that, right? You know, you're modeling
societies or using cell phone data to figure out how much people are interacting and where,
but there's other pieces of that, right? You know, what are the causes? What are the projected
spread? How do the aerosols spread on a plane? You know, all sorts of things like that. And then we
have some that are sort of in the middle, looking at the genomic scale, figuring out the sort of evolutionary history of the virus,
which helps, you know, if you know what the virus is, it's related to, that gives you some insights
and treatments, but also the people it infects, right? We know beyond any doubt at this point
that it affects different people differently. You know, there's a lot of sort of preexisting
conditions that feature into that, but a lot of it is also genetic, right? You know, there's a lot of sort of pre-existing conditions that feature into that, but a lot of it is also genetic, right?
You know, what strings of genome do you have that somebody else doesn't have that make you more or less vulnerable, right?
And the more we can understand that, again, we might be able to isolate and build therapies based on that or figure out who actually needs different kinds of vaccines to do a more personalized approach.
And we actually started in March with a number of the other supercomputing centers and the cloud providers through the White House and the Office of Science and Technology Policy.
We put together the COVID-19 HPC consortium.
Yeah. And so at this point, about more than half of our projects have come through that mechanism where they write to the consortium and then they get stuck here at TAC or at the San Diego Supercomputing Center, or maybe on Amazon or Microsoft. So it's interesting to think about how
you have multiple projects that are using data
that was collected from other projects.
So like Romy Amaro's work, right?
A lot of the data that she's been able to gather
is now being fed into the supercomputers
for different research.
And I mean, that's pretty cool.
It feels like the supercomputing world now
is allowing us kind of this time machine forward. From supersonic jets to personalized medicine, industry leaders everywhere are accelerating innovation with unprecedented speed and efficiency by using Rescale, the intelligent control plane that allows you to run any collaboration on hybrid cloud, Rescale empowers IT leaders to
deliver high-performance computing as a service with software automation with incredible security,
architecture, and financial controls. And as a proud sponsor of the Big Compute podcast,
Rescale would especially like to say thank you to all of the scientists and engineers out there who are working to make a difference for all of us. Rescale, powering science and engineering breakthroughs.
Learn how you can modernize HPC at rescale.com slash bcpodcast.
I really love to hear about the great work that scientists and researchers are doing around the world with supercomputers.
And also love to hear about the tech specs of the machines that they are working on.
I remember when we were talking to Dan, things got pretty technical and it was so fascinating.
Absolutely.
So I asked Dan, what are the names, sizes, and the tech specs of the various machines that occupy the tech?
We have about 15 different production platforms at this point.
The biggest one right now is Frontera, and that's our sort of leadership class system.
It's actually a collection of several different kinds of systems, but the biggest piece is an Intel Xeon-based,
a little over 8,000 compute nodes that can do about 40 petaflops with about 425,000 Xeon cores that make that up.
There's also about 1,000 GPUs attached in various subsystems focused more on the machine learning
side of things. It has about 50 petabytes of fast file systems from data direct networks.
The network comes from Mellanox. We have a 200 gigabit InfiniBand interconnect for it.
And then Dell was our integrator who put all the servers together. Although on the GPU side, we also used Green
Revolution for some cooling systems. We use Cool IT for the water cooled parts to the chips. We're
using very high powered chips. So we just pump liquid directly across them at this point. And
the GPU nodes we immersed in mineral oil. So IBM and NVIDIA and a whole bunch of companies were
involved in doing all of that. But that's sort of
our newest and sort of largest system. It debuted at fifth in the world. It's been about a year and
a half, so it's dropped down to about eight in the world at this point. They age just like any
other computer does, but we'll run that one for another four years or so. Our other large scale
system is Stampede 2 that is also an Intel based supercomputer given about 6,000 nodes. It has a mix of Xeon and what were called Knights Landing Cores, the
Xeon 5s, so it also has around 400,000 cores. It's about a 20 petaflop machine,
about 30 petabytes of disk. Frontera does a few dozen of the largest scale
projects, so people get very large allocations, mostly running bigger jobs.
Stampede's sort of our broader mission machine. It's a couple of years older, but it has more
like 3000 projects on it. So, you know, that one has 15, 18,000 users competing for time on it.
And so those are sort of our traditional supercomputers. We have other machines
that have different missions. Chameleon is our sort of cloud test bed, but that one is where
we focus more on computer science research. We have a whole host of storage systems and data intensive computing systems.
We have some more for interactive use. We have some more for visualization. We try and just
provide that sort of whole computing ecosystem that we think you need for modern science and
engineering. Awesome. So one of the, not surprising to me, but I know it's been
surprising to a lot of people in this industry is the rise of the ARM processor and the ARM
supercomputers. Like Fugaku in Japan. Like Fugaku in Japan. And I'm curious,
what are the tax plans right now? What are you looking at in terms of ARM and the future and
kind of the deprecation of the traditional x86 platform over, you know,
obviously a very long period of time. Yeah. So it will not surprise you that we have some of
those chips along with a whole host of other things. And certainly the AMD chips, the NVIDIA
GPUs and other ones. There's a number of interesting things about ARM, but specifically the Fugaku
machine and what my colleague Satoshi Matsuoka has been able to do there is they had a very long-term partnership to really purpose design a chip with Fujitsu for this
sort of big national supercomputer.
So that machine was many years in the planning and design because most darn chips, the kind
that are in your cell phones and things like that are conventional processors.
They have some differences, but fundamentally they work just like the AMD or Intel processors
you have in your laptops and your servers.
But architecturally, they're the same.
But what makes Fugaku unique is that ARM chip they built with Fujitsu is not only a very nice processor,
but it has a bunch of very high bandwidth memory that is integrated directly on the package.
So you don't have to go off the pins to a separate memory chip somewhere else on the motherboard.
That gives you much higher bandwidth.
Right. So it has the memory bandwidth that a GPU would have.
Interesting. So the RAM is on die.
I believe it is actually stacked on there.
Oh, not necessarily on die, but it's on package.
What does that mean? On what? RAM on die? Like it's what does that mean? That means like, so in a traditional computer, you know, you have a motherboard and then you have a CPU and then you have RAM sticks and then you have, you know, hard drive or whatever else.
The new chip from Apple has unified memory and it's on die.
And what that means is the RAM is now in the CPU.
Oh, interesting.
Instead of a stick that you plug in, it actually comes on the cpu it's inside
the cpu so now you don't have to have that latency of like leaving the cpu socket going across the
motherboard hitting the ram stick and then coming back okay and in apple's case they put the gpu in
the in the cpu also so it's so you don't have you don't have a separate gpu anymore like everything is all in
one entire computers in one chip that's crazy now now to be fair they had already been doing this
sort of and ipads and and iphones but this was like another evolutionary step because even on
those i believe the ram was still separate but it was soldered on to the motherboard or the pcb
but now they just rolled it all in so that the one chip handles everything.
But are there any supercomputers that actually have the RAM on die or wouldn't they mostly
be separate pieces?
They would almost all be separate pieces.
But Dan did note right here that in the case of the Fugaku one, it's actually stacked.
So what they did is like they have the CPU at the bottom layer and then they put the
RAM on, they stacked a
layer on top of it okay so it's like 3d instead of it being flat and what's the advantage of doing
that just extremely available and extremely fast and available less latency or less latency it's
much faster but then there's a downside in that uh you can only fit so much in that space.
Whereas like with a traditional computer, you can put a terabyte or two terabytes of RAM on a node.
There's no way you can fit that much on a CPU, whether it's stacked or on diet as a matter.
Thanks for letting me know.
And I remember you asking him that.
And I also remember not quite understanding what you guys were talking about.
It's also very dependent on the software, right?
Like if the software is written to take advantage of the architecture, it could be a lot faster
than traditional computing where the software isn't as optimized, right?
You know, I believe the memory bandwidth on Fugaku is something like a terabyte a second
per node, right?
So for a single CPU socket, right?
Which is about five times better than we can get out of a current,
you know, sort of mainline CPU socket.
But they're seeing some fantastic performance
out of that on some applications.
Now, the downside of that, of course,
is that in that particular design,
because they've squeezed all the memory onto the chip,
they have a lot more bandwidth,
but they have a lot less capacity.
So looking to the future,
what are the TAC's plans
in terms of new supercomputer designs? You might imagine, you know, we're always designing
the next machine. And right now we're planning what the National Science Foundation will call
the Leadership Class Computing Facility. It will ultimately replace Frontera and some of our other
infrastructure and sort of the 2024, 2025 timeframe, Congress permitting. That's a whole other story.
You know, so we're in design for that
machine now, and we're sort of taking apart the application space. And it's really more than
applications, right? You have to understand the field of science, but you also have to understand
the method they're using, right? But when you change algorithms, that can be a big deal for
some of these. And we're sort of mapping that to the chip space across GPUs and CPUs and now TPUs
and all these sort of coarse-grained arrays that people are
building for AI chips and field programmable gate arrays, but also how can we get enough
memory bandwidth to it? How can we get enough network bandwidth to it, right? Do we want to
have a single type of chip on a node? Do we want to have heterogeneous nodes with a mixture of
accelerators and conventional processors, right? And trying to find that right mix for a few years
out is what we're spending a lot of our time on now.
So I know we have a ton of cool stuff
coming down the pipe right now in the world of silicon.
While Intel held a seemingly insurmountable grasp
on the x86 chip market for years,
AMD has now surpassed them as well as Apple
with their new Apple silicon.
Although I did use AMD back in the day
when they, for a brief period, had more
powerful processors, but then they fell behind. And of course I used Intel until now when AMD
released the more powerful processors. The irony is I waited so long for these things that I just
jumped ship entirely and went with Apple Silicon. And you're a pretty diehard apple and i'm pretty diehard not
and we'll put it and yet we coexist i know somehow we managed to live on the same planet
mars mars yep yeah it's one of the things that i love to talk about in general is just the
the the feedback loop that happens right so, we've had the ideas for innovation long before the technology was
produced. And that was mainly a function of material science and the ability to engineer
these things. And now, especially in this sector, there's a feedback loop because the same
supercomputing machines that are used to run the simulations for chip design and the simulations for material science that go into building these chips are being powered by the generation that was produced before.
So the faster we advance our computing ability and AI with predictive analysis, as well as material science in general, it just creates this kind of vortex where it starts going
faster and faster and faster. And part of it may not move as fast as another, but each of these
feed into themselves at some point. Yeah, I think that's absolutely true. I've studied a little
of this and it's often we're inventing the mathematics and the algorithms for these things
decades before they actually become useful, right? The fundamentals of digital circuits are Boolean algebra, right? And George Boole designed Boolean algebra in 1854,
and I don't think he had Intel chips in mind. The math leads the application sometimes by 50
or 100 years. You need stronger computers to create stronger computers. That's really interesting.
Exactly. And that's an excellent way to boil it down, I think. So one of the things we'd love to
discuss here on the Big Compute podcast is cloud computing.
More specifically, the intersection of traditional on-premise supercomputing and cloud HPC.
You know, in many ways, I think our overarching stance on that is just simply that we are the cloud, but we are a very special kind of cloud for academic scientific research, right?
So in some ways, we're a bit of a specialized cloud,
but we're tuned both in our hardware and in our support and in our software stack around
scientific simulation and AI and analytics. So which means for services that aren't the thing
that we're good at, we tend to rely on commercial clouds, right? But there have been impacts of just
the sort of ubiquitous adoption of the commercial cloud that I think have been good for us and technologies that we've transferred back and in places
where we partner.
So in the Frontera project, we have bridges to the major commercial cloud providers, and
we see our users using sort of both and doing crazy hybrid things in a good way.
So one of them is climate simulation, right?
We do a bunch of massive climate forecasts
of very high resolution on Frontera
that take millions of processor hours.
And then they dump out a vast amount of data.
They do the simulation piece on Frontera
and then they push the analytics piece to Azure.
And we push the data back and forth and publish it.
I think cloud has changed the sort of expectations
of users and the usage model.
Part of it is that we see less tolerance to wait in batch queues and more demand for interactive things.
But the other part is this sort of shift to almost ubiquitous but persistent web services.
And so, you know, we wrap RESTful APIs on top of our supercomputers now, right?
We still have tons of traditional users who come in, log into a Unix command line, run batch jobs, and work in that environment. But we have more and more where
the supercomputer is just sort of an automated resource living behind an API for data processing
for the Large Hadron Collider, for some robotic phenotyping work we do. There's just so many
different things where it's becoming sort of HPC as a service. And that has grown out of innovations that have happened in the cloud space.
You know, at this point, I think we're one of what will be a set of sort of boutique
niche specialty clouds that will exist within the context of the larger clouds, but will
be linked together by services.
And a lot of our users will use them sort of synergistically, I would hope.
You know, it's sort of not an either or.
And I think whenever it becomes an either or, it's a sort of silly conversation, right?
It's a both.
Yeah.
And I love to hear that, by the way, that TAC has taken the approach of kind of bringing
the two worlds together is really going to make a huge difference in terms of not just
the usability and the extensibility of the services you offer, but the ability for others to
interact and engage with the data coming out of there in a larger context, right? A global context.
Yeah. There will be specialties of the commercial cloud that we want to use, right? I mean,
you know, image tagging, right? Things like that, where there's already a service,
you know, language processing. There's some of these where we can just use a cloud service as
part of a larger scientific workflow or use the cloud as the virtual desktop interface to let people get to the supercomputing resources.
There's so many places where I think collaboration is not only possible, it's just the right thing to do.
I love Dan's vision here that the TAC is the cloud, just a special kind of cloud, and that interoperability with commercial clouds is just the right thing to do. I'm curious about the process that researchers go through in order to be able to access
tech supercomputing. So if I'm a scientist or researcher, I need access to supercomputing.
What do I do to get it? And is it completely complementary?
There's a few ways to do it. And so for our big National Science Foundation supported machines, there is a
process in conjunction with the NSF. And there's another project called XSEED that does allocations
across the NSF supported supercomputing centers. So if you're a scientist, you write a proposal
for time, you know, in addition to your proposal for funding, and you say, all right, I have this
project, and I need X computing resources to do it.
And an independent review panel gets together and sort of looks at the suitability and recommends different machines for them to go on.
And so essentially the NSF pays us to build the machine and operate it.
And then we make an amount of time available on that machine every quarter.
And then this sort of neutral third party project comes through and hands out those chunks of time to the users. And in that model, we're not charging the users directly. Right.
All the time is paid for by the NSF in support of what they're doing.
So so it is free to them. It is not free, as I like to remind them.
It costs many millions of dollars. Right. For the time.
We do also, you know, we have some Texas funded machines that are
a similar process, but open to Texas researchers. And then, yeah, the sort of third fallback
mechanism is, yes, some people just pay us for time, right? When they can't get time through
another process, corporate partners, and then some academic researchers or labs who want dedicated
time will just come in and buy a chunk of time to make sure they have it. So they don't have
to go through that proposal process every time they need more time. So we're all about undercover superheroes
and we consider you to be one of those, Mr. Dan Stanzione, because you're helping all of these
projects move forward and these projects are changing the world. So what are the most memorable
or interesting projects that TAC has been a part of? There are so many and very few of the things we do
change the world directly.
We're letting other people do that, right?
And this being 2020, you can't talk about anything
without talking about COVID, right?
And the fact that we've been able to help world-class people
who do incredible science really fight back
against this pandemic and do things that
probably less than a year
from when we run the computation will turn into therapies or policies or vaccines that will have a
sort of immediate impact. And I sort of call that segment of what we do urgent computing. But the
more I've thought about it, the more I realized that a huge amount of what we do is urgent
computing, right? We just started a new computational oncology partnership with MD Anderson in various kinds of cancer research, doing sort of personalized
dosage levels for particle and proton therapy, the sort of most advanced kind of radiation,
instead of just sort of eyeballing it. And, you know, to people who have cancer,
that's no less urgent than the work we do for COVID. And, you know, this year we've had 12 major Atlantic hurricanes.
We've run simulations for all of those. We do the storm surge models that lead to the evacuation
orders. You know, we run a ton of that stuff and that, you know, within days of doing it
becomes part of people's lives. We see the longer term stuff too. When a hurricane comes through,
we have teams that we work with from universities around the country that go out in the field and
start taking pictures right afterwards, particularly buildings and all the structures
that are damaged. And then we bring all that data in and do analysis, one to make, you know,
to sort of understand and make building codes better. And there's a ton of what's called
vorticity, these little vortexes that form right at where a roof corner is. And if they can lift
off the corner, cause it's not tacked down right, then water can get into the house and everything
else bad happens. Right. So we're starting to look at, you know, potential AI methods
where you could go and say there's been an earthquake in a city,
you have 50,000 buildings you need to go inspect, right?
Which are the priorities, right?
That's a great AI problem because if we can just help the first responders
to prioritize, these are the buildings you really need to go look at
before you can let people get back in, then that's a huge help. We did some work, you know, when a hurricane hits Galveston,
we have enough data that we can do a model of the storm surge. We can overlay a GIS model of
all the houses in Galveston and all the buildings. And we know what the height of their foundations
are. And we know that the electrical outlets are 18 inches above the foundation height. And we can
model everyone that would have been inundated to the level of electrical outlets,
which means FEMA has to go in and inspect
before they can let people go back
into those homes or buildings.
There's new battery technologies
that ultimately will really change the world.
There's first observations of gravitational waves
that were predicted in 1915
and actually first observed in 2015.
Or the Higgs boson discovery
had an enormous amount of computation, right?
The sort of subatomic structure of the universe.
Food production.
How do we do hybrids of corn to increase yield per acre, right?
There's such an incredible variety of things we get to play a part in
that it's just wonderful.
And so my favorite one usually is the one I've worked on in the last few weeks.
For the layperson, my mom's a pianist and a semi-pro pickleball player, right? She doesn't do supercomputing. She doesn't even
know what a supercomputer is. What would you tell somebody like my mom who has no idea what's going
on behind the scenes if you were to describe what supercomputing is doing for her? What would you say
to somebody like that? When you think about what supercomputers can
do, right? If you look at your cell phone, the amount of computation that went into the design
of the case so that when you drop it, it bounces and doesn't crack and rattle the motherboard of
the chips that are inside it and how they work of the materials that make up the batteries so that,
you know, we can now take this tiny little device that fits in your pocket and have it run for days while wirelessly communicating.
There's computational design in all these things.
When you have a problem that's too big to solve on any other computer, that becomes a supercomputing problem.
When you have a problem where you just have too many of them, we need to look at every possible design for how we build the drag on a wing for an airplane to make it
more fuel efficient. You can't build a million 747 prototypes. You do most of that computationally.
And then finally, when you have deadlines, right? There's a hurricane. It's two days from reaching
the coast. I can't spend six days figuring out where it's going computationally, right?
When we can sort of predict how things move, right? With Newton's laws of motion, or if I can build a mathematical model of a physical process in the universe,
if I want to ask a question of that model, I need to run a simulation, right? And that's what a
simulation is. And that's what we do in computing is ask questions of these models that we can build
that are looking at the sky and I need to process data or read all the data off a genome sequencer,
right? That's analytics. And we're doing that computationally. And then finally now where we
have these huge corpuses of data, like everything everyone has ever tweeted or every cell phone
position of every person in the country, or, you know, any of these other vast data sets,
the genomic data for everyone who's ever had COVID, right? And we want to crunch through all
that data and understand it. And we can't, for things like genomics, build the mathematical model. We can build the statistical model of
those. And that's basically what we call artificial intelligence at this point,
is asking questions of these sets of data. And supercomputers do all those things too.
I heard JPL was working with ATT&CK on Mars rover tech.
We have some work doing data crunching with Jet Propulsion Lab, where we're looking at
orbital insertion for future Mars missions and how you compute that.
It's not classified, but it's considered sensitive code and data that we deal with.
You know, it's export controlled and restricted because in this case, we're looking at Mars,
but there's only so many ways to insert a big piece of metal into the atmosphere from
orbit.
And you want to restrict how well distributed that information is.
So we have to protect a lot of the data around that stuff.
But I know we're doing some, you know, what are good orbits and good landing vectors for
getting into the atmosphere with JPL.
What are you most looking forward to as you stare down the future path of technology advances,
specifically in supercomputing?
For us, it's not just the technology, it's the science we're going to do with it.
I mean, with the merger of sort of AI into the scientific workflow and the potential for that, I think,
you know, we're going to do problems that we haven't even thought of yet at scales that we
can't dream of. But for me, the next big step is building on this LCCF machine that comes after
Frontera, which will involve us building a bigger data center and everything that goes with it and
a bigger training program and picking that next set of technologies that are going to work. So that's going to be an awfully daunting process,
but an exciting one as well. Well said, Dan. What an excellent way to put a cap on the meat
of this episode. Hear, hear. Where can our listeners find you on the internet? We're on
Twitter and Facebook and all the usual places you would expect to find us. If you go to the website
and you want to opt into one of our mailing lists, you can get our annual magazine, Texas Scale.
It's like Exascale,
but everything's a little bigger in Texas.
So that's cute.
Some of our coverage of the science stories
and the new machines that are coming down the pipe.
We will be sure to link some additional information
in the show notes for this podcast.
Thank you so much, Dan.
Thank you for being an undercover superhero.
Thanks very much, Julie and Ernest. Appreciate the time and the coverage of this.
Thanks again for listening to the latest episode of the Big Compute Podcast.
Believe me when I tell you that I love recording these for our listeners, and I know Jolie does
too. This seriously is such a treat for me. I love learning about all of these interesting
subjects and all of this technology that I hadn't been exposed to, and I love learning about all of these interesting subjects and all of this technology
that I hadn't been exposed to. And I love sharing it with all of you. And if anyone out there wants
to help us get the word out about the Big Compute podcast, you can leave us a five-star review and
follow us on Apple Podcasts or your favorite... Or Google Podcasts. I was about to say it.
Just making sure. Or your podcatcher of choice.
Yes.
And if you have any ideas of what we should talk about on our next podcast episodes, feel free to send those in at bigcompute.org, where you can also find a lot of great information.
Anything to get you down the rabbit hole of the awesomeness of supercomputing.
Until next time.
Adios.