ACM ByteCast - Torsten Hoefler - Episode 74
Episode Date: September 16, 2025In this episode of ACM ByteCast, Bruke Kifle hosts 2024 ACM Prize in Computing recipient Torsten Hoefler, a Professor of Computer Science at ETH Zurich (the Swiss Federal Institute of Technology), whe...re he serves as Director of the Scalable Parallel Computing Laboratory. He is also the Chief Architect for AI and Machine Learning at the Swiss National Supercomputing Centre (CSCS). His honors include the Max Planck-Humboldt Medal, an award for outstanding mid-career scientists; the IEEE CS Sidney Fernbach Award, which recognizes outstanding contributions in the application of high-performance computers; and the ACM Gordon Bell Prize, which recognizes outstanding achievement in high-performance computing. He is a member of the European Academy of Sciences (Academia Europaea), a Fellow of IEEE, and a Fellow of ACM. In the interview, Torsten reminisces on early interest with multiple computers to solve problems faster and on building large cluster systems in graduate school that were later turned into supercomputers. He also delves into high-performance computing (HPC) and its central role in simulation and modeling across all modern sciences. Bruke and Torsten cover the various requirements that power HPC, the intersection of HPC and recent innovations in AI, and his key contributions in popularizing 3D parallelism for training AI models. Torsten highlights challenges, such as AI’s propensity to cheat, as well as the promise of turning reasoning models into scientific collaborators. He also offers advice to young researchers on balancing academic learning with industry exposure. We want to hear from you!
Transcript
Discussion (0)
This is ACM Bycast, a podcast series from the Association for Computing Machinery,
the world's largest education and scientific computing society.
We talk to researchers, practitioners, and innovators who are at the intersection of computing research and practice.
They share their experiences, the lessons they've learned, and their own visions for the future of computing.
I am your host, Brooke Kifle.
Today we're exploring the very exciting area and convergence of high-performance computing
and its applications in domains like climate modeling, quantum physics simulation, and large-scale
AI training, where breakthroughs in supercomputer architecture, programming, and algorithms
are powering the next generation of scientific discovery and artificial intelligence.
Our next guest is Professor Torsten Holfleur, Director of the Scalable Parallel
Computing Lab at ETH Zurich and Chief Architect for AI and Machine Learning at CSCS, Uterland's National Supercomputing Center.
This year, you received the ACM Prize in Computing, one of the field's highest honors for his foundational
contributions to both high-performance computing and the ongoing AI revolution.
Professor Torsen's work spans MPI standards, advanced parallelism techniques, and the design of
networks that power modern supercomputers, delivering orders of magnitude speedups for large-scale
AI and scientific workloads. He's a fellow of ACM and ICCI, a Gordon Bell Prize winner, and a leader in
making performance benchmarking more rigorous and reproducible. Professor Torsten, welcome to ACM by cast.
Thank you very much, Brooke. You have such an amazing career journey and, of course, very impressive
contributions to the field. Can you tell us about your personal journey into the field of computing
and maybe specifically high-performance computing, maybe some key inflection points over the course
of your life that really drew you into this area? Well, this could be an arbitrarily long story,
but let me try to make it reasonably short. So actually, I was always interested in mathematics,
and then it turns out that my performance for solving mathematical equation was very limited.
I realized early on that with the help of computers, we could actually do better in that sense.
I could use computers as my own accelerators in the sense that we already know that they solve
all kinds of simple equations that me add to 64-bit floating point numbers. Computers is thousands of
times faster than I am very quickly. And so when I was younger, I was getting very fascinated by
using multiple computers to solve a task. So for example, in my teenage years, I started building
a cluster under my bed
where my mother was not super excited
about it but I was more excited
and the cluster was built out of old
machines like literally
486 machines that was a long time ago
that my school was throwing
away at the time. I used the software called
Mo6 to put them together
into a single operating system image
essentially such that I could run
an application distributed over
these eight nodes that were
sitting under my bed. So that was kind of the
start of my distributed
memory career in some sense.
So distributed node connecting career.
Well, and then I went through the novel process in some sense going forward and I studied
computer science.
I actually wanted to study chemistry and then I wanted to study physics.
And then I realized that what's unifying both chemistry and physics is actually the use
of mathematical models, the use of simulations, at least at the university I was.
And then I thought, hey, you may want to get to the root of this, because if you understand the root, you can later easily understand the application of it.
So I thought I should go to computer science as a foundation for modern science at the time.
That's when I started my career in the ACM field, so to say.
And then I went on to my master's thesis, and my master's thesis was in a group actually building computers, like larger clusters.
The group I was at was actually building one of the first commodity off-the-shelf cluster systems in a university called Chemnitz.
So it's quite small.
I believe we did have the first system that was made out of like standard meaty cases on IKEA racks connecting hundreds of those into a supercomputer.
So I was quite innovative at the time.
Now I ended up helping to build some of the largest machines on the planet.
So that's quite cool.
Wow, that's super, super exciting.
the origin story dating back to your time as a kid, I think is certainly one that's quite exciting.
As you think about, I love how you capture this idea of, you know, your initial interest being in
chemistry and physics, but coming to this understanding of, you know, the foundation of modern
science really going down to solving material or mathematical problems. As you forward, you know,
after your master's and maybe into your PhD and some of your early work, what problem or
what set of problems were you really trying to solve early on
that you think shaped the rest of your work in this field?
That's interesting.
That's a nice question.
So I was starting, looking at something that unfortunately turned out to be quite
useless at the end.
Specifically, my master's thesis was inventing a new MPI barrier algorithm.
So as I mentioned, I was in that group that built that supercomputer,
and they found specifically MPI barrier is very slow,
in that machine and we should optimize it.
So I spent six months optimizing barrier
algorithm. I actually still, I believe
that I still hold the record
for the best MPI barrier algorithm
on Infiniband, because that was
the interconnect at the time. It was also quite modern.
It was great. I had a great time
only to find out after I graduated
when I talked to some real scientists
that you wouldn't actually use
barrier. So
optimize something that
a very small number of people,
it's not zero, but it's close to zero.
actually cared about.
But that taught me how to build high-performance networking systems,
how to build collective operations on those systems,
so group communication operations, how to build MPI systems.
And from there on, I was able to broaden up
and actually do things that matter later,
but it's quite funny that barrier isn't nearly useless operation in practice.
So if you catch yourself using MPI barrier in your application,
you should double check if you really need it.
Very interesting. That's certainly a topic that I'd love to revisit later in this conversation, less so the MPI barriers and more, you know, how we ensure that some of our work in the field of research is practical. But before we get into that, you know, I think we've definitely used a couple key terms that I think are certainly uttered in the computing industry, but I'm sure might be unclear to many. And so for those who are less familiar, you know, high performance computing often runs very quietly in the background. But it's
It powers some of the most important advances in science and technology, and this could not be more true than it's ever been today.
So how would you explain what HPC is and why it's so essential today, particularly in some of the areas that you're looking at around climate modeling and AI?
That's also a longer story, so let me try to make this short.
So HPC, high performance computing, is the field of computing where you require high performance, as the name suggests.
And then there is a subfield of HPC that is called supercomputing.
And these are then essentially the fastest computers we have on the planet.
These are two different entities.
Many equate them, but I would say it's different because HPC is broader.
Like, HPC is really anything that requires high performance.
As I mentioned when I was a young student, more than 20 years ago,
oh my God, you could already see it that all these domain scientists,
they were relying less and less on experimentation, on real experimentation.
They were going more and more on simulation-based experimentation.
So they had a model of their chemistry process or their physics device,
and then they would simulate it in a computer to accelerate the experimental process.
So basically the idea there was just to accelerate.
You could evaluate tens of thousands of material combinations,
and you wouldn't need to mix them in physical form.
But then you take the top 10 and you mix those only.
So that was called material screening, for example, at the time.
So that was computer-driven.
And since then, the last 20 years, this has just intensified.
So the field of computing has actually taken over pretty much all modern sciences.
So if you go to most modern scientists today and ask them what their major tool is to make progress in their science,
they will tell you simulation and modeling.
You can see this at several big prices like Nobel Prizes being awarded in the field of computing very recently, actually, to computer scientists.
There was this running joke that the only guarantee you would,
would get as a computer scientist is that you would never get a novel price because there
was a novel price in computer science. But now there's a new running joke and the new running
joke goes like, well, when is the time when computer scientists will get all novel prices?
Essentially two novel prices last year went to computer scientists, which is very shocking.
But it's not shocking in the sense that it's unexpected because as I mentioned, more and
more sciences rely on computing as their foundation. And so now if you think this further, high performance
computing is enabling these sciences to move faster, right? Because high performance computing,
all we do is we create cheaper, faster computing systems to solve very large-scale problems,
like simulations of proteins, new medication, planes, ships, simulation of the atmosphere,
simulation of the weather, and training models like chat GPT, driving AI. These are all high-performance
computing tasks. And we make, the community of high-performance computing makes this faster.
and that means, let's say, we accelerate the speed of those computers by a factor of two,
or we reduce the cost by a factor of two.
It doesn't matter.
It's pretty much equivalent because we are mostly cost bound.
Then we directly accelerate human progress by a factor of two, if you think about it.
If you believe that human progress is driven by science and engineering.
So that's an interesting observation to some extent that many people made recently,
and then the field of high performance computing became suddenly very important in the public.
Everybody talks about supercomputers today.
Like big companies, Microsoft was one of the first companies talking about we are building a supercomputer.
Meta builds supercomputers now.
Meta, I mean, a company that runs social networking.
Tesla, a car company, built supercomputers now.
Google, an advertisement company, a search company, builds supercomputters.
It's every big tech company built supercomputters.
Countries build super.
I mean, it is exploding the field.
And it's all about solving big problems.
faster. All these companies build supercomputers because they want to accelerate AI methods.
But AI science is going to be the next big step, I believe, accelerating science with these
AI and computing techniques. We've been doing traditionally with high performance computing.
Does that make sense as a general? It makes a lot of sense. And I think you talk about direct,
tangible, practical impact, right? Being able to accelerate computing by 2x, translating to a 2x,
You know, improvement in society and humanity's progress, I think is quite exciting. But, you know, when you talk about building supercomputers, obviously you've been instrumental in designing systems like Blue Waters and now Alps. What exactly does it take to design a supercomputer that's truly optimized end to end? And of course, I know we're timebound. And obviously, I don't expect you to get into the full details. But is it a hardware problem? Is this a software problem? Is this a hybrid? What are some of the key critical components of
actually designing a supercomputer that brings the benefits that you've described.
Unfortunately, it's all of the above, and it's hard to compress, but I'll do my best.
So I'll address each of those a little bit.
So first of all, it is a hard for a problem.
These supercomputers are rather large, physically large, and actually they're comparable
in size to these big megadata centers that you can see from a satellite.
So some of them form really physically large footprint facilities.
like the Alps computer and the Blue Waters machine
they fit in a single building
it's a couple of hundred square meters
it's not super big
but the larger industry machines
they can get very very big
but now what makes a supercomputer
it is a collection of servers
so it's literally take your standard server
but make it a high performance server
so buy an expensive GPU essentially
a very high performance GPU
and buy an expensive CPU
with expensive memory
or high performance memory
doesn't necessarily have to be expensive
but they often tend to be somewhat
pricey and then you put them into a single machine then you replicate this machine let's say 10,000
times because one of those machines is not a supercomputer is not fast enough it's not a high performance
machine so you replicate it 10,000 times so you buy 10,000 of those and then what you have to do is
you have to now figure out how to connect those machines to solve a single problem so you connect
them with a network but it's not the network that we are talking over right now or the internet
it is a network that is actually significantly faster.
So in some sense, many of these supercomputer networks that we run today,
they have more throughput in a single room than the whole internet combined has on the planet.
They can transport more bytes per second, like we were at the bytecast here.
They can cast more bytes per second than the entire internet can,
but they're concentrated in the single room.
So they have extremely fast communication.
And that's then a supercomputer.
So there are many, many details that we can talk.
talk about, like, how the topology of the connection looks like, what's the technology,
is it Ethernet, is in Infiniband and yada, yada, but at the high level, that's a supercomputer.
And these accelerators that we are talking with, many people talk about, these GPGPUs,
general purpose GPU is a very funny acronym because GPU means graphics processing unit,
and then you make it a general purpose graphics processing unit. That makes little sense,
but the idea is basically to say that it used to be a GPU, and now it's a more general
purpose accelerator. These things, they drive modern workloads like modern computational science
as well as modern AI. And the network connects those together. So the CPU plays a minor role
actually today in these supercomputers. It's mostly we call this accelerated computing. So that's
the hardware. Now that you have the hardware, what is at least as important is you need to
build, you need to program that machine. And now you need a programming system that allows you to
program 10,000 computers to solve a single problem, or even more, 100,000 computers.
Actually, we scale up to a million. So we have tried a million endpoints. And that is then where
MPI comes in the message passing interface. Here, this is a programming abstraction where we
program these 10,000 computers like they were individuals connected via a messaging system where
you can send a message and receive a message and do these collective operations that I mentioned.
And that allows the programmer to orchestrate an overall performance system that
works on a single task, like training chap GPD, for example.
And then you need algorithms on top of this, distributed algorithms,
parallel algorithms, that enable you to use this programming system that are then
encoded in this programming system, which eventually executes on the hardware.
So these are the three levels.
Hardware, middleware programming, and the algorithmic level.
I somehow started working on all of those because performance, if you mess up any of those
three levels, your performance will be bad.
So you have to have a holistic view.
You cannot say, oh, I'm provably performant on level two.
I provably have a good hardware.
No, if you run a bad algorithm and good hardware, you're still bad.
So it's not modular.
You cannot modularize performance.
You can modularize correctness, for example.
You can modularize security, but you cannot modularize performance,
which is a little bit annoying, but that's how high performance computing goes.
And as you maybe think about, that's extremely helpful and I think provides a really good overview.
as you think about the biggest opportunities for further accelerating performance of these supercomputers,
where is the biggest opportunity?
Is it at the hardware layer?
Is this a middleware software layer?
Where are the biggest advancements that we've seen maybe in the past five or so years?
And where do you think we'll see some of the biggest advancements in the coming few years?
We have actually seen, I mean, if we go to the recent past has mostly focused on AI models.
This is overshadowing everything.
and V-Vidia being the biggest company in the world
in terms of market capitalization
with the growth that is unprecedented of any company
and all they make is compute accelerators
but they work on the full stack
Jensen Wang often says that the company is actually
not a hardware company, it's a software company
because they work at the full stack.
Let me just highlight changes and revolutions
at each of those three layers.
So the first layer at the hardware layer,
we have really, we went from normal
CPU-based compute to accelerated compute in the last five years, very massively.
Today, more than 99% of the operational capability of a single node, or actually of the full
supercomputer, comes from accelerators, comes from GPUs and not CPUs anymore.
Well, it may be 95% in some cases, but many systems that's actually exceeding 99%.
So that is astonishing.
We switched very much to accelerated compute.
Then at the middleware layer, we adopted new programming models to enable this revolution to be more
open. For example, deep learning training was driven by Python frameworks. The most famous one
these days is PyTorch. And then all these parallelisms, FSDP, tensor parallelism, or I call it operator
parallelism very often. And all these sequence parallelism, there are about six, seven different
forms of parallelism that are all implemented in Pythorch. So that wasn't there, well, it was very
small five years ago, but now it's dominating the market. So it was an extremely important revolution
as well together with nickel, the collective communication library or the CCLs in general.
Then at the algorithm level, there were so many breakthroughs that I don't even want to start
enumerating them because we started from transformers a long time ago, but we basically went
to all kinds of different innovations on these transformers.
One very well-known innovation that actually disrupted the stock market quite a bit was
Deepseek, as we've seen at the beginning of the year.
What was it, $500 billion or up to a trillion dollars, depends?
depending on how you found loss in U.S. stock values.
But this was mainly an exercise and optimization.
If you actually read what the Chinese colleagues did,
they used FP8, invented the technique,
multi-had latent detentional, used the technique multi-hat latent detention,
which is an optimization of KV mechanisms.
And they have done a couple of more optimizations,
mainly on the communication side.
And that was just great.
And they used MEO very effectively,
but that also existed before.
At all these levels, you need to innovate to make a difference.
That's what I meant.
It's not, you cannot make it modular.
You have to see the holistic view across the whole stack.
And each of those enables breakthroughs that are, that were unthinkable years before the
breakthrough happened.
And I believe in the future this will continue.
I have a hard time predicting what exactly will happen in the future.
But I'm very, very optimistic that we will have several breakthroughs at all of these variants.
Hmm. Super, super insightful. I think that's, yeah, that's quite exciting.
ACM Bycast is available on Apple Podcasts, Google Podcasts, Podbean, Spotify, Stitcher, and Tunein.
If you're enjoying this episode, please subscribe and leave us a review on your favorite platform.
We touched on a couple interesting things on the hardware side. You know, you covered MPI.
One of your key contributions has also been, and actually, as we talk about AI, popularizing
3D parallelism for actually training these AI models.
Can you maybe help us understand what that means and why it's so effective?
Yeah, that was very simple, very interesting.
Well, in some sense, that happened very early on, that was 2017, 18, end of 2017, actually.
When you look at these models, that at the time, most people were using simple data parallelism.
So how does data parallelism mark?
You would replicate your model to,
let's again work with these 10,000 accelerators,
you would replicate your model essentially 10,000 times,
and then you would run different data
through each of your replicas of the model.
And then each of these replicas would learn from the data,
and what you would do is you would essentially globally average
either the gradients, the update to the weights,
or the weights themselves,
that's now a technical detail.
And then each of these models would be updated
after that operation.
So that's great.
That's a very, very effective,
very cheap method to implement.
This is what everybody used.
But what happened then in about 2017,
and so various companies,
I was at the time,
actually slightly later at a sabbatical
at Microsoft helping them to build
lots of infrastructure
that you can now guess
what that did
in the Open AI collaboration.
But the idea was that we used,
that we couldn't store a model anymore
and a single accelerator
because the model got too big.
at the time the models were getting bigger and bigger and bigger
and they simply exceeded in size the memory capacity of an accelerator
well that's bad so if you don't fit your model you couldn't replicate it
so now we would have to replicate the model across multiple accelerators
and the easy the first option is well each of those models consists of multiple layers
that's a very simple observation so you could now put layer
the first half of the layers an accelerator one
and the next half of the layers an accelerator two okay
that is called pipeline parallelism.
You quickly realize during training,
that is actually complicated
because you would have to go backwards
during the so-called backward pass
and update the way it was the previously stored activation.
So it's again a technical detail,
so you have a forward and a backward pipeline.
I don't want to go into too much detail.
You can watch lots of talks
that I have on YouTube online that explain that.
But that is one dimension, sort of sake.
You have one dimension that is
the model replicas,
let's say this is the vertical,
dimension. And then we have the horizontal dimension where we now cut each replica into multiple
pieces. Now if you have 10 replicas and you cut each replica into 10 pieces, so each piece has
one tenth of the layers, then you already employ 100 accelerators, 10 times 10. That's 2D
parallelism. What we realized then very quickly in many internal projects, which we didn't
publish at the time, but it doesn't matter, realized, well, if you actually do this, your training
would simply be slower
because you know
for each replica you would need
to go through multiple layers
and if each set of layers
is on a different accelerator
you would now not make the whole thing
faster, you would actually make it slightly slower
in terms of latency
you would make it faster in terms of throughput
because you can pipeline it
like while the first accelerator
is idle it can already process
the next data point
while the second accelerator processes
the first data point so it's really like pipelining
so that gets you a higher throughput
but also higher latency
So now the question is, how do I get the latency down?
Well, I actually get it down if I now parallelize each of these layers.
So remember, we have 10 model replicas.
Each of these 10 model replicas is distributed to 10 different machines layer-wise.
And now on each of those 10 machines, we cut it again in 10, but we now parallelize each layer.
And that is often called tensile parallelism or operator parallelism.
that now allows us to be faster again
because if I solve each layer with 10 accelerators
this could in theory be 10 times as fast
in practice is usually slower
because of Anda's law and all kinds of overheads
but that would give you the third dimension
and then you would have 10 times 10 as 100
times 10 as 1,000 accelerators employed
in a logical three-dimensional communication pattern
because you only communicate
with your neighbors in each dimension
That is quite nice for designing systems, that insight.
And that's what many people called a three-dimensional parallelism.
Today it's actually somewhat outdated because today there are more dimensions that people added.
So today we are talking about four or five-dimensional parallelism.
Actually, I know a total of six, but some people argue the sixth one.
So you can extend this model into more dimensions.
And that was the basic idea of how to parallelize deep learning workloads
or distribute them and parallelize them.
as we think about the benefits of this obviously you know you're accelerating performance is there
cost benefits is there is this more energy efficient what drives all these large organizations
to actually pursue these advancements and as you think about the real world implications beyond just
scaling up performance of compute are there other side benefits of these advancements as well
well it's partially a performance benefit yeah it is a performance benefit but it's often not a cost
benefit. What you gain from
parallelizing is
capability. So as I mentioned,
the model would simply not fit on a single
accelerator. So you must parallelize
it to have a larger model.
And the major progress in
recent years in AI came from
larger models. There's this very famous
a couple on a dollar paper from OpenEI
on the scaling loss of models.
And that basically says as you
increase your model size, you will
get more and more capability.
So that was the hope at the time. We now know
it's true to some extent
and so we needed to go
parallel to enable larger models
that was the first one
so it's an opportunity
that it enables
it's an experiment
that was really able to run
or many were able to run
and so that was important
and then secondary was of course
after you have enabled
that capability
you needed to get that capability
as fast as you can
because what you also know
is that there is competition in the field
you basically want to be faster than your
competition and then you use high performance computing now that you can load and train the
large model to make the training as fast as you can that it only takes you two months and not 10
years these are actually realistic numbers so typically it takes about two months but without many
optimizations it would easily take you 10 years and of course in 10 years I mean you'll be
retired also I'll be retired and so it doesn't it doesn't change so you also need to make this
progress at a certain rate and very often both of those come at the cost of cost I can
the more parallelize, usually the more expensive it goes.
So you have to find an interesting sweet spot.
By parallelization, you will very rarely save costs.
You will improve performance.
You will improve capability, but then it usually costs you more.
But the nice thing is even in theory, you can very often prove that your cost is a logarithmic multiplier.
It's not linearly vers.
Super interesting.
Okay.
Very, very insightful.
You know, I think, you know, you mentioned in some ways how this is a core breakthrough
out of necessity, right, to be able to support these large-scale models we had to make
advancements in how we think about the 3D parallelism. Can you maybe share a story where
your research and system design has really enabled a very interesting scientific or AI
breakthrough? A specific story. I mean, yes. So we had this, that's now on the modeling
site about the AI things. There are many NDAs that I have to be careful with. On the scientific
modeling site, we actually are partnering
with a wonderful team at ETH Zurich
simulating
heat dissipation in transistors.
The whole field of computing
is driven by essentially by Moore's
law, or has been driven the last
40 years by Moore's Law, which
roughly says that every 18
months, the number of transistors
that you have at your disposal
somewhat doubles and everything else
is constant. The cost is constant. The energy
consumption is constant. I mean, if you assume
the art scaling as well. So basically, you get a
three-doubling every 18 months. So the 18 months has shifted, it's now getting longer.
This was due to the fact that elements got smaller and smaller and smaller.
And on the transistor side, what happened then is they got so small that now quantum effects play a role.
As you know, a transistor works by enabling and by changing from leaving electrons through to not leaving electrons through, like zero or one state.
Either a high resistance or low resistance based on the electrical signals on one.
one of the contacts.
So now, even if it's supposed to be closed,
unfortunately, some electrons tunnel through
that is called the leakage energy,
and that is a huge problem,
because they don't really close that, well,
if they get very small, because electrons tunnel through.
And so we worked together with the scientist,
Matthew Lucier and his team,
who is an expert in simulating these effects
to simulate a realistic size
set of transistors
in order to build better transistors in the future
because these transistors,
you can now imagine you can build them in different shapes,
and these shapes have very interesting trade-offs
about the loss energy,
the functionality of the transistor,
so how many electrons you need to actually close the gate
and to open the gate,
so that's your power consumption, right?
It gets quite complex in the details,
and I don't want to go into too much of the details.
You can read this yourself.
And so what we help there is make this application
enable the largest transistor run
that we have ever been able to do
or scientists worldwide has ever been able to do
and improve the performance of this run
by about 100 times, 98 or something times.
And so that was a breakthrough
that enabled many manufacturers of transistors
to build better designs,
to simulate better designs for their transistors.
That was 2019.
That's when we got the Gordon Bell Award for this,
which is going to be high awards in the field.
Yes, that was a breakthrough that we enabled,
I always joke and say,
well, we use transistors to make better transistors in the future.
And much of that credit, of course, goes to Matthew Luizier, who is the actual scientist.
I have only dangerous half knowledge in how transistors actually work.
He knows that for sure.
But my team contributed the performance optimization, which partially enabled that breakthrough.
It's always two things, right?
You need the science case, and you need the performance to enable the science case.
So it's really beneficial to work together with scientists in our field.
Interesting. Okay.
Super, super insightful.
And maybe as you look ahead, what do you believe will be some of the most important shifts or innovations in HPC and scalable parallel computing that will redefine how we solve complex scientific problems, such as the ones that you've cited, but also real world problems and AI?
Or perhaps how do you think about some of the foundational assumptions of HPC that perhaps we need to rethink as we think about solving the next sort of wave of scientific and real world problems?
In HPC, the field has been for a very long time driven by computational science.
Literally, these big computers, since I can think, were built to solve big modeling and simulation problems.
And only recently, 2019 that started, that this all switched to AI.
AI as a workload on high-performance computing systems did not really exist in a significant manner before 2019, let's say.
maybe 18, I don't want to say the exact number, but recently.
And so they were all designed to solve these high-performance computing problems, and that was great.
And one of the differentiating factors is what you needed to solve these modeling and simulation problems.
For example, if you want to predict the climate on Earth in 30 years, you would need a floating point precision 64-bit, FP-64.
Now with a new revolution of AI, we realize that actually for AI systems, you don't need that much precision.
Because interestingly, if you look at biological neural systems, like my brain, for example,
my brain can differentiate something in the 20s at different voltage levels.
And if you think about this, something in the 20s voltage levels is 4.6 bits of precision.
So my brain runs at less than 5-bit precision.
So why would we run AI models at 64-bit precision?
Of course we don't.
As I mentioned earlier, DeepSeek pioneered the use of 8-bit in training.
So we are now using 8-bit precision in many training and inference systems.
Many still use 16, but I think we were going to 8-bit.
So that's now a problem for traditional HPC
because the AI field is 95% of the revenue of NBidiya and other companies to make harder.
So they will naturally focus on where the money is.
So they will naturally focus on low precision computations.
And HPC needs high precision computation.
So I think what we need to invent at this point is how do we deal with that?
that in the modeling and simulation field.
And have to be careful, I made a mistake.
I shouldn't have said HPC needs this
because AI is a field in HPC.
Modeling and simulation needs this.
So I think you really need to innovate
to not leave the scientific simulations behind.
But then the other opportunity is,
and this is very important,
and this is going to launch a whole lot of different works.
I mean, in my group, I'm working on this.
Many people work on this, Jack Dongara,
works on this, and he beats that field of mixed precision.
So that's already well ongoing.
A much bigger possibility I'm actually seeing
to revolutionize modeling and simulations
is the use of AI techniques themselves.
So AI for some of this.
But that is very dangerous.
You have to be very careful.
You can easily get fooled
because these AI models,
they always learn how to cheat really well.
So they take all the information you give them
in kind of weird ways to make their predictions.
And sometimes you can catch them cheating.
Like, for example, if you're fascinated about a climate model that runs for a very long time, stable,
you may realize later that, well, if you feed the date of the prediction into the model,
it may just not predict the future, but actually predict the most likely weather on the 15th of January,
consistently for the next 100 years.
That is not a useful prediction because it does not take into account how the CO2 concentration changed in the atmosphere in 100 years.
whatever you want to predict
if you just get a likely sample
for 5th of January.
So that's something very dangerous.
And so physics-based simulations
do not have that problem
because they often work by first principles.
You can prove that your prediction
will have certain properties
while in AI models you typically can't prove.
And so that's a challenge.
You need to figure out
how to use these data-driven methods
to reliably predict
physical simulations
or physical processes.
And that's going to be super exciting as well.
I'm mega-hyped about this.
There are many, many approaches today, physics-informed neural networks and many data-driven approaches,
but I think there's a lot to be done.
And then the third one, which is, in some sense, the most meaningful one, but also the most, well, the biggest one in some sense, is can we use LLMs and these artificial intelligence, these reasoning models that we have right now, to turn into scientists?
so can they replace our reasoning proving hypothesizing process
and I strongly believe they can at least amend it
and help us in interesting ways
so I always like to talk about constructive hallucination
in that context because if I explain myself as a scientist
what I do I make an educated guess
and I call this a hypothesis
and then this educated guess I go and prove it
and once I have proven it I have extended the knowledge
that we have as humans
but it started from the educated guess
because if I wouldn't have guessed
I would have not
been able to come up with the hypothesis
the research hypothesis and I wouldn't be able to prove it
so
many models can actually do this
educated guessing as a form of
constructive hallucination
you could train a model to make an educated guess
because they have a lot of knowledge
it's well into the
interpolation space so it's well in their knowledge
database to make this guess. And then we just need to convince them to prove it. So that's
a big part of the scientific process. And so we're doing a lot of research in this area,
using reasoning models to help with that process to amend scientists. So in some sense, this is
then, well, if you manage to do this, these models can invent simulations. They can do anything
that humans have invented to some extent in the past. Then it gets interesting. And so
these three tiers moving scientific simulation forward through using AI techniques,
replacing parts of scientific simulations
with AI and data-driven techniques
and then enabling AI to do all of the above.
Super insightful, yeah.
I think it certainly speaks to the important role
that the technologies that supercomputers
and high-performance computing is enabling
will also have reverse effects on the field itself.
So I think that's quite exciting.
Your work, and you touched on it earlier,
going back to your early days
and some of your contributions with
MPI barrier. You talked about this idea of doing work that has reached into the real world
and practical impact. And so with your long-spaning career and some of your contributions,
not just in academic research, but an actual building of infrastructure and real-world impact
that has enabled very, very tangible use cases, how do you generally think about bridging theory
and practice and work? And perhaps what advice would you have for younger research?
is entering this space.
That's a great question.
So I have learned over time.
So I'm sitting in a weird spot because as I mentioned, I started with mathematics.
I'm kind of a hobby mathematician.
But then I realized that it's much more fun for me to build systems that actually work
and make a difference.
So somehow I'm an engineer now.
But I'm an engineer who really appreciates math and models trying to develop deep understanding.
In my understanding, mathematics is nothing but a simplification of the world.
with models that enable us to think much cleaner and clearer about what the world does.
So many people think math is complicated, but it's actually a simplification of what's going on
in engineering. So I'm an engineer, and then as an engineer, you have to be very careful
that you engineer systems that are actually useful. And I learned later in my career,
not as a student, and maybe a little bit late, that it's very important to stay connected to
the real world. And so this is when I started spending time in industry myself.
I would give this as a recommendation to many.
I would not do it too early because the problem is, of course, by definition, industry,
they have to generate revenue for their stakeholders.
So by definition, at industry, what you have to do is you have to support the mission
of the company you're at.
Our definition in academia, what you do is you, first of all, while you're in your educational
process, you build your own knowledge, you educate yourself, mostly a selfish process,
while you're learning, you're ingesting all that knowledge,
to later be a better industry player in some sense,
or be an academic.
Another question is when you make this decision,
do I want to go to industry or to academia,
have to be very conscious about this.
Because during your studies,
you were funded by society,
either your parents early on or later with stipends or whatnot,
you're funded by society to improve yourself.
But then when you make the decision to continue in academia
and to educate other people
and build open sky and research
or to go to industry
you have to be very conscious
because when you go to industry
typically as I mentioned
you're there to generate a profit
for the stakeholders including yourself
in academia what you're doing
is you're helping other people
to be educated
and not necessarily generating a profit
so now this
I wouldn't say that one of them
is perfect or the other
I'm not even clear
I like a combination of both
because in order to generate profit
that you need to do something that's societally relevant.
In academia, you can very easily get lost and do something that's not relevant for anybody,
but it's fun.
Yeah, this is very easy.
And so now I think there's a very fine line that I myself chose to work on
in between academia and industry.
So basically what I always do is I go to industry and watch them
solve real world problems and pick the hardest problems out of there
and then move them into an academic context to solve them.
and then when they're solved,
I try to bring them back to benefit society.
It's an interesting mix of the two,
and I would recommend everybody should or could find that mix.
It really depends on the person.
I mean,
I know many people who are perfectly happy
developing theorems all day long
that may or may not be relevant.
Many of those turn out to be extremely relevant later,
but at the moment they develop them,
they may not be relevant.
But then later, 50 years later,
they get the touring award for it.
So this is always possible, absolutely.
And when there are people who are,
more happy on an immediate feedback like myself, where I'm seeing, okay, I architected that system.
That's one of the largest systems on the planet. That is great, and I can be proud of it.
So it really depends on what you want to, what you want to achieve and what you want to get us
feedback from society. I think that's extremely, extremely valuable bites of advice.
For one, it's certainly a personal journey and the kind of gratification that you seek in
the real world impact of your scientific contributions may vary person to person.
I think great, great pieces of advice on, you know, staying close to industry, perhaps not too early in your career, but at some point.
And I think your, you know, approach of identifying the most compelling or hardest problems, bringing them back to an academic sort of research environment, achieving those breakthroughs and then ensuring that they are able to see real world impact by, you know, hopefully bridging that, that research to practice, I think, extremely, extremely valuable.
So I think that leaves us with really, really valuable takeaways for our listeners and a great point for us to wrap up on.
And I think through this conversation, certainly we've uncovered that, you know, high performance computing isn't really just about speed.
Though while that is one aspect, you know, it's really about unlocking, you know, entirely new possibilities in science and in society.
And so as systems grow more complex and more impactful, especially not work like yours, Professor Torsten will ensure that they stay fast, efficient.
and trustworthy.
And so thank you for your work
and thank you for joining us on Bightcasts.
Thank you very much for inviting me, Brooke.
It was fun.
ACM Bycast is a production of the Association
for Computing Machinery's Practitioner Board.
To learn more about ACM and its activities,
visit acm.org.
For more information about this and other episodes,
please visit our website at learning.acm.org.
slash B-Y-T-E-C-A-S-T.
That's learning.acm.org
slash bikecast.