ACM ByteCast - Torsten Hoefler - Episode 74

Starting point is 00:00:00 This is ACM Bycast, a podcast series from the Association for Computing Machinery, the world's largest education and scientific computing society. We talk to researchers, practitioners, and innovators who are at the intersection of computing research and practice. They share their experiences, the lessons they've learned, and their own visions for the future of computing. I am your host, Brooke Kifle. Today we're exploring the very exciting area and convergence of high-performance computing and its applications in domains like climate modeling, quantum physics simulation, and large-scale AI training, where breakthroughs in supercomputer architecture, programming, and algorithms

Starting point is 00:00:46 are powering the next generation of scientific discovery and artificial intelligence. Our next guest is Professor Torsten Holfleur, Director of the Scalable Parallel Computing Lab at ETH Zurich and Chief Architect for AI and Machine Learning at CSCS, Uterland's National Supercomputing Center. This year, you received the ACM Prize in Computing, one of the field's highest honors for his foundational contributions to both high-performance computing and the ongoing AI revolution. Professor Torsen's work spans MPI standards, advanced parallelism techniques, and the design of networks that power modern supercomputers, delivering orders of magnitude speedups for large-scale AI and scientific workloads. He's a fellow of ACM and ICCI, a Gordon Bell Prize winner, and a leader in

Starting point is 00:01:36 making performance benchmarking more rigorous and reproducible. Professor Torsten, welcome to ACM by cast. Thank you very much, Brooke. You have such an amazing career journey and, of course, very impressive contributions to the field. Can you tell us about your personal journey into the field of computing and maybe specifically high-performance computing, maybe some key inflection points over the course of your life that really drew you into this area? Well, this could be an arbitrarily long story, but let me try to make it reasonably short. So actually, I was always interested in mathematics, and then it turns out that my performance for solving mathematical equation was very limited. I realized early on that with the help of computers, we could actually do better in that sense.

Starting point is 00:02:24 I could use computers as my own accelerators in the sense that we already know that they solve all kinds of simple equations that me add to 64-bit floating point numbers. Computers is thousands of times faster than I am very quickly. And so when I was younger, I was getting very fascinated by using multiple computers to solve a task. So for example, in my teenage years, I started building a cluster under my bed where my mother was not super excited about it but I was more excited and the cluster was built out of old

Starting point is 00:02:56 machines like literally 486 machines that was a long time ago that my school was throwing away at the time. I used the software called Mo6 to put them together into a single operating system image essentially such that I could run an application distributed over

Starting point is 00:03:11 these eight nodes that were sitting under my bed. So that was kind of the start of my distributed memory career in some sense. So distributed node connecting career. Well, and then I went through the novel process in some sense going forward and I studied computer science. I actually wanted to study chemistry and then I wanted to study physics.

Starting point is 00:03:32 And then I realized that what's unifying both chemistry and physics is actually the use of mathematical models, the use of simulations, at least at the university I was. And then I thought, hey, you may want to get to the root of this, because if you understand the root, you can later easily understand the application of it. So I thought I should go to computer science as a foundation for modern science at the time. That's when I started my career in the ACM field, so to say. And then I went on to my master's thesis, and my master's thesis was in a group actually building computers, like larger clusters. The group I was at was actually building one of the first commodity off-the-shelf cluster systems in a university called Chemnitz. So it's quite small.

Starting point is 00:04:20 I believe we did have the first system that was made out of like standard meaty cases on IKEA racks connecting hundreds of those into a supercomputer. So I was quite innovative at the time. Now I ended up helping to build some of the largest machines on the planet. So that's quite cool. Wow, that's super, super exciting. the origin story dating back to your time as a kid, I think is certainly one that's quite exciting. As you think about, I love how you capture this idea of, you know, your initial interest being in chemistry and physics, but coming to this understanding of, you know, the foundation of modern

Starting point is 00:04:57 science really going down to solving material or mathematical problems. As you forward, you know, after your master's and maybe into your PhD and some of your early work, what problem or what set of problems were you really trying to solve early on that you think shaped the rest of your work in this field? That's interesting. That's a nice question. So I was starting, looking at something that unfortunately turned out to be quite useless at the end.

Starting point is 00:05:28 Specifically, my master's thesis was inventing a new MPI barrier algorithm. So as I mentioned, I was in that group that built that supercomputer, and they found specifically MPI barrier is very slow, in that machine and we should optimize it. So I spent six months optimizing barrier algorithm. I actually still, I believe that I still hold the record for the best MPI barrier algorithm

Starting point is 00:05:50 on Infiniband, because that was the interconnect at the time. It was also quite modern. It was great. I had a great time only to find out after I graduated when I talked to some real scientists that you wouldn't actually use barrier. So optimize something that

Starting point is 00:06:05 a very small number of people, it's not zero, but it's close to zero. actually cared about. But that taught me how to build high-performance networking systems, how to build collective operations on those systems, so group communication operations, how to build MPI systems. And from there on, I was able to broaden up and actually do things that matter later,

Starting point is 00:06:26 but it's quite funny that barrier isn't nearly useless operation in practice. So if you catch yourself using MPI barrier in your application, you should double check if you really need it. Very interesting. That's certainly a topic that I'd love to revisit later in this conversation, less so the MPI barriers and more, you know, how we ensure that some of our work in the field of research is practical. But before we get into that, you know, I think we've definitely used a couple key terms that I think are certainly uttered in the computing industry, but I'm sure might be unclear to many. And so for those who are less familiar, you know, high performance computing often runs very quietly in the background. But it's It powers some of the most important advances in science and technology, and this could not be more true than it's ever been today. So how would you explain what HPC is and why it's so essential today, particularly in some of the areas that you're looking at around climate modeling and AI? That's also a longer story, so let me try to make this short. So HPC, high performance computing, is the field of computing where you require high performance, as the name suggests.

Starting point is 00:07:40 And then there is a subfield of HPC that is called supercomputing. And these are then essentially the fastest computers we have on the planet. These are two different entities. Many equate them, but I would say it's different because HPC is broader. Like, HPC is really anything that requires high performance. As I mentioned when I was a young student, more than 20 years ago, oh my God, you could already see it that all these domain scientists, they were relying less and less on experimentation, on real experimentation.

Starting point is 00:08:09 They were going more and more on simulation-based experimentation. So they had a model of their chemistry process or their physics device, and then they would simulate it in a computer to accelerate the experimental process. So basically the idea there was just to accelerate. You could evaluate tens of thousands of material combinations, and you wouldn't need to mix them in physical form. But then you take the top 10 and you mix those only. So that was called material screening, for example, at the time.

Starting point is 00:08:37 So that was computer-driven. And since then, the last 20 years, this has just intensified. So the field of computing has actually taken over pretty much all modern sciences. So if you go to most modern scientists today and ask them what their major tool is to make progress in their science, they will tell you simulation and modeling. You can see this at several big prices like Nobel Prizes being awarded in the field of computing very recently, actually, to computer scientists. There was this running joke that the only guarantee you would, would get as a computer scientist is that you would never get a novel price because there

Starting point is 00:09:13 was a novel price in computer science. But now there's a new running joke and the new running joke goes like, well, when is the time when computer scientists will get all novel prices? Essentially two novel prices last year went to computer scientists, which is very shocking. But it's not shocking in the sense that it's unexpected because as I mentioned, more and more sciences rely on computing as their foundation. And so now if you think this further, high performance computing is enabling these sciences to move faster, right? Because high performance computing, all we do is we create cheaper, faster computing systems to solve very large-scale problems, like simulations of proteins, new medication, planes, ships, simulation of the atmosphere,

Starting point is 00:09:56 simulation of the weather, and training models like chat GPT, driving AI. These are all high-performance computing tasks. And we make, the community of high-performance computing makes this faster. and that means, let's say, we accelerate the speed of those computers by a factor of two, or we reduce the cost by a factor of two. It doesn't matter. It's pretty much equivalent because we are mostly cost bound. Then we directly accelerate human progress by a factor of two, if you think about it. If you believe that human progress is driven by science and engineering.

Starting point is 00:10:26 So that's an interesting observation to some extent that many people made recently, and then the field of high performance computing became suddenly very important in the public. Everybody talks about supercomputers today. Like big companies, Microsoft was one of the first companies talking about we are building a supercomputer. Meta builds supercomputers now. Meta, I mean, a company that runs social networking. Tesla, a car company, built supercomputers now. Google, an advertisement company, a search company, builds supercomputters.

Starting point is 00:10:56 It's every big tech company built supercomputters. Countries build super. I mean, it is exploding the field. And it's all about solving big problems. faster. All these companies build supercomputers because they want to accelerate AI methods. But AI science is going to be the next big step, I believe, accelerating science with these AI and computing techniques. We've been doing traditionally with high performance computing. Does that make sense as a general? It makes a lot of sense. And I think you talk about direct,

Starting point is 00:11:27 tangible, practical impact, right? Being able to accelerate computing by 2x, translating to a 2x, You know, improvement in society and humanity's progress, I think is quite exciting. But, you know, when you talk about building supercomputers, obviously you've been instrumental in designing systems like Blue Waters and now Alps. What exactly does it take to design a supercomputer that's truly optimized end to end? And of course, I know we're timebound. And obviously, I don't expect you to get into the full details. But is it a hardware problem? Is this a software problem? Is this a hybrid? What are some of the key critical components of actually designing a supercomputer that brings the benefits that you've described. Unfortunately, it's all of the above, and it's hard to compress, but I'll do my best. So I'll address each of those a little bit. So first of all, it is a hard for a problem. These supercomputers are rather large, physically large, and actually they're comparable in size to these big megadata centers that you can see from a satellite.

Starting point is 00:12:26 So some of them form really physically large footprint facilities. like the Alps computer and the Blue Waters machine they fit in a single building it's a couple of hundred square meters it's not super big but the larger industry machines they can get very very big but now what makes a supercomputer

Starting point is 00:12:42 it is a collection of servers so it's literally take your standard server but make it a high performance server so buy an expensive GPU essentially a very high performance GPU and buy an expensive CPU with expensive memory or high performance memory

Starting point is 00:12:57 doesn't necessarily have to be expensive but they often tend to be somewhat pricey and then you put them into a single machine then you replicate this machine let's say 10,000 times because one of those machines is not a supercomputer is not fast enough it's not a high performance machine so you replicate it 10,000 times so you buy 10,000 of those and then what you have to do is you have to now figure out how to connect those machines to solve a single problem so you connect them with a network but it's not the network that we are talking over right now or the internet it is a network that is actually significantly faster.

Starting point is 00:13:31 So in some sense, many of these supercomputer networks that we run today, they have more throughput in a single room than the whole internet combined has on the planet. They can transport more bytes per second, like we were at the bytecast here. They can cast more bytes per second than the entire internet can, but they're concentrated in the single room. So they have extremely fast communication. And that's then a supercomputer. So there are many, many details that we can talk.

Starting point is 00:13:57 talk about, like, how the topology of the connection looks like, what's the technology, is it Ethernet, is in Infiniband and yada, yada, but at the high level, that's a supercomputer. And these accelerators that we are talking with, many people talk about, these GPGPUs, general purpose GPU is a very funny acronym because GPU means graphics processing unit, and then you make it a general purpose graphics processing unit. That makes little sense, but the idea is basically to say that it used to be a GPU, and now it's a more general purpose accelerator. These things, they drive modern workloads like modern computational science as well as modern AI. And the network connects those together. So the CPU plays a minor role

Starting point is 00:14:36 actually today in these supercomputers. It's mostly we call this accelerated computing. So that's the hardware. Now that you have the hardware, what is at least as important is you need to build, you need to program that machine. And now you need a programming system that allows you to program 10,000 computers to solve a single problem, or even more, 100,000 computers. Actually, we scale up to a million. So we have tried a million endpoints. And that is then where MPI comes in the message passing interface. Here, this is a programming abstraction where we program these 10,000 computers like they were individuals connected via a messaging system where you can send a message and receive a message and do these collective operations that I mentioned.

Starting point is 00:15:18 And that allows the programmer to orchestrate an overall performance system that works on a single task, like training chap GPD, for example. And then you need algorithms on top of this, distributed algorithms, parallel algorithms, that enable you to use this programming system that are then encoded in this programming system, which eventually executes on the hardware. So these are the three levels. Hardware, middleware programming, and the algorithmic level. I somehow started working on all of those because performance, if you mess up any of those

Starting point is 00:15:50 three levels, your performance will be bad. So you have to have a holistic view. You cannot say, oh, I'm provably performant on level two. I provably have a good hardware. No, if you run a bad algorithm and good hardware, you're still bad. So it's not modular. You cannot modularize performance. You can modularize correctness, for example.

Starting point is 00:16:09 You can modularize security, but you cannot modularize performance, which is a little bit annoying, but that's how high performance computing goes. And as you maybe think about, that's extremely helpful and I think provides a really good overview. as you think about the biggest opportunities for further accelerating performance of these supercomputers, where is the biggest opportunity? Is it at the hardware layer? Is this a middleware software layer? Where are the biggest advancements that we've seen maybe in the past five or so years?

Starting point is 00:16:40 And where do you think we'll see some of the biggest advancements in the coming few years? We have actually seen, I mean, if we go to the recent past has mostly focused on AI models. This is overshadowing everything. and V-Vidia being the biggest company in the world in terms of market capitalization with the growth that is unprecedented of any company and all they make is compute accelerators but they work on the full stack

Starting point is 00:17:04 Jensen Wang often says that the company is actually not a hardware company, it's a software company because they work at the full stack. Let me just highlight changes and revolutions at each of those three layers. So the first layer at the hardware layer, we have really, we went from normal CPU-based compute to accelerated compute in the last five years, very massively.

Starting point is 00:17:25 Today, more than 99% of the operational capability of a single node, or actually of the full supercomputer, comes from accelerators, comes from GPUs and not CPUs anymore. Well, it may be 95% in some cases, but many systems that's actually exceeding 99%. So that is astonishing. We switched very much to accelerated compute. Then at the middleware layer, we adopted new programming models to enable this revolution to be more open. For example, deep learning training was driven by Python frameworks. The most famous one these days is PyTorch. And then all these parallelisms, FSDP, tensor parallelism, or I call it operator

Starting point is 00:18:02 parallelism very often. And all these sequence parallelism, there are about six, seven different forms of parallelism that are all implemented in Pythorch. So that wasn't there, well, it was very small five years ago, but now it's dominating the market. So it was an extremely important revolution as well together with nickel, the collective communication library or the CCLs in general. Then at the algorithm level, there were so many breakthroughs that I don't even want to start enumerating them because we started from transformers a long time ago, but we basically went to all kinds of different innovations on these transformers. One very well-known innovation that actually disrupted the stock market quite a bit was

Starting point is 00:18:41 Deepseek, as we've seen at the beginning of the year. What was it, $500 billion or up to a trillion dollars, depends? depending on how you found loss in U.S. stock values. But this was mainly an exercise and optimization. If you actually read what the Chinese colleagues did, they used FP8, invented the technique, multi-had latent detentional, used the technique multi-hat latent detention, which is an optimization of KV mechanisms.

Starting point is 00:19:03 And they have done a couple of more optimizations, mainly on the communication side. And that was just great. And they used MEO very effectively, but that also existed before. At all these levels, you need to innovate to make a difference. That's what I meant. It's not, you cannot make it modular.

Starting point is 00:19:18 You have to see the holistic view across the whole stack. And each of those enables breakthroughs that are, that were unthinkable years before the breakthrough happened. And I believe in the future this will continue. I have a hard time predicting what exactly will happen in the future. But I'm very, very optimistic that we will have several breakthroughs at all of these variants. Hmm. Super, super insightful. I think that's, yeah, that's quite exciting. ACM Bycast is available on Apple Podcasts, Google Podcasts, Podbean, Spotify, Stitcher, and Tunein.

Starting point is 00:19:58 If you're enjoying this episode, please subscribe and leave us a review on your favorite platform. We touched on a couple interesting things on the hardware side. You know, you covered MPI. One of your key contributions has also been, and actually, as we talk about AI, popularizing 3D parallelism for actually training these AI models. Can you maybe help us understand what that means and why it's so effective? Yeah, that was very simple, very interesting. Well, in some sense, that happened very early on, that was 2017, 18, end of 2017, actually. When you look at these models, that at the time, most people were using simple data parallelism.

Starting point is 00:20:40 So how does data parallelism mark? You would replicate your model to, let's again work with these 10,000 accelerators, you would replicate your model essentially 10,000 times, and then you would run different data through each of your replicas of the model. And then each of these replicas would learn from the data, and what you would do is you would essentially globally average

Starting point is 00:21:03 either the gradients, the update to the weights, or the weights themselves, that's now a technical detail. And then each of these models would be updated after that operation. So that's great. That's a very, very effective, very cheap method to implement.

Starting point is 00:21:15 This is what everybody used. But what happened then in about 2017, and so various companies, I was at the time, actually slightly later at a sabbatical at Microsoft helping them to build lots of infrastructure that you can now guess

Starting point is 00:21:27 what that did in the Open AI collaboration. But the idea was that we used, that we couldn't store a model anymore and a single accelerator because the model got too big. at the time the models were getting bigger and bigger and bigger and they simply exceeded in size the memory capacity of an accelerator

Starting point is 00:21:46 well that's bad so if you don't fit your model you couldn't replicate it so now we would have to replicate the model across multiple accelerators and the easy the first option is well each of those models consists of multiple layers that's a very simple observation so you could now put layer the first half of the layers an accelerator one and the next half of the layers an accelerator two okay that is called pipeline parallelism. You quickly realize during training,

Starting point is 00:22:12 that is actually complicated because you would have to go backwards during the so-called backward pass and update the way it was the previously stored activation. So it's again a technical detail, so you have a forward and a backward pipeline. I don't want to go into too much detail. You can watch lots of talks

Starting point is 00:22:27 that I have on YouTube online that explain that. But that is one dimension, sort of sake. You have one dimension that is the model replicas, let's say this is the vertical, dimension. And then we have the horizontal dimension where we now cut each replica into multiple pieces. Now if you have 10 replicas and you cut each replica into 10 pieces, so each piece has one tenth of the layers, then you already employ 100 accelerators, 10 times 10. That's 2D

Starting point is 00:22:58 parallelism. What we realized then very quickly in many internal projects, which we didn't publish at the time, but it doesn't matter, realized, well, if you actually do this, your training would simply be slower because you know for each replica you would need to go through multiple layers and if each set of layers is on a different accelerator

Starting point is 00:23:16 you would now not make the whole thing faster, you would actually make it slightly slower in terms of latency you would make it faster in terms of throughput because you can pipeline it like while the first accelerator is idle it can already process the next data point

Starting point is 00:23:30 while the second accelerator processes the first data point so it's really like pipelining so that gets you a higher throughput but also higher latency So now the question is, how do I get the latency down? Well, I actually get it down if I now parallelize each of these layers. So remember, we have 10 model replicas. Each of these 10 model replicas is distributed to 10 different machines layer-wise.

Starting point is 00:23:53 And now on each of those 10 machines, we cut it again in 10, but we now parallelize each layer. And that is often called tensile parallelism or operator parallelism. that now allows us to be faster again because if I solve each layer with 10 accelerators this could in theory be 10 times as fast in practice is usually slower because of Anda's law and all kinds of overheads but that would give you the third dimension

Starting point is 00:24:20 and then you would have 10 times 10 as 100 times 10 as 1,000 accelerators employed in a logical three-dimensional communication pattern because you only communicate with your neighbors in each dimension That is quite nice for designing systems, that insight. And that's what many people called a three-dimensional parallelism. Today it's actually somewhat outdated because today there are more dimensions that people added.

Starting point is 00:24:45 So today we are talking about four or five-dimensional parallelism. Actually, I know a total of six, but some people argue the sixth one. So you can extend this model into more dimensions. And that was the basic idea of how to parallelize deep learning workloads or distribute them and parallelize them. as we think about the benefits of this obviously you know you're accelerating performance is there cost benefits is there is this more energy efficient what drives all these large organizations to actually pursue these advancements and as you think about the real world implications beyond just

Starting point is 00:25:18 scaling up performance of compute are there other side benefits of these advancements as well well it's partially a performance benefit yeah it is a performance benefit but it's often not a cost benefit. What you gain from parallelizing is capability. So as I mentioned, the model would simply not fit on a single accelerator. So you must parallelize it to have a larger model.

Starting point is 00:25:42 And the major progress in recent years in AI came from larger models. There's this very famous a couple on a dollar paper from OpenEI on the scaling loss of models. And that basically says as you increase your model size, you will get more and more capability.

Starting point is 00:25:57 So that was the hope at the time. We now know it's true to some extent and so we needed to go parallel to enable larger models that was the first one so it's an opportunity that it enables it's an experiment

Starting point is 00:26:08 that was really able to run or many were able to run and so that was important and then secondary was of course after you have enabled that capability you needed to get that capability as fast as you can

Starting point is 00:26:23 because what you also know is that there is competition in the field you basically want to be faster than your competition and then you use high performance computing now that you can load and train the large model to make the training as fast as you can that it only takes you two months and not 10 years these are actually realistic numbers so typically it takes about two months but without many optimizations it would easily take you 10 years and of course in 10 years I mean you'll be retired also I'll be retired and so it doesn't it doesn't change so you also need to make this

Starting point is 00:26:53 progress at a certain rate and very often both of those come at the cost of cost I can the more parallelize, usually the more expensive it goes. So you have to find an interesting sweet spot. By parallelization, you will very rarely save costs. You will improve performance. You will improve capability, but then it usually costs you more. But the nice thing is even in theory, you can very often prove that your cost is a logarithmic multiplier. It's not linearly vers.

Starting point is 00:27:21 Super interesting. Okay. Very, very insightful. You know, I think, you know, you mentioned in some ways how this is a core breakthrough out of necessity, right, to be able to support these large-scale models we had to make advancements in how we think about the 3D parallelism. Can you maybe share a story where your research and system design has really enabled a very interesting scientific or AI breakthrough? A specific story. I mean, yes. So we had this, that's now on the modeling

Starting point is 00:27:52 site about the AI things. There are many NDAs that I have to be careful with. On the scientific modeling site, we actually are partnering with a wonderful team at ETH Zurich simulating heat dissipation in transistors. The whole field of computing is driven by essentially by Moore's law, or has been driven the last

Starting point is 00:28:11 40 years by Moore's Law, which roughly says that every 18 months, the number of transistors that you have at your disposal somewhat doubles and everything else is constant. The cost is constant. The energy consumption is constant. I mean, if you assume the art scaling as well. So basically, you get a

Starting point is 00:28:26 three-doubling every 18 months. So the 18 months has shifted, it's now getting longer. This was due to the fact that elements got smaller and smaller and smaller. And on the transistor side, what happened then is they got so small that now quantum effects play a role. As you know, a transistor works by enabling and by changing from leaving electrons through to not leaving electrons through, like zero or one state. Either a high resistance or low resistance based on the electrical signals on one. one of the contacts. So now, even if it's supposed to be closed, unfortunately, some electrons tunnel through

Starting point is 00:29:04 that is called the leakage energy, and that is a huge problem, because they don't really close that, well, if they get very small, because electrons tunnel through. And so we worked together with the scientist, Matthew Lucier and his team, who is an expert in simulating these effects to simulate a realistic size

Starting point is 00:29:23 set of transistors in order to build better transistors in the future because these transistors, you can now imagine you can build them in different shapes, and these shapes have very interesting trade-offs about the loss energy, the functionality of the transistor, so how many electrons you need to actually close the gate

Starting point is 00:29:41 and to open the gate, so that's your power consumption, right? It gets quite complex in the details, and I don't want to go into too much of the details. You can read this yourself. And so what we help there is make this application enable the largest transistor run that we have ever been able to do

Starting point is 00:29:55 or scientists worldwide has ever been able to do and improve the performance of this run by about 100 times, 98 or something times. And so that was a breakthrough that enabled many manufacturers of transistors to build better designs, to simulate better designs for their transistors. That was 2019.

Starting point is 00:30:14 That's when we got the Gordon Bell Award for this, which is going to be high awards in the field. Yes, that was a breakthrough that we enabled, I always joke and say, well, we use transistors to make better transistors in the future. And much of that credit, of course, goes to Matthew Luizier, who is the actual scientist. I have only dangerous half knowledge in how transistors actually work. He knows that for sure.

Starting point is 00:30:35 But my team contributed the performance optimization, which partially enabled that breakthrough. It's always two things, right? You need the science case, and you need the performance to enable the science case. So it's really beneficial to work together with scientists in our field. Interesting. Okay. Super, super insightful. And maybe as you look ahead, what do you believe will be some of the most important shifts or innovations in HPC and scalable parallel computing that will redefine how we solve complex scientific problems, such as the ones that you've cited, but also real world problems and AI? Or perhaps how do you think about some of the foundational assumptions of HPC that perhaps we need to rethink as we think about solving the next sort of wave of scientific and real world problems?

Starting point is 00:31:21 In HPC, the field has been for a very long time driven by computational science. Literally, these big computers, since I can think, were built to solve big modeling and simulation problems. And only recently, 2019 that started, that this all switched to AI. AI as a workload on high-performance computing systems did not really exist in a significant manner before 2019, let's say. maybe 18, I don't want to say the exact number, but recently. And so they were all designed to solve these high-performance computing problems, and that was great. And one of the differentiating factors is what you needed to solve these modeling and simulation problems. For example, if you want to predict the climate on Earth in 30 years, you would need a floating point precision 64-bit, FP-64.

Starting point is 00:32:10 Now with a new revolution of AI, we realize that actually for AI systems, you don't need that much precision. Because interestingly, if you look at biological neural systems, like my brain, for example, my brain can differentiate something in the 20s at different voltage levels. And if you think about this, something in the 20s voltage levels is 4.6 bits of precision. So my brain runs at less than 5-bit precision. So why would we run AI models at 64-bit precision? Of course we don't. As I mentioned earlier, DeepSeek pioneered the use of 8-bit in training.

Starting point is 00:32:46 So we are now using 8-bit precision in many training and inference systems. Many still use 16, but I think we were going to 8-bit. So that's now a problem for traditional HPC because the AI field is 95% of the revenue of NBidiya and other companies to make harder. So they will naturally focus on where the money is. So they will naturally focus on low precision computations. And HPC needs high precision computation. So I think what we need to invent at this point is how do we deal with that?

Starting point is 00:33:16 that in the modeling and simulation field. And have to be careful, I made a mistake. I shouldn't have said HPC needs this because AI is a field in HPC. Modeling and simulation needs this. So I think you really need to innovate to not leave the scientific simulations behind. But then the other opportunity is,

Starting point is 00:33:35 and this is very important, and this is going to launch a whole lot of different works. I mean, in my group, I'm working on this. Many people work on this, Jack Dongara, works on this, and he beats that field of mixed precision. So that's already well ongoing. A much bigger possibility I'm actually seeing to revolutionize modeling and simulations

Starting point is 00:33:53 is the use of AI techniques themselves. So AI for some of this. But that is very dangerous. You have to be very careful. You can easily get fooled because these AI models, they always learn how to cheat really well. So they take all the information you give them

Starting point is 00:34:08 in kind of weird ways to make their predictions. And sometimes you can catch them cheating. Like, for example, if you're fascinated about a climate model that runs for a very long time, stable, you may realize later that, well, if you feed the date of the prediction into the model, it may just not predict the future, but actually predict the most likely weather on the 15th of January, consistently for the next 100 years. That is not a useful prediction because it does not take into account how the CO2 concentration changed in the atmosphere in 100 years. whatever you want to predict

Starting point is 00:34:45 if you just get a likely sample for 5th of January. So that's something very dangerous. And so physics-based simulations do not have that problem because they often work by first principles. You can prove that your prediction will have certain properties

Starting point is 00:34:58 while in AI models you typically can't prove. And so that's a challenge. You need to figure out how to use these data-driven methods to reliably predict physical simulations or physical processes. And that's going to be super exciting as well.

Starting point is 00:35:12 I'm mega-hyped about this. There are many, many approaches today, physics-informed neural networks and many data-driven approaches, but I think there's a lot to be done. And then the third one, which is, in some sense, the most meaningful one, but also the most, well, the biggest one in some sense, is can we use LLMs and these artificial intelligence, these reasoning models that we have right now, to turn into scientists? so can they replace our reasoning proving hypothesizing process and I strongly believe they can at least amend it and help us in interesting ways so I always like to talk about constructive hallucination

Starting point is 00:35:50 in that context because if I explain myself as a scientist what I do I make an educated guess and I call this a hypothesis and then this educated guess I go and prove it and once I have proven it I have extended the knowledge that we have as humans but it started from the educated guess because if I wouldn't have guessed

Starting point is 00:36:13 I would have not been able to come up with the hypothesis the research hypothesis and I wouldn't be able to prove it so many models can actually do this educated guessing as a form of constructive hallucination you could train a model to make an educated guess

Starting point is 00:36:30 because they have a lot of knowledge it's well into the interpolation space so it's well in their knowledge database to make this guess. And then we just need to convince them to prove it. So that's a big part of the scientific process. And so we're doing a lot of research in this area, using reasoning models to help with that process to amend scientists. So in some sense, this is then, well, if you manage to do this, these models can invent simulations. They can do anything that humans have invented to some extent in the past. Then it gets interesting. And so

Starting point is 00:37:00 these three tiers moving scientific simulation forward through using AI techniques, replacing parts of scientific simulations with AI and data-driven techniques and then enabling AI to do all of the above. Super insightful, yeah. I think it certainly speaks to the important role that the technologies that supercomputers and high-performance computing is enabling

Starting point is 00:37:24 will also have reverse effects on the field itself. So I think that's quite exciting. Your work, and you touched on it earlier, going back to your early days and some of your contributions with MPI barrier. You talked about this idea of doing work that has reached into the real world and practical impact. And so with your long-spaning career and some of your contributions, not just in academic research, but an actual building of infrastructure and real-world impact

Starting point is 00:37:54 that has enabled very, very tangible use cases, how do you generally think about bridging theory and practice and work? And perhaps what advice would you have for younger research? is entering this space. That's a great question. So I have learned over time. So I'm sitting in a weird spot because as I mentioned, I started with mathematics. I'm kind of a hobby mathematician. But then I realized that it's much more fun for me to build systems that actually work

Starting point is 00:38:23 and make a difference. So somehow I'm an engineer now. But I'm an engineer who really appreciates math and models trying to develop deep understanding. In my understanding, mathematics is nothing but a simplification of the world. with models that enable us to think much cleaner and clearer about what the world does. So many people think math is complicated, but it's actually a simplification of what's going on in engineering. So I'm an engineer, and then as an engineer, you have to be very careful that you engineer systems that are actually useful. And I learned later in my career,

Starting point is 00:38:57 not as a student, and maybe a little bit late, that it's very important to stay connected to the real world. And so this is when I started spending time in industry myself. I would give this as a recommendation to many. I would not do it too early because the problem is, of course, by definition, industry, they have to generate revenue for their stakeholders. So by definition, at industry, what you have to do is you have to support the mission of the company you're at. Our definition in academia, what you do is you, first of all, while you're in your educational

Starting point is 00:39:30 process, you build your own knowledge, you educate yourself, mostly a selfish process, while you're learning, you're ingesting all that knowledge, to later be a better industry player in some sense, or be an academic. Another question is when you make this decision, do I want to go to industry or to academia, have to be very conscious about this. Because during your studies,

Starting point is 00:39:49 you were funded by society, either your parents early on or later with stipends or whatnot, you're funded by society to improve yourself. But then when you make the decision to continue in academia and to educate other people and build open sky and research or to go to industry you have to be very conscious

Starting point is 00:40:08 because when you go to industry typically as I mentioned you're there to generate a profit for the stakeholders including yourself in academia what you're doing is you're helping other people to be educated and not necessarily generating a profit

Starting point is 00:40:22 so now this I wouldn't say that one of them is perfect or the other I'm not even clear I like a combination of both because in order to generate profit that you need to do something that's societally relevant. In academia, you can very easily get lost and do something that's not relevant for anybody,

Starting point is 00:40:39 but it's fun. Yeah, this is very easy. And so now I think there's a very fine line that I myself chose to work on in between academia and industry. So basically what I always do is I go to industry and watch them solve real world problems and pick the hardest problems out of there and then move them into an academic context to solve them. and then when they're solved,

Starting point is 00:41:02 I try to bring them back to benefit society. It's an interesting mix of the two, and I would recommend everybody should or could find that mix. It really depends on the person. I mean, I know many people who are perfectly happy developing theorems all day long that may or may not be relevant.

Starting point is 00:41:18 Many of those turn out to be extremely relevant later, but at the moment they develop them, they may not be relevant. But then later, 50 years later, they get the touring award for it. So this is always possible, absolutely. And when there are people who are, more happy on an immediate feedback like myself, where I'm seeing, okay, I architected that system.

Starting point is 00:41:37 That's one of the largest systems on the planet. That is great, and I can be proud of it. So it really depends on what you want to, what you want to achieve and what you want to get us feedback from society. I think that's extremely, extremely valuable bites of advice. For one, it's certainly a personal journey and the kind of gratification that you seek in the real world impact of your scientific contributions may vary person to person. I think great, great pieces of advice on, you know, staying close to industry, perhaps not too early in your career, but at some point. And I think your, you know, approach of identifying the most compelling or hardest problems, bringing them back to an academic sort of research environment, achieving those breakthroughs and then ensuring that they are able to see real world impact by, you know, hopefully bridging that, that research to practice, I think, extremely, extremely valuable. So I think that leaves us with really, really valuable takeaways for our listeners and a great point for us to wrap up on.

Starting point is 00:42:35 And I think through this conversation, certainly we've uncovered that, you know, high performance computing isn't really just about speed. Though while that is one aspect, you know, it's really about unlocking, you know, entirely new possibilities in science and in society. And so as systems grow more complex and more impactful, especially not work like yours, Professor Torsten will ensure that they stay fast, efficient. and trustworthy. And so thank you for your work and thank you for joining us on Bightcasts. Thank you very much for inviting me, Brooke. It was fun.

Starting point is 00:43:06 ACM Bycast is a production of the Association for Computing Machinery's Practitioner Board. To learn more about ACM and its activities, visit acm.org. For more information about this and other episodes, please visit our website at learning.acm.org. slash B-Y-T-E-C-A-S-T. That's learning.acm.org

Starting point is 00:43:36 slash bikecast.

ACM ByteCast - Torsten Hoefler - Episode 74

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.