@HPC Podcast Archives - OrionX.net - @HPCpodcast-92: Torsten Hoefler on Age of Computation

Episode Date: November 15, 2024

A lively discussion about the Age of Computation, Ultra Ethernet, datacenter power and cooling, the creative process for AI, model certainty for AI, AI and emergent behavior, and other HPC topics. [a...udio mp3="https://orionx.net/wp-content/uploads/2024/11/092@HPCpodcast_SP_Torsten-Hoefler_Age-of-Computation_20241114.mp3"][/audio] The post @HPCpodcast-92: Torsten Hoefler on Age of Computation appeared first on OrionX.net.

Transcript
Discussion (0)
Starting point is 00:00:00 Are you attending SC24 in Atlanta? Drop by booth 2201 on the show floor of the Georgia World Congress Center to see Lenovo's latest Neptune liquid cooling infrastructure for the next decade of technology, as well as their new large-language model systems. Learn more at lenovo.com slash Neptune. Now, you're right, our golden age starts and high performance is going to be one of the major things we have to achieve in order to enter this age. We basically made AI computations, both training and inference, a thousand times more energy
Starting point is 00:00:40 efficient than they were before and much of that is in production. Ethernet at the bottom, although it's a flood that keeps rising, and now you have ultra-Ethernet, InfiniBand, PCIe, and then things like OpenCAPI or NVLink and now ultra-accelerator link. That these mega data center providers, these people building very large AI systems, they want Ethernet, they want uniformity, they want to build systems that they're used to. From OrionX in association with InsideHPC, this is the AtHPC podcast. Join Shaheen Khan and Doug Black as they discuss supercomputing technologies and the applications, markets, and policies that shape them. Thank you for being with us. Hi, everyone. I'm Doug Black of Inside HPC.
Starting point is 00:01:26 With me is Shaheen Khan of OrionX.net. It's a great pleasure for us to welcome Torsten Heffler. Torsten is a professor at ETH Zurich, where he directs the Scalable Parallel Computing Laboratory, and he is also the chief Architect for Machine Learning at the Swiss National Supercomputing Center, as well as a consultant for Microsoft on large-scale AI and networking. He is also one of the rising young luminaries in the HPC AI community. So, Torsten, welcome. Thank you. Great to be with you. So, I know you have a particular interest in ultra-Ethernet, the implications that holds for HPC AI. What are some of the more interesting new developments in that area that you're looking at?
Starting point is 00:02:12 Yeah, UltraEthernet is amazing. But here I have to say that I'm very biased because I'm one of the people who helped to kick it off originally in my role as consultant at Microsoft. And here I'm also the co-chair of one of the biggest working groups with about 700 people subscribed to the mailing list, which is amazing. But regular attendance is more about 100 people, which is still a lot, which is a transport working group. So that's a disclaimer. And I personally think UltraEthernet is going to define the future of large-scale networking for AI as well as HPC because it flattens the market in some sense. So it enables all the providers, many different companies. So far, we have 83
Starting point is 00:02:51 different member companies, including NVIDIA, that can now build a product that is compatible among different companies and enables mega data center providers to deploy extremely large systems, systems of the scale that we have seen with 100,000 GPUs, but even much larger, easily go to a million endpoints with all the addressing modes that we have. Is UltraEthernet, is it at the same performance level as InfiniBand, or is it still working toward that? It is designed to be at least as good as InfiniBand. So it has all the modern features that you would need on top of Ethernet, for example, packet spraying, which allows for some form of adaptive routing, scalable endpoints. So it's actually much more scalable than InfiniBand is today, or Rocky for that extent.
Starting point is 00:03:37 And it also has security built in as a first class citizen, which was also added later to actually InfiniBand as well as Rocky. So in some sense, Rocky is kind of the InfiniBand for Ethernet networks, but it inherits all the goodness and also the problems of InfiniBand. So I believe it could be much better. First in our pre-call, you were also referencing the big AI providers and cloud providers and how they have a bias in favor of standardization uniformity, and that basically requires Ethernet. So to what extent are they driving these new emerging standards? They're very much driving it in the sense that they're the customers. They decide what they buy. And if you look at the largest deployed AI clusters in the recent past, like Metasphere cluster, the XAI cluster that Elon
Starting point is 00:04:27 Musk's company deployed, and many other clusters of that scale, so extreme scale, they're all using Ethernet. None of those uses InfiniBand. And that is just a sign that these mega data center providers, these people building very large AI systems, they want Ethernet, they want uniformity, they want to build systems that they're used to, that they've been deploying for the last couple of years, last couple of decades, actually. And it's very similar at Microsoft. So you just want to build things that you know how to maintain, and that have proven to exist for a long time and have proven to work for a long time. Furthermore, Ethernet right now is the dominating interconnect. So it's about 600 million ports per year that are being deployed.
Starting point is 00:05:08 That is about a thousand ports per minute, if you think about it. So all of this hinges on the expectation that UltraEthernet will match or at least get close to and maybe even exceed performance by InfiniBand. Is that what to be expected? Well, that's what we were certainly hoping, but the future will show. So we will release the spec hopefully soon. Once the spec is released, various vendors will go off and have software as well as hardware implementations of the spec. And then we can benchmark those in the field and we'll see. Yeah. In fact, Doug and I were commenting on AMD's recent announcement when their networking was ultra-Ethernet compliant, even before the specification is released.
Starting point is 00:05:52 That's certainly adventurous because the specification will still change in various aspects. So I'm not sure if that can be compatible. Well, it is a very programmable device. So we figured that they think it's sufficiently programmable to match the specification the way it seems to be device. So we figured that they think it's sufficiently programmable to match the specification the way it seems to be going. So we didn't fault them too much, but it was an indication that ultra-Ethernet is coming fast and furious, and they want to allude to that. Absolutely.
Starting point is 00:06:16 And yeah, programmable devices can definitely adopt to the changes in the spec, yeah. Now, what are the timelines? What we've seen with PCIe, for example, like four comes in, and then five was announced some years ago, they were talking about six or seven, and five is just starting to kind of happen. How long if you have an organization, which is basically a direct democracy with a lot of players, and the number of players has been growing very quickly. So I believe we had seven founding companies. And now, as I mentioned, it's 83. And we have 700 people on the mailing list,
Starting point is 00:06:56 or 750 even, I think, on the working group I'm co-chairing with Karen Schramm from Broadcom. And if you're running meetings with 70 to 100 people in the meeting, then things don't move super fast because you need to take everybody along. But we hope to get this done relatively soon. And by relatively soon, I mean, hopefully early next year, the original announcement was end of this year. There will be some announcement at SC, but hopefully soon.
Starting point is 00:07:22 Right, right. Can I ask possibly a dumb question? We follow silicon photonics, yeah, optical IO, sorry. And we've had guests on and so forth. How would that play within Ethernet? Is that just a whole different thing or would it be incorporated within ultra Ethernet or how would that work?
Starting point is 00:07:42 Oh, it's absolutely compatible. So I mean, this is not a dumb question. It's a very good question. The question is, where does your transceiver sit? There are all kinds of silicon photonics pieces which are, for example, CPO, co-packaged optics is one of those. There are various vendors
Starting point is 00:07:58 like Broadcom who have very strong offerings for co-packaged optics in switches or also in NICs in the Ethernet world and at the end it's just essentially how you implement your signaling layer like you can do this optics you can do this over silicon photonics if you have your transceiver on chip or close to chip in a cpo setting or you can do this in your in your pluggable optics cable essentially in the plug itself so yes very relevant so when I think of the
Starting point is 00:08:25 interconnect, starting from Ethernet at the, let's say, bottom, although it's a flood that keeps rising, and now you have ultra Ethernet, and then you have InfiniBand, PCIe, and then things like OpenCAPI or NVLink, and now ultra accelerator link, and then you go on top of the chip and chiplet UCIE, that entire spectrum, do we need every member of that set? Or are we expecting that a couple of them will emerge as dominant? And really, if I do it on the chip, on the rack, and in the data center, I'm kind of done? I think that's an excellent question, actually. So many of those are widening their scope. I mean, Ethernet is widening the scope, certainly. So with UltraEthernet, Ethernet is going from the traditional data center interconnect medium performance or low performance to an extremely high performance setting. So to compete in HPC
Starting point is 00:09:19 with all these proprietary interconnects. So it may wipe out many of those if it's successful. And then the next question is, well, will it actually go towards more rack scale local deployments? And UltraEthernet is designed to be a data center interconnect. So it will cover cable lengths up to 150, 250 meters in a single data center. It will also go down to shorter range. So you can also deploy it within the rack. But then the physics changes. And as the cables go shorter, you will have lower latency and you will probably have higher reliability,
Starting point is 00:09:57 even though that requires good plug. If you go to electrical, then everything changes again. But let's just assume there is a different set of requirements. And so far, UltraEthernet is tuned for the large-scale deployment in large-scale data centers. It's relatively easy to adopt it to rack scale, and then it can actually compete with local interconnects such as NB-Link or things that are in discussion in UA-Link. And eventually, I actually believe much of that will be Ethernet. So many systems today are deployed with Ethernet at the rack scale already. And that seems like a natural
Starting point is 00:10:30 movement just given the dominance of Ethernet in that area. But then if you go on chip, like very short range on chip or on package, like you mentioned UCI, the chiplet interconnect, then I would say Ethernet may not be the right format. Because in Ethernet, you have pretty large headers, you have pretty large packets, it may not be the right format, even though I'm not sure. So we may be able to tweak Ethernet to be the right format in this area. I wouldn't get against it for sure. But right, current version definitely is not. So is Ethernet kind of emerging like a comment about Fortran that I don't know what the language of the future is, but it's going to be called Fortran? Absolutely, absolutely.
Starting point is 00:11:11 Interconnect of the future will be Ethernet regardless, but you're right, it is a pretty heavy protocol. Yep. And I remember another joke in the old days was that it doesn't matter what the hardware is, Ethernet was never going to do better than one megabyte per second. Well, certainly that has been. Well, we have exceeded that. Right. Now, Tarsten, I know recently you gave a lecture that's on YouTube. We'll provide a link in the little blurb I write to accompany this conversation. But the topic is the age of computation and
Starting point is 00:11:42 fascinating presentation on your part, where you kind of, you know, where we are now in the overall, I guess, progress march of mankind, starting with the Stone Age, but in the emerging era that we're in, which you're calling the age of computation. Could you explain what that means?
Starting point is 00:12:00 Yeah, the observation is that we have lived in the technology age, basically. And actually, the fundamental observation is very interesting. With the development of humanity, and in that talk, I started the Bronze Age. At some point, human strength became less relevant or nearly irrelevant. That was at the end of the Industrial Revolution when we had trains and tanks and cars and steam engines and engines. So human strength became irrelevant. And then we switched to attack in the technological age that just recently started. We looked at attacking intelligence. So what happened there is that we went from the atomic age, the age of energy discovery, to through various ages that are a rapid succession of the internet age and most recently the data age
Starting point is 00:12:45 where the internet enabled us to collect all the human data in central places and run analysis on this. And my claim is that we're coming towards the end of the data age. Because first of all, the amount of human generated data is relatively small compared to the amount
Starting point is 00:13:01 of machine generated data. And second, you can see this also in the development of companies like the company NVIDIA, which is the second largest company at this point. When I recorded that talk, it was the third largest, but as of last week, it's the second largest. And in fact, it's only 1.7% of Apple, which is the largest company in terms of total market capitalization. So it'll probably surpass Apple relatively soon, the remaining 1.7%. And this company is a pure accelerator company. It doesn't provide many storage products. It just provides acceleration. That's the main business
Starting point is 00:13:36 that this company has. So somehow, we are running out of this data to learn from these new AI models. And now we need to look at, we really need to look at synthetic data. We need to look at ways for these models to play with themselves, to think about things and to invoke themselves recursively, basically, or iteratively, depending on how you define it. And that's the age of computation. So now, really computation, who has the most computation as a society, as a single individual even, will have a significant benefit over other societies and individuals. You can see today, if you know how to use AI systems, which are computation driven, like these are computational input output systems,
Starting point is 00:14:14 you have a benefit over people who don't know how to use this. And the same applies to countries. So now what's going to happen is that we will have a race for who has, who develops the most advanced and biggest computational capability. And maybe eventually human creativity and human intelligence will be made obsolete as a differentiating factor between humans, as is today human strength. Like it doesn't matter if I'm very strong or not,
Starting point is 00:14:38 I can operate a machine. And maybe later we will be going towards creativity and intelligence. And that's the age of computing. We achieve that with computation. Lenovo will be joining HPC enthusiasts and influencers at SC24 in Atlanta at the Georgia World Congress Center. Lenovo invites you to attend, visit, and experience.
Starting point is 00:14:59 Attend the new Lenovo Innovation Forums, where you will learn about Gen AI, liquid cooling, and more. Register at bit.ly slash lenovo forums that's bit.ly forward slash lenovo forums visit booth 2201 for interactive demos and over 20 booth theater sessions featuring lenovo experts partners and customers experience lenovo's latest liquid cooling and large language model infrastructure could i suggest a possible name change for the age of computation? Okay. How about the age of HPC?
Starting point is 00:15:34 Because it's all enabled by HPC. And we've talked about this, Shaheen, you and I have, and Torsten, that we often hear we're in the AI era, but it's all enabled by HPC. And HPC so often gets subsumed under the current rage of the day in technology. Yeah, absolutely. However, I would not rename it because the age of HPC has been going on for the last couple of decades already. I mean, even in these previous ages, HPC played a very significant role. And after all, the C in HPC stands for computing, which is very close to computation. And we could
Starting point is 00:16:12 also call it the age of computing. I guess that's a minor difference. And so really, now you're right. Our golden age starts and high performance is going to be one of the major things we have to achieve in order to enter this age. Absolutely. I agree. NVIDIA, which is all, as you say, it's an accelerator company. That's been their entire focus. Yeah. And this is how they've risen to the top.
Starting point is 00:16:34 So obviously, yes. per rack electrical power and how data centers have gone from being measured by square feet or square meters to megawatts and gigawatts. This also implies that whoever has the most computation also needs to have the most energy because I'm sort of observing that we've gone from 10 kilowatts per rack to now 150, going to 500, maybe even going to a megawatt per rack. How does that play with kind of the, I guess it's the geopolitical conclusion from whoever has the most chips wins kind of a thing? Yeah. Yeah. I mean, one interesting observation is you would say that as if it was scary if i built racks with a higher energy capacity actually these racks they're much more efficient than racks with a lower energy capacity because you are a lower
Starting point is 00:17:32 power consumption because you reduce the distance between the compute elements and as you reduce the distance you improve the efficiency however of course you will probably just deploy more so there's jevons paradox whenever you make something, you will just find bigger demand for it. So absolutely, that's going to happen. Use more of it. Yeah, exactly. Exactly. The interesting view here is that, I mean, this will happen anyway.
Starting point is 00:17:54 And what we HPC people can do is we can contribute to making this more efficient. And so here I'm extremely proud of our achievements, like my group's achievements and my own achievements in the past couple of years, where we basically made AI computations, both training and inference, a thousand times more energy efficient than they were before. And much of that is in production. So I have a talk where I explain how this factor of 1000 times is achieved. And that is extremely important for the future of our energy consumption, because many people say that this is a problem, this high performance computing, but no, we are the solution. We make these devices more efficient such that we actually use less energy.
Starting point is 00:18:32 And then, well, of course, we can use more of them as we just discussed. But again, this will happen anyway. As you can see from all the mega data center providers, literally all of them, Google, Microsoft, and AWS, they have announced that they're betting big time on nuclear energy, actually rebooting plans, partially rebooting plans that were supposed to be shut down just to deliver that energy in the near term, in the next couple of years, because it's the only way you can get it. Yeah, it does require a different political and social mindset to recognize that these computations aren't, quote, wasted. Yes.
Starting point is 00:19:09 They're not just generating heat. Because we had that debate a couple of years ago. I was joking that should we be thanking AI for taking the energy blame away from Bitcoin? But we had that discussion like a few years ago that is this all? So with AI is a little bit different, but what do we do to sort of recognize or project that the value of this is really for humanity and not just for the big companies that do it? The value distribution is an interesting question. But many of these big companies are actually freely sharing their weights and their models,
Starting point is 00:19:43 like Meta, for example. I believe the Grok model is open. And so there is a very interesting debate to be had here, what that means. But I believe we all already benefit from it in so many different ways. I mean, I use AI models. Of course, I'm consulting for Microsoft, so I use Copilot every single day. So most of my emails are written by that model or supported by that model and it's quite nice and it makes me really much more productive it makes me nicer whenever and so it's it has a huge added benefit right let's talk a little bit about cooling part of it too
Starting point is 00:20:19 because it seems to me that liquid cooling went from just sort of something exotic and interesting and maybe that we use during mainframes and hopefully no longer because we're not using ECL anymore. We went to CMOS and now it's come back with vengeance and it's now mandatory. What do you see going on there that people need to keep on radar? I mean, the liquid cooling is an enabler for these denser racks that you mentioned before. And it's, as you say, it's absolutely mandatory to have 100 plus kilowatt racks, you can simply not air cool, you cannot have an airstream fast enough to air cool those. And we will go further with liquid cooling. And there are many new ideas where you move the liquid cooling closer to the chip or even on the chip or even immerse the chip
Starting point is 00:21:03 in liquid with this immersive cooling strategies that enable us to build smaller and smaller enclosures, which really reduces the distance. And reducing the distance is absolutely key for connectivity. The price goes down, the energy consumption goes down significantly if you can reduce these distances. So efficiency goes up. And cooling is probably one of the most important challenges we have today to build these large-scale systems. So
Starting point is 00:21:30 in Microsoft, I know a significant number of people just looking at cooling because it's absolutely crucial. Now, on the data end of things, with computers generating, Torsten, we've certainly heard about data hallucinations. And I believe part of your talk was LLMs literally talking to each other, working in some facsimile of a collaborative manner. But how do you make sure that the new data that LLMs are generating is not false data, hallucinatory, or whatever? Yes, I know. There are many approaches. What you allude to is the graph of thoughts work that I was presenting. So the idea here is that you have multiple language models or even the same language model reasoning in multiple steps
Starting point is 00:22:14 through multiple invocations of itself or other models where it communicates through prompts, very much like we humans communicate. And the cool idea here is actually that while we, as we are talking here, we communicate, we exchange ideas. I take thoughts from my mind and inject it into your mind, and hopefully you come up with better thoughts and so on. And we do this at an extremely low bandwidth. We here have a couple of words per second, which is a couple of bytes per second. Language models, they can communicate at megabytes, gigabytes, and soon terabytes per second. Language models, they can communicate at megabytes, gigabytes, and soon terabytes per second. That's a billion times faster than what we are doing here. So that's an interesting thing to think about, an interesting fact to think about. So why don't you ask about hallucinations?
Starting point is 00:22:55 So first of all, I also believe we humans hallucinate, and that is called the creative process. So when I design something that's not there, I apply a creative process and then I check it whether it's useful. The unfortunate thing in language models, this check kind of is missing. The language models, they tell you something, they don't think about it. They just have to output the next token. So they are forced to communicate with you immediately. It's like you would force me to utter every single thought I have. And there would be a whole lot of junk that I would have to utter if I would be forced to do so. So really, this graph of thoughts is a way to give the language model a chance to think about things. So not immediately output the next token to you, but use these tokens in an internal process, and maybe not output them to the user,
Starting point is 00:23:41 but give them later as a condensed result to the user. So we can see traces of this in the OpenAI's O1 or Strawberry model, where the language model argues a little bit with itself. So now still, that doesn't mean that the language model understands what's the hallucination or what is not. So here we have another very nice research work that we call CheckEmbed. And I would recommend you look this up because it's really cool, Actually, it's very simple. How can I show or how can I prompt a language model and try to understand whether it's hallucinating? Well, very much like we would do it with humans. So you ask the exact same question in different words. Like if you do a questionnaire, for example, and you want to see if humans pay attention, you ask the same question with different words. And then you
Starting point is 00:24:23 see if they're just clicking random results or not. So there's a whole lot of theory about this. You can do the same thing with language models. So you ask the same question with slightly different input, and then you look at the output. How would you compare different outputs? Because a language model will design different text replies for each of those inputs. How do you now compare those efficiently?
Starting point is 00:24:44 And this is where Check and bet comes in. What we do is we take each of these output texts and we embed it using the same or another language model into the high dimensional space, into the high dimensional vector space where these language models work. And then we get vectors for each of those outputs. And now if these vectors are close to each other, because these vectors, as we know from the early works on embeddings, they kind of encode the meaning of a text. So now if all of these different outputs that were prompted by the same question with different formulations, if all of them are close in the resulting vector space, then the model is very unlikely to hallucinate, because after all, it has given you similar answers for different formulations. However,
Starting point is 00:25:24 if the model is very unsure, and it's forced to output something, if you look at the cross, I could go more technical. If you look at the last cross entropy, they are very often, these entropies are of the selection probability is very uncertain when the model hallucinates. For example, the next token is with 51% and a bear and with 49% a cat, right, then the model is not very sure as opposed to 99% a bear and 1% a cat or something like this. And that you can measure in these vectors, because now, if the model hallucinates, it'll give you different answers, because there is a noise component in this for very similar prompts. And if that happens, you can see that in the vectors
Starting point is 00:26:06 because the distance between these vectors is relatively large. And you can now programmatically analyze whether your model hallucinates or not by just looking at the output vectors. And of course, we have a paper on this and scientific evaluation. And so, but that's kind of the high level intuition, high level human intuition. And I can do the same thing to you. I want to see if you're lying to me, for example. I ask you the same question on day one and maybe a very similar question three days later. So now you need to remember what you told me on day one or you will
Starting point is 00:26:33 give me a different answer. And then I have detected a problem. So this basically covers the space where consistency is a proxy for correctness. Yes. For certainty, not for correctness, for certainty. For certainty. Exactly. That's right. Whether the model is certain or not. There is another way now to check whether a model is certain or not. So you could ask the model, if you're asking mathematical questions, for example, it's quite nice because the model can check itself. Like I'm very often doing when I'm writing a book together with a friend. And if I do a proof, I use Wolfram Alpha to help me to understand whether what I proved is actually correct. The language model can use
Starting point is 00:27:10 tools like model checkers or Wolfram Alpha as well to check itself. And it can talk to itself using these tools like we do. And it's quite nice. This actually really works in practice. Right, right, right. Torsten, the term emergent abilities in LLMs, is that kind of what we're talking about only in a more controlled way? Yes. I was first exposed to that idea about a year ago at a HPC user forum. And when you first hear it, that an LLM draws conclusions that it wasn't asked to draw, it struck me as eerie and maybe frightening. But as I think you pointed out, Shaheen, it's simply that we don't understand yet how this is done. It's not some sort of a mysterious.
Starting point is 00:27:49 My view was that the information content of the data that we supply to AI exceeds our ability to understand it. So if AI is doing something that looks emergent to me, it's because I just didn't know what was in my data. Not that it came up with something brand new. It just did a better job of managing and extracting information from the data. Is that true? Or do we think that it actually truly net new comes up with brand new ideas that aren't just a permutation of existing ones or extraction of information that wasn't accessible to humans? Well, this is a fascinating question and I'm baffled by this. So I would immediately agree with you what you said from a mathematical perspective, because after all these pre-trained
Starting point is 00:28:35 large language models, they learn the statistical distribution of language from a lot of examples. So what is the most likely next token, which is really a representation of a word or a piece of a word, given the previous tokens? This is what they do. It's a very simple statistic somehow with extremely large computation. So these models, they have up to multiple hundreds of billions of parameters, or actually trillions of parameters these days, like LAMA405B has 450 billion parameters. It operates in a 450 billion dimensional optimizations. This is absolutely crazy if you think about it. But now, the interesting thing is that there's actually proof that these models generate new knowledge that is
Starting point is 00:29:17 definitely not in the training data. So for example, there are various papers that show that these models can be used in a loop, like I just mentioned, to prove mathematical theorems that were open for a decade. And these models, they build hypotheses, they use the proof assistant to check if their hypotheses are correct. And if they're incorrect, they refine their hypothesis or change it or discard it. And they build new hypotheses, just like we humans do. And this is a fascinating thing. I think nobody can really explain how that follows from simply predicting the next token. Sometimes I'm wondering if I'm a language model, like,
Starting point is 00:29:50 how does my brain, am I predicting just the next token? Of course, I speak English that makes sense to others. So somehow there's something to it. So I don't know. But there is evidence that language models create new things. Interesting. Well, along those lines, it is also true that language models have been effective surprisingly more than their inventors thought it would, right? Isn't that true? Yes. Oh, absolutely.
Starting point is 00:30:17 So like, is there any, is there an understanding of why that is? Why is it that it works as well as it does? If we had that, we would probably be able to optimize them a whole lot. So far, I don't know. More scale, bigger scale, like they seem to generate new capabilities or have emerging capabilities with bigger scale. And now more iterative invocation. So not necessarily scale now, but now we invoke these language models with their own outputs again and resemble more of a human thinking process. Right. So along those lines, another question for me was, how come there are so many matrix multiplies in nature? That I'm not sure, actually. I would really blame Jack Dongara for this.
Starting point is 00:30:59 Yes, that's right. So he rightfully got the Turing Award because he made matrix multiply fast. And that was really the basis for much of the development, if you ask me. That was definitely inspired, yes. It happened because this forced these models to use matrix multiplies, and they happen to be very effective. However, we don't know if that is the best way. And it's dense matrix multiplication of all matrix multiplication. So we all know that our brains are not densely connected.
Starting point is 00:31:30 We all know that our brains are operating very differently from matrix multiplication. Even though you can model parts of the brain with sparse matrix multiplication, but at the end, you can probably model any physical system with sparse matrix multiplication because it's extremely powerful as a computation model. So that is a great question. And I don't know the answer. My personal theory is that we just found a good engineering solution
Starting point is 00:31:50 with these dense matrix multiplications to an information representation problem. It's very much like planes don't flap with their wings. We found an engineering solution for using the aerodynamics of air or of foils to fly. And we have wheels and there are not many animals with wheels. And so in this AI space, this may be the right representation for knowledge, but we don't know. And this is a very good question. Yeah. The flapping wing basically got decoupled into
Starting point is 00:32:19 his fixed wing and an engine. Yes. And it worked. Exactly. So let me ask you then, there was a Nobel Prize awarded for physics, and there was a controversy on what was so physics and what was awarded. Is the Nobel Committee redefining basic science by pulling computer science into it? Or is it just a continuation of multidisciplinary? Where do you land on that? Was this award really for physics or was it just an excuse to reward AI because it's important? Or is it really the insight that, oh my God, everything is a matrix multiply? Well, I'm actually on the side of the age of computation again. So computation is going to drive human development at least in the near future and probably even more in the far future.
Starting point is 00:33:05 And this is the beginning of this, as we are seeing. Computation has been used to discover physical phenomena and also chemical phenomena to an extent that the committee handing out the biggest award in those fields gave it to essentially computational scientists. And in the good old days, it happened. The saying was, as a computer scientist, you would never get a Nobel Prize. And now we may get all the Nobel Prizes soon. I love that a lot. I had a joke talk about the literature Nobel Prize, because that may soon be going to AI.
Starting point is 00:33:37 Torsten, can I ask, I know we only have a few more minutes, but looking ahead, there has been discussion lately about the limitations of LLMs. And I'm curious your thoughts about what's next in AI at the big picture level, the next model, if you will, and what sort of computation might be for the thing that comes after LLMs? Yeah, that is a great question. So I believe there will be an ecosystem of agents. People call those agents now, but these are really LLMs invoked in loops, in complex loops, LLMs that may have
Starting point is 00:34:11 been given different personalities. So for example, there was a very interesting paper on a software engineering company run by LLMs. And so one LLM was the tester and the other one was the designer. The other one was the coder. And so they were interacting in a loop, talking to each other. So the tester was trying to break what the coder coded based on the designer's input. There was even a marketing LLM that tried to design a web page for the thing. And it worked reasonably well. And so this is where we are going, I believe.
Starting point is 00:34:42 And it's really more computation of LLMs. The big question that I'm not sure how to answer is how they will scale, whether we will scale them much bigger than they are today. So today we are in the low trillion parameter models at the very high end. However, for all practical purposes, these less than a hundred billion, like the 70 billion models, they seem to be doing reasonably well. And the problem with these very large models is that they get extremely expensive to use. So you need a very large GPU farm to just fire up a model and get a reply for these very large models. So here is going to be a very interesting resource question. How many resources can I afford and how many can I get?
Starting point is 00:35:25 Perfect. Maybe one final question is just a follow-up from the panel discussion that was at SC23, I believe, about the future of supercomputing. And I remember your portion of it. I think it was very consistent with the conversation we've had today about the age of computation and also networking. What is your current perspective on how the future architectures will look like and where we will see advances? I think that's a very open-ended question. It is, yes, yes.
Starting point is 00:35:57 Well, the high-level view is going to be more parallelism. That seems to be fundamental. We don't scale single-core performance. We need more parallelism, more specialization in that parallelism to AI style workloads. So smaller data types, sparsity, I really believe sparsity will make a difference. It's very hard. It's much harder than exploiting smaller data types, but we'll make some progress there. We will build bigger systems because we have larger workloads and we have to worry more about the efficiency of those systems at the end.
Starting point is 00:36:26 So really, we are at the pole position with HPC to contribute to this development and also benefit from it. So high-performance computing, the high-performance simulation community will also benefit from the AI development if they can express parts of their problem in an AI context. And here I'm working very hard on making that happen as well in the AI for Science initiative on various extents. But that we have to talk about another time. Perfect, yes.
Starting point is 00:36:52 All right. Well, thank you, Thorsten. Always a treat. Really appreciate you making time. And I know it's evening your time. Thanks for being flexible about that. Oh, that was super fun. I'm super happy to do this again.
Starting point is 00:37:03 We absolutely will hold you to that. We'll take you out on that. Thanks so much. Thank you so much. Take care. Wonderful, wonderful. That's it for this episode of the At HPC podcast. Every episode is featured on InsideHPC.com and posted on OrionX.net. Use the comment section or tweet us with any questions or to propose topics of discussion.
Starting point is 00:37:24 If you like the show, rate and review it any questions or to propose topics of discussion. If you like the show, rate and review it on Apple Podcasts or wherever you listen. The At HPC Podcast is a production of OrionX in association with Inside HPC. Thank you for listening.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.