@HPC Podcast Archives - OrionX.net - @HPCpodcast-92: Torsten Hoefler on Age of Computation
Episode Date: November 15, 2024A lively discussion about the Age of Computation, Ultra Ethernet, datacenter power and cooling, the creative process for AI, model certainty for AI, AI and emergent behavior, and other HPC topics. [a...udio mp3="https://orionx.net/wp-content/uploads/2024/11/092@HPCpodcast_SP_Torsten-Hoefler_Age-of-Computation_20241114.mp3"][/audio] The post @HPCpodcast-92: Torsten Hoefler on Age of Computation appeared first on OrionX.net.
Transcript
Discussion (0)
Are you attending SC24 in Atlanta?
Drop by booth 2201 on the show floor of the Georgia World Congress Center
to see Lenovo's latest Neptune liquid cooling infrastructure
for the next decade of technology,
as well as their new large-language model systems.
Learn more at lenovo.com slash Neptune.
Now, you're right, our golden age starts and high performance is going to be one of the major things we have to achieve in order to enter this age.
We basically made AI computations, both training and inference, a thousand times more energy
efficient than they were before and much of that is in production.
Ethernet at the bottom, although it's a flood that keeps rising, and now you have ultra-Ethernet,
InfiniBand, PCIe, and then things like OpenCAPI or NVLink and now ultra-accelerator link.
That these mega data center providers, these people building very large AI systems,
they want Ethernet, they want uniformity, they want to build systems that they're used to.
From OrionX in association with InsideHPC, this is the AtHPC podcast. Join Shaheen Khan and Doug
Black as they discuss supercomputing technologies and the applications, markets, and policies that
shape them. Thank you for being with us. Hi, everyone. I'm Doug Black of Inside HPC.
With me is Shaheen Khan of OrionX.net. It's a great pleasure for us to welcome Torsten Heffler.
Torsten is a professor at ETH Zurich, where he directs the Scalable Parallel Computing Laboratory,
and he is also the chief Architect for Machine Learning at the Swiss
National Supercomputing Center, as well as a consultant for Microsoft on large-scale AI
and networking. He is also one of the rising young luminaries in the HPC AI community. So,
Torsten, welcome. Thank you. Great to be with you. So, I know you have a particular interest
in ultra-Ethernet, the implications that holds for
HPC AI. What are some of the more interesting new developments in that area that you're looking at?
Yeah, UltraEthernet is amazing. But here I have to say that I'm very biased because I'm one of
the people who helped to kick it off originally in my role as consultant at Microsoft. And here
I'm also the co-chair of one of the
biggest working groups with about 700 people subscribed to the mailing list, which is amazing.
But regular attendance is more about 100 people, which is still a lot, which is a transport working
group. So that's a disclaimer. And I personally think UltraEthernet is going to define the future
of large-scale networking for AI as well as HPC because it flattens the market in
some sense. So it enables all the providers, many different companies. So far, we have 83
different member companies, including NVIDIA, that can now build a product that is compatible
among different companies and enables mega data center providers to deploy extremely large systems,
systems of the scale that we have seen with 100,000 GPUs, but even much larger, easily go to a million endpoints with all the addressing
modes that we have. Is UltraEthernet, is it at the same performance level as InfiniBand,
or is it still working toward that? It is designed to be at least as good as InfiniBand. So it has
all the modern features that you would need on top of Ethernet,
for example, packet spraying, which allows for some form of adaptive routing, scalable
endpoints. So it's actually much more scalable than InfiniBand is today, or Rocky for that extent.
And it also has security built in as a first class citizen, which was also added later
to actually InfiniBand as well as Rocky. So in some sense, Rocky is kind of the InfiniBand for Ethernet networks, but it inherits all the goodness
and also the problems of InfiniBand. So I believe it could be much better.
First in our pre-call, you were also referencing the big AI providers and cloud providers
and how they have a bias in favor of standardization uniformity, and that basically
requires Ethernet. So to what extent are they driving these new emerging standards?
They're very much driving it in the sense that they're the customers. They decide what they buy.
And if you look at the largest deployed AI clusters in the recent past, like Metasphere cluster, the XAI cluster that Elon
Musk's company deployed, and many other clusters of that scale, so extreme scale, they're all using
Ethernet. None of those uses InfiniBand. And that is just a sign that these mega data center
providers, these people building very large AI systems, they want Ethernet, they want uniformity,
they want to build systems that
they're used to, that they've been deploying for the last couple of years, last couple of decades,
actually. And it's very similar at Microsoft. So you just want to build things that you know
how to maintain, and that have proven to exist for a long time and have proven to work for a
long time. Furthermore, Ethernet right now is the dominating interconnect. So it's about 600 million ports per year that are being deployed.
That is about a thousand ports per minute, if you think about it.
So all of this hinges on the expectation that UltraEthernet will match or at least get close to and maybe even exceed performance by InfiniBand.
Is that what to be expected?
Well, that's what we were
certainly hoping, but the future will show. So we will release the spec hopefully soon.
Once the spec is released, various vendors will go off and have software as well as hardware
implementations of the spec. And then we can benchmark those in the field and we'll see.
Yeah. In fact, Doug and I were commenting on AMD's recent announcement when their networking was ultra-Ethernet compliant, even before the specification is released.
That's certainly adventurous because the specification will still change in various
aspects. So I'm not sure if that can be compatible.
Well, it is a very programmable device. So we figured that they think it's sufficiently
programmable to match the specification the way it seems to be device. So we figured that they think it's sufficiently programmable to match the specification the
way it seems to be going.
So we didn't fault them too much, but it was an indication that ultra-Ethernet is coming
fast and furious, and they want to allude to that.
Absolutely.
And yeah, programmable devices can definitely adopt to the changes in the spec, yeah.
Now, what are the timelines?
What we've seen with PCIe, for example, like four comes in, and then five was announced some years ago, they were talking about six or seven, and five is just starting to kind of happen. How long if you have an organization, which is basically a direct democracy
with a lot of players,
and the number of players has been growing very quickly.
So I believe we had seven founding companies.
And now, as I mentioned, it's 83.
And we have 700 people on the mailing list,
or 750 even, I think, on the working group
I'm co-chairing with Karen Schramm from Broadcom.
And if you're running meetings
with 70 to 100 people in the meeting,
then things don't
move super fast because you need to take everybody along. But we hope to get this done relatively
soon. And by relatively soon, I mean, hopefully early next year, the original announcement was
end of this year. There will be some announcement at SC, but hopefully soon.
Right, right.
Can I ask possibly a dumb question?
We follow silicon photonics,
yeah, optical IO, sorry.
And we've had guests on and so forth.
How would that play within Ethernet?
Is that just a whole different thing or would it be incorporated within ultra Ethernet
or how would that work?
Oh, it's absolutely compatible.
So I mean, this is not a dumb question. It's a very
good question. The question is, where does
your transceiver sit?
There are all kinds of silicon photonics pieces
which are, for example,
CPO, co-packaged optics is
one of those. There are various vendors
like Broadcom who have very strong offerings
for co-packaged optics
in switches or also in NICs in
the Ethernet world and at the end
it's just essentially how you implement your signaling layer like you can do this optics you
can do this over silicon photonics if you have your transceiver on chip or close to chip in a cpo
setting or you can do this in your in your pluggable optics cable essentially in the plug itself so
yes very relevant so when I think of the
interconnect, starting from Ethernet at the, let's say, bottom, although it's a flood that keeps
rising, and now you have ultra Ethernet, and then you have InfiniBand, PCIe, and then things like
OpenCAPI or NVLink, and now ultra accelerator link, and then you go on top of the chip and chiplet UCIE,
that entire spectrum, do we need every member of that set? Or are we expecting that a couple of
them will emerge as dominant? And really, if I do it on the chip, on the rack, and in the data center,
I'm kind of done? I think that's an excellent question, actually.
So many of those are widening their scope. I mean, Ethernet is widening the scope, certainly. So with UltraEthernet, Ethernet is going from the traditional data center interconnect medium
performance or low performance to an extremely high performance setting. So to compete in HPC
with all these proprietary interconnects. So it may wipe out many of those if it's successful. And
then the next question is, well, will it actually go towards more rack scale local deployments?
And UltraEthernet is designed to be a data center interconnect. So it will cover cable lengths up to
150, 250 meters in a single data center. It will also go down to shorter range.
So you can also deploy it within the rack.
But then the physics changes.
And as the cables go shorter, you will have lower latency
and you will probably have higher reliability,
even though that requires good plug.
If you go to electrical, then everything changes again.
But let's just assume there is a different set of requirements.
And so far, UltraEthernet is tuned for the large-scale deployment in large-scale data centers.
It's relatively easy to adopt it to rack scale, and then it can actually compete with local
interconnects such as NB-Link or things that are in discussion in UA-Link. And eventually,
I actually believe much of that will be Ethernet. So many
systems today are deployed with Ethernet at the rack scale already. And that seems like a natural
movement just given the dominance of Ethernet in that area. But then if you go on chip, like very
short range on chip or on package, like you mentioned UCI, the chiplet interconnect, then
I would say Ethernet may not be the right format. Because
in Ethernet, you have pretty large headers, you have pretty large packets, it may not be the right
format, even though I'm not sure. So we may be able to tweak Ethernet to be the right format in
this area. I wouldn't get against it for sure. But right, current version definitely is not.
So is Ethernet kind of emerging like a comment about Fortran that I don't know what the
language of the future is, but it's going to be called Fortran? Absolutely, absolutely.
Interconnect of the future will be Ethernet regardless, but you're right, it is a pretty
heavy protocol. Yep. And I remember another joke in the old days was that it doesn't matter
what the hardware is, Ethernet was never going to do better than one megabyte per second.
Well, certainly that has been.
Well, we have exceeded that.
Right.
Now, Tarsten, I know recently you gave a lecture that's on YouTube. We'll provide a link in the
little blurb I write to accompany this conversation. But the topic is the age of computation and
fascinating presentation on your part,
where you kind of, you know,
where we are now in the overall,
I guess, progress march of mankind,
starting with the Stone Age,
but in the emerging era that we're in,
which you're calling the age of computation.
Could you explain what that means?
Yeah, the observation is that
we have lived in the technology age, basically.
And actually, the fundamental observation is very interesting. With the development of humanity, and in that talk, I started the Bronze Age. At some point, human strength became less relevant or nearly irrelevant. That was at the end of the Industrial Revolution when we had trains and tanks and cars and steam engines and engines.
So human strength became irrelevant.
And then we switched to attack in the technological age that just recently started.
We looked at attacking intelligence.
So what happened there is that we went from the atomic age, the age of energy discovery,
to through various ages that are a rapid succession of the internet age and most recently the data age
where the internet enabled us to collect
all the human data in central places
and run analysis on this.
And my claim is that we're coming towards
the end of the data age.
Because first of all,
the amount of human generated data
is relatively small compared to the amount
of machine generated data.
And second, you can see this also in the
development of companies like the company NVIDIA, which is the second largest company at this point.
When I recorded that talk, it was the third largest, but as of last week, it's the second
largest. And in fact, it's only 1.7% of Apple, which is the largest company in terms of total
market capitalization. So it'll probably surpass
Apple relatively soon, the remaining 1.7%. And this company is a pure accelerator company. It
doesn't provide many storage products. It just provides acceleration. That's the main business
that this company has. So somehow, we are running out of this data to learn from these new AI models.
And now we need to look at, we really need to look at
synthetic data. We need to look at ways for these models to play with themselves, to think about
things and to invoke themselves recursively, basically, or iteratively, depending on how you
define it. And that's the age of computation. So now, really computation, who has the most
computation as a society, as a single individual even, will have a significant benefit
over other societies and individuals. You can see today, if you know how to use AI systems,
which are computation driven, like these are computational input output systems,
you have a benefit over people who don't know how to use this. And the same applies to countries.
So now what's going to happen is that we will have a race for who has, who develops the most
advanced and biggest computational capability.
And maybe eventually human creativity
and human intelligence will be made obsolete
as a differentiating factor between humans,
as is today human strength.
Like it doesn't matter if I'm very strong or not,
I can operate a machine.
And maybe later we will be going towards
creativity and intelligence.
And that's the age of computing.
We achieve that with computation.
Lenovo will be joining HPC enthusiasts and influencers at SC24 in Atlanta at the Georgia
World Congress Center.
Lenovo invites you to attend, visit, and experience.
Attend the new Lenovo Innovation Forums, where you will learn about Gen AI, liquid cooling,
and more.
Register at bit.ly
slash lenovo forums that's bit.ly forward slash lenovo forums visit booth 2201 for interactive
demos and over 20 booth theater sessions featuring lenovo experts partners and customers experience
lenovo's latest liquid cooling and large language model infrastructure could i suggest a possible name change for the age of computation?
Okay.
How about the age of HPC?
Because it's all enabled by HPC.
And we've talked about this, Shaheen, you and I have, and Torsten, that we often hear we're in the AI era, but it's all enabled by HPC.
And HPC so often gets subsumed under the current rage of the day in technology.
Yeah, absolutely.
However, I would not rename it because the age of HPC has been going on for the last
couple of decades already.
I mean, even in these previous ages, HPC played a very significant role. And
after all, the C in HPC stands for computing, which is very close to computation. And we could
also call it the age of computing. I guess that's a minor difference. And so really, now you're
right. Our golden age starts and high performance is going to be one of the major things we have to
achieve in order to enter this age.
Absolutely. I agree.
NVIDIA, which is all, as you say, it's an accelerator company.
That's been their entire focus.
Yeah.
And this is how they've risen to the top.
So obviously, yes. per rack electrical power and how data centers have gone from being measured by square feet
or square meters to megawatts and gigawatts. This also implies that whoever has the most
computation also needs to have the most energy because I'm sort of observing that we've gone
from 10 kilowatts per rack to now 150, going to 500, maybe even going to a megawatt per
rack. How does that play with kind of the, I guess it's the geopolitical conclusion from
whoever has the most chips wins kind of a thing? Yeah. Yeah. I mean, one interesting observation
is you would say that as if it was scary if i built racks with a higher energy capacity actually these racks
they're much more efficient than racks with a lower energy capacity because you are a lower
power consumption because you reduce the distance between the compute elements and as you reduce the
distance you improve the efficiency however of course you will probably just deploy more so
there's jevons paradox whenever you make something, you will just find bigger demand for it.
So absolutely, that's going to happen.
Use more of it.
Yeah, exactly.
Exactly.
The interesting view here is that, I mean, this will happen anyway.
And what we HPC people can do is we can contribute to making this more efficient.
And so here I'm extremely proud of our achievements, like my group's achievements and my own achievements
in the past couple of years,
where we basically made AI computations, both training and inference, a thousand times more energy efficient than they were before. And much of that is in production. So I have a talk where
I explain how this factor of 1000 times is achieved. And that is extremely important for
the future of our energy consumption, because many people say that this is a problem, this high performance computing,
but no, we are the solution.
We make these devices more efficient such that we actually use less energy.
And then, well, of course, we can use more of them as we just discussed.
But again, this will happen anyway.
As you can see from all the mega data center providers, literally all of them, Google,
Microsoft, and AWS, they have announced that they're betting big time on
nuclear energy, actually rebooting plans, partially rebooting plans that were supposed to be shut down
just to deliver that energy in the near term, in the next couple of years, because it's the
only way you can get it. Yeah, it does require a different political and social mindset to recognize that these computations aren't, quote, wasted.
Yes.
They're not just generating heat.
Because we had that debate a couple of years ago.
I was joking that should we be thanking AI for taking the energy blame away from Bitcoin?
But we had that discussion like a few years ago that is this all? So with AI is a little bit different, but what do we do to sort of recognize or project
that the value of this is really for humanity and not just for the big companies that do
it?
The value distribution is an interesting question.
But many of these big companies are actually freely sharing their weights and their models,
like Meta, for example.
I believe the Grok model
is open. And so there is a very interesting debate to be had here, what that means. But I believe we
all already benefit from it in so many different ways. I mean, I use AI models. Of course, I'm
consulting for Microsoft, so I use Copilot every single day. So most of my emails are written by
that model or supported by that model and it's
quite nice and it makes me really much more productive it makes me nicer whenever and so
it's it has a huge added benefit right let's talk a little bit about cooling part of it too
because it seems to me that liquid cooling went from just sort of something exotic and interesting and maybe that we use during mainframes and hopefully no longer because we're not using ECL anymore.
We went to CMOS and now it's come back with vengeance and it's now mandatory.
What do you see going on there that people need to keep on radar?
I mean, the liquid cooling is an enabler for
these denser racks that you mentioned before. And it's, as you say, it's absolutely mandatory
to have 100 plus kilowatt racks, you can simply not air cool, you cannot have an airstream fast
enough to air cool those. And we will go further with liquid cooling. And there are many new ideas
where you move the liquid cooling closer to the chip or even on the chip or even immerse the chip
in liquid with this immersive cooling strategies that enable us to build smaller and smaller enclosures,
which really reduces the distance.
And reducing the distance is absolutely key for connectivity.
The price goes down, the energy consumption goes down significantly if you can reduce
these distances.
So efficiency goes up.
And cooling is probably
one of the most important challenges we have today to build these large-scale systems. So
in Microsoft, I know a significant number of people just looking at cooling because it's
absolutely crucial. Now, on the data end of things, with computers generating,
Torsten, we've certainly heard about data hallucinations.
And I believe part of your talk was LLMs literally talking to each other, working in some facsimile of a collaborative manner.
But how do you make sure that the new data that LLMs are generating is not false data, hallucinatory, or whatever?
Yes, I know.
There are many approaches.
What you allude to is the graph of thoughts work that I was presenting. So the idea here is that you have multiple language models or even the same language model reasoning in multiple steps
through multiple invocations of itself or other models where it communicates through prompts,
very much like we humans communicate. And the cool idea here is actually that while we, as we are
talking here, we communicate, we exchange ideas. I take thoughts from my mind and inject it into
your mind, and hopefully you come up with better thoughts and so on. And we do this at an extremely
low bandwidth. We here have a couple of words per second, which is a couple of bytes per second.
Language models, they can communicate at megabytes, gigabytes, and soon terabytes per second. Language models, they can communicate at megabytes, gigabytes, and soon terabytes per
second. That's a billion times faster than what we are doing here. So that's an interesting thing
to think about, an interesting fact to think about. So why don't you ask about hallucinations?
So first of all, I also believe we humans hallucinate, and that is called the creative
process. So when I design something that's not there, I apply a creative process and then I check it whether
it's useful. The unfortunate thing in language models, this check kind of is missing. The
language models, they tell you something, they don't think about it. They just have to output
the next token. So they are forced to communicate with you immediately. It's like you would force
me to utter every single thought I have. And there would be a whole lot of junk that I would have to utter if I would be forced to do so. So really, this graph of thoughts is a way to give the language
model a chance to think about things. So not immediately output the next token to you,
but use these tokens in an internal process, and maybe not output them to the user,
but give them later as a condensed result to the user. So we can see traces of this
in the OpenAI's O1 or Strawberry model, where the language model argues a little bit with itself.
So now still, that doesn't mean that the language model understands what's the hallucination or what
is not. So here we have another very nice research work that we call CheckEmbed. And I would
recommend you look this up because it's really cool, Actually, it's very simple. How can I show or how can I prompt a language model and try to
understand whether it's hallucinating? Well, very much like we would do it with humans. So you ask
the exact same question in different words. Like if you do a questionnaire, for example, and you
want to see if humans pay attention, you ask the same question with different words. And then you
see if they're just clicking random results or not.
So there's a whole lot of theory about this.
You can do the same thing with language models.
So you ask the same question with slightly different input, and then you look at the
output.
How would you compare different outputs?
Because a language model will design different text replies for each of those inputs.
How do you now compare those efficiently?
And this is where Check and bet comes in. What we do is we take each of these output texts and we embed it using
the same or another language model into the high dimensional space, into the high dimensional
vector space where these language models work. And then we get vectors for each of those outputs.
And now if these vectors are close to each other, because these vectors, as we know from
the early works on embeddings, they kind of encode the meaning of a text. So now if all of these
different outputs that were prompted by the same question with different formulations, if all of
them are close in the resulting vector space, then the model is very unlikely to hallucinate,
because after all, it has given you similar answers for different formulations. However,
if the model is very unsure, and it's forced to output something,
if you look at the cross, I could go more technical.
If you look at the last cross entropy, they are very often,
these entropies are of the selection probability is very uncertain when the model hallucinates.
For example, the next token is with 51% and a bear and with 49% a cat, right, then the model
is not very sure as opposed to 99% a bear and 1% a cat or something like this. And that you can
measure in these vectors, because now, if the model hallucinates, it'll give you different
answers, because there is a noise component in this for very similar prompts. And if that happens, you can see that in the vectors
because the distance between these vectors is relatively large.
And you can now programmatically analyze whether your model hallucinates or not
by just looking at the output vectors.
And of course, we have a paper on this and scientific evaluation.
And so, but that's kind of the high level intuition, high level human intuition.
And I can do the same thing to you.
I want to see if you're lying to me, for example. I ask you the same question on day one and maybe a very similar
question three days later. So now you need to remember what you told me on day one or you will
give me a different answer. And then I have detected a problem. So this basically covers
the space where consistency is a proxy for correctness. Yes. For certainty, not for correctness,
for certainty. For certainty. Exactly. That's right. Whether the model is certain or not.
There is another way now to check whether a model is certain or not. So you could ask the model,
if you're asking mathematical questions, for example, it's quite nice because the model can
check itself. Like I'm very often doing when I'm writing a book together with a friend.
And if I do a proof, I use Wolfram Alpha
to help me to understand whether what I proved is actually correct. The language model can use
tools like model checkers or Wolfram Alpha as well to check itself. And it can talk to itself
using these tools like we do. And it's quite nice. This actually really works in practice.
Right, right, right. Torsten, the term emergent abilities in LLMs, is that kind of what we're talking about only in a more controlled way?
Yes.
I was first exposed to that idea about a year ago at a HPC user forum.
And when you first hear it, that an LLM draws conclusions that it wasn't asked to draw, it struck me as eerie and maybe frightening.
But as I think you pointed out, Shaheen, it's simply that we
don't understand yet how this is done. It's not some sort of a mysterious.
My view was that the information content of the data that we supply to AI exceeds our ability to
understand it. So if AI is doing something that looks emergent to me, it's because I just didn't
know what was in my data. Not that it came up with
something brand new. It just did a better job of managing and extracting information from the data.
Is that true? Or do we think that it actually truly net new comes up with brand new ideas
that aren't just a permutation of existing ones or extraction of information that wasn't accessible to humans?
Well, this is a fascinating question and I'm baffled by this. So I would immediately agree
with you what you said from a mathematical perspective, because after all these pre-trained
large language models, they learn the statistical distribution of language from a lot of examples.
So what is the most likely next token, which is really a representation of a word or a
piece of a word, given the previous tokens? This is what they do. It's a very simple statistic
somehow with extremely large computation. So these models, they have up to multiple hundreds
of billions of parameters, or actually trillions of parameters these days, like LAMA405B has 450
billion parameters. It operates in a 450 billion
dimensional optimizations. This is absolutely crazy if you think about it. But now, the
interesting thing is that there's actually proof that these models generate new knowledge that is
definitely not in the training data. So for example, there are various papers that show that
these models can be used in a loop, like I just mentioned,
to prove mathematical theorems that were open for a decade. And these models, they build hypotheses,
they use the proof assistant to check if their hypotheses are correct. And if they're incorrect,
they refine their hypothesis or change it or discard it. And they build new hypotheses,
just like we humans do. And this is a fascinating thing. I think nobody can really explain
how that follows from
simply predicting the next token. Sometimes I'm wondering if I'm a language model, like,
how does my brain, am I predicting just the next token? Of course, I speak English that makes
sense to others. So somehow there's something to it. So I don't know. But there is evidence that
language models create new things. Interesting.
Well, along those lines, it is also true that language models have been effective surprisingly
more than their inventors thought it would, right?
Isn't that true?
Yes.
Oh, absolutely.
So like, is there any, is there an understanding of why that is?
Why is it that it works as well as it does?
If we had that, we would probably be able to optimize them a whole lot. So far, I don't know.
More scale, bigger scale, like they seem to generate new capabilities or have emerging
capabilities with bigger scale. And now more iterative invocation. So not necessarily scale
now, but now we invoke these language models with their own outputs again and resemble more of a human thinking process.
Right. So along those lines, another question for me was, how come there are so many matrix multiplies in nature?
That I'm not sure, actually. I would really blame Jack Dongara for this.
Yes, that's right. So he rightfully got the Turing Award because he made matrix multiply fast.
And that was really the basis for much of the development, if you ask me.
That was definitely inspired, yes.
It happened because this forced these models to use matrix multiplies, and they happen
to be very effective.
However, we don't know if that is the best way.
And it's dense matrix multiplication of all matrix multiplication.
So we all know that our brains are not densely connected.
We all know that our brains are operating very differently from matrix multiplication.
Even though you can model parts of the brain with sparse matrix multiplication,
but at the end, you can probably model any physical system with sparse matrix multiplication
because it's extremely powerful as a computation model.
So that is a great question.
And I don't know the answer.
My personal theory is that
we just found a good engineering solution
with these dense matrix multiplications
to an information representation problem.
It's very much like planes don't flap with their wings.
We found an engineering solution
for using the aerodynamics of air or of foils to fly.
And we have wheels and there are not many animals with
wheels. And so in this AI space, this may be the right representation for knowledge, but we don't
know. And this is a very good question. Yeah. The flapping wing basically got decoupled into
his fixed wing and an engine. Yes. And it worked. Exactly. So let me ask you then, there was a Nobel Prize awarded
for physics, and there was a controversy on what was so physics and what was awarded. Is the Nobel
Committee redefining basic science by pulling computer science into it? Or is it just a
continuation of multidisciplinary? Where do you land on that? Was this award really for physics or was it just an
excuse to reward AI because it's important? Or is it really the insight that, oh my God,
everything is a matrix multiply? Well, I'm actually on the side of the age of computation again. So
computation is going to drive human development at least in the near future and probably even
more in the far future.
And this is the beginning of this, as we are seeing.
Computation has been used to discover physical phenomena and also chemical phenomena
to an extent that the committee handing out the biggest award in those fields
gave it to essentially computational scientists.
And in the good old days, it happened.
The saying was, as a computer scientist, you would never get a Nobel Prize. And now we may get all the Nobel Prizes soon.
I love that a lot.
I had a joke talk about the literature Nobel Prize, because that may soon be going to AI.
Torsten, can I ask, I know we only have a few more minutes, but looking ahead,
there has been discussion lately about the limitations of LLMs.
And I'm curious your thoughts about what's next in AI at the big picture level, the next
model, if you will, and what sort of computation might be for the thing that comes after LLMs?
Yeah, that is a great question.
So I believe there will be an ecosystem of agents.
People call those
agents now, but these are really LLMs invoked in loops, in complex loops, LLMs that may have
been given different personalities. So for example, there was a very interesting paper
on a software engineering company run by LLMs. And so one LLM was the tester and the other one
was the designer. The other one was the coder.
And so they were interacting in a loop, talking to each other.
So the tester was trying to break what the coder coded based on the designer's input.
There was even a marketing LLM that tried to design a web page for the thing.
And it worked reasonably well.
And so this is where we are going, I believe.
And it's really more computation of LLMs.
The big question that I'm not sure how to answer is how they will scale, whether we will scale them
much bigger than they are today. So today we are in the low trillion parameter models at the very
high end. However, for all practical purposes, these less than a hundred billion, like the 70
billion models, they seem to be doing reasonably well. And the problem with
these very large models is that they get extremely expensive to use. So you need a very large GPU
farm to just fire up a model and get a reply for these very large models. So here is going to be a
very interesting resource question. How many resources can I afford and how many can I get?
Perfect.
Maybe one final question is just a follow-up from the panel discussion that was at SC23, I believe, about the future of supercomputing.
And I remember your portion of it.
I think it was very consistent with the conversation we've had today about the age of computation and also networking.
What is your current perspective on how the future architectures will look like
and where we will see advances?
I think that's a very open-ended question.
It is, yes, yes.
Well, the high-level view is going to be more parallelism.
That seems to be fundamental.
We don't scale single-core performance.
We need more parallelism, more specialization in that parallelism to AI style workloads. So smaller
data types, sparsity, I really believe sparsity will make a difference. It's very hard. It's
much harder than exploiting smaller data types, but we'll make some progress there.
We will build bigger systems because we have larger workloads and we have to worry more about
the efficiency of those systems at the end.
So really, we are at the pole position with HPC to contribute to this development and
also benefit from it.
So high-performance computing, the high-performance simulation community will also benefit from
the AI development if they can express parts of their problem in an AI context.
And here I'm working very hard on making that happen as well
in the AI for Science initiative on various extents.
But that we have to talk about another time.
Perfect, yes.
All right.
Well, thank you, Thorsten.
Always a treat.
Really appreciate you making time.
And I know it's evening your time.
Thanks for being flexible about that.
Oh, that was super fun.
I'm super happy to do this again.
We absolutely will hold you to that.
We'll take you out on that.
Thanks so much.
Thank you so much. Take care.
Wonderful, wonderful.
That's it for this episode of the At HPC podcast.
Every episode is featured on InsideHPC.com and posted on OrionX.net.
Use the comment section or tweet us with any questions or to propose topics of discussion.
If you like the show, rate and review it any questions or to propose topics of discussion.
If you like the show, rate and review it on Apple Podcasts or wherever you listen.
The At HPC Podcast is a production of OrionX in association with Inside HPC.
Thank you for listening.