Microsoft Research Podcast - NeurIPS 2024: The co-evolution of AI and systems with Lidong Zhou
Episode Date: December 17, 2024Just after his NeurIPS 2024 keynote on the co-evolution of systems and AI, Microsoft CVP Lidong Zhou joins the podcast to discuss how rapidly advancing AI impacts the systems supporting it and the opp...ortunities to use AI to enhance systems engineering itself.Learn more:Verus: A Practical Foundation for Systems Verification | Publication, November 2024SuperBench: Improving Cloud AI Infrastructure Reliability with Proactive Validation | Publication, July 2024BitNet: Scaling 1-bit Transformers for Large Language Models | Publication, October 2023
Transcript
Discussion (0)
Welcome to the Microsoft Research Podcast,
where Microsoft's leading researchers bring you to the cutting edge.
This series of conversations showcases the technical advances being pursued at
Microsoft through the insights and experiences of the people driving them.
I'm Eliza Strickland, a Senior Editor at IEEE Spectrum,
and your guest host for a special edition of the podcast.
Joining me today in the Microsoft booth at the 38th
annual Conference on Neural Information Processing Systems, or NeurIPS, is Lidong Zhou. Lidong is a
Microsoft Corporate Vice President, Chief Scientist of the Microsoft Asia-Pacific Research and
Development Group, and Managing Director of Microsoft Research Asia. Earlier today, Lidong
gave a keynote here at NeurIPS
on the co-evolution of AI and systems engineering.
Li Dong, welcome to the podcast.
Thank you, Eliza. It's such a pleasure to be here.
You said in your keynote that progress in AI
is now outpacing progress in the system supporting AI.
Can you give me some concrete examples
of where the current infrastructure is struggling to keep up?
Yeah, so actually we have been working on supporting AI from the infrastructure perspective.
And I can say, you know, there are at least three dimensions where it's actually posing a lot of
challenges. One dimension is that the scale of the AI systems that we have to support,
you know, you heard about the scaling law in AI and, you know, demanding even higher scale every so often.
And when we scale, as I mentioned in the talk this morning,
every time you scale the system,
you actually have to rethink how to design the system,
develop a new methodology, revisit all the assumptions,
and it becomes very challenging for the community
to keep up.
And the other dimension is if you look at AI systems, it's actually a whole stack kind of design.
You have to understand not only the AI workload, the model architecture,
but also the software and also the underlying hardware.
And you have to make sure they are aligned to deliver the best performance.
And the third dimension is the temporal dimension,
where you really see accelerated growth and the pace
of innovation in AI, and not actually only in AI,
but also in the underlying hardware.
And that puts a lot of pressure on how fast we innovate
on the system side, because we really have to keep up
in that dimension as well.
So all those three dimensions add up.
It's becoming a pretty challenging task for the whole system community.
I like how in your talk you proposed a marriage between systems engineering and AI.
What does this look like in practice and how might it change the way we approach both fields?
Yeah, so I'm actually a big fan of system community and AI community
work together to tackle some of the most challenging problems. Of course, you know,
we have been working on systems that support AI,
but now increasingly we're seeing opportunities
where AI can actually help developers to become more productive
and develop systems that are better in many dimensions
in terms of efficiency, in terms of reliability,
in terms of trustworthiness.
So I really want to see the two communities work together
even more closely going forward.
You know, I talk about sort of the three pillars, right?
The efficiency, there's trust,
there's also the infusion of the two.
There are three ambitions that we are actually working on
and we see very encouraging early results
that makes us believe that there's much more to be achieved
going forward with the two communities working together. You mentioned the challenging of
scaling. I think everyone at NeurIPS is talking about scaling. And you've highlighted efficiency
as a key opportunity for improvement in AI. What kind of breakthroughs in systems engineering or
new ideas in systems engineering could help AI achieve greater efficiencies? Yeah, that's another great question.
I think there are a couple of aspects to efficiency.
So this morning I talked about some of the innovations in model architecture.
So our researchers have been looking into binnets, which is essentially try to use one bit or actually using a ternary representation for the weights in all those AI models rather than
using FP16 and so on. And that potentially creates a lot of opportunities for efficiency and energy
gains. But that cannot be done without rethinking about the software and even the hardware stack
so that, you know, those innovations that you have in the model architecture can actually have the end-to-end benefits.
And that's one of the dimensions where we see the co-innovation of AI and underlying system
to deliver some efficiency gains for AI models, for example.
But there's another dimension which I think is also very important. With all the AI
infrastructure that we build to support AI, there's actually huge room for improvement as well.
And this is where AI can actually be utilized to solve some of the very challenging system
problems for optimization, for reliability, for trustworthiness. And I use some of the examples in my talk,
but this is a very early stage. I think the potential is much larger going forward.
Yeah, it's interesting to think about how GPUs and large language models are so intertwined at
this point. You can't really have one without the other. And you said in your talk, you sort of see
the need to decouple the architectures and the hardware.
Is that right?
Yes.
Yeah.
So this is always, you know, like very system type of thinking where, you know, really want to decouple some of the elements so that they can evolve and innovate independently.
And this gives more opportunities, you know, larger design space for each field. And what we
are observing now, which is actually very typical in a relatively mature field,
where we have GPUs that are dominating in the hardware land and all the model
architecture has to be designed and, you know, proving very efficient on GPUs, and
that limits the design space for model architecture. And similarly, you
know, if you look at hardware, it's very hard for hardware innovations to happen
because now you have to show that those hardwares are actually great for all the
models that are have been actually optimized for GPUs. So I think, you know,
from system perspective, it's actually possible if you design the right abstraction between the AI and the hardware,
it's possible for these two domains to actually evolve separately and have a much larger design space to find the best solution for both.
And when you think about systems engineering, are there ways that AI can be used to optimize your own work?
Yes, I think there are two examples that I gave this morning.
One is, you know, in systems, there's this what we call a holy grail of system research,
because we want to build trustworthy systems that people can depend on.
And one of the approach is called the verified systems.
And this has been a very active research area in systems, depend on and one of the approach is called the verified systems and this is
a being a very active research area in systems because there are a lot of
advancements informal methods and in you know how we can infuse the formal
message into building real systems but it's still very hard for the general
system community because you know you really have to understand how formal methods work and so on.
And so it's still a lot within reach. When you build mission-critical systems,
you want to be completely verified. So you don't have to do a lot of testing to show that there
are no bugs. You'll never be able to show there's no bugs with testing.
Sorry, could I pause you for one moment? Could you define formal define formal verification for our listeners just in case they don't know?
Yeah, that's a good point.
I think the easy way to think about this is formal verification uses mathematical logic
to describe, say, a program.
And you can represent some properties in math, essentially, in logic.
And then you can use a proof to show that the program has
certain properties that you desire and simple form in a very preliminary form
of formal verification is you know this assertions in the program right we say
say don't assert a is not equal to zero and that's a very simple form of logic
that must hold.
And then, you know, the proof system is also much more complicated.
You have talked about more advanced properties of programs, their correctness and so on.
So I think the opportunity we're seeing is that with the help of AI,
I think we are on the verge of providing the capability of building verified systems,
at least for some of the mission-critical pieces of systems.
And that would be a very exciting area for systems and AI to tackle together.
And I think we're going to see a paradigm shift in systems
where some pieces of system components will actually be implemented using AI.
Which is interesting is,
you know, system is generally deterministic
because when you look at the traditional computer system,
you want to know that it's actually acting as you expect it.
But AI, you know, it can be stochastic, right?
And it might not always give you the same answer.
But how you combine these two,
which is another area where I see a lot of opportunities for breakthroughs. and it might not always give you the same answer. But how you combine these two, which
is another area where I see a lot of opportunities
for breakthroughs.
Yeah, yeah.
I wanted to back up in your career a little bit
and talk about the concept of gray failures.
Because you were really instrumental in defining
this concept, which for people who don't know,
gray failures are subtle and partial failures
in cloud-scale systems.
They can be very difficult to detect
and can lead to major problems. I wanted to see if you're still thinking about gray failures in cloud scale systems. They can be very difficult to detect and can lead to major problems.
I wanted to see if you're still thinking about great failures in the context of you're thinking about AI and systems.
Are great failures having an impact on AI today?
Yes, definitely.
So when we were looking at cloud systems, we realized the...
So in systems, we developed a lot of mechanisms for reliability.
And when we look at the cloud systems, when they reach a certain scale,
a lot of methodology we develop in systems for reliability actually no longer applies.
So one of the reasons is we have those great failures.
And then we move to looking at AI infrastructure.
The problem is actually even worse.
Because what we realize is there's a lot of built-in redundancy at every level like GPUs, memory or all the communication
channels and because of those building redundancies sometimes the system is
experienced failures but they're not that being masked because of the
redundancies and that makes very hard for us to actually
maintain the system, debug the system, or do troubleshooting. And for AI
infrastructure, what we have developed is a very different approach using
proactive validation rather than reactive repair. And this is actually a
paper that we wrote recently in USENIX ATC that talks about how we approach reliability in AI infrastructure,
where the same concept happens to apply in a new meaning.
I like that, yeah.
So tell me a little bit about your vision for where AI goes from here.
You talked a little bit in your keynote about AI-infused systems.
What would that look like?
Yeah, so I think AI is going to transform almost everything,
and that includes systems.
That's why I'm so happy to be here to learn more from the AI community.
But I also believe that for every domain that AI is going to transform,
you really need the domain expertise
and sort of the combination of AI and that particular domain.
And the same for systems.
So when we look at what we call AI-infused systems,
we really see the opportunity when there are a lot of hard system challenges can be addressed by AI.
But we need to define the right interface between the system and AI
so that we can leverage the advantage of both.
AI is creative, we come up with solutions that people might not think of
but it's also a little bit random sometimes.
I could give you wrong answers but systems are very grounded and very deterministic.
So we need to figure out what is the design
paradigm that we need to develop so that we can get the best of both worlds.
Makes sense. In your talk you gave an example of OptiFlow. Could you tell our
listeners a bit about that? Yeah this is a pretty interesting project that is
actually done in Microsoft Research Asia jointly with the Azure team,
where we look at collective communication,
which is a major part of AI infrastructure.
And it turns out there's a lot of room for optimization.
It was initially done manually.
So expert has to take a look at the system
and look at the different configurations
and do all kinds of experiments.
And it takes about two weeks to come up with a solution.
This is why I say the productivity is becoming a bottleneck
for our AI infrastructure because people are in the loop.
They have to develop solutions.
And it turns out that this is a perfect problem for AI,
where AI can actually come up with various solutions
and can actually develop good system insights based on the observations from the system.
And so Optiflow, what it does is it comes up with the algorithm or the schedule of communications
for different collective communication primitives.
And it turns out to be able to discover algorithms
that's much better than the default one
or for different settings.
And it's giving us the benefits of the productivity,
also efficiency.
And you said that this is in production today, right?
Yes, it is in production.
That's exciting.
So thinking still to the future,
how might the co-evolution of AI and systems
change the skills needed for future computer scientists?
Yeah, that's a very deep question. As I mentioned, I think being fluent in AI is very important,
but I also believe that domain expertise is probably undervalued in many ways. And I see a lot of needs for this interdisciplinary kind of education
where someone not only understands AI and what AI technology can do,
but also understands a particular domain very well.
And those are the people who will be able to figure out
the future for that particular domain with the Power of AI.
And I think for our students,
certainly it's no longer sufficient
for you to be an expert in a very narrow domain.
I think we see a lot of fields sort of merging together.
And so you have to be an expert in multiple domains
to see new opportunities for innovations.
So what advice would you give to a high school student
who's just starting out and thinks,
I want to get into AI?
Yeah, I mean, certainly there's a lot of excitement of AI
and it would be great for high school students
to actually know, to have the firsthand experience.
And I think it's their world in the future
because they probably can imagine
a lot of things from scratch.
I think they probably have the opportunity
to disrupt all the things that we take for granted today.
So I think just use their imagination
and I don't think we have really good advice
for the young generation.
It's going to be their creativity and their imagination.
And AI is definitely going to empower them
to do something that's going to be amazing.
Something that we probably can't even imagine.
Right. Yeah.
I think so.
I like that.
So as we close, I'm hoping you can look ahead
and talk about what excites you most
about the potential of AI and systems working together,
but also if you have any concerns, what concerns you most? Yeah, I think in potential of AI and systems working together but also if you have any concerns what concerns you most? Yeah I think in terms of AI systems I'm
certainly pretty excited about what we can do together you know with with a
combination of AI and systems. There are a lot of low-hanging fruits and there
are also a lot of potential grand challenges that we can actually take on.
I mentioned a couple in this morning's talk.
And certainly, you know, we also want to look at the risks that could happen,
especially when we have system AI start to evolve together.
And this is also in an area where having some sort of trust foundation is very important so we can have some assurance
of the kind of system or AI system that we're going to build.
And this is actually fundamental in how we think about trust in systems.
And I think that concept can be very useful for us to guard against unintended consequences and unintended issues.
Well, Li Dongzhao, thank you so much for joining us on the podcast.
I really enjoyed the conversation.
It's such a pleasure, Eliza.
And to our listeners, thanks for tuning in.
If you want to learn more about research at Microsoft,
you can check out the Microsoft Research website at microsoft.com slash research.
Until next time.