Microsoft Research Podcast - NeurIPS 2024: The co-evolution of AI and systems with Lidong Zhou

Episode Date: December 17, 2024

Just after his NeurIPS 2024 keynote on the co-evolution of systems and AI, Microsoft CVP Lidong Zhou joins the podcast to discuss how rapidly advancing AI impacts the systems supporting it and the opp...ortunities to use AI to enhance systems engineering itself.Learn more:Verus: A Practical Foundation for Systems Verification | Publication, November 2024SuperBench: Improving Cloud AI Infrastructure Reliability with Proactive Validation | Publication, July 2024BitNet: Scaling 1-bit Transformers for Large Language Models | Publication, October 2023

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Microsoft Research Podcast, where Microsoft's leading researchers bring you to the cutting edge. This series of conversations showcases the technical advances being pursued at Microsoft through the insights and experiences of the people driving them. I'm Eliza Strickland, a Senior Editor at IEEE Spectrum, and your guest host for a special edition of the podcast. Joining me today in the Microsoft booth at the 38th annual Conference on Neural Information Processing Systems, or NeurIPS, is Lidong Zhou. Lidong is a
Starting point is 00:00:34 Microsoft Corporate Vice President, Chief Scientist of the Microsoft Asia-Pacific Research and Development Group, and Managing Director of Microsoft Research Asia. Earlier today, Lidong gave a keynote here at NeurIPS on the co-evolution of AI and systems engineering. Li Dong, welcome to the podcast. Thank you, Eliza. It's such a pleasure to be here. You said in your keynote that progress in AI is now outpacing progress in the system supporting AI.
Starting point is 00:00:59 Can you give me some concrete examples of where the current infrastructure is struggling to keep up? Yeah, so actually we have been working on supporting AI from the infrastructure perspective. And I can say, you know, there are at least three dimensions where it's actually posing a lot of challenges. One dimension is that the scale of the AI systems that we have to support, you know, you heard about the scaling law in AI and, you know, demanding even higher scale every so often. And when we scale, as I mentioned in the talk this morning, every time you scale the system,
Starting point is 00:01:34 you actually have to rethink how to design the system, develop a new methodology, revisit all the assumptions, and it becomes very challenging for the community to keep up. And the other dimension is if you look at AI systems, it's actually a whole stack kind of design. You have to understand not only the AI workload, the model architecture, but also the software and also the underlying hardware. And you have to make sure they are aligned to deliver the best performance.
Starting point is 00:02:01 And the third dimension is the temporal dimension, where you really see accelerated growth and the pace of innovation in AI, and not actually only in AI, but also in the underlying hardware. And that puts a lot of pressure on how fast we innovate on the system side, because we really have to keep up in that dimension as well. So all those three dimensions add up.
Starting point is 00:02:23 It's becoming a pretty challenging task for the whole system community. I like how in your talk you proposed a marriage between systems engineering and AI. What does this look like in practice and how might it change the way we approach both fields? Yeah, so I'm actually a big fan of system community and AI community work together to tackle some of the most challenging problems. Of course, you know, we have been working on systems that support AI, but now increasingly we're seeing opportunities where AI can actually help developers to become more productive
Starting point is 00:02:55 and develop systems that are better in many dimensions in terms of efficiency, in terms of reliability, in terms of trustworthiness. So I really want to see the two communities work together even more closely going forward. You know, I talk about sort of the three pillars, right? The efficiency, there's trust, there's also the infusion of the two.
Starting point is 00:03:16 There are three ambitions that we are actually working on and we see very encouraging early results that makes us believe that there's much more to be achieved going forward with the two communities working together. You mentioned the challenging of scaling. I think everyone at NeurIPS is talking about scaling. And you've highlighted efficiency as a key opportunity for improvement in AI. What kind of breakthroughs in systems engineering or new ideas in systems engineering could help AI achieve greater efficiencies? Yeah, that's another great question. I think there are a couple of aspects to efficiency.
Starting point is 00:03:53 So this morning I talked about some of the innovations in model architecture. So our researchers have been looking into binnets, which is essentially try to use one bit or actually using a ternary representation for the weights in all those AI models rather than using FP16 and so on. And that potentially creates a lot of opportunities for efficiency and energy gains. But that cannot be done without rethinking about the software and even the hardware stack so that, you know, those innovations that you have in the model architecture can actually have the end-to-end benefits. And that's one of the dimensions where we see the co-innovation of AI and underlying system to deliver some efficiency gains for AI models, for example. But there's another dimension which I think is also very important. With all the AI
Starting point is 00:04:46 infrastructure that we build to support AI, there's actually huge room for improvement as well. And this is where AI can actually be utilized to solve some of the very challenging system problems for optimization, for reliability, for trustworthiness. And I use some of the examples in my talk, but this is a very early stage. I think the potential is much larger going forward. Yeah, it's interesting to think about how GPUs and large language models are so intertwined at this point. You can't really have one without the other. And you said in your talk, you sort of see the need to decouple the architectures and the hardware. Is that right?
Starting point is 00:05:26 Yes. Yeah. So this is always, you know, like very system type of thinking where, you know, really want to decouple some of the elements so that they can evolve and innovate independently. And this gives more opportunities, you know, larger design space for each field. And what we are observing now, which is actually very typical in a relatively mature field, where we have GPUs that are dominating in the hardware land and all the model architecture has to be designed and, you know, proving very efficient on GPUs, and that limits the design space for model architecture. And similarly, you
Starting point is 00:06:07 know, if you look at hardware, it's very hard for hardware innovations to happen because now you have to show that those hardwares are actually great for all the models that are have been actually optimized for GPUs. So I think, you know, from system perspective, it's actually possible if you design the right abstraction between the AI and the hardware, it's possible for these two domains to actually evolve separately and have a much larger design space to find the best solution for both. And when you think about systems engineering, are there ways that AI can be used to optimize your own work? Yes, I think there are two examples that I gave this morning. One is, you know, in systems, there's this what we call a holy grail of system research,
Starting point is 00:06:57 because we want to build trustworthy systems that people can depend on. And one of the approach is called the verified systems. And this has been a very active research area in systems, depend on and one of the approach is called the verified systems and this is a being a very active research area in systems because there are a lot of advancements informal methods and in you know how we can infuse the formal message into building real systems but it's still very hard for the general system community because you know you really have to understand how formal methods work and so on. And so it's still a lot within reach. When you build mission-critical systems,
Starting point is 00:07:31 you want to be completely verified. So you don't have to do a lot of testing to show that there are no bugs. You'll never be able to show there's no bugs with testing. Sorry, could I pause you for one moment? Could you define formal define formal verification for our listeners just in case they don't know? Yeah, that's a good point. I think the easy way to think about this is formal verification uses mathematical logic to describe, say, a program. And you can represent some properties in math, essentially, in logic. And then you can use a proof to show that the program has
Starting point is 00:08:05 certain properties that you desire and simple form in a very preliminary form of formal verification is you know this assertions in the program right we say say don't assert a is not equal to zero and that's a very simple form of logic that must hold. And then, you know, the proof system is also much more complicated. You have talked about more advanced properties of programs, their correctness and so on. So I think the opportunity we're seeing is that with the help of AI, I think we are on the verge of providing the capability of building verified systems,
Starting point is 00:08:46 at least for some of the mission-critical pieces of systems. And that would be a very exciting area for systems and AI to tackle together. And I think we're going to see a paradigm shift in systems where some pieces of system components will actually be implemented using AI. Which is interesting is, you know, system is generally deterministic because when you look at the traditional computer system, you want to know that it's actually acting as you expect it.
Starting point is 00:09:16 But AI, you know, it can be stochastic, right? And it might not always give you the same answer. But how you combine these two, which is another area where I see a lot of opportunities for breakthroughs. and it might not always give you the same answer. But how you combine these two, which is another area where I see a lot of opportunities for breakthroughs. Yeah, yeah. I wanted to back up in your career a little bit
Starting point is 00:09:32 and talk about the concept of gray failures. Because you were really instrumental in defining this concept, which for people who don't know, gray failures are subtle and partial failures in cloud-scale systems. They can be very difficult to detect and can lead to major problems. I wanted to see if you're still thinking about gray failures in cloud scale systems. They can be very difficult to detect and can lead to major problems. I wanted to see if you're still thinking about great failures in the context of you're thinking about AI and systems.
Starting point is 00:09:51 Are great failures having an impact on AI today? Yes, definitely. So when we were looking at cloud systems, we realized the... So in systems, we developed a lot of mechanisms for reliability. And when we look at the cloud systems, when they reach a certain scale, a lot of methodology we develop in systems for reliability actually no longer applies. So one of the reasons is we have those great failures. And then we move to looking at AI infrastructure.
Starting point is 00:10:19 The problem is actually even worse. Because what we realize is there's a lot of built-in redundancy at every level like GPUs, memory or all the communication channels and because of those building redundancies sometimes the system is experienced failures but they're not that being masked because of the redundancies and that makes very hard for us to actually maintain the system, debug the system, or do troubleshooting. And for AI infrastructure, what we have developed is a very different approach using proactive validation rather than reactive repair. And this is actually a
Starting point is 00:11:01 paper that we wrote recently in USENIX ATC that talks about how we approach reliability in AI infrastructure, where the same concept happens to apply in a new meaning. I like that, yeah. So tell me a little bit about your vision for where AI goes from here. You talked a little bit in your keynote about AI-infused systems. What would that look like? Yeah, so I think AI is going to transform almost everything, and that includes systems.
Starting point is 00:11:30 That's why I'm so happy to be here to learn more from the AI community. But I also believe that for every domain that AI is going to transform, you really need the domain expertise and sort of the combination of AI and that particular domain. And the same for systems. So when we look at what we call AI-infused systems, we really see the opportunity when there are a lot of hard system challenges can be addressed by AI. But we need to define the right interface between the system and AI
Starting point is 00:12:04 so that we can leverage the advantage of both. AI is creative, we come up with solutions that people might not think of but it's also a little bit random sometimes. I could give you wrong answers but systems are very grounded and very deterministic. So we need to figure out what is the design paradigm that we need to develop so that we can get the best of both worlds. Makes sense. In your talk you gave an example of OptiFlow. Could you tell our listeners a bit about that? Yeah this is a pretty interesting project that is
Starting point is 00:12:39 actually done in Microsoft Research Asia jointly with the Azure team, where we look at collective communication, which is a major part of AI infrastructure. And it turns out there's a lot of room for optimization. It was initially done manually. So expert has to take a look at the system and look at the different configurations and do all kinds of experiments.
Starting point is 00:13:04 And it takes about two weeks to come up with a solution. This is why I say the productivity is becoming a bottleneck for our AI infrastructure because people are in the loop. They have to develop solutions. And it turns out that this is a perfect problem for AI, where AI can actually come up with various solutions and can actually develop good system insights based on the observations from the system. And so Optiflow, what it does is it comes up with the algorithm or the schedule of communications
Starting point is 00:13:39 for different collective communication primitives. And it turns out to be able to discover algorithms that's much better than the default one or for different settings. And it's giving us the benefits of the productivity, also efficiency. And you said that this is in production today, right? Yes, it is in production.
Starting point is 00:14:00 That's exciting. So thinking still to the future, how might the co-evolution of AI and systems change the skills needed for future computer scientists? Yeah, that's a very deep question. As I mentioned, I think being fluent in AI is very important, but I also believe that domain expertise is probably undervalued in many ways. And I see a lot of needs for this interdisciplinary kind of education where someone not only understands AI and what AI technology can do, but also understands a particular domain very well.
Starting point is 00:14:42 And those are the people who will be able to figure out the future for that particular domain with the Power of AI. And I think for our students, certainly it's no longer sufficient for you to be an expert in a very narrow domain. I think we see a lot of fields sort of merging together. And so you have to be an expert in multiple domains to see new opportunities for innovations.
Starting point is 00:15:10 So what advice would you give to a high school student who's just starting out and thinks, I want to get into AI? Yeah, I mean, certainly there's a lot of excitement of AI and it would be great for high school students to actually know, to have the firsthand experience. And I think it's their world in the future because they probably can imagine
Starting point is 00:15:29 a lot of things from scratch. I think they probably have the opportunity to disrupt all the things that we take for granted today. So I think just use their imagination and I don't think we have really good advice for the young generation. It's going to be their creativity and their imagination. And AI is definitely going to empower them
Starting point is 00:15:49 to do something that's going to be amazing. Something that we probably can't even imagine. Right. Yeah. I think so. I like that. So as we close, I'm hoping you can look ahead and talk about what excites you most about the potential of AI and systems working together,
Starting point is 00:16:03 but also if you have any concerns, what concerns you most? Yeah, I think in potential of AI and systems working together but also if you have any concerns what concerns you most? Yeah I think in terms of AI systems I'm certainly pretty excited about what we can do together you know with with a combination of AI and systems. There are a lot of low-hanging fruits and there are also a lot of potential grand challenges that we can actually take on. I mentioned a couple in this morning's talk. And certainly, you know, we also want to look at the risks that could happen, especially when we have system AI start to evolve together. And this is also in an area where having some sort of trust foundation is very important so we can have some assurance
Starting point is 00:16:49 of the kind of system or AI system that we're going to build. And this is actually fundamental in how we think about trust in systems. And I think that concept can be very useful for us to guard against unintended consequences and unintended issues. Well, Li Dongzhao, thank you so much for joining us on the podcast. I really enjoyed the conversation. It's such a pleasure, Eliza. And to our listeners, thanks for tuning in. If you want to learn more about research at Microsoft,
Starting point is 00:17:20 you can check out the Microsoft Research website at microsoft.com slash research. Until next time.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.