Computer Architecture Podcast - Ep 20: The Tech Transfer Playbook – Bridging Research to Production with Dr. Ricardo Bianchini, Microsoft

Starting point is 00:00:00 Hi, and welcome to the Computer Architecture Podcast, a show that brings you closer to cutting-edge work in computer architecture and the remarkable people behind it. We are your hosts. I'm Suvinay Subramanian. And I'm Lisa Hsu. Our guest for this episode was Dr. Ricardo Bianchini, who is a technical fellow and

Starting point is 00:00:18 corporate vice president at Microsoft Azure, where he leads the team responsible for managing Azure's compute workload, server capacity, and data center infrastructure, with a strong focus on efficiency and sustainability. Before joining Azure in 2022, Ricardo led the Systems Research Group and the Cloud Efficiency Team at Microsoft Research. During his tenure at MSR,

Starting point is 00:00:40 he created research projects in power efficiency and intelligent resource management that resulted in large scale production systems across Microsoft. Prior to joining Microsoft in 2014, he was a professor at Rutgers University where he conducted research in data center power and energy management, cluster-based systems and other cloud related topics.

Starting point is 00:00:59 Ricardo is a fellow of both the ACM and the IEEE. Incredibly, this is our five-year anniversary episode, and we want to thank all our listeners for sticking with us all these years. So Ricardo is a very special guest indeed for this milestone episode, and we were really excited to talk to him about transitioning research into product,

Starting point is 00:01:19 which he has been doing consistently at Microsoft for years. Now, leading a large group where he has teams that perform research, as well as teams that are responsible for running the business and keeping the lights on, Ricardo joined us to talk about his formulas for hiring, building relationships, and collaborating to create and maintain a true research to product pipeline. A quick disclaimer that all views shared on the show are the opinions of individuals and do not are the opinions of individuals

Starting point is 00:01:45 and do not reflect the views of the organizations they work for. So Ricardo, welcome to the podcast. We're so excited to have you here. Yeah, thanks for having me. I'm super excited. Yeah. Well, we're glad to be able to talk to you today. So let us know what's going on.

Starting point is 00:02:07 What's getting you up in the morning these days? What's getting me up in the morning? I get excited about going to work every day and working with great people, being able to do some technical work. These days I have a fairly large group, but I still have time to do some thinking and sort of work, do technical work with folks. Really strong people I work with, so it's always exciting to do that.

Starting point is 00:02:34 In the last couple of years, you've made a little bit of a transition from being in MSR to being in Azure, and your group reflects a wide breadth of people and activities and goals. So why don't you tell us a little bit about what you're up to these days and a little more depth. So today I lead the Azure Compute Efficiency and Capacity team. And you can think of it as things sort of the work related to managing our workloads,

Starting point is 00:03:07 deciding where to place it, how to move it, tracking our capacity, making sure that we have enough capacity in all capacity pockets. I'm also responsible for our intelligence platform. You may remember this, Lisa, from your time at Microsoft Resource Central and other systems that feed intelligence to other parts of the control plane.

Starting point is 00:03:34 Also, resource management, resource over subscription, and harvesting, power harvesting, power over subscription. And generally sort of innovation in the efficiency space. So how to improve the cost efficiency of Azure through sort of innovation and software hardware and data center sort of co-design or cross optimization. Yeah I do remember so for full, when I was at Microsoft, I worked in the group that Ricardo now leads. And I remember when I joined it,

Starting point is 00:04:11 because I had always thought of a system as, the scope of the system that I thought about throughout my career just kept expanding from like the caching system to like the memory system to like the system system into a whole server. And then now, and then when I got to Azure, I was like, oh, the system to the memory system, to the system system into a whole server. Then when I got to Azure, I was like, the system is the data center. The data center is this enormous thing.

Starting point is 00:04:31 I remember learning the term control playing, which is basically the OS for this data center. Where are the VMs going? Where do we want to cluster them together? Where do we want to spread them apart? Or when do we want to move them? All that stuff. And so I remember thinking of it as like, oh, this is basically the operating system

Starting point is 00:04:48 for this data spreader. And then Resource Central being this really cool thing that is essentially the brain for deciding how to be smart about all the decisions with respect to resources on this thing. Yes. Don't you also have Research Group within Azure as well that came with you? Yeah. So when I moved from Microsoft Research to Azure, I actually brought the researchers and the research engineers that were already

Starting point is 00:05:18 working on projects that we had in conjunction with Azure, like Resource Central and the Power Efficiency Project. So I brought over, I think it was 17 people or so, or 16 people. And they're still sort of part of the group, but now they're less than 10% of the whole org. But it's still sort of super exciting to have them there. Because like I said, innovation is really important to us. And they are sort of working with the product teams to come up with those innovations. Yeah. So I feel like you are relatively unique in our field in that you've kind of hit the

Starting point is 00:06:02 motherload in terms of true transition of research into production. Like now you're running like a pretty high level production team that is very customer facing as well as, but the start came from having a research team that was actually doing tech transfer. And that's kind of what people in our research field always want to do, but that last mile is super hard. We've had guests talk about, like we had Bill Daly on one of our earliest episodes, or he was like, that last mile is super hard. And you get the paper and you can be done in

Starting point is 00:06:35 academia, but like getting it into production is, I mean, I know I'm going to say super hard again, but that's because what I remember is like, it was very... Oh yeah, it is pretty difficult, yeah. Yeah, maybe you can talk a little bit about your experience on how to make that happen. It is tricky. There are multiple reasons why it's tricky. So oftentimes, you'll sort of be working on some research that is so much more advanced compared to where the product is, the corresponding product is. So there's that gap. And you want to transfer your research and you've got to somehow bridge that gap or get

Starting point is 00:07:17 to some point where the product is advanced enough that you can build on top. So that's a big challenge. Another challenge is that when you have some piece of research, when you're just writing a paper or doing a prototype, it's fairly well isolated. You set yourself up in such a way that it's pretty self-contained. But then when you get to like a real system, an enormous system like Azure, there's so many other

Starting point is 00:07:50 dependencies and things that affect your work and your work affects other things. So it's really hard to also sort of stitch these things together nicely, and it involves so many other teams. So it's also very difficult to sort of align everybody and so on. So there's a number of challenges, but the bottom line is that if you're not willing to sort of go through this process, like your likelihood that something that some research that you've

Starting point is 00:08:25 developed, that you've created and that you've worked on is going to have a much lower chance of being adopted. So the way I always set up my research group was to be ready for that kind of thing. So I hired with that in mind, I defined the research projects with that in mind. We can get into more details on these things if you like. But I just, we initially wanted to make the point that what you're saying is definitely true. And there are also aspects of my experience that are a little different than what you

Starting point is 00:08:58 normally see. In the sense that oftentimes you see researchers move to a product group, but to still do research, to still not have a production responsibility always. There are occasions in which this happens as well, but oftentimes it's just the research group that they'll lead. So one thing that sort of makes my life a little easier is that because I own part of production as well, sort of, it makes it a little easier in that sense. If I were just a research group embedded in the product team, it would be harder in a

Starting point is 00:09:40 sense because it would have at least one more person that I would need to convince to get some of that stuff in production. Got it. Yeah, I think that was a good articulation of why the research to production path is tricky and you touched upon a few different dimensions in your response there. Maybe we can double click on a few of these things. You talked about the process of taking things from research to production and you started off by highlighting that there might be a wide gap between where your research is and the state of what the production system is.

Starting point is 00:10:10 So how do you go about, number one, sort of pacing yourself in terms of how do you think about where is the right point to intercept the ideas in your research into the production system? How do you understand the biggest, maybe, pain points on the production side? And how do you work with them? Because they have biggest maybe pain points on the production side and how do you work with them because they have a set of problems on the production side and how do you find the Venn diagram intersection of the problems that your address is and what might be relevant, important, interesting and top of mind for someone in the production team. Yeah, so let me start with what I mentioned, which was you have to be careful about how you hire.

Starting point is 00:10:47 And I should sort of say this from the get-go. If we've had any measure of success at all, it has been because of the folks that I hired. I hired an amazing team of researchers and research engineers that make the rest of all of us look really good. So that is, I should give that credit because that's where it all comes from. I'm just the lucky guy who managed to find those people. So there's that component. Now, how did I hire? What was I looking for? So one thing that I always look for is folks that have the right, they're excited about the things, the right things from my perspective. They're excited,

Starting point is 00:11:34 not just in terms of doing research that is sort of super high quality and cutting edge, but also they're interested in deploying that research for millions of people to use. Without that second piece, I'm not as interested. So I always try to look for people who would be good fits culturally to what we were trying to do. That's number one. The other thing too that's really important is I give a lot of importance to the engineering side of research. In other words, I focus a lot on finding excellent research engineers who would be willing to say when there's a gap between the product and where our research is to make that investment,

Starting point is 00:12:26 to bridge that gap, to help bridge that gap. So finding the right folks who are willing to put that investment in so that we're now ready to deploy our research is really important. So those two are two sort of main factors. And they need to know from the get-go that, like I mentioned, all of this is challenging. It's not going to be, oh, let me write my paper and take off. This is a recipe for disaster because the product team will never work with you again because you're just not committed to the group's overall success.

Starting point is 00:13:05 So those are the key aspects in terms of forming a team that's going to be able to do this. Now in terms of after you have the team, how do you sort of figure out how to do your work in such a way that you're more likely to be successful in tech transfer. So what I always suggest to them is let's think about the North Star that we want to get to, the North Star research that we want to do. But as we're planning this path from where we are to the North Star, let's figure out some offshoots that we can deploy as we go along. And that is a really important thing because it keeps everybody excited and getting promoted and those things.

Starting point is 00:13:53 And people don't have to wait five years to have the first outcome of their work. But it's tricky, though. You have to find offshoots that you can deploy without getting too far off your main path. And you constantly have to adjust where things are and where you're going as you go along. So that's another very important aspect of this trajectory. So these are some of the main ways that we think about how to set ourselves up for success, both in terms of the culture, in terms of the

Starting point is 00:14:33 people we hire, and how we organize our projects. Does that make sense? Yeah, no, I think that's a good encapsulation of both the ingredients. So starting with the people and the culture that you bring about in the team, then the following aspect, which is how do you pick the right projects and how do you paste those projects so that you have these offshoots, you have near-term landings and logical conclusions, I guess, are milestones you can track that give you that sense of, okay, my work is actually translating into some impact. I can see some clear milestones and markers by which I can pace myself on the research to production transfer as well. There's one extra point actually,

Starting point is 00:15:09 which is you mentioned the pain points, right? Understanding whenever you're trying to work with the product team, be it Azure Compute or some other product, it's important to understand their pain points. What are the things that really worry them? What are the things that they feel are not ideal in terms of where they are and where they're going, usually in the short term, that people don't think sort of in the product

Starting point is 00:15:40 teams too far ahead. And what I basically usually say is that understanding is really important when defining the research that you're going to do. Not because you will necessarily try to address their pain points, but rather because it's important information. Because without that information, you will do your research in a vacuum.

Starting point is 00:16:04 And then one day when you try to go transfer that technology to the product team, the product team will say, man, we're completely in this other space here. There's no way that we can come to where you are or for you to come to where we are because we've diverged a long time ago. So having that information to be able to make informed decisions about where you're going. If you decide to disregard their pain points, well, you've decided it in an intelligent manner, in an informed manner, not because of ignorance.

Starting point is 00:16:40 You just didn't know. I think that's a very pertinent point, which is you want to understand the context behind their problems, what are their current pain points, because solving or picking the right problem is about 50% of the battle, as some people might say. So yeah, so you have a context on the pain points in a production team, and obviously they have certain aspects or certain dimensions of the problem that are near term, and maybe they have a window into, okay, these might become problems further down. And as you said, they might not have a vantage point

Starting point is 00:17:09 or interest in pursuing things that are too far out because maybe the space changes very rapidly as well. So within this space, how do you think about what's the right timeline of problems that you want to actually tackle within a research setting? And number two, how do you sort of couple that with the right partnerships on the production side? How do you sort of set up this partnership

Starting point is 00:17:27 so that you have that feedback loop going, so that you have the right context? The context keeps evolving, especially in current times, the space evolves fairly rapidly. So how do you figure out, okay, what's the right timeline at which I want to tackle certain problems relative to where the production team is today or where the production teams are currently.

Starting point is 00:17:46 And then the next step is, of course, how do you set up these collaborations and partnerships so that they are also invested in this, involved in this, and you have the right feedback loop so that you understand, are you on the right trajectory? Are you still solving the most important problems? Has something changed on the other side of the landscape that will need you to also shift directions in terms of what you're pursuing? So how do you think about those dimensions? Right. Yeah, you touched on a critical piece of the puzzle, right, which is partnerships,

Starting point is 00:18:13 right? Sort of identifying the right people, people who are more interested in innovation, more interested in sort of thinking longer term on the product side, is super critical. I was super lucky that I had a partner in the product teams that, and Marcus, Lisa, your former boss, he was a great partner throughout for me and we worked great together. We had the same interests, complementary skills, but the same interests in working together to do these things, to advance and bring innovations to Azure. So identifying the right partner in the product team is very critical.

Starting point is 00:19:14 The other piece is also how to work with those partners, at least in my experience. And everything I'm speaking about here is my own experience. If you ask somebody else who has had research transferred to products, they might have different perspectives on things. I'm simply offering my own. But in my experience, the way to interact with the product team and those partners is there's no point in coming in and saying, oh, here's the five, 10 year plan. They will care very little for the five, 10 year plan because they care about the one year plan.

Starting point is 00:19:56 So what I usually have found most useful is to do things sort of incrementally, right? And say, oh, here's the, forget, I'll keep my North Star and my paths to myself. And I'll talk about what is the next step and focus on that. And then as you work on this next step, you sort of develop trust and you develop a good working relationship and so on and after you accomplish or you're close to accomplishing this first step then you start discussing the next step and so on. There's very little point in sort of scaring them off saying oh here's what we're gonna be doing five years from now. They'll say, no, forget it.

Starting point is 00:20:45 You're nuts. Let me focus on my problem right now. So this goes to what you're asking in terms of how you stage things within those partnerships. Yeah, so I think it might be worth saying that you talked about the types of people that you hired for. You want somebody who's going to be a great researcher, has curiosity, has that kind of mind that can try and solve problems that have not been solved before.

Starting point is 00:21:14 But at the same time, in the Venn diagram, somebody who is not just interested in pursuing ideas but is interested in building things and making sure that they're actually deployed and used. So that's like a, but it would be a good production level engineer and then diagramming them, that's already hard. And then this third piece that you basically were saying is you need someone who can sort of read the room and communicate and build trust with other teams. So that's like a really tough thing to find.

Starting point is 00:21:44 And so I can see how you say you've hired great people because getting all three is hard, which is probably why this thing that I kept saying was super hard. Like you've managed to be quite successful with. And of course, Marcus is wonderful. You guys had an amazing partnership to watch that in play was like, wow, this is a very, very functional relationship. And that's amazing. So maybe we can ask a little bit now, now that you've been there a while, very functional relationship and that's amazing. So maybe we can ask a little bit now, now that you've been there a while, you've had your time in MSR, you've had your time to transition over into Azure and being on both sides, what would you say, I probably shouldn't ask you to pick a favorite child, but I'm kind of asking you to pick a favorite child. What is

Starting point is 00:22:22 one within the more impactful projects that you've brought to bear that you just feel really proud of? I have already mentioned Resource Central. That was a really interesting one because it was a very early project in terms of using ML or AI for systems. If I'm not mistaken, it might be one of the first, or if not the first, in terms of cloud platforms and sort of introducing these capabilities in production. So I have been very proud of it, and it's still going strong.

Starting point is 00:23:04 It's got more than 20 scenarios that it feeds predictions for and so on. So it's sort of exciting to see how it developed from an idea in front of a whiteboard to eventually becoming something that's critical to Azure. That's one. Another one that has been, we have had enormous amount of success with is the Power Efficiency Project.

Starting point is 00:23:37 Really when we started the project back in 2016, honestly, Microsoft was not in a good spot in terms of the ability to manage power and so on. So we brought it from, in collaboration with the other teams, of course, I'm not talking about just my team, but we work very closely with the folks who build and design and operate the data center. That division works very closely with us. The folks who do hardware, that sort of design the hardware and so on.

Starting point is 00:24:15 So it's a broad sort of coalition of people working in this space. But nevertheless, we were able to make sort of enormous changes to how things were done. And now we're in a much better space in terms of the ability to recover power that was going underutilized, the ability to sort of do very targeted power capping, for example, when it's necessary to do. So we, for example, have per-VM power capping that was introduced from the research team and then moved to production. We have many other things that run power rebalancing and use of reserve power capacity. And so a number of different efforts that turned out really great for Microsoft.

Starting point is 00:25:11 So those two are, I would say, the two main ones that come to mind. We have many other things, of course, but those two are sort of very dear to me because they've lasted a long time. Yeah, those two are. We keep innovating. Within those two projects, we keep introducing new ideas and new systems. Those are really good examples, I think. Yeah, yeah, yeah, for sure.

Starting point is 00:25:37 Those are quite mature. The fact that they're still around and adding value and continuing to add even more marginal value is a testament to that. And I just wanted to mention, like, kind of bring Resource Central specifically first back to like all the things you were saying about context and all that is when I came on and I learned about Resource Central, I remember thinking that it was like very thoughtfully designed. So you could easily imagine a project like Resource Central going in two totally different directions depending on the execution. So Resource Central, the

Starting point is 00:26:12 basic idea is it pulls a lot of operational data from what is happening in the data center and it uses that to feed ML so that future decisions that have to be made, you can ask Resource Central, like, should I put this here? Should I put this VM there? Should I? All sorts of questions that it can now, it feeds. So you can imagine as a research project where it's like, oh, what if we grabbed a whole bunch of information

Starting point is 00:26:38 and made some decisions off of it? Where you do that in a vacuum such that when it comes time, yes, maybe in theory, you can get a lot of inputs to this thing, and you can make a lot of decisions, but you've built it in such a way that you actually can't then integrate it into the real system. If you present it fully formed and without context on how the architecture of everything is at the end, then it's kind of like, okay, that's a great paper. But the fact that it was sort of thoughtfully designed from the beginning with an understanding of where it could potentially sit in an actual architectural workflow.

Starting point is 00:27:17 So I'll harken it back to some computer, like classical computer architecture stuff, which is I remember as a grad student reading a paper, or reading papers where people would talk about making decisions in like the last level cache, the L3 cache, based off of the program counter. Meaning you have to get program counter information all the way down into the L3, which is not really, I mean, people maybe have figured out ways to kind of fake it, but like you're not going to pass that many bits all the way down to the

Starting point is 00:27:52 L3 in order to help you feed your decisions. So it was one of those things where like in theory, that's great, but like you actually can't get that information. So anyway, taking it back to that, we're thinking like intellectually, you could imagine taking something like a PC and having that help you make decisions, replacement decisions at the L3, but in practicality, you actually can't get that information all the way down there very easily. So something like Resource Central, very similar, you could have all sorts of intellectual thoughts on all the things that you might want to feed that information, feed Resource Central.

Starting point is 00:28:22 But if you don't have a good pipe, then you might as well not put it in. It just seemed like Resource Central was built in such a way that all the inputs are actually feasibly feedable. And then the outputs, so the decision making part of the pipeline was also feasible. And I just remember thinking like, that is well done. Yeah. And so that speaks to the context that you were talking about before. Right. I think we made a couple of key decisions in Resource Central that really enabled it to flourish quickly. Sort of, I think, it has to do with what you're talking about, which is defining exactly what was the right level of abstraction, which other parts of the control plane and other parts of Azure are able to interact with Resource

Starting point is 00:29:14 Central. We had to, because we wanted to apply it in a number of different scenarios, widely different scenarios, we needed to define a set of abstractions and a level of interfacing with Resource Central that was low enough that it would be useful in all of those scenarios. Because if you raise the abstraction too much, it would become too tied to each of the scenarios. So for example, Resource Central provides predictions of expected blackout time for a live migration. It doesn't try to say, oh, this is how you should live migrate or this is where you should put the VM or anything like that.

Starting point is 00:30:00 It simply gets asked, what is the expected blackout time for this VM? It replies with a prediction. All the smarts about what to move and how to move it and so on, is all higher level in the light migration engine. Similarly, the VM allocator asks Resource Central for a prediction of the lifetime, how long a VM is going to live. And it factors that information into its decision about where to place and how much time to

Starting point is 00:30:36 spend on it. So again, Resource Central makes no decisions about how to place a VM. It simply gets asked, what is the prediction for the lifetime of this VM? And it gives back a prediction. You see what I'm saying? So we define the abstraction and the level of interaction that's low enough so that it can be applicable to any scenario very quickly.

Starting point is 00:31:02 And it doesn't interfere, there's the separation of concerns with all of these different scenarios. So that was a critical decision that we made early on that, like you said, made it so much easier to integrate. Of course, there are extra complexities because we didn't want Resource Central to be on the critical path. So when making calls to Resource Central, we made sure that whenever we could, we made sort of parallel calls to Resource Central so that if Resource Central did not reply in time,

Starting point is 00:31:36 it wouldn't slow down the critical path for the allocator or for the light migration engine and so on. So there were extra complexities that we had to deal with, but this was a critical way to sort of integrate it into the rest of Azure. And these days we apply it even to other services. They're not even part of the control plane. So now we have version of Resource Central and Ring 1 that we call it,

Starting point is 00:32:07 where it runs in regular VMs and so on, so that things that are not in the control plane, services that are not in the control plane can also use it. Yeah, that makes a ton of sense, and that does seem like a really, really critical decision. And again, I kind of want to hammer home, like if you hired researchers who were really interested in the idea, can we use sort of production metadata to inform further control plane production decisions?

Starting point is 00:32:41 That as a purely intellectual exercise does not necessitate making that kind of a call and that kind of abstraction decision early on. Because if you're focused on just like, can we do it and can we publish a paper that shows we can make a difference, then you don't need to think about that yet. But because you started with the explicit goal early on of, we wanna be research, but we also want to make sure we do tech transfer by sort of folding that ethos in early. Then you make that call early, and that sort of paves the way for you to be effective. As you, like, so you've still answered the question,

Starting point is 00:33:20 can we make, can, but you still answered the intellectual question, but you didn't paint yourself into a corner where you couldn't then leverage it in a production context? That's right and you were touching on something else that is so critical in sort of the ability to think through how to integrate research into production which is one thing that researchers and research engineers even don't normally think about is simplicity is king because you can't have PhD students and folks with PhDs maintain code in Azure or in any other production system. This is just not a viable approach.

Starting point is 00:34:09 You need to define things and scope them in such a way that there are these simple pieces that can be deployed. I often joke that the day that I realized that I was decent at my job was when I could look at a paper and say, oh, this piece here, this 30% of the paper, I can actually deploy the rest, sort of intellectual exploration that's super necessary, that advances knowledge. that's super necessary, that advances knowledge, but it might not be sort of deployable right away. So understanding this transition, understanding what is the piece that is more easily deployable and starting with that, I think is critical. And the other important thing too is to realize that in sort of the research that we do and

Starting point is 00:35:08 so on, we don't feel like we always have to transfer 100% of it. Because if you think about it, a lot of the time that we spend in research is to try to squeeze the last every little bit of goodness of any idea. But in production, that's not necessary. Something that is good and simple is much better than something that is maybe a little better, it's perfect, but it's complex. So if you're able to get 70% of the goodness of something,

Starting point is 00:35:41 that's a win, that's a major win. Forget the extra 30%, that extra 30% will oftentimes introduce complexity that might make the whole approach be invaluable for a production team. Yeah, especially at a hyperscaler like Microsoft. It's so large and academic papers don't account for things like data center tech time or data

Starting point is 00:36:07 center tech cost. It just is like a graph of goodness. And so that last 30% matters in an academic paper. But as you say, if complexity makes it so that other costs that are not accounted for in the paper are accounted, then it becomes infeasible. So maybe this would be a good interesting time to slightly pivot to, speaking of costs that are not necessarily accounted for, carbon costs. So historically, those have not really been accounted for.

Starting point is 00:36:37 And so I know your group is starting to look at that. We've talked a little bit about Resource Central and all the power work which is relatively mature and has a lot of impact. Maybe now we can redirect a little bit to stuff that's slightly less mature and ongoing. Yeah. So, the way I think about the carbon space and sustainability is twofold. From one perspective, efficiency and all the work that we do to better utilize servers, better utilize data centers have a direct impact on scope three emissions, right? Or embodied carbon, because if we improve utilization of the infrastructure,

Starting point is 00:37:23 we buy fewer servers. We build fewer data centers. So that reduces the amount of embodied carbon that we put out there. So that's one perspective. And that is the direct benefit. There are other benefits. When you do that, you also happen to improve scope two emissions as well and even scope one because of transportation and other factors.

Starting point is 00:37:48 So doing efficiency work has in itself a pretty broad sustainability benefit. The other way to look at it too is there are things that we can also do that are beyond just efficiency work. There are things like carbon aware scheduling of work, either in time or in space. You can sort of decide to run certain AI training, for example, is a delay in sensitive workload that you might decide to run during a time that the grid mix is more favorable, they're more renewables. There's a lot of batch inference workload that can be run that way as well. And because AI inference is like a SaaS workload, software as a service that workload, oftentimes you can move requests geographically, right?

Starting point is 00:38:54 To take advantage of more renewables and so on. So there are aspects of carbon awareness that go beyond efficiency as well. So we are working in that space, working on things like some good methodologies, right, for carbon accounting, both Scope 2 and Scope 3 carbon accounting is one example. And feeding that information back so that customers of Azure can see that information and make decisions for themselves in terms of the carbon footprint of their workloads. So we're definitely working in that space and in many other areas too.

Starting point is 00:39:38 This is just one example. You touched upon a few different themes here. The first one is, I guess, the importance of metrics overall. And this came up even in the context of our discussion with our prior guest, Carol Jean Wu, who talked about in the context of sustainable AI or accounting for carbon footprint and so on, just having visibility into the data is a huge step forward. The other part that you talked about very briefly was in the context of developing solutions for that intersect with AI and carbon efficiency or power efficiency. It touches like multiple regions of the

Starting point is 00:40:10 stack. So for example, you talked about how AI workloads could be moved between different geographical regions and that's a theme that's come up in some of Google's papers from my colleagues here as well, where you could move a training job to a location that has access to, let's say nighttime wind energy. And so your carbon emissions are correspondingly lower there. So can you talk a little bit about both of these themes

Starting point is 00:40:33 in terms of metrics and data and any efforts in this particular space from your group very broadly into getting more data out there for either researchers to play with or otherwise. The second part was, how do you think about sort of co-designing across multiple layers of the stack, going all the way up to the data center, energy grid, and interactions?

Starting point is 00:40:52 So you make a good point, right, that Google has had some work on this that has been really interesting. We're looking at those kinds of things as well. But starting with the data issue, right, that you brought up and Carol mentioned that too, this is something that I think a lot about. Like today for scope three, for example, there are not, there's no sort of agreed upon methodology

Starting point is 00:41:20 for quantifying these things. So it's very difficult to compare across cloud providers, for example. And even if we were to settle on life cycle analysis as the way to, or as the right methodology for accounting for this, right, then life cycle analysis basically looks at the entire lifetime of equipment and from supply chain and all of these pieces and during the use of the equipment, all the carbon emissions throughout.

Starting point is 00:41:53 So, even if we were to all agree that that is the approach, there's data quality problems, there is sort of inability to get certain data from different vendors. And so there are boundary conditions that would have to be defined very carefully and so on so that we are able to compare across different vendors and different providers. So this doesn't exist at all today. We're going to have to, as an industry, sort of work together with academia and other folks to define what is the right methodology and what are the right boundary conditions. So if you look at things like PUE, for example, or power usage effectiveness,

Starting point is 00:42:41 that was a great way to be able to compare things. A very simple formula that is pretty well defined, although there are still issues with it. At least there was a way for everyone to be able to compare their efficiency in terms of the use of power, comparing sort of the IT power to other overheads and so on. So this doesn't exist today for scope three emissions. So that's something that it's gonna have to be addressed. So on data, that's the data quality and so on,

Starting point is 00:43:21 that's the main thing I worry about. On the other piece, sort of in terms of efforts from my group, like I mentioned, we are sort of exploring, not exploring, we're actually collecting data and sort of generating models and so on to surface them through our different tools, the Azure portal and internal tools as well, to surface what are the actual scope-to

Starting point is 00:43:55 emissions of different deployments of VMs. And we're also sort of working with other teams on geographical distribution of inference requests and things like that to maximize the use of green electricity. So those two are two examples of things we're looking into. And we have other, we already have infrastructure that's able to deploy VMs during off-peak hours, for example, that we can leverage and so forth. So those are some of the things we're working into.

Starting point is 00:44:29 Besides all of the efficiency work that I mentioned before. No, that sounds like a really broad slate of problems and directions to pursue. Maybe this is a good time to sort of wind the clocks back and talk about your trajectory on how you got to Microsoft, what got you interested in compute architecture and computer systems. Tell us a little, tell our listeners a little bit about how you got into this particular space.

Starting point is 00:44:54 Yeah, okay, let me, let me go back a little bit. I was, sorry, I was born in Brazil, right, in Rio and went to college over there. And at some point during college, I wasn't taking it super seriously. And I had some issues in my family. I lost my dad sort of during my college years, and that threw me off completely. But at some point, my college years and that threw me off completely. But at some point during that time, I actually had this good friend and his dad was a Stanford professor, and had been a Stanford professor for a while and that was really exciting to me and so on. The notion of doing research and So, the notion of doing research and sort of tackling problems that nobody knew the answer to and so on really got me excited and made me become a good student and finally

Starting point is 00:45:55 decide for computer science. Yeah, sort of to do research, right? To do a PhD in computer science. And sort of my trajectory during the PhD was a little strange because I wanted to do research, to do a PhD in computer science. And sort of my trajectory during the PhD was a little strange because I wanted to do computer architecture, but my advisor was not in computer architecture. So I had to sort of fend for myself

Starting point is 00:46:16 and learn a lot of things. So I worked on sort of parallel machines at the time and cache coherence and so on. And then over time, I became more and more interested in software and things that sort of bridge the gap between software and hardware. So I started working on software DSM or distributed shared memory, and then eventually cluster level systems and so on. So after finishing my PhD,

Starting point is 00:46:54 I went back to Brazil and was there for several years, and then decided to come back to the US to be a professor at Rutgers University. And during that time, that's very early on, I started getting interested in power and energy and data centers. And so my group and I sort of wrote one of the first few papers on in this space and kept working on it over a period of time until at some point David Tannehaus, who used to be a corporate VP here at Microsoft, reached out to me asking whether I would want to come and work on efficiency problems in Microsoft. And it was a great timing because I felt like I had been in

Starting point is 00:47:47 academia for a while and I needed the new challenge. So I came over to Microsoft, initially a two sister organization to Microsoft Research, but then that organization got dissolved so I moved to MSR. And was that an MSR for almost eight years, I think. And working at MSR was really great. I had great support from my managers. I'm not going to name them all here because I'll sure forget one or another and they'll be mad at me. But sort of managers and collaborators. And sort of during that time, we started working with different product teams, including Azure, Azure being the main one.

Starting point is 00:48:36 And then sort of at some point, my predecessor was leaving, Marcus Fontore, that I mentioned before, was leaving Marcus Fonteur that I mentioned before was leaving Microsoft. So his boss sort of asked me to whether I wanted to move to Azure. And that's when it happened. And that transition from MSR to Azure was in July of 2022. So that's basically my trajectory from sort of college. This is probably more than what you had hoped for, but that's not what we've got here. No, I think origin stories are important. And we always like to touch on them during the podcast because not everybody comes to where they are the same way and not everybody

Starting point is 00:49:23 comes in a straight line. And so I think it's important because our listenership, I believe skews young, a lot of students. And so I think these stories are important. You don't have to have decided at the age of five that you love computers and like that's the straight shot all the way. So yeah, I think the fact that you figured you wanted academia starting in college

Starting point is 00:49:49 and wanted to do research and then found yourself now leading a very large production and research organization in a gigantic company is kind of an unusual path in and of itself, I suppose. I guess nowadays there's plenty of people who are former academics who are in industry, but maybe not as many who are running large production teams. Yeah, definitely. I also joke about the fact that I've had every possible job almost.

Starting point is 00:50:25 Because in Brazil, even before, during my college years, I worked at a startup there. So I've worked for companies, I've done research and now I do production. So I've had the pretty broad exposure to different things. And to be honest, like this today, I think is the happiest I've been in terms of all the things that I get to do on a daily basis. I really love my group. I love the work that we do. I love the people I work with. Our management chain is fantastic.

Starting point is 00:51:03 So I'm really happy right now about everything. Sort of the scope of the group is all about things that I enjoy doing and so on. So it took a while to get here, but I really like it. Well, that's wonderful. Congratulations on winning at life. That's not easy to be able to say that. Yeah, no. And I have to say, I've been so lucky. I've had some bad breaks in my life, but sort of work-wise, I've been really lucky to work with amazing people and sort of enjoy it a lot. Yeah, yeah, for sure.

Starting point is 00:51:50 And so this notion of luck is very interesting. I just read an article about luck where it was like, is luck an actual thing? It's actually an understudied topic in science about luck because it's so mystical. And the article was kind of saying that luck can be perceived as, it is partially how things are perceived. Like, are you lucky or do you just view your life in a lucky lens because all the good things that happen to you, you cast it to luck? I don't know, but certainly I think luck does often find people who are prepared.

Starting point is 00:52:27 It's like if an opportunity falls in your lap and you're unprepared to take it, whether because you're unable to or can't manage to or whatever, then luck favors the prepared. That's a saying for a reason. So I guess that having been someone who has been blessed with, as you say, luck, but I would also say good preparation and hard work, maybe you can share some insights or advice to our audience as well. If you had to give one piece of advice or two or whatever, some brief amount of advice to our listeners, what would you say? Yeah, I can talk a little about how... So I work with a bunch of sort of younger people

Starting point is 00:53:11 who are starting their careers and so on, and they will often sort of ask, oh, how do I deal with this issue or how do I deal with that issue and everything else. I think one piece of advice that I always give is communicate. Oftentimes people come to me sort of worried about their relationship with their manager or how they're working with their colleagues or those kinds of scenarios. The advice that I always give is talk it out. Be upfront, be honest, go in sort of in a way, sort of go into a conversation in an honest and upfront way to try to solve the issues. I think people find that when you do behave that way, sort of all defenses sort of go away and people are sort of more willing to have empathy and accommodate and sort of work together.

Starting point is 00:54:15 And this is what I do as well. Sometimes I might feel or sometimes I feel naive in a way I might feel or sometimes I feel that naive in a way because like I keep saying, oh, this is all simple. Let's just talk it out. Let's find compromise. Let's find areas where we can agree and so on. And it has worked well for me. It might perhaps not work for everybody, but this is the advice that I always give.

Starting point is 00:54:44 Yeah, I think so two things. One is that Microsoft itself, I think, is one of the kindest companies, sort of culturally, that I ever worked for. I know it's not the same. I mean, it hasn't been the same company the whole during its entire 50 year existence. But for the time that I was there, it was a very kind and empathetic environment. And so it's the kind of place that probably does foster the ability to have this kind of talk it out type of solution. So I can see how that might not be as useful in other places

Starting point is 00:55:21 where people do get out instead of talking out, But I always say Microsoft is great and I actually think that it's great because Microsoft was a very like a generally functional work environment in my opinion. But I think the other thing, what you were saying is like when you're doing your research you want to find context. You want to solve problems given a certain context. And if you view interpersonal work relationships as yet another research problem to be solved, like you want to find context to help you solve the problem. So like communication is what allows you to extract context,

Starting point is 00:56:01 not just technical context, but interpersonal context. So if you just like think, oh, this is another thing that I have to solve, how do I find out information? Talking it out is the way. Right. So, yeah. That's a really interesting observation. I had never thought about it that way. Maybe that is what I do. Maybe I bring this research perspective or research approach to addressing problems

Starting point is 00:56:27 that is helpful. I don't know. That's a good point. But what you said is right too, that Microsoft's culture today at least, at least in terms of the space and the teams that I work with and so on is sort of more conducive to that kind of thing than other places I've heard about. So maybe it's a good match in a sense. But if I had to be always duking it out, it probably gets tiring at some point. I'm sure. I wouldn't enjoy it. I suppose people find their homes where they find their homes. So given that you've, as you've just mentioned, have been in every possible job there is, the thing with production stuff, we've alluded to it a little bit during our conversations where

Starting point is 00:57:26 the production has a certain role, which is they have to keep the lights on, they have to keep everything going, and they have to ship. So very naturally, they don't usually want anything to distract them from their number one mission and their reason for existing, which is to produce products. And so it is a little bit of a different ethos than research, which is pushing boundaries, exploring. Maybe you can talk a little bit about, do you have a different approach for how you deal with your production-side employees versus your research-side employees?

Starting point is 00:58:03 And now that you potentially have to hire for, you talked a little bit about how you hire for the research side, but do you hire differently for the production side? So maybe compare and contrast a little bit the difference between leading a production versus leading research. Yeah, there are differences there. The way you manage a research team has to be different than a production team. The production team is very comfortable with very structured, in fact, not just comfortable, but it needs a very, very good structure of execution, a very good structure of scoping.

Starting point is 00:58:47 And everything needs to be really well defined, especially for the more sort of early in career folks that sort of work in the team. Whereas for a research team, like trying to impose that kind of structure doesn't make any sense. It's a recipe for losing them all. You still have to manage them basically the same way as if they were sort of in Microsoft Research or they were in academia to a large extent. You have to give them freedom, you have to give them

Starting point is 00:59:27 the ability to explore. You can't be saying, oh, what is, when are you going to deliver this and when are you going to deliver that? That's not the way to interact. Whereas on the production side, it's all about that. It's like, oh, what are we going to accomplish this semester? What are the things that we're going to cut? What are the things that we're going to prioritize? So it's very, very structured, very, very focused on execution, executing well, especially for a group like mine, where the, so the company depends on us for increasing gross margins, for being able to recover enough power, for a number of things, for sort of becoming more efficient and buying less infrastructure and so on.

Starting point is 01:00:25 So then there's a lot of pressure on us to deliver on this. So it has to be like super well structured and with targets, with metrics, with like things that we track over time to make sure that we're not sort of falling behind or we're not deviating from targets that we set for ourselves. So all of that is super critical on the production side and not really a thing for research. Yeah, maybe I can flip the script a little bit. So early on, we talked about from a researcher's vantage point, like how should they be empathetic towards the production games goals and pain points and so on.

Starting point is 01:01:06 And like now that you sort of manage both a production and a research org, like all under the same organization, from the other side, like how do you communicate to production teams on the value of research explorations, things that are necessarily, as you said, unstructured, can sometimes seemingly hit dead ends or you might learn something out of it but may not translate into something actionable or clearly something that increases your gross margins or revenue or any other thing that you can measure. So how do you do that handshake on the other side of the EL as well?

Starting point is 01:01:36 That's a great question because there is oftentimes, right, especially for companies or groups that don't have that culture of interacting with researchers, there might be this bias. Oh, those people are not working on real things. They're just enjoying whatever the sexy thing is that they're working on, but they have no accountabilities and so on. So there's a lot of this type of bias that we do need to work on and sort of highlight to

Starting point is 01:02:13 the production teams all the goodness that the research team can bring. For example, so I often will mention, look, if there are sort of good ideas that you've had but you're worried that they're too risky and so on, a research team is a great team to go de-risk things, right? To go explore, to go really dig deep into ideas and into new directions that the production team is not either able or willing to do at a certain, at any point in time. So that's one way to sort of lower these barriers. Another way is to highlight the importance of innovation and how maybe a product that's running behind a competitor can sort of leapfrog the competitor

Starting point is 01:03:08 by introducing a new feature or a new idea, a new approach to things that the competitor doesn't have. Or maybe it's a way to become so much more efficient than the competitor, and this will make a big difference in terms of everybody's, as a stockholder and so on, everybody's compensation. So that's another approach, highlighting the importance of innovation. And the third one is to create a culture where there is close collaboration, right? After people sort of understand what researchers do and how they can contribute better, they tend to sort of lower these barriers and eliminate these biases because they say, oh, I see,

Starting point is 01:04:03 they are willing to work with us on things that matter to us as well. And they'll be here for the long haul. They're not gonna take off after they get their paper written. So it takes some work, but after you've created this culture, then it sort of is a lot smoother, actually, than you would think. Yeah, and I would actually say one thing is that creating the culture often involves,

Starting point is 01:04:33 in some ways, performance metrics too, where if you build a research team where you're purely evaluated on the number of papers you produce, or your like age index or something like that, evaluated on the number of papers you produce or your like age index or something like that. Yeah, let's let's say then then then it becomes then the researchers themselves are not incentivized to behave in the way that you've laid out as a way to be successful text transfer. And so you do have to sort of provide a culture where like you will be rewarded for helping the production team with their problems in order to build trust. And then the production team sees your value and then you can work together on bigger and

Starting point is 01:05:11 bigger issues. And same thing on the production side where it's just like, you ship this and if you deviate, then you're hosed versus you are willing to accommodate and innovate and make the product better, then that is also rewarded. So it feels like, I don't know, it just occurred to me that like performance metrics are a part of building this culture as well. Yeah, definitely. You're exactly right.

Starting point is 01:05:39 If you reward just for the behavior that you don't want. Guess what? You're gonna get what you don't want. So you really have to target, in terms of the incentives, they have to be aligned with the culture that you want to develop. And in our case, I can't complain at all. The relationship between the two sides of my org, the research and the production side,

Starting point is 01:06:05 like couldn't be better. They work extremely well together. And I think this also comes from sort of the management chain being supportive. And now I have a manager to the research team and he gets along great with the other people on my leadership team. So this relationship has been really good for a long time.

Starting point is 01:06:34 That's awesome. Ricardo, this has been such an interesting conversation. I think we've addressed so many different topics, research, production, specific projects that you've worked on, advice, your career path. So it's been a really enjoyable conversation. Thank you so much for joining us today. We've really had a good time talking with you.

Starting point is 01:06:55 Yeah. Thank you so much for having invited me. I had tremendous fun chatting with you two and I hope we can make an interesting episode of this conversation. Absolutely. Like echoing Lisa, this is a fantastic distillation of many years of hard-won wisdom, straddling both production and production. So thank you so much for sharing that with us. And to our listeners, thank you for being with us on the Computer Architecture podcast. Till next time, it's goodbye from us.

Computer Architecture Podcast - Ep 20: The Tech Transfer Playbook – Bridging Research to Production with Dr. Ricardo Bianchini, Microsoft

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.