Computer Architecture Podcast - Ep 20: The Tech Transfer Playbook – Bridging Research to Production with Dr. Ricardo Bianchini, Microsoft
Episode Date: June 17, 2025Dr. Ricardo Bianchini is a Technical Fellow and Corporate Vice President at Microsoft Azure, where he leads the team responsible for managing Azure’s compute workload, server capacity, and datacente...r infrastructure with a strong focus on efficiency and sustainability. Before joining Azure, Ricardo led the Systems Research Group and the Cloud Efficiency team at Microsoft Research (MSR). He created research projects in power efficiency and intelligent resource management that resulted in large-scale production systems across Microsoft. Prior to Microsoft, he was a Professor at Rutgers University, where he conducted research in datacenter power and energy management, cluster-based systems, and other cloud-related topics. Ricardo is a Fellow of both the ACM and IEEE.
Transcript
Discussion (0)
Hi, and welcome to the Computer Architecture Podcast,
a show that brings you closer to
cutting-edge work in computer architecture
and the remarkable people behind it.
We are your hosts. I'm Suvinay Subramanian.
And I'm Lisa Hsu.
Our guest for this episode was Dr. Ricardo Bianchini,
who is a technical fellow and
corporate vice president at Microsoft Azure,
where he leads the team responsible for managing
Azure's compute workload, server capacity, and data center infrastructure,
with a strong focus on efficiency and sustainability.
Before joining Azure in 2022,
Ricardo led the Systems Research Group and the Cloud Efficiency Team
at Microsoft Research.
During his tenure at MSR,
he created research projects in power efficiency
and intelligent resource management that resulted in large scale production systems
across Microsoft.
Prior to joining Microsoft in 2014,
he was a professor at Rutgers University
where he conducted research in data center power
and energy management, cluster-based systems
and other cloud related topics.
Ricardo is a fellow of both the ACM and the IEEE.
Incredibly, this is our five-year anniversary episode,
and we want to thank all our listeners
for sticking with us all these years.
So Ricardo is a very special guest indeed
for this milestone episode,
and we were really excited to talk to him
about transitioning research into product,
which he has been doing consistently at Microsoft for years.
Now, leading a large group where he has teams that perform research,
as well as teams that are responsible for running the business and
keeping the lights on, Ricardo joined us to talk about his formulas for hiring,
building relationships, and collaborating to create and
maintain a true research to product pipeline.
A quick disclaimer that all views shared on the show are the opinions of
individuals and do not are the opinions of individuals
and do not reflect the views of the organizations they work for.
So Ricardo, welcome to the podcast.
We're so excited to have you here.
Yeah, thanks for having me.
I'm super excited.
Yeah.
Well, we're glad to be able to talk to you today.
So let us know what's going on.
What's getting you up in the morning these days?
What's getting me up in the morning?
I get excited about going to work every day and working with great people, being able
to do some technical work.
These days I have a fairly large group, but I still have time to do some thinking
and sort of work, do technical work with folks.
Really strong people I work with,
so it's always exciting to do that.
In the last couple of years,
you've made a little bit of a transition
from being in MSR to being in Azure,
and your group reflects a wide breadth of people and activities and goals.
So why don't you tell us a little bit about what you're up to these days and a little
more depth.
So today I lead the Azure Compute Efficiency and Capacity team.
And you can think of it as things sort of the work related to managing our workloads,
deciding where to place it,
how to move it, tracking our capacity,
making sure that we have enough capacity in all capacity pockets.
I'm also responsible for our intelligence platform.
You may remember this, Lisa,
from your time
at Microsoft Resource Central and other systems
that feed intelligence to other parts of the control plane.
Also, resource management, resource over subscription,
and harvesting, power harvesting, power over subscription.
And generally sort of innovation in the efficiency space. So
how to improve the cost efficiency of Azure through sort of innovation and
software hardware and data center sort of co-design or cross optimization.
Yeah I do remember so for full, when I was at Microsoft,
I worked in the group that Ricardo now leads.
And I remember when I joined it,
because I had always thought of a system as,
the scope of the system that I thought about
throughout my career just kept expanding
from like the caching system to like the memory system
to like the system system into a whole server. And then now, and then when I got to Azure, I was like, oh, the system to the memory system, to the system system into a whole server.
Then when I got to Azure,
I was like, the system is the data center.
The data center is this enormous thing.
I remember learning the term control playing,
which is basically the OS for this data center.
Where are the VMs going?
Where do we want to cluster them together?
Where do we want to spread them apart?
Or when do we want to move them?
All that stuff.
And so I remember thinking of it as like, oh, this is basically the operating system
for this data spreader.
And then Resource Central being this really cool thing that is essentially the brain
for deciding how to be smart about all the decisions with respect to resources
on this thing.
Yes.
Don't you also have Research Group within Azure as well that came with you?
Yeah. So when I moved from Microsoft Research to Azure,
I actually brought the researchers and the research engineers that were already
working on projects that we had in conjunction with Azure,
like Resource Central and the Power Efficiency Project.
So I brought over, I think it was 17 people or so, or 16 people.
And they're still sort of part of the group, but now they're less than 10% of the whole org.
But it's still sort of super exciting to have them there.
Because like I said, innovation is really important to us. And they are sort of working
with the product teams to come up with those innovations.
Yeah. So I feel like you are relatively unique in our field in that you've kind of hit the
motherload in terms of true transition of research into production.
Like now you're running like a pretty high level production team that is very customer
facing as well as, but the start came from having a research team that was actually doing
tech transfer.
And that's kind of what people in our research field always want to do, but that last mile
is super hard.
We've had guests talk about, like we had Bill Daly on one of our earliest episodes,
or he was like, that last mile is super hard. And you get the paper and you can be done in
academia, but like getting it into production is, I mean, I know I'm going to say super hard again,
but that's because what I remember is like, it was very... Oh yeah, it is pretty difficult, yeah.
Yeah, maybe you can talk a little bit about your experience on how to make that happen.
It is tricky. There are multiple reasons why it's tricky. So oftentimes, you'll
sort of be working on some research that is so much more advanced compared to where the product is, the corresponding
product is.
So there's that gap.
And you want to transfer your research and you've got to somehow bridge that gap or get
to some point where the product is advanced enough that you can build on top.
So that's a big challenge.
Another challenge is that when you have some piece of research,
when you're just writing a paper or doing a prototype,
it's fairly well isolated.
You set yourself up in such a way that it's pretty self-contained.
But then when you get to
like a real system, an enormous system like Azure, there's so many other
dependencies and things that affect your work and your work affects other things.
So it's really hard to also sort of stitch these things together nicely, and it involves so many other teams.
So it's also very difficult to sort of align everybody
and so on.
So there's a number of challenges,
but the bottom line is that if you're not willing
to sort of go through this process,
like your likelihood that something that some research that you've
developed, that you've created and that you've worked on is going to have a much lower chance
of being adopted.
So the way I always set up my research group was to be ready for that kind of thing.
So I hired with that in mind, I defined the research projects with that in mind.
We can get into more details on these things if you like.
But I just, we initially wanted to make the point that what you're saying is definitely
true.
And there are also aspects of my experience that are a little different than what you
normally see.
In the sense that oftentimes you see researchers move to a product group, but to still do research,
to still not have a production responsibility always.
There are occasions in which this happens as well, but oftentimes it's just the research
group that they'll lead.
So one thing that sort of makes my life a little easier is that because I own part of
production as well, sort of, it makes it a little easier in that sense.
If I were just a research group embedded in the product team, it would be harder in a
sense because it would have at least one more person that I would need to convince to get some of that stuff in production.
Got it.
Yeah, I think that was a good articulation of why the research to production path is
tricky and you touched upon a few different dimensions in your response there.
Maybe we can double click on a few of these things.
You talked about the process of taking things from research to production and you started
off by highlighting that there might be a wide gap
between where your research is and the state of what the production system is.
So how do you go about, number one, sort of pacing yourself in terms of how do you think about
where is the right point to intercept the ideas in your research into the production system?
How do you understand the biggest, maybe, pain points on the production side? And how do you work with them? Because they have biggest maybe pain points on the production side
and how do you work with them because they have a set of problems on the production side
and how do you find the Venn diagram intersection of the problems that your address is and what
might be relevant, important, interesting and top of mind for someone in the production team.
Yeah, so let me start with what I mentioned, which was you have to be careful about how
you hire.
And I should sort of say this from the get-go.
If we've had any measure of success at all, it has been because of the folks that I hired.
I hired an amazing team of researchers and research engineers that make the rest of all
of us look really good.
So that is, I should give that credit because that's where it all comes from. I'm just the
lucky guy who managed to find those people. So there's that component. Now, how did I hire?
What was I looking for? So one thing that I always look for is folks that have
the right, they're excited about the things, the right things from my perspective. They're excited,
not just in terms of doing research that is sort of super high quality and cutting edge,
but also they're interested in deploying that research for millions of
people to use. Without that second piece, I'm not as interested. So I always try to
look for people who would be good fits culturally to what we were trying to do. That's number
one. The other thing too that's really important is I give a lot of importance to the engineering
side of research.
In other words, I focus a lot on finding excellent research engineers who would be willing to
say when there's a gap between the product and where our research is to make that investment,
to bridge that gap, to help bridge that gap.
So finding the right folks who are willing to put that investment in so that we're now
ready to deploy our research is really important.
So those two are two sort of main factors.
And they need to know from the get-go that, like I mentioned, all of this is challenging.
It's not going to be, oh, let me write my paper and take off.
This is a recipe for disaster because the product team will never work with you again
because you're just not committed to the group's overall success.
So those are the key aspects in terms of forming a team that's going to be able to do this.
Now in terms of after you have the team, how do you sort of figure out how to do your work
in such a way that you're more likely to be successful in tech transfer.
So what I always suggest to them is
let's think about the North Star that we want to get to, the North Star research that we want to do.
But as we're planning this path from where we are to the North Star,
let's figure out some offshoots that we can deploy as we go along. And that is a
really important thing because it keeps everybody excited and getting promoted and those things.
And people don't have to wait five years to have the first outcome of their work. But it's tricky,
though. You have to find offshoots that you can deploy
without getting too far off your main path.
And you constantly have to adjust where things are
and where you're going as you go along.
So that's another very important aspect of this trajectory.
So these are some of the main ways that we think about how to
set ourselves up for success, both in terms of the culture, in terms of the
people we hire, and how we organize our projects. Does that make sense?
Yeah, no, I think that's a good encapsulation of both the ingredients. So
starting with the people and the culture that you bring about in the team, then the following aspect, which is how do you pick the right projects and how do you paste those
projects so that you have these offshoots, you have near-term landings and logical conclusions,
I guess, are milestones you can track that give you that sense of, okay, my work is actually translating into some impact.
I can see some clear milestones and markers by which I can pace myself on the research
to production transfer as well.
There's one extra point actually,
which is you mentioned the pain points, right?
Understanding whenever you're trying to work
with the product team, be it Azure Compute
or some other product,
it's important to understand their pain points.
What are the things that really worry them?
What are the things that they feel are not ideal in terms of where they are and where
they're going, usually in the short term, that people don't think sort of in the product
teams too far ahead. And what I basically usually say is that understanding
is really important when defining the research
that you're going to do.
Not because you will necessarily try to address
their pain points, but rather because
it's important information.
Because without that information,
you will do your research in a vacuum.
And then one day when you try to go transfer that technology to the product team,
the product team will say, man, we're completely in this other space here.
There's no way that we can come to where you are or for you to come to where we are
because we've diverged a long time ago. So having that information to be able to make informed decisions
about where you're going.
If you decide to disregard their pain points,
well, you've decided it in an intelligent manner,
in an informed manner, not because of ignorance.
You just didn't know.
I think that's a very pertinent point, which
is you want to understand the context behind their problems, what are their current pain points, because
solving or picking the right problem is about 50% of the battle, as some people might say.
So yeah, so you have a context on the pain points in a production team, and obviously they have
certain aspects or certain dimensions of the problem that are near term, and maybe they have
a window into, okay, these might become problems further down.
And as you said, they might not have a vantage point
or interest in pursuing things that are too far out
because maybe the space changes very rapidly as well.
So within this space, how do you think about
what's the right timeline of problems
that you want to actually tackle within a research setting?
And number two, how do you sort of couple that
with the right partnerships on the production side?
How do you sort of set up this partnership
so that you have that feedback loop going,
so that you have the right context?
The context keeps evolving, especially in current times,
the space evolves fairly rapidly.
So how do you figure out, okay, what's the right timeline
at which I want to tackle certain problems
relative to where the production team is today
or where the production teams are currently.
And then the next step is, of course, how do you set up these collaborations and partnerships
so that they are also invested in this, involved in this, and you have the right feedback loop
so that you understand, are you on the right trajectory?
Are you still solving the most important problems?
Has something changed on the other side of the landscape that will need you to also shift
directions in terms of what you're pursuing?
So how do you think about those dimensions?
Right. Yeah, you touched on a critical piece of the puzzle, right, which is partnerships,
right? Sort of identifying the right people, people who are more interested in innovation, more interested in sort of thinking longer term
on the product side, is super critical.
I was super lucky that I had a partner in the product teams that, and Marcus, Lisa, your former boss,
he was a great partner throughout for me and we worked great together.
We had the same interests, complementary skills,
but the same interests in working together to do these things,
to advance and bring innovations to Azure.
So identifying the right partner in the product team is very critical.
The other piece is also how to work with those partners, at least in my experience.
And everything I'm speaking about here is my own experience. If you ask somebody else who has had research transferred to products, they might have different
perspectives on things.
I'm simply offering my own.
But in my experience, the way to interact with the product team and those partners is there's no point in coming in and saying,
oh, here's the five, 10 year plan.
They will care very little for the five, 10 year plan because they care about the one
year plan.
So what I usually have found most useful is to do things sort of incrementally, right? And say, oh, here's the, forget, I'll
keep my North Star and my paths to myself. And I'll talk about what is the next step
and focus on that. And then as you work on this next step, you sort of develop trust
and you develop a good working relationship and so on and
after you accomplish or you're close to accomplishing this first step
then you start discussing the next step and so on. There's very little point in
sort of scaring them off saying oh here's what we're gonna be doing five
years from now. They'll say, no, forget it.
You're nuts.
Let me focus on my problem right now.
So this goes to what you're asking
in terms of how you stage things within those partnerships.
Yeah, so I think it might be worth saying
that you talked about the types of people that you hired for.
You want somebody who's going to be a great researcher, has curiosity, has that kind of
mind that can try and solve problems that have not been solved before.
But at the same time, in the Venn diagram, somebody who is not just interested in pursuing
ideas but is interested in building things and making sure that they're actually
deployed and used.
So that's like a, but it would be a good production level engineer and then diagramming them,
that's already hard.
And then this third piece that you basically were saying is you need someone who can sort
of read the room and communicate and build trust with other teams.
So that's like a really tough thing to find.
And so I can see how you say you've hired great people because getting all three is hard, which is probably
why this thing that I kept saying was super hard. Like you've managed to be quite successful
with. And of course, Marcus is wonderful. You guys had an amazing partnership to watch
that in play was like, wow, this is a very, very functional relationship. And that's amazing.
So maybe we can ask a little bit now, now that you've been there a while, very functional relationship and that's amazing. So maybe we can ask a little bit
now, now that you've been there a while, you've had your time in MSR, you've had your time to
transition over into Azure and being on both sides, what would you say, I probably shouldn't
ask you to pick a favorite child, but I'm kind of asking you to pick a favorite child. What is
one within the more impactful projects that you've
brought to bear that you just feel really proud of?
I have already mentioned Resource Central. That was a really interesting one because it was
a very early project in terms of using ML or AI for systems.
If I'm not mistaken, it might be one of the first,
or if not the first, in terms of cloud platforms
and sort of introducing these capabilities in production.
So I have been very proud of it, and it's still going strong.
It's got more than 20 scenarios that it feeds predictions for
and so on.
So it's sort of exciting to see how it developed
from an idea in front of a whiteboard
to eventually becoming something that's critical to Azure.
That's one.
Another one that has been, we have had enormous amount of success with is the Power Efficiency
Project.
Really when we started the project back in 2016, honestly, Microsoft was not in a good spot in terms of the ability to
manage power and so on.
So we brought it from, in collaboration with the other teams, of course, I'm not talking
about just my team, but we work very closely with the folks who build and design and operate
the data center.
That division works very closely with us.
The folks who do hardware,
that sort of design the hardware and so on.
So it's a broad sort of coalition
of people working in this space.
But nevertheless, we were able to make sort of enormous changes
to how things were done. And now we're in a much better space in terms of the ability
to recover power that was going underutilized, the ability to sort of do very targeted power capping, for example, when it's necessary to do.
So we, for example, have per-VM power capping that was introduced from the research team
and then moved to production.
We have many other things that run power rebalancing and use of reserve power capacity. And so a number of different efforts that turned out really great for Microsoft.
So those two are, I would say, the two main ones that come to mind.
We have many other things, of course, but those two are sort of very dear to me
because they've lasted a long time.
Yeah, those two are.
We keep innovating.
Within those two projects, we keep introducing new ideas and new systems.
Those are really good examples, I think.
Yeah, yeah, yeah, for sure.
Those are quite mature.
The fact that they're still around and adding value and continuing to add even more marginal
value is a testament to that.
And I just wanted to mention, like, kind of bring Resource Central specifically first back to like
all the things you were saying about context and all that is when I came on and I learned about
Resource Central, I remember thinking that it was like very thoughtfully designed. So you could
easily imagine a project like Resource Central going
in two totally different directions depending on the execution. So Resource Central, the
basic idea is it pulls a lot of operational data from what is happening in the data center
and it uses that to feed ML so that future decisions that have to be made, you can ask
Resource Central, like, should I put this here?
Should I put this VM there?
Should I?
All sorts of questions that it can now, it feeds.
So you can imagine as a research project where it's like,
oh, what if we grabbed a whole bunch of information
and made some decisions off of it?
Where you do that in a vacuum such that when it comes time,
yes, maybe in theory, you can get a lot of inputs to this thing,
and you can make a lot of decisions, but you've built it in such a way that you actually can't then integrate it into the real system.
If you present it fully formed and without context on how the architecture of everything is at the end, then it's kind of like, okay, that's a great paper.
But the fact that it was sort of thoughtfully designed from the beginning
with an understanding of where it could potentially sit
in an actual architectural workflow.
So I'll harken it back to some computer,
like classical computer architecture stuff,
which is I remember as a grad student reading a paper,
or reading papers where people would talk about making decisions in like
the last level cache, the L3 cache, based off of the program counter. Meaning you have
to get program counter information all the way down into the L3, which is not really,
I mean, people maybe have figured out ways to kind
of fake it, but like you're not going to pass that many bits all the way down to the
L3 in order to help you feed your decisions.
So it was one of those things where like in theory, that's great, but like you actually
can't get that information.
So anyway, taking it back to that, we're thinking like intellectually, you could imagine taking
something like a PC and having that help you make decisions, replacement decisions at the L3, but in practicality,
you actually can't get that information all the way down there very easily.
So something like Resource Central, very similar, you could have all sorts of intellectual thoughts
on all the things that you might want to feed that information, feed Resource Central.
But if you don't have a good pipe, then you might as
well not put it in. It just seemed like Resource Central was built in such a way that all the
inputs are actually feasibly feedable. And then the outputs, so the decision making part of the
pipeline was also feasible. And I just remember thinking like, that is well done. Yeah. And so
that speaks to the context that you were talking about before.
Right. I think we made a couple of key decisions in Resource Central that really enabled it to flourish quickly. Sort of, I think, it has to do with what you're talking about,
which is defining exactly what was the right level of abstraction,
which other parts of the control plane and other parts of Azure are able to interact with Resource
Central. We had to, because we wanted to apply it in a number of different scenarios, widely
different scenarios, we needed to define a set of abstractions and a level of interfacing
with Resource Central that was low enough that it would be useful in all of those scenarios.
Because if you raise the abstraction too much, it would become too tied to each of the scenarios. So for example, Resource Central provides predictions
of expected blackout time for a live migration.
It doesn't try to say, oh, this is how you should
live migrate or this is where you should put the VM
or anything like that.
It simply gets asked, what is the expected blackout time for this VM?
It replies with a prediction.
All the smarts about what to move and how to move it and so on,
is all higher level in the light migration engine.
Similarly, the VM allocator asks
Resource Central for a prediction of the lifetime,
how long a VM is going to live.
And it factors that information into its decision about where to place and how much time to
spend on it.
So again, Resource Central makes no decisions about how to place a VM. It simply gets asked, what is the prediction
for the lifetime of this VM?
And it gives back a prediction.
You see what I'm saying?
So we define the abstraction and the level of interaction
that's low enough so that it can be applicable
to any scenario very quickly.
And it doesn't interfere, there's the separation
of concerns
with all of these different scenarios. So that was a critical decision that we made early on
that, like you said, made it so much easier to integrate. Of course, there are extra complexities
because we didn't want Resource Central to be on the critical path. So when making calls to Resource Central,
we made sure that whenever we could,
we made sort of parallel calls to Resource Central
so that if Resource Central did not reply in time,
it wouldn't slow down the critical path for the allocator
or for the light migration engine and so on.
So there were extra complexities that we had to deal with,
but this was a critical way to sort of integrate it
into the rest of Azure.
And these days we apply it even to other services.
They're not even part of the control plane.
So now we have version of Resource Central and Ring 1 that we call it,
where it runs in regular VMs and so on,
so that things that are not in the control plane,
services that are not in the control plane can also use it.
Yeah, that makes a ton of sense,
and that does seem like a really, really critical decision.
And again, I kind of want to hammer home, like if you hired researchers who were really
interested in the idea, can we use sort of production metadata to inform further control
plane production decisions?
That as a purely intellectual exercise does not necessitate making that kind of a call
and that kind of abstraction decision early on. Because if you're focused on just like, can we do
it and can we publish a paper that shows we can make a difference, then you don't need to think
about that yet. But because you started with the explicit goal early on of, we wanna be research, but we also want to make sure
we do tech transfer by sort of folding that ethos in early.
Then you make that call early,
and that sort of paves the way for you to be effective.
As you, like, so you've still answered the question,
can we make, can, but you still answered
the intellectual question, but you didn't paint yourself
into a corner where you couldn't then leverage it in a production context? That's right and you were
touching on something else that is so critical in sort of the ability to think through
how to integrate research into production which is one thing that researchers and research
engineers even don't normally think about is simplicity is king because you can't have
PhD students and folks with PhDs maintain code in Azure or in any other production system.
This is just not a viable approach.
You need to define things and scope them in
such a way that there are
these simple pieces that can be deployed.
I often joke that the day that I realized that I was decent at my job was when I could look at a
paper and say, oh, this piece here, this 30% of the paper, I can actually deploy the rest,
sort of intellectual exploration that's super necessary, that advances knowledge.
that's super necessary, that advances knowledge, but it might not be sort of deployable right away. So understanding this transition, understanding what is the piece that is more easily deployable
and starting with that, I think is critical. And the other important thing too is to realize that in sort of the research that we do and
so on, we don't feel like we always have to transfer 100% of it.
Because if you think about it, a lot of the time that we spend in research is to try to
squeeze the last every little bit of goodness of any idea.
But in production, that's not necessary.
Something that is good and simple is much better
than something that is maybe a little better,
it's perfect, but it's complex.
So if you're able to get 70% of the goodness of something,
that's a win, that's a major win.
Forget the extra 30%, that extra 30%
will oftentimes introduce complexity
that might make the whole approach be invaluable
for a production team.
Yeah, especially at a hyperscaler like Microsoft.
It's so large and academic papers don't account
for things like data center tech time or data
center tech cost.
It just is like a graph of goodness.
And so that last 30% matters in an academic paper.
But as you say, if complexity makes it so that other costs that are not accounted for
in the paper are accounted, then it becomes infeasible.
So maybe this would be a good interesting time to slightly pivot to, speaking of costs that
are not necessarily accounted for, carbon costs.
So historically, those have not really been accounted for.
And so I know your group is starting to look at that.
We've talked a little bit about Resource Central and all the power
work which is relatively mature and has a lot of impact.
Maybe now we can redirect a little bit to stuff that's slightly less mature and ongoing.
Yeah.
So, the way I think about the carbon space and sustainability is twofold. From one perspective, efficiency and all the work that we do to better utilize servers,
better utilize data centers have a direct impact on scope three emissions, right?
Or embodied carbon, because if we improve utilization of the infrastructure,
we buy fewer servers.
We build fewer data centers.
So that reduces the amount of embodied carbon that we put out there.
So that's one perspective.
And that is the direct benefit.
There are other benefits.
When you do that, you also happen to improve scope two emissions as well and even scope
one because of transportation and other factors.
So doing efficiency work has in itself a pretty broad sustainability benefit.
The other way to look at it too is there are things that we can also do that are beyond
just efficiency work.
There are things like carbon aware scheduling of work, either in time or in space.
You can sort of decide to run certain AI training, for example, is a delay in sensitive workload that you might decide to run during a time
that the grid mix is more favorable, they're more renewables.
There's a lot of batch inference workload that can be run that way as well.
And because AI inference is like a SaaS workload, software as a service that workload, oftentimes you can move requests geographically, right?
To take advantage of more renewables and so on. So there are aspects of carbon awareness that go beyond efficiency as well. So we are working in that space,
working on things like some good methodologies, right,
for carbon accounting,
both Scope 2 and Scope 3 carbon accounting is one example.
And feeding that information back
so that customers of Azure can see that information and make
decisions for themselves in terms of the carbon footprint of their workloads.
So we're definitely working in that space and in many other areas too.
This is just one example.
You touched upon a few different themes here.
The first one is, I guess, the importance of metrics overall.
And this came up even in the context of our discussion with our prior guest, Carol Jean Wu, who talked
about in the context of sustainable AI or accounting for carbon footprint and so on, just having
visibility into the data is a huge step forward. The other part that you talked about very briefly
was in the context of developing solutions for that intersect with AI and
carbon efficiency or power efficiency. It touches like multiple regions of the
stack. So for example, you talked about how AI workloads could be moved between
different geographical regions and that's a theme that's come up in some of
Google's papers from my colleagues here as well, where you could move a training
job to a location that has access to,
let's say nighttime wind energy.
And so your carbon emissions
are correspondingly lower there.
So can you talk a little bit about both of these themes
in terms of metrics and data
and any efforts in this particular space
from your group very broadly into getting more data out there
for either researchers to play with or otherwise.
The second part was, how do you think about sort of co-designing
across multiple layers of the stack,
going all the way up to the data center, energy grid,
and interactions?
So you make a good point, right, that Google
has had some work on this that has been really interesting.
We're looking at those kinds of things as well.
But starting with the data issue, right,
that you brought up and Carol mentioned that too,
this is something that I think a lot about.
Like today for scope three, for example,
there are not, there's no sort of agreed upon methodology
for quantifying these things.
So it's very difficult to compare
across cloud providers, for example.
And even if we were to settle on life cycle analysis as the way to, or as the
right methodology for accounting for this, right, then life cycle analysis
basically looks at the entire lifetime of equipment and from supply
chain and all of these pieces and during the use of the equipment, all the carbon emissions
throughout.
So, even if we were to all agree that that is the approach, there's data quality problems,
there is sort of inability to get certain data from different vendors.
And so there are boundary conditions that would have to be defined very carefully and
so on so that we are able to compare across different vendors and different providers.
So this doesn't exist at all today.
We're going to have to, as an industry, sort of work together
with academia and other folks to define what is the right methodology and what are the right
boundary conditions. So if you look at things like PUE, for example, or power usage effectiveness,
that was a great way to be able to compare things.
A very simple formula that is pretty well defined,
although there are still issues with it.
At least there was a way for everyone to be able to compare their efficiency
in terms of the use of power, comparing sort of the IT power to other overheads and so on.
So this doesn't exist today for scope three emissions.
So that's something that it's gonna have to be addressed.
So on data, that's the data quality and so on,
that's the main thing I worry about.
On the other piece, sort of in terms of
efforts from my group, like I mentioned,
we are sort of exploring, not exploring,
we're actually collecting data and
sort of generating models and so on to surface them
through our different tools,
the Azure portal and internal tools as well, to surface what are the actual scope-to
emissions of different deployments of VMs.
And we're also sort of working with other teams on geographical distribution of inference requests and things
like that to maximize the use of green electricity.
So those two are two examples of things we're looking into.
And we have other, we already have infrastructure that's able to deploy VMs during off-peak
hours, for example,
that we can leverage and so forth.
So those are some of the things we're working into.
Besides all of the efficiency work
that I mentioned before.
No, that sounds like a really broad slate of problems
and directions to pursue.
Maybe this is a good time to sort of wind the clocks back
and talk about your trajectory on how
you got to Microsoft, what got you interested in compute architecture and computer systems.
Tell us a little, tell our listeners a little bit about how you got into this particular space.
Yeah, okay, let me, let me go back a little bit. I was, sorry, I was born in Brazil, right, in Rio and went to college over there. And at some point during
college, I wasn't taking it super seriously. And I had some issues in my family. I lost
my dad sort of during my college years, and that threw me off completely. But at some point,
my college years and that threw me off completely. But at some point during that time,
I actually had this good friend and his dad was a Stanford professor,
and had been a Stanford professor for a while and that was really exciting to me and so on.
The notion of doing research and
So, the notion of doing research and sort of tackling problems that nobody knew the answer to and so on really got me excited and made me become a good student and finally
decide for computer science.
Yeah, sort of to do research, right?
To do a PhD in computer science.
And sort of my trajectory during the PhD was a little strange because I wanted to do research, to do a PhD in computer science. And sort of my trajectory during the PhD
was a little strange because I wanted
to do computer architecture, but my advisor was not
in computer architecture.
So I had to sort of fend for myself
and learn a lot of things.
So I worked on sort of parallel machines at the time
and cache coherence and so on.
And then over time, I became more and more interested in software
and things that sort of bridge the gap between software and hardware.
So I started working on software DSM or distributed shared memory,
and then eventually cluster level systems and so on.
So after finishing my PhD,
I went back to Brazil and was there for several years,
and then decided to come back to the US to be
a professor at Rutgers University.
And during that time, that's very early on, I started getting interested in power and
energy and data centers.
And so my group and I sort of wrote one of the first few papers on in this space and kept working on it over a period of time until at some point
David Tannehaus, who used to be a corporate VP here at Microsoft, reached out to me asking whether
I would want to come and work on efficiency problems in Microsoft. And it was a great timing because I felt like I had been in
academia for a while and I needed the new challenge. So I came over to
Microsoft, initially a two sister organization to Microsoft Research, but
then that organization got dissolved so I moved to MSR. And was that an MSR for almost eight years, I think.
And working at MSR was really great.
I had great support from my managers.
I'm not going to name them all here because I'll sure forget one or another and they'll be mad at me.
But sort of managers and collaborators.
And sort of during that time, we started working with different product teams, including Azure, Azure being the main one.
And then sort of at some point, my predecessor was leaving, Marcus Fontore, that I mentioned before, was leaving Marcus Fonteur that I mentioned before was leaving Microsoft.
So his boss sort of asked me to whether I wanted to move to Azure.
And that's when it happened.
And that transition from MSR to Azure was in July of 2022.
So that's basically my trajectory from sort of college. This is probably more than what you
had hoped for, but that's not what we've got here.
No, I think origin stories are important. And we always like to touch on them during
the podcast because not everybody comes to where they are the same way and not everybody
comes in a straight line.
And so I think it's important because our listenership, I believe skews young, a lot
of students.
And so I think these stories are important.
You don't have to have decided at the age of five that you love computers and like that's
the straight shot all the way.
So yeah, I think the fact that you figured
you wanted academia starting in college
and wanted to do research and then found yourself now
leading a very large production and research organization
in a gigantic company is kind of an unusual path
in and of itself, I suppose.
I guess nowadays there's plenty of people who are former academics who are in industry,
but maybe not as many who are running large production teams.
Yeah, definitely.
I also joke about the fact that I've had every possible job almost.
Because in Brazil, even before, during my college years, I worked at a startup there.
So I've worked for companies, I've done research and now I do production.
So I've had the pretty broad exposure to different things. And to be honest, like this today, I think is the happiest I've been in terms of
all the things that I get to do on a daily basis.
I really love my group.
I love the work that we do.
I love the people I work with.
Our management chain is fantastic.
So I'm really happy right now about everything.
Sort of the scope of the group is all about things that I enjoy doing and so on.
So it took a while to get here, but I really like it.
Well, that's wonderful. Congratulations on winning at life. That's not easy to be able
to say that.
Yeah, no. And I have to say, I've been so lucky. I've had some bad breaks in my life,
but sort of work-wise, I've been really lucky to work with amazing people and sort of enjoy it a lot.
Yeah, yeah, for sure.
And so this notion of luck is very interesting.
I just read an article about luck where it was like, is luck an actual thing?
It's actually an understudied topic in science about luck because it's so mystical.
And the article was kind of saying that luck can be perceived as, it is partially how things
are perceived.
Like, are you lucky or do you just view your life in a lucky lens because all the good
things that happen to you, you cast it to luck?
I don't know, but certainly I think luck does often find people who are prepared.
It's like if an opportunity falls in your lap and you're unprepared to take it, whether
because you're unable to or can't manage to or whatever, then luck favors the prepared.
That's a saying for a reason.
So I guess that having been someone who has been blessed with, as you say, luck, but I
would also say good preparation and hard work, maybe you can share some insights or advice
to our audience as well.
If you had to give one piece of advice or two or whatever, some brief amount of advice
to our listeners, what would you say? Yeah, I can talk a little about how... So I work with a bunch of sort of younger people
who are starting their careers and so on, and they will often sort of ask, oh, how do I deal
with this issue or how do I deal with that issue and everything else. I think one piece of advice that I always give is communicate.
Oftentimes people come to me sort of worried about their relationship with their manager
or how they're working with their colleagues or those kinds of scenarios. The advice that I always give is talk it out.
Be upfront, be honest, go in sort of in a way, sort of go into a conversation in an
honest and upfront way to try to solve the issues.
I think people find that when you do behave that way, sort of all defenses sort of go away and people are
sort of more willing to have empathy and accommodate and sort of work together.
And this is what I do as well. Sometimes I might feel or sometimes I feel naive in a way
I might feel or sometimes I feel that naive in a way because like I keep saying, oh, this is all simple.
Let's just talk it out.
Let's find compromise.
Let's find areas where we can agree and so on.
And it has worked well for me.
It might perhaps not work for everybody,
but this is the advice that I always give.
Yeah, I think so two things.
One is that Microsoft itself, I think, is one of the kindest companies,
sort of culturally, that I ever worked for.
I know it's not the same.
I mean, it hasn't been the same company the whole during its entire 50 year existence.
But for the time that I was there, it was a very kind and empathetic
environment. And so it's the kind of place that probably does foster the ability to have this
kind of talk it out type of solution. So I can see how that might not be as useful in other places
where people do get out instead of talking out, But I always say Microsoft is great and I actually think that it's great because
Microsoft was a very like a generally functional work environment in my
opinion. But I think the other thing, what you were saying is like when you're
doing your research you want to find context. You want to solve problems given a certain context.
And if you view interpersonal work relationships
as yet another research problem to be solved,
like you want to find context to help you solve the problem.
So like communication is what allows you to extract context,
not just technical context, but interpersonal context.
So if you just like think,
oh, this is another thing that I have to solve, how do I find out information? Talking it out is
the way.
Right.
So, yeah.
That's a really interesting observation. I had never thought about it that way. Maybe
that is what I do. Maybe I bring this research perspective or research approach to addressing problems
that is helpful. I don't know. That's a good point. But what you said is right too, that
Microsoft's culture today at least, at least in terms of the space and the teams that I work with and so on is sort of more conducive to that kind
of thing than other places I've heard about.
So maybe it's a good match in a sense.
But if I had to be always duking it out, it probably gets tiring at some point.
I'm sure. I wouldn't enjoy it. I suppose people find their homes where they find their homes.
So given that you've, as you've just mentioned, have been in every possible job there is,
the thing with production stuff, we've alluded to it a little bit during our conversations where
the production has a certain role, which is they have to keep the lights on, they have to
keep everything going, and they have to ship. So very naturally, they don't usually want anything
to distract them from their number one mission and their reason for existing, which is to
produce products.
And so it is a little bit of a different ethos than research, which is pushing boundaries,
exploring.
Maybe you can talk a little bit about, do you have a different approach for how you
deal with your production-side employees versus your research-side employees?
And now that you potentially have to hire for, you talked a little bit about how you
hire for the research side, but do you hire differently for the production side?
So maybe compare and contrast a little bit the difference between leading a production
versus leading research.
Yeah, there are differences there.
The way you manage a research team has to be different than a production team.
The production team is very comfortable with very structured, in fact, not just comfortable,
but it needs a very, very good structure of execution, a very good structure of scoping.
And everything needs to be really well defined, especially
for the more sort of early in career folks that sort of work
in the team.
Whereas for a research team, like trying to impose that kind
of structure doesn't make any sense.
It's a recipe for losing them all.
You still have to manage them basically the same way as if they were sort of in Microsoft Research
or they were in academia to a large extent. You have to give them freedom, you have to give them
the ability to explore. You can't be saying, oh, what is, when are you going to deliver
this and when are you going to deliver that? That's not the way to interact. Whereas on
the production side, it's all about that. It's like, oh, what are we going to
accomplish this semester? What are the things that we're going to cut? What are the things that we're
going to prioritize? So it's very, very structured, very, very focused on execution, executing well, especially for a group like mine, where the, so the company
depends on us for increasing gross margins, for being able to recover enough power, for
a number of things, for sort of becoming more efficient and buying less infrastructure and
so on.
So then there's a lot of pressure on us to deliver on this.
So it has to be like super well structured and with targets, with
metrics, with like things that we track over time to make sure that we're not
sort of falling behind or we're not deviating from targets that we set for
ourselves. So all of that is super critical on the production side and not really a thing
for research.
Yeah, maybe I can flip the script a little bit. So early on, we talked about from a researcher's
vantage point, like how should they be empathetic towards the production games goals and pain points and so on.
And like now that you sort of manage both a production and a research org, like all
under the same organization, from the other side, like how do you communicate to production
teams on the value of research explorations, things that are necessarily, as you said,
unstructured, can sometimes seemingly hit dead ends or you might learn something out
of it
but may not translate into something actionable or clearly something that increases your gross
margins or revenue or any other thing that you can measure.
So how do you do that handshake on the other side of the EL as well?
That's a great question because there is oftentimes, right, especially for companies or groups
that don't have that culture of interacting
with researchers, there might be this bias.
Oh, those people are not working on real things.
They're just enjoying whatever the sexy thing is
that they're working on,
but they have no accountabilities and so on.
So there's a lot of this type of bias that we do need to work on and sort of highlight to
the production teams all the goodness that the research team can bring. For
example, so I often will mention, look, if there are sort of good ideas that you've
had but you're worried that they're too risky and so on, a research team is a great team
to go de-risk things, right?
To go explore, to go really dig deep into ideas and into new directions that the production team is not either able or
willing to do at a certain, at any point in time. So that's one way to sort of
lower these barriers. Another way is to highlight the importance of innovation
and how maybe a product that's running behind a competitor can sort of leapfrog the competitor
by introducing a new feature or a new idea, a new approach to things that the competitor
doesn't have.
Or maybe it's a way to become so much more efficient than the competitor, and this will make a big difference in terms of everybody's, as
a stockholder and so on, everybody's compensation.
So that's another approach, highlighting the importance of innovation.
And the third one is to create a culture where there is close collaboration, right?
After people sort of understand what researchers do and how they can contribute better, they
tend to sort of lower these barriers and eliminate these biases because they say, oh, I see,
they are willing to work with us
on things that matter to us as well.
And they'll be here for the long haul.
They're not gonna take off after they get their paper written.
So it takes some work,
but after you've created this culture,
then it sort of is a lot smoother, actually, than you would think.
Yeah, and I would actually say one thing is that creating the culture often involves,
in some ways, performance metrics too, where if you build a research team where you're purely
evaluated on the number of papers you produce, or your like age index or something like that,
evaluated on the number of papers you produce or your like age index or something like that.
Yeah, let's let's say then then then it becomes then the researchers themselves are not incentivized to behave in the way that you've laid out as a way to be successful text
transfer. And so you do have to sort of provide a culture where like you will be
rewarded for helping the production team with their problems in
order to build trust.
And then the production team sees your value and then you can work together on bigger and
bigger issues.
And same thing on the production side where it's just like, you ship this and if you deviate,
then you're hosed versus you are willing to accommodate and innovate and make the product
better,
then that is also rewarded.
So it feels like, I don't know, it just occurred to me that like performance metrics are a part of building this culture as well.
Yeah, definitely.
You're exactly right.
If you reward just for the behavior that you don't want. Guess what?
You're gonna get what you don't want.
So you really have to target, in terms of the incentives,
they have to be aligned with the culture
that you want to develop.
And in our case, I can't complain at all.
The relationship between the two sides of my org,
the research and the production side,
like couldn't be better.
They work extremely well together.
And I think this also comes from sort of
the management chain being supportive.
And now I have a manager to the research team
and he gets along great with the other people
on my leadership team.
So this relationship has been really good for a long time.
That's awesome.
Ricardo, this has been such an interesting conversation.
I think we've addressed so many different topics, research, production, specific projects
that you've worked on, advice,
your career path.
So it's been a really enjoyable conversation.
Thank you so much for joining us today.
We've really had a good time talking with you.
Yeah.
Thank you so much for having invited me.
I had tremendous fun chatting with you two and I hope we can make an interesting episode of this conversation.
Absolutely.
Like echoing Lisa, this is a fantastic distillation of many years of hard-won wisdom, straddling
both production and production.
So thank you so much for sharing that with us.
And to our listeners, thank you for being with us on the Computer Architecture podcast. Till next time, it's goodbye from us.