Microsoft Research Podcast - 111 - Auto ML and the future of self-managing networks with Dr. Behnaz Arzani

Starting point is 00:00:00 Humans are great at innovating and building stuff. But when it comes to figuring out what went wrong and how it went wrong and fixing things, it's much better to have automation do that than humans do that because we take our sweet time with things. And we also don't have the mental power to process so much data that's out there all at once. Machines are much better at doing that. You're listening to the Microsoft Research Podcast, a show that brings you closer to the cutting edge of technology research and the scientists behind it. I'm your host, Gretchen Huizinga. Dr. Banaz Arzani is a senior researcher in the mobility and networking group at MSR,

Starting point is 00:00:44 and she feels your pain. At least that is, if you're a network operator trying to troubleshoot an incident in a data center. Her research is all about getting networks to manage themselves so your life is as pain-free as possible. On today's podcast, Dr. Arzani tells us why it's so hard to identify and resolve networking problems, and then explains how content-aware or domain-customized AutoML frameworks might help. She also tells us what she means when she says she wants to get humans out of the loop, and reveals how a competitive

Starting point is 00:01:15 streak and a comment from her high school principal set her on the path to a career in high-tech research. That and much more on this episode of the Microsoft Research Podcast. Benaz Arzani, welcome to the podcast. Hey, thanks for having me. I like to start situating, and you're a senior researcher at MSR and you work in mobility and networking. But that falls under the bigger umbrella of systems and networking. So to kick off our conversation, give us the elevator pitch version of what you all are up to. What's the big goal of mobility and networking and how does it fit into the broader ecosystem of systems and networking? Right. So I guess mobility and networking, as the name suggests, goes into two parts of mobility

Starting point is 00:02:11 and networking. So half of our groups are doing things like IoT research, edge research, things that have to do with mobile phones and things like that, or like devices. And what people like me do are more of the networking aspect. So every distributed system that is out there has an underlying network. And so our job is to try to figure out how to operate those networks properly and how to make those networks work the best way possible. So what would you say the big audacious goal of systems and networking or mobility and networking is, sort of writ large? So I think every person you ask

Starting point is 00:02:46 is going to give you a different answer on that. My particular take on this is that where the infrastructure behind a lot of the systems that you see out there, from the web that you access to the storage systems that you use to everything else. And so our job is to make this as seamless as possible. So you shouldn't even know that these networks are there. You should just use it and expect that they work properly. Well, let's talk about what gets you up in the morning and where you situate yourself as a researcher. What's your big goal, Banaz, as a scientist? And what do you want to be known for at the end of your career?

Starting point is 00:03:20 Yeah, so when I think about this, and this was mainly after doing many, many internships in Azure networking and seeing what operators have to deal with every day. And to me, it seems like the worthy goal or what I really want to achieve is something where the life of an operator, a network operator, is as painless as possible. Because it can get painful on days, especially if there's, you know, something broke and they have to figure out what happened. It can be a nightmare. And what I would like to see is that they don't have to do that. And we will get into how you're going about that shortly. Let's start with one of the things that is kind of interesting to me. Many people I've talked to on this podcast emphasize the importance of keeping humans in the loop. But you suggest in some ways and for some problems, we actually need to get humans out of the loop, or at least you question why after so many years, you still have so many humans trying to figure so much of

Starting point is 00:04:15 this all out. So when you say, get humans out of the loop, what do you mean? And then how does it play out in the work you're doing? Right. So I think depends on what do you get the humans out of the loop of. I think to me, humans are great at innovating and building stuff. But when it comes to figuring out what went wrong and how it went wrong and fixing things, it's much better to have automation do that than humans do that because we take our sweet time with things. And we also don't have the mental power to process so much data that's out there all at once. Machines are much better at doing that. And what I keep seeing is that we as humans are very, very inefficient.

Starting point is 00:04:58 And what that causes is that our customers often are in pain while humans are trying to figure things out like that. And so why do we have to do that after so many years of having networks out there? And I think it's because this particular problem is just a really, really hard problem to solve. And so I find that both exciting and hard. And so it's a challenge that's worth pursuing. Hasn't it gotten more complicated

Starting point is 00:05:22 rather than saying, you know, we should have had this figured out by now. It's like, well, the internet threw us a lot more problems. The cloud has thrown us a bigger problem. How would you answer that? A lot of this has to do with scale. So we've just have more and more of things and the bigger things are, it's harder to handle, but also our processing capabilities have also increased. So that's one piece of good news. The other thing is, like, if you think of things like cloud, yes, they did, like, throw a curveball in introducing something new, but also they added a little bit of structure. So when you think about, like, a cloud network, it's much more symmetric and much

Starting point is 00:06:01 easier to reason about compared to something like the internet, which is just a hodgepodge of devices connected to each other with arbitrary topologies. Like, if you look at the stuff we did in 007, for example, what we used really was that the fixed structure and the nice structure that cloud networks actually have. So that actually helped? That actually helps. Interesting. Okay. Because, you know, you just think the bigger it is, the more messy it is, but you're actually saying it's added a layer of structure to help iron out some of the problems. All right, well, let's talk a little bit more in detail about those kinds of problems. Data center diagnosis is hard. There's lots of incidents, lots of different kinds of incidents, incidents with whole life cycles. Why is this so hard? And what are some specific research projects you have going to make it not so hard? Yeah. So if you go back to what I was saying earlier, like the network is really the underlying infrastructure of a lot of distributed systems. So there's a lot of dependency on a well-functioning network. But the problem is also

Starting point is 00:07:01 that when something goes wrong, how do you know if it's the network that's problematic or if there's other layers of the infrastructure that may be problematic? A very simple example of this is all of the VMs we operate in Azure are dependent on our storage systems, for example, because they have a virtual hard drive that has to access those storage systems. So they have to go over the network, but there's also that storage system itself that can fail. It's also the VM virtual hard disk that might fail. There's a lot of different failure scenarios. The host may fail, the server may fail, everything. And so what we see often is that, well, the first step is to just locate who the expert is that needs to look at this problem. And often it's the case that because there's so many different levels of expertise, like the storage people, the storage operator knows really well how storage works, but he may not know anything about how a network works, right? And so he doesn't know how to look

Starting point is 00:07:54 at this network and to determine whether this network is healthy or not, right? So you really need a network operator to get engaged at that point. But at the same time, you need to first know that you need the network operator to engage at that point. It's kind of like a chicken and egg problem. You don't know what you don't know. Yeah. So the projects we're working on right now, in the case of the storage example that I gave, I think in 2016, we had a NetPoro project that dealt with that. Right now, we are looking at a project called Scouts, which its goal is basically to say, if each individual team provides an abstraction that basically says, is it likely that my team

Starting point is 00:08:31 is probably going to be responsible for this problem? So my expertise is needed or not. That way, at least a storage operator, when it sees that the storage system is failing, can know, oh, I need a networking expert or I need a host expert, an SDN expert. What type of expert do I need to help me with figuring this out? So the problem upstream is diagnosing where the problem is. And you want to do that quickly so that you can address the issue. And the problem has been that with a human, it takes way too long to even figure out who to blame. Right. So what is it that machines can do to help us out here? I think the observation at least we had, and there's a lot of work still remaining to be done, but the observation is, well, we see enough examples as like if I'm the networking team in Microsoft, we've seen examples of failures happen in the past. And we collect a lot of data from our own infrastructure. So the idea is, can we learn

Starting point is 00:09:29 from past failures whether this is probably going to be caused by a networking problem or a physical networking problem, for example, and basically use machine learning to identify whether this problem is likely due to this team's infrastructure failing. So do these failures present themselves in a certain way that would be a pattern detection thing that would be really good for machines to work on? In certain cases, yes. So for example, in the case of physical networking, that turns out to be really true.

Starting point is 00:09:56 It's more complicated when you have, for example, something like a software load balancer is a lot more complicated because it has a lot more dependencies and its failures are also more complex. So for certain teams, this is easier. But the nice thing is also that for these teams, these are often the first ones to get blamed anyway because all of the teams depend on them. Right. So it's kind of like a win-win situation.

Starting point is 00:10:15 You might want to build similar things for the teams that you can build this for. And then hope that this would simplify the problem to an extent that it makes the life of operators easier. Okay. Right. And then hope that this would simplify the problem to an extent that it makes the life of operators easier. OK. Well, talk a little bit more about the life cycle of a problem or an incident, we'll call it, because we all recognize there are incidents that are going to happen and there will be a lot of them when you have this giant scale. What do you mean when you say life cycle of an incident? Well, so an incident starts when some monitoring infrastructure picked up that some anomaly is happening and something is not operating as it should. And so an incident is created. A lot of the times we also have automation that goes and checks and knows how to fix it. So that's the good case. That's like the best case scenario. But in some cases, when automation also fails to solve the problem, we have humans that are called basically to try to resolve it. And basically the first step that that human takes

Starting point is 00:11:06 is to figure out who to call to help. And also they get together and try to figure out, okay, which part of the system went wrong? How do we fix it? And the first step is actually mitigating the problem. Meaning, for example, if I have a software load balancer that's problematic, I'll redirect all of my traffic to a different software load balancer

Starting point is 00:11:23 while I'll figure out what's going on with this load balancer, right? And then they go out, proceed to fix it and resolve the issue. I'm having a visual of an ER doc. You know, you triage and you say, you know, is he breathing? Is he bleeding? Start one, stop the other. Right. And then we can move on to what's really the problem. Exactly. Yeah, that's basically what happens. Well, let's talk a little bit more about automation for a second and this trend towards auto ML or automated machine learning. And it's one line of research that seems really promising. And there's some specific branches of it. You refer to them as content aware ML or domain customized auto ML frameworks. So talk somewhat generally about the work that's

Starting point is 00:12:16 going on in ML and then tell us how you are instantiating it in the world of networks and distributed systems. I mean, I think this came up when I was the only one in our group which knew a little bit about networking and machine learning. And I had 30 different teams in Azure asking me to build a machine learning model that does something, whatever that was. And it felt like the pattern that I was going through each time was very, very similar. And so it felt like I should be able to replicate my brain somehow so that I'm not needed in that process. And I didn't know at the time, when I researched it, I found that AutoML is actually a thing in the machine learning

Starting point is 00:12:54 communities. I didn't know that. And then when I looked at those, what I found is that a lot of them try to do anything and everything. Or they're customized to domains that are very, very popular, things like video analytics, like natural language processing, things like that are always needed, not necessarily something for networking. So my friend and I, Bita Rohani from Doug Berger's group, started to look at, well, what happens if you just dump networking data into these systems? Like, just let's see how well they do. And they did abysmally bad. The state of the art was like terrible. And so we looked at it and said. The state of the art was like terrible. And so we looked at it and said, okay, why is that the case? And what we found was that, well,

Starting point is 00:13:30 there's simple domain customizations that we could do even on the input, not anything to the machine learning, but just how we present the data that would significantly boost their accuracy. And so the idea was, well, actually operators are really good at that part. Like they really know their data. They really know things about the data that the AutoML frameworks don't know. So is there a way to bridge this gap? Is there a way to provide that domain knowledge without knowing anything about ML? Maybe like somehow the AutoML framework knows what information it needs and queries for that information from the user and the user provides that information, and then we use that to generate more customized ML model as part of those AutoML frameworks. So this sounds a lot like Patrice Damar's work in machine teaching,

Starting point is 00:14:14 which is similar to this domain-specific ML, right? I mean, it's similar and yet different. I think the nice thing about networking is actually, even though the types of problems we tackle are very, very diverse, they fall into a very limited set of categories. Things like congestion control, like diagnosis, like traffic engineering. Like I can count them on my hand how many broad problem topics we tackle. And so because of that, it's much easier to provide a networking-specific abstraction for these systems than it is for any generic problem. And again, like, for example, a network has a specific structure.

Starting point is 00:14:51 You always have underlying topology. Like, there are things we know, right, where for generic problems, we might not know those specific topics. And I think that's where, like, our take on the problem is different in the sense that we want to exploit network-specific domains that you can quantify almost, right? You can use a structure for them as opposed to a generic problem. So you're providing a domain expert in networking with machine learning tools, and they don't necessarily have to be a machine learning expert to be able to use these tools to make the whole thing.

Starting point is 00:15:22 Yeah, and I should preface that we don't know how to do this. Like, I'm just like giving you the idea of this is what we want to do. We don't know how to do it yet. Okay. Go in there a little bit. You don't know how to do this. So this is like... An idea.

Starting point is 00:15:38 Okay. Where are you with the idea? How far have you pushed on it? So what we did initially was just to verify this hypothesis that domain knowledge actually helps auto ML systems. And we were successfully able to demonstrate that. What we're doing now is take one specific area in networking that's very, very well structured, but yet rich in problems, specifically congestion control. So within congestion control, you might have a lot of different problems. What is the best

Starting point is 00:16:04 congestion control protocol for me to use at any given point in time, given different objective functions that I have? Or can I design an ML-based congestion control protocol? And a lot of different other questions that we have a whole list of. And our idea is, how do we build a domain-customized AutoML framework for congestion control specifically? So it's not even for networking, just for this very, very tiny domain within networking. And we're exploring whether we can do that. Okay. Thank you for the word hypothesis. It was the one I was searching for and couldn't find five minutes ago. You have a paper that just got accepted at the conference for Networked Systems Design and Implementation, NSDI, this year.

Starting point is 00:16:45 And you call it Private Eye. And it deals with scalable and privacy-preserving compromise detection in the cloud. What problem is this work addressing and what's promising about your approach to it? So the problem was really when we talked to Azure operators, one of the things they mentioned is we have these really good compromise detection systems that are very, very effective that customers can use, but they don't want to use. Or I don't know if like don't want to use might be a strong word. It might be that they are hesitant to use. And the reason for that seems to be that they're concerned about their privacy, how much data they want to share with Microsoft, and also taking on a third-party code.

Starting point is 00:17:27 So basically, Microsoft would have to maintain that compromise detection system for them, and a lot of customers aren't comfortable with that. So we looked at this, and the idea was, well, we still need to protect all of our customers, even though they don't necessarily want to use these systems. So how do we do this without needing our customers' permission to do so? And the observation was, and this is not a new observation, a lot of researchers have made this observation in the past, which is, well, network behavior changes when a VM is compromised. So can we use that change to basically say whether a VM is likely to be compromised or not, and then go from there.

Starting point is 00:18:09 The other observation, which is unique to this paper, was though that we do have these compromise detection systems that are very, very effective, and they're running on at least our first-party VMs. And these are VMs that run things like Bing, like SQL, like services that we have. And some of our customers are also opting in to use them. So what they do is provide a constant stream of detections of compromised VMs that they've seen. And we can use those as sort of quote unquote labels to learn, okay, this is what compromise looks like. And this is what changes it induces in the network behavior of these VMs. Putting these two on top of each other, we were like, okay, maybe we can do something

Starting point is 00:18:39 that's privacy preserving compromise detection that operates at data center scale. And then scale is also a hard thing here. So the paper goes into a lot of trouble of explaining, for example, how do I ensure that I can run at this massive scale without sacrificing too much on accuracy, without having to use things like IP addresses, where right now with GDPR are difficult to use because GDPR says that if a customer wants to, they can contact you and say that you have to delete this and you have 24 hours to do so and so on. Wow. So this sounds like it is also in the sort of early stages of how might we do this.

Starting point is 00:19:14 Yes. I mean, the paper basically goes and demonstrates that we can theoretically do this. And my experience with the other Scout project kind of says that there's a whole nine yards between we think we can do this and we deploy it and we say, oh, this came up. So we have to handle this and then this other thing. So like what I found actually very interesting is from research paper to actually deployment, things can change 180 degrees. Like, you might just completely change the approach you use just because new constraints come up in deployment that you hadn't thought about when you were doing the prototype version, which is basically what the paper usually is. But you're moving forward, and this paper is kind of the beginning of the exploration, and then you're going to try to scale it up and see where it breaks?

Starting point is 00:19:59 Yeah, it probably will break. I'm going to be honest about that. Yeah, but that's what research is about, right? Well, you just referred to Scout again. And let's preface this by saying collaboration is so essential to research now that Microsoft Research even has an award for it. And you recently won this. You and your team recently won it for this project called Scout. So tell us about the project. What is Scout? And why did it win MSR's Collaboration Award?

Starting point is 00:20:44 That's interesting. So I started Scouts two years ago with an intern of mine who was from Princeton and we basically first started to think about, okay, is this even a problem? Like the first step is like, how bad is this problem? And is this really a problem? Define this problem. Meaning, is it really the case that people find it hard

Starting point is 00:21:02 to blame a team for a problem? We did that investigation and said, yeah, apparently it is hard. So let's try to solve this problem. Wait, wait, wait. is it really the case that people find it hard to blame a team for a problem? We did that investigation and said, yeah, apparently it is hard. So let's try to solve this problem. Wait, wait, wait. So for me, it's easy to blame. Okay, let's just like level set here. You're saying it's hard to blame a team or it's hard to prove the blame?

Starting point is 00:21:21 Well, everybody points the finger at the other one. So that's basically what happens. Because also like, you know, people have limited time. So they do a superficial check. And if everything seems healthy, it's like, nope, not me, your turn. And so they keep passing the ball around until somebody figures out what's going on. And that is very, very inefficient. And we basically just demonstrated that that's the case. All right.

Starting point is 00:21:40 Then we were like, OK, how do we solve this problem? And so we went about at least doing a prototype version of the Scout, which is basically a paper we submitted and fine. But then we were like, okay, can we actually deploy this? And there's this really cool project in the systems group going on called Resource Central, which has to do with a framework to deploy machine learning models in production. So that's where Riccardo Bianchini came in and said, well, we have this really cool framework. Why don't you guys take advantage of this and use this to deploy the system? So they helped us to basically deploy the first version of a scout. And then the physical

Starting point is 00:22:13 networking team in Azure was the first team that we targeted to build a scout for. And they helped us with insights on what they knew about the network, the data they collected, helped us figure out, okay, you did a great job here. You did a sucky job here. We hate you for it. And all of those different things. So like they provided us with really good feedback. And this is an ongoing collaboration. So we found that scouts do really well in certain cases, but they suck at cases where operators get actually angry about. Like we can classify really, really hard problems. When it comes to the easy stuff, we sometimes make mistakes. And it turns out operators don't like it when you get things wrong that they would have gotten right.

Starting point is 00:22:51 You can't have everything, though. Come on, guys. Well, here's a question, though. Is the easy stuff problem like the person who's proofreading a paper and gets all of the small print right, but the headline has a massive error in it. Yeah, pretty much. Yeah. So like, for example, like one example we saw is there was an incident where the title of the incident said, Arista Switch, this is experiencing problem. So a switch is basically the purview of a physical networking team. He's actually saying this is what the problem is. Our very cool scout said, this is not a physical networking

Starting point is 00:23:25 issue. And I was like, okay, why? Turns out that that particular incident was a transient problem. So that meant that there was a blip. And that blip really didn't register in the monitoring data that we had. The machine learning model thought things are fine. Nothing's bad. But because we didn't have the contextual information, and this goes back to the need for context right yeah like we didn't have the contextual information and so we got that one wrong and what we learned from that is well we need to have some form of contextual features as part of our feature set now if you look at our prototype version this really didn't register to us as an important problem because our accuracy was so high. We had like 98% true positive, 97% true negative. But in that 3%, we had these very, very simple mistakes

Starting point is 00:24:11 that operators are very unforgiving about because it's like, it's saying it in the title. So how do you fix that? Well, so it actually ends up being a relatively simple fix because, again, like it's in the title. You just use information from the title as part of the features that you're using. For us, the original hypothesis was use the monitoring data as God. Basically, what does the data show? But also it's the fact that, you know, there are some incidents that turn out to be non-problems,

Starting point is 00:24:45 but they're still an incident and somebody has to still go and look. but also it's the fact that, you know, there are some incidents that turn out to be non-problems, but they're still an incident and somebody has to still go and look. So it's important to basically have the context from the incident itself as well as part of your feature set. Okay. So I want to sort of weave back in some of the things we've talked about.

Starting point is 00:24:59 My understanding of what you want to accomplish here is a self-managing network using AutoML frameworks and having as little human slowdown in the process as possible. Right. You don't want humans completely out of the loop. Not yet. I don't think that's possible. I mean, ideally, you would want to. And I think that's like kind of the holy grail. Do you foresee a future where that would be possible? Somebody that I really admire once told me when you build systems, you have to ask yourself, can it take me to Mars?

Starting point is 00:25:40 And I think that's pretty much what we fail to do when building networks, at least recently, because there's maybe, I don't know. But when I look at a lot of the work I have done a lot of work my peers have done i think we never really ask ourselves that question which is why we're in the mess we're in maybe over time as we start to ask that question more will it take us to mars all right then because you know we are the same people that built actual things that took us to the moon. And those did not need operators to manage them. Yeah. Well, okay. But so having the contextual input for these systems to identify the stupid errors, can you do that and make that happen with a machine without having the human context provided? The right answer to that question is I don't know. These are things that we're experimenting with, that we're trying, but who knows? All right. Let's bring this all together.

Starting point is 00:26:34 Your big goal is to get to networks that can manage themselves, and we're not there yet. So what would you say are the big open problems in the field that, if solved, would get us closer to the network equivalent of self-driving cars? So I think there's a couple of things. One is what data do we actually need from the network to be able to do this? I think that's still an open problem. But the problem is not that we don't know what exact data we need. It's like what data we need and how to efficiently collect it. Like how do we collect it without actually breaking the network while doing so? I think that's, like, there's a lot of work going on, and we see paper after paper on this topic, but we really don't know what is the necessary and

Starting point is 00:27:12 sufficient data set to be able to do this. That's one. The control loops that we need to be able to then use this data to do self-driving networks and self-driving, the equivalent of self-driving cars, is I don't think they're in place yet. And we don't even have the mechanisms to then implement that control loop yet. I also think that we've been bogged down by just how to get the network to work in the first place. And a lot of the papers that we see,

Starting point is 00:27:39 like for example, the traffic engineering papers we see have to do with that. And so I think it's just, we haven't had time yet to fully explore the other side of things. And networks themselves, like you started talking about Internet of Things, and Donald Kosman was recently on the podcast and talked about 9 billion things, as it were, and trying to think of how you even wrap your brain around how you would manage that kind of a network. Right. Luckily, that's out of my area of expertise. I work on data center networks. If I get that work, I'm happy. That I would talk to somebody

Starting point is 00:28:16 else about. There's the finger pointing over there. I'm data centers. That's your business. Well, we've talked about what gets you up in the morning, and it's a lot that gets you up in the morning. Now I want to know what keeps you up at night. And I often joke with some researchers that their entire career is about what keeps them up at night, technically speaking. That said, is there anything about your work outside the fact that it's important to get your work right that keeps you up at night, metaphorically? And if so, how are you dealing with it? So I guess, and this is more recent, I don't think it's been the case for the past,

Starting point is 00:28:50 like whatever years is, like after we started actually deploying the Scout and getting it to be used in production, like my worry is, again, will it get us to Mars, for lack of a better word, in the sense of how trustworthy is it? When is it going to break again? How long is it going to break again? How long is

Starting point is 00:29:05 it going to last as is? When is the next time that somebody is going to yell at me because I got a simple thing wrong? So I think reliability of machine learning systems for networking and how hands-free can they actually be is something that keeps me up at night because it seems to me, at least with our experience, that there's some level of hand-holding that's needed over time. And that worries me because what does that actually mean? Does it mean that you always need somebody babysitting these types of systems? And that's not necessarily the best thing that you would want. Yeah. You got me thinking so deeply right now about the preferred future and humans out, humans in. And if we could ever really get to a full representation of data center problems.

Starting point is 00:29:58 Right. Who knows? That's why you're working here. Well, it's story time, Banaz. Tell us about yourself. How did you get started in the high-tech life? And how did you end up at Microsoft Research? I heard the word internships, plural, earlier on. I had a very messy way to getting where I'm at. In high school, I loved physics. And I liked circuits and electrical systems. So I went into electrical engineering as my bachelor's degree.

Starting point is 00:30:28 And still I liked electrical engineering a lot. Like I was the circuits person, the analog circuits person. How do you analyze at that time? They would teach us things about BJTs and such. And then that was where I wanted to go. And a friend of mine said, you'll never find a job in electrical engineering. Really?

Starting point is 00:30:46 Yeah. Especially in analog circuits, which I was particularly good at. And so I was like, OK, what's the next thing I'm good at? And that was probability and networking. And I'm like, OK, that's what I'm going to do. And actually, the first few classes that I sat in, because there was an electrical engineering analog circuits class that was in parallel to the digital signal processing class that we had to take if we weren't in the communications major. And I would sit in that class and I was like, I have to be in that other class. And then I started to fall in love with it.

Starting point is 00:31:17 I was like, I really, really like networking. So I applied to a networking PhD, again, in electrical engineering. And then my advisor just left my school. So I had to find a new advisor. And that was in the computer science department. And that's how I became a computer scientist, completely by accident. Circuitous route. Yes.

Starting point is 00:31:35 And then I loved it. So like most of it is like I accidentally stumbled into where I'm at. And then I ended up falling in love with it. Where did it all start? I mean, where was this taking place? And who are you working with? So I was at, and then I ended up falling in love with it. Where did it all start? I mean, where was this taking place and who are you working with? So I was at the University of Pennsylvania. I was working with Rock Guerin when I started, and then Rock left to become the chair of computer science at Washington St. Louis.

Starting point is 00:31:56 So then I moved to computer science to work with my new advisor, Boon Tao Lu, who's still there, and I did networking. And then I think my first internship was in 2015. In Azure networking. Oh, Azure. Not Microsoft Research, no. And then again, I loved it. So I came back for a second time. And then I applied for a postdoc here. I did a postdoc. And then I applied for full-time jobs. And then the rest is history.

Starting point is 00:32:23 So postdoc in Azure or postdoc in Microsoft Research? Microsoft Research. Okay. I don't think Azure has postdocs. I actually fought for a postdoc in Azure and they said that we don't have such a thing. Right? Tell me a little bit though about the back and forth

Starting point is 00:32:37 between Azure and Microsoft Research. So the way it happens in MSR is very different than the way it happens in Azure. is very different than the way it happened in Azure. So when I was an Azure intern, I talked to Azure people every day, 24 hours a day. So I knew about all the problems that were going on. I knew what people's pain points are because they were sitting next to me. Here in MSR, people come to us and say, I need this problem solved. Or we solve a problem like, hey, we solved this problem. Do you actually need this that we did? And so it's very, very different, I would say, the dynamic of going from an idea in MSR to actually deploying it in production. And it's much, much harder than

Starting point is 00:33:18 if you come up with the idea when you're sitting in Azure and deploying it in Azure. But it's amazing how easily it gets done. It's amazing like how fun the collaborations are and so on. From your position now, where do you see yourself in the future? Staying in research? I prefer to not think about that type of thing. I like to be the person who does things while they're fun. And once they're not fun, stop doing them and move on to the next thing. So I have no idea how to answer that question. All right. Well, tell us something that we might not know about you. Maybe it impacted your life or career, a life-defining moment or some personal characteristic. But maybe it's just something interesting that would give

Starting point is 00:33:59 us some context about you outside the lab. Okay. Not something I'm really proud of, but I'm a very, very competitive person. So I always attribute me getting to where I am to a friend of mine in high school, where our principal would come and say, learn from this person, This person is great. And I was like, I can do better. And it's sad but true that the reason I'm here is because of a competition with another person in high school. Otherwise, I would not get into college. I think I would not get to where I am. Okay, so let me clarify. There was an actual person that your principal said, be like that person. I was like, no, I'm going to be better than that person. Oh my gosh. I'd like to meet that principal. Well, before we go, I want to give you the opportunity to talk to some version of your grad school self. Assuming you'd listen to you, what advice would you give yourself if you could go back and give yourself advice? The advice I would think is it's okay to be nitpicky.

Starting point is 00:35:11 Like, I think one thing that I found frustrating as a PhD student was how much one of my advisors, Rock, wanted us to be very, very meticulous about making sure about every single detail about something before we made a conclusion. And it took a long time to do. It was a lot of pain. And I've now learned to appreciate that. And so what I would say is it's hard now, but it's such good advice.

Starting point is 00:35:49 Benaz Arzani, thank you for joining us today. Thank you. It's been so much fun. Yeah, I know. Thanks. To learn more about Dr. Benaz Arzani and the latest in networking research, visit Microsoft.com slash research.

Your Ad Here

Microsoft Research Podcast - 111 - Auto ML and the future of self-managing networks with Dr. Behnaz Arzani

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.