Microsoft Research Podcast - 043 - All About Automated Machine Learning with Dr. Nicolo Fusi

Starting point is 00:00:00 So we cast it again as a machine learning problem, but we had multiple models interacting and tuning each model separately was a complete nightmare. During that process, I decided, surely somebody must have thought about something. They had, but the problem is that a lot of the state of the art was working only for tuning a few hyperparameters at the time. What we're trying to do was really tune thousands.

Starting point is 00:00:30 You're listening to the Microsoft Research Podcast, a show that brings you closer to the cutting edge of technology research and the scientists behind it. I'm your host, Gretchen Huizinga. You may have heard the phrase necessity is the mother of invention. But for Dr. Niccolo Fusi, a researcher at the Microsoft Research Lab in Cambridge, Massachusetts, the mother of his invention wasn't so much necessity as it was boredom. The special machine learning boredom of manually fine-tuning models and hyperparameters that can eat up tons of human and computational resources, but bring no guarantee of a good result. His solution? Automate machine learning with a meta-model that figures out what other models are doing, and then predicts how they'll work on a given dataset. On today's podcast, Dr. Fuzzi gives us an inside look at automated machine learning, Microsoft's version of the

Starting point is 00:01:22 industry's AutoML technology, and shares the story of how an idea he had while working on a gene editing problem with CRISPR-Cas9 turned into a bit of a machine learning side quest, and ultimately, a surprisingly useful instantiation of automated machine learning, now a feature of Azure machine learning, that reduces dependence on intuition

Starting point is 00:01:44 and takes some of the tedium out of data science at the same time. That and much more on this episode of the Microsoft Research Podcast. Nicolo Fusi, welcome to the podcast. Thank you. It's great to be here. So you lead the automated machine learning efforts at Microsoft Research in Cambridge, Massachusetts. I really want to wade into the technical weeds with you in a bit, but for right now in broad strokes, tell us about your work. What gets you up in the morning?

Starting point is 00:02:22 Yeah, it's interesting because my background is in machine learning, and I got very excited about problems in computational biology. And so I did a lot of work in computational biology. And then during that work, I kind of figured, oh, there are so many machine learning problems that you can solve that are interesting and apply to a wide range of things. And so I kind of went back a little bit in machine learning. So most recently, as you said, I'm working on automated machine learning, which is a field where the goal is to kind of automate as much of the machine learning process as possible. That goes from data preparation to model criticism, for instance, once you come up with a model. So drill in a little bit on this idea of computational biology. So computational biology is an enormous field. There are many kind of different people doing different things from proteomics to genomics.

Starting point is 00:03:11 Some of them are using mathematical tools. Some of them are more probabilistic or statistical in nature. So my slice of this world was using machine learning and statistics to kind of investigate molecular mechanism. And in particular, I was working on genetics mostly. I also worked on functional genomics, but genetics was the most formative part of my training. So we talked a little bit about your interest in machine learning, computational

Starting point is 00:03:38 biology and medicine, in fact. So those are three sort of divergent paths. Well, there's some overlaps on Venn diagram, but how did those all come together for you? I started in machine learning and, you know, machine learning, you can do either applied work, you pick a problem, you apply machine learning to it. It always comes with its own set of challenges. Or you can pick something more theoretical and maybe advance the way people do modeling or infer the parameters of a model, for instance. And when it came to starting my PhD, I had a choice of problems from both fields. And due to personal circumstances, I felt like I needed to do something that had an effect on

Starting point is 00:04:15 human health. But I also thought that human health was too far from machine learning. And in some sense, to this day, I still think that if you want to do medicine with machine learning, I think you need to do medicine with machine learning, I think you need to stop somewhere in between first, like kind of break up your journey. And I think you probably should break up your journey at the molecular level, which is where computational biology comes in. So I started working on solving questions in computational biology using machine learning with the goal of later kind of going from computational biology to medicine,

Starting point is 00:04:49 again, using machine learning. That's fascinating. Before we launch into your specific work in the field, let's talk a little more generally about automated machine learning. Forbes magazine had an article where the author claimed that it was set to become the future of AI. Is that overstatement? Well, in general, in AI right now, there is a lot of, this is the future of AI. Is that overstatement? Well, in general, in AI right now, there is a lot of this is the future of AI. That is the future of AI. It would be great as somebody who does a lot of AutoML research if AutoML was single handedly the future of all of AI. But I think it's going to be a huge component. And I think more than AutoML, it's probably going to be meta-learning if one has to put

Starting point is 00:05:19 names on fields because meta-learning is learning about learning. So in some sense, I think we have developed over time a good set of kind of base models or base algorithms. And now we are starting to move up the hierarchy and kind of combine classes and families of models into meta models. And that kind of incorporate all that is going on underneath them. So in some sense, I agree with the Forbes article in the sense that we need to move one level up the hierarchy, but there is a lot more work to be done at the base.

Starting point is 00:05:54 Let's drill in a little bit on why automated machine learning is such a big deal. Yes. Okay. And perhaps by way of comparison. So you alluded earlier to the traditional machine learning workflow. Yeah. Tell us what that looks like and what an automated machine learning workflow looks like and why it's different and why it matters.

Starting point is 00:06:12 In my mind, the traditional machine learning workflow, which is also the data science workflow, people use different names. You start with some question and you define what kind of data do I need to answer that question? What kind of metrics measure my success? And what's the closest numerically computable metric that I can pair and I can measure to see whether my model is doing well? And then there is a lot of data cleaning.

Starting point is 00:06:35 And then eventually you start the modeling phase. And the modeling phase involves transforming features, changing different models, tuning different parameters. And every time you go down one path, you pick a way to transform your features, you pick one model, you pick a set of parameters, and then you test it, and then you go back. And you maybe try a different hypothesis, gather more data.

Starting point is 00:06:55 You keep doing this loop over and over again. And then at the end, you basically produce one model that maybe you deploy, maybe you inspect to see whether the predictions are correct or fair or stuff like that. The goal of automated machine learning is to automate as much of this as possible. I don't think we'll ever be able to automate the phrasing the question or deciding the metric because that's what the human should be doing, really. But the goal is to kind of remove as much of the high dimensional thinking with many

Starting point is 00:07:22 options that are not always clear, that really kind of slows down the process for humans. I've heard it described as the drudge work of data science, the fine tuning of the models and the parameters. Explain that a little bit more about how, is it basically a trial and error? It is a lot of a trial and error because it's really high dimensional space. So depending on which value you set one hyperparameter to, all the values for another hyperparameter completely change meaning or change the scale at which they are relevant. And so it becomes a really difficult problem because I've done it, right?

Starting point is 00:07:59 It's extremely boring, not a good use of time. If you do kind of parameter sweeps, which is a lot of what the industry is doing, you kind of waste a ton of computational resources and you have no guarantee of finding anything good. That's sad. It's bleak. Yeah. So how are you tackling that? So there are different techniques and AutoML is a very exciting research area. There has been like AutoML competitions, AutoML workshops, symposiums at NIPS. It's really very exciting. The broad idea is let's try to use a model to kind of figure out what other models are doing. So in that sense, it's kind of meta

Starting point is 00:08:36 learning because you're trying to predict how different models react when you change their parameters. And you use that model to guide you through a series of experiments. All right. So you would have to have a base of model experiments for this uber model to judge, right? Exactly. To learn from. To learn from. Better word. Yes. According to legend, I use that term loosely, you were using machine learning and getting mired in the drudge work of data science and basically thinking there's got to be an app for this or something. Yeah. And you set out to fix the problem for yourself. It wasn't like I'm going to go create this auto ML thing for the world.

Starting point is 00:09:28 It's like I got to solve my own problem. And you kind of did it covertly. Tell us that story. Yeah. So I was working on CRISPR gene editing. Okay. It was a joint collaboration between Microsoft Research and the Broad Institute. At the Broad Institute, the lead investigator was John Dench.

Starting point is 00:09:44 And at Microsoft Research was Jennifer Liskartan, who's now at Berkeley, and me. And basically we got this data, we figured out the question, we did all the deciding which metric we want to optimize. And then we spent six months maybe of our own time, almost full time, trying different ways to slice and dice the model space,

Starting point is 00:10:02 the parameter space. It was just exhausting. What question were you trying to answer? In this work, we were trying to investigate and build a predictive model of the, of target activity in CRISPR-Cas9. So CRISPR is a gene editing system. It allows you to mute a gene that you don't want to be expressed. For instance, this gene is causing a disease, I want to shut it down. So you can do that in previous work we basically figured out again a machine learning model to predict given the many ways you can edit

Starting point is 00:10:29 the gene there because you can do it in different ways what's the most successful edit and in this follow-up work we were investigating the issue of given that i want to perform this edit what's the likelihood that i mess up something else in the genome that I didn't want to touch? You can imagine if I want to remove a gene or silence a gene that was causing a disease, I don't want to suddenly give you a different disease because an intended edit happened somewhere else. And so we cast it again as a machine learning problem, but we had multiple models interacting and tuning each model separately was a complete nightmare. And during that process, I decided, surely somebody must have thought about something. And, you know, they had, but the problem is that a lot of the state of the art was working only for tuning a few hyperparameters at the time. What we were trying to do was really tune thousands.

Starting point is 00:11:19 Right. And so it would have taken ages. And so I kind of had an idea while working on CRISPR. So it was kind of like a side project. It kind of worked and I was suspicious because it was a weird approach that was not supposed to work. It was really a hack. And so I kind of kept it quiet, kind of used it to inform my own experiments, but I didn't advertise it. Okay.

Starting point is 00:11:41 So the best and brightest minds all over the world are trying to tackle this problem. Like you say, there's other companies working on it. They've got symposiums, they've got competitions. And here you are in your lab working on a CRISPR-Cas9 gene problem and you come up with this. Tell us what it is? So it's something that actually is used already by, you know, Netflix, Amazon, we probably use it somewhere in the company to recommend things. Right. So the idea was ultimately deciding which algorithm and which set of hyperparameters to use for a given problem. You're kind of trying to recommend a series of things and then I evaluate them and I tell you how well they worked. and then you kind of update your beliefs about what's going to work and what's not going to work. And that is similar to movies, right? You watch a movie, you rate it, and then they learn more about you, your tastes and so on. The good news for us is that we don't actually rely on the human watching the movie. We can force

Starting point is 00:12:40 the execution of a machine learning pipeline. We can just tell, execute this thing. And they perform, let's say, well or not so well, depending on which dataset they're exposed to. So you can now gather a corpus of experiments that help you guide the selection of what to do in the new dataset. So that's the meta-learning aspect. You have this meta-model that knows and can predict how individual base models will perform when shown some given data.

Starting point is 00:13:07 Okay. So implementing this. Yeah. So because I was the first user, I didn't want just something that you could, you know, academically show, oh, you know, we beat random because random is a strong baseline for AutoML, surprisingly, like picking a model at random. Right. I wanted to encapsulate this work into something I could use into a library.

Starting point is 00:13:26 So we worked on a first version of a toolkit you could just deploy on your data. And we started using it for our own stuff. And then we were working on this. And in the summer at Microsoft, you have the Hackathon, which is this huge initiative. Everybody kind of takes part.

Starting point is 00:13:41 And a lot of teams were looking for data scientists. And it was crazy because on the machine learning mailing list was, oh, I'm working on, you know, these accessibility problems. Is there anybody who knows machine learning who can help me out with this data analysis question? And so we thought, okay, so if nobody's responding to that email, maybe we should just blast out an email to the entire mailing list saying, we have 50 spots, so we can give you an API key that we designed

Starting point is 00:14:05 in a bad way. And if you want to use something that kind of finds a model automatically, you phrase the question and we kind of search your model space for you and give you a pipeline you can use at the end. Just give you a Python object. You can ask for predictions. OK, what was the response? The response was crazy.

Starting point is 00:14:22 So we were a research team of two people specifically. So we didn't have a lot of systems behind us. So it was a single machine serving this meta brain model that kind of figures out what to do. And by our calculations, we could only accommodate 50 teams. So we got 150 requests within the first week. Unbelievable. And so we stretched our resources a bit thin to let people use it. So you did accommodate the 150.

Starting point is 00:14:50 We did accommodate the 150 in the end. Okay. It was a lot of CPU time that went into that. And a lot of like last minute, oh, the service is down. Can you reboot? Something that we'd never experienced. Well, okay. So this is the hackathon.

Starting point is 00:15:03 And it sounds to me like the scenario you're painting is something that would help validate your research and actually help the people that are doing projects within the hackathon itself. It was kind of a win-win because we saw some real-world usage of our tool, and it was very limited. We could only do classification, not regression problems, just because we started with that. Sure. And people came and they started saying, oh, could you add this base learner or this pre-processing method? And it was very useful to us. What did you say?

Starting point is 00:15:30 No, I can't. Well, our answer was always, oh, it's just, you know, it's just us. But one day... It's just research. It's just research, yes. Well, let's go there for a minute. There's been a big announcement at Ignite.

Starting point is 00:15:47 Very exciting. Can you talk about that? So, yes, it's a very exciting announcement. It took a ton of work from a lot of very smart people. So it's a joint collaboration at this point between MSR, you know, we did the original kind of proof of concept and a huge amount of work went in from people within Azure. So it's being released as kind of like a automated ML library or SDK that you can use on your data and it's in public preview. That's super exciting. You know, we have a very good collaboration and a very good ability to now transfer what are technically complex ideas. You know, this technology transfer was not like a small, simple model that you could just write quickly. And we had to think about, is the probability distribution calibrated? Are the

Starting point is 00:16:32 choices that we make based on that information correct in most cases for most datasets? What's the cost on runtime on our servers to satisfy the demand that this thing will likely have? So it was a lot of engineering work. And I think we figured a lot of that out. And so we are able to transfer a lot of the research into product much more quickly. Well, who's the customer for this right now? Uh, we struggled a little bit because in the early validation phase, let's call it the research prototype, let's give it away phase, we got a lot of

Starting point is 00:17:03 different kind of people approaching us. So data scientists, they are interested in because they want to save time. They don't care in a lot of cases what the final model is. They just want a good model and they don't want to spend ages just running parameter sweeps. So data scientists are one set of individuals who could be interested. Developers. Sometimes developers now are tasked with including intelligence in their applications.

Starting point is 00:17:27 And if you don't know what to do, this kind of solution gives you a good model in a short amount of time relative to the size of your data. So you can just use it. And then there is a lot of kind of business analysts, buyers from companies. They have to make data-driven decisions

Starting point is 00:17:44 and they would benefit from good predictions and this tool would give them good predictions. So going back to your comment about data scientists being a customer, why wouldn't a data scientist be a little bit worried that this AutoML might be taking over their job? Yeah, I get asked that question a lot. I created it for myself, not intending to replace myself, but just kind of as a tool for me to use.

Starting point is 00:18:13 The metaphor I use for this, it's kind of like using a word editor. It doesn't replace the role of a writer. It just makes the writer that much more effective because you don't have to cancel with a pencil your old text and just rewrite it from scratch. You can just erase one letter, for instance. And that's what AutoML does. Like if you change, let's say, the way you featurize your data, you don't have to start from scratch with the tuning.

Starting point is 00:18:38 You just set an AutoML run going and you just move on. Would it democratize it to the point where non-data scientists could become data scientists? I think it's a possibility because in some sense you observe this meta model kind of reasoning about your data and you see the thinking process, if you can call it thinking, you see which models it starts out with and then it sees how it evolves. And so you can actually learn things about your data by observing the process, observing how different metrics are, you know, sometimes you want to maximize accuracy, but maybe looking at something like an area under the RC curve

Starting point is 00:19:13 is informative because you can now see how the probabilities are changing. So I think you can learn from AutoML, and I think it can become kind of like some training wheels if you're starting out. Let's go back to CRISPR for a minute, since this all started with how to make it easier to decide where to edit a gene. Yeah. You made a website called CRISPR.ML that provides bioscientists with free tools they can use to make CRISPR gene edits. Basically, yes. Tell us about the site.

Starting point is 00:19:43 Why did you start it and how's it going? That's a great question. So it's crispr.ml. That's the URL. We had to really, really resist the crazy hype to not call it crispr.ai, which was available. And we just said, no, it's ml, it's not ai. We decided not to feed the hype around the ai. That was self-restraint on steroids. I know, I know. It took a lot of us. So we had done work on the on-target problem, which was how do you find the optimal

Starting point is 00:20:14 for some notion of optimality? How do you find the best edit to perform to make sure that you disable a gene that you want to disable? And that was giving you predictions. It was a tool that you could use in Python. You could just unload it from GitHub. We just gave it away, liberally licensed and so on. And a lot of startups incorporated that in their tools, selling it. A lot of institutions were using it. The off-target stuff, which was how do you make sure that you don't have unintended

Starting point is 00:20:40 edits somewhere else in the genome, that was much more computation intensive to run as something you could download. So if you were interested in running it for your own problem of interest, you would have to wait a long time. And so we decided, why don't we just pre-compute everything for a human genome? Which took an exorbitant amount of CPU time on Azure,

Starting point is 00:21:01 but we could just pre-populate a giant database table and then search it almost instantly. And that's what the website does. You can put in the gene you want to edit and you get a list of possible guides with a score that tells you how likely the edit is to be successful and how likely each of target is to happen. And a global score that tells you, broadly speaking, bad, bad, bad of targets here. Don't touch this guy. Don't do it. Yeah.

Starting point is 00:21:30 So it both tells you what's a good place to go and tells you avoid these places because lots of bad stuff could happen. In some sense, it tells you which place to edit. And if you choose to edit this, these other spots on the genome might be edited

Starting point is 00:21:46 and maybe you don't care. So maybe your experiment is very narrowly focused on a given genes. So maybe you don't care, but in therapeutic applications, you want zero targets pretty much all the time. Yeah. Yeah.

Starting point is 00:21:58 That leads me into a question that I ask all of the guests on the podcast. Is there anything that keeps you up at night about what you're doing? Yeah, gene editing, AutoML, and AI, what can go wrong? Right? No, I'm not easily kept awake at night. I can sleep anywhere.

Starting point is 00:22:20 But, you know, there are things that concern me and that try to move my work towards addressing them as a first priority. Maybe the main thing on my mind right now is I know that AutoML or what we call AutoML is a very good way to predict. So it's a very strong supervised machine learning method and it can be applied to all kinds of data. And I want to make sure that as we build the capability to generate better and better predictors, we are also thinking of ways to make sure that the predictions are well explained, that the biases are auditable and visible to the person who's deploying these systems. So we are spending a lot of time now thinking how fairness and all these themes that

Starting point is 00:23:05 are mentioned a lot in this podcast are addressed because it's a very powerful tool. And if you apply it in the wrong way, you're going to have exorbitant amounts of bias. Tell us a bit about yourself, Nicolo. What's your background? How did you get interested in what you're doing? And how did you end up at Microsoft Research? It's a good story. So I will not start from when I was a baby. I will start directly from university.

Starting point is 00:23:43 I did my university in Milan, close to home. Well, at home, basically. And then I attended some advanced course in statistics, and I thought it was fascinating. But I was always kind of like more of a computer science person. And so I figured computer science plus statistics. At the time, the answer was machine learning. I think to this day, probably there is a lot. Same answer. Same answer for most people.

Starting point is 00:24:06 And so I decided to kind of do a summer internship somewhere. Anybody would take me. You know, this is kind of hashtag rejection stories. I must've sent 30 or 40 emails to everybody in Europe to say, can you please, like, I will come for free, like for a summer, it was kind of like my summer vacation that year. And I think the response I got was from Neil Lawrence, who's now also a podcast host as part of other things, Talking

Starting point is 00:24:31 Machines. And he wrote his email in broken Italian because he speaks a little bit of Italian, but he can speak it, but he cannot write it. If he hears this, I hope he's okay with that. And he says, sure, come over. And I went there for a summer. The goal was to kind of build up my CV to do grad school somewhere. So I wanted to do some research, you know, get a feeling. And after that summer,

Starting point is 00:24:55 I basically decided, no, I'm staying here. I want to do a PhD right here. I'm coming back in four months. And I basically kind of closed up all my stuff, like finished my exams at home, just like my thesis and just packed up and went to the UK. So Neil had a choice of projects. And because I wanted to have an impact on health, I kind of chose the molecular biology inspired projects. And I started working on that. And it was one of the

Starting point is 00:25:23 best, you know, three years of my life. It was a lot of fun. I learned a lot. Yeah. And you got your PhD. I got my PhD. Well, in the last year, I traveled a lot during my PhD because I spent some time at Max Planck Institute in Tumingen.

Starting point is 00:25:37 I spent some times at UCLA, at the Institute of Pure and Applied Mathematics. There was a program where the idea, I think, to represent it correctly was to kind of combine some mathematically-minded people and some biology-slash-medicine-minded people to see what kind of collaborations arise. It was an incredible program for, like, I think three or four months during which I met this group of Microsoftees in LA who were doing statistical genomics. And that included my longtime collaborator, Jennifer Liskartan. And I started an internship there the year after.

Starting point is 00:26:09 And then at the end of the internship, they say, hey, do you want to join? So again, once more, I went back, I packed everything up and I said, sure. And I joined Microsoft Research in LA, which was this remote site. It was five of us all kind of working in health. So there's not actually a lab in LA. It was a rented office that used to be a steakhouse on the UCLA campus. Very, very unofficial. If you've ever been to a Microsoft building, you see all these, you know, machines that include beverages like sodas and so on.

Starting point is 00:26:39 We had to have a standing order from supermarkets delivering us the sodas. Because you didn't have a place to... We really have facilities it was just us and our network was ethernet cables running everywhere that's hilarious did you at least have some signage i don't think so it was probably you know those um plastic signs that you can kind of get at any office store we had those but we didn't have a sign that said mic, but we were in the address book. So sometimes Microsoft salespeople would come to our office intending to access the corporate network, but they didn't understand that we were not on the corporate network even.

Starting point is 00:27:15 Oh, you were that off target. Off the grid. Listen, so then how did you come to be at Cambridge? Did you go straight from LA to the Cambridge lab? Yes. So I think after a couple of years in L.A., different people moved in different places. And Jen Liskarden came to Boston. And I heard this Boston Lab is incredible.

Starting point is 00:27:36 And so at NIPS, I met with Jennifer Chase, who's the lab director. I was like, I need to visit there and check it out. And I joined, I think, three years ago now. So you moved up all your stuff again. I moved up all my stuff again. I said, I love it here. I'm just moving. As we close, I like to ask all my guests

Starting point is 00:27:54 to look to the future a little. And it's not like predictions of the future, but more just sort of as you look at the landscape, what are the exciting challenges, hard problems that are still out there for young researchers who might be yourself a few years ago trying to decide, where do I want to land? Where do I want to pack all my stuff up and move to? That's a great question. And one I spend a lot of time thinking about because we have a lot of interns.

Starting point is 00:28:21 The quality of the students right now is exceptional. So even maybe a first year PhD student has an incredible amount of experience very often. And they're asking you, where should I direct my career? And it's hard to give advice. But I think the area where I expect the most improvement and interesting work to be done is probably the area of making decisions given predictions. So I think a lot of machine learning is focused correctly so on giving good predictions. We are now kind of topped out performance on a lot of tasks that were considered

Starting point is 00:28:54 very hard, image recognition. In 2012, I think it was, or 2013, was really our task. And now we are kind of like, we can achieve great performance, great top one, top 5% performance. But I think the gap now is, okay, we got good predictions. How do we make decisions with those predictions? And I think with that, you need to have a notion of uncertainty and well-calibrated uncertainty. So you need to be certain the correct percentage of the time and then uncertain the rest. And, you know, self-driving cars and all these things will need the notion of raising a red flag and saying, I don't know what's going on. I need to not make a decision right now.

Starting point is 00:29:34 Please intervene. You need the notion of low confidence prediction. Please do something else. Don't use me for your decision and beyond. But in general, there is this notion that you need uncertainty. And we are in a decent spot for quantifying uncertainty, but there is a lot more work that needs to be done to have safe, robust machine learning systems. So that's a fruitful line of inquiry for somebody who's interested in this.

Starting point is 00:30:01 Yes. I think, you know, as a student, you need to kind of imagine like skit shooting. You need to shoot ahead of your target. If you're entering the game now and you're trying to just maximize predictive accuracy, where predictive accuracy is basically like a root miss square, minimizing a root miss square or maximize accuracy. I think it's about time to do machine learning if that's your objective. But I think thinking more end to end, what is the end goal of this machine learning system is going to be a much more interesting area in the future.

Starting point is 00:30:31 Niccolo Fusi, thank you so much for joining us today on the podcast. I'm enlightened, and it was so much fun. Thanks for having me. It was fun. To learn more about Dr. Niccolo Fusi and the latest research in automated machine learning, visit Microsoft.com slash research.

Your Ad Here

Microsoft Research Podcast - 043 - All About Automated Machine Learning with Dr. Nicolo Fusi

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.