Microsoft Research Podcast - 063 - Competing in the X Games of machine learning with Dr. Manik Varma
Episode Date: February 13, 2019If every question in life could be answered by choosing from just a few options, machine learning would be pretty simple, and life for machine learning researchers would be pretty sweet. Unfortunately..., in both life and machine learning, things are a bit more complicated. That’s why Dr. Manik Varma, Principal Researcher at MSR India, is developing extreme classification systems to answer multiple-choice questions that have millions of possible options and help people find what they are looking for online more quickly, more accurately and less expensively. On today’s podcast, Dr. Varma tells us all about extreme classification (including where in the world you might actually run into 10 or 100 million options), reveals how his Parabel and Slice algorithms are making high quality recommendations in milliseconds, and proves, with both his life and his work, that being blind need not be a barrier to extreme accomplishment.
Transcript
Discussion (0)
In 2013, I thought there is no way we can learn 10 million or 100 million classifiers.
And even if you could learn them, where would we store them?
And even if you could store them, how would we make a prediction in a millisecond?
And so I just turned away from one versus all approaches and we tried developing trees
in embeddings.
But today, we've actually managed to overcome all of those limitations. And the key trick is to go from linear time training and predictions to log time training and prediction.
You're listening to the Microsoft Research Podcast, a show that brings you closer to the cutting edge of technology research and the scientists behind it.
I'm your host, Gretchen Huizinga. If every question in life could be answered by
choosing from just a few options, machine learning would be pretty simple, and life
for machine learning researchers would be pretty sweet. Unfortunately, in both life and machine
learning, things are a bit more complicated. That's why Dr. Manik Varma, Principal Researcher
at MSR India, is developing extreme classification systems to answer multiple choice questions that
have millions of possible options and help people find what they're looking for online
more quickly, more accurately, and less expensively. On today's podcast, Dr. Varma tells us all about
extreme classification, including where in the world you might actually run into 10 or 100 million options, reveals how his parable and slice algorithms are making high-quality recommendations in milliseconds, and proves with both his life and his work that being blind need not be a barrier to extreme accomplishment.
That and much more on this episode of the Microsoft Research Podcast.
Manik Varma, welcome to the podcast.
Thanks, Gretchen. The pleasure is entirely mine.
So you're a principal researcher at MSR India and an adjunct professor of computer science at IIT
Delhi. In addition to your work in computer science, you're a physicist, a theoretician,
an engineer, and a mathematician. You are a Rhodes Scholar at Oxford and a university scholar when
you did your doctoral work there. And you were a postdoc at MSRI Berkeley. And you're
blind. I'm not going to call you Superman, but I would really love to know what has inspired you
to get to this level of accomplishment in your life. What gets you up in the morning?
I guess it's a combination of my hopes and desires on the one hand, and I guess fears as well on the
other. So I guess hopes and desires. I hope every day I get up, I learn something new.
That's one of the best feelings I've had. And that's what's driven me all this while. And
actually, that's why I'm in research, because I get to ask new questions and learn about new
things throughout my career. So that's fantastic. And the other hope and desire that I have that
drives me is to build things and build things that will help millions of people around the world. And I guess there's some fears lurking behind that as well. I've been worried about the imposter syndrome all my life. And yeah, so I guess the best way to tackle that is actually to try and do things and get things out there and have people use them and be happy with them. So I guess that's the fear that's driving the hopes and desires.
So Monik, all of this has been without the use of your eyes. How have you gone about all this?
Right. So I have a condition where the cells in my retina deteriorate over time.
So in about 2009, I think I started using a blind stick and then I lost the ability
to read papers, then recognize faces, stuff like that. But it makes life very interesting,
right? I go up to my wife and ask, who are you? That's the secret to happily married life.
Well, so you haven't been blind since birth?
No. Well, it's a hereditary condition, but nobody else in my family has it.
So it started deteriorating from the time I was in school and I lost the ability to do some things in school.
But it's only over the last, let's say, 10 years where it's become really significant.
But again, it's been no big deal also, right?
It's only in the last decade where I've had to think about it.
And that's, I think, because of my kids, right?
I want to set them an example that they can, because this is hereditary, there's a probability
they might have it.
And if they do, I don't want them to feel that they can't do something, right?
So if they want, they can go climb Mount Everest or become the best historian or the best scientist or whatever. As long as they set their minds up, they can do
that. We're going to talk about this relatively new area of machine learning research called
extreme classification in a minute. But let's start a bit upstream of your current work first.
Give us a quick review of what we might call the machine learning
research universe, as it were, focusing on a bit of the history and the types of classification
that you work with. And this is just to give us a framework for how extreme classification
is different from other methodologies and why we need it. So if you look at the history of
machine learning, then the most well-studied problem is binary classification, where we learn to answer yes-no questions involving uncertainty.
And then the community realized that actually there are many high-impact applications out there in the real world that are not just simple yes-no questions, right?
They're actually multiple-choice questions.
And this leads to the field of multi-class
classification and then after that the community realized that there's some high impact applications
that are not only multiple choice but they also have multiple correct answers and this led to the
establishment of the area of multi-label classification so just to take examples of all
of these if you ask the question,
is Gretchen speaking right now or not, then that's binary classification. Whereas if you
turn this into a multiple choice question, such as who's speaking right now, is it Gretchen or
Manik or Satyana Dela? So that's a multiple choice question that'll be multi-class classification.
But now suppose you threw a cocktail party and you invited all the top machine learning AI scientists over there. And then you took a short clip and asked who's speaking
in this clip. That becomes a multi-label classification problem because now multiple
people could be speaking at the same time. And so if you have L choices in front of you,
then in multi-class, the output space is L dimensional or order L but if you have a multi-label then the
output space is two to the power L because every person may or may not be speaking so you go from
two choices in binary classification to tens to hundreds and thousands of choices in multi-class
and multi-label learning and if you looked at the state of the art in about 2012 the largest
multi-label data set out there had about 5,000 labels.
And I remember all my colleagues like running their algorithms for weeks on big clusters to try and solve this problem because 2 to the power 5,000 is way more than the number of atoms in the universe.
So it's a hard problem.
So this has been a problem for quite some time.
It's not brand new, right?
But it's getting bigger.
Is this our issue?
Right.
And actually, that's how extreme classification got started.
So as I mentioned, in 2012, the largest publicly available multi-label data set had about 5000
labels.
But then in 2013, we published a paper which exploded the number of labels being considered in a multi-label classifier from 5,000 to 10 million.
Wow.
And that really changed the nature of the game.
So the motivating application was to build a classifier that could be used as a tool by advertisers that would predict which Bing queries would lead to a click on the ad or
the document.
And you can well imagine from the context of the application that this is a really important
problem from both the research as well as the commercial perspective.
And so many sophisticated natural language processing, machine learning, information
retrieval techniques have been developed in the literature to solve this problem.
But unfortunately, none of these were working
for our ads team.
They had billions of ads for which
all these sophisticated approaches were not
making good quality predictions.
And so we decided to go back to the drawing board.
We set aside all of these approaches
and simply reformulated the problem
as a multi-label classification problem, where
we would take the document as input,
and then we would treat each of the top queries on Bing as a label.
So we took the top 10 million monetizable queries on Bing,
and now you just learn a classifier that will predict
for this particular document or ad
which subset of top 10 million Bing queries will lead to a click on the ad.
Top 10 million.
Yeah, so from 5,000 to 10 million. This was just a very different and completely new way of looking at the problem.
And it took us two years to build this system, run the experiments, publish our results and
check everything out. But once the results came in, we found that our approach was much better than all these
traditional approaches. So the number of ads for which we were making good quality recommendations
went up from about 60% for the Bing system to about 95, 98% for us. And the quality of our
recommendations also improved a lot. And so that led to the establishment of the area of extreme
classification, which deals
with multi-class and multi-label problems in extremely large label spaces, say millions or
billions of labels. And I think that's exactly why extreme classification grew to be a whole new
research area in itself. And that's because I think fundamentally new research questions arise
when you go from, let's say, 100 labels to 100 million
labels. Let me just give you a couple of examples, if you'll permit me the time. Yes, please.
The whole premise in supervised machine learning is that there's an expert sitting out there who
we can give our data to, and he or she will label the data with ground truth what's the right answer right
what's the right prediction to make for this but unfortunately in extreme classification there is
no human being who can go through a list of 100 million labels to tell you what are the right
predictions to make for this data point so even the most fundamental machine learning techniques, such as cross-validation, might go for a toss at the extreme scale.
And you'll have missing labels in your test set, in your validation set, in your training set.
And this is like a fundamental difference that you have with traditional classification, where a human being could go through a list of 100 labels and mark out the right subset. Another really interesting question is the whole
notion of what constitutes a good prediction changes when you go from 100 labels to let's
say 100 million. When you have 100 labels, you need to go through that list of 100 and say,
okay, which labels are relevant, which labels are irrelevant. But when you have 100 million,
nobody has the time or patience to go through it. So you need to give your top five best predictions very quickly.
And you need to have them ranked with your best prediction at the very top and then the worst one at the bottom.
And you need to make sure that you handle this missing labels problem.
Because some of the answers that you predict might not have been marked by the expert.
So all of this changes as you go from one scale
to the next scale. Let's talk for a second about how extreme classification can be applied in other areas besides advertising.
Tell us a little bit about the different applications in this field and where you think extreme classification is going.
I think one of the most interesting questions that came out of our research was when or where in the world will you ever have 10 million or 100 million labels to choose from if you just think about it for a minute a hundred million is a really large number just to put it
in context to see how big a number this is when i was doing my phd in computer vision the luminaries
in the field would get up and tell the community that according to biedermann's counts there are
only 60 000 object categories in the world so none of the classical visual problems will make the cut.
And even if you were to pick up a fat Oxford English dictionary, it would have somewhere
around half a million words in it. So many traditional NLP problems might not also make
the cut. And over the last five years, people have actually found very high impact applications
of extreme classification.
And so for example, one of them leads to reformulation
of well-known problems in machine learning
like ranking and recommendation,
which are critical for our industry.
So suppose you wanted to, for instance,
design a search engine, right?
You can treat each document on the web as a label and now
when a query comes in you can learn a classifier that will take the query's feature vectors input
and predict which subset of documents on the web are relevant to this particular query and so then
you can show those documents and you can rank them on the strength of the classifier's probabilities and you can reformulate ranking
as a classification problem and similarly think about like recommendation right so suppose you
were to go on to a retailer's website they have product catalogs that run into the millions if
not hundreds of millions and so no human being can go through the entire catalog to find what
they're looking for and therefore recommendation becomes critical for helping users find things they're looking for. And now you can treat each
of the 100 million products that the retailer is selling as a particular category, and you learn a
classifier that takes the user's feature vector as input, and simply predicts which subset of
categories are going to be of interest to the user. And you recommend the items corresponding to those categories to the user.
And so you can reformulate ranking and recommendation as extreme classification.
And sometimes this can lead to very large performance gains as compared to traditional
methods such as collaborative filtering or learning to rank or content-based methods.
And so that's what extreme classification is really good for.
Let's talk about results for a minute.
How do we measure the quality of any machine learning classification system?
I imagine there are some standard benchmarks,
but if it's like any other discipline,
there are trade-offs and you can't have everything.
So tell us, how should we think about measurement?
What kinds of measurements are you looking at?
And how does extreme classification help there?
So the axes along which we measure the quality of a solution are training time, prediction
time, model size, and accuracy.
So let's look at these one by one.
And they're all linked also.
So if you look at training time, this is absolutely critical for us.
We cannot afford to let our extreme
classifiers take months or years to train so think of the number of markets in which bing operates
all over the world if it were to take you let's say two days to train an extreme classifier then
every time you wanted to deploy in a new market that would be two days then every time you wanted to deploy in a new market, that would be two days
gone. Every time you wanted to run an experiment where you change a feature, two days gone. Every
time you want to tune hyperparameters, two days gone. And again, when I'm saying two days, this
is probably on a cluster of a thousand cores, right? So how many people or how many groups
have access to those kinds of clusters? Whereas if you could bring your training time down
to let's say 17 minutes of a single core
of a standard desktop,
now you can run millions of experiments
and anyone can run these experiments
on their personal machines.
And so the speed with which you can run experiments
and improve the quality of your product
and your algorithm,
there's a sea change in that.
Yeah.
So training time becomes very important. But as you pointed out, right, the quality of your product and your algorithm there's a sea change in that yeah so training
time becomes very important but as you pointed out right you want to maintain accuracy so you
can't say oh i'll cut down my training time but then i'll make really bad predictions that's not
acceptable and then the second thing is model size so if your classifier is going to take let's say
a terabyte to build its model then your cost of
making recommendations will go up you'll need to buy more expensive hardware and again the number
of machines that have a terabyte is limited so the number of applications you can deal with in
one go is limited so again you want to bring your model size down to a few gigabytes so that it can
fit on standard hardware and anybody can use this for making predictions and again there's a trade-off between model size and prediction time right you can
always trade speed for memory but now the question is how can you bring down your model size to a few
gigabytes without hitting your prediction time too much and the reason prediction time is important
is because again at the scale of bing we we might have 100 billion documents that we need to process regularly.
So if your prediction time is more than a few milliseconds per document, then there is no way you can make 100 billion predictions in a reasonable amount of time.
So your solution simply will not fly.
And all of this, as I said, is tied to accuracy because people will not suffer poor recommendations.
No, I won't.
Well, so circling back, how does extreme classification impact your training time, model size, speed, accuracy, all of that? If you look at the papers that we are publishing now, like Parable and Slice, we can train on that same data set in, as I said, 17 minutes of a single
core of a standard machine. And so we've really managed to cut down training time, let's say by
10,000 times over the span of five years. We've managed to reduce our model size from a terabyte to a few gigabytes.
Our prediction time is a few milliseconds per test point.
And we managed to increase accuracy from about 19% precision at one to about 65% today.
So precision at one is if you just look at the top prediction that was made, was that correct or not?
So on a standard benchmark data set, it was 19% in 2013, and we managed to increase that to 65% today. So there's been a remarkable improvement in almost all metrics over the last five, six years.
You just mentioned Parable and Slice, and let's talk about those right now. First of all,
Parable, it's a
hierarchical algorithm for extreme classification. Tell us about it and how it's advanced both the
literature of machine learning research in the field in general. So for the last five years,
we've been trying out lots of different approaches to extreme classification. We tried out trees,
we tried out embeddings and deep learning, and we looked at one versus all approaches.
And in one versus all approach, you have to learn a classifier per label.
And then you use all of them at prediction time to see which ones have the highest scores.
And then that's the classes or the labels you recommend to the user.
And in 2013, I thought there is no way we can learn 10 million or 100 million classifiers and
even if you could learn them where would we store them and even if you could store them
how would we make a prediction in a millisecond and so i just turned away from one versus all
approaches and we tried developing trees in embeddings but today we've actually managed
to overcome all of those limitations and the key trick is to go from
linear time training and predictions to log time training and prediction and originally i thought
this is not possible right because you have to learn one classifier per label so if you have
100 million labels you have to have 100 million classifiers so your training time has to be linear
in the number of labels but with parable we managed to come up with a log time training and a log time prediction
algorithm.
And the key is to learn hierarchy over your labels.
So each node in the hierarchy inherits only about half of its parents labels.
So there's an exponential reduction as you go down the hierarchy.
And that lets you make predictions in logarithmic time now the trick in parable was how do you make sure that the training is also
logarithmic time because at the leaves of the hierarchy you will have one let's say label
in the leaf node and so you have to have at least as many leaves as classified so that gives you
back to linear costs but somehow par, you'll need to read the paper
in order to get the trick. It's a really cute trick and where you can get away with long time
training. And so that's why Parable manages to get this 10,000 times speed up in prediction,
in training, in reduction in model size and accuracy is great. It's a really cool algorithm.
More recently, you've published a paper on what you've called the Slice algorithm.
What's the key innovation in Slice?
How is it different from or how does it build on Parable?
Right.
So Slice is also a one versus all algorithm where you learn a classifier per label.
It also has log time training and prediction, but it's based on a completely different principle.
So in Parable, our approach was to learn a hierarchy.
So you keep splitting the label space in half
as you go down each node.
Slice has a very different approach to the problem.
It says, okay, let me look at my entire feature space.
And now if I look at a very small region
in the feature space,
then only a small number of labels
will be active in this region
so now when a new user comes in if i can find out what region of feature space he belongs in or she
belongs in very quickly then i can just apply the classifiers in that region and that's the key
approach that slice takes to cutting down training time and prediction time from linear to logarithmic and it's about to appear in wisdom this month and it scales to 100 million labels i mean if you look
at image net right so that was a thousand classes with a thousand training points per class and now
we have a hundred million classes so we used it for the problem of recommending related searches
on bing so when you go and type in a query on a search engine, the search engine will often recommend
other queries that you could have asked that might have been more helpful to you, or that
will highlight other facets of the topic.
And so for obvious queries, the recommendations are pretty standard and everyone can do a
good job.
The real fight is in the tail for these queries
that are rare, that we don't know how to answer well. Can you still make good quality recommendations?
And they're getting even a 1% lift in the success rate is a big deal. Like it takes
dedicated time and effort to do that. And we managed to get a 12% lift in performance with
Slice in the success rate. So that's like really satisfying.
Yeah. And you had some other percentages that were pretty phenomenal too in other areas. Can
you talk about those a little? Right. So we also saw increases across the board in trigger coverage,
suggestion density, and so on. So trigger coverage is the number of queries for which you could make
recommendations. And we saw a 52% increase in that. 52. Yeah, that's right. That was amazing.
Statistically significant on steroids.
And then the suggestion density is the number of recommendations you make per query.
And there was a 33% increase in that as well.
So, yeah, we had pretty significant lifts.
And I'm very glad to say like Slice is making billions of recommendations so far.
And people are really happy. It's really improved the success rate of people asking queries on Bing so far.
That's cool. Speaking of where people can
find stuff, I imagine a number of listeners would be interested to plunge in here. Where can they
get this? Is it available? Where are the resources? So we maintain an extreme classification repository,
which makes it really easy for researchers and practitioners who are new to the area to come in and get started.
If you go to Bing and search for the extreme classification repository or your favorite search engine, you can find it very easily. And there you'll find not just our code,
but you'll find data sets, you'll find metrics on how to evaluate the performance of your own
algorithm, you'll find benchmark results showing you what everybody else has achieved in the field so far.
You'll find publications.
And if you look at my homepage,
you'll also find a lot of talks.
So you can go and look at the recordings
to explore more about whatever area fascinates you.
And all of this is publicly available,
freely available to the academic community.
So people can come in and explore whatever area of extreme classification they like.
So, Monique, we've talked about the large end of extreme classification, but there's another
extreme that lies at the small end of the spectrum, and it deals with really, really,
really tiny devices. Tell us a bit about the work you've done with what you call resource-efficient
ML. Yeah, that's the only other project I'm working on. And that's super cool too, right?
Because for the first time in the world,
we managed to put machine learning algorithms
on a microcontroller that's smaller than a grain of rice.
Think of the possibilities that opens up, right?
So you can now implant these things in the brains of people
who might be suffering from seizures
to predict the onset of a seizure.
You could put them all
over a farm to try and do precision agriculture tell you where to water where to put fertilizer
and where to put pesticide and all of that the applications are completely endless especially
once you start thinking about the internet of things number of applications in the medical
domain in smart cities smart housing so in 2017 we put two
classical machine learning algorithms based on trees and nearest neighbors called bonsai and
proton onto this microcontroller it has only two kilobytes of ram 32 kilobytes of read-only flash
memory no support and hardware for floating point operations and yet we managed to get our
classifiers to run on them.
And then last year, we released two recurrent neural network algorithms called FastGRNN and EMIRNN.
And again, all of these are publicly available from GitHub.
So if you go to github.com slash Microsoft slash HML, you can download all these algorithms
and play with them and use them for your own applications.
So while we're on the topic of your other work,
you said that was the only other project you're working on, but it isn't.
And maybe they're tied together, but I've also heard you're involved in some rather
extraterrestrial research. Can you tell us about the work you're doing with SETI?
Yeah, so they're related, but apparently some of these algorithms that we've been developing could have many applications in astronomy and astrophysics. So if you look at our telescopes right now, they're receiving data at such a high rate, that it's not possible to process all of this data or even store all of this data. So because the algorithms we've developed
are so efficient, if we could put them on the telescope itself, it could help us analyze all
types of signals that we are receiving, including perhaps our search for extraterrestrial intelligence.
So that's a really cool project being run out of Berkeley. But there are also lots of other
applications because if you're trying to put something on a satellite, I'm told by my astronomer friends that
the amount of energy that an algorithm can consume is very limited because energy is at a premium
out there in space. And so things that are incredibly energy efficient or will have very
low response time are very interesting to the astronomers. So, Monica, I ask all my guests
some form of the question, is there anything that keeps you up at night? You're looking at some
really interesting problems, big ones and small ones. Is there anything you see that could be of
concern? And what are you doing about it? Extreme classification touches people. People use it to find things they're
looking for. And so they reveal a lot of personal information. So we have to be extremely careful
that we behave in a trustworthy fashion where we make sure that everybody's data is private to them.
And this is not just at the individual level,
but also the corporate level, right?
So if you're a company that's looking to come to Microsoft
and leverage extreme classification technology,
then again, your transaction history and your logs and stuff,
we'll make sure that those are private and you can trust us
and we won't share them with anybody else.
And because we're making recommendations,
there are all these issues about fairness,
about transparency, about explainability.
And these are really important research challenges and ones that we are thinking of at the extreme
scale.
At the beginning of the podcast, we talked briefly about your CV and it's phenomenally
impressive.
But tell us a little bit more about yourself. How did
your interest in physics and theory and engineering and math shape what you're doing at Microsoft
Research? Yeah, so because I've been exposed to all of these different areas, it helps me really
appreciate all the different facets of the problem that we're working on. So the theoretical underpinnings are extremely important. And then I've come to realize how important the mathematical and
statistical modeling of the problem is. And then once you have built the model, then engineering
a really good quality solution, how to do that, what kind of approximations to make that you start
from theoretically well-founded principles, but then you make good engineering approximations to make that you start from theoretically well-founded principles but then
you make good engineering approximations that will help you deliver a world-class solution
and so it helps me look at all of these aspects of the problem and try and tackle them holistically
so what about the physics part actually to tell you the truth. I'm a miserable physicist. I completely failed. Yeah, I'm not very good at physics, unfortunately, which is why I switched.
You know, I gotta be honest. I'll bet your bad at physics is way better than my guests to offer some parting advice or inspiration to listeners, many of them who are aspiring researchers in the field of AI and machine learning.
What big, interesting problems await researchers just getting into the field right now?
Yeah, we've been working on it for several years, but we barely scratched the surface.
I mean, there's so many exciting and high impact problems
that are still open and need good quality solutions, right? So if you're interested in
theory, even figuring out what are the theoretical underpinnings of extreme classification,
how do we think about generalization error? And how do we think about the complexity of a problem
that would be such a fundamental contribution to make. If you're interested in engineering or in the machine learning side of things, then how do you come
up with algorithms that bring down your training prediction and cost and model size from linear to
logarithmic? So can we have an algorithm that is log time training, log time prediction, log model
size, and yet it is as accurate as a one versus all classified that has linear costs
and if you're interested in deep learning then how can you do deep learning at the extreme scale
how do you learn at the scale where you have a hundred million classes to deal with how do you
personalize extreme classification if you wanted to build something that was be personalized for
each user how would you do that
and if you're interested in applications then how do you take extreme classification
and apply it to all the really cool problems that there are in search advertising recommendation
retail computer vision all of these problems are open and we're looking for really talented people
to come and join us and work with us on this team.
And location is no bar.
No matter where you're in the world, we're happy to have you.
Level is no bar.
As long as you're passionate about having impact on the real world and reaching millions of users, we'd love to have you with us.
So we've talked about going from 100 to 100 million labels. What's it going to take to go from 100 million to a billion labels, I want to go to an infinite set of labels. That would be the next
extreme and extreme classification. And that's really important, again, for the real world,
because if you think about applications on Bing or search or recommendation, you have new documents
coming on the web all the time, or you have new products coming into a catalog all the time.
And currently, our classifiers don't deal with that.
Thankfully, we managed to cut down our training costs so that you can just retrain the entire
classifier periodically when you have a batch of new items come in. But if you have new items
coming in at a very fast rate, then you need to be able to deal with them from the get go. And so
we now are starting to look at problems where there's no limit on the number
of classes that you're dealing with. Well, I imagine the theory and the math and then the
engineering to get us to that level is going to be significant as well. But it'll be a lot of fun.
You should come and join us. Come on, it'll be fun, he said.
Manik Varma, it's been an absolute delight talking to you today.
Thanks for coming on the podcast.
Thank you, Gretchen.
As I said, the pleasure is entirely mine.
To learn more about Dr. Manik Varma and how researchers are tackling extreme problems
with extreme classification, visit Microsoft.com slash research.