Microsoft Research Podcast - 078r - Machine teaching with Dr. Patrice Simard
Episode Date: January 15, 2020This episode originally aired in May, 2019. Machine learning is a powerful tool that enables computers to learn by observing the world, recognizing patterns and self-training via experience. Much like... humans. But while machines perform well when they can extract knowledge from large amounts of labeled data, their learning outcomes remain vastly inferior to humans when data is limited. That’s why Dr. Patrice Simard, Distinguished Engineer and head of the Machine Teaching group at Microsoft, is using actual teachers to help machines learn, and enable them to extract knowledge from humans rather than just data. Today, Dr. Simard tells us why he believes any task you can teach to a human, you should be able to teach to a machine; explains how machines can exploit the human ability to decompose and explain concepts to train ML models more efficiently and less expensively; and gives us an innovative vision of how, when a human teacher and a machine learning model work together in a real-time interactive process, domain experts can leverage the power of machine learning without machine learning expertise.
Transcript
Discussion (0)
When Patrice Simard was on the podcast in May of last year,
he gave us a primer course on the science and philosophy of machine teaching,
an innovative way to train ML models with limited data and make customized AI more accessible.
Whether you heard Patrice last year,
or you're just tuning in for your first lesson on bespoke AI,
I know you'll enjoy Episode 78 of the Microsoft Research Podcast, Machine Teaching.
A lot of people have thought that the key to AI is the learning algorithm.
And I actually don't believe it's the learning algorithm.
I think teaching is what makes the difference.
So from a philosophical standpoint, I believe that machine learning algorithm
is almost the easy part,
is the part that you can locally optimize.
Teaching is the part that you have to optimize
at a global level, at a societal level.
And I think that may actually be the key to AI
the same way it was the key to human development.
You're listening to the Microsoft Research Podcast,
a show that brings you closer to the cutting edge
of technology research and the scientists behind it.
I'm your host, Gretchen Huizinga.
Machine learning is a powerful tool
that enables computers to learn by observing the world,
recognizing patterns, and self-training via experience. Much like humans. But while machines perform well when they can extract knowledge
from large amounts of labeled data, their learning outcomes remain vastly inferior to
humans when data is limited. That's why Dr. Patrice Simard, Distinguished Engineer and
Head of the Machine Teaching Group at Microsoft, is using actual teachers to help machines learn and
enable them to extract knowledge from humans rather than just data.
Today, Dr. Samar tells us why he believes any task you can teach to a human,
you should be able to teach to a machine,
explains how machines can exploit the human ability to decompose,
and explain concepts to train ML models more efficiently and less expensively,
and gives us an innovative vision of how, when a human teacher and a machine learning model work together in a real-time interactive process, domain experts can leverage the power of machine
learning without machine learning expertise. That and much more on this episode of the Microsoft
Research Podcast.
Patrice Simard, welcome to the podcast. Thank you, this is a pleasure to be here.
I have to start a little differently than I normally do
because you and I are talking
at a literal transition point for you.
Till recently, I would have introduced you as Distinguished Engineer,
Research Manager, and Deputy Managing Director of Microsoft Research.
But you're just moving along with a stellar team from Microsoft Research to Microsoft Office, right?
Yes, this is correct.
Well, we're going to talk about that in a minute.
But first, I always like to get a general sense of what my guests do for a living
and why. Sort of in broad strokes, what are the problems you're trying to solve in general and
what gets you up in the morning, makes you want to come to work? I want to do innovation. I think
this is where I have my background and talent. So I am completelyverent to the established wisdom. And I like to go and try to solve problems.
And since I want to change things, I want to have an impact in terms of change,
then I pick the problem and try to reformulate it and solve it in a different way or change the question.
I think this is usually the best way to have an impact.
So that irreverence, is that something you've had since you were young?
Is it part of your sort of DNA?
Yes.
And the issue there was that I was never really that good in classrooms.
And I always somehow misunderstood the question, answered a different question.
And since I didn't do my
homework very well, I never knew what the standard answer was. And so I kept changing the problem.
And that was not very successful in class. But when I moved to research, then changing the question
was actually part of the job. And so I got far more successful after I got past the scholar program.
That's actually hilarious because the school system rewards people who get answers
right but over in research, you want to turn things on their head.
Yes, that's right.
I mean, changing the question is far more useful in research than coming up with a different
answer or slightly better answer to an existing question.
You know, that's just a perfect setup for this podcast
because the entire topic is turning a particular discipline
or field on its head a little bit.
So let's set the stage for our conversation today
by operationalizing the term machine teaching.
Some people have said it's just another way of saying machine learning,
but you characterize it as basically the next new field for both programming and machine learning.
So tell us, how is machine teaching a new paradigm?
And what's so different about it that it qualifies as a new field?
Okay, so I'm going to characterize machine learning as extracting knowledge from data.
Machine teaching is different,
and the problem it tries to solve is, what if you start and you have no data? And then the task is
about extracting knowledge from the teacher. And this is very similar to programming. Programming
is about extracting knowledge from the programmer. So this is where the two fields are very close.
And it's a very different paradigm, because now it's all about expressivity, recognizing what
the teacher meant. And because you focus on the teacher, this is why HCI is so important. HCI is
human-computer interaction. And so programming and teaching are absolutely the epitome of
human communicating with computers. Listen, I want to ask, because I'm still a little fuzzy,
when you say you have no data, but you have a teacher,
is this teacher a human?
Is this teacher a person?
Yes, yes, yes.
So explain what that looks like with no data.
Is the teacher giving the data to the machine?
Yes.
So let me give you a simple example.
Good.
Imagine I want to teach you how to buy a car.
So I want to give you my personal feeling for how you buy a good car.
So I could bring you to the parking lot and point to good cars and bad cars.
And at some point I may ask you, what is a good car?
And you may say, oh, it's all the cars for which the second digit of the license plate is even.
And that may fit the
data perfectly. And obviously, this is not what I expected. But this is not the way we do it human
to human. So the way we do it human to human is I will tell you that you should look at the price,
you should look at the crash test, you should look at the gas mileage, maybe you should buy electric.
And these are features, They are what question to
ask to basically have the right answer about what a good car and a bad car are. And that's very
different. It's a little bit like Socrates teaching by asking the right question, as opposed to
enumerating positive and negative for the task. So when human teach other human, they teach in a
very different way than they teach a machine. Now, if you have millions and millions of labels,
then the task is about extracting the knowledge from that data. But if you start with no data,
then you find out that labels are not efficient at all. And this is not the way humans teach
other humans. So there must be another language. And the other language is what I call machine teaching.
This is like a programming language.
And just to give you an idea of how natural it is,
what I see happen over and over in industry
is that when people want to build a new machine learning model,
they start by collecting a whole bunch of data.
They write labeling directives, and then they outsource it, and then they get back their 50,000 labels, and then
they have a machine learning algorithm try to extract that knowledge from those
labels. But this is ironic because the labeling directives contain all the
information to do the labeling. So imagine now that the labeling directives
could be inputted directly into the system. Now when you look at the labeling directives could be inputted directly into the system. Now, when you look at
the labeling directives, they are features. They are saying, oh, this is a cooking recipe because
it has a list of ingredients. So if we can make that the teaching language, then we can skip the
middleman and get the machine to do the right thing. That's exactly the word I was going to
use is the middleman of the labelers, right? Drilling in a little, teachers are typically more expensive in terms of hours and so on. So
what's the business model here? Except for the fact that you're missing the middleman,
which usually marks up the price. How is it more efficient or less expensive?
Okay. So this is exactly what happened with programming.
At first, the programmers were scientists
that would program in assembly code.
And the name of the game in those days was performance,
and the biggest machine you could get
and the fastest machine you could get.
And over the years, the field has
evolved to allow more and more people to program.
And the cost became really the programmer.
And so we wanted to scale with the number of programmers.
This was the mythical man-month and, you know, how to reduce the cost of programmer,
how to scale a single task to multiple programmers.
And if you look at the literature for programming,
it moved from a literature of performance to a literature of productivity. And I believe that machine learning is still a literature of performance.
Generalization is everything. And if you have a lot of data, this is the right thing. Basically,
what makes the difference is the machine learning algorithm and how many GPUs you're going to
put on it. And this is what deep learning is. And I've worked in that field for many
years and I absolutely love that game, but I believe that we are changing. We are at a turning point
where productivity and the teacher's time becomes more and more important. And for custom problems,
you don't have a choice. You cannot extract the knowledge from the data because you don't have
the data. It's too custom. Or maybe it changes too fast. And in that case, the more efficient way
to communicate the knowledge is through features,
through schema, through other constraint.
And I'm not sure we yet know what language it will be.
It will still evolve. As a former teacher myself, albeit teaching teens, not machines, I'm intrigued by your
framework of what we in education call decomposition or deconstructing down into
smaller concepts to help people understand and then scaffolding or using building blocks
to support
underlying knowledge to build up for future learning. Talk about how those concepts would
transfer from human teaching to machine teaching. So in philosophy, there's a movement called
behaviorist that says that with stimuli response, you can teach everything. And of course you can't,
you won't be able to learn very complex things if you just do stimuli response. Well, in machine learning, I find that very often a machine learning
expert or what I would call machine learning behaviorist. And basically they believe that
with a very large set of input-label pair, they can teach anything. And it turns out
To a machine.
To a machine, right. And if you have a tremendous amount of labels
and you have a tremendous amount of computation,
you can go very far.
But there are tasks that you will never be able to do.
If I were to give you a scenario and ask you to write a book,
you could fill buildings of scenario garbage and scenario book.
You will not be able to learn that task.
The space of functions that are bad is too big
compared to the space of function that actually fulfills the desired goal.
And because of that, there's absolutely no way you can select the right function from the original space in a time that's less than the length of the universe or something.
But strangely enough, everyone in this building can perform that task.
So we must have learned it somehow.
Interesting, yeah.
Right? And the way we've learned it is we learn about characters, we learn about words,
we learn about sentences, we learn about paragraphs, we learn about chapters,
but we also learn about tenses, we learn about metaphors, we learn about character development,
we learn about…
Sarcasm.
Sarcasm, right? So with all theseESE THINGS THAT WE'VE LEARNED IN TERMS OF SKILLS,
WE WERE ABLE TO GO TO THE NEXT STAGE
AND LEARN NEW SKILLS ON TOP OF THE PREVIOUS SKILLS.
AND IN MACHINE LEARNING, THERE'S THIS THING THAT WE CALL
THE HYPOTHESIS SPACE, WHICH IS THE SPACE OF FUNCTIONS
FROM WHICH WE'RE LOOKING TO FIND THE RIGHT FUNCTION.
IF THE HYPOTHESIS SPACE IS TOO BIG,
THEN IT'S TOO HARD TO FILTER IT DOWN TO GET THE RIGHT FUNCTION.
BUT IF THE SPACE OF FUNCTION IS SMALL ENOUGH, THEN WITH A FEW LABELED EXAMPLE, space is too big, then it's too hard to filter it down to get the right function. But if the space
of function is small enough, then with a few labeled examples, you can actually find the
right function. And the decomposition allows you to break the problem into this smaller subspace.
And once you've learned that the subtask, then you can compose it and keep going and build on
top of previous skills. And now you can do very complex tasks,
even though each of the subtasks was simple. So it is really more similar to how
humans teach humans than the traditional model of how we teach machines to learn.
Yes. And decomposition is also the hallmark of programming. So the art of programming is
decomposition. And if you don't get it right,
you refactor and we've developed all these design patterns to do programming right. And I believe
that there will be a complete one-to-one correspondence. There will be the design
patterns, the teaching language will correspond to the programming language. I will even say that
what corresponds to the assembly language is the machine learning models, and they should be interchangeable.
And if they are interchangeable, then the teachers are interchangeable, which is exactly what you want in terms of productivity, because you know how it works.
The person that starts the project may not be the person that ends the project or maintains the project.
There's been a lot of talk within the computer science community about software 2.0 and the democratization of skills that usually require a significant amount of expertise.
But you've suggested that those terms don't do full justice to the kinds of things you're trying to describe or define.
So tell us why.
So if you say software 2.0, it's an evolution thing.
Right.
And I believe that we need something far more radical
in the way we view the problem.
So to me, software is something that's intentional.
Very often, people in software 2.0,
they think about dealing with large amount of label data.
And label data can be collected in a non-intentional way.
So for instance, for click prediction,
the label data is whether you got a click
or you didn't get a click, and you collect that automatically.
So it's not really intentional.
When you write a program, it's very intentional.
You decompose it.
You write each of the function with a purpose, right?
And I think when you teach and you decompose the problem into subpart, the labels are intentional, and they are costly because they come from a human.
And now we need to manage that.
So if I decompose the problem, I want to be able to share the part that I've decomposed. And now
if I'm going to share it, I need to version it. And this is a discipline that is very well known
in programming. So how do we manage as efficiently as possible knowledge that was created intentionally
and now it will need to be maintained, it will need to be versioned, it will need to be shared, manage as efficiently as possible knowledge that was created intentionally.
And now it will need to be maintained.
It will need to be versioned.
It will need to be shared.
It will need to be specified.
And now all the problems of sharing software will translate to sharing and maintaining models that are taught by humans.
So the parallel is very, very direct.
What about this word democratization?
You and I talked before, and I actually inserted the word into our conversation.
You go, I don't really like that word.
Yes.
So I started using that word at the beginning.
And I felt like the problem with the word is that everyone wants to democratize.
And I want machine teaching to be more ambitious.
So let's think about the guarantee you have when you program.
If I ask you to code a function and you're a programmer,
you will tell me, yes, I can do it.
And then I'll say, well, how long is it going to take?
And you're going to say, let's say three months.
And your estimate may be off.
People say that we're usually off by a factor of three.
But you were able to know that it was possible.
And you can actually even specify the performance of your program.
And you can say how long it's going to take.
Right now, we don't have that kind of guarantee when we talk about machine learning.
And the strange thing is that we have all these tools of structural risk minimization that gives us guarantee on the accuracy given
some distribution, but we don't really have guarantee on what the final performance is
going to be before we have the data.
And yet, in programming, we can have those guarantees.
So what's different?
Right?
So we have to think about the problem of teaching differently.
And if you start thinking about the problem of teaching in terms of decomposition, then you'll be able to reason
exactly the same way that you reason for programming.
Programming.
We actually do this when you teach human, right?
Yeah.
So if you wanted to teach a concept to a person,
you would say, okay, this person doesn't know about this
and this and this, so I'm going to first have to teach
those sub-skills, and then I'll be able to build
on top of these skills.
So the task of decomposition is a task that's very important for teaching.
We humans do it all the time.
We programmers do it all the time.
The teacher has spent a lot of time decomposing the problem into these sub-field and sub-skills.
And then laying the foundation.
Laying the foundation and building on top of the foundation and testing at each
level whether you got the skill right.
And that testing is exactly what I want to make a discipline in machine teaching.
Which is just, you know, music to my ears as a former teacher.
I don't think you ever stop being a teacher.
I said, they'll know I'm dead when the red pen falls out of my hand.
Well, let's get a little more granular and specific.
So you've developed a machine
learning tool called PICL, which is spelled P-I-C-L, not like the pickle that you put on a
burger. And it stands for Platform for Interactive Concept Learning. And it allows people, this is
key, it allows people without ML expertise to build ML classifiers and extractors. Yes. Tell
us more about Pickle.
What is it, how does it work, and why is it cool?
Okay, so I believe that if you want to do machine teaching,
you need a few elements.
And if it's okay, I'm going to describe the three elements that I believe are essential.
It is okay.
All right, so the first thing is that
you're going to need a machine learning algorithm,
and the machine learning algorithm has to be able to
find the right function from the hypothesis space
if there is such a function that fits the data.
So that's the first requirement.
And if we have that requirement, we can even interchange machine learning algorithms.
It's not super important which one we're going to use.
The second element I call teaching completeness.
And what I mean by that is that if the teacher can distinguish
two examples of two different classes,
you should be able to give the machine a feature that
will distinguish these two classes.
At the same time, you need to be able to compose functions
in a way that you can always bring the function that you
want into the hypothesis space.
Now, it may take several iterations.
You may have to create sub-models
that are fairly simple or complex, you can always decompose, but eventually you have to be able to
bring the function that you want in the hypothesis space. And if that's possible, then I call the
system teaching complete. The last thing is that you need to have access to an almost unterrishable pool of useful examples.
Imagine I want to build a classifier for gardening, and I decide that the subconcept
botanical garden is important to decide whether it's gardening or not. Maybe if it's gardening,
it talks about plants, but if it's botanical garden, I say, well, it's more about entertainment
than gardening, but I need to be able to distinguish this so now i need a subclassifier to decide whether
it's botanical garden or not and for that i need to find example for this so if my sampling space
has an infinite supply of all the sub concepts that i may ever learn then basically i have the
ability to find all the examples that i need that are relevant to the task that I want.
So people tend to think that, oh, but this is a very hard requirement because how do you get an infinite supply of data?
But here is the key is that that data doesn't need to be labeled.
Because if I have access to this pool of unlabeled data, I can query it.
I can use my classifier to search it. I can combine multiple
classifiers and say, well, this classifier has to be very positive and this one has to be very
negative. And I can sample with that range, right? So if I have the right tool to discover the
example that I need, I'm good enough. So these are the three requirements. And if you have these
three requirements, then it's all about finding the right combination of labels, features, decomposition, so that you can achieve your task.
And this is what PICOL does for text.
So right now, PICOL only works on text.
And so in PICOL, you can take documents and classify them, but you can also extract schema from the document.
So you can find addresses, you can find the job history from
a resume, you can find menus, you can extract product, you can extract quotes from email,
all these things that humans can do very easily. So this is the vision of machine teaching is that
anything you can teach to a non-expert human, you should be able to teach to a machine. Right.
And hopefully, we have the language
to do that easily.
Now, let's be honest.
Not all humans are good teachers.
And I believe that not all humans
will be good machine teachers.
It takes some education and familiarity
with the tool pickle.
But hopefully, we can get better at this.
I'm delighted by some of the phrases you use,
and maybe they're not unique to you,
but I find them a little provocative,
and I like that.
One of them is something you call ML litter.
It sounds bad.
What is it and what can be done about it?
Ah, okay.
I'm going to tell you a story that I've seen repeated over and over.
Sure.
So the story goes as follows.
Some program manager decide
they're going to use machine learning
to solve a particular task. So they collect a bunch of data and then they write the labeling directions
and they send that to get labeled. It comes back and they look at it and it's not exactly what they
meant with the labeling direction. So they change it a little bit, they send it back, they do a
couple of iterations, then they finally get a data set that they are happy with. They consult with some machine learning expert who will recommend deep learning or support
vector machine or boosted decision tree, and they will decide what parameters to use, k-fold
validation, k equals five, all these hyperparameters.
And they will have some engineers that will do feature engineering and code some features.
And then they finally build a model
that they are happy with, they deploy it,
and it's a catastrophe because cooking recipes
are confused with numerical recipe
and they're missing important subset of recipes.
So they go back, they do the iteration,
they collect more data, it gets labeled, blah, blah, blah.
And eventually they get a function
that does exactly what they want.
And they're super happy, and it's deployed, and everything is fine.
Six months later, the distribution has changed.
The semantic of what is a recipe, what's not a recipe, or whatever they were trying to
do has changed.
The features that are available are not the same.
There are new features.
Some features are no longer available.
And then they go back and they look for their machine learning expert.
That machine learning expert now is at Facebook or Google or Amazon.
Moved on.
Right.
So you have models that are no longer reproducible, experts that are not the original expert.
And so you can reproduce the model.
You can answer the question.
And someone asks, well, can I remove that model?
And you remove the model.
And then suddenly everyone screams because they're using that model? And you remove the model. And then suddenly, everyone screamed
because they're using that model as a feature for another model.
But they don't even know if that feature is performing to spec.
And so you find all these models that are not defunct.
They're still running.
No one knows whether they're performing to spec.
And no one dare remove them.
And this is what I call ML litter.
And the amount of resources that's wasted on this is enormous.
Now, some people have identified that problem.
There's a famous paper from the Google team
on the technical depth that have identified these problems.
But I haven't seen a lot of solutions for how to deal with this.
Yeah, I was going to ask, what do we do with it?
So I actually think that machine teaching
is bringing the discipline of programming to machine learning.
So something as simple as using source control and including in the source control the data.
Now, I mean the software 2.0, according to my definition, which is the intentional data.
I don't think we need to version data that is collected automatically.
But everything that is created by a human with a purpose should be versioned.
And in the same check-in, in the same group of things that you save together, you should
include the label, the features, the schema, everything that's relevant to reproducing
that model.
And if you do this and you bring all the discipline about decomposition and documentation and
versioning, then suddenly
that solves that problem. You will always be able to reproduce your models. If you bring all the
discipline and design patterns that we've learned from programming, then I believe that will solve
the problem of ML data. I do want to ask you about this interesting program you told me about.
And this program is called the Rotation Program.
And it seems to me like a novel way to ensure what I would call organizational hybrid vigor,
to use an agricultural or animal husbandry term. What is it? Why do you think it's a good thing?
And what results have you seen since you implemented it?
Okay, so Microsoft has this organization called Microsoft Research,
where the primary goal is to move the theory forward, to design new principle and innovate
in all the fields. In the product group, you have very different imperatives in terms of producing
value. And the question is, how do you transfer from one organization to the other? And how do you lubricate producing value and doing innovation?
And this is not a simple answer.
So what I wanted to do is to help the product group with the recruiting of research talent.
I wanted the researchers to learn about the reality of product group.
And you don't want to force people
to do a move that they don't want to make. At the same time, you want to provide the right
incentives. So the rotation program is for people above a certain level, they have the options to
do a rotation in a product group with an interesting constraint, which is that they
cannot come back for two years. And that constraint means
that when they do that jump, they jump in the deep water and they have to basically swim for two years
and then they have the option to come back. And so two years is so that they have two review cycle,
but it's also so that we can hire a postdoc during the same time. So it all fits perfectly.
And what we find out is that sometimes people come back, sometimes they
don't come back. Every time they do it at first, they say, you know, I hold to this contract,
well, like it's my dear life. And after two years, they say, well, you know, I'm totally happy.
They've given me a lot of resources. I'm having a huge amount of impact. And suddenly they
absolutely are not worried about the future. And it's very, very different.
So that's the benefit for the product teams and the person who goes.
What's the benefit for Microsoft Research?
So for Microsoft Research, there are many benefits.
So first, having an impact is always very satisfying.
The people that do the jump becomes advocates of research and collaborators from the product
group so that you can have both
a theoretical impact in the academic community and have an impact in the real world. Basically,
the world becomes your customer. It creates movement across the org. So it brings fresh mind.
Fresh legs.
Exactly. And so this is good in terms of diversity. It's good in terms of basically steering the pot a little bit. You're going into Microsoft Office. Tell us about what prompted that move. What's the goal? Who are you taking
from here? Who are you getting over? What's the deal? Yeah, so I started the machine teaching
effort about seven years ago. And at the time, I had the choice of doing it in a product or doing
it in Microsoft Research. And to be frank, I worried that if it was done in a product group,
it would be hard to protect it from the imperative of delivering value.
Right, immediately.
Immediately, yes.
So I wanted to have a little bit of breathing room.
And so I created this team in Microsoft Research.
And now it's been seven years, and I believe that there's no really
any question of the value. We can actually both deliver value and continue the investment in
innovation, and we can do that almost anywhere in the company. So why Office? So why Office?
We started doing this in Azure, and now the group that started in Azure has moved to Office.
And basically, I'm rejoining a group that have influence in the past.
And we are going to do both Azure and Office.
And those are the two main product of Microsoft.
You started here 20 years ago, you said.
And your career has been anything but linear.
You've been back and forth already.
The rotation program is not new to you. So tell us about your path, your journey, your story. All right. So after my PhD,
I started at Bell Labs in a pretty famous group. This is the group of Yann LeCun, Vladimir Vapnik,
Joshua Bengio, Léon Boutou, and Joshua and Yann just got the Turing Award. So I stayed there for eight years.
And then I came to Microsoft Research.
And when I moved to Microsoft,
they did something very strange.
I looked at the address book
and I looked at everyone that had the title architect,
scheduled one-on-ones
and tried to find out what problems needed to be solved
in the product group.
It was a very bold kind of move.
And I started creating relationship. And after a
while, I had two groups that were providing me with engineers to help them because they wanted
to have more of my bandwidth. And I told them, well, the best way to have more of my bandwidth
is to provide me with engineers, and then I will help with your product. So that's how I started.
And then I thought that, you know, Microsoft is the
document company because at the time, 95% of all documents were created on Microsoft software and
we didn't have a document group. So I said, we should have a document group. And then the answer
came back and said, yeah, you should create it. And I said, well, I'm a machine learning person.
I'm not a document person. But after six months, I thought it makes no sense.
So I created that group.
And then more groups were put under me.
And eventually I was asked to start Live Lab Research.
And this is when I left Microsoft Research and created Live Lab Research under Gary Flake, who was basically creating Live Labs.
So I created this team.
And I moved to AdCenter
as a group program manager.
And I can tell you, I am not qualified for that position.
And it was really crazy.
And after six months, I sort of fired myself.
I said, I cannot do this.
And I became a chief scientist again.
But after three more years of that,
I decided to come back to Microsoft Research
to do machine teaching.
And now I'm about to go again out of Microsoft Research to try to have an impact.
All right. Machine learning is full of promise, and machine teaching seems to be a promising direction therein.
So we might call everything we've talked about up to now, what could possibly go right.
But we always have to ask the what could possibly go
wrong question. At least I do. Not because I'm a pessimist, but because I think we need to think
about unintended consequences and the power of what we're doing. So given the work you do, Patrice,
and its potential to put machine learning tools in the hands of non-experts.
Oh, God. Do you want to go there?
Is there anything that concerns you?
Anything that keeps you up at night? Oh yeah. I like to think of myself as someone that thinks
strategically and I feel like it's kind of my job to imagine everything that can go wrong.
That's good. Yes. So many things can go wrong. The first thing is 30 years ago, we had expert system and, you know, the first definition of AI.
And what happened is that we had this giant system with lots of rules, and we didn't have a good way to decompose the problem into a simple problem, and it didn't work out.
We also have now deep learning and again
there's no decomposition and the complexity is such that we don't
understand what's going on inside and I think it's far more successful than
where we were 30 years ago and this is why we have something different today.
And I'm trying to say we should be in between. Before it was all features no
label and now with deep learning,
it's sort of kind of all labels, no feature.
And I'm advocating that we should be in between.
And this is where machine teaching is.
We should express things not just with labels,
but with features. And we should do it in a way that's disciplined
and deliberate like we do for programming.
Okay, what if I'm wrong?
I don't believe this is the case,
but of course I'm worried that I might't believe this is the case, but of course,
I'm worried that I might pulling a whole bunch of people in a direction that is not the right
direction. At the same time, to be honest, I really truly believe that this is the way to go.
So I have the fortitude to overcome those doubts, but it's something that always keeps me up at
night. The other question that you're asking is more
philosophical. A lot of people have thought that the key to AI is the learning algorithm.
And I actually don't believe it's the learning algorithm. I think teaching is what makes the
difference. So from a philosophical standpoint, I believe that machine learning algorithm
is almost the easy part,
is the part that you can locally optimize.
Teaching is the part that you have to optimize
at a global level, at a societal level.
And I think that may actually be the key to AI
the same way it was the key to human development.
At the end of every podcast,
I give my guests the proverbial last word.
So here's your chance to give some advice
or inspiration to emerging researchers
who might be interested in teaching machines.
What would you say to them?
Okay.
What I tell people
when they ask me advice for career,
I always tell them optimize for growth.
So challenge yourself. Don't be afraid of failure. Failure is growth. So that's the general advice for researchers and for
people doing their career. For machine teaching, I believe machine teaching is an incredible field
right now because first it's at the intersection of three fields that are very different. So when
you are at the intersection of multiple fields, it's different. So when you are at the intersection
of multiple fields, it's usually a very fertile ground for doing all sorts of new things.
I also believe that we are at a phase transition where the field of machine learning, which is
super popular, right, is about to transit to something different. And when you are at the
time where it transits, it's the most exciting thing possible.
So I think now it is a fantastic opportunity to create a new field.
I don't know where it's going to go, but it's very, very exciting.
Patrice Simard, thank you so much for coming on the podcast today.
All right. Thank you. It was my pleasure. To learn more about Dr. Patrice Demar and the science of teaching machines,
visit microsoft.com slash research.