Microsoft Research Podcast - Abstracts: NeurIPS 2024 with Weizhu Chen
Episode Date: December 6, 2024Next-token prediction trains a language model on all tokens in a sequence. VP Weizhu Chen discusses his team’s 2024 NeurIPS paper on how distinguishing between useful and “noisy” tokens in pretr...aining can improve token efficiency and model performance.Read the paperGet the code
Transcript
Discussion (0)
Welcome to Abstracts,
a Microsoft Research podcast that puts
the spotlight on world-class research in brief.
I'm Amber Tingle.
In this series, members of the research community at Microsoft give us
a quick snapshot or a podcast abstract of their new and noteworthy papers.
Our guest today is Weiju Chen.
He is vice president of Microsoft Gen AI and co-author of a paper called Not All Tokens Are What You Need for Pre-Training.
This paper is an oral presentation during the 38th Annual Conference on Neural Information Processing Systems,
also known as NEURIPS, which is happening this week in Vancouver.
We do thank you for joining us today on Abstracts.
Thank you for having me, Amber.
So let's start with a brief overview of your paper.
In a couple sentences, tell us about the problem your research addresses and,
more importantly, why the research community and beyond should know about this work.
So my team, basically, in Microsoft, JNI, we are working on model training. So one of the things
actually we do in the pre-training, we realize the importance of the data. And we found that
actually when we do this kind of data for each of the tokens, some token
is more important than the other.
That's one.
The other one actually is some token actually is very, very hard to be predicted during
the pre-training.
So for example, just like if someone see the text of a WeJu and what's the next token,
it can be Chen, it can be any of the last name.
So it's very
hard to be predicted so and if we try to enforce a language model to focus on this kind of the hard
to predict token just like actually is going to confuse the language model there are so many
different kinds of the example like this just like for example the serial number in your ups so the
focus of this paper you try to identify which token actually is more
important for the language model to learn. And actually, the other token maybe is just the noise.
And how can we try to discriminate the token, which is good token, which is noise token?
Basically, you try to understand this kind of dynamic of the tokens.
How did you conduct this research?
Actually, we do a lot of work in the model
training, including the pre-training and the post-training. So for the pre-training side,
actually the most important thing to us is the data. We also try to understand how can we leverage
the existing data and how can we create much more data as well. And data basically is one of the most important thing
to build a better foundation model.
So we try to understand how much more we can get from the data.
And the important thing for the data
is about the data filtering.
So you think about actually in the previous literature,
we do the data filtering.
For example, just like we build a classifier to class okay this page is more important than the other
and this page actually is the noise because there's so much noise data in
the web so we just keep the best data and to get into the pre-training
corpus so and further away we think about okay yeah so this is maybe it's no
final grain enough so can we try to understand
even for the same page we want to keep so some token is more important than the other maybe
some token just some noise token actually you put this data into the pre-training it's going to hurt
the model quality so that is the motivation actually we try to think about. And what were your major findings?
Our major finding is about basically, definitely, this works so well.
And it's so important that actually we are able to get the best token from the corpus
and then make it available and try to ask the model during the pre-training to ignore the token.
We don't want to get into the model during the pre-training to ignore the token we don't want to
get into the model itself. So that is one. The second thing definitely,
data is still a very important thing. If you're able to figure out the better way
to build the better data, it's most likely able to build a much better foundation model.
The third thing actually is also connected to a lot of other existing work, just like
data synthesis, just like distillation, just like data filtering.
So a lot of things are already connected together.
And actually, this work basically you can associate with also a lot of other work we
are working on, just like distillation.
You can think about for
example for this work we also try to build a model, a reference model we call as the reference model
to try to identify actually this data this token is more important than the other and
try to understand the discrepancy between the reference model and the running model, their prediction on each token.
So you can think about also some kinds of try to distill from the reference model to
the existing model as well.
Let's talk a little bit about real-world impact.
Who benefits most from this work?
And how significant is this within your discipline
and even downstream for people using applications? This actually is very, very fundamental work
because just like I shared a little bit before, actually we build the data and this data is,
it will build the data much better. It's able to build a much better foundation model. It will
able to build a better model. Actually it's able to benefit so many different kinds
of applications.
This also is going to help us to build a much better small language model.
We can also serve this model even in the edge side, in the client side, in the coding scenario.
So we are going to see actually huge impact from this kind of data model if you are
able to benefit from building much better training data. Are there any unanswered questions or
unsolved problems in this area? What's next on your research agenda? Yeah, I think that is a
very good question. And definitely there's a lot of things
about how to build better data is unsolved yet
in the literature.
And especially because when you do the pre-training,
the most important part is the data.
But the data is very limited.
And how can we make better use from the existing
limited data is a big challenge.
Because we can increase the model by 10x, but it's super hard to increase the data by
10x, especially when we want to deal with the high quality of data.
The other way, even given the data, how can you identify, especially for this work, the
importance of each token to build a much better model.
I think all these things are very connected together.
To me, actually, data is oxygen.
So there are still so many things
we are able to do in the data,
including building for either small-length model
or the large model.
Data is oxygen. I love that.
So other than that being a key takeaway, is there any other one thing that you'd like our listeners to walk away from this conversation knowing?
I would love to say actually focus more on this kind of data and focus more about how can we get more from the data.
Actually, it is a very important thing.
And the other thing, actually,
we are working on something that's very exciting. You can feel free to come to join us if you are very interested in this area. Well, Weiju Chen, thank you for joining us today. We really appreciate
it. Thank you. Thank you for having me. And thanks to our listeners for tuning in. If you'd like to read the full paper, you may find a link at aka.ms slash abstracts.
You can also find the paper on Archive and on the NeurIPS conference website.
I'm Amber Tingle from Microsoft Research, and we hope you'll join us next time on Abstracts.