Microsoft Research Podcast - Abstracts: NeurIPS 2024 with Weizhu Chen 

Starting point is 00:00:00 Welcome to Abstracts, a Microsoft Research podcast that puts the spotlight on world-class research in brief. I'm Amber Tingle. In this series, members of the research community at Microsoft give us a quick snapshot or a podcast abstract of their new and noteworthy papers. Our guest today is Weiju Chen. He is vice president of Microsoft Gen AI and co-author of a paper called Not All Tokens Are What You Need for Pre-Training.

Starting point is 00:00:39 This paper is an oral presentation during the 38th Annual Conference on Neural Information Processing Systems, also known as NEURIPS, which is happening this week in Vancouver. We do thank you for joining us today on Abstracts. Thank you for having me, Amber. So let's start with a brief overview of your paper. In a couple sentences, tell us about the problem your research addresses and, more importantly, why the research community and beyond should know about this work. So my team, basically, in Microsoft, JNI, we are working on model training. So one of the things

Starting point is 00:01:18 actually we do in the pre-training, we realize the importance of the data. And we found that actually when we do this kind of data for each of the tokens, some token is more important than the other. That's one. The other one actually is some token actually is very, very hard to be predicted during the pre-training. So for example, just like if someone see the text of a WeJu and what's the next token, it can be Chen, it can be any of the last name.

Starting point is 00:01:44 So it's very hard to be predicted so and if we try to enforce a language model to focus on this kind of the hard to predict token just like actually is going to confuse the language model there are so many different kinds of the example like this just like for example the serial number in your ups so the focus of this paper you try to identify which token actually is more important for the language model to learn. And actually, the other token maybe is just the noise. And how can we try to discriminate the token, which is good token, which is noise token? Basically, you try to understand this kind of dynamic of the tokens.

Starting point is 00:02:21 How did you conduct this research? Actually, we do a lot of work in the model training, including the pre-training and the post-training. So for the pre-training side, actually the most important thing to us is the data. We also try to understand how can we leverage the existing data and how can we create much more data as well. And data basically is one of the most important thing to build a better foundation model. So we try to understand how much more we can get from the data. And the important thing for the data

Starting point is 00:02:56 is about the data filtering. So you think about actually in the previous literature, we do the data filtering. For example, just like we build a classifier to class okay this page is more important than the other and this page actually is the noise because there's so much noise data in the web so we just keep the best data and to get into the pre-training corpus so and further away we think about okay yeah so this is maybe it's no final grain enough so can we try to understand

Starting point is 00:03:27 even for the same page we want to keep so some token is more important than the other maybe some token just some noise token actually you put this data into the pre-training it's going to hurt the model quality so that is the motivation actually we try to think about. And what were your major findings? Our major finding is about basically, definitely, this works so well. And it's so important that actually we are able to get the best token from the corpus and then make it available and try to ask the model during the pre-training to ignore the token. We don't want to get into the model during the pre-training to ignore the token we don't want to get into the model itself. So that is one. The second thing definitely,

Starting point is 00:04:11 data is still a very important thing. If you're able to figure out the better way to build the better data, it's most likely able to build a much better foundation model. The third thing actually is also connected to a lot of other existing work, just like data synthesis, just like distillation, just like data filtering. So a lot of things are already connected together. And actually, this work basically you can associate with also a lot of other work we are working on, just like distillation. You can think about for

Starting point is 00:04:45 example for this work we also try to build a model, a reference model we call as the reference model to try to identify actually this data this token is more important than the other and try to understand the discrepancy between the reference model and the running model, their prediction on each token. So you can think about also some kinds of try to distill from the reference model to the existing model as well. Let's talk a little bit about real-world impact. Who benefits most from this work? And how significant is this within your discipline

Starting point is 00:05:26 and even downstream for people using applications? This actually is very, very fundamental work because just like I shared a little bit before, actually we build the data and this data is, it will build the data much better. It's able to build a much better foundation model. It will able to build a better model. Actually it's able to benefit so many different kinds of applications. This also is going to help us to build a much better small language model. We can also serve this model even in the edge side, in the client side, in the coding scenario. So we are going to see actually huge impact from this kind of data model if you are

Starting point is 00:06:07 able to benefit from building much better training data. Are there any unanswered questions or unsolved problems in this area? What's next on your research agenda? Yeah, I think that is a very good question. And definitely there's a lot of things about how to build better data is unsolved yet in the literature. And especially because when you do the pre-training, the most important part is the data. But the data is very limited.

Starting point is 00:06:41 And how can we make better use from the existing limited data is a big challenge. Because we can increase the model by 10x, but it's super hard to increase the data by 10x, especially when we want to deal with the high quality of data. The other way, even given the data, how can you identify, especially for this work, the importance of each token to build a much better model. I think all these things are very connected together. To me, actually, data is oxygen.

Starting point is 00:07:13 So there are still so many things we are able to do in the data, including building for either small-length model or the large model. Data is oxygen. I love that. So other than that being a key takeaway, is there any other one thing that you'd like our listeners to walk away from this conversation knowing? I would love to say actually focus more on this kind of data and focus more about how can we get more from the data. Actually, it is a very important thing.

Starting point is 00:07:47 And the other thing, actually, we are working on something that's very exciting. You can feel free to come to join us if you are very interested in this area. Well, Weiju Chen, thank you for joining us today. We really appreciate it. Thank you. Thank you for having me. And thanks to our listeners for tuning in. If you'd like to read the full paper, you may find a link at aka.ms slash abstracts. You can also find the paper on Archive and on the NeurIPS conference website. I'm Amber Tingle from Microsoft Research, and we hope you'll join us next time on Abstracts.

Microsoft Research Podcast - Abstracts: NeurIPS 2024 with Weizhu Chen

Next-token prediction trains a language model on all tokens in a sequence. VP Weizhu Chen discusses his team’s 2024 NeurIPS paper on how distinguishing between useful and “noisy” tokens in pretr...aining can improve token efficiency and model performance.Read the paperGet the code

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Microsoft Research Podcast - Abstracts: NeurIPS 2024 with Weizhu Chen

Next-token prediction trains a language model on all tokens in a sequence. VP Weizhu Chen discusses his team’s 2024 NeurIPS paper on how distinguishing between useful and “noisy” tokens in pretr...aining can improve token efficiency and model performance.Read the paperGet the code

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Microsoft Research Podcast - Abstracts: NeurIPS 2024 with Weizhu Chen