Microsoft Research Podcast - Abstracts: July 29, 2024
Episode Date: July 29, 2024A lack of appropriate data, decreased model performance, and other obstacles have made it difficult to expand the input language models can receive. Li Lyna Zhang introduces LongRoPE, a method capable... of extending content windows to more than 2 million tokens.Read the paperGet the code
Transcript
Discussion (0)
Welcome to Abstracts,
a Microsoft Research podcast that puts
the spotlight on world-class research in brief.
I'm Dr. Gretchen Huizenga.
In this series,
members of the research community at Microsoft give us
a quick snapshot or a podcast abstract
of their new and noteworthy papers.
My guest today is Dr. Li-Lina Zhang, a senior researcher at Microsoft Research.
Dr. Zhang is co-author of a paper called Long Rope,
Extending LLM Context Window Beyond 2 Million Tokens.
This paper was featured at this year's International Conference on Machine Learning, or ICML.
Li, thanks so much for joining us today on Abstracts.
Thank you for having me.
So let's start with a brief overview of your paper.
Tell us about the issue your research addresses and why it matters.
Okay, so this paper is about how to effectively extend the context window of large language models beyond 2 million tokens.
Why this is important?
Because enabling longer input contexts can improve LLM capabilities.
Right now, some LLMs can only handle a limited context window of 4K tokens,
which is about 10 pages in a book.
With our method, we can push LM context window to over 2 million tokens.
That means you can put all seven Harry Potter books to the LM and ask any question about
this story.
Another important thing is that our method is super-efficient. It requires minimal
change to the LLM architectures, and most existing optimizations can be reused. Therefore,
our method can be easily applied in real production.
So it sounds like what you're working on is improving the memory span of artificial intelligence or large language models.
So what's already been done in this field, and what unique contributions does your work bring?
Well, there has been a lot of work in building non-connexional arms.
For example, pre-training with an efficient model architecture using RAC and extending the context window
with ROPE precision interpolation.
Our approach uses the last technique.
Let me briefly explain it.
ROPE stands for Rotary Precision Embedding, which encodes token precision information
for transformer models. When we pre-trained NLM, we set a context window size and all token precisions have
a predefined range of ROP values.
Extending for a longer context window introduces new token precisions that can be out of this
predefined range, thus leading to auto-distribution issues and making
fine-tuning difficult.
Rope precision interpolation solves this by downscanning precision embeddings to fit within
the pre-trained range.
However, precision embeddings like rope exhibit non-uniform information entropy
in transformer models.
Existing approaches do not effectively handle
these non-uniformities during rope interploration,
leading to information loss
and limiting the context window size.
Our method addresses this challenge.
Therefore, it can achieve the longest context window
size.
Okay, so Li, how would you describe the methodology you used for this work, and how did you go
about conducting the research?
Okay, so our method is to interpolate the rope positional embedding.
It has three main steps.
First we introduce an efficient evolution search algorithm to perform non-uniform rope
precision interpolation.
Second, we propose progressive context window extension strategy.
It begins by searching for a 256K lens on the pre-trained LLM and fine-tuning it at
this length. Then, based on the fine-tuned
256K LLM, we did a second search for new rope interpolations to achieve 2048K conic's window
size. Finally, since non-conic's LLM will drop performance at its original context window. We readjust the non-uniform precision interpolation at a 4K lens
to recover the short context window performance.
Let's talk about findings.
Tell us how things worked out for you and what you found as a result of your experiments.
Yeah, our study verified two important non-uniformities in LM context window extension.
We identified that lower rope dimensions and initial token precisions require less interpolation because they contain crucial and high-frequency information.
Higher rope dimensions require more interpolation because these are sparse and low-frequency information.
So work in the lab is always interesting, but deployment in real-world settings is often
another story. If everything is successful, Li, who benefits most from your long rope research?
Well, our work significantly improves LLM's capabilities to handle long contacts in real-world applications, such as long contact retrieval, code debugging, and even multi-modality LLM applications.
Moreover, our method achieves this with minimal modifications to the rope positional embedding.
Therefore, it can be widely applied to production.
We have integrated the long rope
into Microsoft 5.3 128K family,
which are the first long context LMs in its class.
Before long rope, five models have only 2K context window.
So who is your primary user? I think any users who want to use
long context and they can be our audience. So it's a wide audience. Yeah, it's a wide audience.
It's about now that I always ask the golden nugget question. If you wanted to leave our
listeners with one key takeaway from this research, what would it be?
Well, if there's one key takeaway from our work, it must be our key findings that non-uniformities in rotary precision embedding are crucial for LM context window extension.
And if you want to build a high-quality non condensate or lung rope, it's all you need
to know.
Talk about what's left to do in this field in terms of open questions and outstanding
challenges.
What's next on your research agenda, Li?
So far, there are still a couple of big questions in this field.
First, it's challenging to achieve both strong long and short capabilities at the same time.
Although we have managed to recover some of the short performance for long-context LLM, it has not recovered 100%.
We are trying different approaches to close these gaps. apps. Second, we want to figure out how we can use this long-term aim to solve more challenging tasks
and then we can push this model to work harder and smarter for us.
Well, Li-Lina Zhang, thanks for joining us today. And to our listeners, thanks for tuning in.
If you want to read this paper, you can find a link at aka.ms forward slash abstracts, or you can find it on Archive.
See you next time on Abstracts.