Microsoft Research Podcast - Abstracts: July 29, 2024

Starting point is 00:00:00 Welcome to Abstracts, a Microsoft Research podcast that puts the spotlight on world-class research in brief. I'm Dr. Gretchen Huizenga. In this series, members of the research community at Microsoft give us a quick snapshot or a podcast abstract of their new and noteworthy papers.

Starting point is 00:00:24 My guest today is Dr. Li-Lina Zhang, a senior researcher at Microsoft Research. Dr. Zhang is co-author of a paper called Long Rope, Extending LLM Context Window Beyond 2 Million Tokens. This paper was featured at this year's International Conference on Machine Learning, or ICML. Li, thanks so much for joining us today on Abstracts. Thank you for having me. So let's start with a brief overview of your paper. Tell us about the issue your research addresses and why it matters.

Starting point is 00:00:57 Okay, so this paper is about how to effectively extend the context window of large language models beyond 2 million tokens. Why this is important? Because enabling longer input contexts can improve LLM capabilities. Right now, some LLMs can only handle a limited context window of 4K tokens, which is about 10 pages in a book. With our method, we can push LM context window to over 2 million tokens. That means you can put all seven Harry Potter books to the LM and ask any question about this story.

Starting point is 00:01:41 Another important thing is that our method is super-efficient. It requires minimal change to the LLM architectures, and most existing optimizations can be reused. Therefore, our method can be easily applied in real production. So it sounds like what you're working on is improving the memory span of artificial intelligence or large language models. So what's already been done in this field, and what unique contributions does your work bring? Well, there has been a lot of work in building non-connexional arms. For example, pre-training with an efficient model architecture using RAC and extending the context window with ROPE precision interpolation.

Starting point is 00:02:29 Our approach uses the last technique. Let me briefly explain it. ROPE stands for Rotary Precision Embedding, which encodes token precision information for transformer models. When we pre-trained NLM, we set a context window size and all token precisions have a predefined range of ROP values. Extending for a longer context window introduces new token precisions that can be out of this predefined range, thus leading to auto-distribution issues and making fine-tuning difficult.

Starting point is 00:03:09 Rope precision interpolation solves this by downscanning precision embeddings to fit within the pre-trained range. However, precision embeddings like rope exhibit non-uniform information entropy in transformer models. Existing approaches do not effectively handle these non-uniformities during rope interploration, leading to information loss and limiting the context window size.

Starting point is 00:03:39 Our method addresses this challenge. Therefore, it can achieve the longest context window size. Okay, so Li, how would you describe the methodology you used for this work, and how did you go about conducting the research? Okay, so our method is to interpolate the rope positional embedding. It has three main steps. First we introduce an efficient evolution search algorithm to perform non-uniform rope

Starting point is 00:04:09 precision interpolation. Second, we propose progressive context window extension strategy. It begins by searching for a 256K lens on the pre-trained LLM and fine-tuning it at this length. Then, based on the fine-tuned 256K LLM, we did a second search for new rope interpolations to achieve 2048K conic's window size. Finally, since non-conic's LLM will drop performance at its original context window. We readjust the non-uniform precision interpolation at a 4K lens to recover the short context window performance. Let's talk about findings.

Starting point is 00:04:55 Tell us how things worked out for you and what you found as a result of your experiments. Yeah, our study verified two important non-uniformities in LM context window extension. We identified that lower rope dimensions and initial token precisions require less interpolation because they contain crucial and high-frequency information. Higher rope dimensions require more interpolation because these are sparse and low-frequency information. So work in the lab is always interesting, but deployment in real-world settings is often another story. If everything is successful, Li, who benefits most from your long rope research? Well, our work significantly improves LLM's capabilities to handle long contacts in real-world applications, such as long contact retrieval, code debugging, and even multi-modality LLM applications. Moreover, our method achieves this with minimal modifications to the rope positional embedding.

Starting point is 00:06:02 Therefore, it can be widely applied to production. We have integrated the long rope into Microsoft 5.3 128K family, which are the first long context LMs in its class. Before long rope, five models have only 2K context window. So who is your primary user? I think any users who want to use long context and they can be our audience. So it's a wide audience. Yeah, it's a wide audience. It's about now that I always ask the golden nugget question. If you wanted to leave our

Starting point is 00:06:41 listeners with one key takeaway from this research, what would it be? Well, if there's one key takeaway from our work, it must be our key findings that non-uniformities in rotary precision embedding are crucial for LM context window extension. And if you want to build a high-quality non condensate or lung rope, it's all you need to know. Talk about what's left to do in this field in terms of open questions and outstanding challenges. What's next on your research agenda, Li? So far, there are still a couple of big questions in this field.

Starting point is 00:07:20 First, it's challenging to achieve both strong long and short capabilities at the same time. Although we have managed to recover some of the short performance for long-context LLM, it has not recovered 100%. We are trying different approaches to close these gaps. apps. Second, we want to figure out how we can use this long-term aim to solve more challenging tasks and then we can push this model to work harder and smarter for us. Well, Li-Lina Zhang, thanks for joining us today. And to our listeners, thanks for tuning in. If you want to read this paper, you can find a link at aka.ms forward slash abstracts, or you can find it on Archive. See you next time on Abstracts.

Microsoft Research Podcast - Abstracts: July 29, 2024

A lack of appropriate data, decreased model performance, and other obstacles have made it difficult to expand the input language models can receive. Li Lyna Zhang introduces LongRoPE, a method capable... of extending content windows to more than 2 million tokens.Read the paperGet the code

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

Microsoft Research Podcast - Abstracts: July 29, 2024

A lack of appropriate data, decreased model performance, and other obstacles have made it difficult to expand the input language models can receive. Li Lyna Zhang introduces LongRoPE, a method capable... of extending content windows to more than 2 million tokens.Read the paperGet the code

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.