Microsoft Research Podcast - Abstracts: October 9, 2023

Starting point is 00:00:00 Welcome to Abstracts, a Microsoft Research podcast that puts the spotlight on world-class research in brief. I'm Dr. Gretchen Huizenga. In this series, members of the research community at Microsoft give us a quick snapshot or a podcast abstract of their new and noteworthy papers.

Starting point is 00:00:24 Today, I'm talking to Dr. Shen Zhang, a senior researcher at Microsoft Research. Dr. Zhang is co-author of a paper called Universal NER, Targeted Distillation from Large Language Models for Open Named Entity Recognition, and you can read this paper now on Archive. Shen Zhang, thanks for joining us on Abstracts. Thanks for having me. So in a few sentences, give us a brief introduction or overview of the issue or problem that your research addresses and why we should care about it. Sure. Well, our research

Starting point is 00:00:58 addresses the challenge of efficiently replicating the capabilities of large language models for targeted application. Particularly, we focus on NAMNTT recognition or NER. And people should care because this work aims to create more cost-effective and transparent models that can recognize a wide range of NTT types across various domains, which is crucial for knowledge extraction and has numerical practical applications.

Starting point is 00:01:30 So how does your approach, your particular approach, build on or differ from what's been done previously in this field? Well, our approach builds on the idea of instruction tuning, which is used to fine-tune language models to follow human instructions. However, unlike existing work that focuses on tuning models into replicas of large language models in every aspect, we propose a method called mission-focused instruction tuning, where we train a smaller model to specifically excel in a broad application class, such as open information instruction. And in our case study, we focus

Starting point is 00:02:14 on name entity recognition, NER, and we demonstrate how targeted distillation from large language models can maximize the capabilities for this application. At the same time, the smaller model, the student model, also preserves generalizability across different semantic types and domains. This approach differs from previous work also because we emphasize the importance of increasing the diversity of input data and generating more comprehensive coverage of antitypes, which ultimately leads to better performance in the targeted application. Okay, and in the paper you talk about student models trailing the original large language models by large margins in what you call downstream applications.

Starting point is 00:03:03 Give me an example of what downstream application looks like. Yeah, so we here specifically focus on name entity recognition, that is, identifying name entities in the written text. So there's various types of name entities. So the canonical ones like person, geographic location, organization, and the people have various needs. They can go beyond those core screen types. They can go into very fine-grained types like athlete, a politician, and even finer-grained types. And you cannot predefine what types will be considered in your task.

Starting point is 00:03:44 That's why we care about this universal concept of NAM entity recognition. Well, let's talk about methodology for a bit. What kind of research methodology did you use, and how did you conduct this research? We developed a general recipe for targeted distillation from large-language models. And in this case, we applied to OpenNER. And our methodology consists of two main steps, data construction and mission-focused instruction tuning.

Starting point is 00:04:13 For data construction, we sampled inputs from a large corpus across diverse domains. And then we use a large language model, ChatGPT, to annotate anti-dimensions and their associated anti-types in the sampled inputs. This process allowed us to create a data set with a wide coverage of anti-types. For mission-focused instruction tuning,

Starting point is 00:04:37 we fine-tune smaller models using our constructed data set in a conversational style format. For each anti-pe in the output, we transform it into a natural language query and tune the model to generate structured outputs that contain all entities of that type in the input passage. We also incorporate negative sampling to account for antitypes not mentioned in that passage. And besides these two main steps, our research also involved assembling the largest to date and the most diverse NER benchmark for evaluation. We compared the performance of our targeted distillation approach with other state-of-the-art

Starting point is 00:05:20 models to demonstrate the effectiveness of our methodology. Okay. So you talk about NER as a case study, and you had 43 datasets and nine domains. Give me an example of some of those domains that you pulled from. Yeah. So one very, you know, typical domain is like news, right? We read news every day, and the news mentioned about people, events, and location. So that's like a very common domain. And there are other very interesting domains like code.

Starting point is 00:05:54 People also write code. And the computer can understand the code, but the person would also want to understand the code in some different way. So if you have code-specific name entity recognition capability, that would be awesome for some people that want to understand what's happening in the code. Right. And you mentioned programming or code, but I also see

Starting point is 00:06:17 in the paper biomedicine on one kind of complex and academic end and social media on another. So those are wildly different domains that you pulled from. Did you do that for a reason, that spectrum of different kinds of data? Yes. The reason is that, you know, for some high-value domain like biomedicine, it's quite expensive to annotate some data to train a model like that. So traditionally, people will have to hire an expert to do that. That is quite expensive and not scalable.

Starting point is 00:06:59 And here in the universal NIR paper, we propose a way to distill that specific domain knowledge from the large language model. So the whole process is automatic. And the result model, you can see, it does pretty well and maybe equally well on the model that based on, you know, human expert annotated corpus. So after all this, a research paper presents findings. I imagine you had some interesting discoveries in this study. What were your major findings? Yes, our major findings were that the targeted distillation approach, specifically here, the universal NER model we developed,

Starting point is 00:07:39 it achieved state-of-the-art performance in name antirecognition across a wide range of antitypes and domains. And when we compare to other models like APACA, Vicuna, and Instruct-UIE, Universal NER significantly outperformed them in terms of F1 score. This demonstrated the effectiveness of mission-focused instruction tuning for creating more cost-effective and transparent models that can excel in targeted applications,

Starting point is 00:08:06 such as open AR. So let's talk a little bit more about real-world impact. We've already discussed a little bit about that. But how would you say, based on these findings, that this impacts the real world and how people will use this? Yeah, absolutely. I would say our work is very significant in terms of real-world impact. Because first of all, NER is a fundamental task in natural language processing, and it plays a crucial role in knowledge extraction, information retrieval, and data mining. And by developing a more cost-effective and transparent model like Universal NER, which can recognize a wide range of antitypes and domains, we enable better performance in this downstream application. And like I said, this is particularly important in high-value

Starting point is 00:08:59 domains such as biomedicine, where specialized expertise is required for annotation and the new antitypes keep emerging. Our approach can help save time and resources for effectively recognizing these new antitypes without the need for extensive annotated data. And secondly, our work can have a broader impact as it represents a general recipe for targeted distillation from large language models. And this approach can be applied to other application classes, such as open relation extraction. And this allows researchers and the practitioner to create much smaller models that can be

Starting point is 00:09:42 more efficient and transparent while maintaining high performance in their targeted tasks. If there was one thing you want our listeners to take away from this work and you could distill that into a short take, what would it be? One key takeaway from our work is that targeted distillation from large language models using our mission-focused instruction tuning can lead to more cost-effective and transparent models that excel in a broader application class. And our application demonstrates that it is possible to harness the capabilities of large language models and distill them into much smaller models that not only maintain

Starting point is 00:10:27 general liability across semantic types and domains, but also surpass the performance of their larger counterparts in the targeted application. And this opens up new avenues for research and practical application in various fields, making knowledge extractions and natural language processing tasks more efficient and accessible. It sounds very promising, and it sounds like you're excited about it. Yeah, I'm pretty excited. Well then, tell us, given this new vista that you've opened up with this universal NER, what unanswered questions or unsolved problems still remain in this area?

Starting point is 00:11:11 And what's next on your research agenda? Yeah, our work demonstrates the effectiveness of targeted simulation for open NER, but several unanswered questions remain. And I would say the first one is adapting the approach to other application classes. Our method is a general recipe for targeted distillation, and it would be interesting to explore its effectiveness in other broader application classes, such as open relation extraction. And the second one is handling label conflicts and the dataset-specific definition.

Starting point is 00:11:48 So in our work, we propose a dataset-specific instruction tuning template to address label conflicts. But more research is needed to better understand and develop methods for harmonizing discrepancies in label definition across datasets. And the last one is exploring more efficient data construction methods. We use ChatGPT for data construction, but alternative approaches could be explored to generate more diverse and comprehensive datasets for mission-focused instruction tuning.

Starting point is 00:12:26 And as for our research agenda, we plan to continue exploring targeted distillation techniques and apply them to other application classes, as well as investigate ways to improve data construction for better performance and efficiency in real-world tasks. Sounds like you got your work cut out for you. Yes. Shen Zhang, thanks for joining us today. And to our listeners, thanks for tuning in.

Starting point is 00:12:52 If you're interested in learning more about this paper, you can find a link at aka.ms forward slash abstracts, or you can read the paper on Archive. See you next time on Abstracts.

Your Ad Here

Microsoft Research Podcast - Abstracts: October 9, 2023

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.