Microsoft Research Podcast - Abstracts: December 6, 2023

Episode Date: December 6, 2023

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations... about new and noteworthy achievements.In this episode, Xing Xie, a Senior Principal Research Manager of Microsoft Research Asia, joins host Dr. Gretchen Huizinga to discuss “Evaluating General-Purpose AI with Psychometrics.” As AI capabilities move from task specific to more general purpose, the paper explores psychometrics, a subfield of psychology, as an alternative to traditional methods for evaluating model performance and for supporting consistent and reliable systems.Read the paper: Evaluating General-Purpose AI with Psychometrics

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to Abstracts, a Microsoft Research podcast that puts the spotlight on world-class research in brief. I'm Dr. Gretchen Huizenga. In this series, members of the research community at Microsoft give us a quick snapshot or a podcast abstract of their new and noteworthy papers.
Starting point is 00:00:23 Today, I'm talking to Dr. Xing Xie, a senior principal research manager at Microsoft Research. Dr. Xie is co-author of a vision paper on large language models called Evaluating General Purpose AI with Psychometrics, and you can find a preprint of this paper now on Archive. Xing Xie, thank you for joining us on Abstracts. Yes, thank you. It's my pleasure to be here. So in a couple sentences, tell us what issue or problem your research addresses and why people should care about it.
Starting point is 00:00:55 Yeah, in a sense, actually we are exploring the potential of psychometrics to revolutionize how we evaluate general purpose AI. Because AI is advancing at a very rapid pace, traditional evaluation methods face significant challenges, especially when it comes to predicting a model's performance in unfamiliar scenarios. And these methods also lack a robust mechanism to assess their own quality.
Starting point is 00:01:21 Additionally, in this paper, we delve into the complexity of directly applying psychometrics to this domain and underscore several promising directions for future research. We believe that this research is of great importance. As AI continues to be integrated into novel application scenarios, it could have significant implications for both individuals and society at large. It's crucial that we ensure their performance is both consistent and reliable. Okay, so I'm going to drill in a little bit.
Starting point is 00:01:52 In case there's people in our audience that don't understand what psychometrics is, could you explain that a little bit for the audience? Yeah, psychometrics could be considered as a subdomain of psychology. Basic psychology is just studying everything about humans, but psychometrics is specifically developed to study how we can better evaluate. We can also call this general public intelligence,
Starting point is 00:02:17 but it's human intelligence. So there are actually a lot of methodologies and approaches how we develop this kind of test and what tests we need to carry out. The previous AI is designed for specific tasks, like machine translation, like summarization. But now, as people are already aware of many progress in big models, in large-language models,
Starting point is 00:02:40 AI actually currently can be considered as some kind of solving general purpose tasks. Sometimes we call it few-shot learning or sometimes we call it zero-shot learning. We don't need to train a model before we bring new tasks to them. So this brings a question, how we evaluate this kind of general purpose AI? Because traditionally, we evaluate AI usually use some specific benchmarks, specific data sets, and specific tasks. This seems to be unsuitable to this new general purpose AI.
Starting point is 00:03:15 So how does your approach build on and or differ from what's been done previously in this field? Yeah, we actually see a lot of efforts have been investigated into evaluating the performance of these new large-ranging models. But we see a significant portion of these evaluations are test-specific. They're still test-specific. And also, frankly speaking, they are easily affected by changes. That means even slight alteration to a test could lead to substantial drops in performance. So our methodology differs from these approaches
Starting point is 00:03:51 in that rather than solely testing how AI performs on those predetermined tasks, we are keen on evaluating those latent constructs because we believe that pinpointing these latent constructs is very important. It's important in forecasting AI's performance in evolving and unfamiliar contexts. We can use an example like game design. With humans, even an individual has no work on game design. It's just a whole new task for her. We might still confidently infer their potential if we know they possess the essential latent contract or abilities which are important for game design,
Starting point is 00:04:32 for example, creativity, critical thinking, and communication. So this is a vision paper, and you're making a case for using psychometrics as opposed to regular traditional benchmarks for assessing AI. So would you say there was a methodology involved in this as a research paper? And if so, how did you conduct the research for this? What was the overview of it? As you said, this is a vision paper. So instead of describing a specific methodology, we are collaborating with several experienced psychometrics researchers.
Starting point is 00:05:05 Collectively, we explore the feasibility of integrating psychometrics into AI evaluation and deciding which concepts are viable and which are not. In February this year, we hosted a workshop on this topic. Over the past months, we have engaged in numerous discussions, and the outcome of this discussion is articulated in this paper. And additionally, actually, we are also in the middle of drafting another paper. On that paper, we will apply the insight from this paper to devise a rigorous methodology for assessing the latent capability of the most cardiology-like
Starting point is 00:05:43 models. When you do a regular research paper, you have findings. And when you did this paper and you workshopped it, what did you come away with in terms of the possibilities for what you might do on assessing AI with psychometrics? What were your major findings? Yeah, our major findings can be divided into two areas. First, we underscored the significant potential of psychometrics. This includes exploring how these metrics can be utilized to enhance predictive accuracy and guarantee test quality. Second, we also draw attention to the new challenges that arise when directly applying these principles to AI. For instance, test results could be misinterpreted
Starting point is 00:06:27 as assumptions verified for human tests might not necessarily apply to AI. Furthermore, capabilities that are essential for humans might not hold the same importance for AI. Another notable challenge is the lack of a consistent defined population of AI, especially considering their rapid evolution. But this population is essential for traditional psychometrics. We need to have a population of humans to verify either the reliability or the validity of a test.
Starting point is 00:07:00 But for AI, this becomes a challenge. Based on those findings, how do you think your work is significant in terms of real-world impact at this point? We believe that our approach will signal the start of a new era in the evolution of general-purpose AI, shifting from earlier task-specific methodologies to a more rigorous scientific method. Fundamentally, there is an urgent demand to establish a dedicated research domain focusing solely on AI evaluation. We believe psychometrics will be at the heart of this domain.
Starting point is 00:07:34 Given AI's expanding role in society and its growing significance as an indispensable assistant, this evaluation will be crucial. I think one missing part of current AI evolution is how we can make sure the test, the benchmark, or these evolution methods of AI themselves are scientific. Actually, previously I used the example of game design. Suppose in the future, I think a lot of people are discussing a language model agent, AI agent, they could be used to not only writing code, but also develop a software by collaborating among different agents. Then, what kind of capabilities or we call them
Starting point is 00:08:18 latent contracts of these AI models they should have before they make success in game design or any other software development. For example, like creativity, critical thinking, communication, because this could be important when there are multiple AI models, they communicate with each other. They check the result of the output of other models. Are there other areas that you could say, hey, this would be a relevant application of having AI evaluated with psychometrics instead of the regular benchmarks because of the generality of intelligence? We are mostly interested in maybe doing research because a lot of researchers have started to leverage AI for their own research, for example, not only for writing papers, not only for generating some ideas, but maybe they could use AI models for more tasks in the whole pipeline of research. So this may require AI to have some underlying capabilities, like as we have
Starting point is 00:09:20 said, like critical thinking, how AI to define the new ideas and how they check whether these ideas are feasible and how they propose creative solutions and how they work together on research. This could be another domain. So if there was one thing that you want our listeners to take away from this work, what would it be? Yeah, I think the one takeaway I want to say is we should be aware of the vital importance of AI evaluation.
Starting point is 00:09:51 We are still far from achieving a truly scientific standard. So we need to still work hard to get that done. Finally, what unanswered questions or unsolved problems remain in this area? What's next on your research agenda that you're working on? Yeah, actually, a lot of unanswered questions, as highlighted at the later part of this paper. Ultimately, our goal is to adapt psychometric theories and techniques to fit AI contexts. So we have discussed with our collaborators in both AI and psychometrics some examples of how can we develop guidelines, extended theories,
Starting point is 00:10:32 and techniques to ensure a rigorous evaluation that prevents misinterpretation, and how can we best evaluate assistant AI and the dynamics of AI-human teaming. This actually is particularly proposed by one of our collaborators in the psychometrics domain. And how do we evaluate the value of general purpose AI and ensure their alignment with human objectives?
Starting point is 00:10:57 And how can we employ semi-automatic methods to develop psychometric tests, series and tests with the help of general purpose AI? That means we use AI to solve this problem by themselves. This is also important because, you know, psychometrics or psychology have developed for hundreds or maybe thousands of years to come to all the techniques today. But can we shorten that period? Can we allow AI to speed up this development. Would you say there's wide agreement in the AI community that this is a necessary direction to head? This is only starting. I think there are several papers
Starting point is 00:11:35 discussing how we can apply some part of psychology or some part of psychometrics to AI, but there's no systematic discussion or thinking along this line. So I don't think there's agreement, but there's already initial thoughts and initial perspectives shown in the academic community. Well, Xinjie, thanks for joining us today
Starting point is 00:11:58 and to our listeners, thank you for tuning in. If you're interested in learning more about this paper, you can find a link at aka.ms forward slash abstracts, or you can find a preprint of the paper on Archive. See you next time on Abstracts.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.