Microsoft Research Podcast - Abstracts: December 6, 2023
Episode Date: December 6, 2023Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations... about new and noteworthy achievements.In this episode, Xing Xie, a Senior Principal Research Manager of Microsoft Research Asia, joins host Dr. Gretchen Huizinga to discuss “Evaluating General-Purpose AI with Psychometrics.” As AI capabilities move from task specific to more general purpose, the paper explores psychometrics, a subfield of psychology, as an alternative to traditional methods for evaluating model performance and for supporting consistent and reliable systems.Read the paper: Evaluating General-Purpose AI with Psychometrics
Transcript
Discussion (0)
Welcome to Abstracts,
a Microsoft Research podcast that puts
the spotlight on world-class research in brief.
I'm Dr. Gretchen Huizenga.
In this series,
members of the research community at Microsoft give us
a quick snapshot or a podcast
abstract of their new and noteworthy papers.
Today, I'm talking to Dr. Xing Xie, a senior principal research manager at Microsoft Research.
Dr. Xie is co-author of a vision paper on large language models called Evaluating General Purpose AI with Psychometrics,
and you can find a preprint of this paper now on Archive.
Xing Xie, thank you for joining us on Abstracts.
Yes, thank you.
It's my pleasure to be here.
So in a couple sentences, tell us what issue or problem your research addresses and why
people should care about it.
Yeah, in a sense, actually we are exploring the potential of psychometrics to revolutionize
how we evaluate general purpose AI.
Because AI is advancing at a very rapid pace,
traditional evaluation methods face significant challenges,
especially when it comes to predicting a model's performance
in unfamiliar scenarios.
And these methods also lack a robust mechanism
to assess their own quality.
Additionally, in this paper, we delve
into the complexity of directly applying psychometrics
to this domain and underscore several promising directions for future research.
We believe that this research is of great importance.
As AI continues to be integrated into novel application scenarios, it could have significant
implications for both individuals and society at large.
It's crucial that we ensure
their performance is both consistent and reliable. Okay, so I'm going to drill in a little bit.
In case there's people in our audience that don't understand what psychometrics is,
could you explain that a little bit for the audience? Yeah, psychometrics could be considered
as a subdomain of psychology.
Basic psychology is just studying
everything about humans,
but psychometrics is specifically developed
to study how we can better evaluate.
We can also call this general public intelligence,
but it's human intelligence.
So there are actually a lot of methodologies
and approaches how we develop this kind of test
and what tests we need to carry out.
The previous AI is designed for specific tasks,
like machine translation, like summarization.
But now, as people are already aware of many progress
in big models, in large-language models,
AI actually currently can be considered
as some kind of solving general purpose tasks.
Sometimes we call it few-shot learning or sometimes we call it zero-shot learning.
We don't need to train a model before we bring new tasks to them.
So this brings a question, how we evaluate this kind of general purpose AI?
Because traditionally, we evaluate AI usually use some specific benchmarks,
specific data sets, and specific tasks.
This seems to be unsuitable to this new general purpose AI.
So how does your approach build on and or differ from what's been done previously in this field?
Yeah, we actually see a lot of efforts have been investigated
into evaluating the performance of these new large-ranging models.
But we see a significant portion of these evaluations are test-specific.
They're still test-specific.
And also, frankly speaking, they are easily affected by changes.
That means even slight alteration to a test could lead to
substantial drops in performance. So our methodology differs from these approaches
in that rather than solely testing how AI performs on those predetermined tasks,
we are keen on evaluating those latent constructs because we believe that pinpointing these latent constructs is very important.
It's important in forecasting AI's performance in evolving and unfamiliar contexts.
We can use an example like game design.
With humans, even an individual has no work on game design.
It's just a whole new task for her.
We might still confidently infer their potential if we know
they possess the essential latent contract or abilities which are important for game design,
for example, creativity, critical thinking, and communication. So this is a vision paper,
and you're making a case for using psychometrics as opposed to regular traditional benchmarks for assessing AI.
So would you say there was a methodology involved in this as a research paper?
And if so, how did you conduct the research for this?
What was the overview of it?
As you said, this is a vision paper.
So instead of describing a specific methodology, we are collaborating with several experienced
psychometrics researchers.
Collectively, we explore the feasibility of integrating psychometrics into AI evaluation
and deciding which concepts are viable and which are not.
In February this year, we hosted a workshop on this topic.
Over the past months, we have engaged in numerous discussions,
and the outcome of
this discussion is articulated in this paper. And additionally, actually, we are also in the
middle of drafting another paper. On that paper, we will apply the insight from this paper to
devise a rigorous methodology for assessing the latent capability of the most cardiology-like
models. When you do a regular research paper, you have findings.
And when you did this paper and you workshopped it, what did you come away with in terms of
the possibilities for what you might do on assessing AI with psychometrics?
What were your major findings?
Yeah, our major findings can be divided into two areas.
First, we underscored the significant potential of psychometrics.
This includes exploring how these metrics can be utilized to enhance predictive accuracy and guarantee test quality.
Second, we also draw attention to the new challenges that arise when directly applying these principles to AI. For instance, test results could be misinterpreted
as assumptions verified for human tests
might not necessarily apply to AI.
Furthermore, capabilities that are essential for humans
might not hold the same importance for AI.
Another notable challenge is the lack of a consistent
defined population of AI, especially considering their rapid evolution.
But this population is essential for traditional psychometrics.
We need to have a population of humans to verify either the reliability or the validity of a test.
But for AI, this becomes a challenge.
Based on those findings, how do you think your work is significant in terms of real-world
impact at this point?
We believe that our approach will signal the start of a new era in the evolution of general-purpose
AI, shifting from earlier task-specific methodologies to a more rigorous scientific method.
Fundamentally, there is an urgent demand to establish a dedicated research domain focusing solely
on AI evaluation.
We believe psychometrics will be at the heart of this domain.
Given AI's expanding role in society and its growing significance as an indispensable
assistant, this evaluation will be crucial. I think one missing part of current AI evolution is how we
can make sure the test, the benchmark, or these evolution methods of AI themselves are scientific.
Actually, previously I used the example of game design. Suppose in the future, I think a lot of
people are discussing a language model agent, AI agent,
they could be used to not only writing code,
but also develop a software by collaborating among different agents.
Then, what kind of capabilities or we call them
latent contracts of these AI models they should have
before they make success in game design or any other software development.
For example, like creativity, critical thinking, communication, because this could be important
when there are multiple AI models, they communicate with each other. They check the result of the output
of other models. Are there other areas that you could say, hey, this would be a relevant application of having AI evaluated with psychometrics instead of the regular benchmarks because of the generality of intelligence?
We are mostly interested in maybe doing research because a lot of researchers have started to leverage AI for their own research, for example, not only for writing papers,
not only for generating some ideas, but maybe they could use AI models for more tasks in the whole
pipeline of research. So this may require AI to have some underlying capabilities, like as we have
said, like critical thinking, how AI to define the new ideas and how they check
whether these ideas are feasible and how they propose creative solutions and how they work
together on research.
This could be another domain.
So if there was one thing that you want our listeners to take away from this work, what
would it be?
Yeah, I think the one takeaway I want to say is
we should be aware of the vital importance of AI evaluation.
We are still far from achieving a truly scientific standard.
So we need to still work hard to get that done.
Finally, what unanswered questions or unsolved problems remain in this area?
What's next on your research agenda that you're working on?
Yeah, actually, a lot of unanswered questions, as highlighted at the later part of this paper.
Ultimately, our goal is to adapt psychometric theories and techniques to fit AI contexts.
So we have discussed with our collaborators in both AI and psychometrics some examples of
how can we develop guidelines, extended theories,
and techniques to ensure a rigorous evaluation
that prevents misinterpretation,
and how can we best evaluate assistant AI
and the dynamics of AI-human teaming.
This actually is particularly proposed
by one of our collaborators in the psychometrics domain.
And how do we evaluate the value of general purpose AI
and ensure their alignment with human objectives?
And how can we employ semi-automatic methods
to develop psychometric tests, series and tests
with the help of general purpose AI?
That means we use
AI to solve this problem by themselves. This is also important because, you know, psychometrics
or psychology have developed for hundreds or maybe thousands of years to come to all the techniques
today. But can we shorten that period? Can we allow AI to speed up this development. Would you say there's wide agreement in the AI community
that this is a necessary direction to head? This is only starting. I think there are several papers
discussing how we can apply some part of psychology or some part of psychometrics to AI,
but there's no systematic discussion or thinking along this line. So I don't think
there's agreement, but there's already
initial thoughts and initial perspectives
shown in the academic
community.
Well, Xinjie,
thanks for joining us today
and to our listeners, thank you for
tuning in. If you're interested in learning
more about this paper, you can find a link
at aka.ms forward slash abstracts, or you can find a preprint of the paper on Archive.
See you next time on Abstracts.