Microsoft Research Podcast - Abstracts: October 23, 2023
Episode Date: October 23, 2023Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversatio...ns about new and noteworthy achievements.In this episode, Andy Gordon, a Partner Research Manager, and Carina Negreanu, a Senior Researcher, both at Microsoft Research, join host Dr. Gretchen Huizinga to discuss “Co-audit: Tools to help humans double-check AI-generated content.” This paper brings together current understanding of generative AI performance to explore the need and context for tools to help people using the technology find and fix mistakes in AI output.View the paper
Transcript
Discussion (0)
Welcome to Abstracts,
a Microsoft Research podcast that puts
the spotlight on world-class research in brief.
I'm Dr. Gretchen Huizenga.
In this series,
members of the research community at Microsoft give us
a quick snapshot or a podcast abstract
of their new and noteworthy papers.
Today, I'm talking to Dr. Andy Gordon, a partner research manager,
and Dr. Karina Negranu, a senior researcher, both at Microsoft Research.
Doctors Gordon and Negranu are co-editors of a paper called
Co-Audit, Tools to Help Humans Double-Check AI-Generated Content.
And you can read a preprint of this paper now on Archive.
Andy Gordon, Karina Negrano, thanks for joining us on Abstracts.
Great to be here.
Likewise.
Let's start with you, Andy. In a few sentences, describe the issue or problem your paper addresses
and why people should care about it.
Well, generative AI is amazing. Things like Bing Chat or ChatGBT, all these things powered by large language models.
Totally amazing.
But it's really important for everyone to remember
that these AIs can make mistakes.
For example, you ask when your favorite actor got married
and the model says the year, but gets it wrong.
Or you ask for some Python code
and it works on positive numbers,
but occasionally you give it negative numbers
and it goes wrong. Another example example you get a summary of some text it's great but unfortunately
misses one of the important points or thinking about images you ask for a portrait of a character
from the AI and there's some glitch and it produces a hand with six fingers. So as users, we need to get into the habit of carefully checking AI outputs for mistakes.
And we refer to that as audit in the sense of a systematic review.
Coming to the paper, it's about what we call co-audit.
And that's our term for any tool support that helps the human audit the AI output.
And some examples of co-audit are tools that can help check for hallucinations, like when the actor's date of birth is wrong, or to check Python code to find some errors, or show how a summary has been constructed to help people find errors.
Karina, let's talk to you. What related research does this paper build on
and how does your work add to it? So there was no direct work on the co-added brand before us.
We're just introducing it. But there has been a lot of research that either motivates the need
for co-added or provides relevant framing for it, or even like early examples of what we started
thinking of co-ad coded. So as you're
probably aware there has been a really great effort in the last years to assess the quality
of generations by large language models across a multitude really of tasks and currently we use
this body of work as motivation for our research basically shows there really is a need for this
kind of work and we hope that in future, we can also use it to benchmark
co-audit tools that we are going to produce in our wider community.
But the idea of dealing with errors has been a key part of research
on human-AI interaction for ages.
And there have been some really cool guidelines that came out recently,
especially from Amershi in 2019 on human-AI interactions
that are concerned with this part of the world.
And more recently, Glassman had a really cool paper
about conversational frameworks for human-AI communication
and basically links these concepts to psychology.
And in our work, as you can read in our paper,
we are trying to basically frame co-audit
within her framework, and we find that it's a natural fit.
But before we started defining formally co-ordinated
and building this paper,
our group has built co-ordinated tools
in the co-generation space.
One such tool is GAM, which is Grind Obstruction Matching,
where we basically help users learn
how to effectively communicate with large language models
so that they both understand
what the large language model understands they're asking and also get good feedback back. We also have a call deco,
which is a spreadsheet tool for inspecting and verifying calculated columns without the user
requiring to view the underlying code produced by the large language models. But really, any tool
that focuses on debugging or basically getting information back from human-generated content is useful here.
So even tools that are like early debugging tools like FXD are very important here as we learn how people use this kind of tools.
And we try to basically apply the same concepts in the context of LLM-generated content. So basically, we are building on top of work that helps understand the needs
and challenges that end user programmers have when working in the space and trying to extrapolate
them to codating tools for LLM generated content. Well, Andy, how would you describe the research
approach you used or your methodology for this paper? And how did it come about?
Great question, Gretchen. And it was actually quite an unusual methodology for us. So as
Karina says, we'd been looking at co-audit in a very specific setting of spreadsheet computations.
And we began to realize that co-audit was really important for any kind of AI generated output.
And we started to see other people doing research
that was doing the same sort of thing we were doing,
but in different settings.
So for example, there was a paper,
they were generating bits of Python
and they were deliberately showing
multiple pieces of code after they'd been generated
to kind of nudge the human user
to make a decision about which one was better.
I mean, it's really important to get
people to think about the outputs. And this was a nice trick. So we thought, look, this is actually
quite an important problem and MSR should step up and sort of gather people. So we organized a
workshop inside Microsoft in the spring and got folks together to share their perspectives on
co-audit. And then since then, we've reflected
on those discussions and tried to kind of pull them together in a more coherent sense
than the sort of whiteboards and sticky notes that we produced back then. And so that's produced
this paper. I think one of the key things that we learned in that process that we hadn't been
thinking about before was that co-audit really complements prompt engineering. So you hear a lot about prompt engineering, and it's the first part of what
we call the prompt response audit loop. And this is related to what Karina was saying about Elena
Glassman's work about AI human interaction. So the first step is you formulate a prompt.
For example, you ask for Python code.
That's the first step.
The second step is we wait for the response from the AI.
And then the third step is that we need to inspect the response.
That's the audit part.
Decide if it meets our needs or if there is a mistake.
And if that's the case, we need to repeat again.
So that's this loop, the prompt response audit loop.
And prompt engineering, they're the
tools and techniques that you use in that first step to create the prompt. So for example, some
tools will automatically include a data context in a prompt if you're trying to create some Python
to apply to a table in a spreadsheet or something like that. And then duly, co-audit, those are the tools and techniques we have to help
the human audit the response in the third step of this loop. And that's like these tools I've
been mentioning that show maybe two or three candidates of code that's to be used.
Karina, let's move over to what kinds of things you came away with, your takeaways or your findings
from this workshop. Talk about that and how you chose to
articulate them in the paper. So as part of our research, we found that basically one coded tool
does not fit all needs, which in a way was great because we have a bigger field to explore. But in
other words, it's a bit daunting as it means you have to think of many things. And one thing that
really came to light is that even though we can't build something that
fits everything, we can build a set of principles that we think are important. So really, we wrote
our paper on those 10 principles that we have identified throughout the workshop and then are
trying to promote them as things people should think about when they start going on the journey
of building auditing tools. So one of the examples is that we really think that
we should think about grounding outputs.
So for example, by citing reliable sources,
similar to what Bing Chat does today,
we think that's a really valuable, important principle
that people should follow and they should think about
what that means in the concept of their auditing tool.
In the case of Bing, it's quite simple,
as it's like factual references,
but maybe if it becomes referencing code it's like factual references, but maybe
if it becomes referencing code, that becomes more tricky, but still super interesting going forward.
We also propose that our codating tool should have the capability to prioritize the user's
attention to the most likely errors, as we need to be mindful of the user's cognitive efforts and
have a positive cost benefit. Basically, if we overflood the users with different errors and flags,
it might be too problematic
and the adoption might be quite difficult going forward.
And finally, this is something that really comes to core
to our research area and spreadsheets.
It's about thinking beyond text.
So we know visuals are so important
in how we explain things,
in how we teach in schools,
how we teach universities.
So how do we include them in the quality process going forward?
I think that's going to be a really interesting challenge.
And we hope we're going to see some interesting work in that space.
Yeah. Well, principles are one thing, Andy,
but how does this paper contribute to real world impact?
We talked about that a bit at the beginning.
Who benefits most from this tool?
That is a great question,
Gretchen. And actually, that was a question that we talked about at the workshop. We think that some application areas are going to benefit more than others. So co-audit really matters
when correctness really matters and when mistakes are bad consequences. So in terms of application area, that's areas like maybe finance
or technology development or medicine. But you asked particularly about who, and we think some
people will benefit more from co-audit than others. And we found this really striking example,
I guess it's an anecdotal example that someone was posting on social media.
A professor was teaching a class using generative AI tools for the first time to generate code.
And he found some evidence that people who have low self-confidence with computers can be
intimidated by generative AI. So he would find that some of the class were really confident
users and they would ask it, you know, generate some Python to do such and such. And it would
come back with code with, you know, a bunch of mistakes in it. And the confident users were
happy just to swap that away. They were even quite a little arrogant about it. Like this is a stupid
computer, they were saying.
But Gretchen, he found that a lot of his students who were less confident with computers were quite intimidated by this, because it was very confidently just saying, oh, look, all this code
is going to work. And they kind of got a bit stuck. And some of them were scrolling around to this
code trying to understand how it worked, when in in fact it was just really broken. So he thought this was pretty bad that these able students who were just
less confident were being intimidated and were making less good use of the generative AI. Now
that is an example, that is an anecdote from social media from a reputable professor. But we looked
into it and there's peer-reviewed studies that show similar effect
in the literature. So I'd say we need co-edit tools that will encourage these less confident
users to question when the AI is mistaken rather than getting stuck. And I think otherwise,
they're not going to see the benefits of the generative AI.
Well, Karina, sometimes I like to boil things down to a nugget
or a beautiful takeaway. So if there's one thing you want our listeners to take away from this work,
this paper, what would it be? I think that what this study has taught us is that really we need
significantly more research. So basically a good codating experience can really be the element that makes it or breaks it
in how we incorporate that element safely into our day-to-day lives.
But to make this happen, we need people from the field working towards the same goal.
It's really an interdisciplinary work, and I don't think we can do it by isolating into groups
as we're currently researching now.
So I would urge our listeners to think about how they could
contribute in this space and reach out with feedback and questions to us. We are more than
open to collaboration. Really, we are just starting this journey and we'd love to see this area to
become a research priority going forward in 2024. Well, Andy, as an opportunity to give some
specificity to Karina's call for help, what potential pitfalls have you already identified that represent ongoing research challenges in this field?
And what's next on yours and potentially others' research agenda in this field?
Well, one point, and I think Karina made this, that co-audit techniques will themselves never be perfect.
I mean, we're saying that language models are never going to be perfect.
Mistakes will come through.
But the co-audit techniques themselves won't be perfect either.
So sometimes a user who is using the tools will still miss some mistakes.
So, for example, at the workshop, we thought about security questions and co-audit tools themselves.
And we were thinking, for instance, about maybe deliberate attacks on a generative AI.
There's various techniques that people are talking about at the moment where you might sort of poison the inputs that generative AI models pick up on.
And in principle, co-audit tools could help users realize that there are deliberate mistakes that have been engineered by the attacker.
So that's good.
But on the other hand, security always becomes an arms race. And so once we did
have a good tool that could detect those kinds of mistakes, the attackers then will start to
engineer around the co-edit tools, trying to make them less effective. So that will be an ongoing
problem, I think. And on the other hand, we'll find that if co-edit tools are giving too many
warnings, users will start to ignore them. And there'll be hand, we'll find that if co-edit tools are giving too many warnings,
users will start to ignore them.
And there'll be a sort of under-reliance on co-edit tools.
And of course, if they give too few,
users will miss the mistakes.
So an interesting balance needs to be struck.
And also, we don't expect there's
going to be one overarching co-edit experience.
But we think there'll be many different realizations.
And so as Karina says,
we hope that common lessons can be learned. And that's why we want to keep documenting this space
in general and building a research community. So I echo what Karina was saying. If you're
listening and you think that what you're working on is core, do reach out.
Well, Andy Gordon, Karina Negranu, thanks for joining us today. And to our listeners,
thanks for tuning in.
If you're interested in learning more about this paper and this research, you can find a link at aka.ms forward slash abstracts,
or you can read the preprint on archive.
See you next time on Abstracts. Thank you.