Microsoft Research Podcast - Abstracts: October 23, 2023

Starting point is 00:00:00 Welcome to Abstracts, a Microsoft Research podcast that puts the spotlight on world-class research in brief. I'm Dr. Gretchen Huizenga. In this series, members of the research community at Microsoft give us a quick snapshot or a podcast abstract of their new and noteworthy papers.

Starting point is 00:00:24 Today, I'm talking to Dr. Andy Gordon, a partner research manager, and Dr. Karina Negranu, a senior researcher, both at Microsoft Research. Doctors Gordon and Negranu are co-editors of a paper called Co-Audit, Tools to Help Humans Double-Check AI-Generated Content. And you can read a preprint of this paper now on Archive. Andy Gordon, Karina Negrano, thanks for joining us on Abstracts. Great to be here. Likewise.

Starting point is 00:00:51 Let's start with you, Andy. In a few sentences, describe the issue or problem your paper addresses and why people should care about it. Well, generative AI is amazing. Things like Bing Chat or ChatGBT, all these things powered by large language models. Totally amazing. But it's really important for everyone to remember that these AIs can make mistakes. For example, you ask when your favorite actor got married and the model says the year, but gets it wrong.

Starting point is 00:01:18 Or you ask for some Python code and it works on positive numbers, but occasionally you give it negative numbers and it goes wrong. Another example example you get a summary of some text it's great but unfortunately misses one of the important points or thinking about images you ask for a portrait of a character from the AI and there's some glitch and it produces a hand with six fingers. So as users, we need to get into the habit of carefully checking AI outputs for mistakes. And we refer to that as audit in the sense of a systematic review. Coming to the paper, it's about what we call co-audit.

Starting point is 00:01:59 And that's our term for any tool support that helps the human audit the AI output. And some examples of co-audit are tools that can help check for hallucinations, like when the actor's date of birth is wrong, or to check Python code to find some errors, or show how a summary has been constructed to help people find errors. Karina, let's talk to you. What related research does this paper build on and how does your work add to it? So there was no direct work on the co-added brand before us. We're just introducing it. But there has been a lot of research that either motivates the need for co-added or provides relevant framing for it, or even like early examples of what we started thinking of co-ad coded. So as you're probably aware there has been a really great effort in the last years to assess the quality

Starting point is 00:02:50 of generations by large language models across a multitude really of tasks and currently we use this body of work as motivation for our research basically shows there really is a need for this kind of work and we hope that in future, we can also use it to benchmark co-audit tools that we are going to produce in our wider community. But the idea of dealing with errors has been a key part of research on human-AI interaction for ages. And there have been some really cool guidelines that came out recently, especially from Amershi in 2019 on human-AI interactions

Starting point is 00:03:22 that are concerned with this part of the world. And more recently, Glassman had a really cool paper about conversational frameworks for human-AI communication and basically links these concepts to psychology. And in our work, as you can read in our paper, we are trying to basically frame co-audit within her framework, and we find that it's a natural fit. But before we started defining formally co-ordinated

Starting point is 00:03:47 and building this paper, our group has built co-ordinated tools in the co-generation space. One such tool is GAM, which is Grind Obstruction Matching, where we basically help users learn how to effectively communicate with large language models so that they both understand what the large language model understands they're asking and also get good feedback back. We also have a call deco,

Starting point is 00:04:10 which is a spreadsheet tool for inspecting and verifying calculated columns without the user requiring to view the underlying code produced by the large language models. But really, any tool that focuses on debugging or basically getting information back from human-generated content is useful here. So even tools that are like early debugging tools like FXD are very important here as we learn how people use this kind of tools. And we try to basically apply the same concepts in the context of LLM-generated content. So basically, we are building on top of work that helps understand the needs and challenges that end user programmers have when working in the space and trying to extrapolate them to codating tools for LLM generated content. Well, Andy, how would you describe the research approach you used or your methodology for this paper? And how did it come about?

Starting point is 00:05:04 Great question, Gretchen. And it was actually quite an unusual methodology for us. So as Karina says, we'd been looking at co-audit in a very specific setting of spreadsheet computations. And we began to realize that co-audit was really important for any kind of AI generated output. And we started to see other people doing research that was doing the same sort of thing we were doing, but in different settings. So for example, there was a paper, they were generating bits of Python

Starting point is 00:05:33 and they were deliberately showing multiple pieces of code after they'd been generated to kind of nudge the human user to make a decision about which one was better. I mean, it's really important to get people to think about the outputs. And this was a nice trick. So we thought, look, this is actually quite an important problem and MSR should step up and sort of gather people. So we organized a workshop inside Microsoft in the spring and got folks together to share their perspectives on

Starting point is 00:06:02 co-audit. And then since then, we've reflected on those discussions and tried to kind of pull them together in a more coherent sense than the sort of whiteboards and sticky notes that we produced back then. And so that's produced this paper. I think one of the key things that we learned in that process that we hadn't been thinking about before was that co-audit really complements prompt engineering. So you hear a lot about prompt engineering, and it's the first part of what we call the prompt response audit loop. And this is related to what Karina was saying about Elena Glassman's work about AI human interaction. So the first step is you formulate a prompt. For example, you ask for Python code.

Starting point is 00:06:46 That's the first step. The second step is we wait for the response from the AI. And then the third step is that we need to inspect the response. That's the audit part. Decide if it meets our needs or if there is a mistake. And if that's the case, we need to repeat again. So that's this loop, the prompt response audit loop. And prompt engineering, they're the

Starting point is 00:07:05 tools and techniques that you use in that first step to create the prompt. So for example, some tools will automatically include a data context in a prompt if you're trying to create some Python to apply to a table in a spreadsheet or something like that. And then duly, co-audit, those are the tools and techniques we have to help the human audit the response in the third step of this loop. And that's like these tools I've been mentioning that show maybe two or three candidates of code that's to be used. Karina, let's move over to what kinds of things you came away with, your takeaways or your findings from this workshop. Talk about that and how you chose to articulate them in the paper. So as part of our research, we found that basically one coded tool

Starting point is 00:07:52 does not fit all needs, which in a way was great because we have a bigger field to explore. But in other words, it's a bit daunting as it means you have to think of many things. And one thing that really came to light is that even though we can't build something that fits everything, we can build a set of principles that we think are important. So really, we wrote our paper on those 10 principles that we have identified throughout the workshop and then are trying to promote them as things people should think about when they start going on the journey of building auditing tools. So one of the examples is that we really think that we should think about grounding outputs.

Starting point is 00:08:29 So for example, by citing reliable sources, similar to what Bing Chat does today, we think that's a really valuable, important principle that people should follow and they should think about what that means in the concept of their auditing tool. In the case of Bing, it's quite simple, as it's like factual references, but maybe if it becomes referencing code it's like factual references, but maybe

Starting point is 00:08:45 if it becomes referencing code, that becomes more tricky, but still super interesting going forward. We also propose that our codating tool should have the capability to prioritize the user's attention to the most likely errors, as we need to be mindful of the user's cognitive efforts and have a positive cost benefit. Basically, if we overflood the users with different errors and flags, it might be too problematic and the adoption might be quite difficult going forward. And finally, this is something that really comes to core to our research area and spreadsheets.

Starting point is 00:09:15 It's about thinking beyond text. So we know visuals are so important in how we explain things, in how we teach in schools, how we teach universities. So how do we include them in the quality process going forward? I think that's going to be a really interesting challenge. And we hope we're going to see some interesting work in that space.

Starting point is 00:09:33 Yeah. Well, principles are one thing, Andy, but how does this paper contribute to real world impact? We talked about that a bit at the beginning. Who benefits most from this tool? That is a great question, Gretchen. And actually, that was a question that we talked about at the workshop. We think that some application areas are going to benefit more than others. So co-audit really matters when correctness really matters and when mistakes are bad consequences. So in terms of application area, that's areas like maybe finance or technology development or medicine. But you asked particularly about who, and we think some

Starting point is 00:10:14 people will benefit more from co-audit than others. And we found this really striking example, I guess it's an anecdotal example that someone was posting on social media. A professor was teaching a class using generative AI tools for the first time to generate code. And he found some evidence that people who have low self-confidence with computers can be intimidated by generative AI. So he would find that some of the class were really confident users and they would ask it, you know, generate some Python to do such and such. And it would come back with code with, you know, a bunch of mistakes in it. And the confident users were happy just to swap that away. They were even quite a little arrogant about it. Like this is a stupid

Starting point is 00:11:02 computer, they were saying. But Gretchen, he found that a lot of his students who were less confident with computers were quite intimidated by this, because it was very confidently just saying, oh, look, all this code is going to work. And they kind of got a bit stuck. And some of them were scrolling around to this code trying to understand how it worked, when in in fact it was just really broken. So he thought this was pretty bad that these able students who were just less confident were being intimidated and were making less good use of the generative AI. Now that is an example, that is an anecdote from social media from a reputable professor. But we looked into it and there's peer-reviewed studies that show similar effect in the literature. So I'd say we need co-edit tools that will encourage these less confident

Starting point is 00:11:52 users to question when the AI is mistaken rather than getting stuck. And I think otherwise, they're not going to see the benefits of the generative AI. Well, Karina, sometimes I like to boil things down to a nugget or a beautiful takeaway. So if there's one thing you want our listeners to take away from this work, this paper, what would it be? I think that what this study has taught us is that really we need significantly more research. So basically a good codating experience can really be the element that makes it or breaks it in how we incorporate that element safely into our day-to-day lives. But to make this happen, we need people from the field working towards the same goal.

Starting point is 00:12:34 It's really an interdisciplinary work, and I don't think we can do it by isolating into groups as we're currently researching now. So I would urge our listeners to think about how they could contribute in this space and reach out with feedback and questions to us. We are more than open to collaboration. Really, we are just starting this journey and we'd love to see this area to become a research priority going forward in 2024. Well, Andy, as an opportunity to give some specificity to Karina's call for help, what potential pitfalls have you already identified that represent ongoing research challenges in this field? And what's next on yours and potentially others' research agenda in this field?

Starting point is 00:13:18 Well, one point, and I think Karina made this, that co-audit techniques will themselves never be perfect. I mean, we're saying that language models are never going to be perfect. Mistakes will come through. But the co-audit techniques themselves won't be perfect either. So sometimes a user who is using the tools will still miss some mistakes. So, for example, at the workshop, we thought about security questions and co-audit tools themselves. And we were thinking, for instance, about maybe deliberate attacks on a generative AI. There's various techniques that people are talking about at the moment where you might sort of poison the inputs that generative AI models pick up on.

Starting point is 00:13:57 And in principle, co-audit tools could help users realize that there are deliberate mistakes that have been engineered by the attacker. So that's good. But on the other hand, security always becomes an arms race. And so once we did have a good tool that could detect those kinds of mistakes, the attackers then will start to engineer around the co-edit tools, trying to make them less effective. So that will be an ongoing problem, I think. And on the other hand, we'll find that if co-edit tools are giving too many warnings, users will start to ignore them. And there'll be hand, we'll find that if co-edit tools are giving too many warnings, users will start to ignore them.

Starting point is 00:14:27 And there'll be a sort of under-reliance on co-edit tools. And of course, if they give too few, users will miss the mistakes. So an interesting balance needs to be struck. And also, we don't expect there's going to be one overarching co-edit experience. But we think there'll be many different realizations. And so as Karina says,

Starting point is 00:14:45 we hope that common lessons can be learned. And that's why we want to keep documenting this space in general and building a research community. So I echo what Karina was saying. If you're listening and you think that what you're working on is core, do reach out. Well, Andy Gordon, Karina Negranu, thanks for joining us today. And to our listeners, thanks for tuning in. If you're interested in learning more about this paper and this research, you can find a link at aka.ms forward slash abstracts, or you can read the preprint on archive. See you next time on Abstracts. Thank you.

Microsoft Research Podcast - Abstracts: October 23, 2023

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.