Microsoft Research Podcast - AI Testing and Evaluation: Learnings from Science and Industry

Episode Date: June 23, 2025

In the introductory episode of this new series, host Kathleen Sullivan and Senior Director Amanda Craig Deckard explore Microsoft’s efforts to draw on the experience of other domains to help advance... the role of AI testing and evaluation as a governance tool.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to AI Testing and Evaluation, Learnings from Science and Industry. I'm your host, Kathleen Sullivan. As generative AI continues to advance, Microsoft has gathered a range of experts from genome editing to cybersecurity to share how their fields approach evaluation and risk assessment. Our goal is to learn from their successes and
Starting point is 00:00:24 their stumbles to move the science and practice of AI testing forward. In this series, we'll explore how these insights might help guide the future of AI development, deployment and responsible use. For our introductory episode, I'm pleased to welcome Amanda Craig-Dekard from Microsoft to discuss the company's efforts
Starting point is 00:00:47 to learn about testing in other sectors. Amanda is Senior Director of Public Policy in the Office of Responsible AI, where she leads a team that works closely with engineers, researchers, and policy experts to help ensure AI is being developed and used responsibly. Their insights shape Microsoft's contribution to public policy discussions on laws, norms,
Starting point is 00:01:08 and standards for AI. Amanda, welcome to the podcast. Thank you. Amanda, let's give the listeners a little bit of your background. What's your origin story? Can you talk to us a little bit about maybe how you started in tech?
Starting point is 00:01:21 And I would love to also learn a little bit more about what your team does in the Office of Responsible AI. Sure. Thank you. I'd say my path to tech, to Microsoft as well, was a bit like circuitous maybe. I thought for the longest time I was going to be a journalist. I studied forced migration. I worked in a sort of state level trial court in Indiana, a legal service provider in India,
Starting point is 00:01:55 just to give you a bit of a flavor. I made my way to Microsoft in 2014 and have been here since working in cybersecurity, policy first and now in responsible AI. And the way that our Office of Responsible AI has really sort of structured itself is bringing together the kind of expertise to really work on defining policy and how to operationalize it at the same time. And that means that we have been working through this real challenge of defining internal policy and practice, making sure that's deeply grounded in the work of our colleagues in Microsoft Research, and then really closely working with engineering to make sure that we
Starting point is 00:02:43 have the processes, that we have the tools to implement that policy at scale. And I'm really drawn to these kind of hard problems where they have the character of two things being true, or there's like, you know, real tension on both sides. And in particular, in the context of those kinds of problems, roles in which, like, the whole job is actually just sitting with that tension, not necessarily, like, resolving it and expecting that you're done. And I think really there are two reasons why tech is so kind of representative of that kind of challenge that I've always found fascinating. You know, one is that, of course, tech is sort of ubiquitous. It's really impacting so many people's lives. But also, because as I think has become
Starting point is 00:03:30 part of our vernacular now, but is not necessarily immediately intuitive, is the fact that technology is both a tool and a weapon. And so that's just another reason why we have to continuously work through that tension and sort of sit with it, right? And even as tech evolves over time. You bring up such great points and this field is not black and white.
Starting point is 00:03:51 I think that even underscores, you know, this notion that you highlighted that it's impacting everyone. And, you know, to set the stage for our listeners, last year we pulled in a bunch of experts from cyber security, biotech, finance. And we ran this large workshop to study how they're thinking about governance in those playbooks. And so I'd love to understand a little bit more about what sparked that effort.
Starting point is 00:04:13 And there's a piece of this which is really centered around testing. And to hear from you why the focus on testing is so important. If I could rewind a little bit and give you a bit of history of how we even arrived at bringing these experts together. You know, we actually started on this journey in 2023. At that time, there were like a lot of these big questions
Starting point is 00:04:39 swirling around about, you know, what did we need in terms of governance for AI? Of course, this was in the immediate aftermath of the chat GPT wave and everyone recognizing that the technology was going to have a different level of impact in the near term. And so what do we need from governance? What do we need at the global level, in particular, of governance? And so at the time in early 2023, especially there were a lot of attempts to sort of draw analogies to other global governance institutions in other domains.
Starting point is 00:05:12 So we actually in 2023 brought together a different workshop than the one that you're referring to specifically focused on testing last year. And we kind of had two big takeaways from that conversation. One was, you know, what are the actual functions of these institutions and how do they apply to AI? And actually one of the takeaways was they all sort of apply. There's like a role for, you know, any of the functions whether it be sort of driving consensus on research or building industry standards or managing kind of frontier risks for thinking about how those might be needed in the AI context. And one of the other big takeaways was that there are also limitations in these analogies. You know, each of the institutions grew up in its own sort of unique historical moment,
Starting point is 00:06:06 like the one that we sit in with AI right now. And each of those circumstances don't exactly translate to this moment. And so, yeah, there was like this kind of, OK, we want to draw what we can from this conversation, and then we also want to understand what is also very important that's just different for AI right now. We published a book with the lessons from that conversation in 2023. And then we actually went on a bit of a tour
Starting point is 00:06:35 with that content, where we had a number of roundtables actually all over the world, where we gathered feedback on how those analogies were landing, how our takeaways were landing. And one of the things that we took from them was a gap that some of the participants saw in the analogies that we chose to focus on. So across multiple conversations, other domains kept being raised, like why did you not also study pharmaceuticals?
Starting point is 00:07:04 Why did you also not study pharmaceuticals? Why did you also not study cybersecurity, for example? And so that, you know, naturally got us thinking about what further lessons we could draw from those domains. At the same time, though, we also saw a need to, again, go deeper than what we went and really like focus on a narrower problem. So that's really what led us to trying to think about a more specific problem where we could think across levels of governance and bring in some of these other domains.
Starting point is 00:07:32 And testing was top of mind, continues to be a really important topic in the AI policy conversation right now. I think for really good reason. A lot of policymakers are focused on what we need to do to have there be sufficient trust. And testing is going to be a part of that. Really better understand risk.
Starting point is 00:07:56 Enable everyone to be able to make more risk-informed decisions. Testing is an important component for governance and AI, and of course, in all of these other domains as well. So I'll just add the other kind of input into the process for this second round was exploring other analogies beyond those that we kind of got feedback on. And one of the early kind of examples of another domain that would be
Starting point is 00:08:27 really worthwhile to study that came to mind from sort of just studying the literature was genome editing. You know, genome editing was really interesting through the process of thinking about other kind of general purpose technologies. We also arrived at nanoscience and brought those in into the conversation. That's great. I mean, actually, if you could double click, I mean, you just named a number of industries. I'd love to just understand which of those worlds maybe feels the closest to what we're wrestling with with AI, and maybe which is kind of the farthest off,
Starting point is 00:08:57 and what makes them stand out to you? Such a good question. For this second round, we actually brought together eight different domains, right? And I think we actually thought we would come out of this conversation with some bit of clarity around, oh, if we just sort of take this approach for this domain or that domain, we'll sort of have, at least for now, really solve part of the puzzle. And, you know, our public policy team, the day after the workshop, we had a sort of follow on discussion. And the very first thing that we started with in that conversation was like, Okay, so which of these domains? And fascinatingly, like everyone was sort of like, none of them are applying perfectly. I mean, this is also speaking to the limitations of analogies that we already acknowledged.
Starting point is 00:09:47 And also, you know, all of the experts from across these domains gave us really interesting insights into sort of the tradeoffs and the limitations and how they were working. None are really applying perfectly for us, but all of them do offer a thread of insight that is really useful for thinking about testing in AI. And there are some different dimensions that I think are really useful as framing for that. I mean, one is just this horizontal versus vertical kind of difference in domains and, you know, the horizontal technology like genome editing or nanoscience, just being inherently different and seemingly very similar to AI in that you want to be able to understand risks in the technology itself.
Starting point is 00:10:41 And there is just so much contextual sort of factor that matters in the application of those technologies for how the risk manifests that you really need to kind of do those two things at once of understanding the technology, but then really thinking about risk and governance in the context of application versus, you know, a context like our domain like civil aviation or nuclear technology, for example. Even in the workshop itself that we hosted late last year where we brought together this second round of experts, it was really interesting. We actually started the conversation by trying to understand how those different domains defined
Starting point is 00:11:26 risks, where they were able to set risk thresholds. That's been such a part of the AI policy conversation in the last year. And it was really instructive that the more vertical domains were able to sort of snap to clearer answers much more quickly. But like the horizontal nanoscience and genome editing, we're not because it just depends, right?
Starting point is 00:11:49 So anyway, the horizontal vertical dimension seems like a really important one to draw from and apply to AI. The couple of others that I would offer is just thinking about the different kinds of technologies. Obviously, there's some of the domains that we studied that they're just inherently sort of like physical
Starting point is 00:12:07 technologies, a mix of physical and digital or virtual in a lot of cases, because all of these are, of course, applying digital technology. But there is just a difference between something like an airplane or a medical device or the more kind of virtual or intangible sort of technologies, even of course AI and some of the other, like cyber and genome editing, but also like financial services having some of that quality. And again, I think the thing that's interesting to us about AI is to think about AI
Starting point is 00:12:39 and risk evaluation of AI as being, having a large component of that being about that kind of virtual or intangible technology. And also, there is a future of robotics where we might need to think about the kind of physical risk evaluation kind of work as well. And then the final thing I'd maybe say in terms of thinking about which domains have the lessons for AI that are most applicable is just how they've grappled with
Starting point is 00:13:05 these different kind of governance questions. Things like how to turn the dial in terms of being more or less prescriptive on risk evaluation approaches, how they think about the balance of kind of pre-market versus post-market risk evaluation and testing and what the trade-offs have been there across domains has been really interesting to kind of tease out. And then also thinking about sort of who does what. So in each of these different domains, it was interesting to hear about the role of
Starting point is 00:13:40 industry, the role of governments, the role of third-party experts in designing evaluations and developing standards and actually doing the work and kind of having the pull through of what it means for risk and governance decisions. There were, again, there was a variety of sort of approaches across these domains that I think were interesting for AI. Lylea Kaye You mentioned that there's a number of different stakeholders to be considering across the board as we're thinking about policy, as we're thinking about regulation.
Starting point is 00:14:14 Where can we collaborate more across industry? Is it academia, regulators? Just how can we move the needle faster? I think all of the above is needed, but it's also really important to have all of that kind of expertise brought together. And I think one of the things that we certainly heard from a multiple of the domains, if not all of them, was that same actual interest and need and
Starting point is 00:14:46 the same sort of ongoing work to try to figure that out. Even where there had been progress in some of the other domains with bringing together some industry stakeholders or industry and government, there was still a desire to actually do more there. If there was some progress in industry and government, the need was more cross-jurisdiction government conversation, for example, or some progress within industry but needing to strengthen the partnership with academia, for example. So I think it speaks to the quality of your question, to be honest, that all of these domains are actually still grappling with this and still seeing the need to grow in that direction more. What I'd say about AI today is that we have made good progress with starting to build some industry partnerships.
Starting point is 00:15:45 We were a founding member of the Frontier Model Forum, or FMF, which has been a very useful place for us to work with some peers on really trying to bring forward some best practices that apply across our organizations. There are other forums as well, like ML Commons, where we're working with others in industry and broader sort of academic and civil society communities. Partnership on AI is another one I think about that kind of fits that mold as well in a really positive way. And like there are a lot of different sort of governance needs to think through and where
Starting point is 00:16:21 we can really think about bringing that expertise together is going to be so important. I think about almost like in the near to midterm, like three issues that we need to address in the AI kind of policy and testing context. One is just building kind of like a flexible framework that allows us to really build trust while we continue to advance the science and the standards. You know, we are going to need to do both at once, and so we need a flexible framework that enables that kind of agility. And advancing the science and the standards, that is going to be something that really demands that kind of cross-discipline or cross-expertise group coming together to
Starting point is 00:17:07 work on that, researchers, academics, civil society, governments, and of course, industry. And so I think that is actually the second problem is how do we actually build the kind of forums and ways of working together the public-private partnership kind of efforts that allow all of that expertise to come together and fit together over time, right? Because when these are really big, broad challenges, you kind of have to break them down, incrementally make progress on them, and then bring them back together. And so I think about like one example that I really have been reflecting on lately is, you know, in the context of building standards, like how do you do that, right?
Starting point is 00:17:51 Again, standards are going to benefit from that whole community of expertise. And there are lots of different kinds of quote unquote standards though, right? You kind of have the small S industry standards, you have the kind of big S international standards, for example. And how do you kind of leverage one to accelerate the other, I think is part of like how we need to work together within this ecosystem. And like I think what we and others have done in an organization like C2PA, for example, where we've really built an industry specification, but then built on that towards an international standard
Starting point is 00:18:31 effort is one example that is interesting, right, to point to. And then, you know, I actually think that bridges to the third thing that we need to do together within this whole community, which is, you know, really think again about how we manage the breadth of this challenge and opportunity of AI by thinking about this horizontal vertical problem. And, you know, I think that's where it's not just the sort of tech industry, for example, it's broader industry that's going to be really applying this technology that needs to get involved in the conversation about not just sort of testing AI models, for example, but also testing how AI systems or applications are working and context. And so yes, so much fun opportunity. Amanda, this was just fantastic. You've really set the stage for this podcast. And thank you so much for sharing your time
Starting point is 00:19:25 and wisdom with us. Thank you. And to our listeners, we're so glad you joined us for this conversation. An exciting lineup of episodes are on the way and we can't wait to have you back for the next one. Thanks for watching!

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.