Microsoft Research Podcast - AI Testing and Evaluation: Reflections

Episode Date: July 21, 2025

In the series finale, Amanda Craig Deckard returns to examine what Microsoft has learned about testing as a governance tool. She also explores the roles of rigor, standardization, and interpretability... in testing and what’s next for Microsoft’s AI governance work.Show notes: https://www.microsoft.com/en-us/research/podcast/ai-testing-and-evaluation-reflections/

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to AI Testing and Evaluation, Learnings from Science and Industry. I'm your host, Kathleen Sullivan. As generative AI continues to advance, Microsoft has gathered a range of experts from genome editing to cybersecurity to share how their fields approach evaluation and risk assessment. Our goal is to learn from their successes and
Starting point is 00:00:24 their stumbles to move the science and practice of AI testing forward. In this series, we'll explore how these insights might help guide the future of AI development, deployment, and responsible use. For our final episode of the series, I'm thrilled to once again be joined by Amanda Craig-Deckard, Senior Director of Public Policy
Starting point is 00:00:47 at Microsoft's Office of Responsible AI. Amanda, welcome back to the podcast. Thank you so much. In our intro episode, you really helped set the stage for this series, and it's been great because since then we've had the pleasure of speaking with governance experts about genome editing, pharma, medical devices, cybersecurity, and we've also gotten to spend some time with our own Microsoft responsible AI leaders and hear reflections from them. And here's
Starting point is 00:01:16 what stuck with me, and I'd love to hear from you on this as well. Testing builds trust. Context is shaping risk. And every field is really thinking about striking its own balance between pre-deployment testing and post-deployment monitoring. So drawing on what you've learned from the workshop and the case studies, what headline insights do you think matter the most for AI governance? It's been really interesting to learn
Starting point is 00:01:44 from all these different domains. And there are lots of really interesting takeaways. I think a starting point for me is actually pretty similar to where you landed, which is just that testing is really important for trust. And it's also really hard to figure out exactly how to get it right, how to make sure that you're addressing risks, that you're not constraining innovation, that you are recognizing that a lot of the industry that's impacted is really different.
Starting point is 00:02:18 You have small organizations, you have large organizations, and you want to enable that opportunity that isn't enabled by the technology across the board. And so it's just difficult to kind of get all of these dynamics right, especially when, you know, I think we heard from other domains, testing is not some sort of like, oh, simple thing, right? There's not this linear path from like A to B where you just test the one thing and you're done.
Starting point is 00:02:41 It's complex, right? Testing is multi-stage. There's a lot of testing by different actors. There are a lot of different purposes for which you might test. As I think it was Dan Carpenter who talked about, it's not just about testing for safety, it's also about testing for efficacy
Starting point is 00:03:01 and building confidence and the right dosage for pharmaceuticals, for example. And that's across the board for all of these domains, right? That you're really thinking about the performance of the technology, you're thinking about safety, you're trying to also calibrate for efficiency. And so those trade-offs, every expert shared that navigating those is really challenging and also that there were real impacts to early choices and the sort of governance of risk in these different domains and the development of the testing sort of expectations and that in some cases this have been difficult to
Starting point is 00:03:37 reverse which also just layers on that complexity and that difficulty in a different way. So that's the super high level takeaway. But maybe if I could just quickly distill like three takeaways that I think really are applicable to AI and a bit more of a granular way. One is about how is the testing exactly used for what purpose? And the second is what emphasis there is on this pre versus post deployment testing and monitoring. And then the third is how rigid versus adaptive
Starting point is 00:04:13 the sort of testing regimes or frameworks are in these different domains. So on the first, how is testing used? So is testing something that impacts market entry, for example? Or is it something that impacts market entry, for example, or is it something that might be used more for informing how risk is evolving in the domain and how broader risk management strategies might need to be applied? We have examples like the pharmaceutical or medical device industry, the experts with
Starting point is 00:04:42 whom you spoke, that's really testing, There is a pre-deployment requirement. So that's one question. The second is this emphasis on pre versus post-deployment testing and monitoring. And we really did see across domains that in many cases there is a desire for both pre and post-deployment sort of testing and monitoring,
Starting point is 00:05:06 but also that naturally in these different domains, a degree of emphasis on one or the other had evolved and that had a real impact on governance and tradeoffs. And the third is just how rigid versus adaptive these testing and evaluation regimes or frameworks are in these different domains. We saw in some domains the testing requirements were more rigid, as you might expect in more of the pharmaceutical or medical devices industries, for example. And in other domains, there was this more sort
Starting point is 00:05:43 of adaptive approach to how testing might get used. So, for example, in the case of our other general purpose technologies, you spoke with AltaSharow on genome editing, and in our case studies, we also explored this in the context of nanotechnology. In those general purpose technology domains, there is more emphasis on downstream or application context testing that is more sort of adaptive to the use scenario of the technology and, you know, having that work in conjunction with testing more at the kind of
Starting point is 00:06:21 level of the technology itself. I want to double click on a number of the things you just talked about. But actually, before we go too much deeper, a question on if there's anything that really surprised you or challenged maybe some of your own assumptions in this space from some of the discussions that we had over the series. Yeah, you know, I know I've already just mentioned this pre versus post-deployment testing and monitoring issue, but it was something that was very interesting to me and in some ways surprised me or made me just realize something that I hadn't fully connected before about how these sort of regimes might evolve in different contexts and why.
Starting point is 00:07:03 And in part, I couldn't help but bring the context I have from cybersecurity policy into this kind of processing of what we learned and reflection. Because there was a real contrast for me between the pharmaceutical industry and the cybersecurity domain when I think about the emphasis on pre versus post deployment monitoring. And on the one hand, we have in the pharmaceutical domain, I think about the emphasis on pre versus post-deployment monitoring. On the one hand, we have in the pharmaceutical domain a real emphasis that has developed around pre-market testing.
Starting point is 00:07:35 There is also an expectation in some circumstances in the pharmaceutical domain for post-deployment testing as well. As we learned from our experts in that domain, there has naturally been a real kind of emphasis on the pre-market portion of that testing. And in reality, even where post-market monitoring is required and post-market testing is required, it does not always actually happen. And the experts really explained that part of it is just the incentive structure around the emphasis around the testing as a pre-market sort of entry requirement
Starting point is 00:08:13 and also just the resources that exist among regulators. There's limited resources, right? And so there are just choices and trade-offs that they need to make in their own sort of enforcement work. And then on the other hand, you know, in cybersecurity, I never thought about the kind of emphasis on things like coordinated vulnerability disclosure and bug bounties that have really developed in the cybersecurity domain, but it's a really important part of how we secure technology and enhance cybersecurity over time, where
Starting point is 00:08:49 we have these norms that have developed, where security researchers are doing really important research. They're finding vulnerabilities in products. And we have norms developed where they report those to the companies that are in a position to address those vulnerabilities. And in some cases, those companies actually pay through bug bounties, the researchers, and perhaps in some ways, the role of coordinated vulnerability disclosure and bug bounties has
Starting point is 00:09:17 evolved the way that it has because there hasn't been as much emphasis on the pre-market testing across the board, at least, in the context of software. And so you look at those two industries, and it was interesting to me to study them to some extent in contrast with each other as this way that the incentives and the resources that need to be applied to testing sort of evolved to address where there's kind of more or less emphasis. It's a great point. I mean, I think what we're hearing and what you're saying is just exactly this choice. Like, is there a binary choice between focusing on pre-deployment testing
Starting point is 00:09:54 or post-deployment monitoring? And, you know, I think our assumption is that we need to do both. I'd love to hear from you on that. Oh, absolutely. I think we need to do both. I'm very persuaded by this inclination always that there's value in trying to really do it all in a risk management context. And also, we know one of the principles of risk management is you have to prioritize because there are finite resources. And I think that's where we get to this challenge and really thinking deeply, especially as we're in the early days of AI governance, and we need to be very thoughtful about trade-offs
Starting point is 00:10:34 that we may not want to be making, but we are, because again, these are finite choices and we kind of can't help but put our finger on the dial in different directions with our choices, that it's going to be very difficult to have sort of equal emphasis on both. And we need to invest in both, but we need to be very deliberate about the roles of each and how they complement each other and who does which and how we use what we learn from pre versus post deployment testing and monitoring.
Starting point is 00:11:05 Maybe just spending a little bit more time here. A lot of attention goes into testing models upstream, but risk often shows up once they're wired into real products and workflows. How much does deployment context change the risk picture from your perspective? Yeah, such an important question. I really agree that there has been a lot of emphasis to date on sort of testing models
Starting point is 00:11:28 upstream, the AI model evaluation. And it's also really important that we bring more attention into evaluation at the system or application level. And I actually see that in governance conversations, this is actually increasingly raised, this need to have system level evaluation. We see this across regulation. We also see it in the context of just organizations trying to put in governance requirements
Starting point is 00:12:01 for how their organization is going to operate and deploy this technology. And there's a gap today in terms of best practices around system level testing, perhaps even more than model level evaluation. And it's really important because in a lot of cases, the deployment context really does impact the risk picture, especially with AI, which is a general purpose technology. And we really saw this in our study of other domains that represented general purpose technology.
Starting point is 00:12:34 So in the case study that you can find online on nanotechnology, there's a real distinction between the risk evaluation and the governance of nanotechnology in different deployment contexts. So the chapter that our expert on nanotechnology wrote really goes into incredibly interesting detail around you know deployment of nanotechnology in the context of like chemical applications versus consumer electronics versus pharmaceuticals versus construction and how the way that nanoparticles are basically delivered in all those different deployment contexts,
Starting point is 00:13:14 as well as what the risk of the actual use scenario is just varies so much. And so there's a real need to do that kind of risk evaluation and testing in the deployment context. And this difference in terms of risks and what we learned in these other domains, where there are these different approaches to trying to really think about and gain efficiencies and address risks
Starting point is 00:13:39 at a horizontal level versus taking a real sector by sector approach. And to some extent, it seems like it's level versus taking a real sector by sector approach. And to some extent, it seems like it's more time intensive to do that sectoral deployment specific work. And at the same time, perhaps there are efficiencies to be gained by actually doing the work in the context in which you have a better understanding of the risk that can result from really deploying this technology.
Starting point is 00:14:07 And ultimately, really what we also need to think about here is probably in the end, just like pre and post-deployment testing, you need both. Not probably, certainly. So effectively, we need to think about evaluation at the model level and the system level as being really important. And it's really important to get system evaluation right so that we can actually get trust in this technology in deployment context. So we enable adoption in low and in high risk deployments in a way that
Starting point is 00:14:41 means that we've done risk evaluation in each of those contexts in a way that really makes sense in terms of the resources that we need to apply and ultimately we are able to unlock more applications of this technology in a risk-informed way. That's great. I mean I couldn't agree more. I think these contexts, the approaches are so important for trust and adoption and I'd love to hear from you. What do we need to advance AI evaluation and testing in our ecosystem? What are some of the big gaps that you're seeing and what role can different stakeholders play in filling them?
Starting point is 00:15:16 And maybe an add-on actually, is there some sort of network effect that could 10x our testing capacity? Absolutely. So there's a lot of work that needs to be done, and there's a lot of work in process to really level up our whole evaluation and testing ecosystem. We learned across domains that there is really a need to advance our thinking and our practice in three areas.
Starting point is 00:15:47 Rigor of testing, standardization of methodologies and processes, and interpretability of test results. So what we mean by rigor is that we are ensuring that what we are ultimately evaluating in terms of risks is defined in a scientifically valid way and we are able to measure against that risk in a scientifically valid way. By standardization, what we mean is that there's really an accepted and well understood and again a scientifically valid methodology for doing that testing and for actually producing artifacts out of that testing that are meeting those standards.
Starting point is 00:16:34 And that sets us up for the final portion on interpretability, which is like really the process by which you can trust that the testing has been done in this rigorous and standardized way and that then you have artifacts that result from the testing process that can really be used in the risk management context because they can be interpreted, right? We understand how to apply weight to them for our risk management decisions, we actually are able to interpret them in a way that perhaps they inform other downstream risk mitigations that address the risks that we see through the testing results, and that we actually understand what limitations apply to the test results and why they may or may not be valid in certain deployment
Starting point is 00:17:23 contexts, for example, and especially in the context of other risk mitigations that we need to apply. So there's a need to advance all three of those things, rigor, standardization, and interpretability, to level up the whole testing and evaluation ecosystem. And when we think about what actors should be involved in that work, really everybody, which is both complex, still orchestrate, but also really important.
Starting point is 00:17:53 And so, you know, you need to have the entire value chain involved and really advancing this work. You need the model developers, but you also need the system developers and deployers that are really engaged in advancing the science of evaluation and and advancing how we are using these testing artifacts in the risk management process. When we think about what could actually 10x our testing capacity, that's the dream, right? We all want to accelerate our progress in this space. I think we need work across all three of those areas of rigorous standardization and interpretability, but I think one that will really help accelerate our progress across the board is that standardization
Starting point is 00:18:39 work because ultimately you're going to need to have these tests be done and applied across so many different contexts. And ultimately, while we want the whole value chain engaged in the development of the thinking and the science and the standards in this space, we also need to realize that not every organization is necessarily going to have the capacity to kind of contribute to developing the ways that we create and use these tests. And there are going to be many organizations that are going to benefit from there being standardization of the methodologies and the artifacts that they can pick up and use.
Starting point is 00:19:20 One thing that I know we've heard throughout this podcast series from our experts in other domains, including Timo and the medical devices context and Kieran in the cybersecurity context, is that there's been a recognition as those domains have evolved that there's a need to calibrate our sort of expectations for different actors in the ecosystem and really understand that small businesses, for example, just cannot apply the same degree of resources that others may be able to do testing and evaluation and risk management. And so the benefit of having standardized approaches
Starting point is 00:19:56 is that those organizations are able to kind of integrate into the broader supply chain ecosystem and apply their own kind of risk management practices in their own context in a way that is more efficient. And finally, the last stakeholder that I think is really important to think about in terms of partnership across the ecosystem to really advance the whole testing and evaluation work that needs to happen is government partners, right? And thinking beyond the value chain, the AI supply chain,
Starting point is 00:20:27 and really thinking about public-private partnership that's going to be incredibly important to advancing this ecosystem. I think there's been real progress already in the AI evaluation and testing ecosystem in the public-private partnership context. We have been really supportive of the work of the international network of AI safety and security
Starting point is 00:20:54 institutes in the Center for AI Standards and Innovation that all allow for that kind of public-private partnership on actually testing and advancing the science and best practices around standards. And there are other innovative kind of partnerships as well in the ecosystem. Singapore has recently launched their global AI assurance pilot findings. And that effort really paired application deployers
Starting point is 00:21:22 and testers so that consequential impacts that deployment could really be tested. And that's a really fruitful sort of effort that complements the work of these institutes and centers that are more focused on evaluation at the model level, for example. And in general, I think that there's just really a lot of benefits for us thinking expansively about what we can accomplish through deep, meaningful public-private partnership in this space. I'm really excited to see where we can go from here with building on partnerships across AI supply chains and with governments and public-private partnerships.
Starting point is 00:22:01 I couldn't agree more. This notion of more engagement across the ecosystem and value chain is super important for us and informs how we think about the space completely. If you could invite any other industry to the next workshop, maybe quantum safety, space tech, even gaming, who's on your wish list? And maybe what are some of the things you'd want to go deeper on? This is something that we really welcome feedback on. If anyone listening has ideas about other domains that would be interesting to study,
Starting point is 00:22:31 I will say I think I shared at the outset of this podcast series, the domains that we added and this round of our effort in studying other domains actually all came from feedback that we received from folks we'd engaged with our first study of other domains actually all came from feedback that we received from folks we'd engaged with our first study of other domains and multilateral sort of governance institutions. And so we're really keen to think about what other domains could be interesting to study. And we are also keen to go deeper building on what we learned in this round of effort going forward.
Starting point is 00:23:05 One of the areas that I am particularly really interested in is going deeper on what sort of transparency and information sharing about risk evaluation and testing will be really useful to share in different contexts. So across the AI supply chain, what is the information that's going to be really meaningful to share between developers and deployers of models and systems and those that are ultimately using this technology and particular deployment contexts? And I think that we could have much
Starting point is 00:23:41 to learn from other general purpose technologies like genome editing and nanotechnology and cybersecurity, where we could learn a bit more about the kinds of information that they have shared across the development and deployment life cycle and how that has strengthened risk management in general, as well as provided a really strong feedback loop around testing and evaluation, what kind of testing is most useful to do, at what point in the lifecycle, and what artifacts are most useful to share as a result of that testing and evaluation work.
Starting point is 00:24:19 I'll say as Microsoft, we have been really investing in how we are sharing information with our various stakeholders. We also have been engaged with others in industry and reporting what we've done in the context of the Hiroshima AI process or HAPE reporting framework. This is an effort that is really just in its first round of really exploring how this kind of reporting can be really additive to risk management understanding. And again, I think there's real opportunity here to look at this kind of reporting and understand, you know, what's valuable for stakeholders and where's
Starting point is 00:25:01 their opportunity to go further and really informing value chains and policymakers and the public about AI risk and opportunity? And what can we learn again from other domains that have done this kind of work over decades to really refine that kind of information sharing? It's really great to hear about all the advances that we're making on these reports. I'm guessing a lot of the metrics in there are technical, but socio-technical impacts, jobs, maybe misinformation, well-being are harder to score. What new measurement ideas are you excited about?
Starting point is 00:25:37 And do you have any thoughts on who needs to pilot those? Yeah, it's an incredibly interesting question that I think also just speaks to the breadth of sort of testing and evaluation that's needed at different points along the AI life cycle and really not getting lost in one particular kind of testing or another pre or post deployment and thinking expansively about the risks that we're trying to address through this testing. For example, even with the UK's AI Security Institute that has just recently launched a new program, a new team that's focused on societal resilience research, I think it's going to be a really important area from a socio-technical impact perspective to bring
Starting point is 00:26:20 some focus into as this technology is more widely deployed, are we understanding the impacts over time as different people in different cultures adopt and use this technology for different purposes? And I think that's an area where there really is opportunity for greater public-private partnership in this research because we all share this long-term interest in ensuring that this technology is really serving people and we have to understand the impacts so we understand what adjustments we can actually pursue sooner upstream to address those impacts and make sure that this technology is really going to work for all of us and in a way that
Starting point is 00:27:04 is consistent with the societal values that we want. So, Amanda, looking ahead, I would love to hear just what's going to be on your radar, what's top of mind for you in the coming weeks? Well, we are certainly continuing to process all the learnings that we've had from studying these domains. It's really been a rich set of insights that we want to make sure we fully take advantage of. And I think these hard questions and real opportunities to be thoughtful in these early
Starting point is 00:27:36 days of AI governance are not sort of going away or being easily resolved soon. And so I think we continue to see value in really learning from others, thinking about what's distinct in the AI context, but also what we can apply in terms of what other domains have learned. Well, I mean, it has been such a special experience for me to help illuminate the work of the Office of Responsible AI and our team in Microsoft Research. And it's just really special to see all of the work that we're doing to help set the standard for responsible development and deployment of AI. So thank you for joining us today. Thanks for your reflections and discussion. And to our listeners, thank you so much for joining us for the series.
Starting point is 00:28:21 We really hope you enjoyed it. To check out all of our episodes, visit aka.ms slash AI testing and evaluation. If you want to learn more about how Microsoft approaches AI governance, you can visit microsoft.com slash RAI. See you next time. Thanks for watching!

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.