Microsoft Research Podcast - AI Testing and Evaluation: Reflections
Episode Date: July 21, 2025In the series finale, Amanda Craig Deckard returns to examine what Microsoft has learned about testing as a governance tool. She also explores the roles of rigor, standardization, and interpretability... in testing and what’s next for Microsoft’s AI governance work.Show notes: https://www.microsoft.com/en-us/research/podcast/ai-testing-and-evaluation-reflections/
Transcript
Discussion (0)
Welcome to AI Testing and Evaluation,
Learnings from Science and Industry.
I'm your host, Kathleen Sullivan.
As generative AI continues to advance,
Microsoft has gathered a range of experts from genome editing to
cybersecurity to share how
their fields approach evaluation and risk assessment.
Our goal is to learn from their successes and
their stumbles to move
the science and practice of AI testing forward.
In this series, we'll explore how
these insights might help guide the future of AI development,
deployment, and responsible use.
For our final episode of the series,
I'm thrilled to once again be joined by Amanda Craig-Deckard,
Senior Director of Public Policy
at Microsoft's Office of Responsible AI.
Amanda, welcome back to the podcast.
Thank you so much.
In our intro episode, you really helped set the stage for this series,
and it's been great because since then we've had the pleasure of speaking with
governance experts about genome editing, pharma,
medical devices, cybersecurity, and we've also gotten to spend some time with our
own Microsoft responsible AI leaders and hear reflections from them. And here's
what stuck with me, and I'd love to hear from you on this as well. Testing builds
trust. Context is shaping risk. And every field is really thinking
about striking its own balance between pre-deployment testing
and post-deployment monitoring.
So drawing on what you've learned from the workshop
and the case studies, what headline insights
do you think matter the most for AI governance?
It's been really interesting to learn
from all these different domains.
And there are lots of really interesting takeaways.
I think a starting point for me is actually pretty similar
to where you landed, which is just that testing is really
important for trust.
And it's also really hard to figure out exactly how to get it right, how to make sure that
you're addressing risks, that you're not constraining innovation, that you are recognizing that
a lot of the industry that's impacted is really different.
You have small organizations, you have large organizations, and you want to enable that
opportunity that isn't enabled by the technology across the board.
And so it's just difficult to kind of get all
of these dynamics right, especially when, you know,
I think we heard from other domains, testing is not some sort
of like, oh, simple thing, right?
There's not this linear path from like A to B
where you just test the one thing and you're done.
It's complex, right?
Testing is multi-stage.
There's a lot of testing by different actors.
There are a lot of different purposes
for which you might test.
As I think it was Dan Carpenter who talked about,
it's not just about testing for safety,
it's also about testing for efficacy
and building confidence and the right dosage
for pharmaceuticals, for example.
And that's across the board for all of these domains, right?
That you're really thinking about the performance of the technology, you're thinking about safety,
you're trying to also calibrate for efficiency.
And so those trade-offs, every expert shared that navigating those is really challenging and also that there were real impacts to early
choices and the sort of governance of risk in these different domains and the development
of the testing sort of expectations and that in some cases this have been difficult to
reverse which also just layers on that complexity and that difficulty in a different way.
So that's the super high level takeaway.
But maybe if I could just quickly distill like three takeaways that I think really are
applicable to AI and a bit more of a granular way.
One is about how is the testing exactly used for what purpose?
And the second is what emphasis there is on this pre versus post deployment testing
and monitoring.
And then the third is how rigid versus adaptive
the sort of testing regimes or frameworks are
in these different domains.
So on the first, how is testing used?
So is testing something that impacts market entry,
for example? Or is it something that impacts market entry, for example, or is it something
that might be used more for informing how risk is evolving in the domain and how broader
risk management strategies might need to be applied?
We have examples like the pharmaceutical or medical device industry, the experts with
whom you spoke, that's really testing, There is a pre-deployment requirement.
So that's one question.
The second is this emphasis on pre versus post-deployment
testing and monitoring.
And we really did see across domains
that in many cases there is a desire
for both pre and post-deployment sort of testing
and monitoring,
but also that naturally in these different domains, a degree of emphasis on one or the
other had evolved and that had a real impact on governance and tradeoffs.
And the third is just how rigid versus adaptive these testing and evaluation regimes or frameworks are in these different domains.
We saw in some domains the testing requirements
were more rigid, as you might expect
in more of the pharmaceutical or medical devices industries,
for example.
And in other domains, there was this more sort
of adaptive approach to how testing might
get used.
So, for example, in the case of our other general purpose technologies, you spoke with
AltaSharow on genome editing, and in our case studies, we also explored this in the context
of nanotechnology.
In those general purpose technology domains, there is more emphasis on
downstream or application context testing that is more sort of adaptive to the use scenario of the
technology and, you know, having that work in conjunction with testing more at the kind of
level of the technology itself. I want to double click on a number of the things you just talked about.
But actually, before we go too much deeper, a question on if there's anything that really
surprised you or challenged maybe some of your own assumptions in this space from some
of the discussions that we had over the series.
Yeah, you know, I know I've already just mentioned this pre versus post-deployment testing and monitoring
issue, but it was something that was very interesting to me and in some ways surprised
me or made me just realize something that I hadn't fully connected before about how
these sort of regimes might evolve in different contexts and why.
And in part, I couldn't help but bring the context I have from cybersecurity policy
into this kind of processing of what we learned and reflection.
Because there was a real contrast for me between the pharmaceutical industry and
the cybersecurity domain when I think about the emphasis on pre versus post
deployment monitoring.
And on the one hand, we have in the pharmaceutical domain, I think about the emphasis on pre versus post-deployment monitoring.
On the one hand, we have in the pharmaceutical domain a real emphasis that has developed
around pre-market testing.
There is also an expectation in some circumstances in the pharmaceutical domain for post-deployment
testing as well.
As we learned from our experts in that domain, there has naturally
been a real kind of emphasis on the pre-market portion of that testing. And in reality, even
where post-market monitoring is required and post-market testing is required, it does not
always actually happen. And the experts really explained that part of it is
just the incentive structure around the emphasis around
the testing as a pre-market sort of entry requirement
and also just the resources that exist among regulators.
There's limited resources, right?
And so there are just choices and trade-offs
that they need to make in their own sort of enforcement work. And then on the other hand, you know, in cybersecurity,
I never thought about the kind of emphasis on things like coordinated vulnerability disclosure
and bug bounties that have really developed in the cybersecurity domain, but it's a really
important part of how we secure technology
and enhance cybersecurity over time, where
we have these norms that have developed,
where security researchers are doing really important research.
They're finding vulnerabilities in products.
And we have norms developed where they report those
to the companies that are in a position
to address those vulnerabilities.
And in some cases, those companies actually pay through bug bounties, the researchers,
and perhaps in some ways, the role of coordinated vulnerability disclosure and bug bounties has
evolved the way that it has because there hasn't been as much emphasis on the pre-market testing
across the board, at least, in the context of software.
And so you look at those two industries, and it was interesting to me to study them to
some extent in contrast with each other as this way that the incentives and the resources
that need to be applied to testing sort of evolved to address where there's kind of more
or less emphasis.
It's a great point. I mean, I think what we're hearing and what you're saying is just exactly
this choice. Like, is there a binary choice between focusing on pre-deployment testing
or post-deployment monitoring? And, you know, I think our assumption is that we need to
do both. I'd love to hear from you on that.
Oh, absolutely. I think we need to do both. I'm very persuaded by this inclination
always that there's value in trying to really do it all in a risk management context. And also,
we know one of the principles of risk management is you have to prioritize because there are finite
resources. And I think that's where we get to this challenge and really thinking deeply,
especially as we're in the early days of AI governance,
and we need to be very thoughtful about trade-offs
that we may not want to be making, but we are,
because again, these are finite choices
and we kind of can't help but put our finger on the dial
in different directions with our choices,
that it's going
to be very difficult to have sort of equal emphasis on both. And we need to invest in both,
but we need to be very deliberate about the roles of each and how they complement each other and who
does which and how we use what we learn from pre versus post deployment testing and monitoring.
Maybe just spending a little bit more time here.
A lot of attention goes into testing models upstream,
but risk often shows up once they're
wired into real products and workflows.
How much does deployment context change the risk
picture from your perspective?
Yeah, such an important question.
I really agree that there has been a lot of emphasis to date on sort of testing models
upstream, the AI model evaluation.
And it's also really important that we bring more attention into evaluation at the system
or application level. And I actually see that in governance conversations,
this is actually increasingly raised,
this need to have system level evaluation.
We see this across regulation.
We also see it in the context of just organizations
trying to put in governance requirements
for how their organization is going to operate
and deploy this technology.
And there's a gap today in terms of best practices around system level testing, perhaps even
more than model level evaluation.
And it's really important because in a lot of cases, the deployment context really does
impact the risk picture, especially with AI, which is a general purpose
technology.
And we really saw this in our study of other domains that represented general purpose technology.
So in the case study that you can find online on nanotechnology, there's a real distinction
between the risk evaluation and the governance of nanotechnology in
different deployment contexts. So the chapter that our expert on nanotechnology
wrote really goes into incredibly interesting detail around you know
deployment of nanotechnology in the context of like chemical applications
versus consumer electronics versus pharmaceuticals versus construction
and how the way that nanoparticles are basically
delivered in all those different deployment contexts,
as well as what the risk of the actual use scenario
is just varies so much.
And so there's a real need to do that kind of risk evaluation
and testing in the deployment context.
And this difference in terms of risks
and what we learned in these other domains,
where there are these different approaches to trying
to really think about and gain efficiencies and address risks
at a horizontal level versus taking a real sector
by sector approach.
And to some extent, it seems like it's level versus taking a real sector by sector approach.
And to some extent, it seems like it's more time intensive to do that sectoral deployment
specific work.
And at the same time, perhaps there are efficiencies to be gained by actually doing the work in
the context in which you have a better understanding of the risk that can result from really deploying this
technology.
And ultimately, really what we also need to think about here is probably in the end, just
like pre and post-deployment testing, you need both.
Not probably, certainly.
So effectively, we need to think about evaluation at the model level and the system level as
being really important.
And it's really important to get system evaluation right so that we can actually
get trust in this technology in deployment context.
So we enable adoption in low and in high risk deployments in a way that
means that we've done risk evaluation in each of those contexts in a way that really makes sense in terms of the resources that we need to apply and
ultimately we are able to unlock more applications of this technology in a
risk-informed way. That's great. I mean I couldn't agree more. I think these
contexts, the approaches are so important for trust and adoption and I'd love to
hear from you. What do we need to advance AI evaluation and testing
in our ecosystem?
What are some of the big gaps that you're seeing and what role can different stakeholders
play in filling them?
And maybe an add-on actually, is there some sort of network effect that could 10x our
testing capacity?
Absolutely. So there's a lot of work that needs to be done,
and there's a lot of work in process
to really level up our whole evaluation and testing
ecosystem.
We learned across domains that there is really
a need to advance our thinking and our practice in three areas.
Rigor of testing, standardization of methodologies and processes, and interpretability of test
results.
So what we mean by rigor is that we are ensuring that what we are ultimately evaluating in
terms of risks is defined in a scientifically
valid way and we are able to measure against that risk in a scientifically valid way.
By standardization, what we mean is that there's really an accepted and well understood and
again a scientifically valid methodology for doing that testing and for actually producing artifacts out of that
testing that are meeting those standards.
And that sets us up for the final portion on interpretability, which is like really
the process by which you can trust that the testing has been done in this rigorous and
standardized way and that then you have artifacts that result from the testing process that
can really be used in the risk management context because they can be interpreted, right?
We understand how to apply weight to them for our risk management decisions, we actually are able to interpret
them in a way that perhaps they inform other downstream risk mitigations that address the
risks that we see through the testing results, and that we actually understand what limitations
apply to the test results and why they may or may not be valid in certain deployment
contexts, for example, and especially in the context of other risk
mitigations that we need to apply.
So there's a need to advance all three of those things,
rigor, standardization, and interpretability,
to level up the whole testing and evaluation ecosystem.
And when we think about what actors should
be involved in that work, really everybody,
which is both complex, still orchestrate, but also really important.
And so, you know, you need to have the entire value chain involved and really advancing
this work.
You need the model developers, but you also need the system developers and deployers that are really engaged in advancing the science of evaluation and
and advancing how we are using these testing artifacts in the risk management
process. When we think about what could actually 10x our testing capacity, that's
the dream, right? We all want to accelerate our progress in this space.
I think we need work across all three of those areas of rigorous standardization and interpretability,
but I think one that will really help accelerate our progress across the board is that standardization
work because ultimately you're going to need to have these tests be done and applied across
so many different contexts.
And ultimately, while we want the whole value chain engaged in the development of the thinking
and the science and the standards in this space, we also need to realize that not every
organization is necessarily going to have the capacity to kind of contribute
to developing the ways that we create and use these tests.
And there are going to be many organizations that are going to benefit from there being
standardization of the methodologies and the artifacts that they can pick up and use.
One thing that I know we've heard throughout this podcast series from our experts in other domains,
including Timo and the medical devices context and Kieran in the cybersecurity context, is that
there's been a recognition as those domains have evolved that there's a need to calibrate our sort
of expectations for different actors in the ecosystem and really understand that small
businesses, for example, just cannot apply the same degree of resources
that others may be able to do testing and evaluation
and risk management.
And so the benefit of having standardized approaches
is that those organizations are able to kind of integrate
into the broader supply chain ecosystem
and apply their own kind of risk management
practices in their own context in a way that is more efficient.
And finally, the last stakeholder that I think is really important to think about in terms
of partnership across the ecosystem to really advance the whole testing and evaluation work
that needs to happen is government partners, right?
And thinking beyond the value chain, the AI supply chain,
and really thinking about public-private partnership
that's going to be incredibly important to advancing
this ecosystem.
I think there's been real progress already
in the AI evaluation and testing ecosystem
in the public-private partnership context.
We have been really supportive of the work
of the international network of AI safety and security
institutes in the Center for AI Standards and Innovation
that all allow for that kind of public-private partnership
on actually testing and advancing the science and best practices around standards.
And there are other innovative kind of partnerships
as well in the ecosystem.
Singapore has recently launched their global AI assurance
pilot findings.
And that effort really paired application deployers
and testers so that consequential
impacts that deployment could really be tested.
And that's a really fruitful sort of effort that complements the work of these institutes
and centers that are more focused on evaluation at the model level, for example.
And in general, I think that there's just really a lot of benefits for us thinking expansively about what we can accomplish through deep, meaningful public-private partnership in this
space.
I'm really excited to see where we can go from here with building on partnerships across
AI supply chains and with governments and public-private partnerships.
I couldn't agree more.
This notion of more engagement across the ecosystem and
value chain is super important for us and informs how we think about the space completely.
If you could invite any other industry to the next workshop, maybe quantum safety, space
tech, even gaming, who's on your wish list? And maybe what are some of the things you'd
want to go deeper on?
This is something that we really welcome feedback on.
If anyone listening has ideas about other domains that would be interesting to study,
I will say I think I shared at the outset of this podcast series, the domains that we
added and this round of our effort in studying other domains actually all came from feedback
that we received from folks we'd engaged with our first study of other domains actually all came from feedback that we received from folks
we'd engaged with our first study of other domains and multilateral sort of governance
institutions.
And so we're really keen to think about what other domains could be interesting to study.
And we are also keen to go deeper building on what we learned in this round of effort
going forward.
One of the areas that I am particularly really interested in is going deeper on what sort
of transparency and information sharing about risk evaluation and testing will be really
useful to share in different contexts.
So across the AI supply chain, what is the information that's going to be really meaningful
to share between developers and deployers of models and systems
and those that are ultimately using this technology
and particular deployment contexts?
And I think that we could have much
to learn from other general purpose technologies like genome editing
and nanotechnology and cybersecurity, where we could learn a bit more about the kinds of information
that they have shared across the development and deployment life cycle and how that has
strengthened risk management in general, as well as provided a really strong feedback loop
around testing and evaluation, what kind of testing
is most useful to do, at what point in the lifecycle,
and what artifacts are most useful to share
as a result of that testing and evaluation work.
I'll say as Microsoft, we have been really investing
in how we are sharing information with our
various stakeholders.
We also have been engaged with others in industry and reporting what we've done in the context
of the Hiroshima AI process or HAPE reporting framework.
This is an effort that is really just in its first round of really exploring how this kind of reporting can be really additive
to risk management understanding. And again, I think there's real opportunity here to look
at this kind of reporting and understand, you know, what's valuable for stakeholders and where's
their opportunity to go further and really informing value chains
and policymakers and the public about AI risk and opportunity?
And what can we learn again from other domains that have done this kind of work over decades
to really refine that kind of information sharing?
It's really great to hear about all the advances that we're making on these reports.
I'm guessing a lot of the metrics in there are technical, but socio-technical impacts,
jobs, maybe misinformation, well-being are harder to score.
What new measurement ideas are you excited about?
And do you have any thoughts on who needs to pilot those?
Yeah, it's an incredibly interesting question that I think also just speaks to the breadth
of sort of testing and evaluation that's needed at different points along the AI life cycle
and really not getting lost in one particular kind of testing or another pre or post deployment
and thinking expansively about the risks that we're trying to address through this testing.
For example, even with the UK's AI Security Institute that has just recently launched
a new program, a new team that's focused on societal resilience research, I think it's
going to be a really important area from a socio-technical impact perspective to bring
some focus into as this technology is more widely deployed, are we
understanding the impacts over time as different people in different cultures adopt and use
this technology for different purposes?
And I think that's an area where there really is opportunity for greater public-private
partnership in this research because we all share this long-term interest in ensuring
that this technology is really serving people and we have to understand the impacts so we
understand what adjustments we can actually pursue sooner upstream to address those impacts
and make sure that this technology is really going to work for all of us and in a way that
is consistent with the societal values that we want.
So, Amanda, looking ahead, I would love to hear just what's going to be on your radar,
what's top of mind for you in the coming weeks?
Well, we are certainly continuing to process all the learnings that we've had from studying
these domains.
It's really been a rich set of insights that we want to make sure we fully
take advantage of.
And I think these hard questions and real opportunities to be thoughtful in these early
days of AI governance are not sort of going away or being easily resolved soon.
And so I think we continue to see value in really learning from others,
thinking about what's distinct in the AI context, but also what we can apply in terms of what other domains have learned.
Well, I mean, it has been such a special experience for me to help illuminate the work of the Office of Responsible AI and our team in Microsoft Research. And it's just really special to see all of the work that we're doing to help set
the standard for responsible development and deployment of AI.
So thank you for joining us today.
Thanks for your reflections and discussion.
And to our listeners, thank you so much for joining us for the series.
We really hope you enjoyed it.
To check out all of our episodes, visit aka.ms slash AI testing and evaluation.
If you want to learn more about how
Microsoft approaches AI governance,
you can visit microsoft.com slash RAI.
See you next time. Thanks for watching!