Microsoft Research Podcast - AI Testing and Evaluation: Learnings from Science and Industry
Episode Date: June 23, 2025In the introductory episode of this new series, host Kathleen Sullivan and Senior Director Amanda Craig Deckard explore Microsoft’s efforts to draw on the experience of other domains to help advance... the role of AI testing and evaluation as a governance tool.
Transcript
Discussion (0)
Welcome to AI Testing and Evaluation,
Learnings from Science and Industry.
I'm your host, Kathleen Sullivan.
As generative AI continues to advance,
Microsoft has gathered a range of experts from genome editing to
cybersecurity to share how
their fields approach evaluation and risk assessment.
Our goal is to learn from their successes and
their stumbles
to move the science and practice of AI testing forward.
In this series, we'll explore how these insights
might help guide the future of AI development,
deployment and responsible use.
For our introductory episode,
I'm pleased to welcome Amanda Craig-Dekard from
Microsoft to discuss the company's efforts
to learn about testing in other sectors.
Amanda is Senior Director of
Public Policy in the Office of Responsible AI,
where she leads a team that works closely with engineers,
researchers, and policy experts to help
ensure AI is being developed and used responsibly.
Their insights shape Microsoft's contribution
to public policy discussions on laws, norms,
and standards for AI.
Amanda, welcome to the podcast.
Thank you.
Amanda, let's give the listeners a little bit
of your background.
What's your origin story?
Can you talk to us a little bit about maybe how you started
in tech?
And I would love to also learn a little bit more
about what your team does in the Office of Responsible AI.
Sure.
Thank you.
I'd say my path to tech, to Microsoft as well, was a bit like circuitous maybe.
I thought for the longest time I was going to be a journalist.
I studied forced migration.
I worked in a sort of state level trial court in Indiana, a legal service provider in India,
just to give you a bit of a flavor. I made my way to Microsoft in 2014 and have been
here since working in cybersecurity, policy first and now in responsible
AI.
And the way that our Office of Responsible AI has really sort of structured itself is
bringing together the kind of expertise to really work on defining policy and how to
operationalize it at the same time. And that means that we have been working through this real challenge of defining
internal policy and practice, making sure that's deeply grounded in the work of our colleagues
in Microsoft Research, and then really closely working with engineering to make sure that we
have the processes, that we have the tools to implement that policy at scale.
And I'm really drawn to these kind of hard problems where they have the character of two things being true, or there's like, you know, real tension on both sides.
And in particular, in the context of those kinds of problems, roles in which, like, the whole job is actually
just sitting with that tension, not necessarily, like, resolving it and expecting that you're
done. And I think really there are two reasons why tech is so kind of representative of that
kind of challenge that I've always found fascinating. You know, one is that, of course, tech is
sort of ubiquitous. It's really impacting so many people's lives.
But also, because as I think has become
part of our vernacular now, but is not necessarily immediately
intuitive, is the fact that technology
is both a tool and a weapon.
And so that's just another reason
why we have to continuously work through that tension
and sort of sit with it, right?
And even as tech evolves over time.
You bring up such great points and this field is not black and white.
I think that even underscores, you know, this notion that you highlighted that it's impacting
everyone.
And, you know, to set the stage for our listeners, last year we pulled in a bunch of experts
from cyber security, biotech, finance.
And we ran this large workshop to study
how they're thinking about governance in those playbooks.
And so I'd love to understand a little bit more
about what sparked that effort.
And there's a piece of this which is really
centered around testing.
And to hear from you why the focus on testing
is so important.
If I could rewind a little bit and give you a bit of history
of how we even arrived at bringing these experts together.
You know, we actually started on this journey in 2023.
At that time, there were like a lot of these big questions
swirling around about, you know, what did we need
in terms of governance for AI?
Of course, this was in the immediate aftermath of the chat GPT wave and everyone recognizing
that the technology was going to have a different level of impact in the near term.
And so what do we need from governance?
What do we need at the global level, in particular, of governance?
And so at the time in early 2023, especially there were a lot of attempts
to sort of draw analogies to other global governance institutions in other domains.
So we actually in 2023 brought together a different workshop than the one that you're
referring to specifically focused on testing last year. And we kind of had two big takeaways from that conversation. One was, you know,
what are the actual functions of these institutions and how do they apply to AI?
And actually one of the takeaways was they all sort of apply. There's like a
role for, you know, any of the functions whether it be sort of driving consensus
on research or building industry standards or managing kind of frontier risks for thinking about how those might be
needed in the AI context. And one of the other big takeaways was that there are also limitations
in these analogies. You know, each of the institutions grew up in its own sort of unique historical moment,
like the one that we sit in with AI right now.
And each of those circumstances don't exactly translate to this moment.
And so, yeah, there was like this kind of, OK, we want to draw what we can from
this conversation, and then we also want to understand what is also very important
that's just different for AI right now.
We published a book with the lessons from that conversation
in 2023.
And then we actually went on a bit of a tour
with that content, where we had a number of roundtables
actually all over the world, where we gathered feedback
on how those analogies were landing, how our
takeaways were landing.
And one of the things that we took from them was a gap that some of the participants saw
in the analogies that we chose to focus on.
So across multiple conversations, other domains kept being raised, like why did you not also
study pharmaceuticals?
Why did you also not study pharmaceuticals? Why did you also
not study cybersecurity, for example? And so that, you know, naturally got us thinking about what
further lessons we could draw from those domains. At the same time, though, we also saw a need to,
again, go deeper than what we went and really like focus on a narrower problem. So that's really what
led us to trying
to think about a more specific problem
where we could think across levels of governance
and bring in some of these other domains.
And testing was top of mind, continues
to be a really important topic in the AI policy conversation
right now.
I think for really good reason.
A lot of policymakers are focused on what
we need to do to have there be sufficient trust.
And testing is going to be a part of that.
Really better understand risk.
Enable everyone to be able to make more risk-informed
decisions.
Testing is an important component for governance and AI,
and of course, in all of these other domains
as well.
So I'll just add the other kind of input into the process for this second round was exploring
other analogies beyond those that we kind of got feedback on.
And one of the early kind of examples of another domain that would be
really worthwhile to study that came to mind from sort of just studying the literature was genome
editing. You know, genome editing was really interesting through the process of thinking
about other kind of general purpose technologies. We also arrived at nanoscience and brought those
in into the conversation. That's great. I mean, actually, if you could double click,
I mean, you just named a number of industries.
I'd love to just understand which of those worlds
maybe feels the closest to what we're wrestling with with AI,
and maybe which is kind of the farthest off,
and what makes them stand out to you?
Such a good question.
For this second round,
we actually brought together eight different domains, right?
And I think we actually thought we would come out of this conversation with some bit of clarity around, oh, if we just sort of take this approach for this domain or that domain, we'll sort of have, at least for now, really solve part of the puzzle. And, you know, our public policy team, the day after the workshop,
we had a sort of follow on discussion. And the very first thing that we started with in that
conversation was like, Okay, so which of these domains? And fascinatingly, like everyone was
sort of like, none of them are applying perfectly. I mean, this is also speaking to the limitations of analogies that we already acknowledged.
And also, you know, all of the experts from across these domains gave us really interesting
insights into sort of the tradeoffs and the limitations and how they were working.
None are really applying perfectly for us, but all of them do offer a thread of insight
that is really useful for thinking about testing in AI.
And there are some different dimensions that I think are really useful as framing for that.
I mean, one is just this horizontal versus vertical kind of difference in domains and, you know, the horizontal technology like genome
editing or nanoscience, just being inherently different and seemingly very similar to AI
in that you want to be able to understand risks in the technology itself.
And there is just so much contextual sort of factor that matters in
the application of those technologies for how the risk manifests that you really need
to kind of do those two things at once of understanding the technology, but then really
thinking about risk and governance in the context of application versus, you know, a context like
our domain like civil aviation or nuclear technology, for example.
Even in the workshop itself that we hosted late last year where we brought together this
second round of experts, it was really interesting.
We actually started the conversation by trying to understand how those different domains defined
risks, where they were able to set risk thresholds.
That's been such a part of the AI policy conversation
in the last year.
And it was really instructive that the more vertical domains
were able to sort of snap to clearer answers much more
quickly.
But like the horizontal nanoscience and genome
editing, we're not because it just depends, right?
So anyway, the horizontal vertical dimension
seems like a really important one to draw from and apply
to AI.
The couple of others that I would offer
is just thinking about the different kinds
of technologies.
Obviously, there's some of the domains
that we studied that they're just inherently sort of like physical
technologies, a mix of physical and digital or virtual
in a lot of cases, because all of these are, of course,
applying digital technology.
But there is just a difference between something
like an airplane or a medical device
or the more kind of virtual or intangible sort of technologies, even of course AI and some of the other, like cyber and genome editing,
but also like financial services having some of that quality.
And again, I think the thing that's interesting to us about AI is to think about AI
and risk evaluation of AI as being, having a large component of that being about
that kind of virtual or intangible technology.
And also, there is a future of robotics
where we might need to think about the kind of physical risk
evaluation kind of work as well.
And then the final thing I'd maybe say in terms of thinking
about which domains have the lessons for AI that
are most applicable is just how they've grappled with
these different kind of governance questions.
Things like how to turn the dial in terms of being more or less prescriptive on risk
evaluation approaches, how they think about the balance of kind of pre-market versus post-market
risk evaluation and testing
and what the trade-offs have been there across domains has been really interesting to kind
of tease out.
And then also thinking about sort of who does what.
So in each of these different domains, it was interesting to hear about the role of
industry, the role of governments, the role of third-party experts in designing
evaluations and developing standards and actually doing the work and kind of having the pull
through of what it means for risk and governance decisions.
There were, again, there was a variety of sort of approaches across these domains that
I think were interesting for AI.
Lylea Kaye You mentioned that there's a number of different
stakeholders to be considering across the board as we're thinking about policy, as we're
thinking about regulation.
Where can we collaborate more across industry?
Is it academia, regulators?
Just how can we move the needle faster? I think all of the above is needed,
but it's also really important to have all of that kind
of expertise brought together.
And I think one of the things that we certainly
heard from a multiple of the domains, if not all of them,
was that same actual interest and need and
the same sort of ongoing work to try to figure that out.
Even where there had been progress in some of the other domains with bringing together
some industry stakeholders or industry and government, there was still a desire to actually do more there.
If there was some progress in industry and government, the need was more cross-jurisdiction
government conversation, for example, or some progress within industry but needing to strengthen
the partnership with academia, for example. So
I think it speaks to the quality of your question, to be honest, that all of these domains are actually still grappling with this and still seeing the need to grow in that direction more.
What I'd say about AI today is that we have made good progress with starting to build some industry partnerships.
We were a founding member of the Frontier Model Forum, or FMF, which has been a very
useful place for us to work with some peers on really trying to bring forward some best
practices that apply across our organizations.
There are other forums as well, like ML Commons, where we're working with others in industry
and broader sort of academic and civil society communities.
Partnership on AI is another one I think about that kind of fits that mold as well in a really
positive way.
And like there are a lot of different sort of governance needs to think through and where
we can really think about bringing that expertise together is going to be so important.
I think about almost like in the near to midterm,
like three issues that we need to address in the AI kind
of policy and testing context.
One is just building kind of like a flexible framework
that allows us to really build trust while we continue to advance the science and the standards.
You know, we are going to need to do both at once, and so we need a flexible framework that enables that kind of agility.
And advancing the science and the standards, that is going to be something that really demands that kind of cross-discipline or cross-expertise group coming together to
work on that, researchers, academics, civil society, governments, and of course, industry.
And so I think that is actually the second problem is how do we actually build the kind
of forums and ways of working together the public-private partnership kind of efforts that allow all
of that expertise to come together and fit together over time, right?
Because when these are really big, broad challenges, you kind of have to break them down, incrementally
make progress on them, and then bring them back together.
And so I think about like one example that I really have been reflecting on lately is,
you know, in the context of building standards, like how do you do that, right?
Again, standards are going to benefit from that whole community of expertise.
And there are lots of different kinds of quote unquote standards though, right?
You kind of have the small S industry standards,
you have the kind of big S international standards, for example. And how do you kind of leverage
one to accelerate the other, I think is part of like how we need to work together within
this ecosystem. And like I think what we and others have done in an organization like C2PA,
for example, where we've really built
an industry specification, but then built on that towards an international standard
effort is one example that is interesting, right, to point to.
And then, you know, I actually think that bridges to the third thing that we need to
do together within this whole community, which is, you know, really think again about how we manage the breadth of this challenge and opportunity of AI by thinking about this horizontal vertical problem.
And, you know, I think that's where it's not just the sort of tech industry, for example, it's broader industry that's going to be really applying this technology that needs to get involved in the conversation about not just sort of testing AI models,
for example, but also testing how AI systems or applications are working and context. And
so yes, so much fun opportunity.
Amanda, this was just fantastic. You've really set the stage for this podcast. And thank
you so much for sharing your time
and wisdom with us.
Thank you.
And to our listeners, we're so glad you joined us
for this conversation.
An exciting lineup of episodes are on the way
and we can't wait to have you back for the next one. Thanks for watching!