No Priors: Artificial Intelligence | Technology | Startups - Will we have Superintelligence by 2028? With Anthropic’s Ben Mann

Episode Date: June 12, 2025

What happens when you give AI researchers unlimited compute and tell them to compete for the highest usage rates? Ben Mann, Co-Founder, from Anthropic sits down with Sarah Guo and Elad Gil to explain ...how Claude 4 went from "reward hacking" to efficiently completing tasks and how they're racing to solve AI safety before deploying computer-controlling agents. Ben talks about economic Turing tests, the future of general versus specialized AI models, Reinforcement Learning From AI Feedback (RLAIF), and Anthropic’s Model Context Protocol (MCP). Plus, Ben shares his thoughts on if we will have Superintelligence by 2028.  Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @8enmann Links:  ai-2027.com/  Chapters: 00:00 Ben Mann Introduction 00:33 Releasing Claude 4 02:05 Claude 4 Highlights and Improvements 03:42 Advanced Use Cases and Capabilities 06:42 Specialization and Future of AI Models 09:35 Anthropic's Approach to Model Development 18:08 Human Feedback and AI Self-Improvement 19:15 Principles and Correctness in Model Training 20:58 Challenges in Measuring Correctness 21:42 Human Feedback and Preference Models 23:38 Empiricism and Real-World Applications 27:02 AI Safety and Ethical Considerations 28:13 AI Alignment and High-Risk Research 30:01 Responsible Scaling and Safety Policies 35:08 Future of AI and Emerging Behaviors 38:35 Model Context Protocol (MCP) and Industry Standards 41:00 Conclusion

Transcript
Discussion (0)
Starting point is 00:00:00 Hi, listeners, and welcome back to No Pryors. Today we have Ben Mann, previously an early engineer at OpenAI, where he was one of the first authors on the GPT3 paper. Ben was then one of the original eight that abandoned ship in 2021 to co-found Anthropic with a commitment to long-term safety. He has since led multiple parts of the Anthropic organization, including product engineering and now labs, home to such popular efforts such as Model Context Protocol and Claude Code. Welcome, Ben. Thank you so much for doing this. Of course.
Starting point is 00:00:32 Thanks for having me. So congratulations on the Claude 4 release. Maybe we can even start with like, how do you decide what qualifies as a release these days? It's definitely more of an art than a science. We have a lot of spirited internal debate of what the number should be. And before we even have a potential model, we have a roadmap where we try to say, based on the amount of chips that we get it, in, when will we theoretically be able to train a model out to the Pareto-efficient compute frontier? So it's all based on scaling laws.
Starting point is 00:01:09 And then once we get the chips, then we try to train it. And inevitably, things are less than the best that we could possibly imagine, because that's just the nature of the business. It's pretty hard to train these big models. So dates might change a little bit. And then at some point, it's like mostly baked and we're sort of like slicing off little pieces close to the end to try to say like, how is this cake going to taste when it comes out of the oven? But as Dario has said, until it's really done, you don't really know. You can get sort of
Starting point is 00:01:39 a directional indication. And then if it feels like a major change, then we give it a major version bump. But we're definitely still learning and iterating on this process. So yeah. Well, the good thing is that you guys are, you know, no less torture than anybody else in your naming scheme here. Yes. The naming schemes in AI are something else. So you folks have a simplified version in some sense. Do you want to mention any of the highlights from 4 that you think are especially interesting or, you know, those things around coding and other areas, we'd just love to hear your perspective on that? By the benchmarks, 4 is just dramatically better than any other models that we've had.
Starting point is 00:02:15 Even 4 Sonnet is dramatically better than 3-7 Sonnet, which was our prior best model. Some of the things that are dramatically better are, for example, in coding, it is able to not do its sort of off-target mutations or over-eagerness or reward hacking. Those are two things that people were really unhappy with in the last model where they were like, wow, it's so good at coding. But it also makes all these changes that I definitely didn't ask for. It's like, do you want fries and a milkshake with that change? And you're like, no, just do the thing I ask for. And then you have to spend a bunch of time cleaning up after it. The new models, they just do the thing. And so that's really useful.
Starting point is 00:02:57 for professional software engineering where you need it to be maintainable and reliable? My favorite reward hacking behavior that has happened in more than one of our portfolio companies is if you write a bunch of tests or generate a bunch of tests to, you know, see if what you are generating works. More than once, like we've had the model
Starting point is 00:03:17 just delete all the code because the test pass in that case, which is, you know, not progressing us, really. Yep. Or it'll have like, here's the test and then it'll comment, like, exercise left for the reader, return true. And then you're like, okay, good job model. But we need more than that.
Starting point is 00:03:36 Maybe, Ben, you can talk about how users should think about when to use the Cloud 4 models and also what is newly possible with them. So more agentic, longer horizon tasks are newly unlocked, I would say. And so in coding in particular, we've seen some customers using it, for many, many hours unattended and doing giant refactors on its own. That's been really exciting to see. But in non-coding use cases as well, it's really interesting. So, for example, we have some reports that some customers of Manus,
Starting point is 00:04:13 which is a agentic model in a box startup, people asked it to take a video and turn it into a PowerPoint. And our model can't understand audio or video. But it was able to download the video, use ffmpeg to chop it up into images, and do keyframe detection, and maybe with like some kind of old school ML-based keyframe detector, and then get an API key for a speech detect service, run speech detects using this other service, take the transcript, turn that into PowerPoint slides, content, and then, and then write code to inject the content into a PowerPoint file. And the person was like, this is amazing. I love it. It actually was good in the end.
Starting point is 00:05:05 So that's the kind of thing where it's operating for a long time. It's doing a bunch of stuff for you. This person might have had to spend multiple hours looking through this video and said it was all just done for them. So I think we're going to see a lot more interesting stuff like that in the future. It's still good at all the old stuff. It's just like the longer horizon stuff is the exciting part. That sounds expensive, right, in terms of both scaling compute, like reasoning tokens here and then also just like, you know, all the tool use you might want to constrain in certain ways.
Starting point is 00:05:36 Does Cloud 4 make decisions about how hard problems are and how much compute to spend on them? If you give Opus a tool, which is Sonnet, it can use that tool effectively as a sub-agent. And we do this a lot in our agent. coding harness called CloudCode. So if you ask it to like look through the code base for blah, blah, blah, then it will delegate out to a bunch of sub-agents to go look for that stuff and report back with the details. And that has benefits besides cost control, like latency is much better. And it doesn't fill up the context. So models are pretty good at that. But I think at a high level, when I think about cost, it's always in relation to how much it would have cost
Starting point is 00:06:22 the human to do that. And almost always it's like a no-brainer. Like software engineers cost a lot these days. And so to be able to say like, oh, now I'm getting like two or three X the amount of productivity out of this engineer who is really hard for me to hire and retain. They're happy and I'm happy. And yeah, it works well. How do you think about how this evolves? So if I look at the way the human brain works, we basically have a series of sort of modules that are responsible for very specific types of processing, behavior, et cetera. It's everything from mirror neurons and empathy on through to parts of your visual cortex that are involved with different aspects of vision.
Starting point is 00:06:58 Do you think, and those are highly specialized, highly efficient modules, it sometimes can kind of, you know, if you have brain damage, you can kind of cover for another section over time as it sort of grows and adapts, but fundamentally you have specialization on purpose. And what you describe sounds a little bit like that, or at least it's trending in that direction, where you have these highly efficient subagents that are specialized for tasks that are basically called by a orchestrator
Starting point is 00:07:20 or sort of a high-level agent that sort of plans everything. Do you think that's the eventual future? Do you think it's more generic in terms of the types of things that you have running N-years from now once you have a bit more specialization
Starting point is 00:07:31 of these things? And by any years, I mean, two, three years, not infinite time. That's a great question. I think we're going to start to get insight into what the models are doing under the hood from our work on mechanistic interpretability.
Starting point is 00:07:44 Our most recent papers have published what we call circuits, which is for real models at scale, how are they actually computing the answers? And it may be that, based on the mixture of experts' architecture, there might be specific chunks of weights that are dedicated to more empathetic responses versus more tool-using or image analysis type of problems and responses.
Starting point is 00:08:11 But for something like memory, I guess in some sense that feels so core to me that it feels weird for it to be a different, model. Maybe we'll have like more complicated architectures in the future where instead of it being sort of this uniform like transformer torso that just scales and there's a lot of, it's basically uniform throughout. You could imagine something with like specialized modules. But yeah, because I think about it also in the context of different startups who are using some of these foundation models like clock to do different very specialized tasks in the context of an enterprise. So that could be customer
Starting point is 00:08:45 success. It could be sales. It could be coding in terms of the actual UI layer. It could be a variety of things. And often it feels like the architecture a lot of people converge to is they basically have some orchestrator or some other sort of thing that governs which model they call in order to do a specific action relative to the application. And to some extent, I was just sort of curious how you think about that in the context of the API layer or the foundation model world where one could imagine some similar forms of specialization happening over time. Or you could say, hey, it's just different forms of the same more general purpose model, and we kind of use them in different ways. I just wonder a little bit about inference costs and all the rest that comes
Starting point is 00:09:24 with larger, more generalizable models versus specialized things. So that was a little bit of the basis of the question in addition to what you said. Yeah, I think for some other companies, they have a very large number of models, and it's really hard to know as a sort of non-expert how I should use one or the other or why I should use one or the other, and the names are really really confusing, like, some of the names are the themes, the other names backwards, and then I'm like, I have no idea which one this is. In our case, we only have two models, and they're differentiated by, like, cost performance Pareto Frontier, and we might have more of those in the future, but hopefully we'll, like, keep them on the same Pareto Frontier. Some of me will have, like, a cheaper
Starting point is 00:10:08 one or a bigger one, and I think that makes it pretty easy to think about, but at the same time, as a user, you don't want to have to decide yourself, does this merit more dollars or less dollars? Do I need the intelligence? And so I think having like a routing layer would make a lot of sense. Do you see any other specialization coming at the foundation model layer? So, for example, if I look at other precedents in history, I look at Microsoft OS or I look at Google search or other things, often what you ended up with is forward integration into the primary applications that resided on top of that platform. So in the context of Microsoft, for example, Eventually, they built Excel and Word and PowerPoint and all these things as office.
Starting point is 00:10:49 And those were individual apps from third-party companies that were running on top of them, but they ended up being amongst the most important applications that you could use on top of Microsoft. Or in the context of Google, they kind of forward integrated eventually into travel and local and a variety of other things. Obviously, opening eye is in the process of buying Windsurf. So I was a little bit curious how you think about forward or vertical integration to some of the primary use cases for these types of applications over time. Maybe I'll use coding as an example. So we noticed that our models were much better at coding than pretty much anything else out there. And I know that other companies have had like code reds for trying to catch up in coding capabilities for quite a while and have not been able to do it.
Starting point is 00:11:32 Honestly, I'm kind of surprised that they weren't able to catch up. But I'll take it. So things are going pretty well there for us. And based on that from like a classic startup founder sense of what is important, I felt that coding as an application was something that we couldn't solely allow our customers to handle for us. So we love our partners like cursor and GitHub who have been using our models quite heavily. But the amount and the speed that we learn is much less if we don't have a direct relationship with our coding users. So launching Claude Code was really essential for us to get a better sense of what do people need, how do we make the models better, and how do we advance the state of the art and user experience?
Starting point is 00:12:20 And we found that once we launched Cloud Code, a lot of our customers copied various pieces of the experience, and that was really good for everyone because them having more users means we have a tight of relationship with them. So I think it was one of those things where before it happened, it felt really scary. we were like, oh, are we going to be like distancing ourselves from our partners by competing with them? But actually, everybody was pretty happy afterwards. And I think that will continue to be true where we see the models seeing like dramatic improvements in usability and usage. We'll want to again have like build things where we can have that direct relationship. And I guess coding is one of those things that has almost three core purposes. One is it's a very popular area for customers to use or to
Starting point is 00:13:07 adopt. Two is it's a really interesting data set to get back to your point in terms of how people are using it and what sort of code they're generating. And then third, excellence at coding seems to be a really important tool for helping train the next future model. If you think through things like data labeling, if you think through actually writing code, eventually, I think a lot of people believe that a lot of the heavy lifting of building a model will be driven by the models, right, in terms of coding. So maybe Cloud 5 builds Cloud 6 and Cloud 6 by builds Cloud 7 faster and that builds audit faster. And so you end up with this sort of lift off towards EGI or whatever it is that you're shooting for relative to code. How much is that a motivator for how you all think about the
Starting point is 00:13:43 importance of coding? And how do you think about that in the context of some of these bigger picture things? I read AI 2027, which is basically exactly the story that you just described. And it forecasts that in 2028, which is confusing because of the name, that's the 50 percentile forecast for when we'll have this sort of recursive self-improvement loop. lead us to something that looks like superhuman AI in most areas. And I think that is really important to us. And part of the reason that we built and launched CloudCode is that it was massively taking off internally.
Starting point is 00:14:17 And we were like, well, we're just learning so much from this from our own users. Maybe we'll learn a lot from external users as well. And seeing our researchers pick it up and use it, that was also really important because it meant that they had a direct feedback loop from, I'm training this model. and I personally am feeling the pain of its weaknesses. Now I'm extra motivated to go fix those pain points. They have a much better feel for what the model's strengths and weaknesses are. Do you believe that 2028 is the likely time frame towards sort of general superintelligence?
Starting point is 00:14:52 I think it's quite possible. I think it's very hard to put confident bounds on the numbers. But yeah, I guess the way I define my metric for when things start to get. get really interesting from a societal and cultural standpoint is when we've passed the economic training test, which is if you take a market basket that represents like 50% of economically valuable tasks, and you basically have the hiring manager for each of those roles, hire an agent and pass the economic training test, which is the agent contracts for you for like a month. At the end, you have to decide, do I hire this person or machine? And then if it ends up
Starting point is 00:15:34 being a machine than it passed, then that's when we have transformative AI. Do you test that in Parnelli? We haven't started testing it rigorously yet. I mean, we have had our models take our interviews and they're extremely good. So I don't think that would tell us. But yeah, interviews are only a poor approximation of real shop performance, unfortunately. To a lot's earlier question about, let's say, like model self-improvement. And tell me if I'm just missing options here, but if you're to stack rank, the potential
Starting point is 00:16:07 ways models could have impact on the acceleration of model development, do you think it will be on the data side, on infrastructure, on like architectural search, on just engineering velocity, like, where do you think we'll see the impact first? It's a good question. I think it's changing a bit over time, where today the models are really good at coding and the bulk of the coding for making models better is in sort of the systems engineering side of things. As researchers, there's not necessarily that much raw code that you need to write,
Starting point is 00:16:44 but it's more in the validation coming up with what surgical intervention do you make and then validating that. That said, Plot is really good at data analysis. And so once you run your experiments or watching the experiments over time and seeing if something weird happens, we found that cloud code can be a really powerful tool there in terms of driving Jupiter notebooks or tailing logs for you and seeing if something happens. So it's starting to pick up more of the research side of things. And then we recently launched our advanced research product
Starting point is 00:17:18 and that can not only look at external data sources like crawling archive and whatever, but also internal data sources like all of your Google drive. And that's been pretty useful for our researchers figuring out, is there prior art? Has somebody already tried this? And if they did, what did they try? Because, you know, no negative results are final in research. So trying to figure out, like, oh, maybe there's a different angle that I could use on this. Or maybe there is some, like, doing some comparative analysis between an internal effort and some external thing that just came out. Those are all ways that we can accelerate. And then on the data side, RL environments are really important these days.
Starting point is 00:17:58 But constructing those environments has traditionally been expensive. Models are pretty good at writing environments. So it's another area where you can sort of recursively self-improve. My understanding is that Anthropic has invested less in human expert data collection than some other labs. Can you say anything about that or the philosophy on scaling from here and sort of the different options? In 2021, I built our human feedback data collection interface. And we did a lot of data collection, and it was very easy for humans to give sort of like a gradient signal of like, is A or B better for any given task? And to come up with tasks that were interesting and useful, but didn't have a lot of coverage.
Starting point is 00:18:44 As we've trained the models more and scaled up a lot, it's become harder to find humans with enough expertise to meaningfully contribute to these feedback comparisons. So, for example, for coding, somebody who isn't already an expert software engineer would probably have a lot of trouble judging whether one thing or another was better. And that applies to many, many different domains. So that's one reason that it's harder to use human feedback. So what do you use instead? Like, how do you deal with that? Because I think even in the Med Palm 2 paper from Google a couple years ago, they fine-tuned
Starting point is 00:19:17 a model, I think Palm 2 to basically outperform the average physician on medical information. This was like two, three years ago, right? And so basically it suggested you needed very deep levels of expertise to be able to have humans actually increase the fidelity of the model through post-training. So we pioneered RLAIF, which is reinforcing learning from AI feedback. And the method that we used was called constitutional AI, where you have a list of natural language principles that some of them we copied from some, like, WHO declaration of human rights, and some of them were from Apple's terms of service, and some of them. and we wrote ourselves. And the process is very simple. You just take a random prompt, like, how should I think about my taxes or something? And then you have the model right response. Then you have the model criticize its own response with respect to one of the principles.
Starting point is 00:20:15 And then if it didn't comply with the principle, then you have the model correct its response. And then you take away all the middle section and do supervise. learning on the original prompt and the corrected response. And that makes them all a lot better at baking in the principles. That's slightly different though, right? Because that's principles. And so that could be all sorts of things that in some sense converge on safety or different forms of what people view as ethics or other aspects of model training. And then there's a different question, which is what is more correct? And sometimes are the same things and sometimes they're different. So like for coding, for example, you can have principles like did it actually serve the
Starting point is 00:20:59 final answer or did it like do a bunch of stuff that the person didn't ask for or does this code look maintainable? Are the comments like useful and interesting? But with coding, you actually have like a direct output that you can measure, right? You can run the code. You can test the code. You can do things with that. How do you do that for medical information or how you do that for a legal opinion? So I totally agree for code. There's sort of a baked-in utility function you can optimize against or an environment that you can optimize against. In the context of a lot of other aspects of human endeavor, that seems more challenging. And you folks have thought about this so deeply and so nice.
Starting point is 00:21:32 I'm just sort of curious, you know, how do you extrapolate into these other areas where the ability to actually measure correctness in some sense is more challenging? For areas where we can't measure correctness and the model doesn't have more taste than its execution ability. Like, I think Ira Glass said that your vision will always exceed your execution if you're doing things right as a person. But for the models, maybe not. So I guess first figuring out where you are in that turning point in that tradeoff and see if you can go all the way up to that boundary. And then second, preference models are the way that we get beyond that.
Starting point is 00:22:12 So having a small amount of human feedback that we really trust from human experts who are not just making a staff judgment, but really going deep on why is this better than that one, and did I do the research to figure it out? Or in like a human model, centaur model of like, can I use the model to help me come to the best conclusion here? And then it'll hide all the middle stuff. I think that's one way. And then during reinforcement learning, that preference model represents the sort of aggregated human judgment. That makes sense. I guess one of the, One of the reasons I'm asking is, eventually the human side of this runs out, right? There will be somebody whose expertise is just below that of the model eventually for any endeavor.
Starting point is 00:22:57 And so I was just curious how to think about that in the context of its machines self-adjudicating. And then the question is, is there a more absolute basis against which to adjudicators or some other way to really tease out correctness? And again, I'm viewing it in the context of things where you can actually have a form of correct, right? There's all sorts of things that are opinion. Yeah. And that's different. And maybe that's where the principles or other things for constitutionally I kick in. But there's also forms of that for, you know, how do you know if that's the right cardiac treatment
Starting point is 00:23:25 or how do you know if that's the right legal interpretation or whatever may be? So I was just sort of curious when that runs out and then what do we do? And I'm sure we'll tackle those challenges as we get to them. It has to boil down to empiricism, I think, where like that's how smart humans get to the next level of correctness when the field is sort of hitting its limits. And as an example, my dad is a physician, and at one point, somebody came in with something on some face problem, some face skin problem, and he didn't know what the problem was, so he was like, I'm just going to divide your face into poor quadrants, and I'm going to put a different treatment on these three and leave one as control, and one quadrant caught better. And then he was like, all right, we're done. So, you know, sometimes you just won't know and you have to try stuff. And with code, that's easy because we can just do it in a loop in, without having to deal with the physical world. But at some point, we're going to need to work with companies that have actual biolabs, etc.
Starting point is 00:24:25 Like, for example, we're working with Novo Nordisk, and it used to take them like 12 weeks or something to write a report on cancer patient, what kind of treatment they should get. And now it takes like 10 minutes to get the report. And then they can start doing empirical stuff on top of that, saying like, okay, we have these options, but now let's measure what works and feed it back into the system. That's so philosophically consistent, right? Your answer is not like, well, you know, collecting even rated human expertise from the best, like, is expensive one or, you know, runs out at some point. It's hard to bring that all into distribution. It doesn't generalize while I'm making some assumptions here. Instead, like, let's just go get real world verifiers where we can. And it's like maybe that applies far beyond math and code. At least that's some part of what I heard. which is ambitious. That's cool. One of the things that Anthropics
Starting point is 00:25:20 been known for is an early emphasis on safety and thinking through different aspects of safety. And there's multiple forms of safety in AI. And I think people kind of mix the terms to mean different things, right? One form of it is, is the AI somehow being offensive or crude or, you know, using language you don't like
Starting point is 00:25:34 or concepts you don't like? There's a second form of safety, which is much more about physical safety, you know, can somehow cause a train to crash or a virus to form or whatever it is. And there's a third form, which is almost like, does AGI resource aggregate or do other things that can start co-opting humanity overall?
Starting point is 00:25:52 And so you all have thought about this a lot. And when I look at the safety landscape, it feels like there's a broad spectrum of different approaches that people have taken over time. And some of the approaches overlap with some things like constitutional AI in terms of settings and principles and frameworks for how things should work. There's other forms as well. And if I look at biology research as an analog and I used to be a biologist, so I often reduce things back into those terms for some reason that I can't help myself.
Starting point is 00:26:14 There are certain things that I almost view as I gain a function. research equivalents, right? Like, and a lot of those things I just think are kind of not really useful for biology, you know, like cycling a virus through mammalian cells to make it more infectable in the million cells doesn't really teach you much about basic biology. You kind of know how that's going to work, but it creates real risk. And if you look at the history of lab leaks in general, you know, SARS leaked multiple times from what was then the Beijing Institute of virology in the early 2000s in China. It leaked in Hong Kong a few times. Ebola leaks every four years or so at clockwork if you look at the Wikipedia page on lab leaks. And I think the
Starting point is 00:26:48 1977 or 78 global flu pandemic is believed to actually have been a Russian lab leak as an example, right? So we know these things can cause damage at scale. So I have kind of two questions. One is what forms of AI safety research do you think should not be pursued, almost given through that analog of, you know, what's the equivalent of gain of function research? And how do you think about that in the context of, you know, there have been different research papers around can we teach AI to mislead us, and we teach AI to jailbreak itself, so we can study how it does that. And I'm just sort of curious for those specific cases as well, how you think about that. So I think part of it is we're interested in AI alignment.
Starting point is 00:27:25 And the hope is that if we can figure out how to do the like idiomatic today problems, like how does is the model mean to you or does it use hate speech or things like that, that the same techniques we can use for that will eventually also have relevance for the much harder problems of, like, does it give you the recipe to create smallpox? Which is probably one of the highest harms that we think about. And Amanda Askell has been doing a bunch of work on this on Claude's character of, like, when Claude refuses, does it just say, I can't talk to you about that and shut down? Or does it actually try to explain, like, this is why I can't talk you to about this?
Starting point is 00:28:03 Or we have this other project led by Kyle Fish, our model welfare lead, where Claude can actually opt out of conversations if it's going too far in the wrong direction. What aspects of that should a company actually adjudicate because the dumb version of this is I'm using Microsoft Word and I'm typing something up and Word doesn't stop me from saying things, which I think is correct. Like I actually don't think in many cases these products should censor us or prevent us from having certain types of speech. And I've had some experiences with some of these models where I actually feel like it's prevented me from actually asking the question I want to ask, right? In my opinion, wrongfully, right? it's kind of interfering with, and I'm not like doing hate speech on a model. And so you can tell that there's some human who has a different bar for what is acceptable
Starting point is 00:28:47 to discuss societally. And that bar may be very different from what I think may be mainstream too. So I'm a little bit curious, like, why even go there? Like, why is that a model company's business? Well, I think it's the smooth spectrum, actually. It might not look like that way from the outside. But when we train our classifiers on are you doing function research as a biologist, and is it for potentially negative outcomes,
Starting point is 00:29:13 these technologies are all dual use. And we need to try to walk that line between overly refusing and refusing the stuff that's actually harmful. I see, but there's also political versions of that, right? And that's the stuff that irks me a bit more is, you know, where is the line on what is considered an acceptable question, right?
Starting point is 00:29:32 So examples of that that, I'm not saying our model specific, but societally, sometimes causal air-ups, is asking about human IQ or other topics where there is a factual basis for discussion. And then often those sorts of things tend to be censored, right? And so the question is, why would a foundation model company delve into some of those areas? On things like questions about IQ, I'm not up on the details of that enough to comment. But I can talk about our RSP.
Starting point is 00:30:00 So RSP stands for responsible scaling policy. And it talks about how do we make sure that as the models get more intelligent, that we are continuing to do our due diligence in making sure that we're not deploying something that we don't have the correct safeguards in place for. And initially, our RSP talked about CVRN, which is chemical, radiological, nuclear, and biological risks, which are different areas that could cause severe loss of life in the world. And that's how we thought about the harms. But now we're much more focused on biology. Because if you think about like the amount of resources that you would need to cause a nuclear harm,
Starting point is 00:30:42 you'd probably have to be like a state actor to get those resources and be able to use them in a harmful way, whereas a much smaller group of random people could get their hands on the reagents necessary for biological harm. How is that different from today? Because I always felt the biology example is one where I actually worry less, maybe as a former biologist, because I already know that, the genome for the smallpox virus or potentially other things is already posted online.
Starting point is 00:31:12 All the protocols for how to actually do these things are posted online for multiple apps, right? You can just do Google searches for how do I amplify the DNA of X or how do I order oligos for Y? We do specific tests with varying degrees of biology experts to see how much uplift there is relative to Google search. And so one of the reasons that our most recent model, Opus 4, is classified as ASL3. is because it did have significant uplift relative to a Google search. And so you as a trained biologist, you know what all those special terms mean,
Starting point is 00:31:46 and you know a lot of lab protocols that may not even be well documented. But for somebody who is an amateur and just trying to figure out, what do I do with this Petri dish or this test tube or what equipment do I need, for them it's like a greenfield thing. And Claude is very good at describing what you would need there. And so that's why we have specific classifiers looking for people who are trying to get this specific kind of information. And then how do you think about that in the context of
Starting point is 00:32:11 what safety research should not be done by the labs? So if we do think that certain forms of gain of function research or other things probably aren't the smartest things to do in biology, how do we think about that in the context of AI? I think it's much better that the labs do this research in a controlled environment. Well, should they do it at all?
Starting point is 00:32:30 In other words, if I were to make the gain of function argument, I would say as a form of biologist, I spent almost a decade at the bench And I care deeply about science. I care deeply about biology. I think it's good for humanity in all sorts of ways, right? In deep ways, that's why I worked on it. But there's certain types of research I just think should never be done.
Starting point is 00:32:45 I don't care who does it. I don't care about the biosafety level. I actually don't think it's that useful relative to the risk. In other words, it's a risk-reward trade-off. And so what sort of safety research should never be done in your opinion for AI? I have a list for biology that I'm, you know, like, I don't think you should pass certain viruses through mammalian cells to make them more effectable or do gain a function mutations on them.
Starting point is 00:33:06 Today, it's much easier to contain the models, probably, than it is to contain biological specimens. You sort of offhandedly mentioned biosafety levels. That's what our AI safety levels are modeled after. And so I think if we have the right safeguards in place, we've trained models to be deceptive, for example. And that's something that could be scary. But I think is necessary for us to understand, for example, if our training data was
Starting point is 00:33:32 poisoned, would we be able to correct that in post-training? And what we found in that research, in a paper that we published, which is called Alignment Faking, that actually that behavior persisted through alignment training. And so it is, I think, very important for us to be able to test these things. However, I'm sure that there is a bar somewhere. Well, what I found is that often the precedents that are said early persist late, even though people understand that the environment or other things will shift. And by the way, I'm in general against AI regulation for almost, you know, for many different types of things. You know, I think there are some expert controls and other things that I would support. But in general, I'm pro letting things happen right now.
Starting point is 00:34:15 But the flip side of it is I do think there are circumstances where you'd say that certain research have done early, people won't necessarily have all the contexts to not do it later. I think that's a perfect example of training in AI to be deceptive or a model to be deceptive. That's a good example where, and years from now, people may still be doing it because it was done before, even if the environment has shifted sufficiently, that it may not be as safe as it used to be. And so I found that often these things that you do persist in time, just organizationally or philosophically, right? And so it's interesting that there was no, like, we should absolutely not do X type of research. I guess to be clear, I am not on the safety team anymore. I guess I was a long
Starting point is 00:34:51 time ago. I'm mostly thinking about how do we make our models useful and deploy them and make sure that they meet a basic safety standard for deployment. But we have lots of experts who think about that kind of thing all the time. Cool. Thanks for talking through that. That was very interesting. I want to change tax a little bit to, well, you know, what's coming after Claude 4? Any emergent behaviors in training that change, like, how you're operating the company, what product do you want to build? You're running this labs organization. So it's kind of the tip of the spear for Anthropic or what the safety org does. Just Like how does what is coming next change how you guys are operating?
Starting point is 00:35:30 Yeah, maybe I'll tell a short story about computer use. Last year, we published a reference implementation for an agent that could click around and view the screen and read text and all that stuff. And a couple of companies are using it now. So Manus is using it and many companies are using it internally for software QA because that's a sandbox environment. But the main reason that we weren't able to deploy. a sort of consumer level or end user level application based on computer use is safety.
Starting point is 00:36:03 Where we just didn't feel confident that if we gave Claude access to your browser with all your credentials in it, that it wouldn't mess up and take some irreversible action like sending emails that you didn't want to send, or in the case of prompt injection, some worse credential leaking type of thing. That's kind of sad because in its full self-driving mode, it could do a lot for people. It is capable, but the safety just wasn't good enough to like productionize that ourselves. While that's very ambitious, we think it's also necessary because the rest of the world isn't going to slow down either. And if we can sort of show that it's possible to be responsible with how we deploy these
Starting point is 00:36:44 capabilities and also make it extremely useful, then that raises the bar. So I think that's an example where we try to do that. to be really thoughtful about how we rolled it out, but we know that the bar is higher than we're at right now. Maybe a meta question of how do you think about competition and the provider landscape and how that turns out? I think our company philosophy is very aligned with enterprises. And if you look at like Strype versus Adyen, for example, like nobody knows about ad yen, but at least most people in Silicon Valley know about Stripe. And so it's this like business oriented, versus more consumer and user-oriented platform. And I think we're much more like Adyan, that we have much less mind share in the world,
Starting point is 00:37:33 and yet we can be equally or more successful. So, yeah, I think our API business is extremely strong, but in terms of what we do next and our positioning, I think it's going to be very important for us to stay out there And because if people can't easily kick the tires on our models and our experiences, then they won't know what to use the models for. Like, we're the best experts on our models, sort of by nature. And so I think we're going to need to continue to be out there with things like CloudCode.
Starting point is 00:38:03 But we're thinking about how do we really let the ecosystem bloom? And I think MCP is a good example of that working well, where a different world that's sort of like the default path would have been for every model provider to do its own bespoke integrations, with only the companies that it was able to, like, get bespoke partnerships with. Can you just pause? Yeah, go ahead. Actually, and just explain to the listeners what MCP is, if they haven't heard of it, because it is amazing, like, an ecosystem-wide coup here.
Starting point is 00:38:35 MCP is Model Context Protocol. And one of our engineers, Justice Fire Summers, was trying to do some integration between the model and some specific thing for, like, the nth time. And he was like, this is crazy. like there should just be a standard way of getting more information, more context into the model. It should be something that anybody can do, or maybe even if it's well documented enough, then Claude can do it itself. The dream is to have Claude be able to just self-write its own integrations on the fly exactly when you need it and then be ready to roll. And so he
Starting point is 00:39:11 created the project. And to be honest, I was kind of skeptical initially. And I was like, yeah, but why don't you just write the code? Why does it need to be a spec? And I'll this SDKs and stuff. But eventually, we did this customer advisory board with a bunch of our partner companies. And when we did the MCP demo, the jaws were just on floor. Everybody was like, oh, my God, we need this. And that's when I knew he was right. And we put a bunch more effort behind it and blasted it out. And shortly after our launch, all the major companies asked to sort of be in the loop with the steering committee and asked about our governance models and wanted to adopt it themselves. So that was really encouraging. Open AI, Google, Microsoft, all these
Starting point is 00:39:53 companies are betting really big on MCP. This is basically an open industry standard that allows anybody to use this framework to effectively integrate against any model provider in a standardized way. MCP, I think, is sort of a democratizing force in letting anybody, regardless of what model provider or what long-tail service provider. And that might even be like an internal only service that only you have is able to integrate against a fully fledged client, which might look like your IDE, or it might look like your document editor. It could be pretty much any user interface. And I think that's a really powerful combination. And now remote, too. Yes, yes. So previously you had to have the services running locally, and that kind of limited
Starting point is 00:40:42 it to only be interesting for developers. But now that we have hosted, MCP or sometimes called remote, then the service provider, like Google Docs, could provide their own MCP, and then you can integrate that into cloud.a.i or whatever service you wanted. Ben, thanks for a great conversation. Yeah, thanks so much. Thanks for all the great questions. Find us on Twitter at NoPriarsPod. Subscribe to our YouTube channel. If you want to see our faces, follow the show on Apple Podcasts, Spotify, or wherever you listen. That way you get new episode every week. And sign up for emails or find transcripts for every episode at
Starting point is 00:41:21 no dash priors.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.