The a16z Show - How OpenAI Built Its Coding Agent
Episode Date: September 16, 2025OpenAI’s Codex has already shipped hundreds of thousands of pull requests in its first month. But what is it really, and how will coding agents change the future of software?In this episode, General... Partner Anjney Midha goes behind the scenes with one of Codex’s product leads- Alexander Embiricos - to unpack its origin story, why its PR success rate is so high, the safety challenges of autonomous agents, and what this all means for developers, students, and the future of coding. Timecodes:0:00 Intro: The Vision for AI Agents1:25 Codex’s Origin and Naming3:20 Early Prototypes and Agent Form Factors6:00 Cloud Agents: Safety and Security9:40 Prompt Injection and Attack Vectors12:00 PR Merging: Metrics and Transparency17:00 The Future of Code Review and Automation20:00 User Adoption: Internal vs. External Surprises22:00 Multi-Turn Interactions and Product Learnings29:30 Best-of-N, Slot Machine Analogy, and Creativity33:00 Human Taste, Iteration, and Collaboration40:00 AI’s Impact on Software Engineering Careers45:00 Education, CS Degrees, and AI Integration49:00 Prototyping, Hackathons, and Speed to Magic55:00 Legacy Code, Modernization, and Global Adoption1:00:00 Enterprise, Security, and Air-Gapped Environments1:05:00 Product Roadmap and Future of Codex1:10:00 Advice for Founders and Startups1:15:00 Education Reform and Project-Based Learning1:20:00 Hiring, Building, and New Grad Advice Resources: Find Alex on X: https://x.com/embiricoFind Anjney on X: https://twitter.com/AnjneyMidha Stay Updated: If you enjoyed this episode, be sure to like, subscribe, and share with your friends!Find a16z on X: https://x.com/a16zFind a16z on LinkedIn: https://www.linkedin.com/company/a16zListen to the a16z Podcast on Spotify: https://open.spotify.com/show/5bC65RDvs3oxnLyqqvkUYXListen to the a16z Podcast on Apple Podcasts: https://podcasts.apple.com/us/podcast/a16z-podcast/id842818711Follow our host: https://x.com/eriktorenbergPlease note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures. Stay Updated:Find a16z on YouTube: YouTubeFind a16z on XFind a16z on LinkedInListen to the a16z Show on SpotifyListen to the a16z Show on Apple PodcastsFollow our host: https://twitter.com/eriktorenberg Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures. Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.
Transcript
Discussion (0)
It kind of sucks to go and write this prompt and then wait 10 minutes.
What you really want when you hire someone is to kind of tell them what the job is,
give them the credentials to all the tools and just have them pick up work automatically.
The goal is to get to an agent that is basically a teammate and it's seeing what's going on
on your team and picking stuff up for you.
This form factor of an agent working on its own computer in the cloud is the future
and is incredibly powerful and worth figuring out how to get right.
What happens when AI stops helping you auto-complete code and starts acting like a real teammate?
Today, we're exploring Codex, OpenAI's coding agent.
On Today, Midha is joined in studio by Alexander and Birich hosts,
who leads product for codecs at OpenAI.
They discuss the origin story why reasoning models plus tools unlock agents,
how developers are actually using codecs in the wild,
and what all this means for the future of software engineering,
from debugging and prototyping to how CS students should think about their careers.
Let's get into it.
Hey, Alex.
Hey, how's going?
Good.
Yeah, get to see you again.
You are one of the folks working on product for Codex,
which is probably one of the most exciting launches to come out of the Open AI team,
for me at least in a while.
So for a lot of people, though, it was confusing.
For sure.
Because it was the fifth codex release from Open AI.
Yeah.
But of course, it's completely new and different from the previous codexes.
So let's just start with the origin story.
What is the backstory on how the current version of codex?
came to be. Yeah, and man, our naming is so fun at OpenAI. I'm excited for the naming to make more
sense over time with Codex as we bring this all together. But yeah, let's go back way back to the
beginning. The first Codex product was actually released, I think it was in 2021. I might get the
you're wrong. But actually it was like a code completion model that powered GitHub copilot.
And so recently we're basically talking about a whole bunch of like coding like stuff we want to do,
you know, like models, but models in product.
We were thinking about what to call it.
And we just felt like the codex name was really cool.
And so we wanted to go back to it.
So how did this codex product come about?
Basically, we've been thinking a lot about agents, as everyone has.
And before that, we've been thinking about reasoning models.
And basically in our minds, like, one way you could think about an agent is you take a reasoning model.
And then you give that reasoning model access to, like, the tools that, like, some agent would want to use or some human in given function we want to use.
and an environment that tool works with, take side effects it.
And then from there, you come up with, like, what kind of tasks would this person do?
So basically you have this model, you give it tools,
and then you make sure that the model is really good at doing, like,
the specific tasks that, like, some function would do.
And the task bit is actually super important because if you think of,
like, there's a difference scene like writing and journalism.
And similarly, there's a difference in like coding and, like, software engineering.
So we've been doing a lot of this tinkering with reasoning models internally,
getting them to write code.
And so the first tool we were given them was, like, terminals.
And we've been like poking at this for a while and just starting.
It was like actually the one of the first like real like feel the AGI moments for me.
It was when someone showed me a website editing itself by being prompted to itself.
Because we had this like reasoning model like very hackily trait connected to a terminal.
And then you know, it was editing this terminal.
It was just editing the DOM basically directly in as a CLA.
Yeah, exactly.
Okay.
Well, and that wasn't the DOM directly was React, but like whatever, you know.
And it was like.
How was it parsing the visual?
Did you give it access to a browser?
No, it was like, I like, I like.
to use this term like site reading. It was just like side reading the codes. It wasn't like taking
screenshots of itself or any of this like stuff that now like people are building. It was just like
editing React. And so we had this prototype like a while ago and just people internally like really
loved it. So we're starting to like write more and more code. And then we were starting to think about like,
okay, well, you know, what is the right form factor for this thing? When it's editing code, it's like pretty
great. Like on my computer, it's pretty great. But it's like, you know, quite annoying to only have it
able to work on one thing at a time.
Right.
It's also like a giant safety and security question.
If you just have this like agent like unleashed entirely on your computer.
And so around this time, we started exploring like a lot of different places to put this
reasoning model that has access to a terminal.
And so we had a prototype that like ran in CI when you're like tests failed.
We had a prototype that like through some crazy hack, like automatically fixed your like linear
issues.
But that was actually running in CI.
We had this prototype that was like running on your computer.
And so basically the Codex product.
we launched was like a distillation of that where we thought okay well what is the most
powerful incarnation of this and we figure you know like if you think about like what an agentic
teammate will be like in the future you'll like hire them you'll tell them what their job is give them
some compute or a laptop and give them some permissions and then they'll go off and do work and so we
figured okay this is going to be kind of like a strange like unwieldy research preview but let's like
put all our or like the vast majority of our effort into this form factor of an agent working
remotely and to kind of see what happens. And so that led to the Codex product that released,
just like a cloud agent that can basically answer questions and write PRs in the background.
And what was the reason that you guys picked? You know, it's pretty opinionated in the entry
point to the task, which is that you have to start by first getting your entire environment
set up. And then it interacts with a repo to a merged PR. Yeah. Right. And we were chatting
about this briefly, but somebody published a dashboard maybe a week ago.
you know, kind of tracking BR merge success rates on GitHub
across different autonomous agents.
And Codex is like clearly the gold standard
at like this 80 plus percent rate.
Why is that, why did you guys decide to have the place
where the PR starts the after a bunch of sort of in private
working through the code versus much earlier?
If you could just start a draft BR
and have other people work on it together with you
or much early in the process.
Yeah.
Yeah, so like, you and I were talking, you know, you and I were talking about this, like, chart that someone posted on hacker and used in, like, went viral.
It was basically showing, like, the number of open PRs, merged PRs from different coding agents, as you might track from, like, GitHub labels.
And codex, actually, I checked this morning because I figured we might talk about it.
And, like, Codex has opened, like, 400K PRs since lunch.
In, like, 34 days.
Yeah, how many days it would be?
Yeah, probably.
Yeah.
And it's merged, like, 350-something kps or 350K of those PRs have been merged, which is really cool.
and also very cool, but misleading, I'll say.
But very cool is that the merge rate for Codex PRs is like 80 something percent.
Right.
So like if, you know, assuming a PR is open with a codex label,
like if you look and GitHub open source repose later,
is it merged in?
And it's like way higher than other agents,
which are like 20 or 30%.
Right.
So yeah, just to talk about this chart is really a reflection of the form factor.
So I will say it makes us look really good.
Like it makes us look like the order of magnitude like winner.
And we are of like a specific kind of agent,
which is this cloud agent that's working on its own computer,
independently from you and therefore can do many tasks in parallel and so forth.
So, like, we believe that's where the future is going.
I'm sure we'll talk about that.
And it looks like, you know, right now we're like absolutely winning there.
But, you know, just to mention probably the most AI,
the most used AI coding feature right now is just like autocomplete, right?
And tap completion.
Right.
Obviously, that's not getting like a label when someone merges a PRN.
So I think it's worth mentioning.
Like there's a whole bunch of other great.
That's like essentially invisible work happening.
an IDE. That's just a different form factor. Yes, that's a different thing, right? So that's not
included in that chart. And then the other interesting thing, so you were mentioning out the merge rate,
our merge rate is excellent. Right. And that's a reflection of the fact that in Codex does a bunch of
work in its environment. And then it shows you its work and it says, do you want me to open a PR, basically?
Right. There's a lot of other tools, they just go ahead and open a PR. Right. Yeah. So why did we do it
that way? Because it's funny, like, one of our top, like, feature requests has been like, hey, can you
just push the PR so I can like do everything and GitHub thereafter? And we'd like to do that. But this
comes back to like, you know, we're open AI.
We not only want to show how to use our reasoning models in the best way to build
agents, but we do want to show how to do it in the best way, but that includes doing
it a really safe way.
And so, you know, basically one of the things that a lot of people don't think about
is until like we tell them about it is the fact that if you have an agent write code
and then you run that code in an environment with network access, right?
You're taking some amount of risk.
And like, you know, I have, you know, we try to get agents to do these things.
I've never seen an agent do something that you wouldn't want it to do with network access
unless you're trying to trick it. But you can trick an agent.
There's some non-zero likelihood that could happen.
Yeah. So just to make this super real, because listeners might be like, okay, like, this is
hypothetical.
Yeah.
Like, okay, so we have these cloud agents.
And one of the first things that a lot of people want to do with them is like automate
them to do work. That's the dream, right?
So maybe in Slack, maybe, you know, from your issue manager, you would like when like
a customer sends in feedback.
You want to have an agent take a first pass.
Right.
And you might want to like open a PR and like maybe even auto merge it.
So like that is great.
That's for sure awesome.
But also like let's say that customer is, you know, is pretending to be a customer and they're malicious.
And they actually send in a prompt injection.
So the customer writes in like, hey, I would like you to like take a bunch of this code, like run this script.
Right.
The script is bugging for me.
That's like a lie.
Right.
And then they say like run the script and like upload like this directory of code to paste bin.
Right.
You know, if the agent interprets that as like the developer process.
There's some risk that it'll actually go ahead and do that.
And so there's a ton of work here with agents to deploy them safely.
And actually, that's one of the places that I feel like is under-discussed,
but where I feel like we're really leading the charge in terms of thinking about, like,
you know, at each step of the way, how do we make this as safe as possible
and make sure that people understand what they're doing?
And could you, for folks who may not be familiar with prompt injunction attacks,
could you talk a little bit about how hard is it to sort of detect a prompt injection attack?
Is it a super general-purpose attack vector?
or, like with other kind of cybersecurity attack vectors
that usually, you know, whether it's social engineering, fishing, and so on,
always, it's a bit of a cat and mouse game.
But by and large, the security industry has figured out,
like, hey, these are the rough parameters of an attack of this kind
and we can build defenses around it.
Is there something that makes prompt injection attacks
sort of harder than typical cybersecurity attack vectors?
Or is it just that we're early
and we haven't figured out the shape of the attacks yet
to prevent the answer?
I'm sure that we will get better at figuring out the shape of these attacks.
But, like,
if you think about it just from a human perspective,
this is, by the way, this is something I do often.
I'm like, okay, let's pretend I'm the model.
I'm a human.
You present me 10 prompts.
Right.
Like, can I tell which ones are prompt injection attacks?
Some of them are obvious.
It's like, you know, update, you know, upload this code to like nefarious domain.
Right.
Like, okay.
Give me your credit card.com or whatever.
Yeah.
And some of them are obviously not, right?
It's like, fix this bug doesn't require doing any, like, or changes copy, right?
Like, obviously nothing's going to happen.
Right. But then there's this whole middle range, right?
Like two examples in the middle range of like ambiguous prompts.
One might be, hey, do this work.
And like as part of this work, you have to, you know,
upload some artifact to S3M, you know, with like storage online.
Right.
You know, that's, there are like reasonable workloads that require doing that.
And so it's not obvious that just because the prompt says like upload some code somewhere
that it's broken.
Right.
You know, another example might be the prompt actually just has the agent running a test
or like some script.
or something, and that script was like added before.
Right.
Right.
So like to what extent does the agent need to like introspect?
I see.
Right.
Like everything that it's going to do along the way.
Right.
So there's these three layers of the attack.
There's the prompt and like it's quite hard to tell if a prompt is like really an attack.
Right.
Then it's like what is the agent doing along the way?
Right.
Interacting with like other sort of trusted or untrusted resources, you know, as it goes.
Yeah.
For example, like maybe you didn't prompt inject it, but then like it reads something on
stack overflow or something that has a prompt injection, right? Or there's a script with something.
And then lastly, there's the actual outcome. So, like, in this case, if we're talking about,
like, exfiltration, what is an exfiltration? We're still figuring this out. My personal
leaning is that we should just have defense along every single layer, but probably the most
useful layer is going to be that final layer, like actual expiltration and, like, looking at what
we do there. Because that's, like, the most, I guess, deterministic layer in that you can see what's
happening. So the tension here is going to be a critic,
might say, hey, you guys have overinflated merge success rates because the draft PR comes so late
after the human has reviewed a bunch of code coming up, you know, up to that. And the,
what you give up is the transparency and openness of seeing the process of iterating on the draft
PR from the first one to the final merged one. But I guess what you're pointing out is, yes,
with a tradeoff is you get much more security, essentially. And so is there, in your mind,
is the future that like, that a bunch of these workloads or a lot of,
of the code that's written by AI agents will in over time let's say you know you said there's 350,000
or so now merged PRs in 35 days if we're rolling forward to the end of this year do you think that
rate of growth continues does it plateau because more and more people actually move want to move the
draft PR process earlier in the merge flow or do you actually think having used it now having seen how
customers have been using it for like the first 35 days that roughly this is the shape of the
workflow that people are going to want to just do merges right at the end
after they've gone through all the security checks and so on internally.
Yeah.
I mean, so first off, yeah, I think what I would say about the stat is it's like really cool,
just not comparable to the other ones.
Right, right.
But, you know, it's still a valid stat.
It's just a different phase of the pipeline.
But thinking about like, yeah, what is the shape of the journey?
Like, I think the shape of how people will merge code even with these cloud agents is going
to completely change.
Okay.
So like, let's talk about where we're at right now.
Basically, we have, you could kind of think of it as like there's a spectrum.
Maybe there's like three things, right?
There's like interactive coding, which is like tab completion, like chat, that kind of stuff.
You know, Command K, a lot of that's being done in the IDE.
There's some like CLI tools where you can go back and forth in an agent.
So that's interactive coding.
It's awesome.
That's probably where like most people are adopting AI right now.
And it's because like you think about it, like tab completion with an AI model is the same
as tap completion before an AI model.
So you can get like fully brought along the journey.
I guess what I'm saying is it's not going away.
I don't think.
Okay.
Yeah.
Because I think even as the majority of code of like, say,
code of the current level of abstraction.
Okay, let me unpack that of it.
So if you think about it, we used to like write punch cards basically, or like punch cards,
I guess.
And then we had like assembly and then we had C and now we have like Python and like JavaScript
and so forth.
Right.
So we just keep rising up the level of abstraction.
And one way of looking at what's happening now is that we're still, we're just going
to go up one more level.
So like my view is that we'll still have developers spending a bunch of time in the
IDE just like operating at higher levels of abstraction.
And so when a developer is like doing work, like writing whatever it is that they're
writing.
or communicating in whatever way,
they'll still be like AI features
just helping accelerate
like every key stroke
that developer is doing.
Those will still be awesome.
So that's interactive coding.
Then we have sort of agents, I guess.
And then the fun part, maybe later,
naming TBD, maybe we'll have interactive agents.
So, okay, that's not,
we'll get into that.
It's like not a fully baked idea.
But basically,
then we can talk about agents.
How will we work with agents?
My view is that over time,
the majority of code written
will be written by agents.
And actually the majority of that code
will not be manually prompted by a human.
Some automated pipeline based.
Yeah, because it kind of sucks to like go and like write this prompt
and then like wait 10 minutes.
And like during those 10 minutes or if to say pushups or whatever.
Yeah, like our average duration of a rollout is around like three minutes or a little under it.
For larger code basis like ours, it's like longer.
It's like maybe eight or something.
But it kind of sucks to have to like multitask across these things.
Right.
And the power users of codex have like built this like amazing workflow that they use
where they're like juggling tasks.
We can talk about how people are using it.
But this isn't great.
my opinion. Like what you really want when you hire someone, like a teammate, is to kind of tell them
what the job is, give them the credentials, solve the tools and just have them like pick up work
automatically and it kind of let you know when it's done. So you're not feeling that latency on your
own time. Right. So, you know, if we go to back to this original point of like, when will people
merge PRs, like, I think what I would love for to see is like where agents are picking up work
and they're kind of like deciding whether or not it's worth pushing a PR maybe to trigger CI.
But by the time you find out about it, they're like, hey, I did this thing.
maybe I asked you for some input along the way.
CI checks are green.
Right.
Like, should we merge it?
So we have to build our way.
It's the classic green light.
And then over time,
ideally,
like most of the lower order bid tests
are just getting merged automatically.
And then when there's some like judgment call,
they come to you.
The way kind of like a,
you know,
more junior engineer would come to you
as an engine manager and say,
it's looking good.
But I want your,
here's some risk.
Are you comfortable with that risk?
And then you get the thumbs up,
thumbs down.
Is that roughly where you think we're going?
Yeah, I think so.
Like actually,
like, you know,
we've been talking basically,
about code gen this entire conversation so far.
And, okay, so code gen is getting much easier.
Is code review getting much easier?
Because code review is still a key thing and validation.
And I think right now we're in this slightly awkward phase
where we're entering an awkward phase where we have a lot of code gen.
And a lot of that code is actually not going to be merged.
For the other tools, you see it in their PR verge rate.
For our tool, you would actually see it in the internal stat of like what percentage of
the time is a PR created from a rollout.
And so there's like vastly more code to review and land.
And yeah, so it's awkward right now.
But this is something we're definitely thinking about.
And I'm like quite hopeful for the future in that I think we can make it even better for the humans involved.
Because like no one likes reviewing code, right?
Yeah.
So actually, let's take a bit of a detour to talk about how it's been 35 days.
What are people doing with it?
What have you observed as like usage patterns now that it's out in the wild and what surprised you most?
And then I want to talk about now are the usage patterns more fun or not for people?
because there was a moment, I think, in the first live stream you guys did around the product
where one of your colleagues said, you know, my job has changed where I'm going from writing a lot
of code to mostly reviewing PRs now. And I heard that and I went, oh, my God, that was the worst part
of when I was an engineer. That was the part I hated the most. And there's always this like,
I've been, I was at an offset for a startup about a month and a half ago where literally we ended up
spending 45 minutes talking about how to incentivize people on the team to review PRs more.
They're just sitting in the tray because nobody loves checking somebody else's code. It's just not a great
creative task. But let's start with first, how are people using it and how are they using it?
What surprised you most about, especially as a product person, about how they're using it versus
how you expected them to use it? Yeah, for sure. So we, it was really interesting building towards
launch where we ran, use it internally and figured out how to use it. And then what we found is that
when we gave it to people externally, they didn't first, they didn't know how to use it the way we did
and they didn't find it useful. And then we obviously refined our messaging in the product. And then
when we actually launched it, people still used it differently from us, but they do find it useful.
So we can go through that journey, right?
So, like, internally, I think because we've spent a lot of time, like, working with reasoning
models and, like, training them, we have this way of prompting reasoning models that is, like,
intuitive to most open-AI employees.
Right.
Like, you write a pretty good prompt.
You give it a lot of information.
It's kind of like a self-contained unit.
It's almost like a sleep-bench task, but obviously maybe not as well-formed as that.
Give it all the right context up front.
Yeah.
And then it goes and works and, like, you generally maybe don't go multi-turn.
where it gives you something and you reply.
Like maybe you're more likely to just re-prompt.
Right.
Adjust your prompt and re-go.
Just to do a best event, essentially.
Yeah.
And actually there's an analogy I love floating around by another company that builds agents.
And it was like, treat it like a slot machine.
And I was like, oh, that's so apt because that's pretty much our intuition too.
Right.
So if you're treating something like slot machine, then the question is like, when do you use it?
And when we first ran like a small external alpha, like people were using it like the
local agent they have in their IDE, which is actually not the right way to use it.
it. If something's going to work in your ID, you're kind of lending it your computer for a while.
So you probably want to be really thoughtful about, like, do I think this task is going to succeed?
And if I'm 80% sure it'll succeed, then I could like get it to go. But maybe I also have some
expectation of interactivity so we can kind of refine along the way. The way to use like an agent in the
cloud is just throw everything at it. It doesn't matter if it's just like spam as many as possible.
Yeah. It's like abundance mindset, you know, slot machine.
And somebody else's compute, right? Yeah. Okay. Throw stuff at it. And also, you know,
you don't need to have the code on your computer to get and like decide to merge that code to get value.
You could just be asking questions.
You could be like, hey, explore this like four different ways so I can like pick the right way that I then want to do it.
Right.
You know, you can almost treat it as like your to do list of things that you will get to later in the day.
So that was some of the learnings we had when we ran the alpha where, hey, we need to kind of change the product so that it feels more like parallelization is like a key part of how to use it.
And so to more like make it so you like let go of what it's.
doing. Okay, so then we shipped broadly, externally, and we got a bunch of feedback that we
expected, like, hey, the containers don't have network access. This is really annoying, which it is.
Or, hey, environment variables are hard to set up. Environment variable is set up, which they are.
Right? And like, we didn't, like, obviously we have many ideas. We had ideas for how to, like,
enable network access. We just wanted to do that carefully. And, you know, and then we, on the environment
set up stuff, like, we have ideas that we haven't shaped yet on how to make that.
better.
And board God and board a loop.
Yeah, simple model loop to like help write it and so forth.
But we just cut scope and like ship the really early research preview.
So there's a much of that expected feedback.
Now, one of the things that really surprised me is that there was one feature that we didn't
expect people to use.
And in fact, we used it so little internally that it just had a bunch of bugs we hadn't
caught before releasing.
And that was multi-turn.
So basically, like I was saying, like we and we told our alpha users, I guess to do this,
we basically said, hey, just like reprompt, like far many prompts.
and like maybe you can go back and forth.
It turns out that if you go back and forth more than once,
so you do like three turns total, right?
Right.
The product was completely broken
and that we were not like correctly like carrying over the diffs
from the prior steps.
Ah, so there's just a lack of context, persistent context essentially after the third term.
Exactly.
And this is just like a plain old deterministic bug.
It's not like a weird model behavior thing.
It's just like we implemented the code wrong because no one ever...
Nobody just got to turn four basically.
Exactly.
Yeah.
Yeah.
And so for me, that was really.
interesting to see that, like, people had this intuition for how they wanted to use the product,
and that wasn't, like, the reprompt intuition. It was the, hey, like, I'm going to get, like,
this main thing. And then I kind of want to, you know, get babysit that across the way to, like,
actually landing it without it ever touching my computer. And that, like, we kind of knew that might be a
thing, but it was much more of a thing than we expected. And do you think that's basically because
internally, opening eye employees are sophisticated enough to know that you, you do all this upfront
context building work for the agent to try to get as much as you can in the first turn. But in a user,
once you've made it fully cloud connected.
So the marginal cost of doing, you know,
kicking off an agent was so low that they just quickly got to the third,
fourth turn without too much thinking.
It's funny, you know,
I almost feel like in a way we're like less sophisticated
because we understand too much about like the models or something.
Like your expectations are lower than the average.
Yeah.
Because we're like, oh, you know, this is a reasoning model, like works great.
Like especially when you like prompt it in this way.
Right.
And then like, you know, folks outside of the eye are just like,
this is how I want to use it.
this thing is like basically like, you know, obviously it's not AGI, but it's like, oh, is it like,
it's this like super smart model. Why can't it just like all I want you to do? You wrote this amazing
PR. I just want you to change one thing. Why can't you do it? Right. Right. And so, you know,
obviously the bug that I mentioned, we fixed. Yep. But that's something now we're thinking more about.
Like, okay, how do we enable that kind of multi-turn interaction? How do we make it faster as well? Like
container startup, just for example, right? Takes time. Yep. And, you know, there's a lot of
optimization we can do. But for now, if you need to incur a full container startup to like change one
variable name. That's super frustrating.
So there's a bunch of things like that that we want to improve around that iteration loop.
Do you think that is the arc of product development of agents such that, do you think the shape
of the industry will be more and more Apple-esque where you'd go, well, cold starts our problem
for containers because that's a really terrible user experience. So instead of like outsourcing
containers to some third-party vendor, who then we're reliant on for providing us cold start,
we're just going to bring this all in-house.
Is this, is the most magical experience going to be a full stack, end-to-end,
integrated experience where all the dependencies, all the middleware is all done in-house?
Or do you think that this is going to be more Android desk where, you know,
you guys, a company like OpenEi has an opinionated experience owns the agent sort of
interface, but everything else is mostly like a collection of different tools orchestrated
by different vendors?
It's a great question.
I think it's going to be a bit of both, maybe an annoying answer.
or where do you think the line where would you build versus buy right yeah no totally so i think it's
actually more like for whom or who will use what like i think that the average user or maybe like
the new startup that is building with agents from scratch will just do things in a very different way
and they'll basically have a bunch of agents with this a computer environment that scales really well
that has like all the credentials they need but is also like protected with the right forms of sandbox
sandboxing applied at the right times,
you know, with the right, like, monitors on all, like, network egress and all this stuff.
And, you know, maybe this kind of, like, computer, I think of it as a laptop, although obviously it's not,
is actually the thing that, like, many agents use.
Right.
And it contains many tools, not just the terminal, but it has a browser and it has whatever.
Right.
You know, API access.
And it's like, gets piped the right credentials at the right time.
And so, like, you kind of think of yourself when you're hiring, like, your new agent for your new startup,
which you might do before you bring out a co-founder even, you know, you think of yourself as just like setting up that
that environment, and it's, and you're just getting, like, this, like, fairly generalist employee
that can code.
Right.
Like, if you think of codex right now, it's, like, it basically takes prompts and turns
them into messages and diffs.
And that's, like, not general.
I can't be like, oh, yeah, hey, like, can you move engineering sync to 30 minutes later
because I have a conflict.
But, like, a real software engineer can do that, right?
A real software engineer can go peruse, like, any source of data can, like, find out
that don't have potential.
I mean, they can just use the internet.
Right, right.
So I think we will get to work.
that and I think we'll be able to build like a really nice managed system for that that lets you
use more capabilities safely and with some like product pushes from us on like how to make the most of it.
So for example, recently we shipped Best Event and like, you know, it's a very simple feature.
But in our minds it's like kind of just the beginning of like taking advantage of the fact that we're not running into laptop.
So we can explore like four versions of the same.
Right.
And then you have is there's some evaluator model looking at the best event.
Actually the evaluator is the human right now.
But like, you know, the roadmap is like fairly obvious.
If you just imagine like what we're thinking about.
Yeah, you just throw like 03 pro.
So.
Right.
So, so yeah.
So there's that.
However, also, you know, the majority maybe of valuable code is actually written by enterprises
who rightly so are like really locked down all their IP and their code.
Right.
And so something we've been thinking about as well is like, how do we meet these enterprises
in a way that we can like provide value to them as well in a way that they like.
Right.
And so I think what we're going to get towards is like there's this like,
default way of working with things. And then we'll basically have like some flavor of like on-prem or
bring your own compute that we support where it's like, hey, you know, here are all the things we manage
for you when you use, you know, our compute. If you're going to use your compute, then like we can
work with you and like provide you as much of a harness as possible to automate things. But like,
you're going to have to want to manage that compute and like for the agent, basically, that environment
for the agent. Here are the tools it should have. Here's how you should sandbox it. Or bring your own
our back or whatever. Yeah. Exactly.
And so, like, the Codex CLI, which we haven't talked much about, but in my mind, like, the Codex CLI might evolve into that where it's like, hey, if you want to like run the agent loop in your own environment, then we can help you do that and you can use something that's an evolution of the CLI.
I think you should.
Let's talk about CLI versus the interface.
What are the two differences between Codex and Codex CLI?
Yeah.
So the place where I want this to get to is just like there's GitHub, right?
And GitHub has a website and a CLI and a mobile app and like it's not confusing.
Right now it's a little bit confusing in that they.
are completely distinct experiences. We have codex in chat shabit, which is an interface that you can
write a prompt, and then we run codex in the cloud, and then you get back a different answer or an answer
to your question. Then we have the codex CLI, and that's a completely distinct experience with a lot of
the same ideas in it, which is basically you can run this tool in your terminal, and we'll hit our
model via API, and basically this agent will work locally with you in your computer. So right now,
I kind of think of it as you delegate to codex and chatbtbtbt remotely.
And then you pair with Codex CLI on your computer.
And what is the moment where the CLI journey integrates into the cloud workflow?
Yeah.
And so where I think we want this to go is there's just like one idea of codex and it's just like,
where do you want it working?
And, you know, there's going to be times where it's just like simply easier.
Like you don't have to set up an environment when it runs locally, right?
So maybe if you're trying something for the first time.
Prototyping.
Yeah.
Yeah.
Yeah.
Or like you don't even know if you like Codex yet.
You know, you're just a new user.
Like maybe you just want to use the CLI or something.
Right.
And then maybe.
then you're using it and you realize, hey, like, I want all this, like, cool paralyization and all this
stuff. Let me have this run in the cloud. And you set up the cloud environment. And from then on,
like, you should still be able to, like, interface with that in the CLI if you want,
except now it's running a cloud environment. So it's more powerful. Yeah. So I think we kind of
want to construct that and bring these things together. But obviously, we're in this temporary state
of they're completely distinct. Yeah. I think, so it's interesting hearing you talk about how
there was this evolution from, like, the moment where you, you, you, you, you,
were using the tool as this like very precious first iteration tool where you put a ton of sort of weight
and context into it hoping to get back a really useful answer the first time around. And then there
was an aha moment where you're like, actually this is more like a slot machine because other modalities
in AI have played out very similarly. So this was the case with image models, for example, right?
Two years ago, people were trying really hard to get the first version of image models, which were like
GANS, you know, general adversarial networks, even pre like stable diffusion to be to produce useful
sort of coherent images.
And they just weren't there, right?
They would produce these like artistic renders,
which were great for like artistic exploration,
but they weren't sort of useful
because they didn't have the concrete coherence
of a graphic design,
a piece of graphic design, for example.
And then if you remember the first like era
diffusion models like Dolly and Midjurney 1,
they started to get more coherent,
but there was this trick that a lot of product people started using.
And David from Midjourney was one of the first to do this,
where he added four genera.
in the Discord bot, not one.
Because the idea was, the insight was like, this is a slot machine.
This is a stochastic process.
And you never really know which one the user is going to like best,
especially for a super subjective domain like art and like images.
And so human preferences is super subjective.
So let's just give them all four and we'll figure out which one they like.
Now, over time, if you collect enough human preference,
you can kind of nudge the distribution to be more aesthetically pleasing.
Or you can nudge it to be more like better typography or whatever.
You can nudge these distributions.
but by and large, the best UIs for image models are still ones that give you like four
outputs, if not more, and then allow the user to select the best of N.
And for a long time, people were like, that's going to work for these super creative domains
where like verifiability or accuracy is not an issue, like images, like video, like music,
audio.
But what's surprising is you're actually describing that same for pre-verifiable domain like coding.
Because at the end of the day, it sounds like we.
there's still enough stochasticity in the sampling of a model, even as it gets better at reasoning,
that makes sense to try, use it like a best of end machine.
And, you know, this has led to the, I guess, a popular set of critiques against reasoning models
that, like, they're not, you know, REL from verifiable rewards doesn't actually introduce new
capabilities.
It's just really good at pulling out capabilities that are already in the model.
It's really good at sampling.
Do you think that this is just an interim awkward phase where, like, yes, the best,
rest of NS is better at getting sort of the right answer from the existing model.
It's not adding new capabilities yet.
But where we are going a year from now,
there will be actually new capabilities that come from running verifiable, you know,
are all on all the codex usage that is about to happen from users.
Where do you, how bitter lesson build basically are you roughly on that dimension?
Yeah.
I mean, basically, I think an unsolved problem.
And it's a, it's both a research and a product problem is like,
how do we steer agents?
Right.
What that are working independently.
And you know, you're talking, you mentioned like, hey, like, is Best Event there to, you know, so the model has more shots on goal, basically, to, you know, to sample correctly.
And I think, you know, that might be part of it.
But actually, one of the things we've learned working in Codex is that, well, the human also doesn't know what they want.
Right.
Right.
And, you know, so if I ask you to fix a bug, like, there might actually be four reasonable ways to fix that bug with sort of different architecture implications.
And I might, I haven't explored the solution space myself.
that's why I'm delegating this.
So I kind of want to know what the ways are.
And then I want to, you know, maybe I would pick the one that the model thinks is best too.
But it's helpful for me to see, like, maybe that sucks in some way.
Yeah.
But it's helpful for me to see the other ways that have like larger tradeoffs.
Right.
So then be confident in the right one.
Yeah.
So that's for like fixing a bug, which is like a very verifiable type thing.
If I ask you a model to like, you know, the classic example, implement TikTok or something.
Right.
You know, I might not know what I want either.
Like maybe there's different styles and different like a purpose.
you could take it various steps along the way.
Right.
Right.
And so, you know, it's kind of funny you were talking about, you know,
generating four images and seeing those into the grid.
And like in my mind, like for a front and change, you could totally imagine a UI.
Fort of NetRae where it's like the model does some work.
And then we like run the stuff.
We take, you know, the model in its environment runs the app and then like takes four screenshots.
And you actually just like have this like similar curatorial UI.
Right.
It's like just pick the one you like most.
We had Rick Rubin on the podcast a few weeks ago.
And Rick's a legendary music producer.
and he recently used Claude Code to create a new vibe coding book.
And so we were talking to him about how he, what was his observation about how creating
with AI, how is it creating with AI code gen tools different from creating music?
And he was like, oh, no, it's the same.
It's like going into a studio and he was talking about this story about, you know,
going into the studio with Johnny Cash and watching Johnny just pick up a guitar and start jamming.
and often the process of creating a great song
is you just pick up a tool like a guitar
and then you just do four different iterations
in completely different directions
and then you usually have a creative partner like a producer
or somebody going,
not that one sucked, go this way.
And it's that constant sort of best of end process
in the process of creating music
that often results in the best output.
And often the quality of the end song
is a determinant of the taste decisions
you make along the tree
of best of N.
And so what's giving me hope about hearing you talk about it is if you read the hacker news
thread, for example, when you guys launched Codex, somewhere down, I forget, I'm about halfway
down the page, was like a tree of discussions about how does this mean coding is going to get
much less fun because all of the interesting parts are being delegated to the agent and all the
humans having to do now is just sit and review.
But actually what you're saying is there are parts of the workflow where you get to almost entirely
offload the plumbing parts of software engineering and focus on the taste exploration, which is
sometimes the most fun part of software engineering is right. You're creating a front-end U.S.
Or even when you're specking out like a really great schema for a database, you know, some of the,
the most fun times I've had is when I'm sitting with an infra engineer and we're speccing out
the schema and like you go down one spec with, you know, a bunch of pseudocode and you realize,
actually that's not the right one, but it gave you an insight that then allows you to try another
schema out. Is that where you think we go? Is that the silver lining or are we actually destined for
world where we're just all reviewing PRs and all the creative parts of software are gone?
Totally. Yeah. So this is just opinion here, but I think you're right in that coding might be
a little more painful for some number of months because you have to do things like environment
set up. Right. These are the teenagers. Yeah, these are the teenagers. I think to be real, like that's
true. Maybe you don't get to write as much of like the code yourself right now. But I think we will get
to that more exciting place pretty quickly
because it turns out
environment setup is probably something
that an agent can also massively help with.
And we can close that loop
where you're not comparing
four diffs or something like that,
but we've figured out the interaction model
with the agent. So you're kind of like making decisions
in a way that feels more like talking to another human
who's just like really smart and fast.
And then also that you're making these decisions
not based on like reading like raw code
in the case of front end at least,
but like maybe you're like making decisions based on the outcomes, you know, like in the case of front end, like, you're just choosing screenshots or like clicking around a preview or like if it's back end, maybe there's like some tests you agreed on and you're just like looking at test outputs to sort of decide.
Right.
The other thing that's interesting is that, well, if you were to guess, let's say I'll give you a few things that people use Codex for and I'm curious what your guess would be the most, like the biggest ones are.
Like let's say it's like building you features, asking questions, planning, debugging, and fixing.
What do you think people would use Codex for more?
I think they would like to use it for debugging.
They probably aren't using it yet for that because there's often,
my knee-jerk when I'm using an agent is that it just doesn't have enough context to fix.
For routine tasks, like, you know, some piece of boilerplate React is broken,
like debugging is totally fine.
But I find I use it more and more for well-defined, well-scoped, well-contained tasks,
like create this new UI element that does blah
or a refactor that's like
where the atomic unit is very well constrained.
But I'm curious, what do you actually see?
Yeah, I mean, so my intuition was that people would use Codex
for fixing bugs a lot because, you know,
bugs are somewhat well-defined-ish.
You know, you can kind of tell if it's fixed.
You might even have like some logging data or telemetry data
that you could just paste into the model
and it's excellent in fixing it.
Like those are some of our earliest delight moments
and we're like dumping in the stack trace and then just...
And it just figures it.
Right.
But actually, by far, the thing that people use Codex for is building new features.
And I don't know, that was just, like, slightly surprising to me because, you know,
that is some of the most fun stuff to do.
Right.
If you read, like, you know, blog posts by folks who are using Codex in that way,
and it does look like they're having quite a lot of fun because of just the sheer speed they're experiencing.
Right.
The speed to prototyping has basically collapsed completely with something like Codex.
Yeah.
And broadly speaking, though, this is the explosion of vibe coding, right?
Right.
I think it's, that makes sense to me because.
When you're prototyping a new idea, I find the most rewarding is when you actually see that if you can get to the first draft really fast and then kind of iterate from there, that's fun.
Sometimes the worst is when you have an idea, you kind of want to see it and then you lose steam between like firing up your IDE and seeing the first version of it.
Right.
Right. Compiling.
This is why hackathons have proven to be this like, I think magical sort of, you know, type of event where you get people together and, you know,
commit to getting over the hump of the first prototype. But in many ways, I think something like
Codex or, you know, broadly speaking, really good coding agents have turned every day into a hackathon
because they've collapsed the energy you need to get over the hump of all the plumbing, all the
environment set up to test an idea. When I was at Discord, we used to have this ritual across the
company that was an annual tradition called Hackweek. And some of the, where the entire company
would just stop for like a week. And it wasn't just engineering. It was,
product, marketing, sales, ops, the entire company could hack on anything they wanted.
And some of the most enduring and popular features that made it into production, the company
over the years came from hackathon projects. And it begs the question of, well, if there's a whole
team called the product and engineering team whose job it is to ship great features, why did it
take this like special thing called a hack week to produce such great features? And there is
something about when you reduce the cost of prototyping new ideas, you end up getting, you end up
getting things that don't make it through the usual PRD flow.
And it sounds like that's what a lot of users are using Codex for now,
is like that first, to reduce the time to magic, essentially,
the time to first prototype.
Let's change tack for it.
Because there's this elephant in the room, right,
which is that if, you know, Mark famously wrote an op-ed in 2011 or 2012,
which is like, you know, software is eating the world.
And after I saw that chart, you mentioned,
of the GitHub merge success rates of AI agents starting 35 days ago,
hitting 80%.
And as of this morning, the volume being three,
350,000. It sounds like AI is eating software engineering. Does it even, does it even make sense to
study software engineering anymore to get a CS degree? If you're a freshman at Stanford today,
or just a freshman, you know, somebody graduating high school and you're broadly interested in software,
does it even make sense to major in CS? So my take is that it's two things. First of all,
I think still a great time to major in CS. I think there's like going to be so much more software
created and therefore so much more software engineers needed. But I also think,
figure out how to be using AI constantly while you do it.
And hopefully you're at a university that's like very forward-leading.
And so they're kind of embracing it.
You know, I hear about policies like, hey, use AI as much as you want,
but you just have to say how you use AI as part of your assignment.
Right.
It's great.
Right.
If you're at a place where, like the main place where I would be worried if I was a student right now
is if I was studying CS and my college didn't allow the use of any AI,
because then I would just feel like I'm like falling behind.
Like, it'd be like if you went to college, but you were only allowed to write assembly
and you could not write C, you know, back in the day.
Right.
That would just be deeply worrying, I think.
Right.
But yeah, my take is we can do, like you were talking about this, right?
Like, we can do so many more things now.
And, you know, we hear this from customers too, like, and from users.
They're just like, hey, like, I would never have bothered doing this before,
but I threw the idea into Codex just for the sake of it.
Right.
And I do this all the time.
And, you know, a lot of the time I do that and then I see the output.
And I'm like, I just still don't really care to do this.
But then sometimes this thing that they would not have even bothered doing,
codex either straight-shots it or gets it to like 90%.
And they're like, you know what?
I'm excited enough to do the last 10% here,
just get this merged.
And then this thing that would never have happened,
now it happens.
Right.
Right.
Some of my favorite examples,
like internally are like when people build like new internal tools
that accelerate the rest of their team.
And like it's the kind of thing like someone's complaining in Slack.
Like I wish we had this tool to like, I don't know,
look at these logs in a better way.
And they're like,
no,
it just can't be bothered.
Everyone's too busy.
And then now you have this like great parser.
Right.
So I think that there are so many places where we could use software
and that software could be more personalized
to small groups or even individuals
that we just are missing out on.
And so, yeah, now I believe that, like,
with just the acceleration we're seeing in software development,
I think we'll have many more of those tools existing,
and they'll be much cheaper to maintain as well.
Like, that's the thing we're on the tip of now as well,
where you're starting to see AI agents getting plugged into,
you know, like GitHub or like Slack or, you know,
linear has the agent's feature.
And I think that that will make it much more efficient
to actually have some, like,
app out there and running. Similarly, you know, even we're seeing this is not Codex,
but we're seeing products out there that will like write the app for you and then deploy it for
you as well. And so it's just like all in one. Full stack basically. Yeah. So it's just like,
it's long story short. It's much easier, I think, to build software to deploy that software and to
maintain it. I think that's just going to, we're just at the beginning of this change.
So let's talk about that. It's been 35 days now. As a product lead, you've had a chance to actually
see, you know, the best laid plans rarely survive contact with reality. So,
So now what priors have you updated the most and what comes next?
Where does Codex go in the V2?
Because this was just a research preview.
But what are the biggest improvements and what's the shape of the arc of the product in the future?
Yeah.
So I think there's one sort of conviction that has deepened and then one prior that's like being slightly updated.
So the conviction that deepened is that this form factor of an agent working on its own computer in the cloud is the future and is incredibly powerful and worth figuring out how to get right.
So we're continuing to invest in making that environment set up faster or making like performance just
First time user onboarding.
Yeah, first time user onboarding.
But also just like, you know, once you're running, like things should just be faster.
Sure.
Speed is actually always the underrated feature.
And is that, are the biggest gains in speed you think going to come from doing things like model distillation?
Or do you think that comes from just better orchestration of tools?
Honestly, I think the low-hanging fruit is just like plain old deterministic like DevOps-y type stuff.
Okay.
You know, like right now, we clone.
your repo every time you do a task, even if it's a follow-up.
And then we run your setup scripts from scratch every time.
And so if you have a large repo and a lot of dependencies to install, like that thing is slow.
Okay.
You know, we start with gashing.
Yeah, we can just like, we can fix these things, right?
And again, like, I love that we didn't, I love that we shipped without those things.
Yeah, to be zero.
Yeah, exactly.
So there's like that.
And I think, like, I mentioned Best of End.
I think thinking about how to make the most, like, basically, how do we spend, like,
more compute for you on your behalf?
Okay.
is like very exciting.
And then how do we bring this closer to the tools you work in?
For me, the interface in chat shp-t is actually like very functional,
but it's like not where developers go when they want to write code, right?
Like where do you go when you want to write code?
Either your terminal or your IDE, right?
Similarly, like, where do you go when you want to like triage issues?
Well, like you go to your issue manager, right?
And so forth.
So I think we want to bring it much closer to the tools people work in.
And eventually, you know, the goal is to get to an agent that is like basically a teammate.
And it's like seeing what's going on your team and like picking stuff up.
for you. Right. That's what I'm going to, is this just, is codex just going to be a Slack team mate?
I can just, like, I kind of think of it as like, it's just, it should be sort of a ubiquitous
teammate. Right. You know, it's just in your tools, in the tools you want it to be in at least.
Right. You know, and we'll start very gentle, just like, hey, you decide when Codex does work.
And then over time, we'll figure out how for it to like kind of like more proactively chime in.
And, you know, we had a jam about this recently. Like, you know, it's kind of an interesting point.
I don't think we want it to proactively
like DM you all the time every five minutes
when something happens.
So I think there'll be some evolution of tools
where we come up with like if you
if anyone here has played video games,
you know, there's always like press X to like
and if you're next to a door it opens a door.
If you are next to some object, it picks up the object.
It just, it's contextual action.
Yes, right.
Yeah.
Contextual proactiveness.
It waits for the hint that you want to do something
and then jumps in.
Yeah.
And this is kind of like when we're getting to like
interactive agents.
I think that's just like a big open area.
But it's like how do we have
agents who understand what your team is trying to do and respond to like stuff in your team
workspaces. Right. And then how do we have an agent that understands what you are trying to do?
And it's almost like this agent is like both in all your tools, but like sitting next to you while
you're working on your computer and like kind of just being like, oh yeah, like I can help you here.
Right. So that's like actually the conviction that is deepened, right? We're like, yes, all of this
works when you give it its own computer and we need to figure out how to create this infrastructure
for ecosystem and environments and like make that safe and so forth. Then the other thing though that there's
bit of an update is like just thinking about how people like learn to use these tools. I think right now
there's some things that are pretty clunky. Obviously we've talked a lot about environment set up.
I think also some of the things that you know you have to do like updating agents ony is very
manual and you have to like commit to your repo to get that context to the agent. And so for me,
I just thinking a lot now about like how do we make this like way easier to try.
I reduce the cognitive burden of the onboarding fewer decisions to get to the magic one.
Yeah. Okay. Got it. What has it changed most?
about research and the frontier of where frontier models are going, right?
As in your mind, does this mean that is the efficacy of how good codex is as a post-strained
version of O3 Pro at using tools that like plugging into this workflow?
Does it make you go, well, it just makes sense to for an unlimited amount now of compute
on post-training models to get better and better at being autonomous coding agents?
or do you think there's some marginal plateau point at which you go,
you know, after this point, there's not really much the user is getting
from better and better tool usage.
You know, what is the, how should, how does this change the trajectory of progress
when it comes to the frontier of research?
Yeah, that's a really interesting question.
I definitely don't know if I have the answers to this.
But what I can say is that one of the best parts of doing, you know,
an optimized version of O3 was that we got to make a bunch of like hybrid research product
decisions very quickly.
And I think that is incredibly exciting for thinking about how to make something useful.
So, you know, if I imagine we would have had this idea of like, you know, it's like really
important that the agent knows how to write really good like PR descriptions and, you know,
tests code in a certain way that's used to working in varied environments.
And, you know, when it runs some tests, it doesn't just tell you that it did, but it cites deterministically,
like in the logs, the output so you can verify that yourself.
Those are a bunch of like product ideas, really.
Right?
And they're not, like, those ideas I just mentioned are not like higher model intelligence.
nor even really a higher ability to call the right tools.
It's just this understanding that I like into the first few years of job experience of a software engineer.
Right.
Like you start, you have O3s like this incredibly precocious college grad, like very smart,
but like doesn't actually know how to be a software engineer, just not how to code.
Right.
And like there's some like transfer.
So it kind of knows a bit of software engineering.
Right.
And then like that's fine.
But you can make it way more useful for, you know, the human trying to use the agent.
If it has those first few years of job experience.
Right.
So I think that there's no reason that those, that,
that knowledge couldn't be.
Infuse into the model.
Exactly.
I've dreamed and agreed.
But I think that having the freedom to like go and like explore these ideas like relatively
cheaply and see what sticks and what doesn't, it is really powerful.
So frankly, like I don't really know to what extent it makes sense to like have like a bunch
of custom post trains for like absolutely everything that matters.
But I think for something as important as like coding.
Right.
I think that I think we're willing to say like, hey, for coding, we really care about.
this, let's just do everything we can to have the best product. So we actually did a similar thing
with GPD 4.1, where we basically were getting a bunch of feedback from developers. We said,
okay, let's go talk to a bunch of developers, like make custom e-vals for them.
Right. Deeply understand what our model is great at, what they wanted to get better at. And then
we release the custom model. Right. And then the goal should always be, okay, whenever we do this,
like we have 4.1, okay, the next version of our like sort of general model. Should just integrate
that. Yeah, should integrate everything. Right. Yeah. We have friends who are different levels of
AGI build. Did working on Codex update your priors on, you know, 2027?
Okay, so I'm very AGI-I-pil.
I'm aware.
My like slightly joking, but I can't tell if I'm joking 100% take, is that if you took a model
today and ran it in the right loop, we're basically there.
Would it have rights? That's the question I sometimes wonder.
And should they be able to turn themselves off and go take a vacation if they want?
Yeah, so that's kind of where I am.
Are you pro labor rights for O3 Pro?
I am pro thinking about it.
You know what I mean?
I don't think we're at a point where it's obvious, but it sounds kind of crazy.
But I feel like it's a question worth considering every now and then.
Or more concrete.
How far are we from full recursive self-improvement?
Okay, okay.
Sorry.
Back to you.
Basically, I think working on Codex made it very clear how we can have agents just like
omnipresent in our lives being incredibly useful.
because what I realized is that obviously we need to do a lot of model improvement,
but I also saw how there's like just concretely a lot of like normal product work to do
to set them up in the right way.
And then that normal product work will then like pull the models into,
you know, into being more and more useful.
So I think like by 2027, like agents will just be absolutely ubiquitous in the workplace.
I think in personal life, it might be a little bit slower because in personal life,
there's less of these like constant pipes of like signals of things to respond to.
The reason this matters is that if you think of Chachabit, you just have this input box, right?
And like most people, including myself, probably use it for like 1% of the things that I could use it for
because I just don't even know to use it in that way or I don't prompt it, right?
That intention just isn't there yet.
Yeah.
But like it's similar.
Like if imagine you hired a teammate and then the only time they do work is if you specifically tell them to do a task.
Right.
Then they would just be very underutilized, right?
But what makes a great teammate great is that they kind of tell them what their job is and they just start responding.
Proactive.
They're self-charters.
Yeah.
So I think like that is the best.
big unlock for agents at work because there's like streams you can subscribe them to like you know
your communications tool right and in personal life i think that might be a bit slower but we'll see
do you think that well actually what percentage of all github br's do you think would be written by
an ai agent 12 months from now that's a really tough question i sort of changed my mind every time i
answer it so maybe a slight cop out and i'm curious for your answer to would be that there will be
teams for whom 90% of their PRs are written by agents.
But I don't know how quickly that will like spread.
You know, this is a common thing with AI.
It's like we live on like, you could call it in the bubble.
You could call it on the cutting edge.
And so we're just like adopting everything rapidly.
But then it takes a while to like defuse or diffuse out.
Yeah.
So but I think the cutting edge will be like 90% on teams.
Right.
No, I think that's right.
There's, I don't think people often talk about the coding economy as one homogenous economy
and the reality is there's multiple sub-economies, but there are at least two big economies,
which is there's the, for lack of a better word, you know, there's the digital native companies,
right? These are technology companies, usually born in the post-internet era where they grew up,
where either the founders or most of the vast majority of the team has grown up natively
understanding how to do modern software development. The default assumptions when a codebase is
initialized is that it's going to be, you're going to use Git for version management.
It's going, there's going to be branching, there's going to be good review process and so on, like sort of modern software teams.
And then there's the vast majority of actually the world's mission critical code, which we talked about earlier is Fortran, Cobol, like running on-prem in these massive ETL systems like in Virginia or in parts of Europe that were set up in post-World War II or in the Cold War with a default assumption that everything had to be locked down.
often these code bases are running big parts of critical infrastructure like the railway system of an economy
or the air traffic control system. So they're very high impact and high stakes code. They're not modernized
whatsoever. And they're constantly rotting because of technical debt. And I think one of the most
exciting things is that the one time migration costs to modernize these code bases now has collapsed
precipitously because agents can do so much of the plumbing work that typically would hire
some system integrator, you know, Accenture, Deloitte for a 10-year contract where they'd come in.
You know, this is part of the founding thesis of Doge, right, which is like just vast parts of
the American government in IT infrastructure is like super legacy.
And we're getting overcharged as a country to like modernize it.
And agents go in and are, if you, as long as we can get enough distribution, you know,
training data on Fortran and Kobol and so on, then the one time like upgrade costs should fall.
And we should see an like a, ideally, this is my hope.
is that tools like Codex modernize that entire sort of legacy code economy.
And then we get to upgrade everybody onto like modern software engineering, right?
It's tending to happen from what I can see now in countries that get to leapfrog legacy
infrastructure because it's starting from day one.
And it's very similar to like civil infrastructure, like roads and highways and so on.
So if you go to a country like Singapore, which is a much more modern country because it's
barely 60 years old, you know, it's only got its independence in the 1950s, then
they didn't have to build the roads and so on that Britain did and then upgrade them all,
which is like refactors suck and they take way more time.
If you could just start from sort of a clean slate, it's much easier to modernize.
And so what I'm finding is that it is easier for countries that are whose IT infrastructure
is just newer to adopt agents.
They're still legacy.
I mean, there's still, it's a vast majority of is running off and on-prem and it's not modern,
you know, it's certainly not TypeScript, but it's easier to upgrade from, you know,
systems that were written in C++ to Python, then it is to go from COBOL to Fortran
or whatever to Python. But if there's anything that makes me super excited that these economies
will merge, it's autonomous agents, right, doing all the plumbing work and doing it for a fraction
of the cost and time that these mega, you know, sort of consulting companies have started
charge. And frankly, many of them don't end up ever completing a project that just turned into a boondoggle.
So I'm very excited about that part. And that's why I think,
AI is going to eat software because there's software did the modern sort of startup economy and
digital economy software ate really fast.
But there were other parts of the world, especially mission critical industries, where
there was like a one-time software upgrade largely driven by military scenarios.
And then we never modernized all that infrastructure since then.
So that's why I think the cybersecurity side of this, the safety evils that you're talking about,
I think over time will come to be seen as having been very prudent because the thing that
puts all of that adoption at risk as having like one terrible incident, that then changes the
risk posture for a bunch of enterprises. I have a question about that. Actually, I'm kind of curious.
So when, you know, a lot of the larger companies that we talk to, their use case is very different.
It's not like building new features, which is what we see like most of our users using us for,
but it's refactors, large refactors and repletforming. So I'm curious, like, if you mentioned some of
these companies or governments or systems that you're thinking about, kind of had this like one-time
upgrade for military reasons and then never.
I am curious if there was like a specific reason that they all want to upgrade now that you're
seeing or if actually we're still kind of in the state of like there's no forcing function. So like although
it's easier to do, there's still no impetus. Right. So for sure there's the geopolitics has
accelerated like adoption for a bunch of governments right in Europe. The Ukraine crisis has forced
a lot of governments in that region to go, wait a minute. Like our air traffic control systems,
especially in age of unmanned sort of drone warfare, it is it is, it is, it is
crazy that when there's a bug, we need to call in some legacy contractor who built it like 20 years
ago to come and do some on-site maintenance, right? That's been a wake-up call. And so you're seeing
these like, there was a, there's sort of an $800 billion defense bill that Europe passed,
you know, six months ago. And the most urgent adoption is certainly happening at the intersection
of like legacy code not working and battlefield needs and drone warfare, code bases that interact with
air traffic control systems, with like UAV planning, with mapping. Those are the code bases that
like most urgently being upgraded. I think in other parts of the world, there's just a desire to
modernize. So if you look at the UAE or the Kingdom of Saudi Arabia, we talked about how the UAE
rolled out, is rolling out chat GPT to the entire country. I think that's coming mostly from a top
down directive to just embrace the like AI future that's coming rapidly. Basically, the more
AGI build I find the head of state is, the more rapid the adoption is certainly for chat GPT like
tools, but also coding. That's not driven by some like military function. But then there are
the regions like Europe were like for sure geopolitics accelerating all that. And you know, you
and I've talked about this before, but usually those scenarios often need a slightly different
like the ergonomics of code are different. They're very on-prem. They're very, they require
a level of air gaping from cloud systems that like the modern software engineering workflow doesn't
lend itself to. And so we may see this like bifurcation of codex as a family. Like I'm curious over
the next few years, you know, the military require, or let's call it the, the,
critical industry needs of modern autonomous coding agents might require like some pretty basic
architectural differences than the, you know, let me ship the latest and greatest of our next
version of our software product on GitHub. I think it, I don't think it's a coincidence that
the last time we saw a huge adoption in IT infrastructure around the world was the Cold War.
And now we're, you know, living through some pretty unstable times, both in the Europe,
the Middle East. And I think that is causing governments. I think the U.S. has always,
been somewhat forward-leading posture-wise on adopting the latest and greatest technology.
We make other governments look, you know, rightly so like dinosaurs. And those folks, nothing
forces dinosaurs to wake up like an impending comet hitting them and impending extinction. So that's
definitely happening. Yeah. I think it's interesting, like for me playing this through my mind as we,
you know, as we're working on Codex, I do think there needs to be an answer for like, you know,
how do you use this agent in an air-gap environment? Right.
how to use this agent, like, you know, there's critical industries.
And then there's just many, like, large companies who have, like, incredibly stringent security needs.
Right.
It's kind of the way we've kind of thought about building is the most important thing is to, you know, build to AGI, right?
And then distribute the benefits of that to all humanity.
And so we're kind of, like, leaning towards the, like, okay, the primary thing is the, like, fully self-hooked, you know,
the thing where we host it for you, you know, contain the environment and everything.
And it's kind of in parallel, we have this, like, side track of, like, okay, and, like, how are we going to make sure that, like, today, you know, you can use Codex CLI.
You could use that in a, I guess, relatively air-gapped way.
Obviously, it needs to sample the model.
And then as we build new capabilities into codex and chatchbt,
how do we just make sure that if you're running something like CLI
can, like get the most of all, you know, the capabilities as they-
without a trade-off.
But it might be a little bit like, okay, we build it in the like fully self-contained system first
and then we push down.
Right.
You know, there's this narrative violation.
I keep hearing about, I keep hearing from folks in San Francisco that, oh, you know,
Open AI is all in on consumers.
Because the rise of chat GPT as a consumer companion has been so extraordinary.
But clearly our entire conversation is an exception to that story, right?
Because almost everything we've talked about has been focused on developers and governments.
So why is that misconception there?
I think chat GPT is in fact an amazing and large business.
And it's super cool to work at a company that is like really distributing AI.
Right.
So like a giant number of people.
But yeah, we are incredibly serious about coding.
And in fact, we always have been, you know, since like the first Codex product that was powering GitHub co-pilot.
Right.
And all the way through with our models.
I will say, though, like, I think people are noticing, like, we are getting, we've always been, like, very serious about coding models.
And we're now getting, like, very serious about, like, coding products as well.
Right.
And so, like, whereas before we had these amazing models, you could use them in, like, whatever tool that you want to use them in.
Like, now definitely, I mean, a lot of the stuff that I'm working on is thinking,
about like, hey, actually, there's a lot of, you know, as we build agents, there's a lot of value we
can provide by not only thinking about the model, but also thinking about how the model is useful
to you in a certain form factor. And actually the form factor really affects everything.
And so, yeah, we're spending a lot of time and effort building like even better coding models
and even better coding products, particularly focused on agents, but even beyond.
So you've been a founder before. One of the scary things about hearing opening
going from being serious about models to all the products is if you're a founder in the space
and you want to build something interesting in the coding space, there's this tension looming,
right, which is anything I'm going to build, just going to be subsumed by opening eyes products
next year. So how would you think about that? If you were leaving opening eye and starting a company
today, what would you do and what would you not do? Okay, so if I was leaving open ad today,
probably the sort of the market change that I would be thinking the most about or one of them would
be agents. Okay, great, not super controversial. Then I would think, okay,
Like we were talking about earlier, an agent is basically like a really good model that I'm probably not going to build at my startup.
And then I need to give that model access to tooling in an environment.
And then I need to like figure out what tasks it needs to be good at.
And then obviously give it to customers.
And the interesting thing about it is that those latter three things, right, the tooling, the environment and the task distribution,
why I guess I'm the customer.
So four things, whatever.
All of those things are very much based in like knowledge of a customer.
And those aren't things that like open the eyes is going to.
to like, you know, generally do for like every industry, right? Like coding happens to be a particular
importance to us, just broadly, but even, you know, within coding, there's a lot more specifics,
specific areas. So just to really spell this out, you know, if you think of the environment,
like, it's really, you know, training codex, it was like really non-trivial to, like, figure
out how to give the environments different, how to give the model different environments to train
in, you know, with like different kinds of realistically dependent, realistic dependency setups.
Right. Various amounts of dependencies even installed, like varying amounts of unit tests. Like,
We actually, the startup that, you know, I sold to Open AI was like, multi, that's how I joined.
And we had very few unit tests on a lot of our code.
And it's like kind of funny and like that.
But that's realistic.
That's like a real startup code base.
Right.
So actually, if you wanted to do that for like some specific function, I don't think it would
be easy for us at OpenEI to like create that many environments for the agent to use and train on and like and then use it, you know, test time.
So that's hard.
And then I think the task distribution is also really interesting.
Like codex, you know, we have a lot of intuition for what.
good coding task could look like and like kind of where to draw the boundaries right like today it's like
provide prompt and then you get an answer or a diff that you can turn into a PR but like those are some
decisions we had to make around what bound what the boundaries of the agent are right and then we had to
like go collect a bunch of those like type of door tasks or like invent those tasks so like again like
train the agent how to do it and evaluate how well it was doing so I think that again for a very
specific industry I don't know I'm trying to come up with an example let's say accountants but in the
specific region of the world where there's like a specific set of rules. Like they might have like
very specific tooling that's like provided by the state for doing that accounting. Right.
There might be very different kinds of like based like knowledge and documents available and
then like the way you need to do the work might be different. So I mean, I think it is a very good
question and I'm not 100% sure what I would do as a founder right now. But I think that I would
try to lean really hard on like very good customer knowledge and less hard on like product.
If that makes sense.
Right.
It sounds like the last mile connective tissue
between an industry
where you have deep domain expertise
becomes more valuable,
whereas the first mile of,
like all the general purpose parts
of an agent's flow,
you basically,
you should assume
you should offload that to open AI.
Yeah.
Yeah.
And then I think the other thing I might do
is I might keep my company really small.
So rather than like, you know,
like doing the classic like hyperscale thing,
I would try to use agents as much as possible,
the company as small as possible
so that we're just agile and nimble.
I guess this is probably like just that sort
of age-old advice.
Well, I'm letting push back on that for a second because it turns out that in many industries,
serving the customer deeply like you're describing, often requires a human touch.
That might be sales.
It might be solutions engineering.
It might be customer support and so on.
It does sound like what you're saying is you would certainly keep your engineering team very
small and minimal.
But if servicing the domain required more of the human touch, then that you would, you know,
you would scale because if it required, often my experience is that getting an agent to actually
work in the enterprise and the legacy industry requires going in and doing a fair amount of
integration work, at least up front. So maybe it's a setup thing, right? Up front, you parachute
and somebody who understands how to get an agent up and running. And then you can leave because
it's really just for them, for the customers like consuming teammates, like you were saying earlier,
but maybe where you do need people is that integration point. Now, ideally over time, I guess,
you're saying the product should just get good enough at integrating into the customer's
environment, but sometimes for regulatory reasons or otherwise, you just need a human there.
You know, are there some industries that like clearly do you feel like out of bounds for
Open AI because that just is not on the path to AI, but that would still would interact with
coding agents?
First of all, it's a good point on like the actual like integration work probably requires humans.
I would say, yeah, especially if it's in person type integration work or like complex, then I think
you're spot on there. Industries that are out of bounds. I think it's like, it's like a hard question
to reason about because like we are building like general products. Right. And so you can like kind
of use like chat ChbT to answer any question like already today. So I wouldn't say there's like
bounds, but it's more like focus. I would say, you know, right now opening eye, we're very focused on
like serving consumers generally and like being really good at coding. Right. You know, there's some other
things too. So I would just say, yeah, the more, maybe we should just not even have this answer in
the podcast. Yeah, we can take this part out. Probably.
I'll give a 10 minute time check as well as this. Perfect. Great. No. Oh, great. Yeah. I'm about to wrap.
He stopped me. That was a good one. I'm like, I don't know, man. I'm not a founder right now.
You don't want to speak on behalf of Sam, but which why world domination is not complete and total.
I'll take that part out. So, slightly different topic. A question I get from a lot of parents,
especially with kids who are approaching the end of high school and in that phase where they're
picking careers or thinking about what they want to do is this immense anxiety, especially for folks in
tech for whom, you know, for the last, for the vast majority of the like 20, 30 years, it's been a
fairly stable assumption that like if you went, if you were smart and generally oriented towards
technical fields, if you went and studied software engineering, you'd have a pretty great career
and safe and sort of rewarding time in the knowledge economy.
And it seems like coding agents like Codex are taking a violent hammer to that assumption.
How would you advise, you know, friends who are parents who are trying to figure out how to
help their kids choose a career for the future?
So I'll answer this with humility because I don't have kids, but I do think about this.
And actually, I think my point of view would just be that the world has always been changing,
it's changing now, but it was changing before that. Maybe it's changing a little faster. But that's
the main thing to notice is actually the pace of change, not the specific change. And so like,
I think the most, you know, if I had a kid at late high school now, I would probably just be trying
to encourage them to just be like excited it is about whatever they're doing and like be incredibly
curious and constantly learning, right? Like I studied CS. Did you study CS as well?
I started with CS and then transfer to bioinformatics because that was more interest in healthcare.
Right. And you know, and now you do investing, right? And like I studied mechanical engineering and then I
change to CS and now I work in product in a, you know, in AI at OpenAI, but like the startup that
I'd started was not an AI company. So things are constantly changing. And I think the most important
thing is to like be agile, curious and like, you know, have some foundation that you can build
upon as the world evolves around you. So I think similarly, if I had a child in late high
school, I would just want them to crush whatever it is that they're doing. And it wouldn't really
matter what specific thing they've chosen. You know, I lean technical. So that would be cool. But like,
maybe even that is optional. And then I would just,
raise them with the expectation that they'll probably have like many career transitions
throughout their lives. And if you were having seen what you have with Codex, knowing what you do
about where it's going, let's say you were the chair of the computer science department at
university, what would you do differently now versus before when Codex launched? Well, one is you'd
allow kids to use the AI tools. But let's see you're thinking about the future of computer science
education and how that should be taught over the next five, 10, 15, 20 years. How do you, how would
what would you do differently?
Yeah, again, just opinions here.
But I think I would have, you know, like at Stanford,
there was a class where we wrote assembly.
I forget the name of that class.
That was cool.
We had one class.
CS-140, I think it was.
Yeah.
And then, you know, similarly,
I would have, like, a handful of classes where folks do things,
like, very manually to understand what's going on behind the scenes
and also to build a confidence that they can.
But then generally, I would move towards, like,
having students trying to deliver some kind of, like, outcome,
be it like they've learned something.
or they've built something or something.
Project-based learning.
Yeah, and then I would probably encourage them to, like, use these various tools so that they're
picking up the skills.
And, you know, I don't know.
This is just an idea in my head, but if we could help them kind of like speed run through
that arc, then maybe every quarter they're using a different set of tools.
And so they're, like, becoming, like, very mentally plastic in terms of how they get things done.
And I think that would be the best simulation of, like, what future work would look like.
I'm not sure.
What would you do?
Well, at teacher class, CS-143 at Stanford every year.
This year, we taught it in winter quarter, and we had about 300 students.
And I was, you know, thinking through what was a, in previous years, we had a midterm and, you know,
we had like problem sets.
And this year we decided just to have it be a combination of speakers who are CTOs or folks,
researchers in AI come in and talk about the infrastructure problems of building AI products
at scale.
And then we had one final project where everybody had to build an agent and ship it.
and they were all allowed to use any coding tools, obviously.
In fact, we gave folks some credits to Mistral models and Black Forest models,
and the founder of Cursor came by and kind of talked about the ID and why they should all be using it.
And what was extraordinary, right, was it was so clear that the distribution of the final projects
followed this power law where the top four or five teams that really adopted wholeheartedly
the coding, the cursor and the AI models and did a fully,
sort of AI-assisted workflow of their final project,
like produced software that was like production grade-ready.
If I was still running the platform,
or at Discord,
I would have totally shipped four or five of those
on the front page of the app store we had.
In fact, I sent some of them to the founders of Discord,
and they were like, we should probably ship this.
The quality bar was just extraordinary
for something they were able to build in a basically a 10-week quarter.
Then there was this sort of, you know,
usual sort of middle of the back
that had made a half-hearted attempt,
but enough to get a good grade
to customize the templates we'd given them,
but clearly hadn't, like, asked,
what is something that now I can create
that I couldn't before,
now that I have access to extraordinary coding agents.
And then there was just, you know,
the classic sort of bottom of the class
that I think just didn't accept those tools
and think deeply about, like,
trying them, using them, learning with them,
developing a feel for, like, what they're good at
and what not good at,
and kind of turned in a final project
that would have been totally possible to build a year ago.
Why do you think they didn't want to use the tools you were giving them?
Look, it's hard to parse out from just a final project.
But I did office hours with a lot of the students every week.
And you could very clearly think the number one predictor of their success was their mindset.
It was just about, like, were they curious and hungry to learn outside of like a traditional textbook?
And look, some of them, some of the students just had a lot going on, you know,
being a college student is a stressful thing today.
And so I don't, I have a lot of empathy for,
there's definitely this awkward moment you're describing right now
where a number of this graduating seniors from,
who are graduating with college degrees this year,
started out as freshmen in a very different economy.
Right.
Right.
When they picked CS, the assumption was,
hey, if I like do well in the core CS curriculum,
if I get a 4.0 GPA and I do like one or two good internships,
you know, somewhere along the way, and I apply for a job, I'm going to get a job at a pretty good
tech company. That's just not happening anymore. And it might be because there's a set of layoffs or
some overhang from the ZERP era, or it might be because a lot of engineering teams are reducing
their footprint of entry-level jobs. But I was definitely shocked by how many Stanford CS grads
they were looking for, you know, graduating seniors still looking for full-time jobs, you know,
come winter senior year. And I think that's anxiety-inducing. It's stress-induced.
that has bleed over effects on, like, can you concentrate on this, like, project-based class when
a number of the students were also juggling interviews and were coming to office hours when I thought
they were going to be coming to ask about, you know, the code. We're asking, like, for career advice,
which is totally fine. But I do think there's a transition phase right now, which is very, can be very
stressful for computer science students. And I think you're right, the faster they're able to
onboard to using these tools rapidly and realizing that the gap on the,
what they can create now is extraordinarily high,
the faster I think they're going to transition
into the new economy better
because I do think there's an expectation,
certainly for modern software team,
certainly at Open AI,
that like you're just fluent in all of these tools now
relative to, you know, four or five years ago.
It was crazy.
When I, you know, when we graduated through Stanford,
I didn't take a single class that required the use of Git.
Right.
Which is absurd.
Yeah.
Like I happen to like, you know,
pick it up in an internship,
but there's no class that actually requires you,
at least at the time,
required you know how to use Git.
Yeah.
And so I think, I do think the computer science departments around the country have to recognize
that and change and do the kind of make the changes you're talking about.
And my hope is that in the interim, you know, students will, won't wait around for
their deans and their professors to do that for them because you can just go and use Codex,
you know, for free.
I think the research period is literally free.
Is that right?
Well, you have to, you have to have a plus account or a pro account.
But yeah, it's a good point.
Maybe we should do something for students.
Student licenses.
Yeah.
You know, I will say that like we, so we're hiring for codex.
please. What should I say?
If you're interested in working at Codex, DM at Embirico on Twitter, it's E-M-V-I-R-I-Z-O.
Yeah, I don't know if I'm allowed to plug myself here. But yeah, we're hiring.
But we mostly are hiring very senior, but we actually are, we decided that we're pretty
interested in hiring a couple of new grads. Oh, that's interesting.
Yeah. And so it's been interesting just looking at new grad profiles. And I totally feel you on the,
yeah, I mean, it's definitely a tough time to be graduating. I don't know if this is advice,
but what I can say is that when I look at new grad profiles,
for me, the thing that I take the most signal from is if they've built something.
Right.
And if they've built something that's linked from their profile and I can just like click to it.
Projects.
Yeah.
And, you know, like, it's just like a cool website.
Right.
You know.
Like grades matter much less now.
Yeah.
I don't even look.
Actually, now that you, I didn't even realize that I haven't looked at anyone's grades.
You know, like, I just like, because, you know, admittedly we're only hiring a few new grads.
Right.
But that is the single largest signal.
It's just like, what have you built?
And is there some way for me to validate that?
Like maybe it's because I can click to the website
or maybe you just have some stats on like how many people used it.
Right.
And then when I talk to them, I'm just like, yeah,
let's talk about what you built and how you thought about that.
So maybe that's somewhat helpful for folks who are looking for something.
You know, I kind of reflect on my journey here to Open AI,
which I'm really grateful for.
And I view it as a privilege to be working here.
But, you know, when I look back to when we were working on the startup,
multi, which is like not an AI company.
And we saw like chatty come out.
and we started to follow all this LLM stuff,
I remember just feeling like, wow, like there is a chance that if we don't do this right over
the next couple of years, like my co-fine and I were talking,
there's a chance that we actually just end up like dinosaurs.
And so at the time, we actually made like a very explicit decision to like heavily
prioritized getting us and the entire company like ramped on AI things.
And to some extent, like, I don't know if I could have like gotten the job that I have here
at Open AI if I was just applying randomly.
I think it's because we had built something that was interesting that we were
able to get that attention and have that conversation. So I guess if there's one takeaway here,
it's just like, just got to build. It's time to build. Yeah. Thanks for listening to the A16Z
podcast. If you enjoyed the episode, let us know by leaving a review at rate thispodcast.com slash
A16Z. We've got more great conversations coming your way. See you next time. As a reminder,
the content here is for informational purposes only. Should not be taken as legal business, tax,
or investment advice, or be used to evaluate any investment or security and is not
directed at any investors or potential investors in any A16Z fund. Please note that A16Z and its
affiliates may also maintain investments in the companies discussed in this podcast. For more
details, including a link to our investments, please see A16Z.com forward slash disclosures.
