How I AI - Claude Opus 4.6 vs. GPT-5.3 Codex: How I shipped 93,000 lines of code in 5 days

Starting point is 00:00:00 Welcome back to How IAI. I'm Claire Bow, product leader, and AI obsessive here on a mission to help you build better with these new tools. Today we're going to bring you up to date on all the new coding model releases from OpenAI and Anthropic. In case you missed it, Open AI released last week, Codex, their desktop app for AI engineering, the new model GPT53 Codex, try saying that five times fast, and Anthropic released their response, Opus 46 and Opus 46 and Opus 460, 4-6 fast. If you're new here, then you don't know, but when these new models come out, I put them through their paces. I test them. I test them side by side on the same task, and I'm going to give you my opinion about where they do well, where they fall apart, and which one goes where in my

Starting point is 00:00:49 AI engineering stack. Spoiler alert, I've shipped more code in the last five days than I think I have in the last month, so I think these are pretty fabulous models, but they do have their quirks, they do have their strengths, and sometimes they go off the rails. Let's get to it. This episode is brought to you by WorkOS. AI has already changed how we work. Tools are helping teams write better code, analyze customer data, and even handle support tickets automatically. But there's a catch. These tools only work well when they have deep access to company systems. Your co-pilot needs to see your entire code base. Your chatbot needs to search across internal docs. And for enterprise buyers, that raises serious security concerns. That's why these apps face

Starting point is 00:01:34 intense IT scrutiny from day one. To pass, they need secure authentication, access controls, audit logs, the whole suite of enterprise features. Building all that from scratch, it's a massive lift. That's where WorkOS comes in. WorkOS gives you drop-in APIs for enterprise features, so your app can become enterprise-ready and scale upmarket faster. Think of it. it like Stripe for Enterprise features. Open AI, perplexity, and cursor are already using Work OS to move faster and meet enterprise demands. Join them and hundreds of other industry leaders at WorkOS.com. Start building today. Okay, to start, I like to pick a task when I'm evaluating new models that's pretty ambitious, something I definitely wouldn't want to do by hand, and is consistent enough that I can

Starting point is 00:02:25 actually compare the pros and cons of each model side by side. And I picked a task that I choose often when comparing these models, which is redesign my marketing site. I think all these models are pretty good at one-shotting kind of a landing page or a marketing page, a simple app. I don't feel like that's a practical evaluation criteria for these no models. I like to take a code base that's relatively complex or at least established and compare side by side how these models work inside these codebases. So I type my chat PRD homepage marketing site. It's got lots of pages. It's got a blog. It's got the how I AI workflows on there. It's not a simple app, even though it's just kind of like a content front end. And I want to bring it up to my 2026 ambitions, which are all about the enterprise.

Starting point is 00:03:12 So while this, you know, website looks great. It's cute. It's got nice colors. It's definitely more focused on the kind of PLG individual user workflow. And I want to up level this as we saw more to enterprise customers. So I'm going to have these models duke it out and see which one does the better job. And I'm going to test these in order of when they came out. So the first thing that came out in our very busy week last week was Codex. Now Codex, as I said, is OpenAI's desktop app for coding. And before we get into it, I want to show off some of the things that I think make codex unique. First of all, Codex is focused around Git primitives. Now, if you don't know or you're not technical, your new software engineer, you probably run into some concepts of Git as you've gotten

Starting point is 00:03:59 started vibe coding, but I just want to walk through a couple of things that might be useful for you to know. The first thing is the idea of a Git repository. That is basically a whole code base that represents an app or a project. Git repositories are represented over here in Codex as projects. You can see I have different repositories here that I'm working on, including my chat PRD website, the WWW website. then in your repo you can start working on new types of code and there are kind of two ways you can take code and make it contained so that when you edit it it doesn't break your production website the first way that i use a lot are branches branches are little as they say branches of your code that you can make changes to commit and then ultimately decide to merge production there's also

Starting point is 00:04:47 the concept of work trees these are full copies of your code base that you would use or an agent would use to make changes. And one of the benefits of work trees versus branches and you get many of them going on on the same time, on your same machine. And so if you're working with a lot of agents, you could give each agent its own work tree to work on. And it can do a lot of work in parallel without running into each other or causing issues. If you want to learn more about work trees, definitely watch our episode with Alex from Open AI on Codex, the Terminal app, where he goes through how he uses work trees on a daily basis to kick off his agentic work. And then up in the top right, you can see we have a good diff panel.

Starting point is 00:05:31 A diff is, again, the difference between what you had and what you have now. You'll see red is code that was removed. Green is code that was added. You can see up here the count of line changed, either added or removed. And then you can create pull requests from Codex. Pull requests are kind of a signal to your team that says this code that I'm working on, is ready to be part of the main production branch. Can you pull it in?

Starting point is 00:05:55 I'm requesting it. And often that's where your CICD pipeline, your pre-developed, or your pre-production checks go, and where your team with their human eyes tends to look at your code. And you can see here as I'm talking through this, Codex has put these concepts up front and center. And I think that's because they're trying to appeal to two audiences. One, they're just trying to appeal to, you know, let the tokens go, highly empowered, use all the agents, software engineers that are doing a lot of things at once on their local machine and need to be able to benefit from these concepts of Git work trees, local and cloud agents,

Starting point is 00:06:33 all that kind of stuff. The second thing is I think this is actually a really good framework for folks that are less technical to learn the concepts of Git. I have always said you should invest in the GitHub desktop experience. It is a version of this. It's what I use all the time to manage by work across branches and across files. I could work in the command line tool for GitHub. I just think it's nice to be able to see your changes and really know what's going on.

Starting point is 00:07:00 And so Codex has brought some of these visual concepts, UI concepts of Git into the Codex app. So it's nice if you're learning. The second thing that you'll see in Codex that is a little new and unique compared to other apps is the concept of bringing skills up as a first class citizen. So if you are new, skills are sort of a package set of prompts, instructions, reference files, and code that can be called by an agent to kind of consistently execute a task over time. If you want to be like really cheap, it's like a bundled prompt. And you can see here that OpenAI and Codex have given screens a home and they've given them icons and they've given them buttons. And I have to say, I love this. If you watch my early episode when, skills first came out, I was so exasperated that skills were like a zip file that you had to upload

Starting point is 00:07:53 somewhere or put in your repository. This just makes it a much more visual experience to add skills to your code base or to your system and refer to them over time. I also like that OpenAI shipped a bunch of recommended skills that a lot of people could benefit from so you can get your mind wrapped around what kind of skills would benefit your AI work. The final thing that I think OpenAI put kind of front and center in Codex that's interesting is this concept of automations. So automations are basically tasks that can run on a schedule. You can see here when you create a new automation, you give it a name, you say what project it needs to run on. You basically run a prompt. It's not that fancy. And then you give it a schedule. And again, like skills, Open AI has shipped a bunch of out-of-the-box automations. Now, my reaction here was, I'm already doing a lot of this stuff. You know, I'm a little ahead of the curve. when it comes to some of the automations around my codebase. So I've solved these problems, but I think everybody should solve these problems.

Starting point is 00:08:54 So if you're looking for inspiration on what kind of automations would benefit your codebase, the Codex Automations, recommended automations is a really good place to start and get some inspiration. But let's get to actually writing code. Now, I have to say one caveat, which is I ran this process using GPT5.2 codex, which was the recommended model, this app came out. Now very quickly they came out with 5-3 and we'll see that towards the end of the episode, but I do want to call out this is a slightly older version of the model, though I think the family of models, given my experience, have very similar output. So I would say I would probably

Starting point is 00:09:32 get the same experience with 5-3. Now, what is the test case we're going to do? As I like to do, we are going to redesign the chat parity site. Last time when some models came out, we redesigned a page. But we've been pitched that these new models are more independent, can do more long-running tasks, can handle more. And so I want it to take an existing code base and redesign the whole thing. And I'm going to trust these very smart models to do it without too much prompting. And so that was my test cases. I wanted to take this homepage and this website, which is lovely, but it's very PLG focused and make it more polished, more up-leveled for an enterprise audience. And so I started that in Codex and I gave it pretty high-level prompt, but I thought it could go

Starting point is 00:10:25 with it, which as I said, optimize the marketing site in this repo for PLG plus enterprise. You can create new pages, redesign templates, et cetera, to make it the highest quality marketing site I could have and then I listed a bunch of sites that I really like. If you're on this list, I think you have a nice website. Now, here's where it immediately disappointed me, and I'm sad to say it, but it did. One of the things that I've noticed about the GPT 5X codex models is they are so literal. They are so literal. And so they follow instructions very well.

Starting point is 00:11:00 And I know that is a, in many instances, a feature, not a bug. You want your model to follow your instructions explicitly. but you don't want it to follow it blindly. And that's what I found. I found that the Codex app harness plus the codex models were just too literal to do greenfield or creative broad work on my behalf. It will do high quality coding work. I will get to that soon. But your ability to tell these models like, hey, go and do X, I often found that with a combination of it,

Starting point is 00:11:38 being too literal and not pushing me to the next step, not actually saying, are you ready for me to build, meant that it was much more painful and slower to get work done with these models. And this is really ironic because the 5-3 model is actually pretty fast. And so it should feel faster to code with it. But the actual back-and-forth experience conversationally was really challenging. You'll see some of that here. So I said, redesign the website. We went back and forth on how to use the Figma skill.

Starting point is 00:12:07 It didn't actually pick it up well, so I just gave gave up on that. And then I asked it to redesign the page, and it did it. Now here's where example number one of being too literal came in. I had told it I wanted to redesign the marketing site for a combination of product-led growth and enterprise. Basically, I wanted a market site that would be friendly to users, but it would also help our sales team bring in invalid leads. And it built it and literally had explicit references to PLG and

Starting point is 00:12:37 enterprise in the copy. It was like, if you're here for product-led growth, click here and sign up. If you are here as an enterprise customer, click here and talk to sales. It was so explicit. And this was my perpetual cycle with Codex on this redesign. We went back and forth. I gave it some design help. I asked it to design a couple things on styling. At some point, I said the design's okay, but it could be better, take more inspiration from the sites I offered, make the copywriting top tier. I've spent $2 million on it. You can see some of my desperate prompting here, just trying to figure out what is the unlock. Is it a technical spec unlock? Is it a, you know, find reference content unlock? Is it an identity unlock for these models? I couldn't figure it out. So I kept trying

Starting point is 00:13:25 and what was really funny is I just, every time I would say something, it would overfit to my a prompt. So when it gave me a website that I generally liked, but said, hey, can you add more about integrations? Our enterprise customers really like integrations. It made the entire page around integrations. If I said, hey, I want to focus a little more on enterprise. It would make the entire page about enterprise. It really didn't have that nuance of what goes where and how to build a balanced experience. It was really overfitting to my last prompt. And I will, you know, I was saying like we don't need to list exactly everything. I was trying to give it explicit examples. And then it put a long list of all those examples. It was just having a really hard time editing itself. And then I'm going to

Starting point is 00:14:08 give my favorite example of Codex being way too literal, which is I told it, you know, created it was something that I thought was fine, but it was a lot of images and not a lot of content. And I said, hey, I like a more content dense site like Hex. Hex, you have a lovely site. I think you did a really nice job. I just want a more copy on there because I think I want to be more technical, more detailed, more precise about what the value of my product is. And after two prompts, literally made the headline a dense product workflow for AI powered teams. And I was like, oh, I mean, I made the like face palm emoji face. It's like, why in the world would you say that our product has a dense workflow?

Starting point is 00:14:51 I asked for a content dense site. I didn't say make our content all about how dense our product is. So I just had a really tough time with Codex and Codex 5-2, GPT-52 on this particular task. We eventually got there and I would say the output was, okay, so this was the before, the after from Codex. I really liked this headline that came up. It was like one of the things somewhere buried down on the page that I thought was great. It eventually got overwritten by my content dense headline.

Starting point is 00:15:25 I thought some of the headlines were like kind of interesting. It looks pretty nice. It pulled some interesting graphics from our repo. It put placeholders in here. You know, I think this is okay. It kind of didn't quite fit our design aesthetic. And what I was more frustrated by than the sort of literal nature of the GPT model, which I kind of gotten used to, this is like not something that's new to me, is that it really only redesigned this homepage and the enterprise page. So I had asked it to read it to read. redesign the whole page, the whole site. And it really did not do that. And so again, this like sort of, it can do long running tasks and take on ambitious things. It just took a lot more work from me to get it

Starting point is 00:16:11 to even get to this two page redesign, which I thought was okay, not great. Now, the code is great. It's fine. It's not terrible. It's certainly faster and better than what I would have done myself. That being said, I think we can do a lot better. So speaking of doing a lot better, let's go over to my friend opus. Now, again, spoiler alert, y'all, I love, I love her. I love opus. And I will caveat by saying I found a place where I really love Codex. So we're going to come back. But as soon as I started getting my hands on opus, I was just really happy. But it didn't start off perfect. So let's talk about where it went well and where it kind of went off the rails. So again, I started with the same prompt. optimize the chat marketing site in this repo for PLG and enterprise. You can create new pages,

Starting point is 00:17:02 redesign things, et cetera. Again, I put this content done's framework in here. I just, I had just come off that bad experience. I wanted to see what it did. And I will say Opus 46 was just a lot better at planning for itself so that it could execute a long running task. So it did its exploration of our code base and reference marketing sites. It used cursor plan mode to do a plan. and then it started building the components. Now, I have to give kudos to cursor. I'm still a cursor girl. Yes, I could have tested Opus 4-6 in Claude code. I am sure there are optimizations there. I just, hand the God, think that cursor does a good job of building harnesses for all of these models. I think the combination of like planning and to-dos and exploration and the question tool,

Starting point is 00:17:50 I just tend to get good results. So there is this open question of was it the model or was it the codex harness that, you know, in the desktop app that is not as mature as cursor. Which one caused that bad experience? I'm not sure. But using Opus 4-6 in the cursor desktop app was quite nice. Okay. So it's building. It's building. It's building. It's building. It's building. It goes. It runs a build. It gives me a summary. I am very pleased with the independent nature of this model. I'm about to hire her. She can go run my marketing site. You were now my marketing engineer. Except the copy was great. That is not. was terrible. And unfortunately, I didn't commit this at this point. So I lost the design,

Starting point is 00:18:32 but it just did not look good. It did not look sophisticated. I was like, I'm going back home to Codex. What are we doing here? It was terrible. So again, I did my desperate prompting here. I wanted to look like I spent a million dollars on my design with the best agencies out here. Here are some colors. Let's see if I said, oh, I said, I want you to develop a unique and modern front end visual style. This is Tailwind Indigo AI Slop. If you know, you know. And I agreed with me. It was like, you're right. I gave you an Eric Tailwinds law.

Starting point is 00:19:02 Let me rebuild. And it rebuilt. And it was so lovely. And so we went back and we, you know, it integrated our design system. It gave me an outline of what it did in terms of design. We had to go back and forth on build. But eventually I got something lovely. Here was the before and the after was like this.

Starting point is 00:19:26 love this so much. We're probably going to ship this in the next day or two, hopefully live when the episode goes live. But it still matches our brand aesthetic, but just looks so much nicer. It has our colors. She is pink. It uses some of our graphics instead of placeholders. It calls out some numbers, which is really great for selling the value proposition. It highlights the reviews. And then, you know, instead of what Codex was doing, which was making like very blunt statements about enterprise. It was like 100% security, all this stuff. It gives a really nice kind of value proposition oriented view of what would be nice for enterprise and redesigned our enterprise page as well. So once I got exactly what I liked, I asked it, okay, let's take these

Starting point is 00:20:10 styles and go ahead and redesign the rest of the site to bring it up to matching. And it did a really good job. It kept everything consistent. It redesigned our pricing page. It's working on our Hawaii AI page to make sure we're matching some of the designs. I think this looks really nice and I was super happy with the output. And this is going to be my meta assessment of Opus 4-6 versus the GPT 5x models, is that Opus 4-6 is really good at kind of generative, broad, greenfield work. You want it to implement a new feature. It will go implement a new feature. You want to completely redesign your site. It will completely redesign your site. I was really, really pleased with my experience on this model and we're probably going to ship this live. Now this is a much more front end

Starting point is 00:20:58 focus design oriented task. I like this task because we can literally say, okay, what did I start with before? What did, you know, opus come up with? And then even compare that directly to what did codex come up with, which I can refresh and show you here. I can do a side by side. And you can see with your eyes, you can read all the words and really make a decision about where these models do well. But that is not enough to assess whether or not these are good models, bad models. I like them. I'm going to use them or we're not going to use them. And as I go into the next workflow where I found both models to be super useful, I'm going to admit something that is a little scary and maybe impressive, which is I asked Devon today. How much code have I merged in

Starting point is 00:21:41 to get hub in the last five days? I need to fix my Devon workspace. But if you go into it, in the last five days, I have merged 44 PRs containing 98 commits across 1,088 files. I have added 92, almost 93,000 lines of code. I have removed 87,000 lines of code. We've added 5,000 net new lines. We have released our 1, 2, 3, 4, 5 MCP integrations. We've completely overhauled one of our big components. We've completely refactored our components folder and we have shipped and fixed and we have fixed a bunch of bugs. We have done a lot. And this None of this is in the web app. This is all in our core application, which is quite complicated and much more complicated than our marketing site. And I did all of this with now my two pals on my team,

Starting point is 00:22:33 Opus 46 and Codex 53. So I did find a place that these two operate really well, and I am going to talk you through it. As I mentioned earlier, one of the big features that I released recently on chat PRD was a bunch of MCP connectors. So now from our chat, you can look at what's happening. in GitHub. You can look at what's happening in linear. You can look at what's happening in granola. And you bring all that into your product work. And this is one of probably two dozen tools that we now have available in the chat parity app. And we were displaying them all in different ways. All our tools were different. They were individual components. Our code was super, super messy. And so one of the things that I kicked off in Opus was a refactor of a reused component that I wanted to be

Starting point is 00:23:21 able to add to, remove from, customize, but have some shared code. I just knew the way we were doing this wasn't great. And so I started off a Opus 46 task to refactor how we use our tool components. So let's talk about how I actually rebuilt these components and where I use these different models. So first I opened up cursor and honestly this might be the secret sauce in some of these experiences. I opened up cursor. I built a plan with Opus 46 using plan mode. I kicked it off and I went back with 46 on how to build this. And you can see here, I got this lovely, like sort of extensible tool component where I could add different things in, give them different link or give it different copy and language as it went through. It built a bunch of really nice front end components for me.

Starting point is 00:24:10 and I think, honestly, they look lovely. So as we saw before, you get these lovely tool calls here. They look nice across all of our different kinds of tool calls, whether you're creating documents. I'm just really happy with this experience. Now I'm ready to push this code to production. Here is where our friend Codex comes back to play now, and this is where I love to use Codex.

Starting point is 00:24:35 I went back into Codex and I said, I've redesigned tool usage in this index. It's gone through several rounds of feedback. Can you review the architecture and performance and see if you have any feedback we should consider before shipping? We're looking for something scalable but customizable and we don't want to overfit in any direction. And it went through and searched and identified a couple high impact issues. Prioritize those issues for me. Ask me questions.

Starting point is 00:25:02 I said one is intentional. Two is the edge case. And it asked me if it wanted to implement any of the polish. I said yes. And it polished it. It passed our AI bug bot code review and we ship this to production. And now this is my flow. So this was a very, again, kind of front end focused component focused workflow.

Starting point is 00:25:21 We just, you know, like for the technical folks out there, we just completely are repl platforming our vector stores. It was a huge, huge, huge thing. It touched 50 files. It was really hard to do without kind of doing a huge, huge PR. It required like, I don't know, probably 30 rounds of, of feedback on this thing. And GPT-5-3 Kodax was so lovely. Love it for code review, architectural review, and finding edge cases. And what I found is you could ask Opus 4-6 to build something. It would build something 80 to 90% done or good. You'd ask Codex to find everything wrong with it.

Starting point is 00:26:02 It would find all the things that were wrong with it. And then you take it back to Opus. And Opus would be like, oh, yeah, yeah, bro, you're right. I really miss that. thing, I better fix it. And so I do think, I'm going to give Codex some love here. I think it's the better software engineer. Technically, oh, this is kind of the software engineer that you want on your team, though. It actually builds stuff. And so what I've been saying to people about GBT53 Codex is it really replicates the principal software engineer experience in that you will fight them tooth and nail to build anything for you, but they are more than happy to tear apart someone else's code. So if you are looking for a principal engineer on your team to pair with your eager product

Starting point is 00:26:46 engineer of Opus 4-6, definitely, definitely use Codex. And I kind of feel like I can't live without Codex reviewing my code now. So I'm quite happy with this experience. Again, BugBot, which I use from Cursor, does a lot of review of our PRs. It's also run on the Codex model. So I think it's a really good Eagle Eye reviewer. It's just too hard to get out of the gate building new product. So I really like this flow and I highly recommend that folks replicated. I think it's really useful. To conclude our episode, I just want to give a quick nod to Opus 46 fast. If you have not heard, Opus 46 is Opus 46, but fast.

Starting point is 00:27:27 You can select it here. It's most powerful model, but fast and it is expensive. Six times the price. I think it's $150 per million output. tokens, something like that. I actually used Opus 4, 6 fast a lot, and now I got to go look at how much I'm spending. So what I will say is while I have consumed the tokens, I am floating through an infinite ocean of tokens. I embrace a token abundance mindset. Starting to spend a lot of money on models, which at the end of the day, super, super high ROI. Again, if we're looking at this,

Starting point is 00:28:07 how expensive would it be for me to ship 44 PRs, really, really huge features? It would take months of time, tons of people. We probably also wouldn't get it to perfect quality. And so I am really bullish that this is a worthwhile investment for my team, but don't mess around with four, six, unless you're ready to pay the bill. And so I just think we're all going to start looking at where does this fit from a personality perspective, where does this fit from a capability perspective? And then where does it fit from a budget perspective?

Starting point is 00:28:41 And as my friend from Cody at Century said, if you're playing between 4-6 and 4-6 fast, don't pick the wrong task or you're going to get a bill that you're not happy with. So that's today's model-focused episode of How I AI. I've compared Opus 4-6, Codex, GPT-53 Codex, and Opus 4-6 fast. What I found you want to use Opus 4-6. for your product and feature work, being creative and creating high quality designs.

Starting point is 00:29:11 You want Codex catching all your bugs, advising on our architecture, and really writing exceptional high quality hardened code. Both of these models have a place in your stack. I still love Cursor for using them. I'm still a multi-model girl, but I think they do well in either the Codex desktop app, ClaudeCode, or wherever you like to get your AI generated code. That is today's episode of HowAAI. I'm looking forward to hearing your feedback about what your favorite model is and where you're using it, and we will see you next week. Thanks so much for watching.

Starting point is 00:29:45 If you enjoyed this show, please like and subscribe here on YouTube or even better, leave us a comment with your thoughts. You can also find this podcast on Apple Podcasts, Spotify, or your favorite podcast app. Please consider leaving us a rating and review, which will help others find the show. You can see all our episodes and learn more about the show at how IAIIPod.com. See you next time.

How I AI - Claude Opus 4.6 vs. GPT-5.3 Codex: How I shipped 93,000 lines of code in 5 days

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.