The AI Daily Brief: Artificial Intelligence News and Analysis - How to Use /Goal to Do More With AI

Starting point is 00:00:00 Today on the AI Daily Brief, a primer in using the slash goals primitive in codex and cloud code and how to use it to level up your use of AI. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. All right, friends, quick announcements before we dive in. First of all, thank you to today's sponsors, robots and pencils, section, super intelligent, and blitzie. To get an ad-free version of the show, go to patreon.com slash AI Daily Brief, or you can subscribe on Apple Podcasts. And if you want to learn more about sponsoring the show, send us a note at sponsors at AIdailybrief.aI. Today we're talking about something that a lot of power users of AI are incredibly excited about, which is slash goals. So let's dive

Starting point is 00:00:45 in. Today we were doing another very operator-centric episode. Recently, I did a show about codex maxing, effectively a set of tips and best practices on how to get the most out of OpenAI's Codex. Now, in many ways, while that episode was specific to Codex itself, a lot of the interaction patterns, you could also follow in other harnesses like ClaudeCode. The Codex maxing piece was built off of a blog post by OpenAI's Jason Liu. Jason wrote up about nine techniques or interaction patterns that he had discovered allowed him to get the most out of codecs, not just for coding, but for other types of knowledge work as well. And some of those tips represented fairly different types of patterns. One of them, for example, is the idea

Starting point is 00:01:28 of durable threads or mono threads, where instead of using some sort of infrastructure like a project, where you have multiple threads all related to the same topic that share a memory base, you instead use a single thread relying on the harnesses compaction tool to make sure it always preserves the relevant context. You also saw in that codex maxing post a number of ideas about how to effectively reduce the latency between the human providing guidance to the model and the model getting things done. I think in some ways, in fact, that you could kind of summarize the overarching direction of what Jason was exploring as a way to move past the turn-based paradigm of AI. In other words, the standard way of interacting with chatbots that we've all gotten used to

Starting point is 00:02:06 over the last few years where you give it a prompt, wait for it to do a thing, review the thing it did, develop and provide it your feedback, and wait again for the next thing that it does. By using features of codex like the side panel where you can inspect artifacts as they're being built, voice input to more freeform give feedback, with a lot of additional context because you're talking through it, steering to insert that feedback even as Codex is still working, and some other features like remote control and heartbeats to make sure that this can happen even when you're not sitting at your desk,

Starting point is 00:02:33 all of what it amounts to is a new more parallel way of working with agents through these harnesses like Codex. Now, when it comes to Codex, however, there has been one feature that has been lurking in a lot of the conversation throughout the month. It is, in fact, one of those features that once introduced, becomes normalized across all of the competitor set, with other companies adopting it even if they weren't the first to do it.

Starting point is 00:02:55 I'm talking, of course, about slash goal. Back at the beginning of May, the Codex teams Tebow wrote, slash goal might be the most consequential thing we have shipped in Codex. The value of good instructions has never been higher. Pavel Hearn explained, you state the outcome, the model loops, self-evaluates, and stops when it's done. Now, this idea of looping is a key part of this. You might remember how we talked about the Ralph Wiggum loop,

Starting point is 00:03:18 which is basically an early hack-it-yourself version of this, that figured out a way to get an agent that you initiate on a problem, to keep working against that problem over and over, without human steering having to be involved, effectively extending the window of how long it can work without your immediate interaction. Former OpenAI co-founder, who is now at Anthropic Andre Carpathy, also has been spending a lot of time with looping, such as his auto research loop. At one point, he said,

Starting point is 00:03:42 LLMs are exceptionally good at looping until they meet specific goals. Don't tell it what to do, give it success criteria, and watch it go. Pavel concluded his tweet, The Skill That Wins is Engineering the Intent, why it matters, strategic context, and how the success will be measured, so the agent can make better autonomous decisions. Now, over the next couple weeks,

Starting point is 00:04:01 people really started to click in on goal. Gregor Zunich writes, Slash Goal is one of the best things OpenAI ever shipped. Alex Finn wrote, Slash Goal is the most underrated feature in AI right now. Ollie Lemon called it basically autopilot for complex AI tasks. Trying to describe how Slash Goal worked for non-technical folks, he wrote. One, you type slash goal and describe the end result you want. Two,

Starting point is 00:04:24 the AI starts working. Three, after every step, it checks itself. Am I done yet? Four, if no, it keeps going. Five, if yes, it stops and tells you. And honestly, people found so much utility so fast with this that just a couple weeks later, Claude Code shipped the same feature, and in recognition that it was better to participate in a new primitive, rather than trying to own it, They did the super smart and mature thing of just calling it slash goal in cloud code as well. Microsoft's Nicholas Bustamante wrote, I'm glad to see Slash Goal becoming the new primitive for long-running tasks. The model does not naturally persist across turns, context windows, sandboxes,

Starting point is 00:05:02 process crashes, or days of work, so it needs the help of the harness. He continued, I also love how simple it is. An initializer agent turns fuzzy user intent into durable workspace structure with a plan.md file. then worker agents make bounded progress against that structure, and a judge agent decides whether the stated completion condition is actually met or it will keep running. Once again, the abstraction is moving up the stack. In 2024, you wrote your own wild loop. In 2025, you wrote prompt files and hooks, i.e. Ralph Wiggum. In 26, the loop is becoming a product primitive. Sean Wang, aka Swix, wrote that this represented an increased level of autonomy,

Starting point is 00:05:38 from slash skill, which was preset prompts to slash plan, which was human refined inputs, to slash goal, which was AI evaluated outputs. Now, as this new primitive has taken hold, lots of people have started to try to write guides and tip documents. One of those came from the open AI developers themselves, and that form the basis for the guide which will be going through for the rest of the show, how to use slash goal. This is not comprehensive, and it honestly still slants more technical than I was trying to get to, but hopefully especially for those of you who are thinking about how to apply this

Starting point is 00:06:08 knowledge work as opposed to just actual software engineering tasks, you'll feel a little bit more like you have a handle on this once we're through. Now, as you might imagine, I did use Codex to build this presentation, so if you see any lingering meta text, i.e. where it converts instructions into marketing copy, or it's just in general a little overly verbose, I'm blazing all of the credit for that squarely at the feet of 5-5 in Codex itself. Now let's start by defining the difference between a prompt and a goal. And an important point here is that slash goal is not a bigger prompt. It's a fundamentally different type of a thing. In summarizing the OpenAI Guide and some other primers I gave it, the way the Codex described goal was as a finish line contract,

Starting point is 00:06:50 what should be true, how success should be checked, and what has to stay intact along the way. If a prompt involves asking for a result, the harness model combo doing the immediate work, the harness slash model reporting that work and waiting for your feedback, repeat, goal is instead a continuous loop, that one works towards the durable objective that you've given it, Two, checks current evidence against the finish line as it's defined, and three, determines whether to continue, whether the task is complete, or whether to stop because it's honestly blocked. And part of the recognition behind slash goal is that there's lots of types of work that's sequential in a way where the work can't know its next step until the last step has taught it something.

Starting point is 00:07:28 Now, because it's from their developers, the OpenAI document centers on tasks like profiling, patching, benchmarking, reproducing, flaky tests, migrations, bug hunts, and research audits, with the common thread being that although each has a specific target, the path to get there changes as codex gathers evidence. If you didn't have a system like Slash Goal, you'd be sitting there waiting to see what it's said after each intermediate step, only to say something like keep going, now check this, now rerun that. Slash Goal effectively pushes that Keep Going button for you. Now, despite all these examples being about coding, goals can apply to any objective that has and requires some sort of auditable persistence. So what does a type of work need to have to be a

Starting point is 00:08:08 good candidate for goals? First, it has to have a durable objective. In other words, the target should remain true across each turn. The target itself is not going to change over time. A second aspect of work that's good for goals is an uncertain path to success, one where codex or Claude Code may need to inspect, compare, rerun, revise, or investigate before knowing what the next best move to make is. Finally, that objective needs to have really strong, clear, finish line evidence, where completion is not dependent on vibes, but instead on tests, sources, artifacts, citations, basically some sort of proof that is inspectable by the AI, where it can self-judge successfully if it's actually done. Simply put, a goal defines completion for a particular

Starting point is 00:08:52 body of work. And by using slash goal in Codex or ClaudeCodecode, you're engaging in a particular type of work where you've shifted from telling the AI what to do to instead telling the AI what you want to have done when you're through. And while goals are a way to increase autonomy, they're not about cutting out the user entirely. They are still highly user controlled. You define the outcome. The goal can be paused, resumed, cleared, or completed. Basically, life cycle authority stays bounded to the user and the system in evidence that the user has provided. There's a set of commands including slash goal pause, slash goal resume, and slash goal clear that if a user finds the path that codex is going down seems to be wrong, or the rubric for success

Starting point is 00:09:34 needs to change, they can intervene without having to throw away everything that's been done so far. Now, one pattern we talked about when we were talking about codex maxing was the idea of the importance of durable threads, or some people have called it the monothread pattern, where instead of a project with a shared set of memory, the unit of context is the thread at That's how goals work as well. The thread itself is where everything accumulates. This is not taking advantage of global memory or project instructions more broadly. The objective itself travels within that specific thread.

Starting point is 00:10:09 One thing I keep seeing in Enterprise AI, companies hedging across every cloud, every model, every framework, or paying a GSI for a pilot that never ends. The team's actually shipping, they've picked a lane and they move fast. That's one of the reasons I like today's sponsor Robots and Pencils. gone all in on AWS. They're an advanced tier and AWS pattern partner and they ship production AI co-workers in 45 days. That's led to them doing some of the more interesting work I've seen on AI co-workers. And by that I'm not talking about chatbots. I'm talking about actual agentic systems that sit inside a business architecture and do real work. That kind of focus matters if you're

Starting point is 00:10:43 an enterprise leader trying to get something real into production or an AWS rep trying to move a customer from interested to deployed. Request an AI briefing at robots and pencils.com. One conversation with robots and pencils and you'll know. Here's a harsh truth. Your company is probably spending thousands or millions of dollars on AI tools that are being massively underutilized. Half of companies have AI tools, but only 12% use them for business value. Most employees are still using AI to summarize meeting notes. If you're the one responsible for AI adoption at your company, you need Section. Section is a platform that helps you manage AI transformation across your entire organization. It coaches employees on real

Starting point is 00:11:20 use cases, tracks who's using AI for business impact, and shows you exactly where AI is and isn't creating value. The result, you go from rolling out tools to driving measurable AI value. Your employees move from meeting summaries to solving actual business problems, and you can prove the ROI. Stop guessing if your AI investment is working. Check out section at sectionaI.com. That's SEC, T-I-O-N-AI.com. Open AI and Anthropic are both launching enterprise AI consulting efforts because everyone is realizing that the challenges and the capabilities of AI, the challenge is getting individuals and the organization actually ready to use it. The truth, though, is that all the forward-deployed engineers in the world

Starting point is 00:11:59 aren't going to help you if you don't actually have a coherent strategy based on an understanding of your actual AI readiness. Super Intelligent Maturity Maps give you a chance to see where you stand relative to the industry on deployment depth, systems integration, data access, outcomes, people, and governance. And from there, our customized AI planning assessments can help you figure out what you need to do to improve your readiness and how to sequence it. Go take your own maturity maps quiz at B-super.a.i and set us a note if you will, want to go deeper. Weekends are for vibe coding.

Starting point is 00:12:29 It has never been easier to bring a passion project to life, so go ahead and fire up your favorite vibe coding tool. But Monday is coming, and before you know it, you'll be staring down a maze of microservices, a legacy cobal system from the 1970s, and an engineering roadmap that will exist well past your retirement party. That's why you need Blitzy, the first autonomous software development platform designed for enterprise-scale codebases. Deploy the beginning of every sprint and tackle your roadmap 500% faster.

Starting point is 00:12:53 Blitzies agents ingest your entire codebase, but you're not. plan the work, and deliver over 80% autonomously. Validated, end-to-end tested premium quality code at the speed of compute. Months of engineering compressed into days. Vibe code your passion projects on the weekend. Bring Blitsey to work on Monday. See why Fortune 500s trust Blitsey for the code that matters at blitzie.com. That's BLYtZY.com.

Starting point is 00:13:19 Now, writing a good goal is more than just having an outcome, although that's part of it. When it comes to the outcome itself, it's really important that evidence can decide, success or completion. Evidence can be tests, citations, matrices, logs, rubrics, artifacts, but there's more to writing a good goal as well. A good goal prompt is going to provide boundaries like which files, tools, or data can be used, and it's likely going to explain things like when the harness should actually stop and explain that no defensible path remains. OpenAI's tip document says that the strongest goals usually define six things. The outcome or what should be true when the work is done, the verification surface, which is the test benchmark report artifact command

Starting point is 00:13:59 output or source material that proves it, the constraints, in other words, what must not regress while codex works, the boundaries, which files, tools and resources codex can use, the iteration policy, how codex should decide what to try next after each attempt, and the block stop condition or when codex should actually stop. But what about scope? How broad or narrow should a goal be? Early experiments do suggest that there is sort of a Goldilocks zone, where you can be too narrow, i.e. fix this one line, or you can be too broad, i.e. improve the whole system, with the challenge of being too narrow, being that even if that's the thing that you actually want to change,

Starting point is 00:14:35 it doesn't give the system enough flexibility to discover where the real issue is, especially if it's in some related dependency or upstream in some way. Whereas on the other end of the spectrum, if it's too broad, it's much harder to provide the kind of concrete evidence that's going to allow Codex or Claudecoe to know if it's actually successfully accomplished the task. Just right is obviously in between those two extremes. Relatedly, defining the output artifact can be the difference between a successful goal run or not. A weak artifact in the same way that prompting too loosely can produce underwhelming results. If your slash goal artifact is write docs for this feature, the inspectable output of the work might not actually provide the

Starting point is 00:15:12 best evidence surface as opposed to a stronger artifact goal like produce a docs page that explains the life cycle, command surface, and two examples. Verify that the page builds locally and all referenced commands match current CLI behavior. Now, you're probably noticing that a lot of this terminology is still really anchored in the realm of developers. Well, how do we start to figure out what types of other non-software engineering knowledge work might be a good fit for the slash goal primitive? One of the ways to think about it is when the output is not just an answer, but an audit that might be a good place for a goal. A good non-coding goal, is going to produce a ledger of what was checked, what was supported, what was contradicted,

Starting point is 00:15:51 what was weak, and what remains unknown. If that's the type of output that is valuable for your task, it might be a good fit for slash goal, even if it's not a coding task. Now, one of the interesting things as you branch from software engineering to knowledge work is how to think about where the definition of success comes from. Broadly speaking, there are two paths. In some cases, there will be an externally definable rubric. That could be existing published criteria, official docs, a third-party data set, an existing set of logs or transcripts, or some project-specific document like RFP questions. In many cases, however, and this is where it starts to get really blurry, as I was thinking about different projects that were going to be a good

Starting point is 00:16:32 fit for slash goal, I noticed that sometimes I, as the user, needed to provide the rubric, And I think that this is going to be one of the most common patterns in those types of knowledge work use cases, where the user supplies the criteria for success. Think about, for example, hiring criteria. It's not going to be some external source of what you should be looking for. It's going to be you articulating in ways that are knowable by the AI and can be tested against by the AI. What are the hiring criteria that matter to you? A similar example is vendor scorecards. You're not looking to some external standard for what the vendor should be, at least not entirely. you're probably looking for the AI to mirror what you specifically or your company specifically

Starting point is 00:17:09 are looking for in the vendor. Same can be true for editorial standards, lead qualification rules, investment diligence priorities, etc. In fact, you can almost work backwards from here and notice that when you have a knowledge work task that implicitly comes with some rubric or criteria of success, that might be a good place to look to see if it is a good fit for the goal primitive. Now, for the sake of this particular episode, I'm not going all the way through an entire use case, but I did want to provide a set of examples that I think might be good fits or good areas to look as you are thinking about how you can experiment. So 10 areas of knowledge work that I think might be a good fit for slash goal include

Starting point is 00:17:49 literature reviews, market landscapes, vendor evaluations, due diligence, claim audits, policy research, interview synthesis, timeline reconstruction, spreadsheet audits, and even strategy memos, if the goal of the work is to take a whole bunch of messy inputs and put them into a more structured format. Double-clicking on three examples, claim audits strike me as a really clear fit, that even if that's not a use case for you, hopefully gives you some more insight into the type of structure you're looking for. So imagine a prompt slash goal, audit this memo claim by claim.

Starting point is 00:18:24 Verify each claim against the provided sources and reputable external sources, which, by the way, you'd probably want to provide. And with a table labeling each claim as supported, contradicted, partially supported, or unverified, with citations and uncertainty notes. So you're seeing here that output of an audit trail, you're in that Goldilocks zone where you're articulating well enough what you want is the output, and it works because every conclusion the AI makes can be traced back to evidence. Now, what about a market landscape? Isn't that just sort of a normal AI research question? Well, imagine that the goal is create a market landscape for X market, verified by cited company pages, filings, analyst reports, pricing pages,

Starting point is 00:19:05 and product docs, and with a comparison table, confidence levels, and gaps where evidence was unavailable. So what takes this out of the realm of a general research project and into the realm of a slash goal project is that idea of moving to an audit as the process and output. The artifact that you're trying to go for is a comparison table that shows you what can be verified, what's inferred, and where the evidence runs out. Similarly, a slash goal-shaped literature review is one where you're living with complexity and diversity, highlighting rather than flattening conflicting evidence and disagreement. Imagine a goal. Provide an evidence-backed literature review on X topic. Build a source matrix

Starting point is 00:19:41 covering methods, sample sizes, findings, limitations, and conflicts. End with confirmed themes, disputed findings, and open questions. Basically, this pattern is going to work wherever evidence can be inventoried and presented in complete form. My suspicion, though, is that a lot more of the way that knowledge workers are going to use this, at least in the short term, is in this area where there are user-provided rubrics, whereas a prompt can be good for a single pass-slash-goal can execute an entire review process. So something that might be well-suited for a prompt as opposed to a goal would be review these five applications against this rubric, cite evidence and suggest interview

Starting point is 00:20:17 questions. It's a small set of inputs, straightforward criteria with one comparative read. slash goal would allow that to become the architecture for an entire process that involved extracting evidence, applying the rubric, checking consistency, revisiting borderline cases, flagging missing information, and producing a continuously updated document as more entries come in. Still, it's really important to note that as you start to dig into this, not every task will end up making sense to be a goal. There will be lots and lots of times, perhaps even the majority of times, when the traditional

Starting point is 00:20:49 interaction pattern is completely sufficient for what you're trying to achieve. Sometimes that will be because the outcome objective is small enough, but other times it will be because the criteria for success won't be as clean or definable as the slash goal primitive needs to do a good job. And this is why Jason's tips about codex maxing remain important even in the slash goal era, because a lot of times you're not going to want to be as fully disconnected from the process as slash goal allows you to be. Effectively, there is a spectrum of interaction autonomy between you and the harness with different methods, making sense for different types of things you're trying to achieve. Goal is a really great tool to begin to play with, and I think it is worth spending some time

Starting point is 00:21:27 experimenting, even if it's with something outside the mainstream of your work, just to get a sense and a feel for what it can achieve for you and what it requires of you. As we get a little bit deeper into this paradigm, remember we're only a couple weeks after it's been fully introduced now, I'm sure we're going to have a lot more examples of how and where it is both working and not working in and around non-coding use cases and knowledge work, and so at some point and I'll come back and do an update based on all of that. For now, though, that's going to do it for today's episode of the AI Daily Brief. Hope this one is helpful.

Starting point is 00:21:56 I'm excited to see where your goals lead you. Appreciate you listening or watching, as always, and until next time, peace.

The AI Daily Brief: Artificial Intelligence News and Analysis - How to Use /Goal to Do More With AI

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.