Latent Space: The AI Engineer Podcast - The Agent Reasoning Interface: o1/o3, Claude 3, ChatGPT Canvas, Tasks, and Operator — with Karina Nguyen of OpenAI
Episode Date: February 1, 2025Sponsorships and tickets for the AI Engineer Summit are selling fast! See the new website with speakers and schedules live! If you are building AI agents or leading teams of AI Engineers, this will be... the single highest-signal conference of the year for you, this Feb 20-22nd in NYC.We’re pleased to share that Karina will be presenting OpenAI’s closing keynote at the AI Engineer Summit. We were fortunate to get some time with her today to introduce some of her work, and hope this serves as nice background for her talk!There are very few early AI careers that have been as impactful as Karina Nguyen’s. After stints at Notion, Square, Dropbox, Primer, the New York Times, and UC Berkeley, She joined Anthropic as employee ~60 and worked on a wide range of research/product roles for Claude 1, 2, and 3. We’ll just let her LinkedIn speak for itself:Now, as Research manager and Post-training lead in Model Behavior at OpenAI, she creates new interaction paradigms for reasoning interfaces and capabilities, like ChatGPT Canvas, Tasks, SimpleQA, streaming chain-of-thought for o1 models, and more via novel synthetic model training. Ideal AI Research+Product ProcessIn the podcast we got a sense of what Karina has found works for her and her team to be as productive as they have been:* Write PRD (Define what you want)* Funding (Get resources)* Prototype Prompted Baseline (See what’s possible)* Write and Run Evals (Get failures to hillclimb)* Model training (Exceed baseline without overfitting)* Bugbash (Find bugs and solve them)* Ship (Get users!)We could turn this into a snazzy viral graphic but really this is all it is. Simple to say, difficult to do well. Hopefully it helps you define your process if you do similar product-research work. Show Notes* Our Reasoning Price War post * Karina LinkedIn, Website, Twitter* OSINT visualization work* Ukraine 3D storytelling* Karina on Claude Artifacts* Karina on Claude 3 Benchmarks* Inspiration for Artifacts / Canvas from early UX work she did on GPT-3* “i really believe that things like canvas and tasks should and could have happened like 2 yrs ago, idk why we are lagging in the form factors” (tweet)* Our article on prompting o1 vs Karina’s Claude prompting principles* Canvas: https://openai.com/index/introducing-canvas/ * We trained GPT-4o to collaborate as a creative partner. The model knows when to open a canvas, make targeted edits, and fully rewrite. It also understands broader context to provide precise feedback and suggestions.To support this, our research team developed the following core behaviors:* Triggering the canvas for writing and coding* Generating diverse content types* Making targeted edits* Rewriting documents* Providing inline critiqueWe measured progress with over 20 automated internal evaluations. We used novel synthetic data generation techniques, such as distilling outputs from OpenAI o1-preview, to post-train the model for its core behaviors. This approach allowed us to rapidly address writing quality and new user interactions, all without relying on human-generated data.* Tasks: https://www.theverge.com/2025/1/14/24343528/openai-chatgpt-repeating-tasks-agent-ai* * Agents and Operator* What are agents? “Agents are a gradual progression of tasks: starting with one-off actions, moving to collaboration, and ultimately fully trustworthy long-horizon delegation in complex envs like multi-player/multiagents.” (tweet)* tasks and canvas fall within the first two, and we are def. marching towards the third—though the form factor for 3 will take time to develop * Operator/Computer Use Agents* https://openai.com/index/introducing-operator/* Misc:* Andrew Ng* Prediction: Personal AI Consumer playbook* ChatGPT as generative OSTimestamps* 00:00 Welcome to the Latent Space Podcast* 00:11 Introducing Karina Nguyen* 02:21 Karina's Journey to OpenAI* 04:45 Early Prototypes and Projects* 05:25 Joining Anthropic and Early Work* 07:16 Challenges and Innovations at Anthropic* 11:30 Launching Claude 3* 21:57 Behavioral Design and Model Personality* 27:37 The Making of ChatGPT Canvas* 34:34 Canvas Update and Initial Impressions* 34:46 Differences Between Canvas and API Outputs* 35:50 Core Use Cases of Canvas* 36:35 Canvas as a Writing Partner* 36:55 Canvas vs. Google Docs and Future Improvements* 37:35 Canvas for Coding and Executing Code* 38:50 Challenges in Developing Canvas* 41:45 Introduction to Tasks* 41:53 Developing and Iterating on Tasks* 46:27 Future Vision for Tasks and Proactive Models* 52:23 Computer Use Agents and Their Potential* 01:00:21 Cultural Differences Between OpenAI and Anthropic* 01:03:46 Call to Action and Final ThoughtsTranscriptAlessio [00:00:04]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO at Decibel, and I'm joined by my usual co-host, Swyx.swyx [00:00:11]: Hey, and today we're very, very blessed to have Karina Nguyen in the studio. Welcome.Karina [00:00:15]: Nice to meet you.swyx [00:00:16]: We finally made it happen. We finally made it happen. First time we tried this, you were working at a different company, and now we're here. Fortunately, you had some time, so thank you so much for joining us. Yeah, thank you for inviting me. Karina, your website says you lead a research team in OpenAI, creating new interaction paradigms for reasoning interfaces and capabilities like ChatGPT Canvas, and most recently, ChatGPT TAS. I don't know, is that what we're calling it? Streaming chain of thought for O1 models and more via novel synthetic model training. What is this research team?Karina [00:00:45]: Yeah, I need to clarify this a little bit more. I think it changed a lot since the last time we launched. So we launched Canvas, and it was the first project. I was a tech lead, basically, and then I think over time I was trying to refine what my team is, and I feel like it's at the intersection of human-computer interaction, defining what the next interaction paradigms might look like with some of the most recent reasoning models, as well as actually trying to come up with novel methods, how to improve those models for certain tasks if you want to. So for Canvas, for example, one of the most common use cases is basically writing and coding. And we're continually working on, okay, how do we make Canvas coding to go beyond what is possible right now? And that requires us to actually do our own training and coming up with new methods of synthetic data generation. The way I'm thinking about it is that my team is going from very full stack, from training models all the way up to deployment and making sure that we create novel product features that is coherent to what you're doing. So we're really working on that.swyx [00:02:08]: So it's, it's a lot of work to do right now. And I think that's why I think it's such a great opportunity. You know, how could something this big work in like an industrial space and in the things that we're doing, you know, it's a really exciting time for us. And it's just, you know, it's a lot of work, but what I really like about working in digital space is the, you know, the visual space is always the best place to stay. It's not just the skill sets that need to be done.Alessio [00:02:17]: Like we have, like, a lot of things to be done, but like, we've got a lot of different, you know, things to come up with. I know you have some early UX prototypes with GPT-3 as well, and kind of like maybe how that is informed, the way you build products.Karina [00:02:32]: I think my background was mostly like working on computer vision applications for like investigative journalism. Back when I was like at school at Berkeley, and I was working a lot with like Human Rights Center and like investigative journalists from various media. And that's how I learned more about like AI, like with vision transformers. And at that time, I was working with some of the professors at Berkeley AI Research.swyx [00:03:00]: There are some Pulitzer Prize winning professors, right, that teach there?Karina [00:03:04]: No, so it's mostly like was reporting for like teams like the New York Times, like the AP Associated Press. So it was like all in the context of like Human Rights Center. Got it. Yeah. So that was like in computer vision. And then I saw... I saw Crisolo's work around, you know, like interpretability from Google. And that's how I found out about like Anthropic. And at that time, I was just like, I think it was like the year when like Ukraine's war happened. And I was like trying to find a full-time job. And it was kind of like all got distracted. It was like kind of like spring. And I was like very focused on like figuring out like what to do. And then my best option at that time was just like continue my internship. At the New York Times and convert to like full-time. At the New York Times, it was just like working on like mostly like product engineering work around like R&D prototypes, kind of like storytelling features on the mobile experience. So it kind of like storytelling experiences. And like at that time, we were like thinking about like how do we employ like NLP techniques to like scrape some of the archives from the New York Times or something. But then I always wanted to like get into like AI. And like I knew OpenAI for a while, like since I was like, and I was like, I don't know, I don't know. So I kind of like applied to Anthropic just on the website. And I was rejected the first time. But then at that time, they were not hiring for like anything like product engineering or front-end engineering, which was something I was like, at that time, I was like interested in. And then there was like a new opening at Anthropic was like kind of like you are front-end engineer. And so I applied. And that's how my journey began. But like the earlier prototypes was mostly like I used like Clip.swyx [00:05:13]: We'll briefly mention that the Ukrainian crisis actually hit home more for you than most people because you're from the Ukraine and you moved here like for school, I guess. Yeah.Karina [00:05:23]: Yeah.swyx [00:05:23]: We'll come back to that if it comes up. But then you joined Anthropic, not just as a front-end engineer. You were the first. Is that true? Designer? Yeah.Karina [00:05:32]: Yes. I think like I did both product design and front-end engineering together. And like at that time it was like pre-CHPT. It was like, I think August 2022. And that was a time when Anthropic really decided to like do more product-y related things. And the vision was like, we need to like fund research and like building product is like the best way to like fund safety research, which I find it quite admirable. So the really first product that Anthropic built was like Cloud and Slack. And it was sunsetted not long after, but like it was like one of the first, I think I still come back to that idea of like Cloud operating inside some of the organizational workplace like Slack and something magical in there. And I remember we built like ideas like summarize the thread, but you can like imagine having automated like ways of like, maybe Cloud should like summarize multiple channels every week, custom for what you like or for what you want. And then we built some like really cool features. Like this. So we could like tag Cloud and then ask to summarize what's what happened in the thread. So just like new ideas, but we didn't quite double down because you could like imagine like Cloud having access to like the files or like Google drive that you can upload and just connectors, like connections in the Slack. Also the UX was kind of constraining at that time. I was thinking like, oh, we wanted to do this feature, but like Slack interface kind of constrained us to like do that. And we didn't want to like be dependent on the platform, like Slack. And then after like ChaiGPT came out, I remember the first two weeks, my manager made me this challenge, like, can I like reproduce kind of like a similar interface in like two weeks? And one of the early mistakes being in the engineering is like, I said, yes, instead I should have said like, you know, it's double, two X at the time. Sure. Um, and this is how like Cloud.ai was kind of like born.swyx [00:07:39]: Oh, so you actually wrote Cloud.ai? Yeah. As your first job. Yeah.Karina [00:07:43]: Like, I think like the first like 50,000 code of lines without any reviews at that time, because there's no one, um, yeah, it was like very small team. It was all like six, seven team who we were called like deployment team. Yeah.swyx [00:07:59]: Oh, mine, I actually interviewed for, uh, at Anthropic around that time. I got, I was given Cloud in Sheets and that was my other form factor. I was like, oh yeah, this needs to be in a table so we can, we can just copy paste and just span it out. Uh, which is kind of cool. The other rumor that, um, we might as well just mention this, um, Raza Habib from HumanLoop, uh, often says that, uh, you know, there was some, there's some version of ChatGPT in Anthropic, like you had the chat interface already, like you had Slack, why not launch a web UI? Like basically like how did, how did OpenAI beat Anthropic to ChatGPT basically? Um, well, it seems kind of obvious to have it.Karina [00:08:35]: I think ChatGPT model itself came out way before then we decided to like launch Cloud2 necessarily. And I think like at that time, Cloud 1.3 had a lot of hallucinations actually. So I think there was like, one of the concerns is like, I don't think like the leadership was convinced, had the conviction that this is the model that you need to like, you want to like deploy or something. So it was a lot of discussions around, around that time. But Cloud 1.3 was like, I don't know if you played with that, but it's like extremely creative and it was like really cool.swyx [00:09:07]: Nice.Alessio [00:09:08]: It's still creative. And you had a tweet. Recently that you said things like Canvas and Tasks could have happened two years ago, but they were not. Do you know why they were not? Was it too many researchers at the labs not focused on UX? Was it just not a priority for the labs?Karina [00:09:24]: Yeah. I come back to that question a lot. I guess like I was working on something similar to like Canvas-y, but for Cloud at that time in like 2023, it was the same similar idea of like Cloud workspace where a human and a Cloud could have like a shared workspace. Yeah. And that's Artifacts. Which is like a document. Right.swyx [00:09:44]: No, no, no. This is Cloud projects.Karina [00:09:46]: I don't know. I think it kind of evolved. I think like at that time I was like in product engineering team and then I switched to like research team and the product engineering team grew so much. They had their own ideas of like artifacts and like projects. So not necessarily, maybe they had, they looked at my like previous explorations, but like, you know, when I was exploring like Cloud documents or like Cloud workspace was like. Yeah. I don't think anybody was thinking about UX as much or like not many like researchers understood that. And I think the inspiration actually for, I still have like all the sketches, but the inspiration was like from the Harry Potter, like Tom Riddler diary. That was an inspiration of like having Cloud writing into the document or something and communicate back.swyx [00:10:34]: So like in the movie you write a little bit and then it answers you. Yeah.Karina [00:10:37]: Okay.swyx [00:10:38]: Interesting.Karina [00:10:39]: But that was like in the. Only in the context of like writing. I think Canvas is like more also serves like coding, one of the most common use cases. But yeah, I think like those, those ideas could have happened like two years ago. Just like maybe, I don't think it was like a priority at that time. It was like very unclear. I think like AI landscape at that time was very nascent. If that makes sense. Like nobody, like, even when I would talk to like some of the designers at that time, like product designers, they were not even thinking about that at all. They did not have like AI in mind. And like, it's kind of interesting, except for one of my designer friends. His name is Jason Yuan. Yeah. Who was thinking about that.swyx [00:11:19]: And Jason now is a new computer. Yes. We'll have them on at some point. I had them speak at my first summit and you're speaking the second one, which will be really fun. Nice. We'll stay on Anthropic for a bit and then we'll move on to more recent things. I think the other big project that you were, you were involved with was just Cloud 3. Just tell us the story. Like, what was it like to launch one of the biggest launches of the year? Yeah.Karina [00:11:39]: I think like I was, so Cloud 3.swyx [00:11:43]: This is Haiku, Sonnet, Opus all at once, right? Yes. Yeah.Karina [00:11:46]: It was a Cloud 3 family. I was a part of the post-training fine tuning team. We only had like, what, like 10, 12 people involved. And it was really, really fun to like work together as friends. So yeah, I was mostly involved in like Cloud 3 Haiku post-training side and then evaluations, like developing new evaluations. And like literally writing the entire like model card. And I had a lot of fun. I think like the way you train the model is like very different, obviously. But I think what I've learned is that like you will end up with like, I don't know, like 70 models and every model will have its own like brain damage. And like, so it's just like, like kind of just bugs.swyx [00:12:28]: Like personality wise or performance benchmarks?Karina [00:12:31]: I think every model is very different. And I think like, it's like one of the interesting like research questions is like, how do you understand like the data interface? How do you understand the interactions as you like train the model? It's like, if you train the model on like contradictory data sets, how can you make sure that there won't be like any like weird like side effects? And sometimes you get like side effects. And like the learning is that you have to like iterate very rapidly and like have to like debug and detect it and make like address it with like interventions. And actually some of the techniques from like software engineering is very like useful here. It's like, how do you- Yeah, exactly.swyx [00:13:09]: So I really empathize with this because data sets, if you put in the wrong one, you can basically kind of screw up like the past month of training. The problem with this for me is the existence of YOLO runs. I cannot square this with YOLO runs. If you're telling me like you're taking such care about data sets, then every day I'm going to check in, run evals and do that stuff. But then we also know that YOLO runs exist. Yes. So how do you square that?Karina [00:13:32]: Well, I think it's like dependent on how much compute you have. Right? So it's like, it's actually a lot of questions and like researchers are like, how do you most effectively use the compute that you have? And maybe you can have like two to three runs that is only like YOLO runs. But if you don't have a luxury of that, like you kind of need to like prioritize ruthlessly. Like what are the experiments that are most important to like run? Yeah. I think this is what like research management is basically. It's like, how do you-swyx [00:14:04]: Funding efforts. Yeah. Yeah. Prioritizing.Karina [00:14:07]: Take like research bets and make sure that you build the conviction and those bets rapidly such that if they work out, you like double down on them. Yeah.swyx [00:14:15]: You almost have to like kind of ablate data sets too and like do it on the side channel and then merge it in. Yeah. It's kind of super interesting. Tell us more, like what's your favorite? So you, I have this in front of me, the model card. You say constructing this painful, this table was slightly painful. Just pick a benchmark and what's an interesting story behind one of them?Karina [00:14:33]: I would say GPQA was kind of interesting. I think it was like the first, I think we were the first lab, like Antarctica was the first lab to like run.swyx [00:14:42]: Oh, because it was like relatively new after NeurIPS? Yeah.Karina [00:14:45]: Yeah. Okay. Published GPQA like numbers. And I think one of the things that we've learned was that I personally learned about that, like any evals is like, some evals are like very like high variance. And like GPQA is like, happened to be like a huge like high variance. Like evaluation. So like one thing that we did is like having like run the average of like five and like take the average. But like the hardest thing about like the model card is like none of the numbers are like apples to apples. Yes. Will knows this. So you actually need to like go back to like, I don't know, like GPT-4 model card and like read the appendix just to like make sure that like the settings are the same as you're running the settings too. So it's like never an apples to apples. Yeah. But it's interesting how like, you know, when you market models as products, like customers don't necessarily know. Yeah. Like.swyx [00:15:44]: They're just like, my MMLU is 99. What do you mean? Yeah, exactly. Why isn't there an industry standard harness, right? There's this eLuther's thing, which it seems like none of the model labs use. And then OpenAI put out simple eval and nobody uses that. Why isn't there just one standard way everyone runs this? Because the alternative approach is you rerun your evals on their models. And obviously the numbers, your numbers will be lower. Yeah. And they'll be unhappy. So that's why you don't do that.Karina [00:16:12]: I think it operates on an assumption that like the models, the next generation of the model or the model that you produce next is going to behave the same. So for example, like I think the way you prompt a one or like a cloud three is going to be very different from each other. I feel like there's a lot of like prompting that you need to do to get the evals to run correctly. So sometimes the model will just like output like new lines and the way it parsed will be like incorrect or something. This has happened with like Stanford. I remember like when Stanford had this also like they were like running benchmarks. Helm? Yeah, Helm. And somehow like cloud was like always like not performing well. And that's because like the way they prompted it was kind of wrong. So it's like a lot of like techniques. Yeah. It's just like very hard because like nobody even knows.swyx [00:17:00]: Has that gone away with chat models instead of, you know, just raw completion models?Karina [00:17:05]: Yeah, I guess like each eval also can be run in a very different way. Sometimes you can like ask the model to output in like XML tags, but some models are not really good at XML tags. So it's like, do you change the formatting per model or like do you run the same format across all models? And then like the metrics themselves, right? Like maybe, you know, accuracy is like one thing, but maybe you care about like some other metrics like F score or like some other like things. Yeah. It's like hard. I don't know.Alessio [00:17:36]: And talking about O1 prompting, we just had a O1 prompting post on the newsletter, which I think was...swyx [00:17:42]: Apparently it went viral within OpenAI. Yeah. I don't know. I got pinged by other OpenAI people. They were like, is this helpful to us? I'm like, okay. Oh, nice. Yeah.Alessio [00:17:50]: I think it's like maybe one of the top three most read posts now. Yeah. Cool. And I didn't write it. Okay. Exactly.swyx [00:17:57]: Anyway, go ahead.Alessio [00:17:57]: What are your tips on O1 versus like cloud prompting or like what are things that you took away from that experience? And especially now, I know that with 4.0 for Canvas, you've done RL after on the model. So yeah, just general learning. So now to think about prompting these models differently.Karina [00:18:12]: I actually think like O1, I did not even harness the magic of like O1 prompting. But like one thing that I found is that like, if you give O1 like hard, like constraints of like what you're doing. What you're looking for, basically the model will be, will have a much easier time to like kind of like select the candidates and match like the candidate that is most like fulfilled the criteria that you gave. And I think there's a class of problems like this that O1 excels at. For example, if you have a question, like a bio question on like some, or like in chemistry, right? Like if you have like very specific criteria with the protein or like some of the. Chemical bindings or something like, then the model will be really, will be really good at like determining the exact candidate that will match the certain criteria.swyx [00:19:04]: I have often thought that we need a new IF eval for this. Because this is basically kind of instruction following, isn't it? Yes. But I don't think IF eval has like multi-step IF eval. Yeah. So that's what basically I use AI News for. I have a lot of prompts and a lot of steps and a lot of criteria and O1 just kind of checks through each kind of systematically. And we don't have any evals like that.Karina [00:19:24]: Yeah.Alessio [00:19:25]: Does OpenAI know how to prompt O1? I think that's kind of like the, that's the, you know, Sam is always talking about incremental deployments and kind of like getting, having people getting used to it. When you release a model, you obviously do all the safety testing, but do you feel like people internally know how to get a hundred percent out of the model? Or like, are you also spending a lot of time learning from like the outside on how to better prompt O1 and like all these things? Yeah.Karina [00:19:50]: I certainly think that you learn so much from like external feedback too. Yeah. I feel like I don't fully know on how people use like O1. I think like a lot of people use O1 for like really hardcore like coding questions. I feel like I don't fully know how to best use O1. You release the model. Except for like, I use O1 to just like do some like synthetic data explorations. But that's it.Alessio [00:20:16]: Do people inside of OpenAI, once the model is coming out, do you get like a company-wide memo of like, hey, this is how you should try and prompt this? Yes. Especially for people that might not be close to it during development, you know, or I don't know if you can share anything, but I'm curious how internally these things kind of get shared.Karina [00:20:34]: I feel like I'm like in my own little corner in like research. I don't really like to look at some of the Slack channels.swyx [00:20:40]: It's very, very big.Karina [00:20:41]: So I actually don't know if something like this exists. Probably. It might be exist because we need to share to like customers or like, you know, like some of the guides. I'm like, how do you use this model? So probably there is.swyx [00:20:56]: I often say this. The reason that AI engineering can exist outside of the model labs is because the model labs release models with capabilities that they don't even fully know because you never trained specifically for it. It's emergent. And you can rely on basically crowdsourcing the search of that space or the behavior space to the rest of us. Yeah. So like, you don't have to know. That's what I'm saying. Yeah.Karina [00:21:20]: I think like an interesting thing about like O1 is like. That like it's really for like average human. Sometimes I don't even know whether the model like produced the correct output or not. Like it's really hard for me to like verify even like hard like stem questions. I don't know if I'm not an expert. Like I usually don't know. So it's like the question of like alignment is actually more important like for this like complex reasoning models to like how do we help humans to like verify the outputs of these models is quite important. And I feel like. Yeah. Like learning from external feedback is kind of cool.swyx [00:21:56]: For sure. One last thing on cloud three. You had a section on behavioral design. Yes. Anthropics very famous for the HHH goals. What was your insights there? Or, you know, maybe just talk a little bit about what you explored. Yeah.Karina [00:22:09]: I think like behavioral design is like a really cool. I'm glad that I made it like a section around this. And it's like really cool. I think like.swyx [00:22:17]: Like you weren't going to publish one and then you insisted on it or what?Karina [00:22:20]: I think like I just like put the section. Yeah. I think like I put the section inside it and like, yeah, Jared, my like one of my most favorite researchers like, yeah, that's cool. Let's, let's do that. I guess. Yeah. Like nobody had this like term like behavioral design necessarily for the models. It's kind of like a new little field of like extending like product design into like the model design. Right. Like, so how do you create a behavior for the model in certain contexts? So as for example, like in Canvas, right. Like one of the things that we had to like think about is like, okay, like now the model enters like more collaborative environment, more collaborative context. So like what's the most appropriate behavior for the model to act like as a collaborator? Should it ask like more follow up questions? Should it like change? What's the tone should be? Like what is the collaborator's tone? It's different from like a chat, like conversationalist versus like collaborator. So how do you shape the perspective? Like, you know, like the persona and the personality around that is it has like some philosophical questions too. Like, yeah. Behavioral. I mean, like, I guess like I can talk more about like the methods of like creating the personality. Please. It's the same thing as like you would create like a character in a video game or something. It's kind of like...swyx [00:23:39]: Charisma, intelligence. Yeah, exactly. Wisdom.Karina [00:23:42]: What are the core principles? Helpful, harmless, honest. Yeah. And obviously for Cloud, this was my, is much easier than I would say like for ChargeAPD. For Cloud, it's like baked in the mission, right? It's like honest, harmless, helpful. But the most complicated thing about the model behavior or the behavioral design is that sometimes two values would contradict each other. I think this happened in Cloud 3. One of the main things that we were thinking about was like, how do we balance this like honesty versus like homelessness or like helpfulness? And it's like, we don't want the model to always like refuse even to like innocuous queries, like some like creative writing prompts, but also if you don't want the model to be act like a, be harmful or something. So it's like, there's always a balance between those two. And it's more like art than the science necessarily. And this is what data sets craft is, is like more of an art than a literal science. You can definitely do like empirical research on this, but it's actually like, like this is the idea of like synthetic data. Like if you look back to like institutional AI paper is around like, how do you create completions such that you would agree to certain like principles that you want your model to agree on? So it's like, if you create the core values of the models, how do you decompose those core values? Into like specific scenarios or like, so how does the model need to express its honesty in a variety of kind of like scenarios? And this is where like generalization happens when you craft the persona of the model. Yeah.swyx [00:25:22]: It seems like what you described behavior modification or shaping as a side job that was done. I mean, I think Anthropic has always focused on it the first and the most. But now it's like every lab has sort of. It's like a vibes officer for you guys is Amanda, for OpenAI it's Rune, and then for Google, it's Steven Johnson and Raiza who we had on the podcast. Do you think this is like a job? Like, it's like a, like every, every company needs a tastemaker.Karina [00:25:50]: I think the model's personality is actually the reflection of the company or the reflection of the people who create that model. So like for Claude's, I think Amanda was doing a lot of like Claude character work and I was working with her at the time.swyx [00:26:04]: But there's no team, right? Claude character work. Now there's a little bit of a team. Isn't that cool?Karina [00:26:09]: But before that there was none. I think like actually it was Claude 3, he was like, we kind of doubled down on the feedback from Claude 2. Like people, we didn't even like think, but like people said like Claude 2 is like so much better at like writing and like has certain personality, even though it was like unintentional at all. And we did not pay that much attention and didn't know even how to like productionize this property of model being better. Like personality. And to like, with Claude 3, we kind of like had to like double down because we knew that if you would launch like in chat, we wanted to like Claude honesty is like really good for like enterprise customers. So we kind of wanted to like make sure the hallucinations went, like factuality would like go up or something. We didn't have a team until or after like Claude 3, I guess. Yeah.swyx [00:26:58]: I mean, it's, it's growing now. And I think anyway, everyone's taking it seriously.Karina [00:27:00]: I think on OpenAI there was a team called Model Design. It's John, the PM. She's leading that team and I work very closely with those teams that we were working on, like actually writing improvements that we did with ChaiGPT last year. And then I was working on like this collaboration, like how do you make ChaiGPT act like a collaborator for like Canvas? And then, yeah, we worked together on some of the projects.swyx [00:27:25]: I don't think it's publicly known his, his actual name other than Rune, but he's, he's, he's mostly, he's mostly doxxed.Alessio [00:27:32]: We'll beep it and then people can guess. Yeah. Do we want to move on to OpenAI and some of the recent work, especially you mentioned Canvas. So the first thing about Canvas is like, it's not just a UX thing. You have a different model in the backend, which you post-trained on or one preview distilled data, which was pretty interesting. Can you maybe just run people through, you come up with a feature idea, maybe then how do you decide what goes in the model, what goes in the product and just that, that process? Yeah.Karina [00:28:03]: I think the most unique thing about ChaiGPT Canvas. What I really liked about that was that it was also the team formed out of the air. So it was like July 4th or something... Wow. during the break. Like on Independence Day.swyx [00:28:17]: They just like, okay.Karina [00:28:18]: I think it was, there was some like company break or something. I remember I was just like taking a break and then I was like pitching this idea to like Barrett Zarf. Barrett Zarf, yeah. Who was my manager at that time. Just like, I just want to like create this like Canvas or something. And I really didn't know how to like apply this. Navigate, OpenAI, it was like my first, like, I don't know, like first month at OpenAI and I really didn't know how to like navigate, how do I get product to work with me or like some of the ideas, like some of the things like this was like, so I'm really grateful for like actually Barrett and Mira who helped me to like staff this project basically. And I think that was really cool. And it was like this 4th of July and like Barrett was like, yeah, actually, who's like an engineering manager is like, yeah, we should like staff this project with like five, six engineers or something. And then Karina can be a researcher on this project. And I think like, this is how the team was formed. This was kind of like out of the air. And so like, I didn't know anyone there at that time, except for Thomas Dimson. He did like the first like initial like engineering prototype of the canvas and it kind of like reshaped. But I think the first, we learned a lot on the way how to work together as product and research. And I think this is one of the first projects at OpenAI where research and product work together from the very beginning. And we just made it like a successful project in my opinion is because like designers, engineers, PM and research team were all together. And we would like push back on each other. Like if like it doesn't make sense. Yeah. we'd like to do it on the model side, like we are hard to like collaborate with like applied engineers to like make sure this is being handled on the applied side. But the idea is you can go that far with like prompted baseline, prompt, the charge of PT was kind of like the first thing that we tried was like a canvas as a tool or something. So how do we define the behavior of the canvas? But then like we've found like different like edge cases that we wanted to like fix and the only way to like fix the some of these edge cases actually through post training. So we actually, what we did was actually retrain the entire 4.0 plus our Canvas stuff. And this is like, there are like two reasons why we did this is because like the first one is that we wanted to ship this as a better model in the dropdown menu. We could like rapidly iterate on users' feedback as we ship it and not going through the entire like integration process into like this like new one model or something, which took some time. Right. So I'm like from beta to like GA, it took, I think, three months. So we kind of wanted to like ship our own model with that feature to like learn from the user feedback very quickly. So that was like one of the decisions we made. And then with Canvas itself, we just like had a lot of like different like behavioral, it's again, like it's a behavioral engineering. It's kind of like various behavioral craft around like when does Canvas need to write comments? When does it need to like update or like edit the document? When does it need to like update or like edit the document? When does it need to edit the entire, like rewrite the entire document versus like edit very specific section of the user asks? And when does it need to like trigger the Canvas itself? It was one of those, those like behavioral engineering questions that we had. At that time, I was also working with like writing quality. So that was like the perfect way for us to like literally both teach the model how to use Canvas, but also like improve writing quality if writing was like one of the main use cases for Chachi PD. So I think that was like the reasoning around that.swyx [00:31:55]: There's so many questions. Oh my God. Quick one. What does improved writing quality mean? What are the evals?Karina [00:32:01]: What are the evals? Yeah. So the way I'm thinking about it is like have two various directions. The first direction is like, how do you improve the quality of the writing of the current use cases of Chachi PD? And those, most of the use cases are mostly like nonfiction writings. It's like email writing or like some of the, maybe you've blog posts, cover letters is like one. I don't mean use cases, but then the second one is like, how do we teach the model to literally think more creatively or like write in a more creative manner such that it will like just create novel forms writing. And I think the second one is like much of a longer term, like research question. While the first one is more like, okay, we just need to improve data quality for the writing use cases that between the models are. It is more straightforward question. Okay. But the way we evaluated the writing quality, so actually I worked with Jan's team on the model design. So they had a team of like model writers and we would work together and it's just like a human eval. It's like internal human eval where we would just like that. Yeah. On the prompt distribution that we cared about, like we want to make sure that the models that we like use, that we trained were always like better or something. Yeah.swyx [00:33:20]: So like some test set of like a hundred prompts that you want to make sure you're good on. I don't know. I don't know how big the prompt distribution needs to be because you are literally catering to everyone. Right.Karina [00:33:32]: Yeah. I think it was much more opinionated way of like improving writing quality because we worked together with like model designers to like come up with like core principles of what makes this particular writing good. Like what does make email writing good? And we had to like craft like some of the literally like rubric on like what makes it good and then make sure during the eval, we check the marks on this like rubric. Yeah.swyx [00:33:58]: That's what I do. Yeah. That's what school teachers do. Yeah.Karina [00:34:02]: Yeah. It's really funny.swyx [00:34:03]: Like, yeah, that's exactly how we grade essays. Yes.Karina [00:34:06]: Yeah.Alessio [00:34:06]: I guess my question is when do you work the improvements back in the model? So the canvas model is better writing. Why not just make the core model better too? So for example, I built this small podcasting thing for a podcast and I have the 4.0 API and I asked it to write a write up about the episode based on the transcript. And then I've done the same in canvas. The canvas one is a lot better. Like the one from the raw 4.0, it starts, the podcast delves and I was like, no, I'm not delved in the third word. Why not put them back in 4.0 core or is there just like.Karina [00:34:38]: I think you put it back in the core now.Alessio [00:34:40]: Yeah. So like, so the 4.0 canvas now is the same as 4.0. Yeah. You, you must've missed that update. Yeah. What's the, what's the, what's the process to, I think it's just like an AB test almost. Right. To me, it feels, I mean, I've only tried it like three times. But it feels the canvas, the canvas output feels very different than the API output.Karina [00:35:01]: Yeah, yeah. I think like, there's always like a difference in the model quality. I would say like the original better model that we released this canvas was actually much more creative than even right now when I use like 4.0 with canvas. I think it's just like the complexity of like the data and the complexity of the, it's kind of like versioning issues right here. It's like, okay, like your version. 11 will be very different from like version eight, right? It's like, even though like the stuff that you put in is like the same or something.swyx [00:35:32]: It's a good time to, to say that I have used it a lot more than three times. I'm a huge fan of canvas. I think it is, um, yeah, like it's weird when I talk to my other friends, they, they don't really get it yet or they don't really use it yet. I think because it's maybe sold as like sort of writing help when really like it's kind of, it's the scratch pad. Yeah. What are the core use cases or like, yeah.Karina [00:35:53]: Oh yeah. I'm curious. Literally draft.swyx [00:35:54]: Drafting anything like I want to draft like copy for my conference that I'm running, like I'll put it there first and then I like, it'll just have the canvas up and I'll just say what I don't like about it and it changes. I will maybe edit stuff here and paste in. So, so for example, like I wanted to draft a brainstorm list of reasons of signs that you may be an NPC just for fun, just like a blog post for fun. Nice. And I was like, okay, I'll do 10 of these and then I want you to generate the next 10. So I wrote 10. I placed it in it to, to chat GPT. Okay. And they generated the next 10 and they all sucked, all horrible, but it also spun up the canvas with, with the blog posts and I was like, okay, self-critique why your output sucks and then try again. And it, and it just kind of just iterates on the blog posts with me as a writing partner and it is so much better than, I don't know, like intermediate steps. I was like, that would be my primary use case literally drafting anything. I think the other way that I'll put it, I'm not putting words in your mouth. This is how I view what canvas is and why. It's so important. It's basically an inversion of what Google docs is, wants to do with Gemini. It's like Google docs on the main screen and then Gemini on the side and right now what chat GPT has done is do the chat thing first and then the docs on the side, but it's kind of like a reversal of, of what is the main thing. Like Google docs starts with the canvas first that you can edit and whatever, and then you maybe sometimes you call in the AI assistants, but chat GPT, what you are now is you're kind of AI first with these, the site output being Google docs.Karina [00:37:22]: I think we definitely want to improve. Like writing use case in terms of like, how do we make it easier for people to format or like do some of the editing? I think there is still a lot of room for improvement, to be honest. I think the another thing is like coding, right? I feel like one of the things that'd be like doubling down is actually like executing code inside the canvas. And there's a lot of questions like, how do you evolve this? It's kind of like IDE for both. And I feel like this is where I'm coming from is like the chat GPT evolves into this blank image. It's kind of like the interface, which can morph itself in whatever you trying, like the model should try to like derive your true intent and then modify the interface based on your intent. And then if you like writing, it should become like the most powerful, like writing IDE possible. If it's like coding, it should become like a coding IDE or something.swyx [00:38:14]: I think it's a little bit of a odd decision for me to call those two things, the same product name, because they're basically two different UIs. Like one is code interpreter plus plus. The other one is canvas. Yes. I don't know if you have other thoughts on canvas.Alessio [00:38:27]: No, I'm just curious, maybe some of the harder things. So when I was reading, for example, forcing the model to do targeted edits versus like for rewrite, it sounds like it was like really hard in the AI engineer mind. Maybe sometimes it's like just pass one sentence in the prompt. It's just going to rewrite that sentence. Right. But obviously it's harder than that. What are maybe some of the like hard things that people don't understand from the outside and building products like this?Karina [00:38:50]: I think it's always hard with any new like product feature. Like. Canvas or tasks or like any other new features that you don't know how people would use this feature. And so how do you even like build evaluations that would simulate how people would use this feature? And it's always like really hard for us. Therefore, like we try to like lean on to like iterative deployment this in order to like learn from user feedback as much as possible. Again, it's like we didn't know that like code diffs was very difficult. For a model, for example, again, it's like, do we go back to like fundamentally improve like code diffs as a model capability, or do you like do a workaround where the model will just like rewrite the entire document, which is yield to like higher accuracy? And so those are like some of the decisions that we had to like make as yeah. How do you like improve the bar to the product quality, but also make sure the model. Quality is also a part of it. And like, what kind of like cheat offs you're okay to do? Again, I think, I think this is like new way of product development is more like product research, model training and like product development goes like together hand in hand. This is like one of the hardest things, like defining the entire like model behaviors. I think just like, is there's so many edge cases that might happen, especially when you like do canvas was like other tools, right? Like canvas plus Dalek. Canvas plus search. If you like select certain section and then like ask for search, like how do you build such evals? Like what kind of like features or like behaviors that you care the most about? And this is how you build evals.swyx [00:40:35]: You tested against every feature of ChatGPT? No. Oh, okay. I mean, I don't think there's that many that you can. Right. It will take forever.Karina [00:40:44]: But it's the same. It's indecision boundary between like Python, ADA advanced data analysis versus canvas. Is one of the most trickiest like decision boundary behaviors that we had to like figure out, like how do you derive the intent from the human user query? Yeah. And how do I say this? Deriving the intent, meaning does the user expect canvas or some other tool and then like make sure that it's like maximally like the intent was is like actually still one of the hardest problems. Yeah. Especially with like agents, right? Like you don't want like agents to go for like five minutes and do something on the background and then come back with like some mid answer that you could have gotten from like a normal model or like the answers that you didn't even want because it didn't have enough context. It didn't like follow up correctly.swyx [00:41:40]: You said the magic word. We have to take a shot every time you say it. You said agents.swyx [00:41:46]: So let's move to tasks. You just launched tasks. What was that like? What was the story? I mean, it's, it's your, it's your baby. SoKarina [00:41:52]: Now that I have a team, I actually like tasks was purely like my residence projects. I was mostly a supervisor. So I kind of like delegated a lot of things to my resident. His name is like Vivek. And I think this is like one of the projects where I learned management, I would say. Yeah. But it was really cool. I think it's very similar model. I'm trying to replicate canvas operational model. How do we operate with product people or like product applied orgs was research and the same happened. I was trying to replicate like the methods and replicate the operational process with tasks. And actually tasks was developed less than like two months. So if canvas took like, I don't know, four months, then tasks took like two months. And I think again, like it's kind of very similar process of like, how do we build eval? You know, some people like ask for like reminders in actual charge GPT, but then like, obviously, even though they know it doesn't work. Yeah. So like there is some like demand or like desire from users to like do this. And actually I feel like task is like simple feature in my opinion is something that you would want from any model. Right. But then the magic is like when I actually, because the model is so general, it knows how to use search or like canvas or like create cypher. You know, you can modify stories and create Python puzzles when coupled with status actually becomes like really, really powerful. It was like the same ideas of like, how do we shape the behavior of the model? Again, we shipped it as like as a better model in the model dropdown. And then we are working towards like making that feature integrated in like the core model. So I feel like the principles that like everything should be like in one model, but because of some of the operational difficulties, it's, it's much easier to like deploy. It's a separate model first to like learn from the user feedback and then iterate very quickly and then improve into the core model basically. Again, this is a project was also like together at the beginning from the very beginning, designers, engineers, researchers were working all together and together with model designers, we were like trying to like come up with like evals evaluations and like testing and like bug bashing. And it's like a lot of cool like synergy.swyx [00:44:12]: Evals, bug bashing. I'm trying to distill. Okay. I would love a canvas for this, for distill what the ideal product management or research management process is. Right. Start from like, do you have a PRD? Do you have a doc that like these, these things? Yes. And then from PRD, you get funding maybe or like, you know, staffing resources, whatever. Yes. And then prototype maybe. Yeah. Prototype.Karina [00:44:37]: I would say like prototype was prompted baseline. It's all, all, everything starts with like prompted baseline. Yeah. And then like we craft like certain like evaluations that you want to like capture. Okay. They want to like measure progress at least for the model and then make sure that evals are good and make sure that the prompted baseline actually fails on those like evals because then you have like, if you're allowed to like hill climb on. And then once you start iterating on the model training, it's actually very iterative. So like every time you train the model or you like look at the benchmark or like look at your evals and it like goes up, it's like good. But then also you don't want to like, you want to make sure it's not like super overfitting. Like that's where you run on other evals, right? Like intelligence evals or something. And then like. Yeah.swyx [00:45:20]: You don't want regressions on the other stuff. Right. Yes. Okay. Is that your job or is that like the rest of the company's job to do?Karina [00:45:26]: I think it's mainly my like. Really? The job of the people who like.swyx [00:45:30]: Because regressions are going to happen and you don't necessarily own the data for the other stuff.Karina [00:45:34]: What's happening right now is that like you, basically you only like update your, your data sets, right? So it's like you compare on the baseline, you compare like the regressions on the baseline model.swyx [00:45:47]: Model training and then book bash. And that's, that's about it. And then ship.Karina [00:45:50]: Actually, I did the course with Andrew Yang, who. Yes. There was like one little lesson around this. Okay.swyx [00:45:57]: I haven't seen. Product research. You tweeted a picture with him and it wasn't clear if you were working on a course. I mean, it looked like the standard course picture with Andrew Yang. Yes. Okay. There was a course with him. What was that like working with him?Karina [00:46:08]: No, I'm not working with him. I just like, I just like did the course with him. Yeah. Yeah.Alessio [00:46:11]: How do you think about the tasks? So I started creating a bunch of them. Like, do you see this as being, going back to like the composability, like composable together later? Like you're going to be scheduled one task that does multiple tasks chained together. What's the vision?Karina [00:46:27]: I would say task is like a foundational module, obviously to generalize to all sorts of like behaviors that you want. Like sometimes like I see like people have like three tasks.Karina [00:46:41]: And right now I don't think like the model handles this very well. I think that ideally we learn from like the user behavior and ideally the model will just be more proactive in suggesting of like, oh, I can either do this for you every day because I've observed that you do that every day or something. So it's like more becomes like a proactive behavior. I think right now you have to be more explicit, like, oh yeah, like every day, like remind me of this. But I think like the, the ideally the model will always think about you on the background and like kind of suggests, okay, like I noticed you've been reading some of this particular like how I can use articles. Maybe I can try to suggest you like every day or something. So it's like, it's just like much more like of a natural like friend, I think.swyx [00:47:35]: Well, there is an actual startup called Friend that is trying to do that. Oh, Yes. We'll have, we'll interview Avi at some point. But like it sounds like the guiding principle is just what is useful to you. It's a little bit B2C, you know, is there any B2B push at all or you don't think about that?Karina [00:47:51]: I personally don't think about that as much, but I definitely feel like B2B is cool. Again, I come back to like Cloud and Slack. It's like one of the, like the first like interfaces where like the model was operating inside your organization, right? It would be very cool for the model to like handle that. To like become like a productive member of your organization. And then either like even like even process, like I right now, like I'm thinking like processing like user feedback. I think it'd be very cool if the model would just like start doing this for us and like we don't have to hire a new person on this just for this or something. And like you have like very simple like data analysis or like data analytics or like how this features like.swyx [00:48:36]: Do you do this analysis yourself? Or do you have a data science team that tells you insights?Karina [00:48:40]: I think there are some data scientists. Okay.swyx [00:48:43]: I've often wondered, I think there should be some startup or something that does automated data insights. Like I just throw you my data. You tell me. Yeah. Yeah, exactly. Cause that's what the data team at any company does. Right. Which is just give us your data. We'll like make PowerPoints. Yeah. Yeah.Karina [00:48:59]: That'd be very cool.swyx [00:49:00]: That's, I think that's a, that's a really good vision. You had thoughts on agents in general. There's some more proactive stuff. You actually had tweeted a definition. Which is kind of interesting.Karina [00:49:09]: I did.swyx [00:49:10]: Well, I'll read it out to you. You tell me. Okay. If you still agree with yourself. This is five days ago. Agents are a gradual progression of tasks, starting off with one-off actions, moving to collaboration. Ultimately fully trustworthy long horizon. I know it's, I know it's uncomfortable to have your tweets read to you. I have had this done to me. Ultimately fully trustworthy long horizon delegation in complex environments like multiplayer, multi-agents, tasks, and canvases fall within the first two. What is the third one?Karina [00:49:34]: One of my weaknesses is like, I like writing long sentences. I feel like that's a good thing. Like I need to like learn how to.swyx [00:49:39]: That's fine. That's fine. Is that your definition of agents? Like what are you looking for?Karina [00:49:43]: I'm not sure if this is my definition of agents, but I feel like it's more like how I think it makes sense, right? Like I feel like for me to like trust an agent with my passwords or my credit card, I actually need to build trust with that agent that it will handle my tasks correctly and reliably. And the way I would go about this is how I would naturally like collaborate with other people. Is it like we first, even if it's any project, right, like we first came, when we first come, like we don't even know each other. Like we don't know how each other's like working style, like what I prefer, what do they prefer, how do they prefer to communicate, et cetera, et cetera. So like you spend like the first, like, I don't know, like two weeks to just like learn their style of working. And then like over time you adapt to their working style and then this is how you create the collaboration. And then like at the beginning you don't have much trust. So like how do you build more trust, especially like, it's the same thing as like with a manager, right? Like it's like, how do you build trust with your manager? What does they need to know about you? What do you need to know about them? Over time as you build trust and trust builds either through collaboration, which is why I feel like building Canvas was kind of like the first steps towards like more collaborative agents. I think with humans, so like you can, you should need to show a consistency. Yeah. Consistent effort to each other, like consistent effort that you care about each other is that you like work together very well or something. So consistency and like collaboration is like what creates trust. And then I will naturally will try to delegate tasks to a model because I know the model will not fail me or something. So it's kind of like building out like the intuition for the form factor of like new agents. Because sometimes I feel like a lot of researchers or like people in AI community are like so, into like, yeah, agents, delegate everything like blah, blah, blah, but like on the way towards that, I think like collaboration is actually one of the main roadblocks or like milestones to get over. Because then you will learn some of the implicit preferences that would help you, that would help towards like this full delegation model. Yeah.swyx [00:51:55]: Trust is very important. I have an AGI working for me and I, we're, we're still working on the trust issues. Okay. Um, we are recording this just before the launch of the podcast. We have a collaborative operator. The other side of agents that is very topical recently is computer use and topic launch computer use recently. Um, you know, you're not saying this, but opening is rumored to be working on things and like, there's a lot of labs are like exploring this, like sort of drive a computer generally. Um, how important is that for agents?Karina [00:52:23]: I think it would be one of the core capabilities of agents. Yeah. Computer using, oh, agents using desktop or like your computer is like the delegation part. So like when you might want to like delegate an agent to like order a book for me or like order a flight or like search for a flight and then order things. And I feel like this idea was flying around like for a long time since at least like 2022 or something. And finally we are here. It's just like there's a lot of like lag between idea and like full execution in the orders like two to three years.swyx [00:53:01]: The vision models had to get better. Yeah. A lot better.Karina [00:53:04]: The perception and something. But I think like it's really cool. I feel like it has like implications for like consumers definitely like delegation. But I guess again like I think like latency is like one of the most important factors here. It's like you don't want to make sure that the model correctly understands what you want. And then if it doesn't understand or if it doesn't know like full context, it should like ask for a follow up question and then like use that to perform the task. Like the agent should know if it has enough information to complete the task at the maximal, if it's a maximal success or not. And I think this is like still an open kind of like research question I feel like. Yeah. And the second idea is that like I think it also enables new class of like research questions of like computer use agents. Like can we use it in RL? Right. Like this is kind of like very cool like nascent area of like research.swyx [00:53:59]: What's one thing? What's one thing that you think by the end of this year people will be using computer use agents a lot for?Karina [00:54:05]: I don't know. It's really hard to predict. I'm trying to look for.swyx [00:54:09]: Maybe for coding.Karina [00:54:11]: I don't know.swyx [00:54:11]: For coding?Karina [00:54:12]: I think like right now like with Canvas we are thinking about like this paradigm of like real time collaboration to like asynchronous collaboration. So it's like it would be cool if I can just delegate to a model like, okay, can you figure out like how to do this feature or something? And then the model can just like. Test out that feature in its own like virtual environment or something. I don't know. Like maybe this is a weird idea. Obviously, there will be a lot of use cases around the consumers, the consumer use cases like, hey, like shop for me or something.swyx [00:54:43]: I was going to say, everyone goes to booking plane tickets. That's like the worst example because you only booked plane tickets, what, two or three times a year? Or like concert tickets.Karina [00:54:50]: I don't know. Yeah.swyx [00:54:51]: Concert tickets. Yeah.Karina [00:54:51]: Like Taylor Swift.swyx [00:54:52]: I want a Facebook marketplace bought that just scrolls Facebook marketplace for free stuff. Yeah. And then just go and get it. Yeah.Karina [00:55:00]: I have a question. I don't know. What do you think?swyx [00:55:01]: I have been very bearish in computer use because they're slow, they're expensive, they're imprecise, like the accuracy is horrible. Still, even with Anthopics new stuff, I'm really waiting to see what opening I might do to change my opinions. And really what I'm trying to do is like Jan last year versus December last year, I changed a lot of opinions. What am I wrong about today? And computer use is probably one of them where I'm like, I don't think, I don't know if by end of the year we'll still be using them. Will my ChatGPT have? Like every GPT instance, will they, will they have a virtual computer? Maybe? I don't know. Coding? Yes. Because he, he invested in a company that does, does that for the, the code sandboxes there. There are a bunch of code sandbox companies. E2B is the name. But then like in browsers, yes. Computer use is like coding plus browsers, plus everything else. There's a whole operating system and it's very like, you have to be pixel precise. You have to OCR. Well, I think OCR is basically solved, but like pixel precise and like understand the UI of what you're operating. And like, I don't know if the models are, I don't know. There you go.Karina [00:56:01]: Yeah. Yeah. Two questions. Like, do you think the progress of like mini models, like O3 mini or like O1 mini, I guess like it's came back to like the cloud, cloud 3 high cool, cloud 1.2 instant, like this like gradual progression of like small models becoming really powerful, which are very also like fast. Like I'm sure like the computer use agents like would be able to like couple with like those like small models that will solve some of the latency issues, in my opinion. I think in terms of like other operating system, I think a lot about it these days, it's just like, if you're entering this like task oriented, like operating system or something, where also a generative OS, like in my opinion, like people in like few years will click on like websites way less. I want to see the plot of like website clicks over time. But then my prediction is like, it will click. It will go down and like people's access to the internet will be through the model's lens. Either you see what the model is doing or you don't see what the model is doing on the internet. Yeah.Alessio [00:57:10]: I think my personal benchmark for computer use this year is expense reports. So I have to do my expense report every month. But what you need to do. So for example, I expense a lunch, I have to go back on the calendar and see who I was having lunch with. Then I need to upload the receipt of the lunch and I need to tag the person. The expense report, blah, blah, blah. Yeah. It's very simple on a task by task basis. Yeah. But like you have to go to every app. Right. That I use. You have to go to like the, you know, Uber app. You have to go to the camera roll to get the photo of the receipt, all these things. It's not, you cannot actually do it today, but it feels like a tractable problem. You know that probably by the end of the year we should be able to do it.Karina [00:57:49]: Yeah. This reminds me of like the idea of you kind of want to show to computer use agents how you would want. How you want or how you like booking your flights. It's kind of like a few shot. Yeah.swyx [00:58:03]: Demonstration.Karina [00:58:04]: Demonstrations of like maybe there is more efficient way that you do things that the model should learn to do it in that way. And so it's kind of like, again, comes back to like personalized tasks too is like right now task is just like where you're like rudimentary, but in the future tasks should become like much more personalized for your preferences.swyx [00:58:27]: Okay. Well, we mentioned that. Oh, I'll also say that I think one takeaway I got from your, this conversation is that ChatGPT will have to integrate a lot more with my life. Like you, you, you will need my calendar. You will need my email. Yes. Like for sure. And maybe you use MCP. I don't know. Have you, have you looked at MCP?Karina [00:58:43]: No, I haven't.swyx [00:58:44]: It's good. It's got a lot of adoption. Okay.Alessio [00:58:47]: Anything else that we're forgetting about or like maybe something that people should use more? Yeah. I don't know. Before we wrap on like the open AI side of things.Karina [00:58:56]: I think. I think like search product is kind of cool, like ChatGPT search. I think this idea of like, you know, like right now I'm thinking a lot of us, like, you know, the magic of ChatGPT when it first came out, it was like, you know, you ask something, any like instruction, and then like, it would like follow the instruction that you gave to a model, right? Like write a poem and we'll give you a poem. But I think like the magic of the next generation of ChatGPT is like actually, and we're like, we're marching towards that. It's like, when you ask a question, it's not just a question. It's not just going to be in the text output. The ideal output might be like in some form of like a react app on the fly or something. So like, this is happening with like search, right? Like give me like Apple stock and then it gives you the chart and gives you like this like generative UI. And I feel like this is what I mean by like the evolution of ChatGPT becomes like more of a generative OS with a task orientation or something. So it's like, and then UI will adapt to what you like. So like, if you really like 3D, what do you like? If you really like 3D visualizations, I think the model should give you as much visualization as possible. Like, you know, if you really like certain way of like the UIs, like maybe you like round corners. I don't know. It's just like some color schemes that you're like, it's just like the UI becomes like more dynamic and like becomes like a custom, custom model, like personal model, right? Like from personal computer to like a personal model, I think. Yeah.swyx [01:00:20]: Takes overall, you are one of the rare few people, actually, maybe not that rare. To work at both OpenAI and Anthropic.Karina [01:00:28]: Not anymore. Yeah.swyx [01:00:31]: Cultural difference. What are general takes that people like only like you see?Karina [01:00:35]: I love both places. I think I've learned so much at Anthropic and I'm really, really grateful to the people and I'm still like friends with a lot of people there. And I was really sad when John left OpenAI because I came to OpenAI because I wanted to work with the most or something. What's he doing now? But I think it changed a lot. So I think like... When I first joined Anthropic, they were like, I don't know, 60, 70 people. When they left, they were like 700 like people. So it's like a massive like growth. OpenAI and Anthropic is different in terms of like more like maybe like product mindset. Maybe OpenAI is much more willing to take some of the product risks and explore different bets. And I think Anthropic is much more focused and they have... I think it's fine. Like they have to like prioritize, but they definitely double down on like enterprise might be more than like consumers or something. I don't know. It's just like some of the product mindsets might be different. I would say like research, I've enjoyed like both like research cultures, both at Anthropic and like OpenAI. I feel like they are more... On the daily basis, I feel like it's more similar than different.swyx [01:01:50]: I mean, no surprise.Karina [01:01:52]: Like how you run experiments is kind of like very similar. I'm sure the Anthropic...swyx [01:01:55]: I mean, you know, Dario used to be VP research, right? So he set the culture at OpenAI. So yeah, it makes sense. Maybe quick takes on people that you mentioned. Barrett, you mentioned Mira. Like what's one thing you learned from Barrett, Mira, Sam, maybe? Something like that. Like one lesson that you would share to others.Karina [01:02:13]: I wish I like worked with them way longer. I think what I've learned from Mira is actually her like interdisciplinary mindset. She's really good at like connecting dots. Between like product and like kind of balancing like product research and like create this like comprehensive, like coherent story. Because sometimes like there are like researchers who like really hate doing product and there are researchers who really love doing product. And it's like kind of dichotomy between two and also like safety is like a part of this process. So kind of, you kind of want to like create this coherent, like think from like systems perspective. Or like think about like bigger picture. And I think I learned a lot from her on that. I definitely feel like I have much more creative freedom at OpenAI. And that's because the environment that the leaders set like enables me to do that. So it's like if I have an idea, if I want.swyx [01:03:10]: Propose it. Yeah, exactly. On your first month.Karina [01:03:11]: There's like more like creative freedom and like resource reallocation. Especially in research is like being adaptable to like new technologies and like change your views based on that. Yeah. Like you know, I've seen a lot of like researches that are like based on like empirical results or kind of like change the research directions. I've seen a lot of like, sometimes I've seen researchers who would just like get stuck on the same directions for like two to three years and they would never like work out or something, but they would still be like stubborn. So it's like adaptability to like new directions and like new paradigms. It's kind of like one of those things that-Alessio [01:03:42]: This is a Barrett thing or this is a general culture thing?Karina [01:03:45]: A general kind of culture, I think. Cool.Alessio [01:03:46]: Yeah. And just to wrap up, we just usually have a call to action.Alessio [01:03:52]: Do you want people to give you feedback? Do you want people to join your team?Karina [01:03:56]: Oh yeah, of course. I'm definitely hiring for like research engineers who are like more product minded people. So it's like people who know how to train the models, but also like interested in like deploying into like the products and developing like new product features. I'm definitely looking for those archetypes of like research engineers or like research scientists. So yeah. If you're like looking for a job, if you're like interested in joining my team, I'm like really looking forward to that. I'm definitely happy to just reach out, I guess.swyx [01:04:24]: And then just like generally, what do you want people to do more of in the world, whether or not they work with you, like, you know, call to action as in like everyone should be doing this.Karina [01:04:32]: I think this is something that I tell to a lot of like designers is that like, I think people should like spend more time just like play around with the models. And the more you play with a model, the more creative ideas you'll get around like what kind of like new potential features of the products or like new kinds of things. Kind of like interaction paradigms that you might want to create with those models. I feel like we are bottlenecked by like human creativity on like completely changing the way we think about the internet or like some of the, the way you think about software, like AI right now is pushes us to like rethink everything that we've done before in my view. And I feel like not enough people are either double down on like those ideas or I'm just like not seeing a lot of like human creativity in this like. Interface design or like product design mindsets. So I feel like it'd be really great for people to just like do that. And especially right now it's like research, some research becomes like much more product oriented. So it's like you actually can train the models for the things that you want to do in a product or something. Yeah.swyx [01:05:41]: And you define the process now. Now this is my go-to for how to manage a process. I think it's pretty common sense, but it's nice to hear from you that cause you actually did it. That's nice. Thank you for driving innovation, interface design and the new models at OpenAI and Anthropic. And we're looking forward to what you're going to talk about in New York. Yeah.Karina [01:06:01]: Thank you so much for inviting me here. I hope my job will not be automated by the time.swyx [01:06:06]: Well, I hope you automate yourself and we'll do whatever else you want to do. That's it. Thank you. Awesome. Thanks. This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
Transcript
Discussion (0)
Welcome back. From Sam Altman to Satya Nadella, many people are saying that 2025 is the year of agents.
Since our podcast conversations about DeepSeek, the mainstream narrative has become obsessed with Deepseek R1
and what it means to have a competitive Open Waits reasoning model from China.
Swix wrote a viral blog post about the reasoning price war of January 2025,
and today, Open AI has responded by slashing the price of,
0.1 Mini from $12 per million tokens to $4.40,
and also released 03 Mini in ChatGPT and to level 3 and above API users for the exact same price.
Given the 03 Mini matches or exceeds 01 especially with medium or high reasoning effort,
this is an enormous leap in performance per dollar.
In the meantime, the rest of OpenAI has been busy shipping.
Chat GPT has slowly accelerated from shipping canvas during the 12 days of Shipmas last month
to shipping recurring tasks and, most recently, operator, the hosted virtual agent response to Claude's
computer use.
We are very proud to host today's guest, Karina Wyn, who was at Anthropic for the launch
of Claude 3 and wrote the first 50,000 lines of Claude.aI before joining OpenAI to work on the
future of what she calls reasoning interfaces.
We are very proud to also announce that Carina will be the closing keynote speaker for the
second AI engineer summit in New York City from February 20th to 22nd.
This is the last call for applications for the AI leadership track for CTOs and VPs of AI.
If you are building agents in 2025, this is the single best conference of the year.
Our new website now lists out.
speakers and talks from Deep Mind, Anthropic, OpenAI, Meta, Jane Street, Bloomberg, BlackRock,
LinkedIn, and more. Look for more sponsor and attendee information at apply.aI.org
and see you there. Watch out and take care.
Hey, everyone. Welcome to the Latenspace podcast. This is Alessio, partner and CTO Ad decibel,
and I'm joined my usual co-host as weeks. Hey, and today we're very, very blessed to have
Karina Wynn in the studio. Welcome. Nice to meet you. We finally made it happen.
I finally made it happen.
First time we tried this, you were working in a different company, and now we're here.
Fortunately, you had some time.
So thank you for joining us.
Karina, your website says you lead a research team in OpenEI, creating new interaction
paradigms for reasoning interfaces and capabilities like chatchpc canvas and most recently chat
GPT tasks.
I don't know.
Is that what we're calling it?
Streaming chain of thought for 01 models and more via novel synthetic model training.
What is this research team?
Yeah, I need to like clarify this a little bit more.
I think it changed a lot, like since the last time we launched, so we launched Canvas and it was like the first, like, project that I was a tech lead, basically.
And then I think over time I was like trying to refine what my team is.
And I feel like it's at an intersection of like human computer interaction, defining what's the next interaction paradigms might look like with some of the most recent like reasoning models as well as actually trying to come up with like novel methods, how to improve.
of those models for certain tasks if you want to.
So for Canvas, for example, one of the most common use cases is basically writing and coding.
And we continually working on like, okay, like, how do we make Canvas coding to go beyond
what is possible right now?
And like that requires us to actually do like our all training and like coming up with
like new methods of like synthetic data generation.
The way I'm like thinking about it is that like my team is going from like very full stock
from like training models all the way up to like deployment and like making sure that we create
novel like product features that is coherent to what chatypT can become. There are different
types of like features like canvas tasks, but all those components that go they compose together
to evolve chat chipt into something completely new, I think, in the new year. It's evolving. I like to
tweet about that. It's like sort of modular. You can compose it with the stocks feature, the creative
writing feature. I forget what else. We have a list of other use cases, but we don't have to
go into that yet. Can we maybe go back to when you first started working with LLMs? I know you
have some early UX prototypes with GPD3 as well on kind of like maybe how that is informed
the way you build products. I think my background was mostly like working on computer vision applications
for like investigative journalism back when I was like at school at Berkeley and I was working a lot
was like human rights center and like investigative journalists from various media.
And that's how I learned more about like AI, like with vision transformers.
And at that time, I was working with some of the professors at Berkeley AI research.
There are some Pulitzer Prize winning professors, right, that they teach there.
No.
So it's mostly like was reporting for like teams like the New York Times like the AP Associated Press.
So it was like all in the context of like human rights center.
Got it.
Yeah. So that was like in computer version. And then I saw Chris Ollo's work around, you know, like interpretability from Google. And that's how I found out about like anthropic. And at that time I was just like, I think it was like the year when like Ukraine's war happened. And I was like trying to find a full time job. And it was kind of like all got distracted. It was like kind of like spraying. And I was like very focused on like, forget.
out, like, what to do. And then my best option at that time was just, like, continue my internship
at the New York Times and convert to, like, full-time. At the New York Times, it was just, like,
working on, like, mostly, like, product engineering work around, like, R&D prototypes, kind of, like,
storytelling features on the mobile experience. So, like, kind of, like, storytelling experiences. And,
like, at that time, you were, like, thinking about, like, how do you employ, like, NLP techniques
to, like, scrape some of the archives from the New York Times or something. But then I was,
always wanted to like get into like AI and like I knew open AI for a while like since I was like in
Berkeley and yeah so I kind of like applied to Anthropoc just on the website and I was rejected the first time
but then at that time they were not hiring for like anything like product engineering like frontline engineering
which was something that was like at that time I was like interested in and then um there was like a new
opening at Antarctic was like kind of like you are front of engineer and so I applied and that's how
my journey began. But like the earlier prototypes was mostly like, I used like clip for like fashion
recommendation search. So it was like one of those successful projects, I think. And I was like,
before even coming to Antarctica, I was like thinking maybe I should just like do my own startup.
But I feel like I didn't have like enough confidence and conviction in myself that I could do that.
But it was like one of the early like prototypes. And I think Twitter is a good platform to like for
side projects. That's fantastic. Especially for something visual. Yeah. Yeah. We'll briefly
mentioned that the Ukrainian crisis actually hit home more for you than most people because you're
from the Ukraine and you moved here like for school I guess yeah yeah yeah we'll come back to that
if it comes up but then you join in the topic not just as a front-in engineer you were the first is that
true designer yeah yes I think like I did both product design and front of engineering together and
like at that time it was like pre-chatypg it was like I think August 22 and
And that was a time when Anthropi really decided to, like, do more product-y-related things.
And the vision was, like, we need to, like, fund research.
And, like, building product is, like, the best way to, like, fund safety research,
which I find it quite admirable.
So the really first product that Antarbeville was, like, Cloud and Slack.
And it was sunseted, not long after, but, like, it was, like, one of the first.
I think I still come back to that idea of, like,
Claude operating inside some of the organizational workplace like Slack.
And it's something magical in there.
And I remember we built like ideas like summarize the thread.
But you can like imagine having automated like ways of like maybe cloud should like
summarize multiple channels every week custom for what you like or for what you want.
And then we build some like really cool features like tag cloud and then us.
to summarize what happened in the thread, suggest like new ideas.
But we didn't quite double down because you could like imagine like cloud having access to like
the files or like Google Drive that you can upload in Slack, just connectors, like connections
in the Slack. Also the UX was kind of constraining. At that time I was thinking like, oh, we wanted to
do this feature, but like Slack interface kind of constrained us to like do that and we didn't want to
like be dependent on the platform like Slack. And then after like chatypT came out, I remembered
the first two weeks. My manager made me this challenge like can I like reproduce kind of like
similar interface in like two weeks. And one of the early mistakes being in the engineering is like I
said yes. Instead I should have said like, you know, it's double two X the time. Sure. And this is how like
cloud that AI was kind of like born. Oh so you actually wrote club.com.com.com. As your first job.
Yeah, like I think like the first like 50,000 code of lines.
Yeah, yeah.
Without any reviews at that time because there's no one.
Yeah, it was like very small team.
It was like six, seven team who we were called like deployment team.
Yeah.
On mine, I actually interviewed for Anthropic around that time.
I was given cloud and sheets.
Oh, cool.
And that was my other form factor.
I was like, oh, yeah, this needs to be in a table.
So we can just copy paste and just span it out, which is kind of cool.
The other rumor that we might as well just mentioned this, Raza Habib from Humulup
often says that, you know, there was some version of chatGBT in Anthropic.
Like, you had the chat interface already.
Like, you had Slack.
Why not launch a web UI?
Like, basically, like, how did the opening eye beat Anthropic to Chatuby, basically?
Well, at the time, it seems kind of obvious to have that.
I think Chat Chipt model itself came out way before than we decided to, like, launch Cloud 2 necessarily.
And I think, like, at that time, Cloud 1.3 had a lot of hallucinations, actually.
So I think there was, like, one of the concerns is, like, I don't think, like,
the leadership was convinced, had a conviction that this is the model that you need to,
like, you want to, like, deploy or something.
So it was a lot of discussions around that time.
But Cloud 1.3 was, like, I don't know if you played with that, but it's, like,
extremely creative.
And it was, like, really cool.
Nice.
It's still creative.
And you had a tweet,
recently that you said things like canvas and task could have happened two years ago, but they were not,
do you know why they were not? Was it too many researchers at the labs not focus on UX? Was it just
not a priority for the labs? Yeah, I come back to that question a lot. I guess like I was working
on something similar to like canvacy, but for Claude at that time in like 2023, it was the same
similar idea of like Claude workspace where a human and a cloud could have like a shared
workspace.
And that's artifacts.
Right.
No, no, no.
This is cloud projects.
I don't know.
I think it kind of evolved.
I think like at that time I was like in product engineering team and then I switched to like
research team and the product engineering team grew so much.
They had the own ideas of like artifacts and like projects.
So not necessarily.
Maybe they had they looked at my like previous explorations.
But like, you know, when I was exploring like cloud documents or like cloud workspace,
was like, I don't think anybody was thinking about you ask as much or like not many like
researchers understood that.
And I think the inspiration actually for, I still have like all the sketches, but the inspiration
was like from the Harry Potter like Tom Ridler.
Yeah, yeah, yeah, yeah.
That was an inspiration like having Claude writing into the document or something and
communicate back.
So like in the movie you write a little bit and then it answers you.
Yeah.
Okay.
Interesting.
But that was like.
And only in the context of, like, writing, I think Canvas is like more also serves like coding,
one of the most common use cases.
But yeah, I think like those ideas could have happened like two years ago.
Just like maybe I don't think it was like a priority at that time.
It was like very unclear.
I think like AI landscape at that time was very nascent.
If that makes sense.
Like nobody like even when I would talk to like some of the designers at that time, like product
designers, they were not even thinking about that at all.
did not have like AI in mind and like it's kind of interesting except for one of my
designer friends his name is Jason Yuan yeah who was thinking about that and Jason now is a new
computer yes we'll have them on at some point I had them speak at my first summit and you're speaking
the second one which will be really fun nice it was still Anthropic for a bit and we'll move on to
more recent things I think the other big project that you were you were involved with was just
cloud three just tell us the story like was it like to launch one of the biggest launches of the year
Yeah, I think like I was, so Cloud 3.
This is Haiku, Sonnet, Opus, all at once, right?
Yes.
Yeah.
It was a club three family.
I was a part of the post-training team.
We only had like, what, like 10, 12 people involved.
And it was really, really fun to work together as friends.
So, yeah, I was mostly involved in like Cloud 3 Haiku post-training side and then evaluations,
like developing new evaluations and like literally writing the entire like model card and I had a lot of fun
I think like the way you train the model is like very different obviously but I think what I've
learned is that like you will end up as like I don't know like 70 models and every model will have
its own like brain damage and like so it's like like like kind of just like personality wise or
performance benchmarks and I think like it's like one of the interesting like research questions
like how do you understand like the data interactions as you like train the model?
It's like if you train the model on like contradictory data sets,
how can you make sure that there won't be like any like weird like side effects?
And sometimes you get like side effects.
And like the learning is that you have to like iterate very rapidly and like have to like debug
and detect it and make like address it with like interventions.
And actually some of the techniques from like software engineering is very like,
useful here. It's like how do you debug code? Yeah, exactly. So I really empathize with this because
data sets, if you put in the wrong one, you can basically kind of screw up like the past month of
training. The problem with this for me is the existence of YOLO runs. I cannot square this with YOLORANS.
If you're telling me like you're taking such care about datasets, then every day I'm going to
check in run e-vals and do that stuff. But then we also know that YOLO runs exist. So how do you square
that? Well, I think it's like dependent
on how much compute you have, right?
So it's like, it's actually a lot of questions
and, like, researchers aren't like,
how do you most effectively use the compute that you have?
And maybe you can have, like, two to three runs
that is only, like, yellow runs.
But if you don't have a luxury of that,
like, you kind of need to, like, prioritize ruthlessly
and, like, what are the experiments that are most important to, like, run?
Yeah.
I think this is what, like, research management is, basically.
is like how do you
funding efforts
yeah
prioritizing
take like research bets
and make sure that
you build the conviction
and those bets rapidly
such that if they work out
you like double down on them
yeah
you almost have to like kind of
ablate datasets too
and like do it on the side channel
and then merge it in
yeah it's kind of super interesting
tell us more like what's your favorite
so you I have this in front of me
the model card
you say constructing this painful
this table was slightly painful
just pick a benchmark
and what's an interesting story behind one of them?
I would say in GPQA was kind of interesting.
I think it was like the first, I think we were the first lab.
Like, Antarctica was the first lab to like run.
Oh, because it was like relatively new after new rips?
Yeah, yeah.
Okay.
Published GDPQQA like numbers.
And I think one of the things that we've learned was that I personally learned about
that like any evolves.
Like some evils are like very like high variants.
And like GPUAs like happen to be like.
a huge, like, high variance, like, evaluation.
So, like, one thing that we did is, like, having, like, run the average of, like,
five and, like, take the average.
But, like, the hardest thing about, like, the model card is, like, none of the numbers
are, like, apples to apples.
So you actually need to, like, go back to, like, I don't know, like, GPD for model card
and, like, read the appendix just to, like, make sure that, like, the settings were the same
as you're running the settings too.
So it's like never an apples to apples.
Yeah.
But it's interesting how like, you know,
when you market models as products,
like customers don't necessarily know.
Yeah.
They're just like, my MMLU is 99.
What do you mean?
Yeah, exactly.
Why isn't there an industry standard harness, right?
This Elyuther's thing,
which it seems like none of the model labs use.
And then openly I put out simply valid
and nobody uses that.
Why isn't there just a,
and one standard way everyone runs this.
Because the alternative approach is you rerun your e-vals on their models.
And obviously, your numbers will be lower.
Yeah.
And they'll be unhappy.
So that's why you don't do that.
I think it operates on an assumption that like the models, the next generation of the model
or the model that you produce next is going to behave the same.
So for example, like I think the way you prompt one or like cloud three is going to be
very different from each other.
I feel like there's a lot of like prompting that you need to do to get the evolves to run correct.
So sometimes the model is just like output like new lines.
And the way you'd parsed will be like incorrect or something.
This has happened with like Stanford, I remember.
Like when Stanford had this also like they were like running benchmarks.
Yeah, Helm.
Yeah.
And somehow like Claude was like always like not performing well.
And that's because like the way they prompted it was kind of wrong.
So it's like a lot of like techniques.
It's just like very hard because like nobody even knows.
Has that gone away with chat models instead of, you know, just raw completion models?
Yeah, I guess like each.
Structured output also can be run in a very different way.
Sometimes you can like ask the model to output in like XML tags,
but some models are not really good at XML tags.
And so it's like, do you change the formatting per model or like do you run the same format across all models?
And then like the metrics themselves, right?
like maybe, you know, accuracy is like one thing, but maybe you care about like some other
metrics, like, up school or like some other like things.
Yeah.
It's like hard.
I don't know.
And talking about O1 prompting, we just had a O1 prompting post on the newsletter, which I think was
Apparently went viral within Open AI.
Yeah.
I don't know.
I got pinged by other open AI people.
They were like, is this helpful to us?
I'm like, okay.
Oh, nice.
I think it's like maybe one of the top three most red posts now.
Yeah.
And I didn't write it.
Exactly.
What are your tips on 01 versus like clot prompting or like what are things that you took away from that experience?
And especially now I know that with 40 for Canvas, you've done RL after on the model.
So yeah, just general learning.
So now to think about prompting these models differently.
I actually think like O1, I did not even harness the magic of like O1 prompting.
But like one thing that I found is that like if you give a one like hard like constraints,
of what you're looking for.
Basically, the model will be,
will have a much easier time to, like,
kind of like, select the candidates and match,
like, the candidate that is most, like,
fulfilled the criteria that you gave.
And I think there's a class of problems like this
that O one excels that.
For example, if you have a question, like,
a bio question on, like, some,
or, like, in chemistry, right?
Like, if you have, like, very specific criteria
with the protein or like some of the chemical bindings or something,
like then the model will be really,
it would be really good at like determining the exact candidate
that will match the certain criteria.
I have often thought that we need a new IFEVL for this
because this is basically kind of instruction following, isn't it?
Yes.
But I don't think IFEVL has like multi-step IFEVL.
Yeah.
So that's what basically I use AI News for.
I have a lot of prompts and a lot of steps and a lot of criteria
and O1 just kind of checks through each kind of systematically.
And we don't have any evils like that.
Yeah.
Does Open AI know how to prompt 01?
I think that's kind of like that.
You know, Sam is always talking about incremental deployments and kind of like getting, having people getting used to it.
When you release a model, you obviously do all the safety testing.
But do you feel like people internally know how to get 100% out of the model?
Or like are you also spending a lot of time learning from like the outside on how to better prompt O1 and like all these things?
Yeah, I certainly think that you learn so much from like external.
feedback too on how people use like 01.
I think like a lot of people use O1 for like really hard core like coding questions.
I feel like I don't fully know how to.
Yeah, you release the model.
A1 except for like I use the one to just like do some like synthetic data explorations.
But that's it.
Do people inside of opening eye, once the model is coming out, do you get like a company
one memo of like, hey, this is like you should try?
impromptus, especially for people that might not be close to it during development, you know,
or I don't know if you can share anything, but I'm curious how internally these things kind of get
shared. I feel like I'm like in my own little corner and like research. I don't really like
look at some of the Slack channels. It's very, very big. So I actually don't know if something like this
exists, probably. It might be exist because we need to share to like customers or like, you know,
like some of the guides on like how to use this model.
So probably there is.
I often say this.
The reason that AI engineering can exist outside of the model labs is because the model labs
release models with capabilities that they don't even fully know because you never train
specifically for it.
It's emergent.
And you can rely on basically crowdsourcing the search of that space or the behavior space to
the rest of us.
Yeah.
So like you don't have to know.
Yeah.
Yeah.
I think like, um,
An interesting thing about, like, O-1 is that, like, it's really, for, like, average human,
sometimes I don't even know whether the model, like, produce the correct output or not.
Like, it's really hard for me to, like, verify.
Even, like, hard, like, STEM questions.
I don't know.
If I'm not an expert, like, I usually don't know.
So it's, like, the question of, like, alignment is actually more important, like, for
these, like, complex reasoning models.
To, like, how do we help humans to, like, verify the outputs of these models?
It's quite important.
And I feel like, yeah, like learning from external feedback is kind of cool.
For sure.
One last thing on Cloud 3.
You had a section on behavioral design.
Yes.
Anthropics very famous for the HHS goals.
What was your insights there?
Or, you know, maybe just talk a little bit about what you export.
Yeah.
I think like behavioral design is like a really cool.
I'm glad that I made it like a section around this.
And so like really cool.
I think like.
Like you weren't going to publish a one and then you insisted on it or what?
I think I just like put the section inside it and like, yeah, Jared might like one of my most
favorite researchers are like, yeah, that's cool. Let's let's do that, I guess. Yeah, like nobody had
this like term of like behavioral design necessarily for the models. It's kind of like a new
little field of like extending like product design into like the model design. Right.
So how do you create a behavior for the model in certain contexts? So as for example, like in
canvas, right? Like, one of the things that we had to, like, think about is like, okay, like, now the
model enters, like, more collaborative environment, more collaborative context. So, like, what's the
most appropriate behavior for the model to act like as a collaborator? Should it ask, like, more
follow-up questions? Should it, like, change? What's the tone should be? Like, what is the
collaborator's tone? It's different from, like, a chat, like, conversationalist versus, like,
collaborator. So how do you shape the persona and the personality around that? It has like some
philosophical questions too. Like yeah, behavioral. I mean like I guess like I can talk more about like
the methods of like creating the personality. Please. It's the same thing as like you would create like
a character in a video game or something. It's kind of like. Charisma, intelligence, wisdom.
What are the core principles? Helpful, harmless. And obviously for Claudus was my is much easier.
than I would say like for charge of PDE for cloud is like it's like baked in and like the
mission right it's like honest harmless helpful helpful but the most complicated thing about like the
model behavior or like the behavioral design is that like sometimes two values would contradict each other
I think this happened in cloud three one of the main things that we were thinking about is like how
to we balance this like honesty versus like homelessness or like helpfulness as like we don't want the model to
always, like, refuse even to, like, innocuous queries, like, some, like, creative writing
prompts, but also if you don't want the model to be, act, like, be harmful or something.
So it's, like, there's always a balance between those two.
And it's more, like, art than the science, necessarily.
And this is what datasets craft is, like, more of an art than a literal science.
You can definitely do, like, empirical research on this.
but it's actually like, like, this is the idea of like synthetic data, like, if you look back to a constitutional AI paper,
is around like how do you create completions such that you would agree to certain, like, principles that you want your model to agree on.
So it's like if you create the core values of the models, how do you decompose those core values into like specific scenarios or like,
so how does the model need to express its honesty in a variety,
kind of like scenarios, and this is where like generalization happens when you craft the persona
of the model.
Yeah.
It seems like what you describe behavior modification or shaping as a side job that was done
and done, I mean, I think Anthropics has always focused on it the first and the most.
But now it's like every lab has sort of vibes officer.
For you guys, it's Amanda.
For opening eye, it's ruin.
And then for Google, it's Steven Johnson and Reza, who we have.
had on the podcast. Do you think this is like a job? Like it's like a like every every company needs a
taste maker. I think the model's personality is actually the reflection of the company or the
reflection of the people who create that model. So like for Claude, I think Amanda was doing
a lot of like Cloud character work and I was working with her at the time. But there's no team,
like cloud character team. Now there's a little bit of a team. Isn't that cool? But before that there was
none. I think like actually it was Cloud 3. He was like we kind of doubled down on the feedback from
Cloud 2. Like people, we didn't even like think, but like people said like Clot 2 is like so
much better at like writing and like has certain personality even though it was like unintentional
at all. And we did not pay that much attention. I didn't know even how to like productionize
this property of model being better like personality until like with Cloud 3 we kind of like
had to like double down because we knew the view launch like in chat. We wanted to like cloud
honesty is like really good for like enterprise customers. So if you kind of wanted to like make
sure the hallucinations went like factuality would like go up or something. We didn't have a team
until or after like Cloud 3, I guess. Yeah. I mean it's going now and I think anyway everyone's
taking it seriously. I think that opening either is a team called model design. It's John the
PM. She's leading that team and I work very close with those teams. We were working on like
actually writing improvements that we did with chatypT last year.
And then I was working on like this collaboration.
Like how do you make Chi-GPT-Aklux collaborator for like Canvas?
And then, yeah, we worked together on some of the projects.
I don't think it's publicly known his actual name other than Rood.
But he's mostly, he's mostly docs.
We'll beep it and then people can guess.
Do we want to move on to Open AI and some of the reason of work,
especially you mentioned Canvas?
So the first thing about Canvas is like it's not just a UX thing.
You have a different model in the backend, which who post-trained on 01 preview distilled data, which was pretty interesting.
Can you maybe just run people through?
You come out with a feature idea maybe.
Then how do you decide what goes in the model, what goes in the product and just that process?
Yeah.
I think the most unique thing about Chachipita Canvas was that it was also the team formed out of the air.
So it was like July 4th or something during the break.
Like Independence Day.
They just like, okay.
I think it was there some like company break or something.
And I remember I was just like taking a break.
And then I was like pitching this idea to like Barrett Zoff, who was my manager at that time.
She's like, I just want to like create this like canvas or something.
And I really didn't know how to like navigate open eyes.
It was like my first like, I don't know, like first month at Open Eye.
And I really didn't know how to, like, navigate, how do I get product to work with me?
Or, like, some of the ideas, like, some of the things like this was like, so I'm really grateful for, like, actually, Barrett and Mirr, who helped me to, like, staff this project, basically.
And I think that was really cool.
And it was, like, this 1st of July.
And, like, Barry was like, yeah, actually, who's, like, an engineer manager is, like, yeah, we should, like, staff this project was, like, five, six engineers or something.
And then Karina can be, like, researcher on this project.
And I think like this is how the team was formed.
This was kind of like out of the air.
And so like I didn't know anyone there at that time, except for Thomas Dimson.
He did like the first like initial like engineering prototype of the canvas and it kind of like riffed off.
But I think the first you learned a lot on the way how to work together as product and research.
And I think this is one of the first projects at Open AI where research and product work together from the very
beginning and which has made it like a successful project in my opinion is because like designers
engineers PM and research team for all together and we would like push back on each other like if
like it doesn't make sense to do it on a model side like we had to like collaborate with like
applied engineers to like make sure this is being handled on the applied side but the idea is you can
go that far as like prompted baseline prompted chaty pt was kind of like the first
thing that we tried was like a canvas as a tool or something. So how do we define the behavior
of the canvas? But then like we've found a bunch of like different like edge cases that we wanted
to like fix. And the only way to like fix some of the edge cases is actually through post training.
So we actually, what we did was actually retrain the entire FOO plus our canvas stuff. And this is like,
there are like two reasons why we did this is because like the first one is that we wanted to
ship this as a better.
model in the drop-down menu. We could, like, rapidly iterate on users' feedback as we ship it
and not going through the entire, like, integration process into, like, this, like, new one model
or something, which took some time, right? So from, like, from Bada to, like, GA, it took, I think,
three months. So we kind of wanted to, like, ship our own model with that feature to, like,
learn from the user feedback very quickly. So that was, like, one of the decisions we made. And then
with canvas itself, we just had a lot of like different like behavioral, it's again, like,
behavioral engineering. It's like various behavioral craft around like when does canvas need to
write comment? When does it need to like update or like edit the document? When does it need to edit the
entire, like rewrite the entire document versus like edit very specific section of the user
asks? And when does it need to like trigger the canvas itself was one of those those like behavioral
engineering questions that we had. At that time, I was also working with, like, writing quality.
So that was, like, the perfect way for us to, like, literally both teach the model how to use canvas,
but also, like, improve writing quality if writing was, like, one of the main use cases for
CHAPD. So I think that was, like, the reasoning around that. There's so many questions.
Oh, my God. Yeah. Quick one. What does improve writing quality mean? What are the evils?
What are the evils? Yeah. So the way I'm thinking about is, like, have
two various directions. The first direction is like how do you improve the quality of the writing
of the current use cases of Chachapie? And those, most of the use cases are mostly like non-fiction
writings. It's like email writing or like some of the maybe you've blog posts, cover letters is like
one of the main use cases. But then the second one is like how do we teach the model to literally
think more creatively or like write in a more creative manner such that it will like,
like just create novel forms writing.
And I think the second one is like much of a longer term, like research question,
while the first one is more like, okay, we just need to improve data quality for the writing
use cases that between the models are.
It is more straightforward question.
But the way we evaluated the writing quality, so actually I worked with Jan's team on
the model design.
So they had a team of like model writers and we would work together.
And it's just like a human evolve.
It's like internal human evolve where we just...
Always like that.
Yeah, on the prompt distribution that we cared about.
Like, we want to make sure that the models that we like used, that we trained were always like better or something.
Yeah.
So like some test set of like 100 prompts that you want to make sure you're good on.
I don't know how big the prompt distribution needs to be because you are literally catering to everyone.
Right.
Yeah.
I think it was much more opinionated way of like improving.
writing quality because we worked together with like model designers to like come up with like core
principles or what makes this particular writing good. Like what does make email writing good? And we had to
like craft like some of the literally like rubric on like what makes it good. And then make sure during
the evile we check the marks on this like rubric. Yeah. That's what I do. Yeah. That's what school.
It's like most common practice. Yeah. It's really funny. Like yeah. That's exactly how we grade essays.
Yeah, I guess my question is, when do you work the improvements back in the model?
So the Canvas model is better at writing.
Why not just make the core model better too?
So, for example, I built this small podcast thing for a podcast,
and I have the 4O API and I asked it to write a write up about the episode based on the transcript,
and then I've done the same in Canvas.
The Canvas one is a lot better.
Like the one from the raw 40 starts, the podcast delves.
And I was like, no, I'm not delve in the third word.
why not put them back in 4-0 core or is there just like-
I think you put it back in the corner?
Yeah, so like, so the 4-0 canvas now is the same as 4-0?
Yeah, you must have missed that update.
Yeah, what's the, what's the process to?
But I think the models are still a little bit different.
It's just like an AB test almost, right?
To me it feels, I mean, I've only tried it like three times,
but it feels the canvas, the canvas output feels very different than the API output.
Yeah, yeah.
I think like there's always like a difference in the.
model quality. I would say like the original better model that we released with canvas was actually
much more creative than even right now when I use like 4-0 with canvas. I think it's just like
the complexity of like the data and the complexity of the it's kind of like versioning issues right here.
It's like okay like your version 11 will be very different from like version 8 right. It's like
even though like the stuff that you put in is like the same or something. It's a good time to say
that I have used it a lot more than three times.
I'm a huge fan of canvas.
I think it is, like, it's weird when I talk to my other friends.
They don't really get it yet or they don't really use it yet.
I think because it's maybe sold as like sort of writing help when really like it's kind of it's
the scratch pad.
Yeah, what are the core use cases or like yeah.
Oh yeah, I'm curious.
Literally drafting anything.
Like I want to draft like copy from my conference that I'm running.
Like I'll put it there first and then I like it'll just have the canvas up and I'll
say what I don't like about it and it changes. I will maybe edit stuff here and paste in. So,
so for example, like I wanted to drive to brainstorm list of reasons, of signs that you may be an
NPC just for fun, just like a blog post for fun. And I was like, okay, I'll do 10 of these and then I want
you to generate the next 10. So I wrote 10. I placed it in it to chat GPT and they generated
the next 10 and they all sucked. All horrible. But it also spun up the canvas with the blog post.
and I was like, okay, self-critique why your output sucks and then try again.
And it just iterates on the blog post with me as a writing partner.
And it is so much better than, I don't know, like intermediate steps.
I was like that would be my primary use case.
It's like literally drafting anything.
I think the other way that I'll put it, I'm not putting words in your mouth,
this is how I view what canvas is and why is so important.
It's basically an inversion of what Google Docs is, wants to do with Gemmon.
I. So Google Docs on the main screen and then Gemini on the side. And write what chat
GPT has done is do the chat thing first and then the docs on the side. But it's kind of like a
reversal of what is the main thing. Like Google Doc starts with the canvas first that you can edit
and whatever. And then you maybe sometimes you call in the AI assistance. But Chat Chb-T,
what you are now is your kind of AI first with the site output being Google Docs.
I think we definitely want to improve like writing use case in terms of like how do we make it easier
for people to format or like do some of the editing.
I think there is still like a lot of room for improvement, to be honest.
I think the other thing is like coding, right?
I feel like one of the things that we're like doubling down is actually like executing
code inside the canvas.
Since there's a lot of questions like, how do you evolve this?
It's kind of like IDE for both.
And I feel like this is where I'm coming from.
It's like the chat chipitia evolves into this blank interface which can morph itself in
whatever you try, like the model should try to like derive your true intent and then modify
the interface based on your intent. And then if you like writing, it should become like the most
powerful like writing IDE possible. If it's like coding, it should become like a coding IDE or something.
I think it's a little bit of an odd decision for me to call those two things the same product
name because they're basically two different UIs. Like one is code interpreter plus plus.
Yeah. Yeah.
that was canvas.
Yes.
I don't know if you have other thoughts on Canvas.
No, I'm just curious, maybe some of the harder things.
So when I was reading, for example, forcing the model to do targeted edits versus
like for rewrite, it sounds like it was like really hard.
In the AI engineer mind, maybe sometimes it's like just past one sentence in the prompt,
it's just going to rewrite that sentence right.
But obviously it's harder than that.
What are maybe some of the like hard things that people don't understand from the outside
and building products like this?
I think it's always hard with any new like product feature, like can.
or tasks or like any other new features.
Like you don't know how people would use this feature.
And so how do you even like build evaluations that would simulate how people would use
this feature?
And it's always like really hard for us.
Therefore like we try to like lean on to like iterative deployment this in order to like learn
from user feedback as much as possible.
Again, it's like we didn't know that like code diffs was very difficult.
for a model, for example.
Again, it's like, do we go back to, like, fundamentally improve, like, code diffs as a model
capability?
Or do you, like, do a work around where the model will just, like, rewrite the entire document,
which is yield to, like, higher accuracy?
And so those are, like, some of the decisions that we had to, like, make as, yeah, how do you,
like, improve the bar to the product quality, but also make sure the model quality, the model quality
is also a part of it and like what kind of like cheat-ups you're okay to do again I think it's like
new way of product development is more like product research model training and like product
development goes like together hand-in-hand this is like one of the hardest things like defining
the entire like model behavior is I think just like is there's so many edge cases that might happen
especially when you like do canvas with like other tools right like canvas plus dali canvas plus
search if you like select certain section and then like ask for search like how do you build such
evolves like what kind of like features or like behaviors that you care the most about and this is how
you both evolves you test it against every feature of chatybti uh no i mean i don't think there's that many
that you can right it will take forever but it's the same is in decision boundary between like
python 88 advanced data analysis versus canvas is one of the most
trickiest, like, decision boundary,
behaviors that we had to, like, figure out.
Like, how do you derive the intent from the human user query?
Yeah.
And how do I say this?
Deriving the intent, meaning, does the user expect canvas or some other tool
and then, like, make sure that it's, like, maximally, like, the intent was,
is, like, actually still one of the hardest problems.
Yeah.
Especially with, like, agents, right?
Like, you don't want, like, agents to go.
for like five minutes and do something on the background and then come back with like some
mid answer that you could have gotten from like a normal model or like the answers that you
didn't even want because it didn't have enough contacts. So it didn't like fall up correctly.
You said the magic word. We have to take a shot every time you say it. You said agents.
So let's move to Tass. You just launched Tass. What was that like? What was the story? I mean, it's your
is your baby.
So now that I have a team, I actually like tasks was purely like my residence projects.
I was mostly a supervisor.
So I kind of like delegated a lot of things to my resident.
His name is like Vivek.
And I think this is like one of the projects where I learned management, I would say.
Yeah.
But it was really cool.
I think it's very similar.
I'm trying to replicate canvas operational model.
how do we operate with product people or like product applied orgs with research.
And the same happened.
I was trying to replicate the methods and replicate the operational process with tasks.
And actually tasks was developed less than like two months.
So if Canvas took like, I don't know, four months, then tasks took like two months.
And I think, again, like it's kind of very similar process of like how do we build evolves.
You know some people like ask for like reminders in actual chat GPT, but then like obviously
Even though they know it doesn't work.
Yeah.
It doesn't know very much.
So like there is some like demand or like desire from users to like do this.
And actually I feel like task is like simple feature in my opinion.
It's something that you would want from any model, right?
But then the magic is like when actually because the model is so general, it knows how to
use search or like canvas or like create sci-fi.
stories and create Python puzzles, when coupled with tasks, it actually becomes like really,
really powerful. It was like the same ideas of like, how do we shape the behavior of the model?
Again, we shipped it as like as a better model in the model drop down. And then we are working
towards like making that feature integrated in like the core model. So I feel like the principles
that like everything should be like in one model. But because of some of the operational
no difficulties, it's much easier to like deploy as a separate model first to like learn from
the user feedback and then it read very quickly and then improve into the core model basically.
Again, this is a project was also like together at the beginning from the very beginning
designers, engineers, researchers were working all together and together with model designers
we were like trying to like come up with like evolves evaluations and like testing and like
backbashing and it's like a lot of cool like synergy.
VALS bug bashing.
I'm trying to distill, I would love a canvas for this,
distill what the ideal product management or research management process is, right?
Start from like, do you have a PRD?
Do you have a doc that like these these things?
Yes.
And then from PRD you get funding maybe or like, you know, staffing resources, whatever.
Yes.
And then prototype maybe.
Yeah, prototype.
I would say like prototype was prompted baseline.
It's all, all everything's.
always prompted baseline. And then like we craft like certain like evaluation that you want to like capture.
Okay. They want to like measure progress at least for the model. And then make sure that evals are good. And make sure that the prompted baseline actually fails on those like evils. Because then you have like if you allow to like hill climb on. And then once you start iterating on the model training, it's actually very iterative. So like every time you train the model or you like look at the bench or like look at your evals and like goes up, it's like good. But. But.
then also you don't want to like, even to make sure it's not like super overfitting.
Like that's where you run on other evils, right?
Like intelligence evolves and then like...
You don't want regressions on the other stuff.
Yes.
Okay.
Is that your job or is that like the rest of the company's job to do?
I think it's mainly my like the job of the people who like...
Because regressions are going to happen and you don't necessarily own the data for the other stuff.
What's happening right now is that like you basically you only like uplaid your your datasets, right?
So it's like you compare on the baseline.
You compare like the regressions on the baseline model.
Model training and then bookbash.
And that's above it.
And then ship.
Actually, I did the course with Andrew NG who.
Yes.
There was like one little lesson around this.
Okay.
I haven't seen.
Product research.
You tweeted a picture with him and it wasn't clear if you were working on a course.
I mean, it looked like the standard course picture with enjoying.
Yes.
Okay, there was a course with him.
Was that like working with him?
No, I'm not working with him.
Like, I just, like, did the course with them, yeah.
How do you think about the tasks?
So I started creating a bunch of them.
Like, do you see this as being, going back to, like, the composability, like,
composable together later?
Like, you're going to be scheduled one task that does multiple tasks chained together.
What's the vision?
I would say task is like a foundational module.
Obviously, to generalize to all sorts of, like, behaviors that you want.
Like, sometimes, like, I see, like, people have, like, three tasks.
in one query. And right now I don't think like the model handles this very well. I think that ideally
we learn from like the user behavior and ideally the model will just be more proactive
in suggesting of like, oh, I can either do this for you every day because I've observed that
you do that every day or something. So it's like more becomes like a proactive behavior. I think
right now you have to be more explicit, like, oh, yeah, like, every day, like, remind me this.
But I think, like, the, ideally, the model will always think about you on the background and, like,
kind of suggests, okay, like, I noticed you've been reading some, this particular, like,
how can use articles.
Maybe I can try to suggest you all, like, every day or something.
So, like, it's just, like, much more, like, of a natural, like, friend, I think.
Well, there's an actual startup called friend that is trying to do that.
Yes.
We'll interview Avi at some point.
But it sounds like the guiding principle is just what is useful to you.
It's a little bit B2C.
You know, is there any B2B push at all or you don't think about that?
I personally don't think about that as much, but I definitely feel like B2B is cool.
Again, I come back to the cloud and slag.
It's like one of the first like interfaces where like the model was operating inside your organization.
Right.
It would be very cool for the model to like handle.
to like become like a productive member of your organization and then either like even like even
process like I right now like I'm thinking like processing like user feedback I think it'd be very
cool if the model would just like start doing this for us and like we don't have to hire a new person
on this just for this or something and like you have like very simple like data analysis so like
data analytics or like how this feature is like do you do this analysis yourself or do you have a
data science team that tells you insights?
I think there are some data scientists.
Okay.
I've often wondered, I think there should be some startup or something that does automated
data insights.
Like I just throw you my data.
You tell me.
Yeah.
Yeah, exactly.
Because that's what the data team at any company does.
Right.
Which is just give us your data.
We'll like make PowerPoints.
Yeah.
Yeah, that'd be very cool.
That's, I think that's a really good vision.
You had thoughts on agents in general.
There's some more proactive stuff.
You actually had tweeted a definition, which is,
kind of interesting?
I did.
Well, I'll read it out to you.
You tell me, you can double-
agree with yourself.
This is five days ago.
Agents are gradual
progressional tasks,
starting off with one-off actions,
moving to collaboration.
Ultimately, fully trustworthy,
Long Horizon.
I know it's uncomfortable
to have your tweets read to you.
I have had this done to me.
Ultimately,
fully trustworthy,
long horizon delegation
in complex environments,
like multiplayer,
multi-agents,
tasks,
and canvas
forward in the first two.
One of my weaknesses
is, like,
I like writing long sentences.
I feel like I need to like
No, that's fine. That's fine. Is that your definition of agents? Like, what are you looking for?
I'm not sure if this is my definition of agents, but I feel like it's more like how I think. It makes sense, right?
Like, I feel like for me to like trust an agent with my passwords or my credit card, I actually need to build trust with that agent that it will handle my tasks correctly and reliably.
And the way I would go about this is how I would naturally like collaborate with other people.
Is it like we first, even with any project, right?
Like we first came, when we first come, like we don't even know each other.
Like we don't know how each other's like working style.
Like what I prefer, what do they prefer?
How do they prefer to communicate, et cetera, et cetera.
So like you spend like the first like, I don't know, like two weeks to just like learn their style of working.
And then like over time you adopt to their working style.
and then this is how you create the collaboration.
And then, like, at the beginning, you don't have much trust.
So, like, how do you build more trust?
Especially, like, it's the same thing as, like, with a manager, right?
It's like, how do you build trust with your manager?
What does they need to know about you?
What do you need to know about them?
Over time, as you build trust and trust builds either through collaboration,
which is why I feel like building canvas was kind of like the first steps
towards, like, more collaborative agents.
I think it was humans.
So you can, you should need to show a consistent effort to each other, like consistent effort
that you care about each other is that you like work together very well or something.
So consistency and like collaborations like what creates trust.
And then I will naturally, we'll try to delegate tasks to a model because I know the model
will not fail me or something.
So it's kind of like building out like the intuition for the form factor of like new agents.
Because sometimes I feel like a lot of researchers or like people in the AI community are like so into like, yeah, agents, delegate everything like blah, blah, but like on the way towards that, I think like collaboration is actually one of the main roadblocks or like milestones to get over.
Because then you will learn some of the implicit preferences that would help you, that would help towards like this full delegation model.
Yeah, trust is very important.
I have an AGI working for me and we're still working out of the trust issue.
We are recording this just before the launch of operator.
The other side of agents that is very topical recently is computer use, Anthopic launch, computer use recently.
You're not saying this, but OpenE Eye is rumored to be working on things.
And there's a lot of labs that are exploring this, like, sort of drive a computer generally.
How important is that for agents?
I think it will be one of the core capabilities of agents.
Yeah, a computer using, or agents using desktop or like your computer is like the delegation part.
Like when you might want to like delegate an agent to like order a book for me or like order a flight or like search for a flight and then order things.
And I feel like this idea was flying around like for a long time since at least like 2022 or something.
And finally we are here.
And just like there's a lot of like
lag between idea and like full execution
in the orders like two to three years.
The vision models had to get better.
Yeah.
A lot better.
Like perception and something.
But I think like it's really cool.
I feel like it's it has like implications for like consumers, definitely like delegations.
But I guess again, like I think like latency is like one of the most important factors here.
It's like you don't want to make sure that the model correctly understands what you want.
And then if it doesn't understand or if it doesn't know like full context,
it should like ask for a follow-up question and then like use that to perform the task.
Like the agent should know if it has enough information to complete the task at the maximal,
if it's a maximal success or not.
And I think this is like still an open kind of like research question if you like.
Yeah.
And the second idea is that like I think it also enables new cost of like research questions of like
computer use agents, like, can we use it in Aral?
Right?
Like, this is kind of like very cool, like, nascent area of like research.
What's one thing that you think by the end of this year, people will be using computer
use agents a lot for?
I don't know.
Maybe for coding.
I don't know.
For coding?
I think like right now, like with Canvas, we are thinking about like this paradigm of like
real time collaboration to like asynchronous collaboration.
like it would be cool if I can just delegate to a model like okay can you figure out like how to
do this feature or something and then the model can just like test out that feature in its own
like virtual environment or something I don't know like maybe this is a weird idea obviously there
will be a lot of use cases around like consumers consumer use cases like hey like shop for me or something
I was going to say everyone goes to booking plane tickets that's like the worst example because
you only booked plane tickets what two or three times a year you know like concert tickets I don't know
Yeah.
Concert tickets.
Yeah.
Like Taylor says.
I want a Facebook marketplace bought that just scrolls Facebook marketplace for free stuff.
Yeah.
And then just go and get it.
Yeah. I don't know.
What do you think?
I have been very bearish in computer use because they're slow.
They're expensive.
They're imprecise.
Like the accuracy is horrible.
Still, even with Anthropics new stuff.
I'm really waiting to see what opening I might do to change my opinions.
And really what I'm trying to do is like Jan last year versus December last year.
I changed a lot of opinions.
what am I wrong about today?
And computer use is probably one of them
where I'm like, I don't think,
I don't know if by end of the year
will still be using them.
Will my chat GBT have,
like every GBT instance,
will they have a virtual computer?
Maybe?
I don't know.
Coding, yes,
because he invested in a company
that does that.
The code signboxes,
there are a bunch of code sandbox companies.
E2B is the name.
But then, like, in browsers, yes.
Computer use is like coding plus browsers
plus everything else.
There's a whole operating system.
And it's very, like,
you have to be pixel,
precise. You have to OCR. I think OCR is basically solved. But like pixel precise and like
understand the UI of what you're operating. And like I don't know if the models are there yet.
I don't know. Yeah, yeah. Two questions. Like do you think the progress of like mini models like 03 mini
or like 01 mini? I guess like it came back to like the cloud cloud three high cool, cloud 1.2 instant.
And this like gradual progression of like small models becoming really powerful, which are very
also like fast.
Like I'm sure like the computer use agents like would be able to like couple with like those like
small models.
That was like some of the latency issues in my opinion.
I think in terms of like other operating system, I think a lot about it.
This is just like if you're entering this like task oriented like operating system or something
where also a generative OS.
Like, in my opinion, like, people in like a few years will click on, like, websites
way less.
I want to see the plot of, like, website clicks over time.
But then my prediction is, like, it will go down and, like, people's access to the
internet will be through the model's lens.
Either you see what the model is doing or you don't see what the model's doing on the
internet.
Yeah.
I think my personal benchmark for computer use this year is.
expense reports.
So I have to do my expense report every month.
But what you need to do, so for example, I expense a lunch.
I have to go back on the calendar and see who I was having lunch with.
Then I need to upload the receipt of the lunch and I need to tag the person,
the expense report, blah, blah, blah.
It's very simple in a task-by-task basis.
Yeah.
But like you have to go to every app that I use.
You have to go to like the Uber app.
You have to go to the camera roll to get the foot of the receipt and all these things.
It's not, you cannot actually do it today.
but it feels like a tractable problem.
You know that probably by the end of the year we should be able to do it.
Yeah, this reminds me of like the idea of you kind of want to show to computer use agents
how you would want, how you want or how you like booking your flights.
It's kind of like a few shot.
Yeah, demonstrations of like maybe there is more efficient way that you do things
that the model should learn to do it in that way.
And so it's kind of like, again, comes back to like personalized tasks too.
It's like, right now a task is just like very like rudimentary.
But in the future, tasks should become like much more personalized for your preferences.
Okay.
Well, we mentioned that.
I'll also say that I think one takeaway I got from this conversation is that chat GBT
will have to integrate a lot more with my life.
Like you, you will need my calendar.
You'll need my email.
Yes.
Like for sure.
And maybe you use MCP.
I don't know.
Have you looked at MCP?
No, I haven't.
It's good.
It's got a lot of adoption.
Okay.
Anything else that we're forgetting about or like maybe something that people should use more.
Yeah, I don't know, before we wrap on like the open-eye side of things.
I think like search product is kind of cool, like Chachapiti search.
I think this idea of like, you know, like right now I'm thinking a lot of was like, you know, the magic of chatchipete when it first came out was like, you know, you ask something.
like instruction and then like it would like follow the instruction that you gave to a model,
like write and poem and would give you a poem.
But I think like the magic of the next notion of CHPC is like actually, and we're like
marching towards that.
It's like when you ask a question, it's not just going to be in the text output.
The ideal output might be like in some form of like a React app on the fly or something.
So like this is happening with like search, right?
Like give me like Apple stock.
And then it gives you the chart.
and it gives to this like generative UI.
And I feel like this is what I mean by like,
the evolution of chatypT becomes like more of a generative OS
with a task orientation or something.
So it's like, and then UI will adapt to what you like.
So like if you really like 3D visualizations,
I think the model should give you as much as visualization as possible.
Like, you know, if you really like a certain way of like the UI's,
like maybe you like round corners.
I don't know.
It's just like some color schemes that you're like.
it's just like the UI becomes like more dynamic and like becomes like a custom custom model like personal model right like from personal computer to like a personal model I think yeah takes overall you are one of the rare few people actually maybe not that rare to work at both opening eye and anthropic not anymore yeah
cultural difference what are general takes that people like only like you see i love both places I think I've learned so much at Anthropic and I'm
really, really grateful to the people and I'm still like friends with a lot of people there.
And I was really sad when John left, opening eye, because I came to open eye because I wanted to
work with him the most or something.
What's he doing?
What's he doing now?
But I think it changed a lot.
So I think like when I first turned Anthropic, they were like, I don't know, 60, 70 people.
When they left, there were like 700 like people.
So it's like a massive like growth.
Open AI and Antarctic is different in terms of like more like, maybe.
Maybe like product mindset, maybe opening eyes much more willing to take some of the product risks
and explore different bets.
And I think Anthropac is much more focused.
And they have, I think it's fine.
Like they have to like prioritize.
But they definitely double down in on like enterprise might be more than like consumers or something.
I don't know.
It's just like some of the product mindsets might be different.
I would say like research, I've enjoyed like both like research cultures, both in Anthropag and like opening
eye.
I feel like on the daily basis, I feel like it's more similar than different.
I mean, no surprise.
Like how you run experiments.
It's kind of like very similar.
I'm sure the anthropic, I mean, you know, Dari used to be VP research, right?
So he set the culture at opening eye.
So yeah, that makes sense.
Maybe quick takes on people that you mentioned Barrett, you mentioned Mira.
Like what's one thing you learn from Barrett, Mira, Sam?
maybe something like that.
Like one lesson that you would share to others.
I wish I, like, worked with them way longer.
I think what I've learned from Mira is actually her, like, interdisciplinary mindset.
It's just really good at, like, connecting dots between, like, product and, like, kind of balancing, like, product research and, like, create this, like, comprehensive, like, coherent story.
Because sometimes, like, there are, like, researchers who, like, really hate doing.
product and there are researchers who really love doing product. And it's like kind of dichotomy
between two and also like safety is like a part of this process. So kind of you you kind of want
to like create this coherent. Like think from like systems perspective, like think about like
bigger picture. And I think I learned a lot from her on that. I definitely feel like I have much more
creative freedom at OpenE. And that's because the environment that the leader set like enables me
to do that. So it's like, if I have an idea, if I want to propose it. Yeah, exactly. There's like more
like creative freedom and like resource reallocation, especially in research is like being adaptable
to like new technologies or like change your views based on like empirical results or kind of like
changed research directions. I've seen a lot of like sometimes I've seen researchers who would just like
get stuck on the same directions for like two to three years and it would never like work out or
something but they would still be like stopper. So it's like adapt.
the ability to like new directions and like new paradigms.
It's kind of like one of those things that this is a Barrett thing or is a general
culture thing?
Like general kind of culture, I think.
Cool.
Yeah.
And just to wrap up, we just usually have a call to action.
Founders usually want people to work at their companies.
Do you want people to give you feedback?
Do you want people to join your team?
Oh yeah.
Of course.
I'm definitely hiring for like research engineers who are like more product minded people.
So it's like people who know how to train the models.
but also like interested in like deploying into like the products and developing like new product
features.
I'm really looking for those archetypes of like research engineers or like research scientists.
So yeah, if you're like looking for a job, if you're like interested in joining my team,
I'm like really happy to just reach out, I guess.
And then just like generally what do you want people to do more of in the world, whether or not
they work with you?
Like, you know, call to action as in like everyone should be doing this.
I think this is something that I tell to a lot of like designers.
is that like, I think people should spend more time just like play around with the models.
And the more you play with the model, the more creative ideas you'll get around like what kind of like new potential features of the products or like new kind of interaction paradigms that you might want to create with those models.
I feel like we are bottlenecked by like human creativity on like completely changing the way
if you think about the internet or like some of the the way you think about software.
Like AI right now pushes us to like rethink everything that we've done before in my view.
And I feel like not enough people are either double down on like those ideas or I'm just like not seeing a lot of human creativity in this like interface design or like
product design mindsets.
So I feel like it would be really great for people to just like do that.
And especially right now is like research.
Some research becomes like much more product re-ended.
So it's like you actually can train the models for the things that you want to do in
a product or something.
Yeah.
And you define the process now.
This is my go-to for how to manage a process.
I think it's pretty common sense.
But it's nice to hear from you that because you actually did it.
That's nice.
Thank you for driving innovation interface design and the new.
models of the eye and Anthropic. And we're looking forward to what you're going to talk about
in New York. Yeah, thank you so much for inviting me here. I hope my job will not be automated
by the time. I come to New York. Well, I hope you automate yourself. Yeah, I hope so.
We'll do whatever else you want to do. That's it. Thank you. Awesome. Thanks.
