The AI Daily Brief: Artificial Intelligence News and Analysis - The Next Step in Our Journey to AI Agents: Anthropic's Computer Use

Episode Date: October 24, 2024

Anthropic unveils a new feature enabling AI to use computers like humans, marking a significant advancement in AI autonomy. Explore how this capability allows for tasks such as navigating software, in...putting information, and more. This breakthrough opens the door to new business, tech, and beyond applications, driving the evolution toward fully autonomous AI agents. Concerned about being spied on? Tired of censored responses? AI Daily Brief listeners receive a 20% discount on Venice Pro. Visit ⁠⁠⁠https://venice.ai/nlw⁠⁠⁠ and enter the discount code NLWDAILYBRIEF. The AI Daily Brief helps you understand the most important news and discussions in AI. Subscribe to the podcast version of The AI Daily Brief wherever you listen: https://pod.link/1680633614 Subscribe to the newsletter: https://aidailybrief.beehiiv.com/ Join our Discord: https://bit.ly/aibreakdown

Transcript
Discussion (0)
Starting point is 00:00:00 Today on the AI Daily Brief, the next step towards our agentic future. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. To join the conversation, follow the Discord link in our show notes. Hello, friends, quick note before we dive in. Today, instead of our normal setup where we have headlines followed by the main episode, we are just doing a long extended main episode that is all about this big advance from Anthropic. I think you'll understand why as you dig in, but that is why you're not getting any headlines. Enjoy, and later this week we'll be back to more normal formats.
Starting point is 00:00:37 There have been two moments over the past few weeks where we really get a chance to step back and see the very beginnings of a branch off on the AI evolutionary tree. The first of those was, of course, OpenAI's O-1. This was their reasoning model, and it wasn't just a bigger model than GPT-40. It takes a fundamentally different approach to how it works. O-1's model basically has a built-in sort of chain of thought approach. that breaks down complex tasks into simpler steps and reasons through them sequentially before generating a response. This is why, by the way, one of the things that we learned when prompting
Starting point is 00:01:17 O-1 as opposed to GPT-4-O is that you don't need to add things like think step-by-step. Now, the net impact of O-1's reasoning approach is that it's much better able to handle things like coding and math, and when it comes to business, it's better about things that have a distinct right answers based on inputs. So it might not make a poem any better than GPT-40, for example. But if you're trying to figure out the ideal arrangement of a banquet hall for a big convention, and you can give it all of the relevant inputs, it's going to be much better at figuring that out than, for example, GPT-40.
Starting point is 00:01:51 And while the difference is subtle, the important thing, like I said, is that it represents this branch off of the LLM tree, where we are moving subtly but surely into a new reasoning era, which is, of course, in and of itself, the beginning of a new agentic era. Open AI also recently shared their stages of artificial intelligence. Level 1 was chatbots and AI with conversational language. Level 2, which 01 represents, were reasoners with human level problem solving. Level 3, which we're not at yet, are agents or systems that can take actions.
Starting point is 00:02:23 And then level 4 and 5 are basically what collections of agents can do and more advanced capabilities. So level 4 is innovators, AI, that can aid in invention. And level 5 is organizations AI that can do the work of a full organization. This is the relevant setup for what Anthropic recently announced, which is called computer use. Now, this is part of a larger announcement that also included some model updates, including an upgraded Claude 3.5 Sonnet, as well as a new model, Claude 3.5 haiku, but there's no doubt that the main discussion and excitement around this was computer use. Developers' Anthropic rights can now direct Claude to use computers the way people do
Starting point is 00:02:57 by looking at a screen, moving a cursor, clicking, and typing text. They write that Claude 3.5 Sonet can now follow a user's commands to move a cursor around their computer screen, click on relevant locations and input information via a virtual keyboard. Basically, it emulates how people interact with their computers. Now, the version of this that they started exploring is purposefully very general. It is not use case specific. And that seems very intentional. They write, a vast amount of modern work happens via computers. Enabling AIs to interact directly with computer software in the same way people do will unlock a huge range of applications that simply aren't possible for the current
Starting point is 00:03:32 generation of AI assistance. Now, in a similar way to the idea that, oh, O-1 was not just a bigger or better model, but a different approach. So too is computer use, not innovation as model innovation, but a capability's innovation. Anthropic writes, over the last few years, many important milestones have been reached in the development of powerful AI, for example, the ability to perform complex logical reasoning and the ability to see and understand images. The next frontier, they argue, is computer use. AI models that don't have to interact via bespoke tools, but that instead are empowered to use
Starting point is 00:04:05 essentially any piece of software as instructed. Their announcement post also gave a little bit of the background around how this came together. One of the cool behind-the-scenes things is how this works. They write, When a developer tasks Claude with using a piece of computer software and gives it the necessary access, Claude looks at screenshots of what's visible to the user, then counts how many pixels vertically or horizontally it needs to move a cursor in order to click on the correct place.
Starting point is 00:04:27 They continue, training Claude to count pixels accurately was critical. Without this skill, the model finds it difficult to give mouse commands. So this is a big part of the secret of these capabilities is that it literally counts pixels. However, they also found that these new capabilities unlocked a lot of new capabilities that it wasn't specifically trained on. They say we were surprised by how rapidly Claude generalized from the computer-use training we gave it on just a few pieces of simple software, such as a calculator and a text editor. In combination with Claude's other skills, this training granted it the remarkable ability
Starting point is 00:04:57 to turn a user's written prompt into a sequence of logical steps and then take actions on the computer. We observed that the model would even self-correct and retry tasks when it encountered obstacles. Basically, in the same way that the ability to use a computer changes how we think, the ability to use a computer seems to change the way that the LLM quote-unquote thinks. Anthropics sums up this shift by saying, computer uses a completely different approach to AI development. Up until now, LLM developers have made tools fit the model,
Starting point is 00:05:24 producing custom environments where AIs use specifically designed tools to complete various tasks. Now we can make the model fit the tools. Claude can fit into the computer environments we all use every day. Our goal is for Claude to take pre-existing pieces of computer software and simply use them as a person would. Now, a couple caveats to all of this. First, Anthropic is quite clear that this is very experimental at this stage and that it tends to be pretty error-proned. It also says that there are a bunch of actions that seem incredibly easy or effortless for people that are difficult to impossible for Claude's computer use.
Starting point is 00:05:56 Those include scrolling, dragging, and zooming. The framework in which they're releasing this is as an experiment. And right now it's only available through API. This isn't something that a general user of Cloud can just fire up and do. It's something that's going to require a developer to actually set up a specific application for. Overall, the tone from Anthropic is very much this is a first glimpse into the future, not a production-ready product. They even joke, even while recording these demos, we encountered some amusing moments.
Starting point is 00:06:23 In one, Claude accidentally stopped a long-running screen recording causing all footage to be lost. Later, Claude took a break from our coding demo and began to peruse photos of Yellowstone National Park. So we can say for the machines that at least it has good taste. Now, I should note that Anthropic also does get into some of the questions around safety. They did not, for example, train computer use on navigating the internet yet. They note that Claude 3.5 Sonnet, even with computer use, still remains at AI safety level 2, meaning that it, quote, doesn't require a higher standard of safety and security measures than those we currently have in place. They continue, when future models require AI safety level three or four safeguards,
Starting point is 00:06:57 they present catastrophic risks, computer use might exacerbate those risks. We judge that it's likely better to introduce computer use now while models still only need AI safety level two safeguards. That means we can begin grappling with any safety issues before the stakes are too high, rather than adding computer use capabilities for the first time into a model with much more serious risks. For what it's worth, this sounds very similar to how OpenAI always talks about their approach to safety. In other words, iterative deployment that allows all of us to adapt to new capabilities in something of a more incremental way. Today's episode is brought to you by Venice. Venice is a private, uncensored generative AI app.
Starting point is 00:07:33 It accesses open source models to enable text, image, and code generation without the fear of being spied on or having your data exploited. Discuss anything with Venice without concern about it being monitored, sold, or given to advertisers and governments. Venice is different because your conversations and creations are kept securely within the browser, never stored or accessible by Venice. Unlike other AI apps, Venice won't tell you what's okay to say or not. Venice won't patronize you. It simply provides direct access to machine intelligence, no topics are off limits, no ideas, or taboo. With Venice, you're in control of the AI as you should be. Pro subscriptions are available for $49 a year or $8 per month.
Starting point is 00:08:09 AI Daily Brief listeners receive a 20% discount on Venice Pro. Visit venice.a.i slash NLW and enter the discount code NLW Daily Brief. That's NLW Daily Brief, all one word. Two things I feel qualified to talk about. organization and productivity apps and AI tools, which is why I am very happy that today's episode is sponsored by Notion. Notion combines your notes, docks, and projects into one space that's simple and beautifully designed. And the new Notion AI has the capability of multiple AI tools built in, which means you can search, generate, analyze, and chat all inside
Starting point is 00:08:42 Notion. The new Notion AI is a single AI tool that does it all. Search across Notion and other apps, generate docs in your own style, analyze PDFs and images, and chat with you about anything. Notion is a perfect place to organize your tasks, track your habits, write beautiful docs, and collaborate with your team. The more content you add to Notion, the more Notion AI can personalize its responses for you. Basically, unlike generic chatbots, Notion AI already has the context of your work. There are also a bunch of great integrations. Notion uses AI knowledge from Gpt4 and Claude. And with AI connectors, which is now in beta, Notion AI can also search across Slack discussions, Google Doc, sheets, and slides, and more tools like GitHub and Jira are
Starting point is 00:09:17 coming soon. Notion is used by over half of Fortune 500 companies, but more important, it's used all day, every day by me. Try Notion for free when you go to notion.com slash AI Daily Brief. That's all lowercase letters, notion.com slash AI Daily Brief to try the powerful, easy to use Notion AI today. And when you use that link, you're supporting the show. Once again, that's Notion.com slash AI Daily Brief. Today's episode is brought to you by Super Intelligent. Every single business workflow and function is being remade and reimagined with artificial intelligence. There is a huge challenge, however, of going from the potential of AI to actually capturing that value. And that gap is
Starting point is 00:09:56 what Super Intelligence is dedicated to filling. Super Intelligence accelerates AI adoption and engagement to help teams actually use AI to increase productivity and drive business value. An interactive AI use case registry gives your company full visibility into how people are using artificial intelligence right now. Pair that with capabilities building content in the form of tutorials, learning paths, and a use case library. And super intelligent helps people inside your company show how they're getting value out of AI, while providing resources for people to put that inspiration into action. The next three teams that sign up with 100 or more seats are going to get free embedded consulting. That's a process by which our super intelligent team sits with your organization,
Starting point is 00:10:37 figures out the specific use cases that matter most to you, and helps actually ensure support for adoption of those use cases to drive real value. Go to Bsuper.a.i to learn more about this AI enablement network. And now back to the show. Alex Albert, who is, I think, nominally the head of developer relations, but lists himself as the head of Claude Relations on Twitter, wrote a nice thread about how big a shift this represents. He writes, computer use is the first step towards a completely new form of human computer interaction. In just a few years, the way we interface with computers will be completely different from today. Computer use allows AIs to use computers just as you would, no complex
Starting point is 00:11:13 abstractions or specific APIs, just pure visual understanding and interaction, exactly like how you use your computer. He gives an example video where he says Claude opens up Claude.A.I. in a outputted website, code in a new code file within VS code, and then proceeds to fix a bug in the website, all with computer use. Alex continues, this is entirely different from how most agenetic frameworks currently work. Today, most quote-unquote agents are a patchwork of multiple bespoke APIs glued together under the hood of some complex scaffolding. Alex says, I believe will be able to reach near-human-level performance in the next few years, if not much sooner. When that is reached, it means AIs can operate the basics of a computer just as well as an average person can.
Starting point is 00:11:51 At that point, we can start stringing together AIs doing tasks. Now, instead of an AIS doing a simple task on a computer that would only take a human a couple of minutes, that will do that task, then move on to two more tasks. Suddenly, the AIs will be doing end-to-end tasks that would take humans hours and days. Read a 50-page research report to create a full executive summary and slide deck. Scan financial documents to build a DCF model. View a wireframe to ship a production-ready website. Alex continues,
Starting point is 00:12:15 combine this with a longer context window and increased chain of thought, and now you have the beginnings of the unhobling of AI products. The pieces of the true agent puzzle are starting to fall into place. If you're developing on AI today, you need to be thinking about building complementary pieces to this reality, because it may come faster than most expect. AI entrepreneur Bindu Ready writes, Computer Use API by Anthropic is an interesting take on agentic APIs. Agents are challenging because they have to talk to other systems, and most of these systems don't have good APIs. One potential solution is to use the computer use API, which allows the LLM to pretend to be a human operating a computer. The biggest issue with this approach is security, but it's not insurmountable.
Starting point is 00:12:52 Now, what about the use cases that are available right now? Obviously, there's a lot of talk about things in the future, but what can computer use do at this moment? Well, once again, Alex Albert writes, fun story from our time working on computer use. We held an engineering bug bash to make sure we found all the potential problems with the API. This meant bringing a handful of engineers in a room together for a few hours. We were hungry, so one of our engineer's first computer use request was to ask Claude to navigate to DoorDash and order enough food to feed a group of people. About a minute later, we saw Claude decide to order us some pizzas. Alex Volkov and AI Evangelist with weights and biases and host of the Thursday AI podcast, writes,
Starting point is 00:13:24 mind officially blown once again. This computer use claw demo from Anthropic didn't work for me, so I just asked it to fix itself so it did. Kalin KS writes, Anthropic recently released computer use API with which developers can direct Claude to use computers the way people do. He then shares a video around how he set up computer use to use the fire crawl dev API to gather information and fill out job applications. Michelle Katosta, the president at Replit, writes, I can't tell you the last time I was so excited to see a new AI capability in action. We plugged in Claude computer use and Replit agent as a human feedback replacement, and it just works. I feel like it won't take long until our agent will become fully autonomous.
Starting point is 00:14:00 Now, Replit had some advanced access to this so it's had a little bit more of a chance to experiment. To the extent that there was any skepticism, it was not about the long term but about the very short term. Joe Mueller, for example, responded to the announcement asking, is this just a faster horse? Developer Tony Gee writes, first look at Anthropics Claude computer use demo. Feel like it's cool but kind of underwhelming and definitely not cost effective yet. 150,000 tokens just to visit and navigate through a couple sites. Either the models have to get much cheaper or your use case has to be insanely valuable. Or am I just missing something?
Starting point is 00:14:30 Others are concerned about just how much change we're going to go through. Elena Alfaro writes, The new computer mode by Anthropic is the beginning of an era where we humans will stop understanding software. The interfaces we have today will become useless. Now, I think it's a valid concern, but also an open question. How much and in what specific ways does it matter for us to understand interfaces?
Starting point is 00:14:50 And like I said, there is a ton of experimenting already happening. Stets and Blake writes, if this is the slowest and cheapest that agentic automation will ever be, we're in for a wild future. I tried a few different tasks, one involving administering my Facebook groups, posts and declining and accepting member requests, which it did great at. I tried booking a haircut. It seemed to work but hit API rate limit. Did the make yourself a website thing from the demo,
Starting point is 00:15:09 and it cooked. Ultimately, Chubby on X summed up a lot of the sentiment when they wrote, this is so huge. Today we entered the era of agents, first Microsoft, now much more so anthropic. Today marks a new step in the direction of AGI. Your move open AI, and propic is in the lead. Mackay Mooran writes, the dawn of truly agentic AI is upon us. While today we're seeing the early stages of APIs enabling models to interact with systems, we are rapidly approaching an era where intelligent and autonomous agents will be deeply integrated into our operating systems, serving as intelligent assistance that can truly understand and execute our intentions. I have long maintained that the future of computing isn't just about smarter algorithms,
Starting point is 00:15:43 more parameters or improved tuning. Rather, it's about intelligently automated systems that can actively engage with our digital environment. The recent advances in multimodal models and computer interaction capabilities are just the beginning of a new trend. As these models become more integrated into our operating systems, we will see a fundamental shift in how we interact with technology. This is both society-altering and inevitable. And so that is Anthropics computer use. Still nascent, yes, still limited in what it can do, but as with 01, another break off the branch of the evolutionary tree of LLMs and AI, and another step towards a very different agentic future. That's going to do it for today's AI Daily Brief. Appreciate you listening as always. And until next time, peace.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.