The AI Daily Brief: Artificial Intelligence News and Analysis - ChatGPT Gets a Body with the Figure 01 Robot

Starting point is 00:00:00 Today on the AI breakdown, we're looking at Figure and OpenAI's new collaborative, interactive robot that is blowing people's minds. Before that on the brief, Google DeepMind releases a new generalist video game playing agent. The AI breakdown is a daily podcast and video about the most important news and discussions in AI. Go to Breakdown.network for more information about our Discord, our newsletter, and our YouTube. Welcome back to the AI Breakdown Brief, all the AI headline news you need in around five minutes. The AI agent future is coming on quickly, and Google DeepMind has made another advancement in that area.

Starting point is 00:00:36 Yesterday they tweeted, introducing SEMA, the first generalist AI agent to follow natural language instructions in a broad range of 3D virtual environments and video games. It can complete tasks similar to a human and outperforms an agent trained in just one setting. We partnered with gaming studios to train SEMA, which stands for scalable, instructable, multi-world agent, on No Man's Sky, Teardown, Valheim, and others. These offer a wide range of distinct skills for it to learn from flying a spaceship to crafting a helmet. Seema needs only the images provided by the 3D environment and natural language instructions given by the user.

Starting point is 00:01:08 With mouse and keyboard outputs, it is evaluated across 600 skills, spanning areas like navigation and object interaction, such as turn left or chopped down tree. We found Seema agents trained on all of our domains significantly outperform those trained on just one world. When it faced an unseen environment, it performed nearly as well as the specialized agent, highlighting its ability to generalize to new spaces. Unlike our previous work, SEMA isn't about achieving high game scores. It's about developing embodied AI agents that can translate abstract language into useful actions, and using video games as sandboxes offers a safe, accessible way of testing them. The SEMA research builds towards more general AI that can understand and safely carry out instructions in both virtual and physical settings.

Starting point is 00:01:46 Such generalizable systems will make AI power technology more helpful and intuitive. So basically what you have here is an attempt to train a generalist AI agent and see how it compares, in the context of these virtual worlds in the form of games, to agents that are trained just specifically on one game. The broader goal of the research, as Google said, is about generalist agents that can navigate any virtual or physical world environment, not just those of video games. The Google DeepMind blog post has a lot more about how the process worked.

Starting point is 00:02:12 For example, they write, we used four research environments, including a new environment we built with Unity called the Construction Lab, where agents need to build sculptures from building blocks which test their object manipulation and intuitive understanding of the physical world. By learning from different gaming worlds, Sima captures how language ties in with gameplay behavior. Our first approach was to record pairs of human players across the games in our portfolio, with one player watching and instructing the other.

Starting point is 00:02:33 We also had players play freely, then rewatch what they did and record instructions that would have led to their game actions. I think the big upshot and what Google DeepMind is excited about comes from the fact that, as they write, we show an agent trained on many games was better than an agent that learned how to play just one. In our evaluation, Sima agents trained on a set of nine 3D games from our portfolio significantly outperformed all specialized agents trained solely on each individual one. What's more, an agent trained in all but one game performed nearly as well on that unseen game as an agent trained specifically on it on average. Importantly, this ability to function in brand new environments highlights Seam's ability

Starting point is 00:03:06 to generalize beyond its training. GoogleDMind has long been a fan of using these sort of gaming environments to figure out advances in AI, and this seems to be no exception. Next up on the brief, what is looking is sort of like a fairly disastrous interview by OpenAI CTO Miramaradi with the Wall Street Journal. The piece was called OpenAI made AI videos for us. These clips are good enough to freak us out. Now, the article as written isn't damning. In fact, it's part of the coverage of SORA that reflects just how much people are blown away by it. As they wrote, when Open AI began previewing videos made with the generative AI tool last month,

Starting point is 00:03:39 the internet understandably lost its mind. Other AI video technology has produced choppy low-resolution clips. These looked like something out of a nature documentary or big-budget film. Sora brings new intensity to the now familiar AI feelings loop, amazement about the capability followed by fear for society. And if you'd just read the WSJ piece, you'd probably come away pretty impressed and fairly thoughtful. However, if you actually watch the interview, particularly the question, where the interviewer asks what data SORA was trained on, and specifically if it was trained on data from YouTube, Facebook, or Instagram, you would have a very different impression. The Verge writes, when pressed on what data OpenAI used to train SORA, Maradi didn't get too specific and seem to dodge the

Starting point is 00:04:17 question. Quote, I'm not going to go into the detail. of the data that was used, but it was publicly available or licensed data. Maradi also said she isn't sure whether it uses videos from YouTube, Facebook, or Instagram. Ed Newton Rex, the CEO of Fairly Trained, and the person who left Stability AI over their copyright policies, writes, disappointing to see OpenAI's CTO dodge questions about Soros training data today. If a Gen AI company dodges questions about their training data and uses the phrase publicly available, it seems fair to assume their model was trained on copyrighted work without permission. Even if one doesn't agree with that, I think a lot of people are stunned that you

Starting point is 00:04:49 wasn't prepared for this question, that she didn't assume that this is going to be a question asked in basically every interview from here to forever. Given the fact that OpenAI is fighting multiple lawsuits on multiple fronts around copyrighted data, of course they're going to ask this question. You would have thought that everyone from the PR team to the executive team to the legal counsel would have figured out exactly what they were going to say to this answer long before the interview ever took place. Interestingly, given that we talked about the EUAI Act recently, Louisa Dirovsky points out that this is potentially more than just a PR problem. Louisa writes,

Starting point is 00:05:20 The clip below could put OpenAI in trouble. Here's why. In case OpenAI's SORA is classified as a high-risk AI system per the EU AI Act, they will have to comply with transparency obligations such as informing users about training, validation, and testing datasets used. In the clip below, Miramoradi, OpenAI, CTO cannot specify or exemplify the sources of data used to train SORA. If she were from the marketing department, this would be okay, but as a CTO,

Starting point is 00:05:41 this is a core aspect of the technology that can lead to legal liability, and she should be able to answer it in a straightforward way. Still, I think while this might be the conversation inside of our enfranchised AI community, by and large, still what people are paying attention to when it comes to SORA are its incredible outputs. Lastly, today, another update around election-related policies for AI. Mid-Jurdy has changed its policies and is now blocking images of Biden and Trump as the U.S. election comes into focus. This was announced during an office hour's event yesterday, where Mid-Journey CEO, David Holtz, said that the company was starting to block requests for images of the two candidates, which is a reversal of a decision that they had made back in

Starting point is 00:06:19 February. When the associate press tried Trump and Biden shaking hands at the beach, they got a banned prompt detected warning, and after a second attempt got an abuse alert warning, pedipixel tried to get around it by saying the 45th president of the United States with the 46th president of the United States holding hands, but also got the banned prompt message. Said Holtz during the office hours, I don't really care much about political speech. That's not the purpose of mid-journey. It's not that interesting to me. That said, I also don't want to spend all my time trying to police political speech, so we're going to have to put our foot down on it a bit. Sounds to me like this is less about taking some deep principled stance, and more about just

Starting point is 00:06:51 not being distracted by what is going to be a very distracting conversation. Expect to see more of this type of decision happen as we get closer to November. For now, however, that is going to do it for today's AI breakdown brief. Up next, the main AI breakdown. Today's podcast is brought to you by Plum. If you're a startup building AI features for your customers, you're probably feeling the pain of hallucination, prompt testing, unstructured responses, subpar queries for embeddings, and of course, the mind-numbing process of general iteration and refinement when your engineers have to make every change by hand.

Starting point is 00:07:24 That's where Plum comes in. Plum is a no-code AI app builder designed for product teams who care about quality and speed. What is taking you weeks to hand code today can be done confidently and hours. Check out useplum.com, us e-p-l-U-M-B dot com, or reach out to me for early access. Welcome back to the AI breakdown. This week has been weirdly filled with buzz about exciting things coming, and everyone has been trying to kind of figure out which of the crazy things that we keep seeing are those things that people were teasing. Was it Cognition's demo of their AI Coder

Starting point is 00:07:58 tool Devon, or, as it increasingly seems, was it a demo of OpenAI's integration with the figure humanoid robot. We're going to talk a little bit about that today. And to kick things off, I highly suggest that you go actually watch this video. For those listening to the podcast, I will leave a link in the show notes. For those of you watching the video, I will just actually show it. It's about two minutes long, but it's worth watching in full. This was a video of the figure humanoid robot, the figure 01, engaging in real-time conversation with a human, not being teleprompter, and figuring out how to engage in fast, low-level dexterous actions. But to understand understand the context for this, let's actually jump back a couple of weeks. Throughout the beginning of the

Starting point is 00:08:37 year, there were rumors that Figure was raising a massive round, and that was confirmed at the end of February when they announced a $675 million fundraise, which included participation from Jeff Bezos, Nvidia, Microsoft, Amazon, and more. As part of that announcement, figure also shared a new partnership with OpenAI. Popular Science Magazine, which now goes by Popsai, wrote that the partnership marked, quote, one of the most significant examples yet of an AI software company working to integrate its tools into physical robots. Going on, they continued, figure founder and CEO Brett Adcock described the partnership as a huge milestone for robotics. Eventually, Adcock hopes the partnership with OpenAI will lead to a robot that can work side by side with humans completing tasks and holding a conversation.

Starting point is 00:09:17 By working with Open AI, creators of the world's most popular large language model, Adcox says figure will be able to further improve the robot's semantic understanding, which should make it more useful in work scenarios. Now, overall, this AI robotic space is a huge one. Figure is competing with Tesla's Optimus humanoid robot and a number of other well-funded startups as well. One-X Technologies AS, which also received investment from OpenAI, recently raised $100 million. Agility Robotics is reportedly testing its robots in Amazon warehouses, and that's just the tip of the iceberg. This partnership isn't actually OpenAI's first flirtation with the integration of its software into robotics. For about a year between 2020 and 2021, OpenAI actually had a robotics

Starting point is 00:09:57 research team. In July of 2021, however, they announced that they were shutting down that team and shifting their focus. An open AI spokesperson said, after advancing the state of the art and reinforcement learning through our Rubik's Cube project and other initiatives, we decided not to pursue further robotics research and instead refocused the team on other projects. This is obviously a big shift. Given that open AI had started showing off robotics research in May 2017 and then again in October 2019, giving that research up was probably not an easy decision. Although given we know what happened next with ChatGPT, probably the right one. In any case, I think it puts more color on the fact that OpenAI decided to partner with Figure as a different approach to something that they had always cared

Starting point is 00:10:35 about. Now, this brings us back to the Figure 01 slash OpenAI demo from yesterday. Figure founder, Brett Adcock writes, OpenAI plus Figure, conversations with humans on end-to-end neural networks. Open AI is providing visual reasoning and language understanding. Figures neural networks are delivering fast, low-level dexterous robot actions. Discussing the video, Brett writes, the video is showing end-to-end neural networks. There is no teleoperation. This was filmed at 1x speed and shot continuously. As you can see from the video, there's been a dramatic speed up of the robot. We are starting to approach human speed. Figures onboard cameras feed into a large vision language model trained by OpenAI. Figures neural nets also take images in at 10 hertz through cameras on the robot. The neural net is then

Starting point is 00:11:13 outputting 24 degrees of freedom actions at 200 hertz. In addition to building leading AI, figure has also vertically integrated basically everything. We have hardcore engineers designing motors, firmware, thermals, electronics, middleware, OS, battery systems, actuator sensors, mechanical, and structures. Diving a little deeper on the technical side was figures Corey Lynch. He writes, we're now having full conversations with figure 01 thanks to our partnership with OpenAI. A robot can describe its visual experience, plan future actions, reflect on its memory, explain its reasoning verbally. Explaining what's happening in the video, he writes,

Starting point is 00:11:44 All behaviors are learned, not teleoperated, and run at normal speed. We feed images from the robot's cameras and transcribe text from speech captured by onboard microphones, to a large multimodal model trained by OpenAI that understands both images and text. The model processes the entire history of the conversation, including past images to come up with language responses, which are spoken back to the human via text to speech. The same model is responsible for deciding which learned closed-loop behavior to run on the robot to fulfill a given command, loading particular neural network weights under the GPU and executing a policy. In just a few words then, Corey is explaining how complex this actually is.

Starting point is 00:12:17 There are numerous processes of translation to get from human language and the visual input of the cameras, on the robot into actual activity. Going on, Corey writes, connecting figure 01 to a large multimodal model gives it some interesting new capabilities. It can now describe its surroundings, use common sense reasoning when making decisions. For example, the dishes on the table like that plate and cup are likely to go into the drying rack next. Translate ambiguous high-level requests like I'm hungry to some context-appropriate behavior like hand the person an apple. Finally, it can describe why it executed a particular action in plain English. For example, it was the only edible item I could provide you with from the table.

Starting point is 00:12:51 Finally, Corey writes, let's talk about the learned low-level by manual manipulation. All behaviors are driven by neural network Visio-Motor Transformer policies mapping pixels directly to actions. These networks take in onboard images at 10 hertz and generate 24 DOF actions, wrist poses, and finger-joint angles at 200 hertz. These actions serve as high-rate set points for the even higher rate whole-body controller to track. There is a useful separation of concerns. Internet pre-trained models do common-sense reasoning over images and text to come up with a high-level plan, Learn vizio-motor policies execute the plan, performing fast-reactive behaviors that are hard to specify manually, like manipulating a deformable bag in any position.

Starting point is 00:13:26 Meanwhile, a whole body controller ensures safe, stable dynamics. For example, maintaining balance. Of course, even beyond the demo itself, what's been interesting to watch is the response to the demo. Dan Fitzpatrick writes, This is a I remember where I was when I first saw it moment for me. I've been researching the embodiment of AI a lot recently and talking about them in my work. But this new robot from figure in Open AI is on another level. Bidlwalt Sidhu jokingly responded to figures tweet saying,

Starting point is 00:13:51 Hey, figure one, I'm headed out for a bit. Please clean the house, do the dishes, and take the dog for a walk. Feel free to watch a little Terminator 2 until I get back, but don't get any wild ideas, okay? Some people did try to call BS, claiming that the fact that the robot said, uh, and stuttered, suggested that the whole thing was faked, to which Tesla bot journal pointed out, you're wrong, you can prompt it to converse like how people typically speak. Put that in chat GPT's custom instruction so you can have that in every conversation.

Starting point is 00:14:14 The demo probably used a variant of GPT4 where the same tricks can be. applied. Indeed, overall, the broad assumption that I'm seeing is that the demo is legitimate and that it really does mark a massive phase shift moment for AI and robotics. Courts even went so far as to argue that OpenAI's new robot is, quote, way ahead of Elon Musk's optimist. Tesla's optimist is taking baby steps while OpenAI's figure 01 is doing burnouts on the track. I'm sure that sort of headline won't inflame Elon at all, given that he's already in lawsuits with OpenAI. In any case, this is going to be an area that continues to see massive and rapid advancement, and so one to watch very closely.

Starting point is 00:14:49 For now, however, that is going to do it for today's AI breakdown. Until next time, peace.

The AI Daily Brief: Artificial Intelligence News and Analysis - ChatGPT Gets a Body with the Figure 01 Robot

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.