The AI Daily Brief: Artificial Intelligence News and Analysis - The Most Interesting AI Research This Week

Starting point is 00:00:00 Today on the AI breakdown, we're looking at the most interesting research from the previous week. Big themes like multimodality and text of video are all over this thing. The AI breakdown is a daily podcast and video about the most important news and stories in AI. Like subscribe and share and go to Breakdown.com network for more information. Today we are looking at some of the most interesting AI research from the last week and we're kicking off with a theme that has really defined the last few months and that is AutoGPTs and more broadly speaking the interest that people have in AI agents that can actually go out and do tasks. Now, if you've listened to this show or watch this channel for any amount of time, you have inevitably heard of AutoGBT. In April, Sig,

Starting point is 00:00:42 Gravitas introduced AutoGBT, and it quickly became the most interacted with project on GitHub. Other projects like it include Baby AGI, and basically, again, these tools were all designed to try to be an approach to an AI agent that could actually not only figure out how to solve problems, but to potentially spin up and create the other AI agents that could go out and do the tasks that were necessary to accomplish a particular goal. AssistGPT isn't exactly a one-to-one auto-GPT competitor or anything like that. Instead, it's an approach that tries to bring in another theme of this year, which is multimodality or large language models that can also interact with visual-based tasks such as images, videos, and more to try to create a more sophisticated

Starting point is 00:01:22 AI Assistant. Assistant GPD executes tasks using an approach called plan, execute, inspect, and learn, P, EIL, which is what has reminded some people of Auto-GPT. The planner uses natural language to plan the next steps based on the current progress of reasoning. It decides which tool or function should be used next to process the input data or move closer to the final answer. The executor carries out the tasks planned by the planner, which could involve processing visual data such as images or video, or it could involve other types of tasks such as searching for information or performing calculations. The inspector is a memory manager that helps the planner

Starting point is 00:01:54 by providing the appropriate visual information to the specific tool that needs it, which again could involve selecting the right images or videos or involve providing other types of data that the tool needs to perform its task, and the learner is designed to improve the system's performance over time. It allows the model to learn from its experiences and to discover the optimal solution to a task.

Starting point is 00:02:11 This learning process could involve adjusting the way the planner, executor, and inspector work based on the results of the previous tasks. Now, the paper doesn't specifically mention the ability of AssistGPT to interact with or create other AI agents to execute tasks. However, the concept of AssistGPT involves integrating LLMs with various tools, which could potentially include other AI models or agents. Next up, we have re-render a video.

Starting point is 00:02:34 This is one that people like Matt Wolf got really excited about. So as Matt puts it, it takes an input video and re-renders it with your prompt, without the flicker or weirdness we get from other current models. For those of you who are listening or not watching, you can see how it takes a sort of video and then can re-render it based on a variety of different styles. So a colorful impasto painting, a Ghibli cartoon, a starry night of Van Gogh. Basically, all of these text prompts change the nature of the video to look like the text prompt. This research presents a novel framework for translating text into video. The framework consists of two parts,

Starting point is 00:03:09 keyframe translation and full video translation. Keyframe translation uses a modified diffusion model to generate keyframes from text with constraints applied to ensure coherence in shapes, textures, and colors across frames, and then full video translation fills in the gaps between keyframes using a method that matches and patches blends from the keyframes, which ensures style and texture consistency over time. One of the big things we've seen over the last couple months is that all of that excitement and energy that came to the text to image space last year coming into this year, right? The difference between Mid Journey 2 and Mid Journey 5 that we have now is starting to make it to video as well. Runway Gen 1 had this sort of video

Starting point is 00:03:47 re-skinned capacity and then runway gen 2 has an entire video generation suite where you can actually go text to video in four second snippets. The creative, artistic and business implications of this sort of text to video approach are enormous and people are kind of salivating to get their hands on anything that can do this type of video modification or generation. Next up we have meta's voice box. As Min Choy sums it up, it's an AI that can create high quality audio in six languages with noise removal, content editing, style conversion, and more, all without specific training. Min writes, unlike traditional speech AI that needs specific training for each task, VoiceBox learns from raw audio and transcriptions.

Starting point is 00:04:28 It's based on flow matching and outperforms current models in speed, intelligibility, and audio similarity. Unlike old models that require carefully prepared data, VoiceBox can learn from varied speech data without needing careful labeling. It's trained on diverse data at a much larger scale, making it versatile and adaptable. Voicebox was trained on 50,000 hours of recorded, speech and can perform tasks including text-to-speech synthesis, cross-lingual style transfer, speech denoising and editing, and more. Now, there are just a ton of potential use cases for this.

Starting point is 00:04:56 With just a two-second-long audio input sample, voice box can match the samples' audio style and use it for text-to-speech generation. That could potentially be used to provide speech capabilities for people who are unable to speak, to allow individuals to customize the voices used by non-player characters or virtual assistants, to have themselves saying things in a different language in the same style as they normally would in whatever their home language is. And again, it could also be totally transformational in how fast and easy it is to clean up audio, taking out background noises or imperfections, and things like that. This is VoiceBox, a new AI foundation model for speech that does some pretty awesome things.

Starting point is 00:05:32 If you give it text, it can read it in a bunch of different styles. Penelope Porcupine and Sammy Slough danced gracefully in the tree tops. Penelope Porcupine and Sammy Sloth danced gracefully in the tree tops. And you can use it to fix background noise too, kind of like an eraser, but for audio. Sammy and Penelope's heartwarming friendship inspires joy. Sammy and Penelope's heartwarming friendship inspires joy. We think this is probably the most versatile speech-generative model out there. This is still a research project, but I think that we're going to be able to build

Starting point is 00:06:11 a lot of interesting things with tools like this. Meta puts out so many different AI models at this point that it's hard to keep track of them all, but this one really does strike me as something pretty significant. They claim it as the first ever generative AI speech model that can do tasks it wasn't specifically trained on. That is exactly the sort of transformation that led to all of the tools that we have today when it comes to things like text to image. Fascinatingly, however, Meta has decided not to release the voice box model or code publicly at this time due to the potential risks of misuse. As part of their efforts to mitigate that potential misuse, they've developed a classifier that can distinguish between authentic speech and audio generated with voice box, but they

Starting point is 00:06:49 still decided to share audio samples in a research paper instead of the actual code. For those of you who aren't worried about AI alignment, until robots get bodies that are human-like, I have some bad news for you. A new research paper titled Agile Catching with Whole Body MPC and black box policy learning is all about a new approach to enabling robots to catch objects thrown at high speeds. The researchers here explored two different solution strategies including model predictive control, MPC, using accelerated constrained trajectory optimization, and reinforcement learning RL using zero order optimization. Now, TLDR, the combination of these methodologies has led to some really big advancements. This sort of high speed object catching is an extremely complex

Starting point is 00:07:30 task. It requires split second decision making and precise mechanical control. There are, of course, commercial implications, giving robots the ability to do things where fast and accurate object handling is crucial, or, you know, robots might just become the goalies of the future. And speaking of the robots now being able to do a thing that only humans used to do, another interesting piece of research this week is a paper titled, Can Language Models Teach Weaker Agents? Teacher Explanations improve students via Theory of Mind. In simple terms, the researchers were trying to understand if advanced AI models, specifically LLMs, can act as teachers to less advanced AI models, which are referred to here as weaker agents. They're interested in

Starting point is 00:08:05 these LLMs can improve the performance of weaker agents by providing them with explanations for their predictions. The researchers set up a student-teacher framework where the LLM, the teacher, provides explanations to the weaker agent, the student. However, they also set a limit on how much the teacher can communicate with the student to mimic real-world constraints where resources like time and computational power might be limited. The researchers explore four main questions. Can the teacher's intervention improve the student's predictions? When is it worth explaining a data point to the student? How can the teacher personalize explanations to better teach the student? Can the teacher's explanation improve the student's performance on future data that

Starting point is 00:08:37 hasn't been explained? TLDR, the researchers found that teacher LLMs can indeed improve the performance of the student agents. They also developed a theory of mind approach where the teacher builds two mental models of the student. The first model helps the teacher decide when to intervene, and the second model helps the teacher personalize explanations. Now, this obviously has huge implications for how we train future AI systems. It could be used not only for efficient AI training, but also help personalize AIs for specific use cases. It could be applied in scenarios where multiple AI systems need to work together, where the teacher AI could help improve the performance of the student AI, leading to a more effective collaboration. And there are some interesting implications

Starting point is 00:09:13 for AI alignment. The teacher's student framework could be used to train AI systems to better understand and mimic human values, where the teacher could be a model that has been trained to understand human values and then pass on that understanding to the student. It could personalize alignment for different quote-unquote students. But it also showed that this sort of AI alignment is important as the study found that misaligned teachers can lower student performance by intentionally misleading them. One more on the theme of multimodality, another big theme of the year. Remember, OpenAI thought that they were going to bring a multimodal model to chat GPT in 2024. In other words, a model that can interact with text, image, audio, and video inputs, not just

Starting point is 00:09:51 text inputs, but have been constrained by GPU access. While this new research titled Lama Adaptor V2, Perimeter-Efficient Visual Instruction Model, is an approach to multi-modality that works with the meta-Lama model. They made more parts of the Lama model learnable, which means the model can adapt better to the task of following instructions. They introduced a new strategy for incorporating visual information into the model. Instead of feeding visual information into all the layers of the model, they only feed it into the early layers. This helps the model incorporate visual knowledge better. They train the model on two types of data, image text pairs and instruction following data, which helps the model get better at both

Starting point is 00:10:23 understanding images and following instructions. And during the use of the model, they incorporate additional expert models like systems that can generate captions for images or recognize text and images to enhance image understanding capabilities. They found that the new model, Lama Adaptor V2, is better at following instructions that involve both text and images and even performs well in chat interactions. So the research opens up new possibilities for AI applications, such as more interactive chatbots or educational tools. And TLDR, this is just one of the big themes going on in the space right now. Just to show a super quick example, let's upload an image of a Bernice Mountain Dog, as you can see, and type what breed of dog is in the picture,

Starting point is 00:11:01 and then run it. Around 10 seconds later, the dog in the picture is a Bernie's Mountain Dog. Now, these guys are obviously far from the only team working on multimodality, but cool to see that research live and demos live as well. Last up, one of the biggest viral trends over the last week has been QR code art. AI-generated visual QR code started popping up all over Reddit and then Twitter over the last week, But then the good people at Hugging Face went and put together a QR code AIR generator. Radamas Ajana says you only need the QR code content and a text to image prompt idea or you can upload your image.

Starting point is 00:11:36 This is one that is really better seen than described. So if you are listening to this, I suggest you check out the YouTube video as well. All right, guys, that's going to do it for today's AI breakdown. Hope you enjoyed this. Thanks to so many great research teams for sharing their knowledge. Exciting to see what's coming down the line. If you're enjoying the AI breakdown, please like, subscribe and share. share it, check out the podcast version and the newsletter version.

Starting point is 00:11:58 Until next time, peace.

The AI Daily Brief: Artificial Intelligence News and Analysis - The Most Interesting AI Research This Week

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.