The AI Daily Brief: Artificial Intelligence News and Analysis - The Most Interesting AI Research This Week
Episode Date: June 18, 2023A Research Roundup including: -AssistGPT -LLaMA multimodal adapter -Meta Voicebox -Text-to-Video -LLM agent teaching weaker AIs -AI art QR code generator The AI Breakdown helps you understand the ...most important news and discussions in AI. Subscribe to The AI Breakdown newsletter: https://theaibreakdown.beehiiv.com/subscribe Subscribe to The AI Breakdown on YouTube: https://www.youtube.com/@TheAIBreakdown Join the community: bit.ly/aibreakdown Learn more: http://breakdown.network/
Transcript
Discussion (0)
Today on the AI breakdown, we're looking at the most interesting research from the previous
week. Big themes like multimodality and text of video are all over this thing. The AI breakdown
is a daily podcast and video about the most important news and stories in AI. Like subscribe and share
and go to Breakdown.com network for more information. Today we are looking at some of the most
interesting AI research from the last week and we're kicking off with a theme that has really
defined the last few months and that is AutoGPTs and more broadly speaking the interest that people
have in AI agents that can actually go out and do tasks. Now, if you've listened to this show
or watch this channel for any amount of time, you have inevitably heard of AutoGBT. In April, Sig,
Gravitas introduced AutoGBT, and it quickly became the most interacted with project on GitHub.
Other projects like it include Baby AGI, and basically, again, these tools were all designed
to try to be an approach to an AI agent that could actually not only figure out how to solve
problems, but to potentially spin up and create the other AI agents that could go out and do
the tasks that were necessary to accomplish a particular goal. AssistGPT isn't exactly a one-to-one
auto-GPT competitor or anything like that. Instead, it's an approach that tries to bring in
another theme of this year, which is multimodality or large language models that can also interact
with visual-based tasks such as images, videos, and more to try to create a more sophisticated
AI Assistant. Assistant GPD executes tasks using an approach called plan, execute, inspect, and
learn, P, EIL, which is what has reminded some people of Auto-GPT. The planner uses natural language
to plan the next steps based on the current progress of reasoning. It decides which tool or
function should be used next to process the input data or move closer to the final answer. The
executor carries out the tasks planned by the planner, which could involve processing visual data
such as images or video, or it could involve other types of tasks such as searching for information
or performing calculations.
The inspector is a memory manager that helps the planner
by providing the appropriate visual information
to the specific tool that needs it,
which again could involve selecting the right images or videos
or involve providing other types of data
that the tool needs to perform its task,
and the learner is designed to improve the system's performance over time.
It allows the model to learn from its experiences
and to discover the optimal solution to a task.
This learning process could involve adjusting
the way the planner, executor, and inspector work
based on the results of the previous tasks.
Now, the paper doesn't specifically mention
the ability of AssistGPT to interact with or create other AI agents to execute tasks.
However, the concept of AssistGPT involves integrating LLMs with various tools, which could
potentially include other AI models or agents.
Next up, we have re-render a video.
This is one that people like Matt Wolf got really excited about.
So as Matt puts it, it takes an input video and re-renders it with your prompt, without the flicker
or weirdness we get from other current models.
For those of you who are listening or not watching, you can see how it takes a sort of
video and then can re-render it based on a variety of different styles. So a colorful impasto
painting, a Ghibli cartoon, a starry night of Van Gogh. Basically, all of these text prompts
change the nature of the video to look like the text prompt. This research presents a novel
framework for translating text into video. The framework consists of two parts,
keyframe translation and full video translation. Keyframe translation uses a modified
diffusion model to generate keyframes from text with constraints applied to ensure
coherence in shapes, textures, and colors across frames, and then full video translation fills in
the gaps between keyframes using a method that matches and patches blends from the keyframes,
which ensures style and texture consistency over time. One of the big things we've seen over the last
couple months is that all of that excitement and energy that came to the text to image space
last year coming into this year, right? The difference between Mid Journey 2 and Mid Journey 5 that
we have now is starting to make it to video as well. Runway Gen 1 had this sort of video
re-skinned capacity and then runway gen 2 has an entire video generation suite where you can actually
go text to video in four second snippets. The creative, artistic and business implications of this sort
of text to video approach are enormous and people are kind of salivating to get their hands on anything
that can do this type of video modification or generation. Next up we have meta's voice box. As Min Choy
sums it up, it's an AI that can create high quality audio in six languages with noise removal,
content editing, style conversion, and more, all without specific training.
Min writes, unlike traditional speech AI that needs specific training for each task,
VoiceBox learns from raw audio and transcriptions.
It's based on flow matching and outperforms current models in speed,
intelligibility, and audio similarity.
Unlike old models that require carefully prepared data,
VoiceBox can learn from varied speech data without needing careful labeling.
It's trained on diverse data at a much larger scale, making it versatile and adaptable.
Voicebox was trained on 50,000 hours of recorded,
speech and can perform tasks including text-to-speech synthesis, cross-lingual style transfer,
speech denoising and editing, and more. Now, there are just a ton of potential use cases for this.
With just a two-second-long audio input sample, voice box can match the samples' audio style
and use it for text-to-speech generation. That could potentially be used to provide speech
capabilities for people who are unable to speak, to allow individuals to customize the
voices used by non-player characters or virtual assistants, to have themselves saying things
in a different language in the same style as they normally would in whatever their home
language is. And again, it could also be totally transformational in how fast and easy it is to
clean up audio, taking out background noises or imperfections, and things like that.
This is VoiceBox, a new AI foundation model for speech that does some pretty awesome things.
If you give it text, it can read it in a bunch of different styles.
Penelope Porcupine and Sammy Slough danced gracefully in the tree tops.
Penelope Porcupine and Sammy Sloth danced gracefully in the tree tops.
And you can use it to fix background noise too, kind of like an eraser, but for audio.
Sammy and Penelope's heartwarming friendship inspires joy.
Sammy and Penelope's heartwarming friendship inspires joy.
We think this is probably the most versatile speech-generative model
out there. This is still a research project, but I think that we're going to be able to build
a lot of interesting things with tools like this. Meta puts out so many different AI models at
this point that it's hard to keep track of them all, but this one really does strike me as something
pretty significant. They claim it as the first ever generative AI speech model that can do tasks
it wasn't specifically trained on. That is exactly the sort of transformation that led to all of
the tools that we have today when it comes to things like text to image. Fascinatingly, however,
Meta has decided not to release the voice box model or code publicly at this time due to the potential
risks of misuse. As part of their efforts to mitigate that potential misuse, they've developed a
classifier that can distinguish between authentic speech and audio generated with voice box, but they
still decided to share audio samples in a research paper instead of the actual code.
For those of you who aren't worried about AI alignment, until robots get bodies that are human-like,
I have some bad news for you. A new research paper titled Agile Catching with Whole Body MPC and
black box policy learning is all about a new approach to enabling robots to catch objects thrown at
high speeds. The researchers here explored two different solution strategies including model
predictive control, MPC, using accelerated constrained trajectory optimization, and reinforcement learning
RL using zero order optimization. Now, TLDR, the combination of these methodologies has led to
some really big advancements. This sort of high speed object catching is an extremely complex
task. It requires split second decision making and precise mechanical control. There are, of course,
commercial implications, giving robots the ability to do things where fast and accurate object
handling is crucial, or, you know, robots might just become the goalies of the future.
And speaking of the robots now being able to do a thing that only humans used to do,
another interesting piece of research this week is a paper titled, Can Language Models Teach Weaker
Agents? Teacher Explanations improve students via Theory of Mind. In simple terms, the researchers
were trying to understand if advanced AI models, specifically LLMs, can act as teachers
to less advanced AI models, which are referred to here as weaker agents. They're interested in
these LLMs can improve the performance of weaker agents by providing them with explanations
for their predictions. The researchers set up a student-teacher framework where the LLM, the teacher,
provides explanations to the weaker agent, the student. However, they also set a limit on how much
the teacher can communicate with the student to mimic real-world constraints where resources
like time and computational power might be limited. The researchers explore four main questions.
Can the teacher's intervention improve the student's predictions? When is it worth explaining
a data point to the student? How can the teacher personalize explanations to better teach
the student? Can the teacher's explanation improve the student's performance on future data that
hasn't been explained? TLDR, the researchers found that teacher LLMs can indeed improve the performance
of the student agents. They also developed a theory of mind approach where the teacher builds
two mental models of the student. The first model helps the teacher decide when to intervene,
and the second model helps the teacher personalize explanations. Now, this obviously has huge
implications for how we train future AI systems. It could be used not only for efficient AI
training, but also help personalize AIs for specific use cases. It could be applied in scenarios
where multiple AI systems need to work together, where the teacher AI could help improve the performance
of the student AI, leading to a more effective collaboration. And there are some interesting implications
for AI alignment. The teacher's student framework could be used to train AI systems to better
understand and mimic human values, where the teacher could be a model that has been trained to
understand human values and then pass on that understanding to the student. It could personalize
alignment for different quote-unquote students. But it also showed that this sort of AI alignment is
important as the study found that misaligned teachers can lower student performance by
intentionally misleading them. One more on the theme of multimodality, another big theme of the year.
Remember, OpenAI thought that they were going to bring a multimodal model to chat GPT in 2024.
In other words, a model that can interact with text, image, audio, and video inputs, not just
text inputs, but have been constrained by GPU access. While this new research titled Lama Adaptor
V2, Perimeter-Efficient Visual Instruction Model, is an approach to
multi-modality that works with the meta-Lama model. They made more parts of the Lama model
learnable, which means the model can adapt better to the task of following instructions.
They introduced a new strategy for incorporating visual information into the model. Instead of
feeding visual information into all the layers of the model, they only feed it into the early layers.
This helps the model incorporate visual knowledge better. They train the model on two types of data,
image text pairs and instruction following data, which helps the model get better at both
understanding images and following instructions. And during the use of the model, they incorporate
additional expert models like systems that can generate captions for images or recognize text
and images to enhance image understanding capabilities. They found that the new model,
Lama Adaptor V2, is better at following instructions that involve both text and images and even
performs well in chat interactions. So the research opens up new possibilities for AI applications,
such as more interactive chatbots or educational tools. And TLDR, this is just one of the big
themes going on in the space right now. Just to show a super quick example, let's upload an image
of a Bernice Mountain Dog, as you can see, and type what breed of dog is in the picture,
and then run it.
Around 10 seconds later, the dog in the picture is a Bernie's Mountain Dog.
Now, these guys are obviously far from the only team working on multimodality, but cool
to see that research live and demos live as well.
Last up, one of the biggest viral trends over the last week has been QR code art.
AI-generated visual QR code started popping up all over Reddit and then Twitter over the last week,
But then the good people at Hugging Face went and put together a QR code AIR generator.
Radamas Ajana says you only need the QR code content and a text to image prompt idea or you can upload your image.
This is one that is really better seen than described.
So if you are listening to this, I suggest you check out the YouTube video as well.
All right, guys, that's going to do it for today's AI breakdown.
Hope you enjoyed this.
Thanks to so many great research teams for sharing their knowledge.
Exciting to see what's coming down the line.
If you're enjoying the AI breakdown, please like, subscribe and share.
share it, check out the podcast version and the newsletter version.
Until next time, peace.
