The AI Daily Brief: Artificial Intelligence News and Analysis - Study: Reinforcement Learning from AI Feedback Performs As Well As Human Feedback

Episode Date: September 5, 2023

Today on The AI Breakdown, NLW looks at new research from Google that shows that reinforcement learning using artificial intelligence rather than human feedback could perform as well as RLHF. Before t...hat on the Brief: the first AI pop singer gets a record deal; an AI-produced covid drug moves to phase 1 trials, and more. Today's Sponsor: Supermanage - AI for 1-on-1's - https://supermanage.ai/breakdown ABOUT THE AI BREAKDOWN The AI Breakdown helps you understand the most important news and discussions in AI.  Subscribe to The AI Breakdown newsletter: https://theaibreakdown.beehiiv.com/subscribe Subscribe to The AI Breakdown on YouTube: https://www.youtube.com/@TheAIBreakdown Join the community: bit.ly/aibreakdown Learn more: http://breakdown.network/

Transcript
Discussion (0)
Starting point is 00:00:00 Today on the AI breakdown, we're looking at a new research paper that suggests that reinforcement learning using AI could perform as well as reinforcement learning using human feedback. Before that on the brief, the first AI COVID drug heads to phase one trials. The AI breakdown is a daily podcast and video about the most important news and discussions in AI. Go to Breakdown.network for more information about our newsletter, our YouTube, and our Discord. Welcome back to the AI breakdown brief. All the AI headline news you need in around five minutes. We kick off today with a story that I think reflects something that we're going to hear a lot more about in the future, which is AI-generated pharmaceuticals. Fox News reports that a new AI-generated COVID drug has entered phase one clinical trials.
Starting point is 00:00:47 So in terms of the drug itself, this is an alternative to Paxlovid. In other words, it is an oral medication that you take after you've gotten COVID, not a vaccine that prevents it in the first place. The company behind the treatment is called Ensilicom Medicine, which is based both in Hong Kong and in New York City. The drug is referred to as ISM-3312, and in their early trials, the company says it has some advantages over Paxlovid. That includes being effective against variants that are resistant to Paxlovid, being more stable in working for a longer period of time, and not having the Paxlovid rebound, which, as Fox describes, is when patients recover from COVID and test negative, but then test positive again a short time later. Here's how the article sums up the process of creation for Ensilico.
Starting point is 00:01:28 Quote, to create its new drug, Ensilico's research team first used its target to discovery platform Pandaomics to identify the target protein within the coronavirus. Next, it used its in-house generative chemistry platform, Chemistry 42, to generate new molecules that would attack the protein as a means of treating COVID and other coronaviruses. A phase one study means that ISM-332 is now being evaluated in healthy volunteers, and the expectation is that the results of the trials will be released at the end of this year. Said the company's CEO, generative AI offers us a powerful tool for accelerating the drug discovery process and allows us to quickly identify new solutions that we hope can provide more potent
Starting point is 00:02:05 defenses against mutating COVID strains and prevent another pandemic. Now, this is definitely a theme we're seeing more and more. A couple weeks ago, Vox wrote a piece, AI discovered drugs will be for sale sooner than you think. Among other things, the piece discusses just how much faster AI models can be than traditional drug discovery processes, which typically can take up to 12 years or more. Next up, we moved to a follow-up from last week. In the wake of approvals from the Chinese government, the companies that have received those approvals
Starting point is 00:02:34 are mass releasing a huge number of AI models and applications. Baidu is among the companies that have released their LLMs following last week's announcement, and Reuters reports that 360 security technology and I Fly Tech have just launched their own AIs, including voice recognition technology. In our next story, unions in Las Vegas are gearing up for a battle around AI. NPRs, all things considered, covered how quickly AI is infiltrated. trading Sin City, noting that recent studies have suggested that between 38 and 65% of jobs in Las Vegas could be automated by 2035. Many commentators believe that these shifts are fairly inevitable.
Starting point is 00:03:12 John Restrepo, principal at RCG economics, said, wherever the resort industry can replace their workers and not affect productivity profits or the customer experience, wherever they can do that with artificial intelligence, they will. The question is, how do you factor in and how do you adapt your economic development strategy, your community strategy, your resiliency strategy to accommodate a world where certain jobs no longer exist. Now, one of the answers is, of course, to fight back. As NPR writes, unions in Las Vegas are closely watching the changes. The largest union in Nevada, the culinary union represents 60,000 service and hospitality workers in Las Vegas and Reno. Later this year, it hopes to have a new negotiated contract that includes protections against
Starting point is 00:03:50 AI replacing jobs. Ted Papa George, who's the secretary treasurer of the union, told NPR, quote, we had a huge fight about tech in our previous contract, we're going to have the same fight this time around. Promising Belico's action, he said, we'd like to say we're going to be able to get to an agreement, but if we have to, we're going to have a big fight and do whatever it takes, including a strike on technology. Now, obviously, we've been following the writer's strike and the screen actors guild strike in Hollywood, and part of the arguments of the unions involved in those strikes is that
Starting point is 00:04:17 this fight would be coming for other workers as well. The fact that the culinary union in Las Vegas is gearing up for just such a battle, suggest that that may be an accurate assessment. Moving back to the world of AI and the consumer internet, a new plugin for ChatGPT is getting people excited in a way that most haven't. That plugin is Canva, the popular image creation and design tool, and in some ways gives ChatGPT a new, light, multimodal feel. DeCrypt writes, tapping into the broad and rapidly evolving social marketing space,
Starting point is 00:04:47 OpenAI has unveiled a Canva plugin for its popular chatbot ChatGPT. The strategic move aims to make the process of creating visuals such as logos, banners, and more, even more simple for businesses and entrepreneurs. Now, speaking of AI generated content, a new report suggests that 90% of online content will be AI generated by just 2026. The report comes from Europol and is called Facing Reality, law enforcement, and the challenge of deepfakes. Now, the report is actually pretty interesting even if you're not in law enforcement. And I think even if you see that 90% of online content being synthetically generated by 2026 as perhaps a bit hyperbolic, it shows at least the scale and
Starting point is 00:05:27 magnitude of the problem in many people's estimation. At the same time, it doesn't seem totally insane, if only because content that is created by AI is so much cheaper and faster to produce than content that's created by humans. The question is whether this type of shift is completely inevitable, and also what it will do for the value of content that isn't created by AI. Lastly, on that theme of content created by AI, Nunei is a completely AI-created influencer who has 400,000 followers on Instagram and has previously done fashion and lifestyle brand deals, and has now officially signed with Warner Music, which Nunei's team claims makes her the first AI pop singer. Forbes sums up the appeal for a company like Warner.
Starting point is 00:06:07 Nunori won't get worn out from touring and promoting her music, and she can be restiled in seconds to keep in step with changing teen trends. and if she makes it to superstar status, she won't start making diva-like demands or demand an enormous pay raise. Given that Abba's virtual tour is now making $2 million a week, which is a show produced entirely with virtual avatars and not the musicians themselves, it feels like this is the beginning, not the end of a trend. Anyways, friends, that is going to do it for today's AI breakdown brief. I'll be back soon with the main AI breakdown.
Starting point is 00:06:37 Before we get into the main AI breakdown, I want to tell you about today's sponsor, Supermanage. If you work in a professional setting, you probably have some version of a one-on-one meeting, either with the people that work for you or the people that you work with. Unfortunately, all too often, those one-on-one meetings become glorified catch-up calls. Don't you wish you could jump right to the stuff that really matters? That's where SuperManage comes in. Supermanage AI magically distills your team's public Slack channels into a real-time brief on
Starting point is 00:07:06 any employee, any time. Catch up on contributions, work in progress, challenges they're facing, sentiment, everything you need to show up ready for a truly meaningful conversation, and it's completely free. Visit supermanage.a.i forward slash breakdown today to start making the most of your one-on-ones. And thanks again to SuperManage for sponsoring the AI breakdown. Welcome back to the AI breakdown. Today we are digging into some new research that I think has some fairly interesting implications for the development of foundational models.
Starting point is 00:07:39 Now, we're getting a little bit wonky in this one, but I actually think it's very significant and touches on not just model design and training processes, but AI ethics as well. Over the weekend, I started to notice people posting about research from Google called RLAIF, scaling reinforcement learning from human feedback with AI feedback. Now, at core here is a concept called reinforcement learning from human feedback. RLHF is one of the key techniques by which we align large language models with human preferences. Now, Hugging Face did a great blueprint. blog last December explaining this concept. The entire blog post is worth a read, but here's how they saw chat GPT explain RLHF. And keep in mind, this was an explain it like I'm 5 example. ChatGPT says,
Starting point is 00:08:24 Imagine you have a robot named Rufus who wants to learn how to talk like a human. Rufus has a language model that helps him understand words and sentences. First, Rufus will say something using his language model. For example, he might say, I am a robot. Then a human will listen to what Rufus said and give him feedback on whether it sounded like a natural sentence a human would say. The human might say that's not quite right, Rufus. Humans don't usually say, I am a robot. They might say, I'm a robot, or I am a machine. Rufus will take this feedback and use it to update his language model. He will try to say the sentence again, using the new information he received from the human. This time you might say, I'm a robot. The human will listen again and give Rufus more feedback.
Starting point is 00:08:59 This process will continue until Rufus can say sentences that sound natural to a human. Now, as this paper from Google points out, reinforcement learning from human feedback is one of the, quote, key drivers of success in modern conversational language models like chat. at GPD and Bard. By training with reinforced learning, language models can be optimized on complex, sequence-level objectives that are not easily differentiatable with traditional supervised fine-tuning. What the goal of this study was, was to compare reinforcement learning from human feedback versus reinforcement learning from artificial intelligence feedback. They write, in this work, we directly compare RLAIF against RLHF on the task of
Starting point is 00:09:38 summarization. Given the text and two candidate responses, we assign a preference. label using an off-the-shelf LLM. We then train a reward model on the LLM preferences with a contrastive loss. Finally, we fine-tune a policy model with reinforcement learning using the reinforcement model to provide rewards. Now, fascinatingly, what they found is that RLAIF achieved comparable performance to RLHF measured in two different ways. First, compared to a supervised fine-tuning baseline that didn't have a reinforcement learning process, the results of both RLAIF and RLHF were preferred by humans a significant portion of the time. For RLAIF, it was preferred 71% of the time. For RLHF, it was preferred 73% of the time, which the researchers say is basically statistically equal.
Starting point is 00:10:24 Now, in compared head-to-head, RLHF and RLAF were both preferred 50% of the time. In other words, they were preferred an equivalent amount of times by human judges. Now, there are a couple reasons why this is significant. As the researchers put it, quote, The need for high-quality human labels is an obstacle for scaling up RLHF, and one natural question is whether artificially generated labels can achieve comparable results. So first, there is just simply a question of scale. Any process that involves humans is inherently going to be less scalable, more time-consuming, more challenging, and more costly, than a process where the humans can be replaced by artificial intelligence.
Starting point is 00:10:59 Now, a second piece, though, which isn't really discussed in this paper as much, is the human cost and AI ethics questions of the actual reinforcement learning process. There have been a number of articles about this challenge that have come out over the last month or two. One in The Guardian was called, It's Destroyed Me Completely, Kenyan moderators to cry toll of training of AI models. The Guardian piece tells the story of MoFet O'Kinney,
Starting point is 00:11:22 who is part of a previous job with ChatGPT, would view up to 700 text passages a day, many he said that depicted graphic sexual violence. From the Guardian, quote, He recalls he started avoiding people after having read texts about rapists and found himself projecting paranoid narratives onto people around him. Then last year, his wife told him he was a changed man and left. Slate's Alex Cantrowitz wrote a piece about this in May. It was called the horrific content a Kenyan worker had to see while training Chad GPT and reads,
Starting point is 00:11:49 this type of work has been crucial for bots like ChatGBT and Google's part to function and to feel so magical. But the human cost of the effort has been widely overlooked. In a process called reinforcement learning from human feedback or RLHF, bots become smarter as humans label content teaching them how to optimize based on feedback. AI leaders, including OpenAI, Sam Altman, have praised the practice's technical effectiveness, yet they rarely talk about the costs some humans pay to align the AI systems with our values. Now, a big question in both of these pieces is what OpenAI and the contracting company, Samma, that they worked with, did or didn't provide these Kenyan workers, which is a really important set of questions to ask, but the broader point is just
Starting point is 00:12:24 to acknowledge that reinforcement learning in the context of LLMs is not without cost. Now, other companies have tried other approaches for exactly these concerns, both around scalability as well as around the harm to individuals who are involved in the RLHF process. One of the more notable of these is Anthropics constitutional approach, which they explain as, quote, giving language model's explicit values determined by a constitution, rather than values determined implicitly via large-scale human feedback. Of RLHF, they write, this process has several shortcomings. First, it may require people to interact with disturbing outputs. Second, it does not scale efficiently, as the number of responses increases or the models produce more complex responses, crowd workers will find it difficult
Starting point is 00:13:02 to keep up with or fully understand them. Third, reviewing even a subset of outputs require substantial time and resources, making this process inaccessible for many researchers. So given all of this, shouldn't we just switch now to reinforcement learning from artificial intelligence entirely? Well, even the Google researchers produce this paper caution against running that far that fast. They write, while this work highlights the potential of RLAIF, we note some limitations on these findings. First, this work only explores the task of summarization leaving an open question about generalizability to other tasks. Second, we did not estimate whether LLM inference is advantageous versus human labeling in terms of monetary costs. Additionally, there remain many interesting open questions, such as whether
Starting point is 00:13:42 RLHF combined with RLAIF, can outperform a single approach alone, how well using an LLM to directly assign rewards performs, whether improving AI labeler alignment translates to improved final policies, and whether using an LLM labeler the same size as the policy model can further improve the policy, i.e. whether a model can self-improve. We leave these questions for future work, and we hope that this paper motivates further research in the area of RLAIF. Now, even beyond just having comparable results, the paper also pointed out that, quote, RLAIF appears less likely to hallucinate than RLHF. The hallucinations in RLHF are often plausible, but are inconsistent with the original text, meaning that it's not just potentially scalability in replacing the human cost of the reinforcement
Starting point is 00:14:23 learning, but that there actually could be benefits as well in terms of model performance. Still, to the extent that Google's goal was to provoke more interest in further exploring this area, I think it's certainly going to be successful in that. In many ways, it's also part of a larger conversation around what parts of the AI model training process can be done by AI versus need human input. One big topic of conversation throughout the year has been the concern around model collapse, which is something that might theoretically happen as AI trains on more AI generated content. However, recently, when Lama 2 was released, one of the things that meta noted was that the model that was trained on synthetic data actually outperformed models that
Starting point is 00:14:58 were trained on non-synthetic data, bringing up a lot of questions about how much our assumptions about model collapse actually hold true. Now, if you want to take it from the biggest implications perspective, it's really just how little we know about these models and about how to improve them, and for as advanced as we are, it feels very much like we are at the most infantile stages of figuring these things out. In that context, I find it worthwhile to approach everything with a certain base level of humility and an appreciation that the next research paper that comes out might fundamentally change our understanding about how the entire space works. For now, though, some really interesting stuff to chew on with some really big potential benefits. I will, of course,
Starting point is 00:15:36 include a link to the research and the news in the show notes, and I hope you take some time to go check it out. Thanks as always for listening or watching, and until next time, peace.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.