The AI Daily Brief: Artificial Intelligence News and Analysis - Study: Reinforcement Learning from AI Feedback Performs As Well As Human Feedback
Episode Date: September 5, 2023Today on The AI Breakdown, NLW looks at new research from Google that shows that reinforcement learning using artificial intelligence rather than human feedback could perform as well as RLHF. Before t...hat on the Brief: the first AI pop singer gets a record deal; an AI-produced covid drug moves to phase 1 trials, and more. Today's Sponsor: Supermanage - AI for 1-on-1's - https://supermanage.ai/breakdown ABOUT THE AI BREAKDOWN The AI Breakdown helps you understand the most important news and discussions in AI. Subscribe to The AI Breakdown newsletter: https://theaibreakdown.beehiiv.com/subscribe Subscribe to The AI Breakdown on YouTube: https://www.youtube.com/@TheAIBreakdown Join the community: bit.ly/aibreakdown Learn more: http://breakdown.network/
Transcript
Discussion (0)
Today on the AI breakdown, we're looking at a new research paper that suggests that reinforcement learning using AI could perform as well as reinforcement learning using human feedback.
Before that on the brief, the first AI COVID drug heads to phase one trials.
The AI breakdown is a daily podcast and video about the most important news and discussions in AI.
Go to Breakdown.network for more information about our newsletter, our YouTube, and our Discord.
Welcome back to the AI breakdown brief.
All the AI headline news you need in around five minutes.
We kick off today with a story that I think reflects something that we're going to hear a lot more about in the future, which is AI-generated pharmaceuticals.
Fox News reports that a new AI-generated COVID drug has entered phase one clinical trials.
So in terms of the drug itself, this is an alternative to Paxlovid.
In other words, it is an oral medication that you take after you've gotten COVID, not a vaccine that prevents it in the first place.
The company behind the treatment is called Ensilicom Medicine, which is based both in Hong Kong and in New York City.
The drug is referred to as ISM-3312, and in their early trials, the company says it has some advantages over Paxlovid.
That includes being effective against variants that are resistant to Paxlovid, being more stable
in working for a longer period of time, and not having the Paxlovid rebound, which, as Fox describes,
is when patients recover from COVID and test negative, but then test positive again a short time later.
Here's how the article sums up the process of creation for Ensilico.
Quote, to create its new drug, Ensilico's research team first used its target to
discovery platform Pandaomics to identify the target protein within the coronavirus.
Next, it used its in-house generative chemistry platform, Chemistry 42, to generate new
molecules that would attack the protein as a means of treating COVID and other coronaviruses.
A phase one study means that ISM-332 is now being evaluated in healthy volunteers, and the
expectation is that the results of the trials will be released at the end of this year.
Said the company's CEO, generative AI offers us a powerful tool for accelerating the drug discovery
process and allows us to quickly identify new solutions that we hope can provide more potent
defenses against mutating COVID strains and prevent another pandemic.
Now, this is definitely a theme we're seeing more and more.
A couple weeks ago, Vox wrote a piece, AI discovered drugs will be for sale sooner than you
think.
Among other things, the piece discusses just how much faster AI models can be than traditional
drug discovery processes, which typically can take up to 12 years or more.
Next up, we moved to a follow-up from last week.
In the wake of approvals from the Chinese government, the companies that have received those approvals
are mass releasing a huge number of AI models and applications.
Baidu is among the companies that have released their LLMs following last week's announcement,
and Reuters reports that 360 security technology and I Fly Tech have just launched their own AIs,
including voice recognition technology.
In our next story, unions in Las Vegas are gearing up for a battle around AI.
NPRs, all things considered, covered how quickly AI is infiltrated.
trading Sin City, noting that recent studies have suggested that between 38 and 65% of jobs in Las Vegas
could be automated by 2035. Many commentators believe that these shifts are fairly inevitable.
John Restrepo, principal at RCG economics, said, wherever the resort industry can replace their
workers and not affect productivity profits or the customer experience, wherever they can do that
with artificial intelligence, they will. The question is, how do you factor in and how do you
adapt your economic development strategy, your community strategy, your resiliency strategy to
accommodate a world where certain jobs no longer exist. Now, one of the answers is, of course, to fight back.
As NPR writes, unions in Las Vegas are closely watching the changes. The largest union in Nevada,
the culinary union represents 60,000 service and hospitality workers in Las Vegas and Reno.
Later this year, it hopes to have a new negotiated contract that includes protections against
AI replacing jobs. Ted Papa George, who's the secretary treasurer of the union, told NPR, quote,
we had a huge fight about tech in our previous contract, we're going to have the same
fight this time around.
Promising Belico's action, he said, we'd like to say we're going to be able to get to an agreement,
but if we have to, we're going to have a big fight and do whatever it takes, including
a strike on technology.
Now, obviously, we've been following the writer's strike and the screen actors guild strike in
Hollywood, and part of the arguments of the unions involved in those strikes is that
this fight would be coming for other workers as well.
The fact that the culinary union in Las Vegas is gearing up for just such a battle,
suggest that that may be an accurate assessment.
Moving back to the world of AI and the consumer internet,
a new plugin for ChatGPT is getting people excited in a way that most haven't.
That plugin is Canva, the popular image creation and design tool,
and in some ways gives ChatGPT a new, light, multimodal feel.
DeCrypt writes, tapping into the broad and rapidly evolving social marketing space,
OpenAI has unveiled a Canva plugin for its popular chatbot ChatGPT.
The strategic move aims to make the process of
creating visuals such as logos, banners, and more, even more simple for businesses and entrepreneurs.
Now, speaking of AI generated content, a new report suggests that 90% of online content
will be AI generated by just 2026. The report comes from Europol and is called Facing Reality,
law enforcement, and the challenge of deepfakes. Now, the report is actually pretty interesting
even if you're not in law enforcement. And I think even if you see that 90% of online content
being synthetically generated by 2026 as perhaps a bit hyperbolic, it shows at least the scale and
magnitude of the problem in many people's estimation. At the same time, it doesn't seem totally insane,
if only because content that is created by AI is so much cheaper and faster to produce than content
that's created by humans. The question is whether this type of shift is completely inevitable,
and also what it will do for the value of content that isn't created by AI.
Lastly, on that theme of content created by AI,
Nunei is a completely AI-created influencer who has 400,000 followers on Instagram and has previously done fashion and lifestyle brand deals,
and has now officially signed with Warner Music, which Nunei's team claims makes her the first AI pop singer.
Forbes sums up the appeal for a company like Warner.
Nunori won't get worn out from touring and promoting her music, and she can be restiled in seconds to keep in step with changing teen trends.
and if she makes it to superstar status, she won't start making diva-like demands or demand
an enormous pay raise.
Given that Abba's virtual tour is now making $2 million a week, which is a show produced
entirely with virtual avatars and not the musicians themselves, it feels like this is the
beginning, not the end of a trend.
Anyways, friends, that is going to do it for today's AI breakdown brief.
I'll be back soon with the main AI breakdown.
Before we get into the main AI breakdown, I want to tell you about today's sponsor,
Supermanage.
If you work in a professional setting, you probably have some version of a one-on-one meeting, either
with the people that work for you or the people that you work with.
Unfortunately, all too often, those one-on-one meetings become glorified catch-up calls.
Don't you wish you could jump right to the stuff that really matters?
That's where SuperManage comes in.
Supermanage AI magically distills your team's public Slack channels into a real-time brief on
any employee, any time.
Catch up on contributions, work in progress, challenges they're facing, sentiment, everything you
need to show up ready for a truly meaningful conversation, and it's completely free.
Visit supermanage.a.i forward slash breakdown today to start making the most of your one-on-ones.
And thanks again to SuperManage for sponsoring the AI breakdown.
Welcome back to the AI breakdown.
Today we are digging into some new research that I think has some fairly interesting
implications for the development of foundational models.
Now, we're getting a little bit wonky in this one, but I actually think it's very
significant and touches on not just model design and training processes, but AI ethics as well.
Over the weekend, I started to notice people posting about research from Google called RLAIF,
scaling reinforcement learning from human feedback with AI feedback. Now, at core here is a concept
called reinforcement learning from human feedback. RLHF is one of the key techniques by which we align
large language models with human preferences. Now, Hugging Face did a great blueprint.
blog last December explaining this concept. The entire blog post is worth a read, but here's how they saw
chat GPT explain RLHF. And keep in mind, this was an explain it like I'm 5 example. ChatGPT says,
Imagine you have a robot named Rufus who wants to learn how to talk like a human. Rufus has a language
model that helps him understand words and sentences. First, Rufus will say something using his language
model. For example, he might say, I am a robot. Then a human will listen to what Rufus said and
give him feedback on whether it sounded like a natural sentence a human would say. The human might
say that's not quite right, Rufus. Humans don't usually say, I am a robot. They might say,
I'm a robot, or I am a machine. Rufus will take this feedback and use it to update his language
model. He will try to say the sentence again, using the new information he received from the human.
This time you might say, I'm a robot. The human will listen again and give Rufus more feedback.
This process will continue until Rufus can say sentences that sound natural to a human.
Now, as this paper from Google points out, reinforcement learning from human feedback is one of the,
quote, key drivers of success in modern conversational language models like chat.
at GPD and Bard. By training with reinforced learning, language models can be optimized
on complex, sequence-level objectives that are not easily differentiatable with traditional
supervised fine-tuning. What the goal of this study was, was to compare reinforcement learning
from human feedback versus reinforcement learning from artificial intelligence feedback.
They write, in this work, we directly compare RLAIF against RLHF on the task of
summarization. Given the text and two candidate responses, we assign a preference.
label using an off-the-shelf LLM. We then train a reward model on the LLM preferences with a
contrastive loss. Finally, we fine-tune a policy model with reinforcement learning using the
reinforcement model to provide rewards. Now, fascinatingly, what they found is that RLAIF
achieved comparable performance to RLHF measured in two different ways. First, compared to a supervised
fine-tuning baseline that didn't have a reinforcement learning process, the results of both RLAIF
and RLHF were preferred by humans a significant portion of the time. For RLAIF, it was preferred 71% of the time.
For RLHF, it was preferred 73% of the time, which the researchers say is basically statistically equal.
Now, in compared head-to-head, RLHF and RLAF were both preferred 50% of the time.
In other words, they were preferred an equivalent amount of times by human judges.
Now, there are a couple reasons why this is significant.
As the researchers put it, quote,
The need for high-quality human labels is an obstacle for scaling up RLHF, and one natural question is whether
artificially generated labels can achieve comparable results. So first, there is just simply a question of scale.
Any process that involves humans is inherently going to be less scalable, more time-consuming, more challenging, and more costly,
than a process where the humans can be replaced by artificial intelligence.
Now, a second piece, though, which isn't really discussed in this paper as much, is the human cost and AI ethics questions
of the actual reinforcement learning process.
There have been a number of articles about this challenge
that have come out over the last month or two.
One in The Guardian was called,
It's Destroyed Me Completely,
Kenyan moderators to cry toll of training of AI models.
The Guardian piece tells the story of MoFet O'Kinney,
who is part of a previous job with ChatGPT,
would view up to 700 text passages a day,
many he said that depicted graphic sexual violence.
From the Guardian, quote,
He recalls he started avoiding people after having read texts about rapists and found himself projecting paranoid narratives onto people around him.
Then last year, his wife told him he was a changed man and left.
Slate's Alex Cantrowitz wrote a piece about this in May.
It was called the horrific content a Kenyan worker had to see while training Chad GPT and reads,
this type of work has been crucial for bots like ChatGBT and Google's part to function and to feel so magical.
But the human cost of the effort has been widely overlooked.
In a process called reinforcement learning from human feedback or RLHF, bots become
smarter as humans label content teaching them how to optimize based on feedback. AI leaders, including
OpenAI, Sam Altman, have praised the practice's technical effectiveness, yet they rarely talk about the
costs some humans pay to align the AI systems with our values. Now, a big question in both of these
pieces is what OpenAI and the contracting company, Samma, that they worked with, did or didn't provide
these Kenyan workers, which is a really important set of questions to ask, but the broader point is just
to acknowledge that reinforcement learning in the context of LLMs is not without cost. Now, other companies
have tried other approaches for exactly these concerns, both around scalability as well as around
the harm to individuals who are involved in the RLHF process. One of the more notable of these is
Anthropics constitutional approach, which they explain as, quote, giving language model's
explicit values determined by a constitution, rather than values determined implicitly via large-scale
human feedback. Of RLHF, they write, this process has several shortcomings. First, it may require
people to interact with disturbing outputs. Second, it does not scale efficiently, as the number of
responses increases or the models produce more complex responses, crowd workers will find it difficult
to keep up with or fully understand them. Third, reviewing even a subset of outputs require substantial
time and resources, making this process inaccessible for many researchers. So given all of this,
shouldn't we just switch now to reinforcement learning from artificial intelligence entirely? Well,
even the Google researchers produce this paper caution against running that far that fast. They write,
while this work highlights the potential of RLAIF, we note some limitations on these findings. First, this
work only explores the task of summarization leaving an open question about generalizability to other
tasks. Second, we did not estimate whether LLM inference is advantageous versus human labeling in terms of
monetary costs. Additionally, there remain many interesting open questions, such as whether
RLHF combined with RLAIF, can outperform a single approach alone, how well using an LLM to
directly assign rewards performs, whether improving AI labeler alignment translates to improved final
policies, and whether using an LLM labeler the same size as the policy model can further improve the
policy, i.e. whether a model can self-improve. We leave these questions for future work, and we hope that
this paper motivates further research in the area of RLAIF. Now, even beyond just having comparable
results, the paper also pointed out that, quote, RLAIF appears less likely to hallucinate than
RLHF. The hallucinations in RLHF are often plausible, but are inconsistent with the original text,
meaning that it's not just potentially scalability in replacing the human cost of the reinforcement
learning, but that there actually could be benefits as well in terms of model performance.
Still, to the extent that Google's goal was to provoke more interest in further exploring this area,
I think it's certainly going to be successful in that. In many ways, it's also part of a larger
conversation around what parts of the AI model training process can be done by AI versus need
human input. One big topic of conversation throughout the year has been the concern around
model collapse, which is something that might theoretically happen as AI trains on more
AI generated content. However, recently, when Lama 2 was released, one of the things that
meta noted was that the model that was trained on synthetic data actually outperformed models that
were trained on non-synthetic data, bringing up a lot of questions about how much our assumptions
about model collapse actually hold true. Now, if you want to take it from the biggest implications
perspective, it's really just how little we know about these models and about how to improve them,
and for as advanced as we are, it feels very much like we are at the most infantile stages of
figuring these things out. In that context, I find it worthwhile to approach everything with a certain
base level of humility and an appreciation that the next research paper that comes out might
fundamentally change our understanding about how the entire space works. For now, though, some
really interesting stuff to chew on with some really big potential benefits. I will, of course,
include a link to the research and the news in the show notes, and I hope you take some time to go
check it out. Thanks as always for listening or watching, and until next time, peace.
