The AI Daily Brief: Artificial Intelligence News and Analysis - The Latest Developments in Superalignment
Episode Date: December 15, 2023OpenAI's Superalignment team, launched this summer, has just published their first paper about weak-to-strong generalizations, and how they can analogize using weaker models to train more advanced mod...els to simulate humans trying to control superhuman AI. Before that on the Brief, Intel's latest in the AI chip race. Today's Sponsors: Listen to the chart-topping podcast 'web3 with a16z crypto' wherever you get your podcasts or here: https://link.chtbl.com/xz5kFVEK?sid=AIBreakdown ABOUT THE AI BREAKDOWN The AI Breakdown helps you understand the most important news and discussions in AI. Subscribe to The AI Breakdown newsletter: https://theaibreakdown.beehiiv.com/subscribe Subscribe to The AI Breakdown on YouTube: https://www.youtube.com/@TheAIBreakdown Join the community: bit.ly/aibreakdown Learn more: http://breakdown.network/
Transcript
Discussion (0)
Today on the AI breakdown, we're looking at the first research paper from OpenAI Superalignment
team. Before that on the brief, another entrant in the AI Chip Wars. The AI breakdown is a daily
podcast and video about the most important news and discussions in AI. Go to Breakdown.com.
For more information about our YouTube, our newsletter, and our Discord. Welcome back to the AI
breakdown brief, all the AI headline news you need in around five minutes. One of the
dominant themes of 2023 has been the AI chip wars.
Really, this hasn't been much of a war.
Invidia has just been the big winner in the entire space.
In fact, I have seen some people argue that when the history books are written,
they will see Nvidia as this year's big AI winner over OpenAI.
Now, I think that that assumes that OpenAI ultimately gets beat by some of the bigger labs
or gets absorbed into one of them,
as I don't think you can really make an argument that this rush was started by anything
other than OpenAI's ChatGBT, but it stands to make the point about just what a monster
year Nvidia has had.
Now, of course, the pressures of capitalism to increase GPUs beyond just Nvidia
have been working overtime this year, as big progress has been halted because people don't
have access to compute.
I mean, hell, OpenAI, arguably the leading lab in this space, had to turn off its chat
GPT plus subscriptions for some meaningful amount of time, only just turning them back on this
week because they didn't have enough access to GPUs.
Well, every company and their mother is trying to catch up and address this in some different
way.
Microsoft announced a chip effort.
Amazon is increasing its chip efforts.
Google is increasing its chip efforts.
And of course, the existing chip makers are ramping up as well.
AMD has launched a new chip that's meant to compete with NVIDIA's H-100 and their forthcoming H-200.
And now, as of yesterday, Intel has announced the Gaudi-3.
The Gowdy 3 is specifically designed and focused on artificial intelligence.
CNBC writes,
While the company was light on details,
Gaudi-3 will compete with NVIDIA's H-100,
the main choice among companies that build huge farms of the chips to power AI applications,
and AMD's forthcoming MI-300X when it starts shipping to customer
in 2024. Intel has been building Gowdy chips since 2019 when it bought a chip developer called
Hibana Labs. Now, interestingly, in addition to the Gowdy 3 going after that data center
market, Intel is also clearly thinking about AI on personal devices. In fact, their CEO, Pat Gelsinger
said, we think the AI PC will be the star of the show for the upcoming year. Along those lines,
Intel also announced their core ultra chips, which are designed for Windows, laptops, and PCs,
as well as a fifth-generation Xon server chip, both which include a specialized AI computer.
component called an NPU that is designed to run AI programs faster. Now, this is not even meant
to be powerful enough to do something like run a chat GPT locally without an internet connection.
But for the array of smaller AI tasks that are going to increasingly come to applications,
this should involve some increased performance. Now, whether it can keep up with Apple's M3 Max,
which is also explicitly designed to speak to this AIPC market, seems pretty unlikely,
but it's interesting just how many different companies are thinking in the same way.
2024, in other words, is the year that AI is coming to your phones and laptops.
Next up, more officials in D.C. are picking up the call, first echoed by people like Gary Gensler from the SEC.
That, as CNN business puts it, AI is a danger to the financial system.
Basically, we just had a meeting of the Financial Stability Oversight Council or FSOC,
which is a group of regulators across departments in the U.S., created after the 2008 financial crisis
that's designed to understand, identify, and ward against future upcoming threats to financial stability.
FSOC's annual report came out this week, and AI was prominently featured.
Said the report, AI has the potential to spur innovation and drive efficiency,
but its use in financial services requires thoughtful implementation and supervision to manage potential risks.
Some of those risks that they warned about include cybersecurity concerns, compliance risks, and privacy issues.
They also flagged hallucinations, and something that SEC Chair Gary Gensler has talked about before
is a concern that if all trading starts to be done on the back of super intelligent bots,
that it could mean that everyone trades in the exact same direction at once because they're all just piling into the same trades based on the same models.
Now, I think it's highly unrealistic to assume, even in a world of super intelligent trading bots,
that people would all be using the same bots that had the same models that had the same conclusions.
But that is something that is now actively being talked about in Washington, D.C.
Speaking of Washington, D.C., one refugee from that place, former Speaker of the House, Kevin McCarthy,
is, according to Axios, looking to get into the AI field.
Basically, McCarthy did a long exit interview with Axios and mentioned that AI is an interest area for him.
He said that he'd been friends with Elon Musk for a decade and he could imagine working with him.
In the interview, McCarthy said, I view AI as a positive.
AI is where California is going to come back.
The knowledge and capability of AI is streaming from California.
Now, given how deep and extensive the regulatory efforts around AI are going to be in the coming year,
I can imagine this being a pretty lucrative place for McCarthy.
One more vaguely related to Elon.
apparently Grimes, who is of course Elon's former partner and the mother of three of his children,
has built a new line of AI plush toys and named one Grock.
So apparently, Grimes worked with a toy company called Curio,
who themselves are working with Open AI models,
and the idea is that these plush toys converse with the kids that own them
and learn their personalities over time.
Grimes provides the voice for all three of the initial plushies,
which are named Gabo, Grim, and Grock.
They claim that Grock is a shortening of Grockett,
but obviously people are making the connection to Elon Musk's new AI chatbot, which is of course
called GROC. Lending some credence to their case that it's not related, the U.S. Patent and Trademark
Office said that Curio filed an application to trademark the name GROC on September 12th, and
XAI didn't apply for a trademark for GROC until October 27th. Anyway, this type of thing is going to be
an absolute Roershack test for how people feel about AI. Some people will think it's amazing,
that these toys are so much more sophisticated, engaging, and interactive. Others will view it
as literally an R. L. Stein, Night of the Living Dummy Style Horror.
Over in the world of new AI models.
Earlier this year, Google announced its AI Music Generator Music FX,
but they've just updated it and made it more widely available.
This is an area that I'm super excited about just on a completely personal level
because I'm fascinated with music creation,
and so at some point I'll do a demo video testing out all of these new music generation
models that have come out recently.
Another new model that's getting attention is Alibaba's just released 12V Gen XL.
This is a new image-to-video model, which is obviously part of a much broader trend towards
video creation that is really capturing people's attention right now.
You've got PICA and Runway as startups that are pushing in this area.
Stable Diffusion just released their first image-to-video models.
And overall, it feels like we're setting up for a 20-24 that has a heck of a lot more people
creating short films and animations than have ever done so before.
Now, lastly, something that if you've spent any time on X or Twitter in the last 24 hours or so,
you've probably seen, there is a new tool called Kampi,
camera AI by FAL, which is basically a real-time deep fake. One demo video has someone transforming
themselves into Elon Musk in real time, with the fake Musk's face imitating their head movement
and facial expressions and copying them as they put on sunglasses. If you're watching this video
right now, you might be seeing me talk in George Clooney's face, which is pretty freaky.
Honestly, the resemblance is pretty uncanny. And although this might be an impressive model
or demonstration generally.
I think it's mitigated by the fact that I already look so much like Clooney,
but still, it's impressive technology and something it's going to be excited to see how people play with.
Anyways, that is going to do it for today's AI breakdown brief.
Next up, the main AI breakdown.
Quickly a brief word from today's sponsor.
As a listener of this show, I suspect you like to stay up to date on all things AI and tech,
which is why you have to check out the chart-topping podcast Web3 with A16Z Crypto,
Produced by venture firm Andresen Horowitz, Web3 with A16Z is the perfect companion podcast to the AI breakdown.
Web3 with A16Z Crypto is your definitive resource for the future of the internet.
Whether you're interested in the convergence of AI and crypto or simply curious about what's next.
If you need a place to start, they recently released an excellent episode with Stanford Cryptography Professor Dan Bonay
and former Google X engineer Aliya in conversation with host Sonal Choxi about the intersection of AI and crypto.
From fighting deepfakes and proving humanity to large language models like ChatGBT,
they cover it all.
I highly recommend checking it out, especially if you'd like to learn more about how
AI and crypto will impact our everyday lives.
Beyond Crypto and AI, this show is for creators seeking more ways to truly own their work,
for business leaders trying to prepare for the future today, and for innovators exploring
trending tech topics.
Don't miss out.
Follow Web3 with A16Z Crypto on Apple Podcasts, Spotify, or your favorite listening app.
Welcome back to the AI breakdown.
Yesterday, Greg Brockman tweeted
New Direction for AI alignment,
weak to strong generalization,
promising initial results.
We used outputs from a weak model,
a fine-tuned GPT2,
to communicate a task to a stronger model,
GPT4, resulting in intermediate GPT3 level performance.
So what is Greg talking about?
Why is this important?
What does it have to do with the robots not killing us?
Well, to understand what's going on,
let's go back to the summer when OpenAI introduced their super alignment team.
Their announcement blog post reads,
We need scientific and technical breakthroughs to steer and control AI systems much smarter than us.
To solve this problem within four years, we're starting a new team
and dedicating 20% of the compute we've secured to date to this effort.
So, basically, this is not a team that was focused on how to align the current crop of models.
This is a team that's focused on superintelligence.
Superintelligence OpenAI argues will be the most impactful technology humanity has ever invented
and could help us solve many of the world's most important problems.
But the vast power of superintelligence could also be very dangerous and could lead to the disempowerment
of humanity or even human extinction.
While superintelligence seems far off now, we believe it could arrive this decade.
Managing these risks will require, among other things, new institutions for governance and
solving the problem of superintelligence alignment.
How do we ensure AI systems much smarter than humans follow human intent?
Currently, we don't have a solution for steering or controlling a potentially superintelligent
AI and preventing it from going rogue.
Our current techniques for aligning AI, such as reinforcement learning from human feedback,
rely on humans' ability to supervise AI. But humans won't be able to reliably supervise AI systems
much smarter than us, and so our current alignment techniques will not scale to superintelligence.
We need new scientific and technical breakthroughs.
Now, even back then, they said that their goal was to build a roughly human-level automated
alignment researcher. They said we can then use vast amounts of compute to scale our efforts
and iteratively align superintelligence. To align the first automated alignment researcher,
we will need to, one, develop a scalable training method, two, validate the resulting model,
and three, stress test our entire pipeline. Now, there are a couple things that were notable
about this announcement at the time. One was the scale of its ambition. They had set a four-year
timeline to solve the core technical challenges of superintelligence alignment, which made
observant spectators kind of wonder how fast they think superintelligence is actually going
to arrive. But on top of that, they also showed up with resources to do so. That 20% of the
compute that they had secured to date commitment was certainly not nothing. Finally, they were bringing
in some big hitters. Notably, Ilya, the co-founder and chief scientist of OpenAI, was going to be
leading this team, along with Jan Leakey, who's the head of alignment. Now, Ilya leading the
superintelligence alignment team was part of why people thought that perhaps the whole OpenAI
board Sam Altman fight had to do with some technical breakthrough, given that Ilya had initially
not supported Sam. Now, at this stage, we don't currently know what Ilya's long-term status with
open AI is going to be. By all accounts, it is still up in the air. However, in spite of that,
this new announcement is the first time that we've seen what they've been working on. So,
the co-lead of the superalignment team, Jan Leakey writes, super excited about our new research direction
for aligning smarter than human AI. We fine-tune large models to generalize from weak supervision,
using small models instead of humans as weak supervisors. Yang continues, for lots of important
tasks, we don't have ground truth supervision. Is this statement true? Is this code buggy? We want to
elicit the strong model's capabilities on these tasks without access to ground truth. This is pretty
central to aligning superhuman models. We find that large models generally do better than their
weak supervisor, a smaller model, but not by much. This suggests reward models won't be much better
than their human supervisors. In other words, RLHF won't scale. But even our simple technique can
significantly improve weak-to-strong generalization. This is great news. We can make measurable progress
on this problem today. I believe more progress in this direction will help us align superhuman models.
Now, Colin Burns takes this explanation a little bit further. And apparently, according to a shout
out from Sam Altman, it was actually Colin who inspired this line of research and convinced the whole
team to go down this path. Colin writes, humans won't be able to supervise models smarter than us.
For example, if a superhuman model generates a million lines of extremely complicated code, we won't
be able to tell if it's safe to run or not, if it follows our instructions or not, and so on.
This is a key difficulty of aligning superhuman models.
Unlike in most of machine learning, we will need to supervise models smarter than us.
Despite its importance, it's not obvious how to even begin to empirically study this issue.
We propose a simple, simple analogy to study this problem today.
Can we use weak models to supervise strong models?
If we can learn superhuman reward models or safety classifiers from weak supervision,
that would be a huge advancement for super alignment.
So editors note here,
basically the technique that they're taking is humans in this analysis.
are playing the role of the weak model, the GBT2, whereas superintelligence is taking the role
of the strong model or GPT4.
Now back to Colin, he writes,
intuitively this may be feasible because the strong model should already be very capable
at the key alignment relevant tasks we care about.
All the weak supervisor needs to do is elicit key capabilities that already exist within
the strong model.
We empirically test this setup and find that if we fine tune a strong pre-trained model
using weak model supervision, it consistently outperforms the weak model, usually by a large
margin. Generalization appears to be a promising approach to alignment. But directly fine-tuning a
big model to imitate a small model is suboptimal. Intuitively, we want to nudge the generalization
towards outputting what it internally knows. We test a simple model for doing this that makes
the strong model more confident in its predictions. Across a large number of data sets,
this simple method drastically improves weak to strong generalization performance. On our NLP tasks,
we can fine-tune GPT-4 using a GPT-2-level supervisor and attain performance close to GPT3.5. There's still a
amount of work to be done in this setting. Our methods still don't always work well, for example,
performance isn't as good on our chat GPT preference dataset, and our setup still has
disanalogies with the future alignment problems we care about. But we can make rapid iterative
empirical progress on this problem today. Our setup is simple, general, and easy to try out.
And there is still a huge amount of low-hanging fruit. Alignment feels more solvable than ever
before. Now, another member of the team, Leopold Aschenbrenner, tries to explain this in more
lay terms that I think are useful as well. Leopold writes,
intuitively, superhuman AI systems should know if they're acting safely, but can we summon such
concepts from strong models with only weak supervision?
Incredibly excited to finally share what we've been working on weak to strong generalization.
Future AI systems will be able to do crazy complicated things, e.g. generate 1 million
lines of code.
We'll want to add side constraints to their behavior, like don't lie, or don't escape your server.
But how do we do that if we can't even fully understand what they're doing?
Normally, we train AI systems with human labels, but relative to AI systems much smarter than us,
humans will be weak supervisors.
Maybe we can supervise them on easy problems, but on hard problems will only be able to
provide incomplete or flawed labels.
But there's reason to be optimistic.
Concepts like, is this action dangerous, should already be saliently internally represented
by strong models.
Can we just get them to tell us what they know, even about the cases too difficult for us
to supervise directly?
We have a really neat setup to study this.
What happens when we use a small model to supervise a large model?
Will the strong model just imitate the weak supervisor, including its errors,
or will the strong model generalize beyond the underlying task or concept?
It turns out, we can often nudge deep learning's remarkable generalization properties to work in our favor.
Even when we use GPT2, which can barely count to 10,
to supervise GPT4, which ACE's high school tests,
we can often recover 80% of the performance we would get with perfect labels.
Now, this only works in some setting so far,
and naive weak supervision without methods is far from recovering full performance.
In that sense, we provide evidence that naively applying current alignment techniques like RLHF
will scale poorly to superhuman models.
But this feels like a super tractable problem.
Generalizing beyond weak supervisors is a widespread phenomenon, and we can drastically improve
generalization with simple methods.
There's tons of low-hanging fruit here.
There are exciting directions for future work.
Better methods.
Scientific understanding.
When and why do we see good generalization?
Analogous setups.
There are still important disanalogies between our setup and the future superalignment problem.
Can we fix it?
Perhaps what I find most exciting, we can make iterative empirical progress today on a core challenge of aligning future superhuman models.
Lots of prior alignment work has been stuck in theory or been empirical, but failed to confront core challenges head on.
But the question, of course, becomes how to get people to work on those problems.
Well, in addition to continuing their work with the superalignment team, OpenAI has also announced alongside Eric Schmidt,
a $10 million program they're calling superalignment fast grants.
This is a $10 million pot, they say, for, quote, technical research on aligning superhuman AI systems,
including weak to strong generalization, interpretability, scalable oversight, and more.
Basically, this $10 million is divided into $100K to $2 million grants for academic labs,
nonprofits, and individual researchers.
For graduate students, they're sponsoring a one-year 150K OpenAI Superalignment Fellowship,
which includes 75K in stipend and 75K in compute and research funding.
And importantly, they say that no prior experience working on alignment is required.
They want this to be a pipeline in for researchers to start working on alignment.
Now, what are they looking to fund?
They say they're particularly interested in the following research directions.
One, weak to strong generalizations, everything we've just been talking about.
Two, interpretability.
How can we understand model internals?
And can we use this to EG build an AI lie detector?
Scalable oversight.
How can we use AI systems to assist humans in evaluating the outputs of other AI systems
on complex tasks?
And then finally, many other research directions, including but not limited to, honesty,
chain of thought faithfulness, adversarial robustness, avals and testbeds, and more.
You can apply to get these grants until,
February 18th. So how are people receiving this? The MIT Technology Review says,
unlike many of the company's announcements, this heralds no big breakthrough. In a low-key research
paper, the team describes a technique that lets a less powerful LLM supervise a more powerful one
and suggests that this might be a small step towards figuring out how humans might
supervise superhuman machines. Now, they also got commentary from a number of different AI
researchers, including Tilo Hagendorf, who wrote, it is an interesting idea, but he told
MIT, he thinks that GPT2 might be too dumb to be a good teacher.
Quote, GBT2 tends to give nonsensical responses to any task that is slightly complex or requires
reasoning. Continuing the MIT Technology Review writes, he also notes that this approach does
not address Ilius Hootskaver's hypothetical scenario in which a superintelligence hides its true
behavior and pretends to be aligned when it isn't. Says Hagendorf, future superhuman models
will likely possess emergent abilities which are unknown to researchers. How can alignment
work in these cases? Still, even for the skeptics, the broad senses that this is valuable
progress being made.
Now, Open AI also on the same day dropped another research paper that they summed up as identifying
seven practices for keeping increasingly agentic AI systems safe and accountable as they
become more common and more capable.
Alongside this, they also announced research grants for a range of open questions in this area.
Now, I'm not going to get fully into this paper as well, but what strikes me as notable is that
this is two fairly significant research papers around questions of AI safety and AI alignment,
right at a time when people are speculating the GPT 4.5 is around the corner.
I would say only that it doesn't not make sense to me, that in advance of releasing the most
advanced model ever, you would drop all of the progress that you've made on safety and alignment
issues to try to make it feel as though those things were progressing at the same speed
the overall capabilities were. I think this is a little less tinfoil hat and a little more just
my PR and marketing brain observing something that I could see doing. I will note that after
Sam Altman called out Colin Burns for his work on this issue, when someone asked him
GPT 4.5 leak legit or no, he said nah, to which everyone scrambled to figure out if he was denying
the pricing screenshot or that GPT 4.5 is coming this month. At the time of recording, still no answer
on that front, but the speculation and rumors live on. Overall, my favorite comment on this research
comes from Kirtik, who wrote, in my opinion, a better model to study this would be to have
government run by children age 10 to 15, with all the employees still being adults. I think there
are a fair few people out there who believe that in that circumstance, the results might be better
than what we have. But anyways, guys, that is the new super alignment research. I'm glad to see
things coming out of this section of OpenAI, and I'm looking forward to more experiments and
research to come. For now, that's going to do it for today's AI breakdown. Until next time,
peace.
