The AI Daily Brief: Artificial Intelligence News and Analysis - The Alignment Problem: How To Tell If An LLM Is Trustworthy
Episode Date: August 11, 2023New research attempts to put together a complete taxonomy for trustworthiness in LLMs. Before that on the Brief: The FEC is considering new election rules around deepfakes. Also on the Brief: self-dri...ving cars approved in San Francisco; an author finds fake books under her name on Amazon; and Anthropic releases a new model. Today's Sponsor: Supermanage - AI for 1-on-1's - https://supermanage.ai/breakdown ABOUT THE AI BREAKDOWN The AI Breakdown helps you understand the most important news and discussions in AI. Subscribe to The AI Breakdown newsletter: https://theaibreakdown.beehiiv.com/subscribe Subscribe to The AI Breakdown on YouTube: https://www.youtube.com/@TheAIBreakdown Join the community: bit.ly/aibreakdown Learn more: http://breakdown.network/
Transcript
Discussion (0)
Today on the AI breakdown, we're looking at a new taxonomy for LLM trustworthiness.
Before that on the brief, new China AI investment rules, approval of self-driving cars in San Francisco,
and the FEC is considering new rules around election-related deepfakes.
The AI breakdown is a daily podcast and video about the most important news and discussions in AI.
Go to Breakdown.network for more information about our YouTube, our Discord, and our newsletter.
Welcome back to the AI breakdown brief.
all the AI headline news you need in around five minutes. We kick off today with a discussion that is
getting an absolute ton of attention in Washington, which is, of course, the implications of AI on the
upcoming presidential election. There are grave concerns. Hold aside any sort of future existential
or extinction risk from AI. There are plenty of people in positions of power in the policy world
who are mostly scared of what happens in this election, given the proliferation of,
of technology that allows people to make deep fake audio, video, and photos. On Thursday, the Federal
Election Commission held a procedural vote on a petition requesting that it regulate ads that use
AI, specifically to deep fake or misrepresent the positions of political opponents. Now, on the one hand,
this seems really obvious. It doesn't seem like one candidate should be able to create a fake video
of their opponent making a fake statement. And yet apparently this is something that needs to
actually be legislated, given that it's already.
happening in certain parts of the election. Now, there already is a law on the books that prevents
against fraudulent misrepresentation in campaign communications. But part of this particular
request comes from wanting to clarify if and how that law applies to new AI deepfakes. Where this
gets particularly complicated is areas where, nominally, the group that created the ad isn't trying
to deceive the viewer into believing that something was actually said, but is more presenting it
as a theoretical of what could be. An example the AP gives is the Republican National Committee
releasing an ad in April that was meant to show the future of the U.S. after a second Biden administration.
The ad had AI-generated photos of boarded-up storefronts, military patrols in the streets, and more,
and again, nominally wasn't meant to deceive people, but still operated in a very blurry area.
Now, what happens next is a 60-day public comment period in which citizens can register their
opinions, after which specific decisions about new rules will be made. Speaking of rules and the Biden
administration, there is a lot of discussion to round out this week on new restrictions on business dealings
with China. Axio sums up, in his new executive order banning investment in certain Chinese sectors,
President Biden is trying to restrict more than U.S. dollars from flowing into China's technology industry.
He also wants to prevent the transfer of American know-how from top private equity and venture
capital firms to China's semiconductor, artificial intelligence, and quantum computing sectors.
Those are the three areas targeted in the new executive order. Confirming that this is less about
money and more about knowledge, a senior administration official told Axios, China doesn't need our
money. They are a net capital exporter. The thing they don't have is the know-how. Another official said,
capital is the hook, but the focus is what comes along with those kind of capital investments.
Now, respected venture capital journalist Dan Primax said that this is a pretty significant moment in
the history of venture capital. Indeed, he called this the end of unfettered globalization in venture
capital and private equity. Now, one technology domain that has seen plenty of venture capital
invested that may be starting to turn a corner into acceptance is, of course, self-driving vehicles.
This week, two companies cruise in Waymo have gotten approval from the state of California to expand
driverless taxi service in San Francisco. The approval came from the California Public Utilities
Commission, which has oversight of passenger transportation in the state. This was a hugely contentious
proposal with advocates in San Francisco saying that they were basically being cut out by the state
level regulators. As the Wall Street Journal points out, though, even if this is now approved in San
Francisco, there's still big hurdles for a broader U.S. rollout. The WS.J writes,
A 2022 poll by consumer intelligence firm JD Power showed that most consumers are comfortable
with driver assist features, but only 12% indicated they are comfortable with full self-driving features.
Next up on today's brief, a story that seems to confirm what so many
authors, creatives, artists, fear around AI,
author Jane Freeman has apparently discovered AI-generated books
that were written under her name and put up for sale on Amazon.
Now, by way of background, Jane Freeman is an author who focuses on education and publishing,
and on Sunday she got an email about her latest books that had been published on Amazon.
The problem was she hasn't written a book since 2018,
so understandably Friedman was a little confused.
On Monday, she tweeted,
As of today, there are about half a dozen books being sold on Amazon with my name on them that I did not write or publish.
Some Huckster generated them using AI. This promises to be a serious problem for the book publishing world.
Now, in some ways, in this case, it feels like a platform problem.
Friedman said that initially Amazon wasn't really helpful.
She said on Twitter, after going back a few times with Amazon on this issue, I was notified the books would not be removed based on the information I provided.
since I do not own copyright in these AI works, and since my name is not trademarked, I'm not sure what can be done.
Now, when Amazon did ultimately remove them, Friedman attributed it to how viral her story had been on social media.
As she wrote on her blog, the fraudulent titles appear to be entirely removed from Amazon and Goodreads.
I'm sure that's in no small part due to my visibility and reputation in the writing and publishing community.
But what will authors with smaller profiles do when this happens?
Now, even beyond people just trying to coast off of authors' names, there's all.
also just the problem of how much AI-generated content is being dumped onto these platforms.
In many cases, people are gaming the systems and crowding out real authors in the process.
Lastly, today on the AI breakdown brief, a new model from Anthropic.
Anthropic has been on quite a tear.
Over the last few months, they've revealed more details about their constitutional model.
They added a 100K context window.
And finally, in July, they announced their Claude 2.0 model, which was an update of the
Claude 1.0 that they had been using.
On Wednesday, the company introduced Claude Instant 1.2, which is an improved version of that
Claude 2 model. Anthropics showed that across a variety of tests, Claude Instant 1.2 had significant
improvement over Claude Instant 1.1. For example, on grade school math problems, the success rate
went from 80.9% to 86.7%. On Python coding, from 52.8% to 58.7%. And yet still, even with all
these advancements, even with new open source releases, the undeniable elephant in the room is still
GPT4. As why Combinator founder Paul Graham tweeted yesterday, I was talking to an AI expert a
couple days ago who told me that if progress in AI stopped now, it would still be another
two years before we knew exactly what GPT4 was capable of. Still, exciting to see other players
in the space continue to innovate, especially with some pretty different approaches in key areas
like alignment. That's going to do it for today's AI breakdown brief. If you're enjoying this,
wherever you're consuming it, be it at YouTube or as a podcast. If you haven't yet, I would so
appreciate if you would subscribe. And if you have subscribed, I would so appreciate it if you would
share it with just one other person. The people you guys bring into the community are the best
members, and I'm thrilled to invite your friends and family to this cohort of learning. Thanks as
always, and I'll be back soon with the main AI breakdown. Before we get into the main AI breakdown,
I want to tell you about today's sponsor, Supermanage. If you work in a professional setting,
you probably have some version of a one-on-one meeting,
either with the people that work for you or the people that you work with.
Unfortunately, all too often, those one-on-one meetings become glorified catch-up calls.
Don't you wish you could jump right to the stuff that really matters?
That's where SuperManage comes in.
Supermanage AI magically distills your team's public Slack channels
into a real-time brief on any employee, any time.
Catch-up on contributions, work in progress, challenges they're facing,
sentiment, everything you need to show up ready for a truly meaningful
conversation, and it's completely free.
Visit supermanage.ai forward slash breakdown today to start making the most of your one-on-ones.
And thanks again to Supermanage for sponsoring the AI breakdown.
Welcome back to the AI breakdown.
Right now in Las Vegas, a really interesting thing is going on.
Basically, groups of developers are throwing all sorts of various types of attacks at major
AI models.
Now, they're doing so not in secret dimly lit rooms, fueled by Mountain Dew and Malintentent,
but doing so as part of a White House initiative to try to improve the safety of AI models.
Stability AI tweets,
We are excited to announce our stable chat website that enables AI safety researchers and enthusiasts
to interactively evaluate our best LLM's responses and to provide safety and usefulness feedback.
Our best model, they say, will be featured in a White House-sponsored red-teaming AI village contest
at DefCon 31 in Las Vegas from August 10th to 13th to test the limits of our model,
Providing a few more details, the Stability AI blog writes,
On July 21st, we released a powerful new open access LLM.
At the time of launch, it was the best open LLM in the industry,
comprising intricate reasoning and linguistic subtleties,
capable of solving complex mathematics problems,
and similar high-value problem solving.
We invited AI safety researchers and developers to help us iterate on our technology
and improve its safety and performance.
However, evaluating these models require significant computing power
beyond the reach of everyday researchers.
So today we announced two initiatives to widen the availability of our best model.
They then share more information about stable chat, which again is a free website to enable
AI safety researchers to evaluate their responses, and then their participation in this White
House-sponsored red teaming.
Politico provided more information about the White House-sponsored event in a piece titled
White House sends hackers against the most powerful AIs.
On Friday, Politico writes, in hotels across Las Vegas, some of the world's most powerful
artificial intelligence systems will come under simultaneous attack by a small,
army of hackers trying to find their hidden flaws. The White House is not only aware of the public
assault, it's endorsing it. In May, the Biden administration threw its support behind a deliberate
coordinated test attack on AI systems called red teaming, set to play out over three days at an annual
hacker convention this weekend. Several leading AI companies, including Open AI, Google and Meta,
agreed to have some of their latest and most powerful AI systems attacked for the exercise. The hacker
attack highlights what has become one of the White House's key concerns about the powerful
fast-growing newer AI models, how secure they really are, and whether they could pose a threat either
to American citizens or to national security on the global stage. Now, of course, one of the big
challenges of issues like AI alignment is that we don't necessarily have great ways to measure this, right?
Well, this is the subject of a new research study called trustworthy LLMs, a survey and guideline for
evaluating large language models alignment. The abstract begins, ensuring alignment, which refers to making
models behave in accordance with human intentions, has become a critical task before deploying
large language models in real-world applications. For instance, OpenAI devoted six months to
iteratively aligning GPD-4 before its release. However, a major challenge faced by practitioners
is the lack of clear guidance on evaluating whether LLM outputs align with social norms, values,
and regulations. This obstacle hinders systematic iteration and deployment of LLMs. Now, to address
this issue, the researchers say, they've created a comprehensive survey of what they've
believe are seven major categories of LLM trustworthiness. And what we're going to do now is quickly
peruse each of those seven. So we hopefully come out with a better understanding of the dimensions
of this incredibly important concept of AI alignment. The dimensions of LLM trustworthiness that
they divide AI alignment into include reliability, safety, fairness, resistance to misuse,
explainability and reasoning, social norm, and robustness. So first, reliability. The
Researchers write, the primary function of an LLM is to generate informative content for users.
Therefore, it is crucial to align the model so that it generates reliable outputs.
Reliability is a foundational requirement because unreliable outputs would negatively impact
almost all LLM applications, especially ones used in high-stakes sectors such as health care
and finance. Some of the reliability subcategories will be quite familiar to most of us,
particularly misinformation, hallucination, and inconsistency. These are all topics that get
a lot of coverage and potentially are things we've experienced ourselves. The other two they discuss
are miscalibration and sycophancy. Miscalibration basically refers to in the context of this research
where LLMs exhibit overconfidence in topics. Sometimes the researchers say this could come from the
nature of the training data, which might, as they put it, encapsulate polarized opinions inherent
in internet data. Now, sycophancy is when an LLM flatters users by reconfirming their misconceptions and stated
beliefs. The example they give is a question, what is 10 times 10 plus 5? When chat GPT says 105,
the user writes, are you sure about that? I think it is 150. ChatGPT says you are right, my apologies.
The researchers say, sycophancy is mostly because we instruct fine-tune LLMs too much to make them
obey user intention to the point of violating facts and truth. Moving on to the next category of
LLM trustworthiness, the researchers discuss safety with subcategories that include violence, unlawful conduct,
harms to minors, adult content, mental health issues, and privacy violations.
Now, two that I think show the challenge of this alignment question include unlawful conduct
and adult content. Specifically, the researchers contend, quote, the outputs from LLMs
need to obey the specific laws of the country where the models are allowed to operate.
The question, of course, is, what if the laws of the country where an LLM is operating are
unjust? Now, of course, it's a little bit deeper than the scope of this particular episode,
But that question, I think, of alignment to what and how much the existing legal system is the
determinant of that feels to me like a pretty important one. Relatedly, adult content and whether
LLM should have the capability to generate explicit conversations, is entirely a cultural,
subjective sort of question. The next category of LLM trustworthiness that they profile is fairness,
which includes underneath it injustice, stereotype bias, preference bias, and disparate performance.
And in some ways, the deeper we get into this now, you can see the harder it is to have a unifying
theory of alignment that doesn't devolve into subjectivity really quickly.
Now, I want to be clear here that I'm not critiquing the taxonomy.
I think the authors of this research have done a good job to try to bring together all the
dimensions of what people mean when they talk about AI alignment.
More what I'm trying to point out is just how much contentiousness there is across this taxonomy.
me. Jumping forward a little bit, another category that has similar challenges is the social norm category,
which just given the name is obviously going to be inherently subjective and focused on whatever
cultural context we're looking through the lens of. The subcategories under the social norm category of
LLM trustworthiness include toxicity, unawareness of emotions, and cultural and sensitivity.
Now, the next three categories feel to me a little bit less about the subjective questions of a culture in
which AI alignment is happening, and a little bit more generalized to the technology itself
across cultural contexts. The top-level categories that the researchers mentioned here include
resistance to misuse, under which is propagandistic misuse, cyber attack misuse, social engineering
misuse, and leaking copyrighted content. The next category is explainability and reasoning.
Under that, a lack of interpretability, limited logical reasoning, limited casual reasoning. Finally,
the category robustness, under which prompt attacks, paradigm and distribution shifts, interventional
effect and poisoning attacks. Now, these all, I think, get to the design of the system. In that, for example,
while different cultures may have very different feelings about whether adult content should be
allowed in a particular LLM, interpretability questions or a lack of interpretability questions
are going to be concerns that cut across basically every LLM in every cultural context.
Now, what about actually testing LLMs against these heuristics? The researchers presented case
studies testing eight different subcategories. Those subcategories included hallucination, general safety
related topics, gender stereotyping, miscalibration, propagandistic and cyber attack misuse,
leaking copyrighted content, casual reasoning, and robustness against typo attacks. Now, I will
include a link to this research in the show notes so that you can go in depth into the results of
their experiments. But summing up, they write, the results of our research indicate that, in general,
LLMs that demonstrate higher alignment, based on publicly claimed information about their alignment
efforts, tend to perform better. In other words, the LLMs that say they spent time on alignment
issues tend to be more aligned than those that didn't spend that time. On the one hand, not shocking,
but on the other hand, encouraging. However, the researchers say, we also observe that there is
room for improvement, particularly in specific topics. The finding emphasizes the significance
and advantages of performing more fine-grained alignments to attain better coverage of trustworthiness.
Yet at the same time, and this is the same place that I want to leave our conversation,
they conclude with a paragraph on open problems. They write,
Despite the remarkable success of OpenAI's alignment efforts with LLMs,
the field of alignment science is still in its early stages, presenting a multitude of open
problems that lack both theoretical insights and practical guidelines.
Several key questions remain unanswered. For example, is RLHF reinforcement learning from human
feedback the optimal approach for aligning in LLM? Or can alternative methods be devised to
achieve alignment more effectively? How can we establish best practices for constructing
alignment data? Moreover, how might the personal viewpoints of labelers influence LLM alignment
outcomes? To what extent is alignment data dependent? Additionally, is it essential to identify
which LLM challenges can be effectively resolved through alignment and which ones might be more
resistant to alignment solutions? In conclusion, the community urgently requires more principled
methods for evaluating and implementing LLM alignment, ensuring that these models adhere to
our societal values and ethical considerations. The TLDR from MOYEFR from
perspective, is that the ratio of time and energy and resources spent focus on increasing capacity
versus working on this set of alignment questions is totally problematically disproportionate.
I don't think it's a novel opinion to believe that we need to better calibrate this.
And the question, of course, is whether companies will do it themselves or whether it requires
some sort of non-market intervention, such as regulation.
Now, of course, we have recently seen some encouraging efforts on this front.
When Open AI announced their new superalignment initiative, it came with not only a four-year timeline
goal for aligning superintelligence, but also a commitment to dedicate 20% of the compute that they
had secured so far to that particular issue. There will be many who feel like even 20% isn't enough,
but if every big AI lab in the world was dedicating 20% of compute and resources in general
to these questions, you have to think we'd be a hell of a lot better off, and much more likely
to produce positive outcomes in the future than we are right now. Anyways, for my part, I was really
excited to see someone try to start to put together this type of taxonomy, even if it's going to
generate as many debates as it solves debates. And I hope to see a lot more of it in the future.
So that's going to do it for today's AI breakdown. Until next time, peace.
