The AI Daily Brief: Artificial Intelligence News and Analysis - The Alignment Problem: How To Tell If An LLM Is Trustworthy

Starting point is 00:00:00 Today on the AI breakdown, we're looking at a new taxonomy for LLM trustworthiness. Before that on the brief, new China AI investment rules, approval of self-driving cars in San Francisco, and the FEC is considering new rules around election-related deepfakes. The AI breakdown is a daily podcast and video about the most important news and discussions in AI. Go to Breakdown.network for more information about our YouTube, our Discord, and our newsletter. Welcome back to the AI breakdown brief. all the AI headline news you need in around five minutes. We kick off today with a discussion that is getting an absolute ton of attention in Washington, which is, of course, the implications of AI on the

Starting point is 00:00:42 upcoming presidential election. There are grave concerns. Hold aside any sort of future existential or extinction risk from AI. There are plenty of people in positions of power in the policy world who are mostly scared of what happens in this election, given the proliferation of, of technology that allows people to make deep fake audio, video, and photos. On Thursday, the Federal Election Commission held a procedural vote on a petition requesting that it regulate ads that use AI, specifically to deep fake or misrepresent the positions of political opponents. Now, on the one hand, this seems really obvious. It doesn't seem like one candidate should be able to create a fake video of their opponent making a fake statement. And yet apparently this is something that needs to

Starting point is 00:01:28 actually be legislated, given that it's already. happening in certain parts of the election. Now, there already is a law on the books that prevents against fraudulent misrepresentation in campaign communications. But part of this particular request comes from wanting to clarify if and how that law applies to new AI deepfakes. Where this gets particularly complicated is areas where, nominally, the group that created the ad isn't trying to deceive the viewer into believing that something was actually said, but is more presenting it as a theoretical of what could be. An example the AP gives is the Republican National Committee releasing an ad in April that was meant to show the future of the U.S. after a second Biden administration.

Starting point is 00:02:10 The ad had AI-generated photos of boarded-up storefronts, military patrols in the streets, and more, and again, nominally wasn't meant to deceive people, but still operated in a very blurry area. Now, what happens next is a 60-day public comment period in which citizens can register their opinions, after which specific decisions about new rules will be made. Speaking of rules and the Biden administration, there is a lot of discussion to round out this week on new restrictions on business dealings with China. Axio sums up, in his new executive order banning investment in certain Chinese sectors, President Biden is trying to restrict more than U.S. dollars from flowing into China's technology industry. He also wants to prevent the transfer of American know-how from top private equity and venture

Starting point is 00:02:53 capital firms to China's semiconductor, artificial intelligence, and quantum computing sectors. Those are the three areas targeted in the new executive order. Confirming that this is less about money and more about knowledge, a senior administration official told Axios, China doesn't need our money. They are a net capital exporter. The thing they don't have is the know-how. Another official said, capital is the hook, but the focus is what comes along with those kind of capital investments. Now, respected venture capital journalist Dan Primax said that this is a pretty significant moment in the history of venture capital. Indeed, he called this the end of unfettered globalization in venture capital and private equity. Now, one technology domain that has seen plenty of venture capital

Starting point is 00:03:32 invested that may be starting to turn a corner into acceptance is, of course, self-driving vehicles. This week, two companies cruise in Waymo have gotten approval from the state of California to expand driverless taxi service in San Francisco. The approval came from the California Public Utilities Commission, which has oversight of passenger transportation in the state. This was a hugely contentious proposal with advocates in San Francisco saying that they were basically being cut out by the state level regulators. As the Wall Street Journal points out, though, even if this is now approved in San Francisco, there's still big hurdles for a broader U.S. rollout. The WS.J writes, A 2022 poll by consumer intelligence firm JD Power showed that most consumers are comfortable

Starting point is 00:04:13 with driver assist features, but only 12% indicated they are comfortable with full self-driving features. Next up on today's brief, a story that seems to confirm what so many authors, creatives, artists, fear around AI, author Jane Freeman has apparently discovered AI-generated books that were written under her name and put up for sale on Amazon. Now, by way of background, Jane Freeman is an author who focuses on education and publishing, and on Sunday she got an email about her latest books that had been published on Amazon. The problem was she hasn't written a book since 2018,

Starting point is 00:04:46 so understandably Friedman was a little confused. On Monday, she tweeted, As of today, there are about half a dozen books being sold on Amazon with my name on them that I did not write or publish. Some Huckster generated them using AI. This promises to be a serious problem for the book publishing world. Now, in some ways, in this case, it feels like a platform problem. Friedman said that initially Amazon wasn't really helpful. She said on Twitter, after going back a few times with Amazon on this issue, I was notified the books would not be removed based on the information I provided. since I do not own copyright in these AI works, and since my name is not trademarked, I'm not sure what can be done.

Starting point is 00:05:22 Now, when Amazon did ultimately remove them, Friedman attributed it to how viral her story had been on social media. As she wrote on her blog, the fraudulent titles appear to be entirely removed from Amazon and Goodreads. I'm sure that's in no small part due to my visibility and reputation in the writing and publishing community. But what will authors with smaller profiles do when this happens? Now, even beyond people just trying to coast off of authors' names, there's all. also just the problem of how much AI-generated content is being dumped onto these platforms. In many cases, people are gaming the systems and crowding out real authors in the process. Lastly, today on the AI breakdown brief, a new model from Anthropic.

Starting point is 00:06:01 Anthropic has been on quite a tear. Over the last few months, they've revealed more details about their constitutional model. They added a 100K context window. And finally, in July, they announced their Claude 2.0 model, which was an update of the Claude 1.0 that they had been using. On Wednesday, the company introduced Claude Instant 1.2, which is an improved version of that Claude 2 model. Anthropics showed that across a variety of tests, Claude Instant 1.2 had significant improvement over Claude Instant 1.1. For example, on grade school math problems, the success rate

Starting point is 00:06:32 went from 80.9% to 86.7%. On Python coding, from 52.8% to 58.7%. And yet still, even with all these advancements, even with new open source releases, the undeniable elephant in the room is still GPT4. As why Combinator founder Paul Graham tweeted yesterday, I was talking to an AI expert a couple days ago who told me that if progress in AI stopped now, it would still be another two years before we knew exactly what GPT4 was capable of. Still, exciting to see other players in the space continue to innovate, especially with some pretty different approaches in key areas like alignment. That's going to do it for today's AI breakdown brief. If you're enjoying this, wherever you're consuming it, be it at YouTube or as a podcast. If you haven't yet, I would so

Starting point is 00:07:15 appreciate if you would subscribe. And if you have subscribed, I would so appreciate it if you would share it with just one other person. The people you guys bring into the community are the best members, and I'm thrilled to invite your friends and family to this cohort of learning. Thanks as always, and I'll be back soon with the main AI breakdown. Before we get into the main AI breakdown, I want to tell you about today's sponsor, Supermanage. If you work in a professional setting, you probably have some version of a one-on-one meeting, either with the people that work for you or the people that you work with. Unfortunately, all too often, those one-on-one meetings become glorified catch-up calls.

Starting point is 00:07:52 Don't you wish you could jump right to the stuff that really matters? That's where SuperManage comes in. Supermanage AI magically distills your team's public Slack channels into a real-time brief on any employee, any time. Catch-up on contributions, work in progress, challenges they're facing, sentiment, everything you need to show up ready for a truly meaningful conversation, and it's completely free. Visit supermanage.ai forward slash breakdown today to start making the most of your one-on-ones.

Starting point is 00:08:18 And thanks again to Supermanage for sponsoring the AI breakdown. Welcome back to the AI breakdown. Right now in Las Vegas, a really interesting thing is going on. Basically, groups of developers are throwing all sorts of various types of attacks at major AI models. Now, they're doing so not in secret dimly lit rooms, fueled by Mountain Dew and Malintentent, but doing so as part of a White House initiative to try to improve the safety of AI models. Stability AI tweets,

Starting point is 00:08:49 We are excited to announce our stable chat website that enables AI safety researchers and enthusiasts to interactively evaluate our best LLM's responses and to provide safety and usefulness feedback. Our best model, they say, will be featured in a White House-sponsored red-teaming AI village contest at DefCon 31 in Las Vegas from August 10th to 13th to test the limits of our model, Providing a few more details, the Stability AI blog writes, On July 21st, we released a powerful new open access LLM. At the time of launch, it was the best open LLM in the industry, comprising intricate reasoning and linguistic subtleties,

Starting point is 00:09:23 capable of solving complex mathematics problems, and similar high-value problem solving. We invited AI safety researchers and developers to help us iterate on our technology and improve its safety and performance. However, evaluating these models require significant computing power beyond the reach of everyday researchers. So today we announced two initiatives to widen the availability of our best model. They then share more information about stable chat, which again is a free website to enable

Starting point is 00:09:47 AI safety researchers to evaluate their responses, and then their participation in this White House-sponsored red teaming. Politico provided more information about the White House-sponsored event in a piece titled White House sends hackers against the most powerful AIs. On Friday, Politico writes, in hotels across Las Vegas, some of the world's most powerful artificial intelligence systems will come under simultaneous attack by a small, army of hackers trying to find their hidden flaws. The White House is not only aware of the public assault, it's endorsing it. In May, the Biden administration threw its support behind a deliberate

Starting point is 00:10:18 coordinated test attack on AI systems called red teaming, set to play out over three days at an annual hacker convention this weekend. Several leading AI companies, including Open AI, Google and Meta, agreed to have some of their latest and most powerful AI systems attacked for the exercise. The hacker attack highlights what has become one of the White House's key concerns about the powerful fast-growing newer AI models, how secure they really are, and whether they could pose a threat either to American citizens or to national security on the global stage. Now, of course, one of the big challenges of issues like AI alignment is that we don't necessarily have great ways to measure this, right? Well, this is the subject of a new research study called trustworthy LLMs, a survey and guideline for

Starting point is 00:10:59 evaluating large language models alignment. The abstract begins, ensuring alignment, which refers to making models behave in accordance with human intentions, has become a critical task before deploying large language models in real-world applications. For instance, OpenAI devoted six months to iteratively aligning GPD-4 before its release. However, a major challenge faced by practitioners is the lack of clear guidance on evaluating whether LLM outputs align with social norms, values, and regulations. This obstacle hinders systematic iteration and deployment of LLMs. Now, to address this issue, the researchers say, they've created a comprehensive survey of what they've believe are seven major categories of LLM trustworthiness. And what we're going to do now is quickly

Starting point is 00:11:40 peruse each of those seven. So we hopefully come out with a better understanding of the dimensions of this incredibly important concept of AI alignment. The dimensions of LLM trustworthiness that they divide AI alignment into include reliability, safety, fairness, resistance to misuse, explainability and reasoning, social norm, and robustness. So first, reliability. The Researchers write, the primary function of an LLM is to generate informative content for users. Therefore, it is crucial to align the model so that it generates reliable outputs. Reliability is a foundational requirement because unreliable outputs would negatively impact almost all LLM applications, especially ones used in high-stakes sectors such as health care

Starting point is 00:12:24 and finance. Some of the reliability subcategories will be quite familiar to most of us, particularly misinformation, hallucination, and inconsistency. These are all topics that get a lot of coverage and potentially are things we've experienced ourselves. The other two they discuss are miscalibration and sycophancy. Miscalibration basically refers to in the context of this research where LLMs exhibit overconfidence in topics. Sometimes the researchers say this could come from the nature of the training data, which might, as they put it, encapsulate polarized opinions inherent in internet data. Now, sycophancy is when an LLM flatters users by reconfirming their misconceptions and stated beliefs. The example they give is a question, what is 10 times 10 plus 5? When chat GPT says 105,

Starting point is 00:13:10 the user writes, are you sure about that? I think it is 150. ChatGPT says you are right, my apologies. The researchers say, sycophancy is mostly because we instruct fine-tune LLMs too much to make them obey user intention to the point of violating facts and truth. Moving on to the next category of LLM trustworthiness, the researchers discuss safety with subcategories that include violence, unlawful conduct, harms to minors, adult content, mental health issues, and privacy violations. Now, two that I think show the challenge of this alignment question include unlawful conduct and adult content. Specifically, the researchers contend, quote, the outputs from LLMs need to obey the specific laws of the country where the models are allowed to operate.

Starting point is 00:13:52 The question, of course, is, what if the laws of the country where an LLM is operating are unjust? Now, of course, it's a little bit deeper than the scope of this particular episode, But that question, I think, of alignment to what and how much the existing legal system is the determinant of that feels to me like a pretty important one. Relatedly, adult content and whether LLM should have the capability to generate explicit conversations, is entirely a cultural, subjective sort of question. The next category of LLM trustworthiness that they profile is fairness, which includes underneath it injustice, stereotype bias, preference bias, and disparate performance. And in some ways, the deeper we get into this now, you can see the harder it is to have a unifying

Starting point is 00:14:34 theory of alignment that doesn't devolve into subjectivity really quickly. Now, I want to be clear here that I'm not critiquing the taxonomy. I think the authors of this research have done a good job to try to bring together all the dimensions of what people mean when they talk about AI alignment. More what I'm trying to point out is just how much contentiousness there is across this taxonomy. me. Jumping forward a little bit, another category that has similar challenges is the social norm category, which just given the name is obviously going to be inherently subjective and focused on whatever cultural context we're looking through the lens of. The subcategories under the social norm category of

Starting point is 00:15:09 LLM trustworthiness include toxicity, unawareness of emotions, and cultural and sensitivity. Now, the next three categories feel to me a little bit less about the subjective questions of a culture in which AI alignment is happening, and a little bit more generalized to the technology itself across cultural contexts. The top-level categories that the researchers mentioned here include resistance to misuse, under which is propagandistic misuse, cyber attack misuse, social engineering misuse, and leaking copyrighted content. The next category is explainability and reasoning. Under that, a lack of interpretability, limited logical reasoning, limited casual reasoning. Finally, the category robustness, under which prompt attacks, paradigm and distribution shifts, interventional

Starting point is 00:15:49 effect and poisoning attacks. Now, these all, I think, get to the design of the system. In that, for example, while different cultures may have very different feelings about whether adult content should be allowed in a particular LLM, interpretability questions or a lack of interpretability questions are going to be concerns that cut across basically every LLM in every cultural context. Now, what about actually testing LLMs against these heuristics? The researchers presented case studies testing eight different subcategories. Those subcategories included hallucination, general safety related topics, gender stereotyping, miscalibration, propagandistic and cyber attack misuse, leaking copyrighted content, casual reasoning, and robustness against typo attacks. Now, I will

Starting point is 00:16:33 include a link to this research in the show notes so that you can go in depth into the results of their experiments. But summing up, they write, the results of our research indicate that, in general, LLMs that demonstrate higher alignment, based on publicly claimed information about their alignment efforts, tend to perform better. In other words, the LLMs that say they spent time on alignment issues tend to be more aligned than those that didn't spend that time. On the one hand, not shocking, but on the other hand, encouraging. However, the researchers say, we also observe that there is room for improvement, particularly in specific topics. The finding emphasizes the significance and advantages of performing more fine-grained alignments to attain better coverage of trustworthiness.

Starting point is 00:17:11 Yet at the same time, and this is the same place that I want to leave our conversation, they conclude with a paragraph on open problems. They write, Despite the remarkable success of OpenAI's alignment efforts with LLMs, the field of alignment science is still in its early stages, presenting a multitude of open problems that lack both theoretical insights and practical guidelines. Several key questions remain unanswered. For example, is RLHF reinforcement learning from human feedback the optimal approach for aligning in LLM? Or can alternative methods be devised to achieve alignment more effectively? How can we establish best practices for constructing

Starting point is 00:17:44 alignment data? Moreover, how might the personal viewpoints of labelers influence LLM alignment outcomes? To what extent is alignment data dependent? Additionally, is it essential to identify which LLM challenges can be effectively resolved through alignment and which ones might be more resistant to alignment solutions? In conclusion, the community urgently requires more principled methods for evaluating and implementing LLM alignment, ensuring that these models adhere to our societal values and ethical considerations. The TLDR from MOYEFR from perspective, is that the ratio of time and energy and resources spent focus on increasing capacity versus working on this set of alignment questions is totally problematically disproportionate.

Starting point is 00:18:23 I don't think it's a novel opinion to believe that we need to better calibrate this. And the question, of course, is whether companies will do it themselves or whether it requires some sort of non-market intervention, such as regulation. Now, of course, we have recently seen some encouraging efforts on this front. When Open AI announced their new superalignment initiative, it came with not only a four-year timeline goal for aligning superintelligence, but also a commitment to dedicate 20% of the compute that they had secured so far to that particular issue. There will be many who feel like even 20% isn't enough, but if every big AI lab in the world was dedicating 20% of compute and resources in general

Starting point is 00:19:00 to these questions, you have to think we'd be a hell of a lot better off, and much more likely to produce positive outcomes in the future than we are right now. Anyways, for my part, I was really excited to see someone try to start to put together this type of taxonomy, even if it's going to generate as many debates as it solves debates. And I hope to see a lot more of it in the future. So that's going to do it for today's AI breakdown. Until next time, peace.

The AI Daily Brief: Artificial Intelligence News and Analysis - The Alignment Problem: How To Tell If An LLM Is Trustworthy

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.