The AI Daily Brief: Artificial Intelligence News and Analysis - GPTBot AI Data Controversy and the Remaining Challenges of LLMs

Episode Date: August 8, 2023

NLW reviews a recent research paper which provides a comprehensive overview of the state of LLM development and organizes remaining challenges and problems to be solved -- along with prospective solut...ions. Before that on the Brief: OpenAI launches GPTBot to crawl the web for AI training purposes; Zoom has to back off ToS changes that would allow them to collected data for AI training purposes. Plus a terrifying new cyberattack that can tell what you typed just by listening to your keystrokes. Read the paper: https://arxiv.org/abs/2307.10169 Today's Sponsor: Giskard - the testing framework for ML models - https://www.giskard.ai/ ABOUT THE AI BREAKDOWN The AI Breakdown helps you understand the most important news and discussions in AI.  Subscribe to The AI Breakdown newsletter: https://theaibreakdown.beehiiv.com/subscribe Subscribe to The AI Breakdown on YouTube: https://www.youtube.com/@TheAIBreakdown Join the community: bit.ly/aibreakdown Learn more: http://breakdown.network/

Transcript
Discussion (0)
Starting point is 00:00:00 Today on the AI breakdown, we're looking at all of the challenges that remain when it comes to LLMs. Before that on the brief, Zoom's new terms of service and chat GPT's GPT bot bring up questions of AI data training. The AI breakdown is a daily podcast and video about the most important news and discussions in AI. Go to Breakdown.network for more information about our Discord, our YouTube, and our newsletter. Welcome back to the AI breakdown brief, all the AI headline news you need in around five minutes. Today, we are starting with a conversation around data privacy and AI, and there are actually two different and interesting context for having this conversation today. The first is a big dust up that started over the weekend around Zoom's terms of service.
Starting point is 00:00:46 Over the weekend, people started to notice that there was a big update in Section 10.2. Andrew Cote tweets, Zoom updates its terms of service to become the NSA 2.0. I'm in disbelief at this update because of how far sweeping it is, yet here we are. Direct quotes from Section 10.2. You can send to Zoom's access, use, collection, creation, modification, distribution, processing, sharing, maintenance, storage of, parentheses, the stuff you say in meetings for any purpose. You hereby unconditionally and irrevocably assigned to Zoom and your end users to unconditionally and irrevocably assigned to Zoom.
Starting point is 00:01:20 I'll write title and interest, too, including all proprietary rights. Does this strike anyone else as far-reaching, utterly insane, and totally unethical? there were lots of other people who reacted as well. Eric Toller tweets, canceling the Bellingat Pro Zoom account and will migrate all our webinars and trainings to a new platform right away. Justine Bateman says, well, not using Zoom again
Starting point is 00:01:39 until these overreaching permissions are gone. Now, as it turns out, these terms actually were updated last March. A spokesperson told Vice, Zoom customers decide whether to enable generative AI features and separately whether to share customer content with Zoom for product improvement purposes.
Starting point is 00:01:54 A blog post published on Monday sought to further clarify exactly how these terms apply. Basically what they said is that if you use Zoom's generative AI features, which includes things like transcription as well as chat composition, there is an opt-in setting for admins through which they can allow data from meetings to be used to, quote, improve the performance and accuracy of these AI services. The blog post states clearly, quote, we do not use audio, video, or chat content for training our models without customer consent. And what's more, they actually added a term to the document, just to be, super clear. Notwithstanding the above, Zoom will not use audio, video, or chat customer
Starting point is 00:02:30 content to train our artificial intelligence models without your consent. Regardless of the changes that Zoom would ultimately make this really struck a nerve, and it struck a nerve because it felt not like something necessarily unique to Zoom, but perhaps the type of thing that lurks in just about every weighty terms of service that we click agree to without even thinking about. LA Times tech columnist Brian Merchant writes, I am very glad to see the revolt against Zoom unfolding as a result of this TOS update, and I hope it inspires us to look harder at all the services we use that generate data for tech companies, which they might use to train their LLMs or to sell ads or to third-party data brokers. It is obviously an extremely heightened moment right now when it comes to AI and data,
Starting point is 00:03:10 and so perhaps unsurprisingly, one of the first cohorts of people to pick up on this Zoom fiasco were a number of folks who are intimately involved with the SAG AFRA and WGA strike right now. Now, this wasn't the only context for a discussion about data and data privacy that happened yesterday. OpenAI announced its GPTBot. GPTBot is a web crawler that scrapes data from the entire public internet. As they write, Web pages crawled with the GPTBot user agent may potentially be used to improve future models and are filtered to remove sources that require paywall access,
Starting point is 00:03:42 are known to gather personally identifiable information, or have text that violates our policies. Allowing GPTBot to access your site can help AI models become more accurate, and improve their general capabilities and safety. So there are a few different things to parse out here. First is what the purpose of this bot is. It is designed to crawl the web to get information that can be used for training future OpenAI models.
Starting point is 00:04:03 That's the purpose and they're not trying to hide that at all. Second, OpenAI is drawing some guardrails around what information it will and won't collect. They're saying that it's filtering sites with paywall access, sites that have text that violates our policies, whatever that means. And it's also blocking out sites that are known to gather personally identifiable information. A couple things that stand out from that. One, as with any service, you have to trust that GPTBot is actually following the rules that OpenAI is publicly saying,
Starting point is 00:04:30 but second, even beyond that, there's just a lot of gray area, right? What is included, for example, in sites that have texts that violate OpenAI's policies, which sites are known to gather personally identifiable information. Where's the list of those sites? How would one go about trying to get additional sites on that? The point here is not to rag on OpenAI, but simply to point out just how much subjectivity, trust, and simple human failability there is when it comes to the ingesting of this huge amount of online data. Now, the other interesting piece of this that has generated a lot of discussion is that alongside the announcement of GPTBot, OpenAI also shared how people maintaining websites can block access to it. They share a simple line of code that can be used to disallow the
Starting point is 00:05:09 GPTBot. They also even allow for a customized GPTBot access, where webmasters can specify some parts of their site that are on limits to GPTBot and some parts that are off limits. On the one hand, people were enthusiastic that you can actually prevent OpenAI from scraping your website now. On the other hand, some are skeptical about why anyone would allow GPTBot to crawl their websites. AI entrepreneur Mark Tenin halts writes, Most people don't block Google from crawling. Appearing in search results boost traffic to your website. Unfortunately, chat GPT does not, even if it's asked to cite sources.
Starting point is 00:05:41 I expect a lot of people to block GPTBot. And of course, the interesting question is, If OpenAI makes it so easy to block access to websites, what is the incentive for anyone to allow access? As that tweet we just read pointed out, there's no quid pro quo as there is with Google indexing, and so wouldn't it become just the default to block GPTBot rather than allow OpenAI access? Is the place simply to bet that people aren't going to take the proactive step of blocking it? It strikes me as a really interesting case study in the evolving space of AI data collection. A couple more on today's brief.
Starting point is 00:06:14 Microsoft continues to push ever farther with its AI efforts announcing that BingChat is now coming to mobile browsers. BingChat has had Android and iOS apps since late February. However, at the end of last month, they started opening up Bing Chat to Chrome and Safari desktop browsers, and now they're trying to bring it to third-party mobile browsers as well. The announcement was made in a blog post celebrating six months of the new AI-powered Bing. The company said they've seen over 1 billion chats and over 750 images, and that bringing AI-powered Bing to third-party browsers on the web and mobile means is, quote, the next step in the journey allowing Bing to showcase the incredible value of
Starting point is 00:06:49 summarized answers, image creation, and more to a broader array of people. They also said that multimodal visual search is coming. This feature they write leverages open AI models to let you input into chat with images, either a picture you've taken or one you found elsewhere, and prompt BingChat with related questions. BingChat can understand the context of an image, interpret it, and answer questions about it. For example, you can use visual search to ask BingChat about the architecture of a building you've taken a picture of, or take a picture of the contents of your fridge and ask for lunch ideas.
Starting point is 00:07:17 It seems everywhere you look these days, we are moving further into the world of multimodality. Lastly, today, one to scare us with just how many new types of attacks and scams we're going to have to deal with in this future. A group of British researchers have shared a paper around a homemade deep learning model that can determine what someone is typing simply by listening to their keystrokes, and doing so with 95% accuracy. The abstract reads, With recent developments in deep learning, the ubiquity of microphones and the rise in online services via personal devices,
Starting point is 00:07:49 acoustic side channel attacks present a greater threat to keyboards than ever. This paper presents a practical implementation of a state-of-the-art deep learning model in order to classify a laptop keystrokes using a smartphone-integrated microphone. When trained on keystrokes recorded by a nearby phone, the classifier achieved an accuracy of 95% when trained on keystrokes recorded using the video conferencing software,
Starting point is 00:08:08 software Zoom, an accuracy of 93% was achieved. Our results proved the practicality of these side-channel attacks via off-the-shelf equipment and algorithms. So basically, imagine you are on Zoom, and separately, you scrolled over to your bank's website and typed in your username and your password. With 93% accuracy, the model that these folks developed could actually figure out what your username and password was. Now, even more concerning is that the defensive tactics that are suggested by the researchers aren't particularly encouraging. They're suggesting that Questions include using randomized passwords featuring multiple cases. They suggested that if you are ever in a scenario where a recording might be made during a call,
Starting point is 00:08:46 that you should add randomly generated fake keystrokes. Lastly, they said, just use biometric logins. Gizmodo concludes, I think there's very little likelihood that most people are going to deploy fake typing noises or overhaul their entire typing style just on the offhand chance that it might throw off some sort of acoustic spy lurking nearby. Sure, biometrics are a good idea in general, though it doesn't cancel out the invasive potential that acoustic spying poses generally. I guess the best thing we can do is hope that this is mostly a hypothetical threat
Starting point is 00:09:11 and that there aren't too many lunatics out there who would actually try something like this. My friends, that is never a bet that I am interested in making. Anyways, guys, that is going to do it for today's AI breakdown brief. Thanks for listening or watching, and I'll be back soon with the main AI breakdown. Hey guys, before we get to the main show, I want to tell you about today's sponsor, Giscard. Giscard is tackling one of the most important challenges in our new AI world, which is detecting hidden vulnerabilities in machine learning models. So what do I mean by hidden vulnerabilities?
Starting point is 00:09:43 Well, I mean things like performance bias, data leakages, spurious correlations over confidence issues, basically all the things that could negatively impact the performance of a machine learning model. Giscard is compatible with all Python frameworks including Pytorch, Hugging Face, Langchain, and more, and works for both tabular models and LLMs. It's an open source framework
Starting point is 00:10:04 that also has enterprise and hosted options, and it's quick and easy to install with just four lines of code. I really believe that this type of testing framework that Giscard offers is so important, as more and more ML models impact the applications we interact with every day. So to learn more and try it out, go to Giscard.aI. That's G-I-S-K-A-R-D.A-I. Thanks again to Giscard for sponsoring the show, and with that, let's get back to the episode. Welcome back to the AI breakdown.
Starting point is 00:10:36 Today we're doing something a little bit different. We're going a lot more technical than we normally would, but I thought this was the perfect context. It can feel after experiencing something like chat cheapy T for the first time that the mysteries of the universe have actually been unlocked. This revelatory experience is, I believe, why so many people have shifted so much attention to the generative AI space over the last six months.
Starting point is 00:11:01 However, as a recent paper points out, there are still a huge, huge array of, problems that remain unsolved when it comes to building LLMs. In fact, to read this paper, one gets the impression that we might still be barely scratching the surface of how these large language models can be built and scaled for maximum impact. So what we're going to do today is a whistle stop tour through this paper, which is called challenges and applications of large language models, and involves contributions from researchers at the University College London, the University of Cambridge, meta-a-I, stability AI, and other organizations as well.
Starting point is 00:11:38 The abstract reads, Large language models went from non-existent to ubiquitous in the machine learning discourse within a few years. Due to the fast pace of the field, it is difficult to identify the remaining challenges and already fruitful application areas. In this paper, we aim to establish a systematic set of open problems and application successes such that ML researchers can comprehend the field's current state more quickly and become productive. This was published about three weeks ago, so is 3, fairly up to date. The way that they organize their challenges are into three categories,
Starting point is 00:12:08 design, behavior, and science. Some challenges fit in multiple areas, such as high pre-training costs being both a design challenge as well as a science challenge. We're going to start with some of the challenges that might be the most familiar to the average listener, and then move into some of the ones that might not be as familiar. Let's start with hallucinations. The paper writes, The popularity of services like chat GPT suggests that LLMs are increasingly used for everyday question answering. As a result, the factual accuracy of these models has become more significant. Unfortunately, LLMs often suffer from hallucinations, which contain inaccurate information that can be hard to detect due to the text fluency.
Starting point is 00:12:43 Now, the paper points out that there are actually different types of hallucinations. They distinguish between intrinsic hallucinations, where the generated text logically contradicts the source content, and extrinsic hallucinations, where one cannot verify the output correct. from the provided source, because the source content does not provide enough information to assess the output. So as they do with most sections, the researchers try to catalog efforts that are being made to solve these challenges. When it comes to hallucinations, one of the things that they point out is that there are efforts to better measure hallucinations, and to try to provide a framework that makes understanding improvement when it comes to hallucinations a bit more measurable. Now, the two big
Starting point is 00:13:20 categories of strategies to solve hallucinations that they point to, notably without changing the model architecture itself are one, retrieval augmentation or supplying the LLM with relevant sources, and two decoding strategies. In each case, the paper points to a variety of different works in those areas. Next, let's turn to another well-known challenge, limited context length. The paper writes, Addressing everyday natural language processing tasks often necessitates an understanding of a broader context. For example, if the task at hand is discerning the sentiment in a passage from a novel or a segment of an academic paper, it is not sufficient to merely analyze a few words or sentences in isolation. The entirety of the input or context, which might encompass the whole section or even the complete document, must be considered.
Starting point is 00:14:03 Similarly, in a meeting transcript, the interpretation of a particular comment could pivot between sarcasm and seriousness, depending on the prior discussion in the meeting. Now, for anyone who has tried to use chat GPT for something that involves a longer document than their context window can handle, you'll understand this particular challenge. It's one of the reasons that people got so excited earlier this year when in May Anthropic announced that Claude would be moving to a 100K context window. However, as the researchers point out, they find that, quote, while commercial closed API models often fulfill their promise, many open source models, despite claiming to perform well with longer contexts, exhibit severe performance degradation. Basically, what the researchers point out is that there's a difference between being able to technically deal with long inputs versus performing well with long inputs. The researchers then point to three efforts around context length,
Starting point is 00:14:53 including efficient attention mechanisms, length generalization, and even transformer alternatives as ways to address this issue. I think the transformer alternative section serves as a good reminder of just how much things that seem known today are actually highly dynamic. The researchers write, While transformers are the dominant paradigm in LLMs today due to their strong performance, several more efficient alternative architectures exist. Another challenge they identify that is familiar to most of us
Starting point is 00:15:19 is misaligned behavior. They write, the alignment problem refers to the challenge of ensuring that the LLM's behavior aligns with human values, objectives, and expectations, and that it does not cause unintended or undesirable harms or consequences. The researchers write that most of the existing alignment work can be characterized into either methods for detecting misaligned behavior, such as model evaluation and auditing, mechanistic interpretability or red teaming, or methods for aligning model behavior, such as pre-training with human feedback, instruction fine-tuning, or RLHF. Now, regular listeners of this show will know that this is a huge area of development.
Starting point is 00:15:53 So much so that, again, Anthropic was also able to make news when they announced a very different approach to the alignment issue with their constitutional model. Anthropic is basically trying to challenge the paradigm of human feedback models, arguing that they have some real problems, including scalability, of course, as well as requiring people to interact with disturbing outputs. Now, this is far from just a theoretical problem. Just about a week ago, the Guardian published an article called, Kenyan moderators decry toll of training of AI models.
Starting point is 00:16:22 Quote, it destroyed me completely. Employees describe the psychological trauma of reading and viewing graphic content. Plus, they point out low pay and abrupt dismissals. Now, the purpose of this video obviously isn't to get into what constitutional AI does differently, but more to point out that this is hardly a solved area, and that there is a lot of innovation in how people think about addressing this set of problems. But let's move more quickly through some of the other challenges, which are perhaps a little bit less familiar.
Starting point is 00:16:47 Outdated knowledge. As the researchers point out, factual information learned during pre-training can contain inaccuracies or become outdated with time. For instance, it might not account for changes in political leadership. However, retraining the model
Starting point is 00:16:59 with updated pre-training data is expensive, and trying to unlearn old facts and learn new ones during fine-tuning is non-trivial. The efforts that they point out trying to address this include modifying model parameters, preserving model parameters,
Starting point is 00:17:12 and retrieval augmented language modeling. Another challenge they point out are unfathomable data sets. And this is just a fancy way of saying that because these data sets are so huge, they're much bigger than the number of documents that human teams can manual quality check. That creates a number of problems. Near duplicates in the training data set, which degrade model performance, personally identifiable information that sneaks in and gets really hard to extract. And then there's things like pre-training domain mixtures, basically trying to get the right combination of information from different sources to produce the best results. In a world where
Starting point is 00:17:43 there's such a huge volume of documents that these models are being trained on, humans simply can't quality check at all. Tokenizer reliance is another really fundamental challenge core to how these LLMs have been designed. Tokenization, they write, is the process of breaking a sequence of words or characters into smaller units called tokens, such that they can be fed into the model. However, as they point out, this introduces computational overhead, language dependence, low levels of human interpretability, and problematic linguistic variance. Simply put, the number of tokens needed to convey the same information can vary greatly across languages, which is something that has to be accounted for when it comes to API pricing policy so that certain languages aren't unfairly
Starting point is 00:18:24 costed higher. Another perhaps slightly more familiar problem is high pre-training costs. As the researchers write, the vast majority of the training costs go towards the pre-training process. Training a single LLM can require hundreds of thousands of compute hours, which in turn costs millions of dollars and consume energy amounts equivalent to that used by several typical U.S. families annually. Now, one of the big issues with these high pre-training costs is that it creates a situation where state-of-the-art results are, as the researchers write, essentially bought by spending massive computational resources. There are diminishing returns to how much more power can be thrown at these things, and so the risk is that it creates a set of haves and have-nots when it comes to
Starting point is 00:19:00 who has access to the most advanced models, or more specifically who has the ability to build the most advanced models. And to give you a sense of just how important this issue is, the announced Create AI Act or creating resources for every American to experiment with artificial intelligence act of 2023, explicitly establishes the National Artificial Intelligence Research Resource, which is a shared national research infrastructure that is designed to provide AI researchers and students from backgrounds that aren't meta or Google, the tools that they need to be able to experiment with the state of the art. Now, there are just a ton of additional problems that they identify here. High inference latency, which is a fancy way of saying that results are
Starting point is 00:19:39 slow. Prompt brittleness, which is something that anyone who has had to learn prompt engineering in order to figure out weird tweaks to get the model to perform what they want can understand. As the researchers define it, variations of the prompt syntax, often occurring in ways unintuitive to humans, can result in dramatic output changes. Another big problem is the indistinguishability between generated and human written texts. Right now, for all the talk of AI detectors, there isn't a universally agreed upon solution. The paper talks about post hoc detectors and watermarking and other strategies, but it's still a real challenge and one that has big political resonance. Now, importantly, in addition to just identifying the challenges in general, the paper
Starting point is 00:20:17 also discusses about how they come to bear in specific applications. For example, when it comes to chatbots, an issue is maintaining coherence. Multi-turn interactions they write, make chatbots easily, quote-unquote, forget earlier parts of the conversation or repeat themselves. That whole issue with limited context window isn't just a problem for people who are trying to read long financial reports. The researchers point out that the largest genomes have vastly longer DNA sequences than existing genomic LLMs context windows can handle, constraining the types of genomes that can be successfully modeled using these approaches. Of course, there are issues of bias where the unbalanced views and opinions that show up in the training data end up skewing the LLMs towards biased human
Starting point is 00:20:57 behaviors as well. And so again, the point here in this 72-page research paper is to, firstly, remind us just how much work there still is to be done when it comes to improving large language models. We're talking about dozens and dozens of different domains, each of which has huge numbers of research teams specifically focused on those issues. But what about for us laypeople, people who aren't involved in the solving of these challenges? I believe that we are going to be increasingly asked to make policy based not just upon what exists now, but what might exist in the future. Given that, understanding exactly where LLM's limitations and challenges lie,
Starting point is 00:21:33 feels incredibly important. I think having all of this in one space available for people to have as a jumping off point is also an incredibly useful resource. I will obviously include a link in the show notes so that you can go look at this paper yourself, and I hope this gave you just a little bit of a different sense of how much work there really is in front of us for the entire space around LLMs. Thanks as always for hanging out with the AI breakdown. Until next time, peace.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.