The AI Daily Brief: Artificial Intelligence News and Analysis - Bard vs. Bing vs. Claude vs. ChatGPT: The Right LLM For Every Task

Starting point is 00:00:00 Today on the AI Breakdown, we're looking at the state of LLM competition and asking which models are right for different tasks. The AI Breakdown is a daily podcast and video about the most important news and discussions in AI. Go to Breakdown.network for more information about our newsletter, Discord, and YouTube channel. One of the big announcements this week was that Anthropic was releasing its latest model called Claude 2. Now, in some ways, Claude 2 was just catching up to GPT4. They had very similar results on things like reasoning exam, the GREs, Claude II's coding was much improved, bringing it in line with GPT4. But Cloud 2 also offered some very different capabilities, particularly the cost and the

Starting point is 00:00:41 context window were something that made people really take notice. Google Bard also got a slew of updates, many of which served to improve its functionality in very clear day-to-day ways. So with all of that, it got me thinking about whether there is at this point a single dominant LLM or, alternatively, whether we're at a point where there are different use cases that make sense for different LLMs. It turns out I was not the only person to have this thought. Yesterday, Jan Pellig tweeted, which model should you use? The AI Wars TLDR. Long context tasks Claude 2. Internet required tasks use BARD. Hard reasoning tasks use GPD4, anything with code, code

Starting point is 00:01:18 interpreter, long essay plus internet use Bing. And all are crazy good at this point. It is much, much closer. If you didn't try them lately, you should. You would probably be surprised by how much Bart and Claude improved night and day. So what we're going to do today is build off of this tweet and ask what the right LLM for any given use cases. And let's start where he started with long context tasks. Context window refers to how many tokens or how much data can be fed into an LLM in one fell swoop. The longer the context window, the more context than LLM has in trying to help gauge with a document or some other material. The average person has mostly been interacting with 4K and 8K context windows in GPT3.5 and GPT4.

Starting point is 00:01:58 And earlier this year, people started to get really excited about the move to a 32K context window for GPT4. Certain API users had access to that longer window and it greatly expanded the capabilities of the model, allowing it to process four to eight times as much information at once. As deepleaps.com put it at the beginning of May, one of the primary use cases for the GPT432K model is the development of sophisticated Q&A chatbots for businesses. The expanded context window eliminates the need for complex embeddings and databases, enabling businesses to fit their entire dataset into the 32K prompt and use the API directly. The streamlined process could revolutionize chatbot functionality, making them more efficient

Starting point is 00:02:36 and versatile across industries. And yet, even as people were waiting for that 32K context window, Anthropics swooped in and blew that out of the water with a 100K context window for their Klaude model. On May 11th, Anthropic announced, we've expanded Klawn's context window from 9K to 100K tokens corresponding to around 75,000 words. This means businesses can now submit hundreds of pages of material for Claude to digest and analyze, and conversations with Claude can go on for hours or even days. Now, as examples, they point to the fact that the Great Gatsby is about that long, but they also say, beyond just reading long texts,

Starting point is 00:03:11 Claude can help retrieve information from the documents that help your businesses run. You can drop multiple documents or even a book into the prompt and then ask Claude questions that require synthesis of knowledge across many parts of the text. Then again, it was with the Claude model which was significantly underpowered compared to GPT4. However, with the launch of Claude 2, that has changed, and there's now more parity among the models, meaning that Anthropics Claude 2 really does serve a hugely valuable purpose because of that longer context window. Billowal Sidhu writes,

Starting point is 00:03:39 The 100K token context with improved reasoning is quite the combo, uploaded hundreds of pages without breaking a sweat. A few things I tried. Drop a CSV from your course waitless form and immediately analyze it. Drop a two-hour Zoom transcript and summarize the key points in a tweet thread format. Provide your plan teaching curriculum and refine it with student features. So of course, you see that the common thread here is that these are tasks that require the ability for the model to have the context of that bigger amount of information going in. More generally, Professor Ethan Malik points out that Claude 2 is just very good at summarizing

Starting point is 00:04:09 documents. Now, that said, given that we are talking about what different LLMs are useful for and what they're not, there has been a significant sense that even with this new Claude 2 model, there are many hallucinations. Malik again says, on the downside, don't use Claude for data. It hallucinates answers. Chris Kretz said something similar, Claude hallucinates a lot. But hey, at least it's friendly. Okay, so next up in Yom's contention, we have internet required tasks, which he suggests using

Starting point is 00:04:36 Bard for. So at this point, most of these LLMs are connected to the internet. With chat GPT, you have Brows with Bing, which at this point is rolled out for all users, not just paying users. So why might Bard be a better choice? Well, on the one hand, Bard is just natively in the internet. It's not set up in the same way that ChatGPT is where the native version of it was trained on data that has a cutoff point, instead its whole purpose is to sit on top of the internet in the same way that Google search does. But even beyond that, a new set of updates also increase its viability for those use cases. First of all, the new rollouts make it available in Europe and Brazil, not just the US. Second, it's now available in something like 40 languages.

Starting point is 00:05:14 Third, they just added a number of new utility features, things like save searches, sharing searches with friends, pinning searches, all of which individually are very small but add up to a higher functionality product. But more than that, with this new update, Bard is officially multimodal. What that means is that an image can now be used to prompt the system. Kyrthana, a researcher at DeepMind, posted an image of a pug with a graduation hat and typed what is happening in this image. Bard says the image shows a pug dog wearing a graduation cap on a leash.

Starting point is 00:05:45 The image is likely a celebration of the dog's graduation from Obedian School or a Service Dog Training Program. Ethan Malik again says Google Bard is surprisingly good at working with images. It appears to be combining a reverse image search with multimodal capability, i.e. the ability of the AI to see something. Now, importantly, this isn't just for novelty, like asking about a pug in a graduation cap. Joel Dean writes, wow, Bard just converted a screenshot to code. This is so next level. Looking forward to these multimodal capabilities in chat GPT. The prompt that Joel had used was, are you able to convert this screen to Jetpack Compose and then shared a screenshot from which

Starting point is 00:06:21 Bard was able to push out code, although Joel doesn't say how accurate that. that code was. Now, it's entirely possible that within the next six months, this sort of multi-modality is total table stakes. However, as of right now, OpenAI has indicated that they've had to put broader multimodal rollouts on hold because of their lack of access to GPUs. It's one of the areas where the GPU shortage is showing up most profoundly. So for now, I would say that in addition to just using Bard for internet-required tasks, Bard is also the standout option for multimodal tasks that involve images. Now, Yom's next contention is that for harder reasoning tasks, you And on the one hand, I would say that this is broadly consensus, that people believe by and large

Starting point is 00:06:59 that GPT4 remains ahead of all of its competitors when it comes to reasoning tasks. And on top of that, there's also some reasonable evidence. For example, when Claude 2 came out, they shared a number of comparisons. And while Claude did overtake chat GPT in GRE writing and bar exams, the difference wasn't really statistically significant. And in terms of standard GREs, chat GPT still won verbal, quantitative, and the medical exam. But I think the even more important part of the discussion right now, as relates to the relates to chat GPT and GPT4 isn't so much GPD4 and how ahead it is on reasoning tasks.

Starting point is 00:07:30 Instead, what matters about chat GPD most right now is the newly released code interpreter feature, which many are seeing as effectively GPT 4.5, even though it's not named that. Swicks from the Latent Space podcast made this point most loudly. On July 10th, he tweeted, code interpreter equals GPT4.5, or making GPT4-1,000x better with one weird trick. Now, the one weird trick that he's referring to is the fact that code interpreter is not so much just a tool that can interpret code or that can look at data when you plug it into the model. Instead, it represents a fundamental addition to the model itself. In a blog post that they wrote, Swick shared a chart that he called the Road to AGI. And what he pointed out is that each of the big leaps for GPT, from GPT3 to

Starting point is 00:08:18 3.5, from 3.5 to GPT4, and from GPT4 to GPT4 plus interpreter, there was an input of an additional aspect to the training. So with GPT3, we got pre-training, but with GPT3.5, we got pre-training and reinforcement learning from human feedback. Then the next additions to GPT3.5 included plugins and user-defined functions. And then with GPT4, we added into the mix a mixture of experts. So all of a sudden the model had not just pre-training and reinforcement learning from human feedback, but pre-training a mixture of experts and reinforcement learning from human feedback. In that point of view, code interpreter becomes not just, again, an application that sits on top of GPT4, but a code sandbox, which allows GPT4 to effectively fill in the gaps in its own model.

Starting point is 00:09:03 More than's cram expanded upon the same idea. He wrote, people haven't fully grasped the significance of the code interpreter. It's not just another plugin that does data analysis. In my opinion, it's actually GPT4.5 masked as a plug-in. Let me explain. ChatGPT was already able to produce code, but it wasn't able to run it. The code interpreter can. This small change makes a huge difference. This means that chat GPT is no longer limited to being a passive assistant. It has now become active.

Starting point is 00:09:28 Two, iterative abilities. On top of running code, the code interpreter seems to have built-in iterative abilities. It recognizes when it's made a mistake and it corrects it by itself. It's more closely resembling an agent now. Three, different model. It also seems that the code interpreter is actually accessing a completely different model from GPT4. Some people have reverse engineered this and are pretty sure. He actually references another tweet from Yom Pelag that says,

Starting point is 00:09:51 We highly 99% suspect that the model is not the same model as GPT4. The user interface access is a completely different endpoint that also has additional parameters. Number four more points out is multimodality. GPT4 has multimodality built into it. This means that it understands not only text but also visuals and audio. However, this feature has not been activated for chat GPT yet. With the code interpreter, OpenAI has made a step in the direction of enabling this feature. Because now there is a way to input anything, data sets images audio into chat GPT,

Starting point is 00:10:20 a prerequisite for multimodal functions. While the code interpreter doesn't yet understand an image, it can already take the image and manipulate it. To me, this is a fundamentally upgraded chat GPT. Calling it the code interpreter and downplaying it as a GPT4 plugin is not doing it justice. Now, the last piece of this puzzle that I wanted to mention is something that's very different. You can kind of tell with all of these different LLMs that I've just mentioned over the course

Starting point is 00:10:43 of this video, they're all sort of for professional or at least work-type use cases. It's research, it's coding, it's development, it's building. However, some people believe that that is not the be-all and end-all of what AIs can do. In many ways, the biggest proponent of this view is, of course, inflection. Inflection is the company behind Pai, which stands for personal intelligence. When you go to hey pye.com, the first window comes up, hey there, great to meet you. I'm Pai, your personal AI. My goal is to be useful, friendly, and fun. Ask me for advice, for answers, or let's talk about whatever's on your mind. When Pi was first introduced, Mustafa Silliman, who was also previously a founder at Google's

Starting point is 00:11:18 DeepMind said, many people feel that they just want to be heard or they just want a tool that reflects back what they said to demonstrate they have actually been heard. And subsequent to that launch, what they've been doing is basically increasing the feature set to make it more interpersonal. About a week ago, Mustafa tweeted, Pi now has a voice. Call Pi and have a chat whilst taking a walk or doing the dishes. Interestingly, so far, the community hasn't really seemed to treat it like just a novelty. Last week, Robert Scoble shared a set of conversations saying,

Starting point is 00:11:44 check out this chat I had with Pi, my new AI. This is incredible. In the conversation, you can really get a sense of how Pi is designed to be a good listener. And what's really interesting to me and what Pi seems to do really well is actually ask questions that move the conversation into a new direction. In other words, when we have a conversation, it's not just one person talking and another person nodding their head and saying, yeah, that's cool. It's two people actively interacting with one another such that,

Starting point is 00:12:11 each changes the shape of the next thing that's going to be said. Now, to some extent, reading this still feels like you're reading an AI, but it does do that job of coming back with questions that really do end up pushing things forward. Now, as we wrap up one thing that I think is worth noting, one company that didn't have a contender in here is, of course, Meta. However, meta's Lama model has been absolutely integral to the explosion of open source alternatives, and it appears that they're on the verge of releasing a new Lama 2 model that will be commercially available. Swicks again said, this is the biggest change to the AI competitive landscape. The real threat to open AI isn't open AI but safer, but open AI but open. Finally, the last LLM that I'll mention is something that is a

Starting point is 00:12:51 different use case entirely, which is not an LLM that is open and which an individual taps into the collective database, but instead personal LLMs that interact with the data that a specific person or company has given it access to. There are tons of examples of this. This is a very hot development area right now, But one that's been making some waves recently is called Quiver. Friend of the show, Emmett Homm writes, An AI-powered second brain is taking over GitHub. Quiver is a customizable second brain that lets you dump in any file, text, audio, video links, and chat with it via LLM.

Starting point is 00:13:21 I have only just started to play with Quiver entering in my notes. I've only just started to play with Quiver entering my note files into it and some other things to see what comes out. But I think that this is a trend you're going to see a lot more of. So, friends, we will wrap there. Those are a list of how LLMs differ and what they're good for. Let me know what you think in the comments. And as always, I appreciate you listening or watching.

Starting point is 00:13:41 Until next time, peace.

The AI Daily Brief: Artificial Intelligence News and Analysis - Bard vs. Bing vs. Claude vs. ChatGPT: The Right LLM For Every Task

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.