The AI Daily Brief: Artificial Intelligence News and Analysis - Bard vs. Bing vs. Claude vs. ChatGPT: The Right LLM For Every Task
Episode Date: July 16, 2023LLM competition ratchets up seemingly every week. At this point, the different design choices that models have made have led to different LLMs being better or worse suited for different tasks. NLW bui...lds off a recent viral tweet about which LLMs are good for what tasks. ABOUT THE AI BREAKDOWN The AI Breakdown helps you understand the most important news and discussions in AI. Subscribe to The AI Breakdown newsletter: https://theaibreakdown.beehiiv.com/subscribe Subscribe to The AI Breakdown on YouTube: https://www.youtube.com/@TheAIBreakdown Join the community: bit.ly/aibreakdown Learn more: http://breakdown.network/
Transcript
Discussion (0)
Today on the AI Breakdown, we're looking at the state of LLM competition and asking which models are right for different tasks.
The AI Breakdown is a daily podcast and video about the most important news and discussions in AI.
Go to Breakdown.network for more information about our newsletter, Discord, and YouTube channel.
One of the big announcements this week was that Anthropic was releasing its latest model called Claude 2.
Now, in some ways, Claude 2 was just catching up to GPT4.
They had very similar results on things like reasoning exam,
the GREs, Claude II's coding was much improved, bringing it in line with GPT4.
But Cloud 2 also offered some very different capabilities, particularly the cost and the
context window were something that made people really take notice.
Google Bard also got a slew of updates, many of which served to improve its functionality
in very clear day-to-day ways.
So with all of that, it got me thinking about whether there is at this point a single dominant
LLM or, alternatively, whether we're at a point where there are different use cases that
make sense for different LLMs. It turns out I was not the only person to have this thought.
Yesterday, Jan Pellig tweeted, which model should you use? The AI Wars TLDR. Long context tasks
Claude 2. Internet required tasks use BARD. Hard reasoning tasks use GPD4, anything with code, code
interpreter, long essay plus internet use Bing. And all are crazy good at this point. It is much,
much closer. If you didn't try them lately, you should. You would probably be surprised by how much
Bart and Claude improved night and day. So what we're going to do today is build off of this tweet
and ask what the right LLM for any given use cases. And let's start where he started with long
context tasks. Context window refers to how many tokens or how much data can be fed into an
LLM in one fell swoop. The longer the context window, the more context than LLM has in trying to
help gauge with a document or some other material. The average person has mostly been
interacting with 4K and 8K context windows in GPT3.5 and GPT4.
And earlier this year, people started to get really excited about the move to a 32K context window for GPT4.
Certain API users had access to that longer window and it greatly expanded the capabilities of the model,
allowing it to process four to eight times as much information at once.
As deepleaps.com put it at the beginning of May,
one of the primary use cases for the GPT432K model is the development of sophisticated Q&A chatbots for businesses.
The expanded context window eliminates the need for complex embeddings and databases,
enabling businesses to fit their entire dataset into the 32K prompt and use the API directly.
The streamlined process could revolutionize chatbot functionality, making them more efficient
and versatile across industries. And yet, even as people were waiting for that 32K context window,
Anthropics swooped in and blew that out of the water with a 100K context window for their
Klaude model. On May 11th, Anthropic announced,
we've expanded Klawn's context window from 9K to 100K tokens corresponding to around 75,000 words.
This means businesses can now submit hundreds of pages of material for Claude to digest and analyze,
and conversations with Claude can go on for hours or even days.
Now, as examples, they point to the fact that the Great Gatsby is about that long,
but they also say, beyond just reading long texts,
Claude can help retrieve information from the documents that help your businesses run.
You can drop multiple documents or even a book into the prompt
and then ask Claude questions that require synthesis of knowledge across many parts of the text.
Then again, it was with the Claude model which was significantly underpowered compared to GPT4.
However, with the launch of Claude 2, that has changed, and there's now more parity among the models,
meaning that Anthropics Claude 2 really does serve a hugely valuable purpose because of that longer
context window.
Billowal Sidhu writes,
The 100K token context with improved reasoning is quite the combo, uploaded hundreds of pages without breaking a sweat.
A few things I tried.
Drop a CSV from your course waitless form and immediately analyze it.
Drop a two-hour Zoom transcript and summarize the key points in a tweet thread format.
Provide your plan teaching curriculum and refine it with student features.
So of course, you see that the common thread here is that these are tasks that require the
ability for the model to have the context of that bigger amount of information going in.
More generally, Professor Ethan Malik points out that Claude 2 is just very good at summarizing
documents.
Now, that said, given that we are talking about what different LLMs are useful for and what
they're not, there has been a significant sense that even with this new Claude 2 model,
there are many hallucinations.
Malik again says, on the downside, don't use Claude for data.
It hallucinates answers.
Chris Kretz said something similar, Claude hallucinates a lot. But hey, at least it's friendly.
Okay, so next up in Yom's contention, we have internet required tasks, which he suggests using
Bard for. So at this point, most of these LLMs are connected to the internet. With chat
GPT, you have Brows with Bing, which at this point is rolled out for all users, not just paying
users. So why might Bard be a better choice? Well, on the one hand, Bard is just natively in
the internet. It's not set up in the same way that ChatGPT is where the native version of it was
trained on data that has a cutoff point, instead its whole purpose is to sit on top of the
internet in the same way that Google search does. But even beyond that, a new set of updates also
increase its viability for those use cases. First of all, the new rollouts make it available
in Europe and Brazil, not just the US. Second, it's now available in something like 40 languages.
Third, they just added a number of new utility features, things like save searches, sharing searches
with friends, pinning searches, all of which individually are very small but add up to a higher
functionality product.
But more than that, with this new update, Bard is officially multimodal.
What that means is that an image can now be used to prompt the system.
Kyrthana, a researcher at DeepMind, posted an image of a pug with a graduation hat and typed
what is happening in this image.
Bard says the image shows a pug dog wearing a graduation cap on a leash.
The image is likely a celebration of the dog's graduation from Obedian School or a Service Dog
Training Program.
Ethan Malik again says Google Bard is surprisingly good at working with images.
It appears to be combining a reverse image search with multimodal capability, i.e. the ability of the
AI to see something. Now, importantly, this isn't just for novelty, like asking about a pug in a graduation cap.
Joel Dean writes, wow, Bard just converted a screenshot to code. This is so next level.
Looking forward to these multimodal capabilities in chat GPT. The prompt that Joel had used was,
are you able to convert this screen to Jetpack Compose and then shared a screenshot from which
Bard was able to push out code, although Joel doesn't say how accurate that.
that code was. Now, it's entirely possible that within the next six months, this sort of multi-modality
is total table stakes. However, as of right now, OpenAI has indicated that they've had to put
broader multimodal rollouts on hold because of their lack of access to GPUs. It's one of the areas
where the GPU shortage is showing up most profoundly. So for now, I would say that in addition to just
using Bard for internet-required tasks, Bard is also the standout option for multimodal tasks that
involve images. Now, Yom's next contention is that for harder reasoning tasks, you
And on the one hand, I would say that this is broadly consensus, that people believe by and large
that GPT4 remains ahead of all of its competitors when it comes to reasoning tasks.
And on top of that, there's also some reasonable evidence.
For example, when Claude 2 came out, they shared a number of comparisons.
And while Claude did overtake chat GPT in GRE writing and bar exams, the difference
wasn't really statistically significant.
And in terms of standard GREs, chat GPT still won verbal, quantitative, and the medical exam.
But I think the even more important part of the discussion right now, as relates to the
relates to chat GPT and GPT4 isn't so much GPD4 and how ahead it is on reasoning tasks.
Instead, what matters about chat GPD most right now is the newly released code interpreter
feature, which many are seeing as effectively GPT 4.5, even though it's not named that.
Swicks from the Latent Space podcast made this point most loudly. On July 10th, he tweeted,
code interpreter equals GPT4.5, or making GPT4-1,000x better with one weird trick. Now, the
one weird trick that he's referring to is the fact that code interpreter is not so much just a tool
that can interpret code or that can look at data when you plug it into the model. Instead, it represents
a fundamental addition to the model itself. In a blog post that they wrote, Swick shared a chart that
he called the Road to AGI. And what he pointed out is that each of the big leaps for GPT, from GPT3 to
3.5, from 3.5 to GPT4, and from GPT4 to GPT4 plus interpreter, there was an input of an additional
aspect to the training. So with GPT3, we got pre-training, but with GPT3.5, we got pre-training
and reinforcement learning from human feedback. Then the next additions to GPT3.5 included
plugins and user-defined functions. And then with GPT4, we added into the mix a mixture
of experts. So all of a sudden the model had not just pre-training and reinforcement learning from
human feedback, but pre-training a mixture of experts and reinforcement learning from human feedback.
In that point of view, code interpreter becomes not just, again, an application that sits on top
of GPT4, but a code sandbox, which allows GPT4 to effectively fill in the gaps in its own model.
More than's cram expanded upon the same idea. He wrote, people haven't fully grasped the
significance of the code interpreter. It's not just another plugin that does data analysis.
In my opinion, it's actually GPT4.5 masked as a plug-in. Let me explain.
ChatGPT was already able to produce code, but it wasn't able to run it.
The code interpreter can.
This small change makes a huge difference.
This means that chat GPT is no longer limited to being a passive assistant.
It has now become active.
Two, iterative abilities.
On top of running code, the code interpreter seems to have built-in iterative abilities.
It recognizes when it's made a mistake and it corrects it by itself.
It's more closely resembling an agent now.
Three, different model.
It also seems that the code interpreter is actually accessing a completely different model from GPT4.
Some people have reverse engineered this and are pretty sure.
He actually references another tweet from Yom Pelag that says,
We highly 99% suspect that the model is not the same model as GPT4.
The user interface access is a completely different endpoint that also has additional parameters.
Number four more points out is multimodality.
GPT4 has multimodality built into it.
This means that it understands not only text but also visuals and audio.
However, this feature has not been activated for chat GPT yet.
With the code interpreter, OpenAI has made a step in the direction of enabling this feature.
Because now there is a way to input anything, data sets images audio into chat GPT,
a prerequisite for multimodal functions.
While the code interpreter doesn't yet understand an image, it can already take the image and
manipulate it.
To me, this is a fundamentally upgraded chat GPT.
Calling it the code interpreter and downplaying it as a GPT4 plugin is not doing it justice.
Now, the last piece of this puzzle that I wanted to mention is something that's very
different.
You can kind of tell with all of these different LLMs that I've just mentioned over the course
of this video, they're all sort of for professional or at least work-type use cases. It's research,
it's coding, it's development, it's building. However, some people believe that that is not the be-all
and end-all of what AIs can do. In many ways, the biggest proponent of this view is, of course,
inflection. Inflection is the company behind Pai, which stands for personal intelligence.
When you go to hey pye.com, the first window comes up, hey there, great to meet you. I'm Pai,
your personal AI. My goal is to be useful, friendly, and fun. Ask me for advice, for answers, or
let's talk about whatever's on your mind.
When Pi was first introduced, Mustafa Silliman, who was also previously a founder at Google's
DeepMind said, many people feel that they just want to be heard or they just want a tool that
reflects back what they said to demonstrate they have actually been heard.
And subsequent to that launch, what they've been doing is basically increasing the feature
set to make it more interpersonal.
About a week ago, Mustafa tweeted, Pi now has a voice.
Call Pi and have a chat whilst taking a walk or doing the dishes.
Interestingly, so far, the community hasn't really seemed to treat it like just a novelty.
Last week, Robert Scoble shared a set of conversations saying,
check out this chat I had with Pi, my new AI.
This is incredible.
In the conversation, you can really get a sense of how Pi is designed to be a good listener.
And what's really interesting to me and what Pi seems to do really well
is actually ask questions that move the conversation into a new direction.
In other words, when we have a conversation, it's not just one person talking
and another person nodding their head and saying, yeah, that's cool.
It's two people actively interacting with one another such that,
each changes the shape of the next thing that's going to be said. Now, to some extent, reading
this still feels like you're reading an AI, but it does do that job of coming back with questions
that really do end up pushing things forward. Now, as we wrap up one thing that I think is worth
noting, one company that didn't have a contender in here is, of course, Meta. However,
meta's Lama model has been absolutely integral to the explosion of open source alternatives,
and it appears that they're on the verge of releasing a new Lama 2 model that will be commercially
available. Swicks again said, this is the biggest change to the AI competitive landscape. The real threat to
open AI isn't open AI but safer, but open AI but open. Finally, the last LLM that I'll mention is something that is a
different use case entirely, which is not an LLM that is open and which an individual taps into the
collective database, but instead personal LLMs that interact with the data that a specific person or company
has given it access to. There are tons of examples of this. This is a very hot development area right now,
But one that's been making some waves recently is called Quiver.
Friend of the show, Emmett Homm writes,
An AI-powered second brain is taking over GitHub.
Quiver is a customizable second brain that lets you dump in any file,
text, audio, video links, and chat with it via LLM.
I have only just started to play with Quiver entering in my notes.
I've only just started to play with Quiver entering my note files into it
and some other things to see what comes out.
But I think that this is a trend you're going to see a lot more of.
So, friends, we will wrap there.
Those are a list of how LLMs differ and what they're good for.
Let me know what you think in the comments.
And as always, I appreciate you listening or watching.
Until next time, peace.
