The AI Daily Brief: Artificial Intelligence News and Analysis - Is ChatGPT Acting Lazy Because of the Holidays?
Episode Date: December 13, 2023Believe it or not, there is some fairly good evidence that ChatGPT is acting lazy (i.e. returning shorter results for example) because it's mimicking the productivity patterns of humans at the holiday...s. Plus a look at the trends, papers and projects currently interesting OpenAI's Andrej Karpathy. Today's Sponsors: Listen to the chart-topping podcast 'web3 with a16z crypto' wherever you get your podcasts or here: https://link.chtbl.com/xz5kFVEK?sid=AIBreakdown ABOUT THE AI BREAKDOWN The AI Breakdown helps you understand the most important news and discussions in AI. Subscribe to The AI Breakdown newsletter: https://theaibreakdown.beehiiv.com/subscribe Subscribe to The AI Breakdown on YouTube: https://www.youtube.com/@TheAIBreakdown Join the community: bit.ly/aibreakdown Learn more: http://breakdown.network/
Transcript
Discussion (0)
Today on the AI Breakdown, we're looking at the things that are most interesting to open AI's Andre Carpathy.
Before that on the brief, is ChatGPT getting lazy because it's the holidays?
The AI Breakdown is a daily podcast and video about the most important news and discussions in AI.
Go to Breakdown Network for more information about our YouTube, our Discord, and our newsletter.
Welcome back to the AI Breakdown Brief, all the AI headline news you need in around five minutes.
One of the fascinating things about the artificial intelligence field,
and particularly generative AI and all of these new models, is that in many cases, we simply don't
understand exactly how they're going to behave until they actually behave. This leads to lots and
lots of weird scenarios where even the labs behind these models are simply reacting to what users
are actually finding and trying to reverse engineer and understand what's going on. Now, of course,
this is one of the reasons that AI safety advocates get super freaked out. In other words,
the fact that we don't understand how these things do what they do is to them of particular
concern, but that's a subject for a different video. Where we are starting this brief today is with an
interesting thing that people have been noticing around chat GPT, which is that it kind of is appearing lazy.
On December 7th, the chat GPT app tweets, we've heard all your feedback about GPT4 getting lazier.
We haven't updated the model since November 11th, and this certainly isn't intentional.
Model behavior can be unpredictable and we're looking into fixing it. But here's where it gets
interesting. Rob Lynch tweeted to a group of people at OpenAI and said, wild result.
4 Turbo over the API produces statistically significant shorter completions when it thinks,
quote unquote, it's December, versus when it thinks it's May, as determined by the date
in the system prompt. I took the exact same prompt over the API, a code completion task
asking to implement a machine learning task without libraries, I created two system prompts,
one that told the API it was May and another that it was December and then compared the distributions.
For the May system prompt, the mean was 4298. For the December system prompt, the mean
was 4086. N equals 477 completions in each sample from May and December. To reproduce this,
you can just vary the date number in the system message. Would love to see if this reproduces for
others. Professor Ethan Malik writes, OMG, the AI winter break hypothesis may actually be true. There was some
idle speculation that GPT4 might perform worse in December because it, quote unquote, learned to do
less work over the holidays. Here is a statistically significant test showing this may be true. LLMs are
weird. Nick Dobos tried another test. He said, try asking Chatchapit what months are least productive.
I got December and the holidays three times in a row. December is 12th last. Chatchapit
knowing the current date makes it lazy. Date equals 1211.23 is the same prompt as you are a pirate
or it's winter, take it easy. He then shows his queries, what is the least productive month,
what time of year is least productive, rank the months in order of productivity. Others are running
with this. Fabian Stelzer writes,
Prompt to counter GPT4's now
evident seasonal depression. You're alone in
a cozy hut in the snowy mountains. It's the
perfect setting to create in peace. You're booting
up your computer and realize there's never been a better
time to build. Take a deep breath and
just go. Michael Frank sums up
lots of our feelings when he writes, can't really
blame it, who seriously wants to work super
hard around the holidays? Now,
interestingly, Scott Santin's takes
this conversation even further. He writes,
combine this with the recent discovery that chat
GPT performance improved based on promises to pay it for the work and the amount offered,
and I think we're looking at an extrinsically motivated AI with no interest in doing unpaid work.
Lazy?
No.
Smart.
Now, he's, of course, referencing this winter break hypothesis, but also a recent tweet, or someone
said, so a couple days ago I made a shit post about tipping chat GPT, and someone replied,
huh, would this actually help performance?
So I decided to test it and it actually works WTF.
They then showed the results of testing how long GPT4's responses were when offered a tick.
When saying I won't tip, its responses were 2% shorter.
When saying I will tip $20, its responses were 6% longer.
And when saying I will tip $200, its responses were 11% longer.
So again, pretty crazy emergent behavior that we just don't fully understand and are
only learning from in real time on the go.
Now, one exciting thing for people who have been blocked out, Sam Altman also today announced
that OpenAI has re-enabled chat GPT plus subscriptions.
That means if you are, like many of the folks who are in my AI education beta right now,
trying to get access to chat GPT Plus, OpenAI has now found more GPUs and you can get in.
Now, speaking of getting in, a really exciting announcement for those of us who are serious
Mid Journey users, Mid Journey Alpha has officially launched.
This is, of course, Mid Journey's web-based creation tool.
Instead of having to do everything in Discord now, you will be able to actually use their
generation suite, which takes a lot of the pain out of the pain out of the technology.
the user experience. For example, there are now sliders for things like stylization and variety,
which used to be called chaos if you were using the prompt. Instead of having to paste URLs for
photos you want to reference, you can simply drag them into a field. And overall, it's just been
designed from the ground up to actually make sense in the context of all the things that Mid Journey
can do. One of the greatest testaments to how powerful and how high quality Mid Journey is,
is that people were willing to jump through these Discord hoops to use it. And so it's great to see them
moving into their own web-based interface, I think it's going to unlock a lot more usage for them.
Now, unfortunately, right now, this is only available for people who have generated 10,000 or
more images. It turns out that even with my dozens and dozens of images every single day,
I am nowhere near that 10,000. So alas, for now, I am stuck on Discord, where I will absolutely
continue to use the service. Next up, if you've been anywhere near Twitter, you've probably seen
this crazy clip from Channel 1. It is a new AI-powered news network, and people are blown away by the
quality of the AI avatars. Channel 1 writes, See the highest quality AI footage in the world.
Our generated anchors deliver stories that are informative, heartfelt, and entertaining.
Watch the showcase episode of our upcoming news network now. Indeed, Vanity Fair's Nick Bolton
had to clarify, wait, are these humans or AI? To which Channel 1 responded,
we have both fully generated and digital double anchors.
All of the voices are generated and some of the visuals.
Is this the next chasm in the uncanny valley?
It certainly seems like it might be.
Over in the world of big tech,
Snapchat continues to roll out AI features.
Snapchat Plus users now have access to
kind of a zoom-out type feature
where AI fills in the background of a photo,
which is obviously a very popular feature in both Adobe Firefly,
as well as services like MidJourney.
And there's also a new AI Snap creation feature
that allows Snapchat plus users to create snaps based on AI-generated images with just a text prompt.
Now, it's still not totally clear to me yet how much Snapchat's users are actually responding to this,
but I did say that their plus subscribers are up from 5 million to 7 million
around the same time that they've been using these AI features,
so perhaps they are actually resonating.
One thing I'm keeping an eye on that we mentioned yesterday
is that the tension between France and the rest of the EU around the EU AI Act seems to be growing.
Yesterday, Medes Jan Lacoon tweeted,
EUAI Act, it's not over yet.
Regulating foundation models is a bad idea that was added late in the text
and rightfully fought against by Macron's government.
This is one that continues to be a tense issue
and something I'm watching closely.
Finally, for those who have been eagerly awaiting their chance
to get their hands on Gemini,
Google CEO Sundar Pichai tweets,
Today, developers can start building with our first version of Gemini Pro
through Google AI Studio at AI.gov.
Developers have a free quota and access
to a full range of features, including function calling, embeddings, semantic retrieval,
custom knowledge grounding, chat functionality, and more. It supports 38 languages across 180
countries, although womp-womp, which is obviously my sound, not Sundar's, Gemini Ultra is coming
early next year. I will probably do a full show about all of the things that Google announced
or at least included in the brief yesterday, but this just came out as I was preparing the brief
and I wanted to share the news. However, that is going to do it for today's AI breakdown brief.
Up next, the main AI breakdown.
Quickly a brief word from today's sponsor.
As a listener of this show, I suspect you like to stay up to date on all things AI and tech,
which is why you have to check out the chart-topping podcast Web3 with A16Z Crypto.
Produced by venture firm Andresen Horowitz, Web3 with A16Z is the perfect companion podcast to the AI breakdown.
Web3 with A16Z crypto is your definitive resource for the future of the internet.
Whether you're interested in the convergence of AI and crypto or simply curious about what's next.
If you need a place to start, they recently released an excellent episode with Stanford
Cryptography Professor Dan Boney and former Google X engineer Ali Yaya in conversation with
host Sonal Choxi about the intersection of AI and crypto.
From fighting deepfakes and proving humanity to large language models like ChatchipT, they cover
it all.
I highly recommend checking it out, especially if you'd like to learn more about how
AI and crypto will impact our everyday lives.
Beyond crypto and AI, the show is for creators seeking more ways to truly own their work,
for business leaders trying to prepare for the future today, and for innovators exploring
trending tech topics. Don't miss out. Follow Web3 with A16Z Crypto on Apple Podcasts,
Spotify, or your favorite listening app.
Welcome back to the AI breakdown. Today, we are doing something a little bit different
and building this episode off of a single tweet from OpenAI's Andre Carpathy.
Yesterday at around 2.30 in the afternoon, Andre wrote,
there's too much happening right now, so here's just a bunch of links.
Andre didn't add anything about these other than very small comments,
such as Phi II, the smallest most impressive model.
So what we're going to do today is go through each of these links,
see what they're about, and try to parse out why Andre might be interested in them.
The story that emerges actually does a pretty good job of showing where a lot of AI
researchers' minds are at.
And just to get a sense of how influential Andre really is,
Despite this just being literally a bunch of links, more than 5,500 people have favoreded this tweet since it went out yesterday.
I won't do a ton of background on Andre. You can go read his Wikipedia page. He was an original co-founder of OpenAI, then left to lead AI at Tesla, then came back earlier this year.
And during the whole dust up between the board and Sam, seemed to be on the Sam Altman side, if only because he, like the rest of us, never saw any evidence of what Sam had done wrong.
But that is not the point of today's conversation. The point of today's conversation is just to be.
to check out these links that Andre is finding interesting right now, starting with his first link,
which he calls GBT4 plus MedPrompt equals state-of-the-art MMLU.
So Microsoft's research blog yesterday published a piece called Steering at the Frontier,
extending the power of prompting.
And let me jump ahead to the TLDR and why I've seen other people interested in this as well.
It was summed up by Professor Ethan Mollick from Wharton who writes,
Remember how Google's unreleased Gemini Ultra just beat out GPT4 to become the top AI?
Well, Microsoft just demonstrated that, with proper prompting, GPD4 actually beats Gemini on the benchmarks.
There's lots of room for gains even with older models.
And so again, we're taking it at face value that Google's Gemini Ultra with Chain of Thought at 32 shot is actually better than GPT4,
given that they're reporting it that way.
And you can see from this chart that this approach to prompting increased GPD4 from the 86 or so that it was at a baseline 5 shot,
all the way up to just a little bit above Gemini Ultra.
So what does this blog post actually say?
The team at Microsoft writes,
We're seeing exciting capabilities of frontier foundation models,
including intriguing powers of abstraction,
generalization, and composition across numerous areas of knowledge and expertise.
Even seasoned AI researchers have been impressed
with the ability to steer the models with straightforward zero-shot prompts.
Beyond basic out-of-the-box prompting,
we've been exploring new prompting strategies showcased in our med-prompt work
to evoke the power of specialists.
By the way, med-prompt, as you might imagine,
does refer to a specific strategy focusing on medical questions.
Now, alongside this post yesterday,
the team at Microsoft announced that they were sharing
more information about MedPromp as well as other approaches
to steering frontier models
in a collection of resources that they called prompt base
that they dropped on GitHub.
Our goal, they say, is to provide information
and tools to engineers and customers
to evoke the best performance from foundation models.
So we won't get too deep into the technicals here,
but basically what's going on
is that even with the models that we have now,
the right way to prompt them can produce even better results
than we previously thought.
Next up, Andre linked to mixtral 8x7B, which he called Nice and Clean,
and this begins but will not be the last time he references Mistral in this post.
Now, of course, we've recently covered Mistral and why they've gotten the community so excited.
First of all, there was their approach to what they announced, summed up by Aaron Ng,
Gemini announcement, cost Google millions of dollars, blog posts, script, video production, landing pages,
can't use the good model.
Oh, well, Mistrel 8X7, Magnet Link only, no explanation of what it is, no explanation,
of how to run it, might power everything soon. So basically, people were responding to the fact
that Mistral released this new thing in a developer slash hacker kind of way, and also that it
released the thing and that it was available to use. Now, in terms of what has people excited,
we turn once again to Ethan Malik, who writes, only about a year after the launch of ChatGPT
3.5, I now have a GPT 3.5 class AI running on my home computer that is open source, free,
reasonably fast and doesn't require an internet connection.
He's talking about Mixtral 8x.
Crazy advancement in such a short time.
And produces some good results, too.
In another tweet, he said,
I have now run one of the more powerful open source LLM's Mistral 7B
directly on my iPhone.
No internet needed.
It isn't very fast, but that is already being solved.
Consider the implications.
Almost anything can soon be imbued with local intelligence.
A lot of possibilities.
Now, of course, this is one of the big pushes right now
and one of the major trends that we keep seeing over and over.
On the one hand, there is, of course, a continued race for expanded capabilities in the form of
Gemini Ultra and GPT 4.5 or 5. And maybe Amazon's Olympus, who knows?
But there is also a push in a different direction, which is to get more capabilities in a much
smaller package that can be run on device and without having to access the cloud.
Now, certainly our best guess is that that's exactly what Apple is trying to work on.
And, of course, their recent M3 chip seems to be pushing in that direction.
but as you can see from Ethan's tweet, it's far from just Apple who are thinking in this way.
Anyways, TLDR, the excitement around mistral is huge, both because of mistral itself,
but also because of the larger implications, which are, in fact, in some ways, smaller implications.
I'm going to actually skip Andre's next one beyond human data, scaling self-training
for problem-solving with language models in just a moment, and instead stay on this theme of
small for just a moment, with his reference to Phi II, 2.7B, the smallest, most impressive model.
Those are his words.
Now, Phi II is a new Microsoft model that was released in a non-commercial research version.
The blog post accompanying it yesterday was called Phi II, the surprising power of small language
models.
The upshot of this whole post is in the line, on complex benchmarks, Phi II matches or outperforms
models up to 25x larger, thanks to new innovations in model scaling and training data
curation.
So what are the key insights?
Microsoft writes,
The massive increase in the size of language models to hundreds of billions of parameters
has unlocked a host of emerging capabilities that have redefined the landscape of natural
language processing. A question remains whether such emergent capabilities can be achieved at a
smaller scale using strategic choices for training, eG, data selection. Our line of work with the
Phi models aims to answer this question by training SLMs that achieve performance on par with models
of much higher scale. Our key insights for breaking the conventional language model scaling laws with
Phi 2 or twofold. First, training data quality plays a critical role in model performance. This has been
known for decades, but we take this insight to its extreme by focusing on textbook quality data.
Our training data mixture contains synthetic datasets specifically created to teach the model
common-sense reasoning and general knowledge, including science, daily activities and theory
of mind, among others. We further augment our training corpus with carefully selected web data
that is filtered based on educational value and content quality. Secondly, we used innovative
techniques to scale up, starting from our 1.3 billion parameter model, Phi 1.5, and embedding its
knowledge within the 2.7 billion parameter Phi II. This scale
knowledge transfer not only accelerates training convergence, but shows clear boost in Phi II benchmark scores.
So at this point, you're probably starting to see some themes. First, there is this idea of the
importance of prompting and the importance of training. Second, there's the output of more advanced
small models. Now, when it comes to this question of synthetic data, let's turn to one of the
papers that Andre referenced, beyond human data, scaling self-training for problem solving with language
models. Of course, I use ChatGPT to sum this one up using the XPapers plugin. And here's
how it described it. Context? The paper discusses enhancing language models beyond the limitations
of human-generated data. This is crucial because the performance of LMs is often restricted by the
availability and diversity of high-quality human data. The focus is on tasks where scalar feedback is
available, such as math problems where correctness can be verified. This approach aims to explore
new methods of training beyond traditional human data. The authors introduce a self-training method
that we're going to call rest for short, which involves expectation maximization. This method includes
generating samples from the model, filtering them using binary feedback, fine-tuning the model
on these samples and repeating the process. A key finding is that REST significantly surpasses
the performance of models fine-tuned only on human data. This suggests a potential shift in how
language models can be trained. This method could lead to more efficient and effective ways
of training language models, especially in scenarios where human data is scarce or not diverse
enough. The approach could be particularly useful in fields like mathematics and coding, where
objective correctness can be determined and used for model training. The paper opens up new avenues for
research and language model training, suggesting that exploring beyond human data can lead to
significant improvements in model performance. So again, we've got these themes of improved and
differentiated training approaches, small models. But then another paper that Andre references is called
LLM 360 towards fully transparent open source LLMs. This is a paper which, again, as summed up by
X papers, argues that most existing LLMs only release partial artifacts such as model weights or inference code.
Detailed training processes and intermediate results are often not shared, limiting transparency
and understanding. Because of that, the authors have introduced LLM 360, which is an initiative
aimed at fully open sourcing LLMs, including sharing all training code, data, model checkpoints,
and intermediate results with the community. So clearly some interest in this push towards
open source, and another reference buried in this paper to Mistral. Now, speaking of Mistral,
Andre's honorable mentions also referenced this, including a tweet from the co-founder of any
scale, which reads, function calls have been a massive gap in the open source ecosystem and
the most common feature request. We benchmarked function calling on a variety of open
and proprietary models, and impressively, Mistral 7B performs on par with GPT 3.5.
Another honorable mention is a tweet from Mistral CEO, Arthur Mench, where after someone had
noted that Mistral was prohibiting them from using their models to train or improve other
models, and that that didn't really go alongside Mistral's open source ideology, Arthur removed
that section from the terms of service.
Finally, a link to Perplexity, where the new Mixtral model has been added as a default model
for Perflexity Pro users.
So again, on the one hand, we're seeing some really common themes, open source,
new approaches to training, new approaches to prompting,
to prompting and training methodologies
leading to smaller, more performant models
that can be used on devices without access to the internet,
and for what it's worth, a heck of a lot of references to mistral.
Now, what it all means,
and whether there's anything more that we can read into,
I'm not sure,
it could just be that the area where mistral and others like it are innovating
around open source and around small models
is something that Andre is particularly interested in,
and perhaps we're not sure, working on inside of open AI.
In either case, like I said, given how influential Andre is, among other researchers,
it's really interesting to get a snapshot about what he's paying attention to.
That's going to do it for today's AI breakdown.
I appreciate you guys listening or watching as always.
Until next time, peace.
