The AI Daily Brief: Artificial Intelligence News and Analysis - How Close Are We to Self-Improving AI?
Episode Date: November 19, 2024Anthropic’s AI outperforms OpenAI in a new AI research competition, sparking discussions about self-improving AI and its future implications. Meanwhile, Google’s Gemini leaps to the top of benchma...rking charts, surpassing GPT-4 in multiple domains except coding. Also explored: Are AI benchmarks saturated, and how should businesses leverage existing capabilities during this period of incremental advancements? Brought to you by: Vanta - Simplify compliance - vanta.com/nlwThe AI Daily Brief helps you understand the most important news and discussions in AI. Subscribe to the podcast version of The AI Daily Brief wherever you listen: https://pod.link/1680633614 Subscribe to the newsletter: https://aidailybrief.beehiiv.com/ Join our Discord: https://bit.ly/aibreakdown
Transcript
Discussion (0)
Today on the AI Daily Brief, how anthropic at OpenAI both perform in a test of AI that performs AI research.
Before that in the headlines, can chat GPT out-diagnosed doctors?
The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI.
To join the conversation, follow the Discord link in our show notes.
Welcome back to the AI Daily Brief Headlines edition, all the daily AI news you need in around five minutes.
Here's a really interesting study published in Science Daily today.
Does AI improve doctors diagnoses?
Study puts it to the test.
This study came out of UBA Health and took 50 physicians, of whom half were randomly assigned to use Chatsypt Plus to diagnose complex cases,
with the other half relying on more conventional methods, including using medical reference sites.
The researchers then compared the results to each other as well as to ChachyPT alone.
So what actually happened?
Well, doctors using Chachypte Plus slightly outperformed the physicians using conventional methods.
Still, it was very close.
Diagnostic accuracy for the doctors using Chachypt Plus was 76.3%.
while the conventional approach physicians was 73.7%.
The Chatsybt test group also apparently reached their diagnoses slightly more quickly,
about 45 seconds faster.
That said, when Chachybt Plus was tasked with making the same diagnoses alone,
its accuracy was more than 92%.
Does this mean that Chachybt is unreservedly better
and that we should just be turning over everything to robot doctors?
Not necessarily.
This is a controlled setting,
and in real life the researchers cautioned that there are many other aspects
of clinical reasoning that come into play, especially as they write in determining downstream effects
of diagnoses and treatment decisions. Still, the fact that ChachyBT-B-T alone outperform the doctors
using Chatsy-B-T suggests to some that doctors need better training on how to use these tools.
Study lead Andrew Parsons said, our study shows that AI alone can be an effective and powerful
tool for diagnosis. We were surprised to find that adding a human physician to the mix
actually reduced diagnostic accuracy, though improved efficiency. These results likely mean we need
more formal training in how best to use AI.
Moving over to AI giant Nvidia, some concerning news for the company recently.
According to the information, Nvidia has asked suppliers to change the design of server racks
multiple times to deal with an overheating issue.
The Blackwell GPUs overheat when connected together in server racks designed to hold up to 72 chips.
Invideo refused to comment on whether an updated design has been finalized.
Still, this is extremely late in the production process to be making such major changes.
Reportedly, Nvidia hasn't alerted customers to any delays related to the redesign.
A company spokesperson told Reuters,
NVIDIA is working with leading cloud service providers as an integral part of our engineering team in process.
The engineering iterations are normal and expected.
Basically, a non-answer denial.
This is unfortunately for NVIDIA not the only issue with Blackwell.
In August, the company discovered a design fault that impacted manufacturing yields and delayed the release by at least a quarter.
CEO Jensen Huang has recently claimed the Blackwell units would begin shipping in Q4,
but with just six weeks remaining, Nvidia could be cutting it close to hit that target.
Over in adoption land, ESPN is testing an AI-generated sportscaster on their Saturday
college football show, SEC Nation. Named Facts, the Gen AI Avatar is intended to promote,
quote, education and fun or on sports analytics.
Writes The Verge, we haven't seen the avatar in action, but it sounds like a bodified version
of stats encyclopedia Howie Schwab, who was ESPN's first statistician.
ESPN had already gone deep on AI, adding AI-generated game recaps to their website back in
September. The feature was used to expand coverage of less followed sports like women's soccer
in lacrosse. Commentary at the time focused on the gaffs, including a failure to recognize the
occasion of a player's retirement game, as well as bland commentary. But that's sort of to be
expected as these things roll out. Anticipating backlash around this, ESPN made clear that the
avatar is absolutely not made to replace journalists or other talent writing, fax is designed to test
innovations out in the market, and create an outlet for ESPN analytics data to be accessible
to fans in an engaging and enjoyable segment. Lastly, today, a fun one. A UK Telco is
taking a novel approach to using AI to reduce fraud. Mobile phone carrier O2 has introduced a voice-enabled
chatbot they're calling the AI Granny to waste scammers time. Trained to mimic an elderly woman,
the chatbot engages in rambling discussion, keeping scammers online for as long as possible.
Nicknamed Daisy, the AI granny can feed fake bank details to the scammers to keep them interested
while going on long tangents about knitting the weather or her cat. The chatbot isn't for use
by customers. It's being deployed directly on the phone network and used to answer calls from a
list of serial scam numbers. Introduced to Mark International Fraud Awareness Week, O2 claims the chatbot
has kept numerous fraudsters on calls for 40 minutes at a time. The Verges Tom Warren writes,
best use of AI yet, and he might not be wrong. That, however, is going to do it for today's
AI Daily Brief Headlines edition. Next up, the main episode. Today's episode is brought to you by
Vanta. Whether you're starting or scaling your company's security program, demonstrating top-notch
security practices, and establishing trust is more important than ever.
Vanta automates compliance for ISO-27001, SOC2, GDPR, and leading AI frameworks like ISO-42,001, and NIST-A-I-Risks
saving you time and money while helping you build customer trust.
Plus, you can streamline security reviews by automating questionnaires and demonstrating your security posture with a customer-facing trust center all powered by Vanta AI.
Over 8,000 global companies like Langecane, Lila AI, and Factory AI use Vanta to demonstrate AI trust and prove security in real time.
Learn more at vanta.com slash nLW.
That's vanta.com slash nLW.
Today's episode is brought to you, as always, by Super Intelligent.
Have you ever wanted an AI Daily Brief,
but totally focused on how AI relates to your company?
Is your company struggling with AI adoption,
either because you're getting stalled,
figuring out what use cases will drive value,
or because the AI transformation that is happening
is siloated individual teams, departments, and employees
and not able to change the company as a whole,
Super Intelligent has developed a new custom internal podcast product
that inspires your teams by sharing the best AI use cases
from inside and outside your company.
Think of it as an AI Daily Brief,
but just for your company's AI use cases.
If you'd like to learn more,
go to be super.aI slash partner
and fill out the information request form.
I am really excited about this product,
so I will personally get right back to you.
Again, that's besupor.a.i slash partner.
Welcome back to the AI Daily Brief.
Today we are discussing kind of the shape and texture of what state of the art looks like.
We've got a story about Google Gemini, outperforming other models on the leaderboard.
And we're kicking off with this story about Anthropic and Open AI in this AI research comparison.
But really, I want to take a step back and contextualize this in terms of how individuals and enterprises are thinking about AI right now.
Over the last couple of weeks, a huge part of the conversation has been dedicated to the idea of or question of whether AI models are plateau.
whether there is a slowdown in the rate of performance. It's why we talked about some
alternative scaling methods and what the labs are doing to try to deal with this. In many ways,
what I think we're going to see is if and as that plateau happens, the competition for model
supremacy is going to be about more than just sheer state-of-the-art performance. It's going to be
about product and user experience. It's going to be about customization and specification for
task. And it's going to be about access to particular data and knowledge of specific workflows
within the enterprise that make certain tools work better than others.
Basically, I think that we're about to see an expansion of the way that we think about the
competition for Gen AI supremacy.
And so that's just a little bit of context and background before we get into this.
The information's headline reads, Anthropic beat Open AI and test of AI that performs AI research.
Now, this came from independent researchers at the model evaluation and threat research,
which is a nonprofit group, which is publishing later this week an evaluation of how
LLMs from both OpenAI and Anthropic perform when they were asked to solve.
a set of seven AI research problems. This is more than just an idle test. As the information
puts it, since the days of Alan Turing, AI developers have been captivated by the prospect of
AI powerful enough to improve itself. OpenAI has already developed an internal AI research
assistant tool to help its researchers work faster, a possible first step in the development of
AI that can conduct AI research on its own. Now, for AI safety advocates, self-improving AI is
an indicator of something else entirely. But the point is that people are very interested in this
question of whether AI can be used to improve AI. According to the information, in five of the
seven tests that were run as part of this experiment, Claude Sonnet 3.5 outperformed O1 preview.
They also note that Claude won by what they call a wide margin in two of those seven tests.
Of the two that O1 preview won, one of those was also what they called decisive.
One thing for those who are trying to gauge how far along the path to AGI we are, the information
also reports that both models were no match for the top human researchers who took the same
tests who scored more than twice as high as the models on average. Claude was, quote,
basically as good as the average human researcher in two of the seven problems, and O1 preview
was about as good as an average researcher in another problem. So what are the types of problems?
The example they give, one of the problems involved writing code for a language model from
scratch without using division or exponents, which are usually essential for that task.
Another problem involves experimenting with traditional AI scaling laws, just like an employee
at OpenAI might do, but using only a small amount of computing power. The tests are in part
designed to give us a beacon and a benchmark for how far along AI development really is.
Again, the information writes, these tests are designed to put human participants at a disadvantage.
That way, even if AI models catch up to humans on these tests, that would still mean the models
are less capable than top human researchers overall and would give the AI firms time to make
adjustments to improve their safety. So again, summing up for those keeping track at home,
AI still not as good as the top human researchers at AI research, but starting to, in certain cases,
match average human researchers.
Now, one other small thing from Anthropic while we're on the topic.
Anthropic has been pushing really hard to get away from the world of prompt engineering
and just build tools that help people improve their prompts automatically.
At the end of last week, they announced, quote,
the ability to improve prompts and manage examples directly in the Anthropic console.
These features, they say, make it easier to leverage prompt engineering best practices
and build more reliable AI applications.
The prompt improver allows developers to take existing prompts and leverage clod
to automatically refine them using advanced prompt engineering techniques.
This is ideal for adapting prompts that were originally written for AI models as well as for optimizing handwritten prompts.
So somewhat connected in the sense that increasingly we're seeing people ask the AI to help them use the AI.
Now one more story, which was from the end of last week as well, Google's Deep Mind's latest experimental model has leapt to the front of the benchmarking charts.
Known as Gemini XP 1114, the model has undergone testing on crowdsource benchmarking website Chatbot Arena over the past week.
It consistently scored better than ChatGPT4O, jumping 40 ranks from the pre-119.
previous Gemini models at the top of the leaderboard. It is now ranked in both technical and
creative domains, topping the charts for both math and creative writing. It also overtook
GBT40 for the best vision mode. The only category where it wasn't the best model was coding,
where it ranked number three behind GPT40 and the O1 reasoning models. Notably, this is the first
time a Gemini model has taken the lead by this benchmarking standard. The model is currently
available as a preview on Google's AI Studio website. Logan Kilpatrick, the product lead at
Google AI Studio posted, Gemini, super-duper smart. Market
research on new model names. Referring to Sam Altman's habit of quickly snatching the limelight back,
scientist Casper Hansen wrote, what a great way to find out OpenAI will release 01 within 24 hours.
Professor Ethan Malick wrote, why are people confused about which models are the best choice for hard
problems? I mean, don't the name GPT40 latest 202903 and Gemini EXP 11114 and 01
preview make it obvious? Stop naming AI like files on my hard drive. As for the model itself, though,
he wrote, this was pretty impressive from the new Gemini model launch today.
I gave it one of my papers and asked it to review the tables and to comment on the methods.
It did a better job than previous Gemini Pro, though that wasn't bad.
Claude was close, but didn't zoom out as well.
The bigger picture, of course, is there are now multiple models that are remarkably good at
understanding complex academic papers and underlying quantitative methods.
Reading a paper like a PhD seems like a pretty impressive feat for us to take just in stride,
as of course AI can do that.
Part of what matters about that analysis, by the way, is that Ethan, among others,
have suggested that part of the reason that it looks like AI performance is slowing down
is that our benchmarks are just basically soaked at this point.
Once you get up in the 90s, there's just not that much room to run.
And part of the question is, do we need better benchmarks?
Still, overall, it's hard not to feel like we are in a more incremental improvement sort of time in the AI field.
I would suggest that rather than be concerned about this,
especially if you are trying to integrate AI into your business,
use this breather as a chance to actually figure out how to use what's already available,
which is so transformative in and of itself.
I have a feeling that we will not be in this sort of moment for very long
and that punctuated equilibrium will be back in no time flat.
For now, though, that's going to do it for today's AI Daily Brief.
Until next time, peace.
