The AI Daily Brief: Artificial Intelligence News and Analysis - How Close Are We to Self-Improving AI?

Starting point is 00:00:00 Today on the AI Daily Brief, how anthropic at OpenAI both perform in a test of AI that performs AI research. Before that in the headlines, can chat GPT out-diagnosed doctors? The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. To join the conversation, follow the Discord link in our show notes. Welcome back to the AI Daily Brief Headlines edition, all the daily AI news you need in around five minutes. Here's a really interesting study published in Science Daily today. Does AI improve doctors diagnoses? Study puts it to the test.

Starting point is 00:00:35 This study came out of UBA Health and took 50 physicians, of whom half were randomly assigned to use Chatsypt Plus to diagnose complex cases, with the other half relying on more conventional methods, including using medical reference sites. The researchers then compared the results to each other as well as to ChachyPT alone. So what actually happened? Well, doctors using Chachypte Plus slightly outperformed the physicians using conventional methods. Still, it was very close. Diagnostic accuracy for the doctors using Chachypt Plus was 76.3%. while the conventional approach physicians was 73.7%.

Starting point is 00:01:08 The Chatsybt test group also apparently reached their diagnoses slightly more quickly, about 45 seconds faster. That said, when Chachybt Plus was tasked with making the same diagnoses alone, its accuracy was more than 92%. Does this mean that Chachybt is unreservedly better and that we should just be turning over everything to robot doctors? Not necessarily. This is a controlled setting,

Starting point is 00:01:32 and in real life the researchers cautioned that there are many other aspects of clinical reasoning that come into play, especially as they write in determining downstream effects of diagnoses and treatment decisions. Still, the fact that ChachyBT-B-T alone outperform the doctors using Chatsy-B-T suggests to some that doctors need better training on how to use these tools. Study lead Andrew Parsons said, our study shows that AI alone can be an effective and powerful tool for diagnosis. We were surprised to find that adding a human physician to the mix actually reduced diagnostic accuracy, though improved efficiency. These results likely mean we need more formal training in how best to use AI.

Starting point is 00:02:05 Moving over to AI giant Nvidia, some concerning news for the company recently. According to the information, Nvidia has asked suppliers to change the design of server racks multiple times to deal with an overheating issue. The Blackwell GPUs overheat when connected together in server racks designed to hold up to 72 chips. Invideo refused to comment on whether an updated design has been finalized. Still, this is extremely late in the production process to be making such major changes. Reportedly, Nvidia hasn't alerted customers to any delays related to the redesign. A company spokesperson told Reuters,

Starting point is 00:02:37 NVIDIA is working with leading cloud service providers as an integral part of our engineering team in process. The engineering iterations are normal and expected. Basically, a non-answer denial. This is unfortunately for NVIDIA not the only issue with Blackwell. In August, the company discovered a design fault that impacted manufacturing yields and delayed the release by at least a quarter. CEO Jensen Huang has recently claimed the Blackwell units would begin shipping in Q4, but with just six weeks remaining, Nvidia could be cutting it close to hit that target. Over in adoption land, ESPN is testing an AI-generated sportscaster on their Saturday

Starting point is 00:03:08 college football show, SEC Nation. Named Facts, the Gen AI Avatar is intended to promote, quote, education and fun or on sports analytics. Writes The Verge, we haven't seen the avatar in action, but it sounds like a bodified version of stats encyclopedia Howie Schwab, who was ESPN's first statistician. ESPN had already gone deep on AI, adding AI-generated game recaps to their website back in September. The feature was used to expand coverage of less followed sports like women's soccer in lacrosse. Commentary at the time focused on the gaffs, including a failure to recognize the occasion of a player's retirement game, as well as bland commentary. But that's sort of to be

Starting point is 00:03:40 expected as these things roll out. Anticipating backlash around this, ESPN made clear that the avatar is absolutely not made to replace journalists or other talent writing, fax is designed to test innovations out in the market, and create an outlet for ESPN analytics data to be accessible to fans in an engaging and enjoyable segment. Lastly, today, a fun one. A UK Telco is taking a novel approach to using AI to reduce fraud. Mobile phone carrier O2 has introduced a voice-enabled chatbot they're calling the AI Granny to waste scammers time. Trained to mimic an elderly woman, the chatbot engages in rambling discussion, keeping scammers online for as long as possible. Nicknamed Daisy, the AI granny can feed fake bank details to the scammers to keep them interested

Starting point is 00:04:18 while going on long tangents about knitting the weather or her cat. The chatbot isn't for use by customers. It's being deployed directly on the phone network and used to answer calls from a list of serial scam numbers. Introduced to Mark International Fraud Awareness Week, O2 claims the chatbot has kept numerous fraudsters on calls for 40 minutes at a time. The Verges Tom Warren writes, best use of AI yet, and he might not be wrong. That, however, is going to do it for today's AI Daily Brief Headlines edition. Next up, the main episode. Today's episode is brought to you by Vanta. Whether you're starting or scaling your company's security program, demonstrating top-notch security practices, and establishing trust is more important than ever.

Starting point is 00:04:56 Vanta automates compliance for ISO-27001, SOC2, GDPR, and leading AI frameworks like ISO-42,001, and NIST-A-I-Risks saving you time and money while helping you build customer trust. Plus, you can streamline security reviews by automating questionnaires and demonstrating your security posture with a customer-facing trust center all powered by Vanta AI. Over 8,000 global companies like Langecane, Lila AI, and Factory AI use Vanta to demonstrate AI trust and prove security in real time. Learn more at vanta.com slash nLW. That's vanta.com slash nLW. Today's episode is brought to you, as always, by Super Intelligent. Have you ever wanted an AI Daily Brief,

Starting point is 00:05:36 but totally focused on how AI relates to your company? Is your company struggling with AI adoption, either because you're getting stalled, figuring out what use cases will drive value, or because the AI transformation that is happening is siloated individual teams, departments, and employees and not able to change the company as a whole, Super Intelligent has developed a new custom internal podcast product

Starting point is 00:05:57 that inspires your teams by sharing the best AI use cases from inside and outside your company. Think of it as an AI Daily Brief, but just for your company's AI use cases. If you'd like to learn more, go to be super.aI slash partner and fill out the information request form. I am really excited about this product,

Starting point is 00:06:15 so I will personally get right back to you. Again, that's besupor.a.i slash partner. Welcome back to the AI Daily Brief. Today we are discussing kind of the shape and texture of what state of the art looks like. We've got a story about Google Gemini, outperforming other models on the leaderboard. And we're kicking off with this story about Anthropic and Open AI in this AI research comparison. But really, I want to take a step back and contextualize this in terms of how individuals and enterprises are thinking about AI right now. Over the last couple of weeks, a huge part of the conversation has been dedicated to the idea of or question of whether AI models are plateau.

Starting point is 00:06:51 whether there is a slowdown in the rate of performance. It's why we talked about some alternative scaling methods and what the labs are doing to try to deal with this. In many ways, what I think we're going to see is if and as that plateau happens, the competition for model supremacy is going to be about more than just sheer state-of-the-art performance. It's going to be about product and user experience. It's going to be about customization and specification for task. And it's going to be about access to particular data and knowledge of specific workflows within the enterprise that make certain tools work better than others. Basically, I think that we're about to see an expansion of the way that we think about the

Starting point is 00:07:29 competition for Gen AI supremacy. And so that's just a little bit of context and background before we get into this. The information's headline reads, Anthropic beat Open AI and test of AI that performs AI research. Now, this came from independent researchers at the model evaluation and threat research, which is a nonprofit group, which is publishing later this week an evaluation of how LLMs from both OpenAI and Anthropic perform when they were asked to solve. a set of seven AI research problems. This is more than just an idle test. As the information puts it, since the days of Alan Turing, AI developers have been captivated by the prospect of

Starting point is 00:08:01 AI powerful enough to improve itself. OpenAI has already developed an internal AI research assistant tool to help its researchers work faster, a possible first step in the development of AI that can conduct AI research on its own. Now, for AI safety advocates, self-improving AI is an indicator of something else entirely. But the point is that people are very interested in this question of whether AI can be used to improve AI. According to the information, in five of the seven tests that were run as part of this experiment, Claude Sonnet 3.5 outperformed O1 preview. They also note that Claude won by what they call a wide margin in two of those seven tests. Of the two that O1 preview won, one of those was also what they called decisive.

Starting point is 00:08:40 One thing for those who are trying to gauge how far along the path to AGI we are, the information also reports that both models were no match for the top human researchers who took the same tests who scored more than twice as high as the models on average. Claude was, quote, basically as good as the average human researcher in two of the seven problems, and O1 preview was about as good as an average researcher in another problem. So what are the types of problems? The example they give, one of the problems involved writing code for a language model from scratch without using division or exponents, which are usually essential for that task. Another problem involves experimenting with traditional AI scaling laws, just like an employee

Starting point is 00:09:13 at OpenAI might do, but using only a small amount of computing power. The tests are in part designed to give us a beacon and a benchmark for how far along AI development really is. Again, the information writes, these tests are designed to put human participants at a disadvantage. That way, even if AI models catch up to humans on these tests, that would still mean the models are less capable than top human researchers overall and would give the AI firms time to make adjustments to improve their safety. So again, summing up for those keeping track at home, AI still not as good as the top human researchers at AI research, but starting to, in certain cases, match average human researchers.

Starting point is 00:09:47 Now, one other small thing from Anthropic while we're on the topic. Anthropic has been pushing really hard to get away from the world of prompt engineering and just build tools that help people improve their prompts automatically. At the end of last week, they announced, quote, the ability to improve prompts and manage examples directly in the Anthropic console. These features, they say, make it easier to leverage prompt engineering best practices and build more reliable AI applications. The prompt improver allows developers to take existing prompts and leverage clod

Starting point is 00:10:11 to automatically refine them using advanced prompt engineering techniques. This is ideal for adapting prompts that were originally written for AI models as well as for optimizing handwritten prompts. So somewhat connected in the sense that increasingly we're seeing people ask the AI to help them use the AI. Now one more story, which was from the end of last week as well, Google's Deep Mind's latest experimental model has leapt to the front of the benchmarking charts. Known as Gemini XP 1114, the model has undergone testing on crowdsource benchmarking website Chatbot Arena over the past week. It consistently scored better than ChatGPT4O, jumping 40 ranks from the pre-119. previous Gemini models at the top of the leaderboard. It is now ranked in both technical and creative domains, topping the charts for both math and creative writing. It also overtook

Starting point is 00:10:52 GBT40 for the best vision mode. The only category where it wasn't the best model was coding, where it ranked number three behind GPT40 and the O1 reasoning models. Notably, this is the first time a Gemini model has taken the lead by this benchmarking standard. The model is currently available as a preview on Google's AI Studio website. Logan Kilpatrick, the product lead at Google AI Studio posted, Gemini, super-duper smart. Market research on new model names. Referring to Sam Altman's habit of quickly snatching the limelight back, scientist Casper Hansen wrote, what a great way to find out OpenAI will release 01 within 24 hours. Professor Ethan Malick wrote, why are people confused about which models are the best choice for hard

Starting point is 00:11:29 problems? I mean, don't the name GPT40 latest 202903 and Gemini EXP 11114 and 01 preview make it obvious? Stop naming AI like files on my hard drive. As for the model itself, though, he wrote, this was pretty impressive from the new Gemini model launch today. I gave it one of my papers and asked it to review the tables and to comment on the methods. It did a better job than previous Gemini Pro, though that wasn't bad. Claude was close, but didn't zoom out as well. The bigger picture, of course, is there are now multiple models that are remarkably good at understanding complex academic papers and underlying quantitative methods.

Starting point is 00:12:00 Reading a paper like a PhD seems like a pretty impressive feat for us to take just in stride, as of course AI can do that. Part of what matters about that analysis, by the way, is that Ethan, among others, have suggested that part of the reason that it looks like AI performance is slowing down is that our benchmarks are just basically soaked at this point. Once you get up in the 90s, there's just not that much room to run. And part of the question is, do we need better benchmarks? Still, overall, it's hard not to feel like we are in a more incremental improvement sort of time in the AI field.

Starting point is 00:12:27 I would suggest that rather than be concerned about this, especially if you are trying to integrate AI into your business, use this breather as a chance to actually figure out how to use what's already available, which is so transformative in and of itself. I have a feeling that we will not be in this sort of moment for very long and that punctuated equilibrium will be back in no time flat. For now, though, that's going to do it for today's AI Daily Brief. Until next time, peace.

The AI Daily Brief: Artificial Intelligence News and Analysis - How Close Are We to Self-Improving AI?

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.