Daybreak - This startup ranked AI models. They all landed in the danger zone

Starting point is 00:00:00 Since February this year, India's massive IT sector has been under some real stress. The launch of Claude Co-work triggered what has now been dubbed the SaaS Spacalypse. The companies that existed to take over the dedicated, detailed work of upgrading systems, coding enterprise software and building out tech infrastructure for other businesses was suddenly hit with an existential question. Could they all be replaced by AI and soon? Regardless of what your answer is, here's the thing. We all know that AI models come with some structural faults that desperately need fixing before we get to the next stage.

Starting point is 00:00:39 My colleague, De Banjali Biswas, spoke to the founder of a startup that tested the reasoning capabilities of several popular AI models. And she wrote about what they found in this edition of the Ken's exclusive subscriber-only newsletter called Make India Competitive Again. Her piece is titled, This Startup Ranked AI Models. They all landed in the danger zone. And I'm going to be reading it out to you today. Welcome to Daybreak, a business podcast from the Ken. I'm your host, Rita Virgis, and every day of the week, my co-host, Nikas Sharma and I bring you one new story that is worth understanding and worth your time. Today is Friday, the 8th of May.

Starting point is 00:01:42 Indian IT stocks are having a rough time in the age of AI. February was the worst month since the global sub-referiority. financial crisis of 2008. The latest shock came after Anthropic released two tools, Claude Code and Claude Co-work, that can perform complex programming and analysis tasks. Investors took the hint. If AI can do white-collar work faster and cheaper,

Starting point is 00:02:05 the economics of IT services might change dramatically. For me, the market panic was surprising. The idea that AI will replace humans has been floating around for years. What Claude did differently was positioned AI not just as a productivity tool, but as a competitor, particularly for knowledge workers like programmers and analysts. AI stopped being a crutch and started looking more like a colleague. And yet, here is a deeper problem lurking beneath the hype. Even the best AI models may not

Starting point is 00:02:37 reason the way people assume they do. A few weeks ago, I spoke with Satyam Duwedi, an AIML researcher and founder of Deep Tech startup by Kari AI. Over five months, months, his team tested several prominent large language models to measure their reasoning capabilities. The results, quite unsettlingly, are published on a public leaderboard that ranks leading AI systems on a simple skill, recognising when two concepts are related and when they are not. Publishing the results also serves a purpose for Waikari AI. The startup, just five months old, claims to be building tools that help companies test and diagnose AI models before deploying them in real world applications. By releasing a leaderboard comparing models,

Starting point is 00:03:22 it is positioning itself as an independent evaluator of AI systems. That's an increasingly valuable role as companies raise to build products on top of LLMs. Yet, Zviwedi's investigation started almost accidentally. The former natural language understanding or NLU engineer at Amazon was testing healthcare-oriented models such as Med-Gemma and clinical B-E-R-T for diagnostic applications when he noticed something odd. Duwere said that the models struggled to distinguish between coexistence, correlation and causation. Three concepts that are easy to confuse but critically important in fields like medicine. He realized that companies building products on top of these models could inherit the same flaw.

Starting point is 00:04:09 Curious, he expanded the experiment. beyond healthcare tools to general purpose models. What began as a narrow evaluation became a broader research effort to measure whether AI models can correctly identify when relationships exist. Vakari evaluated models from major AI developers, including OpenAI, Anthropic and Google, along with several open source and domestic systems. Almost every model, barring one, each from Claude and Gemini, tested, fell into what Vicari calls the danger zone. Relatively low accuracy combined with high confidence. In other words,

Starting point is 00:04:47 the models were not just wrong. They were confidently wrong. The leaderboard plots models across two axes. How often they are correct and how confident they are in their answers. Many systems cluster in the worst quadrant, confidently presenting answers that turn out to be wrong. Even more worrying was a deeper structural flaw. When two concepts are genuinely related, AI models often perform well. But when the concepts are unrelated, the models tend to force a connection anyway. Put two unrelated ideas into a prompt

Starting point is 00:05:21 and the AI will usually invent a relationship rather than admit that none exists. Dvhvedi believes part of the problem lies in how AI models are trained. Models rely heavily on training data and something called loss functions, which are mathematical mechanisms that reward correct outputs and penalize incorrect ones during. chain. But according to Devedi, current loss functions tend to push models toward finding relationships

Starting point is 00:05:46 between concepts even when none exist. In effect, the system is optimized to always produce an explanation. None of this will surprise anyone who has used chat GPT or Gemini extensively. AI hallucinations are well documented. What is harder to grasp though is the scale of the problem when these systems move from casual use to real world applications. Consumer facing AI tools often include guardrails. If you ask Gemini about medical symptoms, for instance, it may display a warning telling users that the information is not medical advice. But these warnings typically exist only at the user interface layer.

Starting point is 00:06:26 Developers building applications on top of AI models usually access them through APIs, which are the back-in pipes that allow companies to integrate models into their own software. Those warning labels often disappear at that layer, meaning companies, must build their own safety checks. And many do. Hospitals, financial institutions and SaaS platforms typically apply domain-specific data, validation layers and human review to AI-generated outputs.

Starting point is 00:06:54 Still, the underlying models remain the same. Dvedi said that if the base model tends to infer relationships where none exist, that behavior can still influence downstream applications. The implications span industries. In finance, analysts increase. increasingly use AI tools for summarization and pattern discovery. A model that confidently detects patterns between unrelated signals, say social media sentiment and stock movements, could subtly distort research.

Starting point is 00:07:23 In law, courts have already encountered AI-generated citations that simply do not exist. And in business software, even small errors can have real consequences. Humanic AI, a company that provides AI-powered sales research, cross-checks results across multiple models to reduce inaccuracies. Amar Prid Kalkat, the CEO, said that even small mistakes can make customers lose trust. He recalled using an AI system to suggest potential sales leaders for hiring. Two out of seven recommendations turned out to be entirely fictional. One candidate was described as the former chief revenue officer at Freshworks.

Starting point is 00:08:03 The person was real, but the title was not. Companies often respond by combining human review. with AI automation. But as data volumes grow, manual oversight becomes increasingly difficult. Dwayedi said that it becomes a wild goose chase. The harder problem then is wrong connections. See, fabricated facts are relatively easy to spot. Incorrect relationships are much harder.

Starting point is 00:08:27 According to Wehari's research, AI models frequently get the facts right, but connect them incorrectly. Dvavedi explained that the issue isn't that the facts are wrong. It's that the connection. between them is wrong. Consider a medical example. A terminal disease like skin cancer and a relatively harmless condition like sunburn may share overlapping systems.

Starting point is 00:08:49 When an AI system sees those shared characteristics, it may infer a stronger relationship between them than actually exists. Because the facts themselves are correct, the flawed reasoning can be difficult to detect, especially for non-exports. Part of the challenge lies in the way AI learns. Most models are trained on publicly available. text from the internet. But many causal relationships exist only in human reasoning, not in written text. Ula's Nambiar, a partner at EY who works on AI transformation projects, explains the gap

Starting point is 00:09:21 with a simple example. Imagine someone posting on Reddit about taking an Uber to Bangalore airport and reaching in an hour. What the post might not mention is that it was a public holiday with unusually low traffic. Nambiar said, when AI learns from that data, it assumes Uber is the reason the trip took one hour. The missing context or the real causal factor never makes it into the training data. Numbi added that you need richer constructs and taxonomies, because almost everything AI has been taught is from text that humans have created. But relationship heavy details because X happened Y followed are in our mind and we don't

Starting point is 00:09:58 really write it down. The implications also shape the global AI race. See, on Vakari's leaderboard, most domestic systems appear well below the leading global models. And one reason is structural. Large global players like Google or Open AI often focus on a handful of languages, primarily English and a few widely spoken global languages. This allows them to build deeper data sets with richer contextual information. India presents a different challenge. The country is highly multilingual, forcing domestic AI developers to prioritize coverage across many languages rather than depth within a few. Companies such as

Starting point is 00:10:37 Darwamyayay and Bharajan, therefore face a trade-off, breadth over depth. Duvede believes one solution is to narrow the focus. He said that the way to improve Indian models is to concentrate on a few languages that cover the majority of the population. Ironically, global companies are already moving in that direction, expanding their operations in India while relying on the same depth-first strategy. Fixing these problems may require more data, far more than current models have access to. Better reasoning often demands richer contextual information, business data, private communications, operational records.

Starting point is 00:11:13 And that raises a trade-off. Nambiar said that even when AI providers say they won't use private data, certain insights eventually find their way into the system. In other words, improving AI models could require sacrificing some degree of privacy. The result is a paradox. AI systems today are often confidently wrong. But making them better may require opening the door to far more. data collection. Which means the real challenge for the AI economy may not just be building

Starting point is 00:11:42 smarter models. It may be deciding how much we are willing to reveal in order to train them.

Daybreak - This startup ranked AI models. They all landed in the danger zone

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.