Daybreak - This startup ranked AI models. They all landed in the danger zone
Episode Date: May 7, 2026India's best AI models are confidently wrong. Not occasionally — structurally. If you put two unrelated ideas into a prompt, the model will usually invent a connection rather than admit tha...t none exists.In this piece, The Ken's Debanjali Biswas traces what a five-month study of leading AI models — from OpenAI, Anthropic, and Google — actually found about how they reason. The results landed almost every model in what researchers are calling the "danger zone", which shows high confidence and low accuracy.This is a read aloud of Debanjali's original story, by Rachel Varghese, on Daybreak.📖 Read the full story on The Ken: This startup ranked AI models. They all landed in the danger zone
Transcript
Discussion (0)
Since February this year, India's massive IT sector has been under some real stress.
The launch of Claude Co-work triggered what has now been dubbed the SaaS Spacalypse.
The companies that existed to take over the dedicated, detailed work of upgrading systems,
coding enterprise software and building out tech infrastructure for other businesses
was suddenly hit with an existential question.
Could they all be replaced by AI and soon?
Regardless of what your answer is, here's the thing.
We all know that AI models come with some structural faults that desperately need fixing before we get to the next stage.
My colleague, De Banjali Biswas, spoke to the founder of a startup that tested the reasoning capabilities of several popular AI models.
And she wrote about what they found in this edition of the Ken's exclusive subscriber-only newsletter called Make India Competitive Again.
Her piece is titled, This Startup Ranked AI Models.
They all landed in the danger zone.
And I'm going to be reading it out to you today.
Welcome to Daybreak, a business podcast from the Ken.
I'm your host, Rita Virgis, and every day of the week, my co-host, Nikas Sharma and I bring you one new story that is worth understanding and worth your time.
Today is Friday, the 8th of May.
Indian IT stocks are having a rough time in the age of AI.
February was the worst month since the global sub-referiority.
financial crisis of 2008.
The latest shock came after Anthropic released two tools,
Claude Code and Claude Co-work,
that can perform complex programming and analysis tasks.
Investors took the hint.
If AI can do white-collar work faster and cheaper,
the economics of IT services might change dramatically.
For me, the market panic was surprising.
The idea that AI will replace humans
has been floating around for years.
What Claude did differently was positioned
AI not just as a productivity tool, but as a competitor, particularly for knowledge workers like
programmers and analysts. AI stopped being a crutch and started looking more like a colleague.
And yet, here is a deeper problem lurking beneath the hype. Even the best AI models may not
reason the way people assume they do. A few weeks ago, I spoke with Satyam Duwedi, an AIML researcher
and founder of Deep Tech startup by Kari AI. Over five months,
months, his team tested several prominent large language models to measure their reasoning
capabilities. The results, quite unsettlingly, are published on a public leaderboard that
ranks leading AI systems on a simple skill, recognising when two concepts are related and when
they are not. Publishing the results also serves a purpose for Waikari AI. The startup, just five months
old, claims to be building tools that help companies test and diagnose AI models before
deploying them in real world applications. By releasing a leaderboard comparing models,
it is positioning itself as an independent evaluator of AI systems. That's an increasingly
valuable role as companies raise to build products on top of LLMs. Yet, Zviwedi's investigation
started almost accidentally. The former natural language understanding or NLU engineer at
Amazon was testing healthcare-oriented models such as Med-Gemma and clinical B-E-R-T for
diagnostic applications when he noticed something odd.
Duwere said that the models struggled to distinguish between coexistence, correlation and causation.
Three concepts that are easy to confuse but critically important in fields like medicine.
He realized that companies building products on top of these models could inherit the same flaw.
Curious, he expanded the experiment.
beyond healthcare tools to general purpose models.
What began as a narrow evaluation became a broader research effort to measure whether AI
models can correctly identify when relationships exist.
Vakari evaluated models from major AI developers, including OpenAI, Anthropic and Google,
along with several open source and domestic systems.
Almost every model, barring one, each from Claude and Gemini, tested, fell into what Vicari
calls the danger zone. Relatively low accuracy combined with high confidence. In other words,
the models were not just wrong. They were confidently wrong. The leaderboard plots models across
two axes. How often they are correct and how confident they are in their answers. Many systems
cluster in the worst quadrant, confidently presenting answers that turn out to be wrong. Even more
worrying was a deeper structural flaw. When two concepts are genuinely related,
AI models often perform well.
But when the concepts are unrelated,
the models tend to force a connection anyway.
Put two unrelated ideas into a prompt
and the AI will usually invent a relationship
rather than admit that none exists.
Dvhvedi believes part of the problem lies in how AI models are trained.
Models rely heavily on training data
and something called loss functions,
which are mathematical mechanisms that reward correct outputs
and penalize incorrect ones during.
chain. But according to Devedi, current loss functions tend to push models toward finding relationships
between concepts even when none exist. In effect, the system is optimized to always produce an
explanation. None of this will surprise anyone who has used chat GPT or Gemini extensively.
AI hallucinations are well documented. What is harder to grasp though is the scale of the problem
when these systems move from casual use to real world applications. Consumer facing AI tools
often include guardrails.
If you ask Gemini about medical symptoms, for instance,
it may display a warning telling users that the information is not medical advice.
But these warnings typically exist only at the user interface layer.
Developers building applications on top of AI models usually access them through APIs,
which are the back-in pipes that allow companies to integrate models into their own software.
Those warning labels often disappear at that layer, meaning companies,
must build their own safety checks.
And many do.
Hospitals, financial institutions and SaaS platforms
typically apply domain-specific data,
validation layers and human review to AI-generated outputs.
Still, the underlying models remain the same.
Dvedi said that if the base model tends to infer relationships where none exist,
that behavior can still influence downstream applications.
The implications span industries.
In finance, analysts increase.
increasingly use AI tools for summarization and pattern discovery.
A model that confidently detects patterns between unrelated signals,
say social media sentiment and stock movements, could subtly distort research.
In law, courts have already encountered AI-generated citations that simply do not exist.
And in business software, even small errors can have real consequences.
Humanic AI, a company that provides AI-powered sales research,
cross-checks results across multiple models to reduce inaccuracies.
Amar Prid Kalkat, the CEO, said that even small mistakes can make customers lose trust.
He recalled using an AI system to suggest potential sales leaders for hiring.
Two out of seven recommendations turned out to be entirely fictional.
One candidate was described as the former chief revenue officer at Freshworks.
The person was real, but the title was not.
Companies often respond by combining human review.
with AI automation.
But as data volumes grow, manual oversight becomes increasingly difficult.
Dwayedi said that it becomes a wild goose chase.
The harder problem then is wrong connections.
See, fabricated facts are relatively easy to spot.
Incorrect relationships are much harder.
According to Wehari's research, AI models frequently get the facts right, but connect them
incorrectly.
Dvavedi explained that the issue isn't that the facts are wrong.
It's that the connection.
between them is wrong.
Consider a medical example.
A terminal disease like skin cancer and a relatively harmless condition like sunburn may share
overlapping systems.
When an AI system sees those shared characteristics, it may infer a stronger relationship
between them than actually exists.
Because the facts themselves are correct, the flawed reasoning can be difficult to detect,
especially for non-exports.
Part of the challenge lies in the way AI learns.
Most models are trained on publicly available.
text from the internet. But many causal relationships exist only in human reasoning, not in written
text. Ula's Nambiar, a partner at EY who works on AI transformation projects, explains the gap
with a simple example. Imagine someone posting on Reddit about taking an Uber to Bangalore
airport and reaching in an hour. What the post might not mention is that it was a public holiday
with unusually low traffic. Nambiar said, when AI learns from that data, it assumes Uber is the
reason the trip took one hour.
The missing context or the real causal factor never makes it into the training data.
Numbi added that you need richer constructs and taxonomies, because almost everything
AI has been taught is from text that humans have created.
But relationship heavy details because X happened Y followed are in our mind and we don't
really write it down.
The implications also shape the global AI race.
See, on Vakari's leaderboard, most domestic systems appear well below the
leading global models. And one reason is structural. Large global players like Google or Open
AI often focus on a handful of languages, primarily English and a few widely spoken
global languages. This allows them to build deeper data sets with richer contextual information.
India presents a different challenge. The country is highly multilingual, forcing domestic AI developers
to prioritize coverage across many languages rather than depth within a few. Companies such as
Darwamyayay and Bharajan, therefore face a trade-off, breadth over depth.
Duvede believes one solution is to narrow the focus.
He said that the way to improve Indian models is to concentrate on a few languages
that cover the majority of the population.
Ironically, global companies are already moving in that direction,
expanding their operations in India while relying on the same depth-first strategy.
Fixing these problems may require more data, far more than current models have access to.
Better reasoning often demands richer contextual information, business data, private communications, operational records.
And that raises a trade-off.
Nambiar said that even when AI providers say they won't use private data,
certain insights eventually find their way into the system.
In other words, improving AI models could require sacrificing some degree of privacy.
The result is a paradox.
AI systems today are often confidently wrong.
But making them better may require opening the door to far more.
data collection. Which means the real challenge for the AI economy may not just be building
smarter models. It may be deciding how much we are willing to reveal in order to train them.
