Daybreak - India’s AI still doesn’t speak India. Can it?
Episode Date: February 2, 2026ChatGPT butchers Punjabi with spelling errors and Bollywood-style Hindi bleeding through. Hindi bots trained on newspapers miss dialects like Awadhi and Bhojpuri entirely, while Tamil AI igno...res the rich variations between Kongu and Madurai speech.Sure, Gurugram collected ₹200 crore in taxes using Hindi AI calls, but that's because Hindi dominates datasets. Most other languages remain stuck in translation hell. Private companies optimize for speed over nuance, government corpora like Bhashini sit underused, and multimodal data that captures tone and emotion is too expensive to build.The result? AI is flattening India's 780 languages into sanitized, standardized versions that erase the very dialects it claims to serve.Read the newsletter here. Find the Duolingo article here. Daybreak is produced from the newsroom of The Ken, India’s first subscriber-only business news platform. Subscribe for more exclusive, deeply-reported, and analytical business stories.
Transcript
Discussion (0)
Hi, this is Rohan Dharma Kumar.
If you've heard any of the Ken's podcasts, you've probably heard me, my interruptions, my analogies,
and my contrarian takes on most topics.
And you might rightly be wondering why am I interrupting this episode too?
It's for a special announcement.
For the last few months, I and Sita Ramon, Ganesh, my colleague and the Ken's deputy editor,
have been working on an ambitious new podcast.
It's called Intermission.
We want to tell the Sita Ramancahans, my colleague.
secret sauce stories of India's greatest companies.
Stories of how they were born, how they fought to survive, how they build their
organizations and culture, how they managed to innovate and thrive over decades, and most
importantly, how they're poised today.
To do that, Sita and I have been reading books, poring over reports, going through financial
statements, digging up archives, and talking to dozens of people.
And if that wasn't enough, we also decided to throw in video into.
to the mix. Yes, you heard that right. Intermission has also had to find its footing in the world of
multi-camera shoots in professional studios, laborious editing, and extensive post-production.
Sita and I are still reeling from the intensity of our first studio recording.
Intermission launches on March 23rd. To get an alert, as soon as we release our first episode,
please follow Intermission on Spotify and Apple Podcast.
or subscribe to the Ken's YouTube channel.
You can find all of the links at the ken.com slash I am.
With that, back to your episode.
Today you're in for a treat because we're doing something a little different.
I'm going to be reading out a really compelling edition
of one of the Ken's most popular subscriber-only newsletter called Make India Competitive Again,
written by my colleague, Indel Pal Singh.
And it is titled, India's AI still doesn't speak
India. Can it? Now, in this newsletter, Indyarpal asks a deceptively simple question. See,
the 2011 census in India recorded 121 distinct languages. Now, as both local and legacy AI companies
strive to reach populations in India that don't speak English or Hindi, which languages get
left out, and by extension, which groups of people. Interpal digs into how data
sets in Indian languages either solve mostly for efficiency or quality.
But what ends up happening is that all of them heard dialects and become overly standardized,
ultimately missing the whole point of linguistic reality.
Welcome to Daybreak, a business podcast from the Ken.
I'm your host, Rachel Berkes, and every day of the week, my co-host, Niktha Sharma and I
will bring you one new story that is worth understanding and worth your time.
Today is Monday, the 2nd of February.
Last week, I struck up a conversation with Chad GPD in my mother tongue Punjabi.
And it wasn't great.
Instead of an immersive experience I hope to have, the bot yanked me back to reality.
For one, it made basic spelling and pluralization errors and missed idiomatic meaning of certain phrases.
Second, its responses were peppered with the kind of Punjabi you typically hear in a Bollywood flick.
Their Hindi words bleed into the Punjabi.
In fact, that's the next big barrier for companies in the field
seeking to make a mark in India right now,
master the vernacular.
From Open AI to government-funded LLM initiative, Bharajan,
to the AI Research Lab at IIT Madras, AI for India.
Just in the last three months, the sector saw three significant developments.
In September, the government launched a beta version of Adivani,
Christianed India's first AI-powered translator for tribal languages.
In November, photo-sharing app Instagram announced meta-AI voice translation for reels produced in five Indian languages,
Bengali, Telugu, Tamil, Kannada and Marathi.
In December, OpenAI rolled out a campaign promoting the use of chat TPP in Indian languages.
This followed its release of INDQA, a benchmark to evaluate AI models understanding,
of Indian languages.
Now, these developments suggest momentum,
but the world of vernacular AI remains fragmented into two universes.
One, where datasets built by private players such as Pareto, Mercor and V localized,
which train models for the likes of Gemini, Chad GPT, and perplexity,
solves for efficiency and are popular.
On the other hand, government data sets like Barshini address accuracy and nuances,
but remain limited in their use.
These universes seldom cross over,
meaning datasets remain fragmented,
gaps between private and public corpora are wide,
and complexities of each language add to the chaos for AI companies
hoping to cater to millions of Indians for whom English isn't even the first language.
And that begs the question,
who's left out even amid this attempt at inclusion?
Earlier this year, the municipal corporation of Guru Graham partnered with AI startup Sarvam for property tax collection.
Aditya Mutkal, GTM, AI policy and government applications at Sarvam said,
we used AI to remind people in Hindi to pay their tax.
They actually turned up and paid more than the previous year.
By August, the civic body reportedly collected rupees 200 crore in tax.
That's nearly three-fourths of its FYT.
26 target. Half of it was thanks to AI bots calling people with the highest tax dues in real
time. The municipality of Manesar, a town half an hour from Gurugram 2, mopped up nearly
rupees 30 crore in just one month by replicating the Gurugram model. But this is the case with
Hindi, the most spoken language in India and the mother tongue of nearly 43% of the population.
In other words, a language that's better represented in datasets.
Anurag Shukla, a director at Brathe, a cultural and educational platform focused on Indian knowledge systems, wondered what happens when the Ministry of Agriculture wants to get in touch with farmers from, say, Andhra Pradesh.
They actually have to relay the message in the local language.
But it gets difficult if AI struggles with the intricacies of that language.
With non-English Indian languages, the possibilities are high but use cases remain few.
Shukla said that right now, most databases being built by the private sector for Indian languages are optimized on performance, basically speed and accuracy, not inclusivity.
And this results in AI being too standardized.
Take Hindi chatpots, for example.
Most Hindi AIs, according to Shukla, are trained on news articles from newspapers like Dheenik Jagran or Bhaskar.
This standardized Hindi
misses the nuances of myriad dialects
such as Avdi, Bhuchpuri,
Braj, Bundali, Khariboli and Haryanvi.
Shukla added that even if you look at the Hindi
that Duolingo teaches,
you'll see it's more formal and less colloquial.
Hold on, pause.
I just wanted to quickly tell you
that the Ken recently wrote about
Duolingo's troubles in an AI-disrupted market.
You can find that article linked in the show notes
along with this newsletter.
Now, back to the read aloud.
Take a language like Tamil for instance, which also comes in many flavours.
Kongu Tamil which is spoken in and around Koimbutur,
Madurai Tamil which is spoken mostly in southern Tamil Nadu,
and Tamil which is spoken by Brahman communities,
which has a Sanskrit influence.
But datasets fail to account for this kind of heterogeneity.
Even Google's read-along feature has been pushing children towards one standard pronunciation.
flattening linguistically valid dialectal and accent variations.
G.N. Devi is the founder of People's Linguistic Survey of India, a social movement in the early 2010s.
The movement documented around 7 80 languages spoken in the country.
Now, Davy speaks about the shrinking vocabularies as AI adopts standardized approaches, which misses the nuances of human tongues.
Money begets money, we know. But datasets bigot datasets too.
Languages that are already resource-rich tend to get richer, according to Kalika Bali,
a linguist and senior principal researcher at Microsoft Research India.
And languages that lack data sets and basic digitization end up dying at the starting line of the AI cycle.
Dharat Shukla said that there are three kinds of corpora being built.
Speech, text and multimodal.
Speech consists of call center transcripts and controlled speech recordings.
Text corpus comes from news articles, textbooks, Wikipedia, government circulars and literary classics.
Multimodal contains a mix of audio and video input.
Shukla said, Multimodal is the kind of corpora that's been conventionally less focused on.
This is by far the most expensive and at once the most useful as it tends to capture the nuances of language,
emotion, tone and expression.
Video production and consumption in India, often in local languages, on platforms like YouTube
is among the highest in the world.
But AI developers cannot legally or technically use all of it.
Unlike text, where scraping for building datasets is relatively common,
video and its embedded audio and visual elements are legally complex and often protected
under copyright.
Structuring and labeling these to build datasets is also expensive.
M.J. Varsie, president of the Linguistic Society of India, said that AI struggles to account for dialectal variation, script diversity and cultural pragmatics in languages.
In Hindi or Bengali, the forms of address such as A, Thum, and Thu carry a social and pragmatic meaning that AI still struggles to model.
All three words mean new, but their usage varies depending on a person's relationship with another.
Varsi said that this is where multimodal data sets come into play.
They give AI the context it needs to understand.
Then there's also code mixing, where people routinely mix a spoken language with English in a single sentence,
making it difficult to train datasets.
Sorab Khatri, who is building Koko, an AI companion for spoken English helping students transition from Indian languages to global English,
said that we found that existing Western models failed when a change.
child mixes Hindi syntax with English vocabulary. We're engaging with ecosystems like
AI for Bharat to understand the intent in Indian vernacular, using that as the bridge to teach
global English. Khatri is an ex-Amazon techie and is betting on internal voice data
to fine-tune models as he finds it more reliable than existing data sets. Globally, the European
Union faced something similar. It's now working to address limitations affecting low resource
official languages such as Irish, Maltese and Latvian by using the multilingual data generated
by EU institutions to contribute to their ecosystem of LLNs.
In contrast, India lacks this kind of institutional backbone for language data.
While the country has 22 scheduled languages, government and institutional communication
usually defals to English and to a lesser extent Hindi, with far fewer official texts produced in languages
such as Asamese, Odia or Konkini.
This sharply limits the availability of high-quality parallel datasets,
one of the core ingredients that made the EU's approach viable.
So when AI systems erase the rich mosaic of dialects from Avati to Congo Tamil,
they also end up reshaping linguistic identities,
pushing millions towards standardized versions of their own languages.
Until AI can speak to Indians in their own voices,
India's AI revolution will remain, quite literally, lost in translation.
Daybreak is produced from the newsroom of the Ken, India's first subscriber-focused business news platform.
What you're listening to is just a small sample of our subscriber-only offerings.
A full subscription offers daily long-form feature stories, newsletters and a whole bunch of premium podcasts.
To subscribe, head to the Ken.com and click on the red subscribe button on the topic of the Ken website.
Today's episode was hosted and produced by my colleague Rachel Virgis and...
