Daybreak - India’s AI still doesn’t speak India. Can it?

Starting point is 00:00:01 Hi, this is Rohan Dharma Kumar. If you've heard any of the Ken's podcasts, you've probably heard me, my interruptions, my analogies, and my contrarian takes on most topics. And you might rightly be wondering why am I interrupting this episode too? It's for a special announcement. For the last few months, I and Sita Ramon, Ganesh, my colleague and the Ken's deputy editor, have been working on an ambitious new podcast. It's called Intermission.

Starting point is 00:00:29 We want to tell the Sita Ramancahans, my colleague. secret sauce stories of India's greatest companies. Stories of how they were born, how they fought to survive, how they build their organizations and culture, how they managed to innovate and thrive over decades, and most importantly, how they're poised today. To do that, Sita and I have been reading books, poring over reports, going through financial statements, digging up archives, and talking to dozens of people. And if that wasn't enough, we also decided to throw in video into.

Starting point is 00:01:01 to the mix. Yes, you heard that right. Intermission has also had to find its footing in the world of multi-camera shoots in professional studios, laborious editing, and extensive post-production. Sita and I are still reeling from the intensity of our first studio recording. Intermission launches on March 23rd. To get an alert, as soon as we release our first episode, please follow Intermission on Spotify and Apple Podcast. or subscribe to the Ken's YouTube channel. You can find all of the links at the ken.com slash I am. With that, back to your episode.

Starting point is 00:01:45 Today you're in for a treat because we're doing something a little different. I'm going to be reading out a really compelling edition of one of the Ken's most popular subscriber-only newsletter called Make India Competitive Again, written by my colleague, Indel Pal Singh. And it is titled, India's AI still doesn't speak India. Can it? Now, in this newsletter, Indyarpal asks a deceptively simple question. See, the 2011 census in India recorded 121 distinct languages. Now, as both local and legacy AI companies strive to reach populations in India that don't speak English or Hindi, which languages get

Starting point is 00:02:28 left out, and by extension, which groups of people. Interpal digs into how data sets in Indian languages either solve mostly for efficiency or quality. But what ends up happening is that all of them heard dialects and become overly standardized, ultimately missing the whole point of linguistic reality. Welcome to Daybreak, a business podcast from the Ken. I'm your host, Rachel Berkes, and every day of the week, my co-host, Niktha Sharma and I will bring you one new story that is worth understanding and worth your time. Today is Monday, the 2nd of February.

Starting point is 00:03:03 Last week, I struck up a conversation with Chad GPD in my mother tongue Punjabi. And it wasn't great. Instead of an immersive experience I hope to have, the bot yanked me back to reality. For one, it made basic spelling and pluralization errors and missed idiomatic meaning of certain phrases. Second, its responses were peppered with the kind of Punjabi you typically hear in a Bollywood flick. Their Hindi words bleed into the Punjabi. In fact, that's the next big barrier for companies in the field seeking to make a mark in India right now,

Starting point is 00:03:57 master the vernacular. From Open AI to government-funded LLM initiative, Bharajan, to the AI Research Lab at IIT Madras, AI for India. Just in the last three months, the sector saw three significant developments. In September, the government launched a beta version of Adivani, Christianed India's first AI-powered translator for tribal languages. In November, photo-sharing app Instagram announced meta-AI voice translation for reels produced in five Indian languages, Bengali, Telugu, Tamil, Kannada and Marathi.

Starting point is 00:04:34 In December, OpenAI rolled out a campaign promoting the use of chat TPP in Indian languages. This followed its release of INDQA, a benchmark to evaluate AI models understanding, of Indian languages. Now, these developments suggest momentum, but the world of vernacular AI remains fragmented into two universes. One, where datasets built by private players such as Pareto, Mercor and V localized, which train models for the likes of Gemini, Chad GPT, and perplexity, solves for efficiency and are popular.

Starting point is 00:05:10 On the other hand, government data sets like Barshini address accuracy and nuances, but remain limited in their use. These universes seldom cross over, meaning datasets remain fragmented, gaps between private and public corpora are wide, and complexities of each language add to the chaos for AI companies hoping to cater to millions of Indians for whom English isn't even the first language. And that begs the question,

Starting point is 00:05:40 who's left out even amid this attempt at inclusion? Earlier this year, the municipal corporation of Guru Graham partnered with AI startup Sarvam for property tax collection. Aditya Mutkal, GTM, AI policy and government applications at Sarvam said, we used AI to remind people in Hindi to pay their tax. They actually turned up and paid more than the previous year. By August, the civic body reportedly collected rupees 200 crore in tax. That's nearly three-fourths of its FYT. 26 target. Half of it was thanks to AI bots calling people with the highest tax dues in real

Starting point is 00:06:20 time. The municipality of Manesar, a town half an hour from Gurugram 2, mopped up nearly rupees 30 crore in just one month by replicating the Gurugram model. But this is the case with Hindi, the most spoken language in India and the mother tongue of nearly 43% of the population. In other words, a language that's better represented in datasets. Anurag Shukla, a director at Brathe, a cultural and educational platform focused on Indian knowledge systems, wondered what happens when the Ministry of Agriculture wants to get in touch with farmers from, say, Andhra Pradesh. They actually have to relay the message in the local language. But it gets difficult if AI struggles with the intricacies of that language. With non-English Indian languages, the possibilities are high but use cases remain few.

Starting point is 00:07:11 Shukla said that right now, most databases being built by the private sector for Indian languages are optimized on performance, basically speed and accuracy, not inclusivity. And this results in AI being too standardized. Take Hindi chatpots, for example. Most Hindi AIs, according to Shukla, are trained on news articles from newspapers like Dheenik Jagran or Bhaskar. This standardized Hindi misses the nuances of myriad dialects such as Avdi, Bhuchpuri, Braj, Bundali, Khariboli and Haryanvi.

Starting point is 00:07:48 Shukla added that even if you look at the Hindi that Duolingo teaches, you'll see it's more formal and less colloquial. Hold on, pause. I just wanted to quickly tell you that the Ken recently wrote about Duolingo's troubles in an AI-disrupted market. You can find that article linked in the show notes

Starting point is 00:08:05 along with this newsletter. Now, back to the read aloud. Take a language like Tamil for instance, which also comes in many flavours. Kongu Tamil which is spoken in and around Koimbutur, Madurai Tamil which is spoken mostly in southern Tamil Nadu, and Tamil which is spoken by Brahman communities, which has a Sanskrit influence. But datasets fail to account for this kind of heterogeneity.

Starting point is 00:08:31 Even Google's read-along feature has been pushing children towards one standard pronunciation. flattening linguistically valid dialectal and accent variations. G.N. Devi is the founder of People's Linguistic Survey of India, a social movement in the early 2010s. The movement documented around 7 80 languages spoken in the country. Now, Davy speaks about the shrinking vocabularies as AI adopts standardized approaches, which misses the nuances of human tongues. Money begets money, we know. But datasets bigot datasets too. Languages that are already resource-rich tend to get richer, according to Kalika Bali, a linguist and senior principal researcher at Microsoft Research India.

Starting point is 00:09:16 And languages that lack data sets and basic digitization end up dying at the starting line of the AI cycle. Dharat Shukla said that there are three kinds of corpora being built. Speech, text and multimodal. Speech consists of call center transcripts and controlled speech recordings. Text corpus comes from news articles, textbooks, Wikipedia, government circulars and literary classics. Multimodal contains a mix of audio and video input. Shukla said, Multimodal is the kind of corpora that's been conventionally less focused on. This is by far the most expensive and at once the most useful as it tends to capture the nuances of language,

Starting point is 00:09:58 emotion, tone and expression. Video production and consumption in India, often in local languages, on platforms like YouTube is among the highest in the world. But AI developers cannot legally or technically use all of it. Unlike text, where scraping for building datasets is relatively common, video and its embedded audio and visual elements are legally complex and often protected under copyright. Structuring and labeling these to build datasets is also expensive.

Starting point is 00:10:30 M.J. Varsie, president of the Linguistic Society of India, said that AI struggles to account for dialectal variation, script diversity and cultural pragmatics in languages. In Hindi or Bengali, the forms of address such as A, Thum, and Thu carry a social and pragmatic meaning that AI still struggles to model. All three words mean new, but their usage varies depending on a person's relationship with another. Varsi said that this is where multimodal data sets come into play. They give AI the context it needs to understand. Then there's also code mixing, where people routinely mix a spoken language with English in a single sentence, making it difficult to train datasets. Sorab Khatri, who is building Koko, an AI companion for spoken English helping students transition from Indian languages to global English,

Starting point is 00:11:22 said that we found that existing Western models failed when a change. child mixes Hindi syntax with English vocabulary. We're engaging with ecosystems like AI for Bharat to understand the intent in Indian vernacular, using that as the bridge to teach global English. Khatri is an ex-Amazon techie and is betting on internal voice data to fine-tune models as he finds it more reliable than existing data sets. Globally, the European Union faced something similar. It's now working to address limitations affecting low resource official languages such as Irish, Maltese and Latvian by using the multilingual data generated by EU institutions to contribute to their ecosystem of LLNs.

Starting point is 00:12:08 In contrast, India lacks this kind of institutional backbone for language data. While the country has 22 scheduled languages, government and institutional communication usually defals to English and to a lesser extent Hindi, with far fewer official texts produced in languages such as Asamese, Odia or Konkini. This sharply limits the availability of high-quality parallel datasets, one of the core ingredients that made the EU's approach viable. So when AI systems erase the rich mosaic of dialects from Avati to Congo Tamil, they also end up reshaping linguistic identities,

Starting point is 00:12:47 pushing millions towards standardized versions of their own languages. Until AI can speak to Indians in their own voices, India's AI revolution will remain, quite literally, lost in translation. Daybreak is produced from the newsroom of the Ken, India's first subscriber-focused business news platform. What you're listening to is just a small sample of our subscriber-only offerings. A full subscription offers daily long-form feature stories, newsletters and a whole bunch of premium podcasts. To subscribe, head to the Ken.com and click on the red subscribe button on the topic of the Ken website. Today's episode was hosted and produced by my colleague Rachel Virgis and...

Daybreak - India’s AI still doesn’t speak India. Can it?

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.