ACM ByteCast - Partha Talukdar - Episode 52
Episode Date: April 23, 2024In this episode of ACM ByteCast, Bruke Kifle hosts Partha Talukdar, Senior Staff Research Scientist at Google Research India, where he leads a group focused on natural language processing (NLP), and a...n Associate Professor at the Indian Institute of Science (IISc) Bangalore. Partha was previously a postdoctoral fellow at Carnegie Mellon University’s Machine Learning Department and received his PhD in computer information science from the University of Pennsylvania. He is broadly interested in natural language processing, machine learning, and making language technologies more inclusive. Partha is a co-author of a book on graphs-based learning and the recipient of several awards, including the ACM India Early Career Researcher Award for combining deep scholarship of NLP, graphical knowledge representation, and machine learning to solve long-standing problems. He is also the founder of Kenome, an enterprise knowledge graph company with the mission to help enterprises make sense of big dark data. Partha shares how exposure to language processing drew him to languages with limited resources and NLP. He and Bruke discuss the role of language in machine learning and whether current AI systems are merely memorizing and reproducing data or are actually capable of understanding. He also talks about his recent focus on inclusive and equitable language technology development through multilingual-multimodal Large Language Modeling, including Project Bindi. They discuss current limitations in machine learning in a world with more than 7,000 languages, as well as data scarcity and how knowledge graphs can mitigate this issue. Partha also shares his insights on balancing his time and priorities between industry and academia, recent breakthroughs that were impactful, and what he sees as key future achievements for language inclusion.
Transcript
Discussion (0)
This is ACM ByteCast, a podcast series from the Association for Computing Machinery,
the world's largest education and scientific computing society.
We talk to researchers, practitioners, and innovators
who are at the intersection of computing research and practice.
They share their experiences, the lessons they've learned,
and their own visions for the future of computing.
I am your host, Brooke Kifle.
As technology rapidly evolves, NLP stands at the forefront.
From GPT to Gemini and Lama, language technologies, or large language models as they're known,
are reshaping our intelligent systems at a rapid pace and transforming how we generate and interact with information.
Amidst this transformation, however, it's crucial to equally prioritize the vital importance of inclusivity in language technologies, considering their substantial impact on access to information
and opportunities. Our next guest, Partha Talukdar, is driving advancements to machine
learning and NLP while advocating and ensuring for more
inclusive and equitable language technologies. Partha is a senior staff research scientist at
Google Research India, where he leads a group focused on natural language processing. He's
also an associate professor at the Indian Institute of Science, Bangalore. Previously,
Partha was a postdoc fellow in the machine learning department at Carnegie Mellon University.
He received his PhD in computer and information science from the University of Pennsylvania.
Partha is broadly interested in natural language processing, machine learning, and in making language technologies more inclusive.
Partha is a recipient of several awards, including an Outstanding Paper Award at ACL 2019 and the ACM India Early Career Award 2022.
He is a co-author of a book on graph-based semi-supervised learning. Partha, welcome to ByteCast.
Hi, Brooke. Great to be here, and thanks for having me.
You know, you have such a remarkable and interesting career that spans both academia
and industry, having your undergraduate experience in India, coming
to the US for your graduate studies and postdoc, and then now returning.
I'm very interested to learn what are some of the key points within that personal and
professional career and journey that have led you into the field of computing, but also
motivated you to pursue your field of study now with language technologies?
Sure.
Yeah, I'm happy
to chat about that. Yeah, so I mean, I kind of got into computer science. I mean, I had some exposure
to computer science during school days, but it was really during kind of like the undergrad,
where I took it up as my major, and especially transitioning into like an LPE and AI. That
really happened during an internship that summer
fellowship that I got at Indian Institute of Science where I actually have a faculty position
now during the third year of my undergrad so I think it was about 2002 to summer basically
I was working on like networking technologies before then but when I got that summer fellowship at IIC, which is what Institute of Science is
shortened to, so I got exposure to language processing. It seemed like not really interesting.
And I just kind of like, you know, one thing led to another 20 plus years after that, I'm still
working in language technologies now. I mean, I was really fascinated at that time in terms of like how we can extract information from languages.
Interestingly, I was still working on like low resource languages and then kind of like veered off into doing other things in NLP.
And now back in Google Research, I'm back into like, you know, working on languages with limited resources, how we can make it more inclusive. So it has been a full circle that way, both in terms of like, you know, geography,
and also topics within NLP that I kind of like covered in my research career so far.
And, you know, I think it must be interesting, clearly, now, if you ask anybody about
chat GPT, or gen AI, I think it's the hottest topic of the year.
But I'm sure 20 years ago, or when you initially made your sort of journey into this field,
it was a pretty new domain and area. Now, we have these LLMs that are generating a ton of buzz.
But clearly, over the course of a decade, two decades, there's been a lot of transformation
that we've been seeing in the AI space, primarily as a result of deep learning. So what are your thoughts on
sort of language as a pathway to achieving artificial general intelligence, which of course
is the North Star or the goal? You know, mainly when we think about language as a very pivotal
role in human cognition and communication, How have some of the advancements
that you've seen in LLP really led you to think that there might be something here?
Right, yeah. Language as a central component, as you rightly said, like in terms of communication
and cognition is super important. And recognizing this, like right from like 1950s, I mean, there
has been like the work on language processing.
In fact, people early on thought that machine translation will be a solved problem in the
50s and 60s.
But of course, it took much, much longer than that.
While we have made significant progress, but still, even within translation, there is a
lot more work to be done.
So that way, I think the importance of language processing and NLP has been there all along.
But of course, with language modeling, in fact, large language modeling, making all the progress in the recent years have really brought it to limelight in terms of like say how i have seen area like not transition during my research
career so far so earlier it was like enough for say different tasks even within nlp let's say if
you're interested in like information extraction or machine translation or say parsing. So you would have like customized models
for each one of those tasks separately, right?
And then even within NLP,
it was quite challenging to kind of like not change topics
because there is like a lot more groundwork
that you have to do in order to like now build
kind of like the baseline systems
in the particular subtopic that you're working on.
So things like that has seen like a sea change now
with one like say general purpose pre-trained language model
that you could either use like with some instructions
or with some fine tuning,
you could make it kind of like,
you know, do multiple tasks,
not only within language processing, but increasingly with multimodal systems, you are able to go across
modalities as well. So the same model working with, say, speech, images, text. So all of those
would have been very hard to predict back in the day that within this short span of time we would
like you know come to this kind of homogenization in terms of modeling and i think that has happened
in stages so initially with neural networks and deep learning that there was homogenization i mean
first with like you know machine learning i mean if you kind of like go all the way back like you
know from like rulebased to machine learning,
so we started working with data-driven methods.
So in that case, maybe the algorithms were kind of the same, but you were building different
models and also maybe different types of algorithms, like SVMs and decision trees or CRFs and all. Now with neural
networks and deep learning, there was standardization in terms of like the learning
algorithms. So now you use like say deep learning for all of it, but for different tasks, you'd still
like, you know, use like different models. Now with language models, now there is homogenization
in terms of model also. So now you have like in a
single model doing multiple things. So that way, in stages, there has been more and more
standardization and homogenization across tasks and across modalities. So which has enabled for
researchers to move across these different problems, and then also sharing of the transfer
of knowledge across these different tasks
and learning so that way like you know lots of changes and then going back to your question on
like another importance of language i mean if we look at all the like advances that we have been
celebrating i mean language has been kind of like large language models have been at the forefront and have
demonstrated the way of how these kind of self-supervised learning done at scale could
result in lots of interesting capabilities that would have been hard to anticipate a
few years back.
You raise a very good point around this idea of generalizability, where we have single pre-trained foundational models that are basically able to serve multiple functions or multiple use cases.
But I would presume a big part of that is this generalizability, the ability to learn things not just for specific tasks, but to really capture intelligence, I guess, depending on how you define intelligence. But do you believe that,
you know, with these LLMs, we're seeing a case of systematic learning? Or as some schools have
thought, are these just giant databases? Are they actually exhibiting some of the learning
capabilities, the reasoning capabilities that are basically essential to deliver on a lot of the
tasks, whether it be summarization, sentiment analysis, Q&A.
What are your thoughts? So I think definitely there is like, you know, I mean, acquisition of
the knowledge of the world. But in addition to that, I think there is definitely, I mean,
it's not just like, you know, memorization and reproducing what is there in the data,
ability to understand intent of like, you intent of what the users are saying.
And with those limited instructions,
trying to fulfill a task requires lots of reasoning
and generalization capabilities.
So I think there is no doubt about that.
Even if you say ask to write like a piece of text
in the form of say some author who may have never like, you know, written about in that particular style.
So like I was trying to say, write about like say LLMs in the style of Shakespeare.
So these like say language models will happily comply with that type of request, but nowhere in our say pre, pre-training or fine-tuning data,
like, you know, we would have those kind of combinations being present.
So that way it's able to, like, you know, learn some patterns and apply that kind of
like in novel situations, which was clearly not, like, you know, present in the training
data.
So that way there is, like, you know, definitely like generalization happening, but to what
scale and how those things are happening are, I guess, like, you know, definite like generalization happening, but to what scale and how those things
are happening are, I guess, like topics of ongoing investigation. I guess we have very little
understanding on that as a community currently. Yeah, very interesting. One thing that you,
that your research area or focus or passion is on is around making language technologies more
inclusive and accessible for
low resource languages or underrepresented communities. And of course, when you think
about how these technologies are used in the real world, in real products, there are serious
implications, whether it be digital divide or the unequal access to information or opportunities,
there are serious implications
when we don't make these technologies inclusive.
So could you describe some of the current gaps or limitations
when we say NLP in the context of low resource
or underrepresented languages?
And then what are some of the ongoing efforts
to try and address some of these gaps?
Sure, yeah.
Now, I'm very happy to talk about that.
And that's like in a topic I'm really passionate about.
But Brooke, before going there,
I just want to mention about the importance of language,
like about your first question.
So would it be okay to address one more point there?
Yes, yes, please.
Okay, yeah.
On that particular point,
like, you know, I mean, one of the things, I guess,
like, you know, which has resulted in this excitement is the ease of interface between these like you know ai models and humans
which has happened through like again kind of like in a natural language so now i think like you know
all of these like capabilities just by giving some instructions right now or like you know some
prompts people without having any like an expertise in computer
science or machine learning are able to like you know deal with these language models and like
you know access have access to ai capabilities i think that has really increased the scope and has
excited even non-experts to and make those like capabilities available to a broader mass so to me natural
language as an interface i think has been and recognition of that i think has been like a major
advance through this like in a language model so that's one part and then the other one is like
how these language models have acquired all of this say the knowledge about the world which has
resulted in this kind of like impressive capabilities I mean
that's also through like large corpus where like NASA humans over the years have documented about
like not a knowledge about the world so both in terms of storage of knowledge of like enough the
world around us and then also as an interface language, interface, like, you know, to these AI models, I think,
like, again, language has been like a core enabler in both of these two aspects.
So now coming to your point about inclusivity. So I think there, the question is that now,
if we know, say, English and a handful of other languages, then the capabilities of these
AI and language models are available to us. But there are like a vast majority of languages in
the world where these models don't do very well. So just to give you some idea. So currently,
like, you know, these language models may work well for, like, say, a few dozen languages.
Or even if you look at, say, language technologies more broadly, even beyond, say, language models, we have capabilities in maybe like a few dozen languages across the world.
But in the world, we have like, you know, more than 7000 languages there and for vast majority of those languages basically beyond these like in a few
dozen we have no usable language technologies be it in the form of speech technologies
translation technologies and so that way it runs the risk of like you know those who have access
to these technologies the access to information and opportunities will become more and more easier
while leaving behind vast part of the world's population from having meaningful opportunities
to leverage these technologies. So that seems not an ideal state to be in. So my research and my
group's work has been focused on how we can make these kind of capabilities
available to speakers of a larger number of languages. So now, when I think of LLMs,
so like the notion of like an yin and yang kind of like, you know, comes to my mind.
So as I mentioned that, like, so if we leave things like, you know, as is, then this gap
between like, depending on the languages that you know, as is, then this gap between like, depending on the
languages that you know, the access and opportunities to information that you have, depending on
a known language, I think that's going to grow.
But at the same time, I feel that language models are also the best tools that we have
at our disposal right now to reduce this language-based barrier that's there.
But that's going to require some concentrated effort on our part to make these kind of Ligna
capabilities and models more inclusive.
And that's where our Ligna research has been focused on.
So one, I guess, thing is that when we are looking at this broader set of languages like you know we are talking
about like a diverse geographies people with like you know coming from different cultures so one
thing that we need to understand is like you know what are their core needs like you know what kind
of language technologies they need that could like you know best serve their use cases i thought that's i think like
you know one part then like say if we look at kind of like say scaling the existing capabilities
be it like say translation or speech recognition synthesis to speakers of more languages where it
like you know makes sense so the lack of data is a challenge so right now the recipes that we have
is that if you have like data and compute we have good methods to build these kind of models
which can have like you know very interesting capabilities that we all have seen but for lots
of languages around the world they don't have as good a representation on the web,
or like a lot of them, the digital representation, like on the web, they are not well represented.
So that creates a challenge. So how we can work with communities with speakers around these,
like, you know, different geographies and cultures to make technologies relevant to them while dealing with these data sparsity
problems is a central issue as we look at scaling these kind of methods.
And at Google, we have this thousand languages moonshot where we are looking at building
language technologies for thousand languages around the world.
And we are trying to address some of these issues in a systematic manner. I see. So the data issue is definitely a big problem. This idea of data
scarcity, we mentioned somewhere on the order of 7,000 languages, maybe a small select few that
are probably well served by these existing models. How do we address this data scarcity challenge at
scale? What are sort of sustainable, scalable solutions?
Right. Yes. That's a super important question. One is that, like, you know, the importance of representative data. Like, you know, when we are trying to build these models, it's important that
whatever the end use cases and the communities that's going to use these types of models,
it's important that like not all the nuances that are there,
which is going to affect the user experience when they are using these models so that those parts
are covered. Now, one example of that effort is an effort called Vani that Google is supporting
and Indian Institute of Science is driving that program. So there the goal is to collect
representative speech data, or in fact, as we
call it, collect the speech landscape of all of India. So we are collecting image-prompted
speech data from all districts of India. And it's motivated by the fact that language,
when it's spoken, there is a variation across regions. Like even one language in a multilingual society,
depending on what other languages are being spoken, like there is variations in how that
language gets spoken. So in Project Vani, that's why we are taking like a region anchored approach
rather than a language anchored approach, where we show people locally relevant images in a particular district. A district is like, say,
county in the US. And then we ask users or contributors to describe those images in a
language of their choice. And we are really amazed that when you give people this opportunity to
express themselves in a language of their choice, rather than being prescriptive about it,
the diversity of languages
that they use to describe, say, in this particular case, the images. So we have had instances in
where people use like an endangered tribal languages, and we'd have never thought about
like, you know, collecting data from all of those languages. So that way, capturing all of these
like diverse data, and then we are also making all of this data open source,
and then 10% of that is being transcribed. So building these kind of data ecosystems where
multiple organizations and even people who are passionate about building language technologies,
how we can pull in all of our resources to collect representative data, which covers the
underground variations, I think is super important. And Vani is one effort of that kind.
Second one is when we are, like I said, building these kind of models and then deploying them,
making them available across cultures and geographies, there are variations in terms of local norms the axis of harms that are
there in these different geographies for example if i take the case of india from a responsible
ai lens and then like no contrast that with say us so while we have shared access of discrimination, let's say for like gender,
but in India, we have like additional ones, let's say caste or regionality. So it's important that
we test our models in these kind of region and culture specific dimensions and take necessary
mitigation steps. So to make sure that these models have been tested and mitigated from
these fairness and bias perspectives, which we call as the recontextualized REI. So basically,
the responsible AI, which have been recontextualized in the target geographies and culture
perspective. So it's really important that we work with communities who have been historically like
at the receiving end of this kind of societal biases that we work with them like understand
their needs and what kind of discriminations are there because otherwise it might be very hard to
predict what kind of like issues are there if we don't have lived experiences and underground knowledge about
these things. So two points here, primarily. So one is working with representative data,
and then also finding scalable ways of working with communities to make sure that the responsible
AI aspects are covered. So for that, we have an effort called Bindi, where we are trying to do
exactly that using a complementary approach of doing scaling using LLMs, and then also working
with communities to learn from their rich and varied experiences, and then incorporate those
into the models. You know, I think that's such a great point. Beyond the data representation issue, which leads to the language or usefulness of some of these models for certain communities, I think even understanding the idea of region-specific biases, there's been a strong community around responsible AI. And I
think a lot of the fairness principles, the ethics principles that we explored for classical machine
learning are now very different in the age of LLMs, right? We're thinking about new harms and
new responsible AI concerns. But I think the two solutions, one being rooted around representational data, but then two,
close collaboration with communities to ensure that, you know, most cases we may not even be
aware or understand the local or regional context. So I think that's a very, very great point that
you raised. ACM ByteCast is available on Apple Podcasts, Google Podcasts, Podbean, Spotify, Stitcher, and TuneIn.
If you're enjoying this episode, please subscribe and leave us a review on your favorite platform.
You know, one thing I want to touch on, you lead a group focused on NLP at Google Research India.
And, you know, we talked about some of the work, you shared some of the work that's happening around the inclusive sort of work stream. What are some of the key other
projects or initiatives your team is currently working on that you're excited about?
Yeah, so I mean, a lot of our work or pretty much all of our work is centered around large language
models and how we can make them inclusive and responsible. So I mentioned about the thousand languages moonshot
at Google. So a lot of the work that we do is part of that initiative. So we also, since we
are situated in India, we think of India as a microcosm of the global South and try to take
inspiration from here and try to develop methods with the hope that if we are able
to build something that works here, that could be like an applicable more broadly in other
geographies and locales with similar characteristics. And we have had some success
in that direction and that we want to do more. So yeah, so inclusion and responsible LLMs, linguistically inclusive
LLMs, and doing that in a responsible way has been our core focus. And initially, we started off
with text as the modality. But now we are also increasing that to include speech as an additional
modality and going more towards multimodal versions of these kind of models.
Specifically in the Indian context, one of the problems we are looking at is how we can
build language models for 100 plus Indian languages. India is a very linguistically
diverse country. So we have, based on last census which which happened in 2011, we have a total like some 1300 plus
languages.
And we have about like 120 plus languages, which are spoken by 100,000 plus speakers
each, and 60 plus languages, which are spoken by more than a million speakers each.
And then in the constitution, 22 languages are officially recognized and language technologies
right now are available maybe around for like say the top 10 of these languages in terms of like
say number of speakers. So that way, as you can see, there is a big gap in terms of like the large
number of speakers of these languages for whom either there is no usable
or no language technologies at all.
So we are looking at how we can build a speech text model
for like the speakers of all of these languages.
I think that's very exciting.
And you raised a really good point,
which is India has a large population
and not just in numbers, but also in diversity, linguistically,
as you've mentioned, culturally.
And so being able to develop solutions that are able to cater to such a wide and sort
of diverse group can actually serve as a great model for scalable solutions that also serve
the broader global population as well. So I think it's quite exciting to be able to not only innovate
and work on advancements in this space,
but being able to do it in a context or a region or a population
where you're able to get the same level of diversity
that you're able to achieve when deploying these kind of solutions
to larger groups.
So that's very exciting.
One thing that
I do want to call out is, along with your co-authors, you did receive the Outstanding
Paper Award for your work around word sense disambiguation, which of course is topically
related to some of the work that's happening in LLMs. Can you describe the problem for those who
maybe are unfamiliar and maybe just share what the key insight or set of
insights of your paper were? Sure. Yeah. So word sense disambiguation is the problem that in
many languages, including in English, the same word could take different meaning depending on
the context in which that particular word is being used. For example, if you take the word bank, it could mean like another financial bank.
So I like now went to the bank to deposit a check versus like say river bank. So say I took a nice
stroll along the bank. So that word sense is a big issue problem is given a particular context,
how do you identify which sense this particular, like a word in a particular context,
like which sense is it basically expressing?
So that's the word-sense disambiguation problem.
So usually you have some pre-identified senses of the words,
and the word-sense disambiguation problem is that
how you can develop an algorithm or a method
which given a word in a particular context
can tell you out of like say
the n possible sense possibilities like you know which one is being expressed there so people had
looked at this problem as a problem of classification with discrete labels so like you know given the
words a bank in a particular sentence which one of these like say four senses is it being expressing here so the key idea in that in the
acl 2019 paper was to think about those senses not as like say discrete labels but think in terms of
like their embeddings for example if we could represent them as a vector in some vector space, then we could then expand to new and unseen senses that were not seen
during training data. So why this was important? It was because when you are treating these senses
as like say discrete labels for classification, so the senses that you had not seen in your
training data, there was no possibility of predicting
those senses during test time. So basically, if you have an unseen sense that shows up
during test time, you will have no hope of making that prediction. And then also for the words,
some senses are more popular than the others. And then the popular senses tend to get like a more biased treatment by learning algorithms. So we had some like embedding based methods to overcome
some of these problems. And we also showed how like a lexical knowledge in the form of like say
word net, where like, you know, you have like word to word relationships in terms of whether one word
is an antonym or synonym or you have like say glosses like which are like this short examples
and definitions of these senses so utilizing those kind of other supplementary resources and
knowledge we basically demonstrated a way of making this word sense disambiguation problem
more robust and flexible and extendable to new senses that may not have been seen during training
time. I see. And when I think of use cases, I think specifically machine translation, I've
primarily observed many cases where I've seen sort of this issue that you're describing where
the wrong word or the wrong context is used. Is this a use case where this technique can sort of bring some improvements in quality?
Yes, yeah, definitely. That's a great example. And then also this work was done in a pre-LLM world.
And now with like another language models do a very good job in terms of like you know learning the meaning of
words in a contextual manner but i think like you know the possibility of utilizing these other
lexical knowledges like say word net and all how we can still incorporate them in a language model
is also like an interesting question very interesting. On the topic of papers or publications, I also want
to quickly touch on a book that you authored a while back, which was on this idea of graph-based
semi-supervised learning, which is sort of a combination of two areas, you know, semi-supervised
learning and then sort of graph-based learning. What exactly were, at the time, what were some of
the challenges that you observed in semi-supervised learning or
opportunities that led you to think about graph-based approaches? And then now in this
world of deep learning, LLMs, generative AI, are there practical examples or use cases that you
see in the context of NLP? Sure. Yeah, absolutely. So I guess like now all throughout my research career,
so sparsity of data has been like a common theme. And like now we talked at length in terms of
linguistic inclusion and like not the data sparsity problems there. But even before that,
like now when I was looking at say more information extraction or how we can bring more knowledge about the world
into machine learning algorithms. So again, like no lack of data was a recurring problem. So think
of if you are interested in learning about various types of entities and relationships,
be it like say people, mountains, diseases, islands across the world, what kind of like
the relationships are among them. So if you're thinking of doing this in a supervised learning
setup where you like provide training data for each and every type of this like knowledge,
since there are so many different types of knowledge, you cannot provide lots of label examples for all of them.
So you have to, like, you know, you can only provide maybe a few examples of like, say,
people or like the capital of countries and so on and so forth for different types of relations.
So the problem I looked at was how we can, from given some small number of examples for large and
diverse types of knowledge, we can build machine learning models.
So this is again, like in a pre-LLM era I'm talking about. So there the observation was that
doing all of these labelings through humans was a time consuming process. So we can only get access
to a small number of these learning examples, but unlabeled data
is available plentiful. And what I mean by that is like say corpus on the web,
or documents is available plentiful. So how we could utilize those kinds of unlabeled data
with small amount of labeled instances to combine them to learn like in a good say learning learn models
to let's say extract classify whatever end tasks we were interested in so that was the motivation
for semi-supervised learning so where you combine small amounts of label data with lots of unlabeled
data then graph came into the picture i mean graph is a very useful and versatile data structure.
I mean, we are all connected in one way or another, right? So networks and graphs provide
you a very flexible way to represent knowledge about the world. Be it, let's say, like, you know,
one person connected with or related with another person or an institution in a social network,
or be it like a biological network, or like a transportation network, or knowledge graphs.
So that's kind of like knowledge about the world and relationships where the nodes are represented
as entities and edges represent relationships among those entities. So that kind of like provided a flexible way
of representing various domains and world knowledge. So that's kind of like the representation part.
And then how we can do like learning over those type of graphs with limited supervision
is how the semi-supervised learning part came about. So how we could like combine these two
pieces. So that's where the graph-based semi-supervised learning part came about. So how we could combine these two pieces. So that's where
the graph-based semi-supervised learning came into existence. And so people had looked at utilizing
graph-based semi-supervised learning for other problems, but some of our works and other
researchers around that time were one of the first ones to apply those kind of ideas within NLP. Some of our initial
applications where say you give maybe five examples of watch manufacturers, right? And then
given that data and say access to the web, how you could significantly expand the list of those
watch manufacturers and extract like 100 others from the web, right? So given those
like small number of five examples. So those are kind of like some examples. And then subsequently,
went into how we can build these knowledge graphs, which are like this entity relationship
graphs that I talked about. So one big project that I was involved with during my postdoc time
at CMU, led by Professor Tom Mitchell,
is a project called NEL, which stands for Never Ending Language Learning, where the idea was to
basically build this kind of knowledge graphs by reading web documents in a pretty much
self-supervised manner, and then rereading this knowledge and using that knowledge to improve the extractor and build
this in a never ending manner. So in for that particular projects context, it ran for about
like 10 years, like in a pretty much in a self supervised manner, by basically applying these
kind of semi supervised learning ideas in a graph context. So yeah, that's I think, like is a one
concrete example of basically merging
graphs and semi-supervised learning. Very interesting. Okay. I think it's cool to see
some of the practical applications, but also benefiting from sort of the combination of two
sort of methods of learning, both the semi-supervised and the graph learning. So that's very awesome.
You know, I think we touched on a lot of interesting things throughout, you know, your research career. One thing that's actually quite impressive is that
you wear multiple hats, right? So you, in addition to your research role at Google,
you are also an associate professor at the Indian Institute of Science. So I think that's quite
interesting being able to balance both have a foot in academia but also
in industry and research how do you balance your time and your priorities between these two roles
right yeah so i'm currently on leave from university so where i'm not teaching on a
regular basis but i continued uh even after starting the position at google continued advising
my students, PhD students
who have graduated now.
But yeah, but I think it was challenging.
But since I was working roughly in the same related areas, so that way it was kind of
like not dragging in different directions.
Also, my students were already kind of towards the second half of their PhD journey.
So they were already quite independent. So that definitely helped in terms of towards the second half of their, say, PhD journey. So they were already quite independent.
So that definitely helped in terms of like managing the two sides.
And then now we have these collaborative projects that I mentioned, like in Avani before.
So that's also within the same university.
So now the engagements have morphed into different types, but there is kind of like a strong
back and forth.
I see.
And I was going to actually touch on what are some of the benefits of working in both settings, but I think you kind of alluded to the potential for collaborations between both
academia and some of the work that's happening at Google Research. Do you find that, you know,
having a role in industry helps inform some of the research that happens in academia? Is it vice
versa where, you know, we have some research in academia that's helping push, you know,
some of the innovations in product? At least for you, which hat do you find inspiring or
informing more of your work? Yeah, no, that's a great question. And in fact, I mean, I also had
a startup in between and like, you know, one of the primary reasons why I'm at Google is motivated by the fact
that I wanted to see whatever research I'm doing, if and how that's making any kind of
like the impact or use in the real world.
And basically, like, you know, go the full path and understand, like, you know, how it's
getting used, what are the drawbacks, and then take
inspiration from there to inform the next set of research questions. So I feel that that's a
very productive way of making sure that you're working on important problems and having a strong
connection with industry because many times industry is at the forefront
in terms of deploying products
and users using those products.
The problems, they get exposed to all of those.
So making them available to academic researchers,
like, you know, and influencing them to work on that
is definitely, I think, helpful.
And that has been one of my motivations of, like,
say, why I did the startup at Google, as I mentioned.
And even before this, during my PhD time also,
I spent about a year at Google in three different internships,
and that had a strong influence on my, like, research trajectory.
So that way I had an exposure towards
and the knowledge about the benefit
of industry engagements.
And I just continue to follow that even today.
I think that's a great point.
Of course, exploratory research is equally important.
And I think it's essential
for pushing the boundaries of science,
but also thinking about grounding
some of the research we do in the
context of real world problems or real world use cases can also be quite beneficial to ensuring
tangible short term benefit as well. Yeah. And also, Brooke, like, you know, I mean, you can
always like take the inspiration from those real world use cases. And then you can kind of like,
not think in terms of like say at what time scales
yeah you want to like not solve them right yes yes and then also you could also like this could
be like an individual researcher's tastes so how much they want to be influenced by that i mean
sometimes you want to solve that problem like not exactly but in other cases like you know you want to kind of keep that flavor in mind
but like think about what could be like a more general version of that problem and then try to
address that in a more systematic way so i think having good exposure is i think is very valuable
at least that's what i have found to be valuable and depending on individual researchers taste
they could like you could decide how to
incorporate that in their research. Yes, yes, that's a great point. You know,
you introduced something which was very cool that you had sort of an entrepreneurial venture as well,
which was quite interesting. But it seems like over the course of your career, you span diverse
areas, but still maintain sort of a central focus on language technologies on NLP.
And I'm sure this past year, two years, three years has been quite exciting. I feel like the
large language model advancements have been quite remarkable, but every day there seems to be
something new in the news. Could you highlight maybe any of the recent breakthroughs or findings? It can be in the
context of LLMs, multimodal. I think we're seeing some interesting work happening in that space
that you find particularly exciting or impactful. Yes. Yeah. So I think in terms of like, say,
broader, like not technological arcs. So like exactly the two things that you mentioned,
language models in particular, like the multimodal models,
I think have opened up like, you know,
lots of interesting possibilities,
both in terms of use cases
and then also additional research to be done.
So that I think like from a technical perspective, I think there's
it's an exciting time to be in. But of course, like, you know, not everything is solved. As we
like to discuss at length today, in terms of how we can make it more usable and helpful for a
broader set of like, you know, diverse users and people from different backgrounds is, I think, still a very, very
open problem and an exciting problem at the same time.
And what really excites me is that there are foundational research work to be done.
And if you are able to make progress on that, the possibilities of societal impact are massive.
So that way, as a researcher, it really excites me. Beyond multimodal and
language models more broadly, more recent advances around long context models, where we are able to
specify a lot more contextual knowledge as part of, say, the Gemini 1.5 models, also, I think,
have opened up interesting avenues for exploration, and I'm excited Gemini 1.5 models also, I think, have opened up interesting avenues for exploration.
And I'm excited to explore those.
Very exciting.
And then when you think about, of course, your primary sort of passion or area of interest,
which is ensuring that language models or language technologies are inclusive and accessible,
are there sort of things that you see as the next key achievement for the space to
ensure that these technologies are more inclusive? Is it focusing on data? Are there innovations on
the modeling aspect? Are there other things that you perceive as being important regarding
language representation, but also some of the fairness issues that we described in the context of LLMs? Yes. Yeah. So I think there are kind of like, maybe we are thinking through like three
dimensions. So one is representative data that we talked about, and like Navani is an example of
that. The Project Bindi, which is looking at like the fairness and responsible AI and the importance
of working with communities while leveraging the scale that
LLMs give. The third one that we haven't talked about so much is in terms of how we can do these
models in a more scalable and modular manner. So right now, the recipe for building these models
are you have like say one monolithic model, and then you try to add more data like you know and then
try to extend its capabilities but i'm not sure whether that's kind of like a highly scalable
approach so in order to overcome that we have been working on a method called calm which is
looking at how we can maybe develop models with different expertise independently
and in a post-hoc manner, still see how we can compose these models to enable new capabilities.
So with that, maybe like, you know, I have like a core model, which is very good at doing, say,
reasoning tasks or like, you know, math or numeric reasoning problems but it works primarily well for
english and a few other languages but then if i want to make those capabilities available for
say santali or hausa like you know but if if i have like a separate model with expertise in those
languages so how we could like you know compose both of them together to enable reasoning capabilities in all of these
additional languages so we have had like some initial promise and success in that direction
and we are excited to kind of like follow up more so even in terms of modeling how we can
do this in a more scalable and modular way i see see. So focusing on representational data,
focusing on improvements in modeling,
and then, of course, community-based development.
So working with communities on these solutions.
And then looking at, more broadly,
the recontextualized responsible AI
so that we are serving the end users
in a locally sensitive manner
that brings meaningful change to their lives.
Yes, yes.
I think this was a very interesting discussion.
I want to wrap up with one question.
As somebody who's had a very diverse career,
both as a researcher, as a professor, as an entrepreneur,
I'm sure you've had the chance to teach and engage with and mentor many students,
many junior colleagues who have gone on to become successful,
whether it be as researchers, as leaders in their own fields.
What are some of the skills and qualities that you look for
and try to cultivate in your students or mentees?
And then what advice
would you give to young aspiring engineers and scientists who want to make an impact in this
world? Right. Yeah. So, I mean, one is making sure like the focus on quality is always there,
not like, you know, compromising on quality and a high bar for like some short-term
gains even if it requires you to kind of like not stay the course for a longer longer period so i
think like maintaining quality is i think like one important thing making sure that you are
passionate about like another problem that you're working on. And then you actually like not care about the
outcomes, because I think that's going to help you navigate through downturns that are bound to
happen when you're working on like not challenging problems. So identifying things that you really
care about, and making sure like now you are focusing on them, I think is important. Curiosity drive, I think have been like a
important ingredients, I think, in order to identify good problems and, and also kind of
like now do good work eventually. Importance of question, like identifying the right question to
address is also like extremely important. So I mean, I tend to believe that like even like a suboptimal answer
to the right question is like no more valuable than like an optimal answer to a suboptimal question
right so kind of like spending enough time making sure that you are working on problems that you
care about and are impactful is i think important and important. And if it makes sense,
then seeing how maybe this is going to be grounded
in the real world
and how it may help end users,
if that's a thing that you care about,
is something to maybe think about early on
and how it's going to fit into the bigger picture
and not just looking of like looking at
what's the next incremental improvement that could be done.
Wow, I think those are all great pieces of advice.
Identify a question you're passionate about
or interested in solving.
Be curious, never compromise on quality
and where relevant, think about how your work ties
into society and sort of the larger context.
So I think those are all amazing pieces of advice for the next generation of makers and creators.
So with that, thank you so much, Partha.
I think this was a wonderful discussion.
And we look forward to the many impactful work that you will continue to contribute in this growing, evolving technology landscape.
Thanks, Brooke. Great talking with you. And thanks for giving me this opportunity.
ACM ByteCast is a production of the Association for Computing Machinery's Practitioner Board.
To learn more about ACM and its activities, visit acm.org. For more information about this and other episodes,
please visit our website at learning.acm.org. That's learning.acm.org.