ACM ByteCast - Partha Talukdar - Episode 52

Episode Date: April 23, 2024

In this episode of ACM ByteCast, Bruke Kifle hosts Partha Talukdar, Senior Staff Research Scientist at Google Research India, where he leads a group focused on natural language processing (NLP), and a...n Associate Professor at the Indian Institute of Science (IISc) Bangalore. Partha was previously a postdoctoral fellow at Carnegie Mellon University’s Machine Learning Department and received his PhD in computer information science from the University of Pennsylvania. He is broadly interested in natural language processing, machine learning, and making language technologies more inclusive. Partha is a co-author of a book on graphs-based learning and the recipient of several awards, including the ACM India Early Career Researcher Award for combining deep scholarship of NLP, graphical knowledge representation, and machine learning to solve long-standing problems. He is also the founder of Kenome, an enterprise knowledge graph company with the mission to help enterprises make sense of big dark data. Partha shares how exposure to language processing drew him to languages with limited resources and NLP. He and Bruke discuss the role of language in machine learning and whether current AI systems are merely memorizing and reproducing data or are actually capable of understanding. He also talks about his recent focus on inclusive and equitable language technology development through multilingual-multimodal Large Language Modeling, including Project Bindi. They discuss current limitations in machine learning in a world with more than 7,000 languages, as well as data scarcity and how knowledge graphs can mitigate this issue. Partha also shares his insights on balancing his time and priorities between industry and academia, recent breakthroughs that were impactful, and what he sees as key future achievements for language inclusion.

Transcript
Discussion (0)
Starting point is 00:00:00 This is ACM ByteCast, a podcast series from the Association for Computing Machinery, the world's largest education and scientific computing society. We talk to researchers, practitioners, and innovators who are at the intersection of computing research and practice. They share their experiences, the lessons they've learned, and their own visions for the future of computing. I am your host, Brooke Kifle. As technology rapidly evolves, NLP stands at the forefront.
Starting point is 00:00:32 From GPT to Gemini and Lama, language technologies, or large language models as they're known, are reshaping our intelligent systems at a rapid pace and transforming how we generate and interact with information. Amidst this transformation, however, it's crucial to equally prioritize the vital importance of inclusivity in language technologies, considering their substantial impact on access to information and opportunities. Our next guest, Partha Talukdar, is driving advancements to machine learning and NLP while advocating and ensuring for more inclusive and equitable language technologies. Partha is a senior staff research scientist at Google Research India, where he leads a group focused on natural language processing. He's also an associate professor at the Indian Institute of Science, Bangalore. Previously,
Starting point is 00:01:21 Partha was a postdoc fellow in the machine learning department at Carnegie Mellon University. He received his PhD in computer and information science from the University of Pennsylvania. Partha is broadly interested in natural language processing, machine learning, and in making language technologies more inclusive. Partha is a recipient of several awards, including an Outstanding Paper Award at ACL 2019 and the ACM India Early Career Award 2022. He is a co-author of a book on graph-based semi-supervised learning. Partha, welcome to ByteCast. Hi, Brooke. Great to be here, and thanks for having me. You know, you have such a remarkable and interesting career that spans both academia and industry, having your undergraduate experience in India, coming
Starting point is 00:02:05 to the US for your graduate studies and postdoc, and then now returning. I'm very interested to learn what are some of the key points within that personal and professional career and journey that have led you into the field of computing, but also motivated you to pursue your field of study now with language technologies? Sure. Yeah, I'm happy to chat about that. Yeah, so I mean, I kind of got into computer science. I mean, I had some exposure to computer science during school days, but it was really during kind of like the undergrad,
Starting point is 00:02:36 where I took it up as my major, and especially transitioning into like an LPE and AI. That really happened during an internship that summer fellowship that I got at Indian Institute of Science where I actually have a faculty position now during the third year of my undergrad so I think it was about 2002 to summer basically I was working on like networking technologies before then but when I got that summer fellowship at IIC, which is what Institute of Science is shortened to, so I got exposure to language processing. It seemed like not really interesting. And I just kind of like, you know, one thing led to another 20 plus years after that, I'm still working in language technologies now. I mean, I was really fascinated at that time in terms of like how we can extract information from languages.
Starting point is 00:03:28 Interestingly, I was still working on like low resource languages and then kind of like veered off into doing other things in NLP. And now back in Google Research, I'm back into like, you know, working on languages with limited resources, how we can make it more inclusive. So it has been a full circle that way, both in terms of like, you know, geography, and also topics within NLP that I kind of like covered in my research career so far. And, you know, I think it must be interesting, clearly, now, if you ask anybody about chat GPT, or gen AI, I think it's the hottest topic of the year. But I'm sure 20 years ago, or when you initially made your sort of journey into this field, it was a pretty new domain and area. Now, we have these LLMs that are generating a ton of buzz. But clearly, over the course of a decade, two decades, there's been a lot of transformation
Starting point is 00:04:24 that we've been seeing in the AI space, primarily as a result of deep learning. So what are your thoughts on sort of language as a pathway to achieving artificial general intelligence, which of course is the North Star or the goal? You know, mainly when we think about language as a very pivotal role in human cognition and communication, How have some of the advancements that you've seen in LLP really led you to think that there might be something here? Right, yeah. Language as a central component, as you rightly said, like in terms of communication and cognition is super important. And recognizing this, like right from like 1950s, I mean, there has been like the work on language processing.
Starting point is 00:05:07 In fact, people early on thought that machine translation will be a solved problem in the 50s and 60s. But of course, it took much, much longer than that. While we have made significant progress, but still, even within translation, there is a lot more work to be done. So that way, I think the importance of language processing and NLP has been there all along. But of course, with language modeling, in fact, large language modeling, making all the progress in the recent years have really brought it to limelight in terms of like say how i have seen area like not transition during my research career so far so earlier it was like enough for say different tasks even within nlp let's say if
Starting point is 00:05:57 you're interested in like information extraction or machine translation or say parsing. So you would have like customized models for each one of those tasks separately, right? And then even within NLP, it was quite challenging to kind of like not change topics because there is like a lot more groundwork that you have to do in order to like now build kind of like the baseline systems in the particular subtopic that you're working on.
Starting point is 00:06:27 So things like that has seen like a sea change now with one like say general purpose pre-trained language model that you could either use like with some instructions or with some fine tuning, you could make it kind of like, you know, do multiple tasks, not only within language processing, but increasingly with multimodal systems, you are able to go across modalities as well. So the same model working with, say, speech, images, text. So all of those
Starting point is 00:06:59 would have been very hard to predict back in the day that within this short span of time we would like you know come to this kind of homogenization in terms of modeling and i think that has happened in stages so initially with neural networks and deep learning that there was homogenization i mean first with like you know machine learning i mean if you kind of like go all the way back like you know from like rulebased to machine learning, so we started working with data-driven methods. So in that case, maybe the algorithms were kind of the same, but you were building different models and also maybe different types of algorithms, like SVMs and decision trees or CRFs and all. Now with neural
Starting point is 00:07:46 networks and deep learning, there was standardization in terms of like the learning algorithms. So now you use like say deep learning for all of it, but for different tasks, you'd still like, you know, use like different models. Now with language models, now there is homogenization in terms of model also. So now you have like in a single model doing multiple things. So that way, in stages, there has been more and more standardization and homogenization across tasks and across modalities. So which has enabled for researchers to move across these different problems, and then also sharing of the transfer of knowledge across these different tasks
Starting point is 00:08:26 and learning so that way like you know lots of changes and then going back to your question on like another importance of language i mean if we look at all the like advances that we have been celebrating i mean language has been kind of like large language models have been at the forefront and have demonstrated the way of how these kind of self-supervised learning done at scale could result in lots of interesting capabilities that would have been hard to anticipate a few years back. You raise a very good point around this idea of generalizability, where we have single pre-trained foundational models that are basically able to serve multiple functions or multiple use cases. But I would presume a big part of that is this generalizability, the ability to learn things not just for specific tasks, but to really capture intelligence, I guess, depending on how you define intelligence. But do you believe that,
Starting point is 00:09:26 you know, with these LLMs, we're seeing a case of systematic learning? Or as some schools have thought, are these just giant databases? Are they actually exhibiting some of the learning capabilities, the reasoning capabilities that are basically essential to deliver on a lot of the tasks, whether it be summarization, sentiment analysis, Q&A. What are your thoughts? So I think definitely there is like, you know, I mean, acquisition of the knowledge of the world. But in addition to that, I think there is definitely, I mean, it's not just like, you know, memorization and reproducing what is there in the data, ability to understand intent of like, you intent of what the users are saying.
Starting point is 00:10:08 And with those limited instructions, trying to fulfill a task requires lots of reasoning and generalization capabilities. So I think there is no doubt about that. Even if you say ask to write like a piece of text in the form of say some author who may have never like, you know, written about in that particular style. So like I was trying to say, write about like say LLMs in the style of Shakespeare. So these like say language models will happily comply with that type of request, but nowhere in our say pre, pre-training or fine-tuning data,
Starting point is 00:10:46 like, you know, we would have those kind of combinations being present. So that way it's able to, like, you know, learn some patterns and apply that kind of like in novel situations, which was clearly not, like, you know, present in the training data. So that way there is, like, you know, definitely like generalization happening, but to what scale and how those things are happening are, I guess, like, you know, definite like generalization happening, but to what scale and how those things are happening are, I guess, like topics of ongoing investigation. I guess we have very little understanding on that as a community currently. Yeah, very interesting. One thing that you,
Starting point is 00:11:18 that your research area or focus or passion is on is around making language technologies more inclusive and accessible for low resource languages or underrepresented communities. And of course, when you think about how these technologies are used in the real world, in real products, there are serious implications, whether it be digital divide or the unequal access to information or opportunities, there are serious implications when we don't make these technologies inclusive. So could you describe some of the current gaps or limitations
Starting point is 00:11:50 when we say NLP in the context of low resource or underrepresented languages? And then what are some of the ongoing efforts to try and address some of these gaps? Sure, yeah. Now, I'm very happy to talk about that. And that's like in a topic I'm really passionate about. But Brooke, before going there,
Starting point is 00:12:09 I just want to mention about the importance of language, like about your first question. So would it be okay to address one more point there? Yes, yes, please. Okay, yeah. On that particular point, like, you know, I mean, one of the things, I guess, like, you know, which has resulted in this excitement is the ease of interface between these like you know ai models and humans
Starting point is 00:12:31 which has happened through like again kind of like in a natural language so now i think like you know all of these like capabilities just by giving some instructions right now or like you know some prompts people without having any like an expertise in computer science or machine learning are able to like you know deal with these language models and like you know access have access to ai capabilities i think that has really increased the scope and has excited even non-experts to and make those like capabilities available to a broader mass so to me natural language as an interface i think has been and recognition of that i think has been like a major advance through this like in a language model so that's one part and then the other one is like
Starting point is 00:13:16 how these language models have acquired all of this say the knowledge about the world which has resulted in this kind of like impressive capabilities I mean that's also through like large corpus where like NASA humans over the years have documented about like not a knowledge about the world so both in terms of storage of knowledge of like enough the world around us and then also as an interface language, interface, like, you know, to these AI models, I think, like, again, language has been like a core enabler in both of these two aspects. So now coming to your point about inclusivity. So I think there, the question is that now, if we know, say, English and a handful of other languages, then the capabilities of these
Starting point is 00:14:07 AI and language models are available to us. But there are like a vast majority of languages in the world where these models don't do very well. So just to give you some idea. So currently, like, you know, these language models may work well for, like, say, a few dozen languages. Or even if you look at, say, language technologies more broadly, even beyond, say, language models, we have capabilities in maybe like a few dozen languages across the world. But in the world, we have like, you know, more than 7000 languages there and for vast majority of those languages basically beyond these like in a few dozen we have no usable language technologies be it in the form of speech technologies translation technologies and so that way it runs the risk of like you know those who have access to these technologies the access to information and opportunities will become more and more easier
Starting point is 00:15:06 while leaving behind vast part of the world's population from having meaningful opportunities to leverage these technologies. So that seems not an ideal state to be in. So my research and my group's work has been focused on how we can make these kind of capabilities available to speakers of a larger number of languages. So now, when I think of LLMs, so like the notion of like an yin and yang kind of like, you know, comes to my mind. So as I mentioned that, like, so if we leave things like, you know, as is, then this gap between like, depending on the languages that you know, as is, then this gap between like, depending on the languages that you know, the access and opportunities to information that you have, depending on
Starting point is 00:15:50 a known language, I think that's going to grow. But at the same time, I feel that language models are also the best tools that we have at our disposal right now to reduce this language-based barrier that's there. But that's going to require some concentrated effort on our part to make these kind of Ligna capabilities and models more inclusive. And that's where our Ligna research has been focused on. So one, I guess, thing is that when we are looking at this broader set of languages like you know we are talking about like a diverse geographies people with like you know coming from different cultures so one
Starting point is 00:16:34 thing that we need to understand is like you know what are their core needs like you know what kind of language technologies they need that could like you know best serve their use cases i thought that's i think like you know one part then like say if we look at kind of like say scaling the existing capabilities be it like say translation or speech recognition synthesis to speakers of more languages where it like you know makes sense so the lack of data is a challenge so right now the recipes that we have is that if you have like data and compute we have good methods to build these kind of models which can have like you know very interesting capabilities that we all have seen but for lots of languages around the world they don't have as good a representation on the web,
Starting point is 00:17:26 or like a lot of them, the digital representation, like on the web, they are not well represented. So that creates a challenge. So how we can work with communities with speakers around these, like, you know, different geographies and cultures to make technologies relevant to them while dealing with these data sparsity problems is a central issue as we look at scaling these kind of methods. And at Google, we have this thousand languages moonshot where we are looking at building language technologies for thousand languages around the world. And we are trying to address some of these issues in a systematic manner. I see. So the data issue is definitely a big problem. This idea of data scarcity, we mentioned somewhere on the order of 7,000 languages, maybe a small select few that
Starting point is 00:18:15 are probably well served by these existing models. How do we address this data scarcity challenge at scale? What are sort of sustainable, scalable solutions? Right. Yes. That's a super important question. One is that, like, you know, the importance of representative data. Like, you know, when we are trying to build these models, it's important that whatever the end use cases and the communities that's going to use these types of models, it's important that like not all the nuances that are there, which is going to affect the user experience when they are using these models so that those parts are covered. Now, one example of that effort is an effort called Vani that Google is supporting and Indian Institute of Science is driving that program. So there the goal is to collect
Starting point is 00:19:03 representative speech data, or in fact, as we call it, collect the speech landscape of all of India. So we are collecting image-prompted speech data from all districts of India. And it's motivated by the fact that language, when it's spoken, there is a variation across regions. Like even one language in a multilingual society, depending on what other languages are being spoken, like there is variations in how that language gets spoken. So in Project Vani, that's why we are taking like a region anchored approach rather than a language anchored approach, where we show people locally relevant images in a particular district. A district is like, say, county in the US. And then we ask users or contributors to describe those images in a
Starting point is 00:19:53 language of their choice. And we are really amazed that when you give people this opportunity to express themselves in a language of their choice, rather than being prescriptive about it, the diversity of languages that they use to describe, say, in this particular case, the images. So we have had instances in where people use like an endangered tribal languages, and we'd have never thought about like, you know, collecting data from all of those languages. So that way, capturing all of these like diverse data, and then we are also making all of this data open source, and then 10% of that is being transcribed. So building these kind of data ecosystems where
Starting point is 00:20:32 multiple organizations and even people who are passionate about building language technologies, how we can pull in all of our resources to collect representative data, which covers the underground variations, I think is super important. And Vani is one effort of that kind. Second one is when we are, like I said, building these kind of models and then deploying them, making them available across cultures and geographies, there are variations in terms of local norms the axis of harms that are there in these different geographies for example if i take the case of india from a responsible ai lens and then like no contrast that with say us so while we have shared access of discrimination, let's say for like gender, but in India, we have like additional ones, let's say caste or regionality. So it's important that
Starting point is 00:21:33 we test our models in these kind of region and culture specific dimensions and take necessary mitigation steps. So to make sure that these models have been tested and mitigated from these fairness and bias perspectives, which we call as the recontextualized REI. So basically, the responsible AI, which have been recontextualized in the target geographies and culture perspective. So it's really important that we work with communities who have been historically like at the receiving end of this kind of societal biases that we work with them like understand their needs and what kind of discriminations are there because otherwise it might be very hard to predict what kind of like issues are there if we don't have lived experiences and underground knowledge about
Starting point is 00:22:26 these things. So two points here, primarily. So one is working with representative data, and then also finding scalable ways of working with communities to make sure that the responsible AI aspects are covered. So for that, we have an effort called Bindi, where we are trying to do exactly that using a complementary approach of doing scaling using LLMs, and then also working with communities to learn from their rich and varied experiences, and then incorporate those into the models. You know, I think that's such a great point. Beyond the data representation issue, which leads to the language or usefulness of some of these models for certain communities, I think even understanding the idea of region-specific biases, there's been a strong community around responsible AI. And I think a lot of the fairness principles, the ethics principles that we explored for classical machine learning are now very different in the age of LLMs, right? We're thinking about new harms and
Starting point is 00:23:37 new responsible AI concerns. But I think the two solutions, one being rooted around representational data, but then two, close collaboration with communities to ensure that, you know, most cases we may not even be aware or understand the local or regional context. So I think that's a very, very great point that you raised. ACM ByteCast is available on Apple Podcasts, Google Podcasts, Podbean, Spotify, Stitcher, and TuneIn. If you're enjoying this episode, please subscribe and leave us a review on your favorite platform. You know, one thing I want to touch on, you lead a group focused on NLP at Google Research India. And, you know, we talked about some of the work, you shared some of the work that's happening around the inclusive sort of work stream. What are some of the key other projects or initiatives your team is currently working on that you're excited about?
Starting point is 00:24:33 Yeah, so I mean, a lot of our work or pretty much all of our work is centered around large language models and how we can make them inclusive and responsible. So I mentioned about the thousand languages moonshot at Google. So a lot of the work that we do is part of that initiative. So we also, since we are situated in India, we think of India as a microcosm of the global South and try to take inspiration from here and try to develop methods with the hope that if we are able to build something that works here, that could be like an applicable more broadly in other geographies and locales with similar characteristics. And we have had some success in that direction and that we want to do more. So yeah, so inclusion and responsible LLMs, linguistically inclusive
Starting point is 00:25:26 LLMs, and doing that in a responsible way has been our core focus. And initially, we started off with text as the modality. But now we are also increasing that to include speech as an additional modality and going more towards multimodal versions of these kind of models. Specifically in the Indian context, one of the problems we are looking at is how we can build language models for 100 plus Indian languages. India is a very linguistically diverse country. So we have, based on last census which which happened in 2011, we have a total like some 1300 plus languages. And we have about like 120 plus languages, which are spoken by 100,000 plus speakers
Starting point is 00:26:14 each, and 60 plus languages, which are spoken by more than a million speakers each. And then in the constitution, 22 languages are officially recognized and language technologies right now are available maybe around for like say the top 10 of these languages in terms of like say number of speakers. So that way, as you can see, there is a big gap in terms of like the large number of speakers of these languages for whom either there is no usable or no language technologies at all. So we are looking at how we can build a speech text model for like the speakers of all of these languages.
Starting point is 00:26:55 I think that's very exciting. And you raised a really good point, which is India has a large population and not just in numbers, but also in diversity, linguistically, as you've mentioned, culturally. And so being able to develop solutions that are able to cater to such a wide and sort of diverse group can actually serve as a great model for scalable solutions that also serve the broader global population as well. So I think it's quite exciting to be able to not only innovate
Starting point is 00:27:27 and work on advancements in this space, but being able to do it in a context or a region or a population where you're able to get the same level of diversity that you're able to achieve when deploying these kind of solutions to larger groups. So that's very exciting. One thing that I do want to call out is, along with your co-authors, you did receive the Outstanding
Starting point is 00:27:49 Paper Award for your work around word sense disambiguation, which of course is topically related to some of the work that's happening in LLMs. Can you describe the problem for those who maybe are unfamiliar and maybe just share what the key insight or set of insights of your paper were? Sure. Yeah. So word sense disambiguation is the problem that in many languages, including in English, the same word could take different meaning depending on the context in which that particular word is being used. For example, if you take the word bank, it could mean like another financial bank. So I like now went to the bank to deposit a check versus like say river bank. So say I took a nice stroll along the bank. So that word sense is a big issue problem is given a particular context,
Starting point is 00:28:40 how do you identify which sense this particular, like a word in a particular context, like which sense is it basically expressing? So that's the word-sense disambiguation problem. So usually you have some pre-identified senses of the words, and the word-sense disambiguation problem is that how you can develop an algorithm or a method which given a word in a particular context can tell you out of like say
Starting point is 00:29:05 the n possible sense possibilities like you know which one is being expressed there so people had looked at this problem as a problem of classification with discrete labels so like you know given the words a bank in a particular sentence which one of these like say four senses is it being expressing here so the key idea in that in the acl 2019 paper was to think about those senses not as like say discrete labels but think in terms of like their embeddings for example if we could represent them as a vector in some vector space, then we could then expand to new and unseen senses that were not seen during training data. So why this was important? It was because when you are treating these senses as like say discrete labels for classification, so the senses that you had not seen in your training data, there was no possibility of predicting
Starting point is 00:30:06 those senses during test time. So basically, if you have an unseen sense that shows up during test time, you will have no hope of making that prediction. And then also for the words, some senses are more popular than the others. And then the popular senses tend to get like a more biased treatment by learning algorithms. So we had some like embedding based methods to overcome some of these problems. And we also showed how like a lexical knowledge in the form of like say word net, where like, you know, you have like word to word relationships in terms of whether one word is an antonym or synonym or you have like say glosses like which are like this short examples and definitions of these senses so utilizing those kind of other supplementary resources and knowledge we basically demonstrated a way of making this word sense disambiguation problem
Starting point is 00:31:05 more robust and flexible and extendable to new senses that may not have been seen during training time. I see. And when I think of use cases, I think specifically machine translation, I've primarily observed many cases where I've seen sort of this issue that you're describing where the wrong word or the wrong context is used. Is this a use case where this technique can sort of bring some improvements in quality? Yes, yeah, definitely. That's a great example. And then also this work was done in a pre-LLM world. And now with like another language models do a very good job in terms of like you know learning the meaning of words in a contextual manner but i think like you know the possibility of utilizing these other lexical knowledges like say word net and all how we can still incorporate them in a language model
Starting point is 00:31:58 is also like an interesting question very interesting. On the topic of papers or publications, I also want to quickly touch on a book that you authored a while back, which was on this idea of graph-based semi-supervised learning, which is sort of a combination of two areas, you know, semi-supervised learning and then sort of graph-based learning. What exactly were, at the time, what were some of the challenges that you observed in semi-supervised learning or opportunities that led you to think about graph-based approaches? And then now in this world of deep learning, LLMs, generative AI, are there practical examples or use cases that you see in the context of NLP? Sure. Yeah, absolutely. So I guess like now all throughout my research career,
Starting point is 00:32:46 so sparsity of data has been like a common theme. And like now we talked at length in terms of linguistic inclusion and like not the data sparsity problems there. But even before that, like now when I was looking at say more information extraction or how we can bring more knowledge about the world into machine learning algorithms. So again, like no lack of data was a recurring problem. So think of if you are interested in learning about various types of entities and relationships, be it like say people, mountains, diseases, islands across the world, what kind of like the relationships are among them. So if you're thinking of doing this in a supervised learning setup where you like provide training data for each and every type of this like knowledge,
Starting point is 00:33:39 since there are so many different types of knowledge, you cannot provide lots of label examples for all of them. So you have to, like, you know, you can only provide maybe a few examples of like, say, people or like the capital of countries and so on and so forth for different types of relations. So the problem I looked at was how we can, from given some small number of examples for large and diverse types of knowledge, we can build machine learning models. So this is again, like in a pre-LLM era I'm talking about. So there the observation was that doing all of these labelings through humans was a time consuming process. So we can only get access to a small number of these learning examples, but unlabeled data
Starting point is 00:34:26 is available plentiful. And what I mean by that is like say corpus on the web, or documents is available plentiful. So how we could utilize those kinds of unlabeled data with small amount of labeled instances to combine them to learn like in a good say learning learn models to let's say extract classify whatever end tasks we were interested in so that was the motivation for semi-supervised learning so where you combine small amounts of label data with lots of unlabeled data then graph came into the picture i mean graph is a very useful and versatile data structure. I mean, we are all connected in one way or another, right? So networks and graphs provide you a very flexible way to represent knowledge about the world. Be it, let's say, like, you know,
Starting point is 00:35:19 one person connected with or related with another person or an institution in a social network, or be it like a biological network, or like a transportation network, or knowledge graphs. So that's kind of like knowledge about the world and relationships where the nodes are represented as entities and edges represent relationships among those entities. So that kind of like provided a flexible way of representing various domains and world knowledge. So that's kind of like the representation part. And then how we can do like learning over those type of graphs with limited supervision is how the semi-supervised learning part came about. So how we could like combine these two pieces. So that's where the graph-based semi-supervised learning part came about. So how we could combine these two pieces. So that's where
Starting point is 00:36:05 the graph-based semi-supervised learning came into existence. And so people had looked at utilizing graph-based semi-supervised learning for other problems, but some of our works and other researchers around that time were one of the first ones to apply those kind of ideas within NLP. Some of our initial applications where say you give maybe five examples of watch manufacturers, right? And then given that data and say access to the web, how you could significantly expand the list of those watch manufacturers and extract like 100 others from the web, right? So given those like small number of five examples. So those are kind of like some examples. And then subsequently, went into how we can build these knowledge graphs, which are like this entity relationship
Starting point is 00:36:57 graphs that I talked about. So one big project that I was involved with during my postdoc time at CMU, led by Professor Tom Mitchell, is a project called NEL, which stands for Never Ending Language Learning, where the idea was to basically build this kind of knowledge graphs by reading web documents in a pretty much self-supervised manner, and then rereading this knowledge and using that knowledge to improve the extractor and build this in a never ending manner. So in for that particular projects context, it ran for about like 10 years, like in a pretty much in a self supervised manner, by basically applying these kind of semi supervised learning ideas in a graph context. So yeah, that's I think, like is a one
Starting point is 00:37:43 concrete example of basically merging graphs and semi-supervised learning. Very interesting. Okay. I think it's cool to see some of the practical applications, but also benefiting from sort of the combination of two sort of methods of learning, both the semi-supervised and the graph learning. So that's very awesome. You know, I think we touched on a lot of interesting things throughout, you know, your research career. One thing that's actually quite impressive is that you wear multiple hats, right? So you, in addition to your research role at Google, you are also an associate professor at the Indian Institute of Science. So I think that's quite interesting being able to balance both have a foot in academia but also
Starting point is 00:38:25 in industry and research how do you balance your time and your priorities between these two roles right yeah so i'm currently on leave from university so where i'm not teaching on a regular basis but i continued uh even after starting the position at google continued advising my students, PhD students who have graduated now. But yeah, but I think it was challenging. But since I was working roughly in the same related areas, so that way it was kind of like not dragging in different directions.
Starting point is 00:38:58 Also, my students were already kind of towards the second half of their PhD journey. So they were already quite independent. So that definitely helped in terms of towards the second half of their, say, PhD journey. So they were already quite independent. So that definitely helped in terms of like managing the two sides. And then now we have these collaborative projects that I mentioned, like in Avani before. So that's also within the same university. So now the engagements have morphed into different types, but there is kind of like a strong back and forth. I see.
Starting point is 00:39:28 And I was going to actually touch on what are some of the benefits of working in both settings, but I think you kind of alluded to the potential for collaborations between both academia and some of the work that's happening at Google Research. Do you find that, you know, having a role in industry helps inform some of the research that happens in academia? Is it vice versa where, you know, we have some research in academia that's helping push, you know, some of the innovations in product? At least for you, which hat do you find inspiring or informing more of your work? Yeah, no, that's a great question. And in fact, I mean, I also had a startup in between and like, you know, one of the primary reasons why I'm at Google is motivated by the fact that I wanted to see whatever research I'm doing, if and how that's making any kind of
Starting point is 00:40:15 like the impact or use in the real world. And basically, like, you know, go the full path and understand, like, you know, how it's getting used, what are the drawbacks, and then take inspiration from there to inform the next set of research questions. So I feel that that's a very productive way of making sure that you're working on important problems and having a strong connection with industry because many times industry is at the forefront in terms of deploying products and users using those products.
Starting point is 00:40:51 The problems, they get exposed to all of those. So making them available to academic researchers, like, you know, and influencing them to work on that is definitely, I think, helpful. And that has been one of my motivations of, like, say, why I did the startup at Google, as I mentioned. And even before this, during my PhD time also, I spent about a year at Google in three different internships,
Starting point is 00:41:17 and that had a strong influence on my, like, research trajectory. So that way I had an exposure towards and the knowledge about the benefit of industry engagements. And I just continue to follow that even today. I think that's a great point. Of course, exploratory research is equally important. And I think it's essential
Starting point is 00:41:39 for pushing the boundaries of science, but also thinking about grounding some of the research we do in the context of real world problems or real world use cases can also be quite beneficial to ensuring tangible short term benefit as well. Yeah. And also, Brooke, like, you know, I mean, you can always like take the inspiration from those real world use cases. And then you can kind of like, not think in terms of like say at what time scales yeah you want to like not solve them right yes yes and then also you could also like this could
Starting point is 00:42:12 be like an individual researcher's tastes so how much they want to be influenced by that i mean sometimes you want to solve that problem like not exactly but in other cases like you know you want to kind of keep that flavor in mind but like think about what could be like a more general version of that problem and then try to address that in a more systematic way so i think having good exposure is i think is very valuable at least that's what i have found to be valuable and depending on individual researchers taste they could like you could decide how to incorporate that in their research. Yes, yes, that's a great point. You know, you introduced something which was very cool that you had sort of an entrepreneurial venture as well,
Starting point is 00:42:55 which was quite interesting. But it seems like over the course of your career, you span diverse areas, but still maintain sort of a central focus on language technologies on NLP. And I'm sure this past year, two years, three years has been quite exciting. I feel like the large language model advancements have been quite remarkable, but every day there seems to be something new in the news. Could you highlight maybe any of the recent breakthroughs or findings? It can be in the context of LLMs, multimodal. I think we're seeing some interesting work happening in that space that you find particularly exciting or impactful. Yes. Yeah. So I think in terms of like, say, broader, like not technological arcs. So like exactly the two things that you mentioned,
Starting point is 00:43:46 language models in particular, like the multimodal models, I think have opened up like, you know, lots of interesting possibilities, both in terms of use cases and then also additional research to be done. So that I think like from a technical perspective, I think there's it's an exciting time to be in. But of course, like, you know, not everything is solved. As we like to discuss at length today, in terms of how we can make it more usable and helpful for a
Starting point is 00:44:18 broader set of like, you know, diverse users and people from different backgrounds is, I think, still a very, very open problem and an exciting problem at the same time. And what really excites me is that there are foundational research work to be done. And if you are able to make progress on that, the possibilities of societal impact are massive. So that way, as a researcher, it really excites me. Beyond multimodal and language models more broadly, more recent advances around long context models, where we are able to specify a lot more contextual knowledge as part of, say, the Gemini 1.5 models, also, I think, have opened up interesting avenues for exploration, and I'm excited Gemini 1.5 models also, I think, have opened up interesting avenues for exploration.
Starting point is 00:45:07 And I'm excited to explore those. Very exciting. And then when you think about, of course, your primary sort of passion or area of interest, which is ensuring that language models or language technologies are inclusive and accessible, are there sort of things that you see as the next key achievement for the space to ensure that these technologies are more inclusive? Is it focusing on data? Are there innovations on the modeling aspect? Are there other things that you perceive as being important regarding language representation, but also some of the fairness issues that we described in the context of LLMs? Yes. Yeah. So I think there are kind of like, maybe we are thinking through like three
Starting point is 00:45:50 dimensions. So one is representative data that we talked about, and like Navani is an example of that. The Project Bindi, which is looking at like the fairness and responsible AI and the importance of working with communities while leveraging the scale that LLMs give. The third one that we haven't talked about so much is in terms of how we can do these models in a more scalable and modular manner. So right now, the recipe for building these models are you have like say one monolithic model, and then you try to add more data like you know and then try to extend its capabilities but i'm not sure whether that's kind of like a highly scalable approach so in order to overcome that we have been working on a method called calm which is
Starting point is 00:46:40 looking at how we can maybe develop models with different expertise independently and in a post-hoc manner, still see how we can compose these models to enable new capabilities. So with that, maybe like, you know, I have like a core model, which is very good at doing, say, reasoning tasks or like, you know, math or numeric reasoning problems but it works primarily well for english and a few other languages but then if i want to make those capabilities available for say santali or hausa like you know but if if i have like a separate model with expertise in those languages so how we could like you know compose both of them together to enable reasoning capabilities in all of these additional languages so we have had like some initial promise and success in that direction
Starting point is 00:47:32 and we are excited to kind of like follow up more so even in terms of modeling how we can do this in a more scalable and modular way i see see. So focusing on representational data, focusing on improvements in modeling, and then, of course, community-based development. So working with communities on these solutions. And then looking at, more broadly, the recontextualized responsible AI so that we are serving the end users
Starting point is 00:48:03 in a locally sensitive manner that brings meaningful change to their lives. Yes, yes. I think this was a very interesting discussion. I want to wrap up with one question. As somebody who's had a very diverse career, both as a researcher, as a professor, as an entrepreneur, I'm sure you've had the chance to teach and engage with and mentor many students,
Starting point is 00:48:30 many junior colleagues who have gone on to become successful, whether it be as researchers, as leaders in their own fields. What are some of the skills and qualities that you look for and try to cultivate in your students or mentees? And then what advice would you give to young aspiring engineers and scientists who want to make an impact in this world? Right. Yeah. So, I mean, one is making sure like the focus on quality is always there, not like, you know, compromising on quality and a high bar for like some short-term
Starting point is 00:49:06 gains even if it requires you to kind of like not stay the course for a longer longer period so i think like maintaining quality is i think like one important thing making sure that you are passionate about like another problem that you're working on. And then you actually like not care about the outcomes, because I think that's going to help you navigate through downturns that are bound to happen when you're working on like not challenging problems. So identifying things that you really care about, and making sure like now you are focusing on them, I think is important. Curiosity drive, I think have been like a important ingredients, I think, in order to identify good problems and, and also kind of like now do good work eventually. Importance of question, like identifying the right question to
Starting point is 00:49:58 address is also like extremely important. So I mean, I tend to believe that like even like a suboptimal answer to the right question is like no more valuable than like an optimal answer to a suboptimal question right so kind of like spending enough time making sure that you are working on problems that you care about and are impactful is i think important and important. And if it makes sense, then seeing how maybe this is going to be grounded in the real world and how it may help end users, if that's a thing that you care about,
Starting point is 00:50:36 is something to maybe think about early on and how it's going to fit into the bigger picture and not just looking of like looking at what's the next incremental improvement that could be done. Wow, I think those are all great pieces of advice. Identify a question you're passionate about or interested in solving. Be curious, never compromise on quality
Starting point is 00:50:59 and where relevant, think about how your work ties into society and sort of the larger context. So I think those are all amazing pieces of advice for the next generation of makers and creators. So with that, thank you so much, Partha. I think this was a wonderful discussion. And we look forward to the many impactful work that you will continue to contribute in this growing, evolving technology landscape. Thanks, Brooke. Great talking with you. And thanks for giving me this opportunity. ACM ByteCast is a production of the Association for Computing Machinery's Practitioner Board.
Starting point is 00:51:38 To learn more about ACM and its activities, visit acm.org. For more information about this and other episodes, please visit our website at learning.acm.org. That's learning.acm.org.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.