Microsoft Research Podcast - Abstracts: NeurIPS 2024 with Pranjal Chitale

Episode Date: December 6, 2024

Pranjal Chitale discusses the 2024 NeurIPS work CVQA. Spanning 31 languages and the cultures of 30 countries, this VQA benchmark was created with native speakers and cultural experts to evaluate model... performance across diverse linguistic and cultural contexts.Read the paperGet the dataset

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to Abstracts, a Microsoft Research podcast that puts the spotlight on world-class research in brief. I'm Gretchen Huizenga. In this series, members of the research community at Microsoft give us a quick snapshot or a podcast abstract of their new and noteworthy papers.
Starting point is 00:00:24 Today, I'm talking to Pranjal Chitale, a research fellow at Microsoft Research India. Pranjal is co-author of a paper called CVQA, Culturally Diverse Multilingual Visual Question Answering Benchmark, and this paper is an oral presentation at this week's 38th Annual Conference on Neural Information Processing Systems, or NEURIPS, in Vancouver, BC. Pranjal, thanks for joining us today on Abstracts. Hi, Gretchen. Thanks for having me. So, Pranjal, give us an overview of this paper in a couple sentences. What problem are you trying to solve, and why should people care about it? So we are witnessing some exciting times as LLMs are rapidly evolving as tools for countless use
Starting point is 00:01:10 cases. While most of these LLMs were initially leveraged for natural language processing tasks, they are now expanded across languages and modalities. However, a major gap lies in the availability of multimodal data for non-English languages. Therefore, most multimodal models might not have coverage for non-English languages altogether or might just heavily rely on translations of the associated text in English-centric datasets so as to support multiple languages. The drawback of this approach is that it often misses the cultural nuances of local languages and another reason why this is not optimal is the images are mostly western centric therefore
Starting point is 00:01:52 would not be well reflexive of the local culture of a lot of regions. So this kind of bias can skew these models towards a western perspective raising concerns about inclusivity and safety of the content which they generate when serving a global population which involves multicultural and multilingual users. Therefore for a truly inclusive AI ecosystems models must demonstrate cultural understanding to ensure that the generated content is safe, respectful for diverse communities. Evaluating cultural awareness, though, is extremely challenging because how do we define culture itself is an unsolved problem.
Starting point is 00:02:33 However, in this work, we are trying to take a step towards having a proxy which could measure cultural understanding. Well, talk about how you did this. What methodology did you use for this paper, and what were your major findings? Now that we have defined our broader problem, it is important to decide the scope of our solution, because as we discussed, culture is an umbrella term. So we need to define a smaller scope for this problem. We chose visual question answering, which is a multimodal task, and it is one of the most critical multi-modal tasks for the scope of this work. So, recognizing the limitations of existing VQA benchmarks, which often rely on translations and lack cultural representation, we developed CVQA, which is culturally diverse multilingual VQA benchmark.
Starting point is 00:03:21 CVQA spans 30 countries, 31 languages languages and has over 10,000 culturally nuanced questions which were crafted by native speakers and cultural experts. So our focus was on creating questions which required what we term as cultural common sense to answer. For instance with just the image it is not possible to answer the question, you need some cultural awareness about the local culture to be able to answer the question. So these questions drew inspiration from knowledge of local culture. So one important aspect of this data set is that we include both local language as well as English variants of the same question to allow robust testing of models across linguistic
Starting point is 00:04:03 concepts. I would say the crux of this effort is that while most of the prior efforts may be small in terms of language, it could be language group specific or country specific foremost, but we wanted this to be a much larger global scale collaborative effort. So this covers 31 languages across 30 countries. So to build CV Kiway, we worked with qualified volunteers from diverse age group and genders, ensuring that the questions authentically represented their cultures. So images which were collected, those were ensured to be copyright free, grounded in culture and safe for work,
Starting point is 00:04:41 with strict guidelines to ensure that we avoid images which reflect some stereotypes or privacy violations. And we also had 10 categories which involved topics ranging from daily life, sports, cuisine to history of the region, so a holistic view of the culture of the region. So each question was crafted as a multiple choice task with challenging answer options which required both the image as well as cultural knowledge to solve. We also employed a maker-checker approach to ensure quality and consistency. So you've created the benchmark, you've tested it. What were your major findings? Now that we have created a benchmark, the next step is to evaluate how these multimodal models are performing on this
Starting point is 00:05:26 benchmark. So we benchmark several state-of-the-art multimodal models which include both open source offerings like Clip, Blip, Lava 1.5 and proprietary offerings like GPT-4 or Gemini 1.5 Flash. So what we observed is there is a huge gap when it comes in performance when we compare these proprietary offerings versus the open source models. So GPT-4.0 was the highest performing model with 75.4% accuracy in English prompts and 74.3% accuracy in local prompts. However, the story is completely different when we go to open source models. These open source models significantly lag behind the proprietary models and one key finding over these open source models is that these models perform even worse when prompted in the native language when we compare it to
Starting point is 00:06:17 prompting in English. This potentially highlights that these models lack multilingual understanding capabilities, which may be because multilingual training data is pretty scarce. Yeah. So LAVA 1.5 turned out to be the best open source model. So one thing to notice, LAVA 1.5 performs well across a large set of English VQA benchmarks, but when it comes to cultural understanding, it is pretty weak model. Further, we also did some ablations to understand if adding location specific information to the textual prompts has some impact or not, but we identified that it does not result in any significant performance improvements.
Starting point is 00:06:59 Further, we also conducted a category-wise analysis. So as we had mentioned, there are 10 categories to which these images belong. So what we observed is that certain categories like people and everyday life consistently saw higher accuracy across a large set of models. This may be likely due to abundance of human activity data in training datasets. However, when it comes to niche categories like cooking and food, pop culture, which are much more challenging, especially in local languages, these models struggle. Therefore, these are the kind of highly diverse cultural contexts which need improvement. How's this work going to make an impact outside the lab and in the real world? CVQA is significant because it addresses a fundamental gap in how we evaluate vision language or multimodal models today.
Starting point is 00:07:47 While proprietary models are making impressive strides, open source models, which are more accessible and easier to deploy, significantly lag behind in terms of cultural awareness and safety. So CVQA fills this gap and provides a much needed benchmark to help us identify these gaps in the first place. So as to fix them, we first need to identify the gaps and whether we are progressing or not can be captured by this benchmark. So for the real world, this benchmark does have some far-reaching implications. Models which understand culture are not just technically better,
Starting point is 00:08:20 but they would create interactions which are far more engaging, natural, and safe for users from diverse backgrounds. So this benchmark offers entirely new access for improvement, cultural awareness, and linguistic diversity. Therefore, by probing a model's ability to handle culturally nuanced questions, CVEQA ensures researchers and developers think beyond accuracy and also focus on cultural awareness and inclusivity before shipping these models into production. Pranjal, what are the unanswered questions or unsolved problems in this field and what do you plan to do about it? So while CVQA makes some strides in addressing culture and linguistic diversity, there is still much more to explore in this space.
Starting point is 00:09:03 So this data set only covers 31 languages and cultures, but this is just like a subset of the incredible diversity that exists globally. Many languages and cultures remain underrepresented, especially some of them are endangered or have limited digital resources. So expanding CVQA to include more of these languages would be a natural next step. Secondly, CVQA just focuses on single-turn question-answer pairs. But in reality, human interaction is often multi-turn and conversational in nature. So a multi-turn version of CVQA could better stimulate real-world use cases and challenge
Starting point is 00:09:40 models to maintain cultural and contextual awareness over extended dialogues. Another interesting area is personalization. So it would be very interesting if we could teach models to adapt to a user's cultural background preferences or even regional nuances in real time. This remains a significant challenge, although this benchmark could help us move a step towards our broader goal. Well, Pranjal Chitale, this is super important research, and thank you for joining us today.
Starting point is 00:10:13 To our listeners, thanks for tuning in. If you're interested in learning more about this paper, you can find it at aka.ms forward slash abstracts. You can also find it on Archive and on the NeurIPS website. And if you're at NeurIPS, you can also go hear about it. See you next time on Abstracts.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.