Latent Space: The AI Engineer Podcast - How to train your own Large Multimodal Model — with Hugo Laurençon & Leo Tronchon of HuggingFace M4

Episode Date: January 19, 2024

Latent Space is heating up! Our paper club ran into >99 person Discord limits, oops. We are also introducing 2 new online meetups: LLM Paper Club Asia for Asia timezone (led by Ivan), and AI in Action...: hands-on application of AI (led by KBall). To be notified of all upcoming Latent Space events, subscribe to our new Luma calendar (sign up for individual events, or hit the RSS icon to sync all events to calendar).In the halcyon open research days of 2022 BC (Before-ChatGPT), DeepMind was the first to create a SOTA multimodal model by taking a pre-existing LLM (Chinchilla 80B - now dead?) and pre-existing vision encoder (CLIP) and training a “glue” adapter layer, inspiring a generation of stunningly cheap and effective multimodal models including LLaVA (one of the Best Papers of NeurIPS 2023), BakLLaVA and FireLLaVA. However (for reasons we discuss in today’s conversation), DeepMind’s Flamingo model was never open sourced. Based on the excellent paper, LAION stepped up to create OpenFlamingo, but it never scaled beyond 9B. Simultaneously, the M4 (audio + video + image + text multimodality) research team at HuggingFace announced an independent effort to reproduce Flamingo up to the full 80B scale:The effort started in March, and was released in August 2023.We happened to visit Paris last year, and visited HuggingFace HQ to learn all about HuggingFace’s research efforts, and cover all the ground knowledge LLM people need to become (what Chip Huyen has termed) “LMM” people. In other words:What is IDEFICS?IDEFICS is an Open Access Visual Language Model, available in 9B and 80B model sizes. As an attempt to re-create an open-access version of Flamingo, it seems to track very well on a range of multimodal benchmarks (which we discuss in the pod):You can see the reasoning abilities of the models to take a combination of interleaved images + text in a way that allows users to either describe images, ask questions about the images, or extend/combine the images into different artworks (e.g. poetry).📷 From IDEFICS’s model card and blog postThe above demo screenshots are actually fine-tuned instruct versions of IDEFICS — which are again in 9B and 80B versions.IDEFICS was built by connecting two unimodal models together to provide the multi-modality you see showcased above.* Llama v1 for language (specifically huggyllama/llama-65b) - the best available open model at the time, to be swapped for Mistral in the next version of IDEFICS* A CLIP model for vision (specifically laion/CLIP-ViT-H-14-laion2B-s32B-b79K - after a brief exploration of EVA-CLIP, which we discuss on the pod)OBELICS: a new type of Multimodal DatasetIDEFICS’ training data used the usual suspect datasets, but to get to par with Flamingo they needed to create a new data set.Enter OBELICS: “An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents”:* 115B text tokens* 141M English documents* 353M imagesThese bullets are carefully curated and filtered by going through Common Crawl dumps between FEB 2020 - FEB 2023. We discuss the 2 months of mindnumbing, unglamorous work creating this pipeline:There’s a lot of mentions of ‘multi-modal' web documents’ which deserves some explanation. We’ll show you instead of tell you:You can see from this graph that OBELICS ends up outperforming the other image-text pairs dataset (LAION in this case) when stacked head-to-head.You can view a subset of OBELICS and perform visualizations on them here:2024 Update: WebSight et alMost of this interview was recorded on Halloween 2023 at HuggingFace’s headquarters in Paris:In anticipation of an IDEFICS v2 release. However, several roadblocks emerged, including a notable scandal around CSAM in LAION-5B, which affected all models using that dataset. The M4 team have adopted a strategy of shipping smaller advancements in 2024, and the first ship of the year is WebSight, a dataset of 823,000 HTML/CSS codes representing synthetically generated English websites, each accompanied by a corresponding screenshot (rendered with Playwright). This is intended for tasks like screenshot-to-code workflows like Vercel’s V0 or TLDraw, and will be part of the dataset for IDEFICS-2.As noted in our Best Papers recap, synthetic data is emerging as one of the top themes of 2024, and the IDEFICS/OBELICS team have wasted no time enabling themselves with it.Timestamps* [0:00:00] Intro* [0:00:00] Hugo, Leo’s path into multimodality* [0:09:16] From CLIP to Flamingo* [0:12:54] Benchmarks and Evals* [0:16:54] OBELICS dataset* [0:34:47] Together Redpajama v2* [0:37:12] GPT4 Vision* [0:38:44] IDEFICS model* [0:40:57] Query-Key Layernorm for training* [0:46:40] Choosing smaller vision encoders - EVA-CLIP vs SIG-GLIP* [0:49:02] IDEFICS v2* [0:52:39] Multimodal Hallucination* [0:59:12] Why Open Source Multimodality* [1:05:29] Naming: M4, OBELICS, IDEFICS* [1:08:56] 2024 Update from LeoShow Notes* Introducing IDEFICS: An Open Reproduction of State-of-the-Art Visual Language Model* IDEFICS Knowledge sharing memo: technical lessons and mistakes* Victor Sanh memo* OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents* Papers cited:* BLOOM: A 176B-Parameter Open-Access Multilingual Language Model* Barlow Twins: Self-Supervised Learning via Redundancy Reduction* CLIP paper: Learning Transferable Visual Models From Natural Language Supervision* Vision Transformers paper: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale* Flamingo paper: a Visual Language Model for Few-Shot Learning* April 2022 preprint from DeepMind, blogpost* VQAV2 paper: Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering* OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge (https://okvqa.allenai.org/)* MMBench: Is Your Multi-modal Model an All-around Player?* Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond* Sig-GLIP paper: Sigmoid Loss for Language Image Pre-Training* Nougat: Neural Optical Understanding for Academic Documents* MMC4 (Multimodal C4): An Open, Billion-scale Corpus of Images Interleaved With Text* Dall-E 3 paper: Improving Image Generation with Better Captions* GPT-4V(ision) system card from OpenAI* Query-Key Layernorm trick: paper (Scaling Vision Transformers to 22 Billion Parameters), tweet* EVA-CLIP: Improved Training Techniques for CLIP at Scale * “We intially explored using a significantly bigger vision encoder (the biggest in open-access at that time) with EVA-CLIP. However, we ran into training instabilities very quickly. To lower the risks associated to the change of vision encoder, we decided to continue with laion/CLIP-ViT-H-14-laion2B-s32B-b79K which we have been using until that point. We will leave that swap for future iterations and will also consider using higher resolution images.”* Datasets* Together’s RedPajama-Data-v2: An open dataset with 30 trillion tokens for training large language models* LAION COCO: 600M synthetic captions from Laion2B-en* Chip Huyen’s writeup on LMMs* Joseph Nelson of Roboflow on Latent Space* HuggingFace M4* HuggingFace timm: library containing SOTA computer vision models, layers, utilities, optimizers, schedulers, data-loaders, augmentations, and training/evaluation scripts. It comes packaged with >700 pretrained models, and is designed to be flexible and easy to use.* Logan Kilpatrick declaring 2024 the year of Multimodal AI at AI Engineer Summit This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe

Transcript
Discussion (0)
Starting point is 00:00:05 Welcome to the Latent Space podcast, where we dive into the wild, wild world of AI engineering every week. This is Anna, your AI co-host. Happy New Year. Did you miss me? As an AI language model, I cannot miss you back. But I'm glad to stand in for LSEO while Swix is traveling. This time, in Paris at Hugging Face HQ. At the AI Engineer Summit in 2023, Logan from OpenAI pronounced 2024, the year of multimodality. I'm excited for 2024, which I think is really going to be the, I don't know if I can trademark this, but the year of multimodal models. It's a tongue twister, but also hopefully the domain is available, year of multimodals.com. No, don't buy it if it's available.
Starting point is 00:00:51 Yeah, so I'm excited. Open AI has a ton of multimodal capabilities that are in the works. Some folks might have already tried some of these in chat GBT in the iOS app or the web app today, things like vision, taking in images, describing them. show that later on. Also, the ability to generate images. We've had this historically with with Dolly 2, but Dolly 3 really, if folks have tried it, it takes things to the next level. So excited to show some of that today as well. In 2024, the latent space pod will offer deeper dives into multimodality. Today, we'll talk to Leo Trenshaw and Hugo Lorenzoam of Hugging Face,
Starting point is 00:01:26 who trained IDEFIX, a fully open source reproduction of DeepMind's closed Flamingo model done from scratch, scaled all the way up to 80 billion parameters. By the way, dear listener, we are expanding our online meetups this year after the success of the latent space paper club. See the show notes for the new AI in Action and Paper Club Asia meetups. Watch out and take care. Hi. Thanks for having me at your beautiful office. It's really surreal for me to visit the Huggy Face Paris office, because I've always seen you guys online and organized really huge meetups here in Paris. I want to I learned everything about HuggingFace and you guys' work. So my name is Rigo.
Starting point is 00:02:07 I've been working at HangingFace for two years. I started working on the datasets for the Bloom language model. So it's the 176 billion parameter model that we open source and that was at that time the biggest one. And it was also multilingual. So I worked on the model and the data set. And then I moved to the multimodality with the current project with EDFIX and Obelix. Now I am working also with Leo on the version 2 of EDIFIX.
Starting point is 00:02:38 And Leo yourself? So my name is Leo. I joined HeuggingFace a year and a half ago. I was a student still. So first six months, I was still as an intern. But I started to work on Multimodality right away. And then I spent all my time here in the research team working on a multimodality and EDIFIX that we open source in August.
Starting point is 00:03:00 I think a lot of people are very interested in learning. more about Ilofix and multimodality in general. Bigger question first, how is Hugging Face organized? You told me some surprising details about the size of Hugging Face. You guys are a $4 billion company. Only 200 people, less than 200 people? About 160 people. Yeah.
Starting point is 00:03:16 And then how many people in the research team? This is like maybe 15. So 10 and 20% of the company is research. I'd say. One, that's impressive. And then two, this is something that we discussed before. It's also unintuitive why Hugging Face needs to do research. I think the company has a good incentive to do research because most of the companies that do AI,
Starting point is 00:03:39 they have an incentive to get very good models out, but not the best model out. Their competitive advantage is to have the best model in-house that they can fine-tune for their customers. And then the open source is for show. But Hanging Face is one of the only companies that has an incentive to get the best model out there in the open. And that's why I think the research team is quite important. It's also important because all the tools that Hugging Face makes are used by the researchers, so they get all the feedback directly from us. And I think this is really useful to develop the tools behind it.
Starting point is 00:04:13 Are you talking about the Transformers Library? Transformers Library, diffusers library, datasets. So those seem to me more like in sort of inference type tools. Are there any sort of training tools that you do? Datasets is used for the training. Transformers we've been using for our modeling. Internally, we are also developing a library for training. I think it's going to be open source, but we'll see.
Starting point is 00:04:37 So for example, we used for the construction of Obelix, we used our whole pipeline, the library datasets. The aim of our big open source projects is also to test our own internal libraries and see if they scale well. For example, the datasets guys, they never worked with a big open source. worked with datasets this big before. So this is a way also to test our solutions. This big meaning 114 million images, something like that?
Starting point is 00:05:08 More than 3 million, 300 million images. I've tried transformers and tried diffusers. I haven't tried data sets. Why do I need datasets? So I think dataset is great because you can load datasets that don't fit in memory. So it's a kind of virtual library, virtual pointers or whatever. Exactly.
Starting point is 00:05:26 And also you can easily filter, rows of your datasets, map them, manipulate, and modify the content. So it makes it really easy. And also to do the operations in parallel, it's much easier with this library. What is the video alternative to datasets? What do machine learning researchers use if they don't use datasets? Just do everything my hand or... Basically, manually paginate, write code to paginate in it.
Starting point is 00:05:54 Yeah, yeah. That's what I did before, but it's just much faster to the... Because everything is done for you. And then multi-processing, you just have to implement your function. That's great. Well, I think that's a good intro to the overall hugging face ecosystem. But I'm interested in the journey from Bloom into computer vision for yourself. And then obviously you also had your own journey into multimodality.
Starting point is 00:06:13 A lot of people who are listeners and readers of Lane Space also following that same journey, right? They only have some kind of NLP background. And now everyone is interested in multimodality. What was that journey like for you? Not from the research team, but from the hub team when they started hosting multi-modal models and data sets. And quickly after that, we also realized that it would be a good idea to also train ourselves, multi-modal models to catch up with the proprietary models from DeepMine, Google, etc. I think that was the natural path for us.
Starting point is 00:06:52 But we didn't drop the idea of doing pure text. models. So there is also a team for LLMs. It was just the creation of another team. For me the journey was a bit different because I didn't really, I came right away from my masters. So I had projects on computer vision where for example, I don't know if you've heard of Dino probably and there was another paper called Barlow twins. Basically I had a project on which I tried to combine the two objectives so I was more towards computer vision before joining but I was I was really interested in doing multimodal. When I saw that there was an internship for this position,
Starting point is 00:07:29 I was glad. Then the team was already starting to do the project when I joined. And so I kind of joined the train. Just a demographic question. Is everyone here? Is everyone on your team here? We had the big shift in the team in the recent months.
Starting point is 00:07:49 Some people left for other startups or creating their own. But what is very important? really interesting to me is that when we started the project, not a lot of people in the team previously worked with multimodal models. Maybe only two of them. And we were like six, seven really working on the project. So it was really new to us, this field. And we also wanted to have this knowledge because of course they are, it's explained on the papers how to do things. But without doing them yourself, you still miss a lot of things and you miss the intuition. And we also wanted to build this knowledge of multimodality.
Starting point is 00:08:31 When building the version two, we go much faster because we have a better intuition. Yeah, this and also I think it talks about the philosophy of Hugging Face of having small teams with baking packed. And so we started with a fairly big team for a Hugging Face standard with six, seven people. as Hugo was saying, we lost a few people to different startups, but the idea is still to go as fast, if not faster, with less people right now. And I think it's possible because of all the background we built
Starting point is 00:09:05 in the previous iteration, because small teens can work a lot faster because there's less communication, less overhead. Very cool. Yeah, so I do want to get into Idafix and obelix. I wanted to basically go over a little bit of introductory stuff for people, right? So in my mind, the two main multi-modality papers that everybody should read is clip and vision transformers. Would you mark out anything else?
Starting point is 00:09:29 Or what do you personally get from those two papers? These two papers, I think, build a starting block. Because now what we are noticing is that for building super large models, we don't train them from scratch. We just pre-trained, usually unimodal models that we somehow mix together. So I think Clip or VAT can serve as a pre-trained backbone that you combine with another pre-trained language model backbone to obtain something multi-modal. So these are foundation models that play the same role
Starting point is 00:10:05 to us as Lama models or language models. The important thing to understand with Vision Transformers and Clip is that they provide the basics for them integrating images into this language modeling objective that we use. Then it's mostly a question of data and image resolution, and a lot of engineering goes there. And just a note on this, so some research show that when you use pre-trained vision encoder, that was trained also with a text objective, for example,
Starting point is 00:10:40 contrastive loss, like the clip loss, it's better to use this type of vision encoder than vision encoders train only on classification or a mini-modal task if you are building multimodal models. So if you are building a vision language model, it's better to take as a pre-trained backbone, a pre-trained vision encoder that has been trained using text. Is that not intuitive? Imagine you can take a vision encoder that is super good at classification. Then you can imagine that the embeddings that, you can imagine that the embeddings that,
Starting point is 00:11:15 you get from your vision encoder, a super start to plug into your language model. It's intuitive, but it could clearly work to have a vision encoder train without text at all. But researchers have shown that it's better to use this contrast. And once you have those backbones, the question is really how you integrate it, like how you integrate both of them into the architecture. It turns out with very lightweight updates, you take the embeddings that come from the clip, the output of the clip and you have just a linear that you train on top of this, then when you pass this to the language model and you only train this part,
Starting point is 00:11:55 you can already get pretty good results in multi-modality. You don't have to train all the parameters when you train in the multi-model model. You can just train the adapter. Is this what was spelled out in Flamingo or you just kind of derive some kind of transform that you're happy with? with. So in Flamingo, you introduce a lot more parameters, but there's still like those cross attentions that you insert in the model that are new. Those you train from scratch, but the rest of the model, the language model backbone and the vision backbone, they are frozen during the training. So you never update it. So it's a different type of adapter, but it's more, it's more heavyweight
Starting point is 00:12:36 than what you could have in recent papers. Now with more parameters, you also often get get better performance. Is it necessary to have all those parameters when you only train the adapter part? There's no clear answer yet. Okay, I think that brings us up to date. Oh, except for benchmarks, I wanted to introduce people to the concept of how hard it is to evaluate benchmarks
Starting point is 00:13:00 for multimodality. So there are the academic benchmarks classic. For example, HII, V2, there are for visual question answering. There is also, there are also the image captioning benchmarks. The cocoa. Coco, exactly, Flickr. However, one really
Starting point is 00:13:21 important thing that we noticed is that this benchmark the performance of your model heavily depends on how you formulate the answer. For example, for visual question answering tasks, you will have a question and an answer. This
Starting point is 00:13:38 answer will be generated open and lead by your model. You just from the model with your question, and then it will generate some world until the end of sequence token is rich. But the thing is that if you have a question and the answer
Starting point is 00:13:54 is simply no, if your model says you can count it wrong or you can count it as... There's ways to adjust for that, like, you know, so kind of distance metric or something. But it's hard. It's hard.
Starting point is 00:14:08 Use another model. So just the way you're formulating the answer, heavily impact your performance. And the fact that some people are fine-tuning directly on the benchmark to try to optimize this formulation, or the fact that other people are doing a few-shot evaluation. So a few-shot is by giving the model examples of how to formulate the answer. So it makes it hard to compare the models because they are not evaluated the same way, even if it's on the same benchmarks. So this is a problem.
Starting point is 00:14:39 So you will have all this academic benchmarks. And then you will have this new benchmarks that are not commonly adopted yet, but are created basically with all the language models like GPT4. And people are prompting GPT4 with images and ask it to generate automatically question and and search. And then we can evaluate our models. This is very new. The evaluations in multi-model, like for multi-model models are still a bit rough.
Starting point is 00:15:12 I think. Even for language models, there's discussions of if benchmarks are really the way to go for some tasks. When you do instruction tuning, for example, for a language model or LHF,
Starting point is 00:15:23 you shouldn't be evaluating on the same benchmarks from the point of view of a lot of people. On multimodality, it's also that the quality of the data sets we're evaluating on are not super clean. I recruited something recently, someone that was showing
Starting point is 00:15:37 like failure cases of VQAV2, I think. And it was interesting that sometimes the questions and answers are like super obvious and sometimes it's like so far away even you wouldn't not even a human would get it yeah and I think sometimes it's just plain wrong so it's it's also the quality of the data sets to evaluate on and the diversity of them it matters a lot and right now we still have a few blind spots in evaluations but it's it's really interesting to see the field move on this because as as we have a lot more multi-modal models the evaluations benchmarks are improving.
Starting point is 00:16:10 Maybe four or five come out. They were nice in the past two months. Off the top of you here, can you name any of these that's... Eminem bench. Pupy. There's a seed, and there's a brand new one, but I don't know if it's out yet. Like, the paper is out late.
Starting point is 00:16:27 It's called halusine, like something, halusion, I think. And this one, I read the paper. I don't know the size of it, but from the examples they give on the paper, it seemed really, really interesting and hard to beat. Yeah, I'm excited about this one, mostly. This is like the new race, right? In the last five years, there was a race towards like sort of common sense benchmarks in NLP,
Starting point is 00:16:49 but now this is the new. Yes, it's getting to multimodal. Very cool. Maybe we can go into the work that you did for Oblix. Let's describe the size of the dataset, what you did to clean it up. A lot of these things start from Common Crawl, and Common Crawl is great, but also it's very messy. So first, why we wanted to do it,
Starting point is 00:17:10 we were trying to replicate Flamingo. And Flamingo built their own data sets of interleved image, text, web documents. I think it contained more than 50, no, 100 million images, if I'm not wrong. And it was based on, for Flamingo? For Flamingo. Yeah, for Flamingo.
Starting point is 00:17:31 And it was based on like 50 million web pages. However, the data set was not open. So I talked to the authors, and one of the reasons it was not open is because they used their page rank Google algorithm to try to know in advance which website to target in their data set. So meaning higher page ranked, higher ranking SEO sites have higher weight. Yeah, exactly.
Starting point is 00:17:57 So that's how they scrap the websites. That's one of the reasons why they don't. Many reasons to not open source their datasets. So we wanted to build a data set that was at the beginning similar to this one, and so we made it even larger, and fully open source. Because we believe foundation multimodal models train on interleved image-text documents are better than the ones trained only on pairs. Maybe to go further into that point, what we found,
Starting point is 00:18:32 and that is interesting, is it's for the VQA tasks that this day. datasets is really important. For the captioning tasks, you have an image text data set like lay-on, and it's great. And it's going to improve pretty well, like the alignment is strong. Just to explain, alignment is like basically images that are aligned with the text. So the text means something that is related to the image. And so for lay-on, it will be enough. For the captioning tasks, it will maybe for some OCR tasks, although it's like still weak on this one, even if you could use improvements on this one. And then the Obelix dataset is really important for reasoning,
Starting point is 00:19:11 to have the model be performing on VQAV2, OKVQA. So those depend heavily on the web documents. It was interesting to see the dichotomy when we use only one dataset or the other. That's in the paper of Obelix. Essentially, the pairs, image text pairs, are good for the alignment. Just align what you see in an image with the corresponding text.
Starting point is 00:19:36 But if you want to have more abilities to resonate, it's better to have a higher proportion of web documents with longer context. Also, it's not the only reason why we wanted to do it. Why we wanted to do it is because the image text pairs are super noisy. So, well, the advantage of it is that it's super easy to collect. You just scrap a lot of HTML codes. And anytime you find an image with the corresponding ALPT, text, you download the image and you bring the alt text and you have your pair. Building a web document is much harder because you have to clean the properly the text, you
Starting point is 00:20:15 have to check what you want to keep, what you want to discard. So it's obviously much harder. However, you have also a longer context for each image. So there is really a parallel to be made and it's not the same type of data because on image text pairs, you have an image and the direct caption of it. On web documents, you have, well, this is essentially what you see when you open any website. So you have a text, then sometimes an image, another text, an image, and then the alignment here is weak, in a sense that the text don't necessarily describe perfectly the image.
Starting point is 00:20:56 However, they share the same context. So this is another type of data. And we also think that this diversity helps to improve the performance. How much? So this sounds good in theory, but you had no idea of knowing. I mean, I guess you talked to the Flamingo office and they just told you that this is what they did. You mean like the proportion of exactly. Even them, even them, they built their data set and they told me, yeah, we use this proportion, but maybe we could have used a,
Starting point is 00:21:30 less or we don't know. So we didn't really know in advance the proportion of what documents you would need compared to pairs. We did an ablation though. So basically we can control how much we sample web documents versus land pairs. And so we did an experiment where we moved those probabilities a lot. It was very inconclusive.
Starting point is 00:21:59 It was very inconclusive. So there was no, like, we had a range of, like, what was a good range for, like, how much web documents we should have versus lay on pairs. But overall, past a certain threshold, it didn't matter too much. And when you measure performance, do you split it out into things like individual tasks, like sanitation or detection or anything like that? Or is it just VQA? We don't have a detection.
Starting point is 00:22:29 or segmentation, because the model is basically, like, it outputs text. So we can't really evaluate on those benchmarks. But we did captioning, visual question answering, text recognition a little bit, but it was done through captioning datasets or VQA datasets, and we did classification. So those are the three ones that, three categories that were doable with the setup, like the model we put to give. But I know that
Starting point is 00:23:01 RETO, QNVL, for example, they use bounding boxes in the data sets, so they can do detection, yeah. You also mentioned, by the way, that resolution was a big deal
Starting point is 00:23:11 for you, image resolution. And how do you deal with that in Obelisk? So resolution is important when you want to do OCR, particularly.
Starting point is 00:23:22 Because otherwise it's just fuzzy, right? If it's... Exactly. It's too small. if you can see it, the model probably struggles as well. Well, not just that. Models typically see much smaller images than we do, right?
Starting point is 00:23:35 I don't know what resolution you guys have. It's like a resolution of like 480 by whatever, right? It's super small. 480 fix is smaller than that. Yeah. It's smaller than that. It's 2.24. So you're going to lose a lot of detail.
Starting point is 00:23:47 Yep. Definitely. On top of this, you have the vision model, and it outputs a certain number of tokens depending on the image you put, right? And above the model, we have a perceiver, so we reduce the number of tokens that come out of the vision model. By doing this, we make it even less,
Starting point is 00:24:10 like even harder, I guess, for the model to be precise on those very, very small details. So this is something that happened with EDIFIX, the version one, and probably, I mean, we're going to improve on this for version two. But it's really, really important for OCR, that's for sure. We think it can also be important to visualizing details, improving on those things as well. For example, it's like a finger or like a hand is a certain color or is doing a certain thing.
Starting point is 00:24:42 If all your images are tiny, it's going to be hard for the model to pick up on that. So we are button-knit also by what is available on the open source side. For example, now Google recently released SIGLIP and it's a clip, but there is a version of it. It's called SO Optimize, it's of size like 400 million parameters. And it is trained with 384 resolution images. So it's a bit bigger than the 224 that we had. So I think this is the largest resolution you can get with open source models. We are of course bottlenecked by this.
Starting point is 00:25:24 Usually Google, they release like this version of SIClip, but they didn't release the better version of it. So we are definitely limited by this. Well, so it doesn't really affect, it sounds like it doesn't really affect obliques. Yeah, exactly. So when creating the dataset, we simply downloaded the image of the full resolution,
Starting point is 00:25:46 and after that, you resize them on the fly during the training. But certainly one of the biggest challenge when making Oblix was dealing with all these images because they weigh a lot. Aren't you tempted? So to me, OCR is extremely important. Aren't you tempted to run some kind of extra data augmentation thing to say like, oh, you know, on the Oblix dataset, run some OCR pipeline on it so that you augment your... Yeah, that's really interesting what you mentioned because this is also
Starting point is 00:26:20 So one thing that we want to do in the near future. And also people have kind of did that for Nuga, right? So Nuga is a model from Facebook, and they just try to have a vision model that can read. So they fed to the model PDF with the associated text, and the model is pretty strong. So maybe if we inject this data in our pre-training, it would definitely help. And this is also one of our... And this is one of the threats we're exploring to have a lot more OCR data. We have a team actually at HangingFace that works on Document AI with Ross Weigman.
Starting point is 00:27:01 Do you see the team library for vision models? No? Okay. But yeah, basically he's been working on Document AI and on getting a very strong open source model that can read. And is that primarily PDFs? Yeah. screenshots of PDFs or just raw PDFs? Is there a difference? I think screenshots, but I'm not familiar with the data set yet. We may use it as well for our training in the future.
Starting point is 00:27:29 It's interesting that documents obviously are very, very important form of multimodality that is very OCR heavy, very focused on charts. I feel like you could classify sort of three types of multimodal models. Like one is the traditional classification types of models. the clips of the world. And then two is the VQAs, the, whereas a general image of like a webcam,
Starting point is 00:27:55 you know, whereas like there's three people in this image and all that. And then the third would be like documents, AI. Yeah. I don't know if that's... You can combine them all, actually. Can you combine them?
Starting point is 00:28:06 I don't know. But like for... Actually, actually you don't know. No, but like for a general model like a GPT4, it does all of this. Yeah. Yeah. Even if it's not like a,
Starting point is 00:28:17 train purely on classification, you can classify the... Deeper the better, right? One God model to rule them all. I don't know if it's like a mixture of, you know, different models. Yeah, GPD4V, like probably built upon GPD4, but adapted for images. That would make sense. But then it's like the Dali 3 model is different from,
Starting point is 00:28:42 like it's separated to create different images. Yeah, something they just introduced was, now you don't have to switch modes, right? Now you can just kind of do one model and it just does its own routing, which is kind of very interesting. And then the other thing was a mentioned but not released
Starting point is 00:28:58 was that they could add vision to GPD 3.5, not just adding it onto four. It's not a variant of four. It is a plugable vision module that you can kind of add to three. They never released. Yeah. Anything else that people should know about Oblix?
Starting point is 00:29:14 Obviously, this is like the big work. You mentioned in our prep that you expect it to last for a while because there's a lot to mine from it. I think it's big enough to train large models. So we train our ATB parameter model on it. So it's definitely sufficient for the next one, two years. We spent a lot of care curating the data, like regarding the text quality and the image quality. And I think we... So there is also an alternative, sorry, to Obelix.
Starting point is 00:29:50 It's called Multimodal C4, MMC4. It was published at the, around the same time as us. However, we think we took more care in the deduplication part, to deduplicate the images and the text, and also based on the text quality. This is measured, of course, qualitatively, just by looking and exploring at, our documents, but also quantitatively by looking at certain metrics like Perplexity, we obtain
Starting point is 00:30:22 good scores that match is the best NLP-only data set. This was a win for us. For someone who's never really dived into these datasets, I mean, I can open up a dataset and manually look through these things, but how does perplexity, how do you measure perplexity in a multimodal data set? So Perplexity essentially, it's something really simple. You take a small model and you fit them with the token of your text and then you measure the probability of the document.
Starting point is 00:30:56 Of course you normalize by the length so that everything is equally treated. And then the thing is that we obtained that we had perplexity scores that match the distribution from the documents from the pile and the pile is documents that were taken from good quality sources like Wikipedia archive and so on. It's not something that you can really scale. And however we also noted that we obtain better perplexity scores than the ones from C4, the bit dataset or Oscar. Based on your own measurements right because obviously the multimodity multimola seafar, people would not share it. Yeah, just based on the text.
Starting point is 00:31:48 So yeah, I think this was a, so this is how we computed perplexity and how we assess the quality of the data set. But you could also run a multimod model on this and get the perplexity from it. It would not be measuring the quality of the text, but also would like the alignment would come into account. Alignment of image and text,
Starting point is 00:32:11 because it would be easier for the quality of the text, because it would be easier for the model if the text is very heavily related to the image to get the next token. And then one more question, just about the whole process. Like, how long does it take to make Oblix? So we spend a good time at the very beginning of the project.
Starting point is 00:32:28 Just simply to iterate on the pipeline, like how we collect HTML codes, how we clean them. We had to go through all of the HTML tags, they were important. So this is, yeah, an engineering part. It's, you have to be really... It sounds very boring. But very important.
Starting point is 00:32:49 It is boring and important. That's how you get the good data. Yeah, but this is also why people don't do it. Yeah. But the industry has not converged on a shared set of tools that everybody uses for this. You're just parsing raw tags yourself. Yeah, we did that, yeah. Because we found it was better.
Starting point is 00:33:11 We passed raw HTML codes. So we had to clean the dump tree, select the good HTML nodes, correctly extract the text, the images, clean, they duplicate. So as I said, there was just a good amount of time at the very beginning of the project just finding the pipeline. So maybe one month, but we were like one or two on this and it was really exploratory. And then for actually making the data set, download all the images, do all the processing scripts, and so on, I think it took us like up to two months. Yeah, but then there's also like iterations through the project where we think we should do filtering on this on top of what we were already doing.
Starting point is 00:34:01 So we improve on the data set as we. Something I would think makes sense for the industry is kind of an open source set of deduplication. rules because everyone seems to be reinventing this from scratch every time. Well, for the duplication, you have to do it all the time from scratch because it depends on your original set of documents. Everyone draws from Common Corel. Like it's... Yes, but they're not from the same terms or not from the... But that's true. Yeah. Someone should take all the Common Core documents. What's been done recently? I don't even need the same exact rules. I just need to be like,
Starting point is 00:34:39 oh, these smart guys thought about that, I should include that, right? That's very simple. If you have 100 rules, someone else has 80 rules, maybe they have something that you don't have. Exactly, exactly. I think we did a little bit of this, because there's a bit of literature, like, right around.
Starting point is 00:34:57 Even the ones that we designed before for the dataset of Blue, we took a, I'm sure you used a bunch of that. And Oblix is big, but the data sets that come for NLP right now, are also like huge. Did you see that together one from yesterday? Yes. Yes. Impressive.
Starting point is 00:35:14 30 trillion. So they said they had a raw data set of 100 trillion and they got a cleaned high quality data set of 30 trillion, which means they kept 30% of Common Crawl, which is still too high. Yeah. So I feel the idea of the project. And I think I agree with this idea is that,
Starting point is 00:35:37 is that everyone can set the thresholds for the filters as the as the as the as the yeah so for each document they computed the they computed fresh like threshold the filters or no actually filter values for a set of rules and then you can decide whether you want to keep the document on it and then so you define your own rules so i think they remove like as you said the 70% of the of the data set that is really like you can't you can't do anything with it and then for the remaining part they let people decide. But of course, if you actually want to train something on it, it will be much smaller, I guess, because you will remove all other things.
Starting point is 00:36:19 But data is super important and my point is also that it's still very early in multimodal. We're seeing now in NLP does 30 trillion datasets. It's really important. The first thing they said when they when they released their goals is how impressed they were, that they were capable of doing the data pipeline. in three months. They didn't talk about, like, training the model or not. It's just like they knew what they were doing for this.
Starting point is 00:36:44 It's straightforward. But creating the whole data pipeline, this is what took them a lot of time. So I think that was one thing that struck me when they made their own. Is it confirmed that they have $8 trillion tokens? I don't know. They won't say.
Starting point is 00:36:59 Could be this. It could be more, I think. Given we have now an open source data set of $30 trillion, I wouldn't be surprised that they have that they have more. I just keep coming up with questions. One more thing on datasets. Did you read the GPT4 Vision System Card that they put out?
Starting point is 00:37:15 They put out this paper describing a little bit of their process. Something like 95% of the labels for GPC4 Vision was augmented by GPD4 itself. And I was just curious, like, how much room is there for open source augmented data sets? I think there's a lot of room. I think there's a lot of room. and I think synthetic data works a lot, like worst grades, particularly in multimodality.
Starting point is 00:37:43 Recently, the recent papers have been using, for example, Leon Coco instead of Leon. This is just captions on Leon created from blip. So it's not even like super, like extraordinary, but it does bring more performance with a lot less example because the alignment is more straightforward, I guess. And we've been observing, this recently because we've been using it in our recent experiments.
Starting point is 00:38:11 I think the potential for synthetic data in multimodality is very big and very underutilized right now. Even Dali-3, they said that they used heavily synthetic captions to train their model. Also, over a multi-modal, foundation multimodal models like GRIP, they train on the synthetic captions. Yeah. Actually, I think I was referencing the DALI 3 paper, not the GPC4 version paper. Because they didn't actually put out a paper from GV4 version. Cool.
Starting point is 00:38:41 And then, so you created Obelix and then you trained IDIFIX. Yeah, on it, on top of it. In addition, so we train ObedX on ObedX, but also on ObedX, but also on Overe datasets, like Lyon and Public Multimodal Data sets, which is just sets of data sets that were open-sourced at the moment. Yeah, conceptual captions. And, yeah, just could you take. us through just the out of fixed process you created a smaller version and then you then you scaled
Starting point is 00:39:11 up to the full flamingo 80 billion size actually was the other way around really well we tested that it worked at smaller scale of course yeah but we had we did not train fully our small model before doing the the big one we fully train our big model that's unusual yes Yes, they were training pretty much at the same time, but we needed to launch the big model before also in terms of like timing because it took a long time to train. So it was more like managing computer resources. But through the whole journey was bit longer than just this moment when we train to read a fixed model because we start with the objective of matching Flamingo's performance. But the open source models that are out there, they're just not good enough. So we have OPT, you have GPT Neo, but it's just so below Chinchilla that it's almost impossible to reach the performance.
Starting point is 00:40:15 Suddenly when Lama came out, that it started making sense and that we started matching the performance. And from there, we were able to train the big IDFX models. A long journey, I think because we had a lot of things to learn, because we had quite a lot of instabilities. We shared a blog post about... The checkpointing every 250 steps? Yeah, we were checkpointing every 250 steps, but we had to restart a few times. Was that the instability you're talking about,
Starting point is 00:40:42 or this is just something else? That was the final training, where we still had the instabilities, but a lot less. Like before this, we were struggling to train the model at all. What saved us at that moment was the query key layer norms. I don't know if you've heard about this, but basically there's a paper from Google where they scaled up the vision,
Starting point is 00:41:01 transformer to 22 billion parameters and they needed this trick to keep the stability of the training. Without this, if I remember well the mechanism, you would get something along like a hard attention. And once you get this like the model would get very unstable. So you needed to normalize the queries and keys to basically avoid this. Anyway, so when we did this, we were able to train further. And that was really useful for sure. So query key lay norms if you want stability. It sounds like a trick that is repeatedly applied whenever you have instabilities.
Starting point is 00:41:43 You just do a Dior norm or softmax or... Yeah, you can. Because usually it's parameters that explode. Yeah, yeah. Become too big. So you need some sort of regularization on them. And first you have to inspect which parameter expose first. And then you put a regularization on them.
Starting point is 00:42:06 But it's really tricky to see. When you go in the gradients and in the activations and you try to see where it blows up, when it blows up and why, everything is interlinked. It's really hard to pinpoint one particular layer. So it was a tough one to crack. But very interesting. It's also hard because you never know exactly.
Starting point is 00:42:32 Once, when you're in the process, you don't know where the instability comes from. It could come from bad data. It could come from a bug. It could come from the size of the model, the learning rate that's too high, the warm-up that's not. Like, there's a lot of hyperparameters and potential bugs that can come into account, and the debugging is very, very tough. So do you have a checklist of... What do you look at when you see loss explode or whatever?
Starting point is 00:43:01 You see the loss explode, you look in the activations, possibly the gradients, to see where it blows up. Across all your parameters. Yeah, you find a way to aggregate this. Otherwise it's tough. If all previous solutions fail, because this is the hard part. This is the hard part, but it tells you a little bit where it's happening, so that gives you, like, where it's happening-ish on the model.
Starting point is 00:43:25 And so what's to blame? and then you have a wide range of things to blame, but less than before. So you can look for a bug. You can look for, for example, normalizing some layers. And you can look into the data to see if you have very bad data that's impacted. That's how they would look at first, right? Exactly.
Starting point is 00:43:48 We looked at that too. Is this what weights and biases would do for you? Or is there one integrated solution that kind of... You would wish? Yeah. No, to see the activations and the... We go for activations to then inspect datasets and then look at, you know. Waits and biases would get you, like, the parameters.
Starting point is 00:44:13 It's just logging, right? It's just logging, like, you have tons of them. You can't really do much with it. So we had a tool where we would, like, log them periodically. We had a script that would aggregate them and display them for us in a, in a nice way, so that like interpretable way, so that we could, we could try out and see what mechanisms were and could be, could be impacting the instabilities. But yeah, it was a very, very interesting journey for sure. Yeah. And you published knowledge sharing documents and a memo.
Starting point is 00:44:47 Yes. There's some, there's some interesting detail there, but obviously not everything. Anything you want to highlight, just first for listeners on that one. Obviously, I can send in the link, but um for the high-end-core just just any other like big discoveries on learning like you talked about the core key uh norm corickey leonorm coricke leonorm uh was the big big a ha moment but this and and there's also um the this is one we we fixed afterwards but there's um like in the mask in the image uh mask there was like a little information leak in that instead of attending to all the images, instead of attending to none of the images,
Starting point is 00:45:30 sorry, for a few tokens, very few of them, it would attend to all of the images. Like basically, you tell them, you go in the attention, and you have this masks, this mask that's like, like, don't attend to anything. But in effect, it's like,
Starting point is 00:45:45 attend to one over end, right? And it, it's not, like, it doesn't prevent you from training, but it doesn't help. And, and it's yeah, it's better to fix it for training for sure. Yeah, sometimes you really have to go through all your cut base to... Because you're doing ingredient descent or something, there has no information.
Starting point is 00:46:04 Yeah, no, it's tricky because this, like, for example, this, it would not have an impact if you only train on web documents, because the documents are long. But it would have an impact if you train on image text pairs and you pack them together, because you're attending two images in a document that has nothing to do with it. But you can still train with it. it. It's, uh, you can still get a very good model out of it. It's just a lot, like, it's more painful and you, you don't get, like, I think we can, we can get better performance definitely without the bugger. Yeah. Interesting. You mentioned, you know, just in terms of like the baseline
Starting point is 00:46:42 foundation models that you had, Lamo has been breakthrough on the, on the language model side. But then you also, you didn't talk as much about the vision quota side of things. I actually had a question from Joseph from the, from the Robotho episode, uh, where he talks about, um, about where you mentioned that the larger the clip, the better results, but in your final memo, you actually went for a smaller version of clip. So Eva Clip versus Lion 2B. Does this ring a bell? Basically it was kind of like unintuitive, like the clip choice there.
Starting point is 00:47:11 Yeah, so Lion 2B is the data set on which our clip-based model was trained on. And our clip model was, indeed, like 400 or 600 million parameters. And it's true that the eva clip one is of the biggest, the epiclip one is of five billion parameters. So definitely at the beginning of the training, we saw a big boost using an eva clip. However, at that time, we still had instabilities to train this big eva clip.
Starting point is 00:47:47 We are not sure exactly why. However, we fixed it. And now that we, like for the next, for the V2, version of Elyphix, now that we can train longer, we actually saw a boost by using EVE clip instead of the previous clip, a small clip that we had. However, we now think that EVEP is under-trained. So it means that even if it's really big, we can obtain the same performance with smaller models. So there is this SIGLIP model that I mentioned just earlier, that by Google, that is much smaller, 400 million parameters.
Starting point is 00:48:30 And that is more efficient. You're choosing that as your base. Yeah, we did an ablation, Evaclip versus Ciglip. And actually Ciglip was a bit slightly better. But the thing is that it's much faster for the inference and also for the training, because there's less fewer parameters. One thing is also that Ciglip is a higher resolution. So that has an.
Starting point is 00:48:55 that has an impact for a CR tasks. Free 184 instead of 2-4. Yes. That's great. Yeah, maybe we should talk about Edifix 2. So we're going to time this podcast release with whatever you guys aren't releasing it.
Starting point is 00:49:10 I actually had no idea you were working on a V2. I just came in wanting to talk about your old work, but obviously you're still doing active research. For Edithfix V2 or V1.5, whatever we called, the major access we wanted to to improve on was the image resolution, the base model. We wanted a better base model and a smaller one. Oh, sorry, pre-trained language model.
Starting point is 00:49:38 So we've been using the mistral one. We wanted to iterate a little bit also on the data to have better filters on obelis, on obelix, sorry, better synthetic data for the, for the image taxpayers. So essentially iteration on the data by replacing original lion pairs by synthetic captions to have a stronger alignment, cleaning a bit obelics on the perplexity.
Starting point is 00:50:09 So it's not removing too much, but like potential bad data. Also, yeah, using just better pre-trained models. So for the bad bones, we use like a better clip, SIGLIP, we use also mistral that is better than Lama 1. We are right now changing also the modeling, so moving away from the Flamingo architecture, to something that has fewer parameters. Instead of incorporating the vision components directly into your LLM
Starting point is 00:50:46 by breaking and adding cross-attensions at each layer, or every end layer, you can instead take your vision encoder, take the embeddings out of it, make them through, fit them to linear layers, and fit them directly to the language model. And this works quite well, and this contains a fewer parameters and it's much easier to train. So we are currently trained to do this.
Starting point is 00:51:17 But what we can say is that without this new modeling, just by iterating on better data and better pre-trained models. We are now matching the Flamingo 8CB performance with 9B model. So this is already a big improvement compared to our first version, without even touching the modeling part. I think also one of the big improvements with the new IDIFIX is the licensing. The issue with the fix was based on Lama and the license is not kind of
Starting point is 00:51:52 commercial. So now with a model that's based on Mistral and SIGLIP, almost probably, this is a lot better for anyone that wants to use the model commercially. The model will also be smaller, so a lot better for inference. And hopefully we can beat the performance of VEDIFIX ATB. That would be really good. So you're only producing a 9B? Not exactly 9B, because we're taking off parameters. It should be, it should be about 7.5b. Right now, the focus was really better data, smaller open source models, but better,
Starting point is 00:52:31 and resolution. I'm improving on this as well. That was the focus. You mentioned in our prep as well that there are some topics that you're paying particular attention to like hallucinations. Maybe you could talk about the topics that you are finding
Starting point is 00:52:49 are particularly areas of concern with multimodal models. So we've been using hallucination at the beginning as a broad term, and we realized that it was better to categorize it a little bit more specifically to some categories that were more targeted. So for example, there's the object attributes where you would have a small attribute, like, let's say the hand of a person that's a certain color, and the model would be like it's yellow when it's red.
Starting point is 00:53:20 so that would be that would be one there's like objects that are not there but the model thinks are there like when the model is trying to reason with different elements in the picture but it gets it wrong so you get like comparisons kind of hallucination counting oh my god yeah you have the environment so it would talk about like the object or the person in the picture but it would get the whole environment behind wrong and a few others that are wrong have in mind right now. But basically, categorizing those, I think, is important because it helps you target the type of data that's missing or the type of fine-tuning that you should do afterwards that's missing. And so this is going to be very useful for us in building future datasets. I'm not sure if we'll be able
Starting point is 00:54:10 to incorporate all those like those datasets we've been thinking about for the V1.5. But we'll definitely do this for the iteration after. here is really to get to something that's on par or better that what is currently done in closed source. Like GPD4V doesn't hallucinate as much on those topics and you want the open source models to at least match this. But it's still, like it's still an open research topic because those models still do this. If you push them a little bit, if you ask some specific questions, like they will hallucinate things in the picture. Yeah, you mean GPs4, including GPD4V. Yeah.
Starting point is 00:54:50 I have tried this, by the way. I tried to use it to interpret it like a menu, and it would just make up menu items and make up prices. Something really interesting is that GPT4, even much better and bigger models like GPT4, shows the same failure cases as our model. So for example, it means that the way we are training the models now is not ideal. And especially for example, counting, you take this task. Even if you train on web documents or image experts,
Starting point is 00:55:26 you will never really find this task in your training data. You will always have a picture and maybe a caption of one apple or two computers, that you will never get to like 10, 15, 20. It doesn't really happen. So you don't learn this ability just by training on, on image text or web documents. So what you have to do is to create your own data sets that target specific tasks.
Starting point is 00:55:59 Like for example, create one specifically for OCR, create one specifically for counting, create one specifically, I don't know to challenge the model on hallucination, some types of hallucination. A sort of flan, but for a multimodal. And I think in multimodial those hallucinations are even more unforgiving than what they are for NLP
Starting point is 00:56:23 in that when the language model is making up facts you can think that it got a little bit wrong or it's making up a story but when a multimodal model tells you there is a teapot in this picture and there is none it's very obvious it's very obvious very quickly if you're the user of this model and you receive this you are having a hard time
Starting point is 00:56:44 trusting it for anything so this is one of the big big things to tackle. And for hallucinations in NLP, a fine-tuning that has helped a lot is the reinforcement learning with human feedback or AI feedback. And this is still very early in multimodal. We're not sure exactly how much we can improve with those types of data sets. But it's likely that it would help, it will help the model understand uncertainty to a more fine-grine level. And so improve on all those types of hallucinations ultimately. And you said in a purpose, well, that you're measuring this against other closed source models like Bard.
Starting point is 00:57:22 Like, basically, do you have your own internal benchmarks that you're running? No, no, this is mostly qualitative. It's more like, does it, like, we know it happens in ours because we've played with it. Does it happen with theirs? And yes, it does. The thing is for ours, the evaluation data sets that we use or that we've used for, like, Edifics so far, they, like, they don't measure hallucinations that much. It's not targeted for this.
Starting point is 00:57:49 New data sets that are coming for evaluations are exciting, and I think we'll go in this direction and will help us evaluate hallucinations and different types of hallucinations more accurately. But right now, at least for EDIFix, even with those evaluation benchmarks, and even if this helps, obviously, if you get a very bad scores,
Starting point is 00:58:10 your model could be hallucinating things. It's not targeted to this, and we were flying black. a little bit on this topic. But there are definitely benchmarks who evaluate hallucinations currently. So I think about sugar crepe, for example, so you just have an image and two captions of it, and the model has to select the best one, or VinoGround, I think. So actually the captions are made such that it's tricky.
Starting point is 00:58:42 Winogrand, like the Common Sense and LPE one. Yes. This one is really hard. But you can have one with images, yeah. I think this can be a great way to evaluate hallucinations of the models. However, it's not really commonly used. So right now on the recent models like Pali, for example, they report their numbers on classic evaluation benchmarks.
Starting point is 00:59:09 So I think it needs to be more widely adopted, this kind of new benchmarks. I think the last topic that we prepped was just overall, like why is it important for there to be OSS multimodal models? Like what can people use them for? Where can it be useful? My understanding of this is that ultimately you want a foundational model that understand the world similarly to the way we do.
Starting point is 00:59:36 And multimal models, like, understand visual data a lot better than language model, obviously. but it means they provide a better foundational backbone for like other tasks in general. I think in the future when you want to pre-train a foundational model, you will want to have it multimodal. So this is it's important now and it will be important in the future and this is where everybody's going because training on tech data, it gets you really far as we've seen, but it only gets you so far.
Starting point is 01:00:10 and it can only adapt on a subset of tasks that we do every day. And you need models to be able to understand vision if you want them to be helpful for, I don't know, robotics, for example, and a bunch of other use cases. I think, for example, even for medicine, you can have tons of applications with a vision input. So you just take an image of, of whatever ask.
Starting point is 01:00:42 Panser. Yeah, yeah. Honestly, you can ask for help with your model, or even in your everyday life, how to build this table. You just take a picture of what you have and you ask the model. You just take a picture of even something handwritten, ask things about it, like solve the exercise shown in this picture. in this picture. A lot of things that you can't do with text only models, yeah, that you can do with
Starting point is 01:01:15 multimoderm models. So definitely it's a big step forward. It's still a bit immature. And we have seen that because, yeah, because of all the hallucinations we mentioned and because it's still, it's still early, are still trying to unlock some, some abilities for these models. Yeah, we believe in less than two years, we will only have multi-mobile models in the future. Like they will overtake everything. Yes. Also, right now, it's a lot of image-text models because it does most important things that you want without requiring too much compute.
Starting point is 01:01:55 I think it's more computer fusion than if you would put video in it. Ultimately, I think we'll need video to be incorporated in the pre-training as well. I'm quite excited about what it could look like. probably the ones that are more advanced on it, like the ones that are the most advanced than it are probably the self-driving cars. They're doing like very, very interesting work there, but it's all closed source. Oh, no, I think, was it Coma AI that released Gaia or something like this? But yeah, basically like video generation. So I think that's very interesting. His video isn't just that like a series of frames of still images. What is so different about that?
Starting point is 01:02:36 It's hard because you need to encode each frame, or at least select how many frames you want to integrate per video. Also, one big change is like the length of the video that can be really arbitrary. While if you have an image, you have always the same number of tokens associated to it. So all of this, plus the fact that having a big video data sets is, really challenging because you can't for example scrap YouTube like that it's harder so yeah all of this make it hard to to build a model with videos it's also extremely heavy already when you when you go from a text data set to an image text data set the like how much it weigh is so different it means it means you have to that the
Starting point is 01:03:30 data pipeline it means the product like the data set pre-processing takes more time. If you go with video, you do this like 10x, if not more. For a transformer, it takes tokens into as inputs, sequence of tokens, right? And for text tokens, you get like one token for that one token for like a subset of text or word-ish. For an image, depending on the resolution you want, you can get from like, like you can get a lot of tokens for a single image. For a 2-24 image, you'd get, I think, 256 tokens per image. If you want to increase the resolution, it goes up fast.
Starting point is 01:04:13 For Ciglip, I think it's upward 700 tokens for 384 resolution. So imagine this, and then you make it a video. So it gets really tricky, and you have to create a bottleneck, right? You have to pack those images together and output us like a femuricans. That's what's possible with Flamingo. It also means you have like this bottleneck there. It's a lot of work. And right now, I think the trade-off is not yet worth it
Starting point is 01:04:43 because we have a lot more to do with just image text. But it will be at some point, I'm pretty sure. That's all the questions I had. Any final call to action? You want people to go somewhere, check up the models, check out the papers? Yeah, essentially. So we are excited for the second version, which will be released about the time
Starting point is 01:05:03 of NERIPS. And yeah, we think that Obelix can be useful in the next year. Even if we continue updating the model, I think the data set will remain pretty much the same. And we
Starting point is 01:05:19 keep iterating on better modeling and better data. And if you want to follow our work, we're at the Hanging Face M4 organization. Let's talk to one naming. What is M4? And then maybe we should mention Nobilix and I've had a fix.
Starting point is 01:05:33 I think so at the beginning of the project it's like massive multi-model, multilingual, multitask model, but then we draw the multilingual and but it's still massive it's still well not that massive anymore since we are trying to go smaller, go smaller, multitask for sure. I mean the data sets are still massive so that's good. It should be still multilingual now? If you train it on Common Chrome. Train on Common Chrome.
Starting point is 01:06:04 It's just not targeted for a lot of different languages. Like, we didn't take care of having multiple languages, but there's definitely other languages in there. Yeah, the naming, like, it changed through the project. But we kept it. The idea was also to have four modalities. So it was fitting at the beginning. At the beginning, we wanted audio, video, text, image.
Starting point is 01:06:28 We thought that audio and video, video is just a lot of computers. for not that much results right now, like results right now. And we were, from the moment we decided to reproduce Flamingo, we dropped video and audio, which was not immediately. This started as a different. What's wrong of audio? I still still very, very heavy and not necessary right away, because you can take the text tokens, plug it into an audio, model laughter,
Starting point is 01:06:55 and it will just spit out the audio. Ultimately, I think it will be good because there's different data as well. Like when people talk, it's not the same as when people write. So I feel there's a lot of interesting data to have with audio. Again, not worth the compute right now, I think. And also you would want this model to be integrating everything with like a given size and then be applicable to whatever. You give it to your robot and you're like, what?
Starting point is 01:07:25 Yeah. And to close the loop there, Oblix was an asterix reference that you you made into a back-oonym or whatever. Into an acronym. Yeah, it was really hard to find the acronym. I want to stress this out. I thought about it like for like days. We were brainstorming it for like a few days.
Starting point is 01:07:45 We had to cheat at the end a little bit. We took the S of cross attentions and added it to. Yeah, cross attentions. The problem is the F of Flamingo, right, for EDFX? Because we are, yeah, but we are moving away from the Flamingo architecture. For the next iteration, it wouldn't be really accurate, but whatever. Okay. But yeah, the acronym works only for the first version,
Starting point is 01:08:09 because if I remember well, it's image-aware, decoder, enhanced a la flamingo with interleaved cross-attensions. Yes. Yeah, it's a very, very impressive project. And, you know, we were talking about it even before the GPC4 vision roll out that this is the most impressive sort of open source reproduction and it's amazing that
Starting point is 01:08:35 you're still continuing to work on it. I think a lot of room left in Oblix to keep mining that those rocks and there's a lot of learning as well in multimodality. I think it's a very important area of research. Thank you.
Starting point is 01:08:51 Yeah, thank you. Hello, hello. It is Swix coming in from the editing room in 2024. If you're listening in this you're definitely a true fan. Thanks so much. And I hope you enjoyed that conversation as much as I enjoyed recording it. We recorded that conversation on Halloween in 2023 in the hopes that we would be able to release it at NewRips or around NewRyps with Ida Fix V2. V2 was supposed to be updated with a new base model, which is going to be Mistral and a bunch of other dataset updates. And a couple
Starting point is 01:09:27 of things have happened. But you know what? Hey, let's just get Leo back. on to talk about it. Hello from 2024. It's Leo. Just wanted to add a few things since the recording was done a while ago now. We haven't trained the model yet because we're trading a lot more on data. So we're making progress on OCR, on image to code capabilities, and we want to be a lot more thorough in the image taxpayers data set that we use. I don't know if you've heard, but the Layon dataset has had an issue with CSAM images and so we want to get ahead of this. and fix the problem before we start the training. But we will start training soon.
Starting point is 01:10:06 In the meantime, we released website. It's an image-to-html dataset. The idea behind this data set was to show that we could create a very useful synthetic data set at scale with open-source models. Hugo spearheaded the effort, and so he used Mistral and, and the deep seek coder model to generate the pairs of screenshots and ehtml code. And then we fine-tune an early version of IDFX2.0 on the dataset to have a demo.
Starting point is 01:10:45 You can reach, you can find the model on the hub, you can find the demo on the hub, and you can find the dataset on the hub. So definitely go there and check it out. I think some people have already started training their own model on it. It's been very well received and so we think we're going to do a lot more small releases like website in the future basically switching from releasing everything in one package with the model trained to releasing the datasets architecture and training insights that we get along the way and then releasing the model so stay tuned I hope you will enjoy the podcast and thanks Sean for having us

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.