Latent Space: The AI Engineer Podcast - How to train your own Large Multimodal Model — with Hugo Laurençon & Leo Tronchon of HuggingFace M4
Episode Date: January 19, 2024Latent Space is heating up! Our paper club ran into >99 person Discord limits, oops. We are also introducing 2 new online meetups: LLM Paper Club Asia for Asia timezone (led by Ivan), and AI in Action...: hands-on application of AI (led by KBall). To be notified of all upcoming Latent Space events, subscribe to our new Luma calendar (sign up for individual events, or hit the RSS icon to sync all events to calendar).In the halcyon open research days of 2022 BC (Before-ChatGPT), DeepMind was the first to create a SOTA multimodal model by taking a pre-existing LLM (Chinchilla 80B - now dead?) and pre-existing vision encoder (CLIP) and training a “glue” adapter layer, inspiring a generation of stunningly cheap and effective multimodal models including LLaVA (one of the Best Papers of NeurIPS 2023), BakLLaVA and FireLLaVA. However (for reasons we discuss in today’s conversation), DeepMind’s Flamingo model was never open sourced. Based on the excellent paper, LAION stepped up to create OpenFlamingo, but it never scaled beyond 9B. Simultaneously, the M4 (audio + video + image + text multimodality) research team at HuggingFace announced an independent effort to reproduce Flamingo up to the full 80B scale:The effort started in March, and was released in August 2023.We happened to visit Paris last year, and visited HuggingFace HQ to learn all about HuggingFace’s research efforts, and cover all the ground knowledge LLM people need to become (what Chip Huyen has termed) “LMM” people. In other words:What is IDEFICS?IDEFICS is an Open Access Visual Language Model, available in 9B and 80B model sizes. As an attempt to re-create an open-access version of Flamingo, it seems to track very well on a range of multimodal benchmarks (which we discuss in the pod):You can see the reasoning abilities of the models to take a combination of interleaved images + text in a way that allows users to either describe images, ask questions about the images, or extend/combine the images into different artworks (e.g. poetry).📷 From IDEFICS’s model card and blog postThe above demo screenshots are actually fine-tuned instruct versions of IDEFICS — which are again in 9B and 80B versions.IDEFICS was built by connecting two unimodal models together to provide the multi-modality you see showcased above.* Llama v1 for language (specifically huggyllama/llama-65b) - the best available open model at the time, to be swapped for Mistral in the next version of IDEFICS* A CLIP model for vision (specifically laion/CLIP-ViT-H-14-laion2B-s32B-b79K - after a brief exploration of EVA-CLIP, which we discuss on the pod)OBELICS: a new type of Multimodal DatasetIDEFICS’ training data used the usual suspect datasets, but to get to par with Flamingo they needed to create a new data set.Enter OBELICS: “An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents”:* 115B text tokens* 141M English documents* 353M imagesThese bullets are carefully curated and filtered by going through Common Crawl dumps between FEB 2020 - FEB 2023. We discuss the 2 months of mindnumbing, unglamorous work creating this pipeline:There’s a lot of mentions of ‘multi-modal' web documents’ which deserves some explanation. We’ll show you instead of tell you:You can see from this graph that OBELICS ends up outperforming the other image-text pairs dataset (LAION in this case) when stacked head-to-head.You can view a subset of OBELICS and perform visualizations on them here:2024 Update: WebSight et alMost of this interview was recorded on Halloween 2023 at HuggingFace’s headquarters in Paris:In anticipation of an IDEFICS v2 release. However, several roadblocks emerged, including a notable scandal around CSAM in LAION-5B, which affected all models using that dataset. The M4 team have adopted a strategy of shipping smaller advancements in 2024, and the first ship of the year is WebSight, a dataset of 823,000 HTML/CSS codes representing synthetically generated English websites, each accompanied by a corresponding screenshot (rendered with Playwright). This is intended for tasks like screenshot-to-code workflows like Vercel’s V0 or TLDraw, and will be part of the dataset for IDEFICS-2.As noted in our Best Papers recap, synthetic data is emerging as one of the top themes of 2024, and the IDEFICS/OBELICS team have wasted no time enabling themselves with it.Timestamps* [0:00:00] Intro* [0:00:00] Hugo, Leo’s path into multimodality* [0:09:16] From CLIP to Flamingo* [0:12:54] Benchmarks and Evals* [0:16:54] OBELICS dataset* [0:34:47] Together Redpajama v2* [0:37:12] GPT4 Vision* [0:38:44] IDEFICS model* [0:40:57] Query-Key Layernorm for training* [0:46:40] Choosing smaller vision encoders - EVA-CLIP vs SIG-GLIP* [0:49:02] IDEFICS v2* [0:52:39] Multimodal Hallucination* [0:59:12] Why Open Source Multimodality* [1:05:29] Naming: M4, OBELICS, IDEFICS* [1:08:56] 2024 Update from LeoShow Notes* Introducing IDEFICS: An Open Reproduction of State-of-the-Art Visual Language Model* IDEFICS Knowledge sharing memo: technical lessons and mistakes* Victor Sanh memo* OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents* Papers cited:* BLOOM: A 176B-Parameter Open-Access Multilingual Language Model* Barlow Twins: Self-Supervised Learning via Redundancy Reduction* CLIP paper: Learning Transferable Visual Models From Natural Language Supervision* Vision Transformers paper: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale* Flamingo paper: a Visual Language Model for Few-Shot Learning* April 2022 preprint from DeepMind, blogpost* VQAV2 paper: Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering* OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge (https://okvqa.allenai.org/)* MMBench: Is Your Multi-modal Model an All-around Player?* Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond* Sig-GLIP paper: Sigmoid Loss for Language Image Pre-Training* Nougat: Neural Optical Understanding for Academic Documents* MMC4 (Multimodal C4): An Open, Billion-scale Corpus of Images Interleaved With Text* Dall-E 3 paper: Improving Image Generation with Better Captions* GPT-4V(ision) system card from OpenAI* Query-Key Layernorm trick: paper (Scaling Vision Transformers to 22 Billion Parameters), tweet* EVA-CLIP: Improved Training Techniques for CLIP at Scale * “We intially explored using a significantly bigger vision encoder (the biggest in open-access at that time) with EVA-CLIP. However, we ran into training instabilities very quickly. To lower the risks associated to the change of vision encoder, we decided to continue with laion/CLIP-ViT-H-14-laion2B-s32B-b79K which we have been using until that point. We will leave that swap for future iterations and will also consider using higher resolution images.”* Datasets* Together’s RedPajama-Data-v2: An open dataset with 30 trillion tokens for training large language models* LAION COCO: 600M synthetic captions from Laion2B-en* Chip Huyen’s writeup on LMMs* Joseph Nelson of Roboflow on Latent Space* HuggingFace M4* HuggingFace timm: library containing SOTA computer vision models, layers, utilities, optimizers, schedulers, data-loaders, augmentations, and training/evaluation scripts. It comes packaged with >700 pretrained models, and is designed to be flexible and easy to use.* Logan Kilpatrick declaring 2024 the year of Multimodal AI at AI Engineer Summit This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
Transcript
Discussion (0)
Welcome to the Latent Space podcast, where we dive into the wild, wild world of AI engineering every week.
This is Anna, your AI co-host. Happy New Year. Did you miss me? As an AI language model, I cannot miss you back.
But I'm glad to stand in for LSEO while Swix is traveling. This time, in Paris at Hugging Face HQ.
At the AI Engineer Summit in 2023, Logan from OpenAI pronounced 2024, the year of multimodality.
I'm excited for 2024, which I think is really going to be the,
I don't know if I can trademark this, but the year of multimodal models.
It's a tongue twister, but also hopefully the domain is available, year of multimodals.com.
No, don't buy it if it's available.
Yeah, so I'm excited.
Open AI has a ton of multimodal capabilities that are in the works.
Some folks might have already tried some of these in chat GBT in the iOS app or the web
app today, things like vision, taking in images, describing them.
show that later on. Also, the ability to generate images. We've had this historically with
with Dolly 2, but Dolly 3 really, if folks have tried it, it takes things to the next level.
So excited to show some of that today as well. In 2024, the latent space pod will offer
deeper dives into multimodality. Today, we'll talk to Leo Trenshaw and Hugo Lorenzoam of Hugging Face,
who trained IDEFIX, a fully open source reproduction of DeepMind's closed Flamingo model done from
scratch, scaled all the way up to 80 billion parameters. By the way, dear listener, we are expanding
our online meetups this year after the success of the latent space paper club. See the show notes
for the new AI in Action and Paper Club Asia meetups. Watch out and take care. Hi. Thanks for having
me at your beautiful office. It's really surreal for me to visit the Huggy Face Paris office,
because I've always seen you guys online and organized really huge meetups here in Paris. I want to
I learned everything about HuggingFace and you guys' work.
So my name is Rigo.
I've been working at HangingFace for two years.
I started working on the datasets for the Bloom language model.
So it's the 176 billion parameter model that we open source and that was at that time the
biggest one.
And it was also multilingual.
So I worked on the model and the data set.
And then I moved to the multimodality with the current project with EDFIX and Obelix.
Now I am working also with Leo on the version 2 of EDIFIX.
And Leo yourself?
So my name is Leo.
I joined HeuggingFace a year and a half ago.
I was a student still.
So first six months, I was still as an intern.
But I started to work on Multimodality right away.
And then I spent all my time here in the research team working on a multimodality and EDIFIX
that we open source in August.
I think a lot of people are very interested in learning.
more about Ilofix and multimodality in general.
Bigger question first, how is Hugging Face organized?
You told me some surprising details about the size of Hugging Face.
You guys are a $4 billion company.
Only 200 people, less than 200 people?
About 160 people.
Yeah.
And then how many people in the research team?
This is like maybe 15.
So 10 and 20% of the company is research.
I'd say.
One, that's impressive.
And then two, this is something that we discussed before.
It's also unintuitive why Hugging Face needs to do research.
I think the company has a good incentive to do research because most of the companies that do AI,
they have an incentive to get very good models out, but not the best model out.
Their competitive advantage is to have the best model in-house that they can fine-tune for their customers.
And then the open source is for show.
But Hanging Face is one of the only companies that has an incentive to get the best model out there in the open.
And that's why I think the research team is quite important.
It's also important because all the tools that Hugging Face makes are used by the researchers,
so they get all the feedback directly from us.
And I think this is really useful to develop the tools behind it.
Are you talking about the Transformers Library?
Transformers Library, diffusers library, datasets.
So those seem to me more like in sort of inference type tools.
Are there any sort of training tools that you do?
Datasets is used for the training.
Transformers we've been using for our modeling.
Internally, we are also developing a library for training.
I think it's going to be open source, but we'll see.
So for example, we used for the construction of Obelix,
we used our whole pipeline, the library datasets.
The aim of our big open source projects is also to test our own internal libraries
and see if they scale well.
For example, the datasets guys, they never worked with a big open source.
worked with datasets this big before.
So this is a way also to test our solutions.
This big meaning 114 million images, something like that?
More than 3 million, 300 million images.
I've tried transformers and tried diffusers.
I haven't tried data sets.
Why do I need datasets?
So I think dataset is great because you can load
datasets that don't fit in memory.
So it's a kind of virtual library, virtual pointers or whatever.
Exactly.
And also you can easily
filter, rows of your datasets, map them, manipulate, and modify the content.
So it makes it really easy.
And also to do the operations in parallel, it's much easier with this library.
What is the video alternative to datasets?
What do machine learning researchers use if they don't use datasets?
Just do everything my hand or...
Basically, manually paginate, write code to paginate in it.
Yeah, yeah.
That's what I did before, but it's just much faster to the...
Because everything is done for you.
And then multi-processing, you just have to implement your function.
That's great.
Well, I think that's a good intro to the overall hugging face ecosystem.
But I'm interested in the journey from Bloom into computer vision for yourself.
And then obviously you also had your own journey into multimodality.
A lot of people who are listeners and readers of Lane Space also following that same journey, right?
They only have some kind of NLP background.
And now everyone is interested in multimodality.
What was that journey like for you?
Not from the research team, but from the hub team when they started hosting multi-modal models and data sets.
And quickly after that, we also realized that it would be a good idea to also train ourselves,
multi-modal models to catch up with the proprietary models from DeepMine, Google, etc.
I think that was the natural path for us.
But we didn't drop the idea of doing pure text.
models. So there is also a team for LLMs. It was just the creation of another team.
For me the journey was a bit different because I didn't really, I came right away from my
masters. So I had projects on computer vision where for example, I don't know if you've heard of
Dino probably and there was another paper called Barlow twins. Basically I had a project on which I
tried to combine the two objectives so I was more towards computer vision before joining but I was
I was really interested in doing multimodal.
When I saw that there was an internship for this position,
I was glad.
Then the team was already starting to do the project
when I joined.
And so I kind of joined the train.
Just a demographic question.
Is everyone here?
Is everyone on your team here?
We had the big shift in the team in the recent months.
Some people left for other startups or creating their own.
But what is very important?
really interesting to me is that when we started the project, not a lot of people in the team
previously worked with multimodal models. Maybe only two of them. And we were like six, seven really
working on the project. So it was really new to us, this field. And we also wanted to have this knowledge
because of course they are, it's explained on the papers how to do things. But without doing them
yourself, you still miss a lot of things and you miss the intuition.
And we also wanted to build this knowledge of multimodality.
When building the version two, we go much faster because we have a better intuition.
Yeah, this and also I think it talks about the philosophy of Hugging Face of having small teams
with baking packed.
And so we started with a fairly big team for a Hugging Face standard with six, seven people.
as Hugo was saying, we lost a few people to different startups,
but the idea is still to go as fast, if not faster,
with less people right now.
And I think it's possible because of all the background we built
in the previous iteration,
because small teens can work a lot faster
because there's less communication, less overhead.
Very cool.
Yeah, so I do want to get into Idafix and obelix.
I wanted to basically go over a little bit of introductory stuff for people, right?
So in my mind, the two main multi-modality papers that everybody should read is clip and vision transformers.
Would you mark out anything else?
Or what do you personally get from those two papers?
These two papers, I think, build a starting block.
Because now what we are noticing is that for building super large models, we don't train them from scratch.
We just pre-trained, usually unimodal models that we somehow mix together.
So I think Clip or VAT can serve as a pre-trained backbone
that you combine with another pre-trained language model backbone
to obtain something multi-modal.
So these are foundation models that play the same role
to us as Lama models or language models.
The important thing to understand with Vision Transformers
and Clip is that they provide the basics for them
integrating images into this language modeling objective that we use.
Then it's mostly a question of data and image resolution,
and a lot of engineering goes there.
And just a note on this, so some research show that when you use pre-trained vision
encoder, that was trained also with a text objective, for example,
contrastive loss, like the clip loss, it's better to use this type of vision
encoder than vision encoders train only on classification or a mini-modal task if you are building
multimodal models.
So if you are building a vision language model, it's better to take as a pre-trained backbone,
a pre-trained vision encoder that has been trained using text.
Is that not intuitive?
Imagine you can take a vision encoder that is super good at classification.
Then you can imagine that the embeddings that, you can imagine that the embeddings that,
you get from your vision encoder, a super start to plug into your language model.
It's intuitive, but it could clearly work to have a vision encoder train without text at all.
But researchers have shown that it's better to use this contrast.
And once you have those backbones, the question is really how you integrate it,
like how you integrate both of them into the architecture.
It turns out with very lightweight updates, you take the embeddings that come from the clip,
the output of the clip and you have just a linear that you train on top of this,
then when you pass this to the language model and you only train this part,
you can already get pretty good results in multi-modality. You don't have to train
all the parameters when you train in the multi-model model. You can just train the adapter.
Is this what was spelled out in Flamingo or you just kind of
derive some kind of transform that you're happy with?
with. So in Flamingo, you introduce a lot more parameters, but there's still like those cross
attentions that you insert in the model that are new. Those you train from scratch, but the rest of
the model, the language model backbone and the vision backbone, they are frozen during the training.
So you never update it. So it's a different type of adapter, but it's more, it's more heavyweight
than what you could have in recent papers. Now with more parameters, you also often get
get better performance.
Is it necessary to have all those parameters
when you only train the adapter part?
There's no clear answer yet.
Okay, I think that brings us up to date.
Oh, except for benchmarks, I wanted to introduce people
to the concept of how hard it is to evaluate benchmarks
for multimodality.
So there are the academic benchmarks classic.
For example, HII, V2, there are for visual question answering.
There is also, there are also the
image captioning benchmarks.
The cocoa.
Coco, exactly, Flickr.
However, one really
important thing that we noticed
is that this benchmark
the performance of your model
heavily depends on how
you formulate the answer.
For example, for visual
question answering tasks, you will have
a question and an answer. This
answer will be generated open
and lead by your model. You just
from the model with your question,
and then it will generate some world
until the end of sequence
token is rich.
But the thing is that
if you have a question and the answer
is simply no,
if your model says you can count it wrong
or you can count it as...
There's ways to adjust for that,
like, you know,
so kind of distance metric or something.
But it's hard.
It's hard.
Use another model.
So just the way you're formulating the answer,
heavily impact your performance.
And the fact that some people are fine-tuning directly on the benchmark to try to optimize this formulation,
or the fact that other people are doing a few-shot evaluation.
So a few-shot is by giving the model examples of how to formulate the answer.
So it makes it hard to compare the models because they are not evaluated the same way, even if it's on the same benchmarks.
So this is a problem.
So you will have all this academic benchmarks.
And then you will have this new benchmarks that are not commonly adopted yet,
but are created basically with all the language models like GPT4.
And people are prompting GPT4 with images and ask it to generate automatically question and
and search.
And then we can evaluate our models.
This is very new.
The evaluations in multi-model, like for multi-model models are still a bit rough.
I think.
Even for language models,
there's discussions of
if benchmarks are really the way to go
for some tasks.
When you do instruction tuning,
for example,
for a language model or LHF,
you shouldn't be evaluating
on the same benchmarks
from the point of view of a lot of people.
On multimodality,
it's also that the quality of the data sets
we're evaluating on are not super clean.
I recruited something recently,
someone that was showing
like failure cases of VQAV2, I think.
And it was interesting
that sometimes the questions and answers are like super obvious and sometimes it's like so far away
even you wouldn't not even a human would get it yeah and I think sometimes it's just plain wrong so it's
it's also the quality of the data sets to evaluate on and the diversity of them it matters a lot
and right now we still have a few blind spots in evaluations but it's it's really interesting
to see the field move on this because as as we have a lot more multi-modal models the evaluations
benchmarks are improving.
Maybe four or five come out.
They were nice in the past two months.
Off the top of you here, can you name any of these that's...
Eminem bench.
Pupy.
There's a seed, and there's a brand new one,
but I don't know if it's out yet.
Like, the paper is out late.
It's called halusine, like something, halusion, I think.
And this one, I read the paper.
I don't know the size of it,
but from the examples they give on the paper,
it seemed really, really interesting and hard to beat.
Yeah, I'm excited about this one, mostly.
This is like the new race, right?
In the last five years, there was a race towards like sort of common sense benchmarks in NLP,
but now this is the new.
Yes, it's getting to multimodal.
Very cool.
Maybe we can go into the work that you did for Oblix.
Let's describe the size of the dataset, what you did to clean it up.
A lot of these things start from Common Crawl, and Common Crawl is great,
but also it's very messy.
So first, why we wanted to do it,
we were trying to replicate Flamingo.
And Flamingo built their own data sets
of interleved image, text, web documents.
I think it contained more than 50,
no, 100 million images, if I'm not wrong.
And it was based on, for Flamingo?
For Flamingo.
Yeah, for Flamingo.
And it was based on like 50 million web pages.
However, the data set
was not open.
So I talked to the authors, and one of the reasons it was not open is because they used their
page rank Google algorithm to try to know in advance which website to target in their
data set.
So meaning higher page ranked, higher ranking SEO sites have higher weight.
Yeah, exactly.
So that's how they scrap the websites.
That's one of the reasons why they don't.
Many reasons to not open source their datasets.
So we wanted to build a data set that was at the beginning similar to this one,
and so we made it even larger, and fully open source.
Because we believe foundation multimodal models train on interleved image-text documents
are better than the ones trained only on pairs.
Maybe to go further into that point, what we found,
and that is interesting, is it's for the VQA tasks that this day.
datasets is really important. For the captioning tasks, you have an image text data set like
lay-on, and it's great. And it's going to improve pretty well, like the alignment is strong.
Just to explain, alignment is like basically images that are aligned with the text. So the text
means something that is related to the image. And so for lay-on, it will be enough. For the captioning
tasks, it will maybe for some OCR tasks, although it's like still weak on this one, even if you
could use improvements on this one.
And then the Obelix dataset is really important for reasoning,
to have the model be performing on VQAV2, OKVQA.
So those depend heavily on the web documents.
It was interesting to see the dichotomy
when we use only one dataset or the other.
That's in the paper of Obelix.
Essentially, the pairs, image text pairs,
are good for the alignment.
Just align what you see in an image with the corresponding text.
But if you want to have more abilities to resonate, it's better to have a higher proportion of web documents with longer context.
Also, it's not the only reason why we wanted to do it.
Why we wanted to do it is because the image text pairs are super noisy.
So, well, the advantage of it is that it's super easy to collect.
You just scrap a lot of HTML codes.
And anytime you find an image with the corresponding ALPT,
text, you download the image and you bring the alt text and you have your pair.
Building a web document is much harder because you have to clean the properly the text, you
have to check what you want to keep, what you want to discard.
So it's obviously much harder.
However, you have also a longer context for each image.
So there is really a parallel to be made and it's not the same type of data because on image
text pairs, you have an image and the direct caption of it.
On web documents, you have, well, this is essentially what you see when you open any website.
So you have a text, then sometimes an image, another text, an image, and then the alignment
here is weak, in a sense that the text don't necessarily describe perfectly the image.
However, they share the same context.
So this is another type of data.
And we also think that this diversity helps to improve the performance.
How much?
So this sounds good in theory, but you had no idea of knowing.
I mean, I guess you talked to the Flamingo office and they just told you that this is what they did.
You mean like the proportion of exactly.
Even them, even them, they built their data set and they told me, yeah, we use this proportion, but maybe we could have used a,
less or we don't know.
So we didn't really know in advance the proportion of
what documents you would need compared to pairs.
We did an ablation though.
So basically we can control how much we sample web documents
versus land pairs.
And so we did an experiment where we moved those probabilities a lot.
It was very inconclusive.
It was very inconclusive.
So there was no, like, we had a range of, like, what was a good range for, like,
how much web documents we should have versus lay on pairs.
But overall, past a certain threshold, it didn't matter too much.
And when you measure performance, do you split it out into things like individual tasks,
like sanitation or detection or anything like that?
Or is it just VQA?
We don't have a detection.
or segmentation, because the model is basically, like, it outputs text.
So we can't really evaluate on those benchmarks.
But we did captioning, visual question answering, text recognition a little bit,
but it was done through captioning datasets or VQA datasets, and we did classification.
So those are the three ones that, three categories that were doable with the setup,
like the model we
put to give.
But I know that
RETO, QNVL, for example,
they use
bounding boxes in the
data sets,
so they can do
detection, yeah.
You also mentioned, by the way,
that resolution was a big deal
for you,
image resolution.
And how do you deal with that
in Obelisk?
So resolution
is important
when you want to do OCR,
particularly.
Because otherwise it's just fuzzy,
right?
If it's...
Exactly.
It's too small.
if you can see it, the model probably struggles as well.
Well, not just that.
Models typically see much smaller images than we do, right?
I don't know what resolution you guys have.
It's like a resolution of like 480 by whatever, right?
It's super small.
480 fix is smaller than that.
Yeah.
It's smaller than that.
It's 2.24.
So you're going to lose a lot of detail.
Yep.
Definitely.
On top of this, you have the vision model,
and it outputs a certain number of tokens
depending on the image you put, right?
And above the model, we have a perceiver,
so we reduce the number of tokens that come out of the vision model.
By doing this, we make it even less,
like even harder, I guess, for the model to be precise
on those very, very small details.
So this is something that happened with EDIFIX,
the version one, and probably, I mean,
we're going to improve on this for version two.
But it's really, really important for OCR, that's for sure.
We think it can also be important to visualizing details, improving on those things as well.
For example, it's like a finger or like a hand is a certain color or is doing a certain thing.
If all your images are tiny, it's going to be hard for the model to pick up on that.
So we are button-knit also by what is available on the open source side.
For example, now Google recently released SIGLIP and it's a clip, but there is a version of it.
It's called SO Optimize, it's of size like 400 million parameters.
And it is trained with 384 resolution images.
So it's a bit bigger than the 224 that we had.
So I think this is the largest resolution you can get with open source models.
We are of course bottlenecked by this.
Usually Google, they release like this version of SIClip,
but they didn't release the better version of it.
So we are definitely limited by this.
Well, so it doesn't really affect,
it sounds like it doesn't really affect obliques.
Yeah, exactly.
So when creating the dataset,
we simply downloaded the image of the full resolution,
and after that, you resize them on the fly during the training.
But certainly one of the biggest challenge when making Oblix was dealing with all these images
because they weigh a lot.
Aren't you tempted?
So to me, OCR is extremely important.
Aren't you tempted to run some kind of extra data augmentation thing to say like, oh, you know,
on the Oblix dataset, run some OCR pipeline on it so that you augment your...
Yeah, that's really interesting what you mentioned because this is also
So one thing that we want to do in the near future.
And also people have kind of did that for Nuga, right?
So Nuga is a model from Facebook, and they just try to have a vision model that can read.
So they fed to the model PDF with the associated text, and the model is pretty strong.
So maybe if we inject this data in our pre-training, it would definitely help.
And this is also one of our...
And this is one of the threats we're exploring to have a lot more OCR data.
We have a team actually at HangingFace that works on Document AI with Ross Weigman.
Do you see the team library for vision models? No? Okay.
But yeah, basically he's been working on Document AI and on getting a very strong open source model that can read.
And is that primarily PDFs?
Yeah.
screenshots of PDFs or just raw PDFs?
Is there a difference?
I think screenshots, but I'm not familiar with the data set yet.
We may use it as well for our training in the future.
It's interesting that documents obviously are very, very important form of multimodality
that is very OCR heavy, very focused on charts.
I feel like you could classify sort of three types of multimodal models.
Like one is the traditional classification types of models.
the clips of the world.
And then two is the VQAs,
the,
whereas a general image of like a webcam,
you know,
whereas like there's three people in this image and all that.
And then the third would be like documents,
AI.
Yeah.
I don't know if that's...
You can combine them all, actually.
Can you combine them?
I don't know.
But like for...
Actually, actually you don't know.
No, but like for a general model like a GPT4,
it does all of this.
Yeah.
Yeah.
Even if it's not like a,
train purely on classification, you can classify the...
Deeper the better, right?
One God model to rule them all.
I don't know if it's like a mixture of, you know, different models.
Yeah, GPD4V,
like probably built upon GPD4, but adapted for images.
That would make sense.
But then it's like the Dali 3 model is different from,
like it's separated to create different images.
Yeah, something they just introduced was,
now you don't have to switch modes, right?
Now you can just kind of do one model
and it just does its own routing,
which is kind of very interesting.
And then the other thing was
a mentioned but not released
was that they could add vision to GPD 3.5,
not just adding it onto four.
It's not a variant of four.
It is a plugable vision module
that you can kind of add to three.
They never released.
Yeah.
Anything else that people should know about Oblix?
Obviously, this is like the big work.
You mentioned in our prep that you expect it to last for a while because there's a lot to mine from it.
I think it's big enough to train large models.
So we train our ATB parameter model on it.
So it's definitely sufficient for the next one, two years.
We spent a lot of care curating the data, like regarding the text quality and the image quality.
And I think we...
So there is also an alternative, sorry, to Obelix.
It's called Multimodal C4, MMC4.
It was published at the, around the same time as us.
However, we think we took more care in the deduplication part,
to deduplicate the images and the text,
and also based on the text quality.
This is measured, of course, qualitatively,
just by looking and exploring at,
our documents, but also quantitatively by looking at certain metrics like Perplexity, we obtain
good scores that match is the best NLP-only data set.
This was a win for us.
For someone who's never really dived into these datasets, I mean, I can open up a dataset and
manually look through these things, but how does perplexity, how do you measure perplexity
in a multimodal data set?
So Perplexity essentially, it's something really simple.
You take a small model and you fit them with the token of your text and then you measure
the probability of the document.
Of course you normalize by the length so that everything is equally treated.
And then the thing is that we obtained that we had perplexity scores that match the distribution
from the documents from the pile and the pile is documents that were taken from good quality
sources like Wikipedia archive and so on. It's not something that you can really scale.
And however we also noted that we obtain better perplexity scores than the ones from C4,
the bit dataset or Oscar. Based on your own measurements right because obviously the multimodity
multimola seafar, people would not share it.
Yeah, just based on the text.
So yeah, I think this was a,
so this is how we computed perplexity
and how we assess the quality of the data set.
But you could also run a multimod model on this
and get the perplexity from it.
It would not be measuring the quality of the text,
but also would like the alignment would come into account.
Alignment of image and text,
because it would be easier for the quality of the text,
because it would be easier for the model
if the text is very heavily related to the image
to get the next token.
And then one more question,
just about the whole process.
Like, how long does it take to make Oblix?
So we spend a good time at the very beginning of the project.
Just simply to iterate on the pipeline,
like how we collect HTML codes, how we clean them.
We had to go through all of the HTML tags,
they were important.
So this is, yeah, an engineering part.
It's, you have to be really...
It sounds very boring.
But very important.
It is boring and important.
That's how you get the good data.
Yeah, but this is also why people don't do it.
Yeah.
But the industry has not converged on a shared set of tools that everybody uses for this.
You're just parsing raw tags yourself.
Yeah, we did that, yeah.
Because we found it was better.
We passed raw HTML codes.
So we had to clean the dump tree, select the good HTML nodes, correctly extract the text,
the images, clean, they duplicate.
So as I said, there was just a good amount of time at the very beginning of the project
just finding the pipeline.
So maybe one month, but we were like one or two on this and it was really exploratory.
And then for actually making the data set, download all the images, do all the processing scripts, and so on, I think it took us like up to two months.
Yeah, but then there's also like iterations through the project where we think we should do filtering on this on top of what we were already doing.
So we improve on the data set as we.
Something I would think makes sense for the industry is kind of an open source set of deduplication.
rules because everyone seems to be reinventing this from scratch every time.
Well, for the duplication, you have to do it all the time from scratch because it depends on your
original set of documents. Everyone draws from Common Corel. Like it's...
Yes, but they're not from the same terms or not from the... But that's true.
Yeah. Someone should take all the Common Core documents.
What's been done recently? I don't even need the same exact rules. I just need to be like,
oh, these smart guys thought about that,
I should include that, right?
That's very simple.
If you have 100 rules, someone else has 80 rules,
maybe they have something that you don't have.
Exactly, exactly.
I think we did a little bit of this,
because there's a bit of literature, like, right around.
Even the ones that we designed before for the dataset of Blue,
we took a, I'm sure you used a bunch of that.
And Oblix is big, but the data sets that come for NLP right now,
are also like huge.
Did you see that together one from yesterday?
Yes.
Yes.
Impressive.
30 trillion.
So they said they had a raw data set of 100 trillion
and they got a cleaned high quality data set of 30 trillion,
which means they kept 30% of Common Crawl,
which is still too high.
Yeah.
So I feel the idea of the project.
And I think I agree with this idea is that,
is that everyone can set the thresholds for the filters as the as the as the as the yeah so for each document they computed the
they computed fresh like threshold the filters or no actually filter values for a set of rules and then you can decide
whether you want to keep the document on it and then so you define your own rules so i think they
remove like as you said the 70% of the of the data set that is really like you can't you can't do anything
with it and then for the remaining part they let people decide.
But of course, if you actually want to train something on it,
it will be much smaller, I guess,
because you will remove all other things.
But data is super important and my point is also that it's still very early in multimodal.
We're seeing now in NLP does 30 trillion datasets.
It's really important.
The first thing they said when they when they released their goals is how impressed they were,
that they were capable of doing the data pipeline.
in three months.
They didn't talk about, like, training the model or not.
It's just like they knew what they were doing for this.
It's straightforward.
But creating the whole data pipeline,
this is what took them a lot of time.
So I think that was one thing that struck me
when they made their own.
Is it confirmed that they have $8 trillion tokens?
I don't know.
They won't say.
Could be this.
It could be more, I think.
Given we have now an open source data set of $30 trillion,
I wouldn't be surprised that they have
that they have more.
I just keep coming up with questions.
One more thing on datasets.
Did you read the GPT4 Vision System Card that they put out?
They put out this paper describing a little bit of their process.
Something like 95% of the labels for GPC4 Vision was augmented by GPD4 itself.
And I was just curious, like, how much room is there for open source augmented
data sets?
I think there's a lot of room.
I think there's a lot of room.
and I think synthetic data works a lot, like worst grades,
particularly in multimodality.
Recently, the recent papers have been using, for example,
Leon Coco instead of Leon.
This is just captions on Leon created from blip.
So it's not even like super, like extraordinary,
but it does bring more performance with a lot less example
because the alignment is more straightforward, I guess.
And we've been observing,
this recently because we've been using it in our recent experiments.
I think the potential for synthetic data in multimodality is very big and very underutilized
right now.
Even Dali-3, they said that they used heavily synthetic captions to train their model.
Also, over a multi-modal, foundation multimodal models like GRIP, they train on the synthetic captions.
Yeah.
Actually, I think I was referencing the DALI 3 paper, not the GPC4 version paper.
Because they didn't actually put out a paper from GV4 version.
Cool.
And then, so you created Obelix and then you trained IDIFIX.
Yeah, on it, on top of it.
In addition, so we train ObedX on ObedX, but also on ObedX, but also on Overe datasets,
like Lyon and Public Multimodal Data sets, which is just sets of data sets that were
open-sourced at the moment.
Yeah, conceptual captions.
And, yeah, just could you take.
us through just the out of fixed process you created a smaller version and then you then you scaled
up to the full flamingo 80 billion size actually was the other way around really well we tested
that it worked at smaller scale of course yeah but we had we did not train fully our small
model before doing the the big one we fully train our big model that's unusual yes
Yes, they were training pretty much at the same time, but we needed to launch the big model before also in terms of like timing because it took a long time to train.
So it was more like managing computer resources.
But through the whole journey was bit longer than just this moment when we train to read a fixed model because we start with the objective of matching Flamingo's performance.
But the open source models that are out there, they're just not good enough.
So we have OPT, you have GPT Neo, but it's just so below Chinchilla that it's almost impossible to reach the performance.
Suddenly when Lama came out, that it started making sense and that we started matching the performance.
And from there, we were able to train the big IDFX models.
A long journey, I think because we had a lot of things to learn, because we had quite a lot of instabilities.
We shared a blog post about...
The checkpointing every 250 steps?
Yeah, we were checkpointing every 250 steps,
but we had to restart a few times.
Was that the instability you're talking about,
or this is just something else?
That was the final training,
where we still had the instabilities, but a lot less.
Like before this, we were struggling to train the model at all.
What saved us at that moment was the query key layer norms.
I don't know if you've heard about this,
but basically there's a paper from Google
where they scaled up the vision,
transformer to 22 billion parameters and they needed this trick to keep the stability of the training.
Without this, if I remember well the mechanism, you would get something along like a hard attention.
And once you get this like the model would get very unstable.
So you needed to normalize the queries and keys to basically avoid this.
Anyway, so when we did this, we were able to train further.
And that was really useful for sure.
So query key lay norms if you want stability.
It sounds like a trick that is repeatedly applied whenever you have instabilities.
You just do a Dior norm or softmax or...
Yeah, you can.
Because usually it's parameters that explode.
Yeah, yeah.
Become too big.
So you need some sort of regularization on them.
And first you have to inspect which parameter expose first.
And then you put a regularization on them.
But it's really tricky to see.
When you go in the gradients and in the activations
and you try to see where it blows up, when it blows up and why,
everything is interlinked.
It's really hard to pinpoint one particular layer.
So it was a tough one to crack.
But very interesting.
It's also hard because you never know exactly.
Once, when you're in the process, you don't know where the instability comes from.
It could come from bad data.
It could come from a bug.
It could come from the size of the model, the learning rate that's too high, the warm-up that's not.
Like, there's a lot of hyperparameters and potential bugs that can come into account,
and the debugging is very, very tough.
So do you have a checklist of...
What do you look at when you see loss explode or whatever?
You see the loss explode, you look in the activations,
possibly the gradients, to see where it blows up.
Across all your parameters.
Yeah, you find a way to aggregate this.
Otherwise it's tough.
If all previous solutions fail, because this is the hard part.
This is the hard part, but it tells you a little bit where it's happening,
so that gives you, like, where it's happening-ish on the model.
And so what's to blame?
and then you have a wide range of things to blame,
but less than before.
So you can look for a bug.
You can look for, for example, normalizing some layers.
And you can look into the data to see if you have very bad data that's impacted.
That's how they would look at first, right?
Exactly.
We looked at that too.
Is this what weights and biases would do for you?
Or is there one integrated solution that kind of...
You would wish?
Yeah.
No, to see the activations and the...
We go for activations to then inspect datasets and then look at, you know.
Waits and biases would get you, like, the parameters.
It's just logging, right?
It's just logging, like, you have tons of them.
You can't really do much with it.
So we had a tool where we would, like, log them periodically.
We had a script that would aggregate them and display them for us in a,
in a nice way, so that like interpretable way, so that we could, we could try out and see what
mechanisms were and could be, could be impacting the instabilities. But yeah, it was a very,
very interesting journey for sure. Yeah. And you published knowledge sharing documents and a memo.
Yes. There's some, there's some interesting detail there, but obviously not everything.
Anything you want to highlight, just first for listeners on that one. Obviously, I can send in the link,
but um for the high-end-core just just any other like big discoveries on learning like you talked about
the core key uh norm corickey leonorm coricke leonorm uh was the big big a ha moment but this and and there's
also um the this is one we we fixed afterwards but there's um like in the mask in the image uh mask
there was like a little information leak in that instead
of attending to all the images,
instead of attending to none of the images,
sorry, for a few tokens,
very few of them, it would attend
to all of the images. Like basically,
you tell them, you
go in the attention, and you have
this masks, this mask that's like,
like, don't attend to anything.
But in effect, it's like,
attend to one over end, right?
And it, it's not, like, it doesn't
prevent you from training, but it doesn't
help. And, and it's
yeah, it's better to fix it for
training for sure.
Yeah, sometimes you really have to go through all your cut base to...
Because you're doing ingredient descent or something, there has no information.
Yeah, no, it's tricky because this, like, for example, this, it would not have an impact
if you only train on web documents, because the documents are long.
But it would have an impact if you train on image text pairs and you pack them together,
because you're attending two images in a document that has nothing to do with it.
But you can still train with it.
it. It's, uh, you can still get a very good model out of it. It's just a lot, like, it's more
painful and you, you don't get, like, I think we can, we can get better performance definitely
without the bugger. Yeah. Interesting. You mentioned, you know, just in terms of like the baseline
foundation models that you had, Lamo has been breakthrough on the, on the language model side.
But then you also, you didn't talk as much about the vision quota side of things. I actually had
a question from Joseph from the, from the Robotho episode, uh, where he talks about, um,
about where you mentioned that the larger the clip, the better results,
but in your final memo, you actually went for a smaller version of clip.
So Eva Clip versus Lion 2B.
Does this ring a bell?
Basically it was kind of like unintuitive, like the clip choice there.
Yeah, so Lion 2B is the data set on which our clip-based model was trained on.
And our clip model was, indeed, like 400 or 600 million parameters.
And it's true that the eva clip one is of the biggest,
the epiclip one is of five billion parameters.
So definitely at the beginning of the training,
we saw a big boost using an eva clip.
However, at that time, we still had instabilities
to train this big eva clip.
We are not sure exactly why.
However, we fixed it.
And now that we, like for the next, for the V2,
version of Elyphix, now that we can train longer, we actually saw a boost by using EVE clip
instead of the previous clip, a small clip that we had. However, we now think that EVEP is under-trained.
So it means that even if it's really big, we can obtain the same performance with smaller models.
So there is this SIGLIP model that I mentioned just earlier,
that by Google, that is much smaller, 400 million parameters.
And that is more efficient.
You're choosing that as your base.
Yeah, we did an ablation, Evaclip versus Ciglip.
And actually Ciglip was a bit slightly better.
But the thing is that it's much faster for the inference
and also for the training, because there's less fewer parameters.
One thing is also that Ciglip is a higher resolution.
So that has an.
that has an impact for a CR tasks.
Free 184 instead of 2-4.
Yes.
That's great.
Yeah, maybe we should talk about
Edifix 2.
So we're going to time this podcast release
with whatever you guys aren't releasing it.
I actually had no idea you were working on a V2.
I just came in wanting to talk about your old work,
but obviously you're still doing active research.
For Edithfix V2 or V1.5, whatever we called,
the major access we wanted to
to improve on was the image resolution, the base model.
We wanted a better base model and a smaller one.
Oh, sorry, pre-trained language model.
So we've been using the mistral one.
We wanted to iterate a little bit also on the data
to have better filters on obelis, on obelix, sorry,
better synthetic data for the,
for the image taxpayers.
So essentially iteration on the data by replacing original lion pairs
by synthetic captions to have a stronger alignment,
cleaning a bit obelics on the perplexity.
So it's not removing too much, but like potential bad data.
Also, yeah, using just better pre-trained models.
So for the bad bones, we use like a better clip,
SIGLIP, we use also mistral that is better than Lama 1.
We are right now changing also the modeling,
so moving away from the Flamingo architecture,
to something that has fewer parameters.
Instead of incorporating the vision components directly into your LLM
by breaking and adding cross-attensions at each layer,
or every end layer, you can instead take your vision encoder,
take the embeddings out of it,
make them through, fit them to linear layers,
and fit them directly to the language model.
And this works quite well,
and this contains a fewer parameters and it's much easier to train.
So we are currently trained to do this.
But what we can say is that without this new modeling,
just by iterating
on better data and better pre-trained models.
We are now matching the Flamingo 8CB performance with 9B model.
So this is already a big improvement compared to our first version,
without even touching the modeling part.
I think also one of the big improvements with the new IDIFIX is the licensing.
The issue with the fix was based on Lama and the license is not kind of
commercial. So now with a model that's based on Mistral and SIGLIP, almost probably,
this is a lot better for anyone that wants to use the model commercially. The model will also be
smaller, so a lot better for inference. And hopefully we can beat the performance of VEDIFIX ATB. That would be
really good. So you're only producing a 9B? Not exactly 9B, because we're taking off parameters. It should be,
it should be about 7.5b.
Right now, the focus was really better data,
smaller open source models,
but better,
and resolution.
I'm improving on this as well.
That was the focus.
You mentioned in our prep as well
that there are some topics that you're paying
particular attention to like hallucinations.
Maybe you could talk about the topics
that you are finding
are particularly areas of concern
with multimodal models.
So we've been using hallucination at the beginning as a broad term,
and we realized that it was better to categorize it a little bit more specifically
to some categories that were more targeted.
So for example, there's the object attributes where you would have a small attribute,
like, let's say the hand of a person that's a certain color,
and the model would be like it's yellow when it's red.
so that would be that would be one there's like objects that are not there but the model thinks are
there like when the model is trying to reason with different elements in the picture but it gets it
wrong so you get like comparisons kind of hallucination counting oh my god yeah you have the environment
so it would talk about like the object or the person in the picture but it would get the whole
environment behind wrong and a few others that are wrong have in mind
right now. But basically, categorizing those, I think, is important because it helps you target the type of
data that's missing or the type of fine-tuning that you should do afterwards that's missing. And so
this is going to be very useful for us in building future datasets. I'm not sure if we'll be able
to incorporate all those like those datasets we've been thinking about for the V1.5. But we'll definitely
do this for the iteration after.
here is really to get to something that's on par or better that what is currently done in
closed source. Like GPD4V doesn't hallucinate as much on those topics and you want the open
source models to at least match this. But it's still, like it's still an open research topic
because those models still do this. If you push them a little bit, if you ask some specific
questions, like they will hallucinate things in the picture. Yeah, you mean GPs4, including GPD4V.
Yeah.
I have tried this, by the way.
I tried to use it to interpret it like a menu,
and it would just make up menu items and make up prices.
Something really interesting is that GPT4, even much better and bigger models
like GPT4, shows the same failure cases as our model.
So for example, it means that the way we are training the models now is not ideal.
And especially for example, counting, you take this task.
Even if you train on web documents or image experts,
you will never really find this task in your training data.
You will always have a picture and maybe a caption of one apple or two computers,
that you will never get to like 10, 15, 20.
It doesn't really happen.
So you don't learn this ability just by training on,
on image text or web documents.
So what you have to do is to create your own data sets
that target specific tasks.
Like for example, create one specifically for OCR,
create one specifically for counting,
create one specifically, I don't know to challenge the model
on hallucination, some types of hallucination.
A sort of flan, but for a multimodal.
And I think in multimodial
those hallucinations are even more unforgiving
than what they are for NLP
in that when the language model is making up facts
you can think that it got a little bit wrong
or it's making up a story
but when a multimodal model tells you
there is a teapot in this picture and there is none
it's very obvious it's very obvious very quickly
if you're the user of this model
and you receive this you are having a hard time
trusting it for anything
so this is one of the big
big things to tackle. And for hallucinations in NLP, a fine-tuning that has helped a lot is
the reinforcement learning with human feedback or AI feedback. And this is still very early in multimodal.
We're not sure exactly how much we can improve with those types of data sets. But it's likely
that it would help, it will help the model understand uncertainty to a more fine-grine level.
And so improve on all those types of hallucinations ultimately.
And you said in a purpose, well, that you're measuring this against other closed source models like Bard.
Like, basically, do you have your own internal benchmarks that you're running?
No, no, this is mostly qualitative.
It's more like, does it, like, we know it happens in ours because we've played with it.
Does it happen with theirs?
And yes, it does.
The thing is for ours, the evaluation data sets that we use or that we've used for, like,
Edifics so far, they, like, they don't measure hallucinations that much.
It's not targeted for this.
New data sets that are coming for evaluations are exciting,
and I think we'll go in this direction
and will help us evaluate hallucinations
and different types of hallucinations more accurately.
But right now, at least for EDIFix,
even with those evaluation benchmarks,
and even if this helps,
obviously, if you get a very bad scores,
your model could be hallucinating things.
It's not targeted to this,
and we were flying black.
a little bit on this topic.
But there are definitely benchmarks who evaluate hallucinations currently.
So I think about sugar crepe, for example, so you just have an image and two captions of it,
and the model has to select the best one, or VinoGround, I think.
So actually the captions are made such that it's tricky.
Winogrand, like the Common Sense and LPE one.
Yes.
This one is really hard.
But you can have one with images, yeah.
I think this can be a great way to evaluate hallucinations of the models.
However, it's not really commonly used.
So right now on the recent models like Pali, for example, they report their numbers on classic
evaluation benchmarks.
So I think it needs to be more widely adopted, this kind of new benchmarks.
I think the last topic that we prepped was just overall,
like why is it important for there to be OSS multimodal models?
Like what can people use them for?
Where can it be useful?
My understanding of this is that ultimately you want
a foundational model that understand the world
similarly to the way we do.
And multimal models, like, understand visual data
a lot better than language model, obviously.
but it means they provide a better foundational backbone for like other tasks in general.
I think in the future when you want to pre-train a foundational model, you will want to have it
multimodal.
So this is it's important now and it will be important in the future and this is where everybody's
going because training on tech data, it gets you really far as we've seen, but it only gets
you so far.
and it can only adapt on a subset of tasks that we do every day.
And you need models to be able to understand vision
if you want them to be helpful for, I don't know, robotics, for example,
and a bunch of other use cases.
I think, for example, even for medicine,
you can have tons of applications with a vision input.
So you just take an image of,
of whatever ask.
Panser.
Yeah, yeah.
Honestly, you can ask for help with your model,
or even in your everyday life, how to build this table.
You just take a picture of what you have and you ask the model.
You just take a picture of even something handwritten,
ask things about it, like solve the exercise shown in this picture.
in this picture. A lot of things that you can't do with text only models, yeah, that you can do with
multimoderm models. So definitely it's a big step forward. It's still a bit immature. And we have
seen that because, yeah, because of all the hallucinations we mentioned and because it's still,
it's still early, are still trying to unlock some, some abilities for these models. Yeah, we believe
in less than two years, we will only have multi-mobile models in the future.
Like they will overtake everything.
Yes.
Also, right now, it's a lot of image-text models because it does most important things
that you want without requiring too much compute.
I think it's more computer fusion than if you would put video in it.
Ultimately, I think we'll need video to be incorporated in the pre-training as well.
I'm quite excited about what it could look like.
probably the ones that are more advanced on it, like the ones that are the most advanced
than it are probably the self-driving cars. They're doing like very, very interesting work there,
but it's all closed source. Oh, no, I think, was it Coma AI that released Gaia or something
like this? But yeah, basically like video generation. So I think that's very interesting.
His video isn't just that like a series of frames of still images. What is so different about that?
It's hard because you need to encode each frame, or at least select how many frames you want to integrate per video.
Also, one big change is like the length of the video that can be really arbitrary.
While if you have an image, you have always the same number of tokens associated to it.
So all of this, plus the fact that having a big video data sets is,
really challenging because you can't for example scrap YouTube like that it's harder so
yeah all of this make it hard to to build a model with videos it's also extremely heavy
already when you when you go from a text data set to an image text data set the
like how much it weigh is so different it means it means you have to that the
data pipeline it means the product like the data set pre-processing
takes more time. If you go with video, you do this like 10x, if not more.
For a transformer, it takes tokens into as inputs, sequence of tokens, right?
And for text tokens, you get like one token for that one token for like a subset of text or word-ish.
For an image, depending on the resolution you want, you can get from like, like you can get
a lot of tokens for a single image.
For a 2-24 image, you'd get, I think, 256 tokens per image.
If you want to increase the resolution, it goes up fast.
For Ciglip, I think it's upward 700 tokens for 384 resolution.
So imagine this, and then you make it a video.
So it gets really tricky, and you have to create a bottleneck, right?
You have to pack those images together and output us like a femuricans.
That's what's possible with Flamingo.
It also means you have like this bottleneck there.
It's a lot of work.
And right now, I think the trade-off is not yet worth it
because we have a lot more to do with just image text.
But it will be at some point, I'm pretty sure.
That's all the questions I had.
Any final call to action?
You want people to go somewhere, check up the models, check out the papers?
Yeah, essentially.
So we are excited for the second version,
which will be released about the time
of NERIPS.
And yeah, we think that
Obelix can be useful
in the next year.
Even if we continue
updating the model, I think the
data set will remain pretty much the same.
And we
keep iterating on better
modeling and better data.
And if you want to follow
our work, we're at the Hanging Face
M4 organization.
Let's talk to one naming. What is M4?
And then maybe we should mention
Nobilix and I've had a fix.
I think so at the beginning of the project it's like massive multi-model,
multilingual, multitask model, but then we draw the multilingual and but it's still
massive it's still well not that massive anymore since we are trying to go smaller,
go smaller, multitask for sure.
I mean the data sets are still massive so that's good.
It should be still multilingual now?
If you train it on Common Chrome.
Train on Common Chrome.
It's just not targeted for a lot of different languages.
Like, we didn't take care of having multiple languages,
but there's definitely other languages in there.
Yeah, the naming, like, it changed through the project.
But we kept it.
The idea was also to have four modalities.
So it was fitting at the beginning.
At the beginning, we wanted audio, video, text, image.
We thought that audio and video, video is just a lot of computers.
for not that much results right now, like results right now.
And we were, from the moment we decided to reproduce Flamingo, we dropped video and audio,
which was not immediately.
This started as a different.
What's wrong of audio?
I still still very, very heavy and not necessary right away,
because you can take the text tokens, plug it into an audio, model laughter,
and it will just spit out the audio.
Ultimately, I think it will be good because there's different data as well.
Like when people talk, it's not the same as when people write.
So I feel there's a lot of interesting data to have with audio.
Again, not worth the compute right now, I think.
And also you would want this model to be integrating everything with like a given size
and then be applicable to whatever.
You give it to your robot and you're like, what?
Yeah.
And to close the loop there, Oblix was an asterix reference that you
you made into a back-oonym or whatever.
Into an acronym.
Yeah, it was really hard to find the acronym.
I want to stress this out.
I thought about it like for like days.
We were brainstorming it for like a few days.
We had to cheat at the end a little bit.
We took the S of cross attentions and added it to.
Yeah, cross attentions.
The problem is the F of Flamingo, right, for EDFX?
Because we are, yeah, but we are moving away from the Flamingo architecture.
For the next iteration, it wouldn't be really accurate, but whatever.
Okay.
But yeah, the acronym works only for the first version,
because if I remember well, it's image-aware,
decoder, enhanced a la flamingo with interleaved cross-attensions.
Yes.
Yeah, it's a very, very impressive project.
And, you know, we were talking about it even before the GPC4 vision roll out
that this is the most impressive
sort of open source
reproduction and it's amazing that
you're still continuing to work on it.
I think a lot of room left in
Oblix to keep
mining that
those rocks and there's a lot of
learning as well in multimodality. I think it's
a very important area of research.
Thank you.
Yeah, thank you.
Hello, hello. It is Swix coming in
from the editing room in 2024.
If you're listening in this
you're definitely a true fan. Thanks so much. And I hope you enjoyed that conversation as much as I
enjoyed recording it. We recorded that conversation on Halloween in 2023 in the hopes that we would be
able to release it at NewRips or around NewRyps with Ida Fix V2. V2 was supposed to be updated with a new
base model, which is going to be Mistral and a bunch of other dataset updates. And a couple
of things have happened. But you know what? Hey, let's just get Leo back.
on to talk about it. Hello from 2024. It's Leo. Just wanted to add a few things since the
recording was done a while ago now. We haven't trained the model yet because we're trading a lot more
on data. So we're making progress on OCR, on image to code capabilities, and we want to be a lot
more thorough in the image taxpayers data set that we use. I don't know if you've heard, but the
Layon dataset has had an issue with CSAM images and so we want to get ahead of this.
and fix the problem before we start the training.
But we will start training soon.
In the meantime, we released website.
It's an image-to-html dataset.
The idea behind this data set was to show that we could create
a very useful synthetic data set at scale with open-source models.
Hugo spearheaded the effort,
and so he used Mistral and,
and the deep seek coder model to generate the pairs of screenshots and ehtml code.
And then we fine-tune an early version of IDFX2.0 on the dataset to have a demo.
You can reach, you can find the model on the hub, you can find the demo on the hub,
and you can find the dataset on the hub. So definitely go there and check it out.
I think some people have already started training their own model on it.
It's been very well received and so we think we're going to do a lot more small releases like website in the future
basically switching from releasing everything in one package with the model trained to
releasing the datasets architecture and training insights that we get along the way and then releasing the model
so stay tuned I hope you will enjoy the podcast and thanks Sean for having us
