No Priors: Artificial Intelligence | Technology | Startups - Mistral 7B and the Open Source Revolution With Arthur Mensch, CEO Mistral AI
Episode Date: November 9, 2023Open Source fuels the engine of innovation, according to Arthur Mensch, CEO and co-founder of Mistral AI. Mistral is a French AI company which recently made a splash with releasing Mistral 7B, the mos...t powerful language model for its size to date, and outperforming much larger models. Sarah Guo and Elad Gil sit down with Arthur to discuss why open source could win the AI wars, their $100M+ seed financing, the true nature of scaling laws, why he started his company in France, and what Mistral is building next. Arthur Mensch is Chief Executive Officer and co-founder of Mistral AI. A graduate of École Polytechnique, Télécom Paris and holder of the Master Mathématiques Vision Apprentissage at Paris Saclay, he completed his thesis in machine learning for functional brain imaging at Inria (Parietal team). He spent two years as a post-doctoral fellow in the Applied Mathematics department at ENS Ulm, where he carried out work in mathematics for optimization and machine learning. In 2020, he joined DeepMind as a researcher, working on large language models, before leaving in 2023 to co-found Mistral AI with Guillaume Lample and Timothee Lacroix. Show Links: Arthur’s Linkedin Mistral Mistral 7b Retro: Improving language models by retrieving from trillions of tokens Chinchilla: Training Compute-Optimal Large Language Models Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @ArthurMensch Show Notes: (0:00) - Why he co-founded Mistral (4:22) - Chinchilla and Proportionality (6:16) - Mistral 7b (9:17) - Data and Annotations (10:33) - Open Source Ecosystem (17:36) - Proposed Compute and Scale Limits (19:58) - Threat of Bioweapons (23:08) - Guardrails and Safety (29:46) - Mistral Platform (31:31) - French and European AI Startups
Transcript
Discussion (0)
Open source AI models have completely changed the landscape of technology over the past year.
One tiny team of ex-deep-mind and meta-researchers in France has made a huge splash recently, Mistraw.
This week, Alad and I are joined by Arthur Munch, the CEO and co-founder of Mistral,
who recently released Mistral 7B in Apache 2 licensed open source model that has changed people's mental models about what can be done
with small models. Arthur, welcome to No Pryors.
Thank you for inviting me. I'm very glad to be here.
Okay, so just six months ago, when we met you were leaving DeepMind to start Mistral.
It takes real guts to look at the scale of dollars and compute that Open AI and Google and
others have amassed and say, like, we want to play in this game too, and it's important we do.
Tell us about the inspiration to start.
Me, Guillaume and Timote were, I guess, pretty early in the field and it had been 10 years that we had been doing machine learning.
And we did know where to start from and how to make a good model with a limited amount of compute and money.
Well, not so limited, but at least more limited than where we were coming from.
And so I think that's why get started.
The various companies we were in, move into directions that we hadn't anticipated when we joined the company.
And we decided that there would be a very good opportunity for creating something that would be a standalone company in Europe, focusing on making AI better, focusing on making Frontier AI and focusing on making open source AI as a core value.
Maybe we can talk about each of those pieces.
So 10 years in machine learning before, you were a co-author on the Chinchilla scaling loss paper.
You worked on the sort of mixture of experts' ideas early.
Can you talk a little bit about what your research directions were at DeepMind?
Yeah.
So I come from an optimization background.
So my focus has always been for the last 10 years to make algorithms more efficient
and to use better the data that we have to make models with good prediction performances.
And so when I arrived at DeepMind, I joined the LLM team that was,
10 people at the time. And very quickly, I started to work on retrieval-agmented models.
So with a paper called Retro that I collared with my friend, Seb Borjo, who is still at DeepMind.
The point there was to use very large databases during pre-training so that we didn't force
knowledge into the model itself. And we would tell the model that it would have access to
an external memory anyway. And so it was working quite well. We could actually lower the
perplexity, let's say. That's what you work on when you make LLMs. There were some limitations
that I think the community has started to address quite well. And that was at the time when
retrieval methods weren't really mainstream, now they've become completely mainstream. So that's
the first project I did. I worked on a sparse mixture of experts also quite quickly because
that was related to my topic of postdoc, which was Optimal Transport. So Optimal Transport,
is a setting where you have, I guess, tokens, you need to assess them, assign them to devices
and you need to make sure that there's some good assignment in between the two of them
so that the devices don't see too many tokens. And as it turns out, the way you do it
is with optimal transport is a mathematical framework to do it correctly. And so I started
to work on introducing this two sparse mixture of experts. And very quickly, we started to move
on to scaling lows.
So how do you actually take the method that is working at a certain scale and try to predict
how that will evolve with the scale, the number of experts, the amount of data you see?
And so that's a work I've done with many colleagues as well on how do you adapt the scaling
lows for dense parameters, for dense models, to a setting where you want to predict the performance
not only with relation to the size of the model, but also the number of experts?
I guess that was the second thing I worked on, and then connectedly worked on Chinchilla,
which is, I think, a major paper in the history of LLM, also with Seb, Jordan, Laurent, many of the other people.
Basically, the story was that everybody was training models on too few tokens because of the paper from 2020,
that happened to be not very well executed.
And so what we observed is that you could actually correct that.
And so instead of training very large models on very few tokens,
you should actually grow the number of tokens as you grow the size of the model,
which if you think about it makes a lot of sense
because you don't want to have infinite size model looking at a finite number of tokens.
And similarly, you don't want to have a finite size model looking at an infinite number of tokens.
number of tokens. There must be some proportionality. Yeah. Yeah, exactly. And that's something
that we showed empirically. And I think that's something that was adopted very fast because
it was like a pure win. For the same amount of compute, you would get a model that would be
better, but also a model that will be four times cheaper to serve. So that was definitely
a gain. And as it turns out, we didn't go far enough. That's what we did at Mistral. We
realized that there was also a lot of opportunity in actually compressing models more.
I mean, we've seen with Lama that it was actually possible.
What we showed in Mistral 7B is that we were definitely far away from the limit of compression.
We somehow corrected that by making a model very small, super cheap to serve, super fast, running on your MacBook Pro, but still good enough to be useful.
And so that's one of the first achievement we made in the company.
Yeah, I think a lot of people were really impressed when Mistral came out with the 7B model because, A, you did it very quickly.
You know, you did it in a matter of a few months, but B, if you look at the cost of actually running these models,
obviously there's the training side of it in terms of actually building the model, but then there's
inference. And so much of the cost day to day, if you're a user of open AI or a user of Lama,
or a user of mistrales, how much it cost to actually run the models, the inference. And that's often
driven by the size of the model. And so I think by coming out with these small models that were
very performant, it really made a huge change to how people thought about what was possible.
Is there anything that you can share in terms of where you think this performance is going to go
or how you think about the sizes of models, both in terms of smaller, more performant models,
as well as do you folks plan to build something very large, more like a GPT4 or GPT5 equivalent over time?
Yeah, sure.
So I think what we've seen, like in the, in 2022, 2021, you had a few companies that were really
focusing on pushing the performance of models.
And if you want to push the performance, the pure performance of models, you don't
care about inference because you're not going to use the model, you're just going to see whether
they're good or not. And that's really for scientific purposes. But then when you start thinking
about deployment and enabling downstream applications, then you need to think about what it is going
to cost in runtime. So you're not only worried about the upfront payment you need to make to get
the model, but you're also worried about the runtime. And so I think the coefficient that you put
in between the inference cost and the training cost really is really business dependent.
And as a company that intends to have a valid business model,
we think a lot about inference costs.
We think that it's super important to get to a regime where inference is super cheap
so that you can run agents, you can run.
You can basically use AI LLMs everywhere for all of your use cases
and you're not blocked by cost, which is the case for the largest model currently.
So that's definitely something we had in mind,
and we knew that we could make a model a 7 billion parameters
on a model very good.
That's for sure.
This is definitely not the end of the story.
Now, the question is, do we train bigger models?
And the answer is obviously yes.
There's still a limit to what a certain model size can do.
This limit was, I think, underestimated.
But if you want to get to more reasoning capabilities,
you do need to move into larger models.
The other thing about moving into larger models,
that it enables you to train smaller models that are better,
which is through variety of techniques like distillation or synthetic data generation.
So these two things are quite related.
If you want to make very strong, small models, you do need to have bigger models.
And we are indeed training larger models for sure.
Can you tell us about your approach to data and annotations?
Because we kind of talked about the other two dimensions.
Yeah, so we've talked about compute and obviously data.
is super critical.
So we work from the open web.
Well, we do a lot of work.
I think we do a good job at getting some good data.
The data quality is really what makes the model good.
I mean, data algorithms, obviously, but that is super important.
We put a lot of focus on that.
And I think we do have a very good data sets, that's for sure.
Data annotation is, I guess, another topic.
It's not related to pre-training.
When you pre-trained a model, you really want to have the purest knowledge,
the purest quality of data.
When you want to align your model and instruct it, ask it to follow instructions,
which is useful for many use cases because it makes it steerable.
You do need to have a certain amount of,
is there a human-produced annotation or potentially machine-produced annotations?
And so that's something that we start working on.
We're not the top experts in the world in making
good instruction fine-tune models. We're definitely ramping up and the team is getting better and
better at that. One of the things we haven't talked about is the fact that you guys are an open-source
company, which is very, very different from the other sort of labs working at the state of the
art today. Why is that important? So if you look at, if you look back at the history of
machine learning in the last 10 years, it went very fast. I mean, we went from a
poor cat, dog, detector to something that basically looks human intelligent.
And it's useful to remember how that happened.
It happened because you had many academic labs.
You had many industrial labs actually spending more money on different problems.
And there was like full communication, almost full transparency until 2020.
Like whatever was done in whatever lab, even in competing lab, was actually published
at Nureps, was published at ICML, and every six months we will all gather and we'll get new
ideas. Idees would circulate and everybody would build on top of the work of others. And that's
the way we went from something, well, potentially interesting to something very interesting.
But then the issue is that around 2020, some companies started to be quite ahead on some
field and realized that some value could be accrued. And then at that point,
Opacity made it back to the field.
And I think that's a cycle we've observed in software already,
a cycle between openness and closeness.
We are observing it again.
We think that it's too early,
and we think it's really damaging for the science
to actually move into such an opaque regime,
where you have a few companies basically doing the same thing,
just not communicating about it,
spending billions of compute doing exactly the same thing,
And where really the technology we're looking at is not working completely yet.
So still, it doesn't reason well.
Memory mechanisms are not very well understood.
Causality mechanisms are not well understood.
It's not super-stereable.
There's a lot of biases.
I mean, it's incomplete.
There's many things to be done.
We still need to invent new techniques.
And how are you going to invent new techniques if nobody is speaking about it?
When in order to invent new techniques, you need to still spend some,
some large amount of money to actually try things at scale,
and the few companies that have the money to spend
have no refuse to communicate.
That's something that we deeply regretted,
and that's something that we are trying to change,
because we do have some substantial amount of money
to actually spend on compute.
We do have some good ideas.
We know that there's a big community
that is awaiting for AI players,
for open-source AI players to appear,
and we're very grateful that meta is moving into that direction.
by doing what we do, by being much more open about the technology we create,
we want to steer the community into a regime where things just work better,
where things are safer because of put under more scrutiny.
And really our intention there is to take that position and to change the rules of the game
because we don't think that this is moving into a proper direction.
Yeah, it's very interesting because if you look at the current discourse,
the really big tech companies are claiming that opens,
source AI is dangerous, and it feels like really a form of regulatory capture, right?
They want regulators to step in so that they can constrain innovation and kind of control an
industry. And, you know, the reality is if you look at things like global health, global
equity, open source is one of the biggest potential ways for all of humanity to benefit from
this technology in a way that's transparent and open and people can really understand and
see. How do you approach safety and policy and thinking about, you know, the right way
to think about safety in the context of open source?
So I think we approach it from a very pragmatic point of view.
So the question is, is open sourcing today a model that we do?
Is it a dangerous thing?
Is it actually enabling bad actors to misuse the model?
Is it giving them marginal capacity, like extra marginal capacities in pursuing their bad endeavours?
I think the answer to this question is no.
is no. That's my conclusion. We've seen, well, a lot of ideas around bio-weapons,
around, I don't know, nuclear terrorism and the like. And it's very interesting, because if you
actually assume good faith of these arguments, and I think in many cases, people are in good
faith, and that's an assumption that we always make. And if you try to go down their arguments,
I mean, we realized, so that's what we did, and we realized that there was really nothing to it.
So nothing is showing that a LLM is actually marginally better than a search engine to find knowledge on topics that would enable bad use.
And the other thing is that it's not even proven, and it's certainly very likely not to be the case, that knowledge is not the bottleneck for the actual misuse that we're talking about.
So we have two things.
In order to demonstrate that open sourcing large language models is actually unsafe,
you need to demonstrate that they have marginal improvement over web engine,
and that knowledge is the bottleneck for creation.
And in the two cases, the answer to these two questions is no.
And so that means that we believe that we can open source models today,
and that actually it's the best way of putting things under the highest scrutiny
so that we are ready for potential new generations of models that could be super intelligent.
And in that case, I think we can re-have, we can rediscuss these premises.
But today, we're really talking about the compression of knowledge that is widely available on the web.
And so, marginally speaking, we're not creating anything that is more dangerous than before.
So I think that's really there's a trade-off.
there's a dynamic conversation to be had.
That's what we discussed at the AI Safety Summit.
For sure, this needs to be revisited as we as model capacities build on.
But today, going, banning open source, preventing it from happening is really a way,
well, to enforce regulatory capture, even though the actors that would benefit from it don't
want it to happen.
But by design, if you actually ban small actors from doing things, in the most
efficient way, which is open source. You do facilitate the life of the larger incumbents.
And that's something that would be, I guess, detrimental to mistral life for sure.
What do you make of the arbitrary sort of compute and scale limits proposed?
That's interesting. I don't exactly know how they came out. They came up with this threshold.
It's a high threshold by any standard because if you compute, like if you make the bad faith
assumption that this is the float 64. It actually gives you, it's approximately 300 million
last run compute. So that's high. That's not something we can even afford and that we won't
be able to afford for the coming years. So it's high. It's very arbitrary because who tells
you that beyond that 10 to the per 26, you end up with bad capacities. Models start to have
to see the emergence of bad behaviors. That's definitely not.
not proven. Relating capabilities to scale is also very approximate in the sense that it really
depends on the data. The data set is super important. If you train your model on generating,
I mean, there's a focus on bio-weapon. So let's say if you want to, if we want to prevent
models from generating chemical compounds, because we think it's an enabler of bad behaviors,
which I have said, we don't think it is the case. But if you do, if you want to prevent
that. Well, you do need to adapt your compute flop budget to the data set that we are working
on. As it turns out, that's what they did because they actually made a specific flop budget
for biology, I think. So you can see the bioweapon narrative building on. But this is completely
we should really focus on capabilities and not pre-market conditions. And I think
that's, I mean, to some extent, there's still consensus around that. So everybody knows that
it's imperfect. It's a proxy, which is maybe fairly correlated. But definitely we need to come
up with agreeing on how we measure capabilities, agreeing on what capabilities we deem
dangerous. And I think we don't agree with one another on that topic. But these are really
should be the judge and not obviously pre-market condition or the number of flops that you do.
know why there's such a focus on bioweapons? I asked this as a, I worked for almost a decade as a
biologist. And, you know, when I look at how some of the complexity of building viruses or some
the complexity of actually doing these things, I'm surprised that there's so much focus in the
community on that specific example. Do you have any sense of the origins of why people keep
bringing that up as the, because it's actually hard to translate. It's not some digital thing
that you manipulate. I think it's a very interesting question. Honestly, I don't have the answer.
it's almost epistemology at that point.
So how did this idea appear
and how did it get amplicated by the policy people
and how did it actually ended up being heard
by national security?
And I think it started somehow
with GPT4, an XD or something,
like on page 46,
not exactly the same, these numbers,
but where they generated some chemical compound,
which, and then they said a small,
Mark saying that, okay, maybe that's not the direction we want to take.
We don't want to have a model that reasons about chemical compounds.
The chemical compounds in question wasn't dangerous, but then it made this just observation,
which is definitely expected.
If you train on articles on biology, you're definitely going to be able to produce some
chemical compounds.
So expected observation.
And then somehow people build things on top of it.
Nothing was observed.
No scientific studies in proper form was published.
But then policy papers started to cite non-scientific papers, arguing that these were
scientific evidences, that the bioweapon narrative was actually true.
And then policy papers started to cite the other policy papers.
And all of a sudden you end up with like 50 papers saying that for sure, bioweapon is going
to blow us up.
And that, this is what the policymakers read at the end.
I think that's how we ended up where we are today.
So there's some deconstruction to be made.
Unfortunately, I think the open source community hasn't been vocal enough because they
didn't see it coming.
But right now, this is changing and I'm very glad that it is.
I think it's mimetic.
You have the factor of like the world just goes through the COVID-19 pandemic.
Yeah, I mean, the COVID trauma for sure played the role in that narrative.
I mean, that's definitely a traumatism.
30 million people died.
That's definitely something we don't want to reoccur again.
I don't think AI is going to be the one triggering the next pandemic.
It's always going to be climate change.
That's the way it was, and that's probably where the focus should be,
instead of focusing on hypothetical, non-proven, biological risk induced by a
token generators.
If bio-weapons is not a pragmatic concern in the visible future, there are real concerns around
guardrails about what we want our AI models to actually generate.
Like, how do you think about that?
Yeah.
So I think this is a very valid concern.
Models can output any kind of text.
And in many cases, you don't want it to output any kind of text.
So when you build an application, you need to think on the guardrails you want to put on the model output and potentially also on the input.
So you do need to have a system that filters input that are not valid, that you deem illegal, and output that are not valid or that you deem illegal.
So the way you do it in our mind is that you do create the modular architecture that the application maker can use, which means you provide the role model.
so the model that hasn't been altered to ban some of its output space.
And then you propose new filters on top of that that can detect the output that we don't want.
So it can be pornography, it can be hateful speech.
These things you want to ban when you have a chatbot, for instance.
But these things, you don't want to ban from the raw model,
because if you want to use the raw model to do moderation, for instance,
you want your model to know about this stuff.
So really assuming that the model should be well-behaved is, I think, a wrong assumption.
You need to make the assumption that the model should know everything.
And then on top of that, have some modules that moderate and guardrail the model.
So that's the way we approach it.
And it's a way of empowering the application maker in making a well-guarded application.
And we think that it's our responsibility to make very good modules that allow guard-wailing the model correctly.
it's part of the platform
and we think it's the way of
there should be some
healthy competition on that domain
of different startups
working on guard railing the models
and the way you make this healthy competition
is not
by trusting a couple of companies
to do their own safety
it's rather
for it's rather
the way you do it is to
ask application makers
to comply
with some rules. So chatbot should not output hateful speech. And so that means that now the
application makers need to find a good guard railing solution. And now you have a competition
where there's some economic interest in providing the best guardrailing solution. And so that's
the way we think the ecosystem should work. And that's the way we position ourselves. That's the way
we build the platform with modular filters and modular mechanisms to control the model.
with. It's great that you folks are being so thoughtful about that. I think when people talk about
safety, they really talk about three topics, and sometimes they talk past each other. One is this
sort of moderation or textual-based risk. And so that's risk of hateful content, legal content,
bias, et cetera. There's a second class which we talked about already, which is physical risk.
It's things like bioweaponry or the ability of AI to help derail a train or interfere somehow.
And then third is like existential or species risk.
And that's when people start talking about AGI and new forms of life and, you know.
Resource competition.
Resource competition or aggregation or things like that.
So first of all, I think it's very important to address these three things separately and to hammer home that solutions exist for the first one.
That the second one, there's no evidence that it actually exists at that point and no evidence that it will exist.
the near future. The third point, I think that's very philosophical. Obviously, if you can make
a system of arbitrary complexity, it can start doing anything that you don't want it to do.
We are not at a stage where the model has arbitrary complexity. And so this is very abstract to me.
I think that's still, I mean, we'll move onto a world with agents and AI interacting with one
another, and that will create a lot of complexity. The anticipating,
that complexity will necessarily yield to a collapse.
We call it a collapse in machine learning when suddenly everything stops working
because I know you fail into a local minima.
Well, it's unclear to me that complexity leads to a collapse.
Usually complexity leads to doing nothing because there's no self-organization
and no willpower to build something.
So I'm not too worried about existential risk.
obviously this is a dynamic conversation.
If we can make a model which is growingly intelligent,
then maybe you're at a singularity level.
There's no evidence whatsoever that we are on the way of doing that,
of making that happen.
So I think it's a very open conversation we should have.
I personally don't see the scientific evidence,
and as a scientist, I trust only what I can see.
Sure.
And then I guess you mentioned agents,
which I think is an area of a lot of activity right now,
It feels like a number of things that are related to agents are still a little bit of ways in the future.
In other words, it feels like an area with enormous promise, but it's still quite early.
Are there any big technological innovations or things that you're working on that you think will really help expedite a world that moves more towards agents-based use in a more broad sense?
I think making a model smaller is definitely a way to make agent work because one problem you have with agents that very quickly starts to, if you run an agent on GPT4,
you're going to run of money very quickly.
And so if you divide by a hundred, well,
the cost of compute, well, you can start
to build the more interesting things.
What we see with agents is mode collapse.
So not very interesting mode collapse.
They start repeating themselves and they fall into loops.
So definitely there's some research to be made there.
There's some research to be made on making models more
capable of reasoning and making them more capable
of adapting the amount of compute they put
onto the difficulty of the task.
And this can be agent solved somehow.
So it's definitely an avenue of research that we're exploring.
Yeah.
Going back to Mistral, you know,
one of the things you've talked about a little bit is the platform
that you've been building around the models that you train.
Can you tell us a little bit more about that and some of the directions that's heading in?
Yeah.
So we know that hosting is not, hosting models isn't easy.
We know there's a lot of work to be done on the infant side to make serving efficient.
There's a lot of work to be done on the training side
because you do need to come up with architectures
that are memory efficient, for instance.
That's what Miss Tral-7 is good at that
because it has with sparse attention mechanism
that makes it more memory-efficient.
So there's some work that you can do on the training side,
but in order to reap all the benefits of a good model,
you do need to work a lot on the inference part.
And so we are actively working on that part
to make it as efficient as possible
to build a platform that will be very cost-efficient.
And so you do need to have a good platform,
well, with good code, good inference code.
The other thing that you can propose to customers
is the fact that you do time sharing across customers.
So when you want to play around with a model,
if you want to make it completely safe,
you should spin it up on an instance of a cloud provider.
But if you just want to play around with it,
you can access an API.
it's going to be less costly because just a single H. Android can serve hundreds of customers.
So I think there's some demand, a lot of demand for experimentation and APIs,
and that's something that we started to build alongside the self-hosted platform
that we direct to other enterprise customers.
Your team is based in France.
You have said before that you think there's a opportunity for a really important AI company
that is French and European and serving the world.
I don't know if that is like a mainstream point of view
before the early success of Mistral.
Can you talk about why you think that might work?
I think some very strong point of Europe on that domain is talent.
As it turns out, France, UK, Poland are very good at training mathematicians.
and as it turns out, mathematicians are very good at making AI.
Which means that there's a lot of French people and English people and Polish people in AI.
And many of them wants to stay in Europe.
Their family is there.
The food is better.
You have many advantages.
I can't list them.
It would be too long.
And so obviously, we've been seeing the emergence of an AI ecosystem in London.
I think very much thanks to DeepMind
and then in Paris
also thanks to DeepMine
and to Meta that settled
a lab there and to
a lot of entrepreneurs that started to come back
so today we have I think hundreds
of startups in Paris
so this is not the level of the Silicon
Valley obviously but we start to
have an ecosystem in place with investors
with operators
investing as well so
it's the same kind of
flywheel that made
San Francisco and the Bay Area successes
is starting to spin in France
and I'm very glad that we are participating to it.
This has been a great conversation, Arthur.
I always find you inspiring.
I'm very grateful to be an investor.
Thanks for doing this.
Well, thank you for having me
and looking forward to seeing you soon.
Find us on Twitter at No Prior's Pod.
Subscribe to our YouTube channel
if you want to see your faces, follow the show
on Apple Podcasts, Spotify, or wherever you listen.
That way you get a new episode every week.
And sign up for emails or find transcripts for every episode
at no-dash priors.com.