a16z Podcast - Safety in Numbers: Keeping AI Open
Episode Date: December 11, 2023Arthur Mensch is the co-founder of Mistral and the co-author of Deepmind’s pivotal 2022 "Chinchilla" paper.In September 2023, Mistral released Mistral-7B, an advanced open-source language model that... has rapidly become the top choice for developers. Just this week, they introduced a new mixture of experts model – Mixtral — that’s already generating significant buzz among AI developers.As the battleground around large language models heats up, join us for a conversation with Arthur as he sits down with a16z General Partner Anjney Midha. Together, they delve into the misconceptions and opportunities around open source; the current performance reality of open and closed models; and the compute, data, and algorithmic innovations required to efficiently scale LLMs.Resources:Find Arthur on Twitter: https://twitter.com/arthurmenschFind Anjney on Twitter: https://twitter.com/anjneymidhaLearn more about Mistral: https://mistral.aiLearn why we invested in Mistral: https://a16z.com/announcement/investing-in-mistral/Stay Updated: Find a16z on Twitter: https://twitter.com/a16zFind a16z on LinkedIn: https://www.linkedin.com/company/a16zSubscribe on your favorite podcast app: https://a16z.simplecast.com/Follow our host: https://twitter.com/stephsmithioPlease note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures.
Transcript
Discussion (0)
I think the battle is for the neutrality of the technology.
This is the story of humanity, making knowledge access more fluid.
Basically in 2021, every paper made this mistake.
It means that we're only trusting the team of large companies to figure out ways of addressing these problems.
All of the people that joined us as well, deeply regretted because we think that we're definitely not at the end of the story.
As it turns out, if you look at the history of software, the only way we did software collaboratively is really
open source, so why change the recipe?
Scaling laws.
These underpins the success of large language models today,
but the relationship between datasets, compute,
and the number of parameters was not always clear.
But in 2022, a pivotal paper came out,
often referred to as chinchilla,
that changed the way that many people in the research community
thought about that very calculus,
demonstrating that datasets were actually more important
than just the sheer size of the model.
One of the key authors behind that paper was Arthur Metsch, who was working at Deep Mind at the time.
Now, earlier this year, Arthur banded together with Guillaume Lampel and Timothy Lecois, two researchers at Meta, who worked on the release of Lama, and together they founded a new company, Mistral.
Together, this team has been hard at work releasing Mistral 7B in September, a state-of-the-art open source model that very quickly became the go-to for developers.
And they just released, as in the last few years.
days, a new mixture of experts model that they're calling mixtral.
So today you'll get to hear directly from Arthur as he sits down with A16Z general
partner, Angeena Mida, as the battleground for large language models heats up, to say
the least.
Together, they discuss the many misconceptions around open source and the war being waged on
the industry.
Plus, they'll discuss the current performance reality of open and closed models and whether
that gap will close with time.
Plus, the kind of compute, data, and algorithmic innovation.
required to keep scaling LLMs efficiently.
Now, it's really rare to have someone at the frontier of this kind of research
be so candid about what they're building and why.
So I hope you come out of this episode as excited about the future of open source as I did.
Enjoy.
As a reminder, the content here is for informational purposes only,
should not be taken as legal, business, tax, or investment advice,
or be used to evaluate any investment or security,
and is not directed at any investors or potential investors,
in any A16C fund.
Please note that A16Z and its affiliates
may also maintain investments
in the companies discussed in this podcast.
For more details, including a link to our investments,
please see A16C.com slash disclosures.
All right, why don't we start with the founding team story?
We flashed back to a few years ago,
labs are building foundation models
and the consensus across the research community
was that the size of these models,
was what mattered most.
How many million or billion parameters
went into the model
seemed to be the primary debate
that people were having.
But you had a hunch
that the role of data mattered more.
Could you just give us the backstory
on the chinchilla paper you core wrote?
What were the key takeaways in the paper
and how was it received?
Yeah, so I guess the backstory is that in 2019, 2020,
people were relying a lot on the paper
called scaling lows for large language models.
That was advocating
for basically scaling infinitely the size of models
and keeping a number of data points rather fixed.
So it was saying that if you had four times the amount of compute,
you should be mostly multiplying by 3.5, your model side,
and then maybe by 1.2, your data size.
And so a lot of work was actually done on top of that.
So in particular, at DeepMind, when I joined a project called Gofer,
and there was a misconception there.
There was also a misconception on GPT3.
And basically, in 2021, every paper made this,
mistake. And at the end of 2021, we started to realize there was some issues when scaling
up. And as it turns out, we turned back to the mathematical paper that was actually talking about
scaling those and it was a bit hard to understand. And we figured out that actually, if you thought
about it a bit more in a theoretical perspective. And if we looked at like empirical evidence we
had, it didn't really make sense to actually grow the model size faster than the data size.
And we did some measurements. And as it turned to.
out what was actually true was actually what we expect, which is in common worlds, if you multiply
by four, your compute capacity, you should multiply by two, the model size and by two, the data
size. That's approximately what you should be doing, which is good, because if you move everything
to infinity, everything remains consistent. So you don't have a model which is infinitely big or a model
which is infinity small with infinite compression or close to zero compression. So it really makes
sense. And as it turns out, it's really what you observe. And so that's how we train
chinchilla and that's how we wrote the chinchilla paper. At the time, you were at DeepMind
and your co-founders were at Meta. What's the backstory around how you three end up coming
together to form Mistrol after the compute optimal skating laws work that you just described?
So we've known each other for a while because Guillem and I were in school together and
Timote and I were in master together in Paris. Basically, we had like very parallel careers.
Timote and I, we actually worked together as well again when I was doing a postdoc in mathematics.
And then I joined DeepMind as Guillemette, went to become permanent researchers at Meta.
And so we continued doing this.
I was doing large language models in between 2020 and 2023.
Guillem and Timote were working on solving mathematical problems with large language models,
and they realized they had to have stronger models.
And on my side, I was mostly working in a small team at DeepMind.
So we did a very interesting work on Retro, which is a paper doing retrieval for large language models.
We did Chinchilla.
Then I was in the team doing Flamingo.
which is actually one of the good way of doing a model that can see things.
And I guess when Chad GBTBT went out,
we knew from before that the technology was very much game-changing,
but it was a signal that there was a strong opportunity for building a small team,
focusing on a different way of distributing the technology.
So we're doing things in a more open-source manner,
which was not the direction that Google at least was taking.
And so we had this opportunity.
Then we left the company at the beginning of the last year
and created the team that started to work on the field.
fifth of June, ends on recreating the entire stack and training our first models.
And if I recall correctly, right before they left, Tim and Guillaume had started to work on Lama
over at META. Could you to describe that project and how it was related to the chinchilla
scaling laws work you'd done? So Lama was a small team production of chinchilla, at least in its
approach of parametization and all of these things. It was one of the first papers that established that you
could go beyond the chinchilla scaling laws. So chinchilla scaling laws tell you what you should be
training if you want to have an optimal model for a certain compute cost at training time. But if you take
into account the fact that your model should also be efficient at inference time, you probably want
to go far beyond the chinchilla scaling law. So it means you want to over-train the model. So train
on more tokens than would be optimal for performance. But the reason why you do that is that you
actually compress models more. And then when you do inference,
you end up having a model which is much more efficient for a certain performance.
So by spending more time during training, you spend less time during inference, and so you save
cost. That was something we observed at Google also, but the Lama paper was the first to establish
it in the open, and it opened a lot of opportunities.
Yep. I remember both the impact of the chinchilla scaling laws work on multiple labs,
realizing just how unoptimal the training setups were. And then the subsequent impact of Lama being
dramatic on the industry and realizing how to be much more efficient about inference.
So let's fast forward to today.
It's December 2023.
We'll get to the role of open source in a bit.
But let's just level set on what you've built so far.
A couple months ago, you released Mistral 7B, which was a best-in-class dense model.
And this week, you're releasing a new mixture of experts model.
So just tell us a little bit more about mixtrol and how it compares to other models.
Yeah, so Mixtral is our new model that wasn't released in open source before.
It's a technology called Sparse Mixure of Experts, which is quite simple.
You take all of the dense layers of your transformer and you duplicate them.
You call these layers expert layers.
And then what you do is that for each token that you have in your sequence, you have a router mechanism,
so just a very simple network that decides which expert should be looking at which token.
And so you send all of the tokens to their experts.
and then you apply the experts
and you get back the output
and you combine them
and then you go forward in the network
and you have eight experts per layer
and you execute only two of them
so what it means at the end of the day
is that you have a lot of parameters on your model
you have 46 billion parameters
but the thing is that the number of parameters
that you execute is much lower than that
because you only execute two branches out of eight
and so at the end of the day
you only execute 12 billion parameters per time
And this is what counts for latency and throughput and for performance.
So you have a model which has the performance of a 12 billion parameter network that have
performance that are much higher than what you could get, even by compressing data a lot
on a 12 billion dense transformer.
Sparse mixture of experts allows to be much more efficient at inference time and also much
more efficient at training time.
So that's the reason why we choose to develop it very quickly.
Just for folks who are listening who might not be familiar with sort of state-of-the-art
architecture and language models.
Could you just describe the difference between dense models, which have been your primary architecture to date today, and mixture of experts?
Incuitively, what are the biggest differences?
So they are very similar, except on what we call the dense networks.
In the dense transformer, you alternate between an attention layer and a dense layer, generally.
That's the idea.
A sparse mixture of experts, you take the dense layer and you duplicate it several times.
And so that's where you actually increase the number of parameters.
So you increase the capacity of the model without increasing the capacity.
cost. So that's the way of decoupling what you can remember, the capacity of the network
to its cost at inference time. If you had to describe the biggest benefits for developers as a
result of that inference efficiency, what are those? Its cost and latency. Usually that's what
you look at when you're a developer. You want something which is cheap and you want something
which is fast. But generally speaking, just the trade-off is strictly favorable in using Mixtral compared to
using a 12 billion dense model. And the other way to think about it is that if you want to use
the model which is as good as Lamat to 70B, you should be using mixtral, because mixtral is actually
on par with Lamat to 70B while being approximately six times cheaper or six times faster
for the same price. Could you talk just a little bit about why it's been so challenging
for research labs and research teams to really get the mixture of expert's architecture, right?
For a while now, folks have known that the dense model architectures can be slow, they're expensive,
and they're difficult to scale.
And so for a while, people have been looking for an alternative architecture that could be, like you were saying, cheaper, could be faster, could be more efficient.
But it's taken a while for folks to figure this out.
What were the some of the biggest challenges you have to figure out to get the MIE model right?
Well, there's basically two challenges.
The first one is you need to figure out how to train it correctly from a mathematical perspective.
The other challenges is to train efficiently, so how to use actually hardware as efficiently as possible.
you have like new challenges coming from the fact that you have tokens flying around from one expert to another
that creates some communication constraints and you need to make it fast.
And then on top of that, you also have new constraints that apply when you deploy the model to do inferencing efficiently.
And that's also the reason why we released an open source package based on VLM so that the community can take this code and modify it and see how that works.
Well, I'm definitely excited to see what the community does with the MEO.
let's talk about open source, which is an approach and a philosophy that's permeated
all the work you've been doing so far. Why choose to tackle the space with an open source approach?
Well, I guess it's a good question. The answer is that it's partly ideological and partly
pragmatical. We have grown with the field of AI. From 2012, we were detecting cats and
dogs. And in 2022, we were actually generating text that looked human. So really made a lot of progress.
And if you look at the reason why we made all of this progress,
most of it is explainable by the free flow of information.
You had academic labs.
You had very big industry-backed labs communicating all the time about the results
and building on top of each other results.
And that's the way we increased significantly the architecture and training techniques.
We just made everything work as a community.
And all of a sudden in 2020, with GPG3, this tight reversed.
and companies started to be more opaque
about what they were doing
because they realized
there was actually a very big market
and so they took this approach
and all of a sudden in 2022
on the important aspects of AI
and on LLMs which are really the host
podcast topic and the most promising one
and beyond Chinchilla there were
basically no communication at all
and that's something that I as a researcher
and Timote and Guillaume
and all of the people that joined us as well
deeply regretted because we think
that we're definitely not at the end of the story
We need to invent new things.
There's no reason why to stop now
because the technology is effectively good,
but not working completely well enough.
And so we believe that it's still the case
that we should be allowing the community
to take the models and make it their own.
And that's some ideological reason
why we went into that.
The other reason is that we are talking to developers.
Developers want to modify things
and having a deep access.
Very good model is a good way of engaging
with this community and addressing their needs
so that the platform we are building as well is going to be used by them.
So that's also like a business reason.
Obviously, as a business, we do need to have a valid monetization approach at some point.
But we've seen many businesses build open core approaches and have a very strong open source community
and also a very good offer of services.
And that's what we want to build.
That resonates.
The early days of deep learning were largely driven by a bunch of open collaboration between
researchers from different labs who would often publish all their work and share them at
conferences. Transformers famously was published and opened the entire research community,
but that has definitely changed. As a clarifier, do you see a difference between open and open
source as viewed by the community, or are those two things the same in your mind?
Yes. So I think there's some level of open sourcing in the AI. So we offer the weights and we offer
the inference code. That's like the end product that is already super usable. So it's already a very
big step forward compared to closed APIs because you can modify it and you can look at what's
happening under the hood, look at activations and all. So you have interpretability and the possibility
of modifying the model to adapt it to some editorial tone, to adapt it to proprietary data, to
adapt it to some specific instructions, which is something that is actually much harder to make
if you only have access to a closed source API. And that's something that also goes with our approach
of the technology, which is to say, pretrained model should be neutral and we should
then power our customers to take these models
and just put their editorial approaches,
their instruction, their constitution,
if you want to talk like Anthropic, into the models.
That's the way we approach the technology.
We don't want to pour our own biases into the pre-trained model.
On the other hand, we want to enable the developers
to control exactly how the model behaves
and what kind of biases it has, what kind of biases it doesn't have.
So we really take this modular approach,
and that goes very well with the fact that we release OpenWey.
models. Could you just ground us in the reality of where these models are today, just to give
people a sense of where in the timeline we are, is open source really a viable competitor to
proprietary closed models, or is there a performance gap? So Mixtral is as similar performance to
GPT 3.5. So that's a good grounding. Internally, we have stronger models that are in between
3.5 and 4 that are basically second or third best model in the world. So really, we think that
the gap is closing. The gap is approximately six months.
at that point. And the reason why it's six months is that it actually goes faster if you do
open source things, because you get the community, modify the model, suggests very good ideas
that can then be consolidated by us, for instance, and we just go faster because of that.
So it has always been the case that open source ends up going faster, and that's the reason
why the entire internet runs on Linux. I don't see why it would be any different for AI.
Obviously, there's some constraint that are slightly different because the infrastructure costs
It's quite high to train a model. We cost a lot of money. But I really think that we'll converge to a setting where you have proprietary models and the open source model are just as good.
Yeah, so let's talk about that a little bit more. How are you seeing people use and innovate on the open source models?
I think we've seen several categories of usage. There's a few companies that know how to strongly fine-tuned models to their needs.
So they took Mistral 7B, had a lot of human annotations, had a lot of proprietary data, just modify Mistral 7B so that it solved their task.
just as well as GPT 3.5, but only for a lower cost and a higher level of control.
We've also seen, I think, very interesting community efforts in adding capabilities.
So we saw like a context length extension, 128K, that worked very well.
Again, it was done in the open, so the recipe was available,
and this is something that we were able to consolidate.
We've seen some effort around encoders, image encoders,
to make it a visual language model.
A very actionable thing that we saw is, I think, the hugging face for,
first did the direct preference optimization on top of Mistral 7B and made a much stronger
model than the instructed model we proposed at the early release. And it turned out it's actually
a very good idea to do it. And so that's something that we've consolidated as well. So generally
speaking, the community is super eager to just take the model and add new capabilities, put it
on a laptop, put it on an iPhone. I saw Mistral 7B on a stuffed parot as well. So fun things,
useful things, but generally speaking has been super exciting to see the research community take
a hold of our technology. And with Mixtral, which is a new architecture, I think we are also
going to see much more interesting things because on the interpretability field, also in the safety
field, as it turns out, you have a lot of things to do when you have deep access to an open
model. And so we're really eager to engage with the community. Safety is an important piece
to talk about. The immediate reaction of a lot of folks is to beam open source less safe.
than closed models.
How would you respond to that?
I think we believe that it's actually not the case
for the current generation.
Models that we are using today
are not much more than just a compression
of whatever is available on the internet.
So it does make access to knowledge more fluid,
but this is the story of humanity,
making knowledge access more fluid.
So it's no different than inventing the printing machine
where we had apparently similar debate.
It wasn't there, but that was the debate we had.
So we are not making the world any less,
safer by providing more interactive access to knowledge.
So that's the first thing.
Now, the other thing is that you do have immediate risk of misusage of large language
models.
And you do have them for open source models, but also for closed models.
And so the way you do address these problems and come up with countermeasures is to know
about them.
So you need to know about breaches, basically.
And that's the same way in which you need to know about bridges on operating systems
and on networks.
And so it's no different for AI putting models.
under the highest level of scrutiny
is the way of knowing
or they can be misused
and it's a way of coming up
with counter measures.
And I think a good example of that
is that it's actually super easy
to exploit an API,
especially if you have fine-tuning access
to make GPT4 behave in a very bad way.
And since it's the case
and it's always going to be the case,
it's super hard to be adversarial robust.
It means that we're only trusting
the team of large companies
to figure out ways
of addressing these problems.
Whereas if you do open-sourcing,
you trust the community.
The community is much larger.
And so if you look at the history of software in cyber security and operating systems,
that's the way we made the system safe.
And so if we want to make the current AI system safe and then move on to a next generation
that potentially will be even stronger, and then we can have this discussion again,
well, you do need to do open sourcing.
So today, we think that open sourcing is a safe way of developing AI.
Yeah, I think this is not understood widely that when you have,
thousands or hundreds of thousands of people able to red team models because it's open source,
the likelihood that you'll detect biases and built-in breaches and risks are just dramatically
higher. And I think if you were talking to policymakers, how would you help advise them? How do you
think they should be thinking about regulating open source models, given that the safest way often
to battle-hardened software and tools is to put them out in the open? We've been saying,
precisely this, that the current technology is not dangerous. On the other hand, it can be misused.
On the other hand, the fact that we are effectively making them stronger means that we need to
monitor what's happening. The best way of empirically monitoring software performances is through
open source. So that's what we've been saying. There's been some effort to try to come up with
very complex governance structure where you would have several companies talking together,
having some safe space, some safe sandbox for a red tumor that would be potentially independent.
So things that are super complex.
But as it turns out, if you look at the history of software,
the only way we did software collaboratively is through open source.
So why change the recipe today where the technology we're looking at is actually nothing else
on the compression of the internet?
So that's what we've been saying to the regulators generally.
Another thing we've added to the regulator is that if they want to enforce that AI products
that needs to be safe.
If you are to have a diagnosis assistant,
you want it to be safe.
Well, in order to monitor and to evaluate
whether it's actually safe,
you need to have some very good tooling.
And the tooling requires to have access to LLMs.
And if you access closed-source APIs LLMs,
where you're a bit in a troubled water
because it's hard to be independent in that setting.
So we think that independent controller of product safety
should have access to very strong open-source models
and should own the technology.
And if open source
LLMs were to fail
relative to closed source models,
why would that be?
I guess the regulation burden
is potentially one thing
that could make it harder
to release open source models.
It's also, generally speaking,
a very competitive market.
And I think in order for open source models
to be widely adopted,
they need to be as strong as a close source model.
They have a little advantage
because you do have more control
and so you can do if you're fine-tuning
and so you can make performance jump a lot
on a specific task because you have deep access.
But really, at the end of the day,
developers look at performance and latency.
And so that's why we think that as a company,
we need to be very much on the frontier
if we want to be relevant.
Given the complexity of frontier models
and foundation models in these systems,
there are just tons of misconceptions that folks have
about these models.
But if you just took a step back
And we look at the battle that's raging between folks pushing for closed source systems versus
the open source system.
What do you think is at stake here?
What do you think the battle is really for?
I think the battle is for the neutrality of the technology.
Like a technology, by a sense, is something neutral.
You can use it for bad purposes.
You can use it for good purposes.
If you look at what the LLM does, it's not really different from a programming language.
It's actually used very much as a programming language by the application makers.
And so there's a strong confusion made between what we call a model and what we call an application.
And so a model is really the programming language of AI application.
So if you talk to all of the startups doing amazing products with generative AI,
they're using LLMs just as a function.
And on top of that, you have a very big system with filters, with decision making, with control flow,
and all of these things.
What you want to regulate, if you want to regulate something, is the system.
The system is, it's the product.
So, for instance, a healthcare diagnosis assistant is an application.
You want it to be non-biased.
You want it to take good decisions, even under high pressure.
So you want its statistical accuracy to be very high.
And so you want to measure that.
And it doesn't matter if it uses a large language model under the hood.
What you want to regulate the application.
And the issue we had and the issue we're still having now is,
we hear a lot of people saying we should regulate the tech,
so we should regulate the function, the mathematics behind it.
But really, you never use a large language model itself.
You always use it in an application in a way with a user interface.
And so that's the one thing you want to regulate.
And what it means is that companies like us,
like foundational model companies,
will obviously make the model as controllable as possible
so that the applications on top of it can be compliant, can be safe.
We'll also build the tools that allow to measure the compliance and the safety of the application
because that's super useful for the application makers.
It's actually needed.
But there's no point in regulating something that is neutral in itself, that is just a mathematical tool.
I think that's the one thing that we've been hammering a lot, which is good.
But there's still a lot of effort, I guess, in making this strong distinction,
which is super important to understand what's going on.
To regulate apps, not math, seems like the right direction that a lot of folks who understand
the inner workings of these models and how they're actually implemented in reality are advocating
for. What do you think is the best way to clear up this misconception for folks who maybe don't
have technical backgrounds, don't actually understand how the foundation models work and how the
scaling models work? So I've been using a lot of metaphors to make it understood, but large language
models are like programming languages, and so you don't regulate programming languages.
You regulate malware.
You ban malware.
We've also been actively vocal about the fact that pre-market conditions like flops,
the number of flops that you do to create a model, is definitely not the right way of
measuring the performance of a model.
We're very much in favor of having very strong evaluations.
As I've said, this is something that we want to provide to our customers, the ability to
evaluate our models in their application.
And so I think this is a very strong thing that we've been stressing.
We want to provide the tools for application makers to be compliant.
That's something we have been saying.
And so we find it a bit unfortunate that we haven't been heard everywhere
and that there's still a big focus on the tech,
probably because things are not completely well understood
because it's a very complex field and it's also a very fast-moving field.
But eventually I think I'm very optimistic that we'll find a way
to continue innovating while having safe products,
but also high level of competition on the foundational model layer.
Let's channel your optimism a little bit.
There's very few people who have the ground level understanding of scaling laws like
you, Guillaume, and Tim and your team, when you step back and you look at the entire space
of language modeling, in addition to open source, what are the key differentiators that
you see in the next wave of cutting edge models, things like self-play, you have process reward
models, the uses of synthetic data.
If you had to conjecture, what do you think some of the most exciting or important breakthroughs will be in the field going forward?
I guess it's good to start with diagnosis.
So what is not working that well?
So reasoning is not working that well.
And it's super inefficient to train a model.
If you compare the training process of a large language model to the brain, you have a factor, I think, 100,000.
So really, there's some progress to be made in terms of data efficiency.
So I think the frontier is increasing data efficiency, increasing reasoning capabilities.
So adaptive compute is one way.
And to increase data efficiency, you do need to work on coming up with very high quality data.
Many new techniques that needs to be invented still.
But that's really where the lock is.
Data is the one important thing.
And the ability of the model to decide how much compute it wants to allocate to a certain
problem is definitely on the frontier as well.
So these are things that we're actively looking up.
You know, this is a raging debate, right?
And we've talked about this a few times before, which is, can models actually reason today?
Do they actually generalize out of distribution?
What's your take on it?
And what would convince you that models are actually capable of multi-step, complex reasoning?
Yeah, it's very hard because you train on the entire human knowledge.
And so you have a lot of reasoning places.
So it's hard to say whether they reason or not or whether they do retrieval or reasoning,
and it looks like reasoning.
I guess at the end of the day,
what matters is whether it works or not,
and on many simple reasoning task it does.
So we can call it reasoning.
It doesn't really matter if they reason like we do.
We don't even know how we reason.
So we are not going to know about how machines reason anytime soon.
Yeah, it's a raging debate.
The way you do evaluate that is to try to be as out of distribution as like working on mathematics.
It's not something I've ever done,
but it's something that Timothy and Guillaume are very sensitive to
because they've been doing it for a while when they were at meta.
That's probably one way of measuring whether you have a very good model or not.
And actually, we're starting to see some very good mathematicians.
I'm thinking of Terence Tao that are using large language models for some things.
Obviously, not the high-level reasoning, but for some part of their proofs.
And so I think we will move up into abstraction.
And the question, where does that stop?
We do need to find new paradigms, and we're actually looking for them.
And we've talked a lot about developers so far.
if you had to channel your product view and just conjecture on what these advances in scaling laws,
in representation learning, in teaching the models to reason, faster, better, cheaper.
What will these advances mean for end users in terms of how they consume, how they program,
and they generally work with models?
What we think is that fast-war-five years, everybody will be using their specialized models
within parts of complex applications and systems.
And so for all of the software stack, developers will be looking at latency.
So they will want to have, for any specific task of the system,
they will want to have the lowest cost and lowest latency.
And the way you make that happen is that you will ask for the task,
as for user preferences, as for what you want the model to do,
and you try to make them as small as possible,
and as suitable to the task as possible.
And so I think that's the way we'll be evolving on the developer space.
I also think that generally speaking, the fact that we have access to large language models
is going to reform completely the way we interact with machines
and the internet of five years from now is going to be much different, so much more interactive.
This is already unlocked.
It's just about making very good applications with very fast systems, with very fast models.
So, yeah, very exciting times ahead.
So what would those interaction modalities look like?
Instead of just navigating the internet, you will be asking questions and discussing with machines.
you will have probably a large language model
doing some form of reasoning under the hood,
but looking at your intention
and figuring out how it can address your needs.
So it's going to be much more interactive.
It's going to be much closer to human-like conversation
because as it turns out,
the best way to interact with something
and find knowledge is to have a real discussion.
We haven't found a better way to transmit information.
So I expect that we'll be a bit more talking to machines
in five years' time
and talking to the Internet.
and every content provider needs to adapt to these new paradigings.
I think there's a lot of space for high-quality content
to be well-identified as human-created or human-edited
and for generative AI to help a user navigate through that knowledge.
And generally speaking, I think the access to knowledge
and the enrichment of what we know is going to be much better in the next five years.
What do you think changes the most when we go from interacting
with one large frontier model
to interacting maybe with a team
of small models that are working together
like a swarm of mistrels.
Yeah, so that's very interesting.
And I think in games, for instance,
it's going to be fascinating.
We've seen some very good applications.
You do need to have small models
because you want to have swarms of it
and it starts to be a bit costly
if it's too big.
But having them interact is just going to make
pretty complex systems
and interesting systems to observe
NGUs. So we have a few
friends making applications
in the enterprise space, with different
persons playing different roles,
relying on the same language model,
but with different prompts and different fine-tuning.
And I think that's going to be quite interesting as well.
As I've said, complex applications in three years' time
are just going to use different DELMs for different parts,
and that's going to be quite exciting.
Well, what's your call to action?
To builders, researchers, folks who are excited about the space,
what would you ask them to do?
I would take Mistral Models and try to build amazing applications.
It's not that hard.
The stack is starting to be pretty clear, pretty efficient.
You only need a couple of GPUs.
You cannot even do it on your MacBook pool if you want.
It's going to be a bit hot, but it's good enough to do interesting applications.
Really, the way we do software today is very different from the way we did it last year.
And so I'm really calling application makers to action because we are going to try to enable them to build as fast as possible.
On that note, thank you so much for finding the time to talk with us today.
and we'll put a link to the mixed rule model in the show notes
so people can go find it and play around with it.
If you like this episode, if you made it this far,
help us grow the show.
Share with a friend, or if you're feeling really ambitious,
you can leave us a review at rate thispodcast.com slash A16c.
You know, candidly, producing a podcast can sometimes feel like
you're just talking into a void.
And so if you did like this episode,
if you liked any of our episodes,
please let us know.
We'll see you next time.