No Priors: Artificial Intelligence | Technology | Startups - Mistral 7B and the Open Source Revolution With Arthur Mensch, CEO Mistral AI

Starting point is 00:00:00 Open source AI models have completely changed the landscape of technology over the past year. One tiny team of ex-deep-mind and meta-researchers in France has made a huge splash recently, Mistraw. This week, Alad and I are joined by Arthur Munch, the CEO and co-founder of Mistral, who recently released Mistral 7B in Apache 2 licensed open source model that has changed people's mental models about what can be done with small models. Arthur, welcome to No Pryors. Thank you for inviting me. I'm very glad to be here. Okay, so just six months ago, when we met you were leaving DeepMind to start Mistral. It takes real guts to look at the scale of dollars and compute that Open AI and Google and

Starting point is 00:00:49 others have amassed and say, like, we want to play in this game too, and it's important we do. Tell us about the inspiration to start. Me, Guillaume and Timote were, I guess, pretty early in the field and it had been 10 years that we had been doing machine learning. And we did know where to start from and how to make a good model with a limited amount of compute and money. Well, not so limited, but at least more limited than where we were coming from. And so I think that's why get started. The various companies we were in, move into directions that we hadn't anticipated when we joined the company. And we decided that there would be a very good opportunity for creating something that would be a standalone company in Europe, focusing on making AI better, focusing on making Frontier AI and focusing on making open source AI as a core value.

Starting point is 00:01:41 Maybe we can talk about each of those pieces. So 10 years in machine learning before, you were a co-author on the Chinchilla scaling loss paper. You worked on the sort of mixture of experts' ideas early. Can you talk a little bit about what your research directions were at DeepMind? Yeah. So I come from an optimization background. So my focus has always been for the last 10 years to make algorithms more efficient and to use better the data that we have to make models with good prediction performances.

Starting point is 00:02:15 And so when I arrived at DeepMind, I joined the LLM team that was, 10 people at the time. And very quickly, I started to work on retrieval-agmented models. So with a paper called Retro that I collared with my friend, Seb Borjo, who is still at DeepMind. The point there was to use very large databases during pre-training so that we didn't force knowledge into the model itself. And we would tell the model that it would have access to an external memory anyway. And so it was working quite well. We could actually lower the perplexity, let's say. That's what you work on when you make LLMs. There were some limitations that I think the community has started to address quite well. And that was at the time when

Starting point is 00:03:01 retrieval methods weren't really mainstream, now they've become completely mainstream. So that's the first project I did. I worked on a sparse mixture of experts also quite quickly because that was related to my topic of postdoc, which was Optimal Transport. So Optimal Transport, is a setting where you have, I guess, tokens, you need to assess them, assign them to devices and you need to make sure that there's some good assignment in between the two of them so that the devices don't see too many tokens. And as it turns out, the way you do it is with optimal transport is a mathematical framework to do it correctly. And so I started to work on introducing this two sparse mixture of experts. And very quickly, we started to move

Starting point is 00:03:49 on to scaling lows. So how do you actually take the method that is working at a certain scale and try to predict how that will evolve with the scale, the number of experts, the amount of data you see? And so that's a work I've done with many colleagues as well on how do you adapt the scaling lows for dense parameters, for dense models, to a setting where you want to predict the performance not only with relation to the size of the model, but also the number of experts? I guess that was the second thing I worked on, and then connectedly worked on Chinchilla, which is, I think, a major paper in the history of LLM, also with Seb, Jordan, Laurent, many of the other people.

Starting point is 00:04:35 Basically, the story was that everybody was training models on too few tokens because of the paper from 2020, that happened to be not very well executed. And so what we observed is that you could actually correct that. And so instead of training very large models on very few tokens, you should actually grow the number of tokens as you grow the size of the model, which if you think about it makes a lot of sense because you don't want to have infinite size model looking at a finite number of tokens. And similarly, you don't want to have a finite size model looking at an infinite number of tokens.

Starting point is 00:05:17 number of tokens. There must be some proportionality. Yeah. Yeah, exactly. And that's something that we showed empirically. And I think that's something that was adopted very fast because it was like a pure win. For the same amount of compute, you would get a model that would be better, but also a model that will be four times cheaper to serve. So that was definitely a gain. And as it turns out, we didn't go far enough. That's what we did at Mistral. We realized that there was also a lot of opportunity in actually compressing models more. I mean, we've seen with Lama that it was actually possible. What we showed in Mistral 7B is that we were definitely far away from the limit of compression.

Starting point is 00:05:56 We somehow corrected that by making a model very small, super cheap to serve, super fast, running on your MacBook Pro, but still good enough to be useful. And so that's one of the first achievement we made in the company. Yeah, I think a lot of people were really impressed when Mistral came out with the 7B model because, A, you did it very quickly. You know, you did it in a matter of a few months, but B, if you look at the cost of actually running these models, obviously there's the training side of it in terms of actually building the model, but then there's inference. And so much of the cost day to day, if you're a user of open AI or a user of Lama, or a user of mistrales, how much it cost to actually run the models, the inference. And that's often driven by the size of the model. And so I think by coming out with these small models that were

Starting point is 00:06:41 very performant, it really made a huge change to how people thought about what was possible. Is there anything that you can share in terms of where you think this performance is going to go or how you think about the sizes of models, both in terms of smaller, more performant models, as well as do you folks plan to build something very large, more like a GPT4 or GPT5 equivalent over time? Yeah, sure. So I think what we've seen, like in the, in 2022, 2021, you had a few companies that were really focusing on pushing the performance of models. And if you want to push the performance, the pure performance of models, you don't

Starting point is 00:07:17 care about inference because you're not going to use the model, you're just going to see whether they're good or not. And that's really for scientific purposes. But then when you start thinking about deployment and enabling downstream applications, then you need to think about what it is going to cost in runtime. So you're not only worried about the upfront payment you need to make to get the model, but you're also worried about the runtime. And so I think the coefficient that you put in between the inference cost and the training cost really is really business dependent. And as a company that intends to have a valid business model, we think a lot about inference costs.

Starting point is 00:07:54 We think that it's super important to get to a regime where inference is super cheap so that you can run agents, you can run. You can basically use AI LLMs everywhere for all of your use cases and you're not blocked by cost, which is the case for the largest model currently. So that's definitely something we had in mind, and we knew that we could make a model a 7 billion parameters on a model very good. That's for sure.

Starting point is 00:08:19 This is definitely not the end of the story. Now, the question is, do we train bigger models? And the answer is obviously yes. There's still a limit to what a certain model size can do. This limit was, I think, underestimated. But if you want to get to more reasoning capabilities, you do need to move into larger models. The other thing about moving into larger models,

Starting point is 00:08:44 that it enables you to train smaller models that are better, which is through variety of techniques like distillation or synthetic data generation. So these two things are quite related. If you want to make very strong, small models, you do need to have bigger models. And we are indeed training larger models for sure. Can you tell us about your approach to data and annotations? Because we kind of talked about the other two dimensions. Yeah, so we've talked about compute and obviously data.

Starting point is 00:09:14 is super critical. So we work from the open web. Well, we do a lot of work. I think we do a good job at getting some good data. The data quality is really what makes the model good. I mean, data algorithms, obviously, but that is super important. We put a lot of focus on that. And I think we do have a very good data sets, that's for sure.

Starting point is 00:09:37 Data annotation is, I guess, another topic. It's not related to pre-training. When you pre-trained a model, you really want to have the purest knowledge, the purest quality of data. When you want to align your model and instruct it, ask it to follow instructions, which is useful for many use cases because it makes it steerable. You do need to have a certain amount of, is there a human-produced annotation or potentially machine-produced annotations?

Starting point is 00:10:05 And so that's something that we start working on. We're not the top experts in the world in making good instruction fine-tune models. We're definitely ramping up and the team is getting better and better at that. One of the things we haven't talked about is the fact that you guys are an open-source company, which is very, very different from the other sort of labs working at the state of the art today. Why is that important? So if you look at, if you look back at the history of machine learning in the last 10 years, it went very fast. I mean, we went from a poor cat, dog, detector to something that basically looks human intelligent.

Starting point is 00:10:49 And it's useful to remember how that happened. It happened because you had many academic labs. You had many industrial labs actually spending more money on different problems. And there was like full communication, almost full transparency until 2020. Like whatever was done in whatever lab, even in competing lab, was actually published at Nureps, was published at ICML, and every six months we will all gather and we'll get new ideas. Idees would circulate and everybody would build on top of the work of others. And that's the way we went from something, well, potentially interesting to something very interesting.

Starting point is 00:11:30 But then the issue is that around 2020, some companies started to be quite ahead on some field and realized that some value could be accrued. And then at that point, Opacity made it back to the field. And I think that's a cycle we've observed in software already, a cycle between openness and closeness. We are observing it again. We think that it's too early, and we think it's really damaging for the science

Starting point is 00:12:00 to actually move into such an opaque regime, where you have a few companies basically doing the same thing, just not communicating about it, spending billions of compute doing exactly the same thing, And where really the technology we're looking at is not working completely yet. So still, it doesn't reason well. Memory mechanisms are not very well understood. Causality mechanisms are not well understood.

Starting point is 00:12:25 It's not super-stereable. There's a lot of biases. I mean, it's incomplete. There's many things to be done. We still need to invent new techniques. And how are you going to invent new techniques if nobody is speaking about it? When in order to invent new techniques, you need to still spend some, some large amount of money to actually try things at scale,

Starting point is 00:12:45 and the few companies that have the money to spend have no refuse to communicate. That's something that we deeply regretted, and that's something that we are trying to change, because we do have some substantial amount of money to actually spend on compute. We do have some good ideas. We know that there's a big community

Starting point is 00:13:02 that is awaiting for AI players, for open-source AI players to appear, and we're very grateful that meta is moving into that direction. by doing what we do, by being much more open about the technology we create, we want to steer the community into a regime where things just work better, where things are safer because of put under more scrutiny. And really our intention there is to take that position and to change the rules of the game because we don't think that this is moving into a proper direction.

Starting point is 00:13:34 Yeah, it's very interesting because if you look at the current discourse, the really big tech companies are claiming that opens, source AI is dangerous, and it feels like really a form of regulatory capture, right? They want regulators to step in so that they can constrain innovation and kind of control an industry. And, you know, the reality is if you look at things like global health, global equity, open source is one of the biggest potential ways for all of humanity to benefit from this technology in a way that's transparent and open and people can really understand and see. How do you approach safety and policy and thinking about, you know, the right way

Starting point is 00:14:10 to think about safety in the context of open source? So I think we approach it from a very pragmatic point of view. So the question is, is open sourcing today a model that we do? Is it a dangerous thing? Is it actually enabling bad actors to misuse the model? Is it giving them marginal capacity, like extra marginal capacities in pursuing their bad endeavours? I think the answer to this question is no. is no. That's my conclusion. We've seen, well, a lot of ideas around bio-weapons,

Starting point is 00:14:48 around, I don't know, nuclear terrorism and the like. And it's very interesting, because if you actually assume good faith of these arguments, and I think in many cases, people are in good faith, and that's an assumption that we always make. And if you try to go down their arguments, I mean, we realized, so that's what we did, and we realized that there was really nothing to it. So nothing is showing that a LLM is actually marginally better than a search engine to find knowledge on topics that would enable bad use. And the other thing is that it's not even proven, and it's certainly very likely not to be the case, that knowledge is not the bottleneck for the actual misuse that we're talking about. So we have two things. In order to demonstrate that open sourcing large language models is actually unsafe,

Starting point is 00:15:41 you need to demonstrate that they have marginal improvement over web engine, and that knowledge is the bottleneck for creation. And in the two cases, the answer to these two questions is no. And so that means that we believe that we can open source models today, and that actually it's the best way of putting things under the highest scrutiny so that we are ready for potential new generations of models that could be super intelligent. And in that case, I think we can re-have, we can rediscuss these premises. But today, we're really talking about the compression of knowledge that is widely available on the web.

Starting point is 00:16:22 And so, marginally speaking, we're not creating anything that is more dangerous than before. So I think that's really there's a trade-off. there's a dynamic conversation to be had. That's what we discussed at the AI Safety Summit. For sure, this needs to be revisited as we as model capacities build on. But today, going, banning open source, preventing it from happening is really a way, well, to enforce regulatory capture, even though the actors that would benefit from it don't want it to happen.

Starting point is 00:16:57 But by design, if you actually ban small actors from doing things, in the most efficient way, which is open source. You do facilitate the life of the larger incumbents. And that's something that would be, I guess, detrimental to mistral life for sure. What do you make of the arbitrary sort of compute and scale limits proposed? That's interesting. I don't exactly know how they came out. They came up with this threshold. It's a high threshold by any standard because if you compute, like if you make the bad faith assumption that this is the float 64. It actually gives you, it's approximately 300 million last run compute. So that's high. That's not something we can even afford and that we won't

Starting point is 00:17:43 be able to afford for the coming years. So it's high. It's very arbitrary because who tells you that beyond that 10 to the per 26, you end up with bad capacities. Models start to have to see the emergence of bad behaviors. That's definitely not. not proven. Relating capabilities to scale is also very approximate in the sense that it really depends on the data. The data set is super important. If you train your model on generating, I mean, there's a focus on bio-weapon. So let's say if you want to, if we want to prevent models from generating chemical compounds, because we think it's an enabler of bad behaviors, which I have said, we don't think it is the case. But if you do, if you want to prevent

Starting point is 00:18:32 that. Well, you do need to adapt your compute flop budget to the data set that we are working on. As it turns out, that's what they did because they actually made a specific flop budget for biology, I think. So you can see the bioweapon narrative building on. But this is completely we should really focus on capabilities and not pre-market conditions. And I think that's, I mean, to some extent, there's still consensus around that. So everybody knows that it's imperfect. It's a proxy, which is maybe fairly correlated. But definitely we need to come up with agreeing on how we measure capabilities, agreeing on what capabilities we deem dangerous. And I think we don't agree with one another on that topic. But these are really

Starting point is 00:19:27 should be the judge and not obviously pre-market condition or the number of flops that you do. know why there's such a focus on bioweapons? I asked this as a, I worked for almost a decade as a biologist. And, you know, when I look at how some of the complexity of building viruses or some the complexity of actually doing these things, I'm surprised that there's so much focus in the community on that specific example. Do you have any sense of the origins of why people keep bringing that up as the, because it's actually hard to translate. It's not some digital thing that you manipulate. I think it's a very interesting question. Honestly, I don't have the answer. it's almost epistemology at that point.

Starting point is 00:20:04 So how did this idea appear and how did it get amplicated by the policy people and how did it actually ended up being heard by national security? And I think it started somehow with GPT4, an XD or something, like on page 46, not exactly the same, these numbers,

Starting point is 00:20:23 but where they generated some chemical compound, which, and then they said a small, Mark saying that, okay, maybe that's not the direction we want to take. We don't want to have a model that reasons about chemical compounds. The chemical compounds in question wasn't dangerous, but then it made this just observation, which is definitely expected. If you train on articles on biology, you're definitely going to be able to produce some chemical compounds.

Starting point is 00:20:54 So expected observation. And then somehow people build things on top of it. Nothing was observed. No scientific studies in proper form was published. But then policy papers started to cite non-scientific papers, arguing that these were scientific evidences, that the bioweapon narrative was actually true. And then policy papers started to cite the other policy papers. And all of a sudden you end up with like 50 papers saying that for sure, bioweapon is going

Starting point is 00:21:28 to blow us up. And that, this is what the policymakers read at the end. I think that's how we ended up where we are today. So there's some deconstruction to be made. Unfortunately, I think the open source community hasn't been vocal enough because they didn't see it coming. But right now, this is changing and I'm very glad that it is. I think it's mimetic.

Starting point is 00:21:52 You have the factor of like the world just goes through the COVID-19 pandemic. Yeah, I mean, the COVID trauma for sure played the role in that narrative. I mean, that's definitely a traumatism. 30 million people died. That's definitely something we don't want to reoccur again. I don't think AI is going to be the one triggering the next pandemic. It's always going to be climate change. That's the way it was, and that's probably where the focus should be,

Starting point is 00:22:21 instead of focusing on hypothetical, non-proven, biological risk induced by a token generators. If bio-weapons is not a pragmatic concern in the visible future, there are real concerns around guardrails about what we want our AI models to actually generate. Like, how do you think about that? Yeah. So I think this is a very valid concern. Models can output any kind of text.

Starting point is 00:22:54 And in many cases, you don't want it to output any kind of text. So when you build an application, you need to think on the guardrails you want to put on the model output and potentially also on the input. So you do need to have a system that filters input that are not valid, that you deem illegal, and output that are not valid or that you deem illegal. So the way you do it in our mind is that you do create the modular architecture that the application maker can use, which means you provide the role model. so the model that hasn't been altered to ban some of its output space. And then you propose new filters on top of that that can detect the output that we don't want. So it can be pornography, it can be hateful speech. These things you want to ban when you have a chatbot, for instance.

Starting point is 00:23:47 But these things, you don't want to ban from the raw model, because if you want to use the raw model to do moderation, for instance, you want your model to know about this stuff. So really assuming that the model should be well-behaved is, I think, a wrong assumption. You need to make the assumption that the model should know everything. And then on top of that, have some modules that moderate and guardrail the model. So that's the way we approach it. And it's a way of empowering the application maker in making a well-guarded application.

Starting point is 00:24:20 And we think that it's our responsibility to make very good modules that allow guard-wailing the model correctly. it's part of the platform and we think it's the way of there should be some healthy competition on that domain of different startups working on guard railing the models and the way you make this healthy competition

Starting point is 00:24:42 is not by trusting a couple of companies to do their own safety it's rather for it's rather the way you do it is to ask application makers to comply

Starting point is 00:24:56 with some rules. So chatbot should not output hateful speech. And so that means that now the application makers need to find a good guard railing solution. And now you have a competition where there's some economic interest in providing the best guardrailing solution. And so that's the way we think the ecosystem should work. And that's the way we position ourselves. That's the way we build the platform with modular filters and modular mechanisms to control the model. with. It's great that you folks are being so thoughtful about that. I think when people talk about safety, they really talk about three topics, and sometimes they talk past each other. One is this sort of moderation or textual-based risk. And so that's risk of hateful content, legal content,

Starting point is 00:25:42 bias, et cetera. There's a second class which we talked about already, which is physical risk. It's things like bioweaponry or the ability of AI to help derail a train or interfere somehow. And then third is like existential or species risk. And that's when people start talking about AGI and new forms of life and, you know. Resource competition. Resource competition or aggregation or things like that. So first of all, I think it's very important to address these three things separately and to hammer home that solutions exist for the first one. That the second one, there's no evidence that it actually exists at that point and no evidence that it will exist.

Starting point is 00:26:22 the near future. The third point, I think that's very philosophical. Obviously, if you can make a system of arbitrary complexity, it can start doing anything that you don't want it to do. We are not at a stage where the model has arbitrary complexity. And so this is very abstract to me. I think that's still, I mean, we'll move onto a world with agents and AI interacting with one another, and that will create a lot of complexity. The anticipating, that complexity will necessarily yield to a collapse. We call it a collapse in machine learning when suddenly everything stops working because I know you fail into a local minima.

Starting point is 00:27:06 Well, it's unclear to me that complexity leads to a collapse. Usually complexity leads to doing nothing because there's no self-organization and no willpower to build something. So I'm not too worried about existential risk. obviously this is a dynamic conversation. If we can make a model which is growingly intelligent, then maybe you're at a singularity level. There's no evidence whatsoever that we are on the way of doing that,

Starting point is 00:27:34 of making that happen. So I think it's a very open conversation we should have. I personally don't see the scientific evidence, and as a scientist, I trust only what I can see. Sure. And then I guess you mentioned agents, which I think is an area of a lot of activity right now, It feels like a number of things that are related to agents are still a little bit of ways in the future.

Starting point is 00:27:56 In other words, it feels like an area with enormous promise, but it's still quite early. Are there any big technological innovations or things that you're working on that you think will really help expedite a world that moves more towards agents-based use in a more broad sense? I think making a model smaller is definitely a way to make agent work because one problem you have with agents that very quickly starts to, if you run an agent on GPT4, you're going to run of money very quickly. And so if you divide by a hundred, well, the cost of compute, well, you can start to build the more interesting things. What we see with agents is mode collapse.

Starting point is 00:28:35 So not very interesting mode collapse. They start repeating themselves and they fall into loops. So definitely there's some research to be made there. There's some research to be made on making models more capable of reasoning and making them more capable of adapting the amount of compute they put onto the difficulty of the task. And this can be agent solved somehow.

Starting point is 00:28:56 So it's definitely an avenue of research that we're exploring. Yeah. Going back to Mistral, you know, one of the things you've talked about a little bit is the platform that you've been building around the models that you train. Can you tell us a little bit more about that and some of the directions that's heading in? Yeah. So we know that hosting is not, hosting models isn't easy.

Starting point is 00:29:17 We know there's a lot of work to be done on the infant side to make serving efficient. There's a lot of work to be done on the training side because you do need to come up with architectures that are memory efficient, for instance. That's what Miss Tral-7 is good at that because it has with sparse attention mechanism that makes it more memory-efficient. So there's some work that you can do on the training side,

Starting point is 00:29:39 but in order to reap all the benefits of a good model, you do need to work a lot on the inference part. And so we are actively working on that part to make it as efficient as possible to build a platform that will be very cost-efficient. And so you do need to have a good platform, well, with good code, good inference code. The other thing that you can propose to customers

Starting point is 00:30:02 is the fact that you do time sharing across customers. So when you want to play around with a model, if you want to make it completely safe, you should spin it up on an instance of a cloud provider. But if you just want to play around with it, you can access an API. it's going to be less costly because just a single H. Android can serve hundreds of customers. So I think there's some demand, a lot of demand for experimentation and APIs,

Starting point is 00:30:31 and that's something that we started to build alongside the self-hosted platform that we direct to other enterprise customers. Your team is based in France. You have said before that you think there's a opportunity for a really important AI company that is French and European and serving the world. I don't know if that is like a mainstream point of view before the early success of Mistral. Can you talk about why you think that might work?

Starting point is 00:31:01 I think some very strong point of Europe on that domain is talent. As it turns out, France, UK, Poland are very good at training mathematicians. and as it turns out, mathematicians are very good at making AI. Which means that there's a lot of French people and English people and Polish people in AI. And many of them wants to stay in Europe. Their family is there. The food is better. You have many advantages.

Starting point is 00:31:36 I can't list them. It would be too long. And so obviously, we've been seeing the emergence of an AI ecosystem in London. I think very much thanks to DeepMind and then in Paris also thanks to DeepMine and to Meta that settled a lab there and to

Starting point is 00:31:55 a lot of entrepreneurs that started to come back so today we have I think hundreds of startups in Paris so this is not the level of the Silicon Valley obviously but we start to have an ecosystem in place with investors with operators investing as well so

Starting point is 00:32:11 it's the same kind of flywheel that made San Francisco and the Bay Area successes is starting to spin in France and I'm very glad that we are participating to it. This has been a great conversation, Arthur. I always find you inspiring. I'm very grateful to be an investor.

Starting point is 00:32:33 Thanks for doing this. Well, thank you for having me and looking forward to seeing you soon. Find us on Twitter at No Prior's Pod. Subscribe to our YouTube channel if you want to see your faces, follow the show on Apple Podcasts, Spotify, or wherever you listen. That way you get a new episode every week.

Starting point is 00:32:49 And sign up for emails or find transcripts for every episode at no-dash priors.com.

No Priors: Artificial Intelligence | Technology | Startups - Mistral 7B and the Open Source Revolution With Arthur Mensch, CEO Mistral AI

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.