a16z Podcast - Safety in Numbers: Keeping AI Open

Episode Date: December 11, 2023

Arthur Mensch is the co-founder of Mistral and the co-author of Deepmind’s pivotal 2022 "Chinchilla" paper.In September 2023, Mistral released Mistral-7B, an advanced open-source language model that... has rapidly become the top choice for developers. Just this week, they introduced a new mixture of experts model – Mixtral — that’s already generating significant buzz among AI developers.As the battleground around large language models heats up, join us for a conversation with Arthur as he sits down with a16z General Partner Anjney Midha. Together, they delve into the misconceptions and opportunities around open source; the current performance reality of open and closed models; and the compute, data, and algorithmic innovations required to efficiently scale LLMs.Resources:Find Arthur on Twitter: https://twitter.com/arthurmenschFind Anjney on Twitter: https://twitter.com/anjneymidhaLearn more about Mistral: https://mistral.aiLearn why we invested in Mistral: https://a16z.com/announcement/investing-in-mistral/Stay Updated: Find a16z on Twitter: https://twitter.com/a16zFind a16z on LinkedIn: https://www.linkedin.com/company/a16zSubscribe on your favorite podcast app: https://a16z.simplecast.com/Follow our host: https://twitter.com/stephsmithioPlease note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures.

Transcript
Discussion (0)
Starting point is 00:00:00 I think the battle is for the neutrality of the technology. This is the story of humanity, making knowledge access more fluid. Basically in 2021, every paper made this mistake. It means that we're only trusting the team of large companies to figure out ways of addressing these problems. All of the people that joined us as well, deeply regretted because we think that we're definitely not at the end of the story. As it turns out, if you look at the history of software, the only way we did software collaboratively is really open source, so why change the recipe? Scaling laws.
Starting point is 00:00:34 These underpins the success of large language models today, but the relationship between datasets, compute, and the number of parameters was not always clear. But in 2022, a pivotal paper came out, often referred to as chinchilla, that changed the way that many people in the research community thought about that very calculus, demonstrating that datasets were actually more important
Starting point is 00:00:55 than just the sheer size of the model. One of the key authors behind that paper was Arthur Metsch, who was working at Deep Mind at the time. Now, earlier this year, Arthur banded together with Guillaume Lampel and Timothy Lecois, two researchers at Meta, who worked on the release of Lama, and together they founded a new company, Mistral. Together, this team has been hard at work releasing Mistral 7B in September, a state-of-the-art open source model that very quickly became the go-to for developers. And they just released, as in the last few years. days, a new mixture of experts model that they're calling mixtral. So today you'll get to hear directly from Arthur as he sits down with A16Z general partner, Angeena Mida, as the battleground for large language models heats up, to say
Starting point is 00:01:41 the least. Together, they discuss the many misconceptions around open source and the war being waged on the industry. Plus, they'll discuss the current performance reality of open and closed models and whether that gap will close with time. Plus, the kind of compute, data, and algorithmic innovation. required to keep scaling LLMs efficiently. Now, it's really rare to have someone at the frontier of this kind of research
Starting point is 00:02:05 be so candid about what they're building and why. So I hope you come out of this episode as excited about the future of open source as I did. Enjoy. As a reminder, the content here is for informational purposes only, should not be taken as legal, business, tax, or investment advice, or be used to evaluate any investment or security, and is not directed at any investors or potential investors, in any A16C fund.
Starting point is 00:02:30 Please note that A16Z and its affiliates may also maintain investments in the companies discussed in this podcast. For more details, including a link to our investments, please see A16C.com slash disclosures. All right, why don't we start with the founding team story? We flashed back to a few years ago, labs are building foundation models
Starting point is 00:02:53 and the consensus across the research community was that the size of these models, was what mattered most. How many million or billion parameters went into the model seemed to be the primary debate that people were having. But you had a hunch
Starting point is 00:03:08 that the role of data mattered more. Could you just give us the backstory on the chinchilla paper you core wrote? What were the key takeaways in the paper and how was it received? Yeah, so I guess the backstory is that in 2019, 2020, people were relying a lot on the paper called scaling lows for large language models.
Starting point is 00:03:26 That was advocating for basically scaling infinitely the size of models and keeping a number of data points rather fixed. So it was saying that if you had four times the amount of compute, you should be mostly multiplying by 3.5, your model side, and then maybe by 1.2, your data size. And so a lot of work was actually done on top of that. So in particular, at DeepMind, when I joined a project called Gofer,
Starting point is 00:03:50 and there was a misconception there. There was also a misconception on GPT3. And basically, in 2021, every paper made this, mistake. And at the end of 2021, we started to realize there was some issues when scaling up. And as it turns out, we turned back to the mathematical paper that was actually talking about scaling those and it was a bit hard to understand. And we figured out that actually, if you thought about it a bit more in a theoretical perspective. And if we looked at like empirical evidence we had, it didn't really make sense to actually grow the model size faster than the data size.
Starting point is 00:04:25 And we did some measurements. And as it turned to. out what was actually true was actually what we expect, which is in common worlds, if you multiply by four, your compute capacity, you should multiply by two, the model size and by two, the data size. That's approximately what you should be doing, which is good, because if you move everything to infinity, everything remains consistent. So you don't have a model which is infinitely big or a model which is infinity small with infinite compression or close to zero compression. So it really makes sense. And as it turns out, it's really what you observe. And so that's how we train chinchilla and that's how we wrote the chinchilla paper. At the time, you were at DeepMind
Starting point is 00:05:02 and your co-founders were at Meta. What's the backstory around how you three end up coming together to form Mistrol after the compute optimal skating laws work that you just described? So we've known each other for a while because Guillem and I were in school together and Timote and I were in master together in Paris. Basically, we had like very parallel careers. Timote and I, we actually worked together as well again when I was doing a postdoc in mathematics. And then I joined DeepMind as Guillemette, went to become permanent researchers at Meta. And so we continued doing this. I was doing large language models in between 2020 and 2023.
Starting point is 00:05:38 Guillem and Timote were working on solving mathematical problems with large language models, and they realized they had to have stronger models. And on my side, I was mostly working in a small team at DeepMind. So we did a very interesting work on Retro, which is a paper doing retrieval for large language models. We did Chinchilla. Then I was in the team doing Flamingo. which is actually one of the good way of doing a model that can see things. And I guess when Chad GBTBT went out,
Starting point is 00:06:04 we knew from before that the technology was very much game-changing, but it was a signal that there was a strong opportunity for building a small team, focusing on a different way of distributing the technology. So we're doing things in a more open-source manner, which was not the direction that Google at least was taking. And so we had this opportunity. Then we left the company at the beginning of the last year and created the team that started to work on the field.
Starting point is 00:06:27 fifth of June, ends on recreating the entire stack and training our first models. And if I recall correctly, right before they left, Tim and Guillaume had started to work on Lama over at META. Could you to describe that project and how it was related to the chinchilla scaling laws work you'd done? So Lama was a small team production of chinchilla, at least in its approach of parametization and all of these things. It was one of the first papers that established that you could go beyond the chinchilla scaling laws. So chinchilla scaling laws tell you what you should be training if you want to have an optimal model for a certain compute cost at training time. But if you take into account the fact that your model should also be efficient at inference time, you probably want
Starting point is 00:07:11 to go far beyond the chinchilla scaling law. So it means you want to over-train the model. So train on more tokens than would be optimal for performance. But the reason why you do that is that you actually compress models more. And then when you do inference, you end up having a model which is much more efficient for a certain performance. So by spending more time during training, you spend less time during inference, and so you save cost. That was something we observed at Google also, but the Lama paper was the first to establish it in the open, and it opened a lot of opportunities. Yep. I remember both the impact of the chinchilla scaling laws work on multiple labs,
Starting point is 00:07:47 realizing just how unoptimal the training setups were. And then the subsequent impact of Lama being dramatic on the industry and realizing how to be much more efficient about inference. So let's fast forward to today. It's December 2023. We'll get to the role of open source in a bit. But let's just level set on what you've built so far. A couple months ago, you released Mistral 7B, which was a best-in-class dense model. And this week, you're releasing a new mixture of experts model.
Starting point is 00:08:18 So just tell us a little bit more about mixtrol and how it compares to other models. Yeah, so Mixtral is our new model that wasn't released in open source before. It's a technology called Sparse Mixure of Experts, which is quite simple. You take all of the dense layers of your transformer and you duplicate them. You call these layers expert layers. And then what you do is that for each token that you have in your sequence, you have a router mechanism, so just a very simple network that decides which expert should be looking at which token. And so you send all of the tokens to their experts.
Starting point is 00:08:49 and then you apply the experts and you get back the output and you combine them and then you go forward in the network and you have eight experts per layer and you execute only two of them so what it means at the end of the day is that you have a lot of parameters on your model
Starting point is 00:09:05 you have 46 billion parameters but the thing is that the number of parameters that you execute is much lower than that because you only execute two branches out of eight and so at the end of the day you only execute 12 billion parameters per time And this is what counts for latency and throughput and for performance. So you have a model which has the performance of a 12 billion parameter network that have
Starting point is 00:09:28 performance that are much higher than what you could get, even by compressing data a lot on a 12 billion dense transformer. Sparse mixture of experts allows to be much more efficient at inference time and also much more efficient at training time. So that's the reason why we choose to develop it very quickly. Just for folks who are listening who might not be familiar with sort of state-of-the-art architecture and language models. Could you just describe the difference between dense models, which have been your primary architecture to date today, and mixture of experts?
Starting point is 00:09:58 Incuitively, what are the biggest differences? So they are very similar, except on what we call the dense networks. In the dense transformer, you alternate between an attention layer and a dense layer, generally. That's the idea. A sparse mixture of experts, you take the dense layer and you duplicate it several times. And so that's where you actually increase the number of parameters. So you increase the capacity of the model without increasing the capacity. cost. So that's the way of decoupling what you can remember, the capacity of the network
Starting point is 00:10:23 to its cost at inference time. If you had to describe the biggest benefits for developers as a result of that inference efficiency, what are those? Its cost and latency. Usually that's what you look at when you're a developer. You want something which is cheap and you want something which is fast. But generally speaking, just the trade-off is strictly favorable in using Mixtral compared to using a 12 billion dense model. And the other way to think about it is that if you want to use the model which is as good as Lamat to 70B, you should be using mixtral, because mixtral is actually on par with Lamat to 70B while being approximately six times cheaper or six times faster for the same price. Could you talk just a little bit about why it's been so challenging
Starting point is 00:11:03 for research labs and research teams to really get the mixture of expert's architecture, right? For a while now, folks have known that the dense model architectures can be slow, they're expensive, and they're difficult to scale. And so for a while, people have been looking for an alternative architecture that could be, like you were saying, cheaper, could be faster, could be more efficient. But it's taken a while for folks to figure this out. What were the some of the biggest challenges you have to figure out to get the MIE model right? Well, there's basically two challenges. The first one is you need to figure out how to train it correctly from a mathematical perspective.
Starting point is 00:11:39 The other challenges is to train efficiently, so how to use actually hardware as efficiently as possible. you have like new challenges coming from the fact that you have tokens flying around from one expert to another that creates some communication constraints and you need to make it fast. And then on top of that, you also have new constraints that apply when you deploy the model to do inferencing efficiently. And that's also the reason why we released an open source package based on VLM so that the community can take this code and modify it and see how that works. Well, I'm definitely excited to see what the community does with the MEO. let's talk about open source, which is an approach and a philosophy that's permeated all the work you've been doing so far. Why choose to tackle the space with an open source approach?
Starting point is 00:12:23 Well, I guess it's a good question. The answer is that it's partly ideological and partly pragmatical. We have grown with the field of AI. From 2012, we were detecting cats and dogs. And in 2022, we were actually generating text that looked human. So really made a lot of progress. And if you look at the reason why we made all of this progress, most of it is explainable by the free flow of information. You had academic labs. You had very big industry-backed labs communicating all the time about the results and building on top of each other results.
Starting point is 00:12:55 And that's the way we increased significantly the architecture and training techniques. We just made everything work as a community. And all of a sudden in 2020, with GPG3, this tight reversed. and companies started to be more opaque about what they were doing because they realized there was actually a very big market and so they took this approach
Starting point is 00:13:16 and all of a sudden in 2022 on the important aspects of AI and on LLMs which are really the host podcast topic and the most promising one and beyond Chinchilla there were basically no communication at all and that's something that I as a researcher and Timote and Guillaume
Starting point is 00:13:32 and all of the people that joined us as well deeply regretted because we think that we're definitely not at the end of the story We need to invent new things. There's no reason why to stop now because the technology is effectively good, but not working completely well enough. And so we believe that it's still the case
Starting point is 00:13:48 that we should be allowing the community to take the models and make it their own. And that's some ideological reason why we went into that. The other reason is that we are talking to developers. Developers want to modify things and having a deep access. Very good model is a good way of engaging
Starting point is 00:14:04 with this community and addressing their needs so that the platform we are building as well is going to be used by them. So that's also like a business reason. Obviously, as a business, we do need to have a valid monetization approach at some point. But we've seen many businesses build open core approaches and have a very strong open source community and also a very good offer of services. And that's what we want to build. That resonates.
Starting point is 00:14:28 The early days of deep learning were largely driven by a bunch of open collaboration between researchers from different labs who would often publish all their work and share them at conferences. Transformers famously was published and opened the entire research community, but that has definitely changed. As a clarifier, do you see a difference between open and open source as viewed by the community, or are those two things the same in your mind? Yes. So I think there's some level of open sourcing in the AI. So we offer the weights and we offer the inference code. That's like the end product that is already super usable. So it's already a very big step forward compared to closed APIs because you can modify it and you can look at what's
Starting point is 00:15:10 happening under the hood, look at activations and all. So you have interpretability and the possibility of modifying the model to adapt it to some editorial tone, to adapt it to proprietary data, to adapt it to some specific instructions, which is something that is actually much harder to make if you only have access to a closed source API. And that's something that also goes with our approach of the technology, which is to say, pretrained model should be neutral and we should then power our customers to take these models and just put their editorial approaches, their instruction, their constitution,
Starting point is 00:15:43 if you want to talk like Anthropic, into the models. That's the way we approach the technology. We don't want to pour our own biases into the pre-trained model. On the other hand, we want to enable the developers to control exactly how the model behaves and what kind of biases it has, what kind of biases it doesn't have. So we really take this modular approach, and that goes very well with the fact that we release OpenWey.
Starting point is 00:16:05 models. Could you just ground us in the reality of where these models are today, just to give people a sense of where in the timeline we are, is open source really a viable competitor to proprietary closed models, or is there a performance gap? So Mixtral is as similar performance to GPT 3.5. So that's a good grounding. Internally, we have stronger models that are in between 3.5 and 4 that are basically second or third best model in the world. So really, we think that the gap is closing. The gap is approximately six months. at that point. And the reason why it's six months is that it actually goes faster if you do open source things, because you get the community, modify the model, suggests very good ideas
Starting point is 00:16:43 that can then be consolidated by us, for instance, and we just go faster because of that. So it has always been the case that open source ends up going faster, and that's the reason why the entire internet runs on Linux. I don't see why it would be any different for AI. Obviously, there's some constraint that are slightly different because the infrastructure costs It's quite high to train a model. We cost a lot of money. But I really think that we'll converge to a setting where you have proprietary models and the open source model are just as good. Yeah, so let's talk about that a little bit more. How are you seeing people use and innovate on the open source models? I think we've seen several categories of usage. There's a few companies that know how to strongly fine-tuned models to their needs. So they took Mistral 7B, had a lot of human annotations, had a lot of proprietary data, just modify Mistral 7B so that it solved their task.
Starting point is 00:17:32 just as well as GPT 3.5, but only for a lower cost and a higher level of control. We've also seen, I think, very interesting community efforts in adding capabilities. So we saw like a context length extension, 128K, that worked very well. Again, it was done in the open, so the recipe was available, and this is something that we were able to consolidate. We've seen some effort around encoders, image encoders, to make it a visual language model. A very actionable thing that we saw is, I think, the hugging face for,
Starting point is 00:18:02 first did the direct preference optimization on top of Mistral 7B and made a much stronger model than the instructed model we proposed at the early release. And it turned out it's actually a very good idea to do it. And so that's something that we've consolidated as well. So generally speaking, the community is super eager to just take the model and add new capabilities, put it on a laptop, put it on an iPhone. I saw Mistral 7B on a stuffed parot as well. So fun things, useful things, but generally speaking has been super exciting to see the research community take a hold of our technology. And with Mixtral, which is a new architecture, I think we are also going to see much more interesting things because on the interpretability field, also in the safety
Starting point is 00:18:42 field, as it turns out, you have a lot of things to do when you have deep access to an open model. And so we're really eager to engage with the community. Safety is an important piece to talk about. The immediate reaction of a lot of folks is to beam open source less safe. than closed models. How would you respond to that? I think we believe that it's actually not the case for the current generation. Models that we are using today
Starting point is 00:19:06 are not much more than just a compression of whatever is available on the internet. So it does make access to knowledge more fluid, but this is the story of humanity, making knowledge access more fluid. So it's no different than inventing the printing machine where we had apparently similar debate. It wasn't there, but that was the debate we had.
Starting point is 00:19:25 So we are not making the world any less, safer by providing more interactive access to knowledge. So that's the first thing. Now, the other thing is that you do have immediate risk of misusage of large language models. And you do have them for open source models, but also for closed models. And so the way you do address these problems and come up with countermeasures is to know about them.
Starting point is 00:19:48 So you need to know about breaches, basically. And that's the same way in which you need to know about bridges on operating systems and on networks. And so it's no different for AI putting models. under the highest level of scrutiny is the way of knowing or they can be misused and it's a way of coming up
Starting point is 00:20:02 with counter measures. And I think a good example of that is that it's actually super easy to exploit an API, especially if you have fine-tuning access to make GPT4 behave in a very bad way. And since it's the case and it's always going to be the case,
Starting point is 00:20:16 it's super hard to be adversarial robust. It means that we're only trusting the team of large companies to figure out ways of addressing these problems. Whereas if you do open-sourcing, you trust the community. The community is much larger.
Starting point is 00:20:30 And so if you look at the history of software in cyber security and operating systems, that's the way we made the system safe. And so if we want to make the current AI system safe and then move on to a next generation that potentially will be even stronger, and then we can have this discussion again, well, you do need to do open sourcing. So today, we think that open sourcing is a safe way of developing AI. Yeah, I think this is not understood widely that when you have, thousands or hundreds of thousands of people able to red team models because it's open source,
Starting point is 00:21:03 the likelihood that you'll detect biases and built-in breaches and risks are just dramatically higher. And I think if you were talking to policymakers, how would you help advise them? How do you think they should be thinking about regulating open source models, given that the safest way often to battle-hardened software and tools is to put them out in the open? We've been saying, precisely this, that the current technology is not dangerous. On the other hand, it can be misused. On the other hand, the fact that we are effectively making them stronger means that we need to monitor what's happening. The best way of empirically monitoring software performances is through open source. So that's what we've been saying. There's been some effort to try to come up with
Starting point is 00:21:50 very complex governance structure where you would have several companies talking together, having some safe space, some safe sandbox for a red tumor that would be potentially independent. So things that are super complex. But as it turns out, if you look at the history of software, the only way we did software collaboratively is through open source. So why change the recipe today where the technology we're looking at is actually nothing else on the compression of the internet? So that's what we've been saying to the regulators generally.
Starting point is 00:22:19 Another thing we've added to the regulator is that if they want to enforce that AI products that needs to be safe. If you are to have a diagnosis assistant, you want it to be safe. Well, in order to monitor and to evaluate whether it's actually safe, you need to have some very good tooling. And the tooling requires to have access to LLMs.
Starting point is 00:22:39 And if you access closed-source APIs LLMs, where you're a bit in a troubled water because it's hard to be independent in that setting. So we think that independent controller of product safety should have access to very strong open-source models and should own the technology. And if open source LLMs were to fail
Starting point is 00:23:00 relative to closed source models, why would that be? I guess the regulation burden is potentially one thing that could make it harder to release open source models. It's also, generally speaking, a very competitive market.
Starting point is 00:23:13 And I think in order for open source models to be widely adopted, they need to be as strong as a close source model. They have a little advantage because you do have more control and so you can do if you're fine-tuning and so you can make performance jump a lot on a specific task because you have deep access.
Starting point is 00:23:30 But really, at the end of the day, developers look at performance and latency. And so that's why we think that as a company, we need to be very much on the frontier if we want to be relevant. Given the complexity of frontier models and foundation models in these systems, there are just tons of misconceptions that folks have
Starting point is 00:23:49 about these models. But if you just took a step back And we look at the battle that's raging between folks pushing for closed source systems versus the open source system. What do you think is at stake here? What do you think the battle is really for? I think the battle is for the neutrality of the technology. Like a technology, by a sense, is something neutral.
Starting point is 00:24:10 You can use it for bad purposes. You can use it for good purposes. If you look at what the LLM does, it's not really different from a programming language. It's actually used very much as a programming language by the application makers. And so there's a strong confusion made between what we call a model and what we call an application. And so a model is really the programming language of AI application. So if you talk to all of the startups doing amazing products with generative AI, they're using LLMs just as a function.
Starting point is 00:24:40 And on top of that, you have a very big system with filters, with decision making, with control flow, and all of these things. What you want to regulate, if you want to regulate something, is the system. The system is, it's the product. So, for instance, a healthcare diagnosis assistant is an application. You want it to be non-biased. You want it to take good decisions, even under high pressure. So you want its statistical accuracy to be very high.
Starting point is 00:25:06 And so you want to measure that. And it doesn't matter if it uses a large language model under the hood. What you want to regulate the application. And the issue we had and the issue we're still having now is, we hear a lot of people saying we should regulate the tech, so we should regulate the function, the mathematics behind it. But really, you never use a large language model itself. You always use it in an application in a way with a user interface.
Starting point is 00:25:33 And so that's the one thing you want to regulate. And what it means is that companies like us, like foundational model companies, will obviously make the model as controllable as possible so that the applications on top of it can be compliant, can be safe. We'll also build the tools that allow to measure the compliance and the safety of the application because that's super useful for the application makers. It's actually needed.
Starting point is 00:25:58 But there's no point in regulating something that is neutral in itself, that is just a mathematical tool. I think that's the one thing that we've been hammering a lot, which is good. But there's still a lot of effort, I guess, in making this strong distinction, which is super important to understand what's going on. To regulate apps, not math, seems like the right direction that a lot of folks who understand the inner workings of these models and how they're actually implemented in reality are advocating for. What do you think is the best way to clear up this misconception for folks who maybe don't have technical backgrounds, don't actually understand how the foundation models work and how the
Starting point is 00:26:38 scaling models work? So I've been using a lot of metaphors to make it understood, but large language models are like programming languages, and so you don't regulate programming languages. You regulate malware. You ban malware. We've also been actively vocal about the fact that pre-market conditions like flops, the number of flops that you do to create a model, is definitely not the right way of measuring the performance of a model. We're very much in favor of having very strong evaluations.
Starting point is 00:27:06 As I've said, this is something that we want to provide to our customers, the ability to evaluate our models in their application. And so I think this is a very strong thing that we've been stressing. We want to provide the tools for application makers to be compliant. That's something we have been saying. And so we find it a bit unfortunate that we haven't been heard everywhere and that there's still a big focus on the tech, probably because things are not completely well understood
Starting point is 00:27:32 because it's a very complex field and it's also a very fast-moving field. But eventually I think I'm very optimistic that we'll find a way to continue innovating while having safe products, but also high level of competition on the foundational model layer. Let's channel your optimism a little bit. There's very few people who have the ground level understanding of scaling laws like you, Guillaume, and Tim and your team, when you step back and you look at the entire space of language modeling, in addition to open source, what are the key differentiators that
Starting point is 00:28:03 you see in the next wave of cutting edge models, things like self-play, you have process reward models, the uses of synthetic data. If you had to conjecture, what do you think some of the most exciting or important breakthroughs will be in the field going forward? I guess it's good to start with diagnosis. So what is not working that well? So reasoning is not working that well. And it's super inefficient to train a model. If you compare the training process of a large language model to the brain, you have a factor, I think, 100,000.
Starting point is 00:28:34 So really, there's some progress to be made in terms of data efficiency. So I think the frontier is increasing data efficiency, increasing reasoning capabilities. So adaptive compute is one way. And to increase data efficiency, you do need to work on coming up with very high quality data. Many new techniques that needs to be invented still. But that's really where the lock is. Data is the one important thing. And the ability of the model to decide how much compute it wants to allocate to a certain
Starting point is 00:29:02 problem is definitely on the frontier as well. So these are things that we're actively looking up. You know, this is a raging debate, right? And we've talked about this a few times before, which is, can models actually reason today? Do they actually generalize out of distribution? What's your take on it? And what would convince you that models are actually capable of multi-step, complex reasoning? Yeah, it's very hard because you train on the entire human knowledge.
Starting point is 00:29:28 And so you have a lot of reasoning places. So it's hard to say whether they reason or not or whether they do retrieval or reasoning, and it looks like reasoning. I guess at the end of the day, what matters is whether it works or not, and on many simple reasoning task it does. So we can call it reasoning. It doesn't really matter if they reason like we do.
Starting point is 00:29:46 We don't even know how we reason. So we are not going to know about how machines reason anytime soon. Yeah, it's a raging debate. The way you do evaluate that is to try to be as out of distribution as like working on mathematics. It's not something I've ever done, but it's something that Timothy and Guillaume are very sensitive to because they've been doing it for a while when they were at meta. That's probably one way of measuring whether you have a very good model or not.
Starting point is 00:30:09 And actually, we're starting to see some very good mathematicians. I'm thinking of Terence Tao that are using large language models for some things. Obviously, not the high-level reasoning, but for some part of their proofs. And so I think we will move up into abstraction. And the question, where does that stop? We do need to find new paradigms, and we're actually looking for them. And we've talked a lot about developers so far. if you had to channel your product view and just conjecture on what these advances in scaling laws,
Starting point is 00:30:38 in representation learning, in teaching the models to reason, faster, better, cheaper. What will these advances mean for end users in terms of how they consume, how they program, and they generally work with models? What we think is that fast-war-five years, everybody will be using their specialized models within parts of complex applications and systems. And so for all of the software stack, developers will be looking at latency. So they will want to have, for any specific task of the system, they will want to have the lowest cost and lowest latency.
Starting point is 00:31:12 And the way you make that happen is that you will ask for the task, as for user preferences, as for what you want the model to do, and you try to make them as small as possible, and as suitable to the task as possible. And so I think that's the way we'll be evolving on the developer space. I also think that generally speaking, the fact that we have access to large language models is going to reform completely the way we interact with machines and the internet of five years from now is going to be much different, so much more interactive.
Starting point is 00:31:41 This is already unlocked. It's just about making very good applications with very fast systems, with very fast models. So, yeah, very exciting times ahead. So what would those interaction modalities look like? Instead of just navigating the internet, you will be asking questions and discussing with machines. you will have probably a large language model doing some form of reasoning under the hood, but looking at your intention
Starting point is 00:32:04 and figuring out how it can address your needs. So it's going to be much more interactive. It's going to be much closer to human-like conversation because as it turns out, the best way to interact with something and find knowledge is to have a real discussion. We haven't found a better way to transmit information. So I expect that we'll be a bit more talking to machines
Starting point is 00:32:25 in five years' time and talking to the Internet. and every content provider needs to adapt to these new paradigings. I think there's a lot of space for high-quality content to be well-identified as human-created or human-edited and for generative AI to help a user navigate through that knowledge. And generally speaking, I think the access to knowledge and the enrichment of what we know is going to be much better in the next five years.
Starting point is 00:32:52 What do you think changes the most when we go from interacting with one large frontier model to interacting maybe with a team of small models that are working together like a swarm of mistrels. Yeah, so that's very interesting. And I think in games, for instance, it's going to be fascinating.
Starting point is 00:33:08 We've seen some very good applications. You do need to have small models because you want to have swarms of it and it starts to be a bit costly if it's too big. But having them interact is just going to make pretty complex systems and interesting systems to observe
Starting point is 00:33:21 NGUs. So we have a few friends making applications in the enterprise space, with different persons playing different roles, relying on the same language model, but with different prompts and different fine-tuning. And I think that's going to be quite interesting as well. As I've said, complex applications in three years' time
Starting point is 00:33:38 are just going to use different DELMs for different parts, and that's going to be quite exciting. Well, what's your call to action? To builders, researchers, folks who are excited about the space, what would you ask them to do? I would take Mistral Models and try to build amazing applications. It's not that hard. The stack is starting to be pretty clear, pretty efficient.
Starting point is 00:33:59 You only need a couple of GPUs. You cannot even do it on your MacBook pool if you want. It's going to be a bit hot, but it's good enough to do interesting applications. Really, the way we do software today is very different from the way we did it last year. And so I'm really calling application makers to action because we are going to try to enable them to build as fast as possible. On that note, thank you so much for finding the time to talk with us today. and we'll put a link to the mixed rule model in the show notes so people can go find it and play around with it.
Starting point is 00:34:28 If you like this episode, if you made it this far, help us grow the show. Share with a friend, or if you're feeling really ambitious, you can leave us a review at rate thispodcast.com slash A16c. You know, candidly, producing a podcast can sometimes feel like you're just talking into a void. And so if you did like this episode, if you liked any of our episodes,
Starting point is 00:34:50 please let us know. We'll see you next time.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.