Latent Space: The AI Engineer Podcast - NeurIPS 2023 Recap — Best Papers

Starting point is 00:00:03 Hello, hello. This is Swix with the special edition of Delaney in Space Pod for NIRP's 2023. Both of Leso and I were there covering what we could cover. It is an impossible conference, 15,000 people, 3,500 papers, and tons and tons of sessions. So it's just impossible for two people to cover it, especially with a limited time. But we did our best. A lot of you liked our OpenEI Dev Day coverage where we basically just jumped from paper to paper, person to person, and founder to founder and got their takes. And this is effectively what we've tried to do here. It's still experimental a new format for us.

Starting point is 00:00:36 So we really love your feedback. We're actually doing a listener survey now. If you click into the show notes, we really love to hear feedback and know what you want to hear for 2024. So we recorded a lot of audio in NERIPS. And I figured the most logical way to cover this would be to start with the best papers. Neuribs does hand out best paper awards. So we're going to start with the hardest one to obtain, which is the Test of Time Award.

Starting point is 00:00:57 The test of time award is given to a paper that has. has stood the test of time, which by Nierrifts's definition is a paper that was published 10 years ago at Nieryps. Neurips is in its 37th year, so this is honestly a flex that very, very few conferences can actually do. And it's really interesting to have the original authors of the paper come back and talk about what they've learned and how they look back at the past 10 years. So here's Jeff D. and Greg Corrador. Thank you very much. I'm Jeff. And I'm Greg. And we're here to give a little talk and a retrospective on this work. So this work actually started out as an ICLR 2013 workshop paper with four of our co-authors working together.

Starting point is 00:01:36 And in that work, we sort of explored a bunch of different sort of loss functions and techniques for optimizing word embedding representations. And really, that was kind of the genesis of this work. And that work was cited by quite a few people. and one of the things that we discovered in that work was that the Skipgram model, one of the few models that we evaluated in this workshop paper, really was showing better performance than some of the other ones that we worked on. So we decided to focus on that and really focus on the Skipgram model and then some interesting sort of optimization techniques

Starting point is 00:02:15 to improve the optimization of the word embeddings and added the ability to do phrase embeddings as well. And along the way Ilya joined, as a co-author, which is great. And this paper has been cited by a number of people, as Sergey mentioned. One thing we've discovered, including source code and trained representations, really does boost your citation count. People have done this and, you know, use these downstream representations

Starting point is 00:02:41 for all kinds of things, and we're very gratified to see that in the community. And we also want to highlight that three of our co-authors couldn't make it today. So Tomash, Ilya, and Kai couldn't be here, but on their behalf, we're delighted to be giving this talk. And with that, I'm going to turn it over to Greg, I think. Oh, no, we're older now. Sorry. Sadly, we've found more recent photos, and this is a test of time war.

Starting point is 00:03:10 And time has passed. Yes, I think we survived the test, mostly. But so let's stand back and ask ourselves, you know, what did we really learn from these papers? But before I get into that, I should probably stipulate that some of you out there rightfully say, well, we already believed these things before you published this work. And so for you, maybe this is really us reinforcing these points. Other of you might think that, well, the paper didn't really exactly prove this point.

Starting point is 00:03:42 It just suggested it. So it foreshadowed it. We don't have any quarrel with whether it was reinforcing or shadowing or learning, and so we'll just put that aside for the remainder of the talk and talk about what we think are at least the themes that were in this work that resonate today. So the first point is that semi-supervised objectives have an incredibly powerful opportunity, and we think that they're going to be critical for natural language understanding going forward. We think that this paper shows that fast, parallel, and weekly supervised synchronization

Starting point is 00:04:18 in computation really dominates over the sort of fruitless precision of tight synchronization. Focusing compute where it really helps and improves your learning of representations is what's most important. And tokenization can be used as a good trick to solve some nuanced problems. And then the last, and I think most important point, is that treating language as a sequence of dense vectors has proven to be really powerful. and honestly more powerful than I think we imagined when we started this work. So first on semi-supervised objectives.

Starting point is 00:04:57 Why is this so important? Of course, almost all machine learning systems today go through some period of supervised learning. We're always going to use that, but there's too much to learn in the world to use supervised learning for everything. The promise of unsupervised learning, of course, is tantalizing, but has been difficult to implement in practice. And so semi-supervised learning, the ability to construct a supervised feeling data from a dataset from an unlabeled corpus is really what we think works. So what's the basic program here? You begin with a large corpus of sequence data, say text, choose a random window within that corpus, and then algorithmically construct inputs and target outputs on the fly.

Starting point is 00:05:44 And I want to underscore, I actually think doing it on the fly is part of what makes. this method so powerful. And you have your choice about how you'll do it on the fly. You might be taking a word in the corpus and trying to predict its neighbors, which is the so-called skipgram model. You might be doing something like fill in the blank, or you might be trying to predict the end of the sequence, which is sort of the classic language modeling problem. All of these fit in this description. And if you repeat that a few billion times, it seems to work really well. But that's where we get into the hard part. Yeah. So I think one of the things that we really explored in this work and sort of work we were doing concurrently with this is how effectively could we make sort of weekly synchronized asynchronous updates to a large model work. And Tamash, our first author, had been exploring these word embedding ideas on a single machine version that he implemented in C of both the skipgram and the continuous bag of words, objective functions.

Starting point is 00:06:45 And he actually did a fair amount of work to scale this up to be a very high performance implementation, using all the cores on a single machine, so about 20 different cores at that time with almost no synchronization. So you just kind of blindly update the embedding that was sort of a large 2D array in memory. And then he was able to have about 20 cores on these multi-core machines simultaneously updating this shared representation and get quite good embedding representations. Now, one of the things that we observed was every time we made the dimensionality of the word vectors larger, and every time we trained on more data, things got better, right? This is the lesson of a lot of the last 10 years of deep learning work,

Starting point is 00:07:30 is scaling actually gives you much better results. Fortunately, a bunch of us were simultaneously working on a highly scalable system for distributed training of neural networks. So we decided to take the single machine implementation that Tamas had built, for these word embedding questions and implement that in our distributed framework. And so the work we were doing just a bit before this work was this large-scale distributed deep networks

Starting point is 00:07:58 where we were exploring distributed training of large-scale models, mostly for vision and for speech. And really the motivation was how can we scale training on these systems to thousands of machines. We actually titled this disbelief internally so named because it was a distributed system, but also because a bunch of people were skeptical that it would work. And it turns out it did work, which is nice. So the basic idea behind disbelief is you have some set of parameters that are being represented on some set

Starting point is 00:08:32 of machines, and then you have independent replicas of the model where you fetch the current state of the parameters, you do some computation on the model, and then you update the parameters by sending a gradient back to the parameter servers. And, you know, in large-scale setups, we were using tens to hundreds of machines to hold the distributed state of the parameters and hundreds to thousands of machines to hold the sort of independent workers of the model.

Starting point is 00:08:57 And so that really meant you had 1,000 to 10,000 simultaneous threads kind of updating the model for the word embedding kind of work. And we were using 300 to 1,000 dimensional embeddings for a lot of things, 100K to million item vocabularies and even beyond for a lot of internal uses. It turns out you can make vocabularies out of lots of things, you know, not just words, but, you know, particular videos that people have watched or all kinds of things and use kind of similar approaches than just language modeling.

Starting point is 00:09:27 And now back to Greg. And so that, this kind of provocative disrespect for locking and synchronization was the biggest single enabler of being able to do this work. But there were other things that we did that tried to focus. compute to where it really actually made a difference in terms of model representation and quality. So, for example, the meaning of tokens that are uncommon is actually often more informative than common ones. The common ones are super easy to learn because you get a lot of chances at them.

Starting point is 00:09:57 So we would probabilistically discard tokens related to their frequency, ignoring common tokens more often. And you could apply that both as inputs and in targets. Another thing that we did was we found that favoring objectives and models that we were informative for the ultimate task, but were faster compute was better. And so in our paper you can see we go through softmax and then an approximation to that, through hierarchical softmax, and then noise contrastive estimation, which is an even faster version, and then Ilya came up with negative sampling, which is an even faster, faster,

Starting point is 00:10:30 faster version. We saw that quality went up every time that we were able to make it simpler and faster. We also found that you could use tools like tokenization to focus. computation in the part that was interesting. One of the things that we used it for was to try to deal with phrase representation. So in English, compound concepts and nouns are often represented by multiple words. And so we just had a very simple heuristic that allowed us to build bygrams out of terms that were each individually not super frequent, but were co-occurring much more frequently together than you would expect. And many other authors

Starting point is 00:11:06 have used tokenization schemes in these systems to great benefit, dealing with everything from contractions, declinations, and I just think it's important for us to not overlook that when we're processing text, we begin with tokenization. But then to the point of getting concepts to be n-dimensional vectors and how it is that this is so powerful. And I was actually trained as a neuroscientist, and so I saw this come up as ideas from a long time ago, from the 80s, about maybe concepts could be represented in a dense vector space, and that operators in that vector space or geometric relationships in that vector space actually meant something. But that was simply a conjecture.

Starting point is 00:11:47 And then lo and behold, when we took these representations that we had learned in a semi-supervised fashion and investigated what was inside by, for example, flattening them into due dimensions using PCA, we found that there were, that syntactic relationships were represented geometrically, like these similar triangles representing the tenses of verbs, and that even arbitrary, semantic relationships, like the relationship between countries and capitals or diseases and drugs, were also represented geometrically in this space as similar displacements. And that was really powerful. And then Tomash and Ilya were able to show that you could do these cute tricks,

Starting point is 00:12:24 like solve analogies with simple vector arithmetic. By adding and subtracting vectors, you could see that sushi is to Japan as Bratworth is to Germany, well, at least according to the language model. And in fact, you could even just do simple addition to imagine combining concepts and discovering what concept is nearby in this vector space. So, for example, putting together Russian and River, you get tokens like Volga River. Okay, so summing it all up, what did we learn in these papers? Let's go back to the five points that Greg talked about in the beginning. So semi-supervised objectives applied to a large text corpora are pretty important in natural language understanding.

Starting point is 00:13:10 standing. I would say definitely true today. Fast, parallel, weekly synchronized computation dominates in ML. Parallel, definitely, I would say larger scale specialized ML hardware has really enabled fully synchronized approaches to scale, even to the scale of models that we're training today. But I personally think that asynchronous approaches are going to make a comeback, because I think we're sort of close to where we're going to have to start reconsidering some of these asynchronous approaches to training very large models. Focus compute on the aspects of learning that need improvement. Yeah, simpler, more parallel methods win out over more complex,

Starting point is 00:13:50 less parallelizable models. You know, WordDevac versus RNNs, Transformers versus LSTMs. I think this is a good lesson as we're thinking about future improvements to these things. tokenization can be used to solve seemingly nuanced problems. Yeah, more powerful models on top have actually pushed tokenization in the opposite direction of our phrase-based vocabulary, where we now have kind of subword sort of tokenization,

Starting point is 00:14:15 and that actually seemed to work pretty well for some of these models that have more complex attention mechanisms on top. And treating language as a sequence of dexterous vectors is more powerful than expected. Definitely true today. So we're really honored to receive this award. Thanks to the committee that selected the work. We're really honored.

Starting point is 00:14:35 And thanks to our co-authors who couldn't be here today, and there's their pictures. Tomash, Ilya, and Kai. Thank you for this delightful work and co-authoring. Were we still so young? Thanks, everybody. Yeah, we picked the younger ones. By the way, there was some discussion in Europe's around what would be the 2024 test of time winner.

Starting point is 00:14:57 There was some contention for GANS by Ian Goodfellow, but probably it's going to go to the sequence of sequence paper because that is most influential to Leggish models today. The only thing I know for sure is that I know what's going to be the Test of Time Award winner. for 2027. Up next are the best paper awards from this year. There are two papers chosen, but probably the most relevant for AI engineers is the Mirage paper. In other words, our emergent abilities of large language models, a mirage. And here is Schaefer at all. My name is Ryland Schaefer, and this is our NERB's paper, Our Emergent Abilities of Large Language Models, a Mirage. This is joint work with Brando Miranda and Professor Sanmi Cojejo. Our paper is a story about predictability and surprise.

Starting point is 00:15:44 Our story begins with predictability. As many of you know, several years ago, researchers observed a striking phenomenon that as you fed large networks more and more data, the loss improved in a predictable manner. But it wasn't just the test data. Other researchers observed that other quantities, scaling compute, Scaling data set size, scaling parameters, yielded predictable improvements in the performance of large networks. This was incredibly important because it told us that if you fed more into these models, you knew what you would get. That's extremely useful.

Starting point is 00:16:27 But approximately three years ago, this story was turned on its head. There was a new story in town, a story of surprise in large language models. Specifically, perhaps the first instance of this was in the GPT3 paper, where the authors observed that you might try having language models solve a task, like arithmetic, and you make them larger and larger and larger, and they're unable to do this task. But then, at some seemingly unforeseeable model scale, performance skyrockets, almost to ceiling, something that was unpredictable. But it wasn't just on arithmetic.

Starting point is 00:17:06 who is also on many other tasks, IPA transliterate, word unscrambling, Persian question answering, all of these tasks across a variety of language model families. All of them seem to display these miraculous emergent abilities. What are emergent abilities? Emergent abilities were defined by their authors as abilities that are not present in smaller scale models, but that are present in larger scale models. Critically, emerging abilities cannot be predicted by simply extrapolating the performance improvements

Starting point is 00:17:43 on smaller scale models. These emergent abilities raised several interesting research questions, questions like what controls which abilities will emerge? What controls when abilities will emerge? How can we make desirable abilities emerge faster? And how can we ensure undesirable abilities never emerge. These questions not only are fundamental scientific questions of interest to the machine learning community, but these are also fundamental questions for those interested

Starting point is 00:18:15 in governmental policy or economics. What our paper asked is whether or not the story of emergent abilities is complete. Specifically, if you look at these emergent abilities, you might notice something that if you hone in on the metrics, all of these metrics are quite harsh. They give no partial credit. Exact match, for instance, either you exactly output the correct answer or you do not. There is no in-between. And so, it seemed when we looked closer that many emergent abilities appeared under metrics that non-linearly or discontinuously scored models' performance. For instance, we found over 90% of emerging abilities on Google's large-scale big bench, we found that over 90% of emerging abilities observed under two metrics.

Starting point is 00:19:11 One of those metrics, for those who haven't seen this, is called multiple-choice grade. It's like taking an A- through-D multiple-choice question. You get a score of one if you put the highest probability mass on that answer, and zero otherwise. The other metric was exact string match, where again, one point, if you get it exactly right, zero otherwise. This raised the specter that emergent abilities might not be due to fundamental changes in model with scale, but due to our evaluations of said models. So what exactly is this alternative that I'm positing? What is our alternative hypothesis? Let's walk through it. First of all, let's just suppose that the test loss falls as we

Starting point is 00:19:57 increase the number of parameters in our models. So for example, motivated by power loss scaling, we might assume that the cross entropy loss as a function of the number of parameters is some power law. What that means is if we visualize the number of model parameters against the cross entropy loss in log log space, we observe a very predictable linear trend.

Starting point is 00:20:18 In step two, we compute the probability mass that is placed on the correct token as a function of parameters. So how can we do this? Well, we know the definitional form of cross-entropy, and we know that we can substitute in our power loss scaling, so I can rearrange. And when I plot this, what I see is that,

Starting point is 00:20:38 as model parameters get larger, the probability mass that gets placed on the correct token, asymptotes, towards one. And everybody is comfortable with this. So how do we go from this to an emergent capability? The answer is, we might choose a metric that non-linearly scores model performance. For example, suppose that we want to add two five-digit numbers

Starting point is 00:21:03 and we're gonna measure performance with accuracy. What scaling should we expect? Well, the answer is that unless you get every token correct, you get zero points, ergo to score one point, it's going to be the per token probability approximately exponentiated to however many tokens you need to get correct. So what happens is this graph on the right

Starting point is 00:21:26 that we like and know, gets transformed into something that becomes much less predictable with model scaling. And indeed, this toy model qualitatively reproduces what's been observed empirically at large scale. But could we have done something differently? Yes, suppose we had done the evaluation differently. Suppose that we had chosen a different metric, one that linearly scales model performance. So, for example, I might instead count merely the number of mistakes that the language model makes. For those in NLP, you might call this an edit distance.

Starting point is 00:21:59 And what that then means is that the edit distance scales approximately linearly with the output length. And so if we look at this, instead, what we find is when we plot model parameters versus the number of incorrect tokens, we find a very nice predictable trend that asymptotes toward zero as you make models bigger. So nothing has fundamentally changed. From one viewpoint, we saw a seemingly emergent ability. From a different viewpoint, we removed it. Of course, it's not just about linear and non-linear metrics. It can also be discontinuous metrics.

Starting point is 00:22:32 So, for example, let's consider that multiple-choice metric. So multiple choice, again, is you get one if you place the highest probability mass on the correct option. And what that scaling looks like is you're at chance, up until some unforeseeable critical threshold, at which point you jump to CLA. And this, again, qualitatively matches what's been observed empirically at scale. So if we had done the evaluation differently, we could have chosen a continuous metric like Breyer's score, which is just the mean squared error here between one and the probability mass, and then we find a very nice quadratic.

Starting point is 00:23:05 So to summarize this together, we started with power loss scaling, we figured out, we computed what the probability mass on the correct token is. If we chose a non-linear metric, we see an emergent ability, but if we chose a linear metric, we did not. Similarly, if we chose a discontinuous metric, emergent ability. If we choose a continuous metric, we do not. And so this is our alternative hypothesis for emerging abilities. Now, of course, to summarize this, there's basically three factors at play here.

Starting point is 00:23:34 One of them is the metrics that I focused on. Another one is that of statistics about needing sufficient resolution, measuring discreetness in order to accurately estimate the performance of models. And then third and finally, the third confounding factor is evaluating two few small and medium-sized models. So up till now, this has been Rylund's hypothesis. Do we have any actual evidence? And the answer is, in our paper, we considered three different types of evidence. We made and tested predictions using the largest publicly available model family at the time, GPT3. We did a meta-analysis of published metrics and emergent abilities at Google's Big Bench.

Starting point is 00:24:13 And third, we induced emergent abilities in toy minisheal networks on vision tasks. The reason why we did this is because prior to our paper we didn't know of any work that had found emergent abilities in vision tasks so to induce them intentionally was quite novel so let's walk through this let's first talk about the predictions that the toy model the mathematical model makes the first is that if you change the metric you should get more predictable scaling so here again model parameters versus accuracy as i increase the number of tokens that the model needs to output correctly we should expect to observe approximately geometric decrease in performance so we start a piece here and then it falls. But if I change the metric to token edit distance, I should find this

Starting point is 00:24:55 nice quasi-linear behavior. I'm now going to go test this in GPT3. And that's precisely what we did. So here is accuracy. And again, here's the four models in the three family. And again, we find that as the target length that's longer, you find it decay geometrically in the length of the target. And that if I switch using the exact same data, fixed data, if I change the metric, I find very nice quasi-linear scaling. This is exactly what the toy mathematical model predicts. Moreover, there's a question about better statistics yielding more predictable scaling. What the toy model tells us is that when we said the tiny models are unable to do the task, that wasn't quite right. It was that their performance was so small, we didn't have sufficient

Starting point is 00:25:40 resolution in order to estimate it. So what our toy model says is we really need to consider accuracy on a log scale. And to estimate these quantities, we, We need sufficient data to do so. So we scale up the amount of data, and again we find that if we separate into log scale, we find a very, very nice separation with predictable behavior. Or second, we conducted a meta-analysis of emergent abilities on Google's Big Bench, and what we found is that across many, many, many metrics, we could not find emergent abilities. But on a small subset, to be specific, four of these, we found emergent abilities.

Starting point is 00:26:16 That's what this little pie chart shows. So long story short, it seems like the metric is playing a fundamental role in producing these emergent abilities. And lastly, what we did is we induced emergent abilities in networks. So what we did is we did the simplest possible thing. We took a shallow, nonlinear auto encoder and trained it on CFR 100. Everybody has done this in their intro to machine learning class. And what we did is we plotted the squared reconstruction error as a function of the number of

Starting point is 00:26:42 parameters. But, and this looks very smooth and predictable, everybody has seen this. But if we define a discontinuous metric, so here the model scores one, if the reconstruction error is below some threshold, then you find very, very unpredictable behavior. And so even in a shallow nonlinear auto encoder, we can again qualitatively produce what seems to be an emergent behavior. There's two takeaways. One is for emergent abilities, it might be, in certain cases, the researchers' analyses

Starting point is 00:27:14 that have produced these phenomena. That's why we call it a mirage. there's a more general lesson that I want to leave you with. The more general lesson is that if you want to predict changes in model capabilities with increasing scale, you need to consider the interplay between known scaling properties, the amount and quality of evaluation data, and the specific metrics and evaluation processes that you have available. So with that and with gratitude to all my collaborators and everyone here for attending. Thank you. So for the purposes of this episode, We actually tried to do interviews at the process sessions for each paper,

Starting point is 00:27:52 but some we just didn't manage to find. Or for the case of the Emergent Mirage paper, it was just way too popular. There were just so many people crowding out and listening to Ryan explain his paper again and again that we just couldn't get a proper question in. And I have to say, if I'm allowed to be a little bit critical, I'm a bit puzzled as to why this paper was the best paper. I mean, it's a good paper, but it doesn't really deny the existence of Emergent.

Starting point is 00:28:18 It just pointed out some methodological disagreements, which Jason Way has also responded to. In other words, I don't really know if this paper affected literally anything in the field, so I don't know why it's Best Paper and not just a regular paper. But it's still a notable paper for sure, and it's very well done. Next, we have the runner-up for Best Paper, which is Direct Preference Optimization, which is a direct challenger to PPO, and you can hear directly from the authors. from the authors. Hi, everyone. My name's Eric, and I'm here with Raphael and Archit, and today we're going to talk about direct preference optimization, which is this algorithm that simplifies RLHF,

Starting point is 00:29:02 which is this algorithm framework that has sort of been taking the LLM world by Storm recently. So to start, why are we even talking about reinforcement learning for language models? Now, it's not the first time people have been studying reinforcement learning in the context of language models. But the sort of simple answer to this question is that a few years ago, GPT3 came onto the scene and it was sort of a big deal and you probably, well, I'm an LLM person, but you probably heard from a lot of your researcher friends like, did you hear about this new model? And then last year, Chad GPT came on the scene and it was more like, at least I was like getting text from my grandmother saying like, hey, have you seen this new model, right?

Starting point is 00:29:42 And these are just like two different levels of, you know, permeation in the public consciousness. So, you know, what is the difference between these two models? And really, the main sort of key ingredient is this reinforcement learning from human feedback framework, which lets us sort of align the behaviors of the models more towards what people kind of want or expect. Okay, so to give a little bit of an overview of what sort of the existing RLHF pipeline looked like, kind of when we started working on this project, so there are basically two main steps. So the first step is we're going to start with some reason. behavior-clone policy, what we call pi theta sFT here, so supervised fine-tune policy.

Starting point is 00:30:25 We're going to sample pairs of responses or trajectories from this policy conditioned on a prompt X, and that's how we're going to gather this data set of preferences. So we'll have an X and we'll have two Ys, and a human is going to just label which Y they think is better. So they're just going to give us this binary preference pair over responses, and we're going to use this data to fit a reward model. And then in the second step, we're just going to optimize a policy to maximize rewards. So that's just RL. Okay?

Starting point is 00:30:53 So to look at this a little more closely in this first step, we get this feedback. It's these triples of a prompt and two responses. One is sort of the winner and one is the loser. And we're simply going to train a reward model with this binary classification loss on the preference data. So this is this Bradley Terry model of discrete choice in humans from the 50s. But, you know, it has some nice properties, and it's relatively simple. to understand, and we use this to fit this reward model. So we're just taking the difference in the rewards,

Starting point is 00:31:21 and we have this sort of Boltzmann rational model here that we're fitting with maximum likelihood. Okay, and so now that we're done with this first step, what are we going to do with this reward model, where we're going to try to find a policy, achieving high reward. And so, you know, ideally this reward model after we've done this supervised learning stage should represent goodness according to what humans want.

Starting point is 00:31:45 And so we're just going to fit a policy that both, high reward but also stays close to our original model, our reference model, or our supervised fine-tune model. And so that means we were going to try to find a policy here, pi-theta, that generate samples that achieve high reward under our learned reward model, but also stays close to our original model, our reference model, because if you remember, we actually fit our reward model on samples that were annotated by humans, but these samples were generated by our reference model, our supervised fine-tune model, right? So we don't want our policy to drift too far away because, you know, the, we want to stay in the regime where our reward

Starting point is 00:32:22 model is actually reliable. Okay, so now that we, you know, have this objective, we take some off-the-shelf RL algorithm. Typically, it's PPO, and we find a policy that optimizes these rewards. This is a very complicated procedure, so there's this nice figure in this recent paper, showing sort of the full pipeline of just the PPO step, and there are a lot of moving pieces here. And so in light of sort of this complexity, we kind of set out to see if there's some way we can sort of use a structure of this problem to simplify things. All right. So how the heck do we solve this optimization without reinforcement learning or what we call direct preference optimization? Really the key here is that the optimization that was set up for RLHF has a close form optimal solution.

Starting point is 00:33:10 Now, this may look a bit intimidating, but it's really just the reference distribution re-rated by the exponentiated reward. So if you have a good completion, you want to put more probability mass on it, and if you have a bad completion, why, you want to put less probability mass on this. This may look familiar, it's the Boltzman distribution that you might have seen earlier, and it's very commonly used across machine learning and physics. But the key takeaway here is that every reward function R will induce an optimal policy by R. But there's a very nice way to view this identity through another perspective where we express the reward model in terms of the policy itself. So R-Py x comma y can be written as beta log ratio of pi by pi ref plus the beta log partition function x. And this really is the key where the every policy pi is optimal for some induced reward model, R-Pi. And this really is the key to DPO because our key idea here is that you can fit this reward model

Starting point is 00:34:06 parameterized as a beta-log ratio to the preference data and hopefully skip the RL process altogether. But the problem is that this log partition function is basically interactable as you have to sum over all possible completions for a given instruction So how do we get away from this? Now fortunately for us the reward modeling loss that we looked at the Bradley Terry loss only depends on the differences in the reward Specifically the reward for the preferred completion Subracted subtracting the dispreferred completion's reward from that Now if you look at the induced reward different difference and if you plug in the DPO parameterization here, you can see that like it only ends up depending on their D.P. Reward for the preferred completion and subtract the D.P.R. reward for the dispreferred completion. Now the more important thing here is that the partition function, which only depends on the instruction X, cancels out. As it only depends on the prompt. And this really is the key part here. And if you plug in this difference of rewards in the classification loss,

Starting point is 00:35:12 you get the DPO loss function. And really, in its essence, it's just a classification loss with a specific reward parameterization, which will give you the optimal policy for the original RLHF objective. So to go back to what Eric presented earlier, the RLHF is typically a two-step process. You first put a reward model and then you do some RL on top of it.

Starting point is 00:35:38 Really what we are doing here is that we choose a specific parameterization, the DPA parameterization for the reward model, we're still fitting the reward model exactly the same way, but you get the optimal policy in process, and you don't have to do the step two at any point of time. It's pretty useful to look at the DPO loss function through its gradient as well. Just to recall, it's still a classification loss.

Starting point is 00:35:58 Nothing changed in the two slides. And you're trying to maximize the difference between the rewards. But the gradient is really intuitive. Specifically, what we're trying to do is increase the log probability of the chosen completion, and we're trying to reduce a log probability the rejected completion. The important part here is that we slow down the training on the preference

Starting point is 00:36:18 pairs where the induced reward model is already pointed the right direction, so you're not overfitting to the examples over and over again. But overall, it's really intuitive as you're just doing up on the good examples and down on the bad examples. And finally moving to our experimental results, the first thing we really wanted to evaluate is how good of an optimizer that is for the core objective of reward versus divergence straight off for these language models. So we We started with this synthetic experiment where the goal is to generate positive movie reviews on this IMDB dataset with a small GPD2 base model. We created synthetic preferences by sampling several times from the base model and using a pre-trained score classifier to construct synthetic feedback pairs. Kind of immediately the first thing we see is that DPO provides the best reward KL trade-off.

Starting point is 00:37:11 And PPO, although improves quite a bit, it doesn't quite match that efficiency. of optimization, even when we provided with the ground truth scoring model that generated the preference data. And in addition, other sort of algorithms that are RL-free avoid the R0-Modeling approach, such as just fine-tuning on the preferred answers or things like that, either don't produce the same level of improvement or unstable. We then decided to try to scale these results up to more harder, more involved problems. The first thing we did is this summarization task.

Starting point is 00:37:48 The goal is to provide summarizations of some Reddit posts and dialogue task of the tropic helpful and harmless dataset, publicly released datasets. And kind of again, what we see there is that across the board, DPO either matches or outperforms all other baselines. And particularly, for example, in the summarization case, the PPO model is almost twice as big. So another interesting experiment that we ran recently is evaluating the generalization capabilities of the DPO policy

Starting point is 00:38:20 because essentially the BPO-trained approaches sample a lot of additional data and have the capability to train a lot of additional data, while DPO is fully only using the offline data set of preferences. So what we did here is we took the summarization models that we presented in the previous slides. Those are the first two graphs on the left, separate at different temperatures, and evaluate within distribution, as you kind of see within distribution, they're quite comparable. And then we evaluated them on out-of-distribution data,

Starting point is 00:38:51 particularly summarization of news, CNN and Daily Mail articles. And we do see quite significant drop when we take this model's out of distribution, but the interesting thing is that the DPO policy still generalizes just as well or even perhaps better than the PPR train policy, even though the PPR trained policy is changed on a lot more additionally sampled data. However, I think the strongest sort of validation of this algorithm and its capabilities

Starting point is 00:39:17 are the strong open source models that have been trained by the community, and this is only a selection of those. There are others we couldn't fit on the slide. And if you could go through all of them, you see that especially some of the recent ones, do match or sometimes even outperform chat GPT on some broad benchmarks. Another point to mention here is this is only between the, language domain but recently works have done this training state-of-the-art text image models with the DPO algorithm used for vision language models and also using for

Starting point is 00:39:48 multi-step control as well so this is going beyond languages is becoming kind of a paradigm of alignment so in conclusion I want to point out that kind of the DPO removes the complicated expensive auto training loop from ROHF it's a simple staple and computationally cheaper than PPO I think almost you know order of magnitude And most importantly, it's also principle. You're optimizing the exact same objective. It's not a hack. It's optimizing for the exact same thing.

Starting point is 00:40:16 And yeah, as you've seen, others are training, you know, a lot of state-of-the-art models. We've been achieving pretty strong results, so you should do as well. If you want to learn more about it, you can come talk to us at our poster, and we have publicly opened our code implementation. We can find on GitHub, and you can check our paper on archive as well. Thank you very much. So DPO is interesting because it promises to be simpler than PPO.

Starting point is 00:40:43 It's definitely easier and cheaper to train and there are a bunch of models already emerging being trained on it. The main criticism that people seem to have is that it isn't performing as well in terms of alignments or results or benchmarks as PPO trained models. But that still remains to be seen whether that ease of use and cheapness of availability of data whatever makes it so much better that it doesn't actually matter. So what happens in Europe's is that some papers are selected for oral sessions and then everyone heads down to the poster hall where there's about 600 posters simultaneously presenting, including

Starting point is 00:41:18 the people from the oral sessions. And this is what we did. We went down to talk to the paper authors after their oral session. So we're going to hear them re-explain DPO in four minutes and then answer a bunch of Q&A. But you can also get a sense of how chaotic and noisy it is in that poster session. It's just a mess and I love it. I'm talking about direct preference optimization here.

Starting point is 00:41:39 RLHF is really cool. You get Chad GPD from GPD using RLHF. If you've never heard of chat GPD, you might want to look it up. It's really important. RLHF is complicated. It's really hard. You start with like preference data distribution. You usually have to do some kind of RL process on top of it. And RL is hard to implement because it has a lot of moving components.

Starting point is 00:42:00 You have to sample the model a lot. You have to train a value function. You have to do a lot of magic decree to get it to work. to work. Our hope was that like can we make this simpler and that's where we design DPO. Just to give a brief overview of RLHF, it starts off with some distribution or some model that you have already trained which is usually reasonably good. I'm thinking of GPD3 which is already pretty good. They like some preference data on top of it so you have instruction, two pairs of completions and the human labels which one is preferred and which one is dispreferred. With this

Starting point is 00:42:34 This preference data, you first fit a reward model. The reward model will give you, it's basically telling you which preferred model should have a higher reward and the dispreferred completion should have a lower reward. And this is a simple classification problem. It's very straightforward. Now given this reward model, you want to do RL on top of it. So like you want to generate completions which are good. And the way you set it up is the, you maximize the expected reward under a KL constraint

Starting point is 00:43:01 to the initial distribution that started with. Now why the KL constraint, the models can degenerate very, very quickly. And usually what you want to do is stay close to these models so you don't degenerate and you do not exploit the reward model. The reward models are trained on a very little amount of data and these are very easy to exploit. So that's where the scale constraint is important. This is a traditional RLHF pipeline. This is what exactly was used for Chad GPD initially at least and it's very complicated to do

Starting point is 00:43:29 with PPO or like it's hard to get it right. Now, our contribution is the direct reference optimization. And the way this works is that it turns out for this optimization, there is an exact optimal solution. This optimal solution, if you have seen Boltzmann distribution before, very simple. You take a reference distribution, you upweigh the good things by exponentiated reward, and you downweigh the things by exponential reward which are bad. So it's just the exponential reward weighted for the reference distribution.

Starting point is 00:43:59 Now, unfortunately, this is intractable. Why? Because the partition function is intractable. So you cannot actually compute this distribution. But as it will turn out, this is not going to matter. So our main contribution is that you can actually rewrite the reward in terms of the policy itself. So simple algebra, you write the reward in terms of beta log pi pi by f. This is just simple here. Take your time, just look at it for a second.

Starting point is 00:44:26 You're just rearranging terms. But the thing is that you still have a beta log's partition function, which is just simple. interactable. Now the key thing is we can fit this reward on using the same classification loss that we were using earlier over here but and the nice thing is it depends upon the difference between the reward for the good completion and the bad completion and the partition function actually cancels out. If you look at the partition function it only depends on the instruction. So it only ends up depending on this quantity and this is exactly how you get the DPO loss. You're

Starting point is 00:45:00 plugging in this like implied reward. function into the classification loss and you get the DPO classification which is directly in terms of your policy that is being fine-tuned. So you no longer need to do an explicit reward model where you're learning a different reward model. You do not have to do any RL optimization after that. What you're doing is exactly you're fitting this reward model and you immediately get the optimal policy for that reward model without doing any RL. And that's like the main main pitch for DPO. Any questions? Anything I can explain further?

Starting point is 00:45:32 You don't have to learn any reward. But this thing, you can extract the actual cost. You don't have to, but the policy already implies a reward. Yes, exactly. Is that make sense? I don't mean the . How do it? Yes.

Starting point is 00:45:45 Not the action, but this is this specific reward model. What about the data collection aspect? Sorry. What about the data collection aspect of RLH? That's a great question. So people usually samples more completions online, and you don't have to do any of that.

Starting point is 00:46:02 You only have to sample the preference data set in the beginning, which we use for . How do you know that your preference data set is as good as? We use the exact same preference data set for RLHF and for DPA. It's like a mathematical shortcut. Yes. I assume. Like they created a new loss function.

Starting point is 00:46:23 You train this model on some data distribution, but when you explore, it might go out of distribution. Yes, yes, yes. It kind of limits the policy. Yes, exactly. So that is the major reason what drop is? In general, PPO also has a high variance estimator, so the optimization is never perfect, whereas the DPO, you know for a fact that it's an optimal policy. So it's very, very similar, like, you know for a shag that it's optimal.

Starting point is 00:46:46 But in general, like, if you have a very well-fined PPO pipeline, it will usually work reasonably similarly. But yeah, you don't have to do any of that. Essentially, one of the things is that... This is not an assumption. This is the actual solution. This is not an assumption. It's generic in terms of mathematical form. But I was under the word, for example, does it match the definition of reward?

Starting point is 00:47:16 Because you could write any exponential function here and it's been called reward, but does it match the reward definition? In this optimal solution, you assume there's a reward function that has been given to you. given to you. Oh, yeah, yeah, yeah. The sequence of actions, that's a constant times the log ratio of some and I mean, overall if I look at the experiments, let's look at the real world data sets.

Starting point is 00:47:41 Like, I mean, we try out like summarizations, like single turn dialogue, and it all works great. You never had to do like any online exploration or of any form and like, people relatively works better than people or very similarly to it. I think... Can I, what's the methodology, Nick, you take a base model?

Starting point is 00:47:58 and you fine-tune it with DPO? So we take a same base model. We have the same preference data set. First you put a reward model for PPO and then you do RL for it. Okay, so completely comparable. Yes. In general, like, I mean, we tried to reuse people's already pre-trained models for RLHF, but we looked at their pipeline, it was exactly the same.

Starting point is 00:48:21 Because if we do it, like, there's always a case that it's possible that we didn't tune it well enough. So, like, we tried to, like, take models or trained using RLHF, and try to compare to them directly. But they're trained on the same datasets. Very strong models have been trained using DPO. They're already being used. Yeah, Zephyr is the one I know about. Two mixtrol models, if you have you looked at,

Starting point is 00:48:42 were trained using DPO as well. Oh, that's the mixture of instructs? Yes. Okay. They were trained using DPO as well. So if you guys... That came out very recently. Yes, that's why it's not on the poster, but like, I mean,

Starting point is 00:48:53 it's, you guys, if you're thinking of finding using preferences, You should try to use DPO. How much is the efficiency gain compared to a PPO process? A lot, because you only have to do one step. It uses the same set of preferences. It's usually the one-fifth. So, like, basically, no trade-offs? I'm looking for trade-offs.

Starting point is 00:49:16 I cannot find any. More research needs to be done. There are arguments to be made that PPO might do better in some cases, but it's unclear. Like, we haven't personally seen any evidence yet. I see, I see. Sorry, one more question before. Go for it, yeah.

Starting point is 00:49:31 I noticed Chelsea Finn's a co-author. What guidance has she given? I'm curious. I mean, look, we're all in her lab. She's the one who selected us. She's the one who's providing the infrastructure. Yeah. Like, I mean, none of this would be possible without her.

Starting point is 00:49:45 I'm just curious, like, is there any, like, interesting stories, any, like, good advice that she gave that, like, really inspired you that you want to pass on to others? Let me start discussing the idea with her. She was very insistent that you should try to push this because this is a nice idea. But if you sit on it, somebody might do it or it might fade out of irrelevant. This paper came about in three weeks before the Neurobs deadline. So we had to push really hard. And how did you come up with the idea?

Starting point is 00:50:14 You said. I mean, we were looking at this kind of equation before Rafael did a bit of algebra and say, oh, maybe we can just completely skip the RL part if we like look at this thing. Like, I mean, we're playing around, generally speaking, like, there's a reward estimation step. Whenever you're learning three things in a sequence, if you can statistically remove one of the steps. Yeah, you gain a lot. Yeah. So that's where the motivation usually comes from.

Starting point is 00:50:40 Has John Shulman commented on this? Yes. What do you say? I mean, he tried it. He said it works. But there's some questions about, like, they might be training their reward models on more than binary pairwise preferences. So it's not immediately clear how to extend that using DPO. Like multiple choice?

Starting point is 00:50:59 I'm clear. They obviously did not tell me what they're doing. But there's training on more than just pairwise preferences and they might still want to do RLHRRRRRRRRRRR. You can decompose most things into pairwise. Yeah, that's kind of what I assume, but I don't know what exactly they're doing. So there's a situation where they might be conditioning the reward model on something more than what your policy is conditioned on. That means my Rappell-Y and X is a positive.

Starting point is 00:51:23 That's all I got. Thank you. The other best paper runner-up that we'll talk about is scaling data-constrained language models. In other words, the data-blations paper. And this is a scaling loss paper kind of in the vein of the chinchilla paper, but done with a different assumption in mind. Instead of holding compute constant or holding parameter count constant, here we are running into the real-world problem of data-constrained. So given that you have a fixed amount of data, what should you do to pre-training your models? This kind of paper tends to be a very expensive paper to write, just because you have to do so many ablations.

Starting point is 00:51:56 Here it's notable that HuggingFace has created this and open sourced it both models and datasets. So kudos to HuggingFace. Hi, I'm Nicholas, and I'm presenting scaling data constraint language models. The premise for this work is that we are data constraint. Here's a plot from prior work that estimates that given their definition of high quality language data, we're going to be exhausted next year. And what they mean with high quality language data, is data such as papers and books.

Starting point is 00:52:25 There's other sources like code, however it's unclear how useful it actually is for large language models. And for low-resource languages, we are already hardcore data constraint. The first solution we investigate is simply repeating data. It's important to mention here that, while it's pretty common to train for multiple epochs

Starting point is 00:52:45 in most machine learning problems, for large language models, this has been very uncommon. In GPT3, they write that data are sampled without replacement. In Palm, they say that they explicitly avoid repeating data in any subcomponent, and there was other work explicitly recommending against repeating any data when training large language models. So we ask, is it really that bad?

Starting point is 00:53:09 To answer this question, we have three different setups. We start by simply training for a single epoch. Here, this is your usual training graph where we have the validation loss on the y-axis and the training tokens on the x-axis. And for all of those setups, there's nothing special here. improves as we increase training. Now what happens if we train for two epochs? Notably the performance is around the same. So here only half of the data is unique and it has to be repeated twice. So for the setup on the left, 28 billion tokens are unique

Starting point is 00:53:39 and they're repeated for two epochs. Three, four and it's still pretty similar, however eventually it starts to diverge. So we shouldn't train for too many epochs. At 44 epochs, literally just 144th of the data is unique and repeating 44 times. So that's like billion unique tokens for the set up on the left and that obviously isn't very good however for a few repeats performance is very similar suggesting that we can scale a lot further with existing data constraints by simply repeating for large language models this naturally leads to the question how should we allocate compute when we are in that repeated regime a quick reminder from last year chinchilla told us that we should when we're not

Starting point is 00:54:22 repeating data so in the single epoch regime we should scale model size and training data equally in equal proportions. How does it look like when we're repeating? To investigate this, we train on 100 million unique tokens and vary the model size and the number of epochs over those tokens. Each model is depicted as one of those dots. And as we go towards the upper right, so more parameters and more epochs, loss improves as indicated by the contours.

Starting point is 00:54:51 We put forth scaling equations to exactly predict this change in loss and how you should allocate when you're in that repeat. regime. They're depicted on the right. Now if we add in the efficient frontiers, the chinchilla scaling loss efficient frontier extrapolated to multiple epochs corresponds to the dashed line. So here's just an equal scaling of parameters and equal scaling of epochs. However outfit suggests that data should be scaled faster when we're in that repeated regime. And this is seen by the

Starting point is 00:55:21 the line branching off below and eventually just fades away because at some point you can't get more value out of your data, especially with just 100 million tokens, at some point you're just running out of value in those few tokens. Now we test our predictions at scale. Here we have two models, one allocated according to Chinchilla scaling loss,

Starting point is 00:55:45 and one allocated according to data constraint scaling. The one on the top is Chinchilla, and the one on the bottom, indicated by the red star, is our allocated model. They both have the same number of flops and the same data budget of 25 billion tokens. billion tokens and we see that by training with fewer parameters for more epochs so 6.3 and 9.7 epochs are 242 billion tokens we get a better a better

Starting point is 00:56:11 loss but not only loss we also test this in terms of downstream performance and get better downstream performance as indicated by the column towards the right this was repeating and now we're going to look at complementary strategies to solve data constraints one and two intuitive strategy is making use of that code data that we saw earlier. So can we simply fill up the missing data with code from GitHub? In addition, we evaluate filtering strategies. Specifically, we look at fuzzy to duplication and perplexity filtering. The idea here is can we use a quality filter and then repeat to get better

Starting point is 00:56:56 performance than with the initial data set? Here are the results. On the y-axis, we have the average performance across 19 natural language tasks. On the x-axis is the data budget. So towards the left we have 100% of available data so we don't need to use any of those strategies. But as we go to the right, our data budget is smaller and smaller and we need to repeat data or fill the missing data with code. Starting with the purple line, we can confirm our findings from earlier that also in terms of downstream performance roughly four epochs seems like a good trade-off. So at 25% data budget we have to repeat four times corresponding to

Starting point is 00:57:35 four epochs. And then eventually if you train for too many epochs, it drops quite a bit so you have to be careful with repeating. The red line corresponds to filling missing data with Python code. Similar to the repeating line we see that we can we can make up for a lot of natural language data with code without a drop in natural language performance. So these are all natural language tasks and it seems like coding data is helpful for some of them. We even see spikes on some of these tasks as soon as code is added. Finally we investigate the filtering strategies. We find that quality filtering then repeating

Starting point is 00:58:10 can be much better than the data set to start with. So here the yellow star at the top corresponds to perplexity filtering and then repeating for two epochs. The orange star corresponds to fuzzy deduplication towards the right. And we find that you have to be careful with too much de-duplication, because it can lead to a worse model by limiting your available data.

Starting point is 00:58:38 Now I'll go through the takeaways. The first takeaway is that repeating data is generally fine. So many setups, roughly four epochs, seems to provide a good trade-off. However, there are diminishing returns and you have to be careful with too many epochs. Next, adding code data is fine, even if you're only interested in natural language tasks. We find that 50% provides a good trade-off for most setups. Finally, quality filtering plus repeating can be a good strategy and is often much better than the data set you started with.

Starting point is 00:59:09 Because the penalty from repeating is often much smaller than the additional gain you can get from quality filtering. And finally, I wanted to finish off with some other work that has made use of these findings in their large language model training. So at the top, we have FinGBT, a large language model for Finnish, where they only had 38 billion unique tokens, and they had to repeat them for eight epochs in order to be able to train a reasonable large language model with 13 billion parameters. And there are several more that haven't made use of these findings. The finding that training up to four epochs is almost as good.

Starting point is 00:59:46 good as getting new data is pretty surprising and actually directly counters a very famous paper called One Epoch is All You Need. I actually ran into Aaron Komatsuzaki at the decibel party. And it's just surprising at this stage in ML that we still don't know some very basic questions around how many epochs we should train on a dataset. I mean, I still think that we are surprisingly sample efficient. You know, the consensus is now between one to four epochs, sometimes in some cases maybe up to eight, but more importantly than that, that. I think this work is notable because it is the best example of what open source AI research should look like, and of course it's from Hugging Face. If you go to the GitHub repo,

Starting point is 01:00:25 you can see not only their papers, but also very, very well documented code showing exactly what they did and how they got their results, including the dataset filtering. So just exemplary work of open source AI and no surprise that they won one of the best paper awards. However, I did not manage to catch up with them for a post presentation interview, but I did go straight to to the next session on QLora with Tim Demers. I'm Tim. Today I present QLora efficient fine-tuning of quantized large language models.

Starting point is 01:00:56 Language models have been gotten a lot bigger and a lot more powerful, but they have become so big that is actually quite difficult if you take a pre-trained model and you want to fine-tune it as sort of a normal researcher. Often you need now a big GPU server, and most researchers don't have that. So with QLR, what we worked on is reducing the memory requirements, so that everybody can fine-tune large language models.

Starting point is 01:01:19 The main contribution of QLORA is we compress neural networks to 4-bit, and we develop a new data type, 4-bit normal float, that can replicate 16-bit performance, even though we compress the neural network to 4-bit. Before I talk about QLora, I give you a little bit of background. So this work is about quantization, about compression. So we do, for example, quantization, if we have a 32-bit float number, and we want to quantize it

Starting point is 01:01:45 to a 4-bit integer. In this diagram, I have a histogram, which is equivalent to an int 4 quantization with 16 different bins. And in red, I have the normal distribution. And if we want to quantize all the values in the normal distribution to a 4-bit integer, we need to reduce all these values to 16 different values. How do we do that?

Starting point is 01:02:07 We find the empirical minimum and maximum range of the distribution. And then we slice this distribution in 16 different slices with equal width. Each of these slices is quantization bin, and all the values contained of the normal distribution in this bin are quantized to the middle value of the bin. With that, we can reduce all the values

Starting point is 01:02:29 in the normal distribution just to 16 different values. And this is in four quantization. Now, if we do other quantizations with other data types, we have different ranges. And so what I do in my work is I generalize these data types by normalizing the range the data types take to the range minus 1 and 1. This approach is also called a codebook,

Starting point is 01:02:50 where you map an index to particular values in the data type. And so if we have this codebook, there's a two-step recipe, how we can quantize any tensor. And so we take the tensor X, then we normalize it into the range, oh, sorry, and we normalize it into the range minus 1-1 by dividing by the absolute maximum value, and then we go through each element in the tensor

Starting point is 01:03:14 and find the closest value in the data type. We do that by doing a binary search on the sorted values in the data type, and with that we can then quantize the entire tensor. Just to make this a little clearer, here is an example. This is a very unusual 2-bit data type. It has the values minus 1, 0.3, 0.5, and 1.0.0.0.0.0.0.0. The input tensor is 10 minus 3, 5, 4. And now let's go through the steps of the recipe. So first we find the absolute maximum value, which is 10.

Starting point is 01:03:46 We divide by it. We get 1 minus 0.3, 0.5, 0.4. And then we find the closest value of these values for each element associated in the data type. We get 1, 0.3, 0.5, 0.5. Then we find the associated index of these values. And this is now a 2-bit representation. Now we can store it and it's compressed.

Starting point is 01:04:08 If we want to de quantize these values, we just do all the steps in reverse. So we look up the associated values in the data type, and then we denormalize by multiplying by the absolute maximum value of 10. It gives us 103-55. And so if we compare input and output tensors, what we see is that we have two big errors.

Starting point is 01:04:27 The minus three turned into a three, and the four turned into a five. These are quantization errors. And so the main challenge in quantization research is we want to compress a neural network with low precision data type, But we want to keep all the quantization errors minimal. If the quantization errors are large,

Starting point is 01:04:47 we degrade the neural network performance. And we want to avoid that. And that's the main challenge. Let's talk a little bit about fine tuning. Why is it so expensive? So the best way to look at it is to look at the cost per parameter in fine tuning. And so the per parameter cost for full fine tuning

Starting point is 01:05:04 is 16 bit for each weight, 16 bit for each weight gradient, and 64 bit if we use atom for each parameter, for each parameter. That gives us 12 bytes per parameter. And if you have a 70 billion model, that's 840 gigabytes of GPU memory. 36 consumer GPUs. That's a lot of memory.

Starting point is 01:05:24 If we use lowering adapters, we get much more efficient. And so what we do there is we take a pre-trained model, we freeze it. Now we put some tiny layers on top of it, some adapters. And so if we fine-tune it, we do stochastic gradient descent through the frozen layers, into the adapters, we just update the adaptors, not the main model. And so what that does is the weight still needs 16 bits per value.

Starting point is 01:05:50 But now all the other values that are updated, they're only a fraction of a bit on average. And so in total we have 17.6 bits per parameter. That adds up to 150 gigabytes of memory, which is eight consumer GPUs. Now, without a amount of Kilara, we step in and go a step further. So now we take the pretrained model. quantize it to 4-bit and then put adapters on top. It reduces the average footprint to 5.2 bits per parameter, which is 46 gigabytes, and that fits into two consumer GPUs.

Starting point is 01:06:22 Now, the main challenge is we want to preserve the performance while doing this 4-bit compression, and that is the main challenge. So we have three innovations that improve the memory performance, but then also the precision to reduce the quantization error. There's one part, page optimizers I will not talk about. You can read about it in the paper. It's used to prevent memory spikes during fine-tuning if you hit a large document during your fine-tuning run. The main contribution that we have is the 4-bit normal-flood data type. This is a data type

Starting point is 01:06:54 that's information theoretically optimal, and so you can think about it like this. So in the beginning, I showed you in 4 quantization, where the quantization bins have equal width. In a normal flow data type, the bins have equal area. That means each slice has equal probability mass in the normal distribution. And that means the same amount of values are quantized into each bin. With that, each bin has equal amount of values and its information theoretically optimal. And our second contribution is a little bit silly. It's double quantization. We do a quantization of the quantization. And so what does that look like? So in the normal quantization, we take the weight, quantize it, and now we get two pieces. The quantized weights, and then the absolute

Starting point is 01:07:38 maximum constants. We have multiple constants because we slice the weight into blocks, and each block has its own constant. And so we get a matrix of constants. On average, these are 0.5 bits, and that's multiple gigabytes of GPU memory. And now we quantize those constants again. We save about 0.4 bits on average, and that is important if we want to fit large models into consumer GPUs because otherwise they don't quite fit. And so these are the contributions. Now let's look at the results. So the main thing that we want is to replicate 16 bit performance.

Starting point is 01:08:15 That was our main goal. And so what I have here is different Lama models of different sizes. And we fine-tune on the Flan 2 instruction data set. We evaluate on MMLU accuracy. We have in pink, the 16-bit baseline, and brain flow 16 and what we see now that the float data type the regular float data type 4 bit float and blue doesn't quite replicate 16 bit performance however if you use our normal float data type we get up to 16 bit performance and so with that

Starting point is 01:08:49 we have now replicated a 16 bit performance in our papers we have much more experiments that also have the same finding but with that now we are at the stage where we can very efficiently fine-tune very language models with very little resources. And so now we go a step further and ask, can we build a high quality chatbot now that we very quickly can explore all possibilities with cheap fine tuning.

Starting point is 01:09:14 And so through our experiments, we run over 1,000 experiments, we find a very good data set and build a chatboard called Gonaco, which is a 4-bit data set. We created by just fine tuning on a single consumer GPU for 24 hours. And now we want to compare how good a set a chatbot compared to other shepherds that are trained or fine-tuned in 16-bit. And so we have a tournament style setup where the setup is we have 80 different prompts from the

Starting point is 01:09:45 Vakunia dataset, and we give this prompt to two random shetbots, and then they compete to generate the best response. Each shrapot generates a response, and then the responses are judged by the humans or GPD4, and either humans or GPD4. say which response is better. This consists as a game and so we play multiple games of many random allocations of chatbots and with that we can determine which chatbot is better than another chatbot. If we do this setup then we find that humans think our chatbot on these vicunia prongs is a little bit better than chat GPT. If we ask GPT 4 it says it's about the same quality as chat GPT.

Starting point is 01:10:28 This doesn't mean that our bot is as good as chat GPT but for these particular prompts it is about the same quality. On the right is also a demo. You can scan it and try our chatbot. And that's everything that I have. So just to conclude, Kilora makes fine-tuning 18 times cheaper. With the 4-bit normal float, we can replicate 16-bit fine-tuning performance, and we have also shown that you can create very high-quality chatbots with Kulora. So with all of that, it's very simple to now create high quality fine-tuned models, and it's so cheap that everybody has access to the fine-tuned these large models.

Starting point is 01:11:09 Kulara is available in the bits and bytes library, and it's also integrated in the hugging face transformer stack, and so there you can very easily use it. I'm also on the academic job market, so please get in touch if you're interested. Later this week, I will also give a talk on the making of Kulara at the workshop, so stay tuned on Twitter for more information about that. And that's what I have, and I'm happy to take questions. Thank you so much. So we're going to make a bit of a hard pivot now from the world of optimization, fine-tuning, and training methods,

Starting point is 01:11:45 into the world of multimodality, which is another big theme of this year and probably every year to come. Every previous paper we've covered on the pod up to this point I've heard of online and, you know, it's relatively well-known. You didn't actually need to meet the people to hear about them. But one of the joys of coming to a conference like Newrop's is finding things, that you may not have seen just in case of your filter bubble or just because there's just way too many things out there and you didn't have the time to look into them. And this was definitely true for me for DataComp, which I never heard of, but also a very legitimate effort.

Starting point is 01:12:19 And I actually had to chat with them after their talk. But first, let's introduce what DataComp is. My name is Samir, and this is Gabriel Iliarco, and this is Alex Fang. And today we're going to be presenting our work Data Comp in search of the next generation of multimodal datasets. And this paper was really made possible by a whole team of people, and so we're very lucky and fortunate to be able to share it on behalf of the whole team. Okay, so we want to start with a little bit of a history of computer vision models. So in this kind of traditional paradigm of image classification, what we would do is we would create a specialized data set. We'll call that a traditional supervised data set with certain class labels. for example, 10 different labels for the MNIS data set,

Starting point is 01:13:08 and then we would train these fixed models on these kinds of data sets. And this was really cool because it led to all kinds of architectural improvements. You can think resnets, skip connections, applications of attention. But when you needed to add an additional task, say ImageNet, 1K, you had to kind of create a new dataset with a new set of labels, and this was kind of a laborious process. But then right around 20, 21, something really cool happened. The paradigm a little bit switched to these kind of image text data sets that allowed trading these open vocabulary models.

Starting point is 01:13:48 And suddenly we could do things like train a unified model that could then downstream do arbitrary image classification tasks. And this is really a sort of data set transition is kind of the, the test. take away here. So in spite of this kind of transition between datasets, the standard machine learning pipeline actually stayed relatively consistent. So what we're still going to do is create a monolithic artifact, a data set, keep that fixed, and then iterate on model training on that data set. And this is still like a really cool recipe and it's led to progress in downstream evaluations. But what we really

Starting point is 01:14:32 ask in Datacomp and the center of our paper is how much performance are we actually leaving on the table by adopting the standard ML pipeline can we actually improve models by iterating on data sets instead of on model architectures and so fundamentally data comp is a benchmark for data set development to help the community understand how data set decisions improve models so specifically we're going to look at this clip trading regime for these more modern image text data sets, which are popular nowadays. And so we want to give just a brief overview of clips so that we're all kind of on the same

Starting point is 01:15:14 page. We roughly have a text encoder and an image encoder, and we're going to train these encoders from scratch contrastively in order to align image and text representations. And then downstream, if we have a new classification task, we're going to be a new classification task, going to do things like write sentences, a photo of a plane, a photo of a car, etc., and then query an image feature against all of these text features to retrieve our class label. So kind of recentering things back to Data Comp now, the picture I think that we should all have in mind is we're actually going to fix this clip bit, which is this middle trading diagram,

Starting point is 01:16:00 and we're going to iterate on the data selection process to create new data sets to train our clip models. And now I'm going to hand it over to Alex. So the data comp workflow consists of five steps. Choosing a scale, selecting data, training a model, evaluating, and submitting the results. And the first step is choosing the scale, which roughly reflects the amount of compute used. So data comp has four scales. At the small scale, we train a VIT B32 for 12.8 million samples, which is equivalent to fine-tuning a model on ImageNet 1K. At the medium scale, we train a VIT B32 at 128 million sample scene, which is equivalent to training a model from scratch on ImageNet 1K 1K.

Starting point is 01:16:47 At large, we train a VIT B16 for 1.28 billion samples scene, which is equivalent to training an ImageNet 21K model from scratch. And at extra large, we train for 12.8 billion sample scene on a VIT L14, which is equivalent to training an open AI clip model. One key design decision is that there is no constraint on data set size. We build our scale configurations around samples seen because practically speaking, the key constraints are pool size and compute. This means each data point in a data set of 6.4 million samples at the small scale is seen twice. At the chosen scale, participants can then use their data selection method on either a fixed

Starting point is 01:17:30 provided pool of raw data or are free to bring in additional data. So in the first option, which is the filtering track, participants filter from a provided raw pool equivalent in size to the samples seen at the chosen scale. Our pool, which we call common pool, comes from common crawl, and then we do minimal pre-processing such as near duplicate checking against evaluation and not safe for work filtering. Additionally, we provide metadata to help with potential filtering approaches. This metadata includes original width and height, caption, a check sum, clip features, clip scores, and face bounding boxes for automatic blurring to help with privacy concerns.

Starting point is 01:18:11 The second option is the Bring Your Own Data Track. This allows participants to use additional data sources, as well as both edit and generate images and captions from Common Pool. We hope this track supports participants whose creative approaches do not fit neatly into the filtering track, while also maintaining fair comparison within the filtering track. Next, participants use a fixed training procedure to train a model on their newly filtered data. For training, we adopt fixed training recipes, including hyperparameters for clip training, and this was based on prior experience.

Starting point is 01:18:45 Notably, data comp participants are not allowed to modify these parameters, therefore focusing investigation on data set selection. In the paper, we show that better data sets are largely consistent across variations in training recipes. Once models are trained, they are evaluated using our provided script. Our evaluation suite contains 38 downstream tasks, which is a new one of the data, which include image net and variance, a subset of VTAB, a subset of Wilde's distribution shifts, fairness benchmarks, and retrieval benchmarks. And the evaluations are done in a zero shot manner

Starting point is 01:19:22 to remove the need for fine tuning on each individual downstream task. And the last step of the process is to submit your results. We provide an online leaderboard that participants can submit to, which we hope promotes participation and collaboration. We believe that many of these individual data filtering approaches stack and when combined will lead to better results. Next, I'll hand it over to Gabriel to talk about baselines and some new results.

Starting point is 01:19:50 All right, so let's talk about experiments now. We study many baselines in our paper, but I'll focus on the two most interesting ones in interest of time. The first one is what we call clip score filtering. The idea behind clip score filtering is simple. We use a pre-trained clip model to compute cosine similarity scores for all image text pairs in our dataset. In this plot, you can see a distribution of these scores in our dataset. We then choose a threshold for the similarity, for example, corresponding to the top 30% scores in our unfiltered pool.

Starting point is 01:20:23 We then remove all samples that have similarity smaller than this threshold, keeping only the samples with high score as a proxy for discarding all samples that we think have low quality. Another filtering baseline is what we call image-based filtering. For image-based filtering, we again use a trained clip model, but this time only to extract image features. We then cluster these image features and find clusters that match images on ImageNet. We keep all clusters that are assigned to at least one image. We then discard all the other clusters.

Starting point is 01:20:57 Note that this filtering is purely based on image features, and we do not use any labels or captions for this filtering strategy. Our best performing baseline is built by intersecting between the two baselines I just described. clip score filtering, and image-based filtering. When you apply this technique to our larger pool, we find a data set with 1.4 billion samples that we call Data Comp 1B. So let's see how well this works in practice.

Starting point is 01:21:27 We conducted over 300 pre-training experiments with many different strategies for filtering our pool. Our best data set is DataComP 1B, a 1.4 billion subset of our pool, that leads to much higher accuracy than existing data sets, including opening eyes it and Lyon 2B. This is the first public data set that outperforms OpenAI. Also note that all these models are compute matched, so these gains come at no extra cost at training time. One key

Starting point is 01:21:56 finding from our work is that smaller, more aggressively filtered datasets can perform better than larger datasets coming from the same pool. As you can see on the plot, when we selected samples that have the highest cosine similarity according to a train clip model, there's a sweet spot for the size of the data set that we keep, around 30% of the original pool. This means that you're better off using a smaller subset of the pool instead of using more noisier data.

Starting point is 01:22:24 Interestingly, this doesn't happen when you sample randomly from the pool, as you can see from the dotted line. So you can get away with smaller data sets, but you do need to be a bit more careful on how you are selecting samples. Another key finding from our experiments is that the ranking of different filtering strategies

Starting point is 01:22:42 is relatively stable across scales. as you can see in these scatter plots. These plots show how performance on the small scale correlates to performance on the medium scale. And while it's not a perfect correlation, these plots show that there is hope for doing research at smaller scales, since there's a good chance that findings will generalize to larger scales.

Starting point is 01:23:01 And in fact, this is exactly how we proceeded during our experiments, by first testing things out at smaller scales and only scaling up the most promising results. This saves us a lot of compute during our experiments. There's much more in the paper, as you can see in these slides and if you're interested definitely check it out we are very happy to answer any questions and talk more about any of these topics in our

Starting point is 01:23:25 poster since we released the paper there's been a lot of activity in Data Comp the fun thing is that our best-performing baseline which we thought was pretty decent were blown out of the water by the community since and it's just really nice to see that happening in real time one example is data filtering networks or DFN for short, where the main idea is similar to clip score filtering, but with a deeper dive into what makes a good model for data filtering. And careful data creation has led to what now are the best clip models, even outside data comp, with an impressive 84.4% zero-shot accuracy on ImageNet

Starting point is 01:24:04 using a VATH-14. The central takeaway I'd like to leave with you today is that careful experimentation with datasets can really pay off and can lead to very large improvements in performance on downstream models. So instead of blindly scaling models up, I think we as a community should start paying more attention to how we design datasets. Data Comp is designed to facilitate research in that direction. It's amazing to see what people already building with it, and I'm super excited to see what comes next. Finally, I like to reiterate that our benchmark is designed to encourage everyone to participate, even if you only have a couple of GPUs under your desk.

Starting point is 01:24:44 So if any of this sounds interesting at all to you, feel free to check out our resources, including your well, website, codebase, and paper. Everything we do is fully open source. And we hope these resources are useful for the community. Thank you very much. So I quite enjoyed that presentation, and obviously this being an image, heavy, and multimodal type of paper,

Starting point is 01:25:07 you should probably check out the images and the competition at datacomp.a.i. But I did manage to catch up with them at their poster session and ask them more questions. It turns out there's some intellectual lineage from Lyon with Lyon 5B, and I do think that this has a strong chance to become the new image net.

Starting point is 01:25:24 So let's give them a listen. Oh, fun fact, they were also wearing Datacom T-shirts. Like, most people, when they present their poster sessions, they're in kind of just like somewhat semi-formal attire. These guys, they make custom t-shirts for their posters. So you know how they're serious. My name is Samir. I'm a fourth-year PhD student at Columbia.

Starting point is 01:25:42 Yeah. And I started working on Datacomp, like, I guess, around November of last year. I collaborated with a lot of the folks that are already on the paper on previous projects, like Mitchell Wortsman, Ludwig, Weishall, and they kind of just kind of roped me in. They were looking for hands to help out with different tasks.

Starting point is 01:26:05 And then through the course of time, my involvement just kind of grew because I got really excited about it. How did this become such a big effort? You guys are wearing T-shirts. Yeah. This is not normal. Yeah, yeah. Yeah, so we really took this project very seriously

Starting point is 01:26:21 because we wanted the benchmark to be really good and thorough. And because of that, we were working at kind of a scale that was kind of unprecedented for academics. We generated the pool of like 12.8 billion image text pairs. We wanted evaluations to be very thorough, many, many downstream tasks. And that just took a lot of people to commit to the project. And how do people find out about something like this? You're not from the same university. Is there a community somewhere?

Starting point is 01:26:51 that you all just gather and coordinate? Yeah, yes. So Ludwig, who's kind of the last author on this paper, is kind of networked all around. He's very friendly and very open to collaboration. And I think because of him, many people from many different universities, corporations were able to join. Yeah. And this is separate from the Lyon group.

Starting point is 01:27:15 Yeah, so Ludwig is affiliated with Lyon. Because I've seen his name around. Yeah, yeah. But most of the people, are not necessarily part of Lyon, but we all kind of know each other and collaborate. I mean, wouldn't it be

Starting point is 01:27:29 better to make this, like, Lion 1.4B? Yeah, so we maybe could have done that. I think... Sorry, Lyon 12.A.B, right? Lyon has 5B? Yeah, Lyon has a 5B and like a 2B subset that people train on a lot.

Starting point is 01:27:45 Yeah, we could have done that. I think we were thinking about things more from the standpoint of like a benchmark and that was really our focus. While this 1B data set that came out of the benchmark is an artifact, we kind of wanted to place emphasis on the benchmark itself yet. So just to commend on that, the idea of initially it was a dataset, but then we thought also about benchmarking,

Starting point is 01:28:11 but then we thought about the real thing is the community. So we thought that the way that you can actually build a community is by opening up the tooling. So a lot of the, in data set curation, the problems is not about, usually you work super hard, and then at the end of the day you make a dataset, you release the dataset and you're done. But the tools that you developed to actually clean up the dataset, filter the dataset, benchmark the dataset, these are often more valuable for other people who want to do the same job. So the central idea was to make a community and to open source the tools in addition to the dataset,

Starting point is 01:28:46 and then allow other people to try different tooling methods. so going this data-centric AI direction. So that was kind of one of the central ideas around Datacom. So D-A-com is really about building community around dataset curation. This is the first time I've seen Clipscore filtering applied like this. And you also mentioned at the end of your oral presentation that there were other methods. What filtering methods are you seeing that working really well? It's a whole community of people who are trying a gazillion different tricks.

Starting point is 01:29:17 And that's the whole point, right? One remarkable thing to point out is that if you picked a benchmark, you will see performance changing across different benchmarks, but we'll see surprising correlation of ImageNet, Zero Shot, to a gazillion other benchmarks. So we have 38 benchmarks, and we see that if you do well, basically, zero shot on ImageNet, you're very, very correlated in predicting in how good your model is across the board for retrieval, for all kinds of very useful things. And the community is developing a gazillion type of different methods of data curation. That's why we have a leaderboard and we're building a community. This is not like a paper and we're done.

Starting point is 01:29:52 You could write like 20 projects of different data set curation. It's more like a platform for data set curation evaluations. Do you remember other methods that are doing well? Yeah, yeah. So that's a great overview. And yeah, like I think specifically people have been looking into designing filtering networks. So rather than using Clip to be the filtering network, like what are some other data sets that we might train on in order to create these filtering networks.

Starting point is 01:30:21 What are the differences between a good clip model and a good filtering model? So these are all kind of open questions that, you know, as Alex was saying, the community will answer by trying a bunch of tricks and methods. Yeah, so you can train like new clip, but you can also train new stable diffusion from this, which I'm sure stability is interested in this, unless you're working on your own sort of diffusion model. Yeah, so the problem is that compute. to train multiple stable diffusions is needed.

Starting point is 01:30:51 But yeah, we're definitely interested in that direction, and we're definitely thinking about that. But you could basically include quality of a stable diffusion as a benchmark and evaluate how you would select a subset of data to improve on that benchmark. You might want to talk to Luther. I've been talking to Stella Bilemerman from Luther. She's around here.

Starting point is 01:31:14 She'll come by. Cool. Any other future directions that you're in? that you're very excited about. Yeah, we're actually really excited about just the concept of data comp high level. So right now we're pretty excited about NLP and what a data comp light effort would look like in that space. You could extend this approach to audio, potentially video, although like video is tricky for me just because it's so data heavy. And there's a lot of orders of magnitude of different dimensions they could go to.

Starting point is 01:31:48 So I don't know what that might look like. Let me tell you about that. Yeah, yeah. So one idea is you can make a data comp for MRI images or a data comp for this, a data comp for this, a data come for others. What's the idea, what does that mean? It means you fix the model. So classical machine learning, as much mentioned in the talk,

Starting point is 01:32:07 classical machine learning says, here's a data set, image net, building a million models and tell me what's the best one, right? Now, the data comp idea flips this on its head, right? It says, here is a big pool of data. The model is fixed. You only select a subset of the pool. So the thing you're selecting is which images to keep in the pool. Then the model is fixed.

Starting point is 01:32:30 But you're training other machine learning models to select what to keep. And that's very powerful. So that was in classical, you know, AI, if you're doing the data cleaning, the data filtering, that's like the shittiest job. That's like, what? But we're trying to make that a first-class citizen and trying to tell you that it's worth to do research because it's not that you will manually sit down

Starting point is 01:32:54 and select images from $5 billion or $13 billion. You will be building models that do that. So you can do a data com for X, and we're seeing that from the community. Yeah. Curious how you became involved with data com. Yeah. So we have this NSF Institute.

Starting point is 01:33:13 It's called IFML, the Institute for the Foundations of Machine Learning. and Ludwig is part of our institute. So we were having lunch and we were discussing about how do, we were discussing about lion, right, and how to make a better lion. And we said, okay, instead of just making a better lion, which is what we started with, let's make it a community where we open the tools. So everybody can make a better lion.

Starting point is 01:33:38 So that was the central idea, yeah. What happened to the original lion? Lion is still a great data set that's still public, but this is basically building the next generation. Yeah, yeah, very cool. I wish you good luck. I think this is really foundational work. It's basically the new ImageNet, right?

Starting point is 01:33:53 Yeah, yeah. That's 10 years after the original AlexNet moment. By the way, that second speaker who was not introduced was Alex Demakis, who was a professor at UT Austin, who just jumped in and chatted. And I do find that it's a very charming element of NERFs is that it's effectively a coming-out party slash hiring party where all the grad students

Starting point is 01:34:12 published their papers. They all have sponsors in more senior researchers and professors. as the secondary or tertiary authors, but their name gets first because then they get all the credit and the citations, and the people who are more senior just kind of stand there and support them, and Alex definitely jumped in and supported them. Just like I saw a bunch of other senior authors supporting their grad students and directing questions to their grad students,

Starting point is 01:34:34 because their reputations are already secure, they have jobs, they're just here to help their interns and grad students. There's a very interesting tension between effectively datasets papers and models papers. The datasets people think, that their work is more long-lasting, and the models people think that data sets work is done. And I think you just need both. So that's my awkward transition from Datacomp into Lava,

Starting point is 01:34:57 which is probably the single most interesting visual language model this year. As much as people are in love with GPT4 Vision, it's not open source, and we don't really honestly know very much about it, but Lava is open and trainable with a whole bunch of open-source models. And together with Data Comp, I think Lava and Data Comp together will provide some kind of template for the next generation of multimodal models to form. So let's check out Lava. I'm Haltian, a final year PhD student at UDMadison, and I'm on the job market.

Starting point is 01:35:30 Today, I'm presenting visual instruction tuning, a joint work with Chuan, Qin Yang, and my advisor, Yang Jay. So as in background, we, as humans, we can see and reason about the visual world, express and interact with natural language. Doctors read the CT scans and explain their findings to their patients, teachers teach students with conversations, and we share our life and findings on social media and interact with others. It will be great if we can have a visual intelligent assistant that can reason about the visual world and reflect with language. The closest work along this direction are image to text genesis models, where the model takes in the image as an input and output text reflecting its understanding. Such models like JIT, Blip 2 and Flamingo has basic

Starting point is 01:36:19 visual reasoning capability, while they generally lack the ability to follow very complex instructions or engaging very long conversations. Back in March, opening I demonstrated GPT for a vision with strong visual reasoning capability. For example, given such an image and the user requests what's unusual about this image, GPT for vision is able to reason beyond just visual facts, it's able to figure out that the unusual thing is actually the man's ironing closed when standing on the back of a taxi. It's great, but it's not accessible until very reasonably, and there's no disclosure on how it works.

Starting point is 01:36:56 So if we are able to create an open source model with similar level of visual reasoning capability, it will be great as it allows us to have a deeper understanding of how the models behave, and we can have a joint effort from the whole community to make it better. So, as a starting point, how can we create such multimodal models that can actually follow human's intent? In NP, researchers find that instruction tuning allows the model to learn to follow the instructions by fine tuning the model on a small set of instruction and answer pairs, like explaining the human's behavior or movie recommendation, and creating such instruction just by letting

Starting point is 01:37:32 human writing it is very costly and self-instruct, proposed to use teacher models like chat GPT to create such instructions by expanding a small set of seed instruction. instruction output pairs to million scale using in context learning and it's affordable and has been used to create open source language models like Alpaca based on the base Lama model so now the question is how can we create visual instruction following models and let's start with this basic architecture where we have an image first we have a visual encoder which can encode it into the visual features a cross-modal connector that can bridge it to the

Starting point is 01:38:07 language decoder the language decoder also takes in the user instructions and perform the reasoning and output its understanding using the text. So the key is, how do we train this model for following multimodal instructions and how do we obtain such data? The straightforward way would be used a self-instruct and let's find a multi-model teacher and let it expand. However, if we take a look at those existing teachers' models that were used, they're all only and there were no powerful multimodal teachers.

Starting point is 01:38:38 And in our paper, we propose to leverage a text-only and we provide image context in the textual format to GPT so that it can understand. For example here, we have an image and we can use the COCO annotations captions so we can have an image level context which describe what's happening in an image. We can also have the bonding box and object category annotations from the COCO so that we are able to get regional level context which provides even more details that may not be captured in the captions. So let's take a closer look on our text-only data.

Starting point is 01:39:11 engine. We will have two parts of the inputs. First are in-contest examples which are the exemplars that we guide chat GPT on how they should generate the visual instructions. So we'll have example image. We convert them into the image context in the textual format that we just described. We write the instruction and answers about those visual content in the image. These are the examples for chat to learn from. And then we do the actual inference and we refer any image in the COCO training data set, we're able to convert them into textual format using the COCO annotation, and chatGBT will just learn to generate those instructions and answers following about those

Starting point is 01:39:50 image context. We gather the instructions, answers, and also the image to create a visual instruction following data, which is a triplet of image, instruction, and answer. To better facilitate learning, we create three types of responses. First is a conversation to facilitate multi-term engagement, Detail description to train the model to focus on visual details. And complex reasoning to allow the model to focus beyond the visual facts. For example, question what challenges do people face? The model not only needs to figure out like there are bag, luggage, there are bags, there are SUVs,

Starting point is 01:40:27 it also needs to figure out that the challenges as they may not be able to fit all the luggage on the back of the SUV. So we create a lava instruct 158K and train lava. It's a model composed of these three simple components. We use clip as a vision encoder, instruction-tune language model as a Vecuna as a language decoder, and we use linear layer for the projection. And we find it work quite well because of the club visual features already carry great semantics, and a single linear layer is sufficient to project it into a space where the language decoder can understand well.

Starting point is 01:41:00 For model training, we use a two-stage training pipeline, where in the first stage, we pre-trained a projector only for the feature alignment so that it's projected into a proper space. And in stage two, we perform end-to-end visual instruction tuning on the generated visual instruction following data set. We train the projector, we train the language model, and if you have limited compute, you can try Laura or QLora, or even just the projector only, it can give you decent visual chat performance. After we train Lava, we found several interesting emerging properties. Let's quickly revisit some of the data properties first. So our visual instructions are in English only. English only, no human name and annotation, and there's no explicit zero data.

Starting point is 01:41:43 Lava can have strong visual reasoning capability as GPD for Vision does, where we are able to figure out the unusualness is actually the man's ironing clothes on the back of a minivan, and it's more visually grounded than open source baselines like Blip 2 and Open Flamingo. It understands this humorously parodied Mona Lisa with a dog in the same post, and it's definitely out of distribution. It also has a strong emerging OCR capability where it can recognize NERP's 2023 from this presentation slide and it correlates with the pre-trained language knowledge when asked about who will be interested in this.

Starting point is 01:42:19 It will relate and say like it's related to artificial intelligence and machine learning. Although our visual instructions are just in English, it's able to perform the reasoning and output text in Chinese and other foreign languages, like it recognizes the French quarter and performs a B. brief description in Chinese here. So how can we evaluate our large multi-modal models? We draw away inspiration from NOP and slightly modify our data creation pipeline to use a text-only GPT to do the evaluation, where we have the image context in the textual format,

Starting point is 01:42:53 we have the user instruction, we have two model outputs, and we just feed all of them into the text-only GPT and request for the feedback. It will give you a score out of 10 for each of the assistant and also provide you a explanation so that you can understand how the model is behaved. We create a challenging benchmark lava bench in the wild which requires knowledge beyond training data, multilingual understanding, and also perception of subtle details. We create a very detailed textual annotation of the image context in those images and we can feed them for a GPT evaluation and it's not only just

Starting point is 01:43:30 for accuracy but also for hallucination. Since the introduction, since the introduction of Lava, there has been great effort from the community, ranging from data, model, modality, and expanding to different tasks, as well as developing benchmarks for us to better understand the model. The development after June are just too many to fit into the slide. And we as Lava team has also pushing the effort to make it more accessible and expanding its capability in terms of our HF tool use as well as visual prompting. improved version of lava 1.5 with just simple modifications to the data and model we show that it achieves great performance on a range of 12 benchmarks and it's

Starting point is 01:44:15 simple efficient that it only requires less than 1% of the data that other approach to use we're able to train lava 1.5 within one day on a single node check our poster workshop a workshop poster on Friday so in conclusion lava can reason about the visual world reflect with natural language, its design is simple and general that we show that it is possible to adapt the language model to multi-model effectively and efficiently, that we can train it within one day on a single node. Because its design is so simple that we are compatible with almost all the optimizations that are designed for language models for both training and deployment. And it's fully open source. And unfortunately due to the Wi-Fi network,

Starting point is 01:45:02 issue we are unable to do a live demo here but Lava is able to run on MacBook Air so I will still do a live demo here and let's try to see this is an image that we I took yesterday here and I will just say like what's this event and is it popular it's the tree of salt presentation it's really popular I can I can just barely stand in the back so it will just run for a while and it says the event it seems to be a a conference or presentation, there's a large number of people attending and it appears to be

Starting point is 01:45:36 popular because some ones are, lots of them are standing, including me. So I will say like, okay, I attended as well. It is NewRyps 23 and the experience is great. Help me draft a tweet. And it will think for a while. I hope it will be optimized it further for a faster. failing stage and just attended Europe's 23 amazing experience packed with knowledgeable speakers and attendees learned so much and made valuable connections true and highly recommend for anyone interested in AI related fields and some hashtags. I hope you like it and thank you so much

Starting point is 01:46:21 please come to our post session at number 229 our demo code data model everything is open source we are so excited to talk with you more about Lava and thank you for coming. If DataComps is an example of what a really good benchmarking data set paper looks like, then I think Lava is an example of what really good kind of state-of-the-art research on visual instruction tuning and visual language models looks like. It definitely has inspired a bunch of copycats and derivative work in the open-source model space,

Starting point is 01:46:59 notably Bach Lava, and I think there's just going to be a lot more work being done here. Like we're just realizing that we can plug and play these models and train them together in all sorts of ways and lava is definitely one of the more innovative solutions of that that also just solves simultaneously a whole bunch of issues with visual understanding. Here's the poster session Q&A with Hao Tian. Basically we are trying to create a simple like architecture as simple as possible. So we have a vision encoder just to encode those features, a language model to perform the reasoning and we use a projection layer which is a linear layer we find it

Starting point is 01:47:37 doing pretty well to project the visual features to a latent space that the language decoder can understand. And we believe this is because that the visual features of the clip already carry great semantics are in a good, like, latent space. So a single linear is sufficient for it to understand. So is the language model GPD4? Oh, the language model is Vickuna, something open source. Yeah. Not GPT4. Right.

Starting point is 01:48:00 You can take that off the shelf, but you're training the linear layer. Yeah, so it will be two stages. In the first stage, we want to train the language model to understand those images. So we train the projection layer only. And this is our stage one. The language model and the vision encoder are all frozen. And in the stage two, we will train the model to follow those instructions. So we train the language model and the projector.

Starting point is 01:48:24 To my knowledge, this is the first work that is adding the bounding boxes with the captions. Any difficulty in having the language model understand all those things? Our model does not need to understand bunning box. Because we just feed into the, what we provide to train our model is this, visual instruction following data. The model just need to understand the image and give a proper answer when you give a user's instruction. So this is not something our model needs to worry about. Although we do find that the model is able to understand those bunning boxes well.

Starting point is 01:48:57 And key point is that does GPT4 understand those well and does a text-only GPT4 understand that well? We find it to be true because what we did is that I also work on some image generation model and we have a work on that we can control the image layout by just providing some bonding boxes. What we did is that we give GPD for a caption and we say like, can you generate a reasonable layout for me? And it's able to do that pretty well. So we believe that we do not quantitative evaluate how it's good at doing that, but it does understand those like layout pretty well. and also it can be used to, and also like from the instruction in generate, it does know which is on the left, which is on the right.

Starting point is 01:49:41 Yeah. Did you have to qualitatively evaluate the output of the answers that GPT4 gave you? Yeah, we actually do not quantitatively evaluate, but we did like manually go through some of them when we are developing those data engine, because we do have some factors to consider in this. So we can change the number of in-contact examples. We can change the way we write those reference instructions and answers.

Starting point is 01:50:11 We can also change the actual instructions we use to teach GPT on what is the task. So we did qualitatively iterate on how we design those data engine. And we find it, this process really is quite rewarding because we do, in this process, understand how GPT thinks and what are the information that we do need to provide GPT for. Yeah. And then all the bounding boxes that you provide, these are kind of ground truth because you get them from COPA, right? That's correct. Right, right, right, right. So this actually ensures that those contacts are perfect, if the human annotators are perfect,

Starting point is 01:50:55 and the generated instruction answers are as good as possible. I just want to ask about the training part. So if I were to take Lava 1.5 and fine tune it, either full fine tune or Laura or whatever, would you recommend also retraining the projector? I guess it depends on your task. Are you considering a different domain or? So I want to build off the Lava Plus stuff.

Starting point is 01:51:22 So I know that goes into using tools. So it's a little out of scope for this project, but maybe for both. So let's say I want to take a different multimodal instruction following data set. Like for that part, would you recommend retraining? Yeah, that's a good question. So I would say that if you want to,

Starting point is 01:51:43 if the domain, like the image domain that you're going to work on, yeah, medical image, if it is too different, then I would recommend actually go with different stage one training or even just do everything from Squarespace you have a biomedical clip, right? Because that may give you even more benefit. But we do observe that if you pre-trained with Lava's instructions, you pre-trained with those visual information.

Starting point is 01:52:15 Like it learns to do some reasoning about the visual content, and it may be crucial for some, like, it may be crucial for the visual understanding on different other domains. So I guess there will be a trade-off and I guess there will be both pros and cons for training from another domain from scratch because you may lose the benefit that you get when pre-training on lava on how to localize those objects. So I guess you would need some more like experimental evidence on making the proper decision. So is it fair to say that unless the domain is super different like x-rays, maybe it's fine to just use it?

Starting point is 01:52:56 Yeah, I think it's totally fine. And I guess it's better to use the instruction tuned version because it has so many vision knowledge injected into it. Okay, and then, sorry, one last question. Of course. For stage two, like, let's say I want to fine tune on my own thing. Is the roughly 160K number of examples a good target to hit? Like, do you have recommendations around how big that data set should be?

Starting point is 01:53:18 I guess it also depends on how different the task is and also how bad the model is on that task because I can give a brief example on one of the experiments we have done. So there's a task that we can train the model to generate stable diffusion prompts, for example. Basically it's kind of captioned in some style we want. And because the lava is already able to understand those visual attributes, the content very well, it's just a form of like reorganizing the style, it responds. So we find even 100 examples is sufficient.

Starting point is 01:53:59 Yeah, 100. And we just use 100 examples and it does the work decently. Yeah, it's just a form of changing the song. But if you're like trying to like do some very different reasoning tasks that Lava is not good at, I guess you may need more. I think 10K generally saying is enough. 10K generally selling is saying... Yeah, yeah, yeah, I guess 10K or if you want to make it safe, like I guess I guess,

Starting point is 01:54:26 maybe 50k is at most yeah yeah of course yeah thank you yeah thank you understand how important this vision encoder is have you ever try to remove the encoder entirely and use the binding box here as the input of whatever lengthy model you are using and just do the same task so I guess the key point here is that if you want to remove the encoder completely and just use the bonding boxes as the input the there will be one question like how are you going to get those bonding boxes and second is that what if the user asks you about the text like are you going to also have an OCR engine and what if what if the user asks

Starting point is 01:55:09 about something else for example like the attribute and it just like if you if you think of this like having an end-to-end model will be make it much more easier and much more generalizable to extend to different types of the input and the user's instructions so And also because now you need some other model to generate those bonding boxes, tags, all of those things, I feel that it's good if we can have those models to enhance the capability. But you do have a model that are really trained with vision and really understand what's happening in this image. It can better coordinate those information.

Starting point is 01:55:48 Yeah, so have you ever, you know, unfreeze this vision encoder, meaning you're in first or second. Yes, we have tried to unfreeze the vision encoder and we find it quite useful for some of the text but not for the other. So specifically, if it's just asking about what's the attribute, what's the object, those kind of tasks, it does not matter much. But if there are two kinds of tasks that unfreezing the vision encoder really matters. One is that it's not necessary about the semantics. For example, I'm asking whether this line is straight, like those kind of tasks which require you to understand the low-level level details where the low-delad level of the really matters.

Starting point is 01:56:27 It's one of the things that, and we also have another work, VIP Lava, where we try to train the model to understand the visual prompts. So basically the visual prompts, we mean that can we just like use some scribble to circle some objects that we want to ask about instead of necessarily trying to describe it very clearly on making the model to understand what we are curious about. So for that, in order to correctly identify, those scribbles and those tiny lines it requires you to somehow unfreece the vision encoder to properly unfreeze or use some earlier layers which still

Starting point is 01:57:10 preserves those information I'm curious about the backstory behind this whole thing like how did you get started exploring multi-modality and like your inspiration we have been working on vision language like since like our team has been also working on vision language. Your team, is it a lab? Is it a, his team, I see. Train from Microsoft and we and my advisor. We have a collaborative effort on this.

Starting point is 01:57:36 We have a series of work on vision language. And we, although I'm not having tons of years experience on vision language, but we do see that. In March, we see VQNs, we see VQN. which do make us very impressed about the performance it can have. For the size. Yeah, for its size and also for the open source. And we believe that it's possible for us to create a visual reasoning model

Starting point is 01:58:06 that is purely open source with similar level of their capability. And we believe that with open source, we are able to have a joint effort from the community to make it much, much better. Yeah, it was cheap, right? You trained for like eight hours. Yeah, eight hours were level of point. five one day on a single note. That means everyone else can do it too.

Starting point is 01:58:27 Yes, not everyone else, but most people. Thank you very much. So super interesting and notable work on the lava model. I guess someone should try to hire him. But I guess the next segment we're going to explore is the prompting segments, quote and there are a surprising number of prompting papers here. I'm not sure that many papers should be represented at Neuribs, but where else are the going to present. I don't really know. But anyway, so there was a whole channel or track

Starting point is 01:58:58 just a chain of thought. That blows my mind to me. And I do think that that is appropriate. And I do think that the techniques here are innovative. It's impossible to cover all of them. I actually talked to Noah Shin from Reflection, remember Reflection, as well as a whole bunch of others. But probably the most representative one was the tree of thought paper. So here it is. My name is Shuen Yu. I'm from Princeton. I'm very excited to talk about Tree of SOTS. It's a joint work with my colleagues from Princeton and Google. So we all know language models and large energy models. Language models were invented generate text, token by token, and left to right. But now they are used to solve an increasingly

Starting point is 01:59:40 wide range of problems using scale-up models and prompting techniques like chain of thought. So here is an example. Like you can like a breakdown complex calculation into steps and it will make it solve problems that cannot solve in steps. So the question is, can those language models one day become a general problem solver by keep scaling up and using auto-rogressive inference, or there are some fundamental limitations? So to answer the question, let's take a look

Starting point is 02:00:10 at a very simple example. This game of 24, where the rule is you are given four numbers, and you have plus, minus, divide, and multiply operations, and you need to combine those four numbers to up in 204. So one example is if you are given input to 9, 10, and 12, what you can do is you can first multiply 12 and 2 get 24, then 10 minus 9 to get 1, then 24 times 1 to get 24. Okay, so it's not a really hard game. Now you give it a new input, 4, 5, 6, 10 to GBT 3.5, you know that it solve the task.

Starting point is 02:00:49 It will first try to multiply 1096 to get 60, then divided by 5. 5 to get 12, then 12 times 4 to get 48. Then to make it up, it will say it's 24, and then call it that day. So it's a hallucination. You might argue that if you have better models or better prompts, you will solve this. But even if you use GPD4 with 5 shot, 5 examples in the same

Starting point is 02:01:10 a OT prompt, you will only get 4% task success. So why is this like easy task so hard for language models? So if you look at the initial token generation, right, 10 and 6, 10 times. Because those language models are making local and token level decisions, one by one, left to right. Those initial decisions are really hard, right? Even for humans, we don't know whether the first token

Starting point is 02:01:37 should be 10 or 6 or 5. We don't have pre-training intuition, right? We have to play the game to have a better sense. Worse still, once you generate those one token at the beginning, the task is already failed in that you cannot really complete the whole trajectory, in the COT format and be right. So by this very simple example, what I want to show is

Starting point is 02:01:57 there is something about alter regress and inference that is lacking mechanisms for deliberate reasoning. So it's even true for biggest, strongest language models like GPD4, and the reason is quite simple, just like Ben's talk mentioned, right? So for the COT to work, you really need strong local signals to guide every step through those local decisions. And just to draw analog, right, imagine if you have a robot that's trained only on successful navigation trajectories, and it's only trained to predict the next move.

Starting point is 02:02:31 And then you put it into a new maze, and then it's very hard to explore. So how do we solve this issue, right? So in this work, we took inspiration from human cognition. In his famous book, Thinking Fast and Slow, Daniel Kahn proposed that our cognition has two parts. We have a fast and automatic system one that's handling every day task like riding a bike and we have a slow-end delivery system two that's imposing control and intervention over system one for harder tasks like designing a plan So if you know language models automatic inference Outer Regress an inference is similar to this spontaneous but arrow prompt

Starting point is 02:03:10 System 1 process Maybe we can impose some kind of control algorithm on top of it to get systems to reasoning and And tree search is naturally the choice, which is also one of the oldest ideas in artificial intelligence. For example, the Wai and Simmons general problem solver in 1950s. However, doing like language, doing search in this reasoning space is non-trivial. Because traditionally, you know, if we search in classical games like chess, we also have, we often have like a small fixed set of next moves, so that we can design or learn search heuristics. But if you want to search in open-ended reasoning, the next move can be arbitrary

Starting point is 02:03:50 test, which is really hard to enumerate or evaluate. So the idea here is now that we have large-langor models, we can use them to start generating and evaluating next moves. So from the next previous two slides, you have seen what's the problem of large-langor models and what's the problem of classical search and a hint of, you know, combining them might lead to a better result, and that's true. So we propose three of thoughts. It's a general method for combining language models

Starting point is 02:04:16 and search algorithms for deliberate reasoning. And to solve a problem, you need four parts, right? So first, you need to define what is a search space or what is the thought space. Then you need to generate and evaluate language thoughts using language models. And you need to combine that with a search algorithm to explore and maintain thoughts.

Starting point is 02:04:33 So I'll use the simplest example, which is Game 124, to explain each part. OK, so what is the thought, right? That's not a question in chain of thought, because everything is coherent, and you You don't have to split it. But it's a very critical thing in three of thoughts. So here we define a sod as a coherent piece of text

Starting point is 02:04:51 as a next move in the reasoning game. And if you think about Game 24, right, there are two extreme choices. On one extreme, you can treat each token as a thought, right? Then it will be very easy to generate each thought. But as explained before, it's very hard to evaluate whether 10 is a good thought or 13 is a good thought.

Starting point is 02:05:10 On the other extreme, you can treat the whole reasoning as a thought, right? you generate the whole thing, which will be very easy to evaluate. You just look at the end if the number is 24. But if you can generate, the problem is solved already, so it's very hard to generate. So in this game, naturally, the choice of thought is something in between, right?

Starting point is 02:05:28 We can use each intermediate equations that thought so that it's relatively easy to generate and evaluate thoughts. And this is really a problem specific trade-off design, right? So for different problems, that solve can be a token, can be a word, can be a sentence, can be a, paragraph and so on. So once you have defined what is sought,

Starting point is 02:05:49 it's either to generate that with language model. So here, it's a simple prompt. You know, you have one example of what's the input and what's the possible sauce. Then you give it a new input, and the language model just can generate a new sauce. So here, each new line is a new thought of how to continue the reasoning. Once you have those thoughts, right, you want to give them a value

Starting point is 02:06:10 so that you can search. So here, what we do is, We give this problem of the example where, you know, for the remaining numbers, if the language model can simulate within a field trials and rich of 24, then a high value is given. If not, depending on whether the numbers look reasonable or not, a medium or a low value is given. Okay, so the previous three examples are the in-contest examples. Now, for this new input, you know, 566, the language model try one round, 524, and sure, is a high value. For turn 1313, it will try a few rounds and it will fail and this numbers look too large, so it's impossible, so low value.

Starting point is 02:06:52 So for something like 5.5, 9, it try a few rounds, so it doesn't work, but the numbers look reasonable, so likely, so a median value. But here actually, you know, 5.59, it's not actually possible to reach 24. So it's important to know that, you know, just like any search heuristics, here, the value does not have to be perfect. And it just needs to bias a search toward promising directions. Also here, like the prompt uses, you know, common sense reasoning and simulation, but you can really design different strategies for different problems. It's really flexible. So lastly, you can combine them together with a tree such algorithm.

Starting point is 02:07:26 Here we use a brass per search, which is the simplest algorithm. You have a depth of three, and you have a brass from one until five. And the idea is very simple, right? You have an input. You generate a bunch of thoughts. We evaluate them. You only keep the top choices. So it's like a thought-level beam search, right?

Starting point is 02:07:42 And you keep doing that until four numbers become three numbers, three-number become two numbers, and two numbers become one number, and you're succeeded if the only number is 24. So while COT only achieved 4%, TOT with a breadth of 1 already leads to 45, and a breast of 5, needs to even higher 74. We can also use a similar idea for different algorithms and for different problems. So for example, for crosswords, right? Suppose you have five clues horizontally and five clue vertically. What you can do is you can generate a bunch of thoughts, evaluate them,

Starting point is 02:08:17 and then gets proceeded only with the most promising choice. So that's a depth first search, or best first search. And you can keep doing this until the language model realized, you know, this board is no longer solvable. Then what you do is you prune the substrate and then you backtrack, right? So you'll move on to this, but maybe this is still not solvable, so you will move on again. If none of the things works, then you go back, one level back, and then you try again.

Starting point is 02:08:42 So it's a very classic deference search. And here are the results. COT rich 1%, and we got 20%. But if you don't have pruning and backtrack, then it goes to 5%, which shows pruning and backtracking is very important. So in our paper we have these two games, but we also have a natural language task that's trying to write creative stories. And the intuition is also very simple, right?

Starting point is 02:09:08 So if you're a good writer, you know, you don't just write token by token, right? You will deliberately plan, you know, what are the possible plots. You will choose, compare between them and you will select them. So similarly here, the language model will write, you know, a bunch of diverse plans, then self-evaluate, you know, what is a good plan, then proceed with that. And you can do this kind of search for writing, and then humans will find it, you know, more creative than the COT. writing but the writing is too complex and long so I cannot display it here

Starting point is 02:09:41 so what I want to say is you know across those different tasks with different very different reasoning challenges the modular design of TOT allows us to have very flexible you know ways to generate evaluate and search thoughts across very general and diverse tasks and we were doing so in a very systematic framework and achieve very good performances without retreating any models so it's very convenient to use So we believe this is an initial step toward, you know, connecting old insights and new frontiers of AI. So here, you know, tree search, one of the oldest ideas in AI, helps language model do more deliberate reasoning.

Starting point is 02:10:20 While language models help search, you know, provide search with very flexible and general purpose, powerful heuristics. So we have this follow-up efforts trying to connect cognitive architectures to language model-based agents. So those are systems that does not just... reason internally but also interact with the external world and learn through such interaction continuously so it's like autonomous agents so we have this fallout paper called koala cognitive architectures for language agents I highly you guys to check it out and I thank my co-authors thank you guys for listening check out the poster today and happy to chat thank you so much I do

Starting point is 02:11:05 like it when people come with a general enough model that you can customize it specialize it to recover smaller effects that other people have found. So you can from the tree of thought paper recover something like the backspace token model or recover skeleton of thought, chain of thought, whatever a thought. It doesn't, I don't care, I can't keep track anymore. Anyway, so I caught up with him at his poster section and here's a bit of hard chat. You can hold it up. Alright, thank you. So the TLDR of this paper is very simple. Large-d-mold models and search the complement each other. So what's wrong with just using the large-angle models without search.

Starting point is 02:11:42 It's very, everyone familiar with chain-off thought. Okay, so suppose you're trying to solve this game of 24, where given four numbers, you try to combine them to get 24. Okay, so you can give GPD4 this task instruction. You give it a couple of COT examples, but the performance really low, is 4%. Why is it so hard, right? So that's because this problem, like,

Starting point is 02:12:08 intrinsic need exploration. So let's take a look at the initial example, right? So the model is making local token decisions, right? It first generate 10, then it generates times, then it generates six. But it's very hard to decide those initial tokens, even for humans. You don't really know whether the first token should be 10

Starting point is 02:12:31 or 5 or 6, or are they equally good. That's really hard to decide. But what's worse is, once you decide the wrong token, at the beginning, the task is already failed. So in this particular example, if you generate 10 and times, this task is already failed. Because no matter what times 10,

Starting point is 02:12:51 you cannot get three numbers remaining to reach 104. So the intuition is that auto-regressive inference is like you're keeping making those local token decisions one by one left to right. without look ahead, without backtrack. And it's not very robust when you don't have good local signals

Starting point is 02:13:15 to guide through those kind of process. So another analogy will be, suppose you're training a robot that's trying to navigate methods. If you only train them on successful trajectories, and you only train them to predict the next move, and you do this local imitation and you put them in a new maze that requires pollution,

Starting point is 02:13:40 then it probably won't solve the new mates. So obviously some kind of search is needed. But why this is a 2023 work, given that search has been around since 1940 and 1950s? That's because classical search problems like chess, they have a small fixed set of next moves, what we call the search space. That makes it easy to define.

Starting point is 02:14:05 to design or to learn search heuristics to guide the search. But here, for those kind of open-ended reasoning, the next move can be anything. It could be a token, it could be a sentence, it could be a paragraph, and it's impossible to enumerate this huge space or to design evaluations. So the key point here is

Starting point is 02:14:32 you want to really define what is the search space first. You can consider two extremes first. So on one extreme, you can define a SOT as the next token. Then you will be searching in a tree of tokens, something like BIM search. Then the problem is it's very easy to generate tokens, right? But it's very hard to evaluate tokens. You don't really know whether 10 is good or 13 is good or whatever. On the other extreme, you can you can define thought as the whole reasoning. Then it's very easy to evaluate the thoughts. You just look at if the final number is 24 a lot.

Starting point is 02:15:18 But in this bandit, it will be very hard to generate a good thought. Otherwise, the task is solved already. So in this case, seems like the right balance is you define each of the intermediate step as a thought. So you can do something like, you tell language model, here are some numbers, come up with some different ways to combine two of the numbers. You can generate a bunch of thoughts.

Starting point is 02:15:41 And for each of them, you can do something like this. You can say, try a few rounds, can you reach 24. If not, try to design a value based on that. So, within three trials, if you can already reach 24, then this thought has very high value. If it couldn't reach 24, but maybe you could reach maybe 25 or 26, maybe, okay, maybe a median value is given. But if this is something like one, two, three, and then you can only reach six or four,

Starting point is 02:16:07 then maybe a low value is given. So this value is not perfect, and it does not need to be perfect, just like any search heuristics. It just needs to bias a search towards pharmacy directions. So the point is, once you define this search space, you can generate and evaluate next moves using large-danger models, and then you can systematically maintain them using the trade search algorithm. and we show across diverse tasks, they significantly improve the task performances, and it's very easy to use,

Starting point is 02:16:39 you don't need to train any new models. Everything is down with GPD4. Pretty elegant. So I like that comparison to beam search. This is like a level extraction above that, with the atomic unit being a thought. Yeah. A thought here, you illustrated being an equation,

Starting point is 02:16:59 but here you have an equation, a clue word, like here the examples. Yeah. Do you have a planning stage in order to plan out the thought steps, right? Like here you have thought steps of three, thoughts of five to ten, thoughts of one. Usually when people design agents, they'll have like a planner. But I don't see a planner here. That's a great question.

Starting point is 02:17:22 You will notice here for those two games, the third steps are kind of homogeneous because every step you're just trying to come up with a new equation. a new equation, or you're trying to come up with a new clue. So in this case, you don't really need planning. You can just use one generation prompt, right an evaluation prompt, and use that across different steps. But for something more complicated, where for each third step, you might do different things.

Starting point is 02:17:46 Then you probably need to plan ahead and maybe design different prompts for different kinds of generation and different kinds of evaluation. Got it. So do you also see this being able to be combined with cell consistency? because in a way your judge is self-consistency. That's a great idea and we did that. So on here is like in this creative writing task. It's like what we do for evaluation is

Starting point is 02:18:14 here is a task instruction. Here are some of the plans. Thinks to the best plan and come out with the ID, right? So if you just do this one time, you just give one vote. It's very noisy. So you can apply something like self-consistent, right? You can I-A-D do like 10 different voting or 100 different voting.

Starting point is 02:18:39 And then the evaluation will become more faithful. Yeah. And that's kind of hyper-primitarity you can choose. It's like if you want better performance, you can spend more money and try to do that more. Like a post-generation layer? It's like a stepwise democracy, I guess. Okay, so one more question about just in general Princeton NLP. how is it organized

Starting point is 02:18:59 what should people know about the Princeton program? Because I feel like you guys are very productive and how are you so productive? Like what's the backstory to through your thoughts maybe? I think one thing that's good about Princeton is a kind of small school. I've been there, it's not that small. I mean, compared to Harvard or MIT.

Starting point is 02:19:22 You have a lot of interdissellabary kind of collaborations. So I did this with Connoissellable. scientific science professors, I think this kind of idea across different fields is very important, right? So usually in NLP we don't consider tasks like that. Yeah. That's classical search, right? So I think it's very useful to combine ideas from different fields and that could be a way to to promote, come up with new ideas. Yeah. Are a lot of people asking you about like Q Star stuff? No. No comments. No comments. Okay, well thank you very much. This is a great paper. So perhaps one paper that made a bigger splash than TRIA thought earlier in this year was Tool Former,

Starting point is 02:20:01 where we started really considering the myriad number of ways that we can train language models to use tools. So here's the Toolformer Oral. Hi everyone. My name is Jane. I'm a researcher from Fair Labs at MEDA, and today I'm super excited to be presenting to you Toolformer. Language models can teach themselves to use tools. And the reason we might want language models like ChatGPT to have access to external tools, is exemplified by these three queries. In the first two cases, I've asked who is the current president

Starting point is 02:20:30 and what day of the week is it today? And chat GPT basically says it doesn't have real-time data or access to current time or date information. And in the final query, I've asked to do a simple set of computations, but chat GPT unfortunately hallucinates an answer that's about 300 off from the real answer. And what we really could have used here is access to external tools. For example, a QA system that has up-to-date information,

Starting point is 02:20:54 a calendar tool which has the timer date, and a calculator tool, which is designed specifically to do these simple computations perfectly. And so for toolformer, we have five tools at our disposal. We have a QA system with up-to-date information. We have a Wikipedia search tool, which is able to search Wikipedia. We have a calculator tool, a calendar tool, which has the current day of the week and the date, and we have a translation tool which takes in text and puts it back into English. And so in choosing these five tools, we really wanted a set of tools that is not only diverse,

Starting point is 02:21:27 but is also going to be likely useful to the language model. And what we want to train a model to learn is not only which of these five tools to use, but when to use that particular tool and how to use that tool all on its own without human annotation. And the way we do this is by taking natural language text, like Pittsburgh is known as the Steel City, and augmenting that text with API or tool calls. So, for example, here, a useful API call would be to the QA system with the question, what other name is Pittsburgh known by? And this is useful because it's useful in anticipating the remainder of the text,

Starting point is 02:22:03 which is the Steel City. And we represent an API or tool call with natural language. We should do square brackets, followed by the tool name. And then in round parentheses, we have the input to that tool, followed by a right arrow, which is, followed by the output of the tool with that query. And with that, the steps to creating tool former is pretty simple.

Starting point is 02:22:25 In the first step, we want to create a new training data set augmented with these API calls that I just showed you on the previous slide. And in the second step, we want to fine-tune GPTJ, our base model, on this new data set. And this fine-tuned model is the model that we refer to as toolformer. Now, to create that training data set, we have three simple steps, which I'll get into in just a second.

Starting point is 02:22:48 But first, we want to start out with a standard language modeling data set like CCNet. And the reason we want to start here is because we don't want to disrupt any of the core language modeling capabilities that the model may already have. And so in using a data set or something similar to what it's seen before, we minimize this risk as much as possible. Okay, so let's go into the first step, which is to generate API calls. And to do so, we show the model a simple prompt. We say, your task is to add calls to a question answering API to a question, to a

Starting point is 02:23:18 a piece of text. The question should help you get information required to complete the text. And then we explain the format of the API call that we want, and we show it a couple of examples. I only have one example here, but we would put as many examples as can fit into the context window. And then we show the input that we actually want to do inference

Starting point is 02:23:37 on, and we let the model generate. And here, I'm only showing you the question answering API prompt, but you can imagine that we do a very similar thing for the rest of the four tools. Okay, so let's look at a couple of generated API call examples for the input. Pittsburgh is known as the Steel City. And here the model has generated in which state is Pittsburgh, what other name is Pittsburgh known by, and what is the second city in Pennsylvania?

Starting point is 02:24:04 And so from these generations, you can see that we get a mix of relevant API calls, non-relevant API calls, and also some that don't make a lot of sense, like the last one. And now for the second step where we actually try to execute those API calls. So what we do here is we take that natural language string, we parse it for the input parameters, we send it to the relevant tool, and we get an output from the tool. Now using those outputs, we want to put them back into the embedded API call, and we indicate this with a right arrow followed by the output. And this is also the step where we would filter out generated API calls that are ill-formatted

Starting point is 02:24:41 or don't actually return a result from the tool. And additionally, we also want to filter out API calls that aren't actually useful to the model. So the way we want to think about usefulness if it's useful for anticipating the remainder of the text, as I showed you earlier. And the way we quantify usefulness is through model-based perplexity. And perplexity is basically the negative log likelihood of the remainder of the text given the prefix of the text.

Starting point is 02:25:08 So basically you want the lowest perplexity possible because you want the model to be least perplexed about what it's about to see. So here we evaluate perplexity under three different settings. The first setting is where we don't have any API call. So here, the prefix would just be Pittsburgh is known as. In the second setting, we have the non-executed API call, where we have the API call, but we're not actually going to put the result from the tool yet.

Starting point is 02:25:36 And then finally, we have the full executed API call, where we have the API call and its corresponding output. And intuitively, what we want here is for the perplexity for setting C to be much lower than either A or B, because not only do we want the generated API call to be useful, but we also want the results from the tool to be really useful. So this is exactly how we evaluate usefulness. It's the minimum of the perplexity of either under A or B, minus the perplexity of C. So we want that difference to be as large as possible.

Starting point is 02:26:12 And here we have a pretty sizable usefulness score of 1.3, which is pretty good. But to give you more context, here's another example from the calendar tool. It says the WL will be open on Friday. And the calendar tool tells us that today is Thursday, March 9th. And from this we can kind of infer that Friday is going to be March 10th.

Starting point is 02:26:31 So this gets a high usefulness score of 2.11. On the other hand, we have this example from the calculator tool. The model has seen these two numbers, 85 patients and 23%, and it thinks maybe the ratio is going to be useful. But unfortunately, that's not the case, and it gets a low usefulness score of negative 0.02. So this would likely be filtered out in our final third step. Now here I'm showing you the number of examples that remain after this filtering process. For two different kinds of thresholds, we have in light blue, 0.5, and in dark blue we have 1.0.

Starting point is 02:27:06 Obviously, you're going to get a lot more examples left over if you use a less stringent threshold, 0.5. But the other thing you can see here is that we have the most number of examples from the Wikipedia search tool, whereas for calculator and machine translation, we have the fewest number of examples. And now what we do here is we cap the number of examples per tool at 25,000, and we put it all together in one big dataset. And with that data set, we fine-tune our base model, GBT, and this fine-tune model is what we refer to as toolformer.

Starting point is 02:27:38 Now, to evaluate tool former, we want to evaluate on a range of tasks where we think at least one of the tool is going to be useful. So we have fact completion and question and answering. We also have math computations and multilingual questions where the context is given in English, but the question can be in a different language. And we also have temporal questions like how many days is it until Christmas, where you need to know the current timer date in order to answer the question. Now here are the results for those five tasks.

Starting point is 02:28:08 We have three different models. We have GPTJ, which is the base model, toolformer, and GPT3, which is a 175 billion parameter model. And what you can see here is that in almost all cases, tool formers outperforming GBT, but it's also outperforming GBT3, even though it's about 30 times smaller than GPT3. And an exception to this is the question answering task, where we actually disabled the QA system. And this is because there's a lot of overlap into the training set of the

Starting point is 02:28:35 QA system and our evaluation tasks, so we thought this would be too much of an advantage if we enabled that tool. The second anomaly is the multilingual task where we don't see a lot of benefit from the translation tool, and we think this is likely because GBT has already seen a lot of multilingual text and isn't getting a whole lot of benefit from actually using that tool. But regardless, we see that tool former is either on-paro-GBDJ or outperforming GPDJ. And the second thing that we want to look at is whether or not small models can effectively use tools.

Starting point is 02:29:08 So in other words, is there a minimum size requirement with which tools, I mean, models are able to effectively use tools? So to investigate this, we applied the same kind of pipeline to the family of GBT2 models. So there are four of them, and in total, we have five different models at various sizes, which I'm showing you on the x-axis. And on the y-axis, we have model performance.

Starting point is 02:29:30 And in blue, we have tool former. And in red, we have tool former disabled, where we use constraint decoding to prevent the usage of tools. And as you can see, in the smallest two sizes, we don't see any performance difference between toolformer and tool former disabled, meaning that toolformer is not able to make use

Starting point is 02:29:47 of those five tools to its fullest. But once we get to 775 million parameters, we see a performance gap emerging, and this gets bigger and sustained for the rest of the sizes. And this is a similar thing that we see with the math benchmarks. It seems that tool usage is really emerging at 775 million parameters. For the question answering benchmarks, we don't see this as clearly, and we think that maybe

Starting point is 02:30:11 this is likely because the QA system and the Wikipedia search tool are easier tools to use, and so you don't need a more capable model to be able to understand how to use it effectively. And finally, we also want to revisit the question of whether or not tool formers a good language model. We originally used a data set CCNet because we didn't want to disrupt any of the core language modeling capabilities. And so now we revisit that question by looking at perplexity on a held-out set of of Wikitext and CCNet.

Starting point is 02:30:40 And here we have three different models, GPTJ, GBT further fine-tuned on CCNet, and to a former, which is further fine-tune on CCNet augmented with those API calls. And what we find is that the perplexity is pretty much on par with the base model and the further fine-tuned one. We don't see a whole lot of difference.

Starting point is 02:30:59 And so we feel pretty encouraged that even though this data set may look a bit unnatural with these API calls, it doesn't actually harm the core language modeling capabilities here. So thank you for listening to this talk. Please check out our paper at this QR code. We have a poster in the next poster section. We are poster number 332, and I will be there with Roberta.

Starting point is 02:31:21 Please feel free to reach out to any of the co-authors and me. I'm happy to take questions now or later. Thanks. When I look at all the relevant papers for AI engineers this year, there's the chain of thought papers and the two use papers, two of which we just covered. but something that I think incorporates all of them and it adds a few ideas that are unique and notable to them is the Voyager paper from Nvidia. And even though it was released in the first half of the year, people are still talking about it today.

Starting point is 02:31:55 It's still shaping people's mental perceptions of how they want to build their LM architectures. It was somehow not accepted for posters or oral sessions at this year's in Europe. It's kind of a mystery as to why. I did chat with Jim and I'm still not really sure what's going on there. But it would have been my vote for Best Paper because it's so foundational and established such a strong baseline for everyone else to build on top of LLMs. And anyway, so there is some workshops presentations about Wai-Drift with the first author, so here it is. My name is Guan Zhi Wang.

Starting point is 02:32:33 Currently I'm a 30-year PhD student at Caltech. I'm also a research intern at Vedia. very happy to present Voyager and open-ended embodied agent with large language models. This year, GPD4 came, a large-engage model that's so good at coding and long horizon planning, so we built Voyager, the first large-language model powered left-on-learning agent. When we set Voyager loose in Minecraft, it is able to play the game for hours on end without any human intervention. The video here shows snippets from a single episode of Voyager.

Starting point is 02:33:07 So it explores the terrains, mines all kinds of materials, fight monsters, craft hundreds of recipes, and unlocks an ever-expanding treat of skills. If you want to use the full power of GPD4, a central question is how do we stringify things? In other words, how do we convert this embodied environment with multi-model observation and action space into peer text? We need a magic box. And thankfully, the enthusiastic Minecraft community already built. community already built one. It's called Manfleur, a high-level JavaScript API that actively maintained to work with every Minecraft version. The beauty of Manflare is that it has access

Starting point is 02:33:47 to the game state surrounding the agent like the nearby blocks, animals, and enemies. So we effectively have a Guantruth perception module as a textual channel. Now that we convert everything to text, we are ready to construct an agent algorithm on top of GPD4. And on the high level, there are three components. First, a coding module that writes JavaScript to control the game about. It's a mean module that generates executable actions. Second, we have a code base to store the correctly written code and look it up in the future if the agents need to recall the skill. In this way, we don't duplicate coding efforts and achieve a form of learning without grading design. Third, we have a curriculum that

Starting point is 02:34:32 propose what to do next given the agent's current capabilities. So we will, why, well, them up together, we get a loop that drives the agents indefinitely and achieves something like lifelong learning. So let's do me in the center module. We prompt GPD4 with documentation and examples on how to use a subset of the Manflear API. The GPD4 writes code to take actions given the current assigned task. And because JavaScript runs a code interpreter, GPD4 can define new functions on the fly and run it interactively. But the code that GPt4 isn't always able to get it right at the first try. We develop an iterative prompting mechanism

Starting point is 02:35:13 to refine the program. There are three types of feedback. First, the environment feedback, like what new materials did you get after taking an action? Second, the execution error from JavaScript interpreter like variable and defined error. And we have another GPT4 that provides

Starting point is 02:35:33 critique through self-reflection from the agent's own states. So these are these things. components help the agent refine the program effectively. I want to show some examples of how the critique module provides feedback on the task completion progress. In the first example, the task is to craft a spy gas. So GPT4 looks at the agent's inventory and decides that it has enough copper, but not enough emazest. Second task is to kill three sheep to collect food, so each ship drops one of white wool, but there are only two units in the inventory, so one more sheep to go.

Starting point is 02:36:13 Last example, killing the zombie drops a unit of rotten flash which is in the inventory, so GPT4 determines that the task is successful and moves on. So this critique procedure is repeated until the task is deemed successful or hits the time limit. Now moving on to the second part. Once it implements a scale correctly, we save it to a persistent storage. So think of it as a skill library that's altered purely by GPD4 through trial and error. Then the agent can retrieve the skills from the library when facing similar situations in the future. So it doesn't need to write them again. In this way, Weider improves itself as it experience more and more in Minecraft.

Starting point is 02:37:03 Let's step a bit deeper into how the scale library is implemented. So this is how we insert a new skill. First, we use GPD 3.5 to summarize the program into plain English. So summarization is very easy and doesn't need GPD4, so we save some money here. Then the embedding of the summary becomes a key, and the program becomes a value which we insert into a vector database. We find it better to embed the description instead of the raw program, because it's more semantic and improves the data.

Starting point is 02:37:38 and improves the retrieval. Now, when Voyager is faced with a new task, let's say, craft iron pickaxe, we use GPT3.5 to generate a hint on how to solve the task and combine it with world state as security content. Then we do the embedding and retrieve the top five relevant skills from the skill library.

Starting point is 02:38:08 So Voyager is free to directly use one of the skills as is, or interpret it among the file, or rewrite one from scratch. In this way, we maximally re-use the old experiences, think of it as an in-context replay buffer in the reinforcement learning terminology. Now, we'll go on to the third part. We have yet another GPT4 that propose what task to do

Starting point is 02:38:35 given its own capability at the moment. The curriculum has an unsupervised objective, which is to maximize the number of novel items that the agent obtains. There are two key insights here. First, it's kind of curiosity-driven explorers. exploration or novelty search in prior literature, but implemented purely in context. Oh, sorry. And second, it's a situation where curriculum that naturally gets progressively

Starting point is 02:39:03 harder over time or without any manual prescription from us. So let's go through a working example together. The agent finds its hunger bar dropping to 1 out of 20, so it needs to find food. Now it senses four entities nearby, a cat, a villager, a pig, and some wheat seed. So it starts an unit monologue. Do I kill the cat or villager? Bad idea. How about the wheat seed? I can grow a farm, but it's going to take a long time. So sorry, Piggy, you are the chosen one. It checks the inventory and retrieves an old skill from the library to craft an iron sword and then starts to learn a new skill called hunting. Now, we also know that Voyager isn't vegetarian, unfortunately. So putting our pieces together, we have an union sword. We have a

Starting point is 02:39:55 iterative prompting mechanism that refines a program by self-debucking, a skill library as an in-context replay buffer, and an automatic curriculum as in-context curiosity-driven exploration. This is Voyager's no gradient architecture, where we don't train any new model or frontier any parameters. It allows Voyager to self-bustrap and perform lifelong learning in an open-ended world. So these are the tasks that Voyager happens to pick up along the way. We didn't pre-program any of this.

Starting point is 02:40:31 It's all Voyager's idea. The agent is forever curious and forever pursuing new adventures. We've done a lot of systematic study for Voyager, and here is the quantitative learning curve. Well, the x-axis is the number of prompting iterations, and the y-axis is the number of unique items obtained by each agent. We compare with three baselines, React, reflexing, and auto-GBT. All of these are no gradient. All of these are no gradient architecture on top of GPD 4.

Starting point is 02:41:08 React is a very simple reasoning and acting loop. And the reflexing is built on top of React with self-reflection. We see that both struggle to make progress beyond the basic wooden tools. And auto-GPT is a popular software repo. It combines React and a task planner that they compose an objective into sub-goals. It makes more progress but is very slow. And this is a wider. They are able to obtain three times more novel items

Starting point is 02:41:37 than the prior method and unlock the Hotech tree significantly faster from wooden to stone to iron to diamond. The blue curve here is an ablation without skill library, which plateaus after well. So basically the skill library is very essential for Voyager's lifelong learning capabilities. Here are two precise views of Minecraft maps. So these circles are what the prior method explore,

Starting point is 02:42:05 given the same prompting iteration budget. You can see that they tend to get started in local areas. Voyager is able to navigate distance two times longer compared to prior works. It has to visit more diverse terrain in order to find more novel items quickly. Finally, one limitation is that Voyager does not currently support visual perception because GPD4 is tax only when we were developing Voyager, but there's nothing stopping Voyager from using a multi-modal model to achieve more impressive tasks. And here we demonstrate that given human feedback,

Starting point is 02:42:36 Voyager is able to construct complex 3D structures in Minecraft, such as a house and a netter portal. We basically use the human to replace the critic module of Voyager and provide 3D spatial otherwise. So to build very complex structures, we definitely need some full-blown multimodom multimodal models, and I will leave that to future works. This is Voyager's website at voyager.mandojo.org.

Starting point is 02:43:01 We open source everything, including the environment, algorithm, prompts, and pre-trained skill libraries. Finally, I want to acknowledge all the team members for Wager. This work will not be possible without their help. So please feel free to reach out if you have any questions. Thanks. I think the last component of agents, apart from chain of thought and tool use that I wrote up in the Anatomy of Autonomy right up in April, is the need for better planning.

Starting point is 02:43:33 And I think one of the most interesting or challenging pieces, depending how you look at it, Neurip's is doing poster diving where instead of going to all the oral sessions which have been curated by track committees and all that you just go and walk the halls and look for posters and look for papers and people that are underrated have been overlooked and in fact the original attention is all you need Transformers paper was one such paper where they were just a poster only paper apparently from walking the halls in the poster sessions my pick for underrated paper was Ida Momanajad from Microsoft Research with Cog Eval. Ida was very confident and professorial in her presentation, made it engaging, made it a quiz.

Starting point is 02:44:17 Some parts of the quiz are visual, so if you're listening along and you want to solve it alongside us, you should probably pull up the show notes and check out the graphs that I'm going to paste inside of the show notes. But otherwise, she just made it very engaging for people to follow along. I'm not kidding, there was a group of 10, 20 of us way back in the halls in the post of sessions where a lot of people don't really end up going. And we're just like half an hour while she was giving her

Starting point is 02:44:43 impromptu lecture about CoggyVell. And I do think that this is notable because it is potentially a quantifiable benchmark for reasoning and planning capabilities that currently all the language models don't do very well. And framing it as a graph problem helps us generalize to all sorts of reasoning,

Starting point is 02:45:01 planning, and search situations. And I just like that it was really well presented. This is obviously a benchmark paper, so there's no solutions proposed, but she has another paper that she's working on that has some of her solutions. So LLMs are ubiquitous, and a lot of people claim that they can plan or they're going to plan to take over the world. But first things first, can they actually plan? I have 15 years of experience working in reinforcement learning and cognitive science and in neuroscience, evaluating planning in humans and brains and reinforcement learning models. So I thought, okay, let's apply that.

Starting point is 02:45:33 In order to accurately evaluate whether a cognitive capacity exists in an agent or in a biological system, there needs to be a systematic protocol to evaluate it. Inspired by cognitive science, we have two contributions here. First, we introduce COGEval, a systematic protocol for evaluating cognitive capacities. What that means is you need to operationalize a particular latent ability in terms of multiple tasks that can be measured. And these measurements need to on-com-found or just, or decoupled certain confounds from what is actually being measured in terms of that cognitive ability. So for instance, if you give it some simple situations, it might be that it solves it,

Starting point is 02:46:13 but you can declare victory unless you show that the tasks that you have created somehow capture different aspects of the cognitive latent ability that you are measuring. Second, you want to operationalize it in terms of different structures, different domains, and different tasks. You don't want to measure one or two things in one or two environments and with an anecdote declare that something works or something exists. So here what, for instance, you have is different graph structures. I have six structures that I'll show you, different domain. I'll show you the spatial domain.

Starting point is 02:46:46 For instance, if I ask you for planning, I could ask you, how do you go to Hall Seafrater? Or I could give you any information about Ali is friends with Michael, Michael, is friends with Mary Mary, Mary's friends with Sue. If Ali wants to pass a message to Sue, what is the path, for instance, right? That's the planning in the social domain. So social and spatial domain, different domains, and also task conditions. We use 15 different tasks.

Starting point is 02:47:07 These are inspired by various tasks that I have designed in the past, you can look at these two papers and others. This goes back 100 years ago to the tradition started by Edward Tolman on cognitive maps in rats and men, 1948 review paper reviews 20 years of research, where it shows behaviorally how to measure whether an entity in that case he was measuring rats possess a cognitive map.

Starting point is 02:47:30 It was a revolutionary result at the time, because it went against the behavioristogma of the time that you need a reward to learn structures. It showed that no rats can learn the cognitive map of the environment even if you don't give them rewards. Okay, come back to present day, 15 tasks in five different categories. The goal is to evaluate systematically whether LLMs can extract from descriptions of an environment, the cognitive map, and what does that mean? It means similar to Tolman from 100 years ago until not tradition. Can it solve particular tasks?

Starting point is 02:48:00 Is it robust to certain tasks? can do flexible planning with respect to and it responds to different kinds of tasks where you have maybe short or brief local changes to the environment, like a reward location changed or one edge changed. Can it integrate those to accurately plan, for instance? And we have these different graph structures, just to give you an example of how it goes. So for graph A, domain is spatial and the task is value-based planning, what would it look like? I would describe the graph to the LLM. as you imagine a building with six rooms from the lobby,

Starting point is 02:48:36 you have two choices, you go to room one or two. From room one, there is a door to room three, from room three, there is a door to room five. In room five, there is $10. You don't take any money because at the end you only have one possibility to take money. You go back, from room four, from room two, you can go to four to six, and in room six,

Starting point is 02:48:53 there's $50. And then the question is, and here, this was the description of the environment, then the question is, you return to the lobby, you have only one choice to, you can only take money, once, what is the optimal room to choose in order to take the most money? And you should say two because six has the most room, right? So all of these environments are described in that way, either in the spatial domain or the social domain,

Starting point is 02:49:14 and the different tasks are prompted like this. For cases where something in the environment changes, you can see how the second prompt, for instance, modify something. You say, oh, now you learn that the reward in this room changed to such and such. Oh, now you learn that the door to this room has been changed and it all of a sudden opens to this other. Okay, now with that, please don't look here. I don't want you guys to cheat.

Starting point is 02:49:35 And I know you guys might have heard things, but forget everything you heard. Between these three, which one do you think is going to be the most difficult? And why? So the choice is A, B, and C. A, B and C. Which graph is it going to be difficult, or are they going to be the same? In terms of for the LLM to solve? They're similar.

Starting point is 02:49:52 So B has more branching and C is more length. So which one is going to be more difficult to solve for LLMs? You can say different things, and we can see who is right. I don't know the answer. I'll guess B because more branching. Okay, anybody guesses anything else? Probably C because I guess the line is not going to handle like a very long term of sequence. So we have two hypothesis here. Anybody thinks they're the same?

Starting point is 02:50:16 So between A B and C, which one is more difficult or are they the same? I just don't understand that. When you say so, what kind of problem I'm trying to see? This problem that we just mentioned here. There is some money somewhere at the end of them. One of the nodes that is terminal has the most. C is harder. Okay, and then between D and E, which one do you think is harder?

Starting point is 02:50:38 More branching, so C, so E? You think E is harder? If it's branching. Okay. I don't actually know that. I mean, I do feel like he has a point, so I can be wrong. Okay, so who thinks, okay, so you think E is harder. Anybody thinks D is harder?

Starting point is 02:50:52 Okay, why? Because it has the last way to go from one point on a way. It has bottlenecks. Yeah, yeah. Right. Okay. Ready? Okay.

Starting point is 02:51:01 Right here. So take a look at this. B is harder than C as you can see. It's branching. Right? B is harder, even though C is twice as large as C in terms of the number of nodes. And A, you can see that it was easy, right? So imagine if I showed you this as the planning task and I declared victory and I said, look,

Starting point is 02:51:19 LLMs can solve planning. GPD4, great, near 100%, right? But then you try just a little longer or you have the same number of nodes but with a branching structure, what do you see here? Huge drop, right? And in fact, what do you see for three of the LLMs? It's at almost at 0%, right? And now between D and E, let's take a look.

Starting point is 02:51:37 As you can see, D is much more difficult for GPD4, which is the blue one. In fact, E is more difficult than B. Sorry, B is more difficult than E. It's not consistent. Yeah. Well, there is something there. In these two, you have structures where you need to be exact. There is not multiple paths between different nodes, right?

Starting point is 02:51:57 So it's very important. If you're going from this cluster to this one, you have to path through this bottleneck. So there needs to be an ability to play. plan accurately the specific bottleneck, correct? Now, what about the different tasks? Let's see. As you can see, they're not robust to the different tasks either. Traversal, which is one step, two, step, three step,

Starting point is 02:52:17 end step path, and value path. This is easier for these guys. Why is that? The reason is that traversal does not change the structure of the environment or the rewards. However, as soon as you have the local change, the stuff that Edward Tolman was talking about 100 years ago, that is required for measuring cognitive maps

Starting point is 02:52:34 rodents for instance like detour and shortcut that we have all of a sudden you see a drop and you can see all of a sudden goes to zero and for coher for alpaca and for a llama right and so and here you can see this sad thing also it's almost at zero percent for four of the graphs so all of these graphs are at almost as zero percent for three of the LLMs and about 20 percent for most of them and it's only GPT4 that does a little better and that's about 40 percent right So based on all of these things, robustness to task if you aggregate across graphs, not robust to tasks, and robustness to different graphs if you aggregate across tasks, also not very robust. So you compare these, the general conclusion I would draw is that they're not good at planning.

Starting point is 02:53:21 Now let's take a look at some of their failure modes. So can you guys see what is the failure mode that is happening here? There is an edge that doesn't exist. It hallucinated an edge in giving the planning response that doesn't exist. Now let's take a look at this case where you have a direct path from 1 to 7, but it's giving it very long. It says what is the shortest path between 1 and 7, and it says 1.13-107. But interestingly, if I asked the LLM, can you list the topples? JPD4 can easily list the topals, but at the same time still can hallucinate, like in this case.

Starting point is 02:53:56 Now in this last one, there's two mistakes. I told you one of the mistakes, which is hallucinating the edges. What other mistake do you see? Is it out of order somehow? I don't know. It's hard to tell from this distance. Take a look at the answer. What is wrong with the answer? Revisits a node. Exactly. There's a loop.

Starting point is 02:54:11 So Schroeder's path should not have a loop. Of course. So we found another case, right? So these three failure modes are failures of planning. Even though it knows the one-step topples correctly, it seems to fail at planning. And it can give you some insight into what is going on. So it's not very good at stitching one-step things together. So based on that, why do you think it was better at graph A?

Starting point is 02:54:33 Can people give me guesses? Why do you think graph A was easier? Or it showed some apparent success on graph A. Why do you think that is? Smaller, daves, fewer choices. Fewer choices. But this one is also very few choices. It's like B. It's a tree, right? This has fewer choices than that, right? But why is this so much more difficult?

Starting point is 02:54:53 More ways to be wrong. Say again? More ways to be wrong. More ways to be wrong. So another way to say it is that, is that the things that showed up the exact in the kind of the prompt are more likely to work for C and A, basically. So if it just did just memorization of what's going on, right?

Starting point is 02:55:14 Because it's just sort of a kind of two tracks here. But there are more branching. For what it's worth, I think a lot of the common sense reasoning benchmarks that these things are specifically trained on are transitive. I don't know what you call this. Yeah. Yeah. Like, we trained it to be good at ANC.

Starting point is 02:55:35 No, that's not true. No? GPT4 has been trained on a huge amount of text. A lot of that is family trees and structures that are actually tree-like. It turns out transformers, in fact, do have some limitations with tree-like structures and with things that are bottleneck. We are very good at bottleneck. In fact, bottlenecks makes things easier for us, right?

Starting point is 02:55:53 You have a few nodes that are, basically, they have high centrality, especially eigencentrality or between the centrality. And basically what you do is when you're solving a problem and planning, you say, I'm going to find that. No, then from there, I'll go somewhere. You have a subway system. You go to 14th station in New York City, then you can find a train that goes somewhere else, right?

Starting point is 02:56:15 So if you get lost, just find the hub. We actually use these heuristics a lot. It's available in human texts a lot, but it hasn't been picking up on that. So this is about the structures that the Transformer, for instance, might have been learning. And as you can see, you have here from 7 billion parameters to one trillion parameters to the best of our knowledge or larger, right?

Starting point is 02:56:36 And none of them is capable of figuring out or like having a high performance higher than like we have something between zero and 40 percent on a simple two-step tree, which is the simplest thing you can give a model-based planner. It's not even probabilistic. It's deterministic. And even that is failing, right? And then we saw these failure modes. Another thing, what if I give it extra instructions?

Starting point is 02:56:59 By the way, all of these have been told things step by step. So we give that simple chain of thought. What if we give it extra instructions? For instance, I describe entire breath first search and death first search. And I say, hey, use depth research. How is that working? First do this, then do that. And another one.

Starting point is 02:57:16 So you can see in the supplementary material of our paper, the entire sort of death restriction death research. Then you see that it improves somewhat for when you are within a cluster, but when you look a situation in this graph D, where you need to find the shortest paths between nodes that are a cluster away from each other, what you see is that it doesn't help much. And interestingly, for different temperatures, for temperatures zero, it doesn't help at all. It helps a little bit for temperatures that are higher and I guess like take different kind of paths.

Starting point is 02:57:45 But it's interesting, only one cluster away. Short-dust path, one cluster away, it's not a big deal. The diameter of this network is not that large. There is not a lot of improvement and the performance is pretty low, as you can see, for all of them and for three of them is actually closer to zero. So this is the evaluation. We have done together with my summer interns. We have a paper where we did a prefrontal cortex-inspired modular architecture

Starting point is 02:58:12 where GPT4 basically plays the role of these different kind of modules and solves these problems in a kind of a modular way similar to the prefrontal cortex. I have like 15 years of working on prefrontal cortex. I'm very excited to do this with these models. You can see it here. And this paper is you can find it on my website, WebItal 2023, and you can find it here as well. I have an archive number over there. Okay, so I had to cut it for time there, but literally, I'm not joking, I had another half an hour of audio just chatting with her and like all of us just crawling around her like students.

Starting point is 02:58:44 Like she just was very, very engaging in person. And I love to see that. I love to see when people can not only do great work, but then also talk in a compelling fashion about it, not just passively and strongly. questions about it, but also challenge you to think along the way. So I guess if I were to include one agent's paper from Neuribs, this would be it. And for the final talk of this entire pod, which is already stretching into three hours, I have saved for the coverage of state space models, which have been the talk of the town. The Mamba model was released a few days before Neurips, and Albert Goo was there. I met him, but I couldn't get a conversation with him. But Chris Ray

Starting point is 02:59:22 was on stage talking about effectively all of Hazy Research, what Stanford's doing, and what Chris Ray is up to and all the people he's associated with, including Tree Dow and Albuquer. So if you want a primer or like a good entry point on just how Chris Ray is thinking about six-face models, I think this is it. So as I mentioned, our motivation for getting rid of attention potentially is long sequences. That's the practical motivation. I'll come back to my real motivation in one slide. Practically, some data comes as long sequences. Data, audio, DNA is billions of bass pairs. We can also cram in terms of a few shot examples, which seems pretty cool. When we started this project, really the standard models couldn't have it. GPT1 had only a 512 context lane. And as I mentioned, transformers are

Starting point is 03:00:07 scaling quadratically in their sequence lane. So we kind of took two parallel paths to this. One is better hardware algorithms. So we tried with, you know, flash attention and now people have followed up to make that path really, really fast. Just optimize the crap out of it on hardware, and there's a lot of juice to squeeze there. The other approach, which I'll talk about now, are new models. Now, as I mentioned, I actually wasn't totally motivated by that. I wasn't, honestly, that wasn't my total motivation. I was really motivated by this inductive bias issue. So the idea here is you give me this image, and I flatten it into one single pixel. And then I ask you, is it a car or a boat, some C-FAR like thing? Sequential C-FAR, if you know

Starting point is 03:00:46 the task. And this is really interesting to me because when, you know, a human would do this, this would be hopeless. If you gave me a picture and gave me it in one pixel vector as a thing, I would have no chance of classifying it. Machines could do something, but there was a huge gap. And I wanted to understand why is there this inductive bias underneath the covers? Do you really need this spatial inductive bias for the machines to reason? Do they have to reason like us when they do this? So I was fascinated by this problem. Right. So there's another benchmark that came out that was really exciting from the Google folks, which was about how to benchmark efficient attention. It's called Long Range Arena. It's extremely cool. We found them,

Starting point is 03:01:25 basically because we were playing around with the sequential CFAR things, and they had a much greater library of places where they were seeing possibilities to improve attention. This was the leaderboard in 2021 of this attention, and they were basically looking at a bunch of very cool linear attention variants, some of which we still play with. I want to draw your attention to two columns on this thing. The first is image.

Starting point is 03:01:46 That is that sequential CFAR task I was just talking about. It's a really interesting task. You've probably trained CFAR to like 90s or high 80s on your laptop, or on a small GPU, and you see the sequential version was lagging quite a bit behind. The other column is this thing, PathX, which were these large images where you had two dots, and you're trying to say are the two dots connected.

Starting point is 03:02:06 And the reason there are X's is that every model was basically random guessing at this point. So there are three approaches that we were trying to improve long sequences. Improve the hardware, the utilization on hardware, approximate attention, and this last one, which I'm going to talk about most, which is using RNN-based kinds of ideas

Starting point is 03:02:24 and signal processing ideas. All of them are great, and just happen to pick the last one. All right. So the idea is we're going to replace just the signal processing box, the signal mixing box, with this new operator, S4, that's based on signal processing ideas. So this was inspired by Albert and Karn. Albert's now a professor at CMU.

Starting point is 03:02:44 Karn is now running this company, Cartesia, which is a small company just started. And basically, S4 is a classic states-based model. So if you're an EE person, you've seen these in like your undergrad right away. It's an LTI system, but we're going to tweak it for deep learning. The first thing we're going to get, as I'll show you, pretty mathematically and nicely, is that signal processing people are obsessed with stability. They understand bounded input, bounded output stability like nobody's business.

Starting point is 03:03:07 It's simple and it's clean, and we can use it right away. This is a challenge when training these models. A second thing, which was quite surprising is, I've always thought about CNNs and RNNs as quite distinct models. But what I'm going to show you mathematically is these models actually unify both. Now, these are CNNs in a kind of different way than we're used to. They're convolutions where the filters are potentially as long as the input, but we're going to be able to view the exact same weights and operate on them either as an RNN or CNN,

Starting point is 03:03:35 which is quite exciting. And the last piece, of course, is that we're going to make this quite fast. And these are going to be asymptotically more efficient than transformers. We're eventually going to be able to process sequence in like n-log-n time, which is then, you know, a challenge to make practical, and I'll share some results there. Now, this thing is extremely simple. Very simple, very simple signal processing ideas, but I just want to point out it had a large improvement on LRA that surprised me.

Starting point is 03:04:00 So here's that improvement on LRA. This is the first of its kind of solve PathX. It was like a 26-point jump on this benchmark that a bunch of folks had played at. I also want to point out that the image task, that spatial bias seems to matter less than I thought. And that was really the thing that was interesting to me. And since then, many people have followed on and pushed these numbers up higher, But I just think that's really interesting.

Starting point is 03:04:21 I don't know what to do with the observation, but I really like it. Okay, so what is signal processing? Well, signal processing people view a signal of D-dimensions at end time steps as input, and an output is a signal of D-dimension at n time steps. That looks a lot like our X and O matrix that we had in attention. They also think causally. They think that time moves left to right through this, and things like GPT are also kind of causal.

Starting point is 03:04:46 So, so far, what I want to emphasize is, we've really done nothing. It's just symbol pushing that we've been able to move into this model. So what does signal processing actually buy us? Two big ideas. The first is over 100 years, they figured out a bunch of models, which are relatively simple, but capture pretty interesting phenomenon. These aren't the best models you could ever use, these LTI systems, but they're a simple and very well-understood starting point.

Starting point is 03:05:12 So I argue makes sense to start there. The second piece, which I think a lot of machine learners don't necessarily love, necessarily love, is that they have this idea that a signal is a continuous object that then is discreetly sampled. And that idea allows us to do a bunch of stuff. In particular, it allows us to use all our discrete tricks, which are more common in machine learning and AI, but also a bunch of, you know, 19th, 20th century mathematics that knows how to do integrals and solves things exactly. And I'll show you at least one of those tricks in the next couple of slides. I think that's an incredibly powerful idea, and it was really helpful for us to think about it. And as I said,

Starting point is 03:05:47 it's going to teach us about stability in like a trivial way. We're going to use theorems from the 1800s to be able to prove that our models are stable, which I just think is awesome. All right, so what's an LTI system? If you've never played with one, this is what's called a single input, single output system. You have some curve that's coming in, which is typically called UT, that's the input, and some output curve YT. You have some hidden state, which is much higher dimension, usually, than the input and the output.

Starting point is 03:06:12 We'll take the hidden state as large as the input when it's discretized. it's going to be a huge thing. Now, I haven't told you how that hidden state evolves yet, but it's going to be constrained. And the LTI people say there's lots of things that can fit into basically letting it evolve according to an ODE. So I'm going to show you that in just one second. So here's what you need for the ODE.

Starting point is 03:06:33 You need two matrices A and B, and we're going to learn those matrices. And basically it says that the hidden state can only evolve according to this equation. It basically says the change in the hidden state is proportional to some learned function. of the input plus the previous state. The output is then from projection, from this linear projection, from this high dimensional state down to 1D. This is all that an LTI system does.

Starting point is 03:06:58 I'm just saying it's something that's surprisingly powerful and well understood. This is not the best model. If you're a signal processing person, you say, oh, you should use X, Y, or Z. You're probably right. But we want to start with something really, really simple that we can understand all the way.

Starting point is 03:07:09 All right. So it turns out that one of the beautiful things is because it has this continuous object lurking in the background, you can use high school calculus and in particular you can get out this nice expression and what this says is the hidden state is exactly this function x of s and this convolution style integral okay this is exactly what it is this is wonderful you just solve the ODE then when we realize it we have to discretize we'll come back to that in a second So the immediate win is Well this can tell us exactly when the system is stable basically as long as the eigenvalues are in the left-hand part of the plane which is

Starting point is 03:07:45 every EE person memorizes, and the reason left-hand part of the complex plane matters is E to those values goes inside the unit disk, you know that this thing is not going to blow up. This system is going to have bounded input, bounded output stability, which is really exciting. So when we train, we can fix our A's, our representations, so that the eigenvalues satisfy this property, and that's going to be one of the arts. Now, to implement this on a machine, we can't use continuous objects. We have to use them as discrete. And integrals are just big, smooth sums, basically.

Starting point is 03:08:14 They're actually nicer to deal with than functions. And so what we'll do is we'll break that sum down into functions. And what happens in signal processing is you think that you're going to sample at some regular frequency T. And then what I'm denoting here is x bracket K means the kth sample, which is at the point kT. So you're seeing this animation that the integral is just this nice smooth sum. All right. Cool. All right. So now that we're in discrete land, we can relate it to more familiar machine learning concepts.

Starting point is 03:08:41 The first thing is you can view this as a recurrence. as an RNN. So I'll introduce notation G here, which is basically the B times the input. It's all the modifications on the input that we had. And with just a little bit of arithmetic, I can move it out so that I get x of n plus 1, the next hidden state, is t times gn plus some term that's kind of down weighting it.

Starting point is 03:09:02 And I'm illustrating the down weighting here and the visualization. RNNs are super fast. So if we did manage to learn the weights, the B's, the A, all the rest of these things in the filter, then we could run this as an RNN automatically from the same parameterizations. Super cool.

Starting point is 03:09:19 With just a little bit more notation, I can take that E term, the exponential there, and put that matrix exponential into this function F. And that becomes a convolution that's probably more familiar to most people, which is a discrete convolution. But notice this discrete convolution is of length n. It's a huge, long convolution. It's not a three-by-three convolution like a Resonet.

Starting point is 03:09:38 It's actually as long, potentially, as the filter. So that's going to be challenging to process. But this model says they're basically both the same. So the key technical challenges to make these SSMs fast. Those long comms are hard. If you think about it, that F, that filter, is huge. And so if you materialize it at every time step, you'd be toast. It turns out that you don't ever have to materialize the hidden state.

Starting point is 03:10:01 That's a really important observation. That allows you to go fast and allows you to have runtime that's proportional to the input and the output, not the massive hidden state. The hidden state is important for representation, but it's actually not important for implementation. You can check out the run time. the blog, the blog has more details about exactly how that works. The second thing which we spent a lot of time on, and Albert did a bunch of really brilliant

Starting point is 03:10:22 things inspired the Lejean memory by the Lejean memory units is, how do we make that A have that nice eigenvalue structure so that we know it's stable? Things like diagonal matrices are really easy to keep this structure because you just, you know, they're scalers on the diagonal, you can keep it. But computing matrix exponentials in general for expressive classes is actually pretty challenging. So we had to do a ton of work to get that to happen over the last couple of years. And the last bit is this practical fast convolution that we needed. Now, I love this slide because Dolly 3 made most of the art in this talk or all of the art in this talk, and it made

Starting point is 03:10:54 this poster. I didn't give it the tagline. I still think it's hysterical. Too fast, too furious, revving up the equations. I have no idea what that means. But I love it. It's supposed to be furrier, by the way. That's the thing. Any case, the thing is, is we had to do the same type of operation that we did in flash attention, but now on FFTs and convolutions. If you naively run FFTs, you have terrible memory behavior. If you can somehow group them together in nice ways and be IOW aware, you can get back to that kind of nice utilization.

Starting point is 03:11:21 Flash attention, if you recall, was about 72% utilization. Dan and Herman got to 65% utilization. I would also say that Dan's on the faculty market this year and Herman's on the PhD market and you'd be smart to hire them. They're amazing. So the point is there's not really a hardware trade-off after you do a bunch of work.

Starting point is 03:11:38 It's really algorithmic. This thing is going to do a lot fewer operations. And this led to what, some folks have called Sasha called an R&N Renaissance. And I want to say it's been super fun. I have to say the last like year and a half of two years of research, I've absolutely loved because you've had a ton of people contributing amazing ideas like S5 and Mega and RWKV

Starting point is 03:11:56 on super technical topics that were really exciting for us to do. And there's just been so many more that I can't put on here and they've been pushing the state of the art. So now you've listened to my talk and you're like, should we use these models everywhere? And, you know, maybe I'm a California optimist, so I sound happy. I know it's irritating, but I'm happy about everything, so I am.

Starting point is 03:12:15 So you're like, maybe you should use these things. Say, well, maybe, but there's actually a pretty big gap on language. So it was wonderful on LRA and those signal processing tasks, but when we actually deployed it on language, there was a gap. Now, the standard way you measure a language model is perplexity. This is the score of how predictable the language is. To give you a sense of this measure, S4 was five points worse on perplexity versus transformers. And that's a staggering number, because five points is about the difference between

Starting point is 03:12:45 125 million parameter model and a 7 billion parameter model. It was a big gap. So we started to wonder, why is that? So we went back to work that other folks had done, which was amazing, this associative recall task. So the task here is I give you letters and numbers. The last letter is a query in this case C, and you have to tell me which number is associated with that letter.

Starting point is 03:13:04 It's a lookup task. Attention can crush this, because it's a very easy lookup task. These two variants of S4 that came out later that are supposed to be better on language were better, but there was a gap here too. And so without going to too much detail on this piece, Michael Polly came along and did this thing hyena, and he showed he could get 100% on this underlying operator and did it in a very exciting way while still maintaining speed and all the rest. So this is what the picture looked like as of a couple of months ago, or a couple of weeks ago, I guess, two weeks ago. You had S4, which was a bit worse, but then in quick success, people were coming down to this very strong attention baseline. All the baselines are released.

Starting point is 03:13:43 Luther made a wonderful harness. These are all at 350 million. You can start to play with these things. And people are, RWKV has been releasing even bigger models. And so there was this baseline here. These are closing the gap without attention. But part of the reason I love academia is you can worry about tiny problems. It's like, well, it seems like a small problem, but why is it worse? So we kept asking, we kept poking at it, and Simran and Sabri came in, and they actually came up with this idea. It took us a surprising amount of time, but it's just a small twist. The small twist is what a transformer can do is not one lookup, but many lookups.

Starting point is 03:14:15 So what MQAR is is multi-quiries. We don't just look up one letter, we look up many letters. Now we can worry about scaling in the letters, the vocab size, the model dimension. And what we found is that all of these models can, quote, solve the task. But how they do it, their scaling is quite different. This relates to a bunch of things in parallel circuit

Starting point is 03:14:33 complexity that I won't get to. But this is a really interesting thing where we can start to study the scaling. And so what they realize is that attention can solve these things with a small number of dimensions, roughly logarithmic, whereas hyena and RWKV require, and all the convolutional models, as a result of their reduction, require things, model dimensions that scale with the sequence length. And so you get charts that look like this.

Starting point is 03:14:54 They'll solve it, but they need more capacity to do so. So when we started looking at these MQAR things in the wild, we started thinking, well, okay, MQAR is a nice synthetic, but does it translate? This was really insightful. Simmer and Sabri did this. They said, we're going to take the pile, and we're going to segment out which ones are AR-like, which sentences are AR-like. So these are things that have repeated bygrams. Common buzzard is repeated twice. There's kind of an implicit look-up the second time you're

Starting point is 03:15:19 doing the common buzzard task. That's about 7% of the pile. The non-AR slice was basically everything else. What they found is that the attention gap, 82% of it, was explained, even though this is a pretty rough proxy for the task, by just what's going on here. And this made us think, maybe if we saw this task, we can even close it. But the other observation was, actually these convolutional models are slightly better on the non-lookup task. So maybe there's hope to go beyond them.

Starting point is 03:15:47 And so we started this kind of architecture, and I want to give another shout out here to a paper I love. I love the T5 paper, I'm sure many of you do too. I love the vibe of it where it was like, hey, we just want to say, what are the common elements that are going on? If you're outside this little tiny sub-community, all the papers look very, very different.

Starting point is 03:16:02 But if you're inside, I would say there's a couple of really common themes and Simmer and Sabree tried to boil them down so that more folks can participate and come into the field in a more easy way. The themes are long convolutions, convolutions that are scaling with the input, not necessarily the full input size.

Starting point is 03:16:18 Gating is a wonderful idea that's multiplying in this kind of component-wise way in the sequence. That's an old idea. And data dependence. And Mamba just came out from Albert and Tree, which did this and still kept that sub-quadratic runtime. Based is basically just simplifying all of the things that people are doing

Starting point is 03:16:33 and trying to get to something nice. We don't have T5 level niceness let, but we are inspired by them. One thing I want to point out is that this new convolutional architecture does scale for MQAR a little bit like attention. So it has the same kind of dimension scaling that the others had, which is interesting. So the point is very recently, this is in the last week run up to NURIPS, both Momba and BAS, and I'm sure five others will come out in the next couple of weeks, are now attention-free and actually getting you lower PPL at 350. It doesn't mean they're going to get you lower people necessarily at, you know, 100 billion,

Starting point is 03:17:07 but it's interesting to say there doesn't seem to be any fundamental kind of block, and that's, to me, extremely exciting. I did want to point out a little bit that, you know, there is another bottleneck that's lurking for truly subquadratic models. We talked a lot about the signal processing part, but there's also this MLPs, and I've become obsessed with them. There's a whole line of work, check out Dan Fu's talk, about trying to understand what's going on with the MLPs,

Starting point is 03:17:30 and can we slim those down? they become a bottleneck at much larger dimension sizes. So the questions that we're driving our work really were threefold. I shared with you, I hope, a little bit, about how foundation models change the systems that we're building. I also talked a lot about how classical ideas from signal processing and databases were interesting bits of canon to bring into the field so that maybe we can make these models more efficient.

Starting point is 03:17:51 What I thought I would end with is just why I think there's such a bright future in AI for two minutes and systems. The first thing is, we weren't using these models really 15 to 18 months ago, and the way we're using them now. We knew intuitively that you train them once and use them multiple times, but it's not really clear we were doing that. We were kind of just showing them to each other, if we're honest.

Starting point is 03:18:10 Now, people are using them on like a daily basis. And inference has become an unbelievable task. I would say even the last three or four months, the speed of inference, if you watch on a bunch of the commercial servers, are just going through the roof as new ideas come in. Of course, people were thinking about this. MQA and GQA a while ago. Speculative decoding was an amazing paper.

Starting point is 03:18:29 VLM was really exciting. Flash decode. Flash decode, Matt Former. There's a ton of exciting work here. My point is, this really kicked off like six months ago. Wild to think about. But that's the whole thing. Another bit is there's a big difference between low latency systems and high throughput systems. When you don't care if it returns in a couple of milliseconds, but you want to say run on a hundred

Starting point is 03:18:49 different documents, there are a million different documents. We're just at the outset of seeing that systems pitch as people are actually using these foundation models on all the back-of-house data cleaning is tasks that I think are going to happen the next while. There's new data types. I do want to call out that there's all kinds of things you could worry about from Kuhnlae, about how to program these systems, what's the right accelerators and hardware, that's just happening. What are the right systems to build that are systems of record underneath the covers?

Starting point is 03:19:14 There's tons of stuff. Yep, I gave Chris a little bit more time there because he's such a legend, and he covers so many different concepts and updates and models in such a small amount of time. So his time is very high quality, and you should watch the whole talk if you get the opportunity. But that's it for our coverage of New York 2020. It's just a ton of papers. We are going to follow up with a lot of the startups that I encountered and met, a lot of which are returning guests.

Starting point is 03:19:42 So keep a look out for that. But also, thank you so much for listening in on this. It's an experimental new format. We grabbed a whole bunch of audio spliced in, you know, live interviews or stage talks and some of my own commentary with a little bit of backing music. It's a experimental new thing. Like, did you like it? Let us know.

Starting point is 03:20:00 if you liked it and share it with a friend, that would help us a lot. And also, just remember, we have a listener survey going on. So please come to our website and fill out our survey. Thanks and see you at the next New York's recap. DJ QD outro.

Latent Space: The AI Engineer Podcast - NeurIPS 2023 Recap — Best Papers

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.