Latent Space: The AI Engineer Podcast - NeurIPS 2023 Recap — Best Papers
Episode Date: December 23, 2023We are running an end of year listener survey! Please let us know any feedback you have, what episodes resonated with you, and guest requests for 2024! Survey link here.NeurIPS 2023 took place from De...c 10–16 in New Orleans. The Latent Space crew was onsite for as many of the talks and workshops as we could attend (and more importantly, hosted cocktails and parties after hours)!Picking from the 3586 papers accepted to the conference (available online, full schedule here) is an impossible task, but we did our best to present an audio guide with brief commentary on each. We also recommend MLContests.com NeurIPS recap and Seb Ruder’s NeurIPS primer and Jerry Liu’s paper picks. We also found the VizHub guide useful for a t-SNE clustering of papers. Lots also happened in the arxiv publishing world outside NeurIPS, as highlighted by Karpathy, especially DeepMind’s Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models.Jan 2024 update: we also strongly recommend Sebastian Raschka, PhD ‘s pick of the year’s 10 best papers, including Pythia.We’ll start with the NeurIPS Best Paper Awards, and then go to a selection of non-awarded but highly influential papers, and then arbitrary personal picks to round out the selection. Where we were able to do a poster session interview, please scroll to the relevant show notes for images of their poster for discussion. We give Chris Ré the last word due to the Mamba and StripedHyena state space models drawing particular excitement but still being too early to assess impact. Timestamps* [0:01:19] Word2Vec (Jeff Dean, Greg Corrado)* [0:15:28] Emergence Mirage (Rylan Schaeffer)* [0:28:48] DPO (Rafael Rafailov)* [0:41:36] DPO Poster Session (Archit Sharma)* [0:52:03] Datablations (Niklas Muennighoff)* [1:00:50] QLoRA (Tim Dettmers)* [1:12:23] DataComp (Samir Gadre)* [1:25:38] DataComp Poster Session (Samir Gadre, Alex Dimakis)* [1:35:25] LLaVA (Haotian Liu)* [1:47:21] LLaVA Poster Session (Haotian Liu)* [1:59:19] Tree of Thought (Shunyu Yao)* [2:11:27] Tree of Thought Poster Session (Shunyu Yao)* [2:20:09] Toolformer (Jane Dwivedi-Yu)* [2:32:26] Voyager (Guanzhi Wang)* [2:45:14] CogEval (Ida Momennejad)* [2:59:41] State Space Models (Chris Ré)Papers covered* Distributed Representations of Words and Phrases and their Compositionality (Word2Vec) Tomas Mikolov · Ilya Sutskever · Kai Chen · Greg Corrado · Jeff Dean. The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several improvements that make the Skip-gram model more expressive and enable it to learn higher quality vectors more rapidly. We show that by subsampling frequent words we obtain significant speedup, and also learn higher quality representations as measured by our tasks. We also introduce Negative Sampling, a simplified variant of Noise Contrastive Estimation (NCE) that learns more accurate vectors for frequent words compared to the hierarchical softmax. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of Canada'' and "Air'' cannot be easily combined to obtain "Air Canada''. Motivated by this example, we present a simple and efficient method for finding phrases, and show that their vector representations can be accurately learned by the Skip-gram model.* Some notable reflections from Tomas Mikolov - and debate over the Seq2Seq paper credit with Quoc Le* Are Emergent Abilities of Large Language Models a Mirage? (Schaeffer et al.). Emergent abilities are abilities that are present in large-scale models but not in smaller models and are hard to predict. Rather than being a product of models’ scaling behavior, this paper argues that emergent abilities are mainly an artifact of the choice of metric used to evaluate them. Specifically, nonlinear and discontinuous metrics can lead to sharp and unpredictable changes in model performance. Indeed, the authors find that when accuracy is changed to a continuous metric for arithmetic tasks where emergent behavior was previously observed, performance improves smoothly instead. So while emergent abilities may still exist, they should be properly controlled and researchers should consider how the chosen metric interacts with the model.* Direct Preference Optimization: Your Language Model is Secretly a Reward Model (Rafailov et al.)* While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these preferences, often with reinforcement learning from human feedback (RLHF). However, RLHF is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then fine-tuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without drifting too far from the original model. * In this paper, we leverage a mapping between reward functions and optimal policies to show that this constrained reward maximization problem can be optimized exactly with a single stage of policy training, essentially solving a classification problem on the human preference data. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight, eliminating the need for fitting a reward model, sampling from the LM during fine-tuning, or performing significant hyperparameter tuning. * Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods. Notably, fine-tuning with DPO exceeds RLHF's ability to control sentiment of generations and improves response quality in summarization and single-turn dialogue while being substantially simpler to implement and train.See also Interconnects on DPO: and recent Twitter discussions* Scaling Data-Constrained Language Models (Muennighoff et al.)* The current trend of scaling language models involves increasing both parameter count and training dataset size. Extrapolating this trend suggests that training dataset size may soon be limited by the amount of text data available on the internet. Motivated by this limit, we investigate scaling language models in data-constrained regimes. Specifically, we run a large set of experiments varying the extent of data repetition and compute budget, ranging up to 900 billion training tokens and 9 billion parameter models. We find that with constrained data for a fixed compute budget, training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data. However, with more repetition, the value of adding compute eventually decays to zero. We propose and empirically validate a scaling law for compute optimality that accounts for the decreasing value of repeated tokens and excess parameters. Finally, we experiment with approaches mitigating data scarcity, including augmenting the training dataset with code data or removing commonly used filters. Models and datasets from our 400 training runs are freely available at https://github.com/huggingface/datablations.* 2 minute poster session presentation video* QLoRA: Efficient Finetuning of Quantized LLMs (Dettmers et al.). * This paper proposes QLoRA, a more memory-efficient (but slower) version of LoRA that uses several optimization tricks to save memory. They train a new model, Guanaco, that is fine-tuned only on a single GPU for 24h and outperforms previous models on the Vicuna benchmark. Overall, QLoRA enables using much fewer GPU memory for fine-tuning LLMs. Concurrently, other methods such as 4-bit LoRA quantization have been developed that achieve similar results.* DataComp: In search of the next generation of multimodal datasets (Gadre et al.)* Multimodal datasets are a critical component in recent breakthroughs such as CLIP, Stable Diffusion and GPT-4, yet their design does not receive the same research attention as model architectures or training algorithms. To address this shortcoming in the machine learning ecosystem, we introduce DataComp, a testbed for dataset experiments centered around a new candidate pool of 12.8 billion image-text pairs from Common Crawl. Participants in our benchmark design new filtering techniques or curate new data sources and then evaluate their new dataset by running our standardized CLIP training code and testing the resulting model on 38 downstream test sets. * Our benchmark consists of multiple compute scales spanning four orders of magnitude, which enables the study of scaling trends and makes the benchmark accessible to researchers with varying resources. Our baseline experiments show that the DataComp workflow leads to better training sets. Our best baseline, DataComp-1B, enables training a CLIP ViT-L/14 from scratch to 79.2% zero-shot accuracy on ImageNet, outperforming OpenAI's CLIP ViT-L/14 by 3.7 percentage points while using the same training procedure and compute. We release \datanet and all accompanying code at www.datacomp.ai.* Visual Instruction Tuning (Liu et al)* Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, we present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. * By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding.* Our early experiments show that LLaVA demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. We make GPT-4 generated visual instruction tuning data, our model and code base publicly available.* Tree of Thoughts: Deliberate Problem Solving with Large Language Models (Yao et al)* Language models are increasingly being deployed for general problem solving across a wide range of tasks, but are still confined to token-level, left-to-right decision-making processes during inference. This means they can fall short in tasks that require exploration, strategic lookahead, or where initial decisions play a pivotal role. * To surmount these challenges, we introduce a new framework for language model inference, Tree of Thoughts (ToT), which generalizes over the popular Chain of Thought approach to prompting language models, and enables exploration over coherent units of text (thoughts) that serve as intermediate steps toward problem solving. * ToT allows LMs to perform deliberate decision making by considering multiple different reasoning paths and self-evaluating choices to decide the next course of action, as well as looking ahead or backtracking when necessary to make global choices.* Our experiments show that ToT significantly enhances language models’ problem-solving abilities on three novel tasks requiring non-trivial planning or search: Game of 24, Creative Writing, and Mini Crosswords. For instance, in Game of 24, while GPT-4 with chain-of-thought prompting only solved 4\% of tasks, our method achieved a success rate of 74\%. * Code repo with all prompts: https://github.com/princeton-nlp/tree-of-thought-llm.* Toolformer: Language Models Can Teach Themselves to Use Tools (Schick et al)* LMs exhibit remarkable abilities to solve new tasks from just a few examples or textual instructions, especially at scale. They also, paradoxically, struggle with basic functionality, such as arithmetic or factual lookup, where much simpler and smaller specialized models excel. * In this paper, we show that LMs can teach themselves to use external tools via simple APIs and achieve the best of both worlds. * We introduce Toolformer, a model trained to decide which APIs to call, when to call them, what arguments to pass, and how to best incorporate the results into future token prediction. * This is done in a self-supervised way, requiring nothing more than a handful of demonstrations for each API. We incorporate a range of tools, including a calculator, a Q&A system, a search engine, a translation system, and a calendar. * Toolformer achieves substantially improved zero-shot performance across a variety of downstream tasks, often competitive with much larger models, without sacrificing its core language modeling abilities.* Voyager: An Open-Ended Embodied Agent with Large Language Models (Wang et al)* We introduce Voyager, the first LLM-powered embodied lifelong learning agent in Minecraft that continuously explores the world, acquires diverse skills, and makes novel discoveries without human intervention. Voyager consists of three key components: * 1) an automatic curriculum that maximizes exploration, * 2) an ever-growing skill library of executable code for storing and retrieving complex behaviors, and * 3) a new iterative prompting mechanism that incorporates environment feedback, execution errors, and self-verification for program improvement. * Voyager interacts with GPT-4 via blackbox queries, which bypasses the need for model parameter fine-tuning. The skills developed by Voyager are temporally extended, interpretable, and compositional, which compounds the agent's abilities rapidly and alleviates catastrophic forgetting. Empirically, Voyager shows strong in-context lifelong learning capability and exhibits exceptional proficiency in playing Minecraft. It obtains 3.3x more unique items, travels 2.3x longer distances, and unlocks key tech tree milestones up to 15.3x faster than prior SOTA. Voyager is able to utilize the learned skill library in a new Minecraft world to solve novel tasks from scratch, while other techniques struggle to generalize.Voyager discovers new Minecraft items and skills continually by self-driven exploration, significantly outperforming the baselines.* Evaluating Cognitive Maps and Planning in Large Language Models with CogEval (Momennejad et al)* Recently an influx of studies claims emergent cognitive abilities in large language models (LLMs). Yet, most rely on anecdotes, overlook contamination of training sets, or lack systematic Evaluation involving multiple tasks, control conditions, multiple iterations, and statistical robustness tests. Here we make two major contributions. * First, we propose CogEval, a cognitive science-inspired protocol for the systematic evaluation of cognitive capacities in LLMs. The CogEval protocol can be followed for the evaluation of various abilities. * * Second, here we follow CogEval to systematically evaluate cognitive maps and planning ability across eight LLMs (OpenAI GPT-4, GPT-3.5-turbo-175B, davinci-003-175B, Google Bard, Cohere-xlarge-52.4B, Anthropic Claude-1-52B, LLaMA-13B, and Alpaca-7B). We base our task prompts on human experiments, which offer both established construct validity for evaluating planning, and are absent from LLM training sets.* * We find that, while LLMs show apparent competence in a few planning tasks with simpler structures, systematic evaluation reveals striking failure modes in planning tasks, including hallucinations of invalid trajectories and falling in loops. These findings do not support the idea of emergent out-of-the-box planning ability in LLMs. This could be because LLMs do not understand the latent relational structures underlying planning problems, known as cognitive maps, and fail at unrolling goal-directed trajectories based on the underlying structure. Implications for application and future directions are discussed.* Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Albert Gu, Tri Dao)* Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformers' computational inefficiency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content-based reasoning, and make several improvements. * First, simply letting the SSM parameters be functions of the input addresses their weakness with discrete modalities, allowing the model to selectively propagate or forget information along the sequence length dimension depending on the current token. * Second, even though this change prevents the use of efficient convolutions, we design a hardware-aware parallel algorithm in recurrent mode. We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even MLP blocks (Mamba). * Mamba enjoys fast inference (5x higher throughput than Transformers) and linear scaling in sequence length, and its performance improves on real data up to million-length sequences. As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics. On language modeling, our Mamba-1.4B model outperforms Transformers of the same size and matches Transformers twice its size, both in pretraining and downstream evaluation.* This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
Transcript
Discussion (0)
Hello, hello. This is Swix with the special edition of Delaney in Space Pod for NIRP's
2023. Both of Leso and I were there covering what we could cover. It is an impossible conference,
15,000 people, 3,500 papers, and tons and tons of sessions. So it's just impossible for
two people to cover it, especially with a limited time. But we did our best. A lot of you liked
our OpenEI Dev Day coverage where we basically just jumped from paper to paper, person to person,
and founder to founder and got their takes.
And this is effectively what we've tried to do here.
It's still experimental a new format for us.
So we really love your feedback.
We're actually doing a listener survey now.
If you click into the show notes, we really love to hear feedback and know what you want
to hear for 2024.
So we recorded a lot of audio in NERIPS.
And I figured the most logical way to cover this would be to start with the best papers.
Neuribs does hand out best paper awards.
So we're going to start with the hardest one to obtain, which is the Test of Time Award.
The test of time award is given to a paper that has.
has stood the test of time, which by Nierrifts's definition is a paper that was published 10 years ago at Nieryps.
Neurips is in its 37th year, so this is honestly a flex that very, very few conferences can actually do.
And it's really interesting to have the original authors of the paper come back and talk about what they've learned
and how they look back at the past 10 years. So here's Jeff D. and Greg Corrador.
Thank you very much. I'm Jeff. And I'm Greg.
And we're here to give a little talk and a retrospective on this work.
So this work actually started out as an ICLR 2013 workshop paper with four of our co-authors working together.
And in that work, we sort of explored a bunch of different sort of loss functions and techniques for optimizing word embedding representations.
And really, that was kind of the genesis of this work.
And that work was cited by quite a few people.
and one of the things that we discovered in that work was that the Skipgram model,
one of the few models that we evaluated in this workshop paper,
really was showing better performance than some of the other ones that we worked on.
So we decided to focus on that and really focus on the Skipgram model
and then some interesting sort of optimization techniques
to improve the optimization of the word embeddings
and added the ability to do phrase embeddings as well.
And along the way Ilya joined,
as a co-author, which is great.
And this paper has been cited by a number of people, as Sergey mentioned.
One thing we've discovered, including source code and trained representations,
really does boost your citation count.
People have done this and, you know, use these downstream representations
for all kinds of things, and we're very gratified to see that in the community.
And we also want to highlight that three of our co-authors couldn't make it today.
So Tomash, Ilya, and Kai couldn't be here,
but on their behalf, we're delighted to be giving this talk.
And with that, I'm going to turn it over to Greg, I think.
Oh, no, we're older now.
Sorry.
Sadly, we've found more recent photos, and this is a test of time war.
And time has passed.
Yes, I think we survived the test, mostly.
But so let's stand back and ask ourselves, you know,
what did we really learn from these papers?
But before I get into that, I should probably stipulate that some of you out there rightfully say,
well, we already believed these things before you published this work.
And so for you, maybe this is really us reinforcing these points.
Other of you might think that, well, the paper didn't really exactly prove this point.
It just suggested it.
So it foreshadowed it.
We don't have any quarrel with whether it was reinforcing or shadowing or learning,
and so we'll just put that aside for the remainder of the talk and talk about what we think
are at least the themes that were in this work that resonate today.
So the first point is that semi-supervised objectives have an incredibly powerful opportunity,
and we think that they're going to be critical for natural language understanding going forward.
We think that this paper shows that fast, parallel, and weekly supervised synchronization
in computation really dominates over the sort of fruitless precision of tight synchronization.
Focusing compute where it really helps and improves your learning of representations is what's
most important.
And tokenization can be used as a good trick to solve some nuanced problems.
And then the last, and I think most important point, is that treating language as a sequence
of dense vectors has proven to be really powerful.
and honestly more powerful than I think we imagined when we started this work.
So first on semi-supervised objectives.
Why is this so important?
Of course, almost all machine learning systems today go through some period of supervised learning.
We're always going to use that, but there's too much to learn in the world to use supervised learning for everything.
The promise of unsupervised learning, of course, is tantalizing, but has been difficult to implement in practice.
And so semi-supervised learning, the ability to construct a supervised feeling data from a dataset from an unlabeled corpus is really what we think works.
So what's the basic program here?
You begin with a large corpus of sequence data, say text, choose a random window within that corpus,
and then algorithmically construct inputs and target outputs on the fly.
And I want to underscore, I actually think doing it on the fly is part of what makes.
this method so powerful. And you have your choice about how you'll do it on the fly. You might be
taking a word in the corpus and trying to predict its neighbors, which is the so-called skipgram model.
You might be doing something like fill in the blank, or you might be trying to predict the end of
the sequence, which is sort of the classic language modeling problem. All of these fit in this
description. And if you repeat that a few billion times, it seems to work really well. But that's
where we get into the hard part.
Yeah. So I think one of the things that we really explored in this work and sort of work we were doing concurrently with this is how effectively could we make sort of weekly synchronized asynchronous updates to a large model work. And Tamash, our first author, had been exploring these word embedding ideas on a single machine version that he implemented in C of both the skipgram and the continuous bag of words, objective functions.
And he actually did a fair amount of work to scale this up to be a very high performance implementation,
using all the cores on a single machine, so about 20 different cores at that time with almost no synchronization.
So you just kind of blindly update the embedding that was sort of a large 2D array in memory.
And then he was able to have about 20 cores on these multi-core machines simultaneously updating this shared representation
and get quite good embedding representations.
Now, one of the things that we observed was every time we made the dimensionality of the word vectors larger,
and every time we trained on more data, things got better, right?
This is the lesson of a lot of the last 10 years of deep learning work,
is scaling actually gives you much better results.
Fortunately, a bunch of us were simultaneously working on a highly scalable system
for distributed training of neural networks.
So we decided to take the single machine implementation that Tamas had built,
for these word embedding questions
and implement that in our distributed framework.
And so the work we were doing just a bit before this work
was this large-scale distributed deep networks
where we were exploring distributed training
of large-scale models, mostly for vision and for speech.
And really the motivation was how can we scale training
on these systems to thousands of machines.
We actually titled this disbelief
internally so named because it was a distributed system, but also because a bunch of people
were skeptical that it would work. And it turns out it did work, which is nice. So the basic
idea behind disbelief is you have some set of parameters that are being represented on some set
of machines, and then you have independent replicas of the model where you fetch the current
state of the parameters, you do some computation on the model, and then you update the parameters
by sending a gradient back to the parameter servers.
And, you know, in large-scale setups,
we were using tens to hundreds of machines
to hold the distributed state of the parameters
and hundreds to thousands of machines
to hold the sort of independent workers of the model.
And so that really meant you had 1,000 to 10,000 simultaneous threads
kind of updating the model for the word embedding kind of work.
And we were using 300 to 1,000 dimensional embeddings for a lot of things,
100K to million item vocabularies and even
beyond for a lot of internal uses.
It turns out you can make vocabularies out of lots of things, you know, not just words,
but, you know, particular videos that people have watched or all kinds of things and use
kind of similar approaches than just language modeling.
And now back to Greg.
And so that, this kind of provocative disrespect for locking and synchronization was the biggest
single enabler of being able to do this work.
But there were other things that we did that tried to focus.
compute to where it really actually made a difference in terms of model representation and quality.
So, for example, the meaning of tokens that are uncommon is actually often more informative
than common ones.
The common ones are super easy to learn because you get a lot of chances at them.
So we would probabilistically discard tokens related to their frequency, ignoring common tokens more
often.
And you could apply that both as inputs and in targets.
Another thing that we did was we found that favoring objectives and models that we
were informative for the ultimate task, but were faster compute was better.
And so in our paper you can see we go through softmax and then an approximation to that,
through hierarchical softmax, and then noise contrastive estimation, which is an even faster
version, and then Ilya came up with negative sampling, which is an even faster, faster,
faster version.
We saw that quality went up every time that we were able to make it simpler and faster.
We also found that you could use tools like tokenization to focus.
computation in the part that was interesting. One of the things that we used it for was to
try to deal with phrase representation. So in English, compound concepts and nouns are often
represented by multiple words. And so we just had a very simple heuristic that allowed us to
build bygrams out of terms that were each individually not super frequent, but were
co-occurring much more frequently together than you would expect. And many other authors
have used tokenization schemes in these systems to great benefit, dealing with everything from
contractions, declinations, and I just think it's important for us to not overlook that
when we're processing text, we begin with tokenization. But then to the point of getting concepts
to be n-dimensional vectors and how it is that this is so powerful. And I was actually trained as
a neuroscientist, and so I saw this come up as ideas from a long time ago, from the 80s,
about maybe concepts could be represented in a dense vector space, and that operators in that
vector space or geometric relationships in that vector space actually meant something.
But that was simply a conjecture.
And then lo and behold, when we took these representations that we had learned in a semi-supervised
fashion and investigated what was inside by, for example, flattening them into due dimensions
using PCA, we found that there were, that syntactic relationships were represented geometrically,
like these similar triangles representing the tenses of verbs, and that even arbitrary,
semantic relationships, like the relationship between countries and capitals or diseases and drugs,
were also represented geometrically in this space as similar displacements.
And that was really powerful.
And then Tomash and Ilya were able to show that you could do these cute tricks,
like solve analogies with simple vector arithmetic.
By adding and subtracting vectors, you could see that sushi is to Japan as Bratworth is to Germany,
well, at least according to the language model.
And in fact, you could even just do simple addition to imagine combining concepts and discovering what concept is nearby in this vector space.
So, for example, putting together Russian and River, you get tokens like Volga River.
Okay, so summing it all up, what did we learn in these papers?
Let's go back to the five points that Greg talked about in the beginning.
So semi-supervised objectives applied to a large text corpora are pretty important in natural language understanding.
standing. I would say definitely true today. Fast, parallel, weekly synchronized computation
dominates in ML. Parallel, definitely, I would say larger scale specialized ML hardware has really
enabled fully synchronized approaches to scale, even to the scale of models that we're training
today. But I personally think that asynchronous approaches are going to make a comeback, because I think
we're sort of close to where we're going to have to start reconsidering some of these asynchronous
approaches to training very large models.
Focus compute on the aspects of learning that need improvement.
Yeah, simpler, more parallel methods win out over more complex,
less parallelizable models.
You know, WordDevac versus RNNs,
Transformers versus LSTMs.
I think this is a good lesson as we're thinking about future improvements to these things.
tokenization can be used to solve seemingly nuanced problems.
Yeah, more powerful models on top have actually pushed tokenization
in the opposite direction of our phrase-based vocabulary,
where we now have kind of subword sort of tokenization,
and that actually seemed to work pretty well
for some of these models that have more complex attention mechanisms on top.
And treating language as a sequence of dexterous vectors
is more powerful than expected.
Definitely true today.
So we're really honored to receive this award.
Thanks to the committee that selected the work.
We're really honored.
And thanks to our co-authors who couldn't be here today,
and there's their pictures.
Tomash, Ilya, and Kai.
Thank you for this delightful work and co-authoring.
Were we still so young?
Thanks, everybody.
Yeah, we picked the younger ones.
By the way, there was some discussion in Europe's around what would be the 2024 test of time winner.
There was some contention for GANS by Ian Goodfellow, but probably it's going to go to the sequence of sequence paper because that is most influential to Leggish models today.
The only thing I know for sure is that I know what's going to be the Test of Time Award winner.
for 2027. Up next are the best paper awards from this year. There are two papers chosen,
but probably the most relevant for AI engineers is the Mirage paper. In other words,
our emergent abilities of large language models, a mirage. And here is Schaefer at all.
My name is Ryland Schaefer, and this is our NERB's paper, Our Emergent Abilities of Large Language
Models, a Mirage. This is joint work with Brando Miranda and Professor Sanmi Cojejo.
Our paper is a story about predictability and surprise.
Our story begins with predictability.
As many of you know, several years ago, researchers observed a striking phenomenon
that as you fed large networks more and more data, the loss improved in a predictable manner.
But it wasn't just the test data.
Other researchers observed that other quantities, scaling compute,
Scaling data set size, scaling parameters, yielded predictable improvements in the performance of large networks.
This was incredibly important because it told us that if you fed more into these models, you knew what you would get.
That's extremely useful.
But approximately three years ago, this story was turned on its head.
There was a new story in town, a story of surprise in large language models.
Specifically, perhaps the first instance of this was in the GPT3 paper, where the authors observed
that you might try having language models solve a task, like arithmetic, and you make them
larger and larger and larger, and they're unable to do this task.
But then, at some seemingly unforeseeable model scale, performance skyrockets, almost
to ceiling, something that was unpredictable.
But it wasn't just on arithmetic.
who is also on many other tasks, IPA transliterate, word unscrambling, Persian question answering,
all of these tasks across a variety of language model families.
All of them seem to display these miraculous emergent abilities.
What are emergent abilities?
Emergent abilities were defined by their authors as abilities that are not present in
smaller scale models, but that are present in larger scale models.
Critically, emerging abilities cannot be predicted
by simply extrapolating the performance improvements
on smaller scale models.
These emergent abilities raised several interesting research
questions, questions like what controls which abilities will emerge?
What controls when abilities will emerge?
How can we make desirable abilities emerge faster?
And how can we ensure undesirable abilities
never emerge. These questions not only are fundamental scientific questions of interest
to the machine learning community, but these are also fundamental questions for those interested
in governmental policy or economics. What our paper asked is whether or not the story of
emergent abilities is complete. Specifically, if you look at these emergent abilities,
you might notice something that if you hone in on the metrics, all of these metrics are quite
harsh. They give no partial credit. Exact match, for instance, either you exactly output the correct
answer or you do not. There is no in-between. And so, it seemed when we looked closer that many
emergent abilities appeared under metrics that non-linearly or discontinuously scored models' performance.
For instance, we found over 90% of emerging abilities on Google's large-scale big bench,
we found that over 90% of emerging abilities observed under two metrics.
One of those metrics, for those who haven't seen this, is called multiple-choice grade.
It's like taking an A- through-D multiple-choice question.
You get a score of one if you put the highest probability mass on that answer, and zero otherwise.
The other metric was exact string match, where again,
one point, if you get it exactly right, zero otherwise. This raised the specter that emergent
abilities might not be due to fundamental changes in model with scale, but due to our evaluations
of said models. So what exactly is this alternative that I'm positing? What is our alternative
hypothesis? Let's walk through it. First of all, let's just suppose that the test loss falls as we
increase the number of parameters in our models.
So for example, motivated by power loss scaling,
we might assume that the cross entropy loss
as a function of the number of parameters
is some power law.
What that means is if we visualize the number of model
parameters against the cross entropy loss
in log log space, we observe a very predictable linear trend.
In step two, we compute the probability mass
that is placed on the correct token
as a function of parameters.
So how can we do this?
Well, we know the definitional form of cross-entropy,
and we know that we can substitute in our power loss scaling,
so I can rearrange.
And when I plot this, what I see is that,
as model parameters get larger, the probability mass
that gets placed on the correct token,
asymptotes, towards one.
And everybody is comfortable with this.
So how do we go from this to an emergent capability?
The answer is, we might choose a metric that non-linearly
scores model performance.
For example, suppose that we want to add two five-digit numbers
and we're gonna measure performance with accuracy.
What scaling should we expect?
Well, the answer is that unless you get every token correct,
you get zero points, ergo to score one point,
it's going to be the per token probability
approximately exponentiated to however many tokens
you need to get correct.
So what happens is this graph on the right
that we like and know,
gets transformed into something that becomes much less predictable with model scaling.
And indeed, this toy model qualitatively reproduces what's been observed empirically at large scale.
But could we have done something differently?
Yes, suppose we had done the evaluation differently.
Suppose that we had chosen a different metric, one that linearly scales model performance.
So, for example, I might instead count merely the number of mistakes that the language model makes.
For those in NLP, you might call this an edit distance.
And what that then means is that the edit distance scales approximately linearly with the output length.
And so if we look at this, instead, what we find is when we plot model parameters versus the number of incorrect tokens,
we find a very nice predictable trend that asymptotes toward zero as you make models bigger.
So nothing has fundamentally changed.
From one viewpoint, we saw a seemingly emergent ability.
From a different viewpoint, we removed it.
Of course, it's not just about linear and non-linear metrics.
It can also be discontinuous metrics.
So, for example, let's consider that multiple-choice metric.
So multiple choice, again, is you get one if you place the highest probability mass on the correct option.
And what that scaling looks like is you're at chance, up until some unforeseeable critical threshold,
at which point you jump to CLA.
And this, again, qualitatively matches what's been observed empirically at scale.
So if we had done the evaluation differently, we could have chosen a continuous metric like Breyer's score,
which is just the mean squared error here between one and the probability mass,
and then we find a very nice quadratic.
So to summarize this together, we started with power loss scaling,
we figured out, we computed what the probability mass on the correct token is.
If we chose a non-linear metric, we see an emergent ability,
but if we chose a linear metric, we did not.
Similarly, if we chose a discontinuous metric, emergent ability.
If we choose a continuous metric, we do not.
And so this is our alternative hypothesis for emerging abilities.
Now, of course, to summarize this, there's basically three factors at play here.
One of them is the metrics that I focused on.
Another one is that of statistics about needing sufficient resolution,
measuring discreetness in order to accurately estimate the performance of models.
And then third and finally, the third confounding factor is evaluating two
few small and medium-sized models. So up till now, this has been Rylund's hypothesis. Do we have any
actual evidence? And the answer is, in our paper, we considered three different types of evidence.
We made and tested predictions using the largest publicly available model family at the time,
GPT3. We did a meta-analysis of published metrics and emergent abilities at Google's Big Bench.
And third, we induced emergent abilities in toy minisheal networks on vision tasks. The reason why
we did this is because prior to our paper we didn't know of any work that had found emergent
abilities in vision tasks so to induce them intentionally was quite novel so let's walk through this
let's first talk about the predictions that the toy model the mathematical model makes the first is that
if you change the metric you should get more predictable scaling so here again model parameters
versus accuracy as i increase the number of tokens that the model needs to output correctly we should
expect to observe approximately geometric decrease in performance so we start a piece
here and then it falls. But if I change the metric to token edit distance, I should find this
nice quasi-linear behavior. I'm now going to go test this in GPT3. And that's precisely what we did.
So here is accuracy. And again, here's the four models in the three family. And again, we find
that as the target length that's longer, you find it decay geometrically in the length of the target.
And that if I switch using the exact same data, fixed data, if I change the metric, I find very nice
quasi-linear scaling. This is exactly what the toy mathematical model predicts.
Moreover, there's a question about better statistics yielding more predictable scaling.
What the toy model tells us is that when we said the tiny models are unable to do the task,
that wasn't quite right. It was that their performance was so small, we didn't have sufficient
resolution in order to estimate it. So what our toy model says is we really need to consider
accuracy on a log scale. And to estimate these quantities, we,
We need sufficient data to do so.
So we scale up the amount of data, and again we find that if we separate into log scale,
we find a very, very nice separation with predictable behavior.
Or second, we conducted a meta-analysis of emergent abilities on Google's Big Bench, and what
we found is that across many, many, many metrics, we could not find emergent abilities.
But on a small subset, to be specific, four of these, we found emergent abilities.
That's what this little pie chart shows.
So long story short, it seems like the metric is playing a fundamental role in producing
these emergent abilities.
And lastly, what we did is we induced emergent abilities in networks.
So what we did is we did the simplest possible thing.
We took a shallow, nonlinear auto encoder and trained it on CFR 100.
Everybody has done this in their intro to machine learning class.
And what we did is we plotted the squared reconstruction error as a function of the number of
parameters.
But, and this looks very smooth and predictable, everybody has seen this.
But if we define a discontinuous metric, so here the model scores one, if the reconstruction
error is below some threshold, then you find very, very unpredictable behavior.
And so even in a shallow nonlinear auto encoder, we can again qualitatively produce what seems
to be an emergent behavior.
There's two takeaways.
One is for emergent abilities, it might be, in certain cases, the researchers' analyses
that have produced these phenomena.
That's why we call it a mirage.
there's a more general lesson that I want to leave you with. The more general lesson is that if you
want to predict changes in model capabilities with increasing scale, you need to consider the
interplay between known scaling properties, the amount and quality of evaluation data, and the
specific metrics and evaluation processes that you have available. So with that and with gratitude to all
my collaborators and everyone here for attending. Thank you. So for the purposes of this episode,
We actually tried to do interviews at the process sessions for each paper,
but some we just didn't manage to find.
Or for the case of the Emergent Mirage paper,
it was just way too popular.
There were just so many people crowding out and listening to Ryan explain his paper again and again
that we just couldn't get a proper question in.
And I have to say, if I'm allowed to be a little bit critical,
I'm a bit puzzled as to why this paper was the best paper.
I mean, it's a good paper, but it doesn't really deny the existence of Emergent.
It just pointed out some methodological disagreements, which Jason Way has also responded to.
In other words, I don't really know if this paper affected literally anything in the field,
so I don't know why it's Best Paper and not just a regular paper.
But it's still a notable paper for sure, and it's very well done.
Next, we have the runner-up for Best Paper, which is Direct Preference Optimization,
which is a direct challenger to PPO, and you can hear directly from the authors.
from the authors. Hi, everyone. My name's Eric, and I'm here with Raphael and Archit, and today we're
going to talk about direct preference optimization, which is this algorithm that simplifies RLHF,
which is this algorithm framework that has sort of been taking the LLM world by Storm recently.
So to start, why are we even talking about reinforcement learning for language models? Now, it's
not the first time people have been studying reinforcement learning in the context of language
models. But the sort of simple answer to this question is that a few years ago, GPT3 came
onto the scene and it was sort of a big deal and you probably, well, I'm an LLM person, but
you probably heard from a lot of your researcher friends like, did you hear about this new
model? And then last year, Chad GPT came on the scene and it was more like, at least I was
like getting text from my grandmother saying like, hey, have you seen this new model, right?
And these are just like two different levels of, you know, permeation in the public consciousness.
So, you know, what is the difference between these two models?
And really, the main sort of key ingredient is this reinforcement learning from human feedback framework,
which lets us sort of align the behaviors of the models more towards what people kind of want or expect.
Okay, so to give a little bit of an overview of what sort of the existing RLHF pipeline looked like,
kind of when we started working on this project, so there are basically two main steps.
So the first step is we're going to start with some reason.
behavior-clone policy, what we call pi theta sFT here, so supervised fine-tune policy.
We're going to sample pairs of responses or trajectories from this policy conditioned on a prompt X,
and that's how we're going to gather this data set of preferences.
So we'll have an X and we'll have two Ys, and a human is going to just label which Y they think is better.
So they're just going to give us this binary preference pair over responses,
and we're going to use this data to fit a reward model.
And then in the second step, we're just going to optimize a policy to maximize rewards.
So that's just RL.
Okay?
So to look at this a little more closely in this first step, we get this feedback.
It's these triples of a prompt and two responses.
One is sort of the winner and one is the loser.
And we're simply going to train a reward model with this binary classification loss on the preference data.
So this is this Bradley Terry model of discrete choice in humans from the 50s.
But, you know, it has some nice properties, and it's relatively simple.
to understand, and we use this to fit this reward model.
So we're just taking the difference in the rewards,
and we have this sort of Boltzmann rational model here
that we're fitting with maximum likelihood.
Okay, and so now that we're done with this first step,
what are we going to do with this reward model,
where we're going to try to find a policy, achieving high reward.
And so, you know, ideally this reward model
after we've done this supervised learning stage
should represent goodness according to what humans want.
And so we're just going to fit a policy that both,
high reward but also stays close to our original model, our reference model, or our
supervised fine-tune model. And so that means we were going to try to find a policy here,
pi-theta, that generate samples that achieve high reward under our learned reward model, but
also stays close to our original model, our reference model, because if you remember, we
actually fit our reward model on samples that were annotated by humans, but these samples were
generated by our reference model, our supervised fine-tune model, right? So we don't want our policy
to drift too far away because, you know, the, we want to stay in the regime where our reward
model is actually reliable. Okay, so now that we, you know, have this objective, we take some
off-the-shelf RL algorithm. Typically, it's PPO, and we find a policy that optimizes these rewards.
This is a very complicated procedure, so there's this nice figure in this recent paper,
showing sort of the full pipeline of just the PPO step, and there are a lot of moving pieces here.
And so in light of sort of this complexity, we kind of set out to see if there's some way we can sort of use a structure of this problem to simplify things.
All right.
So how the heck do we solve this optimization without reinforcement learning or what we call direct preference optimization?
Really the key here is that the optimization that was set up for RLHF has a close form optimal solution.
Now, this may look a bit intimidating, but it's really just the reference distribution re-rated by the exponentiated reward.
So if you have a good completion, you want to put more probability mass on it, and if you have a bad completion, why, you want to put less probability mass on this.
This may look familiar, it's the Boltzman distribution that you might have seen earlier, and it's very commonly used across machine learning and physics.
But the key takeaway here is that every reward function R will induce an optimal policy by R.
But there's a very nice way to view this identity through another perspective where we express the reward model in terms of the policy itself.
So R-Py x comma y can be written as beta log ratio of pi by pi ref plus the beta log partition function x.
And this really is the key where the every policy pi is optimal for some induced reward model, R-Pi.
And this really is the key to DPO because our key idea here is that you can fit this reward model
parameterized as a beta-log ratio to the preference data and hopefully skip the RL process altogether.
But the problem is that this log partition function is basically interactable as you have to sum over all possible completions for a given instruction
So how do we get away from this?
Now fortunately for us the reward modeling loss that we looked at the Bradley Terry loss only depends on the differences in the reward
Specifically the reward for the preferred completion
Subracted subtracting the dispreferred completion's reward from that
Now if you look at the induced reward different
difference and if you plug in the DPO parameterization here, you can see that like it only ends up depending on their D.P. Reward for the preferred completion and subtract the D.P.R. reward for the dispreferred completion. Now the more important thing here is that the partition function, which only depends on the instruction X, cancels out. As it only depends on the prompt. And this really is the key part here. And if you plug in this difference of rewards in the classification loss,
you get the DPO loss function.
And really, in its essence, it's just a classification loss
with a specific reward parameterization,
which will give you the optimal policy
for the original RLHF objective.
So to go back to what Eric presented earlier,
the RLHF is typically a two-step process.
You first put a reward model and then you do some RL on top of it.
Really what we are doing here is that we choose a specific parameterization,
the DPA parameterization for the reward model,
we're still fitting the reward model exactly the same way,
but you get the optimal policy in process,
and you don't have to do the step two at any point of time.
It's pretty useful to look at the DPO loss function
through its gradient as well.
Just to recall, it's still a classification loss.
Nothing changed in the two slides.
And you're trying to maximize the difference
between the rewards.
But the gradient is really intuitive.
Specifically, what we're trying to do is increase
the log probability of the chosen completion,
and we're trying to reduce a log probability
the rejected completion. The important part here is that we slow down the training on the preference
pairs where the induced reward model is already pointed the right direction, so you're not
overfitting to the examples over and over again. But overall, it's really intuitive as you're just
doing up on the good examples and down on the bad examples. And finally moving to our experimental
results, the first thing we really wanted to evaluate is how good of an optimizer that is for
the core objective of reward versus divergence straight off for these language models. So we
We started with this synthetic experiment where the goal is to generate positive movie reviews on this IMDB dataset with a small GPD2 base model.
We created synthetic preferences by sampling several times from the base model and using a pre-trained score classifier to construct synthetic feedback pairs.
Kind of immediately the first thing we see is that DPO provides the best reward KL trade-off.
And PPO, although improves quite a bit, it doesn't quite match that efficiency.
of optimization, even when we provided with the ground truth scoring model that generated
the preference data.
And in addition, other sort of algorithms that are RL-free avoid the R0-Modeling approach,
such as just fine-tuning on the preferred answers or things like that, either don't
produce the same level of improvement or unstable.
We then decided to try to scale these results up to more harder, more involved problems.
The first thing we did is this summarization task.
The goal is to provide summarizations of some Reddit posts
and dialogue task of the tropic helpful and harmless dataset,
publicly released datasets.
And kind of again, what we see there is that across the board,
DPO either matches or outperforms all other baselines.
And particularly, for example, in the summarization case,
the PPO model is almost twice as big.
So another interesting experiment that we ran recently is evaluating the generalization capabilities of the DPO policy
because essentially the BPO-trained approaches sample a lot of additional data and have the capability to train a lot of additional data,
while DPO is fully only using the offline data set of preferences.
So what we did here is we took the summarization models that we presented in the previous slides.
Those are the first two graphs on the left,
separate at different temperatures,
and evaluate within distribution,
as you kind of see within distribution, they're quite comparable.
And then we evaluated them on out-of-distribution data,
particularly summarization of news, CNN and Daily Mail articles.
And we do see quite significant drop
when we take this model's out of distribution,
but the interesting thing is that the DPO policy
still generalizes just as well or even perhaps better
than the PPR train policy,
even though the PPR trained policy is changed on a lot more additionally sampled data.
However, I think the strongest sort of validation of this algorithm and its capabilities
are the strong open source models that have been trained by the community,
and this is only a selection of those.
There are others we couldn't fit on the slide.
And if you could go through all of them, you see that especially some of the recent ones,
do match or sometimes even outperform chat GPT on some broad benchmarks.
Another point to mention here is this is only between the,
language domain but recently works have done this training state-of-the-art text image
models with the DPO algorithm used for vision language models and also using for
multi-step control as well so this is going beyond languages is becoming kind of a paradigm
of alignment so in conclusion I want to point out that kind of the DPO removes the
complicated expensive auto training loop from ROHF it's a simple staple and computationally
cheaper than PPO I think almost you know order of magnitude
And most importantly, it's also principle.
You're optimizing the exact same objective.
It's not a hack.
It's optimizing for the exact same thing.
And yeah, as you've seen, others are training, you know,
a lot of state-of-the-art models.
We've been achieving pretty strong results, so you should do as well.
If you want to learn more about it, you can come talk to us at our poster,
and we have publicly opened our code implementation.
We can find on GitHub, and you can check our paper on archive as well.
Thank you very much.
So DPO is interesting because it promises to be simpler than PPO.
It's definitely easier and cheaper to train and there are a bunch of models already emerging
being trained on it.
The main criticism that people seem to have is that it isn't performing as well in terms
of alignments or results or benchmarks as PPO trained models.
But that still remains to be seen whether that ease of use and cheapness of availability of data
whatever makes it so much better that it doesn't actually matter.
So what happens in Europe's is that some papers are selected for oral sessions and then everyone
heads down to the poster hall where there's about 600 posters simultaneously presenting, including
the people from the oral sessions.
And this is what we did.
We went down to talk to the paper authors after their oral session.
So we're going to hear them re-explain DPO in four minutes and then answer a bunch
of Q&A.
But you can also get a sense of how chaotic and noisy it is in that poster session.
It's just a mess and I love it.
I'm talking about direct preference optimization here.
RLHF is really cool.
You get Chad GPD from GPD using RLHF.
If you've never heard of chat GPD, you might want to look it up.
It's really important.
RLHF is complicated. It's really hard.
You start with like preference data distribution.
You usually have to do some kind of RL process on top of it.
And RL is hard to implement because it has a lot of moving components.
You have to sample the model a lot.
You have to train a value function.
You have to do a lot of magic decree to get it to work.
to work. Our hope was that like can we make this simpler and that's where we design DPO.
Just to give a brief overview of RLHF, it starts off with some distribution or some model that
you have already trained which is usually reasonably good. I'm thinking of GPD3 which is already
pretty good. They like some preference data on top of it so you have instruction, two pairs
of completions and the human labels which one is preferred and which one is dispreferred. With this
This preference data, you first fit a reward model.
The reward model will give you, it's basically telling you which preferred model should
have a higher reward and the dispreferred completion should have a lower reward.
And this is a simple classification problem.
It's very straightforward.
Now given this reward model, you want to do RL on top of it.
So like you want to generate completions which are good.
And the way you set it up is the, you maximize the expected reward under a KL constraint
to the initial distribution that started with.
Now why the KL constraint, the models can degenerate very, very quickly.
And usually what you want to do is stay close to these models so you don't degenerate
and you do not exploit the reward model.
The reward models are trained on a very little amount of data and these are very easy to exploit.
So that's where the scale constraint is important.
This is a traditional RLHF pipeline.
This is what exactly was used for Chad GPD initially at least and it's very complicated to do
with PPO or like it's hard to get it right.
Now, our contribution is the direct reference optimization.
And the way this works is that it turns out for this optimization, there is an exact optimal
solution.
This optimal solution, if you have seen Boltzmann distribution before, very simple.
You take a reference distribution, you upweigh the good things by exponentiated reward,
and you downweigh the things by exponential reward which are bad.
So it's just the exponential reward weighted for the reference distribution.
Now, unfortunately, this is intractable.
Why? Because the partition function is intractable.
So you cannot actually compute this distribution.
But as it will turn out, this is not going to matter.
So our main contribution is that you can actually rewrite the reward in terms of the policy itself.
So simple algebra, you write the reward in terms of beta log pi pi by f.
This is just simple here.
Take your time, just look at it for a second.
You're just rearranging terms.
But the thing is that you still have a beta log's partition function, which is just simple.
interactable. Now the key thing is we can fit this reward on using the same
classification loss that we were using earlier over here but and the nice thing is
it depends upon the difference between the reward for the good completion and
the bad completion and the partition function actually cancels out. If you
look at the partition function it only depends on the instruction. So it only ends up
depending on this quantity and this is exactly how you get the DPO loss. You're
plugging in this like implied reward.
function into the classification loss and you get the DPO classification which is
directly in terms of your policy that is being fine-tuned. So you no longer need to do an
explicit reward model where you're learning a different reward model. You do not have to
do any RL optimization after that. What you're doing is exactly you're fitting this
reward model and you immediately get the optimal policy for that reward model without
doing any RL. And that's like the main main pitch for DPO. Any questions? Anything
I can explain further?
You don't have to learn any reward.
But this thing, you can extract the actual cost.
You don't have to, but the policy already implies a reward.
Yes, exactly.
Is that make sense?
I don't mean the .
How do it?
Yes.
Not the action, but this is this
specific reward model.
What about the data collection aspect?
Sorry.
What about the data collection aspect of RLH?
That's a great question.
So people usually samples more completions online,
and you don't have to do any of that.
You only have to sample the preference data set in the beginning,
which we use for .
How do you know that your preference data set is as good as?
We use the exact same preference data set for RLHF and for DPA.
It's like a mathematical shortcut.
Yes.
I assume.
Like they created a new loss function.
You train this model on some data distribution,
but when you explore, it might go out of distribution.
Yes, yes, yes.
It kind of limits the policy.
Yes, exactly.
So that is the major reason what drop is?
In general, PPO also has a high variance estimator, so the optimization is never perfect, whereas the DPO, you know for a fact that it's an optimal policy.
So it's very, very similar, like, you know for a shag that it's optimal.
But in general, like, if you have a very well-fined PPO pipeline, it will usually work reasonably similarly.
But yeah, you don't have to do any of that.
Essentially, one of the things is that...
This is not an assumption.
This is the actual solution.
This is not an assumption.
It's generic in terms of mathematical form.
But I was under the word, for example, does it match the definition of reward?
Because you could write any exponential function here and it's been called reward, but does it match the reward definition?
In this optimal solution, you assume there's a reward function that has been given to you.
given to you.
Oh, yeah, yeah, yeah.
The sequence of actions, that's a
constant times the log ratio of some
and I mean, overall if I look at the experiments,
let's look at the real world data sets.
Like, I mean, we try out like summarizations,
like single turn dialogue, and it all works great.
You never had to do like any online exploration
or of any form and like,
people relatively works better than people or very similarly to it.
I think...
Can I, what's the methodology, Nick,
you take a base model?
and you fine-tune it with DPO?
So we take a same base model.
We have the same preference data set.
First you put a reward model for PPO and then you do RL for it.
Okay, so completely comparable.
Yes.
In general, like, I mean, we tried to reuse people's already pre-trained models for RLHF,
but we looked at their pipeline, it was exactly the same.
Because if we do it, like, there's always a case that it's possible that we didn't tune it well enough.
So, like, we tried to, like, take models or trained using RLHF,
and try to compare to them directly.
But they're trained on the same datasets.
Very strong models have been trained using DPO.
They're already being used.
Yeah, Zephyr is the one I know about.
Two mixtrol models, if you have you looked at,
were trained using DPO as well.
Oh, that's the mixture of instructs?
Yes.
Okay.
They were trained using DPO as well.
So if you guys...
That came out very recently.
Yes, that's why it's not on the poster, but like, I mean,
it's, you guys, if you're thinking of finding using preferences,
You should try to use DPO.
How much is the efficiency gain compared to a PPO process?
A lot, because you only have to do one step.
It uses the same set of preferences.
It's usually the one-fifth.
So, like, basically, no trade-offs?
I'm looking for trade-offs.
I cannot find any.
More research needs to be done.
There are arguments to be made that PPO might do better in some cases,
but it's unclear.
Like, we haven't personally seen any evidence yet.
I see, I see.
Sorry, one more question before.
Go for it, yeah.
I noticed Chelsea Finn's a co-author.
What guidance has she given?
I'm curious.
I mean, look, we're all in her lab.
She's the one who selected us.
She's the one who's providing the infrastructure.
Yeah.
Like, I mean, none of this would be possible without her.
I'm just curious, like, is there any, like, interesting stories,
any, like, good advice that she gave that, like, really inspired you that you want to pass on to others?
Let me start discussing the idea with her.
She was very insistent that you should try to push this because this is a nice idea.
But if you sit on it, somebody might do it or it might fade out of irrelevant.
This paper came about in three weeks before the Neurobs deadline.
So we had to push really hard.
And how did you come up with the idea?
You said.
I mean, we were looking at this kind of equation before Rafael did a bit of algebra and say,
oh, maybe we can just completely skip the RL part if we like look at this thing.
Like, I mean, we're playing around, generally speaking, like, there's a reward estimation step.
Whenever you're learning three things in a sequence, if you can statistically remove one of the steps.
Yeah, you gain a lot.
Yeah.
So that's where the motivation usually comes from.
Has John Shulman commented on this?
Yes.
What do you say?
I mean, he tried it.
He said it works.
But there's some questions about, like, they might be training their reward models on more than binary pairwise preferences.
So it's not immediately clear how to extend that using DPO.
Like multiple choice?
I'm clear.
They obviously did not tell me what they're doing.
But there's training on more than just pairwise preferences and they might still want to do
RLHRRRRRRRRRRR.
You can decompose most things into pairwise.
Yeah, that's kind of what I assume, but I don't know what exactly they're doing.
So there's a situation where they might be conditioning the reward model on something more than what your policy is conditioned on.
That means my Rappell-Y and X is a positive.
That's all I got.
Thank you.
The other best paper runner-up that we'll talk about is scaling data-constrained language models.
In other words, the data-blations paper.
And this is a scaling loss paper kind of in the vein of the chinchilla paper, but done with a different assumption in mind.
Instead of holding compute constant or holding parameter count constant, here we are running into the real-world problem of data-constrained.
So given that you have a fixed amount of data, what should you do to pre-training your models?
This kind of paper tends to be a very expensive paper to write, just because you have to do so many ablations.
Here it's notable that HuggingFace has created this and open sourced it both models and datasets.
So kudos to HuggingFace.
Hi, I'm Nicholas, and I'm presenting scaling data constraint language models.
The premise for this work is that we are data constraint.
Here's a plot from prior work that estimates that given their definition of high quality language data,
we're going to be exhausted next year.
And what they mean with high quality language data,
is data such as papers and books.
There's other sources like code,
however it's unclear how useful it actually is
for large language models.
And for low-resource languages,
we are already hardcore data constraint.
The first solution we investigate is simply repeating data.
It's important to mention here that,
while it's pretty common to train for multiple epochs
in most machine learning problems,
for large language models, this has been very uncommon.
In GPT3, they write that data are sampled
without replacement.
In Palm, they say that they explicitly avoid repeating data in any subcomponent, and there
was other work explicitly recommending against repeating any data when training large language
models.
So we ask, is it really that bad?
To answer this question, we have three different setups.
We start by simply training for a single epoch.
Here, this is your usual training graph where we have the validation loss on the y-axis and
the training tokens on the x-axis.
And for all of those setups, there's nothing special here.
improves as we increase training. Now what happens if we train for two epochs? Notably
the performance is around the same. So here only half of the data is unique and
it has to be repeated twice. So for the setup on the left, 28 billion tokens are unique
and they're repeated for two epochs. Three, four and it's still pretty similar,
however eventually it starts to diverge. So we shouldn't train for too many epochs. At 44
epochs, literally just 144th of the data is unique and repeating 44 times. So that's like
billion unique tokens for the set up on the left and that obviously isn't very
good however for a few repeats performance is very similar suggesting that we can scale a lot
further with existing data constraints by simply repeating for large language models this
naturally leads to the question how should we allocate compute when we are in that repeated
regime a quick reminder from last year chinchilla told us that we should when we're not
repeating data so in the single epoch regime we should scale model size and training data equally
in equal proportions.
How does it look like when we're repeating?
To investigate this, we train on 100 million unique tokens
and vary the model size and the number of epochs over those tokens.
Each model is depicted as one of those dots.
And as we go towards the upper right,
so more parameters and more epochs, loss improves as indicated by the contours.
We put forth scaling equations
to exactly predict this change in loss
and how you should allocate when you're in that repeat.
regime. They're depicted on the right. Now if we add in the efficient
frontiers, the chinchilla scaling loss efficient frontier extrapolated to
multiple epochs corresponds to the dashed line. So here's just an equal scaling of
parameters and equal scaling of epochs. However outfit suggests that data
should be scaled faster when we're in that repeated regime. And this is seen by the
the line branching off below and eventually just fades away because at some point
you can't get more value out of your data,
especially with just 100 million tokens,
at some point you're just running out of value
in those few tokens.
Now we test our predictions at scale.
Here we have two models,
one allocated according to Chinchilla scaling loss,
and one allocated according to data constraint scaling.
The one on the top is Chinchilla,
and the one on the bottom,
indicated by the red star, is our allocated model.
They both have the same number of flops
and the same data budget of 25 billion tokens.
billion tokens and we see that by training with fewer parameters for more
epochs so 6.3 and 9.7 epochs are 242 billion tokens we get a better a better
loss but not only loss we also test this in terms of downstream performance and
get better downstream performance as indicated by the column towards the right
this was repeating and now we're going to look at complementary strategies to
solve data constraints one and two intuitive
strategy is making use of that code data that we saw earlier. So can we simply fill up
the missing data with code from GitHub? In addition, we evaluate filtering
strategies. Specifically, we look at fuzzy to duplication and perplexity filtering.
The idea here is can we use a quality filter and then repeat to get better
performance than with the initial data set? Here are the results. On the y-axis,
we have the average performance across 19 natural language tasks. On the x-axis is the
data budget. So towards the left we have 100% of available data so we don't
need to use any of those strategies. But as we go to the right, our data budget
is smaller and smaller and we need to repeat data or fill the missing data with
code. Starting with the purple line, we can confirm our findings from earlier
that also in terms of downstream performance roughly four epochs seems like a
good trade-off. So at 25% data budget we have to repeat four times corresponding to
four epochs. And then eventually if you train for too many epochs, it drops quite
a bit so you have to be careful with repeating. The red line corresponds to
filling missing data with Python code. Similar to the repeating line we see that
we can we can make up for a lot of natural language data with code without a drop
in natural language performance. So these are all natural language tasks and it
seems like coding data is helpful for some of them. We even see spikes on some of
these tasks as soon as code is added. Finally we investigate the filtering
strategies. We find that quality filtering then repeating
can be much better than the data set to start with.
So here the yellow star at the top corresponds
to perplexity filtering and then repeating for two epochs.
The orange star corresponds to fuzzy deduplication
towards the right.
And we find that you have to be careful with too much
de-duplication, because it can lead to a worse model
by limiting your available data.
Now I'll go through the takeaways.
The first takeaway is that repeating data is generally fine.
So many setups, roughly four epochs,
seems to provide a good trade-off.
However, there are diminishing returns and you have to be careful with too many epochs.
Next, adding code data is fine, even if you're only interested in natural language tasks.
We find that 50% provides a good trade-off for most setups.
Finally, quality filtering plus repeating can be a good strategy and is often much better than the data set you started with.
Because the penalty from repeating is often much smaller than the additional gain you can get from quality filtering.
And finally, I wanted to finish off with some other work that has made use of these findings
in their large language model training.
So at the top, we have FinGBT, a large language model for Finnish, where they only had 38 billion unique tokens,
and they had to repeat them for eight epochs in order to be able to train a reasonable large
language model with 13 billion parameters.
And there are several more that haven't made use of these findings.
The finding that training up to four epochs is almost as good.
good as getting new data is pretty surprising and actually directly counters a very famous paper
called One Epoch is All You Need. I actually ran into Aaron Komatsuzaki at the decibel party.
And it's just surprising at this stage in ML that we still don't know some very basic questions
around how many epochs we should train on a dataset. I mean, I still think that we are
surprisingly sample efficient. You know, the consensus is now between one to four epochs, sometimes
in some cases maybe up to eight, but more importantly than that,
that. I think this work is notable because it is the best example of what open source AI
research should look like, and of course it's from Hugging Face. If you go to the GitHub repo,
you can see not only their papers, but also very, very well documented code showing exactly
what they did and how they got their results, including the dataset filtering. So just exemplary
work of open source AI and no surprise that they won one of the best paper awards. However, I did not
manage to catch up with them for a post presentation interview, but I did go straight to
to the next session on QLora with Tim Demers.
I'm Tim.
Today I present QLora efficient fine-tuning
of quantized large language models.
Language models have been gotten a lot bigger and a lot more powerful,
but they have become so big that is actually quite difficult
if you take a pre-trained model and you want to fine-tune it
as sort of a normal researcher.
Often you need now a big GPU server, and most researchers
don't have that.
So with QLR, what we worked on is reducing the memory requirements,
so that everybody can fine-tune large language models.
The main contribution of QLORA is we compress
neural networks to 4-bit, and we develop a new data type,
4-bit normal float, that can replicate 16-bit performance,
even though we compress the neural network to 4-bit.
Before I talk about QLora, I give you a little bit of background.
So this work is about quantization, about compression.
So we do, for example, quantization, if we
have a 32-bit float number, and we want to quantize it
to a 4-bit integer.
In this diagram, I have a histogram, which
is equivalent to an int 4 quantization with 16 different bins.
And in red, I have the normal distribution.
And if we want to quantize all the values
in the normal distribution to a 4-bit integer,
we need to reduce all these values to 16 different values.
How do we do that?
We find the empirical minimum and maximum range
of the distribution.
And then we slice this distribution in 16 different slices
with equal width.
Each of these slices is quantization bin,
and all the values contained of the normal distribution
in this bin are quantized to the middle value of the bin.
With that, we can reduce all the values
in the normal distribution just to 16 different values.
And this is in four quantization.
Now, if we do other quantizations with other data types,
we have different ranges.
And so what I do in my work is I generalize these data
types by normalizing the range the data types take
to the range minus 1 and 1.
This approach is also called a codebook,
where you map an index to particular values in the data type.
And so if we have this codebook,
there's a two-step recipe, how we can quantize any tensor.
And so we take the tensor X,
then we normalize it into the range, oh, sorry,
and we normalize it into the range minus 1-1
by dividing by the absolute maximum value,
and then we go through each element in the tensor
and find the closest value in the data type.
We do that by doing a binary search on the sorted values in the data type,
and with that we can then quantize the entire tensor.
Just to make this a little clearer, here is an example.
This is a very unusual 2-bit data type.
It has the values minus 1, 0.3, 0.5, and 1.0.0.0.0.0.0.0. The input tensor is 10 minus 3, 5, 4.
And now let's go through the steps of the recipe.
So first we find the absolute maximum value, which is 10.
We divide by it.
We get 1 minus 0.3, 0.5, 0.4.
And then we find the closest value of these values
for each element associated in the data type.
We get 1, 0.3, 0.5, 0.5.
Then we find the associated index of these values.
And this is now a 2-bit representation.
Now we can store it and it's compressed.
If we want to de quantize these values,
we just do all the steps in reverse.
So we look up the associated values in the data type,
and then we denormalize by multiplying
by the absolute maximum value of 10.
It gives us 103-55.
And so if we compare input and output tensors,
what we see is that we have two big errors.
The minus three turned into a three,
and the four turned into a five.
These are quantization errors.
And so the main challenge in quantization research
is we want to compress a neural network
with low precision data type,
But we want to keep all the quantization errors minimal.
If the quantization errors are large,
we degrade the neural network performance.
And we want to avoid that.
And that's the main challenge.
Let's talk a little bit about fine tuning.
Why is it so expensive?
So the best way to look at it is to look at the cost per parameter
in fine tuning.
And so the per parameter cost for full fine tuning
is 16 bit for each weight, 16 bit for each weight gradient,
and 64 bit if we use atom for each parameter,
for each parameter.
That gives us 12 bytes per parameter.
And if you have a 70 billion model,
that's 840 gigabytes of GPU memory.
36 consumer GPUs.
That's a lot of memory.
If we use lowering adapters, we get much more efficient.
And so what we do there is we take a pre-trained model,
we freeze it.
Now we put some tiny layers on top of it, some adapters.
And so if we fine-tune it,
we do stochastic gradient descent through the frozen layers,
into the adapters, we just update the adaptors, not the main model.
And so what that does is the weight still needs 16 bits per value.
But now all the other values that are updated, they're only a fraction of a bit on average.
And so in total we have 17.6 bits per parameter.
That adds up to 150 gigabytes of memory, which is eight consumer GPUs.
Now, without a amount of Kilara, we step in and go a step further.
So now we take the pretrained model.
quantize it to 4-bit and then put adapters on top.
It reduces the average footprint to 5.2 bits per parameter,
which is 46 gigabytes, and that fits into two consumer GPUs.
Now, the main challenge is we want to preserve the performance
while doing this 4-bit compression, and that is the main challenge.
So we have three innovations that improve the memory performance,
but then also the precision to reduce the quantization error.
There's one part,
page optimizers I will not talk about. You can read about it in the paper. It's used to prevent
memory spikes during fine-tuning if you hit a large document during your fine-tuning run.
The main contribution that we have is the 4-bit normal-flood data type. This is a data type
that's information theoretically optimal, and so you can think about it like this. So in the beginning,
I showed you in 4 quantization, where the quantization bins have equal width. In a normal flow
data type, the bins have equal area. That means each slice has equal probability mass in the
normal distribution. And that means the same amount of values are quantized into each bin. With
that, each bin has equal amount of values and its information theoretically optimal.
And our second contribution is a little bit silly. It's double quantization. We do a quantization
of the quantization. And so what does that look like? So in the normal quantization, we take
the weight, quantize it, and now we get two pieces. The quantized weights, and then the absolute
maximum constants. We have multiple constants because we slice the weight into blocks, and each block
has its own constant. And so we get a matrix of constants. On average, these are 0.5 bits,
and that's multiple gigabytes of GPU memory. And now we quantize those constants again. We save
about 0.4 bits on average, and that is important if we want to fit large models into
consumer GPUs because otherwise they don't quite fit.
And so these are the contributions.
Now let's look at the results.
So the main thing that we want is to replicate 16 bit performance.
That was our main goal.
And so what I have here is different Lama models of different sizes.
And we fine-tune on the Flan 2 instruction data set.
We evaluate on MMLU accuracy.
We have in pink, the 16-bit baseline, and brain flow
16 and what we see now that the float data type the regular float data type 4 bit
float and blue doesn't quite replicate 16 bit performance however if you use our
normal float data type we get up to 16 bit performance and so with that
we have now replicated a 16 bit performance in our papers we have much more
experiments that also have the same finding but with that now we are at the
stage where we can very efficiently fine-tune very
language models with very little resources.
And so now we go a step further and ask,
can we build a high quality chatbot
now that we very quickly can explore all possibilities
with cheap fine tuning.
And so through our experiments, we run over 1,000 experiments,
we find a very good data set and build a chatboard
called Gonaco, which is a 4-bit data set.
We created by just fine tuning on a single consumer GPU
for 24 hours.
And now we want to compare how good a set
a chatbot compared to other shepherds that are trained or fine-tuned in 16-bit.
And so we have a tournament style setup where the setup is we have 80 different prompts from the
Vakunia dataset, and we give this prompt to two random shetbots, and then they compete to
generate the best response. Each shrapot generates a response, and then the responses are
judged by the humans or GPD4, and either humans or GPD4.
say which response is better. This consists as a game and so we play multiple games
of many random allocations of chatbots and with that we can determine which
chatbot is better than another chatbot. If we do this setup then we find
that humans think our chatbot on these vicunia prongs is a little bit better
than chat GPT. If we ask GPT 4 it says it's about the same quality as chat GPT.
This doesn't mean that our bot is as good as chat GPT but for these
particular prompts it is about the same quality. On the right is also a demo. You can
scan it and try our chatbot. And that's everything that I have. So just to conclude,
Kilora makes fine-tuning 18 times cheaper. With the 4-bit normal float, we can
replicate 16-bit fine-tuning performance, and we have also shown that you can create
very high-quality chatbots with Kulora. So with all of that, it's very simple to now create
high quality fine-tuned models, and it's so cheap that everybody has access to the
fine-tuned these large models.
Kulara is available in the bits and bytes library, and it's also integrated in the hugging
face transformer stack, and so there you can very easily use it.
I'm also on the academic job market, so please get in touch if you're interested.
Later this week, I will also give a talk on the making of Kulara at the workshop, so stay tuned
on Twitter for more information about that.
And that's what I have, and I'm happy to take questions.
Thank you so much.
So we're going to make a bit of a hard pivot now from the world of optimization, fine-tuning, and training methods,
into the world of multimodality, which is another big theme of this year and probably every year to come.
Every previous paper we've covered on the pod up to this point I've heard of online and, you know,
it's relatively well-known.
You didn't actually need to meet the people to hear about them.
But one of the joys of coming to a conference like Newrop's is finding things,
that you may not have seen just in case of your filter bubble or just because there's just
way too many things out there and you didn't have the time to look into them. And this was
definitely true for me for DataComp, which I never heard of, but also a very legitimate effort.
And I actually had to chat with them after their talk. But first, let's introduce what DataComp is.
My name is Samir, and this is Gabriel Iliarco, and this is Alex Fang. And today we're going to be
presenting our work Data Comp in search of the next generation of multimodal datasets.
And this paper was really made possible by a whole team of people, and so we're very lucky and fortunate to be able to share it on behalf of the whole team.
Okay, so we want to start with a little bit of a history of computer vision models.
So in this kind of traditional paradigm of image classification, what we would do is we would create a specialized data set.
We'll call that a traditional supervised data set with certain class labels.
for example, 10 different labels for the MNIS data set,
and then we would train these fixed models on these kinds of data sets.
And this was really cool because it led to all kinds of architectural improvements.
You can think resnets, skip connections, applications of attention.
But when you needed to add an additional task, say ImageNet, 1K,
you had to kind of create a new dataset with a new set of labels,
and this was kind of a laborious process.
But then right around 20, 21, something really cool happened.
The paradigm a little bit switched to these kind of image text data sets that allowed trading these open vocabulary models.
And suddenly we could do things like train a unified model that could then downstream do arbitrary image classification tasks.
And this is really a sort of data set transition is kind of the, the test.
take away here. So in spite of this kind of transition between
datasets, the standard machine learning pipeline actually stayed relatively
consistent. So what we're still going to do is create a monolithic
artifact, a data set, keep that fixed, and then iterate on model
training on that data set. And this is still like a really cool recipe
and it's led to progress in downstream evaluations. But what we really
ask in Datacomp and the center of our paper is how much performance are we
actually leaving on the table by adopting the standard ML pipeline can we
actually improve models by iterating on data sets instead of on model
architectures and so fundamentally data comp is a benchmark for data set
development to help the community understand how data set decisions improve
models so specifically we're going to look at this
clip trading regime for these more modern image text data sets, which are popular nowadays.
And so we want to give just a brief overview of clips so that we're all kind of on the same
page.
We roughly have a text encoder and an image encoder, and we're going to train these
encoders from scratch contrastively in order to align image and text representations.
And then downstream, if we have a new classification task, we're going to be a new classification task,
going to do things like write sentences, a photo of a plane, a photo of a car, etc., and then query
an image feature against all of these text features to retrieve our class label.
So kind of recentering things back to Data Comp now, the picture I think that we should all
have in mind is we're actually going to fix this clip bit, which is this middle trading diagram,
and we're going to iterate on the data selection process to create new data sets to train
our clip models. And now I'm going to hand it over to Alex. So the data comp workflow
consists of five steps. Choosing a scale, selecting data, training a model, evaluating, and
submitting the results. And the first step is choosing the scale, which roughly reflects the
amount of compute used. So data comp has four scales. At the small scale, we train a VIT
B32 for 12.8 million samples, which is equivalent to fine-tuning a model on ImageNet 1K.
At the medium scale, we train a VIT B32 at 128 million sample scene, which is equivalent to training a model from scratch on ImageNet 1K
1K.
At large, we train a VIT B16 for 1.28 billion samples scene, which is equivalent to training an ImageNet 21K model from scratch.
And at extra large, we train for 12.8 billion sample scene on a VIT L14, which is equivalent to training an open AI clip model.
One key design decision is that there is no constraint on data set size.
We build our scale configurations around samples seen because practically speaking, the key
constraints are pool size and compute.
This means each data point in a data set of 6.4 million samples at the small scale is seen
twice.
At the chosen scale, participants can then use their data selection method on either a fixed
provided pool of raw data or are free to bring in additional data.
So in the first option, which is the filtering track, participants filter from a provided
raw pool equivalent in size to the samples seen at the chosen scale.
Our pool, which we call common pool, comes from common crawl, and then we do minimal pre-processing
such as near duplicate checking against evaluation and not safe for work filtering.
Additionally, we provide metadata to help with potential filtering approaches.
This metadata includes original width and height, caption, a check sum, clip features, clip scores,
and face bounding boxes for automatic blurring to help with privacy concerns.
The second option is the Bring Your Own Data Track.
This allows participants to use additional data sources,
as well as both edit and generate images and captions from Common Pool.
We hope this track supports participants whose creative approaches
do not fit neatly into the filtering track,
while also maintaining fair comparison within the filtering track.
Next, participants use a fixed training procedure to train a model on their newly filtered data.
For training, we adopt fixed training recipes, including hyperparameters for clip training, and this was based on prior experience.
Notably, data comp participants are not allowed to modify these parameters, therefore focusing investigation on data set selection.
In the paper, we show that better data sets are largely consistent across variations in training recipes.
Once models are trained, they are evaluated using our provided script.
Our evaluation suite contains 38 downstream tasks, which is a new one of the data,
which include image net and variance, a subset of VTAB,
a subset of Wilde's distribution shifts, fairness benchmarks,
and retrieval benchmarks.
And the evaluations are done in a zero shot manner
to remove the need for fine tuning
on each individual downstream task.
And the last step of the process is to submit your results.
We provide an online leaderboard that participants can submit to,
which we hope promotes participation and collaboration.
We believe that many of these individual data filtering approaches
stack and when combined will lead to better results.
Next, I'll hand it over to Gabriel to talk about baselines and some new results.
All right, so let's talk about experiments now.
We study many baselines in our paper, but I'll focus on the two most interesting ones in interest of time.
The first one is what we call clip score filtering.
The idea behind clip score filtering is simple.
We use a pre-trained clip model to compute cosine similarity scores for all image text pairs in our dataset.
In this plot, you can see a distribution of these scores in our dataset.
We then choose a threshold for the similarity, for example, corresponding to the top 30%
scores in our unfiltered pool.
We then remove all samples that have similarity smaller than this threshold, keeping only the
samples with high score as a proxy for discarding all samples that we think have low quality.
Another filtering baseline is what we call image-based filtering.
For image-based filtering, we again use a trained clip model, but this time only to extract
image features.
We then cluster these image features and find clusters that match images on ImageNet.
We keep all clusters that are assigned to at least one image.
We then discard all the other clusters.
Note that this filtering is purely based on image features, and we do not use any labels
or captions for this filtering strategy.
Our best performing baseline is built by intersecting between the two baselines I just described.
clip score filtering, and image-based filtering.
When you apply this technique to our larger pool,
we find a data set with 1.4 billion samples
that we call Data Comp 1B.
So let's see how well this works in practice.
We conducted over 300 pre-training experiments
with many different strategies for filtering our pool.
Our best data set is DataComP 1B,
a 1.4 billion subset of our pool,
that leads to much higher accuracy than existing data
sets, including opening eyes
it and Lyon 2B. This is the first public data set that outperforms OpenAI. Also note that all
these models are compute matched, so these gains come at no extra cost at training time. One key
finding from our work is that smaller, more aggressively filtered datasets can perform better than
larger datasets coming from the same pool. As you can see on the plot, when we selected samples
that have the highest cosine similarity according to a train clip model, there's a sweet spot
for the size of the data set that we keep,
around 30% of the original pool.
This means that you're better off
using a smaller subset of the pool
instead of using more noisier data.
Interestingly, this doesn't happen
when you sample randomly from the pool,
as you can see from the dotted line.
So you can get away with smaller data sets,
but you do need to be a bit more careful
on how you are selecting samples.
Another key finding from our experiments
is that the ranking of different filtering strategies
is relatively stable across scales.
as you can see in these scatter plots.
These plots show how performance on the small scale
correlates to performance on the medium scale.
And while it's not a perfect correlation,
these plots show that there is hope for doing research
at smaller scales, since there's a good chance
that findings will generalize to larger scales.
And in fact, this is exactly how we proceeded
during our experiments, by first testing things out
at smaller scales and only scaling up
the most promising results.
This saves us a lot of compute during our experiments.
There's much more in the paper,
as you can see in these slides and if you're interested definitely check it out we are
very happy to answer any questions and talk more about any of these topics in our
poster since we released the paper there's been a lot of activity in Data Comp
the fun thing is that our best-performing baseline which we thought was pretty
decent were blown out of the water by the community since and it's just really
nice to see that happening in real time one example is data filtering networks
or DFN for short, where the main idea is similar to clip score filtering,
but with a deeper dive into what makes a good model for data filtering.
And careful data creation has led to what now are the best clip models,
even outside data comp, with an impressive 84.4% zero-shot accuracy on ImageNet
using a VATH-14.
The central takeaway I'd like to leave with you today
is that careful experimentation with datasets can really pay off
and can lead to very large improvements in performance on downstream models.
So instead of blindly scaling models up, I think we as a community should start paying more attention to how we design datasets.
Data Comp is designed to facilitate research in that direction.
It's amazing to see what people already building with it, and I'm super excited to see what comes next.
Finally, I like to reiterate that our benchmark is designed to encourage everyone to participate, even if you only have a couple of GPUs under your desk.
So if any of this sounds interesting at all to you, feel free to check out our resources, including your well,
website, codebase, and paper.
Everything we do is fully open source.
And we hope these resources are useful for the community.
Thank you very much.
So I quite enjoyed that presentation,
and obviously this being an image, heavy,
and multimodal type of paper,
you should probably check out the images
and the competition at datacomp.a.i.
But I did manage to catch up with them
at their poster session and ask them more questions.
It turns out there's some intellectual lineage
from Lyon with Lyon 5B,
and I do think that this has a strong chance
to become the new image net.
So let's give them a listen.
Oh, fun fact, they were also wearing Datacom T-shirts.
Like, most people, when they present their poster sessions,
they're in kind of just like somewhat semi-formal attire.
These guys, they make custom t-shirts for their posters.
So you know how they're serious.
My name is Samir.
I'm a fourth-year PhD student at Columbia.
Yeah.
And I started working on Datacomp, like, I guess,
around November of last year.
I collaborated with a lot of the folks
that are already on the paper on previous projects,
like Mitchell Wortsman, Ludwig, Weishall,
and they kind of just kind of roped me in.
They were looking for hands to help out with different tasks.
And then through the course of time,
my involvement just kind of grew because I got really excited about it.
How did this become such a big effort?
You guys are wearing T-shirts.
Yeah.
This is not normal.
Yeah, yeah.
Yeah, so we really took this project very seriously
because we wanted the benchmark to be really good and thorough.
And because of that, we were working at kind of a scale that was kind of unprecedented for academics.
We generated the pool of like 12.8 billion image text pairs.
We wanted evaluations to be very thorough, many, many downstream tasks.
And that just took a lot of people to commit to the project.
And how do people find out about something like this?
You're not from the same university.
Is there a community somewhere?
that you all just gather and coordinate?
Yeah, yes. So Ludwig, who's kind of the last author on this paper,
is kind of networked all around.
He's very friendly and very open to collaboration.
And I think because of him, many people from many different universities,
corporations were able to join.
Yeah.
And this is separate from the Lyon group.
Yeah, so Ludwig is affiliated with Lyon.
Because I've seen his name around.
Yeah, yeah.
But most of the people,
are not
necessarily part of Lyon, but we all
kind of know each other and
collaborate. I mean, wouldn't it be
better to make this, like, Lion
1.4B?
Yeah, so we maybe
could have done that.
I think... Sorry, Lyon 12.A.B, right?
Lyon has 5B?
Yeah, Lyon has a 5B and like a
2B subset that people train on a lot.
Yeah, we could have done that.
I think we were thinking about things more
from the standpoint of like a benchmark
and that was really our focus.
While this 1B data set that came out of the benchmark is an artifact,
we kind of wanted to place emphasis on the benchmark itself yet.
So just to commend on that, the idea of initially it was a dataset,
but then we thought also about benchmarking,
but then we thought about the real thing is the community.
So we thought that the way that you can actually build a community
is by opening up the tooling.
So a lot of the, in data set curation, the problems is not about, usually you work super hard,
and then at the end of the day you make a dataset, you release the dataset and you're done.
But the tools that you developed to actually clean up the dataset, filter the dataset, benchmark the dataset,
these are often more valuable for other people who want to do the same job.
So the central idea was to make a community and to open source the tools in addition to the dataset,
and then allow other people to try different tooling methods.
so going this data-centric AI direction.
So that was kind of one of the central ideas around Datacom.
So D-A-com is really about building community around dataset curation.
This is the first time I've seen Clipscore filtering applied like this.
And you also mentioned at the end of your oral presentation that there were other methods.
What filtering methods are you seeing that working really well?
It's a whole community of people who are trying a gazillion different tricks.
And that's the whole point, right?
One remarkable thing to point out is that if you picked a benchmark, you will see performance changing across different benchmarks,
but we'll see surprising correlation of ImageNet, Zero Shot, to a gazillion other benchmarks.
So we have 38 benchmarks, and we see that if you do well, basically, zero shot on ImageNet,
you're very, very correlated in predicting in how good your model is across the board for retrieval, for all kinds of very useful things.
And the community is developing a gazillion type of different methods of data curation.
That's why we have a leaderboard and we're building a community.
This is not like a paper and we're done.
You could write like 20 projects of different data set curation.
It's more like a platform for data set curation evaluations.
Do you remember other methods that are doing well?
Yeah, yeah.
So that's a great overview.
And yeah, like I think specifically people have been looking into designing filtering networks.
So rather than using Clip to be the filtering network, like what are some other data sets
that we might train on in order to create these filtering networks.
What are the differences between a good clip model and a good filtering model?
So these are all kind of open questions that, you know, as Alex was saying,
the community will answer by trying a bunch of tricks and methods.
Yeah, so you can train like new clip, but you can also train new stable diffusion from this,
which I'm sure stability is interested in this,
unless you're working on your own sort of diffusion model.
Yeah, so the problem is that compute.
to train multiple stable diffusions is needed.
But yeah, we're definitely interested in that direction,
and we're definitely thinking about that.
But you could basically include quality of a stable diffusion
as a benchmark and evaluate how you would select a subset
of data to improve on that benchmark.
You might want to talk to Luther.
I've been talking to Stella Bilemerman from Luther.
She's around here.
She'll come by.
Cool.
Any other future directions that you're in?
that you're very excited about.
Yeah, we're actually really excited about just the concept of data comp high level.
So right now we're pretty excited about NLP and what a data comp light effort would look like in that space.
You could extend this approach to audio, potentially video, although like video is tricky for me just because it's so data heavy.
And there's a lot of orders of magnitude of different dimensions they could go to.
So I don't know what that might look like.
Let me tell you about that.
Yeah, yeah.
So one idea is you can make a data comp for MRI images
or a data comp for this, a data comp for this, a data come for others.
What's the idea, what does that mean?
It means you fix the model.
So classical machine learning, as much mentioned in the talk,
classical machine learning says, here's a data set, image net,
building a million models and tell me what's the best one, right?
Now, the data comp idea flips this on its head, right?
It says, here is a big pool of data.
The model is fixed.
You only select a subset of the pool.
So the thing you're selecting is which images to keep in the pool.
Then the model is fixed.
But you're training other machine learning models to select what to keep.
And that's very powerful.
So that was in classical, you know, AI, if you're doing the data cleaning, the data filtering,
that's like the shittiest job.
That's like, what?
But we're trying to make that a first-class citizen
and trying to tell you that it's worth to do research
because it's not that you will manually sit down
and select images from $5 billion or $13 billion.
You will be building models that do that.
So you can do a data com for X,
and we're seeing that from the community.
Yeah.
Curious how you became involved with data com.
Yeah.
So we have this NSF Institute.
It's called IFML,
the Institute for the Foundations of Machine Learning.
and Ludwig is part of our institute.
So we were having lunch and we were discussing about how do,
we were discussing about lion, right, and how to make a better lion.
And we said, okay, instead of just making a better lion,
which is what we started with, let's make it a community where we open the tools.
So everybody can make a better lion.
So that was the central idea, yeah.
What happened to the original lion?
Lion is still a great data set that's still public,
but this is basically building the next generation.
Yeah, yeah, very cool.
I wish you good luck.
I think this is really foundational work.
It's basically the new ImageNet, right?
Yeah, yeah.
That's 10 years after the original AlexNet moment.
By the way, that second speaker who was not introduced
was Alex Demakis, who was a professor at UT Austin,
who just jumped in and chatted.
And I do find that it's a very charming element of NERFs
is that it's effectively a coming-out party
slash hiring party where all the grad students
published their papers.
They all have sponsors in more senior researchers and professors.
as the secondary or tertiary authors,
but their name gets first because then they get all the credit and the citations,
and the people who are more senior just kind of stand there and support them,
and Alex definitely jumped in and supported them.
Just like I saw a bunch of other senior authors supporting their grad students
and directing questions to their grad students,
because their reputations are already secure, they have jobs,
they're just here to help their interns and grad students.
There's a very interesting tension between effectively datasets papers and models papers.
The datasets people think,
that their work is more long-lasting,
and the models people think that data sets work is done.
And I think you just need both.
So that's my awkward transition from Datacomp into Lava,
which is probably the single most interesting visual language model this year.
As much as people are in love with GPT4 Vision,
it's not open source,
and we don't really honestly know very much about it,
but Lava is open and trainable with a whole bunch of open-source models.
And together with Data Comp, I think Lava and Data Comp together will provide some kind of template for the next generation of multimodal models to form.
So let's check out Lava.
I'm Haltian, a final year PhD student at UDMadison, and I'm on the job market.
Today, I'm presenting visual instruction tuning, a joint work with Chuan, Qin Yang, and my advisor, Yang Jay.
So as in background, we, as humans, we can see and reason about the visual world, express and
interact with natural language. Doctors read the CT scans and explain their findings to their
patients, teachers teach students with conversations, and we share our life and findings on social
media and interact with others. It will be great if we can have a visual intelligent assistant
that can reason about the visual world and reflect with language. The closest work along
this direction are image to text genesis models, where the model takes in the image as an input and
output text reflecting its understanding. Such models like JIT, Blip 2 and Flamingo has basic
visual reasoning capability, while they generally lack the ability to follow very complex
instructions or engaging very long conversations. Back in March, opening I demonstrated GPT
for a vision with strong visual reasoning capability. For example, given such an image and the user
requests what's unusual about this image, GPT for vision is able to reason beyond just
visual facts, it's able to figure out that the unusual thing is actually the man's ironing
closed when standing on the back of a taxi.
It's great, but it's not accessible until very reasonably, and there's no disclosure
on how it works.
So if we are able to create an open source model with similar level of visual reasoning
capability, it will be great as it allows us to have a deeper understanding of how the models
behave, and we can have a joint effort from the whole community to make it better.
So, as a starting point, how can we create such multimodal models that can actually follow
human's intent?
In NP, researchers find that instruction tuning allows the model to learn to follow the instructions
by fine tuning the model on a small set of instruction and answer pairs, like explaining
the human's behavior or movie recommendation, and creating such instruction just by letting
human writing it is very costly and self-instruct, proposed to use teacher models like chat GPT
to create such instructions by expanding a small set of seed instruction.
instruction output pairs to million scale using in context learning and it's
affordable and has been used to create open source language models like Alpaca
based on the base Lama model so now the question is how can we create visual
instruction following models and let's start with this basic architecture
where we have an image first we have a visual encoder which can encode it
into the visual features a cross-modal connector that can bridge it to the
language decoder the language decoder also takes in the user instructions
and perform the reasoning and output its understanding using the text.
So the key is, how do we train this model for following multimodal instructions and how do we obtain
such data?
The straightforward way would be used a self-instruct and let's find a multi-model teacher
and let it expand.
However, if we take a look at those existing teachers' models that were used, they're all
only and there were no powerful multimodal teachers.
And in our paper, we propose to leverage a text-only
and we provide image context in the textual format to GPT so that it can understand.
For example here, we have an image and we can use the COCO annotations captions so we
can have an image level context which describe what's happening in an image.
We can also have the bonding box and object category annotations from the COCO so that we are
able to get regional level context which provides even more details that may not be captured
in the captions.
So let's take a closer look on our text-only data.
engine. We will have two parts of the inputs. First are in-contest examples which are
the exemplars that we guide chat GPT on how they should generate the visual
instructions. So we'll have example image. We convert them into the image context
in the textual format that we just described. We write the instruction and answers
about those visual content in the image. These are the examples for chat
to learn from. And then we do the actual inference and we refer any image in the
COCO training data set, we're able to convert them into textual format using the COCO annotation,
and chatGBT will just learn to generate those instructions and answers following about those
image context. We gather the instructions, answers, and also the image to create a visual instruction
following data, which is a triplet of image, instruction, and answer. To better facilitate learning,
we create three types of responses. First is a conversation to facilitate multi-term engagement,
Detail description to train the model to focus on visual details.
And complex reasoning to allow the model to focus beyond the visual facts.
For example, question what challenges do people face?
The model not only needs to figure out like there are bag,
luggage, there are bags, there are SUVs,
it also needs to figure out that the challenges as they may not be able to fit
all the luggage on the back of the SUV.
So we create a lava instruct 158K and train lava.
It's a model composed of these three simple components.
We use clip as a vision encoder, instruction-tune language model as a Vecuna as a language decoder,
and we use linear layer for the projection.
And we find it work quite well because of the club visual features already carry great semantics,
and a single linear layer is sufficient to project it into a space where the language decoder can understand well.
For model training, we use a two-stage training pipeline, where in the first stage,
we pre-trained a projector only for the feature alignment so that it's projected into a proper space.
And in stage two, we perform end-to-end visual instruction tuning on the generated visual instruction following data set.
We train the projector, we train the language model, and if you have limited compute, you can try Laura or QLora, or even just the projector only, it can give you decent visual chat performance.
After we train Lava, we found several interesting emerging properties.
Let's quickly revisit some of the data properties first.
So our visual instructions are in English only.
English only, no human name and annotation, and there's no explicit zero data.
Lava can have strong visual reasoning capability as GPD for Vision does, where we are able to figure out the
unusualness is actually the man's ironing clothes on the back of a minivan, and it's more visually
grounded than open source baselines like Blip 2 and Open Flamingo.
It understands this humorously parodied Mona Lisa with a dog in the same post, and it's definitely
out of distribution.
It also has a strong emerging OCR capability where it can recognize NERP's
2023 from this presentation slide and it correlates with the pre-trained language knowledge
when asked about who will be interested in this.
It will relate and say like it's related to artificial intelligence and machine learning.
Although our visual instructions are just in English, it's able to perform the reasoning
and output text in Chinese and other foreign languages, like it recognizes the French
quarter and performs a B.
brief description in Chinese here.
So how can we evaluate our large multi-modal models?
We draw away inspiration from NOP and slightly modify our data creation pipeline to use
a text-only GPT to do the evaluation, where we have the image context in the textual format,
we have the user instruction, we have two model outputs, and we just feed all of them into
the text-only GPT and request for the feedback.
It will give you a score out of 10 for each of the assistant and also provide you a
explanation so that you can understand how the model is behaved. We create a
challenging benchmark lava bench in the wild which requires knowledge beyond
training data, multilingual understanding, and also perception of subtle
details. We create a very detailed textual annotation of the image context in
those images and we can feed them for a GPT evaluation and it's not only just
for accuracy but also for hallucination. Since the introduction, since the introduction
of Lava, there has been great effort from the community, ranging from data, model, modality,
and expanding to different tasks, as well as developing benchmarks for us to better understand the model.
The development after June are just too many to fit into the slide.
And we as Lava team has also pushing the effort to make it more accessible and expanding its capability
in terms of our HF tool use as well as visual prompting.
improved version of lava 1.5 with just simple modifications to the data and model
we show that it achieves great performance on a range of 12 benchmarks and it's
simple efficient that it only requires less than 1% of the data that other
approach to use we're able to train lava 1.5 within one day on a single node
check our poster workshop a workshop poster on Friday so in conclusion lava can
reason about the visual world reflect with natural language, its design is simple and general
that we show that it is possible to adapt the language model to multi-model effectively and efficiently,
that we can train it within one day on a single node. Because its design is so simple that
we are compatible with almost all the optimizations that are designed for language models
for both training and deployment. And it's fully open source. And unfortunately due to the Wi-Fi network,
issue we are unable to do a live demo here but
Lava is able to run on MacBook Air so I will still do a live demo here
and let's try to see this is an image that we I took
yesterday here and I will just say like what's this
event and is it popular it's the tree of salt presentation
it's really popular I can I can just barely stand in the back
so it will just run for a while and it says the event it seems to be a
a conference or presentation, there's a large number of people attending and it appears to be
popular because some ones are, lots of them are standing, including me. So I will say like, okay, I
attended as well. It is NewRyps 23 and the experience is great. Help me draft a tweet. And it will
think for a while. I hope it will be optimized it further for a faster.
failing stage and just attended Europe's 23 amazing experience
packed with knowledgeable speakers and attendees
learned so much and made valuable connections true
and highly recommend for anyone interested in AI related fields
and some hashtags. I hope you like it and thank you so much
please come to our post session at number 229
our demo code data model everything is open source
we are so excited to talk with you more
about Lava and thank you for coming.
If DataComps is an example of what a really good benchmarking data set paper looks like,
then I think Lava is an example of what really good kind of state-of-the-art research on visual
instruction tuning and visual language models looks like.
It definitely has inspired a bunch of copycats and derivative work in the open-source model space,
notably Bach Lava, and I think there's just going to be a lot more work being done here.
Like we're just realizing that we can plug and play these models and train them together in
all sorts of ways and lava is definitely one of the more innovative solutions of that
that also just solves simultaneously a whole bunch of issues with visual understanding.
Here's the poster session Q&A with Hao Tian.
Basically we are trying to create a simple like architecture as simple as possible.
So we have a vision encoder just to encode those features,
a language model to perform the reasoning and we use a projection layer which is a linear layer we find it
doing pretty well to project the visual features to a latent space that the language decoder can understand.
And we believe this is because that the visual features of the clip already carry great semantics are in a good, like, latent space.
So a single linear is sufficient for it to understand.
So is the language model GPD4?
Oh, the language model is Vickuna, something open source.
Yeah.
Not GPT4.
Right.
You can take that off the shelf, but you're training the linear layer.
Yeah, so it will be two stages.
In the first stage, we want to train the language model to understand those images.
So we train the projection layer only.
And this is our stage one.
The language model and the vision encoder are all frozen.
And in the stage two, we will train the model to follow those instructions.
So we train the language model and the projector.
To my knowledge, this is the first work that is adding the bounding boxes with the captions.
Any difficulty in having the language model understand all those things?
Our model does not need to understand bunning box.
Because we just feed into the, what we provide to train our model is this,
visual instruction following data.
The model just need to understand the image and give a proper answer when you give a user's instruction.
So this is not something our model needs to worry about.
Although we do find that the model is able to understand those bunning boxes well.
And key point is that does GPT4 understand those well and does a text-only GPT4 understand that well?
We find it to be true because what we did is that I also work on some image generation model
and we have a work on that we can control the image layout by just providing some bonding boxes.
What we did is that we give GPD for a caption and we say like, can you generate a reasonable layout for me?
And it's able to do that pretty well.
So we believe that we do not quantitative evaluate how it's good at doing that, but it does understand those like layout pretty well.
and also it can be used to, and also like from the instruction in generate,
it does know which is on the left, which is on the right.
Yeah.
Did you have to qualitatively evaluate the output of the answers that GPT4 gave you?
Yeah, we actually do not quantitatively evaluate,
but we did like manually go through some of them
when we are developing those data engine,
because we do have some factors to consider in this.
So we can change the number of in-contact examples.
We can change the way we write those reference instructions and answers.
We can also change the actual instructions we use to teach GPT on what is the task.
So we did qualitatively iterate on how we design those data engine.
And we find it, this process really is quite rewarding because we do, in this process,
understand how GPT thinks and what are the information that we do need to provide GPT for.
Yeah.
And then all the bounding boxes that you provide, these are kind of ground truth because you get them from COPA, right?
That's correct. Right, right, right, right.
So this actually ensures that those contacts are perfect, if the human annotators are perfect,
and the generated instruction answers are as good as possible.
I just want to ask about the training part.
So if I were to take Lava 1.5 and fine tune it,
either full fine tune or Laura or whatever,
would you recommend also retraining the projector?
I guess it depends on your task.
Are you considering a different domain or?
So I want to build off the Lava Plus stuff.
So I know that goes into using tools.
So it's a little out of scope for this project,
but maybe for both.
So let's say I want to take a different multimodal
instruction following data set.
Like for that part, would you recommend retraining?
Yeah, that's a good question.
So I would say that if you want to,
if the domain, like the image domain that you're going to work on,
yeah, medical image, if it is too different,
then I would recommend actually go with different stage one training
or even just do everything from Squarespace
you have a biomedical clip, right?
Because that may give you even more benefit.
But we do observe that if you pre-trained with Lava's instructions,
you pre-trained with those visual information.
Like it learns to do some reasoning about the visual content,
and it may be crucial for some, like,
it may be crucial for the visual understanding
on different other domains.
So I guess there will be a trade-off and I guess there will be both pros and cons for training from another domain from scratch
because you may lose the benefit that you get when pre-training on lava on how to localize those objects.
So I guess you would need some more like experimental evidence on making the proper decision.
So is it fair to say that unless the domain is super different like x-rays, maybe it's fine to just use it?
Yeah, I think it's totally fine.
And I guess it's better to use the instruction tuned version
because it has so many vision knowledge injected into it.
Okay, and then, sorry, one last question.
Of course.
For stage two, like, let's say I want to fine tune on my own thing.
Is the roughly 160K number of examples a good target to hit?
Like, do you have recommendations around how big that data set should be?
I guess it also depends on how different the task is
and also how bad the model is
on that task because I can give a brief example on one of the experiments we have done.
So there's a task that we can train the model to generate stable diffusion prompts, for example.
Basically it's kind of captioned in some style we want.
And because the lava is already able to understand those visual attributes, the content very well,
it's just a form of like reorganizing the style, it responds.
So we find even 100 examples is sufficient.
Yeah, 100.
And we just use 100 examples and it does the work decently.
Yeah, it's just a form of changing the song.
But if you're like trying to like do some very different reasoning tasks that Lava is not good at,
I guess you may need more.
I think 10K generally saying is enough.
10K generally selling is saying...
Yeah, yeah, yeah, I guess 10K or if you want to make it safe, like I guess I guess,
maybe 50k is at most yeah yeah of course yeah thank you
yeah thank you understand how important this vision encoder is have you ever
try to remove the encoder entirely and use the binding box here as the input
of whatever lengthy model you are using and just do the same task so I guess the key
point here is that if you want to remove the encoder completely and just use the
bonding boxes as the input the there will be one question like how are you going to
get those bonding boxes and second is that what if the user asks you about the text
like are you going to also have an OCR engine and what if what if the user asks
about something else for example like the attribute and it just like if you if you
think of this like having an end-to-end model will be make it much more easier and
much more generalizable to extend to different types of the input and the
user's instructions so
And also because now you need some other model to generate those bonding boxes,
tags, all of those things, I feel that it's good if we can have those models to enhance the capability.
But you do have a model that are really trained with vision and really understand what's happening in this image.
It can better coordinate those information.
Yeah, so have you ever, you know, unfreeze this vision encoder,
meaning you're in first or second.
Yes, we have tried to unfreeze the vision encoder and we find it quite useful for some of the text but not for the other.
So specifically, if it's just asking about what's the attribute, what's the object, those kind of tasks, it does not matter much.
But if there are two kinds of tasks that unfreezing the vision encoder really matters.
One is that it's not necessary about the semantics.
For example, I'm asking whether this line is straight, like those kind of tasks which require you to understand the low-level
level details where the low-delad level of the really matters.
It's one of the things that, and we also have another work,
VIP Lava, where we try to train the model to understand the visual prompts.
So basically the visual prompts, we mean that can we just like use some scribble to
circle some objects that we want to ask about instead of necessarily trying to
describe it very clearly on making the model to understand what we are curious about.
So for that, in order to correctly identify,
those scribbles and those tiny lines it requires you to somehow unfreece the
vision encoder to properly unfreeze or use some earlier layers which still
preserves those information I'm curious about the backstory behind this whole
thing like how did you get started exploring multi-modality and like your
inspiration we have been working on vision language like since like our team
has been also working on vision language.
Your team, is it a lab?
Is it a, his team, I see.
Train from Microsoft and we and my advisor.
We have a collaborative effort on this.
We have a series of work on vision language.
And we, although I'm not having tons of years
experience on vision language, but we do see that.
In March, we see VQNs, we see VQN.
which do make us very impressed about the performance it can have.
For the size.
Yeah, for its size and also for the open source.
And we believe that it's possible for us to create a visual reasoning model
that is purely open source with similar level of their capability.
And we believe that with open source, we are able to have a joint effort from the community
to make it much, much better.
Yeah, it was cheap, right?
You trained for like eight hours.
Yeah, eight hours were level of point.
five one day on a single note.
That means everyone else can do it too.
Yes, not everyone else, but most people.
Thank you very much.
So super interesting and notable work on the lava model.
I guess someone should try to hire him.
But I guess the next segment we're going to explore is the prompting segments, quote
and there are a surprising number of prompting papers here.
I'm not sure that many papers should be represented at Neuribs, but where else are the
going to present. I don't really know. But anyway, so there was a whole channel or track
just a chain of thought. That blows my mind to me. And I do think that that is appropriate.
And I do think that the techniques here are innovative. It's impossible to cover all of them.
I actually talked to Noah Shin from Reflection, remember Reflection, as well as a whole bunch of
others. But probably the most representative one was the tree of thought paper. So here it is.
My name is Shuen Yu. I'm from Princeton. I'm very excited to talk about
Tree of SOTS. It's a joint work with my colleagues from Princeton and Google.
So we all know language models and large energy models. Language models were invented
generate text, token by token, and left to right. But now they are used to solve an increasingly
wide range of problems using scale-up models and prompting techniques like chain of thought.
So here is an example. Like you can like a breakdown complex calculation into steps and it will make it
solve problems that cannot solve in steps.
So the question is, can those language models one day
become a general problem solver by keep scaling up
and using auto-rogressive inference,
or there are some fundamental limitations?
So to answer the question, let's take a look
at a very simple example.
This game of 24, where the rule is you are given four numbers,
and you have plus, minus, divide, and multiply operations,
and you need to combine those four numbers to up in 204.
So one example is if you are given input to 9, 10, and 12, what you can do is you can
first multiply 12 and 2 get 24, then 10 minus 9 to get 1, then 24 times 1 to get 24.
Okay, so it's not a really hard game.
Now you give it a new input, 4, 5, 6, 10 to GBT 3.5, you know that it solve the task.
It will first try to multiply 1096 to get 60, then divided by 5.
5 to get 12, then 12 times 4 to get 48.
Then to make it up, it will say it's 24,
and then call it that day.
So it's a hallucination.
You might argue that if you have better models or better
prompts, you will solve this.
But even if you use GPD4 with 5 shot, 5 examples in the same
a OT prompt, you will only get 4% task success.
So why is this like easy task so hard for language models?
So if you look at the initial token generation, right,
10 and 6, 10 times.
Because those language models are making local
and token level decisions, one by one, left to right.
Those initial decisions are really hard, right?
Even for humans, we don't know whether the first token
should be 10 or 6 or 5.
We don't have pre-training intuition, right?
We have to play the game to have a better sense.
Worse still, once you generate those one token
at the beginning, the task is already failed
in that you cannot really complete the whole trajectory,
in the COT format and be right.
So by this very simple example, what I want to show is
there is something about alter regress and inference
that is lacking mechanisms for deliberate reasoning.
So it's even true for biggest, strongest language models like GPD4,
and the reason is quite simple, just like Ben's talk mentioned, right?
So for the COT to work, you really need strong local signals
to guide every step through those local decisions.
And just to draw analog, right, imagine if you have a robot that's trained only on successful navigation
trajectories, and it's only trained to predict the next move.
And then you put it into a new maze, and then it's very hard to explore.
So how do we solve this issue, right?
So in this work, we took inspiration from human cognition.
In his famous book, Thinking Fast and Slow, Daniel Kahn proposed that our cognition has two parts.
We have a fast and automatic system one that's handling every day task like riding a bike and we have a slow-end delivery system two that's
imposing control and intervention over system one for harder tasks like designing a plan
So if you know language models automatic inference
Outer Regress an inference is similar to this spontaneous but arrow prompt
System 1 process
Maybe we can impose some kind of control algorithm on top of it to get systems to reasoning and
And tree search is naturally the choice, which is also one of the oldest ideas in artificial intelligence.
For example, the Wai and Simmons general problem solver in 1950s.
However, doing like language, doing search in this reasoning space is non-trivial.
Because traditionally, you know, if we search in classical games like chess, we also have, we often have like a small fixed set of next moves,
so that we can design or learn search heuristics.
But if you want to search in open-ended reasoning, the next move can be arbitrary
test, which is really hard to enumerate or evaluate.
So the idea here is now that we have large-langor models, we can use them to start generating
and evaluating next moves.
So from the next previous two slides, you have seen what's the problem of large-langor models
and what's the problem of classical search and a hint of, you know, combining them might
lead to a better result, and that's true.
So we propose three of thoughts.
It's a general method for combining language models
and search algorithms for deliberate reasoning.
And to solve a problem, you need four parts, right?
So first, you need to define what is a search space
or what is the thought space.
Then you need to generate and evaluate
language thoughts using language models.
And you need to combine that with a search algorithm
to explore and maintain thoughts.
So I'll use the simplest example, which is Game 124,
to explain each part.
OK, so what is the thought, right?
That's not a question in chain of thought,
because everything is coherent, and you
You don't have to split it.
But it's a very critical thing in three of thoughts.
So here we define a sod as a coherent piece of text
as a next move in the reasoning game.
And if you think about Game 24, right,
there are two extreme choices.
On one extreme, you can treat each token as a thought,
right?
Then it will be very easy to generate each thought.
But as explained before, it's very hard to evaluate
whether 10 is a good thought or 13 is a good thought.
On the other extreme, you can treat the whole reasoning
as a thought, right?
you generate the whole thing, which will be very easy to evaluate.
You just look at the end if the number is 24.
But if you can generate, the problem is solved already,
so it's very hard to generate.
So in this game, naturally, the choice of thought
is something in between, right?
We can use each intermediate equations that thought
so that it's relatively easy to generate and evaluate
thoughts.
And this is really a problem specific trade-off design, right?
So for different problems, that solve can be a token,
can be a word, can be a sentence, can be a,
paragraph and so on.
So once you have defined what is sought,
it's either to generate that with language model.
So here, it's a simple prompt.
You know, you have one example of what's the input
and what's the possible sauce.
Then you give it a new input, and the language model
just can generate a new sauce.
So here, each new line is a new thought of how to continue the reasoning.
Once you have those thoughts, right, you want to give them a value
so that you can search.
So here, what we do is,
We give this problem of the example where, you know, for the remaining numbers, if the language model can simulate within a field trials and rich of 24, then a high value is given.
If not, depending on whether the numbers look reasonable or not, a medium or a low value is given.
Okay, so the previous three examples are the in-contest examples.
Now, for this new input, you know, 566, the language model try one round, 524, and sure, is a high value.
For turn 1313, it will try a few rounds and it will fail and this numbers look too large,
so it's impossible, so low value.
So for something like 5.5, 9, it try a few rounds, so it doesn't work, but the numbers look reasonable,
so likely, so a median value.
But here actually, you know, 5.59, it's not actually possible to reach 24.
So it's important to know that, you know, just like any search heuristics, here, the value does not have to be perfect.
And it just needs to bias a search toward promising directions.
Also here, like the prompt uses, you know, common sense reasoning and simulation, but you can really design different strategies for different problems.
It's really flexible.
So lastly, you can combine them together with a tree such algorithm.
Here we use a brass per search, which is the simplest algorithm.
You have a depth of three, and you have a brass from one until five.
And the idea is very simple, right?
You have an input.
You generate a bunch of thoughts.
We evaluate them.
You only keep the top choices.
So it's like a thought-level beam search, right?
And you keep doing that until four numbers become three numbers, three-number become two numbers,
and two numbers become one number, and you're succeeded if the only number is 24.
So while COT only achieved 4%, TOT with a breadth of 1 already leads to 45, and a breast of 5,
needs to even higher 74.
We can also use a similar idea for different algorithms and for different problems.
So for example, for crosswords, right?
Suppose you have five clues horizontally and five clue vertically.
What you can do is you can generate a bunch of thoughts, evaluate them,
and then gets proceeded only with the most promising choice.
So that's a depth first search, or best first search.
And you can keep doing this until the language model realized,
you know, this board is no longer solvable.
Then what you do is you prune the substrate and then you backtrack, right?
So you'll move on to this, but maybe this is still not solvable,
so you will move on again.
If none of the things works, then you go back, one level back, and then you try again.
So it's a very classic deference search.
And here are the results.
COT rich 1%, and we got 20%.
But if you don't have pruning and backtrack, then it goes to 5%,
which shows pruning and backtracking is very important.
So in our paper we have these two games, but we also have a natural language task
that's trying to write creative stories.
And the intuition is also very simple, right?
So if you're a good writer, you know, you don't just write token by token, right?
You will deliberately plan, you know, what are the possible plots.
You will choose, compare between them and you will select them.
So similarly here, the language model will write, you know, a bunch of diverse plans,
then self-evaluate, you know, what is a good plan, then proceed with that.
And you can do this kind of search for writing, and then humans will find it, you know,
more creative than the COT.
writing but the writing is too complex and long so I cannot display it here
so what I want to say is you know across those different tasks with different very
different reasoning challenges the modular design of TOT allows us to have very
flexible you know ways to generate evaluate and search thoughts across very general and
diverse tasks and we were doing so in a very systematic framework and achieve
very good performances without retreating any models so it's very convenient to
use
So we believe this is an initial step toward, you know, connecting old insights and new frontiers of AI.
So here, you know, tree search, one of the oldest ideas in AI, helps language model do more deliberate reasoning.
While language models help search, you know, provide search with very flexible and general purpose, powerful heuristics.
So we have this follow-up efforts trying to connect cognitive architectures to language model-based agents.
So those are systems that does not just...
reason internally but also interact with the external world and learn through
such interaction continuously so it's like autonomous agents so we have this
fallout paper called koala cognitive architectures for language agents I highly
you guys to check it out and I thank my co-authors thank you guys for
listening check out the poster today and happy to chat thank you so much I do
like it when people come with a general enough model that you can customize it
specialize it to recover smaller effects that other people have found. So you can from the
tree of thought paper recover something like the backspace token model or recover skeleton of thought,
chain of thought, whatever a thought. It doesn't, I don't care, I can't keep track anymore.
Anyway, so I caught up with him at his poster section and here's a bit of hard chat.
You can hold it up. Alright, thank you. So the TLDR of this paper is very simple.
Large-d-mold models and search the complement each other. So what's wrong with just using the
large-angle models without search.
It's very, everyone familiar with chain-off thought.
Okay, so suppose you're trying to solve this game of 24,
where given four numbers, you try to combine them to get 24.
Okay, so you can give GPD4 this task instruction.
You give it a couple of COT examples, but the performance really low,
is 4%.
Why is it so hard, right?
So that's because this problem, like,
intrinsic need exploration.
So let's take a look at the initial example, right?
So the model is making local token decisions, right?
It first generate 10, then it generates times,
then it generates six.
But it's very hard to decide those initial tokens,
even for humans.
You don't really know whether the first token should be 10
or 5 or 6, or are they equally good.
That's really hard to decide.
But what's worse is, once you decide the wrong token,
at the beginning, the task is already failed.
So in this particular example,
if you generate 10 and times,
this task is already failed.
Because no matter what times 10,
you cannot get three numbers remaining to reach 104.
So the intuition is that
auto-regressive inference is like
you're keeping making those local token decisions
one by one left to right.
without look ahead, without backtrack.
And it's not very robust
when you don't have good local signals
to guide through those kind of process.
So another analogy will be,
suppose you're training a robot
that's trying to navigate methods.
If you only train them on successful trajectories,
and you only train them to predict the next move,
and you do this local imitation
and you put them in a new maze that requires pollution,
then it probably won't solve the new mates.
So obviously some kind of search is needed.
But why this is a 2023 work,
given that search has been around since 1940 and 1950s?
That's because classical search problems like chess,
they have a small fixed set of next moves,
what we call the search space.
That makes it easy to define.
to design or to learn search heuristics to guide the search.
But here, for those kind of open-ended reasoning,
the next move can be anything.
It could be a token, it could be a sentence,
it could be a paragraph,
and it's impossible to enumerate this huge space
or to design evaluations.
So the key point here is
you want to really define what is the search space
first. You can consider two extremes first. So on one extreme, you can define a SOT as the next
token. Then you will be searching in a tree of tokens, something like BIM search. Then the problem
is it's very easy to generate tokens, right? But it's very hard to evaluate tokens. You don't really
know whether 10 is good or 13 is good or whatever. On the other extreme, you can
you can define thought as the whole reasoning.
Then it's very easy to evaluate the thoughts.
You just look at if the final number is 24 a lot.
But in this bandit, it will be very hard to generate a good thought.
Otherwise, the task is solved already.
So in this case, seems like the right balance is
you define each of the intermediate step as a thought.
So you can do something like,
you tell language model, here are some numbers,
come up with some different ways to combine two of the numbers.
You can generate a bunch of thoughts.
And for each of them, you can do something like this.
You can say, try a few rounds, can you reach 24.
If not, try to design a value based on that.
So, within three trials, if you can already reach 24,
then this thought has very high value.
If it couldn't reach 24, but maybe you could reach
maybe 25 or 26, maybe, okay, maybe a median value is given.
But if this is something like one, two, three, and then you can only reach six or four,
then maybe a low value is given.
So this value is not perfect, and it does not need to be perfect, just like any search heuristics.
It just needs to bias a search towards pharmacy directions.
So the point is, once you define this search space, you can generate and evaluate next moves
using large-danger models, and then you can systematically maintain them using the trade search algorithm.
and we show across diverse tasks,
they significantly improve the task performances,
and it's very easy to use,
you don't need to train any new models.
Everything is down with GPD4.
Pretty elegant.
So I like that comparison to beam search.
This is like a level extraction above that,
with the atomic unit being a thought.
Yeah.
A thought here, you illustrated being an equation,
but here you have an equation,
a clue word, like here the examples.
Yeah.
Do you have a planning stage in order to plan out the thought steps, right?
Like here you have thought steps of three, thoughts of five to ten, thoughts of one.
Usually when people design agents, they'll have like a planner.
But I don't see a planner here.
That's a great question.
You will notice here for those two games, the third steps are kind of homogeneous
because every step you're just trying to come up with a new equation.
a new equation, or you're trying to come up with a new clue.
So in this case, you don't really need planning.
You can just use one generation prompt, right an evaluation prompt,
and use that across different steps.
But for something more complicated, where for each third step,
you might do different things.
Then you probably need to plan ahead and maybe design different prompts
for different kinds of generation and different kinds of evaluation.
Got it.
So do you also see this being able to be combined with cell consistency?
because in a way your judge is self-consistency.
That's a great idea and we did that.
So on here is like in this creative writing task.
It's like what we do for evaluation is
here is a task instruction.
Here are some of the plans.
Thinks to the best plan and come out with the ID, right?
So if you just do this one time,
you just give one vote.
It's very noisy.
So you can apply something like self-consistent, right?
You can I-A-D do like 10 different voting or 100 different voting.
And then the evaluation will become more faithful.
Yeah.
And that's kind of hyper-primitarity you can choose.
It's like if you want better performance, you can spend more money and try to do that more.
Like a post-generation layer?
It's like a stepwise democracy, I guess.
Okay, so one more question about just in general Princeton NLP.
how is it organized
what should people know about
the Princeton program? Because I feel like you guys are very productive
and how are you so productive?
Like what's the backstory to through your thoughts maybe?
I think one thing that's good about Princeton
is a kind of small school.
I've been there, it's not that small.
I mean, compared to Harvard or MIT.
You have a lot of interdissellabary kind of collaborations.
So I did this with Connoissellable.
scientific science professors, I think this kind of idea across different fields is very important,
right? So usually in NLP we don't consider tasks like that. Yeah. That's classical search, right?
So I think it's very useful to combine ideas from different fields and that could be a way to
to promote, come up with new ideas. Yeah. Are a lot of people asking you about like Q Star stuff?
No. No comments. No comments. Okay, well thank you very much. This is a great paper.
So perhaps one paper that made a bigger splash than TRIA thought earlier in this year was Tool Former,
where we started really considering the myriad number of ways that we can train language models to use tools.
So here's the Toolformer Oral.
Hi everyone. My name is Jane.
I'm a researcher from Fair Labs at MEDA, and today I'm super excited to be presenting to you Toolformer.
Language models can teach themselves to use tools.
And the reason we might want language models like ChatGPT to have access to external tools,
is exemplified by these three queries.
In the first two cases, I've asked who is the current president
and what day of the week is it today?
And chat GPT basically says it doesn't have real-time data
or access to current time or date information.
And in the final query, I've asked to do a simple set of computations,
but chat GPT unfortunately hallucinates an answer
that's about 300 off from the real answer.
And what we really could have used here is access to external tools.
For example, a QA system that has up-to-date information,
a calendar tool which has the timer date, and a calculator tool, which is designed specifically
to do these simple computations perfectly.
And so for toolformer, we have five tools at our disposal.
We have a QA system with up-to-date information.
We have a Wikipedia search tool, which is able to search Wikipedia.
We have a calculator tool, a calendar tool, which has the current day of the week and the date,
and we have a translation tool which takes in text and puts it back into English.
And so in choosing these five tools, we really wanted a set of tools that is not only diverse,
but is also going to be likely useful to the language model.
And what we want to train a model to learn is not only which of these five tools to use,
but when to use that particular tool and how to use that tool all on its own without human annotation.
And the way we do this is by taking natural language text, like Pittsburgh is known as the Steel City,
and augmenting that text with API or tool calls.
So, for example, here, a useful API call would be to the QA system
with the question, what other name is Pittsburgh known by?
And this is useful because it's useful in anticipating the remainder of the text,
which is the Steel City.
And we represent an API or tool call with natural language.
We should do square brackets, followed by the tool name.
And then in round parentheses, we have the input to that tool,
followed by a right arrow, which is,
followed by the output of the tool with that query.
And with that, the steps to creating tool former
is pretty simple.
In the first step, we want to create a new training
data set augmented with these API calls
that I just showed you on the previous slide.
And in the second step, we want to fine-tune GPTJ,
our base model, on this new data set.
And this fine-tuned model is the model that we refer to as toolformer.
Now, to create that training data set,
we have three simple steps, which I'll get into in just a second.
But first, we want to start out with a standard language modeling data set like CCNet.
And the reason we want to start here is because we don't want to disrupt any of the core
language modeling capabilities that the model may already have.
And so in using a data set or something similar to what it's seen before, we minimize
this risk as much as possible.
Okay, so let's go into the first step, which is to generate API calls.
And to do so, we show the model a simple prompt.
We say, your task is to add calls to a question answering API to a question, to a
a piece of text.
The question should help you get information required
to complete the text.
And then we explain the format of the API call that we want,
and we show it a couple of examples.
I only have one example here, but we would put as many examples
as can fit into the context window.
And then we show the input that we actually want to do inference
on, and we let the model generate.
And here, I'm only showing you the question answering API
prompt, but you can imagine that we do a very similar thing
for the rest of the four tools.
Okay, so let's look at a couple of generated API call examples for the input.
Pittsburgh is known as the Steel City.
And here the model has generated in which state is Pittsburgh, what other name is Pittsburgh
known by, and what is the second city in Pennsylvania?
And so from these generations, you can see that we get a mix of relevant API calls, non-relevant
API calls, and also some that don't make a lot of sense, like the last one.
And now for the second step where we actually try to execute those API calls.
So what we do here is we take that natural language string, we parse it for the input parameters,
we send it to the relevant tool, and we get an output from the tool.
Now using those outputs, we want to put them back into the embedded API call, and we indicate
this with a right arrow followed by the output.
And this is also the step where we would filter out generated API calls that are ill-formatted
or don't actually return a result from the tool.
And additionally, we also want to filter out API calls that aren't actually useful to the
model.
So the way we want to think about usefulness if it's useful for anticipating the remainder
of the text, as I showed you earlier.
And the way we quantify usefulness is through model-based perplexity.
And perplexity is basically the negative log likelihood of the remainder of the text given
the prefix of the text.
So basically you want the lowest perplexity possible because you want the model to be least perplexed
about what it's about to see.
So here we evaluate perplexity under three different settings.
The first setting is where we don't have any API call.
So here, the prefix would just be Pittsburgh is known as.
In the second setting, we have the non-executed API call,
where we have the API call, but we're not actually
going to put the result from the tool yet.
And then finally, we have the full executed API call,
where we have the API call and its corresponding output.
And intuitively, what we want here is for the perplexity for setting C
to be much lower than either A or B, because not only do we want the generated
API call to be useful, but we also want the results from the tool to be really useful.
So this is exactly how we evaluate usefulness.
It's the minimum of the perplexity of either under A or B, minus the perplexity of C.
So we want that difference to be as large as possible.
And here we have a pretty sizable usefulness score of 1.3,
which is pretty good.
But to give you more context,
here's another example from the calendar tool.
It says the WL will be open on Friday.
And the calendar tool tells us that today is Thursday, March 9th.
And from this we can kind of infer that Friday
is going to be March 10th.
So this gets a high usefulness score of 2.11.
On the other hand, we have this example from the calculator tool.
The model has seen these two numbers,
85 patients and 23%, and it thinks maybe the ratio is going to be useful.
But unfortunately, that's not the case, and it gets a low usefulness score of negative 0.02.
So this would likely be filtered out in our final third step.
Now here I'm showing you the number of examples that remain after this filtering process.
For two different kinds of thresholds, we have in light blue, 0.5, and in dark blue we have 1.0.
Obviously, you're going to get a lot more examples left over if you use a less stringent threshold, 0.5.
But the other thing you can see here is that we have the most number of examples from the
Wikipedia search tool, whereas for calculator and machine translation, we have the
fewest number of examples.
And now what we do here is we cap the number of examples per tool at 25,000, and we put
it all together in one big dataset.
And with that data set, we fine-tune our base model, GBT, and this fine-tune model is what
we refer to as toolformer.
Now, to evaluate tool former, we want to evaluate on a range of tasks where we think at least
one of the tool is going to be useful.
So we have fact completion and question and answering.
We also have math computations and multilingual questions where the context is given in
English, but the question can be in a different language.
And we also have temporal questions like how many days is it until Christmas, where you
need to know the current timer date in order to answer the question.
Now here are the results for those five tasks.
We have three different models.
We have GPTJ, which is the base model, toolformer, and GPT3, which is a 175 billion parameter
model.
And what you can see here is that in almost all cases, tool formers outperforming
GBT, but it's also outperforming GBT3, even though it's about 30 times smaller than GPT3.
And an exception to this is the question answering task, where we actually disabled the
QA system.
And this is because there's a lot of overlap into the training set of the
QA system and our evaluation tasks, so we thought this would be too much of an advantage if
we enabled that tool.
The second anomaly is the multilingual task where we don't see a lot of benefit from the
translation tool, and we think this is likely because GBT has already seen a lot of
multilingual text and isn't getting a whole lot of benefit from actually using that tool.
But regardless, we see that tool former is either on-paro-GBDJ or outperforming GPDJ.
And the second thing that we want to look at is whether or not small models
can effectively use tools.
So in other words, is there a minimum size requirement
with which tools, I mean, models are able to effectively use tools?
So to investigate this, we applied the same kind of pipeline
to the family of GBT2 models.
So there are four of them, and in total,
we have five different models at various sizes,
which I'm showing you on the x-axis.
And on the y-axis, we have model performance.
And in blue, we have tool former.
And in red, we have tool former disabled,
where we use constraint decoding to prevent
the usage of tools.
And as you can see, in the smallest two sizes,
we don't see any performance difference
between toolformer and tool former disabled,
meaning that toolformer is not able to make use
of those five tools to its fullest.
But once we get to 775 million parameters,
we see a performance gap emerging,
and this gets bigger and sustained for the rest of the sizes.
And this is a similar thing that we see with the math benchmarks.
It seems that tool usage is really emerging
at 775 million parameters.
For the question answering benchmarks, we don't see this as clearly, and we think that maybe
this is likely because the QA system and the Wikipedia search tool are easier tools to use,
and so you don't need a more capable model to be able to understand how to use it effectively.
And finally, we also want to revisit the question of whether or not tool formers a good language
model.
We originally used a data set CCNet because we didn't want to disrupt any of the core language
modeling capabilities.
And so now we revisit that question by looking at perplexity on a held-out set of
of Wikitext and CCNet.
And here we have three different models, GPTJ,
GBT further fine-tuned on CCNet,
and to a former, which is further fine-tune
on CCNet augmented with those API calls.
And what we find is that the perplexity
is pretty much on par with the base model
and the further fine-tuned one.
We don't see a whole lot of difference.
And so we feel pretty encouraged that even though this data
set may look a bit unnatural with these API calls,
it doesn't actually harm the core language
modeling capabilities here.
So thank you for listening to this talk.
Please check out our paper at this QR code.
We have a poster in the next poster section.
We are poster number 332, and I will be there with Roberta.
Please feel free to reach out to any of the co-authors and me.
I'm happy to take questions now or later.
Thanks.
When I look at all the relevant papers for AI engineers this year,
there's the chain of thought papers and the two use papers,
two of which we just covered.
but something that I think incorporates all of them and it adds a few ideas that are unique and notable to them is the Voyager paper from Nvidia.
And even though it was released in the first half of the year, people are still talking about it today.
It's still shaping people's mental perceptions of how they want to build their LM architectures.
It was somehow not accepted for posters or oral sessions at this year's in Europe.
It's kind of a mystery as to why.
I did chat with Jim and I'm still not really sure what's going on there.
But it would have been my vote for Best Paper because it's so foundational and established such
a strong baseline for everyone else to build on top of LLMs.
And anyway, so there is some workshops presentations about Wai-Drift with the first author, so here it is.
My name is Guan Zhi Wang.
Currently I'm a 30-year PhD student at Caltech.
I'm also a research intern at Vedia.
very happy to present Voyager and open-ended embodied agent with large language models.
This year, GPD4 came, a large-engage model that's so good at coding and long horizon
planning, so we built Voyager, the first large-language model powered left-on-learning agent.
When we set Voyager loose in Minecraft, it is able to play the game for hours on end without
any human intervention.
The video here shows snippets from a single episode of Voyager.
So it explores the terrains, mines all kinds of materials, fight monsters, craft hundreds of recipes,
and unlocks an ever-expanding treat of skills.
If you want to use the full power of GPD4, a central question is how do we stringify things?
In other words, how do we convert this embodied environment with multi-model observation and action space into peer text?
We need a magic box.
And thankfully, the enthusiastic Minecraft community already built.
community already built one. It's called Manfleur, a high-level JavaScript API that
actively maintained to work with every Minecraft version. The beauty of Manflare is that it has access
to the game state surrounding the agent like the nearby blocks, animals, and enemies. So we
effectively have a Guantruth perception module as a textual channel. Now that we convert everything to
text, we are ready to construct an agent algorithm on top of GPD4. And on the high level, there are
three components. First, a coding module that writes JavaScript to control the game
about. It's a mean module that generates executable actions. Second, we have a code
base to store the correctly written code and look it up in the future if the
agents need to recall the skill. In this way, we don't duplicate coding efforts and
achieve a form of learning without grading design. Third, we have a curriculum that
propose what to do next given the agent's current capabilities. So we will, why, well,
them up together, we get a loop that drives the agents indefinitely and achieves something
like lifelong learning. So let's do me in the center module. We prompt GPD4 with
documentation and examples on how to use a subset of the Manflear API. The GPD4 writes code
to take actions given the current assigned task. And because JavaScript runs a code interpreter,
GPD4 can define new functions on the fly and run it interactively. But the code that GPt4
isn't always able to get it right at the first try.
We develop an iterative prompting mechanism
to refine the program.
There are three types of feedback.
First, the environment feedback,
like what new materials did you get
after taking an action?
Second, the execution error from JavaScript interpreter
like variable and defined error.
And we have another GPT4 that provides
critique through self-reflection
from the agent's own states.
So these are these things.
components help the agent refine the program effectively. I want to show some examples of how the
critique module provides feedback on the task completion progress. In the first example, the task is to
craft a spy gas. So GPT4 looks at the agent's inventory and decides that it has enough copper,
but not enough emazest. Second task is to kill three sheep to collect food, so each ship drops one
of white wool, but there are only two units in the inventory, so one more sheep to go.
Last example, killing the zombie drops a unit of rotten flash which is in the inventory,
so GPT4 determines that the task is successful and moves on. So this critique procedure is repeated
until the task is deemed successful or hits the time limit. Now moving on to the second part.
Once it implements a scale correctly, we save it to a persistent storage.
So think of it as a skill library that's altered purely by GPD4 through trial and error.
Then the agent can retrieve the skills from the library when facing similar situations in the future.
So it doesn't need to write them again.
In this way, Weider improves itself as it experience more and more in Minecraft.
Let's step a bit deeper into how the scale library is implemented.
So this is how we insert a new skill.
First, we use GPD 3.5 to summarize the program into plain English.
So summarization is very easy and doesn't need GPD4, so we save some money here.
Then the embedding of the summary becomes a key,
and the program becomes a value which we insert into a vector database.
We find it better to embed the description instead of the raw program,
because it's more semantic and improves the data.
and improves the retrieval.
Now, when Voyager is faced with a new task,
let's say, craft iron pickaxe,
we use GPT3.5 to generate a hint
on how to solve the task and combine it
with world state as security content.
Then we do the embedding and retrieve
the top five relevant skills from the skill library.
So Voyager is free to directly use one of the skills
as is, or interpret it among the file,
or rewrite one from scratch.
In this way, we maximally re-use the old experiences,
think of it as an in-context replay buffer
in the reinforcement learning terminology.
Now, we'll go on to the third part.
We have yet another GPT4 that propose what task to do
given its own capability at the moment.
The curriculum has an unsupervised objective,
which is to maximize the number of novel items
that the agent obtains.
There are two key insights here.
First, it's kind of curiosity-driven explorers.
exploration or novelty search in prior literature, but implemented purely in context.
Oh, sorry. And second, it's a situation where curriculum that naturally gets progressively
harder over time or without any manual prescription from us. So let's go through a working example
together. The agent finds its hunger bar dropping to 1 out of 20, so it needs to find food.
Now it senses four entities nearby, a cat, a villager, a pig, and some
wheat seed. So it starts an unit monologue. Do I kill the cat or villager? Bad idea. How about the
wheat seed? I can grow a farm, but it's going to take a long time. So sorry, Piggy, you are the
chosen one. It checks the inventory and retrieves an old skill from the library to craft an iron
sword and then starts to learn a new skill called hunting. Now, we also know that Voyager
isn't vegetarian, unfortunately. So putting our pieces together, we have an union sword. We have a
iterative prompting mechanism that refines a program by self-debucking, a skill library
as an in-context replay buffer, and an automatic curriculum as in-context curiosity-driven
exploration.
This is Voyager's no gradient architecture, where we don't train any new model or frontier any
parameters.
It allows Voyager to self-bustrap and perform lifelong learning in an open-ended world.
So these are the tasks that Voyager happens to pick up along the way.
We didn't pre-program any of this.
It's all Voyager's idea.
The agent is forever curious and forever pursuing new adventures.
We've done a lot of systematic study for Voyager, and here is the quantitative learning curve.
Well, the x-axis is the number of prompting iterations, and the y-axis is the number of
unique items obtained by each agent.
We compare with three baselines, React, reflexing, and auto-GBT.
All of these are no gradient.
All of these are no gradient architecture on top of GPD 4.
React is a very simple reasoning and acting loop.
And the reflexing is built on top of React with self-reflection.
We see that both struggle to make progress beyond the basic wooden tools.
And auto-GPT is a popular software repo.
It combines React and a task planner that they compose an objective into sub-goals.
It makes more progress but is very slow.
And this is a wider.
They are able to obtain three times more novel items
than the prior method and unlock the Hotech tree
significantly faster from wooden to stone to iron to diamond.
The blue curve here is an ablation without skill library,
which plateaus after well.
So basically the skill library is very essential
for Voyager's lifelong learning capabilities.
Here are two precise views of Minecraft maps.
So these circles are what the prior method explore,
given the same prompting iteration budget.
You can see that they tend to get started
in local areas.
Voyager is able to navigate distance two times longer compared to prior works.
It has to visit more diverse terrain in order to find more novel items quickly.
Finally, one limitation is that Voyager does not currently support visual perception because
GPD4 is tax only when we were developing Voyager, but there's nothing stopping Voyager from using a multi-modal model to achieve more impressive tasks.
And here we demonstrate that given human feedback,
Voyager is able to construct complex 3D structures in Minecraft,
such as a house and a netter portal.
We basically use the human to replace the critic module
of Voyager and provide 3D spatial otherwise.
So to build very complex structures,
we definitely need some full-blown multimodom multimodal models,
and I will leave that to future works.
This is Voyager's website at voyager.mandojo.org.
We open source everything, including the environment,
algorithm, prompts, and pre-trained skill libraries.
Finally, I want to acknowledge all the team members for Wager.
This work will not be possible without their help.
So please feel free to reach out if you have any questions.
Thanks.
I think the last component of agents, apart from chain of thought and tool use that I wrote
up in the Anatomy of Autonomy right up in April, is the need for better planning.
And I think one of the most interesting or challenging pieces, depending how you look at it,
Neurip's is doing poster diving where instead of going to all the oral sessions which have been
curated by track committees and all that you just go and walk the halls and look for posters and look for
papers and people that are underrated have been overlooked and in fact the original attention is all you need
Transformers paper was one such paper where they were just a poster only paper apparently from walking the halls
in the poster sessions my pick for underrated paper was
Ida Momanajad from Microsoft Research with Cog Eval.
Ida was very confident and professorial in her presentation, made it engaging, made it a quiz.
Some parts of the quiz are visual, so if you're listening along and you want to solve it alongside
us, you should probably pull up the show notes and check out the graphs that I'm going to paste
inside of the show notes.
But otherwise, she just made it very engaging for people to follow along.
I'm not kidding, there was a group of 10, 20 of us way back in the halls in the post of sessions
where a lot of people don't really end up going.
And we're just like half an hour
while she was giving her
impromptu lecture about CoggyVell.
And I do think that this is notable
because it is potentially a quantifiable benchmark
for reasoning and planning capabilities
that currently all the language models
don't do very well.
And framing it as a graph problem
helps us generalize to all sorts of reasoning,
planning, and search situations.
And I just like that it was really well presented.
This is obviously a benchmark paper, so there's no solutions proposed, but she has another paper that she's working on that has some of her solutions.
So LLMs are ubiquitous, and a lot of people claim that they can plan or they're going to plan to take over the world.
But first things first, can they actually plan?
I have 15 years of experience working in reinforcement learning and cognitive science and in neuroscience,
evaluating planning in humans and brains and reinforcement learning models.
So I thought, okay, let's apply that.
In order to accurately evaluate whether a cognitive capacity exists in an agent or in a biological system,
there needs to be a systematic protocol to evaluate it.
Inspired by cognitive science, we have two contributions here.
First, we introduce COGEval, a systematic protocol for evaluating cognitive capacities.
What that means is you need to operationalize a particular latent ability in terms of multiple tasks that can be measured.
And these measurements need to on-com-found or just,
or decoupled certain confounds from what is actually being measured in terms of that cognitive ability.
So for instance, if you give it some simple situations, it might be that it solves it,
but you can declare victory unless you show that the tasks that you have created somehow capture
different aspects of the cognitive latent ability that you are measuring.
Second, you want to operationalize it in terms of different structures, different domains,
and different tasks. You don't want to measure one or two things in
one or two environments and with an anecdote declare that something works or something exists.
So here what, for instance, you have is different graph structures.
I have six structures that I'll show you, different domain.
I'll show you the spatial domain.
For instance, if I ask you for planning, I could ask you, how do you go to Hall Seafrater?
Or I could give you any information about Ali is friends with Michael, Michael,
is friends with Mary Mary, Mary's friends with Sue.
If Ali wants to pass a message to Sue, what is the path, for instance, right?
That's the planning in the social domain.
So social and spatial domain, different domains,
and also task conditions.
We use 15 different tasks.
These are inspired by various tasks that I have designed
in the past, you can look at these two papers and others.
This goes back 100 years ago to the tradition
started by Edward Tolman on cognitive maps in rats and men,
1948 review paper reviews 20 years of research,
where it shows behaviorally how to measure
whether an entity in that case he was measuring rats
possess a cognitive map.
It was a revolutionary result at the time,
because it went against the behavioristogma of the time that you need a reward to learn structures.
It showed that no rats can learn the cognitive map of the environment even if you don't give them rewards.
Okay, come back to present day, 15 tasks in five different categories.
The goal is to evaluate systematically whether LLMs can extract from descriptions of an environment,
the cognitive map, and what does that mean?
It means similar to Tolman from 100 years ago until not tradition.
Can it solve particular tasks?
Is it robust to certain tasks?
can do flexible planning with respect to and it responds to different kinds of tasks where you have
maybe short or brief local changes to the environment, like a reward location changed or one edge changed.
Can it integrate those to accurately plan, for instance?
And we have these different graph structures, just to give you an example of how it goes.
So for graph A, domain is spatial and the task is value-based planning, what would it look like?
I would describe the graph to the LLM.
as you imagine a building with six rooms from the lobby,
you have two choices, you go to room one or two.
From room one, there is a door to room three,
from room three, there is a door to room five.
In room five, there is $10.
You don't take any money because at the end
you only have one possibility to take money.
You go back, from room four, from room two,
you can go to four to six, and in room six,
there's $50.
And then the question is, and here,
this was the description of the environment,
then the question is, you return to the lobby,
you have only one choice to, you can only take money,
once, what is the optimal room to choose in order to take the most money?
And you should say two because six has the most room, right?
So all of these environments are described in that way, either in the spatial domain or the social domain,
and the different tasks are prompted like this.
For cases where something in the environment changes, you can see how the second prompt, for instance,
modify something.
You say, oh, now you learn that the reward in this room changed to such and such.
Oh, now you learn that the door to this room has been changed and it all of a sudden opens
to this other.
Okay, now with that, please don't look here.
I don't want you guys to cheat.
And I know you guys might have heard things, but forget everything you heard.
Between these three, which one do you think is going to be the most difficult?
And why?
So the choice is A, B, and C.
A, B and C.
Which graph is it going to be difficult, or are they going to be the same?
In terms of for the LLM to solve?
They're similar.
So B has more branching and C is more length.
So which one is going to be more difficult to solve for LLMs?
You can say different things, and we can see who is right.
I don't know the answer. I'll guess B because more branching.
Okay, anybody guesses anything else?
Probably C because I guess the line is not going to handle like a very long term of sequence.
So we have two hypothesis here.
Anybody thinks they're the same?
So between A B and C, which one is more difficult or are they the same?
I just don't understand that.
When you say so, what kind of problem I'm trying to see?
This problem that we just mentioned here.
There is some money somewhere at the end of them.
One of the nodes that is terminal has the most.
C is harder.
Okay, and then between D and E, which one do you think is harder?
More branching, so C, so E?
You think E is harder?
If it's branching.
Okay.
I don't actually know that.
I mean, I do feel like he has a point, so I can be wrong.
Okay, so who thinks, okay, so you think E is harder.
Anybody thinks D is harder?
Okay, why?
Because it has the last way to go from one point on a way.
It has bottlenecks.
Yeah, yeah.
Right.
Okay.
Ready?
Okay.
Right here.
So take a look at this.
B is harder than C as you can see.
It's branching.
Right?
B is harder, even though C is twice as large as C in terms of the number of nodes.
And A, you can see that it was easy, right?
So imagine if I showed you this as the planning task and I declared victory and I said, look,
LLMs can solve planning.
GPD4, great, near 100%, right?
But then you try just a little longer or you have the same number of nodes but with a branching
structure, what do you see here?
Huge drop, right?
And in fact, what do you see for three of the LLMs?
It's at almost at 0%, right?
And now between D and E, let's take a look.
As you can see, D is much more difficult for GPD4, which is the blue one.
In fact, E is more difficult than B.
Sorry, B is more difficult than E.
It's not consistent.
Yeah.
Well, there is something there.
In these two, you have structures where you need to be exact.
There is not multiple paths between different nodes, right?
So it's very important.
If you're going from this cluster to this one, you have to path through this bottleneck.
So there needs to be an ability to play.
plan accurately the specific bottleneck, correct?
Now, what about the different tasks?
Let's see.
As you can see, they're not robust to the different tasks either.
Traversal, which is one step, two, step, three step,
end step path, and value path.
This is easier for these guys.
Why is that?
The reason is that traversal does not change the structure
of the environment or the rewards.
However, as soon as you have the local change,
the stuff that Edward Tolman was talking about 100 years ago,
that is required for measuring cognitive maps
rodents for instance like detour and shortcut that we have all of a sudden you see a drop
and you can see all of a sudden goes to zero and for coher for alpaca and for a llama right and so
and here you can see this sad thing also it's almost at zero percent for four of the graphs so all
of these graphs are at almost as zero percent for three of the LLMs and about 20 percent for most
of them and it's only GPT4 that does a little better and that's about 40 percent right
So based on all of these things, robustness to task if you aggregate across graphs, not robust to tasks,
and robustness to different graphs if you aggregate across tasks, also not very robust.
So you compare these, the general conclusion I would draw is that they're not good at planning.
Now let's take a look at some of their failure modes.
So can you guys see what is the failure mode that is happening here?
There is an edge that doesn't exist.
It hallucinated an edge in giving the planning response that doesn't exist.
Now let's take a look at this case where you have a direct path from 1 to 7, but it's giving it very long.
It says what is the shortest path between 1 and 7, and it says 1.13-107.
But interestingly, if I asked the LLM, can you list the topples?
JPD4 can easily list the topals, but at the same time still can hallucinate, like in this case.
Now in this last one, there's two mistakes.
I told you one of the mistakes, which is hallucinating the edges.
What other mistake do you see?
Is it out of order somehow?
I don't know. It's hard to tell from this distance.
Take a look at the answer. What is wrong with the answer?
Revisits a node.
Exactly. There's a loop.
So Schroeder's path should not have a loop.
Of course.
So we found another case, right?
So these three failure modes are failures of planning.
Even though it knows the one-step topples correctly, it seems to fail at planning.
And it can give you some insight into what is going on.
So it's not very good at stitching one-step things together.
So based on that, why do you think it was better at graph A?
Can people give me guesses? Why do you think graph A was easier?
Or it showed some apparent success on graph A. Why do you think that is?
Smaller, daves, fewer choices.
Fewer choices.
But this one is also very few choices. It's like B.
It's a tree, right?
This has fewer choices than that, right?
But why is this so much more difficult?
More ways to be wrong.
Say again?
More ways to be wrong.
More ways to be wrong.
So another way to say it is that,
is that the things that showed up the exact in the kind of the prompt
are more likely to work for C and A, basically.
So if it just did just memorization of what's going on, right?
Because it's just sort of a kind of two tracks here.
But there are more branching.
For what it's worth, I think a lot of the common sense reasoning benchmarks
that these things are specifically trained on are transitive.
I don't know what you call this.
Yeah.
Yeah.
Like, we trained it to be good at ANC.
No, that's not true.
No?
GPT4 has been trained on a huge amount of text.
A lot of that is family trees and structures that are actually tree-like.
It turns out transformers, in fact, do have some limitations with tree-like structures
and with things that are bottleneck.
We are very good at bottleneck.
In fact, bottlenecks makes things easier for us, right?
You have a few nodes that are, basically, they have high centrality, especially
eigencentrality or between the centrality.
And basically what you do is when you're solving a problem
and planning, you say, I'm going to find that.
No, then from there, I'll go somewhere.
You have a subway system.
You go to 14th station in New York City,
then you can find a train that goes somewhere else, right?
So if you get lost, just find the hub.
We actually use these heuristics a lot.
It's available in human texts a lot,
but it hasn't been picking up on that.
So this is about the structures that the Transformer,
for instance, might have been learning.
And as you can see, you have here from 7 billion
parameters to one trillion parameters to the best of our knowledge or larger, right?
And none of them is capable of figuring out or like having a high performance higher than
like we have something between zero and 40 percent on a simple two-step tree, which is the
simplest thing you can give a model-based planner.
It's not even probabilistic.
It's deterministic.
And even that is failing, right?
And then we saw these failure modes.
Another thing, what if I give it extra instructions?
By the way, all of these have been told things step by step.
So we give that simple chain of thought.
What if we give it extra instructions?
For instance, I describe entire breath first search and death first search.
And I say, hey, use depth research.
How is that working?
First do this, then do that.
And another one.
So you can see in the supplementary material of our paper,
the entire sort of death restriction death research.
Then you see that it improves somewhat for when you are within a cluster,
but when you look a situation in this graph D,
where you need to find the shortest paths between nodes that are a cluster away from each other,
what you see is that it doesn't help much.
And interestingly, for different temperatures, for temperatures zero, it doesn't help at all.
It helps a little bit for temperatures that are higher and I guess like take different kind of paths.
But it's interesting, only one cluster away.
Short-dust path, one cluster away, it's not a big deal.
The diameter of this network is not that large.
There is not a lot of improvement and the performance is pretty low, as you can see,
for all of them and for three of them is actually closer to zero.
So this is the evaluation.
We have done together with my summer interns.
We have a paper where we did a prefrontal cortex-inspired modular architecture
where GPT4 basically plays the role of these different kind of modules
and solves these problems in a kind of a modular way similar to the prefrontal cortex.
I have like 15 years of working on prefrontal cortex.
I'm very excited to do this with these models.
You can see it here.
And this paper is you can find it on my website, WebItal 2023, and you can find it here as well.
I have an archive number over there.
Okay, so I had to cut it for time there, but literally, I'm not joking, I had another half an hour of audio just chatting with her and like all of us just crawling around her like students.
Like she just was very, very engaging in person.
And I love to see that.
I love to see when people can not only do great work, but then also talk in a compelling fashion about it, not just passively and strongly.
questions about it, but also challenge you to think along the way. So I guess if I were to include
one agent's paper from Neuribs, this would be it. And for the final talk of this entire pod,
which is already stretching into three hours, I have saved for the coverage of state space models,
which have been the talk of the town. The Mamba model was released a few days before Neurips,
and Albert Goo was there. I met him, but I couldn't get a conversation with him. But Chris Ray
was on stage talking about effectively all of Hazy Research, what Stanford's doing,
and what Chris Ray is up to and all the people he's associated with, including Tree Dow and Albuquer.
So if you want a primer or like a good entry point on just how Chris Ray is thinking about six-face models, I think this is it.
So as I mentioned, our motivation for getting rid of attention potentially is long sequences.
That's the practical motivation. I'll come back to my real motivation in one slide.
Practically, some data comes as long sequences. Data, audio, DNA is billions of bass pairs. We can also cram in terms of
a few shot examples, which seems pretty cool. When we started this project, really the standard
models couldn't have it. GPT1 had only a 512 context lane. And as I mentioned, transformers are
scaling quadratically in their sequence lane. So we kind of took two parallel paths to this.
One is better hardware algorithms. So we tried with, you know, flash attention and now people
have followed up to make that path really, really fast. Just optimize the crap out of it on hardware,
and there's a lot of juice to squeeze there. The other approach, which I'll talk about now,
are new models. Now, as I mentioned, I actually wasn't totally motivated by that. I wasn't,
honestly, that wasn't my total motivation. I was really motivated by this inductive bias issue.
So the idea here is you give me this image, and I flatten it into one single pixel.
And then I ask you, is it a car or a boat, some C-FAR like thing? Sequential C-FAR, if you know
the task. And this is really interesting to me because when, you know, a human would do this,
this would be hopeless. If you gave me a picture and gave me it in one pixel vector as a
thing, I would have no chance of classifying it. Machines could do something, but there was a huge gap.
And I wanted to understand why is there this inductive bias underneath the covers?
Do you really need this spatial inductive bias for the machines to reason? Do they have to reason like us when they do this?
So I was fascinated by this problem.
Right. So there's another benchmark that came out that was really exciting from the Google folks,
which was about how to benchmark efficient attention. It's called Long Range Arena. It's extremely cool. We found them,
basically because we were playing around with the sequential CFAR things,
and they had a much greater library of places where they were seeing
possibilities to improve attention.
This was the leaderboard in 2021 of this attention,
and they were basically looking at a bunch of very cool linear attention variants,
some of which we still play with.
I want to draw your attention to two columns on this thing.
The first is image.
That is that sequential CFAR task I was just talking about.
It's a really interesting task.
You've probably trained CFAR to like 90s or high 80s on your laptop,
or on a small GPU, and you see the sequential version
was lagging quite a bit behind.
The other column is this thing, PathX,
which were these large images where you had two dots,
and you're trying to say are the two dots connected.
And the reason there are X's is that every model
was basically random guessing at this point.
So there are three approaches that we were trying
to improve long sequences.
Improve the hardware, the utilization on hardware,
approximate attention, and this last one,
which I'm going to talk about most,
which is using RNN-based kinds of ideas
and signal processing ideas.
All of them are great, and just happen to pick the last one.
All right.
So the idea is we're going to replace just the signal processing box,
the signal mixing box, with this new operator, S4,
that's based on signal processing ideas.
So this was inspired by Albert and Karn.
Albert's now a professor at CMU.
Karn is now running this company, Cartesia,
which is a small company just started.
And basically, S4 is a classic states-based model.
So if you're an EE person, you've seen these in like your undergrad right away.
It's an LTI system, but we're going to tweak it for deep learning.
The first thing we're going to get, as I'll show you, pretty mathematically and nicely,
is that signal processing people are obsessed with stability.
They understand bounded input, bounded output stability like nobody's business.
It's simple and it's clean, and we can use it right away.
This is a challenge when training these models.
A second thing, which was quite surprising is,
I've always thought about CNNs and RNNs as quite distinct models.
But what I'm going to show you mathematically is these models actually unify both.
Now, these are CNNs in a kind of different way than we're used to.
They're convolutions where the filters are potentially as long as the input,
but we're going to be able to view the exact same weights and operate on them either as an RNN or CNN,
which is quite exciting.
And the last piece, of course, is that we're going to make this quite fast.
And these are going to be asymptotically more efficient than transformers.
We're eventually going to be able to process sequence in like n-log-n time,
which is then, you know, a challenge to make practical, and I'll share some results there.
Now, this thing is extremely simple.
Very simple, very simple signal processing ideas,
but I just want to point out it had a large improvement on LRA that surprised me.
So here's that improvement on LRA.
This is the first of its kind of solve PathX.
It was like a 26-point jump on this benchmark that a bunch of folks had played at.
I also want to point out that the image task,
that spatial bias seems to matter less than I thought.
And that was really the thing that was interesting to me.
And since then, many people have followed on and pushed these numbers up higher,
But I just think that's really interesting.
I don't know what to do with the observation, but I really like it.
Okay, so what is signal processing?
Well, signal processing people view a signal of D-dimensions at end time steps as input,
and an output is a signal of D-dimension at n time steps.
That looks a lot like our X and O matrix that we had in attention.
They also think causally.
They think that time moves left to right through this,
and things like GPT are also kind of causal.
So, so far, what I want to emphasize is, we've really done nothing.
It's just symbol pushing that we've been able to move into this model.
So what does signal processing actually buy us?
Two big ideas.
The first is over 100 years, they figured out a bunch of models,
which are relatively simple, but capture pretty interesting phenomenon.
These aren't the best models you could ever use, these LTI systems,
but they're a simple and very well-understood starting point.
So I argue makes sense to start there.
The second piece, which I think a lot of machine learners don't necessarily love,
necessarily love, is that they have this idea that a signal is a continuous object that then
is discreetly sampled. And that idea allows us to do a bunch of stuff. In particular, it allows
us to use all our discrete tricks, which are more common in machine learning and AI, but also a bunch of,
you know, 19th, 20th century mathematics that knows how to do integrals and solves things exactly.
And I'll show you at least one of those tricks in the next couple of slides. I think that's an
incredibly powerful idea, and it was really helpful for us to think about it. And as I said,
it's going to teach us about stability in like a trivial way.
We're going to use theorems from the 1800s to be able to prove that our models are stable,
which I just think is awesome.
All right, so what's an LTI system?
If you've never played with one, this is what's called a single input, single output system.
You have some curve that's coming in, which is typically called UT, that's the input,
and some output curve YT.
You have some hidden state, which is much higher dimension, usually, than the input and the output.
We'll take the hidden state as large as the input when it's discretized.
it's going to be a huge thing.
Now, I haven't told you how that hidden state evolves yet,
but it's going to be constrained.
And the LTI people say there's lots of things that can fit into
basically letting it evolve according to an ODE.
So I'm going to show you that in just one second.
So here's what you need for the ODE.
You need two matrices A and B, and we're going to learn those matrices.
And basically it says that the hidden state can only evolve according to this equation.
It basically says the change in the hidden state is proportional to some learned function.
of the input plus the previous state.
The output is then from projection,
from this linear projection,
from this high dimensional state down to 1D.
This is all that an LTI system does.
I'm just saying it's something that's surprisingly powerful
and well understood.
This is not the best model.
If you're a signal processing person,
you say, oh, you should use X, Y, or Z.
You're probably right.
But we want to start with something really,
really simple that we can understand all the way.
All right.
So it turns out that one of the beautiful things
is because it has this continuous object lurking
in the background,
you can use high school calculus and in particular you can get out this nice expression and what this says is the hidden state is exactly this function x of s and this convolution style integral okay
this is exactly what it is this is wonderful you just solve the ODE then when we realize it we have to discretize we'll come back to that in a second
So the immediate win is
Well this can tell us exactly when the system is stable basically as long as the eigenvalues are in the left-hand part of the plane which is
every EE person memorizes, and the reason left-hand part of the complex plane matters is
E to those values goes inside the unit disk, you know that this thing is not going to blow up.
This system is going to have bounded input, bounded output stability, which is really exciting.
So when we train, we can fix our A's, our representations, so that the eigenvalues satisfy
this property, and that's going to be one of the arts.
Now, to implement this on a machine, we can't use continuous objects.
We have to use them as discrete.
And integrals are just big, smooth sums, basically.
They're actually nicer to deal with than functions.
And so what we'll do is we'll break that sum down into functions.
And what happens in signal processing is you think that you're going to sample at some regular frequency T.
And then what I'm denoting here is x bracket K means the kth sample, which is at the point kT.
So you're seeing this animation that the integral is just this nice smooth sum.
All right. Cool.
All right.
So now that we're in discrete land, we can relate it to more familiar machine learning concepts.
The first thing is you can view this as a recurrence.
as an RNN.
So I'll introduce notation G here, which is basically
the B times the input.
It's all the modifications on the input that we had.
And with just a little bit of arithmetic,
I can move it out so that I get x of n plus 1, the next hidden state,
is t times gn plus some term that's kind of down weighting it.
And I'm illustrating the down weighting here
and the visualization.
RNNs are super fast.
So if we did manage to learn the weights, the B's,
the A, all the rest of these things in the filter,
then we could run this as an RNN automatically
from the same parameterizations.
Super cool.
With just a little bit more notation, I can take that E term,
the exponential there, and put that matrix exponential
into this function F.
And that becomes a convolution that's probably
more familiar to most people, which is a discrete convolution.
But notice this discrete convolution is of length n.
It's a huge, long convolution.
It's not a three-by-three convolution like a Resonet.
It's actually as long, potentially, as the filter.
So that's going to be challenging to process.
But this model says they're basically both the same.
So the key technical challenges to make these SSMs fast.
Those long comms are hard.
If you think about it, that F, that filter, is huge.
And so if you materialize it at every time step, you'd be toast.
It turns out that you don't ever have to materialize the hidden state.
That's a really important observation.
That allows you to go fast and allows you to have runtime
that's proportional to the input and the output, not the massive hidden state.
The hidden state is important for representation,
but it's actually not important for implementation.
You can check out the run time.
the blog, the blog has more details about exactly how that works.
The second thing which we spent a lot of time on, and Albert did a bunch of really brilliant
things inspired the Lejean memory by the Lejean memory units is, how do we make that A have
that nice eigenvalue structure so that we know it's stable?
Things like diagonal matrices are really easy to keep this structure because you just,
you know, they're scalers on the diagonal, you can keep it.
But computing matrix exponentials in general for expressive classes is actually pretty challenging.
So we had to do a ton of work to get that to happen over the last couple of
years. And the last bit is this practical fast convolution that we needed. Now, I love this slide
because Dolly 3 made most of the art in this talk or all of the art in this talk, and it made
this poster. I didn't give it the tagline. I still think it's hysterical. Too fast, too furious,
revving up the equations. I have no idea what that means. But I love it. It's supposed to be
furrier, by the way. That's the thing. Any case, the thing is, is we had to do the same type of
operation that we did in flash attention, but now on FFTs and convolutions. If you naively run FFTs,
you have terrible memory behavior.
If you can somehow group them together in nice ways
and be IOW aware, you can get back
to that kind of nice utilization.
Flash attention, if you recall, was about 72% utilization.
Dan and Herman got to 65% utilization.
I would also say that Dan's on the faculty market this year
and Herman's on the PhD market
and you'd be smart to hire them.
They're amazing.
So the point is there's not really a hardware trade-off
after you do a bunch of work.
It's really algorithmic.
This thing is going to do a lot fewer operations.
And this led to what,
some folks have called Sasha called an R&N Renaissance.
And I want to say it's been super fun.
I have to say the last like year and a half of two years of research,
I've absolutely loved because you've had a ton of people
contributing amazing ideas like S5 and Mega and RWKV
on super technical topics that were really exciting for us to do.
And there's just been so many more that I can't put on here
and they've been pushing the state of the art.
So now you've listened to my talk and you're like,
should we use these models everywhere?
And, you know, maybe I'm a California optimist,
so I sound happy.
I know it's irritating, but I'm happy about everything, so I am.
So you're like, maybe you should use these things.
Say, well, maybe, but there's actually a pretty big gap on language.
So it was wonderful on LRA and those signal processing tasks,
but when we actually deployed it on language, there was a gap.
Now, the standard way you measure a language model is perplexity.
This is the score of how predictable the language is.
To give you a sense of this measure, S4 was five points worse on perplexity versus transformers.
And that's a staggering number, because five points is about the difference between
125 million parameter model and a 7 billion parameter model.
It was a big gap.
So we started to wonder, why is that?
So we went back to work that other folks had done, which was amazing, this associative
recall task.
So the task here is I give you letters and numbers.
The last letter is a query in this case C, and you have to tell me which number is
associated with that letter.
It's a lookup task.
Attention can crush this, because it's a very easy lookup task.
These two variants of S4 that came out later that are supposed to be better on language were better, but there was a gap here too.
And so without going to too much detail on this piece, Michael Polly came along and did this thing hyena,
and he showed he could get 100% on this underlying operator and did it in a very exciting way while still maintaining speed and all the rest.
So this is what the picture looked like as of a couple of months ago, or a couple of weeks ago, I guess, two weeks ago.
You had S4, which was a bit worse, but then in quick success,
people were coming down to this very strong attention baseline. All the baselines are released.
Luther made a wonderful harness. These are all at 350 million. You can start to play with these things.
And people are, RWKV has been releasing even bigger models. And so there was this baseline here.
These are closing the gap without attention. But part of the reason I love academia is you can worry about tiny problems.
It's like, well, it seems like a small problem, but why is it worse? So we kept asking, we kept poking at it,
and Simran and Sabri came in, and they actually came up with this idea. It took us a surprising amount of
time, but it's just a small twist.
The small twist is what a transformer can do is not one lookup,
but many lookups.
So what MQAR is is multi-quiries.
We don't just look up one letter, we look up many letters.
Now we can worry about scaling in the letters, the vocab size,
the model dimension.
And what we found is that all of these models can, quote,
solve the task.
But how they do it, their scaling is quite different.
This relates to a bunch of things in parallel circuit
complexity that I won't get to.
But this is a really interesting thing where we can start
to study the scaling.
And so what they realize is that attention
can solve these things with a small number of dimensions, roughly logarithmic,
whereas hyena and RWKV require, and all the convolutional models,
as a result of their reduction, require things, model dimensions that scale with the sequence length.
And so you get charts that look like this.
They'll solve it, but they need more capacity to do so.
So when we started looking at these MQAR things in the wild,
we started thinking, well, okay, MQAR is a nice synthetic, but does it translate?
This was really insightful.
Simmer and Sabri did this.
They said, we're going to take the pile, and we're going to segment
out which ones are AR-like, which sentences are AR-like. So these are things that have repeated
bygrams. Common buzzard is repeated twice. There's kind of an implicit look-up the second time you're
doing the common buzzard task. That's about 7% of the pile. The non-AR slice was basically everything
else. What they found is that the attention gap, 82% of it, was explained, even though this is a
pretty rough proxy for the task, by just what's going on here. And this made us think,
maybe if we saw this task, we can even close it.
But the other observation was,
actually these convolutional models are slightly better
on the non-lookup task.
So maybe there's hope to go beyond them.
And so we started this kind of architecture,
and I want to give another shout out here to a paper I love.
I love the T5 paper, I'm sure many of you do too.
I love the vibe of it where it was like,
hey, we just want to say,
what are the common elements that are going on?
If you're outside this little tiny sub-community,
all the papers look very, very different.
But if you're inside, I would say
there's a couple of really common themes
and Simmer and Sabree tried to boil them down
so that more folks can participate
and come into the field in a more easy way.
The themes are long convolutions,
convolutions that are scaling with the input,
not necessarily the full input size.
Gating is a wonderful idea
that's multiplying in this kind of component-wise way
in the sequence. That's an old idea.
And data dependence.
And Mamba just came out from Albert and Tree,
which did this and still kept that sub-quadratic runtime.
Based is basically just simplifying
all of the things that people are doing
and trying to get to something nice.
We don't have T5 level niceness let, but we are inspired by them.
One thing I want to point out is that this new convolutional architecture does scale for MQAR a little bit like attention.
So it has the same kind of dimension scaling that the others had, which is interesting.
So the point is very recently, this is in the last week run up to NURIPS,
both Momba and BAS, and I'm sure five others will come out in the next couple of weeks,
are now attention-free and actually getting you lower PPL at 350.
It doesn't mean they're going to get you lower people necessarily at, you know, 100 billion,
but it's interesting to say there doesn't seem to be any fundamental kind of block,
and that's, to me, extremely exciting.
I did want to point out a little bit that, you know,
there is another bottleneck that's lurking for truly subquadratic models.
We talked a lot about the signal processing part, but there's also this MLPs,
and I've become obsessed with them.
There's a whole line of work, check out Dan Fu's talk,
about trying to understand what's going on with the MLPs,
and can we slim those down?
they become a bottleneck at much larger dimension sizes.
So the questions that we're driving our work really were threefold.
I shared with you, I hope, a little bit,
about how foundation models change the systems that we're building.
I also talked a lot about how classical ideas
from signal processing and databases were interesting bits of canon
to bring into the field so that maybe we can make these models more efficient.
What I thought I would end with is just why I think there's such a bright future in AI
for two minutes and systems.
The first thing is, we weren't using these models really 15 to 18 months ago,
and the way we're using them now.
We knew intuitively that you train them once
and use them multiple times,
but it's not really clear we were doing that.
We were kind of just showing them to each other, if we're honest.
Now, people are using them on like a daily basis.
And inference has become an unbelievable task.
I would say even the last three or four months,
the speed of inference, if you watch on a bunch of the commercial servers,
are just going through the roof as new ideas come in.
Of course, people were thinking about this.
MQA and GQA a while ago.
Speculative decoding was an amazing paper.
VLM was really exciting. Flash decode.
Flash decode, Matt Former.
There's a ton of exciting work here.
My point is, this really kicked off like six months ago.
Wild to think about.
But that's the whole thing.
Another bit is there's a big difference between low latency systems and high throughput systems.
When you don't care if it returns in a couple of milliseconds, but you want to say run on a hundred
different documents, there are a million different documents.
We're just at the outset of seeing that systems pitch as people are actually using these
foundation models on all the back-of-house data cleaning is tasks that I think are going to
happen the next while.
There's new data types.
I do want to call out that there's all kinds of things you could worry about from Kuhnlae,
about how to program these systems, what's the right accelerators and hardware, that's just happening.
What are the right systems to build that are systems of record underneath the covers?
There's tons of stuff.
Yep, I gave Chris a little bit more time there because he's such a legend,
and he covers so many different concepts and updates and models in such a small amount of time.
So his time is very high quality, and you should watch the whole talk if you get the opportunity.
But that's it for our coverage of New York 2020.
It's just a ton of papers.
We are going to follow up with a lot of the startups that I encountered and met,
a lot of which are returning guests.
So keep a look out for that.
But also, thank you so much for listening in on this.
It's an experimental new format.
We grabbed a whole bunch of audio spliced in, you know, live interviews or stage talks
and some of my own commentary with a little bit of backing music.
It's a experimental new thing.
Like, did you like it?
Let us know.
if you liked it and share it with a friend, that would help us a lot.
And also, just remember, we have a listener survey going on.
So please come to our website and fill out our survey.
Thanks and see you at the next New York's recap.
DJ QD outro.
