Latent Space: The AI Engineer Podcast - Better Data is All You Need — Ari Morcos, Datology
Episode Date: August 29, 2025Our chat with Ari shows that data curation is the most impactful and underinvested area in AI. He argues that the prevailing focus on model architecture and compute scaling overlooks the “bitter les...son” that “models are what they eat.” Effective data curation—a sophisticated process involving filtering, rebalancing, sequencing (curriculum), and synthetic data generation—allows for training models that are simultaneously faster, better, and smaller. Morcos recounts his personal journey from focusing on model-centric inductive biases to realizing that data quality is the primary lever for breaking the diminishing returns of naive scaling laws. Datology’s mission is to automate this complex curation process, making state-of-the-art data accessible to any organization and enabling a new paradigm of AI development where data efficiency, not just raw scale, drives progress.Full Video EpisodeTimestamps00:00 Introduction00:46 What is Datology? The mission to train models faster, better, and smaller through data curation.01:59 Ari’s background: From neuroscience to realizing the “Bitter Lesson” of AI.05:30 Key Insight: Inductive biases from architecture become less important and even harmful as data scale increases.08:08 Thesis: Data is the most underinvested area of AI research relative to its impact.10:15 Why data work is culturally undervalued in research and industry.12:19 How self-supervised learning changed everything, moving from a data-scarce to a data-abundant regime.17:05 Why automated curation is superior to human-in-the-loop, citing the DCLM study.19:22 The “Elephants vs. Dogs” analogy for managing data redundancy and complexity.22:46 A brief history and commentary on key datasets (Common Crawl, GitHub, Books3).26:24 Breaking naive scaling laws by improving data quality to maintain high marginal information gain.29:07 Datology’s demonstrated impact: Achieving baseline performance 12x faster.34:19 The business of data: Datology’s moat and its relationship with open-source datasets.39:12 Synthetic Data Explained: The difference between risky “net-new” creation and powerful “rephrasing.”49:02 The Resurgence of Curriculum Learning: Why ordering data matters in the underfitting regime.52:55 The Future of Training: Optimizing pre-training data to make post-training more effective.54:49 Who is training their own models and why (Sovereign AI, large enterprises).57:24 “Train Smaller”: Why inference cost makes smaller, specialized models the ultimate goal for enterprises.01:00:19 The problem with model pruning and why data-side solutions are complementary.01:03:03 On finding the smallest possible model for a given capability.01:06:49 Key learnings from the RC foundation model collaboration, proving that data curation “stacks.”01:09:46 Lightning Round: What data everyone wants & who should work at Datology.01:14:24 Commentary on Meta’s superintelligence efforts and Yann LeCun’s role. This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
Transcript
Discussion (0)
Hey, everyone. Welcome to the Lidon Space Podcast. This is Alessio, partner and CTOA DECDESA.
And I'm joined by Swix, founder of Small A.I. Hello, hello. And we're so excited to be in the studio
with Ari Morcos, CEO, co-founder of Datology. Welcome.
Thank you so much for having me.
Ari, so you first came across my radar. I mean, I guess Datology is like a relatively, I guess,
exciting, well-hyped startup, at least with the fundraising and the higher profile of the people
that you hire. I reached out to book this interview after you worked on the R.C. I don't even
not to pronounce it, RK? RC.
Yeah, it's inspired by a real transformer that was called RC.
Yeah, the RC Foundation models.
You guys have been doing a lot of data work.
How would you describe Datology today?
Yeah, so our mission at Datology is to take everything around the data side of machine learning, right?
So going from, you have a bunch of data sitting in storage, do you're going to feed it into a model, you know, via a data loader.
There are a ton of choices you would make in that process, raining from how you're going to filter the data, how you're going to sequence the data, what's
synthetic data you're going to generate, if any, how you're going to batch the data, all of those
things. And those will have a tremendous impact on the performance of the model that you train on the data.
One of my favorite catchphrases is models are what they eat. If you show them great data,
they're going to be really high quality. If you show them low quality data, they're going to be
low quality. But this is a frontier research problem. How do you actually do this effectively?
How do you do this automatically at scale? It has to be automatic to be able to process trillions of
tokens, billions of images, things like that. And that's our mission at Datology. It's to take
that whole process make it really easy so that anybody can get access to state-of-the-art data
curation without needing to be an expert themselves.
And in doing so, help the folks we work with to train models much faster to much better performance
and to also help them train much smaller models to the same or better performance,
which I actually think is some of the most exciting stuff going forward.
But fundamentally, that's what we do with datology, is help people curate their data so they can
train models faster, better, smaller.
So the key words for that, data curation is a service, data efficiency, all those
In the pre-chat, before we started recording, you mentioned that there's a cool story around
how you got into data in the first place, right? You were at GDM, you were at meta as a research
scientist, describe how that became an interest. My PhD is actually in neuroscience. So I come
much more from an empirical science sort of background. I actually spent time trying to teach mice
how to count and then analyze the activity of thousands of neurons in the brain while mice did
count and try to understand how did that actually happen. What were the neural dynamics that
enabled that. And that's actually initially how I got into machine learning was as a means to
analyze my neural data sets. I also started my PhD 2011. So Alex Nett came right after that,
Tari DQN right after that. Lots of evidence that AI was going to be very, very exciting, which led to
me transitioning. But as a result, because I had this kind of somewhat different background of being
trained as an empirical scientist rather than as a computer scientist, my real first mission when I, when I
joined AI, was to try to build more of a science of deep learning. Something that I think, you know,
is still true today in many cases, is,
that deep learning is an empirical science,
but most people that have computer science backgrounds
were trained more in the context of a branch of theory.
Everything was very provable.
That was the initial pushback to deep learning, actually,
was that you couldn't prove anything in it.
But deep learning is, at its core, an empirical science, right?
We have to run large experiments.
We understand the rules for how we design these systems,
but the properties that come out of them
when we actually train them on a ton of data
are emergent and unexpected.
So I always really wanted to write these papers
where they had two halves, where the first half of the paper was trying to understand why is this
representation desirable or undesirable? Why does the model good or bad? And then understand that
and then use that understanding to then improve the model. And that was always my goal. That was kind of
the perfect paper. Rather than just throwing spaghetti against the wall and seeing what stuck,
we were able to really understand why something didn't work and then use that understanding to
improve it. Unfortunately, it turns out that it's not so difficult to do the first half of that,
I try to understand the system, but really, really difficult to actually use that understanding
to improve the system.
A lot of times what happened is you go to you optimize for this variable, you find, hey,
here's this property of representations that makes models good.
You go and you optimize for that, and then it turns out that wasn't a causal variable.
That was a correlate.
And it doesn't actually work.
So I maybe wrote 30 papers where we did that first half and maybe only three or four
where we did that second half.
And that was always kind of frustrating and dissatisfying to me.
And then around 2020, I had several papers that all kind of slapped me in the face at
the same time with the same insight, which is that all that really matters is the data. And I had come
into all three of these papers very much focused on inductive biases. How do we put better inductive
biases into models, either through changing the objective or through changing the architecture,
which is where most of the field was, and still where you see a lot of the papers at the big
conferences are about architectures and various tweaks to architectures. But I had these multiple
papers, all of which made this clear takeaway that the data is the only thing that matter.
I'll give you one example. There's a paper we had called Convitt, where the
The idea was to take a vision transformer and initialize it as if it was a convolutional neural
network.
And that way, you could actually start with this inductive bias of convolution, but the model
could choose to unlearn it if it wanted to.
So the idea was it was a soft inductive bias, not a hard inductive bias.
Comnets have a hard inductive bias.
You can't not be convolutional in a comnet.
But in this case, you initialize the transformer that way, and then if it wants, the model
could learn not to be that.
And the idea here was that this would be really helpful for models to give them this
inducted bias, but then they could learn not to use it if they didn't want to. Just to follow-up,
there's a one-to-one mapping of a confidant to a transformer, and you can map it directly onto the
weights. Exactly. You can map it exactly correctly. It turns out. If you make it just,
you know, say you have a three-by-three kernel, you can have nine heads. Each head corresponds to a
different part of that kernel. And then you can initialize it so it is exactly.
So it's like a very coarse thing that can then be refined as with training. Exactly. And then
it can choose to change its weight so that it can undo the weight tying that you impose on it this way.
We actually had a follow-up paper which showed you could take a train network and actually instantiate a train CNN as a vit as well.
So there's a way to do this.
Turns out in the small data regime, and when I say small data here, I mean, say, less than 500,000 data points.
And this was in the context of image self-supervised learning.
So in that small data regime, this is super helpful.
And where this paper has actually been cited is a whole bunch of kind of niche scientific problems where there's very little data.
For example, volcano prediction, where you have like 1,500 data points or things like that.
But the advantage of using this soft inductive bias decays as the data size increases and eventually
actually becomes harmful.
So if you see enough data, and the threshold at which this changes is around like a million
data points.
So it's not massive by any stretch by our current model.
So basically, once you get past a million data points, that soft inductive bias no longer
helps you.
And it actually now is mildly harmful.
So I had this paper and a couple other papers that all kind of made this same point that
Basically, you know, when you get to enough scale, inductive biases matter not at all.
All that really matters is the learned posterior from the data distribution.
And that's really what defines everything.
And then, of course, the rise of the transformer really showed that actually starting with
models that have fewer inductive biases built into their architecture, you know, is the right thing.
So we had this kind of, the combination of factors, which ultimately, like, actually was very,
very confronting for me, because I had spent the last six years in my career working on inductive
biases. And now I'm faced with, you know, several different papers, all of which show me that,
hey, what you've been working on isn't actually really that important. Bitter lesson built.
Bitter lesson, indeed. So, you know, the bitter lesson was indeed very bitter for me. And, you know,
that was really my, you know, inculcation in it, I suppose, where at the end I kind of thought to myself,
okay, clearly the bitter lesson is true here. What should I do in this new world? You know, and it came
clear to me that there are really two options that made a ton of sense. Either go work on making
GPs go burr, and I'm not a hardware engineer. I don't know how to make GPs go faster,
or work on data. And for a whole bunch of reasons, data has been dramatically underinvested in
relative to its impact. Something I've said before, and I'll say again, is that data is the most
underinvested in area of research relative to its impact, and I don't think it's even close. And
there are a whole bunch of reasons for this, which we can go into, some of which have to do with
the culture of machine learning, some of which had to do with the incentives that have been
set up, but data has systematically generally not been considered. And even if you go and you look
at the scaling laws work from Kaplan and Jinchilla and all these other things, they all assume IID
data, which is insane. We know that all data are not create equal, that garbage in garbage out
is like the oldest adage in computer science. And yet all these scaling laws assume that all
data is create equal. That makes no sense whatsoever. That's what led me to start working on this
problem. And it turns out that there's a really cool thing about data research. In addition to it being
something that's impactful relative to the investment, which makes it a great research area and
makes it an even better company. What I'd said previously was that with representations, you have this
disconnect where there's a questions which are kind of scientifically interesting about understanding
why a representation is good. And then the questions that are practically relevant, how do I use this
to improve it? And I think what was so frustrating to me early in my career was that those were
different questions a lot of the time. The questions that I wanted to ask, which were curious,
curiosity-driven and really interesting to me as a scientist, ended up often not being the
that were practically relevant downstream.
But it turns out with data, this is not longer true.
With data, if you can understand what makes a given data point useful
or what makes a given data point not informative,
you can almost always use that insight to make a data set better
and therefore make a model better.
So what this means is that the set of questions which are scientifically interesting
and the set of questions which are practically relevant in data research
are largely the same questions.
And that's really rare to find in research,
period. And what this means is that we can ask the questions which as scientists are extremely
motivating to us, but then have very high confidence that the answers to those questions are
going to help us to build models that train much faster, that train to much better performance,
and that can train with far fewer parameters. So that's a little bit of a high level of
kind of how I got into the data problem. And I think the pain that I had to go through to get there
in the first place. You mentioned something about the incentives in the data, not being aligned.
Can you unpack that? Because I think from the outside, you have companies like scale that
obviously become super successful. So people are investing a good amount of money. But what you're
basically saying, like, you know, MDA is like $4 trillion and scale is not $4 trillion. So what do you
think there's that inefficiency? Okay. So first off, we have to divide the research community from
the industrial community because I think they're very different. And I think in general,
data work has been far more valued in industry consistently than it had been in the research
community. First and foremost, part of this is that data work has just often been considered
a second-class citizen sort of work. It's the grunt work. It's the plumbing. It's the stuff that,
you know, you don't want to work with as a, as a, you know, super hoity-to-dy scientists. There are even
some tweets recently going around people saying, you know, data cleaning is boring, it's low-value
work. Whereas I think what you'd find is that if you talk to the most talented AI researchers and
you ask them, what's the secret to your success, they'll largely tell you that they look at the data.
Ultimately, these models are a reflection of the data that you show it.
And yeah, it can be tedious.
It can be challenging.
But it is so critical to get this right.
So I think first off, there is this general perception that this is lower quality work,
or not quality, but lower prestige work.
And that's been there for a long time.
I think part of this had to do with the way that research incentives were set up.
The data set was viewed as the given.
So if you think about research circa, say, 2018, given ImageNet,
maximize performance on the Valset or on the test set, right? But the dataset ImageNet was given
as something you don't change. Even Cagle had this framework, right? Given the data set go and make this
better, people might try things like bootstrapping or stuff like that. But generally the assumption was
you're going to improve the model through better modeling, not through improving the data set.
And part of this also was just in the supervised learning era, this made sense, right? We generally
weren't compute limited. We were generally very data limited, right? Data was very scarce.
Like, if you want to assemble ImageNet, you have to go to M-Turk and get a whole bunch of people
to label the dataset.
And then there's generally some quality floor, right?
Because a human has looked at every data point in this data set.
Even if there's still a lot of errors there, at least it's not going to be as bad as just
the internet scraped.
But then in 2019, the field underwent this pretty massive change, right?
We figured out how to train without labels.
And one of my, like, more controversial viewpoints, I think, is that I think the transformer
is a great advance to be sure.
But I think it's one of a very large set of equivalently good architectures that we could have found.
And there are many, many ways we could get to the same performance without the transformer.
But I do not think there's any way we could get to where we are today without self-supervised learning
and the ability to train on unlabeled data.
That was the real advance to my mind that enabled us to get these incredible increases in capabilities.
Which is like the mask objective?
It's not just masking objectives.
I think mask language modeling objective is one.
but even next token prediction, right?
But generally this notion that, hey,
instead of having to get an external label from a human,
we can ask the model to predict one aspect of a data point
from other parts of that data.
And that is really powerful.
Because think about it, right?
That meant that we went from ImageNet,
a million data points,
to literally trillions of tokens,
a million-fold increase in data quantity
in a matter of, like, several years.
That's completely unheard of.
And that also changed everything.
Because now we went from data being scarce
and having a high-quality floor,
to now all of a sudden, data is absolutely massive.
All of our models are basically always underfitting the data,
whereas previously we would do 160 epochs on an image data set, right,
where they would all be overfitting the data generally.
So now we move to this underfitting the data regime.
There's no more quality floor.
And now we have all of these problems with redundancy,
with low quality, with low information gain,
all these various things that come with these massive unlabeled data sets.
So I think the problem also changed pretty dramatically.
from the 2010s to the 2020s.
And I think that's what makes it so exciting as a scientific question,
is that this didn't really make sense to study prior to 2020.
But now this makes tremendous sense and is, I think,
absolutely critical for us to solve in order for us to enable these models
to continue to improve and also to enable the cost effectiveness of these models
so that they don't just stay as something that's only possible to achieve
if you have hundreds and hundreds of millions of dollars.
Making the data better can be a massive.
massive compute multiplier. It can change the performance per dollar by orders of magnitude.
And in many ways, that's our whole goal is how do we make that easy and effective for everyone?
Totally. And you were a meta from 2018 to September 23, which is both during Lama 1 and Lama 2.
At what point inside of meta, maybe some of these learnings become apparent, like, okay, we should start to spend resources working on this.
You mentioned 2020, so I'm wondering if that was like...
I think Lama 1 was already a big breakthrough.
Yeah, Lama 1 definitely put more effort into data filtering, I think, than many others,
and definitely started to change us.
But even then, I would say that actually, you know, even when I left META,
this was still an area of kind of the idea of actually curating the data
to figure out what's the high quality, high value data,
I think still was fairly underappreciated.
And if you talk to a lot of the folks on the data teams within the big frontier labs,
what you'll find is that they've actually invested really heavily in crawling.
Oftentimes, they've really worked on getting better crawlers.
trying to clean up the source of the data that's coming in, which makes sense.
But ultimately, you know, I think what you really need to do is you need to take this perspective
of given everything that the model has seen so far, and given a potential candidate set of data,
what data point is going to teach the model the most the next time it sees a data point.
And that's a pretty different framing for how to think about this problem.
And I think there's certainly been some great work done, although it's all secretive within, I think,
the bigger labs.
But that's a really hard problem.
That's a frontier research problem.
And I don't think we still know how to solve that.
I think data creation also is a hard problem to solve, quote unquote,
because it's not one where there's a single silver bullet.
There's not just do this one trick and all of a sudden things work.
It's rather, here are these 50 different things that you can do,
each of which provides a pretty modest gain on its own.
But then if you can figure out how to make them combine,
you then get a really big gain.
But you have to figure out, first off,
what are all these different things you want to do?
And then, two, how do you make them play nice with each other?
Because by default, they don't play nice with each other.
Yeah. I'll make a quick observation on, you mentioned self-supervised learning. I definitely agree. Like that, just getting rid of labels altogether is great. Or forming your own labels, right? And I have a general observation that I think that extends to things that are not just learning. So self-supervised, I don't know, optimization, self-supervised neural architecture search, self-supervised curation. If you can just automate everything, I think that's the lesson really. Just like just get the machines to do it because we are the rate limiters if we must label everything.
Yeah, I think this is very true.
It's actually something I think about a lot is, are we actually falling prey to the bitter lesson again here by trying to have human guided methods of data creation?
Probably the best open effort on data curation is DCLM, DataComPLM.
It was led by Ludwig Schmidt, a professor at Stanford and about 30 students across many different institutions.
Really wonderful effort to kind of curate Common Crawl-style datasets.
Yeah, we've actually covered DataCom.
and DCLM on the podcast.
Awesome. Great.
But DCLM had a really cool study at the end of the paper that I don't think gets nearly enough attention as it should.
So, okay, so they have these 30 grad students spend, you know, two years, basically,
trying to design what are the optimal filtering criteria for these models, right?
And they built a system that's pretty good at this, right?
So then they asked all those students predict what that system is going to do.
So given a data point, is the system going to say keep the data point or is it going to say reject the
data point. These are nominally the best experts you could ever hire to do this. These are students
who have just spent all of their time looking at NLP data for two years. They could not predict
what the DCLM classifiers would say above chance. So, you know, this comes up a lot of times
where people often ask me, how can you possibly do this without a human in the loop? You know,
it just seems impossible. You need to have a human to actually rate these data. But I think that,
you know, what the takeaway from that study is, and I think there's a number of other piece
evidence that also suggests this, is that obviously we have to be automated because humans just
can't scale to billions of data points, trillions of tokens, just not possible. But even if we could,
we actually wouldn't want that. Humans are not good at this task. And to give an intuition as to why
humans aren't good at this task, I think the easiest way to think about this is that the value of
a data point is not just a function of that data point itself. It's rather a function of how that
data point relates to every other data point in the training set. Right. So if I have 10,000 copies
of slightly variable summaries of Hamlet, I don't need all of those. But if I were to look at any
one of those individual summaries, I might say, hey, this is really high quality. This is a really
accurate. It tracks all the characters. It's well written. It's clear. But I don't need 10,000 of those.
And that's just a task that a human would never be able to do because a human can't keep the
whole data set in their head, obviously. So even if you could have this scale with
humans, you wouldn't want to.
But so what's the right number between 1 and 10,000?
The unsatisfying answer is it depends.
But it's also the right answer.
So it depends on how complex the concept is.
So redundancy is really useful, right?
And like removing all redundancy is a bad thing.
If I remove all redundancy, then I'd only be able to understand, say, a golden retriever
in the one situation that I've ever seen it in before.
I wouldn't be able to generalize and that would be bad, right?
So some redundancy is good, but I think we all have the intuitive understanding that
infinite redundancy is not good, it's bad. So where is this line for different concepts?
Well, one example I like to give for this is elephants versus dogs. So elephants are pretty
stereotype. There are two kinds of elephants in the world. They're Asian elephants and African
elephants. They're all gray. They all have floppy ears. They all have a trunk and some tusks.
They all have, you know, wrinkly skin. African elephants are bigger than Asian elephants,
but largely they're all pretty similar. There's not too much variability. So I don't need
that much data or that much redundancy to understand the concept of elephants, you know,
fully and completely. But dogs, on the other hand, are totally different, right? Dogs are super
variable. There are hundreds of breeds, not to mention all the mixes of different dog breeds,
they're different shapes, sizes, textures, colors, all of these different things. The amount of
data that I need in order to properly understand dogs is going to be a lot higher than the amount
of data I need to understand elephants. So this comes to some of the challenge when you're actually
trying to do this sort of creation, at least on the filtering side, is you have to, first off, you
don't get a data set where you're given, hey, these are a bunch of dogs, these are a bunch of elephants, instead you just get here's a bunch of data, right? So first off, you have to, in an unsupervised way, discover what these concepts are. Use something about that concept in order to make some inference about how complicated it is or how complex it is and therefore how much data you need don't understand it, figure out, okay, this is a really complicated concept, I probably should keep a lot of redundancy, this is a really simple concept, I don't need that much redundancy, and then make that appropriate choice of what do you want to remove. So these are, this is I think, where a lot of the challenge,
comes from, but these are the sorts of factors that you have to keep in mind when you're trying
to design these systems.
How do you draw the line of a concept, though, right?
Like, because then it's like, well, the elephant and the dog, but what about mammals?
And then what about, you know what I mean?
It's like, how should people think about it?
Maybe it's that why you need the technology, because it's hard, it's hard to talk.
Yeah, no, I think that I think that's, that's right, to some extent.
I mean, look, it's an empirical question, like, like all things are, right?
Is that with every data set that you can choose different level of fine-grained.
Ultimately, it's a hyperparameter.
It's a knob that you can tune, right?
for how aggressive are you going to be
with respect to creating new concepts
versus keeping concepts together.
And it's one of these things where, you know,
I think to your point,
it's why we've run hundreds and hundreds
of thousands of experiments
to try to figure this out.
I think, you know, this is something
where it requires just a lot of experimentation
to understand how to do this.
And I think one of the challenges we have
is not only do we have to make this
so that this works on one data set,
but we also have to build a system
that can automatically adapt
to any arbitrary data distribution
and be able to make the appropriate inference
is, you know, in zero shot on a new data distribution.
So we kind of have these two sets of questions.
First off is like, how do we push the frontier of data curation forward?
And then second of all, how do we do out of distribution generalization where we say,
hey, we have this great data creation approach?
How do we make sure that this generalizes to a novel data distribution?
I don't know if this is like a good time, but I was going to ask for like a brief history
of data sets.
It might be too much.
I don't know.
I'll just list off because we've done the datasets 101 episode.
I think that was like one of our earliest episodes by far because I would
We want people to know the datasets.
And I think everyone starts a common crawl.
I think every lab has their own web scrape.
Would you say that's true?
Or do they start from Common Crawl?
At this point, yeah.
I think, like I said, this is where most of the labs, I think, have actually invested most of their time and effort.
Yeah, yeah.
Is in building better versions of Common Crawl for themselves.
Yeah.
I'll just name check some of these.
If you have commentary, just, you know, just chime in.
GitHub, the source of code, maybe Stack Overflow, even though that's cut off these days.
I don't know.
Do people get code from anywhere else?
I mean, I think they're obviously places where you buy code data, but for public code, I think those are the most common.
Yeah.
I think some interesting things about those that I just personally find surprising.
Stars are not a good predictor of whether data is useful for models or not.
Not surprised.
Like, I think that's, like, the most popular repos are not necessarily higher quality, at least with respect to do the improved models coding capabilities.
You've ablated this.
I haven't done it, but the Star Coder paper has done it, and there have been a couple other papers that have all shown that.
It's something that I just consistently have found to be a little bit surprising.
And there's a lot of things that are kind of counterintuitive about data curation.
Did they, this shows that I haven't read the paper, but did they find anything good?
That was like a sign of a good co-base?
There wasn't anything that was super predictive.
Oh, man.
Like, honestly, in some ways, like, some of them were length.
Like, some of these like simple heuristics actually ended up being better.
But nothing was super discriminative there, which is kind of interesting.
Okay, cool.
I'm going to keep going.
Archive, which is, you know, GitHub for papers.
Books, books one, books two, and obviously books three, controversial.
I think Anthropics are getting sued over Books 3.
Yeah, I think a bunch of people are getting sued.
Meadow is also being sued over Books 3.
In some sense, like, can we just like look past it?
I don't know.
It's like books are a transformative use.
Like, I don't know if you have a view on this.
Well, I think the recent ruling was interesting, although it was an appellate court ruling,
so presumably it's going to go to a higher court afterwards.
But what they ruled was that it's fair use so long as you purchase the book.
So, you know, if you can't download Books 3 and then use it, because that's,
piracy and that you've stolen the books in the first place. But if you bought a copy of all of those
books, then you can train it on. And then it just counts as fair use, which I think is an interesting
and to me it feels pretty reasonable line there. One fun thing about Books 3 is that it also
has like a lot of not safe for work stuff in Books 3, which is kind of interesting if you actually
go and look through it. There should be a Stripe one click, check out with like Books 3.
Just buy Book 3 and then get a warehouse and then get the ball, get the motion.
I wonder what the cost would be.
I'm sure somebody run the numbers.
I'll look it up.
I don't know if you can comment on this at all, but in the META lawsuit, I remember there was
an email thread with some of the research sciences inside of META talking about books three,
and Zuck was like, just do it.
This is public, right?
Yeah, that was, I think, public and part of the lawsuits.
Yeah.
Any reflections, comments?
All I can say is that when I was at META, certainly legal stuff around data sets was
very challenging and becoming increasingly challenging.
And there are a number of situations where.
the only person that could approve things was suck because of the scale of the risk, I think.
But it definitely made publishing at meta near the end more challenging around just what we could do with any data set.
Because, I mean, realistically, companies like Meta and Opening Ianthropic are big targets for these lawsuits.
Yeah. So my conspiracy theory for what happens to Lama 4 is the lawyers got to it.
The lawyers got to the datasets. And they had to change what they use.
Yeah, they were just like tied behind their back when other labs were not because Matt Mita had an active.
lawsuit. I think that's possible. I think probably more of it just has to do with the challenges of
just continuing to scale and having that be the goal. Like, this is actually a lot of the reason why
I got into data and started datology was that the scaling laws always were terrible. What the
scaling laws paper showed was that there was a predictable relationship between. The Kaplan one.
Yeah, the Kaplan one. There's a predictable relationship between performance and computer data, right?
That's really useful. But it was a bad predictable relationship. Power law scaling is terrible. It means
that every time you 10x your data, you get a diminishing marginal return on performance.
You know, this is why you had these prognostications. Oh, you know, GPTN is going to cost,
you know, a trillion dollars to train. It's because you take that scaling curve and you just
naively extrapolate it out. And I think that's what we've seen to some extent with the failure of
the mega models, right, with 4.5 and Lama 4 and others. I think that there's a challenge of
just continuing to do that naively, and you have to figure out how to break it. I think there
are a number of theories of ways how to break it. And I don't think they're mutually exclusive.
my bet is that data quality is a massive way to do this.
And in many ways, actually the paper that was the foundational paper for datology,
it's called Beyond Neural Scaling Laws, and was fortunate to get a best paper at NERIPS.
And what that paper showed was that if you use your data correctly,
you can actually bend the scaling laws themselves.
And an interesting kind of technical part of this is that, you know,
I mentioned what we really care about is this, how much new information do you learn from the next data point?
So technically, that's the marginal information gain per data point.
perplexity is another variant of it.
There's a duality between them.
It turns out that we were able to prove in perceptrons, at least,
because that's definitely what all you can't have proved things in.
So in small scale, and this work was led by Ben Sorscher,
who was a really fantastic grad student I worked with on this paper.
And what he showed was that there's a direct duality between power law scaling
and the fact that you also see that the marginal information gain per data point
also decays as a power law.
And that's why you get power law scaling,
because every successive data point is teaching you less
and less and less, and it follows a power law, so then you get performance decaying as a power
law as well. So if instead you can keep that so it's flat, then you bend the scaling law.
And now all of a sudden you learn dramatically faster because the amount of information
you're learning is not decaying with data set size. Now, that was all in theory what you could
accomplish, you know, and we proposed a couple metrics that got us one step there. But in many ways,
I would actually say that the whole point of datology is how do we realize the potential that was
shown in that paper? How do we actually make that a reality?
And I think fundamentally, if we want to get scaling to work well fundamentally, we need to do a better job here.
Are you measuring the quality of these open datasets over time?
Are the most recent open datasets better than the older ones at a good rate or like just marginal?
They do get better, but I think they're not relative to the headroom and potential, I would say.
Like, Neumatron is actually pretty similar in quality to DCLM.
It came out about six months later.
It has more unique tokens.
They made a really big deal about it.
having more unique tokens, but on average, the quality is pretty straightforward.
So, you know, when we think about what we are able to accomplish at Datology, we usually
think about along these three axes I mentioned, train faster, train better, train smaller.
So typically, basically that's like first question, train faster.
Given a certain baseline dataset, how much faster can we achieve the same performance?
So, you know, and how many fewer tokens.
So we're able to now get to the same performance as DCLM about 12x faster.
So, you know, in fewer than 10% of the tokens, we can match what you get from training to convergence.
And when you say performance, you mean like GPQA or you mean loss?
Yeah, so we typically take the accuracy across 15s kind of standard benchmark tasks that are relevant for, you know,
given model size. So your MMLUs, your arcs, your races, you know, et cetera.
The problem with those is like, are you training to the test, right?
Like, are you, you know, I'm sure you know this.
And that's something that we're super careful about because it's really easy to overfit to these benchmarks, of course.
and then end up with models that are really brittle.
And I think this is something that we've seen,
especially with synthetic data.
And synthetic data is a big part of what we do at Datology.
We found that it can drive pretty dramatic gains if you do it correctly.
There are lots of ways to do synthetic data incorrectly.
We've seen a number of models, right, that are trained on a lot of synthetic data
and end up doing really well on benchmarks,
but then kind of don't pass vibe checks and people don't really use.
So we do a lot to try to prevent this.
First and foremost, we keep a held out set of tests that we only look at very occasionally.
And we also don't evaluate on a whole bunch of other,
evals that we then have, you know, models that end up getting eviled on later to try to
really ensure this. But yeah, this is fundamentally how we measure. We look at an average of
benchmarks, just trying to kind of think what's fair and reasonable with respect to what we can do.
So, you know, that's like the first thing we typically look at. Then we look at train better,
of course, under the same compute budget, how much better can you do with a given data set?
We're able to beat kind of the best open data sets by anywhere from four to five points,
depending on the specific dataset and evel. Some of the e-vals are actually much bigger than
45 points, 45 points on average.
And those are absolute points.
We generally find that in order to get that same performance from training longer on the baseline data sets,
you'd have to train on those baseline data sets, you know, at least five to ten times longer to try to match that performance,
because every successive point of accuracy, of course, gets harder and harder to achieve.
And then finally, train smaller, basically say, okay, give it holding performance constant,
what's the smallest parameter count model that we can get to outperform?
We can already get models that have fewer than half the parameters and also train faster
and also outperform the larger models trained on the uncurated or alternatively curated data sets by a large margin.
So, you know, this is a big roundabout way of getting to this answer,
have the open data sets, I think, kept up with this improvement.
You know, with a fairly small team, we're now a team about 30.
You know, most of the results that I've discussed, like were achieved with the team of under 20,
because we've grown quite a bit in the last couple months.
And with not that much compute by kind of common standards, you know, more than academics,
but certainly nowhere close to the Frontier Lab,
we've been able to achieve, I think, pretty dramatic results.
I think the reason for this is because there's so much headroom here.
You know, we've already been able to get 10x gains.
I think there's at least another 100x behind this that are still to be done.
There's so much stuff that we're just not even doing right now that I know makes sense to do,
let alone all the things that we are doing that I know we can be doing better,
that we're still very suboptimal with respect to how we're doing this.
Like, I know that the way we do our synthetic data right now could be much better,
that the way we do our filtering could be.
much better. The way we do our model-based filtering, our embedding-based filtering, all these different
aspects could be much stronger. So I think there's just so much headroom here. I think the challenge
is that there's not a huge incentive to do this in the open dataset community. I mean, the labs, which have
the biggest incentives, obviously have strong incentives not to share anything with respect to that.
So you're left to kind of, you know, the Allen Institute, things like DCLM, hugging face, etc., to make
progress there. But I do think that this is such, this is a hard enough problem that it
really demands a whole company that is really focused on this. I think what you've seen all the
Frontier Labs is that they have data teams. And if you talk to the folks that work on those data
teams, what you'll kind of systematically hear is that typically they're under-resourced relative
to the gains that they're delivering, that they're always having to fight for attention.
And this is just like a fundamental thing that I saw at Meta, I saw it deep mind, and I've heard
at all these other places. It was a big part of why I decided to start Datology instead of doing this
within meta. I had the opportunity to start a data team there that would try to centralize this.
But fundamentally, I think that this is such an important problem, that it's a problem that needs
to be the end itself, not just the means to the end, which I think is what you see in many of these
big groups. You need to have a large team of really talented people who are really passionate
about looking at the data, and there aren't that many people who are that passionate about it,
to just focus on how do we build the best possible data sets for model training. I think
it's hard to do this as a data team. I think there's a real benefit of being a data company.
And that's a lot of why I started Datology. How do you think the almost economics, although
the open source datasets world evolve? Because you basically have this like open source data sets that
are like good, but maybe they're not quite as good to make production data systems. And then
you have companies like yourselves that are sitting on top of it. Do you think at some point there's
going to be some sort of rupture between like, hey, why are you just taking my open source data set and
making it better in private for people without contributing back.
And do you guys have plans to then open source, other sets?
I think there's like kind of this open question of, are these things actually useful in
the open?
Or should you just do it in private?
Yeah, it's a great question.
And one that we've thought a lot about.
I mean, so first off, one thing to note is, right, is that while we do work with folks
who are just training on open models, and generally really built our product and designed
it to be able to work with companies that are training on a combination of open source
and proprietary data.
And that proprietary data could just be data they've been collecting as a matter of business for the last decade, or that could be data that they've sourced from a data annotator or, you know, another data provider.
And some folks who work with have all three, right?
You're going to use open data.
They're going to use data.
They're going to use data that they've acquired.
And then they're going to use data that's part of their business to begin with.
So, you know, and that's like I think a lot of where our focus goes, although, of course, we are excited about working with lots of folks who are training on more open data sets.
So I published for, you know, a decade more than that even, like, you know, this was very near and dear to my heart.
And it's something that we thought a lot about at datology.
I think one of the challenges of building a startup today, especially a startup for which
science is a critical component, which, as I mentioned, is one of the things that really
attracted me to starting datology is this tension, right?
Fundamentally, we have to build a business.
In order to do that, we have to have a moat.
And you can think about kind of three places, I think, where our moat could come from.
You know, one is from science know-how.
One is from engineering infrastructure and the challenge of just implementing.
this yourself. And then finally, there's a brand moat that you can eventually reach. We're very
far from a brand moat at this point in our journey. Eventually, I would love to have a brand moat where
whenever anyone thinks data and AI, they think datology and, oh, that's where I should go first.
I hope that we get to that point. But in the meantime, you know, we have to rely on the other two moats
on the science know-how and the engineering infrastructure. I think on the open data side,
what we've seen is that the engineering infrastructure definitely can be a moat. But unfortunately,
I think that science know-how moat is actually pretty.
pretty important. And a lot of the evidence that we've seen so far has suggested that that
is something that's meaningful. As an example, you know, many of the customers we talk to,
that one of the first things I'll ask is, hey, compare to the best open source data set.
Right. So if we were giving away everything we needed to in order to build that best
open source data set, some folks would just go there. So I think that's been where our challenge
has been. Now, what we've tried to do, and I think we've done a good job of, and I'm generally
happy with the balance we've struck, is try to, in the blog posts that we put out,
give a lot of intuition as to kind of what we're doing and how it works without necessarily
getting to that point of reproducibility.
That's, I think, much more open than you see most of the big labs be.
If you look at the data section of the Gemini Tech Report, it basically says, like,
data quality was a single most important thing for making great model.
One paragraph.
We used algorithms and heuristics.
It's like, great.
You know, so I think some people were even pointing out, you know,
Like recently, there's been a lot more attention on rephrasing as a method for using synthetic data.
Was it the Apple paper?
The Apple paper, the Kimi paper has mentioned this, a bunch of others.
And, you know, some folks recently pointed out that, like, hey, in our blog post from November, we were talking a lot about that.
That's the only that we do a lot of Pruduiche, the guy who first came up with refraising was one of our first employees.
So, you know, we've improved on that pretty dramatically and taking it to new places.
But that's something that, you know, I think there would have been an incentive to just, like, not even talk about that at all.
Sorry, just on that, do you feel like this is like a great example of you were talking about it in the data?
And then the Kemi paper comes out with a model.
And then people are like, oh, the rephrasing is important.
But you're like, hey, I was telling you that before.
But I just didn't have a model to show you that it was important.
Do you think that's still, even in open science, like a limiter for people that like if you don't have a model, people don't care?
Same with Deepseek.
A lot of the things in the paper were like kind of known.
But then once you have them applied, people care.
I think that's certainly something that happens.
and I think speaks to the same sort of cultural incentives that we talked about earlier,
where I think that, you know, people tend to think about this very much in, you know, ultimately,
it being a means to an end.
And I understand why that is, of course, and ultimately, like, you know, when we sell better data,
like ultimately we're selling a better model at the end of it, you know, more cost-effective model.
But I think that the fact that people don't care about it as much, unless it's really,
you're snacked in the face with it, I think is both a tragedy and an opportunity.
And, you know, I would love it if it weren't that case.
But given that it is, you know, that's, I think, the opportunity we see at Datology to really make an impact here.
This might be a little bit of a tangent, but you mentioned synthetic data.
You mentioned rephrasing.
So I figured now's a good time to go into it.
You know, I figured that most of the work of Datology is filtering.
But I see synthetic data as something slightly different.
It is in a general domain of improved data quality, but it's different than filtering.
Yeah.
Am I right to recreate synthetic data with refraising or is there a,
Are there other parts to synthetic data in your mind?
Yes, I think there are different parts of synthetic data.
There are two parts.
But let me first actually just comment on the filtering versus things.
So I used to actually use the word data filtering or data pruning.
And actually that paper I mentioned that was at Nureps, that one actually has data pruning in the title.
And that's how you beat scaling laws through data pruning.
When I started datology, I really changed the language to be data curation over data pruning or data filtering.
And that's because curation is a lot more than just filtering.
filtering and saying, hey, this is a bad data point.
We want to get rid of it is absolutely an important part of what we do.
But it's also about rebalancing data sets, up-wating, up-sampling certain data distributionally
and down-sampling others.
That might not mean filtering.
It might just be changing the weighting with which you take it.
The order in which you present data can be really impactful curricula.
And we now have seen this with discrete curricula, you know, for multi-phase training and
things like that.
That's not filtering.
You know, the way you batch the data can be an important factor.
Synthetic data can be an important factor.
you mix sources, all of these sorts of things beyond just filtering. So filtering is a very important
part of what we do, and it will always be something that we care a lot about, but it's much more
than that. Okay, so now to the question about synthetic data. I think at a high level there are
two approaches to synthetic data, and we have focused more on one of them, the rephrasing one than
the other, although I think there is opportunity in the other one. So the first approach is
create new data where the knowledge that's in that data is largely coming from the model that's
generating that synthetic data.
Oh, that's distillation then.
It's a version of distillation, and I think that this version of synthetic data could be construed
as distillation in disguise.
And I think it is a very clear version of this.
And when you think about the criticisms synthetic data around model collapse and stuff
like that, I think they largely apply to this version of you have a net new data creation
that's coming out of these models.
So that's like path one.
I'll slip one in there.
There's also models taggonography where you can sort of hide preferences in a model and distill
it down.
Absolutely.
And now we've seen like the recent like owl stuff around that.
If people search anthropic owls, you'll see it.
Yeah, exactly.
The other way is this rephrasing, rewriting approach.
So this is the information that's in the data is actually coming from the data that you're conditioning
the refraising on in the first place.
And all the models doing is it's reformatting the data or presenting it in a new way that
maybe is easier for a model to learn.
Yeah, cleaning, right?
It's cleaning it in some way.
It could be cleaning it.
it could be making it, you know, the information more accessible.
It could be putting that information in a format that is more representative of what the model's
going to be faced with downstream.
So I do think that, like, one thing that definitely happens with synthetic data is we are bringing
more post-training like data into pre-training.
Yeah, sounds like I said T.
And in general, like, one of my beliefs is that most of what we do in post-training
is better done in pre- and mid-training and earlier on in training in general.
It's just the scale, you know, you don't have that scale until now.
It's just that.
Yeah, exactly.
I think if you assume this paradigm where, you know, pre-training is incredibly expensive and something that you can only do very, very rarely and then post-training is cheap, then it makes sense.
But as soon as you break that assumption, and I think DeepSeek showed that already you can get a frontier model for a marginal cost of a couple million dollars.
Yeah.
That's gone down since then because we've gotten better at it and compute has come down in price.
Since then, like, I believe that getting to a frontier model should cost a million dollars or less for most organizations.
at least in a specialized domain, right?
And when you think about what enterprises need, that's generally what they need.
They don't need a model that can do everything.
They need a model that can do a constraint set of task to very high accuracy for as low
in inference costs as possible.
And I think that that will be under a million dollars very, very soon.
And that changes a lot of these dynamics.
But going back to the synthetic data question of these two different types.
So I think there's one towards this net new creation.
I think that's where you have a lot of risk.
That's where you get the model collapse concerns where I train a model.
a train and generative model on a given data distribution,
it overfits the modes and it underfits the tails.
So then if I have it generate a bunch of data,
it's going to be more mode and less tail.
And then I do that a bunch of times
and eventually I get a spike.
I get a delta function.
Only mode.
Only mode, exactly.
Like, that makes sense why that happens.
I will note that if you filter the data after each point,
that's now information injection,
and that can break all of this.
And I think can prevent model collapse.
Which a little bit is what RL is.
Which is a little bit what RL is.
I think you can absolutely view it that way.
And I think actually a lot of the work that has suggested that, you know, RL is really just eliciting the capabilities of pre-trained models like random rewards or a single example.
And then it's just changing the distribution.
It's like aligning to the distribution the model has in the first place are, I think, very in line with that way of thinking about it.
You're distilling from a perfect model, which is the environment or the verifier or whatever.
And then you're disilling that into the thing.
So it's amazing.
It's beautiful.
But the cool thing about rewriting is that because the model,
that's doing the rephrasing
just needs to know how to rephrase.
It doesn't need to know
anything about the content itself.
It doesn't need to understand it.
It means you can use a pretty weak model
to do the refraising
and have it generalize and generate data
that can teach a model
that's much better
than the model that's doing that refraising.
So I think with this distillation in disguise,
I'm generally quite skeptical
that you can get a model
that will be better than the teacher
that's generating the synthetic data
when you do this sort of net new data creation.
It's possible you could
through some sort of heavy rejection sampling
on the big model
because you're effectively inserting new information
when you say which of the synthetic outputs is good or bad,
right? There's some new supervision coming in there.
But I'm generally skeptical of that.
Whereas we've seen this,
we actually will have a blog post coming out
in the next week or two
about kind of our synthetic data generation,
which we call Beyond Web.
Wow.
And we'll have some cool scientific experiments in there, too,
to our point of trying to figure out this
balance where we can share some of the science, but also do so in a way that, you know,
it's sustainable for our business. And one of the things you show there actually is that by doing
this, you can actually go do, get a model to do much, much better than if you had trained on
all of the data, all raw tokens in the first place. So that by doing this rephrasing effectively,
you actually can break this data wall and now get models that are better than either of the
models that generated the data. With refraising, I think this is super possible because most of the
information is coming from the data. It's not coming from the model itself. A couple follow-ups
on that, just things I've always wondered. Are textbooks all you need? No, they are not all you need.
I think textbooks are great, and I think there's a lot of really great content and high-quality
data points like that. But obviously, textbooks are also a very narrow data distribution.
And if there's only one thing that you should take away from this entire interview about
what is good for data quality, it's diversity. Like, in many,
Anyways, right, there was this, like, I used to do all this work on out of distribution
generalization, and we had all of these, like, you know, very careful studies where we would
say, okay, let's, you know, make this corner of the data distribution, then we leave this held out
where it's never seen this combination of things, and let's see if it can generalize.
And then, like, you know, LLMs and the modern way of training models came along and said,
hey, what if nothing was out of distribution?
What if we just made it so that we trained on everything?
And everything's now in distribution.
And by the way, you know, that is in line with AGI, right?
So you might as well.
And that's basically what we've done.
And it's worked.
It's worked shockingly well.
Like way beyond anyone, I think, or most people would have expected.
I certainly was shocked by it.
I made a strong bet that there is no way you can get compositionality just from scaling.
And, well, you can.
Turns out it does work when you get big enough.
What I was really referencing was this is the Microsoft fee papers, right?
One through, three, four.
A lot of them do the rephrasing or rewriting in textbook format.
And I feel like that's a little bit of cargo cullting of like, oh, just because you write like Wikipedia or write like textbooks, the models learn better.
That's not proven.
I don't know.
That's not automatically proven to be the case.
I think that's also part of the reason why you see a big difference between the benchmark scores of those models and their real world use.
They went to too narrow a distribution.
And I think this is the problem with synthetic data fundamentally is that you're always going to have some bias here.
I think you can do a lot to make it more diverse.
And we have put a lot of effort into finding ways to do that.
For example, we rephrase into many, many, many different styles and formats.
That's really important to get stuff that's good.
But I think this is the risk, right, that you go on way too narrow a distribution,
and models all are always going to be fairly piquy with their output distribution,
and then that actually results in reducing diversity.
That said, I will say that there is a takeaway of that our textbooks all you need that I think is correct,
which is repeating higher quality tokens is almost always better than seeing net new
lower quality tokens. So like epoching over higher quality data almost always better than getting the same
amount of new data of an unknown quality or of average quality, average in this case being like what you
just get from an internet dump or something like that or even a reasonably filtered internet dump.
It's always better. The modification I made or the study I would want to commission out of that is like instead
of having another epoch on high quality data, if you found high quality data, good, go and paraphrase it and then
and then train on that. Maybe that'll get additional gains. I don't think I've seen any people.
that have been to that effect?
The Kemi paper actually had an experiment to that effect where they tried adding multiple
epochs and they looked at how many rephrasinges they did of each of them and had some
results there that were interesting to that effect.
Amazing.
And then the other question was more on curriculum.
Curriculum learning had a bad rep for a while.
How come it's back?
What's changed?
Yeah.
So a bunch of things.
And this was really interesting because when I was going out and, you know, initially deciding
whether to start technology and raising and like talking to various, you know, initial
recruits and stuff, it was like mid-23.
And at the time, I was saying, you know, curricula are going to be a really important aspect.
And a lot of people were basically just like, no, curricula don't work.
Like, we tried this a bunch of times in curriculum don't work.
Curricular are one of these ideas that I think always, like, had to work in the sense that it just made
too much sense.
There are a number of these things where it's like, it might be hard to figure out how to make it
work well, but like it always had to work.
There was actually a really cool paper from Stanford that had a nice way of conceptualizing this,
which is imagine a graph where each of the nodes are a different concept or, you know, idea
that you want the model to understand.
And then the edges are basically the dependency
between those concepts, right?
So if concept A helps you learn concept B,
there would be an edge from concept A to concept B.
So now this is the graph.
Imagine this graph of all concepts in the world
and all the different edges between them, right?
Huge graph.
If that graph is empty,
then it would mean that nothing is helpful
for learning anything else, right?
And then curricula would not make any sense.
You should just randomly order things.
If that graph was complete
so that the edges, there is an edge
of equivalent weight between every pair of nodes, then similarly it would mean that everything
is equally useful for learning everything else, and curricula don't work, and you shouldn't
use them. Any other graph besides those two graphs, curricula makes sense. I think it's pretty obvious
that neither of those is the graph of the actual world that we live in. Clearly, the world does
have dependencies, some very, very obvious, like the fact that it would be hard for me to do
division and multiplication if I don't understand addition and subtraction. And, you know, some much
more vague, but I have always believed that this has to work, and the challenge has largely
been that if you're fully saturating your data, then there's really no advantage of creek.
Unless if you wouldn't be able to learn it otherwise, generally I think the idea behind
curricula is that it makes you much more efficient. But in the supervised learning world,
we were fully saturating these data sets. So, you know, maybe a curricula would get you there
faster, but that wasn't the bottleneck or the limiting factor. So there wasn't a clear
incentive to actually go and do these hard experiments to try to figure out how to make
a good curriculum because, like, who cares if I can get you to image net performance in 80 epochs
instead of 160 epochs? Like, that's nice, but, like, it's not a big deal in the first place.
But now we're in this totally different world, where now all of a sudden, all of our models
are underfitting the data. This is super important, and getting a curriculum right could literally
make the difference between, you know, spending 10 times as much on a model training, you know,
hundreds of millions of dollars, potentially. And now all of a sudden, curriculum make a ton of sense.
So I think that's why the problem didn't really make sense to really put a lot of
lot of effort into previously. And, you know, now we've seen pretty clearly with discrete
curricula that this makes a big impact. And, like, largely what we talk about when we say
mid-training is really just like a later phase of your discrete curriculum, I think, is another
way of thinking about it, right? You could even think of post-training as part of a curriculum.
In fact, one of the things that I'm really excited about is, you know, we've mostly focused
on pre- and mid-training at Datology so far. One of the kind of most consistent asks from every
one of our customers has been, can you do more on post-training? Can you also help us curate
the post-training data, so we're starting to invest pretty heavily there.
And one of the things I'm really excited about is actually viewing this whole thing from pre-training
to mid-training to post-training holistically as a single process.
And then asking questions like, how do we optimize our pre-training data to make post-training
more effective or things like that?
These are, I think, really exciting questions.
And something that you don't see happen, even at the big labs, because they have entirely
separate teams, right?
There's a free-training team, there's a mid-training team, there's a mid-training team,
and, like, the mid-training team is a customer of the pre-training team.
and the post-training team is like a customer of the mid-training and pre-training team,
but it's quite hard to actually have signals propagate through all these.
So I think this is a really exciting area.
I'll push you a bit on this.
Yeah.
You know, I think a popular view is post-training's elicitation of capabilities
that you already trained in pre-training.
So what dependencies can you have that feedback into the pre-training?
So I'm inclined to agree with that view.
And I think that that view would lead very strongly to the fact that you should be
trying to optimize your pre-training data to make post-training processes more effective.
So you should try to figure out how do I optimize my pre-training data so that the slope of the
test-time compute curve or so that the slope of the RL curve is as steep as you possibly can be.
Or alternatively, how do I optimize my pre-training data so that the slope of the jailbreaking
curve is as shallow as possible, right?
Like fundamentally, I think alignment in post-training doesn't really make sense as a long-term
solution.
If you can easily align a model through post-training, you can easily misalign a model
through post-training. If it's easy to put it in, it's easy to take it out. If it's really
hard to put it in, it's really hard to take it out. That's just like a truism of models, right?
So if you do alignment during pre-training, you'll actually end up with models that are, I think,
largely impossible to misalign without putting a massive amount of data into them.
I think there are a lot of benefits to that. And I think we've also seen evidence for this,
like looking at the difference between Lama and Kwen with respect to their ability to be post-trained,
right? It's much easier to R.L. Quinn than it is to do Lama.
likely that has to do with the fact that Quinn put a lot of synthetic reasoning traces into their training data.
Even with wrong examples.
Yeah, but even with wrong examples, that's where it's still a lot of here, which is wild, right?
But I think that pretty clearly shows that it's the base model that's doing it.
It's not the rewards you're giving.
If you give random rewards and the model still learns, it's probably not the reward signal that's doing it.
That's cool.
I'm just curious on the customer usage.
How many people are doing post-training?
see nobody today because you don't have it. But when people come to you, are people looking
mostly to do post-training on open models, on open-AI models, or what do they ask for?
Yeah. So we usually work with folks who are either training their own models from scratch
or doing continued pre-training on an open model with a bunch of domain-specific data that
they have that's unique to their use cases and their business. We typically focus on folks
that are doing training with significant costs. So typically that means, you know, at least
a couple tens of billions of tokens, oftentimes more.
So kind of the standard small-scale post-training case, we don't focus as much on.
That said, I think this has been a question that a lot of people have asked us consistently,
like, hey, who's actually training their own models?
Like, why don't I just rely on this, rely on the open models?
And I think there are a number of reasons why we see people do this.
So first off, I think sovereign AI has been a pretty big place where we've seen a lot of demand.
Lots of countries.
They want to have models that they own that are unique to their language, their culture,
and, you know, that requires them to have really good data curation, of course, in order to do this effectively.
Just to double-click, countries-owning models isn't actually a thing that I know about.
Like, I'm from Singapore, we have the CO-M model, but it's not like owned by a country.
And I can't name any other country that owns a model.
Yeah, I think that's actually correct.
Okay.
It's largely, what you see right now is these public-private partnerships where governments are making pretty large grants.
TIA-U-A-E is like the closest.
Yeah, I think you have those.
I think you also have these places, right, where the funding is, is the country and it becomes a little
unclear where it comes from. But yeah, I think usually what you see is that countries are doing
big grants to private companies or public-private partnerships to go and build, yeah, that's
sort of sort of thing. So that's a big thing. I think we've seen a lot of, you know, larger enterprises
that have a lot of their own data that want to do this. And when you think about this,
ultimately what we see is that, okay, of cross-lose three value pops, train faster, train better,
train smaller, like which matters and when. Like, train faster. In principle, that's the easiest one
to compute. You know, I say, okay, this model would have cost you $10 million to train. I get it to you
for a million dollars or for $800,000 or whatever, right? Great. I saved you a ton of money.
In practice, though, nobody wants to train a $10 million model for a million dollars.
But they already have the model. They already have that. They want to train a hundred million
model for $10 million. You know, they want train better. So train faster usually doesn't matter
so much from the perspective of, hey, this model is now a lot cheaper. It does matter a lot more
from the perspective of you can iterate much faster, right? Because when you think of the workflow
of most ML engineers, you start a training, you go and you sit on your hands until the training
finishes. You know, you find something else to do, but largely you're waiting and your iteration
is bounded by how long that takes. If you can take something from taking 10 days for a model
to finish training to being overnight, now your existing team is way more productive and can do
far more iterations and stuff like that. So that's where we usually see that matter the most.
Most people care the most about train better, right? I can get a better model for the same compute,
and we can absolutely deliver that through data. Data is effectively a compute multiplier, right? Because
all models are underfitting their data sets, if you can make your model more data efficient,
you effectively make your compute more valuable. Because if you think about compute as I inject
a certain number of dollars and I get a certain performance back, if I use better data, then I
will get more performance back per dollar invested and now my compute is more valuable. So that's
where train better, I think, it tends to be the most meaningful thing. But interestingly,
for the most companies that are most advanced on their AI transformation journey,
train smaller is the one that I think actually means the most. Because when you think
about the total cost of ownership of these models, it's going to be very, very heavily weighted
towards inference. It's all inference. And you know, you think about a company that's spending,
say, 50 mil a year on inference, which in the scheme of things is not very much, right? If you
deploy a model that's twice as big as it needs to be, that's going to cost you 25 mil in year one.
The cost to train a model that has fewer than half the parameters but is just as good
or even better at your particular use cases is, say, two or three million dollars. That's a no-brainer
if you can do it easily, right?
If it's really hard, then you're never going to do that.
But if you can do it easily and you can get it right on the first try, that's a no-brainer.
And then as, you know, and then 50 million years is like not going to be very much, right?
We know that all of these products have, you know, a tiny, tiny fraction of what their eventual user bases will be, right?
We're still very much in the first inning here.
You know, everyone that listens to this podcast is using AI nonstop, but the rest of the world is not yet.
So the inference costs are going to skyrocket with these models.
and if you use a general purpose model, then you constrain to say, hey, this model knows about everything,
but now only do this one thing, that model is going to have a ton of parameters that do not need to be there
that are going to massively increase the cost of serving that model.
So I think that, you know, when you think about the use case of an enterprise where they need a model
that's an inch wide and a mile deep, it can do a small handful of things, but it can do that really, really effectively,
to five-nines of reliability, and it can do it for as low a cost as possible,
The economics make it so that it really makes a lot of sense to do this yourself if you can do it easily.
And the way we think about it is that there were kind of two big barriers.
First, you have to get training right, and then you've got to get data right.
And on the training side, I think three years ago, this was super hard.
But Mosaic was the first one to really recognize that there was a huge opportunity in making this easy.
And now this has largely been commoditized by things like SageMaker and together and lots of different folks that help you on the training side.
But on the data side, the barrier is just as high as ever.
And in many ways, that's our mission at Datology.
It's how do we bring that barrier down so that anyone who wants to train a model can do so with the best quality data on their first try?
They don't have to go and spend 40 years in the desert.
They don't have to get it wrong 100 times first, which is what will happen if you don't have this experience.
But instead on the first shot, they get a really great model.
Yeah.
Just a follow-up question on train smaller.
Yeah, I fully agree.
And I think that this is something a lot of people investing in.
You are primarily doing work on the data side, data pruning.
which maybe is a bad word now, data curation, whatever.
I think a lot of people, you know, Jonathan Franco was on the podcast very early on,
but a lot of people were betting on pruning the model itself.
Like you have an working model at size and you just lop off anything above like a certain epsilon.
Is that confirmed to just be dead?
So it's funny.
Jonathan actually interned with me when I was at Meta and we worked on this stuff together.
You know, he had the lottery ticket hypothesis, which is a really beautiful paper.
Which he now completely disowns.
Which he loves to disowns.
You know, I had this whole idea when Jonathan and I worked together that we wanted to create a lottery ticket initialization.
It would just be an initialization you'd sample from for initializing the weights that would then be one of these like perfect winning ticket initializations.
But we actually found out that the problem was that the lottery ticket was actually data dependent.
And that was where the fundamental problem came, that as soon as you change the data distribution a little bit, like the winning tickets changed in a really big way.
I don't think pruning is dead.
A parameter pruning still absolutely has a place, but I think certainly we found a challenging
to really realize the potential of it.
I think one of the big tricks with pruning, parameter pruning, just to be clear, was that
unstructured pruning, when you would, you know, prune weights randomly, so you view all
the weights as a smorgas board and just prune them randomly, that worked really well, and you
could remove massive quantities of the weights with unstructured pruning.
The problem is that unstructured pruning doesn't really give you a little.
a clear compute advantage because you need to have a sparse matrix now to reflect this.
And there's a pretty huge overhead of sparse matrix multiplies.
GPUs are not very good at sparse matrix multiplies.
Like there's some support for them now.
There's some hardware alternatives for that.
And there's some hardware.
And people talked about like building A6 that would be really good at unstructured pruning,
but I don't think I've seen one that works super well.
I think if someone did make something that worked really well for kind of models that
were pruned in an unstructured way, that could be effective.
structured pruning, in which case you just remove a unit, you just remove a neuron, that is really easy to make as a faster.
And a GPU, but that just doesn't work nearly as well.
So, you know, I think there's still potential here.
I don't think it's a panacea that I and I think many others had hoped.
That said, I think one thing that's cool about using better data to train smaller models is that it's complementary with any other approaches for optimizing inference.
So, you know, I think pruning and quantization obviously still have a lot of.
role to play in helping inference go faster. And that would stack on top of anything that we're doing,
which I think is kind of cool. Yeah. One also, I think a kind of a grand challenge golden question
that would be very valuable for you. And just in general, is this idea of like what is the
smallest possible model for given capability. Do you have any insights on that? I did a podcast with
Jack Morris, who's out of Cornell. And, you know, I think like there's like some information
limit. And I think he had some answer like, you know, it's like eight bits for parameter or something
like that. I forget what the conclusion was. Yeah, I'm not sure what I would put out a specific number,
but I would definitely say far, far smaller than what our current models are trained to be.
Right. Like, we are nowhere close to this. And, and, like, I am generally of the belief that
most of the models that the vast majority of people will be using in, say, three years,
will be single-digit B or smaller. I think we've seen this very clearly. Like,
You look at just like the llama series, you know, if you want to exclude Lama 4, do so.
But, you know, Lama 1 through 3, you can see pretty clearly that, you know, the 7B variant from 1-1 generation is like pretty close to the 70B variant from the prior generation, you know, if it's not quite there, but there's still a very clear trend here.
We're seeing this with the Kuen models, right?
You look at some of these small Kuen models and they're just incredibly performant relative to what state of the art was, you know, a year ago.
I think it's pretty clear that these models are way too big.
I personally would bet against kind of the next frontier being trillion parameter models,
and rather that we're going to really optimize the inference cost of it.
I think also test time compute as a paradigm really pushes you towards smaller models, right?
Because if your cost of solving a problem is cost of inference times number of thinking steps,
and you have to do a lot of thinking steps, well, now this is like a really, like,
minimizing the cost of inference is really important.
And I think that anything we can do to make it so that you can just make that inference model
that is doing the one step of thinking a lot faster enables test time compute to be a lot more effective.
Yeah.
I think there's another version of this, which is the sort of Andre Carpathi cognitive core concept
of a model that doesn't know anything, but can use tools a lot to figure out.
Again, another information theoretical limit that would be very helpful to figure out is what is the
minimal viable model for that stuff.
Like zero on GPQA,
100 on browsecom.
I really like that idea, and I think it's very possible
to do that, because knowledge storing
takes a lot of capacity. It takes a lot of
parameters. You don't need it. And you don't need it. And, you know,
we can just look, like, there are, like, actually
one of my first papers that I ever wrote was
actually about showing that when you
train models on randomized labels,
because this was something that was kind of a common test to do.
That was the one way you could prove that a model was
memorizing would be that you randomize all the labels and now there's no actual true association.
It would have to memorize it. And like models could do this really well. There was like an
eye-clear best paper from 2017 that showed this that people were really surprised at that models could
memorize all of image net. Now this seems crazy because of course models can memorize the whole
internet. But at the time that was like crazy. Wait, they could just memorize a million labels.
Like that's wild. And what we found there actually was that if you went and you just deleted units
with a model that memorized, it would be really damaging to the model that memorized. But a model
that actually learned a generalizing solution, you could delete a lot of units.
And it would be pretty robust to that.
So it's actually a very clear demonstration of exactly this concept that the more you memorize, the more capacity you're using.
Dropout regular regularization.
There's a lot of dualities to drop out.
And I think there's an argument to me that drop out, like, you know, helps to prevent memorization.
And it helps to learn more generalizable solutions.
And that's part of why it worked well.
But yeah, I think it's very possible to do this.
And like, I think we're wasting a ton of capacity in these models on knowledge that is just totally unnecessary for them to have.
Before we wrap, just because we started with the RC models and then we never talked about them.
I think the most interesting thing to me was they started with 23 trillion tokens of data and then you help them get down to 6.6 trillion.
Any learnings from that?
And this is a 4.5B model, which is par with Gemma, 4B and a little worse than Q1,3, but roughly the same.
Any learnings there, experiences, things that auto-oven models should adopt?
So, yeah, so we started for that one.
we started with a combination of DCLM, Nematron, and Fine Web.
We basically just can catnade them all together.
It's about 25 trillion tokens to combine for all those to produce $7 trillion out of that.
I mean, I think what was exciting to us about that was, in general, you know, seeing the
speed at which the model learned.
So, you know, it was beating Gemma pretty consistently before the $1 trillion mark, which was pretty cool to see.
And I think really highlighted in many ways, you know, how higher quality data can get you
much better performance much more quickly.
General insights, I think, or takeaways from that.
I mean, I think it was exciting for us as kind of one of our first real, like, RSI is the first customer that we're talking about and being public about, you know, since starting the company.
So obviously, that was an exciting moment.
But I think really generally, it's a good showcase about the fact that combining all of these different techniques can give you a really big gain.
You know, I think that's one of the things we've been saying, but it's nice to have a real demonstration about that.
You know, this is not something where it was synthetic data taking us here or was filtering taking us here.
It was really about thinking about how do we actually combine all of these techniques.
And one of the things we've consistently found, actually, is that when you take these.
different techniques and you try to make them work together, they don't generally. You can make them
work together, but it's quite hard to do so. So I think what was quite exciting for us there was
showing that that's possible. And then combined with that, I think people, first off, tend to think
that you can't stack curation. I think the fact that we started with some of the best curated
open datasets and we're able to make them dramatically better is a pretty good insight to the fact
that there's still a ton of headroom left here. Like, we didn't need to go to common crawl to get those
tokens. We are due course doing work on that, and we think there's a lot we can do to improve
there. But just starting from that, and we actually now are making bigger data sets from that.
I think we can get up to $15 trillion, just starting from that corpus and still have
pretty identical quality to that, which is pretty neat. So I think showing that you can get there,
and then it really stacks. Like, one of the other things we consistently find is that if we apply
our curation on top of, say, DCLM, and then we apply it on top of Fine Web, the gap between
fine web and DCLM is maintained in the gap between kind of datatology.
curated DCLM and Datology curated FineWeb.
They both get a lot better, but Datology, DCLM is still better than Datology FineWeb.
So, you know, there really is a lot that we can do here.
And I think that would be the biggest thing that I would just say.
There's so much still left to do here.
We're just scratching the surface.
We're pretty excited about what these results showed.
We already have better data sets than what RC trained on because that model was largely
trained in May and pretty excited about all the next trainings that will have that go even bigger.
I have a couple more lightning fun questions. What data does everyone want, based on your customer
conversation? What data does everyone want, but it's really hard to get? I mean, I think expert data
is the pretty obvious thing. That's domain experts. Domain expertise. That said, I would also
note that most people don't know what data they actually should be getting. They just show up with
whatever they have. Yeah, I think we've actually found shockingly frequently as we talk to folks
who, you know, have been planning for a really expensive training run, you know, millions of millions
of dollars, trading run. They've been thinking about the architecture they're going to
they've been thinking about all this stuff.
And then they reach out to us and they're like, hey, like, we realize we need a good data set.
And we're planning to kick off training in two weeks.
Like, can you help us?
And a lot of it's like, hey, you probably should be thinking about your data set before all the other things.
If anything, that's actually the most important thing.
So I think, I don't say the most surprising thing is maybe how often people don't even have a conception of what good data is.
And oftentimes I think what they think is good data often isn't, which goes to the DCLM point.
I think that we mentioned in the past. It's very counterintuitive and really hard for humans to identify this is high quality. This is low quality.
This is a little bit of a recruiting question. What data efficiency question? If somebody had an answer, they should join Datology immediately.
The first thing I would just say is like if you are one of these people that keeps on finding yourself, just like staring at the data, you keep on going into the dataset, if you can tell me what your, you know, favorite and least favorite C4 example is, like you belong at Datology. You could, you should come join.
us and join a bunch of other nerds that love doing that exact same thing. I think in many ways,
that's kind of the single biggest predictor of whether someone is going to be really happy at
datology is like, how much do you just look at the data in your own work? Because I think you'd
be surprised by how many really talented researchers don't do it very often, that they really
just viewed as a given. I think it's been pretty surprising across the board. That said,
there are so many questions that I am from the science side that I'm just super excited about.
I mentioned the interactions between pre and post training. That's definitely one that we're really
excited about. One of the things that we really care a lot about is making it so that our product
and curation automatically adapts to novel data distributions, right? If you have this where it has
to be fully automated, and we didn't talk about this too much, but one of our challenges often is
if we're working with an enterprise that has a lot of proprietary data, they obviously don't want
to give that to us. So we bring our curation to their data, but this means that it has to adapt
automatically. You know, we have pretty limited access into going in.
looking at that data. So that's actually a really hairy and interesting out of distribution
generalization problem. But it's also really important because there's no golden curation.
A curation is only optimal with respect to a given set of downstream use cases or tasks, right?
So we need to be able to define based off of, you know, if the model needs to be able to do
XYZ, how should we use that information to adjust the curation that we do to make sure that
we're giving the data that's most relevant for solving tasks XYZ? And that needs to happen
automatically. So we have a number of ways that we can do that for a number of our techniques,
but that's a very broad and general question that we want to apply to every part of our
pipeline, so that the way we do synthetic data differs based off of the downstream use
cases, the way we're doing this, the way of doing every different part, filtering, et cetera,
is going to change based off of that. So that's another question that we're just really excited
about. And fundamentally, you know, anything about really trying to answer this question about,
you know, how do you value data with respect to a target? You know, when I think of datology and
our core competency. I think every company needs to have an unfair advantage or some core
competence that they do better than anyone else. And for us at Datology, you know, I want us to be,
and I think we already are, the best in the world at valuing data with respect to a downstream
use case. In many ways, I think that's kind of the NP-complete problem of AI. If you can do that,
you can kind of do anything. And that's the thing that we're really focused on. And of course,
curation is like the very obvious direct application of that core competency. But when we think about
kind of the vision for the company in the long term. It's about sanitation. What are all the other
ways we can operationalize that same core skill set? And I think there are tons of really interesting
ways things you can do there. But that's the fundamental question that we really want to answer.
And then there are tons of different entry points to that question. But if that's a question
that excites you, if you have been working on data somewhere else and you have felt this pain
of being a second-class citizen or having the data team be kind of dismissed and you want to be in a
place where literally the only reason that the company exists is because data is all we care
about. I mean, the name of the company, Datology, the science of data, that's why we're here.
Then you should absolutely talk to us.
Awesome. And just to wrap on some gossip, let's talk about meta and super intelligence.
And just in the notes, you know, when you talk about science mode and whatnot, you raised a lot of
money from very prominent people. So you have, you know, Yel Lecun as one of your investor, Jeffrey Hinton,
Jeff Dean.
So when Ari says that they have a science mode, believe it.
So maybe since you have Jan as an investor, this is more of a touchy question.
But what do you make of the whole meta, super intelligence team?
And, you know, Jan was also linked in.
And it was like, hey, you know, I'm actually working on, you know, I fear.
We're focused on the next generation of AI, not on this current generation.
So my role is the same.
But then maybe people might say, you know, then why didn't you do the current?
generation 10 years ago. What do you make up the whole of the whole change and whether or not you
think this is an interesting direction for meta, especially given the large platform and user base
that they have? Well, first, with respect to Jan specifically, I mean, Jan's an incredibly
talented scientist, of course. But I think that, you know, his preference has always been to do
science rather than to run an organization. So I think he ran fair, like, organizationally for a
year or two right at the very beginning. But pretty quickly, he handed that off to other people. And
And when I was there, it was Joelle Pino and Antoine Boards and then Joel for most of it that really were running for her. And she was an incredible leader. I really respect her deeply and couldn't have asked for a better kind of advocate for science within Fair.
When she left, people were saying, like, this is the end of fair.
I hope that's not true. But I also had that concern. But I think Jan always really wanted to just actually do the science himself. And, you know, he's generally for much of, most of the time I was at Fair, he kind of operated with his own group of a couple kind of post.
docs and visiting scientists, and then he'd have a couple students through NYU, and he would
kind of do his own research there. So I don't think he was ever, you know, or at least not since
the beginning, in a role where he was defining AI strategy for meta. I don't think that's the
role he wanted at any point. You know, I think he really wanted to be doing that research. And I think,
so I don't think that his role probably is changing very significantly in the sense that he
wasn't doing that previously, and I don't think it was what he wanted to do. I mean, I think one thing
that's pretty cool about it, obviously, is it showcases the importance of data that meta is willing
to spend quite this much.
on, you know, scale, kind of acquisition, not acquisition that we're seeing today.
Alex Wang is not going to underrate data.
Let's put it that way.
Yes, he's not going to underrate the importance of data.
And, you know, and I do think that this is an area where, you know, the stuff we've done
is quite different than, I think, what we've seen from the data annotators, which have
been more focused on collecting the data versus actually optimizing and curing it, curating it.
I think there's quite a bit you can do on top of those things.
So I think it definitely draws some attention to that.
I will also just say generally when Zuck makes a very big bet, it's not proven wise to bet against him.
Just historically, that's been the case.
And like most of the big bets, I think, have panned out.
I think the one that's still really up in the air is a metaverse.
But I would actually argue that I think that's going to end up paying off in the long run.
I think the Rayban glasses are pretty darn cool.
And a lot of the foundations of what was in reality labs will go into those.
Also, Fair was part of Reality Labs, actually, for like a year and a half.
after one reorg.
Like, initially, Fair wasn't, and then got reord into reality labs.
So I think when I left, actually, Fair was officially part of reality labs.
Wow.
If I recall correctly.
There's at least a one and a half two-year period where that was the case.
So some of the AI investment, actually, that, like, lay the foundations came out of that
metaverse investment in the first place.
You know, that said, I think, you know, we talk about data as being a compute multiplier all
the time.
Talent, I think, obviously, is a compute multiplier.
And given the amounts that they're spending on compute, I think you can make a good
argument as to why spending in a crazy amount on talent is also worth it. So I'm excited to see what they
do. I hope that they put a lot of focus on data and become customers. Yes. Awesome. Well,
thank you so much for chatting and coming by and insisting on in person because you're actually
very charismatic in person. So I'm glad you did this. Well, thank you very much. Thanks for having me
and a joy to get to chat in real life. Awesome. Cool.
