No Priors: Artificial Intelligence | Technology | Startups - The Power of Quality Human Data with SurgeAI Founder and CEO Edwin Chen
Episode Date: July 24, 2025In the generative AI revolution, quality data is a valuable commodity. But not all data is created equally. Sarah Guo and Elad Gil sit down with SurgeAI founder and CEO Edwin Chen to discuss the meani...ng and importance of quality human data. Edwin talks about why he bootstrapped Surge instead of raising venture funds, the importance of scalable oversight in producing quality data, and the work Surge is doing to standardize human evals. Plus, we get Edwin’s take on what Meta’s investment into Scale AI means for Surge, as well as whether or not he thinks an underdog can catch up with OpenAI, Anthropic, and other dominant industry players. Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @echen | @HelloSurgeAI Chapters: 00:00 – Edwin Chen Introduction 00:41 – Overview of SurgeAI 02:28 – Why SurgeAI Bootstrapped Instead of Raising Funds 07:59 – Explaining SurgeAI’s Product 09:39 – Differentiating SurgeAI from Competitors 11:27 – Measuring the Quality of SurgeAI’s Output 12:25 – Role of Scalable Oversight at SurgeAI 14:02 – Challenges of Building Rich RL Environments 16:39 – Predicting Future Needs for Training AI Models 17:29 – Role of Humans in Data Generation 21:27 – Importance of Human Evaluation for Quality Data 22:51 – SurgeAI’s Work Toward Standardization of Human Evals 23:37 – What the Meta/ScaleAI Deal Means for SurgeAI 24:35 – Edwin’s Underdog Pick to Catch Up to Big AI Companies 24:50 – The Future Frontier Model Landscape 26:25 – Future Directions for SurgeAI 29:29 – What Does High Quality Data Mean? 32:26 – Conclusion
Transcript
Discussion (0)
Hi, listeners. Welcome back to NoPriors. Today, Elad and I are here with Edwin Chen, the founder
and CEO of Surge, the bootstrapped human data startup that surpassed a billion in revenue
last year and serves top tier clients like Google, OpenAI, and Anthropic. We talk about
what high quality human data means, the role of humans as models become superhuman,
benchmark hacking, why he believes in a diversity of frontier models, the scale meta not M&A
deal, and why there's no ceiling on environment quality for RL or the simulated worlds that
labs want to train agents in. Edwin, thanks for joining us.
Great, great. See you guys today.
Search has been really under the radar until just about now.
Can you give us a little bit of color on sort of scale of the company and what the original
founding thesis was?
So we hit over a billion written revenue last year.
We are kind of like the biggest human data player in this space.
And we're about 100, a little over 100 people.
And our original thesis was,
we just really believed in the power of human data
to advance AI.
And we just had this really big focus from the start
of making sure that we had the highest quality data possible.
Can you give people context for how long you've been around,
how you got going, et cetera?
I think, again, you all have accomplished
an enormous amount in a short period of time.
And I think, you know, you've been very quiet
about some of the things you've been doing. So we're going to just get a little bit of history
and you know, when you started, how you got started and how long you've been around.
Oh, yeah. So we've been around for five years. I think we just hit our five year anniversary.
So we started in 2020. So before that, so I can give some of the context. So before that,
I used to work at Google, Facebook and Twitter. And one of the like, basically, the reason
we started surge was I just used to work on Google, Facebook, and Twitter. And one of the, like, basically the reason we started Surge was I just used to work on
ML at a bunch of these big companies.
And just the problem I kept running into over and over again was that it really was impossible
getting the data that we needed, train our models.
So it's just this big blocker that we faced over and over again.
And there was just like so much more that we wanted to do.
Like even just the basic things that we want to do, we struggled so hard to get the data.
It was really just the big blocker. But then simultaneously, there are all these more futuristic things that we want to do, we struggled so hard to get the data. It was really just the big blocker.
But then simultaneously,
there were all these more futuristic things
that we wanted to build.
Like if you thought of the next generation AI systems,
if we could barely get the data
that we needed at the time to solve,
like just building a simple set,
no, it's the classifier.
If we could barely do that,
then like how would we ever advance beyond that?
So that really was the biggest problem.
I can go into more of that,
but that was such a woeful face.
And you guys are also known for having bootstrap,
the company, versus raising a lot of external venture money
or things like that.
Do you want to talk about that choice
in terms of going profitable early
and then scaling off of that?
In terms of why we didn't raise.
So I think, I mean, I'd be proud of it was obviously
just that we didn't need the money.
I think we were very, very lucky
to be profitable from the start. So we didn't need the money. I think we were very, very lucky to be profitable from the start.
So we didn't need the money.
It always felt weird to give up control.
And like one of the things I've always hated about Silicon Valley is that you see so many
people raising for the sake of raising.
Like, I think one of the things that I often see is that a lot of founders that I know,
they don't have some big dream of building a product that solves some idea that they
really believe in.
Like, if you talk to a bunch of YC founders or whoever it is, like what is their goal?
It really is to tell all their friends that they raised $10 million and show their parents
they got a headline on TechCrunch.
Like that is their goal.
Like I think of like my friends at Google.
They often tell me, oh yeah, I've been at Google or Facebook for 10 years and I want
to start a company.
I'm like, okay, so what problem do you want to solve?
They don't know.
They're like, yeah, I just want to start something company. I'm like, okay, so what problem do you want to solve? They don't know. They're like, yeah, I just want to start something new. I'm bored. And it's
weird because they can like pay their own salaries for a couple of months. Again, they've
been on Google and Facebook for 10 years. They're not just like fresh out of school.
They can pay their own salaries. But the first thing they think about is just going out and
raising money. And I've always started weird because they like might try talking to some
users and they might try building an MVP, but they kind of just do it in this throwaway
manner where the only reason they do it is to check off a box
on a startup accelerated application.
And then they'll just pay around these random product ideas
and know they happen to get a little bit of traction
so that the VC DMs them.
And so they spend all their time tweeting
and they go to these VC dinners
and it's all just so that they can show the world
that they raised a big amount of money.
And so I think raising immediately always felt silly to me.
Like everybody's default is to just immediately raise. But if you were to think about it from first principles,
like if you didn't know how Silicon Valley worked,
if you didn't know that raising was a thing,
like why would you do that?
Like what is money really going to solve
for 90% of these startups
where the founders are lucky to have some savings?
I really think that your first instinct
should be to go out and build whatever you're dreaming of.
And sure, if you ever run into financial problems,
then sure, think about raising money then,
but don't waste all this effort and time
from when you don't even know what you do with it.
Yeah, it's funny.
I feel like I'm one of the few investors
that actually tries to talk people out of fundraising often.
Oh, really?
Like, I actually had a conversation today
where the founder was talking about doing a raise,
and I'm like, why?
You know, you don't have to.
You can maintain control, et cetera.
And then the flip side of it is,
I would actually argue outside of Silicon Valley,
too few people raise venture capital
when the money can actually help them scale. And so I feel like in Silicon Valley, there's too much, and outside of Silicon Valley, too few people raise venture capital when the money can actually help them scale.
And so I feel like in Silicon Valley, there's too much and outside of Silicon Valley, there's
too little.
So it's this interesting spread of different models that sort of stick.
Edwin, what would you say to founders who feel like there's some external validation
necessary to especially like hire a a team or a scale their team.
This is a very like common complaint or
rationale for going and raising more capital.
I think about it a couple ways.
So I guess it depends on what you mean by external validation.
Like in my mind, again, like I often think about things from a perspective of,
are you trying to build a startup that's actually going to change the world?
Like do you have this big thing that you're dreaming of?
And if you have this big thing that you're dreaming of, you...
Like, why do you care?
Maybe the way to think about it is in Sarah's context,
like, if you haven't... Say you're a YC founder,
you haven't been at Google, you haven't been at Meta,
you haven't been at Twitter,
you don't have this network of engineers,
you're a complete unknown, you haven't worked with very many people,
you're straight out of school.
How do you then attract that talent? And to your point, you can tell a story of how you're going
to build things or what you're going to do. But it is a harder obstacle to basically convince others
to join you or for others to come on board or to have money to pay them if you haven't, if you
don't have long work at history. So I think maybe that's the point Sarah's making.
Oh, yeah. So I mean, I think I would differentiate between maybe two,
two things. Like one is,
do you need the money? So first of all, like there's a difference between people who are
yeah, like totally fresh out of school, or maybe, you know, I've never gotten to school
for the first place. And so maybe they don't have any savings. And so they have literally
need some money in order in order to live. And then there's others who, okay, like, let's
assume that you don't necessarily need money, because again, you've been working at Google
or Facebook for 10 years, like, or, you know, five years, whatever it is, you have some
savings. So I would say one of the questions is, again, like it kind
of the path kind of differs depending on depending on those two, those two choices or those two
scenarios. But I think one of the questions is, well, do you really need to go out and
hire all these people? Like one of the things I often see, again, like I'm curious what
you guys see, but one of the things I often see is, founders will tell me like, okay,
so I'm trying, I'm trying to think about the first few hires I'm going to make.
And they're like, yeah, I'm going to hire a PM. I'm gonna hire data scientist. These are one of
my first five to 10 hires. I'm like, what? Like, this is just wild to me. Like, I would never hire
data scientists when the first three people in a company. And I say that because I used to be a
data scientist. Like data scientists are great when you want to optimize your product by 2% or 5%.
But that's definitely not what you want to be doing when you start a company.
You're trying to swing for 10x or 100x changes, not worrying and nitpicking about small percentage
points that are just noise anyways. And to some extent, like product managers. Product
managers are great when your company gets big enough, but at the beginning, you should
be thinking about yourself about what product you want to build. And your engineers should
be hands on. They should be having great ideas as well.
And so product advantage is kind of this weird conception
that big companies have when your engineers don't have time
to be in the weeds on the details
and try things themselves.
And it's not a road that you'd come up with the other way
before.
So I guess with the initial surge team,
it sounds like you had sort of a small,
initial, tight engineering team.
You guys started building product.
You were bootstrapping off of revenue.
At this point, you're at over a billion dollars in revenue, which is amazing.
How do you think about the future of how you want to shape the organization, how big you
want to get, the different products you're launching and introducing?
What do you view as the future of Surge and how that's all going to evolve?
Before we do that, can you just explain at whatever level of detail makes sense here
what the billion dollars of revenue
is, maybe like how product supports the company, who your data, who your humans are, because I
think there's just very little visibility into into all of that. So in terms of what a product is,
I mean, at the end of the day, our product is our data. Like we literally deliver data to companies.
And that is what they use to train and evaluate our models. So imagine you know, when you're one of these frontier labs and you want to improve your
model, your model's coding abilities.
What we will do on our end is we will gather a lot of coding data.
And so this coding data may come in different forms.
And maybe SFT data, we are literally writing out coding solutions, or maybe unit tests.
Like these are the tests that a good, that a good piece of code must pass. be a preference data where it's okay, like here are two pieces of code or here are
two coding explanations, which one is better? Or these might be like verifiers, like, okay,
here's a web app that I created. I want to make sure that in the top right hand of the
screen, there's like a, there's a login button. Or I want to make sure that when you click
this button, something else happens. Like there's a bunch of different forms that this
data may take.
At the end of the day, what we're doing is we're delivering data.
They'll basically help the models improve on these capabilities.
Very, very related to that is this notion of evaluating the models.
Like you also want to know, yeah, is this good coding model?
Is it better than this other one?
What are the errors in which this model is weak and this model is worse? Like what insights can we get from that?
And so in addition to the data, oftentimes we're delivering insights to our customers,
we're delivering loss patterns,
we're delivering failure modes.
So there may be a lot of other things related to data,
but I think it's like this universe of applications
or just like this universe around the data
that we deliver and that is our product.
Yeah, and maybe going back to Alad's question,
maybe product isn't actually the right
word here, but what's what's like repeatable about the company?
Or what are like core capabilities that you guys have that you would say your competitors,
you know, fill to meet the mark?
The way we think about a company is that in the way we differentiate from others is that
a lot of other companies in this space, they are essentially just body shops, what they
are delivering is not data, they are literally just delivering warm bodies to, um, to, uh, to companies.
And so what that means is like at the end of the day, they don't have any
technology and one of our fundamental beliefs is that again, quality is the
most important thing at the end of the day.
Like, is this high quality data?
Is this a good coding solution?
Is this a good unit test?
Is this mathematical problem solved correctly?
Is this a good coding solution? Is this a good unit test? Is this mathematical problem solved correctly? Is this a great poem?
And basically a lot of companies in this space like just just as a relic of how things have worked out historically
it's that like historically a lot of companies they uh
They've treated quality and data as commodity like one of the ways we often think about it is
Imagine you're trying to draw a bounding box around a car
Like sarah you and I we're probably going trying to draw a bounding box around a car. Like Sarah, you and I, we're
probably going to draw the same bounding box. Like ask Hemingway and ask a second grader. Well,
at the end of the day, we're all going to draw the same bounding box. There's not much difference
that we can do. So there's a very, very low ceiling on the bar quality. But then take something like
writing poetry. Well, I suck at writing poetry. Hemingway is definitely going to write a much
better poem than I am. Or imagine a, I don't know, a VC pitch deck.
You're going to write a much better,
you're going to create a much better pitch deck than I will.
And so there's almost an unlimited ceiling in this gen.ai world
on the type of quality that you can build.
And so the way we think of our product is like we have a platform.
We have actual technology that we're using to measure the quality
that our workers or annotators are generating.
If you don't have that technology, if you don't have any way of measuring it.
Is the measurement through human evaluation? Is it through model-based evaluation? I'm a little
bit curious how you create that feedback loop since to some extent it's a little bit of this
question of how do you have enough evaluators to evaluate the output relative to the people
generating the output? Or do you use models? Or how do you approach it?
I think one analogy that we often make is think about something like Google search
or think about something like YouTube. Like you have, you know, millions of search results.
You have millions of web pages, you have millions of videos. How do you evaluate the qualities
of these videos? Like is this a high quality, like is this a high quality webpage? Is it informative
or is it really spammy? Like in the way you do this is like you just need, I mean you gather
so many signals. You gather like page page dependent signals you gather like user dependent signals
You got their activity based signals and all these feed into a giant M.O. algorithm at the end of the day
It's in the same way
We we gather all these signals about our annotators about the work that they're performing about like their activity on the site and we just
Feed it into a lot of these different like we basically have an M.O. team internally that builds a lot of these algorithms to
measure all of this.
What is changing or breaking as you are scaling increasingly sophisticated annotations?
Like if model quality baseline is going up every couple of months, then the expectation
is it exceeds what might have been a random human at some point,
as you said, like can draw a bounding box into all of these different fields, where, you know,
we have modeled better than the 90th percentile at some point. So this is actually something that we
do a lot of internal research on ourselves as well. So there's basically this field of AI
alignment called scalable oversight, which is basically this question of how do you
how do you like have models and humans working together hand in hand to produce data that is
better than either one of them can achieve on their own. And so even like even today,
someone like writing an SAT story from scratch, even today, like a couple years ago, we might
have written that story completely from scratch ourselves. But today, it's just like not very
efficient, right? Like you might start with a story that a model created, and then you would edit it.
You might edit it in a very substantial way, like maybe just the core of it is very vanilla,
very generic, but there's just so much kind of like cruft that is just inefficient for a human to do
and doesn't really benefit from the human creativity and human ingenuity that we're
trying to add into the response. So you can just start with this bare bones structure that you're
basically just layering on top of.
And so again, there's more sophisticated ways
of thinking about scalable oversight,
but just this question of how do you build the right interfaces?
How do you build the right tools?
How do you just combine people with AI in the right ways
to make them more efficient?
It is something that we build a lot of technology for.
A lot of the discussion in terms of what human data the labs want has moved to RL environments
and reward models in recent months.
What is hard about this or what are you guys working on here?
So we do a lot of work building our environments.
And I think one of the things that people really underestimate is how it is how complicated
it is that you can't just synthetically generate it. Like, for example, you think you need a lot of
tools because these are massive environments that people want. Can you give an example of like,
just to make it more real? Like, imagine you are a salesperson. And when you are a salesperson,
you need to be interacting with Salesforce, You need to be getting leads through Gmail.
You're going to be talking to customers in Slack.
You're going to be creating Excel sheets, tracking your leads.
You're going to be, I don't know, writing Google Docs and
making PowerPoint presentations to present things to customers.
And so you want to basically do these very rich environments that are literally
simulating your entire world as a salesperson.
It literally is just imagine your entire world. So with everything. It like, it literally is just like imagine like your entire world.
So with everything on your desktop,
and then in the future, everything that is, you know,
not on your desktop as well.
Like maybe you have a calendar, maybe there's,
maybe you need to travel to a meeting to meet a customer,
and then you want to simulate a car accident happening,
and you're getting notified of that.
So you need to like leave a little bit earlier.
Like all of these things are things that we actually
want to model in these very, very rich RO environments. And so the question is, how do you generate all of the data that
goes into this? Like, okay, you're going to need to generate like thousands of Slack messages,
hundreds of emails, you need to make sure that these are all consistent with each other.
You need to make sure that like going back to like my core example, you need to make
sure that time is evolving in these environments and like certain like external events happen.
Like, how do you do all this? And then then like in a way that's actually kind of like
interesting and creative, but also realistic
and not like incongruent with each other.
Like there's just like a lot of thought
that needs to go into these environments
to make sure that they're, again,
like rich creative environments
that models can learn interesting things from.
And so, yeah, you basically need a lot of tools
and kind of sophistication for creating these.
Is there any intuition for how real or how complex is enough?
Or is it just like, you know, there's
no ceiling on the realism that is useful here
or the complexity of environment that is useful here?
I think there's no ceiling.
At the end of the day, you just want as much diversity and richness as
you can get because the more richness that you have, the more the models can learn from.
The longer the time horizons, the more the models can learn on and improve on.
So I think there's almost an unlimited ceiling here.
If you were to make a five or 10-year bet on what scales most in terms of demand from people training AI
models and types of data? Is it RL environments or is it traces on types of like expert reasoning
or what other areas do you think there's going to be a really large demand for?
I mean, I think it will be all the above. Like I don't think our environments alone
will suffice just because I mean, it depends on how you think about their R environments.
But oftentimes these are very, very rich trajectories that are very,
very long. And so it's almost like inconceivable that a single reward, I mean, I think even
today we often think about things in terms of multiple rewards, not just a single reward,
but a thing like a single reward may just may not be like rich enough to capture all
the work that goes into like the model model solving some very, very complicated goal.
So I think they'll probably be a combination of all those.
If you assume eventually some form of superhuman performance across different model types relative
to human experts, how do you think about the role of humans relative to data and data generation
versus synthetic data or other approaches?
At what point does human input run out as a useful point of either feedback or data
generation?
So I think human feedback will never run out, and that's for a couple of reasons.
So even if I think about the landscape today, I think people often overestimate the role
of synthetic data.
Personally, I think synthetic data actually is very, very useful. We, we use it, like, a ton of ourselves in order to supplement
what the humans do. Like, again, like I said earlier, there's, like, a lot of cruft that
some days aren't worth a human's time. But what we often find is that, like, for example,
a lot of the times where customers will come to us and you'll be like, yeah, for the past
six months, I've been experimenting with synthetic data. I've got a 10 to 20 million pieces of
synthetic data. Actually, yeah, to 20 million pieces of synthetic data.
Actually, yeah, we finally realized
that 99% of it just wasn't useful.
And so we're trying to find right now,
we're trying to curate the 5% that is useful,
but we are literally going to throw out 9 million of it.
And oftentimes they'll find out that, yeah,
like actually a thousand,
even a thousand pieces of high quality human data,
a high-created, really, really high quality human data
is actually more valuable than those 10 million points.
So that is one thing I'll say.
Another thing I'll say is that it's almost like sometimes you need an external signal
to the models.
The models just think so differently from humans that you always need to make sure that
they're kind of aligned with the actual objectives that you want.
We can give two examples.
So one example is that it's kind of funny.
If you sometimes, if you try,
so one of the frontier models,
let me just say that one of them,
if you go use the frontier model,
it's like one of the top models
or one of the models everybody's thinking is one of the top.
If you go use it today,
like maybe 10% of the time when I use it,
you'll just output random Hindi characters
and random Russian characters into one of my responses.
So I'd be like, tell me about Donald Trump, tell me about Barack Obama, and just like
in the middle of it, it will just output Hindi and Russian.
It's like, what is this?
And the model just isn't like self-consistent enough to be aware of this.
It's almost like you need a like an action on Cuban to tell the model that yeah, this
is wrong.
One of the things I think is a giant plague on AI is AlamSys, Alamarina.
And I'll skip the details for now, but I think right now people will often.
It's like, if you train your model on the wrong objectives.
So like the, the mental model that you should have of Alumsys, Alamarina is that
people are writing prompts, they'll get two responses and they'll spend like five,
10 seconds looking, looking at responses and they'll just pick whichever one looks better to them. So they're not evaluating whether
or not the model hallucinated, they're not evaluating the factual accuracy and whether
or following instructions, they're literally just inviting with the model and like, okay,
yeah, like this one seemed better because it had a bunch of formatting, it had a bunch
of emojis, it just looks more impressive. And people will train on like, basically an
LMS subjective, and they won't realize all the consequences of it.
And again, like the model itself doesn't,
doesn't like know what its objective is.
It's like, you almost need like an external
like quality signal in order to tell it
what the right objective should be.
And if you don't have that,
then the model will just go in all these crazy directions.
Again, like you might, you may have seen some of the results
with like the, with Lama 4,
but we'll just go in all these crazy directions
that kind of, kind of mean you need these external external auditors.
This also happens actually when you do different forms of like protein evolution or things
like that where you select a protein against a catalytic function or something else and
you just kind of randomize it and have like a giant library of them and you end up with
the same thing where you have these really weird activities that you didn't anticipate
actually happening.
And so I sometimes think of model training as
almost this odd evolutionary landscape that you're effectively evolving and
selecting against and you're kind of shaping the model into that local maxima
or something. And so it's kind of this really interesting output of anything
where you're effectively evolving against a feedback signal. And depending
on what that feedback signal is, you just end up with these odd results. So
it's interesting to see how it kind of transfers across domains.
These course, as you said, five second reaction, academic benchmarks or even non-academic industrial
benchmarks are easily hacked or not the right gauge of performance against any given task.
They are very popular. What is the alternative
for somebody who's trying to like choose the right model or understand model capability?
So the alternative that I think all the Frontier Labs view as a gold standard is basically human
evaluation. So again, proper human evaluation where you're actually taking the time to look at
the response, you're going to fact check it, you're going to see whether or not it followed all the
instructions. You have good taste so you know whether or not it followed all the instructions.
You have good pace, so you know whether or not
the model has good writing quality.
This concept of doing all that and spending all the time
to do that, as opposed to just vibing for five seconds,
I think actually is really, really important.
Because if you don't do this, you're
basically just training your models
on the analog of clickbait.
So I think it actually is really, really important
for model progress.
If it's not LMSys, how should people actually evaluate model capability for any given task?
What all differential labs find is that human evals really are the gold standard.
You really need to take a lot of time to fact check these responses, to verify their following
instructions.
You need people with good taste to evaluate the writing quality, and so on and so on. And if you don't do this, you're basically training your models on the
analytical clickbait. And so I think I think that really, really harms model progress.
Is there work that surge is doing in this domain of like, trying to standardize human
eval or make it more transparent to end consumers of the API or even users?
So internally, we do a lot of work actually today with working with all the frontier labs
to help them understand their models.
So again, we're constantly evaluating them.
We're constantly servicing loss areas for them to improve on
and so on and so on.
And so right now, a lot of this is internal,
but one of the things that we actually wanna do
is sort of external forms of this as well.
Where we're helping educate people on,
yeah, like these are the different capabilities
of all these models. Here, these models are better at coding. Here, these models
are better at instruction following. Here, these models are actually hallucinating a
lot, so you just don't trust them as much. So we actually do want to start a lot of external
work to help educate the broader landscape on this.
If we can zoom in and talk just about the larger, like, competitive landscape and what
happens with frontier models over time, what does a meta scale deal mean for you guys? Or what do you make
of it?
So I think it's kind of interesting in that. So we were ready to number one
player in the space. It's been beneficial because yeah, there were still some
legacy teams using scale. Like they just didn't know about us because we were
still pretty under the radar. I think it's been beneficial because one of the
things that we've always believed is that sometimes when you use these
low quality data solutions, people kind of get burned on human data. And so they had
this negative experience. And so then they don't want to use human data again. And so
to try these other methods that are honestly just a lot slower and don't come up with the
right objectives. And so I think just harms model progress overall. And so it's just like the more and more we can get all these frontier labs using high quality data, I think it actually really, really is beneficial for an industry as a whole. So I am like I think overall, it was a good thing to happen.
If you were to make a bet that an underdog catches up to open AI and throbbing and deep mind, who would it be?
to OpenAI, Anthropic, and DeepMind, who would it be? So I would bet on XAI.
I think they're just very hungry and mission-oriented in a way that gives them a lot of really unique
advantages.
I guess maybe another sort of broader question is, do you think there's three competitive
frontier models, 10 competitive frontier models a couple years from now?
And is any of those open source?
Yeah.
So I actually see more and more frontier models opening up over time because I actually
don't think that the models will become oddities. Like I think one of the things that we've,
I mean, I think one of the things that has actually been surprising in the past couple
of years is that you actually see all of their models have their own focuses that give them
unique strengths. Like for example, I think, and Haubert's obviously been really, really
amazing at coding and enterprise. And OpenAI has this big consumer focus because of chat activity.
Like I actually really love it.
It's models personality and then croc, you know, it's a different set of things that's willing to say and to build.
And so it's almost like every company has it's almost like a different set of principles that they care about.
Like they're like some will just never do one thing.
Others are totally willing to do it. Others had just had different like models will just have so many different
facets to their personality.
So many different facets to the type of skills that they will be good at.
And sure, like eventually AGI will maybe encompass this, this all,
but in the meantime, you just kind of need to focus.
Like there's only so many focuses that you can have as a company.
And so I think that just will be to different strengths for all the model providers.
So I mean, I think today, you know, we already see a lot of people, including me, if we will
switch between all the different models, just depending on what we're doing. And so in the
future, I think that will just happen even more as you are just using more and more models for
or using models for different aspects of their lives, like both their personal and professional lives.
Going back to something Elad mentioned,
where should we expect to see search investing over time?
What do you think you guys will do a few years from now
that you don't do today?
Again, I think I'm really excited about this more
kind of public research push that we're starting to have.
I think it is really interesting in that a lot of
the like for obvious reasons, a lot of French your labs, there was not publishing anymore.
And as a result of that, I think it's almost like the industry has fallen to kind of a
trap that I worry about. So like maybe to dig into some of the some of the things I
said earlier, um, with some of the negative incentives of the industry into some of the things I said earlier, with some of the negative incentives
of the industry and some of the kind of concerning trends that we've seen.
So like going back to LMSS, one of the things that we'll see is like a lot of researchers,
they'll tell us that their VPs make them focus on increasing their rank on LMSS.
And so I've had researchers explicitly tell me that they're okay with making their models
worse at factuality, worse at following instructions, as long as it improves their ranking, because their leadership just
wants to see these metrics go up.
And again, that is something that literally happens because the people ranking these things
on all of this, they don't care whether the models are good at instruction following.
They don't care whether the models are omitting factual responses.
What they care about is, okay, did this model emit
a lot of emojis? Did it emit a lot of bold words? Did it have really long responses?
Because that's just going to look more impressive to them. Like one of the things that we found
is that the easiest way to improve your rank on Alameda is ability to make your make your
model response longer. And so what happens is, like there are a lot of companies who
are trying to improve their computer board rank. So they'll see progress for six months
because all they're doing is unwillingly making a
model responses longer and adding more emojis.
And they don't realize that all they're doing is training the models to produce better click
bait.
And they might finally realize six months or a year later, like again, you may have
seen some of these things in industry, but it basically means that he's been the past
six months making zero progress.
In a similar way, I think, you know, besides elements that you have all these academic benchmarks, and they're completely diverse in the real world, like
a lot of teams are focused on proving these SAT style scores instead of real world progress.
Like I'll give an example, there's a benchmark benchmark called IFEval. And if you look at
IFEval, so it stands for instruction following eval. If you look at IFEval, like some of
the instructions are trying to check what their models can do. It's like, hey, can you write an essay about Abraham Lincoln?
And every time you like mention a word Abraham Lincoln, make sure that five of the letters
are capitalized and all the other letters are uncapitalized.
It's like, what is this?
And sometimes we'll get customers telling us like, yeah, like we really, really need
to improve or like our score on, on ifeval.
And what this means is again, like you have all these companies or all these
researchers who, instead of focused on real world progress, they're just like optimizing for these
silly STD style benchmarks. And so one of the things that we really want to do is just think
about ways to educate the industry, think about ways of publishing on our own, just like think
about ways of steering the industry into like hopefully a better direction. And so I think
that's just one big thing that we're really excited about and could be really
big in the next five years.
Okay.
Yeah.
I mean, so Sarah brought up earlier how everybody kind of wants high quality data.
What does that mean?
How do you think about that?
How do you generate it?
Can you tell us a little bit more about your thoughts on that?
So let's say you wanted to train them all to write an eight-line poem about the moon.
And so the way most companies think about it is, well, let's just hire a bunch of people from Craigslist, or through some recruiting
agency, and let's ask them to write poems. And then the way they think about quality is, well,
is this a poem? Is it eight lines? Does it contain the word moon? If so, like, okay, yeah, I hate
these three checkboxes. So yeah, sure, this is a great poem, because it follows all these instructions.
But if you think about it, like, the reality is, you get these terrible poems, like, sure, this is a great poem because it follows all these instructions. But if you think about it, like the reality is you get these terrible poems, like sure, it's eight lines and has
the word moon, but they feel like they're written by kids from high school. And so other
companies be like, okay, sure, these people on Craigslist don't have any poetry experience.
So I'm going to do instead is hire a bunch of people with PhDs in English literature.
But this is also terrible. Like a lot of PhDs, they are actually not good writers or poets.
Like if you think of like think of Hemingway or Emily Dickinson, they definitely didn't
have a PhD.
I don't think they even completed college.
And like one of the things I will say is like, yeah, I went to MIT.
I think you went there too.
And a lot of people I knew from MIT who graduate with a CS degree, they're terrible coders.
And so we think about quality completely differently.
Like what we want isn't poetry that checks the boxes.
I'm like, okay, yeah, check these boxes and use it some complicated language. We want the type of poetry that Nobel Prize laureates would write.
So what you want is like, okay, we want to recognize that poetry is actually really subjective
and rich. Like maybe one poem, it's a haiku about boom light on water. And there's another poem
that's like, it has a lot of internal rhyming meter. And another one that I don't know,
focus on the motion behind the moon rising at night. And so you actually want to capture that there's thousands of
ways to write a poem about the moon. There isn't a single correct way, and each one gives
you all these different insights into language and imagery and poetry. If you think about
it, it's not just poetry, it's like math, there's a thousand ways probably to prove
the Pythagorean theorem. And so I think the difference is that when you think about quality
the wrong way, you kind of get commodity data that optimizes for things like iterator agreement.
And again, checking boxes off of some list.
But one of the things that we try to teach all of our customers is that high quality
data actually really embraces human intelligence creatively.
And when you train the models on this like richer data, they don't just learn to follow
instructions.
They really learn all of these deeper patterns about all the stuff that makes language in the world really compelling and meaningful. And so I think a lot of companies,
they just throw humans at the problem and they think that you can get good data that way. But
I think you really need to think about quality from first principles and what it means. And you
need a lot of technology to identify, yeah, that these are amazing problems and these are creative
math problems and these are games and web apps that are beautiful and fun to play. And these
ones are terrible to use. So like you really need to build a lot of technology
and think about quality in the right way.
Otherwise, you're basically just scaling up mediocrity.
That sounds very domain-specific.
So do you, like, in every domain, are you building a lens
of what quality looks like along with your partners?
Yeah, I mean, I think we have kind of holistic quality principles,
but then oftentimes there are differences per domain. So it's like a combination of both. I think we have kind of like holistic quality principles, but then oftentimes there are differences per domain.
So it's like a combination of both.
I think we got all the core topics.
Nice work on podcast number two, Edwin. And thanks for doing this.
Congrats on all the progress with the business.
Yeah, no, thanks so much for having us.
Yeah, it's great.
Great meeting you guys.
Find us on Twitter at no priors pod.
Subscribe to our YouTube channel.
If you want to see our faces, follow the show on Apple Podcasts, Spotify, or wherever you listen.
That way you get a new episode every week. And sign up for emails or find
transcripts for every episode at no-priors.com.