No Priors: Artificial Intelligence | Technology | Startups - The Power of Quality Human Data with SurgeAI Founder and CEO Edwin Chen

Starting point is 00:00:00 Hi, listeners. Welcome back to NoPriors. Today, Elad and I are here with Edwin Chen, the founder and CEO of Surge, the bootstrapped human data startup that surpassed a billion in revenue last year and serves top tier clients like Google, OpenAI, and Anthropic. We talk about what high quality human data means, the role of humans as models become superhuman, benchmark hacking, why he believes in a diversity of frontier models, the scale meta not M&A deal, and why there's no ceiling on environment quality for RL or the simulated worlds that labs want to train agents in. Edwin, thanks for joining us. Great, great. See you guys today.

Starting point is 00:00:40 Search has been really under the radar until just about now. Can you give us a little bit of color on sort of scale of the company and what the original founding thesis was? So we hit over a billion written revenue last year. We are kind of like the biggest human data player in this space. And we're about 100, a little over 100 people. And our original thesis was, we just really believed in the power of human data

Starting point is 00:01:08 to advance AI. And we just had this really big focus from the start of making sure that we had the highest quality data possible. Can you give people context for how long you've been around, how you got going, et cetera? I think, again, you all have accomplished an enormous amount in a short period of time. And I think, you know, you've been very quiet

Starting point is 00:01:24 about some of the things you've been doing. So we're going to just get a little bit of history and you know, when you started, how you got started and how long you've been around. Oh, yeah. So we've been around for five years. I think we just hit our five year anniversary. So we started in 2020. So before that, so I can give some of the context. So before that, I used to work at Google, Facebook and Twitter. And one of the like, basically, the reason we started surge was I just used to work on Google, Facebook, and Twitter. And one of the, like, basically the reason we started Surge was I just used to work on ML at a bunch of these big companies. And just the problem I kept running into over and over again was that it really was impossible

Starting point is 00:01:51 getting the data that we needed, train our models. So it's just this big blocker that we faced over and over again. And there was just like so much more that we wanted to do. Like even just the basic things that we want to do, we struggled so hard to get the data. It was really just the big blocker. But then simultaneously, there are all these more futuristic things that we want to do, we struggled so hard to get the data. It was really just the big blocker. But then simultaneously, there were all these more futuristic things that we wanted to build.

Starting point is 00:02:09 Like if you thought of the next generation AI systems, if we could barely get the data that we needed at the time to solve, like just building a simple set, no, it's the classifier. If we could barely do that, then like how would we ever advance beyond that? So that really was the biggest problem.

Starting point is 00:02:25 I can go into more of that, but that was such a woeful face. And you guys are also known for having bootstrap, the company, versus raising a lot of external venture money or things like that. Do you want to talk about that choice in terms of going profitable early and then scaling off of that?

Starting point is 00:02:37 In terms of why we didn't raise. So I think, I mean, I'd be proud of it was obviously just that we didn't need the money. I think we were very, very lucky to be profitable from the start. So we didn't need the money. I think we were very, very lucky to be profitable from the start. So we didn't need the money. It always felt weird to give up control. And like one of the things I've always hated about Silicon Valley is that you see so many

Starting point is 00:02:52 people raising for the sake of raising. Like, I think one of the things that I often see is that a lot of founders that I know, they don't have some big dream of building a product that solves some idea that they really believe in. Like, if you talk to a bunch of YC founders or whoever it is, like what is their goal? It really is to tell all their friends that they raised $10 million and show their parents they got a headline on TechCrunch. Like that is their goal.

Starting point is 00:03:14 Like I think of like my friends at Google. They often tell me, oh yeah, I've been at Google or Facebook for 10 years and I want to start a company. I'm like, okay, so what problem do you want to solve? They don't know. They're like, yeah, I just want to start something company. I'm like, okay, so what problem do you want to solve? They don't know. They're like, yeah, I just want to start something new. I'm bored. And it's weird because they can like pay their own salaries for a couple of months. Again, they've been on Google and Facebook for 10 years. They're not just like fresh out of school.

Starting point is 00:03:33 They can pay their own salaries. But the first thing they think about is just going out and raising money. And I've always started weird because they like might try talking to some users and they might try building an MVP, but they kind of just do it in this throwaway manner where the only reason they do it is to check off a box on a startup accelerated application. And then they'll just pay around these random product ideas and know they happen to get a little bit of traction so that the VC DMs them.

Starting point is 00:03:53 And so they spend all their time tweeting and they go to these VC dinners and it's all just so that they can show the world that they raised a big amount of money. And so I think raising immediately always felt silly to me. Like everybody's default is to just immediately raise. But if you were to think about it from first principles, like if you didn't know how Silicon Valley worked, if you didn't know that raising was a thing,

Starting point is 00:04:10 like why would you do that? Like what is money really going to solve for 90% of these startups where the founders are lucky to have some savings? I really think that your first instinct should be to go out and build whatever you're dreaming of. And sure, if you ever run into financial problems, then sure, think about raising money then,

Starting point is 00:04:23 but don't waste all this effort and time from when you don't even know what you do with it. Yeah, it's funny. I feel like I'm one of the few investors that actually tries to talk people out of fundraising often. Oh, really? Like, I actually had a conversation today where the founder was talking about doing a raise,

Starting point is 00:04:35 and I'm like, why? You know, you don't have to. You can maintain control, et cetera. And then the flip side of it is, I would actually argue outside of Silicon Valley, too few people raise venture capital when the money can actually help them scale. And so I feel like in Silicon Valley, there's too much, and outside of Silicon Valley, too few people raise venture capital when the money can actually help them scale. And so I feel like in Silicon Valley, there's too much and outside of Silicon Valley, there's

Starting point is 00:04:48 too little. So it's this interesting spread of different models that sort of stick. Edwin, what would you say to founders who feel like there's some external validation necessary to especially like hire a a team or a scale their team. This is a very like common complaint or rationale for going and raising more capital. I think about it a couple ways. So I guess it depends on what you mean by external validation.

Starting point is 00:05:15 Like in my mind, again, like I often think about things from a perspective of, are you trying to build a startup that's actually going to change the world? Like do you have this big thing that you're dreaming of? And if you have this big thing that you're dreaming of, you... Like, why do you care? Maybe the way to think about it is in Sarah's context, like, if you haven't... Say you're a YC founder, you haven't been at Google, you haven't been at Meta,

Starting point is 00:05:37 you haven't been at Twitter, you don't have this network of engineers, you're a complete unknown, you haven't worked with very many people, you're straight out of school. How do you then attract that talent? And to your point, you can tell a story of how you're going to build things or what you're going to do. But it is a harder obstacle to basically convince others to join you or for others to come on board or to have money to pay them if you haven't, if you don't have long work at history. So I think maybe that's the point Sarah's making.

Starting point is 00:06:00 Oh, yeah. So I mean, I think I would differentiate between maybe two, two things. Like one is, do you need the money? So first of all, like there's a difference between people who are yeah, like totally fresh out of school, or maybe, you know, I've never gotten to school for the first place. And so maybe they don't have any savings. And so they have literally need some money in order in order to live. And then there's others who, okay, like, let's assume that you don't necessarily need money, because again, you've been working at Google or Facebook for 10 years, like, or, you know, five years, whatever it is, you have some

Starting point is 00:06:24 savings. So I would say one of the questions is, again, like it kind of the path kind of differs depending on depending on those two, those two choices or those two scenarios. But I think one of the questions is, well, do you really need to go out and hire all these people? Like one of the things I often see, again, like I'm curious what you guys see, but one of the things I often see is, founders will tell me like, okay, so I'm trying, I'm trying to think about the first few hires I'm going to make. And they're like, yeah, I'm going to hire a PM. I'm gonna hire data scientist. These are one of my first five to 10 hires. I'm like, what? Like, this is just wild to me. Like, I would never hire

Starting point is 00:06:56 data scientists when the first three people in a company. And I say that because I used to be a data scientist. Like data scientists are great when you want to optimize your product by 2% or 5%. But that's definitely not what you want to be doing when you start a company. You're trying to swing for 10x or 100x changes, not worrying and nitpicking about small percentage points that are just noise anyways. And to some extent, like product managers. Product managers are great when your company gets big enough, but at the beginning, you should be thinking about yourself about what product you want to build. And your engineers should be hands on. They should be having great ideas as well.

Starting point is 00:07:25 And so product advantage is kind of this weird conception that big companies have when your engineers don't have time to be in the weeds on the details and try things themselves. And it's not a road that you'd come up with the other way before. So I guess with the initial surge team, it sounds like you had sort of a small,

Starting point is 00:07:38 initial, tight engineering team. You guys started building product. You were bootstrapping off of revenue. At this point, you're at over a billion dollars in revenue, which is amazing. How do you think about the future of how you want to shape the organization, how big you want to get, the different products you're launching and introducing? What do you view as the future of Surge and how that's all going to evolve? Before we do that, can you just explain at whatever level of detail makes sense here

Starting point is 00:08:03 what the billion dollars of revenue is, maybe like how product supports the company, who your data, who your humans are, because I think there's just very little visibility into into all of that. So in terms of what a product is, I mean, at the end of the day, our product is our data. Like we literally deliver data to companies. And that is what they use to train and evaluate our models. So imagine you know, when you're one of these frontier labs and you want to improve your model, your model's coding abilities. What we will do on our end is we will gather a lot of coding data. And so this coding data may come in different forms.

Starting point is 00:08:35 And maybe SFT data, we are literally writing out coding solutions, or maybe unit tests. Like these are the tests that a good, that a good piece of code must pass. be a preference data where it's okay, like here are two pieces of code or here are two coding explanations, which one is better? Or these might be like verifiers, like, okay, here's a web app that I created. I want to make sure that in the top right hand of the screen, there's like a, there's a login button. Or I want to make sure that when you click this button, something else happens. Like there's a bunch of different forms that this data may take. At the end of the day, what we're doing is we're delivering data.

Starting point is 00:09:10 They'll basically help the models improve on these capabilities. Very, very related to that is this notion of evaluating the models. Like you also want to know, yeah, is this good coding model? Is it better than this other one? What are the errors in which this model is weak and this model is worse? Like what insights can we get from that? And so in addition to the data, oftentimes we're delivering insights to our customers, we're delivering loss patterns, we're delivering failure modes.

Starting point is 00:09:29 So there may be a lot of other things related to data, but I think it's like this universe of applications or just like this universe around the data that we deliver and that is our product. Yeah, and maybe going back to Alad's question, maybe product isn't actually the right word here, but what's what's like repeatable about the company? Or what are like core capabilities that you guys have that you would say your competitors,

Starting point is 00:09:52 you know, fill to meet the mark? The way we think about a company is that in the way we differentiate from others is that a lot of other companies in this space, they are essentially just body shops, what they are delivering is not data, they are literally just delivering warm bodies to, um, to, uh, to companies. And so what that means is like at the end of the day, they don't have any technology and one of our fundamental beliefs is that again, quality is the most important thing at the end of the day. Like, is this high quality data?

Starting point is 00:10:20 Is this a good coding solution? Is this a good unit test? Is this mathematical problem solved correctly? Is this a good coding solution? Is this a good unit test? Is this mathematical problem solved correctly? Is this a great poem? And basically a lot of companies in this space like just just as a relic of how things have worked out historically it's that like historically a lot of companies they uh They've treated quality and data as commodity like one of the ways we often think about it is Imagine you're trying to draw a bounding box around a car

Starting point is 00:10:43 Like sarah you and I we're probably going trying to draw a bounding box around a car. Like Sarah, you and I, we're probably going to draw the same bounding box. Like ask Hemingway and ask a second grader. Well, at the end of the day, we're all going to draw the same bounding box. There's not much difference that we can do. So there's a very, very low ceiling on the bar quality. But then take something like writing poetry. Well, I suck at writing poetry. Hemingway is definitely going to write a much better poem than I am. Or imagine a, I don't know, a VC pitch deck. You're going to write a much better, you're going to create a much better pitch deck than I will.

Starting point is 00:11:09 And so there's almost an unlimited ceiling in this gen.ai world on the type of quality that you can build. And so the way we think of our product is like we have a platform. We have actual technology that we're using to measure the quality that our workers or annotators are generating. If you don't have that technology, if you don't have any way of measuring it. Is the measurement through human evaluation? Is it through model-based evaluation? I'm a little bit curious how you create that feedback loop since to some extent it's a little bit of this

Starting point is 00:11:35 question of how do you have enough evaluators to evaluate the output relative to the people generating the output? Or do you use models? Or how do you approach it? I think one analogy that we often make is think about something like Google search or think about something like YouTube. Like you have, you know, millions of search results. You have millions of web pages, you have millions of videos. How do you evaluate the qualities of these videos? Like is this a high quality, like is this a high quality webpage? Is it informative or is it really spammy? Like in the way you do this is like you just need, I mean you gather so many signals. You gather like page page dependent signals you gather like user dependent signals

Starting point is 00:12:08 You got their activity based signals and all these feed into a giant M.O. algorithm at the end of the day It's in the same way We we gather all these signals about our annotators about the work that they're performing about like their activity on the site and we just Feed it into a lot of these different like we basically have an M.O. team internally that builds a lot of these algorithms to measure all of this. What is changing or breaking as you are scaling increasingly sophisticated annotations? Like if model quality baseline is going up every couple of months, then the expectation is it exceeds what might have been a random human at some point,

Starting point is 00:12:45 as you said, like can draw a bounding box into all of these different fields, where, you know, we have modeled better than the 90th percentile at some point. So this is actually something that we do a lot of internal research on ourselves as well. So there's basically this field of AI alignment called scalable oversight, which is basically this question of how do you how do you like have models and humans working together hand in hand to produce data that is better than either one of them can achieve on their own. And so even like even today, someone like writing an SAT story from scratch, even today, like a couple years ago, we might have written that story completely from scratch ourselves. But today, it's just like not very

Starting point is 00:13:23 efficient, right? Like you might start with a story that a model created, and then you would edit it. You might edit it in a very substantial way, like maybe just the core of it is very vanilla, very generic, but there's just so much kind of like cruft that is just inefficient for a human to do and doesn't really benefit from the human creativity and human ingenuity that we're trying to add into the response. So you can just start with this bare bones structure that you're basically just layering on top of. And so again, there's more sophisticated ways of thinking about scalable oversight,

Starting point is 00:13:49 but just this question of how do you build the right interfaces? How do you build the right tools? How do you just combine people with AI in the right ways to make them more efficient? It is something that we build a lot of technology for. A lot of the discussion in terms of what human data the labs want has moved to RL environments and reward models in recent months. What is hard about this or what are you guys working on here?

Starting point is 00:14:17 So we do a lot of work building our environments. And I think one of the things that people really underestimate is how it is how complicated it is that you can't just synthetically generate it. Like, for example, you think you need a lot of tools because these are massive environments that people want. Can you give an example of like, just to make it more real? Like, imagine you are a salesperson. And when you are a salesperson, you need to be interacting with Salesforce, You need to be getting leads through Gmail. You're going to be talking to customers in Slack. You're going to be creating Excel sheets, tracking your leads.

Starting point is 00:14:51 You're going to be, I don't know, writing Google Docs and making PowerPoint presentations to present things to customers. And so you want to basically do these very rich environments that are literally simulating your entire world as a salesperson. It literally is just imagine your entire world. So with everything. It like, it literally is just like imagine like your entire world. So with everything on your desktop, and then in the future, everything that is, you know, not on your desktop as well.

Starting point is 00:15:10 Like maybe you have a calendar, maybe there's, maybe you need to travel to a meeting to meet a customer, and then you want to simulate a car accident happening, and you're getting notified of that. So you need to like leave a little bit earlier. Like all of these things are things that we actually want to model in these very, very rich RO environments. And so the question is, how do you generate all of the data that goes into this? Like, okay, you're going to need to generate like thousands of Slack messages,

Starting point is 00:15:33 hundreds of emails, you need to make sure that these are all consistent with each other. You need to make sure that like going back to like my core example, you need to make sure that time is evolving in these environments and like certain like external events happen. Like, how do you do all this? And then then like in a way that's actually kind of like interesting and creative, but also realistic and not like incongruent with each other. Like there's just like a lot of thought that needs to go into these environments

Starting point is 00:15:57 to make sure that they're, again, like rich creative environments that models can learn interesting things from. And so, yeah, you basically need a lot of tools and kind of sophistication for creating these. Is there any intuition for how real or how complex is enough? Or is it just like, you know, there's no ceiling on the realism that is useful here

Starting point is 00:16:20 or the complexity of environment that is useful here? I think there's no ceiling. At the end of the day, you just want as much diversity and richness as you can get because the more richness that you have, the more the models can learn from. The longer the time horizons, the more the models can learn on and improve on. So I think there's almost an unlimited ceiling here. If you were to make a five or 10-year bet on what scales most in terms of demand from people training AI models and types of data? Is it RL environments or is it traces on types of like expert reasoning

Starting point is 00:16:52 or what other areas do you think there's going to be a really large demand for? I mean, I think it will be all the above. Like I don't think our environments alone will suffice just because I mean, it depends on how you think about their R environments. But oftentimes these are very, very rich trajectories that are very, very long. And so it's almost like inconceivable that a single reward, I mean, I think even today we often think about things in terms of multiple rewards, not just a single reward, but a thing like a single reward may just may not be like rich enough to capture all the work that goes into like the model model solving some very, very complicated goal.

Starting point is 00:17:26 So I think they'll probably be a combination of all those. If you assume eventually some form of superhuman performance across different model types relative to human experts, how do you think about the role of humans relative to data and data generation versus synthetic data or other approaches? At what point does human input run out as a useful point of either feedback or data generation? So I think human feedback will never run out, and that's for a couple of reasons. So even if I think about the landscape today, I think people often overestimate the role

Starting point is 00:18:01 of synthetic data. Personally, I think synthetic data actually is very, very useful. We, we use it, like, a ton of ourselves in order to supplement what the humans do. Like, again, like I said earlier, there's, like, a lot of cruft that some days aren't worth a human's time. But what we often find is that, like, for example, a lot of the times where customers will come to us and you'll be like, yeah, for the past six months, I've been experimenting with synthetic data. I've got a 10 to 20 million pieces of synthetic data. Actually, yeah, to 20 million pieces of synthetic data. Actually, yeah, we finally realized

Starting point is 00:18:26 that 99% of it just wasn't useful. And so we're trying to find right now, we're trying to curate the 5% that is useful, but we are literally going to throw out 9 million of it. And oftentimes they'll find out that, yeah, like actually a thousand, even a thousand pieces of high quality human data, a high-created, really, really high quality human data

Starting point is 00:18:42 is actually more valuable than those 10 million points. So that is one thing I'll say. Another thing I'll say is that it's almost like sometimes you need an external signal to the models. The models just think so differently from humans that you always need to make sure that they're kind of aligned with the actual objectives that you want. We can give two examples. So one example is that it's kind of funny.

Starting point is 00:19:05 If you sometimes, if you try, so one of the frontier models, let me just say that one of them, if you go use the frontier model, it's like one of the top models or one of the models everybody's thinking is one of the top. If you go use it today, like maybe 10% of the time when I use it,

Starting point is 00:19:18 you'll just output random Hindi characters and random Russian characters into one of my responses. So I'd be like, tell me about Donald Trump, tell me about Barack Obama, and just like in the middle of it, it will just output Hindi and Russian. It's like, what is this? And the model just isn't like self-consistent enough to be aware of this. It's almost like you need a like an action on Cuban to tell the model that yeah, this is wrong.

Starting point is 00:19:40 One of the things I think is a giant plague on AI is AlamSys, Alamarina. And I'll skip the details for now, but I think right now people will often. It's like, if you train your model on the wrong objectives. So like the, the mental model that you should have of Alumsys, Alamarina is that people are writing prompts, they'll get two responses and they'll spend like five, 10 seconds looking, looking at responses and they'll just pick whichever one looks better to them. So they're not evaluating whether or not the model hallucinated, they're not evaluating the factual accuracy and whether or following instructions, they're literally just inviting with the model and like, okay,

Starting point is 00:20:14 yeah, like this one seemed better because it had a bunch of formatting, it had a bunch of emojis, it just looks more impressive. And people will train on like, basically an LMS subjective, and they won't realize all the consequences of it. And again, like the model itself doesn't, doesn't like know what its objective is. It's like, you almost need like an external like quality signal in order to tell it what the right objective should be.

Starting point is 00:20:35 And if you don't have that, then the model will just go in all these crazy directions. Again, like you might, you may have seen some of the results with like the, with Lama 4, but we'll just go in all these crazy directions that kind of, kind of mean you need these external external auditors. This also happens actually when you do different forms of like protein evolution or things like that where you select a protein against a catalytic function or something else and

Starting point is 00:20:56 you just kind of randomize it and have like a giant library of them and you end up with the same thing where you have these really weird activities that you didn't anticipate actually happening. And so I sometimes think of model training as almost this odd evolutionary landscape that you're effectively evolving and selecting against and you're kind of shaping the model into that local maxima or something. And so it's kind of this really interesting output of anything where you're effectively evolving against a feedback signal. And depending

Starting point is 00:21:21 on what that feedback signal is, you just end up with these odd results. So it's interesting to see how it kind of transfers across domains. These course, as you said, five second reaction, academic benchmarks or even non-academic industrial benchmarks are easily hacked or not the right gauge of performance against any given task. They are very popular. What is the alternative for somebody who's trying to like choose the right model or understand model capability? So the alternative that I think all the Frontier Labs view as a gold standard is basically human evaluation. So again, proper human evaluation where you're actually taking the time to look at

Starting point is 00:22:01 the response, you're going to fact check it, you're going to see whether or not it followed all the instructions. You have good taste so you know whether or not it followed all the instructions. You have good pace, so you know whether or not the model has good writing quality. This concept of doing all that and spending all the time to do that, as opposed to just vibing for five seconds, I think actually is really, really important. Because if you don't do this, you're

Starting point is 00:22:17 basically just training your models on the analog of clickbait. So I think it actually is really, really important for model progress. If it's not LMSys, how should people actually evaluate model capability for any given task? What all differential labs find is that human evals really are the gold standard. You really need to take a lot of time to fact check these responses, to verify their following instructions.

Starting point is 00:22:41 You need people with good taste to evaluate the writing quality, and so on and so on. And if you don't do this, you're basically training your models on the analytical clickbait. And so I think I think that really, really harms model progress. Is there work that surge is doing in this domain of like, trying to standardize human eval or make it more transparent to end consumers of the API or even users? So internally, we do a lot of work actually today with working with all the frontier labs to help them understand their models. So again, we're constantly evaluating them. We're constantly servicing loss areas for them to improve on

Starting point is 00:23:13 and so on and so on. And so right now, a lot of this is internal, but one of the things that we actually wanna do is sort of external forms of this as well. Where we're helping educate people on, yeah, like these are the different capabilities of all these models. Here, these models are better at coding. Here, these models are better at instruction following. Here, these models are actually hallucinating a

Starting point is 00:23:30 lot, so you just don't trust them as much. So we actually do want to start a lot of external work to help educate the broader landscape on this. If we can zoom in and talk just about the larger, like, competitive landscape and what happens with frontier models over time, what does a meta scale deal mean for you guys? Or what do you make of it? So I think it's kind of interesting in that. So we were ready to number one player in the space. It's been beneficial because yeah, there were still some legacy teams using scale. Like they just didn't know about us because we were

Starting point is 00:23:57 still pretty under the radar. I think it's been beneficial because one of the things that we've always believed is that sometimes when you use these low quality data solutions, people kind of get burned on human data. And so they had this negative experience. And so then they don't want to use human data again. And so to try these other methods that are honestly just a lot slower and don't come up with the right objectives. And so I think just harms model progress overall. And so it's just like the more and more we can get all these frontier labs using high quality data, I think it actually really, really is beneficial for an industry as a whole. So I am like I think overall, it was a good thing to happen. If you were to make a bet that an underdog catches up to open AI and throbbing and deep mind, who would it be? to OpenAI, Anthropic, and DeepMind, who would it be? So I would bet on XAI.

Starting point is 00:24:45 I think they're just very hungry and mission-oriented in a way that gives them a lot of really unique advantages. I guess maybe another sort of broader question is, do you think there's three competitive frontier models, 10 competitive frontier models a couple years from now? And is any of those open source? Yeah. So I actually see more and more frontier models opening up over time because I actually don't think that the models will become oddities. Like I think one of the things that we've,

Starting point is 00:25:11 I mean, I think one of the things that has actually been surprising in the past couple of years is that you actually see all of their models have their own focuses that give them unique strengths. Like for example, I think, and Haubert's obviously been really, really amazing at coding and enterprise. And OpenAI has this big consumer focus because of chat activity. Like I actually really love it. It's models personality and then croc, you know, it's a different set of things that's willing to say and to build. And so it's almost like every company has it's almost like a different set of principles that they care about. Like they're like some will just never do one thing.

Starting point is 00:25:46 Others are totally willing to do it. Others had just had different like models will just have so many different facets to their personality. So many different facets to the type of skills that they will be good at. And sure, like eventually AGI will maybe encompass this, this all, but in the meantime, you just kind of need to focus. Like there's only so many focuses that you can have as a company. And so I think that just will be to different strengths for all the model providers. So I mean, I think today, you know, we already see a lot of people, including me, if we will

Starting point is 00:26:11 switch between all the different models, just depending on what we're doing. And so in the future, I think that will just happen even more as you are just using more and more models for or using models for different aspects of their lives, like both their personal and professional lives. Going back to something Elad mentioned, where should we expect to see search investing over time? What do you think you guys will do a few years from now that you don't do today? Again, I think I'm really excited about this more

Starting point is 00:26:38 kind of public research push that we're starting to have. I think it is really interesting in that a lot of the like for obvious reasons, a lot of French your labs, there was not publishing anymore. And as a result of that, I think it's almost like the industry has fallen to kind of a trap that I worry about. So like maybe to dig into some of the some of the things I said earlier, um, with some of the negative incentives of the industry into some of the things I said earlier, with some of the negative incentives of the industry and some of the kind of concerning trends that we've seen. So like going back to LMSS, one of the things that we'll see is like a lot of researchers,

Starting point is 00:27:13 they'll tell us that their VPs make them focus on increasing their rank on LMSS. And so I've had researchers explicitly tell me that they're okay with making their models worse at factuality, worse at following instructions, as long as it improves their ranking, because their leadership just wants to see these metrics go up. And again, that is something that literally happens because the people ranking these things on all of this, they don't care whether the models are good at instruction following. They don't care whether the models are omitting factual responses. What they care about is, okay, did this model emit

Starting point is 00:27:45 a lot of emojis? Did it emit a lot of bold words? Did it have really long responses? Because that's just going to look more impressive to them. Like one of the things that we found is that the easiest way to improve your rank on Alameda is ability to make your make your model response longer. And so what happens is, like there are a lot of companies who are trying to improve their computer board rank. So they'll see progress for six months because all they're doing is unwillingly making a model responses longer and adding more emojis. And they don't realize that all they're doing is training the models to produce better click

Starting point is 00:28:12 bait. And they might finally realize six months or a year later, like again, you may have seen some of these things in industry, but it basically means that he's been the past six months making zero progress. In a similar way, I think, you know, besides elements that you have all these academic benchmarks, and they're completely diverse in the real world, like a lot of teams are focused on proving these SAT style scores instead of real world progress. Like I'll give an example, there's a benchmark benchmark called IFEval. And if you look at IFEval, so it stands for instruction following eval. If you look at IFEval, like some of

Starting point is 00:28:40 the instructions are trying to check what their models can do. It's like, hey, can you write an essay about Abraham Lincoln? And every time you like mention a word Abraham Lincoln, make sure that five of the letters are capitalized and all the other letters are uncapitalized. It's like, what is this? And sometimes we'll get customers telling us like, yeah, like we really, really need to improve or like our score on, on ifeval. And what this means is again, like you have all these companies or all these researchers who, instead of focused on real world progress, they're just like optimizing for these

Starting point is 00:29:11 silly STD style benchmarks. And so one of the things that we really want to do is just think about ways to educate the industry, think about ways of publishing on our own, just like think about ways of steering the industry into like hopefully a better direction. And so I think that's just one big thing that we're really excited about and could be really big in the next five years. Okay. Yeah. I mean, so Sarah brought up earlier how everybody kind of wants high quality data.

Starting point is 00:29:34 What does that mean? How do you think about that? How do you generate it? Can you tell us a little bit more about your thoughts on that? So let's say you wanted to train them all to write an eight-line poem about the moon. And so the way most companies think about it is, well, let's just hire a bunch of people from Craigslist, or through some recruiting agency, and let's ask them to write poems. And then the way they think about quality is, well, is this a poem? Is it eight lines? Does it contain the word moon? If so, like, okay, yeah, I hate

Starting point is 00:29:58 these three checkboxes. So yeah, sure, this is a great poem, because it follows all these instructions. But if you think about it, like, the reality is, you get these terrible poems, like, sure, this is a great poem because it follows all these instructions. But if you think about it, like the reality is you get these terrible poems, like sure, it's eight lines and has the word moon, but they feel like they're written by kids from high school. And so other companies be like, okay, sure, these people on Craigslist don't have any poetry experience. So I'm going to do instead is hire a bunch of people with PhDs in English literature. But this is also terrible. Like a lot of PhDs, they are actually not good writers or poets. Like if you think of like think of Hemingway or Emily Dickinson, they definitely didn't have a PhD.

Starting point is 00:30:27 I don't think they even completed college. And like one of the things I will say is like, yeah, I went to MIT. I think you went there too. And a lot of people I knew from MIT who graduate with a CS degree, they're terrible coders. And so we think about quality completely differently. Like what we want isn't poetry that checks the boxes. I'm like, okay, yeah, check these boxes and use it some complicated language. We want the type of poetry that Nobel Prize laureates would write. So what you want is like, okay, we want to recognize that poetry is actually really subjective

Starting point is 00:30:52 and rich. Like maybe one poem, it's a haiku about boom light on water. And there's another poem that's like, it has a lot of internal rhyming meter. And another one that I don't know, focus on the motion behind the moon rising at night. And so you actually want to capture that there's thousands of ways to write a poem about the moon. There isn't a single correct way, and each one gives you all these different insights into language and imagery and poetry. If you think about it, it's not just poetry, it's like math, there's a thousand ways probably to prove the Pythagorean theorem. And so I think the difference is that when you think about quality the wrong way, you kind of get commodity data that optimizes for things like iterator agreement.

Starting point is 00:31:28 And again, checking boxes off of some list. But one of the things that we try to teach all of our customers is that high quality data actually really embraces human intelligence creatively. And when you train the models on this like richer data, they don't just learn to follow instructions. They really learn all of these deeper patterns about all the stuff that makes language in the world really compelling and meaningful. And so I think a lot of companies, they just throw humans at the problem and they think that you can get good data that way. But I think you really need to think about quality from first principles and what it means. And you

Starting point is 00:31:55 need a lot of technology to identify, yeah, that these are amazing problems and these are creative math problems and these are games and web apps that are beautiful and fun to play. And these ones are terrible to use. So like you really need to build a lot of technology and think about quality in the right way. Otherwise, you're basically just scaling up mediocrity. That sounds very domain-specific. So do you, like, in every domain, are you building a lens of what quality looks like along with your partners?

Starting point is 00:32:19 Yeah, I mean, I think we have kind of holistic quality principles, but then oftentimes there are differences per domain. So it's like a combination of both. I think we have kind of like holistic quality principles, but then oftentimes there are differences per domain. So it's like a combination of both. I think we got all the core topics. Nice work on podcast number two, Edwin. And thanks for doing this. Congrats on all the progress with the business. Yeah, no, thanks so much for having us. Yeah, it's great.

Starting point is 00:32:37 Great meeting you guys. Find us on Twitter at no priors pod. Subscribe to our YouTube channel. If you want to see our faces, follow the show on Apple Podcasts, Spotify, or wherever you listen. That way you get a new episode every week. And sign up for emails or find transcripts for every episode at no-priors.com.

No Priors: Artificial Intelligence | Technology | Startups - The Power of Quality Human Data with SurgeAI Founder and CEO Edwin Chen

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.