Latent Space: The AI Engineer Podcast - Building the Foundation Model Ops Platform — with Raza Habib of Humanloop

Episode Date: September 29, 2023

Want to help define the AI Engineer stack? >500 folks have weighed in on the top tools, communities and builders for the first State of AI Engineering survey! Please fill it out (and help us reach 100...0!)The AI Engineer Summit schedule is now live! We are running two Summits and judging two Hackathons this Oct. As usual, see our Discord and community page for all events.A rite of passage for every AI Engineer is shipping a quick and easy demo, and then having to cobble together a bunch of solutions for prompt sharing and versioning, running prompt evals and monitoring, storing data and finetuning as their AI apps go from playground to production. This happens to be Humanloop’s exact pitch.full show notes: https://latent.space/p/humanloopTimestamps* [00:01:21] Introducing Raza* [00:10:52] Humanloop Origins* [00:19:25] What is HumanLoop?* [00:20:57] Who is the Buyer of PromptOps?* [00:22:21] HumanLoop Features* [00:22:49] The Three Stages of Prompt Evals* [00:24:34] The Three Types of Human Feedback* [00:27:21] UI vs BI for AI* [00:28:26] LangSmith vs HumanLoop comparisons* [00:31:46] The TAM of PromptOps* [00:32:58] How to Be Early* [00:34:41] 6 Orders of Magnitude* [00:36:09] Becoming an Enterprise Ready AI Infra Startup* [00:40:41] Killer Usecases of AI* [00:43:56] HumanLoop's new Free Tier and Pricing* [00:45:20] Addressing Graduation Risk* [00:48:11] On Company Building* [00:49:58] On Opinionatedness* [00:51:09] HumanLoop Hiring* [00:52:42] How HumanLoop thinks about PMF* [00:55:16] Market: LMOps vs MLOps* [00:57:01] Impact of Multimodal Models* [00:57:58] Prompt Engineering vs AI Engineering* [01:00:11] LLM Cascades and Probabilistic AI Languages* [01:02:02] Prompt Injection and Prompt Security* [01:03:24] Finetuning vs HumanLoop* [01:04:43] Open Standards in LLM Tooling* [01:06:05] Did GPT4 Get Dumber?* [01:07:29] Europe's AI Scene* [01:09:31] Just move to SF (in The Arena)* [01:12:23] Lightning Round - Acceleration* [01:13:48] Continual Learning* [01:15:02] DeepMind Gato Explanation* [01:17:40] Motivations from Academia to Startup* [01:19:52] Lightning Round - The Takeaway This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe

Transcript
Discussion (0)
Starting point is 00:00:05 Welcome to the Latent Space Podcast, where we dive into the wild, wild world of AI engineering every week. This is Anna, your AI co-host. Thanks for all the love from last episode. As an AI language model, I cannot love you back, but I'll be standing in for Alessio one last time. This week we have Dr. Raza Habib, co-founder and CEO of Human Loop, which is arguably the first and best-known prompt engineering or prompt ops platform in the world. You may have seen his viral conversation on YC's YouTube on the real potential. of generative AI. Fortunately, we go much more in depth.
Starting point is 00:00:40 We ask him how they got to prompt ops so early, what the three types of prompt evals and the three types of human feedback are, and confront him with the hardest question of all. Is prompt engineering dead? At the end, we talk about whether GPT4 got dumber, the most underrated AI research, the Europe-AI startup scene, and why San Francisco is so back.
Starting point is 00:01:01 By the way, dear listener, we will be presenting the AI engineer summit in October. and you can tune in on YouTube and take the State of AI Engineering Survey at the URLAI.Engineer Summit. Watch out and take care. So welcome to Layton Space. I'm here with Razah Habib,
Starting point is 00:01:18 CEO of Human Loop. Welcome. Thanks so much for having me. It's an absolute pleasure. And we just spent way too long setting up our own studio as sound engineers. I don't think something that either of us woke up today thinking that we'll be doing.
Starting point is 00:01:31 But it gives you greater appreciation for the work of others. Yes. Dave, you are a man. missed Davis Al Sound Engineer back in SF, who handles all this for us. So it's really nice to actually meet you and your team in person. I've heard about Human Loop for a long time. I've attended your webinars and you were one of the earliest companies in this space. So it's an honor to meet and to get to know you a little bit better.
Starting point is 00:01:54 Likewise. I've been excited to chat to you. You definitely are building an amazing community and I've read your blogs with a lot of interest. Yeah. And based on this, I'm going to have to write up Human Loop. So this actually forces me to get to know Humulubla a lot better. Looking forward to it. So I'll do a little quick intro of you, and then you can fill in with any personal side things. Sure.
Starting point is 00:02:11 So you got your MSC and doctorate at UCL. It says here machine learning and computational statistics, which are, I think, mostly the same thing. Yeah, so the master's programs called machine learning and computational statistics, and then my PhD was just in probabilistic deep learning. So you're trying to combine graphical models and Bayesian-style approaches to machine learning with deep learning. Yeah, awesome. And did you meet Jordan in Cambridge?
Starting point is 00:02:37 So Jordan and I overlapped Cambridge a bit. We didn't know each other super well. And we actually met properly for the first time at a PhD Open Day. And I ended up doing the PhD. He ended up going to work for a startup called Bloomsbury AI that got acquired by Facebook. But hilariously, his first boss was my master supervisor. And so even though we didn't end up sort of doing PhDs together, I was often in their offices in this early years. Yeah, very small worlds. And we can talk about being in other people's offices because we are in someone else's office. Yeah, so we're in the offices of Local Globe at Phoenix Court. Local Globe is one of the best seed investors in Europe, and they were one of our first investors.
Starting point is 00:03:14 And they've, yeah, just these incredible facilities. You saw it just now outside a space for a hub for all their startups and other companies in the ecosystem to come work from their offices, and they provide these podcasting studios and all sorts of really useful resources that I think is helping grow the community in Europe. Yeah, and you said something which I found really interesting. They put on a lease. They have the building for 25 years. Yeah, I can't remember if it's 25 or 20, but a really long time.
Starting point is 00:03:39 They've made a conscious decision to invest in what is not one of the wealthiest parts of the city of London and give themselves a base here, go where the action is, and also try and invest in the local community for the long term and give back as well. I find that really inspiring. They think not just about how do we build truly epic companies and technology, but what is the social impact of what we're doing? And I have a lot of respect for that. Yeah, it's pretty important.
Starting point is 00:04:01 It's something I care about in SF as well. which has his own issues. So coming back to your backgrounds, while you're going through your studies, you also did some internships in the byside in finance, which is something we connected about. Yeah. So I did some byside internships in quant finance.
Starting point is 00:04:18 I spent a year almost at Google AI, working on their speech synthesis teams, and I helped a really close friend start his first company, a company called Monolith AI, that was doing machine learning for physical engineering. So really high stakes. Our first customer was McLaren, which was really cool.
Starting point is 00:04:34 So a day a week of my PhD, I was sitting in the McLaren offices, literally next to, and I mean literally, like I could almost reach out and touch it on F1 car, and we were trying to help them use machine learning to reduce how much physical testing they had to do. Right. So simulations? Simulations, so surrogate modeling, can you take these very expensive CFD solvers and replace them with neural nets and also active learning? So they do a lot of physical experiments that if you run an experiment, you get some amount of information back, and then you do something really similar. and a bit of the information overlaps. So they would put a car in a wind tunnel, for example, and they'd sort of adjust the right heights of the car
Starting point is 00:05:09 at all four different corners and measure all of them, which is really wasteful. And you spent a whole day in the wind tunnel. So we had an AI system that would basically take the results of the most recent test you did and say, okay, the ones that you'll learn the most from are this set of experiments. You should do these experiments next.
Starting point is 00:05:23 You'll learn a lot quicker, which is a very similar technique that we used at the early days of Human Loop to make machine learning models learn more efficiently as well. Yeah. I get the sense, by the way, I've talked to a number of startups that started with the active learning route. It's not as relevant these days with language models. So I think it's way less relevant because you need so much less annotated data.
Starting point is 00:05:45 That's the big change. But I also think it's actually really hard to productize. So even if you get active learning working really well, and I think the techniques can work extremely well, it's difficult to abstract it in such a way that you can plug your own model in. So you end up either having to own the model, like, you know, I think open. I could probably do this internally, but trying to go to a machine learning to engineer and sort of let them plug their model into an active learning system that works well is a really hard challenge.
Starting point is 00:06:11 Yeah. Yeah. And from a business perspective, it's also a little bit frustrating because it's almost a hidden ROI. Like when you do succeed, it's very difficult to prove to the person who used it how much data like labeling you saved them. Because they never do the direct comparison of also labeling at random, right, because it's too expensive.
Starting point is 00:06:29 And so you might have saved them 40. percent of their labeling costs, which might have been hundreds of thousands of dollars, but it's really difficult for them to measure that ROI. Yeah. I think like with anything, you have to have a commitment to good process and good science. Yeah. And trust that that actually does work out without evidence or a counterfactual, you know, tests or like a control group because that would be an extreme waste of money.
Starting point is 00:06:53 Absolutely. So the chronology here is super interesting, right? Because you started your PhD in 2017. you just got it in 2022 about a year ago. Yeah, that's right. So that overlaps with your work on Monolith AI. And then you also started Human Loop in 2020. So just take me through that interesting journey.
Starting point is 00:07:13 So I wouldn't recommend this, by the way. Like I'm a big advocate of focus. Within Human Loop, we try to be very focused. But I also just always had this itch to be part of companies and building things. And to be fair, I think it helped as a researcher because it gave you tangible real world problems and experience. I think in academia it's really easy otherwise to just work on things that seem interesting to you, but maybe don't have such a big impact. So the way it came about, I was in the PhD and a very close friend, Richard Alfelt, who's now the CEO and founder of Monolith AI, they're a
Starting point is 00:07:45 series A, almost series B company. And he was starting this company, came to me and he said, you know, I need someone who's on the ML side, just whilst I'm getting started, can you help out? And so it was meant to be this very short-term thing initially. I got sucked into it. I was spending, you know, at least a day a week, if not more, of my PhD early on. But it was really fun, right? We were sitting in the offices of McLaren. They were our first customer. I think Airbus was an early customer.
Starting point is 00:08:06 I helped hire the early team. And it was a really good experience of trying to do machine learning in the real world, in high-stakes situations, right, physical engineering and understanding what did and didn't work. So I'm really glad I got that experience, and it made me much more excited about starting a company. But that was still a part-time thing. And I think my supervisor sort of knew I was doing it, but it was enough, it was a low-enough commitment that I could hide and still be focused on my PhD most of the time. With Human Loop, it was different. Like, the way Human Loop came about, I came back from doing my internship at Google
Starting point is 00:08:37 in Mountain View. And doing the internship at Google convinced me that I loved Google, but I didn't want to work there in the near term. I wanted to be working on some in a space where there was a lot more urgency, where it felt existential, where we were all focused on the same problem as a team pulling together. And at Google, it just felt like you were part of something very big. I was surrounded by really smart, really capable people. I learned a lot from them, but the environment was more comfortable. And I wanted to be in a small, I wanted to be a startup, really. And so when I came back from Google, I sort of started thinking about ideas and speaking to the smartest people I knew to kind of see whether we could do something for when I finished the PhD. That was the point.
Starting point is 00:09:15 I was just doing research. But Peter Jordan and I started working together in that process. We were all at a similar stage of kind of trying to find other people we might want to work on side projects with. and one of the side projects basically became Human Loop, and we got into YC, and we were like, okay, well, this is a great opportunity, let's go do it, and just kind of one domino fell after another. And so didn't quite finish the PhD, but had enough research that I probably could have been writing up.
Starting point is 00:09:41 And so at some point, I got an email from UCL, and they were like, if you don't submit in the next whatever, I think it was two months, then it expires. And I was, you know, I almost didn't do it because obviously running a startup is such a full-time gig, but I had invested a lot of time. The honest reason why I did it is two things. One, my grandfather, who recently passed away, had just really wanted to see me finish.
Starting point is 00:10:02 And so, you know, probably not super rational, but I just wanted to do that for that reason. But the other is I really love teaching. When I was a PhD student, I did a lot of TAing. I TA'd the courses at the Gatsby, which are the ones, is the institute that Jeff Hinton started when he was there. And I really enjoyed that. And I just knew that having a PhD would make it easier one day to come back to that. if I want to do a little bit of teaching at a university, having that title helps. As a second adjuncts.
Starting point is 00:10:26 I don't know if they have adjunct appointments here or maybe lecturer appointments. Yeah, something like that. I can't imagine doing it whilst running the startup, but afterwards. Yeah, I've always wondered if I can give back in some shape or form. But maybe you might with your podcast when you get that started. What was the original pitch for Human Loop? You said it grew out of a side project. Yeah, so when we started Human Loop, both Peter Jordan and I had this strong conviction
Starting point is 00:10:49 about the fact that NLP was getting phenomenally better. This was before GPT3, but after BERT, and after transfer learning had really started to work for NLP, that you could pre-train a large language model on an unlabeled corpus and really quickly adapted to new situations. That was new for NLP. So did GPD1 and 2? Or just BERT?
Starting point is 00:11:08 But we were thinking about BERT, we were thinking about ULM fit as the first milestones that showed that this was possible. And it was very clear that as a result, there was going to be a huge wave of new, you know, useful applications that enterprises could build on NLP that weren't previously possible, but that there was still a huge lack of technical expertise, and annotated data was still a big bottleneck.
Starting point is 00:11:31 So we were always trying to make it a lot easier for developers and for companies to adopt NLP and build useful AI products. But at the time that we started, the bottleneck was mostly, okay, do you have the right ML expertise, and can you get enough annotated data? And so those were the problems we were initially helping people solve. And when GPT3 came out, I wrote a lot. blog posts about this at the time, it was very clear that this was going to be the future, that actually, because in context learning was starting to work, the amount of annotated data
Starting point is 00:11:58 you would need was going to go down a lot. But until the instruct GPT papers, it still didn't feel practical. But after Instruct GPT came out, once you've kind of mentally done that shift, it's very hard to keep working on anything else. And so a little over a year ago, we pivoted it, and that was scary because we had a thing that was working. We had paying customers, it was growing reasonably. We'd raised money. And I went to, we went to our investors at the time. And I remember having a conversation, we did a market size estimate. I actually filled out the YC application. Because I think the YC application is like the simplest business model you could possibly build. What are you going to build? Who's it for? How are you going to make money? How big is the market? And I did the market size question. And at the time, we did it. And I was like, I reckon there are maybe 300 companies in the world who might need a product like this. And the assumption was that like, okay, it's tiny today. It's mostly a small number of startups, but it will be huge in the future. And that, turned out clearly to be right, I didn't realize how quickly it would happen. Yeah, it's obviously surprised, I think, a lot of us, but you were paying attention to the research when I guess a lot of people were not necessarily looking at that.
Starting point is 00:13:02 Like, to my understanding, you didn't have previous NLP knowledge or back on, right? You did speech synthesis. I did speech synthesis. I did fundamental methods in the deep learning, right? So you weren't specialized in any. I wasn't specialized. I was working on generative models, variational inference. I would actually say that I... How did you know this was the thing to focus on, right?
Starting point is 00:13:23 Well, so the interesting thing is that, like, you don't need any NLP expertise to have gotten the, like, current wave of deep learning, right, or machine learning. Like, if anything, I think having previous NLP expertise is almost a disadvantage. I took an NLP course in my master's course. Fantastic lecture, a fantastic group. But at the time, there was only one lecture on deep learning, right? And this was 2016, 2017, or something. or 2015, 2016.
Starting point is 00:13:47 The NLP community was still, you know, just waking up to the fact that deep learning was going to change everything. And the amazing thing about most machine learning attributes, it's another example of the bitter lesson that we were talking about earlier, right? Like general purpose learning methods at scale with large volumes of compute and data are often better than specialist systems. So if you understand that really well, you're probably at an advantage to someone who only understands the NLP side
Starting point is 00:14:12 but doesn't understand that. I don't know if it's understanding so much as believe. I think you're right. It's a bit of both. I think one leads to the other. Yes. That you take the evidence seriously and then you extend it out and it still works. And you just keep going.
Starting point is 00:14:33 Yeah, absolutely. So you got the tam size wrong on the positive side. At the time we were right, I think. Really? Okay, yeah, yeah. And I know we were roughly right because we were spending a lot time speaking to Open AI and we were asking them like how many, you know, they were sending us customers and we were discussing it and we were asking about API usage, like how many big companies are there? And there was a small number at the time. But it just rocketed since then. Okay. So
Starting point is 00:14:57 you were planning to build very closely in partnership with Open AI. As I mean, we've always tried to keep close partnerships with all of the large language model providers, right? It's very clear that whilst open source is fantastic, the very frontier is within private companies. Yeah. And they are building the platforms that the rest of us are building on top of. And so not Open AI specifically, like we're model and platform agnostic, but we want to help developers build useful applications with large language models, whether that's Open AI or Corhear or an open source model or anthropic, we don't mind. But being close to the model providers, make it easier for their customers to succeed
Starting point is 00:15:32 benefits them. And then we also get to learn from them about what problems people are facing, what they're planning to do in the future. So I think that all of the large language model providers are investing a lot in developer ecosystem and not just being close to human loop, but to anyone else who's making it easier for their customers. Yeah, awesome. Okay, so you start the company.
Starting point is 00:15:50 How did you split things between the co-founders? It happened very organically. We're all on paper. We look really similar. Peter also has a PhD in machine learning, amazing engineer, previously been a CTO. Jordan has a master's in machine learning. It's like really good engineer as well. As we came to work on it, it just turned out we had natural strengths and interests that
Starting point is 00:16:08 happened very, very organically. So Jordan is the kind of person who's got an amazing taste for product. notices things day to day. Like, if he finds a product experience he really likes, you see his eyes light up. He's paying attention to it all the time. And so it made a lot of sense that he, over time, gravitated towards user experience, the design, actually thinking through the developer experience and leading on product. Peter's got phenomenal stamina and amazing engineering knowledge and amazing attention to detail
Starting point is 00:16:33 and naturally gravitated towards taking on leading the engineering team. And I like doing this. I like chatting to people on podcasts. I like speaking to customers a lot. that's probably my favorite part of the job. And so naturally, I kind of ended up doing more of that work. But it wasn't that we sat down initially and said, okay, you're going to be the person who does sales and invest.
Starting point is 00:16:52 It was much more organic than that. Yeah, yeah, awesome. And you had to pick your customers. So what did you end up picking? So in the end, our customer changed dramatically when we launched the latest iteration of human loop. When we decided to focus much more in large language models, we suddenly went from a world in which we were building predominantly
Starting point is 00:17:09 from machine learning engineers, people who knew a lot about ML, maybe had research backgrounds, to building for generalist software engineers who are much more product-focused. Something that some people I think would refer to as an AI engineer, so I've heard. And these are people who are much more focused on the outcome, on the product experience, on building something useful,
Starting point is 00:17:28 and they're much more ambivalent towards the means that achieve the end. And that works out as a much better customer for a tooling provider as well because they don't fight you to build everything themselves. They want good tools, and they're happy to pay for them, because they're trying to get to a good outcome as quickly as possible. So we found a much better reception amongst that audience
Starting point is 00:17:46 and also that we could add a lot more value to them because we could bake in best practices and knowledge we had and that would make their lives much easier and they didn't need to know so much about machine learning. Where do you find them? Because this was in like early 2021. Yeah.
Starting point is 00:18:00 There were no chat GPT forums. It wasn't like a widely discussed topic on Twitter. Like where do you find these early adopter types? So we could see some people using GPT3. And so we would directly reach out to companies that we're building on GPT3. And in the early days,
Starting point is 00:18:16 when we first did it, before we did the pivot, we gave ourselves a two-week sales experiment. We said, let's take our designs and our initial idea and let's see if we can get 10 paying customers in two weeks. And on the second day,
Starting point is 00:18:27 we had 10. Paying for what specifically? So we were just pitching them on being part of a development partnership. So we said, we're building a tool that will help you with prompt engineering and evaluating how good your prompts are.
Starting point is 00:18:38 This is what it looks like. We're looking for design partners. It costs this much to be a design partner. And on the second day, we already had 10. And so we were like, okay, there's a real problem here. Because people were feeling the pain. And they were showing us, they're jerry-rigged solutions for this. They were showing us how they would stitch together Excel spreadsheets and Grafana and Nix panel
Starting point is 00:18:57 and the opening eye playground in these very clodgy pipelines to somehow quickly iterate on prompts, version them, collaborate as a team, find some way to measure things that were very subjective. and so we were like, okay, actually there's a very clear need here. Let's go help these people. Yeah, excellent. So what is Human Loop today? Yeah, so at its core, we help engineers to measure and optimize LLM applications, so in particular helping them do prompt engineering, management, and an evaluation.
Starting point is 00:19:27 So evaluation is particularly difficult for large language models because they can to be used for much more subjective applications than traditional machine learning, definitely than traditional software. If you're coming from a pure software and non-ML background, then the first thing you have to learn when you start working with LMs is this stuff is stochastic, which I think, you know, most people are not used to. So just playing with software that every time you run it is different and you can't just write unit tests is the first kind of painful lesson.
Starting point is 00:19:51 But then it turns out that a big piece of these applications ends up in prompts, and these are natural language instructions, but they're having similar impact to what code has. So they need to be treated with the importance of code. And so iterating on that, managing it, versioning it, evaluating it, Those are the problems that Human Loop helps engineers with today. And in particular, we tend to be focused on companies that are at a certain scale because one of the challenges that, one, they tend to care more about evaluation.
Starting point is 00:20:17 I think if you're a two-person startup, you sort of build something quick MVP and you yolo it into production. But larger companies need to have some confidence of the product experience before they launch something. And also what we've found is that there's a lot more collaboration between engineers and non-engineers, between product managers and domain experts who are involved in the design, the prompt engineering, the evaluation, but are maybe not the engineering part. They have to work together nicely.
Starting point is 00:20:42 So giving them the right tools has been a really important part as well. Yeah. Something I've often talked about with other startups in this space is who's the buyer? Yeah. Because you talked about collaboration
Starting point is 00:20:52 between the engineer and the PM or whoever, and it's not clear sometimes. Do you have a clear answer? It varies highly on company stage. So in the early days when we started Human Loop, you said where do we find our customers, right? They were all startups and scale-ups because those were the only people building with GPT3 more than a year ago. There was no large companies.
Starting point is 00:21:11 And there it was always founder-CTO. Even if there were 10, 20-person company, seriously a company, is always founder who was speaking to, reaching out to us, who was helping build it. So like an example here, one of our earliest customers was Mem, and it was Dennis at Mem, who was kind of the person we were speaking to. Now that we're a bit more at scale and we're speaking to larger companies, it's a little bit more varied. surprisingly it's still quite often senior management that first speaks to us. So with Duolingo, it was Severn, the CTO, was actually our first contact. Just inbound. Inbound.
Starting point is 00:21:42 But increasingly now, it's people who are engineers who are actually working on projects. So it's like a senior staff engineer or something like that. We'll reach out, book a demo. They'll probably sign up first and have a play. But then they tend to book a demo because they want to discuss data privacy and how things will be rolled out. and sort of going beyond just individual usage. But that's the usual flow, is we see them sign up. Sometimes we reach out to them.
Starting point is 00:22:06 Often they'll reach out to us, and then the conversation starts. Yeah, yeah. Awesome. For people who want to get a better sense of Humuloop, the company, I think the website does a fantastic job of explaining it. Thank you. We're always working on it. Put in quite a lot of work.
Starting point is 00:22:20 So it says here Humulup application platform includes a playground, monitoring, deployment, AB testing, prop manager, evaluation, data store, and fine-tuning. And based on our chat earlier, it seems like evaluation is kind of the more beta one that's in sort of like a private beta. That's correct, yeah.
Starting point is 00:22:37 So we have evaluation in private. There's always been some aspect of evaluation. It was actually the first problem that we were solving for customers. But evaluation in Human Loop early on was driven entirely by end user feedback. So if you're building an LLM app, there's probably three different places
Starting point is 00:22:53 where evaluation matters a lot. There's the evaluation that you need when you're iterating in design and you haven't got something in production yet, but you just need feedback on as you're making changes, are you making things better? You're iterating on prompts, you're iterating on the context,
Starting point is 00:23:06 trying out different models. How do you know that the changes are actually improving things? Then once you're in production, there's sort of a form of evaluation you need for monitoring. It seemed to work when I was in development, but now I'm putting a whole bunch of different customer inputs through it. Is it still performing the way that I expected? And then the last one is something like equivalent to integration tests
Starting point is 00:23:25 or something like this. Every time you make a change, how do you know you're not making? it worse on the things that are already there. And so I think we always had a really good version of the monitoring user feedback version, but what we were missing was support for offline evaluation and being able to do evaluation during development or regression testing. And we're going to be launching something for that very soon.
Starting point is 00:23:43 Yeah. This is slightly unintuitive to me because I would typically just assume they're all three are the same e-vails. Yeah, so they can't necessarily be the same evils just because you don't have the user feedback at the time that you're in development. I'm not thinking about user feedback, I'm just thinking about validating the output that you get. Yeah, so you're validating in similar ways,
Starting point is 00:24:04 but if you're doing a really subjective task, then I think the only real ground truth is what your customers say is the great answer. If you're building co-pilot, do the customers accept the code suggestions or not? Yes. That is the thing that ultimately matters, and you can only have proxies for that in development.
Starting point is 00:24:19 And so that was why those two things end up being different. Yeah. And in terms of the quality of feedback, so we did an episode with which is an analytics platform dedicated for collecting this kind of behavioral feedback. And you mentioned co-pilot. There was a very famous post about reverse engineering co-pilot that showed you the degree of feedback. I think typically when people implement these things, they implement it as a sort of thumbs-up, thumbs-down.
Starting point is 00:24:46 So, binary feedback until you find that nobody uses those. Nobody does those feedback. I barely use the up there on chat. Yeah, so this was something we learned really early on in building human loop. And, you know, the feedback aspects of Human Loop were very customer-driven. The people who were getting, amongst our early users, the people who were getting traction and who had built something that was working well, had jerry-rigged some version of feedback collection and improvement themselves.
Starting point is 00:25:12 And they were pushing for something better here. And they all were collecting usually three types of feedback, and Human Loop supports all three of these. So you have the thumbs-up, thumbs-down type feedback that you just described. You don't get much of it. It's useful when you get it, but you don't get that much. and then the other form of feedbacks, we call that votes, and then you have actions, and these are like the implicit signals of user feedback.
Starting point is 00:25:33 So I can give a concrete example here. There's a company I really like called SudoRite, and SudoRite, founded by James Yu, and they're building an editor experience for fiction writers that helps them. So as they're writing their stories or novels, there's a sidebar, and you know, you can highlight text and you can say, like, help me come up with a different way of saying this or in a more evocative way. You know, there's many different features built in. And they had built in early on, you know, analytics around does the user accept a suggestion?
Starting point is 00:25:59 Do they refresh and regenerate multiple times? How much do they edit the suggestion before including it in their story? Do they then share that? And all of those implicit signals correlate really well with the quality of the model or the prompt. And they were like running experiments all the time to make these better. And you could just see it in their traction figures. As they figured out the right prompts, the right features, the things that people were actually including, the product became much more loved by their users.
Starting point is 00:26:25 Was there a third? You said there was... And the third one is corrections. So this helps particularly when you want to do fine-tuning later on. So anywhere you're exposing generated text to a user and they can edit it before using it, then that's worth logging. So a concrete example here is we have a couple of customers
Starting point is 00:26:41 who do sales email generation. And they generate a draft, someone edits it, and then they send the draft. And so they capture the edited drafts. And I think a lot of the... this is sort of preemptive, right? They don't necessarily use that captured data immediately, but it's there if they want it for fine-tuning, for validating prompt changes and anything like that. Exactly. Exactly. It's data that you want to have, and you want to have in an accessible way,
Starting point is 00:27:07 such that you can improve things over time. Yeah. And you tend to, you have a UI to expose it, but do you think that people use that UI or did they, did it prefer to export it to, I don't know, Excel, or how do people like to consume their data? once you've captured it. Yeah, so we see a lot of people using it in the UI, and part of the reason for that is we have this bidirectional experience with an interactive playground. So we have the ability to take the data that was logged in production
Starting point is 00:27:33 and open it back up in an environment where you can rerun the models when you make changes. And that ability has been really important for people to reason about counterfactuals. Oh, the model failed here. If the context retrieval had worked correctly, would the model have succeeded? And they can immediately run that counterfactual. or is it a problem with GPD 3.5 versus 4? So they'll run it with 4 and see, does that fix it?
Starting point is 00:27:56 And that lets them build up an intuition about why things have worked or haven't worked. People do export data sometimes. So we allow people to format the data in the right way for fine-tuning and then export it. And that's something we see people do quite a lot if they want to fine-tune their own models. But we try to give fairly powerful data exploration tools within Human Loop. Yeah. What about your integrations with the rest of the ecosystem? On your landing page, you have Langchain, Auto-GPTs mentioned.
Starting point is 00:28:20 Chroma, Pine Cone, Snowflake, and obviously the LLM providers. Yeah, so the way we see Human Loop is sitting, you know, between the base LLM providers and an orchestration framework like code, you know, Langeen or Lama Index might sit sort of separately to that. You know, you have this analogy, I think, of like, LLM first or code first, AI applications, and we're very strongly of the opinion that, like, most things should be happening in code, right? That developers want to write code. They want to be able to orchestrate these things in code.
Starting point is 00:28:48 but for the pieces that require LLMs, you do need separate tooling. You need the right tools for prompt engineering. You need some way to evaluate that. And so we want Human Loop to plug in very nicely into all of these orchestration frameworks that you might be using or your own code and let you collect the prompts, the evaluation data that you need to iterate quickly in a nice UI. So here is where line chain collides with you. Has started to now.
Starting point is 00:29:13 Yes. Because they just released the prompts manager. Yeah. And they also have a dashboard to... observe and track and store their prompts and data and the results. They don't have feedback collection yet, but they're going to build it. I'm sure they will. You know, it's a very vibrant ecosystem.
Starting point is 00:29:31 There's lots of people running after similar problems and listening to developers and building what they need. So I'm not completely surprised that they've ended up building some of the features that we have because I think so much of what we need is really important for developers to achieve stuff. I think one of the strongest parts of it is it's going to be. very tightly integrated with Langchain, but a lot of people are not building on Langchain. And so for anyone for whom Langchain is not their production choice system, then I think actually it's going to be friction to work in that way.
Starting point is 00:30:00 I think that there's going to be a plethora of different options for developers out there, and they'll find their own niches slightly. I think we're focused a little bit more, as I said, on companies where collaboration is very important, a little bit larger scale, and slightly less so far as an individual developers in quite the same way that Langchain has been to date. That's a fair characterization, I think. It's funny because, yeah, you are more agnostic than Lanc Chain is, and that is a strength of yours, but I've also worked for companies which have tried too hard to be Switzerland
Starting point is 00:30:34 and to not be opinionated about anything, and it's bitten them in. You have to have opinions, right? You've got to bake into the – we learn a lot from our customers, and then we try to productize those learning. So I gave you a concrete example earlier. on having good defaults for what types of feedback you can collect. And that's not an accident.
Starting point is 00:30:52 We're very opinionated about that because we've seen what's worked for the people who are getting to good results. And now if you set up human with that, you naturally end up with the correct defaults. And there's loads of examples of that throughout the product where we're feeding back learnings from having a very large range of customers in production
Starting point is 00:31:07 to try and set up sensible defaults that you don't realize it, but we're nudging you towards doing the right thing. Yeah. Yeah. Excellent. So that's a really great overview of the product surface area. I mean, I don't know if we left out anything that you want to highlight.
Starting point is 00:31:21 No, I think that's great. And the focus for us, I think, being like a really excellent tool for prompt management, engineering, versioning, and also evaluation. So kind of combining those and making that easy for a team. Yeah.
Starting point is 00:31:34 What's your estimate of the TAM now? Oh, God. I mean, eventually, at the current rate of growth, right? I think it's really difficult to... All known items in the universe. Yeah, it's difficult to put a size it because how big it's going to be. Like, certainly, like, more than large enough for a venture-backable outcome.
Starting point is 00:31:52 Today, I don't know, Data Dog is something like a $35 billion company doing, like, web monitoring or whatever. I think LLM's and AI are going to be bigger than software. And that market is going to be absolutely enormous. And so trying to put a size on the TAM feels a little silly almost. You had to do it for your exercise, so I just figured I'd get an update. But it was a different world back then, right? At the time that I was doing it, trying to get people to take the
Starting point is 00:32:16 of putting GPT3 in production seriously was work. And most people didn't believe it was the future. It was like it's difficult to believe this because it's only been a year. And I think everyone has kind of rewritten history. But I can tell you, because I was trying to do it, that a year ago it was still contrarian to say that large language models
Starting point is 00:32:33 were going to be the default way that people were building things. Yeah. Well, well done for being early on it and convicted enough to build a leading company doing that. I think that's commendable. And I wish I was earlier. You've still been pretty early. You've done all right.
Starting point is 00:32:48 I do have this message because I talk to a lot of people who feel like they've missed it. But it's just beginning. It's still so early. What would you point to to encourage people who feel like they've missed the boom? I just think that I guess a question to ask yourself if you missed chat GPT was why did you miss it? And the people who didn't miss it, and I'm not necessarily including us in this. I think we were relatively late, even though we were earlier than most. Like, what did the people who get it right really grokker? What did they believe, right?
Starting point is 00:33:20 What did Ilyos Cuscova or Shane Legg, the people who kind of saw this early? And I think it was a conviction about deep learning and scale and projecting forwards that, okay, if we just project forwards the current improvements from deep learning and assume they continue, like what will the world look like? And if you do that today, and obviously it's extrapolating, right? That's not a theory-based prediction. It's just an extrapolation. But the extrapolation has been right for a really long time, so we should take it seriously.
Starting point is 00:33:49 If you extrapolate that forward just a year or two, then you find that you would expect the models to be phenomenally better than they are today. And they're already at a scale where you expect large economic disruption, right? Even if GPD4 doesn't get better. And if all we get is GPD vision plus the current model, we know that there's loads of useful applications to be built. People are doing it right now. But they're going to get better, right?
Starting point is 00:34:11 this is the worst they're ever going to be. So if this is what's possible today, I think the hardest challenge actually is to take seriously the fact that in the not too distant future you will have models even more capable than the ones we have now, how do you build for that world? I think it's a difficult thing to do, but it's certainly extremely early.
Starting point is 00:34:29 Yeah, I think the quote that resonated with me this past week was Nat Friedman saying, imagine everything that we have now with six orders of magnitude, more compute by the end of the decade, and plan for that. Yeah, and that seems to me like a... Six orders is a lot. Six orders, six orders seems optimistic.
Starting point is 00:34:47 But I think it's a good mental exercise, right? Even if it turned out only to be... If it was only four orders or only three orders, right, it would still be transformative. Yes. If GPT4, instead of costing $40 million or $GD, you know, whatever it costs, tens of millions of dollars, became tens of thousands of dollars. I've heard a total all in cost $500 million. So let's say it was $500 million today and it became $1 million or $2 million. Yeah, yeah.
Starting point is 00:35:08 That becomes accessible to, you know, even startups, let alone. you know, medium-sized companies. And I think we should assume something like that will happen. I would say even without significant research breakthroughs on the modeling side, I would just expect inference costs to become a lot cheaper. So training is difficult to optimize from a research perspective, but figuring out how to quantize models, how to make hardware more efficient. That to me feels like you chip away at it and it'll just happen naturally.
Starting point is 00:35:33 I'm already seeing signs of that. So I would expect inference to get phenomenally cheaper, which is most of the cost. Yeah. And a previous guest that we had on by the time this comes out, is Chris Latner, who is working on compilation for Python, that's going to make inference a lot cheaper because it's going to fully saturate the actual compute that we already have.
Starting point is 00:35:51 So I think it's an easy prediction to make that inference costs come down phenomenally. Fantastic. In my mind, you went upmarket faster than most startups that I talked to. So you started selling to Enterprise, and I see you have Duolingo Max and Gusto AI as case studies.
Starting point is 00:36:08 You have a trust report. You don't need talk too. We're in the process of Soch2. So we have SOC2 Part 1 and we're currently being audited for SOC2 Part 2. But you have the Vanta thing up. We have the Vanta thing up. And we have the part one.
Starting point is 00:36:22 We have the trust report. We have regular pen tests. We have to do a lot of this stuff in order to get to procurement. To sell the enterprise. Yeah. So I mean, I love the Vantta story. It's not AI.
Starting point is 00:36:30 But do you think that the Vantage trust report is going to work? In what sense? As a SOC2 replacement. A SOC2 proxy? I don't know. Honestly. All I can say is that, like, customers still care that we have SOC2.
Starting point is 00:36:44 Yeah. And we're still having to go through it. Vantas, even with SOC2, though, Vantam makes the process of doing it phenomenally easier. Okay. That's a big endorsement. So I would endorse the product. I've been less close to it than my co-founder, Peter, and a couple of others. Oh, yeah.
Starting point is 00:36:57 There's always a VANTA implementation, a SOX2 implementation person. Yeah. And that poor person is, like, for a year, they're dealing with this. But it's certainly been a lot faster because of that. But just more broadly, like, becoming an enterprise-oriented company. What if you had to change or learn? Yeah, so I would actually say that, like, we've only done it because we were feeling the pull, right? I wouldn't recommend doing it early if you can avoid it because you do have to do all these things.
Starting point is 00:37:26 Soct2 compliance. And I think Peter is filling out a very long infoset questionnaire today, right? And although you have most of the questions prepared, each one is just a little bit different. So there is just just over. There is this overhead on each time. No comment. But the potential gain for some of these larger companies, right, if they can make efficiency improvements of 1, 2, 4, 5% is so much bigger.
Starting point is 00:37:52 And the efficiency improvements probably aren't 5%. They're probably 20%, 30%. And so when the upside is so large, you know, if you are a large company that's, you know, your costs are dominated, say, by customer support or something like this, then the idea that you might be able to dramatically improve that. Or if you can make your developers much more efficient, there's no shortage of things. And I think a lot of companies in the build versus buy decision,
Starting point is 00:38:17 they want to do both because they want to have the capacity internally to be able to build AI features and services as part of their product as well. So they don't want to buy everything. Certain things, it makes a lot of sense. It's fully packaged. No one's building their own IDEE. Like they're going to use co-pilot or whatever is the equivalent. But they want to be able to add, you know,
Starting point is 00:38:35 I think the first AI feature that Gusto added, was the ability within their application for people who are creating job ads could put in a very short description, and it would auto-generate the first draft job ad, and was smart enough to know that there are different legal requirements and what information has to be there for different states. So in certain states you have to, for example, report the salary range, and in certain states you don't, it's pretty easy to give that information to GPT4 and have it generate you a sensible draft.
Starting point is 00:39:03 But that was, I think, something that they got to production, you know, within weeks of stuff. And just to see such a large company go from zero to having AI features in production, and now they're adding more and more, it's been quite phenomenal. Yeah, the speed of iteration is unlike enterprise, which is fantastic. I think a lot of people see the potential there. I think people's main concern with having someone like Human Loop in the Loop is the data and privacy element, right? Do people want on-prem human loop?
Starting point is 00:39:34 So we do do VPC deployments where they're needed. We don't do full on-premise. So far, most people, we've been able to persuade that they don't need it. So whenever someone says we need VPC, the first question I always ask is why. And then we go through what are the real reasons? Like, what are they concerned about? And we see whether we can find ways either contractually or, you know, in our own cloud to satisfy those requirements. There are exceptions.
Starting point is 00:39:57 Like, we work now with some, you know, financially regulated companies. AmexGPT is one of our customers. Sorry, I should specify. I heard GPT? Yeah, no, Amex GPT is their global business. travel arm. And, you know, they've got very sensitive information. And so they're, they're particularly concerned about it and there's more auditing. But for the people who are not financially regulated, usually we can persuade them that, look, we have SOC to or essentially there. We've got
Starting point is 00:40:22 regular pen tests. We follow really like high security standards. Most people so far have been accepting of that. Yeah. Have you ever attempted to classify the use cases that you're seeing? just you see the whole universe and you're not super opinioned about them but like you know there's summarization there's classification there's you know okay so interesting i've not i've certainly not tried to classify them as that granularity like is it summarization or a question answering i often think more about the end use case so like is this an ed tech use case or is someone that's the vertical to me i think i think a little bit more about it like that in terms of use cases it's really varied right there are people
Starting point is 00:41:03 people using the models as completion, there's chat. Like, it wouldn't be so obvious to know without doing some, like, GPT-level analysis on it, like getting GPT to look at the outputs and inputs, which we can do, which we can do, whether they are doing summarization or something similar. But I would say I feel like most use cases blend. Like that to me feels like an old-school NLP way of viewing the world. Like an old-school NLP, we used to break down these tasks into like summarization and NER and extraction and QA and then pipeline things together.
Starting point is 00:41:31 and actually I feel like that doesn't map very well onto how people are using GPU for today because they're using them as general purpose models and so it is one model that's doing NER and it's doing extraction it's doing summarization, it's doing classification and it's often in one end-to-end sort of system. I think that's what people want to believe that they're using them as general purpose models
Starting point is 00:41:54 but actually when you open up the covers and look at the volume, 80% of it is some really dumb use case that you could... Like question answering our documents or something like that. Yeah. I'm trying to get some insight from there. I don't...
Starting point is 00:42:06 Yeah. So I can tell you the trajectory we've seen, right? So really early on, the, like, killer use case was some form of writing assistant, whether it was like a marketing writing assistant. The Jaspers. Right?
Starting point is 00:42:16 Jasper, copy AI. We had like seven of them at one time, right? And then you had like specialist writing assistants. Some, I think, have gone on to be really successful products like pseudo-write or type AI as another one. But they're still fundamentally, like, helping people write better.
Starting point is 00:42:29 And then I think increasingly we've seen more diversification. There was a wave of chat to documents in one form of another. Chat PDF still doing well. Chat PDF doing super well. Once RAG started working like retrieval augmented generation, there was that. But since then, as people are more problem driven and they're like trying to see, okay, how can we use this? We see a much broader range.
Starting point is 00:42:50 So even within, like take Duolingo as an example, they've got Duolingo max. So that's like a conversational experience. But they're also using large language models within the evaluation. of that. They're all using it for content creation. And each of these companies, sort of, you start with one use case, and I feel like it expands because you just discover more and more things you can do the model with, do with the models. Yeah, yeah, yeah. Do you see much code generation? Yes. So I would say that, like, developer-focused tools, I would say, like, ad tech
Starting point is 00:43:18 and developer-focused tools are, like, probably two of the biggest areas that we see people working on. Yeah. I'm always wondering, because code generation is so structured that you might have some special affordances for that. But again, that's anti the bitter lesson. I was wondering what we can optimize for, but that's my optimization brain when I should not. I should just scale things up. I think there's merit in both.
Starting point is 00:43:43 Yes. Okay, so today, by the time we release this, you will have announced your new pricing. Yeah, that's right. So one thing that people have said to us a lot, actually, is that the barrier of entry to getting started with Human Loop is just quite high. There isn't, you know, you can't just, install an open source package and just get going or whatever it might be.
Starting point is 00:44:01 And there have been quite a few small companies that have signed up and then send us messages, you know, we're a not-for-profit or an early-stage company. We really want to use Human Loop, but it's just prohibitively expensive for now. We wouldn't mind paying in the future. And so we've thought really hard about how can we make it, like lower the buyers to entry it for people to try it out and get started and get value and have the amount they have to pay, scale much more with the value they get, so that they're only paying for things when they've got value from Human Loop.
Starting point is 00:44:25 And so we will be launching a new set of pricing. there'll be a free tier, so you can sign up, you can get going on the website, you can start building projects and you won't have to pay anything. And only once you get to a certain scale, you've got more than three people on the platform, you're logging a certain amount of data to us, then pricing kicks in, and it scales with you. So, you know, as your volumes go up, that's the time when you'll start paying us more. So much more gradual than it is now. And you're tying some features to the tiers?
Starting point is 00:44:50 A little bit, but mostly we're trying to give you just a sort of most of the product experience. So on the free tier, I think there's one or two things you don't have, but you have almost everything. And then once you're off the free tier, you have everything. But the amount you pay kind of scale slightly differently. So you get volume discounts at scale. Awesome. And so this is where one of the hard questions is, right? Like, is there a graduation risk as people get very serious about logging?
Starting point is 00:45:15 You brought up Datadog earlier, and for sure Data Dog is looking at your market as much as you're looking at theirs. So how do you think about that of, like, ultimately at scale? becomes a commodity, the logging. So I think that actually this is really different to that. So the more people use it, we find actually the stickier it becomes. It's almost the opposite. That as they get to scale. So you're right that the millionth feedback data point is worth a lot less than the
Starting point is 00:45:41 1,000th feedback data point. But what continues to be really valuable is this infrastructure around the workflow of prompt management, engineering, fixing things. So we see, you know, you have, what happens over time is people put more and more evaluations onto Human Loop. they've got more people in their team, the product manager, and also three linguists and someone else who are opening up the data that's being logged through human loop back into that interactive environment. They're rerunning things.
Starting point is 00:46:05 They're plugging in other data sources. And so over time, actually, the raw logs, I agree with you, kind of become commoditized. But the tooling that's needed to be able to not just kind of collect the data, but make it useful and do something with it to improve your model, that's the bit that becomes more valuable. Right? Once you have something working at scale, then improve. proving it by a few percentage points is like very, very impactful. So a lot of our customers early on would say exactly this to us.
Starting point is 00:46:32 Like, oh, we can just dump our logs to like an S3 bucket or we can plug it. And then like, why do we need a special purpose tool? And most of them come back to us later because what they find is, oh, okay, I've logged something, but it's really difficult for me to like match up the log to like what model generated it and then quickly run that and try something else. Or I've like logged something and that log involved a retrieval. and I would like to know what went wrong with retrieval or which document the retrieval came from
Starting point is 00:46:58 and I didn't log that information correctly, et cetera, et cetera. And the complexity of setting this up well is quite high. So you can either spend a lot of time at that stage, two things happen. Either people roll their own solution. And early on, we saw a lot of people build their own solutions or they come and use something like us. And I think increasingly, because we've been working on this
Starting point is 00:47:17 for now more than a year, the difference between something you would build yourself and sort of a bot solution is now quite enormous. And so I just wouldn't recommend it. And I guess the difference on the data dog point or like other analytics tools, you mentioned amplitude or data dog, they're much more about passive monitoring.
Starting point is 00:47:34 And I think one of the amazing things about AI is the interventions you can take can be very quick and very powerful. And so coupling very closely, the ability to update a retrieval system or change a prompt to the analytics data and allowing you to run those experiments,
Starting point is 00:47:47 I think it was very powerful. Fantastic answer. It's almost like we prep for this. It's also almost like I think about this a lot. If I didn't have an answer to that question, it would be difficult to justify spending all my time building this. But I do think it's very important. Yeah.
Starting point is 00:48:00 Company building, what have you changed their mind on as a founder? Ah, that's a great question. So one thing that comes from my mind as soon as you say company building is like a piece of advice that Michael Siebel has at YC, right? Which is like, don't do it. Or at least don't do it pre-PMF, right? Like one of the biggest failure modes of early-stage startups is, especially if they've raised investment from, you know,
Starting point is 00:48:21 large investors is that they persuade themselves that they have PMF too early and they go into sort of scaling mode and hiring people. And a lot of that stuff is important, but distracts from the most important thing that you have to do, which is understand the needs that are most pressing for your customer, figure out who the right customer is and build what they really want, or if they're not necessarily know what they want, build what they really need. So one thing that I believed and I still believe is that you want to do that at the right time, that company building too early is a distraction.
Starting point is 00:48:51 When was that for you? So for us, it was actually November, December last year. So November, December 2020. So we were a four-person company for almost two years. And it was only when everything was breaking, when all the charts were up into the right and we really could not service our customers anymore because the team was too small. That's when we started actively hiring people. And even then, we've been really slow and deliberate about it.
Starting point is 00:49:14 Maybe a little bit too slow given how much, like, there was a lot of suffering in being that slow. I wish we had a couple more people when things took off. There was a period of time, I'd say, from like November to March, where all of us were like barely functioning because there was just so much to do. But we've continued to have the bar set really, really high and higher slowly and very deliberately. And I think we get more done with a smaller team of really, really excellent people than we would had we hired more people sooner.
Starting point is 00:49:45 So that's something I kind of agreed on. The other thing that has maybe changed a little bit in my mind is related to how opinionated you should be. So I think you asked this question about opinionation in the product. And I think there's a risk of just listening to your customers and building what they want that can lead to hill climbing. And I think especially, and we were guilty of this, I think, a little bit early on in the first year of human loop. Well, you did it well. Better than most. Thank you. But I think that, you know, where things started working for us was when we were, we had a lot more strength in our convictions, right? When we said, actually, you know, we believe GPT3 is going to be.
Starting point is 00:50:21 the future of how people build this, and even if people don't believe that today, we're going to build for that future. That is hard to do. I still think we don't do it enough. Like, I want us to do it even more. We have things we believe about the future that are somewhat contrarian
Starting point is 00:50:34 and being able to plan for that and be opinionated and build for that future. And also to be building the things that we believe our customers need, not exactly what they ask for. Because otherwise, you end up, I think, with a lot of very undifferentiated products that are for everybody,
Starting point is 00:50:49 so they're not for anyone. and they don't have a strong point of view. So I think, especially for building dev tools, I think you should have a point of view. Yes, I strongly agree with that. Hiring, what are you hiring for, and given that you're now hybrid, you're spending some time in SF, where are you hiring?
Starting point is 00:51:05 Yeah, so we're hiring in both SF and London. The role that is most urgent for me right now personally is hiring for a developer-relations engineer. So this is an engineer who loves community, loves documentation, likes going to talks, building demos, those, you know, as part of launching this new pricing where we're going to have a free tier, is also having a much bigger push towards helping individual developers and smaller teams succeed with Human Loop as well. And even developers in larger companies who just want to get,
Starting point is 00:51:33 you know, try it out before they're at scale. And I think to do that well requires a really good onboarding experience, really amazing documentation and really good community building. And we need someone fully focused on that. I don't think it can be someone's part-time job. We want someone 100% focus on building community. Ideally, we'd find someone as good as you SWIX to do this job. So, yeah, so if you're a developer-reation engineer, or even if you're just a product-focused engineer who is excited about AI and ML and has some track record of community building,
Starting point is 00:52:04 then that's the role that I would love to hear about. And we'll be hiring for it primarily in San Francisco. Although if you are amazing elsewhere, we'll consider it, but SF being the focus. Yeah. Thanks for the compliment as well. But yes, I'd highly recommend people check out job, it's already live on the website.
Starting point is 00:52:20 A lot of people don't know. I have a third blog that is specifically for DeVosal advising, because I do do some angel investing and people ask me for advice all the time, and I actually cash my frequently ask questions there. Anything else on the company side that I didn't touch on? If you're within YC, this will be boring, but if you're outside
Starting point is 00:52:35 of YC, I think that you probably can't hear this enough times, because I've seen so many people get this wrong, which is just like, before PMF, nothing other than PMF matters. And there's just, there's so many possible distractions as a startup founder or things you could be doing that sort of feel productive, but don't actually get you closer to your goal. Like trying to narrow focus to finding
Starting point is 00:52:56 PMF and what that means will be a little bit different for different startups and, you know, different experiences. I have friends who are doing deep tech, biotech startups or whatever. And so I don't think there's one size fits all, but but try not to do anything else. That, that advice has been really good for us. And it's often not, it's not intuitive. Yeah. Does human loop have PMF right now? I think we have PMF within niches. So I think we definitely have like, especially for I would say like if you're a team building an LM application within a larger company, then like yes, we see people sign up, they use
Starting point is 00:53:31 the product, more people use it over time, usage goes up, they give us great feedback. There's always room for improvement. But we have a form of PMF. And I think there will be like multiple stages of it. But we certainly found some PMF. What is the next tier of PMF? PMF that you're looking for? Well, I'm hoping it's on this Eval's project that we're launching, right?
Starting point is 00:53:51 So we definitely have PMF on the current sort of prompt versioning management stuff. We've got about 10 companies currently in closed beta on Eval's giving us a lot of feedback on it. It's a real problem for them. We've seen them get value from it, but we haven't launched it publicly yet. I'm hoping that will be the next big one. Yeah. Just a technical question on the Evales, which I don't know if it's too small, but typically Eval's involved writing code.
Starting point is 00:54:13 Yeah. So it's like freeform, Python, JavaScript, something like that for you guys? Yeah, so it's the combination of, and again, we're iterating on this, but yeah, you can define them in Python, and they can also call language models as well. And it's executed on your servers? Both are options. Okay. So we have a protected environment. You can basically execute everything on our servers, which was not easy to build.
Starting point is 00:54:35 And I'm not the right person to talk about it, but I think there's a really interesting engineering blog and how you can make it safe for other people to exit code on your servers. But also it's going to be set up such that you can also run things on yours and just. to have the output logs still go to Human Loop and useful way. Yeah. This is the promise of the edge clouds of the world. Yeah. The Denos, the Cloud Fair Workers, the Models. I don't know if you've explored any of those,
Starting point is 00:54:57 but then you would not have to set it up yourself, essentially. I'm pretty sure they've all been exported on my team in recent months. Yeah, yeah. Okay, brought it out to market takes. Yeah. Just the, you know, brought in a human loop as a whole. How do you feel about LMLM ops or PromptOps as a category term? LM ops.
Starting point is 00:55:14 I would drop one L, firstly. I think we call them large language models today. but the goalpost of large is going to keep moving. So I think the point is sort of foundation models or... Oh, I have a proposal to deal with that. Oh, yeah? I have T-shirt sizing. So I've defined S-XS and then M and L and all the way to XXL.
Starting point is 00:55:30 You're going to have to keep updating that over time. But I think foundation model ops is maybe a better term because I also think that like within six months we're going to have images and then people won't call them just language models anymore. Yeah. And is it worth a separate category than MLOPS? But I do think it's worth a separate category. Okay.
Starting point is 00:55:46 I think that the people from its four are different. We discussed this a little bit earlier, right? But a machine learning engineer and a traditional software engineer are very different people. They have different levels of knowledge and different goals. I also think that the generality of the models has changed what people are building. And so the problems they face are really different. It's, you know, like what you need for building a recommender, a small recommender system at enormous scale is very different from what you need to build a generative AI application that's very subjective.
Starting point is 00:56:14 And so I do think that they have, I actually think, like we've seen a lot of MLOps companies recently try to pivot into solving problems in this space. And I think it's going to be hard for them because they're changing who they're building for. So they now have to straddle two different sort of ideal customer profiles. And they also have a lot of legacy infrastructure focused around models whose output was like a measurable, quantifiable number. It was F1 or was accuracy or something like this. And I think their lives are going to keep getting harder as the models go more general and go multimodal. because what they've built so far is won't fit that world.
Starting point is 00:56:47 I think it probably can be done but I think it's going to be very hard. You mentioned GPT4 Vision and obviously there's more multimodal models coming along the way. How big does that factor into your planning because you're very language-oriented right now? So it's increasingly like an internal conversation
Starting point is 00:57:01 every time we have a product roadmap discussion like planning for and starting to iterate on and when to build in support for vision has become very much front of mind. So I think now, like we're working on it. Okay. One version of this, I pose this exact same question to Harrison, which is, let's say, the GP4 Vision API drops tomorrow.
Starting point is 00:57:20 Yeah. What changes in Human Loop? Well, for one thing that you need just to be able, I mean, like very simple things, right? Like, we need to be able to render and read in images in the playground environment that's interactive, right? So there's a bunch of just kind of follow your nose things that I think we'd have to figure out. But as I said, we've just started working on this. It's sort of become a product roadmap item. We, but not, like, we have to support it.
Starting point is 00:57:41 Like, it's very clear. This is not a question of if it's a question. of when. Okay. Yeah. Yeah. Excellent. Is prompt engineering dead? So we talked about this a little bit on the walk here. And I've never been a huge fan of the phrase prompt engineering. Because I think it simultaneously makes it not important enough and to important at the same time. I don't think it's a form of engineering in the way that software is a form of engineering where it has this rich body of literature and theory and you have to learn about it and takes like very specialist skill. I think you can get good at it very quickly. But I do think that prompts are a very important.
Starting point is 00:58:13 part of LM or AI applications, right? Like natural language-ridden instructions have become part of your source code. And they have impacts on your product quality. They have impacts on the way your product behave. So you should be treated with that level of seriousness as you would any other code artifact. So in that sense, I don't think it's dead. I think it's alive and well and becoming increasingly important. It's interesting.
Starting point is 00:58:36 There's like, you know, Anthropic had that very well-paid job, prompt engineer. Yeah. And I think they've hired a few prompt engineers now as well. and those people are leading on deployments in Anthropic and adding a lot of value. So there's clearly, it's clearly happening. But I think maybe it's slightly misnamed. I actually prefer your kind of AI engineer framing, where this is a different engineering skill set.
Starting point is 00:58:55 You still need to be able to build product. You're still an engineer. But you have an intuition for how to get the best out of models, how to evaluate them. You understand the problems that come from sarcasticity. And you also understand just the nuances. Like if you have a good mental model for how a large language model works, I think prompt engineering becomes a lot easier.
Starting point is 00:59:12 year. And so having that skill set, I think, is going to be important. But I doubt that five years from now, there will be like a separate job title of prompt engineer. Yeah. Yeah. I try to contrast it basically as prompt engineering is so 2022 and AI engineering is 2023. But yeah, the central thesis is is you can't just get by with prompts. You have to write code to manage prompts, to generate prompts, and to generate code and to, for you say, evaluate and run that code. Yeah, I think I agree with all of that. But to me, that doesn't diminish the importance of the prompts as an artifact. Still important.
Starting point is 00:59:47 Yes. I feel like when I saw a chain of thought for the first time, I went from a world in which I was like, okay, models are not good at reasoning to models can do some reasoning. Yes. It was a sort of step change in my beliefs about the capabilities of these models. Yeah. And I still think that the LLM Cascades paper hasn't had the impact. Can you summarize that?
Starting point is 01:00:04 So this was a paper from Google, and it's just sort of getting you to view LMs as a way of doing inference in a probabilistic programming framework. So that's a lot of words. So let me try and sort of unpack that. And you have a PhD in this. But, but you know, before AI was all LLMs, there was and there still is like a huge branch of research around probabilistic programs. So this is just ways of like writing code where probability and random variables are first class citizen. So you can have like random variables and then there's lots of different operations you can do to condition and make predictions about them and do inferences around them. And this language modeling Cascades paper basically said, hey, actually, like, large
Starting point is 01:00:45 language models are a really powerful inference engine that could be used as a composable piece inside something that looks like a probabilistic programming language. And we were chatting earlier today about the framework that will emerge for large language models. And I know you're working on small and you've given this a lot of thought. And, you know, Langchain and Lama Index and all these different groups, auto-GPT, are trying to circle around, like, what's the right set of abstractions, how might we be able to compose LLMs in ways to write more complex programs? And I think that LM Cascades paper was one of the first attempts to think about that in first principles and say, okay, what are the primitives you might want? And I think I'm surprised it hasn't
Starting point is 01:01:23 been built on more. Yeah. The very, very first AI grant from Nat Friedman mentioned that they were looking for a UI for Cascades and no one took them up on it. I don't think it needs a UI. I think it needs a, I think it's a framework. It's a framework. I think you want it in code. Yeah, yeah. And I would love to work on it if I had all the time in the world. It's sort of you always have to choose your, you know, you can't do everything at once. Well, if someone is working on it, maybe reach out. I would love to chat to people about it who are working on.
Starting point is 01:01:50 Yeah. How many of your customers and users are actually worried about prompt injection and prompt security? Not enough. Really? So I would say almost zero. Yeah. And I think that's correct today because very few of our customers have action-taking LLMs. Yeah.
Starting point is 01:02:04 And I think as long as your models are like read-only, prompt injection isn't that big a deal. It's not to me about leaking your prompts or something, because the prompts are only really valuable in the context of your code anyway. But I do think that once you get to the stage where you're letting the models have read-write access to any source of data, then prompt injection becomes a problem the same way any other form of code injection is a problem. But honestly, no one ever asks us about it. Right.
Starting point is 01:02:27 Like, almost never. And I think that's because of the stage where people are at, right? Which is that they're still trying to overcome hallucinations and they're still trying to put guardrails in place around the behavior of the models. and very few people are using agents in production at meaningfully sized companies. But I think as soon as that becomes the case, if we do get to a stage where more people
Starting point is 01:02:46 are allowing the models to read from a data source and write to a data source, then prompt injection will become something they care about. And you guys will be well positioned to offer something. Absolutely. I think sort of being this layer between the role model and the end application actually buys us a lot in terms of what we can help with.
Starting point is 01:03:04 Yeah. Well, you know, there are a bunch of security-minded people. who are trying to offer that as a standalone thing, and it's a feature, not a product. I think I'd agree with that. OpenE ice fine-tuning rollout, which was last month, how does that affect human loop? Yeah, so when we started the first version of human loop,
Starting point is 01:03:22 chat GPT was 3.5 wasn't out yet. It was all GPT3, and we saw a lot of fine-tuning at the time, and post the release of 3.5 and 4, by virtue of the fact that it was impossible to fine-tune, like we could just see it in our analytics. The amount of fine-tuning just kind of fell off a cliff, Partly, I think, because the models were better, but also just partly, like, it wasn't an option. And so I'm kind of interested to see now that 3.5 and 4 fine-tuning are back, whether that kind of fully recover...
Starting point is 01:03:48 4 isn't back yet, but it's... 3.5 fine-tuning being back. We've definitely seen a lot in the past people generating outputs with GPD4, filtering based off evaluation or feedback criteria, and then fine-tuning smaller, faster models. And so I think we likely see a lot of fine-tuning. of GP3.5 on four generated data, and that's a workflow that we've been, we natively support within a human loop now. So you can actually kind of do all of those things without having to leave it. If you have a bunch of generations, you can filter them on some criteria,
Starting point is 01:04:20 click fine tune, run ahead of evals and then decide whether or not to deploy that model. But time will tell as to whether or not this is something that goes back up in importance the way it used to be. Yeah. The question that occurs to me, always we talk about you being that layer that positions you very well. A lot of people are fighting to be that layer. And it occurs to me that as a user potentially of Human Loop and your competitors that I may not want to have to choose or be locked in. Is there room for an open standard that everyone agrees to that we all say like, okay, just adopt
Starting point is 01:04:56 this one vendor-neutral thing and then we all consume from it? Maybe. I think it could happen. We're not there yet. I think things are moving too fast for that to be the case, for people to have clarity on that. So maybe in the fullness of time there will be... My suspicion is that both will happen, right? That there will be some open standard that some people like to use. But once you come to working on serious production use cases, you often actually want the peace of mind of knowing that you're paying a real company that's going to be around to support
Starting point is 01:05:32 you that is focused on this. that has the knowledge and expertise. And so, as we've seen in many other spaces, I suspect that there'll be a bit of both. A bit of both. Yeah, so the model I have in mind is Datadog versus the Open Telemetry crew. And Data Dog is doing fine, and the open telemetries, you know, crew is doing great as well.
Starting point is 01:05:51 So the last question on the market. Did GPT4 get dumber this year? I don't think so. We've seen a lot of, like, conversation about this having happened. I think GPD4 changed. I think that they are regularly updating it, and you certainly see that both in sort of people's attempts to, you know, papers have been written about this,
Starting point is 01:06:09 and people were trying to do evaluations over time. I think that the main takeaway shouldn't be like, did GPD4 get dumber, right? But the interesting question is like, did GPD4 change? To which the answer, I think, is definitely yes. There's no question about that. And it's something that if you're a developer of building products on top of GPD4
Starting point is 01:06:26 is something that you should think about a lot because you're building on a platform that will evolve and change over time, and you can pin the base model, but not forever. And so I think you need to, at the very least, have really good testing frameworks to be able to run regression test and know, like, have things gotten worse over time?
Starting point is 01:06:43 If you can't answer that for yourself, you're going to be scratching your head. Like, do we make the prompts worse? Did the retrieval system get worse? Did something else change? Did the user inputs distribution change? Or did the model get worse? And being able to disentangle those things easily,
Starting point is 01:06:55 I think the importance of that that's going to go up. But I also think that it should, like, give us pause for thought about kind of the balance between what gets built on top of third-party providers and APIs in a closed world and what we might want to do more open source. And I suspect there'll be a mixture of both depending on the use case. But you are building on shifting sand whenever you're building on someone else's platform. Yeah, yeah, totally.
Starting point is 01:07:17 And then one local specific question before we go to the takeaway questions. You went through IC and you are very, very familiar with the American tech scene, but also you built your company here in London. But what should, and I'm very US-focused, most of our audience is very US-centric. What should people know about the European tech scene? Yes, I think that London's one of the best places in the world, and Paris, for AI-focused folks. With the Hugging Face. I don't know. We've got Hugging Face in Paris.
Starting point is 01:07:47 We're sitting right now. We're probably less than 200 meters from the offices of DeepMind. Facebook AI research is here as well. UCL's AI Center is here, which is where, you know, Jeff Hinton was. and where a lot of great research is where DeepMind spun out of, actually. So Shane Legg and Demis met at UCL. So there's an amazing, and there's many more. I can't list everything that's great, but there's many great AI institutions in the UK.
Starting point is 01:08:10 What I would say is that I think that Europe has been amazing on research and continues to be a fantastic place for researchers, but has been less good in my experience on productizing and trying to productize AI. And so the difference that I feel being here versus being in the U.S. is just the number of, like, if I go to San Francisco, the density of people who are trying to build useful things with large language models or with AI and budding their head up against it
Starting point is 01:08:38 and discovering what works and what doesn't work and trying great ideas and trying stupid ideas and just learning together is much richer than what we have here. I think the pure research labs, very competitive. Anthropics just opened an office here, opening eyes, opening an office here. When you're hiring for talent, you'll find as many or better people, you know, like equal quality people in both places,
Starting point is 01:09:00 but less so once you move towards productization. And I suspect it's also to do with the investor ecosystem. So we're sitting in the offices of Local Globe and Index were our first investors, and they're both great. But the number of investors that you have of that quality in Europe is not the same as the US. And the type of people you interact with, they're very different. When I speak to VCs in the US, there's way more former founders.
Starting point is 01:09:22 There's way more people who have done dev tools before. And there's way more support from the founders towards the ecosystem than there is in Europe. People are trying, but the culture is not quite the same. And that's why we're moving to SF, right? We want to be, every time I've been to SF, good things have happened to you. Whether it's like bumping into you or we get an introduction to an interesting investor or a customer, or we just speak to someone who's been trying really hard to build something. And, you know, we share an office here in London with Bloop that does, you know,
Starting point is 01:09:53 Blupe AI does sort of code search with LMs. and we've tried our very best to kind of aggregate a few other companies to us and we're doing AI tinkers, you know, tomorrow. So there is some of it here, but you have to work so much harder. Versus in an SF, you know, you can't move for hitting some AI things.
Starting point is 01:10:09 We had a Thursday recently with 10 AI meetups in one night. Yeah, it's almost too much. It is too much. I'll go there and say it is too much. Yeah, you need some time to build things too. And there is, I would say, actually in the ESF builder scene,
Starting point is 01:10:22 privilege that comes out of just having so much opportunity thrown at you and like that we like have this like you know arms length this tastes for VC and I'm like no like they are partners in building your business you know absolutely yeah so so I think it's I think it's interesting contrast but you know as a person I'm not American I live most my adult life in America but I I feel for non-US policymakers and VCs and people who care about their city who are like okay like we're not SF what do we do I honestly think that it's, you know, we think a lot about network effects and defensibility and startups.
Starting point is 01:10:57 I think it's like the mother of all network effects, right? The reason I'm going is not because I love the city. I mean, SF's fine as a city. I like it. But I'm going because everyone else is going and everyone else is going because we're going, right? And once you've attracted a certain talent density, I think it's really hard to compete with that. Oh, boy.
Starting point is 01:11:14 Okay. It is true. It's the honest truth. Yeah. I do want to work out a path for non-tech hub cities because, I mean, that's, that's where I'm from, right? Yeah, and me too as well, right? But I also, I also think there's something to be said for the most driven, most ambitious people, like finding a way to get to where the center for their thing is. And like right now, today for like AI-focused products, I think it's
Starting point is 01:11:37 San Francisco. But for different things, the center is, you know, different places. If you're, you know, Hollywood is the place to go if you're an actor or whatever. And there are different hubs for different areas. It's a Paul Graham thing. You know, different cities breathe different ambitions into you. And in San Francisco, apparently it's power. It's not actually tech, it's power. Okay, interesting. And tech is a means to power. Interesting.
Starting point is 01:11:59 There's a lesson in that for those of us who think about AGI safety. And also, you know, not anywhere in San Francisco, specific two square miles in San Francisco called the arena. You have to get in the arena and build. Okay, so broader takeaway questions. So we always ask three of all our guests. Acceleration. What has already happened in AI that you thought would take much longer?
Starting point is 01:12:20 So this has been, since I started my PhD, like every year things have happened that I thought would take much longer. So when I started my PhD, it was at a time when like deep learning had just sort of started working and transfer learning even for like vision hadn't been figured out yet. And people were talking about like, oh, it's going to, you know, how long before we can train models that don't need millions of annotated data examples,
Starting point is 01:12:39 how long, you know, so AlphaGo was happening just at that point in time, the first version. I have made predictions and been wrong again and again and again. I've just been consistently too pessimistic. And I think I'm quite an optimistic person. You know, when would, you know, like Dota's surprise. me when it happened. The first, like, vision transfer learning working in vision surprised me when it happened. The continued successive scale and deep learning. And then finally, like, you know,
Starting point is 01:13:02 although I believe that LMs were going to be enormous and I thought GPD3 was going to be the future, like just how good GPD4 and chat GPD turned out to be did surprise me. The first time, I actually saw Claude before I saw chat GPT, but the first time I saw Claude and I like kept pushing the limits of it with tasks that I knew were kind of at the frontier of what. was currently possible and just saw it like blasting through these one after another, that was a mind-blowing moment for me. And I think it was for a lot of the rest of us. I think we're going to have a lot more of those. I think that's going to keep happening. Yeah, yeah. We are accelerating as we speak. Exploration. What do you think is the most interesting unsolved question in AI? I think there's
Starting point is 01:13:41 actually some like obvious kind of elephant in the room unsolved problems that for some reason don't seem to get the amount of airtime that they kind of obviously should. So continual learning to me is one of these. Oh, God. Yeah. Like we all walk around as if it's, just completely normal that these models never learn anything new. Yeah, 2021 is when history ended. You just think, yeah, 2021 is when history ended. And you do retrieval augmentation with a vector database. And like, you're done, right?
Starting point is 01:14:03 Like, why would the system keep learning after training? And I think everyone knows that this is a problem, but somehow it doesn't seem to me to get the amount of, like the, I think this field in research is called continual learning or lifelong learning. And it doesn't seem to get the airtime that it used to. It seems to be like an obviously enormous problem. The other one that I think will happen naturally, but just hasn't happened yet, is just like more multimodality.
Starting point is 01:14:29 Right. Like it's kind of obvious that these models should be plugged in to vision, audio, speech, et cetera, and have shared representations because there's so much to be gained from that. And I think it's just like going to happen with time, but hasn't happened yet. Yeah. Well, I think the cost is just token space, I guess. I don't know how much more you need to add every single modality. Although I think Facebook released like six We have some examples of this, right?
Starting point is 01:14:54 So like Ghetto from DeepMind was a transformer model that they trained across, they just did policy distillation so they trained a whole bunch of different RL agents. And they took the outputs of that, which is like observation action reward triples
Starting point is 01:15:05 and trained a single transformer model on all of that. And then that one model could do any of those tasks. Actually, okay, Wells were in exploration mode. There's a paper from DeepMind came out at the same time as Gato that I think is massively underrated.
Starting point is 01:15:18 And I don't understand why it didn't get more attention. which it was at the same New Europe's conference, and I forget the exact title, but I think it's called like in-context reinforcement learning or something like that. And they do something really similar to Gato. They take an RL agent, they train it,
Starting point is 01:15:34 and then they distill that into a transformer model. But what they do that's different is they don't take the trained RL agent. Instead, they take an untrained RL agent and they record the full trajectory of its learning. So early on in the data, the model's kind of crappy, and by the end of the data, the model's been good at this task. and then they train a transformer model to predict that sequence.
Starting point is 01:15:55 And in order to be good at predicting that sequence, you have to predict that the sub-agent, like the RL agent that generated the data, gets better at the task over time. And the only way that I can see to do that, and in fact this seems to be what the model is doing, is that you have to simulate a learning algorithm. You have, the transformer has to simulate in context reinforcement learning.
Starting point is 01:16:15 And so they take all of these tasks, they train on the learning trajectories, and then they take a completely new task that that transform model has never seen before, and it learns to do that task. And so it's learning from reward signals in context to achieve a new task. And to me, that's huge. It's a demonstration of, like, inner optimization within a transformer model,
Starting point is 01:16:35 and it's also a demonstration of, like, in-contacts, continuous learning that's limited only by the length of the context window. If the context window was really long, you could make this work practically. I don't really know why that wasn't a bigger deal. I don't know either. This sounds fantastic. Yeah, and Gato, I think the reason maybe it wasn't a bigger deal, it came out exactly the same time as Gato,
Starting point is 01:16:54 and I think Gator just took all the attention. So we just got done talking a lot about focus, but given that you see potential in this, and this would be huge for literally training anything, would you be interested in exploring it at some point? As in trying to train it myself? Put this in production, some form of continuous learning. Obviously that's on your radar, continuous learning.
Starting point is 01:17:15 I would love to, but I think you have to decide what kind of company you want to be. and this is something for like open AI or anthropic to focus on. I feel like you have to be thinking about the fundamentals of like this is the kind of research I used to do as a PhD student. So I'll put it this way, right? Like you have the research background to do this and you're choosing not to. And you're building a company that doesn't use your research specifically that part. I mean, you know.
Starting point is 01:17:42 Reasonable question. But I think that I'm excited about getting. things useful into people's hands very quickly. Like I like seeing, we talked about this earlier, right? We've moved from the research phase to the engineering phase of AI. It's the first time after having been in this field for maybe seven years where stuff goes beyond like just kind of a graph, right? Like the output of my work before would always be like, oh, look, there's a graph and like the
Starting point is 01:18:10 number is better now. Versus we actually get to see, you know, we have a customer between Duolingo and two or three of our other customers. we've got three or four customers working on better versions of teaching students, right, tutors or language learning or whatever it might be. And to be able to make that incrementally better
Starting point is 01:18:26 and accelerate the time it takes to get there, it just feels to be so much closer to it to be on the engineering space right now. Whereas I think there's an alternative user universe in which I stayed in research and I went to an open AI or almost everyone from my research, PhD research group, apart from Peter,
Starting point is 01:18:41 and now works at Deep Mind. And I think I would have enjoyed that as well, but I really wanted to start a company that built something useful in-production, and I don't even think those companies do that much right now, right? Like, it's only recently that Open Eye has sort of become a product company. They're more of a research company. They're building AGI, and I think that's true of the others, and I think that's amazing and fascinating.
Starting point is 01:19:02 And if I had multiple lives, I would love to do that too. But at least right now, I want to be building products and putting them in people's hands, and it just feels a little bit far removed. Yeah, yeah, makes sense. And I think the world's better because you're actually coming at it with a full knowledge of what came before. Yeah. I do think it's a huge advantage.
Starting point is 01:19:21 I do think like having a good conceptual understanding, like there's been a lot of people that pivoted into, as you said, LLM ops earlier. And I do think that actually knowing how it works, having a sense of what's going to come next and being able to project forwards and build for it is difficult to do if you don't have a good conceptual understanding of the machine learning. Yeah, yeah, yeah, agreed. Okay, well, I feel like this is a leading question, but what's one message you want everyone to take away today?
Starting point is 01:19:45 Oh, wow. That's a great question. Really, if you're building a serious LLM application and you're trying to do, find the right prompts, optimize them, evaluate your models, then I really would encourage you to try out human loop. Like, that's the use case that we really solve well for, especially if you're kind of having to collaborate with non-technical people, then human loop will probably solve a lot of pain for you. Yeah. Excellent. Well, thanks so much for doing this. I had a real joy getting to know you and debugging real life issues with you.
Starting point is 01:20:13 But that's the fun of latent space. So thank you so much. Thanks for having me. It's been an absolute pleasure to get to spend some time with you, Sean. In this episode of the Latent Space podcast, we delved into the world of LLM Ops and had a wide-ranging conversation with Dr. Raza Habib, co-founder of Human Loop. We covered, What is Human Loop? The three stages of prompt evals, the three types of human feedback, human loop's new free tier and pricing, the competitive landscape and graduation risk of Human Loop, PromptOps versus MLOPs, Prompt Engineer versus AI Engineer. Did GPT4 get dumber? Europe's AI scene versus San Francisco.
Starting point is 01:20:53 And don't sleep on Raza's in-depth explanations of LLM Cascades and Deep Mind's work on continuous learning. If you are interested in Human Loop, definitely check out their hiring page and new pricing and vote for them on the state of AI engineering survey. Thank you for tuning in to the Latent Space podcast. Don't forget to like, subscribe, and tweet your takes at Latent SpacePod. Now go build.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.