Latent Space: The AI Engineer Podcast - Building the Foundation Model Ops Platform — with Raza Habib of Humanloop
Episode Date: September 29, 2023Want to help define the AI Engineer stack? >500 folks have weighed in on the top tools, communities and builders for the first State of AI Engineering survey! Please fill it out (and help us reach 100...0!)The AI Engineer Summit schedule is now live! We are running two Summits and judging two Hackathons this Oct. As usual, see our Discord and community page for all events.A rite of passage for every AI Engineer is shipping a quick and easy demo, and then having to cobble together a bunch of solutions for prompt sharing and versioning, running prompt evals and monitoring, storing data and finetuning as their AI apps go from playground to production. This happens to be Humanloop’s exact pitch.full show notes: https://latent.space/p/humanloopTimestamps* [00:01:21] Introducing Raza* [00:10:52] Humanloop Origins* [00:19:25] What is HumanLoop?* [00:20:57] Who is the Buyer of PromptOps?* [00:22:21] HumanLoop Features* [00:22:49] The Three Stages of Prompt Evals* [00:24:34] The Three Types of Human Feedback* [00:27:21] UI vs BI for AI* [00:28:26] LangSmith vs HumanLoop comparisons* [00:31:46] The TAM of PromptOps* [00:32:58] How to Be Early* [00:34:41] 6 Orders of Magnitude* [00:36:09] Becoming an Enterprise Ready AI Infra Startup* [00:40:41] Killer Usecases of AI* [00:43:56] HumanLoop's new Free Tier and Pricing* [00:45:20] Addressing Graduation Risk* [00:48:11] On Company Building* [00:49:58] On Opinionatedness* [00:51:09] HumanLoop Hiring* [00:52:42] How HumanLoop thinks about PMF* [00:55:16] Market: LMOps vs MLOps* [00:57:01] Impact of Multimodal Models* [00:57:58] Prompt Engineering vs AI Engineering* [01:00:11] LLM Cascades and Probabilistic AI Languages* [01:02:02] Prompt Injection and Prompt Security* [01:03:24] Finetuning vs HumanLoop* [01:04:43] Open Standards in LLM Tooling* [01:06:05] Did GPT4 Get Dumber?* [01:07:29] Europe's AI Scene* [01:09:31] Just move to SF (in The Arena)* [01:12:23] Lightning Round - Acceleration* [01:13:48] Continual Learning* [01:15:02] DeepMind Gato Explanation* [01:17:40] Motivations from Academia to Startup* [01:19:52] Lightning Round - The Takeaway This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
Transcript
Discussion (0)
Welcome to the Latent Space Podcast, where we dive into the wild, wild world of AI engineering every week.
This is Anna, your AI co-host.
Thanks for all the love from last episode.
As an AI language model, I cannot love you back, but I'll be standing in for Alessio one last time.
This week we have Dr. Raza Habib, co-founder and CEO of Human Loop, which is arguably the first and best-known prompt engineering or prompt ops platform in the world.
You may have seen his viral conversation on YC's YouTube on the real potential.
of generative AI.
Fortunately, we go much more in depth.
We ask him how they got to prompt ops so early,
what the three types of prompt evals and the three types of human feedback are,
and confront him with the hardest question of all.
Is prompt engineering dead?
At the end, we talk about whether GPT4 got dumber,
the most underrated AI research,
the Europe-AI startup scene,
and why San Francisco is so back.
By the way, dear listener,
we will be presenting the AI engineer summit in October.
and you can tune in on YouTube
and take the State of AI Engineering Survey
at the URLAI.Engineer Summit.
Watch out and take care.
So welcome to Layton Space.
I'm here with Razah Habib,
CEO of Human Loop.
Welcome.
Thanks so much for having me.
It's an absolute pleasure.
And we just spent way too long
setting up our own studio as sound engineers.
I don't think something that either of us
woke up today thinking that we'll be doing.
But it gives you greater appreciation
for the work of others.
Yes.
Dave, you are a man.
missed Davis Al Sound Engineer back in SF, who handles all this for us. So it's really nice to
actually meet you and your team in person. I've heard about Human Loop for a long time.
I've attended your webinars and you were one of the earliest companies in this space. So it's an
honor to meet and to get to know you a little bit better.
Likewise. I've been excited to chat to you. You definitely are building an amazing community
and I've read your blogs with a lot of interest.
Yeah. And based on this, I'm going to have to write up Human Loop. So this actually forces me to get
to know Humulubla a lot better.
Looking forward to it.
So I'll do a little quick intro of you,
and then you can fill in with any personal side things.
Sure.
So you got your MSC and doctorate at UCL.
It says here machine learning and computational statistics,
which are, I think, mostly the same thing.
Yeah, so the master's programs called machine learning and computational statistics,
and then my PhD was just in probabilistic deep learning.
So you're trying to combine graphical models
and Bayesian-style approaches to machine learning with deep learning.
Yeah, awesome. And did you meet Jordan in Cambridge?
So Jordan and I overlapped Cambridge a bit. We didn't know each other super well.
And we actually met properly for the first time at a PhD Open Day. And I ended up doing the PhD.
He ended up going to work for a startup called Bloomsbury AI that got acquired by Facebook.
But hilariously, his first boss was my master supervisor. And so even though we didn't end up sort of doing PhDs together, I was often in their offices in this early years.
Yeah, very small worlds. And we can talk about being in other people's offices because we are in
someone else's office.
Yeah, so we're in the offices of Local Globe at Phoenix Court.
Local Globe is one of the best seed investors in Europe, and they were one of our first investors.
And they've, yeah, just these incredible facilities.
You saw it just now outside a space for a hub for all their startups and other companies
in the ecosystem to come work from their offices, and they provide these podcasting studios
and all sorts of really useful resources that I think is helping grow the community in Europe.
Yeah, and you said something which I found really interesting.
They put on a lease.
They have the building for 25 years.
Yeah, I can't remember if it's 25 or 20, but a really long time.
They've made a conscious decision to invest in what is not one of the wealthiest parts of the city of London
and give themselves a base here, go where the action is,
and also try and invest in the local community for the long term and give back as well.
I find that really inspiring.
They think not just about how do we build truly epic companies and technology,
but what is the social impact of what we're doing?
And I have a lot of respect for that.
Yeah, it's pretty important.
It's something I care about in SF as well.
which has his own issues.
So coming back to your backgrounds,
while you're going through your studies,
you also did some internships in the byside in finance,
which is something we connected about.
Yeah.
So I did some byside internships in quant finance.
I spent a year almost at Google AI,
working on their speech synthesis teams,
and I helped a really close friend start his first company,
a company called Monolith AI,
that was doing machine learning for physical engineering.
So really high stakes.
Our first customer was McLaren,
which was really cool.
So a day a week of my PhD, I was sitting in the McLaren offices, literally next to, and I mean literally, like I could almost reach out and touch it on F1 car, and we were trying to help them use machine learning to reduce how much physical testing they had to do.
Right.
So simulations?
Simulations, so surrogate modeling, can you take these very expensive CFD solvers and replace them with neural nets and also active learning?
So they do a lot of physical experiments that if you run an experiment, you get some amount of information back, and then you do something really similar.
and a bit of the information overlaps.
So they would put a car in a wind tunnel, for example,
and they'd sort of adjust the right heights of the car
at all four different corners and measure all of them,
which is really wasteful.
And you spent a whole day in the wind tunnel.
So we had an AI system that would basically take the results
of the most recent test you did and say,
okay, the ones that you'll learn the most from
are this set of experiments.
You should do these experiments next.
You'll learn a lot quicker,
which is a very similar technique
that we used at the early days of Human Loop
to make machine learning models learn more efficiently as well.
Yeah.
I get the sense, by the way, I've talked to a number of startups that started with the active learning route.
It's not as relevant these days with language models.
So I think it's way less relevant because you need so much less annotated data.
That's the big change.
But I also think it's actually really hard to productize.
So even if you get active learning working really well, and I think the techniques can work extremely well,
it's difficult to abstract it in such a way that you can plug your own model in.
So you end up either having to own the model, like, you know, I think open.
I could probably do this internally, but trying to go to a machine learning to engineer
and sort of let them plug their model into an active learning system that works well is a really
hard challenge.
Yeah.
Yeah.
And from a business perspective, it's also a little bit frustrating because it's almost a hidden
ROI.
Like when you do succeed, it's very difficult to prove to the person who used it how much data
like labeling you saved them.
Because they never do the direct comparison of also labeling at random, right, because it's too
expensive.
And so you might have saved them 40.
percent of their labeling costs, which might have been hundreds of thousands of dollars, but it's
really difficult for them to measure that ROI.
Yeah.
I think like with anything, you have to have a commitment to good process and good science.
Yeah.
And trust that that actually does work out without evidence or a counterfactual, you know,
tests or like a control group because that would be an extreme waste of money.
Absolutely.
So the chronology here is super interesting, right?
Because you started your PhD in 2017.
you just got it in 2022 about a year ago.
Yeah, that's right.
So that overlaps with your work on Monolith AI.
And then you also started Human Loop in 2020.
So just take me through that interesting journey.
So I wouldn't recommend this, by the way.
Like I'm a big advocate of focus.
Within Human Loop, we try to be very focused.
But I also just always had this itch to be part of companies and building things.
And to be fair, I think it helped as a researcher because it gave you tangible real world
problems and experience. I think in academia it's really easy otherwise to just work on things that
seem interesting to you, but maybe don't have such a big impact. So the way it came about, I was in the
PhD and a very close friend, Richard Alfelt, who's now the CEO and founder of Monolith AI, they're a
series A, almost series B company. And he was starting this company, came to me and he said, you know,
I need someone who's on the ML side, just whilst I'm getting started, can you help out? And so it was
meant to be this very short-term thing initially. I got sucked into it. I was spending, you know,
at least a day a week, if not more, of my PhD early on.
But it was really fun, right?
We were sitting in the offices of McLaren.
They were our first customer.
I think Airbus was an early customer.
I helped hire the early team.
And it was a really good experience of trying to do machine learning in the real world,
in high-stakes situations, right, physical engineering and understanding what did and didn't work.
So I'm really glad I got that experience, and it made me much more excited about starting a company.
But that was still a part-time thing.
And I think my supervisor sort of knew I was doing it, but it was enough, it was a low-enough
commitment that I could hide and still be focused on my PhD most of the time. With Human Loop,
it was different. Like, the way Human Loop came about, I came back from doing my internship at Google
in Mountain View. And doing the internship at Google convinced me that I loved Google, but I didn't
want to work there in the near term. I wanted to be working on some in a space where there was a lot
more urgency, where it felt existential, where we were all focused on the same problem as a team
pulling together. And at Google, it just felt like you were part of something very big. I was
surrounded by really smart, really capable people. I learned a lot from them, but the environment
was more comfortable. And I wanted to be in a small, I wanted to be a startup, really. And so when I
came back from Google, I sort of started thinking about ideas and speaking to the smartest people I
knew to kind of see whether we could do something for when I finished the PhD. That was the point.
I was just doing research. But Peter Jordan and I started working together in that process.
We were all at a similar stage of kind of trying to find other people we might want to work on
side projects with.
and one of the side projects basically became Human Loop,
and we got into YC, and we were like, okay, well, this is a great opportunity,
let's go do it, and just kind of one domino fell after another.
And so didn't quite finish the PhD,
but had enough research that I probably could have been writing up.
And so at some point, I got an email from UCL,
and they were like, if you don't submit in the next whatever,
I think it was two months, then it expires.
And I was, you know, I almost didn't do it
because obviously running a startup is such a full-time gig,
but I had invested a lot of time.
The honest reason why I did it is two things.
One, my grandfather, who recently passed away, had just really wanted to see me finish.
And so, you know, probably not super rational, but I just wanted to do that for that reason.
But the other is I really love teaching.
When I was a PhD student, I did a lot of TAing.
I TA'd the courses at the Gatsby, which are the ones, is the institute that Jeff Hinton started when he was there.
And I really enjoyed that.
And I just knew that having a PhD would make it easier one day to come back to that.
if I want to do a little bit of teaching at a university, having that title helps.
As a second adjuncts.
I don't know if they have adjunct appointments here or maybe lecturer appointments.
Yeah, something like that.
I can't imagine doing it whilst running the startup, but afterwards.
Yeah, I've always wondered if I can give back in some shape or form.
But maybe you might with your podcast when you get that started.
What was the original pitch for Human Loop?
You said it grew out of a side project.
Yeah, so when we started Human Loop, both Peter Jordan and I had this strong conviction
about the fact that NLP was getting phenomenally better.
This was before GPT3, but after BERT,
and after transfer learning had really started to work for NLP,
that you could pre-train a large language model on an unlabeled corpus
and really quickly adapted to new situations.
That was new for NLP.
So did GPD1 and 2?
Or just BERT?
But we were thinking about BERT,
we were thinking about ULM fit as the first milestones
that showed that this was possible.
And it was very clear that as a result,
there was going to be a huge wave of new, you know,
useful applications that enterprises could build on NLP that weren't previously possible,
but that there was still a huge lack of technical expertise, and annotated data was still a big
bottleneck.
So we were always trying to make it a lot easier for developers and for companies to adopt
NLP and build useful AI products.
But at the time that we started, the bottleneck was mostly, okay, do you have the right
ML expertise, and can you get enough annotated data?
And so those were the problems we were initially helping people solve.
And when GPT3 came out, I wrote a lot.
blog posts about this at the time, it was very clear that this was going to be the future,
that actually, because in context learning was starting to work, the amount of annotated data
you would need was going to go down a lot. But until the instruct GPT papers, it still didn't
feel practical. But after Instruct GPT came out, once you've kind of mentally done that shift,
it's very hard to keep working on anything else. And so a little over a year ago, we pivoted it,
and that was scary because we had a thing that was working. We had paying customers,
it was growing reasonably. We'd raised money. And I went to, we went to our investors at the time. And I remember having a conversation, we did a market size estimate. I actually filled out the YC application. Because I think the YC application is like the simplest business model you could possibly build. What are you going to build? Who's it for? How are you going to make money? How big is the market? And I did the market size question. And at the time, we did it. And I was like, I reckon there are maybe 300 companies in the world who might need a product like this. And the assumption was that like, okay, it's tiny today. It's mostly a small number of startups, but it will be huge in the future. And that,
turned out clearly to be right, I didn't realize how quickly it would happen.
Yeah, it's obviously surprised, I think, a lot of us, but you were paying attention to the
research when I guess a lot of people were not necessarily looking at that.
Like, to my understanding, you didn't have previous NLP knowledge or back on, right?
You did speech synthesis.
I did speech synthesis.
I did fundamental methods in the deep learning, right?
So you weren't specialized in any.
I wasn't specialized. I was working on generative models, variational inference.
I would actually say that I...
How did you know this was the thing to focus on, right?
Well, so the interesting thing is that, like, you don't need any NLP expertise
to have gotten the, like, current wave of deep learning, right, or machine learning.
Like, if anything, I think having previous NLP expertise is almost a disadvantage.
I took an NLP course in my master's course.
Fantastic lecture, a fantastic group.
But at the time, there was only one lecture on deep learning, right?
And this was 2016, 2017, or something.
or 2015, 2016.
The NLP community was still, you know,
just waking up to the fact that deep learning was going to change everything.
And the amazing thing about most machine learning attributes,
it's another example of the bitter lesson that we were talking about earlier, right?
Like general purpose learning methods at scale with large volumes of compute and data
are often better than specialist systems.
So if you understand that really well,
you're probably at an advantage to someone who only understands the NLP side
but doesn't understand that.
I don't know if it's understanding so much as believe.
I think you're right.
It's a bit of both.
I think one leads to the other.
Yes.
That you take the evidence seriously and then you extend it out and it still works.
And you just keep going.
Yeah, absolutely.
So you got the tam size wrong on the positive side.
At the time we were right, I think.
Really? Okay, yeah, yeah.
And I know we were roughly right because we were spending a lot
time speaking to Open AI and we were asking them like how many, you know, they were sending
us customers and we were discussing it and we were asking about API usage, like how many big companies
are there? And there was a small number at the time. But it just rocketed since then. Okay. So
you were planning to build very closely in partnership with Open AI. As I mean, we've always
tried to keep close partnerships with all of the large language model providers, right? It's very
clear that whilst open source is fantastic, the very frontier is within private companies. Yeah. And
they are building the platforms that the rest of us are building on top of.
And so not Open AI specifically, like we're model and platform agnostic, but we want to help
developers build useful applications with large language models, whether that's Open AI or
Corhear or an open source model or anthropic, we don't mind.
But being close to the model providers, make it easier for their customers to succeed
benefits them.
And then we also get to learn from them about what problems people are facing, what they're
planning to do in the future.
So I think that all of the large language model providers are investing a lot in developer
ecosystem and not just being close to human loop, but to anyone else who's making it easier for
their customers.
Yeah, awesome.
Okay, so you start the company.
How did you split things between the co-founders?
It happened very organically.
We're all on paper.
We look really similar.
Peter also has a PhD in machine learning, amazing engineer, previously been a CTO.
Jordan has a master's in machine learning.
It's like really good engineer as well.
As we came to work on it, it just turned out we had natural strengths and interests that
happened very, very organically.
So Jordan is the kind of person who's got an amazing taste for product.
notices things day to day.
Like, if he finds a product experience he really likes, you see his eyes light up.
He's paying attention to it all the time.
And so it made a lot of sense that he, over time, gravitated towards user experience,
the design, actually thinking through the developer experience and leading on product.
Peter's got phenomenal stamina and amazing engineering knowledge and amazing attention to detail
and naturally gravitated towards taking on leading the engineering team.
And I like doing this.
I like chatting to people on podcasts.
I like speaking to customers a lot.
that's probably my favorite part of the job.
And so naturally, I kind of ended up doing more of that work.
But it wasn't that we sat down initially and said,
okay, you're going to be the person who does sales and invest.
It was much more organic than that.
Yeah, yeah, awesome.
And you had to pick your customers.
So what did you end up picking?
So in the end, our customer changed dramatically
when we launched the latest iteration of human loop.
When we decided to focus much more in large language models,
we suddenly went from a world in which we were building predominantly
from machine learning engineers,
people who knew a lot about ML, maybe had research backgrounds,
to building for generalist software engineers
who are much more product-focused.
Something that some people I think would refer to as an AI engineer,
so I've heard.
And these are people who are much more focused on the outcome,
on the product experience, on building something useful,
and they're much more ambivalent towards the means that achieve the end.
And that works out as a much better customer for a tooling provider as well
because they don't fight you to build everything themselves.
They want good tools, and they're happy to pay for them,
because they're trying to get to a good outcome
as quickly as possible.
So we found a much better reception
amongst that audience
and also that we could add a lot more value to them
because we could bake in best practices
and knowledge we had
and that would make their lives much easier
and they didn't need to know so much about machine learning.
Where do you find them?
Because this was in like early 2021.
Yeah.
There were no chat GPT forums.
It wasn't like a widely discussed topic on Twitter.
Like where do you find these early adopter types?
So we could see some people
using GPT3.
And so we would directly reach out
to companies that we're building on GPT3.
And in the early days,
when we first did it,
before we did the pivot,
we gave ourselves a two-week sales experiment.
We said, let's take our designs
and our initial idea
and let's see if we can get
10 paying customers in two weeks.
And on the second day,
we had 10.
Paying for what specifically?
So we were just pitching them
on being part of a development partnership.
So we said,
we're building a tool
that will help you with prompt engineering
and evaluating how good your prompts are.
This is what it looks like.
We're looking for design partners.
It costs this much to be a design partner.
And on the second day, we already had 10.
And so we were like, okay, there's a real problem here.
Because people were feeling the pain.
And they were showing us, they're jerry-rigged solutions for this.
They were showing us how they would stitch together Excel spreadsheets and Grafana and Nix panel
and the opening eye playground in these very clodgy pipelines to somehow quickly iterate on prompts,
version them, collaborate as a team, find some way to measure things that were very subjective.
and so we were like, okay, actually there's a very clear need here.
Let's go help these people.
Yeah, excellent.
So what is Human Loop today?
Yeah, so at its core, we help engineers to measure and optimize LLM applications,
so in particular helping them do prompt engineering, management, and an evaluation.
So evaluation is particularly difficult for large language models
because they can to be used for much more subjective applications than traditional machine learning,
definitely than traditional software.
If you're coming from a pure software and non-ML background,
then the first thing you have to learn when you start working with LMs is this stuff is stochastic,
which I think, you know, most people are not used to.
So just playing with software that every time you run it is different
and you can't just write unit tests is the first kind of painful lesson.
But then it turns out that a big piece of these applications ends up in prompts,
and these are natural language instructions,
but they're having similar impact to what code has.
So they need to be treated with the importance of code.
And so iterating on that, managing it, versioning it, evaluating it,
Those are the problems that Human Loop helps engineers with today.
And in particular, we tend to be focused on companies that are at a certain scale
because one of the challenges that, one, they tend to care more about evaluation.
I think if you're a two-person startup, you sort of build something quick MVP and you yolo it into production.
But larger companies need to have some confidence of the product experience before they launch something.
And also what we've found is that there's a lot more collaboration between engineers and non-engineers,
between product managers and domain experts
who are involved in the design,
the prompt engineering, the evaluation,
but are maybe not the engineering part.
They have to work together nicely.
So giving them the right tools
has been a really important part as well.
Yeah.
Something I've often talked about
with other startups in this space
is who's the buyer?
Yeah.
Because you talked about collaboration
between the engineer
and the PM or whoever,
and it's not clear sometimes.
Do you have a clear answer?
It varies highly on company stage.
So in the early days when we started Human Loop, you said where do we find our customers, right?
They were all startups and scale-ups because those were the only people building with GPT3 more than a year ago.
There was no large companies.
And there it was always founder-CTO.
Even if there were 10, 20-person company, seriously a company, is always founder who was speaking to, reaching out to us, who was helping build it.
So like an example here, one of our earliest customers was Mem, and it was Dennis at Mem, who was kind of the person we were speaking to.
Now that we're a bit more at scale and we're speaking to larger companies, it's a little bit more varied.
surprisingly it's still quite often senior management that first speaks to us.
So with Duolingo, it was Severn, the CTO, was actually our first contact.
Just inbound.
Inbound.
But increasingly now, it's people who are engineers who are actually working on projects.
So it's like a senior staff engineer or something like that.
We'll reach out, book a demo.
They'll probably sign up first and have a play.
But then they tend to book a demo because they want to discuss data privacy and how things will be rolled out.
and sort of going beyond just individual usage.
But that's the usual flow, is we see them sign up.
Sometimes we reach out to them.
Often they'll reach out to us, and then the conversation starts.
Yeah, yeah.
Awesome.
For people who want to get a better sense of Humuloop, the company,
I think the website does a fantastic job of explaining it.
Thank you.
We're always working on it.
Put in quite a lot of work.
So it says here Humulup application platform includes a playground,
monitoring, deployment, AB testing,
prop manager, evaluation, data store,
and fine-tuning.
And based on our chat earlier,
it seems like evaluation is kind of the more beta one
that's in sort of like a private beta.
That's correct, yeah.
So we have evaluation in private.
There's always been some aspect of evaluation.
It was actually the first problem
that we were solving for customers.
But evaluation in Human Loop early on
was driven entirely by end user feedback.
So if you're building an LLM app,
there's probably three different places
where evaluation matters a lot.
There's the evaluation that you need
when you're iterating in design
and you haven't got something in production yet,
but you just need feedback on as you're making changes,
are you making things better?
You're iterating on prompts,
you're iterating on the context,
trying out different models.
How do you know that the changes are actually improving things?
Then once you're in production,
there's sort of a form of evaluation you need for monitoring.
It seemed to work when I was in development,
but now I'm putting a whole bunch of different customer inputs through it.
Is it still performing the way that I expected?
And then the last one is something like equivalent to integration tests
or something like this.
Every time you make a change,
how do you know you're not making?
it worse on the things that are already there.
And so I think we always had a really good version of the monitoring user feedback version,
but what we were missing was support for offline evaluation and being able to do evaluation
during development or regression testing.
And we're going to be launching something for that very soon.
Yeah.
This is slightly unintuitive to me because I would typically just assume they're all three
are the same e-vails.
Yeah, so they can't necessarily be the same evils just because you don't have the user feedback
at the time that you're in development.
I'm not thinking about user feedback,
I'm just thinking about validating the output that you get.
Yeah, so you're validating in similar ways,
but if you're doing a really subjective task,
then I think the only real ground truth
is what your customers say is the great answer.
If you're building co-pilot, do the customers
accept the code suggestions or not?
Yes.
That is the thing that ultimately matters,
and you can only have proxies for that in development.
And so that was why those two things end up being different.
Yeah.
And in terms of the quality of feedback,
so we did an episode with
which is an analytics platform dedicated for collecting this kind of behavioral feedback.
And you mentioned co-pilot.
There was a very famous post about reverse engineering co-pilot that showed you the degree of feedback.
I think typically when people implement these things, they implement it as a sort of thumbs-up, thumbs-down.
So, binary feedback until you find that nobody uses those.
Nobody does those feedback.
I barely use the up there on chat.
Yeah, so this was something we learned really early on in building human loop.
And, you know, the feedback aspects of Human Loop were very customer-driven.
The people who were getting, amongst our early users, the people who were getting traction
and who had built something that was working well, had jerry-rigged some version of feedback
collection and improvement themselves.
And they were pushing for something better here.
And they all were collecting usually three types of feedback, and Human Loop supports all three
of these.
So you have the thumbs-up, thumbs-down type feedback that you just described.
You don't get much of it.
It's useful when you get it, but you don't get that much.
and then the other form of feedbacks, we call that votes, and then you have actions,
and these are like the implicit signals of user feedback.
So I can give a concrete example here.
There's a company I really like called SudoRite, and SudoRite, founded by James Yu,
and they're building an editor experience for fiction writers that helps them.
So as they're writing their stories or novels, there's a sidebar, and you know,
you can highlight text and you can say, like, help me come up with a different way of saying this
or in a more evocative way.
You know, there's many different features built in.
And they had built in early on, you know, analytics around does the user accept a suggestion?
Do they refresh and regenerate multiple times?
How much do they edit the suggestion before including it in their story?
Do they then share that?
And all of those implicit signals correlate really well with the quality of the model or the prompt.
And they were like running experiments all the time to make these better.
And you could just see it in their traction figures.
As they figured out the right prompts, the right features, the things that people were actually including,
the product became much more loved by their users.
Was there a third?
You said there was...
And the third one is corrections.
So this helps particularly when you want to do fine-tuning later on.
So anywhere you're exposing generated text to a user
and they can edit it before using it,
then that's worth logging.
So a concrete example here is we have a couple of customers
who do sales email generation.
And they generate a draft, someone edits it,
and then they send the draft.
And so they capture the edited drafts.
And I think a lot of the...
this is sort of preemptive, right? They don't necessarily use that captured data immediately,
but it's there if they want it for fine-tuning, for validating prompt changes and anything like that.
Exactly. Exactly. It's data that you want to have, and you want to have in an accessible way,
such that you can improve things over time. Yeah. And you tend to, you have a UI to expose it,
but do you think that people use that UI or did they, did it prefer to export it to, I don't know, Excel,
or how do people like to consume their data?
once you've captured it.
Yeah, so we see a lot of people using it in the UI,
and part of the reason for that is we have this bidirectional experience
with an interactive playground.
So we have the ability to take the data that was logged in production
and open it back up in an environment where you can rerun the models
when you make changes.
And that ability has been really important for people to reason about counterfactuals.
Oh, the model failed here.
If the context retrieval had worked correctly, would the model have succeeded?
And they can immediately run that counterfactual.
or is it a problem with GPD 3.5 versus 4?
So they'll run it with 4 and see, does that fix it?
And that lets them build up an intuition about why things have worked or haven't worked.
People do export data sometimes.
So we allow people to format the data in the right way for fine-tuning and then export it.
And that's something we see people do quite a lot if they want to fine-tune their own models.
But we try to give fairly powerful data exploration tools within Human Loop.
Yeah.
What about your integrations with the rest of the ecosystem?
On your landing page, you have Langchain, Auto-GPTs mentioned.
Chroma, Pine Cone, Snowflake, and obviously the LLM providers.
Yeah, so the way we see Human Loop is sitting, you know, between the base LLM providers
and an orchestration framework like code, you know, Langeen or Lama Index might sit sort of
separately to that.
You know, you have this analogy, I think, of like, LLM first or code first, AI applications,
and we're very strongly of the opinion that, like, most things should be happening in code, right?
That developers want to write code.
They want to be able to orchestrate these things in code.
but for the pieces that require LLMs, you do need separate tooling.
You need the right tools for prompt engineering.
You need some way to evaluate that.
And so we want Human Loop to plug in very nicely into all of these orchestration frameworks
that you might be using or your own code and let you collect the prompts, the evaluation data
that you need to iterate quickly in a nice UI.
So here is where line chain collides with you.
Has started to now.
Yes.
Because they just released the prompts manager.
Yeah.
And they also have a dashboard to...
observe and track and store their prompts and data and the results.
They don't have feedback collection yet, but they're going to build it.
I'm sure they will.
You know, it's a very vibrant ecosystem.
There's lots of people running after similar problems and listening to developers
and building what they need.
So I'm not completely surprised that they've ended up building some of the features that we
have because I think so much of what we need is really important for developers to achieve stuff.
I think one of the strongest parts of it is it's going to be.
very tightly integrated with Langchain, but a lot of people are not building on Langchain.
And so for anyone for whom Langchain is not their production choice system, then I think actually
it's going to be friction to work in that way.
I think that there's going to be a plethora of different options for developers out there,
and they'll find their own niches slightly.
I think we're focused a little bit more, as I said, on companies where collaboration is very
important, a little bit larger scale, and slightly less so far as an individual developers in
quite the same way that Langchain has been to date.
That's a fair characterization, I think.
It's funny because, yeah, you are more agnostic than Lanc Chain is, and that is a strength
of yours, but I've also worked for companies which have tried too hard to be Switzerland
and to not be opinionated about anything, and it's bitten them in.
You have to have opinions, right?
You've got to bake into the – we learn a lot from our customers, and then we try to productize
those learning.
So I gave you a concrete example earlier.
on having good defaults
for what types of feedback you can collect.
And that's not an accident.
We're very opinionated about that
because we've seen what's worked
for the people who are getting to good results.
And now if you set up human with that,
you naturally end up with the correct defaults.
And there's loads of examples of that throughout the product
where we're feeding back learnings
from having a very large range of customers in production
to try and set up sensible defaults
that you don't realize it,
but we're nudging you towards doing the right thing.
Yeah. Yeah. Excellent.
So that's a really great overview
of the product surface area.
I mean, I don't know if we left out anything
that you want to highlight.
No, I think that's great.
And the focus for us, I think,
being like a really excellent tool
for prompt management, engineering, versioning,
and also evaluation.
So kind of combining those
and making that easy for a team.
Yeah.
What's your estimate of the TAM now?
Oh, God.
I mean, eventually, at the current rate of growth, right?
I think it's really difficult to...
All known items in the universe.
Yeah, it's difficult to put a size
it because how big it's going to be.
Like, certainly, like, more than large enough for a venture-backable outcome.
Today, I don't know, Data Dog is something like a $35 billion company doing, like,
web monitoring or whatever.
I think LLM's and AI are going to be bigger than software.
And that market is going to be absolutely enormous.
And so trying to put a size on the TAM feels a little silly almost.
You had to do it for your exercise, so I just figured I'd get an update.
But it was a different world back then, right?
At the time that I was doing it, trying to get people to take the
of putting GPT3 in production seriously was work.
And most people didn't believe it was the future.
It was like it's difficult to believe this
because it's only been a year.
And I think everyone has kind of rewritten history.
But I can tell you, because I was trying to do it,
that a year ago it was still contrarian
to say that large language models
were going to be the default way
that people were building things.
Yeah.
Well, well done for being early on it
and convicted enough to build a leading company doing that.
I think that's commendable.
And I wish I was earlier.
You've still been pretty early. You've done all right.
I do have this message because I talk to a lot of people who feel like they've missed it.
But it's just beginning. It's still so early.
What would you point to to encourage people who feel like they've missed the boom?
I just think that I guess a question to ask yourself if you missed chat GPT was why did you miss it?
And the people who didn't miss it, and I'm not necessarily including us in this.
I think we were relatively late, even though we were earlier than most.
Like, what did the people who get it right really grokker?
What did they believe, right?
What did Ilyos Cuscova or Shane Legg, the people who kind of saw this early?
And I think it was a conviction about deep learning and scale and projecting forwards that,
okay, if we just project forwards the current improvements from deep learning and assume they continue,
like what will the world look like?
And if you do that today, and obviously it's extrapolating, right?
That's not a theory-based prediction.
It's just an extrapolation.
But the extrapolation has been right for a really long time, so we should take it seriously.
If you extrapolate that forward just a year or two, then you find that you would expect
the models to be phenomenally better than they are today.
And they're already at a scale where you expect large economic disruption, right?
Even if GPD4 doesn't get better.
And if all we get is GPD vision plus the current model, we know that there's loads of useful
applications to be built.
People are doing it right now.
But they're going to get better, right?
this is the worst they're ever going to be.
So if this is what's possible today,
I think the hardest challenge actually is to take seriously the fact
that in the not too distant future
you will have models even more capable than the ones we have now,
how do you build for that world?
I think it's a difficult thing to do,
but it's certainly extremely early.
Yeah, I think the quote that resonated with me
this past week was Nat Friedman saying,
imagine everything that we have now with six orders of magnitude,
more compute by the end of the decade,
and plan for that.
Yeah, and that seems to me like a...
Six orders is a lot.
Six orders, six orders seems optimistic.
But I think it's a good mental exercise, right?
Even if it turned out only to be...
If it was only four orders or only three orders, right, it would still be transformative.
Yes.
If GPT4, instead of costing $40 million or $GD, you know, whatever it costs, tens of millions of dollars, became tens of thousands of dollars.
I've heard a total all in cost $500 million.
So let's say it was $500 million today and it became $1 million or $2 million.
Yeah, yeah.
That becomes accessible to, you know, even startups, let alone.
you know, medium-sized companies.
And I think we should assume something like that will happen.
I would say even without significant research breakthroughs on the modeling side,
I would just expect inference costs to become a lot cheaper.
So training is difficult to optimize from a research perspective,
but figuring out how to quantize models, how to make hardware more efficient.
That to me feels like you chip away at it and it'll just happen naturally.
I'm already seeing signs of that.
So I would expect inference to get phenomenally cheaper, which is most of the cost.
Yeah.
And a previous guest that we had on by the time this comes out,
is Chris Latner, who is working on compilation for Python,
that's going to make inference a lot cheaper
because it's going to fully saturate the actual compute
that we already have.
So I think it's an easy prediction to make
that inference costs come down phenomenally.
Fantastic.
In my mind, you went upmarket faster than most startups
that I talked to.
So you started selling to Enterprise,
and I see you have Duolingo Max and Gusto AI
as case studies.
You have a trust report.
You don't need talk too.
We're in the process of Soch2.
So we have SOC2 Part 1
and we're currently being audited for SOC2 Part 2.
But you have the Vanta thing up.
We have the Vanta thing up.
And we have the part one.
We have the trust report.
We have regular pen tests.
We have to do a lot of this stuff
in order to get to procurement.
To sell the enterprise.
Yeah.
So I mean, I love the Vantta story.
It's not AI.
But do you think that the Vantage trust report
is going to work?
In what sense?
As a SOC2 replacement.
A SOC2 proxy?
I don't know.
Honestly.
All I can say is that, like, customers still care that we have SOC2.
Yeah.
And we're still having to go through it.
Vantas, even with SOC2, though, Vantam makes the process of doing it phenomenally easier.
Okay.
That's a big endorsement.
So I would endorse the product.
I've been less close to it than my co-founder, Peter, and a couple of others.
Oh, yeah.
There's always a VANTA implementation, a SOX2 implementation person.
Yeah.
And that poor person is, like, for a year, they're dealing with this.
But it's certainly been a lot faster because of that.
But just more broadly, like, becoming an enterprise-oriented company.
What if you had to change or learn?
Yeah, so I would actually say that, like, we've only done it because we were feeling the pull, right?
I wouldn't recommend doing it early if you can avoid it because you do have to do all these things.
Soct2 compliance.
And I think Peter is filling out a very long infoset questionnaire today, right?
And although you have most of the questions prepared, each one is just a little bit different.
So there is just just over.
There is this overhead on each time.
No comment.
But the potential gain for some of these larger companies, right, if they can make efficiency
improvements of 1, 2, 4, 5% is so much bigger.
And the efficiency improvements probably aren't 5%.
They're probably 20%, 30%.
And so when the upside is so large, you know, if you are a large company that's, you know,
your costs are dominated, say, by customer support or something like this, then the idea that
you might be able to dramatically improve that.
Or if you can make your developers much more efficient,
there's no shortage of things.
And I think a lot of companies in the build versus buy decision,
they want to do both because they want to have the capacity internally
to be able to build AI features and services as part of their product as well.
So they don't want to buy everything.
Certain things, it makes a lot of sense.
It's fully packaged.
No one's building their own IDEE.
Like they're going to use co-pilot or whatever is the equivalent.
But they want to be able to add, you know,
I think the first AI feature that Gusto added,
was the ability within their application for people who are creating job ads could put in a very short description,
and it would auto-generate the first draft job ad,
and was smart enough to know that there are different legal requirements
and what information has to be there for different states.
So in certain states you have to, for example, report the salary range,
and in certain states you don't, it's pretty easy to give that information to GPT4
and have it generate you a sensible draft.
But that was, I think, something that they got to production, you know, within weeks of stuff.
And just to see such a large company go from zero to having AI features in production,
and now they're adding more and more, it's been quite phenomenal.
Yeah, the speed of iteration is unlike enterprise, which is fantastic.
I think a lot of people see the potential there.
I think people's main concern with having someone like Human Loop in the Loop is the data
and privacy element, right?
Do people want on-prem human loop?
So we do do VPC deployments where they're needed.
We don't do full on-premise.
So far, most people, we've been able to persuade that they don't need it.
So whenever someone says we need VPC, the first question I always ask is why.
And then we go through what are the real reasons?
Like, what are they concerned about?
And we see whether we can find ways either contractually or, you know, in our own cloud to satisfy those requirements.
There are exceptions.
Like, we work now with some, you know, financially regulated companies.
AmexGPT is one of our customers.
Sorry, I should specify.
I heard GPT?
Yeah, no, Amex GPT is their global business.
travel arm. And, you know, they've got very sensitive information. And so they're, they're
particularly concerned about it and there's more auditing. But for the people who are not financially
regulated, usually we can persuade them that, look, we have SOC to or essentially there. We've got
regular pen tests. We follow really like high security standards. Most people so far have been
accepting of that. Yeah. Have you ever attempted to classify the use cases that you're seeing?
just you see the whole universe and you're not super opinioned about them but like you know there's
summarization there's classification there's you know okay so interesting i've not i've certainly
not tried to classify them as that granularity like is it summarization or a question answering
i often think more about the end use case so like is this an ed tech use case or is someone
that's the vertical to me i think i think a little bit more about it like that in terms of
use cases it's really varied right there are people
people using the models as completion, there's chat.
Like, it wouldn't be so obvious to know without doing some, like, GPT-level analysis on it,
like getting GPT to look at the outputs and inputs, which we can do, which we can do,
whether they are doing summarization or something similar.
But I would say I feel like most use cases blend.
Like that to me feels like an old-school NLP way of viewing the world.
Like an old-school NLP, we used to break down these tasks into like summarization and NER and extraction and QA
and then pipeline things together.
and actually I feel like that doesn't map very well
onto how people are using GPU for today
because they're using them as general purpose models
and so it is one model that's doing NER and it's doing extraction
it's doing summarization, it's doing classification
and it's often in one end-to-end sort of system.
I think that's what people want to believe
that they're using them as general purpose models
but actually when you open up the covers
and look at the volume, 80% of it is some really dumb use case
that you could...
Like question answering our documents
or something like that.
Yeah.
I'm trying to get some insight from there.
I don't...
Yeah.
So I can tell you the trajectory we've seen, right?
So really early on,
the, like, killer use case
was some form of writing assistant,
whether it was like a marketing writing assistant.
The Jaspers.
Right?
Jasper, copy AI.
We had like seven of them at one time, right?
And then you had like specialist writing assistants.
Some, I think, have gone on
to be really successful products like pseudo-write
or type AI as another one.
But they're still fundamentally, like,
helping people write better.
And then I think increasingly we've seen more diversification.
There was a wave of chat to documents in one form of another.
Chat PDF still doing well.
Chat PDF doing super well.
Once RAG started working like retrieval augmented generation, there was that.
But since then, as people are more problem driven and they're like trying to see,
okay, how can we use this?
We see a much broader range.
So even within, like take Duolingo as an example, they've got Duolingo max.
So that's like a conversational experience.
But they're also using large language models within the evaluation.
of that. They're all using it for content creation.
And each of these companies, sort of, you start with one use case, and I feel like it expands
because you just discover more and more things you can do the model with, do with the models.
Yeah, yeah, yeah. Do you see much code generation?
Yes. So I would say that, like, developer-focused tools, I would say, like, ad tech
and developer-focused tools are, like, probably two of the biggest areas that we see people
working on. Yeah. I'm always wondering, because code generation is so structured that
you might have some special affordances for that.
But again, that's anti the bitter lesson.
I was wondering what we can optimize for,
but that's my optimization brain when I should not.
I should just scale things up.
I think there's merit in both.
Yes.
Okay, so today, by the time we release this,
you will have announced your new pricing.
Yeah, that's right.
So one thing that people have said to us a lot, actually,
is that the barrier of entry to getting started with Human Loop is just quite high.
There isn't, you know, you can't just,
install an open source package and just get going or whatever it might be.
And there have been quite a few small companies that have signed up and then send us messages,
you know, we're a not-for-profit or an early-stage company.
We really want to use Human Loop, but it's just prohibitively expensive for now.
We wouldn't mind paying in the future.
And so we've thought really hard about how can we make it, like lower the buyers to entry it for
people to try it out and get started and get value and have the amount they have to pay,
scale much more with the value they get, so that they're only paying for things when they've
got value from Human Loop.
And so we will be launching a new set of pricing.
there'll be a free tier, so you can sign up, you can get going on the website, you can start
building projects and you won't have to pay anything.
And only once you get to a certain scale, you've got more than three people on the platform,
you're logging a certain amount of data to us, then pricing kicks in, and it scales with you.
So, you know, as your volumes go up, that's the time when you'll start paying us more.
So much more gradual than it is now.
And you're tying some features to the tiers?
A little bit, but mostly we're trying to give you just a sort of most of the product experience.
So on the free tier, I think there's one or two things you don't have, but you have almost everything.
And then once you're off the free tier, you have everything.
But the amount you pay kind of scale slightly differently.
So you get volume discounts at scale.
Awesome.
And so this is where one of the hard questions is, right?
Like, is there a graduation risk as people get very serious about logging?
You brought up Datadog earlier, and for sure Data Dog is looking at your market as much as you're looking at theirs.
So how do you think about that of, like, ultimately at scale?
becomes a commodity, the logging.
So I think that actually this is really different to that.
So the more people use it, we find actually the stickier it becomes.
It's almost the opposite.
That as they get to scale.
So you're right that the millionth feedback data point is worth a lot less than the
1,000th feedback data point.
But what continues to be really valuable is this infrastructure around the workflow
of prompt management, engineering, fixing things.
So we see, you know, you have, what happens over time is people put more and more
evaluations onto Human Loop.
they've got more people in their team, the product manager, and also three linguists and someone else
who are opening up the data that's being logged through human loop back into that interactive environment.
They're rerunning things.
They're plugging in other data sources.
And so over time, actually, the raw logs, I agree with you, kind of become commoditized.
But the tooling that's needed to be able to not just kind of collect the data, but make it useful
and do something with it to improve your model, that's the bit that becomes more valuable.
Right?
Once you have something working at scale, then improve.
proving it by a few percentage points is like very, very impactful.
So a lot of our customers early on would say exactly this to us.
Like, oh, we can just dump our logs to like an S3 bucket or we can plug it.
And then like, why do we need a special purpose tool?
And most of them come back to us later because what they find is, oh, okay, I've logged something,
but it's really difficult for me to like match up the log to like what model generated it
and then quickly run that and try something else.
Or I've like logged something and that log involved a retrieval.
and I would like to know what went wrong with retrieval
or which document the retrieval came from
and I didn't log that information correctly, et cetera, et cetera.
And the complexity of setting this up well is quite high.
So you can either spend a lot of time at that stage,
two things happen.
Either people roll their own solution.
And early on, we saw a lot of people build their own solutions
or they come and use something like us.
And I think increasingly, because we've been working on this
for now more than a year,
the difference between something you would build yourself
and sort of a bot solution is now quite enormous.
And so I just wouldn't recommend it.
And I guess the difference on the data dog point
or like other analytics tools,
you mentioned amplitude or data dog,
they're much more about passive monitoring.
And I think one of the amazing things about AI
is the interventions you can take
can be very quick and very powerful.
And so coupling very closely,
the ability to update a retrieval system
or change a prompt
to the analytics data
and allowing you to run those experiments,
I think it was very powerful.
Fantastic answer.
It's almost like we prep for this.
It's also almost like I think about this a lot.
If I didn't have an answer to that question,
it would be difficult to justify spending all my time building this.
But I do think it's very important.
Yeah.
Company building, what have you changed their mind on as a founder?
Ah, that's a great question.
So one thing that comes from my mind as soon as you say company building
is like a piece of advice that Michael Siebel has at YC, right?
Which is like, don't do it.
Or at least don't do it pre-PMF, right?
Like one of the biggest failure modes of early-stage startups
is, especially if they've raised investment from, you know,
large investors is that they persuade themselves that they have PMF too early and they go into
sort of scaling mode and hiring people.
And a lot of that stuff is important, but distracts from the most important thing that you
have to do, which is understand the needs that are most pressing for your customer,
figure out who the right customer is and build what they really want, or if they're not
necessarily know what they want, build what they really need.
So one thing that I believed and I still believe is that you want to do that at the right
time, that company building too early is a distraction.
When was that for you?
So for us, it was actually November, December last year.
So November, December 2020.
So we were a four-person company for almost two years.
And it was only when everything was breaking, when all the charts were up into the right
and we really could not service our customers anymore because the team was too small.
That's when we started actively hiring people.
And even then, we've been really slow and deliberate about it.
Maybe a little bit too slow given how much, like, there was a lot of suffering in being that slow.
I wish we had a couple more people when things took off.
There was a period of time, I'd say, from like November to March, where all of us were like
barely functioning because there was just so much to do.
But we've continued to have the bar set really, really high and higher slowly and very
deliberately.
And I think we get more done with a smaller team of really, really excellent people than we
would had we hired more people sooner.
So that's something I kind of agreed on.
The other thing that has maybe changed a little bit in my mind is related to how
opinionated you should be. So I think you asked this question about opinionation in the product.
And I think there's a risk of just listening to your customers and building what they want that can lead to
hill climbing. And I think especially, and we were guilty of this, I think, a little bit early on in the
first year of human loop. Well, you did it well. Better than most. Thank you. But I think that,
you know, where things started working for us was when we were, we had a lot more strength in our
convictions, right? When we said, actually, you know, we believe GPT3 is going to be.
the future of how people build this,
and even if people don't believe that today,
we're going to build for that future.
That is hard to do.
I still think we don't do it enough.
Like, I want us to do it even more.
We have things we believe about the future
that are somewhat contrarian
and being able to plan for that
and be opinionated and build for that future.
And also to be building the things
that we believe our customers need,
not exactly what they ask for.
Because otherwise, you end up, I think,
with a lot of very undifferentiated products
that are for everybody,
so they're not for anyone.
and they don't have a strong point of view.
So I think, especially for building dev tools,
I think you should have a point of view.
Yes, I strongly agree with that.
Hiring, what are you hiring for,
and given that you're now hybrid,
you're spending some time in SF, where are you hiring?
Yeah, so we're hiring in both SF and London.
The role that is most urgent for me right now personally
is hiring for a developer-relations engineer.
So this is an engineer who loves community,
loves documentation, likes going to talks, building demos,
those, you know, as part of launching this new pricing where we're going to have a free tier,
is also having a much bigger push towards helping individual developers and smaller teams
succeed with Human Loop as well. And even developers in larger companies who just want to get,
you know, try it out before they're at scale. And I think to do that well requires a really
good onboarding experience, really amazing documentation and really good community building.
And we need someone fully focused on that. I don't think it can be someone's part-time job.
We want someone 100% focus on building community.
Ideally, we'd find someone as good as you SWIX to do this job.
So, yeah, so if you're a developer-reation engineer,
or even if you're just a product-focused engineer
who is excited about AI and ML and has some track record of community building,
then that's the role that I would love to hear about.
And we'll be hiring for it primarily in San Francisco.
Although if you are amazing elsewhere, we'll consider it,
but SF being the focus.
Yeah.
Thanks for the compliment as well.
But yes, I'd highly recommend people check out
job, it's already live on the website.
A lot of people don't know. I have a third blog
that is specifically for DeVosal advising, because
I do do some angel investing and people
ask me for advice all the time, and I actually
cash my frequently ask questions
there. Anything else on the company
side that I didn't touch on? If you're
within YC, this will be boring, but if you're outside
of YC, I think that you probably can't hear
this enough times, because I've seen so many
people get this wrong, which is just
like, before PMF,
nothing other than PMF matters.
And there's just, there's so
many possible distractions as a startup founder or things you could be doing that sort of feel
productive, but don't actually get you closer to your goal. Like trying to narrow focus to finding
PMF and what that means will be a little bit different for different startups and, you know,
different experiences. I have friends who are doing deep tech, biotech startups or whatever. And so
I don't think there's one size fits all, but but try not to do anything else. That, that advice
has been really good for us. And it's often not, it's not intuitive. Yeah. Does
human loop have PMF right now?
I think we have PMF within niches.
So I think we definitely have like, especially for I would say like if you're a team building
an LM application within a larger company, then like yes, we see people sign up, they use
the product, more people use it over time, usage goes up, they give us great feedback.
There's always room for improvement.
But we have a form of PMF.
And I think there will be like multiple stages of it.
But we certainly found some PMF.
What is the next tier of PMF?
PMF that you're looking for?
Well, I'm hoping it's on this Eval's project that we're launching, right?
So we definitely have PMF on the current sort of prompt versioning management stuff.
We've got about 10 companies currently in closed beta on Eval's giving us a lot of feedback on it.
It's a real problem for them.
We've seen them get value from it, but we haven't launched it publicly yet.
I'm hoping that will be the next big one.
Yeah.
Just a technical question on the Evales, which I don't know if it's too small, but typically Eval's
involved writing code.
Yeah.
So it's like freeform, Python, JavaScript, something like that for you guys?
Yeah, so it's the combination of, and again, we're iterating on this, but yeah, you can define them in Python, and they can also call language models as well.
And it's executed on your servers?
Both are options.
Okay.
So we have a protected environment.
You can basically execute everything on our servers, which was not easy to build.
And I'm not the right person to talk about it, but I think there's a really interesting engineering blog and how you can make it safe for other people to exit code on your servers.
But also it's going to be set up such that you can also run things on yours and just.
to have the output logs still go to Human Loop and useful way.
Yeah.
This is the promise of the edge clouds of the world.
Yeah.
The Denos, the Cloud Fair Workers, the Models.
I don't know if you've explored any of those,
but then you would not have to set it up yourself, essentially.
I'm pretty sure they've all been exported on my team in recent months.
Yeah, yeah.
Okay, brought it out to market takes.
Yeah.
Just the, you know, brought in a human loop as a whole.
How do you feel about LMLM ops or PromptOps as a category term?
LM ops.
I would drop one L, firstly.
I think we call them large language models today.
but the goalpost of large is going to keep moving.
So I think the point is sort of foundation models or...
Oh, I have a proposal to deal with that.
Oh, yeah?
I have T-shirt sizing.
So I've defined S-XS and then M and L and all the way to XXL.
You're going to have to keep updating that over time.
But I think foundation model ops is maybe a better term
because I also think that like within six months we're going to have images
and then people won't call them just language models anymore.
Yeah.
And is it worth a separate category than MLOPS?
But I do think it's worth a separate category.
Okay.
I think that the people from its four are different.
We discussed this a little bit earlier, right?
But a machine learning engineer and a traditional software engineer are very different people.
They have different levels of knowledge and different goals.
I also think that the generality of the models has changed what people are building.
And so the problems they face are really different.
It's, you know, like what you need for building a recommender,
a small recommender system at enormous scale is very different from what you need to build a generative AI application that's very subjective.
And so I do think that they have, I actually think, like we've seen a lot of MLOps companies recently try to pivot into solving problems in this space.
And I think it's going to be hard for them because they're changing who they're building for.
So they now have to straddle two different sort of ideal customer profiles.
And they also have a lot of legacy infrastructure focused around models whose output was like a measurable, quantifiable number.
It was F1 or was accuracy or something like this.
And I think their lives are going to keep getting harder as the models go more general and go multimodal.
because what they've built so far
is won't fit that world.
I think it probably can be done
but I think it's going to be very hard.
You mentioned GPT4 Vision
and obviously there's more multimodal
models coming along the way.
How big does that factor into your planning
because you're very language-oriented right now?
So it's increasingly like an internal conversation
every time we have a product roadmap discussion
like planning for and starting to iterate on
and when to build in support for vision
has become very much front of mind.
So I think now, like we're working on it.
Okay.
One version of this, I pose this exact same question to Harrison,
which is, let's say, the GP4 Vision API drops tomorrow.
Yeah.
What changes in Human Loop?
Well, for one thing that you need just to be able, I mean, like very simple things, right?
Like, we need to be able to render and read in images in the playground environment that's interactive, right?
So there's a bunch of just kind of follow your nose things that I think we'd have to figure out.
But as I said, we've just started working on this.
It's sort of become a product roadmap item.
We, but not, like, we have to support it.
Like, it's very clear.
This is not a question of if it's a question.
of when. Okay. Yeah. Yeah. Excellent. Is prompt engineering dead? So we talked about this a little bit
on the walk here. And I've never been a huge fan of the phrase prompt engineering. Because I think it
simultaneously makes it not important enough and to important at the same time. I don't think it's a form of
engineering in the way that software is a form of engineering where it has this rich body of literature
and theory and you have to learn about it and takes like very specialist skill. I think you can get good
at it very quickly. But I do think that prompts are a very important.
part of LM or AI applications, right?
Like natural language-ridden instructions have become part of your source code.
And they have impacts on your product quality.
They have impacts on the way your product behave.
So you should be treated with that level of seriousness as you would any other code artifact.
So in that sense, I don't think it's dead.
I think it's alive and well and becoming increasingly important.
It's interesting.
There's like, you know, Anthropic had that very well-paid job, prompt engineer.
Yeah.
And I think they've hired a few prompt engineers now as well.
and those people are leading on deployments in Anthropic and adding a lot of value.
So there's clearly, it's clearly happening.
But I think maybe it's slightly misnamed.
I actually prefer your kind of AI engineer framing,
where this is a different engineering skill set.
You still need to be able to build product.
You're still an engineer.
But you have an intuition for how to get the best out of models,
how to evaluate them.
You understand the problems that come from sarcasticity.
And you also understand just the nuances.
Like if you have a good mental model for how a large language model works,
I think prompt engineering becomes a lot easier.
year. And so having that skill set, I think, is going to be important. But I doubt that five years
from now, there will be like a separate job title of prompt engineer. Yeah. Yeah. I try to contrast it
basically as prompt engineering is so 2022 and AI engineering is 2023. But yeah, the central thesis is
is you can't just get by with prompts. You have to write code to manage prompts, to generate prompts,
and to generate code and to, for you say, evaluate and run that code.
Yeah, I think I agree with all of that.
But to me, that doesn't diminish the importance of the prompts as an artifact.
Still important.
Yes.
I feel like when I saw a chain of thought for the first time, I went from a world in which I was like,
okay, models are not good at reasoning to models can do some reasoning.
Yes.
It was a sort of step change in my beliefs about the capabilities of these models.
Yeah.
And I still think that the LLM Cascades paper hasn't had the impact.
Can you summarize that?
So this was a paper from Google, and it's just sort of getting you to view
LMs as a way of doing inference in a probabilistic programming framework. So that's a lot of words.
So let me try and sort of unpack that. And you have a PhD in this. But, but you know, before
AI was all LLMs, there was and there still is like a huge branch of research around probabilistic
programs. So this is just ways of like writing code where probability and random variables are
first class citizen. So you can have like random variables and then there's lots of different
operations you can do to condition and make predictions about them and do inferences around
them. And this language modeling Cascades paper basically said, hey, actually, like, large
language models are a really powerful inference engine that could be used as a composable piece
inside something that looks like a probabilistic programming language. And we were chatting earlier
today about the framework that will emerge for large language models. And I know you're working
on small and you've given this a lot of thought. And, you know, Langchain and Lama Index and all
these different groups, auto-GPT, are trying to circle around, like, what's the right set of
abstractions, how might we be able to compose LLMs in ways to write more complex programs?
And I think that LM Cascades paper was one of the first attempts to think about that in first
principles and say, okay, what are the primitives you might want? And I think I'm surprised it hasn't
been built on more. Yeah. The very, very first AI grant from Nat Friedman mentioned that
they were looking for a UI for Cascades and no one took them up on it. I don't think it needs a UI.
I think it needs a, I think it's a framework. It's a framework. I think you want it in code.
Yeah, yeah.
And I would love to work on it if I had all the time in the world.
It's sort of you always have to choose your, you know, you can't do everything at once.
Well, if someone is working on it, maybe reach out.
I would love to chat to people about it who are working on.
Yeah.
How many of your customers and users are actually worried about prompt injection and prompt security?
Not enough.
Really?
So I would say almost zero.
Yeah.
And I think that's correct today because very few of our customers have action-taking LLMs.
Yeah.
And I think as long as your models are like read-only, prompt injection isn't that big a deal.
It's not to me about leaking your prompts or something,
because the prompts are only really valuable in the context of your code anyway.
But I do think that once you get to the stage
where you're letting the models have read-write access to any source of data,
then prompt injection becomes a problem the same way any other form of code injection is a problem.
But honestly, no one ever asks us about it.
Right.
Like, almost never.
And I think that's because of the stage where people are at, right?
Which is that they're still trying to overcome hallucinations
and they're still trying to put guardrails in place around the behavior of the models.
and very few people are using agents in production
at meaningfully sized companies.
But I think as soon as that becomes the case,
if we do get to a stage where more people
are allowing the models to read from a data source
and write to a data source,
then prompt injection will become something they care about.
And you guys will be well positioned to offer something.
Absolutely.
I think sort of being this layer
between the role model and the end application
actually buys us a lot in terms of what we can help with.
Yeah.
Well, you know, there are a bunch of security-minded people.
who are trying to offer that as a standalone thing,
and it's a feature, not a product.
I think I'd agree with that.
OpenE ice fine-tuning rollout, which was last month,
how does that affect human loop?
Yeah, so when we started the first version of human loop,
chat GPT was 3.5 wasn't out yet.
It was all GPT3, and we saw a lot of fine-tuning at the time,
and post the release of 3.5 and 4,
by virtue of the fact that it was impossible to fine-tune,
like we could just see it in our analytics.
The amount of fine-tuning just kind of fell off a cliff,
Partly, I think, because the models were better, but also just partly, like, it wasn't an option.
And so I'm kind of interested to see now that 3.5 and 4 fine-tuning are back, whether that kind of fully recover...
4 isn't back yet, but it's...
3.5 fine-tuning being back.
We've definitely seen a lot in the past people generating outputs with GPD4, filtering based off evaluation or feedback criteria,
and then fine-tuning smaller, faster models.
And so I think we likely see a lot of fine-tuning.
of GP3.5 on four generated data, and that's a workflow that we've been, we natively support
within a human loop now. So you can actually kind of do all of those things without having
to leave it. If you have a bunch of generations, you can filter them on some criteria,
click fine tune, run ahead of evals and then decide whether or not to deploy that model.
But time will tell as to whether or not this is something that goes back up in importance
the way it used to be.
Yeah.
The question that occurs to me, always we talk about you being that layer that positions
you very well. A lot of people are fighting to be that layer. And it occurs to me that as a user
potentially of Human Loop and your competitors that I may not want to have to choose or be locked in.
Is there room for an open standard that everyone agrees to that we all say like, okay, just adopt
this one vendor-neutral thing and then we all consume from it? Maybe. I think it could happen.
We're not there yet.
I think things are moving too fast for that to be the case, for people to have clarity on that.
So maybe in the fullness of time there will be...
My suspicion is that both will happen, right?
That there will be some open standard that some people like to use.
But once you come to working on serious production use cases, you often actually want the
peace of mind of knowing that you're paying a real company that's going to be around to support
you that is focused on this.
that has the knowledge and expertise.
And so, as we've seen in many other spaces,
I suspect that there'll be a bit of both.
A bit of both.
Yeah, so the model I have in mind is Datadog versus the Open Telemetry crew.
And Data Dog is doing fine,
and the open telemetries, you know, crew is doing great as well.
So the last question on the market.
Did GPT4 get dumber this year?
I don't think so.
We've seen a lot of, like, conversation about this having happened.
I think GPD4 changed.
I think that they are regularly updating it,
and you certainly see that both in sort of people's attempts to,
you know, papers have been written about this,
and people were trying to do evaluations over time.
I think that the main takeaway shouldn't be like,
did GPD4 get dumber, right?
But the interesting question is like, did GPD4 change?
To which the answer, I think, is definitely yes.
There's no question about that.
And it's something that if you're a developer
of building products on top of GPD4
is something that you should think about a lot
because you're building on a platform
that will evolve and change over time,
and you can pin the base model, but not forever.
And so I think you need to, at the very least,
have really good testing frameworks
to be able to run regression test and know,
like, have things gotten worse over time?
If you can't answer that for yourself,
you're going to be scratching your head.
Like, do we make the prompts worse?
Did the retrieval system get worse?
Did something else change?
Did the user inputs distribution change?
Or did the model get worse?
And being able to disentangle those things easily,
I think the importance of that that's going to go up.
But I also think that it should, like,
give us pause for thought about kind of the balance
between what gets built on top of third-party providers and APIs in a closed world
and what we might want to do more open source.
And I suspect there'll be a mixture of both depending on the use case.
But you are building on shifting sand whenever you're building on someone else's platform.
Yeah, yeah, totally.
And then one local specific question before we go to the takeaway questions.
You went through IC and you are very, very familiar with the American tech scene,
but also you built your company here in London.
But what should, and I'm very US-focused, most of our audience is very US-centric.
What should people know about the European tech scene?
Yes, I think that London's one of the best places in the world, and Paris, for AI-focused folks.
With the Hugging Face. I don't know.
We've got Hugging Face in Paris.
We're sitting right now.
We're probably less than 200 meters from the offices of DeepMind.
Facebook AI research is here as well.
UCL's AI Center is here, which is where, you know, Jeff Hinton was.
and where a lot of great research is where DeepMind spun out of, actually.
So Shane Legg and Demis met at UCL.
So there's an amazing, and there's many more.
I can't list everything that's great, but there's many great AI institutions in the UK.
What I would say is that I think that Europe has been amazing on research
and continues to be a fantastic place for researchers,
but has been less good in my experience on productizing and trying to productize AI.
And so the difference that I feel being here versus being in the U.S.
is just the number of, like, if I go to San Francisco,
the density of people who are trying to build useful things
with large language models or with AI
and budding their head up against it
and discovering what works and what doesn't work
and trying great ideas and trying stupid ideas
and just learning together is much richer than what we have here.
I think the pure research labs, very competitive.
Anthropics just opened an office here,
opening eyes, opening an office here.
When you're hiring for talent,
you'll find as many or better people, you know, like equal quality people in both places,
but less so once you move towards productization.
And I suspect it's also to do with the investor ecosystem.
So we're sitting in the offices of Local Globe and Index were our first investors,
and they're both great.
But the number of investors that you have of that quality in Europe is not the same as the
US.
And the type of people you interact with, they're very different.
When I speak to VCs in the US, there's way more former founders.
There's way more people who have done dev tools before.
And there's way more support from the founders towards the ecosystem than there is in Europe.
People are trying, but the culture is not quite the same.
And that's why we're moving to SF, right?
We want to be, every time I've been to SF, good things have happened to you.
Whether it's like bumping into you or we get an introduction to an interesting investor or a customer,
or we just speak to someone who's been trying really hard to build something.
And, you know, we share an office here in London with Bloop that does, you know,
Blupe AI does sort of code search with LMs.
and we've tried our very best
to kind of aggregate a few other companies to us
and we're doing AI tinkers, you know, tomorrow.
So there is some of it here,
but you have to work so much harder.
Versus in an SF, you know,
you can't move for hitting some AI things.
We had a Thursday recently
with 10 AI meetups in one night.
Yeah, it's almost too much.
It is too much.
I'll go there and say it is too much.
Yeah, you need some time to build things too.
And there is, I would say,
actually in the ESF builder scene,
privilege that comes out of just having so much opportunity thrown at you and like that we like have
this like you know arms length this tastes for VC and I'm like no like they are partners in
building your business you know absolutely yeah so so I think it's I think it's interesting contrast
but you know as a person I'm not American I live most my adult life in America but I I feel for
non-US policymakers and VCs and people who care about their city who are like okay like we're
not SF what do we do
I honestly think that it's, you know, we think a lot about network effects and defensibility
and startups.
I think it's like the mother of all network effects, right?
The reason I'm going is not because I love the city.
I mean, SF's fine as a city.
I like it.
But I'm going because everyone else is going and everyone else is going because we're going, right?
And once you've attracted a certain talent density, I think it's really hard to compete
with that.
Oh, boy.
Okay.
It is true.
It's the honest truth.
Yeah.
I do want to work out a path for non-tech hub cities because, I mean, that's, that's
where I'm from, right? Yeah, and me too as well, right? But I also, I also think there's something
to be said for the most driven, most ambitious people, like finding a way to get to where the
center for their thing is. And like right now, today for like AI-focused products, I think it's
San Francisco. But for different things, the center is, you know, different places. If you're, you know,
Hollywood is the place to go if you're an actor or whatever. And there are different hubs for
different areas. It's a Paul Graham thing. You know, different cities breathe different ambitions
into you. And in San Francisco, apparently it's power.
It's not actually tech, it's power.
Okay, interesting.
And tech is a means to power.
Interesting.
There's a lesson in that for those of us who think about AGI safety.
And also, you know, not anywhere in San Francisco,
specific two square miles in San Francisco called the arena.
You have to get in the arena and build.
Okay, so broader takeaway questions.
So we always ask three of all our guests.
Acceleration.
What has already happened in AI that you thought would take much longer?
So this has been, since I started my PhD,
like every year things have happened that I thought would take much longer.
So when I started my PhD,
it was at a time when like deep learning had just sort of started working
and transfer learning even for like vision hadn't been figured out yet.
And people were talking about like,
oh, it's going to, you know, how long before we can train models
that don't need millions of annotated data examples,
how long, you know, so AlphaGo was happening just at that point in time,
the first version.
I have made predictions and been wrong again and again and again.
I've just been consistently too pessimistic.
And I think I'm quite an optimistic person.
You know, when would, you know, like Dota's surprise.
me when it happened. The first, like, vision transfer learning working in vision surprised me when
it happened. The continued successive scale and deep learning. And then finally, like, you know,
although I believe that LMs were going to be enormous and I thought GPD3 was going to be the
future, like just how good GPD4 and chat GPD turned out to be did surprise me. The first time,
I actually saw Claude before I saw chat GPT, but the first time I saw Claude and I like kept pushing
the limits of it with tasks that I knew were kind of at the frontier of what.
was currently possible and just saw it like blasting through these one after another, that was a
mind-blowing moment for me. And I think it was for a lot of the rest of us. I think we're going to
have a lot more of those. I think that's going to keep happening. Yeah, yeah. We are accelerating as we
speak. Exploration. What do you think is the most interesting unsolved question in AI? I think there's
actually some like obvious kind of elephant in the room unsolved problems that for some reason don't seem to
get the amount of airtime that they kind of obviously should. So continual learning to me is one of these.
Oh, God. Yeah. Like we all walk around as if it's,
just completely normal that these models never learn anything new.
Yeah, 2021 is when history ended.
You just think, yeah, 2021 is when history ended.
And you do retrieval augmentation with a vector database.
And like, you're done, right?
Like, why would the system keep learning after training?
And I think everyone knows that this is a problem,
but somehow it doesn't seem to me to get the amount of,
like the, I think this field in research is called continual learning
or lifelong learning.
And it doesn't seem to get the airtime that it used to.
It seems to be like an obviously enormous problem.
The other one that I think will happen naturally, but just hasn't happened yet, is just like more multimodality.
Right.
Like it's kind of obvious that these models should be plugged in to vision, audio, speech, et cetera, and have shared representations because there's so much to be gained from that.
And I think it's just like going to happen with time, but hasn't happened yet.
Yeah.
Well, I think the cost is just token space, I guess.
I don't know how much more you need to add every single modality.
Although I think Facebook released like six
We have some examples of this, right?
So like Ghetto from DeepMind
was a transformer model
that they trained across,
they just did policy distillation
so they trained a whole bunch of different
RL agents.
And they took the outputs of that,
which is like observation action reward triples
and trained a single transformer model
on all of that.
And then that one model could do any of those tasks.
Actually, okay,
Wells were in exploration mode.
There's a paper from DeepMind
came out at the same time as Gato
that I think is massively underrated.
And I don't understand
why it didn't get more attention.
which it was at the same New Europe's conference,
and I forget the exact title,
but I think it's called like in-context reinforcement learning
or something like that.
And they do something really similar to Gato.
They take an RL agent, they train it,
and then they distill that into a transformer model.
But what they do that's different
is they don't take the trained RL agent.
Instead, they take an untrained RL agent
and they record the full trajectory of its learning.
So early on in the data, the model's kind of crappy,
and by the end of the data, the model's been good at this task.
and then they train a transformer model to predict that sequence.
And in order to be good at predicting that sequence,
you have to predict that the sub-agent,
like the RL agent that generated the data,
gets better at the task over time.
And the only way that I can see to do that,
and in fact this seems to be what the model is doing,
is that you have to simulate a learning algorithm.
You have, the transformer has to simulate in context reinforcement learning.
And so they take all of these tasks,
they train on the learning trajectories,
and then they take a completely new task
that that transform model has never seen before,
and it learns to do that task.
And so it's learning from reward signals in context to achieve a new task.
And to me, that's huge.
It's a demonstration of, like, inner optimization within a transformer model,
and it's also a demonstration of, like, in-contacts, continuous learning
that's limited only by the length of the context window.
If the context window was really long, you could make this work practically.
I don't really know why that wasn't a bigger deal.
I don't know either.
This sounds fantastic.
Yeah, and Gato, I think the reason maybe it wasn't a bigger deal,
it came out exactly the same time as Gato,
and I think Gator just took all the attention.
So we just got done talking a lot about focus,
but given that you see potential in this,
and this would be huge for literally training anything,
would you be interested in exploring it at some point?
As in trying to train it myself?
Put this in production, some form of continuous learning.
Obviously that's on your radar, continuous learning.
I would love to, but I think you have to decide
what kind of company you want to be.
and this is something for like open AI or anthropic to focus on.
I feel like you have to be thinking about the fundamentals of like this is the kind of research I used to do as a PhD student.
So I'll put it this way, right?
Like you have the research background to do this and you're choosing not to.
And you're building a company that doesn't use your research specifically that part.
I mean, you know.
Reasonable question.
But I think that I'm excited about getting.
things useful into people's hands very quickly.
Like I like seeing, we talked about this earlier, right?
We've moved from the research phase to the engineering phase of AI.
It's the first time after having been in this field for maybe seven years where stuff goes
beyond like just kind of a graph, right?
Like the output of my work before would always be like, oh, look, there's a graph and like the
number is better now.
Versus we actually get to see, you know, we have a customer between Duolingo and two or three
of our other customers.
we've got three or four customers working on
better versions of teaching students,
right, tutors or language learning
or whatever it might be.
And to be able to make that incrementally better
and accelerate the time it takes to get there,
it just feels to be so much closer to it
to be on the engineering space right now.
Whereas I think there's an alternative user universe
in which I stayed in research
and I went to an open AI
or almost everyone from my research,
PhD research group, apart from Peter,
and now works at Deep Mind.
And I think I would have enjoyed that as well,
but I really wanted to start a company that built something
useful in-production, and I don't even think those companies do that much right now, right?
Like, it's only recently that Open Eye has sort of become a product company.
They're more of a research company.
They're building AGI, and I think that's true of the others,
and I think that's amazing and fascinating.
And if I had multiple lives, I would love to do that too.
But at least right now, I want to be building products
and putting them in people's hands, and it just feels a little bit far removed.
Yeah, yeah, makes sense.
And I think the world's better because you're actually coming at it with a full
knowledge of what came before.
Yeah.
I do think it's a huge advantage.
I do think like having a good conceptual understanding, like there's been a lot of people
that pivoted into, as you said, LLM ops earlier.
And I do think that actually knowing how it works, having a sense of what's going to come
next and being able to project forwards and build for it is difficult to do if you don't
have a good conceptual understanding of the machine learning.
Yeah, yeah, yeah, agreed.
Okay, well, I feel like this is a leading question, but what's one message you want everyone
to take away today?
Oh, wow.
That's a great question.
Really, if you're building a serious LLM application and you're trying to do, find the right prompts, optimize them, evaluate your models, then I really would encourage you to try out human loop.
Like, that's the use case that we really solve well for, especially if you're kind of having to collaborate with non-technical people, then human loop will probably solve a lot of pain for you.
Yeah.
Excellent.
Well, thanks so much for doing this.
I had a real joy getting to know you and debugging real life issues with you.
But that's the fun of latent space.
So thank you so much.
Thanks for having me. It's been an absolute pleasure to get to spend some time with you, Sean.
In this episode of the Latent Space podcast, we delved into the world of LLM Ops and had a wide-ranging conversation with Dr. Raza Habib, co-founder of Human Loop.
We covered, What is Human Loop? The three stages of prompt evals, the three types of human feedback, human loop's new free tier and pricing,
the competitive landscape and graduation risk of Human Loop, PromptOps versus MLOPs, Prompt Engineer versus AI Engineer.
Did GPT4 get dumber?
Europe's AI scene versus San Francisco.
And don't sleep on Raza's in-depth explanations of LLM Cascades and Deep Mind's work on continuous learning.
If you are interested in Human Loop, definitely check out their hiring page and new pricing and vote for them on the state of AI engineering survey.
Thank you for tuning in to the Latent Space podcast.
Don't forget to like, subscribe, and tweet your takes at Latent SpacePod.
Now go build.
