Lenny's Podcast: Product | Career | Growth - Al Engineering 101 with Chip Huyen (Nvidia, Stanford, Netflix)
Episode Date: October 23, 2025Chip Huyen is a core developer on Nvidia’s Nemo platform, a former AI researcher at Netflix, and taught machine learning at Stanford. She’s a two-time founder and the author of two widely read boo...ks on AI, including AI Engineering, which has been the most-read book on the O’Reilly platform since its launch. Unlike many AI commentators, Chip has built multiple successful AI products and platforms and works directly with enterprises on their AI strategies, giving her unique visibility into what’s actually happening inside companies building AI products.We discuss:1. What people think makes AI apps better vs. what actually makes AI apps better2. What pre-training vs. post-training is, and why fine-tuning should be your last resort3. How RLHF (reinforcement learning from human feedback) actually works4. Why data quality matters more than which vector database you choose5. Why high performers are seeing the most gains from AI coding tools6. Why most AI problems are actually UX issues—Brought to you by:Dscout—The UX platform to capture insights at every stage: from ideation to production: https://www.dscout.com/Justworks—The all-in-one HR solution for managing your small business with confidence: https://www.justworks.comPersona—A global leader in digital identity verification: https://withpersona.com/lenny—Where to find Chip Huyen:• X: https://x.com/chipro• LinkedIn: https://www.linkedin.com/in/chiphuyen/• Website: https://huyenchip.com/• Substack: https://substack.com/@chiphuyen—Where to find Lenny:• Newsletter: https://www.lennysnewsletter.com• X: https://twitter.com/lennysan• LinkedIn: https://www.linkedin.com/in/lennyrachitsky/—In this episode, we cover:(00:00) Introduction to Chip Huyen(04:28) Chip’s viral LinkedIn post(07:05) Understanding AI training: pre-training vs. post-training(08:50) Language modeling explained(13:55) The importance of post-training(15:20) Reinforcement learning and human feedback(22:23) The importance of evals in AI development(31:55) Retrieval augmented generation (RAG) explained(38:50) Challenges in AI tool adoption(43:19) Challenges in measuring productivity(45:20) The three-bucket test(49:10) The future of engineering roles(55:31) ML Engineers vs. AI engineers(57:12) Looking forward: the impact of AI(01:05:48) Model capabilities vs. perceived performance(01:08:23) Lightning round and final thoughts—Referenced:• Chip’s LinkedIn post on what actually improves AI apps: https://www.linkedin.com/posts/chiphuyen_aiapplications-aiengineering-activity-7358971409227792384-y0mf/• Prediction and Entropy of Printed English: https://www.princeton.edu/~wbialek/rome/refs/shannon_51.pdf• Why experts writing AI evals is creating the fastest-growing companies in history | Brendan Foody (CEO of Mercor): https://www.lennysnewsletter.com/p/experts-writing-ai-evals-brendan-foody•Inside the expert network training every frontier AI model | Garrett Lord (Handshake CEO): https://www.lennysnewsletter.com/p/inside-handshake-garrett-lord• First interview with Scale AI’s CEO: $14B Meta deal, what’s working in enterprise AI, and what frontier labs are building next | Jason Droege: https://www.lennysnewsletter.com/p/first-interview-with-scale-ais-ceo-jason-droege• Anthropic’s CPO on what comes next | Mike Krieger (co-founder of Instagram): https://www.lennysnewsletter.com/p/anthropics-cpo-heres-what-comes-next• Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar (creators of the #1 eval course): https://www.lennysnewsletter.com/p/why-ai-evals-are-the-hottest-new-skill• The rise of Cursor: The $300M ARR AI tool that engineers can’t stop using | Michael Truell (co-founder and CEO): https://www.lennysnewsletter.com/p/the-rise-of-cursor-michael-truell• Stanford webinar—How AI Is Changing Coding and Education, Andrew Ng & Mehran Sahami: https://www.youtube.com/watch?v=J91_npj0Nfw• He saved OpenAI, invented the “Like” button, and built Google Maps: Bret Taylor on the future of careers, coding, agents, and more: https://www.lennysnewsletter.com/p/he-saved-openai-bret-taylor• Anthropic co-founder on quitting OpenAI, AGI predictions, $100M talent wars, 20% unemployment, and the nightmare scenarios keeping him up at night | Ben Mann: https://www.lennysnewsletter.com/p/anthropic-co-founder-benjamin-mann• Lenny’s vibe-coded app made on Lovable: https://gdoc-images-grab.lovable.app/• Story of Yanxi Palace: https://www.imdb.com/title/tt8865016/• Steve Jobs’s quote: https://www.goodreads.com/quotes/427317-remembering-that-i-ll-be-dead-soon-is-the-most-important—Recommended books:• The Complete Sherlock Holmes: https://www.amazon.com/Complete-Sherlock-Holmes-Volumes/dp/0553328255• AI Engineering: Building Applications with Foundation Models: https://www.amazon.com/AI-Engineering-Building-Applications-Foundation/dp/1098166302• The Selfish Gene: https://www.amazon.com/Selfish-Gene-Anniversary-Introduction/dp/0199291152• From Third World to First: The Singapore Story: 1965-2000: https://www.amazon.com/Third-World-First-Singapore-1965-2000/dp/0060197765—Production and marketing by https://penname.co/. For inquiries about sponsoring the podcast, email podcast@lennyrachitsky.com.Lenny may be an investor in the companies discussed. To hear more, visit www.lennysnewsletter.com
Transcript
Discussion (0)
Our question that got asked a lot and a lot is how do we keep up to date with the latest AI news?
Why do you have to keep up to date with the latest AI news?
If you talk to the users, you understand what they want or they don't want, look into the feedback.
Then you can actually improve the application way, way, way more.
A lot of companies are building AI products.
A lot of companies are not having a good time building AI products.
We are in an idea crisis.
Now, we have all this really cool tools to have everything from scratch.
It has your design.
It can have your write code.
You can have your website.
So in theory, we should see a lot more.
But at the same time people are somehow stuck.
They don't know what to build.
All those AI hype, the data is actually showing most companies try it, doesn't do a lot.
They stop.
What do you think is the gap here?
It's really hard to measure productivity.
So I do ask people to ask their managers, would you rather help give everyone on the team very expensive,
coding Asians subscriptions?
Or you get an extra headcount.
Almost everyone, the managers would say headcount.
But if you ask VP level or someone who manage a lot of teams, they would say one AI assistant.
Because as managers, you're not.
are still growing. So for you, having one HR has a crowd is big. Whereas for executive, maybe we have
more business metrics that you care about. So you actually think about what actually drive productivity
metrics for you. Today, my guest is Chip Huen. Unlike a lot of people who share insights into
building great AI products and where things are heading, Chip has built multiple successful AI products,
platforms, tools. Chip was a core developer on Nvidia's Nemo platform, an AI researcher at Netflix. She
She taught machine learning at Stanford.
She's also a two-time founder and the author of two of the most popular books in the world of AI,
including her most recent book called AI Engineering, which has been the most read book on the
O'Reilly platform since its launch.
She's also gotten to work with a lot of enterprises on their AI strategies, and so she
gets to see what's actually happening on the ground inside a lot of different companies.
In her conversation, Chip explains a lot of the basics, like what exactly does pre-training
and post-training look like?
What is RAG?
is reinforcement learning, what is RLHF?
We also get into everything she's learned about how to build great AI products,
including what people think it takes and what it actually takes.
We talk about the most common pitfalls that companies run into,
where she's seeing the most productivity gains, and so much more.
This episode is quite technical, more technical than most conversations I've had,
and is meant for anyone looking for a more in-depth conversation about AI.
If you enjoy this podcast, don't forget to subscribe and follow it in your favorite podcasting app or YouTube.
And if you become an annual subscriber of my newsletter,
You get a year free of 16 incredible products,
including Devon, lovable, replet, bolt,
N-8N, linear superhuman, D-Script, whisperflow, gamma,
perplexity, warp, granola, magic patterns,
recast JIPRD, and Mobbin.
Head on over to Lenny's newsletter.com and click ProductPass.
With that, I bring you Chip when,
after a short word from our sponsors.
This episode is brought to you by Descout.
Design teams today are expected to move fast,
but also to get it right.
That's where DeScout comes in.
D-Scout is the all-in-one research platform built for modern product and design teams.
Whether you're running usability tests, interviews, surveys, or in the wild fieldwork,
D-scat makes it easy to connect with real users and get real insights fast.
You can even test your Figma prototypes directly inside the platform.
No juggling tools, no chasing ghost participants.
And with the industry's most trusted panel plus AI-powered analysis,
your team gets clarity and confidence to build better without slowing down.
So if you're ready to streamline your research, speed of decisions, and design with impact,
head to Descout.com to learn more.
That's dsc-o-U-T.com.
The answers you need to move confidently.
Did you know that I have a whole team that helps me with my podcast and with my newsletter?
I want everyone on that team to be super happy and thrive in their roles.
JustWorks knows that your employees are more than just your employees.
They're your people.
My team is spread out across Colorado, Australia, Nepal, West Africa,
San Francisco. My life would be so incredibly complicated to hire people internationally, to pay people
on time and in their local currencies, and to answer their HR questions 24-7. But with JustWorks,
it's super easy. Whether you're setting up your own automated payroll, offering premium benefits,
or hiring internationally, JustWorks offer simple software and 24-7 human support from small
business experts for you and your people. They do your human resources right so that you can
do right by your people. JustWorks. For your
people. Chip, thank you so much for being here and welcome to the podcast. Hi, Lenny. I've been a big
fan as a podcast for a while, so I'm really excited to be here. Thank you for having me.
I want to start with this table slash chart that you shared on LinkedIn a while ago that went
super viral and I think it went super viral because it hit a nerve with a lot of people. And let me just
read this and we'll show this on YouTube for people that are watching. So it's this very simple table
you share it of what people think will improve AI apps and what actually improves AI apps. What
people think will improve AI apps, staying up to date with the latest AI news, adopting the newest
agentic framework, agonizing what vector databases to use, constantly evaluating what model is
smarter, fine-tuning a model. And then you have what actually improves AI apps, talking to users,
building more reliable platforms, preparing better data, optimizing into end-to-end workflows,
writing better prompts. Why do you think this had such a nerve with people? And just what,
if you had to boil it down, what do you think is, what do you think people are missing about building
successful AI apps.
What I mentioned that could ask a lot and a lot is that how do we keep up to date with the latest
AI news?
And I'm like, why do you need to keep up to date with the latest AI news?
And I know it's how very counterintuitive, but there's so much news out there.
A lot of people also ask me questions like, how do I choose between two different technologies?
Like maybe like recently like MCP versus like Asians, right, like protocol and it was like,
which one is better or like this or that?
And the same is a serious question you should ask them.
It's like, first, like, if, how much of the improvement could you get, like,
from, like, optimal solutions versus non-optimal solutions, right?
And sometimes you were, like, actually, it's not much, right?
And I was like, okay, if it's not much improvement,
that why do you want to spend so much time debating something that doesn't make
a much difference to your performance?
And another question they asked is, like, if you adopt a new technology,
like how hard it could be, shouldn't switch that out to your enough?
And sometimes it were like, oh, I think it would be like a lot of work switching it out.
And I was just like, hmm, let's say he's a new technology.
It hasn't been tested by a lot of people.
And if you adopt it, it would be stuck with it forever.
Like, do you actually want to adopt it?
Right.
Maybe you want to think twice about like overcommit to like new technologies that hasn't been
better tested.
I love your just broader advice is just simple.
Like to build successful apps, talk to users, build better data.
write better prompts, optimize the user experience.
Versus just like, what is the latest and greatest?
What's the best model to use right now?
What's happening in AI?
Let me follow this thread of this idea of fine-tuning and basically post-training.
There's all these terms that people hear in AI,
and I think this is going to be a really good opportunity for people to learn what we're actually talking about.
Since you actually do these things, you build these things, you work with companies doing these things.
And there's a few terms I want to sprinkle in through the conversation.
But let's start with this one.
what's the simplest way for someone to understand?
What is this difference between pre-training and post-training
and then just how fine-tuning fits into that,
just what fine-tuning actually is?
Disclamor, I don't have, like, full visibility
into what, like, this big secretive, like, frontier labs are doing.
But right from what I heard, right?
So I think it's like one is like supervised fine-tuning
when you have demonstration data,
and you have like a bunch of like experts, okay, here's a prop, right?
And here is what the answer should be like.
And you just train it on like to like simulate, like emulate what the human expert could be like.
And that's also like what a lot of people would like, so the open source models are doing as they do it by distillation.
So instead of having human experts to like write really good sounding great answers to like prompts,
they get like very popular, famous, good models to like jerry the response to it and like getting this trained smaller model to emulate.
So sometimes you see people
like, so that's because I
really appreciate open source community
by the way, but like going from
like having been a widget trainer models
that can emulate a existing
good model. It's very different from
being in a chichity trained
good models like an output for existing
good model. So it's a big step there.
So yeah, so like we have my supervised
fire tuning and another thing that's like
very big. I'm not sure you have
guests talking about it already, but like reinforcement
learning. It's like everywhere.
Let's pull on that because I would definitely want to spend time on that.
And that's such a cool topic that's merging more and more in my conversations.
But just to even summarize the things you just shared, which I think is really, really important stuff.
So the idea here is a model essentially this algorithm piece of code that someone writes,
and say the frontier models are feeding it just like the entire internet of content.
And basically it's trying to test itself on predicting across all that data, the next word.
Essentially, token is the correct way to think about it,
but a simpler way to think about is like the next word in text.
And as it gets it wrong, it adjusts these things called weights, essentially.
Just like, is that a simple way to think about it, even though that's,
even that's just like very surface level?
So I think of language modeling as a way of encoding statistical information about language, right?
So let's say that we both speak English.
So we kind of get a sense of like what is more statistically likely.
Like if I say my favorite color is, then you would like, okay, that should be another color.
Like the word blue would be much more likely to appear than the word like table, right?
Because statistically blue is more likely to get my favorite color is.
So it's a sign get is, it's a way of encoding information.
So like when language modeling, when it's a large amount of data, like it sees a lot of languages, a lot of domains.
So you can tell, like, okay, you guys say this standard, then it uses a prompt.
would come like with the next most likely token.
So by the way, it's not a new idea.
Actually, I really, so it's an idea that comes very, very old, like from the 1951 papers.
The English entropy, I think it's like Kloshenan.
It's a great paper.
And I think there's a story I really like is from, did you read Sherlock Holmes, by the way?
Yeah, I read a few Sherlock Holmes, yeah.
Yeah, so this is a story of when Sherlock Holmes was using this statistical information to, like, have shown a case.
So he was getting, so this is this story.
There is somebody left message with a lot of like stick figures.
So Stratle Hall was like, okay, he knows that in English, the most common letter is E.
Then the most common stick figure must be E.
Right.
And then he goes, he starts like that.
It was really so, so the code.
So I think there's language.
So in a way, it's like simple language modeling, right?
But instead of like at a word level, he does it as like character level.
And token is something in between, right?
A token is not quite a word, but it's bigger than a character.
So let's say we say token because it helps us like read what,
how does reduce vocabulary because with character is like smallest amount of like vocabulary
right now.
So I'm having like 26 character, but words can have like millions and millions, right?
Whereas tokens you can like be able to like get like the sweet spot be the two.
So let's say that we have like a new word like,
how to say like podcasting, right?
Let's say it's a new word, but it can divide in a podcast and ink.
So people want to say, okay, podcast, we know the meaning.
You know that ink is like a verb, like gerund, whatever it is.
So we know the word like podcasting.
So that's why the token comes in.
But yeah, that's like the pre-tuning is basically like encoding statistical information of language
to have you predict what is most likely.
I think that most likely is a more simple way of doing it, because it's more like building
distributions of like, okay, so next token could be like 90% of the channel, it could be like
a color, like 10% of the time could be nice something else, right?
So it's basically distribution, so language would like pick, like, depending on your sampling strategy.
Like do you want it to always pick the most likely token or do you want it to pick something
more creative?
You know, so, so I think my sampling strategy, I think is something extremely important.
important. It can have your boosts a performance in a huge way and very, very underrated.
Okay, awesome. So essentially, a model is just code with this whole set of weights,
essentially the statistical model that has learned to predict what comes next after certain
words and phrases. Yeah. And then post-training and fine-tuning specifically is doing that same
thing. So pre-training, you get like GPT-5. Fine-tuning is someone taking GPT-5. And
and doing the same sort of thing,
adjusting these weights a little bit
for specific use cases on data that they
find as necessary to do their very specific use case.
Is that a simple way to think about it?
Yeah, I think like weights as like functions, right?
So let's say it's just like you have,
maybe it has a functions of like maybe Lenny's height
is maybe like 1x, like 1x plus something,
like 2x, like 1 and plus something is the weight, right?
So you change it until you fit the,
So there's the correct data, which is like my height and your height, right?
So you can take the weight is just like a weight like they function.
So you like chain adjust the weights so they can fit the data, which is a training data.
Awesome.
Okay.
So we're talking about pre-training, post-training, fine-tuning.
Is there anything else here that's important to share about just like what this is exactly,
what people need to understand about these parts of training?
So the first majority of time, we don't touch on like pre-train anymore.
As users, we don't use it.
Right. It's already done for us.
Yeah. So I think my action is a bit of fun process, like, when my friend's training models,
I try to play with their pre-truiting model and they're horrendous.
They're like saying taste as like, no, it's like, oh my gosh, it's like, yeah, it's crazy.
So it's very interesting to look at like how much of like post-training can change the model behavior.
Yeah, and I think that's where like a lot of time is that a lot of people are spending energy on nowadays.
They function a lap is on like post-training.
because pre-training, I think, so pre-training have been used to, like, increase the general capacity of a model,
capabilities of a model.
And it depends on, it means a lot of data and, like, model size, like, to increase, to increase the model capabilities.
And at some point, we are actually, like, have kind of max out on the Internet data, right?
And people, like, text data, baby, max out.
I think a lot of people are doing, like, with other data, like, audios and videos, and videos.
And everyone's trying to think of, like, what is the new source of data.
But we're like post-trading, but like, middle course of like, is it more of like, everyone can have very similar pre-trading data.
It's that post-trating is where they make a big difference nowadays.
This is a good segue to, you talked about supervised learning versus unsupervised learning.
I love we're getting into this, by the way.
This is super interesting.
So you're talking about labeled data.
Basically supervised learning is AI learning on data that somebody has already labeled and told it.
Here's correct versus incorrect.
For example, this is spam versus not spam.
This is a good short story.
This is not a good short story.
We've had the CEOs of a lot of these companies that do this for labs,
Mercor and scale, handshake, there's micro, there's a few others.
So is that essentially what these companies are doing for labs,
giving them label data, high quality data to drain on?
It is in a way, but I think it's more like a product of big equations.
So there are a lot more different components than that.
So that's why we're talking about reinforcement learning.
I'm not sure if you're a CEO that you interview bring up like that.
term. So the idea is that you want people to like, so like let's say you have a model,
give the model like a prompt, right, and it produce an output, right? You want to buy, like
once you reinforce the model to produce an output that is better, right? So like that
now like now it comes to like how do we know that the answer is good or bad, right? So
easily people realize on like signals. So one way to get like a first one good or bad is like
human feedback, right? It happens we have two responses. You can, okay, this one is better than the
other. And we do that is because, like, as humans, we tend to, it's very hard to give, like,
concrete score, but it's easier to do comparisons, right? Like, if you ask me, okay, give this
song a score, I'm not a musician, like, and don't know, like, how hard it is, like, yeah,
I don't know, like, what, like, how 10, I'm going to six, you know, and then, if you ask me,
again, a month from now, and I completely forgotten, and say, okay, maybe now seven,
only four, I don't know, but then if you ask me, okay, here are two songs, and which one
good you prefer to play for the birthday party. It was like, okay, I can't play it before this
song. So like comparison is a lot easier. So you have human feedback, and then you use this human
feedback to trade a reward model. So you like tell, like, and then the reward model will help you,
like, okay, the model will produce this response. It's a robot model can score. Is this good or bad?
And you're charging bias toward, like, producing better model, the better responses. Another way is that
you can instead of using a human, you can use like AI, right? Like, because of it.
response, say yes, good or bad, right? Or sometimes the thing is that people are very big on
nowadays, like verifiable rewards, which is like kind of natural. So basically, they give it a math
problem and then math solutions. Like, it's a model output solution. It's, you know, so okay, so
expected response should be in a 482 and it doesn't provide 402, then it's wrong, right? And it's
not a good response. So, so, yeah, so like a lot of time people are using this, human labor,
or like human labor should like produce like math,
like how does I say expert questions and I say expected answers.
And in the ways it's like designed systems that like verifiable.
So that the models can be trained on.
Okay, I'm really glad you went there.
This is essentially RLHF reinforcement learning with human feedback,
which is exactly what I wanted to also talk about, right?
Yeah.
So I think it's like it's general.
It's like it's a way of learning.
It's like training is going to be able to learning.
and whether it learned from human feedback
or like AI feedback or like
very terrible rewards.
I think I say it's just a different way of like
clipping signals.
Awesome. Yeah. We had that
C of Anthropic on the podcast and he talked about
their version of RLHF, which is AI-driven
reinforcement learning. I love the way
you phrased it where you basically, you want to
help the model, you want to reinforce
correct behavior and correct answers
and this is the method to do it, whether
it's say an engineer, seeing
an output from a model being like, no,
here's how I would code it differently.
And then training, and it's training a different model that the original model works with
to tell it, am I correct or not correct?
Is that right?
Yeah.
I think that's a way of looking into it.
And I think that's a space is so exciting nowadays because there's so many, like, domain
experts tasks that the model, like that model developers want models to do well on, right?
Let's say you're like accountant, right?
Like maybe once you use a model, they have an accounting task.
So I need a lot of, like, accounting data, like examples for my accountant.
So you need to hire a lot of them to, like, do it.
Or everyone's a physics problem.
I want to do, I don't know, like legal questions and stuff, or, like, engineering questions.
Or, like, somebody was telling me they want you to do, like, using, like, coding for,
to solve scientific problems and not just, like, coding to build a product.
Which is another different whole realm of things.
And I also, like, using very specific tooling.
Like, yeah, like, I'm not sure what apps you use, but maybe, like,
for aiding app or like QuickBooks or like Google Excel.
Like they have very specific like tool specific expert expertise.
So you want the models which you learn.
So like they need a lot of like humans experts in this area to like create data to trade them.
And it's a massive thing.
It's like people because everyone wants a lot of data and like won't slaps at like unlimited
budget.
But whether I think this is so like a little bit of low key interesting economics.
I'm not sure you've talked to like the guess about.
I thought it's very interesting if I think about, because it's very lopsided, right?
Because, like, they only, like, a very small number of Frontier Labs, right?
And they want a lot of data.
And there's, like, a massive amount of, like, startups or companies of providing data.
So, like, you can see these companies, like, this startup, like, doing data labeling,
that they have, like, maybe they have, like, massive AR.
But you're also, like, like, okay, so how many customers you have?
And they could be, like, a very small numbers.
I'm not sure.
I'm not sure you, you, you, so you're smiling.
Yeah, we chat. We chat about that.
Yeah, so I'm like a bit bit like, look me uneasy.
I have a company is growing like crazy,
but it's like heavily dependent on like two or three companies.
And at the same time, like if I was this company from TLAF,
what would be the right economic thing for me to do, right?
Now I want a lot of startups.
I want to have a lot of providers so you can pick and choose.
And then these providers can also like to compete each other
to lower the price and it's so dependent on me.
it would sound to me regardless.
So I feel like, yeah, so, so, you know, this economics,
the whole economics is very interesting to me,
and I'm curious to see how it plays out.
What I'm hearing is you're bearish on the future of these data labeling companies
because, as you said, they don't have a lot of leverage over pricing
because they have so few customers,
and there's so many people getting into the space.
So basically, even though there's some of the fastest growing companies in the world,
you're feeling like there's a challenge up ahead.
I'm actually having some bearish on it.
I think I'm curious because I think things have has a way of work out in ways that I don't expect.
So I think that maybe these companies, they have a lot of data, maybe they wouldn't be able to use that to like have some insights that helps them like stay ahead of the curb.
You know, so I don't know.
A very fair answer.
Okay, while we're on this topic, I want to chat about evals, which is a very recurring topic in this podcast.
This is the other piece of data content these companies share that AI labs really need.
Can you just talk about what an Eval is the simplest way to understand it
and then how this helps models get smarter?
So I think people approach Eval.
I think they're like two very different problems.
One is a app builder, right?
And like can I say have an app that do like maybe a chatbot?
It's very simple.
It's the first thing I came to my mind.
And I want you to you.
to know each chatbot is good or bad, right?
So I need to come away with like if I let's the chatbot.
Another thing is, I think of this as a task-specific evolve design.
So let's say I'm a model developer and I want to make my model better at curb writing.
Right?
And I was like, okay, but how do I even imagine current right?
So I even need someone to like, okay, understand curb writing and think about like,
what makes good story, like what makes a story good?
and then designed the whole dataset and that criteria to evaluate creative writing.
So, yeah, so I think there's that, I think it's more like eval design.
That is very interesting.
Kameh work criteria, come I work guide, how to do it.
And then also, like, train people, like how to do it effectively.
So I guess, in a case, I think evolve is really, really fun because it's extremely creative.
I was looking at, like, different evolves and people were built.
And it was like, wow.
Like, is this not dry at all?
It's just like super, super, super fun.
We had a whole podcast and e-vals with Hamill and Shreya.
And that's exactly what they talked about.
It's just it's actually really fun to create evils for, for companies especially.
So let's still dig into that one a little bit more.
There's this kind of debate online that I don't know how big of a deal this debate is,
but it feels like people spend a lot of time thinking about this, this idea of, do we need evals for AI products?
Some of the best companies say they don't really do e-vails.
they just go on vibes.
They're just like, is this working well?
Can I feel it or not?
What's your take on just the importance of building e-vals
and the skill of e-vals for AI apps,
not the model companies?
You don't have to be like absolutely perfect a thing to win.
You just need to be like good enough
and being consistent about it.
Okay, this is not the philosophy I follow,
but like I have worked with enough companies
to see that play out.
So when I say like why company don't need eval, right?
let's say you are like an executive, right?
And you want to have a new use case.
So here's a use case you started out.
It built and it's like it works well, right?
The customers are somewhat happy.
You don't have the exact metric for it.
But like, so traffic keeps increasing, like people seem happy.
People keep buying stuff, right?
And now here's our engineer coming like, okay, we need Eval for it.
And so it's not an exacting.
It was like, okay, how much effort do we need to go into Eval?
And they were like, okay, maybe like two engineers as much as much.
And they could maybe, would improve that.
And it was like, okay, so how much expected?
can I get from it? And the engineer would be like, oh, maybe you can improve it from like
80% to like 82% and 25% right. And I was like okay, but it will take like that two engineers
and be able to launch a new future. Then it could give me like so much more like improvement, right?
So I think it's like one of them I say eva, sometimes people think of Evar is like, okay,
this is good enough to touch it. Like if you do spend a lot of energy on Evar, it would like
only incremental improvement where it expands the energy on like another use case. And maybe
you know, it's scared that you're good enough because it's vibe check it, right?
So, so I do things that's like, maybe like that's a deep bit it's about.
I do things that's like a lot of times people just like get things to the place
when it's like, okay, good enough.
People run.
But, and then, but of course, it's like there's a lot of risk associated with it
because if we don't have a clear metric, you have good feasibility
to how the application and the morrow is performing.
It might do something very dumb or it can cause you like, I know,
something that crazy can happen.
So, yeah, so, so I do think Eval is very, very important if you have, if you operate a scale
and where, like, failures can have, like, catastrophic consequences, then you do need to be very
tyrannical about, like, what you put in front of the users, understand different failure modes,
like, what could go wrong.
And also maybe in a space when it's a feature, as a product is as a competitive advantage, right?
You want to be the best at it.
So you want to have, like, a very strong.
understanding of like where you are and like where you are with the competitors.
But it's just something that's like more like a low key. Okay, it's like something is like,
okay, it's not the core or like it helps with our users. Then maybe you don't need to be so
so obsessed or like theoretical about it. It's like, okay, that's good enough for now.
And if it fails, then it fails. Like, okay, I know it's like it's such terrifying, but like, yeah.
Yeah. I think it's on about the question of like, regional investment. I'm a big fan of
Eva. I love reading Eva. And I say it's like, I understand.
why some people would choose to not focus on about right away
and choose bringing on new functionalities instead.
Awesome. That is a really pragmatic answer.
What I'm hearing is e-vals are great, very important,
especially if you're operating at scale,
but pick your battles. You don't need to write e-vails for every little feature.
Something that Hamlin Shreya shared is that people need just like,
I don't know, five or seven e-vails for the most important elements of their product.
Is that what you see or do you see a lot more in production that people build and need?
I don't think of like just a fixed number on like the evolves like what was the
going to evolve right as a goal of Eval is to guide the product development so so like you see
eval because I think I'm a big fan of Eval is that it helps you uncover opportunities where
the progress are doing well so sometimes we're seeing a very office it was like okay we
look at EBA and we realize it's like okay it performed really poorly on this like specific
segment of users and then we're looking at you like what what's what's what what's
wrong with it. And it turns out it's like we just like don't have a good messaging to it.
So like maybe we should like just focus on the taste of building polio can improve significantly.
Yeah, so I kind of like the number of evolve is really depends.
Like we have seen product with like hundreds of different metrics.
Oh wow.
Like going crazy.
This is because like that product is like general, right?
It has different types of like one evolve for like I don't know like a verbosity, have like one
evolve for like user sensitive data.
And like another is like for length.
But, like, has a number of, like, okay, let's just play a good example, complete example,
like, deep research.
So, so you have the application, you have, like, build a model to, like, do deep research
for you, right?
Like, okay, like, have a prompt.
Like, me say, okay, do me your comprehensive research on online podcast and have me, like,
propose, like, show me report on what kind of topics he's interested in, what kind of
videos get the most views, or, like, what topics that he's missing on that he should
cover it, right? Like, have this kind of, like, prompt. Then how do you evaluate the result, right?
I don't think there's, like, one, like, metrics it would help. Maybe it's just like, maybe you
have, like, a hundred, I think somebody has a benchmark and they get, like, a hundred expert,
like, write a bunch of prompts and they go through, like, all the, on the answers on AI, and, like,
it's, like, it's extremely costly and slow, right? But if you might have something else, for example,
like, one way I was thinking about it, I was talking to a friend about it, and, and one way is
like, how do you produce the result of the summary?
First, you need to go to gather information.
And to gather information, you need to do a lot of search queries.
You gather, grab the search results, and then some of the search reasons, you aggregate,
and then maybe say, okay, I'm still missing on this.
You have to do another route, like another route, and then NCN has a summary.
So every step of the way, you need evaluations.
You don't need to the end.
So maybe it was a search query, in my first thing about, like, okay, now I write five search queries.
Am I looking to, like, how good is this search queries?
Like, do they, like, as they, like, similar to each other?
Because in the five search queries, that are very similar, like, okay, let me podcast,
then it, many podcasts, last month, landed podcast, like, two months ago, right?
It's not, it's not very very exciting.
But, like, if the quality is the podcast, like, the keywords are, like, more, more diverse, right?
And then I look at the results of the search query, and they say you answer the search query,
like Lenny Postcat data labeling.
And then they come up with like 10 pages, 10 results.
And then you come up with like, oh, Lenny podcast on, I don't know,
like, Frontier Labs and have like 10 results.
And I look at the different web page, like how much of them overlapping.
Like I would, I would do in both like the breadth, like getting a lot of page?
But also like, do we have depth?
And also they have relevant because we come up with the search queries
that are completely irrelevant to the original problem.
So I feel like every aspect of it would need a way of evaluating.
Right.
So I don't think it's like, how many evolve should I get?
But like how many evolve should, do I need to get a good coverage,
a high confidence in my application's performance?
And also to help me understand like where it is not performing well so that I can fix it.
Awesome.
And I'm hearing also just especially for the very core use case,
like the most common path people take in your product is where you want to focus.
Yeah, so yeah.
Okay, let me, there's one more term I want to cover,
and I want to go to somewhat different direction.
Rag, people see this term a lot, R-A-G, what does it mean?
So RAC is then for retrieval augmented generations.
It also not a specific true J-D-D-EI.
So the idea is just like for a lot of questions,
we need contacts to answer.
So I think it came pretty, I think it's from the paper 2017.
So someone was like,
So they realize it's like for a bunch of like benchmark, when the question answering benchmarks,
they realize it's like, okay, if we give the model information about the questions, then
the answer can be much, much better.
So what they do is just try to retrieve information from Wikipedia.
So for a question about topics, it's like retrieve that and then put into the context and
like answer it does much better.
So I feel like it sounds like a no-grinner, right?
I mean like obviously.
So I think that's what racket as a simplest sense is just like providing the model with a relevant
context so that they can answer the questions. And that's why things get like really more,
more interesting because traditionally when it started out, a rack is mostly like text. So we talk
about like a lot of ways like how to prepare data so that the model can retrieve effectively.
Let's say it's like not everything is a Wikipedia page right. Like Wikipedia page is pretty
contained and like you know, okay, everything is about it is about a topic. But a lot of times
have documents, like, it's true many lot, right? And, like, they have a weird way of, like,
structures of documents. Let's say that, you have documents about Lenny podcast, right? And in the future,
in the beginning, it's like, from now on, podcast wouldn't refer to Lenny's podcast, right?
So let's say somebody in the future is like, okay, tell me about Lenny, right? And because
the rest of the document does not have the term, Lenny, you just don't know, you might not
retrieve it. And the document is long enough that it's chunk into a different part. So, like,
the second part has doesn't have the word mimic, so you cannot reach it.
So I have to find a way to process data so that makes sure it's like it can retrieve the
information that's relevant to the query, even though it might not immediately like obvious
that is related.
So people come up with like only thing if I think like contextual retrieval, like giving
extravagance of the data that relevant like maybe in a summary metadata so that it knows.
All the same people use it like as a hypothetical question, it's very interesting.
like for even the chunk of like documents, I must generate a bunch of questions that the
chunks can help answer. So it's like when I have a query, it's like, okay, does it match any of
the like hypothetical questions? So it can fetch it. So it's a very interesting approach.
Okay, so maybe before I go to the next thing, I just want to say this like data preparations
for rack is extremely important. And I would say that's like in the, a lot of the companies that
I have seen, that's like the biggest performance in their rack solutions coming from like better
data preparations, not agonizing over what very databases to use.
We've got very database.
Of course, it's very important to care about things like latency or like if you have like
very specific access patterns, like read heavy or write heavy.
Of course it's like it matters.
But in terms of like pure quality answers, right, I think the data preparation is like hands out.
When you say data preparation, what's an example to make that real and concrete for us to understand?
So like one way is just mentioned as in like you have like chunk.
so data, so we think about like how big of each chunk should be, right? Because if it's like,
so the thing about like if the context you want to maximize, maybe you can, it's a very simple example
right now, you want to retrieve like a thousand words, right? So if X chums data is too long,
then so if a data chunk is long, then it's more likely to contain more relevant metadata,
so you can retrieve more. But if it's too long, like then you have a thousand words and so
chunk is like a thousand words to get a rich one chunk, so it's not very useful.
But if you choose short, then you can retrieve more relevant information.
Like also it can retrieve a wider range of like documents and chunks.
But at the same time each chunk is too small, she contains relevant information.
So we have like very nice like chunk design, like how big each chunk should be.
You add like contextual information like summary, metadata, hypothetical questions.
Somebody was telling me just like a very big performance I got is that from, um,
rewriting their data in the question-answering format.
So they have a podcast, right?
Instead of it's just like, reframe, rewrite it into like, here's a question, here's answers,
and produce a lot of them.
It can use AI for that as well.
So that's one example of data processing.
A lot of examples I see is like for people helping, like using AI,
you have like specific tool news and documentations, right?
And we write documentation usually to our document,
documentation today is written for human reading.
And AI reading is different because it's different because humans, we have like common sense.
And we can't know what it is.
So one thing is all like, human for human experts, they have the context that AI doesn't quite have.
So somebody told me that, like, what's a big change they have is like,
let's say that you have a function, a document, documentation for this, maybe the library.
And the library says, okay, the output of this one is like maybe talking for, like, I don't know, some crazy term.
crazy term, maybe there's some temperature or something under grab. It should be like one,
zero or minus one. And as a human expert, maybe understand the scale, like, what one
does this scale mean? But like for AI, it just really doesn't understand what that means. So,
so actually have like another annotation layer for AI. It's like, okay, what temperatures
equal one means like that? It's not like it's an absolute temperature. It's more like,
as associated with the scale over there. So like just saving all this data processing to make it
easier for AI to retrieve the relevant information to answer the questions.
This episode is brought to you by Persona, the verified identity platform helping organizations
on board users, fight fraud, and build trust. We talk a lot on this podcast about the amazing
advances in AI, but this can be a double-edged sword. For every wow moment, there are fraudsters
using the same tech to wreak havoc, laundering money, taking over employee identities, and impersonating
businesses. Persona helps combat these threats with automated user, business, and employee
verification. Whether you're looking to catch candidate fraud, meet age restrictions, or keep your
platform safe, persona helps you verify users in a way that's tailored to your specific needs.
Best of all, Persona makes it easy to know who you're dealing with without adding friction for good
users. This is why leading platforms like Etsy, LinkedIn, Square, and Lyft, trust Persona to secure
their platform. Persona is also offering my listeners 500 free services per month for one full year.
Just head to withPersona.com slash Lenny to get started. That's withPersona.com slash Lenny.
Thanks again to Persona for sponsoring this episode. Awesome. Okay. So you've talked a bit about
how you work with companies on these sorts of things, on their AI strategies, on their AI products,
how they build, which tools they build, all these things. I want to spend a little time here.
Because a lot of companies are building AI products.
A lot of companies are not having a good time building AI products.
Let me ask a few questions along these lines of what you've learned working with companies
that are doing this well.
One is just, I guess, in terms of AI tool adoption and adoption in general within companies,
there's all this talk recently of just like all this AI hype.
The data is actually showing most companies try it.
It doesn't do a lot.
They stop.
And so there's all this just like maybe this isn't going anywhere.
So in terms of just adoption of tools and AI within companies, what are you seeing there?
For Gen AI in company, I think they're two.
type of Gen AI tooling that have been, I have seen, like, once is to, like, internal productivity,
right? Like, have coding tools, like chatbot, internal knowledge, like, a lot of big
enterprises have some kind of, like, a rubber, like, model. So, but, like, with access, like,
maybe some different kind of a rack solution, I think we'd talk about data, a kind of, like, text-based
rack. I haven't talked about, like, agentic rack or, like, haven't so, like, Montemotor
rack yet, but it's like, yes, it's a whole very exciting area around that. Yeah, so like,
basically to allow the employee to, like, access internal document. Some ways I'm going to ask,
like, okay, I'm having a baby. What could be the maternal or paternal policy, right? Or, like,
am I having these operations? Could the hell benefit, like, cover that? Or, like, I want to,
like, interview, or I want to, like, refer my friend, but could be the process for that. So a lot of
it's like having chatbot internal chatbot to help with internal operations.
And another thing, another category is more like customer facing.
So or like partner facing.
So what a customer support chatbot is a big one.
If a hotel chain, you might have like a booking chatbot, which is like somehow massive,
like a lot of booking chatbot because I guess it's, it's, I do have this theory of like
a lot of applications are companies pursued because they can't measure the concrete
outcomes. And I feel like booking on a sales chatbot is very clear, right? There was a conversion
rate right now with a chatbot with human operators and what could be a conversion rate
with a chatbot. And it's something I think it's like very clear outcomes and companies are
easier to buy into this solutions. So a lot of companies have that like customer facing chatbot.
So yeah, so that is another category of pool. And I think that, um, I don't know for customers or
external facing tools because people are driven to,
people are driven to choose applications with clear outcomes.
So the questions of adopting them is really based on whether they see the outcome or not.
Of course, it's not perfect because sometimes the outcome can be bad,
not because the idea or like the applications, idea, sell is bad.
It's just because the process of building it is like not that great.
Yeah, so it's tricky.
For the internal adoptions of like tooling,
like internal productivity,
that's where it gets tricky.
I would say like a lot of companies,
what's the thing of AI strategy?
Like I think of AI strategies have like usually have very,
have like two key aspect, right?
It's like use cases.
And the second is talent.
You might have like great data for great use cases,
but you don't have talents and you cannot do it.
So a lot of time in the beginning with Gen.
And it's still,
and sometimes I'm really admired a lot of companies for that
It's just like, exactly, we need our employees to be very gen AI aware, like very AI
literate, right?
So what they do is as I start like, maybe like adopting a bunch of tools for the team to use.
They have an upskilling workshops, like they anchorage learning.
And it's like a really, really good thing.
And it's also like willing to spend a lot of money into like adopting like giving people like
Chachapiti subscriptions, cursory subscriptions, cloud code subscriptions.
to get the employees to be more AI literate.
And that's the thing.
It's like a lot of the security in the country may say,
okay, we spend a ton of money as it's too late.
But then we don't see, because you can see the usage.
And it's like, but people don't seem to use them as much.
And what is the issue?
So, yeah, so I think that is tricky.
What do you think is the issue?
Is it just they're not, they're like,
they don't know how to use them?
Like, what do you think is the gap here?
Do you think we'll get to a place of just like, wow, work is completely different because of AI for a lot of companies?
The main thing is like it's really hard to measure productivity again.
So I taught you a lot of people on the world side.
First of all, on the example, it's coding, right?
A lot of companies are not using coding agents or like coding AI acid coding.
And I was asking, I was like, do you think that like it helps with your productivity?
And a lot of times the questions are very hand-weighting.
Just like, okay, say we're like, okay, I feel like it's been better, right?
And I said, okay, because we have more PRs, we see more code and then immediate correctness.
Okay, but I, of course, code number of life is not a good metric for that.
Right.
So it's really, really tricky.
And it's something funny.
So, so I do ask people to ask their managers because I work with like either the VP level,
so you have like multiple teams under them.
So I asked them like, okay, do you ask some managers?
like, okay, would you rather have access,
would you rather give everyone on the team
like very expensive coding agent subscriptions
or you get an extra headcount, right?
Let's say it's like maybe like,
and almost everyone could say the managers could say headcount.
But if you ask VP level,
or like someone who managed a lot of teams,
they would say it's like they could want AI,
assist them, assistive tools.
And the reason is that people say like, okay,
because as managers, right, because you are still growing.
Like, you're not as a level when you manage hundreds of thousands of people.
So for you, like, having one HR headcount is big.
So you want that not for productivity reasons,
but because you just want to have more people working for you.
Whereas for executive, you care more about, like,
maybe you have more like business metrics that you care about.
So you actually think about what actually drive productivity metrics for you.
So, yeah, so it's tricky.
And I think that's like the question of like productivity is not, I'm not sure it's like fundamentally
is the subject, but it's just like we don't have a good way of measuring productivity
improvement.
Another thing is also very widely.
And I think that people do tell me that they notice different buckets of employees, like different
reactions to AI assistive tools.
Like, first of all, I keep going by chip coding because it's a lot.
is big and it's like easier to my reasons somehow.
So I say it's like, I have different reports.
Like one team would tell me is that like, one of the people tell me, okay,
amongst all his engineers, he thinks it's like senior engineers would get the most output,
like would be more productive because like, okay, so that person is very interesting.
So he actually divided his team to like three buckets, but he didn't tell them obviously.
He was like, okay, here's more like currently like best performing, average performing.
and lowest performing.
And then there's a randomized trial.
So they give like half of each group access to like cursor.
And then who's noticed like over time, it was like, okay, something funny, like the group
that get the biggest performance boost, like in his opinion, like who's very close in his team,
there's the biggest boom boost like the senior engine, like the highest performing.
So the highest performing engineer get the biggest boost out of it.
And then the second group is just like the average performing.
So his opinion is like, okay, the highest performing engineers, they also know more proactive.
They would say no such a soul problem.
So I have some sort problem better.
Whereas the people who already have the lowest performing, they only don't care much about work.
Right.
So like this is easier to just like go on autopilot, get it to like Jarrett like that code and just like do it.
And I always just don't know how to do it.
Another company, however, they tell me just like actually senior engineers are the one most resistant to like to like use.
using AI as this tooling because they said it's like, okay, but AI, because they are more
opinionated and they have very high standard. It was like, okay, but AI code, Jared
cool just sucks. So just like very, very resistant in using this. So I don't know, I haven't
quite be able to reconcile very different reports on that yet.
This is so interesting. So just to make sure I'm hearing the story. So there's a company
work with that did a three bucket test with their engineering team where they created three
sorts of groups, the highest performing engineers, mid-performing engineers, lowest-performing engineers,
and gave some of them, so they gave some of them access to, say, cursor. Was it cursor, or what did
they give them access to? It was cursor. I think by saying it was cursor. Okay, cool. And so within
I didn't work with them, this is more like a friend company. Okay, it's a friend's company.
So did they give like half of the higher performing engineers cursor and half not, or how did they
do the split there? Yeah, so like they give like half of the entire company, but like half for each
bucket, yeah. And then they observe the difference in like productivity. I see. Yeah. So how do they
even do that? They're just like, okay, you get cursor, you don't get cursors. That how do they do that?
Yeah, I didn't get just the mechanics of it. But I was like at respect here for doing a
randomized trial. That is so cool. Yeah. Okay. Wow. How large was this engineering team? Was it like
hundreds of people? It's not that large. It's about like maybe 30 to 40. Yeah.
30 to 40. Okay. Yeah. Wow. Okay. So they found that the high
highest performing engineers had the most benefit from using AI tools.
And then behind them was the middle tier engineers and the worst performers.
Yeah.
But also not the same everywhere.
Right, right, right.
Right.
Right.
This other example we shared of just senior engineers in this one example are most resistant to changing the way they work, which I get.
I do feel like the most valuable people right now, other than ML researchers,
and AI researchers like yourself
are senior engineers
because it feels like junior engineers
are just like so much of this is now done by AI
but an engineer that knows what they're doing
that understands how things work
at a large scale with AI tools
just basically like infinite junior engineers
doing their bidding.
It feels like an extremely valuable
and powerful asset.
Yeah, I definitely like really appreciate
as you see companies
like we appreciate engineers
who are
have a good understanding of the whole systems
and being able to have good problem-solving skill
or thinking holistically instead of like locally.
Or when our company have seen the way they work,
as they told me, they work completely different now.
So they actually restructured engineering org
so that, like, they get more senior engineers
to be more into peer review.
Because they've like to get like sort of writing guidelines
on what is the good engineering practices.
What is the process would be like?
Or they'd be like, okay, so they've write
like a lot of like processes
on how to work well.
And then they have more junior engineers
just produce code and like some APR,
but senior engineer more in the reviewing case.
So I think it might be prepared for the future.
So another company actually told me something very similar.
So that kind of paper in the future
when they only need a very small group of very, very strong engineers
to like create like processes
and like reviewing code to get into production,
but I get like AI or like junior engineers,
should I produce code.
But then the question becomes just like,
how does one become a very strong?
Right, that's right.
That's right.
I feel like, yeah.
Yeah, so I don't know what's the process.
We was thinking about like, yeah.
No one's thinking about it.
It's just, it's a problem.
We won't have anymore in 10, 20 years.
There will be no more engineers
because no one's hiring Jude engineers.
Although I could make the case,
junior engineers, people just getting into computer science
science right now are just native,
AI native. And in theory, you could argue they will become really good, really fast. If they're
curious, aren't just delegating learning and thinking to AI, but learning how to actually
using it to learn how to code well and architect correctly, like you could argue they will be
the most successful engineers in the future. I do think that what I mentioned is that
load into architect. I think I grouped that in like system thinking. I do think it's a very
important skill because I think AI can help automate a lot of like destroyed skills.
But like knowing how to utilize these skills together to solve a problem is very, it's hard.
So there's a webinar between Miran Sami was my favorite professors.
He was a chair of the curriculum as a CS department at Stanford.
So he spent a lot of time thinking about CS educations, right?
like what should students learn nowadays in the era of like AI coding?
And then the other person is like Andrewung,
which is, of course, is like a legend in the AI space.
And Neera-Arabri present like Sami,
such a thing very interesting.
It's like he said like a lot more things that CS is about coding, but it's not.
Like coding is just a means to an end.
Like CS is about system thinking,
like using like coding to zone actual problem.
And problem is something will never go away
because like what like AI can automate more stuff,
the problem is just get big.
But as a process of understanding what costs the issue and how to design step-by-step solution
to it, will always be there.
So I think an example of, I actually have a lot of issues with AI for like in the way
of like it's debugging.
So I'm not sure you use a lot of AI for coding, but like something I've noticed and also
seen for my friends, it's like it is pretty good when you have very clear well-defined
task, maybe write documentation, fix its specific features.
or like build an app from scratch, right?
Like, it doesn't have to interact with a large existing code base.
But you added something like a little bit more complicated.
Maybe it would be quite interesting with a lot of components and stuff.
It's usually like not that good.
And for example, like it was using AI to like use to deploy applications.
And it was testing out a new posting service I was not familiar with.
It was like, okay, like usually they form me.
So what the AI does give me is like confidence to try new tool.
Like before what AI is like sharing new tools,
his route, not documentation, for the beginning, but I was like, okay, just try it out and learn.
So I was testing as a new hosting service, and it kept getting a buck that was like very,
very annoying. And it was like, okay, I asked a card code, like fix it. And it kept keeping,
it kept changing the way, like maybe change the environment variable, fix the code, maybe
change from the function, choose this function, maybe change the language, maybe it doesn't process
JavaScript, well, I don't know, whatever. And it didn't work. And it was like, okay, that's it.
I'm also going to read documentation myself and see what's wrong.
And it turns out just like I'm on another tier.
Like the fish that I want did not, is not available in this tier, right?
So I feel like, okay, so the issue with Clark was just trying to focus on fixing things
from a very different component versus the issue is from a different component.
So I think I think of like, okay, be understanding how different components work together
and where the source of the issue might come from.
You need to give a holistic view of it.
And it's made me thing
it's like, okay, how do we teach AI, like system thinking?
Like that, right?
I think I have all the human experts, like having, like, right,
like very much, people going to scaffold.
It's just like, okay, for this kind of problem,
look into this, look into that, look into that,
and then stuff.
So I think that could be one way.
But that's what made me think is, like,
how do we teach humans, like system thinking?
Yeah.
So, yeah, so I think it's very interesting skill.
I do think it's very important.
That's exactly the same insight,
Brett Taylor shared on the podcast. He's the co-founder Sierra. He created Google Maps. He was
CEO of Salesforce, quip, a few other things. And I asked him just like, should people learn to code?
And his point is exactly what you said, which is learning, taking computer science classes is not about learning Java and Python.
It's learning how systems work and how code operates and how software works broadly, not just here's like a function to do a thing.
One thing that I wanted to help people understand, you're with this book,
called AI engineering, which is essentially helping people understand this new genre of engineer.
And you have this really simple way of thinking about the difference between an ML engineer and an
AI engineer, which has a really good corollary to product managers now of just like an AI product
manager versus a non-AI product manager. The way you describe it and fill in what I'm missing is just
ML engineers built models themselves. AI engineers use existing models to build products.
Anything you want to add there?
One thing I really dislike about writing books is that they have to defy like this.
And I think it's like no definitions to be perfect because they always be like edge cases.
But yeah, in general, I think it's like AI as a service, like models of service,
like when somebody builds the models for you and the base model performances are pretty strong.
So it's like it's enable people to just like, okay, now I want to integrate AI into my product.
I don't need to learn what grade and design is.
even though knowing that would really help.
But yeah, it's like it makes an entry barrier really low
for people who want to use AI to build correct.
And at the same time, AI capabilities are like so strong.
It's like it's also like increased like the possibilities,
like the type applications that AI can be used for.
So I think like, yeah, so it both entry barriers like super low
and like the demand for like AI applications like a lot bigger.
So it feels it's very, very exciting.
It opens up like a whole new ball of possibilities.
Yeah, it's like, now you don't have the time.
I don't even spend time building this AI brain.
Now you can just use it to do stuff.
Such an unlock.
Okay, maybe just a final question.
You get to see a lot of what's working, what's not working, where things are heading.
I'm curious just if you had to think about in the next two or three years, just where things are heading.
What do you think?
How do you think building products will be different?
How do you think companies working will be different if you had to think of?
maybe the biggest change we expect to see in the next few years in terms of how companies work.
I think in a lot of organizations, they don't move that fast, right?
But at the same time, they also move faster than I expected.
Because, again, I think it's like bias, like, and don't work with a dinosaur company.
It's a dot care.
I think a lot of executives who come to me are, like, very forward-looking.
So maybe for me, I'm very biased.
The world like organizations is, like, move fast.
So, so yeah, so I think one big change I see is just like in organizational structure.
I think it's like a lot of value plays in like, so before, right, we have like a lot of
destroyed team.
Like we have very clear like engineering team, product team.
But then the question of like who should write EVA, right?
Like who should own the metrics?
And it turns out it's like EVA is not a, it's not a separate problem, it's a system problem, right?
Because you need to look into different components.
components, how they interest each other, you need to use the behaviors, because you need to know
what users care about so that you can, so that you can, like, write Eval, because it's, like,
reflect what users care about.
So, so on of that, like, you can sort it from, like, in looking to different component
architectures, place guard rails and stuff.
So it's just engineering, but understanding users is, like, what product, right?
So, so because of, like, a lot of things, and Eval, it's extremely important.
So, like, the kind of bring product team and, like, engineering team, even, like, marketing team,
like user acquisition, like very close each other.
So, so, yeah, since, you know, ways, like people are structuring
so there's more communications between, like, previously, very distinct functions.
Another thing is, like, I also see as teams, of course, like, think about, like,
what can be automated in the next few years and what cannot be automated.
And I see that people already, like, shedding, like, actually is a little bit, like,
scary to think about it, but I also think it's like, the team, people have told me.
It's like, okay, this is a good and you and me, but we're,
we like covered of these functions, right?
Like for a lot of things like previously outsource, for example,
like traditionally is a business outsourcing this core to them
and like can be done with like not, can be more systemized.
So with that, you can actually like use AI, actually automate a lot of that.
And also like the separation, people think of more of like what is the value of like junior engineers
or senior engineers, how to restructure engineering org for that.
So yeah, so I do different things that.
But it's one thing to success organization, people are just moving pieces around and like thinking
about like use cases, whether you're going to like spin out new use cases and who would lead
the new effort.
And like, yeah, that is one big change.
Another thing in terms of like AI, I think there's, I'm not sure how true this is.
I guess I'm also like on the camp of like thinking that it has merit is, it's a camp of
like, okay, base models, we have probably, like, not quite max out, but we want, we are
unlikely to see, like really, really strong, like crazily strong model. So, like, you remember
like when we have like GPD, right? And the GPD2, which is a big step up, like, an order
of magnitude, like, like, better than like GPD. And then GPD3, which like much much bigger.
And GPD four, much, much, much bigger. And I, of course, I'm here, GPD5, but like, is GPD5,
like that scale of like much bigger like a step jump compared to like the previous I think it's
a debatable right so so I think it's like we had disappointment like the base model performance
improvement is not going to be like my blowing and it was in the last three years so so I think
it's like a lot of like improvements we're going to see in the post training phase in the application
building phase.
And, yeah, so I think that's where I feel, I would see a lot of improvement there.
I also very, like, interest in, like, mounting modality.
So we've seen a lot of text-based, but I think there's a lot of audio, videos, use cases.
That is very, very exciting.
And I think audio is not quite as song as well.
I think because I do work with, like, with, like, a couple of, like, voice startups.
And when you talk to think about voice, it's an entirely different list.
So let's say you have chatbot, right?
We go from a text chatbot to voice chatbot.
It's like the concerns of completely different.
Because now with voice chatbot, right, we need to think about like latency.
Because I like multiple steps first like have like voice to text, text to text and text question
and text answer and then text to voice answer.
So we have like manageable hops.
And like latency become very important.
And there's a question like what does it.
make to sound natural.
So for example, like people think of like,
in AI and humans,
when humans touch each other,
like if I say,
if I say, you try to interrupt me and say,
um, Chip, that's right, I would like pause
and I try to hear you out, right?
But sometimes I just,
I just say, say some word, like,
acknowledge when I'm like, mm-hmm, mm-hmm,
then I shouldn't stop, I just continue.
So the question of like,
for interruption, like, whether it's like,
I should, should I stop or not?
Like, it's a big in what perceived as like natural conversations.
And that's also regulations, right?
Because, like, a lot of time people want to build AI chatbot,
voice chatbot as style like humans,
try to, like, trick users into thinking they're talking to humans.
But also, right, maybe potential regulation saying, like,
okay, you have to disclose to users when you talk,
if the bot is talking to is human or AI.
So I think that's like, just a whole space.
I think it's not quite as soul.
as you think, is it, but it's all not quite like an AI foundation model problem, right?
Because like a human interruption detection is actually a classic commotioning problem.
Like you, it's a different framing that, like, you can be classifier for that.
Or like the question of like, let us see, it's actually have a massive engineering challenge,
not an AI challenge.
Of course, they can be an AI challenge because people are trying to build a voice-to-voice model.
So instead of having like having to first transcribe the voice,
from me into text and then get a model
to generate text answer and get another model
to like turn from text to speak.
You can just like to voice your voice directly.
So that is something called working on but it's like very hard.
Yeah.
So yeah, so like even audio, I think of it,
it's like the easier than video, right?
Because you do have like both image and voice.
It's already like pretty hard.
So I think it's a lot of challenges in that space.
That was an awesome list of things.
Let me mirror back real quick.
So what you're predicting in the next few years,
things that will change in the way we work.
And these actually resonate with so many conversations I've had on this podcast.
So it says just kind of doubling down on where things are heading.
One is the blurring of lines between different functions instead of just like design engineering.
Everyone's going to be doing a lot of different things now.
Two is just more of work being automated with agents and all these AI tools and just in theory,
productivity going up.
Third is shifting from pre-training models to post-training.
fine tuning and things like that because, to your point, models maybe are slowing down
and how smart they're getting. Although, I'll point folks to the chat with the co-founder
of Anthropic. He made a really good point here. He's like, we're really bad at understanding
what exponentials feel like. We're in the middle of that. And also, models are being released more
often. So the difference between them, we may not notice because they're just happening more often
versus GPT3 came out like a year. I don't know, after JPT2. So maybe true, maybe not. And then the
fourth point you made is this idea of multimodal investing in multimodal experiences.
I cannot wait for chat GPT voice mode to get better at interruption, like exactly what you're
saying. I'm just like talking to it. And so it makes a little sound. It's like,
okay. And then you have to and then it's like, and then it's like, and then it's like, and then it's like,
and then it's like, and then it's like, and then it's like, and then it's like, and then
I don't have better voice assistant at home yet. I think I have been testing out a bunch.
Almost like, I keep hoping. Oh my God. Zach would be the one. And then I don't know how many of
them I just like had to get a boy because they're not that good. I think it's coming. I hear
it's coming. Anthropics working with someone that I don't know if it's a launcherer not yet.
Yeah. I'm sorry, I want to bring back to what you mentioned about like the, as your guest,
like from Anthropic, mentioned about the performance improvement. I think there's a big change.
I think like this difference between a model-based capability. So I'm talking about like the pre-trained
model, right, versus the perceived performance. So let's say it's like, um, I'm ashamed
thought about like, are you familiar with the chairman of test time compute?
I don't think so.
Yeah.
So the idea is like, okay, like you have a fixed amount of compute, right?
So you're going to spend a lot of compute on pre-shuting or training the model.
Pre-training.
And then I've spent a lot of some compute on like five-trutely.
And the ratio like pre-truiting and a post-sharing compute is like crazy, very different,
even different map.
And also like since then it has to spend compute on like jering inference.
When I have a trend and five-tung the model and now you want to like serve it to users.
So I might type of questions or prom and it's like Jared like do inference like and Z
requires a compute.
And like you say people about discussion of like should I spend more compute on like pre-tuning
or fine tuning or inference, right?
Because like inference and people found I was like test time compute.
So like spending more compute on inference is like called like test time like compute like
as a strategy of like just allocating more resources, compute resource to Jared inference.
When I shouldn't bring better performance and how.
What does that do it?
Let's say you have a math question, right?
And maybe instead of just jerry one answer, again, just like four different answers and
say, okay, whichever is the best according to some standard.
Or like, okay, it has four answers and then maybe like three of them say four eight two and
one of them says like 20, okay, three of them in agreement.
So the answer should be four eight two.
Right.
So like just people shouldn't generate a bunch of it.
Or another thing is like a lot of time like reasoning, thinking it's just like people should
like jerry it more thinking tokens.
spend more time thinking before showing the final answers.
It's like require more compute,
but it's like give me more,
more and more better performance.
So, yeah, so I think it's like from the user perspective, right?
Like when the model spend more time exploring different potential answers,
thinking longer, it can give you much better final answers.
But the base model itself does not change.
Does it make sense?
Yes, that does.
Absolutely.
Yeah.
That is a good correlation.
Larry to Ben Mann's point.
Yeah.
Chip, we covered a lot of ground.
I've gone through everything I was hoping to learn and more.
Before we get to a very exciting lightning round,
is there anything else that you wanted to share,
anything else you want to leave listeners with?
So I do work at a few companies that does these things of,
like, they want employees to, like, come up with ideas.
So there's a big debate on, like, what is a better way for your strategy,
right?
Should it be topped out or, like, bottom up?
Right.
Should, like, executive come up with, like, one or not true, like, kill a youth,
case and like everyone like allocate resources to that or like should you give engineers and
PMs and smart people like come up with ideas and it makes sense to make sure of both so
some companies it was like okay we hire a bunch of smart people like let's see like what they
come up with and they organize like more than hackathons or like internal challenge to get people
to build product and one thing that um I noticed it's like a lot of people just like don't know
what you built and it shocked me like why I feel like we are in some kind of like an idea
a crisis, right? Now, we have all this really cool tools to have you, like, do everything from
scratch. I can have you, like, design, it can have your, like, write code, you can have your
website. So in theory, we should see a lot more. But at the same time, people are, like,
somehow stuck, like, they don't know what to build. And I think it's like, maybe it's a lot of
had to do with, like, maybe, like, society expectations. Because, like, we have gone through,
we have gone into this phase of, like, specializations, like, people, like, very highly
specialized and people are supposed to do, like, focus on one thing really well, instead of
having a big picture. And we don't have a big picture of you. It's hard to come up with, like,
ideas of what you build. So, so I know what, like, when I work with this company, I just
hackathon, like, we do work out, like, how to come up with a guide eye, like, how to come up
with ideas. And usually what we think of is, like, okay, like, one tip is, like, go look from
the last week, right? Like, for a week, just, like, pay attention to what you do,
and what frustrate you? And what's something frustrated you think, but, like, is there,
anything we can do?
Is there like, can be that a different way?
So it's not frustrating.
And I can talk, like, people can swap
or accept sub-nobes or teams.
And if you see they come on frustrations,
maybe just something you can think about it,
just to build something around that.
So yeah, so I feel like just like notice
like how we work,
thinking of like we constantly ask questions,
like how can it be better?
And then I just build something
to like address the frustrations.
I think it's a good way.
It's just like world and adopt AI.
I think people have felt exactly
what you're describing every time they open up one of these vibe coding tools,
whether you could just describe anything you want.
I'm like, I don't know, what do I want?
And I love this very tactical piece of advice,
just like what frustrates you,
just pay attention to where you're frustrated.
For example, I just built a very cool little vibe coded app.
I was working on a newsletter post inside Google Docs.
And I pasted all these images into the Google Doc from screenshots and stuff.
And then I forgot, oh, yeah, you can't take images out of Google Docs.
It's like this Hotel California experience where you can paste stuff.
into it, very hard to get images back out. So I just went to all the vibe coded tools and just
build an app that I can give you a Google doc URL and let me download all the images automatically.
And it worked amazingly well. And I made it really cute and I'll link to it in the show notes.
Oh, I'm very bullish on using AI just create like micro tools.
It's just something that's like make your life a bit easier.
And 100%. I feel like that's one of the main ways people are using these tools, just like
a little niche problem they have. With that,
We've reached our very exciting lightning round.
I've got five questions for you.
Are you ready?
Yeah, always.
I don't know.
It depends on how the questions or questions.
They're very consistent across every guest.
So I imagine you've heard them before.
First question, what are two or three books that you find yourself recommending most to other people?
I'm really terrified of like book recommendations because I feel like what books or people should read really depends on what they want and where they're in life and I always want to get you.
But I just several books that I do think is like she really changed the way I think
to see the world.
So one thing is a selfish gene.
It's like to understand.
It actually changed.
It actually helped me with a question like whether I want to have kids or not.
Because it's like understanding more of like, yeah, a lot of our functions of where we
operate is the functions of our genes.
And genes want you to do one thing, it's like to procreate.
So yes, in a little way.
But it's like, the book also proposed another thing.
It's like, so everyone wants to live forever, right?
And maybe it's not like consciously, but subconsciously.
We do want that.
And I say two ways, like, one is like by jeans.
Like jeans, one is just like, like, want to continue forever.
But so there are two ideas.
I think there's something going to mean.
It's like being able to have some ideas out there.
And then it's like the last for a long time.
There's the own way to live on.
I know it's like, it's a little bit like abstract, but it was very interesting.
The other books I really, really like
is like from like, it's a book from
Singapore and previous
I think he's as a
father of Singapore, I know.
Like, Lin-Di-Gwang-Yo, I'm not really what's the title
it, but like he was the one who led Singapore
from, he changed Singapore
from a third world country to a fourth-book country
within 25 years. And I have never seen
any country leaders that spent so much effort
into like putting down his thought on like how
to build a country.
like that.
Yeah, and I say talk a lot of, like, public policy,
like how to, like, create policies
of encourage people to do the right things.
That's good for the nations.
And I'm so talking about, like,
foreign affairs, foreign policies,
like the liberation of, like, the country, but other.
So it's a really good book to think about,
for me, it's like system thinking.
But, like, it's a different kind of system,
which is a country,
which a lot of us don't get a chance
to, like, ever experiment in our life.
So it's good to learn about that.
What was the name of that second book?
It's called from third to first world flash.
I think we have it somewhere here.
Yeah.
There it is.
Show and tell.
That's awesome.
I definitely want to read that.
That's a really good tip.
I've heard a lot about just the impact he's had,
and I've seen all these videos on Twitter
of just his really wise insights into how to build a thriving society.
And clearly it worked.
How does he have time to write this such a thick book?
It's like insane.
That is.
Claude, please summarize.
I'm just joking.
By the way, selfish teen,
also absolutely love that book. That is such a good choice. It's such an under the radar kind of
book that really changed the way I see the world as well. So really good pick. Okay, next question.
Do you have a favorite recent movie or TV show? You really enjoyed it? So I watch a lot of movie and
TV shows as research because I'm working on my first novel and I recently sold it. So I'm
interesting like what makes it. It's a drama. It's not a science fiction or anything that like
take people usually read. So it's very like I know it's a very,
out of the left field and like very um so it's almost like reading watching tv to see like what
kind of stories become popular trying to understand the trope and stuff like that so i'm not sure
if the audience are like well what's one what's one that taught you something about writing
i think like umci palais is a chinese tv show cool okay haven't heard that went on the podcast before
okay cool next question do you have a life model
that you often think about,
come back to when you're dealing with something hard,
whether it's in work or in life?
This sounds very nihilist.
I think to say it's like in the end,
nothing really matters.
Usually think of like in the grand scale
or thing like in a billion years,
nothing will, like no one will never be there.
I think, okay, someone will argue with me about that.
So I go to think like,
so my theory is like in a billion years.
Like none of us will never exist.
So like whatever like messy things,
like crazy things we do or like how bad.
that we do it. I mean, no one wouldn't be there should remember it. And I think in a way,
it's like, it sounds scary, but it's very liberating because this allows me, so, okay, let's just try
things out, right? Like, why does it matter? And then it's a story of, like, recently, so we have
a family member who passed away recently. And I was talking to my dad because I couldn't be
home for that. I was asking me, dad, like, okay, say anything I can do to make a person like,
something like comfort
so that anything you can get that person
and my dad was just like
what can he possibly want
at this moment
like it's made me real like at the end of life
like there's nothing that can bring you
like mature can bring your choice
no like money no product
nothing and in a way it's being
feeling like what really do I really care about
at the end of the day
so I guess it's like I think about it
it's like okay maybe I fail it maybe I don't get that contract
maybe do things like in the past at the end of life
I don't think that actually really matters.
So in a ways, it's like it's kind of liberating.
I know you said it might be nihilistic.
This is what Steve Jobs shared too,
and one of his most famous speeches is just,
we will all die someday, so don't take things so seriously.
And it is freeing, absolutely.
It just makes you appreciate it every moment, every day you have,
just like, yeah, let's just do something hard and scary.
Okay, final question.
You talked about how you're writing a novel.
Most people in tech have never written something creative
and fiction. What's just like one thing you learned in the process about how to write better
story is better fiction? A lot of time when we read, we get tripped up by some small things.
So I think I want to do creative writing because I just want to go a better writer.
And it tells us like maybe try my, like a different audience could have me like become better
like anticipating what this different type of audience would want you hear and like what they care
about. So it's the way from me to get up. So I think about writing or like even like any
kind of like content creations is about like predicting the user's reactions, right?
The next token.
Just kidding.
Yeah.
So like you do a podcast, it's like, okay, what kind of things that the users could find engaging, right?
And I find it's like a little bit like a lot of companies like you have like launch a product.
You have a narrative coming out.
So okay, what kind, how do we position this product in a ways of like users want, right?
So I feel like I have done technical writing for a while and I felt like I have had some experience
like trying to predict what engineers would want you hear,
all I care about.
But then I don't have an experience like this completely different type of audience.
So that's what I want to do, like, create writing, writing a story.
And that's why I was doing a lot of research.
I'm like, I mean, going to research, I actually enjoy a lot,
like, watching a lot of dramas.
I just see, like, what people are like.
So one thing that I care about is just like,
I think a lot is like for like emotional journey.
It was from an editor, right?
So, like, when we write something like we care about, like,
how users would feel, like, across the story.
Like, we want something in the beginning, right?
We want something just, like, we need to have a hook so that people continue reading.
But we also don't want too much of, like, drama because we'll get, like, too tired, right?
Like, because, like, the emotion is exhausted, like, because it's like you being, like,
emotionally manipulated, like, a lot of time.
So we care about, like, emotional, emotional journey.
Maybe we have, like, some climax or, like, something more chill, like, maybe, like,
and so care about another thing that I did.
realize like for me for technical writing you entirely focus on the content like the argument
it's very impersonal right like it like for example like people like email compilers like doesn't
matter if they like the person telling them about compiler or not right because it's just like
objective like but like for and for novel people care about like character likeability so so like
in the first version is my story and makes the character like a little bit more like very
very logical, very rational, and just does everything just like very rationally.
And then the feedback I got is, I have a very good friend read it.
And he was, he's an amazing person.
He's a great person.
And he was like, cheap, I'll be honest.
I hate that person.
So it doesn't matter as a story.
It's just like the person is so unlikable.
He does a bunch of crudity.
So in the second version, I makes that person the character more likable.
Like how she makes that character more likable is that you put in some vulnerability.
Like, sometimes that, oh, maybe it's a person like, have setback because sometimes we can relate to it.
So in a lot of ways, it's very interesting.
It's like a lot of it is like, yeah, a lot of it is about like understands the emotional
bit, like how the users feel, not just about the story, but also about the characters.
That is so interesting.
Wow.
I learned a lot more there than I thought.
That was awesome.
Really good example.
Chip, two final questions.
Where can folks find you online if they want to reach out and maybe work with you or maybe
even just share the stuff that you offer if folks want to reach out?
And then how can listeners be useful to you?
I'm like, I'm not social media, LinkedIn, Twitter.
I don't post a lot, but I keep telling myself that I should do more
because I kind of like the composition with readers.
So I'm actually about to start a sub-spec.
So I have like a placeholder for a suspect right now,
and I'm thinking of doing it for more system thinking
because I think it's a very interesting skill.
And so like thinking of doing a YouTube channel on book review,
So basically books that help you think better.
So I think the first book I'm going to review
is more like this book because it's like my favorite book growing up
and I have been like keep on reading it.
So yeah, so how can it be helpful?
Like send me books that you like,
books that help you have changed the way you think
or change the way you do anything.
So I would appreciate it.
Amazing. I'm excited to read that book.
Chip, thank you so much for being here.
Thank you so much, Lenny, for having me.
Bye, everyone.
Thank you so much for listening.
If you found this valuable, you can subscribe to the show on Apple Podcasts, Spotify, or your favorite podcast app.
Also, please consider giving us a rating or leaving a review, as that really helps other listeners find the podcast.
You can find all past episodes or learn more about the show at Lenny'spodcast.com.
See you in the next episode.
