Programming Throwdown - 187: Agentic Coding
Episode Date: May 2, 2026In this episode, Patrick and Jason cover Agentic Coding!...
Transcript
Discussion (0)
Programming Throwdown, Episode 187,
Agentic Coding.
Take it away, Jason.
Hey, everybody.
Okay, so quick intro topic here.
I was talking to a lady.
I won't use her name, but let's call her a Jane.
I was talking to Jane, a friend of mine.
And she has been at her company,
which is kind of like a hardware company.
So it's not really Jane's specialty.
It's a hardware company that needs,
some software, right?
And, you know, Jane's unhappy there.
And there's all sorts of reasons why Jane's unhappy.
It's understandable.
And she's at nine months.
And so they have a one-year cliff.
And so she has to kind of tough it out for three more months.
And it just made me think about vesting schedules.
Because I think that companies, you know, haven't really thought this through.
There's like a cargo cult mentality where, you know, oh, this company did, you know, this type of vesting schedule, so we're just going to copy them.
And I think some bad behavior has been copied.
And when you start breaking it down, it never really makes sense.
And so I'll start by criticizing the one-year clip.
You want to explain what it is, though?
Not everybody.
Oh, yeah.
Okay.
So, okay.
If you're in, let's like, wind it back.
If you're in, let's say, sales, you get a commission, right?
So you build a portfolio of companies that want, you know, some product and you get a commission.
And if you're doing some service-based sales, then you could potentially get a commission every single year.
You can have a customer that is going to use your service for the next 20 years.
But because you were the salesperson that, like, got the deal, you will just keep getting commission for 20 years, potentially.
that's how sales works right and so for engineering and other fields you know they want to have an incentive structure but you know it doesn't make sense for us to do commission because we're not that close to the customer so what they'll do in our case is they will give equity so they'll say if it's a private company or if it's you know yeah not a public company they might give you a percentage of the company you might get options
you might get RSUs or shares.
If it's a public company, you're going to get shares.
And the way this becomes an incentive is they price the shares based on your start date.
So for example, if a company's stock is trading for $80 a share, they're going to tell you,
okay, for the next four years, you're going to get this much number of shares.
you know, every month or every three months or there's some schedule, right?
But let's say the price jumps up to $160 a share.
Let's say the price doubles.
You're still getting that number of shares.
And so you can end up in a position where if the company does well,
even if you, you know, don't get more equity later on,
just that one grant could double in how much you're getting every month
just because the company did well.
And so it aligns your incentives with the company's incentives.
Okay, so that's equity.
Now, you know, they're not going to give you four years of equity right on day one, right?
Because you would just leave, right?
Like any rational person, if they said, here's four years of, you know, up front money,
and there's no strings attached, well, then, yeah, you would just go to every single company
collecting four years and work for a month.
or something, right? So obviously, so they're going to space it out. Makes a ton of sense. Now,
where I think a lot of companies go wrong is they have this one-year cliff, which means that for the
first 12 months, you won't get any equity. And then after 12 months, they will give you the past
12 months worth of equity right then and there. And then you'll start accruing, you know,
monthly or quarterly or something like that. I think this came about if, I mean, I have
done the sort of historical work on this, but I think this came about back when, you know, a lot of
these companies were very small. And so each person, you know, had to be, you know, a line item on
your capital table. So when you're, when you're, you know, a small company, you have what's called a
cap table. And it lists out, you know, who are your investors, basically all the people who own parts
of your company. And so, you know, I guess companies wanted to sort of try people out without
polluting their cap table with all these names of all these people who, you know, after a couple of
months, it wasn't a good fit. That might be the inspiration. Maybe you could argue that it, you know,
saves a bit of money because people who leave before a year, you know, you don't have to pay them
as much, right? So that's the idea behind it. In practice, it's kind of like, you know, that joke
was it like communism
like works in theory
but it's never
do you have the saying
have you heard the same
it's never been applied right
it's like something yeah yeah
yeah there's some saying
yeah we're not we're not good at politics here
but there's some kind of saying where it's like
oh communism works that just wasn't true
communism it's like so it's all these things
where like the cliff never
actually is implemented this way
every single company I've worked at
you know there have been people
who, for one reason or another, had to go within a year.
Every single time we prorated their equity.
I have never, in my entire career, seen a time where somebody, let's say, worked six months.
After six months, we decided they weren't a good fit or there was a reduction in force or
what have you, and we didn't give that person the six months of equity.
So it's kind of like a stick, but it's never really used.
and then on the flip side you have cases like Jane here where you know she has to kind of stick around for three more months so and that's why kind of rant on on on close what's your take on one year clips
i mean i guess i've worked a lot and i've never seen the fact that like oh in the first year you're going to like somehow learn something about the person or like i i mean i've never seen people just be like oh this isn't working let's go
let them go before the first year as like a that way we don't have to pay them which i guess
would be like the fear as an employee but i've just in general never seen someone let go in under a
year like because it's you kind of give them a period of ramp up whatever and by the time you kind
of like get through that and you're like oh they're really not sticking around the process to sort
let them go at most big companies is long enough that they'll get past a year easily so if they
decide to stay they'll make it um and so that's a good point hearing that between your pro rating
story. It's just, I think, yeah, like you said, I feel like it's just a stamped out template and it's
probably been, you know, legally tested enough that they're hesitant to, you know, do something
different. It also, by having the longer, you know, like the four year vesting schedule rather
than just giving something for one year and then just regiving it each year, you know, as like over
the next 12 months, I think is for hope that the stock will grow and sort of give you, it's not like
a golden parachute, but.
sometimes people refer to as like handcuffs or golden handcuffs,
is that the hope is the price goes up,
and therefore you don't want to leave
because you wouldn't get an equivalent pay package somewhere else.
But I will say just very recently,
a lot of software as a service companies
had a really huge stock decline as like AI tools,
like we're going to talk about today, are rolling out
because people are nervous about future earnings.
And so there's a lot of employees who are,
looking to basically bail or leave because huge portion of their compensation.
I mean, a lot of places, it can exceed 50%, it's easily 30%, like it's a very big portion
of your salary.
And people at these public companies just treated as cash.
So the fact that it's not, in fact, like, does it very happy to get into the finance game
theory of it, but maybe for another time, even the fact that they're issued as our issues
is really an accounting trick to basically say.
that basically Wall Street agrees to treat that cost different than a salary cost.
And so therefore, it looks a little bit better to have our issues as a separate line item.
And so if all those things you got rid of, it's just another sort of quirky way of paying people.
Most companies already have a stock purchase program, right, where you can either, you know,
encouraged or get a discount to buy stock with a portion of your salary.
So there's various ways they already get sort of joint ownership with the company.
Yeah, I really think it's sort of time for a revamp, but it's possible to happen with these disruptions where if some companies are really, like if people are really worried about what's going to happen, then shifting some compensation from, you know, stock based to cash based for public companies. That's the point of being public, right? Is that these things are like mark to market. That's much harder for a startup that either doesn't have the cash or doesn't have a reliable market value.
Yeah.
Yeah, totally right.
And then the other thing which like kind of got popular in like a certain niche of companies.
Is this like 10, 20, 30, 40 investing schedule?
So this, I've never worked at a place that has this.
Amazon is definitely the most famous.
I think Snapchat had this for a while, although I think they abandoned it.
But the idea was that, you would get, you would get 10.
So if your grant is, let's say, 100 shares to make it simple, you get 10 of those shares your first year.
You get 20 the second year, 30, the next year and 40 the final year.
And so it would kind of backload the equity.
I never understood that.
I always felt like shouldn't it be the opposite?
Because the first year, you know, you don't have any refreshers or anything.
And so it never really made sense to me.
That's one that just boggles me.
I really don't know why they do that.
I mean, people have said cynically, oh, it's because I think the average tenure at Amazon is very low.
So the average tenure, okay, kind of side rant here, but like people say the average tenure at bang is four years.
But it's actually much longer than that.
That's only if you measure all the people who have left, right?
you can't really do it that way.
You have to assign a tenure to the people who haven't left yet.
If you give them a tenure of infinity,
well, now the overall average tenure is also infinity, right?
If you give it zero, well, then if you don't count it,
well, then you get four years.
But neither of those are really appropriate.
So the tenure is pretty long, you know.
At Amazon, people said it was, I think, 18 months or something.
Again, it's probably longer than that.
is definitely a lot less than the other companies.
But I feel like that's probably not the reason they did the 10, 20, 30, 40 would be
because they plan on getting rid of people after two years.
But I think it's kind of what you said, Patrick, that, you know, they just came up with
something and then it's just too hard to change.
I do have a separate thing.
We should move on.
But it's like reverse survivorship bias.
So there's this thing where people first start thinking about, you know, writing code
to back test stock trading.
So they take the S&P 500,
you know, top 500 companies
and you look back in time.
But what you are missing out by doing that
is like the stock index changes over time.
So if you take companies that exist today,
you're not looking at all the companies that have failed.
So if you want to go 10 years ago and play forward,
you need to start with the set of companies
that were in existence 10 years ago,
not the set of companies in existence today.
Because when you go backwards in time,
you're going to lose companies that haven't,
yet started, but that's fine, but you're also going to lose companies that failed, which is
hugely impactful. And so they call that survivorship bites. The same thing as planes, which
returned from World War II and analyzing where they had bullet holes misses the fact of like,
that's not where you should patch up the bullet holes. There's like a famous meme about this,
right? Because you know the ones that went down, right? Like those are the ones. But these come up like
you were sort of mentioning in tenure computation, which I didn't actually realize. So unless your
companies like super old to where people retired and everything, you don't like sort of reach
steady state. And then you then you can't account for for sort of flux. The other one I saw it is
someone was talking about, you know, we've talked previously about starting to run and learning to
run or whatever, you know, maybe not when we were younger. And people are talking about this
standard of qualifying for the Boston Marathon where you have to run pretty fast in a marathon.
And they were talking about the number of years it takes to, like train before you qualify for
the Boston Marathon. But they could only take people who had qualified for the Boston Marathon
and saying how long they trained for. But like me, I've run for a few years. One, I'm not anywhere
close to qualifying. And even if it was my goal to qualify, I don't know that I could. It just takes a
lot of work to get there. And I'm not particularly naturally talented for it. So I would never
show up in the numbers, even if I had it as a goal and trained for 10 years. So what does it mean to
say on average people took two years? It's on average people who succeed.
succeeded in two years, right?
Like, what about all the people who didn't?
You have to put that number in?
Or else, unless you know that's going to happen to you,
you don't know if that's a valid statistic.
Yeah, we call this a type 2 error or a false.
Oh, that's a name.
Good.
Yes, it's all the things that, you know,
that went wrong that you never saw.
So all the people who didn't qualify for the Boston Marathon
because they sprained their ankle and you just never saw it
because they never applied.
Yep. All right.
All right. So actually, well, we should at least just put a vote on this. I mean,
oh, sorry, you go for it. Amazon, well, no, I mean, so, okay, a question is, should you blacklist or should you just not apply at any company?
Well, okay, almost every company has a cliff. You're not getting around that. Maybe, maybe they'll all listen to us and they'll get rid of their cliffs. I know Open AI got rid of their cliff, their one year cliff about a year ago.
but okay so we're not going to get away from that but like the the 10 20 30 40 should people still go to those companies
I guess it's I guess maybe it's just not the most important thing you know it's not a that's what I was
going to say if you're comparing two companies I think the incentives are poor for fit and change ability
for the backloaded things like 10 20 30 40 so it would be it would be a notch against but I don't
think it's a, if that's the way to break into the industry to get into a tech company, if that's
the people that are going to, you know, pay you what you feel your worth, then like, I don't think
it's a reason to not go there. Yeah. I think that's fair. Unless you think there's foul play,
unless you have evidence that they're purposely churning people out within a year or two.
Yeah, I think we're in the same boat. I mean, I would say maybe a little stronger, you know,
for me, it's definitely a yellow flag. Uh, companies that do 10, 20, 30, 40, um,
It is definitely a concern.
But again, if you have a good boss, a good team, it seems like the work you're interested in, then it could easily overcome it.
All right.
On to news.
Patrick, what's your news?
All right.
So my news, probably most people, well, actually, I don't know.
I'm a little curious.
I don't feel like it got as much covers as just as it should.
But it is now past, which is the Artemis II mission, which was the United States, NASA, trying to send people a,
not in orbit, but in a loop around the moon.
So they were influenced by the moon's gravity to sort of slingshot around the
backside of the moon and back to Earth.
And this happened as of this recording, like last week.
And it was a, you know, very exciting thing.
But it was a little, I don't know, underappreciated initially.
I think once it was happening, the news coverage kind of picked up.
But a lot of people just like weren't talking about it.
It didn't have a hype.
If you ever, you know, talk to someone who was around for the original moon landings,
which are now, you know, whatever, 70 years, it was everything.
It was its major deal.
It was like absolutely, you know, wall-to-wall coverage.
And I feel this was just, you know, that's cool.
And people just kind of moved on for it.
And I've been trying to come up with a theory as to, you know, sort of why.
And there's ones, you know, just the world is a little bit of a different place.
Politically, our country here in the United States is, you know, you.
you know, and a, what do you will say, a bit of turmoil, I guess, internal stuff.
So it's hard to get, you know, everybody to rally on any cause.
But then I think that as well, I feel with the amount of launches for things like Starlink and the International Space Station,
I think it's a bit just of, it feels like, yeah, okay, we do this all the time.
And I think people don't understand the sort of energy difference needed to go to the moon versus go to low Earth,
low Earth orbit.
And so anyways, I just want to give a shout out for, you know, the work of getting back to
the moon.
And, you know, it's a little controversial whether or not a moon base will be set up.
But definitely, you know, as far as like technology demonstration and the ability to push
the envelope, as well as potentially unlock lots of new resources for manufacturing,
it's a definitely exciting time.
And, you know, it's going to be a little bit of a gap before the next Artemis, Artemis 3.
but, you know, Artemis II was an exciting watch
if you got the opportunity to tune in during it.
If not, plenty of YouTube retrospectives.
Go back and check out all that happened.
So they, so this already, I don't follow space.
And for some reason, space doesn't follow me either.
Like I don't recommend it anything about space.
But so the whole project started and ended,
so the astronauts just slingshoted the moon,
came back, and everyone's safe.
Yes.
Yeah, how did, yeah, I mean, I feel like that should, even like us, you know, non-space enthusiasts, like, that should have showed up on, I do, I have the, oh, man, I'm not going to go on a huge tangent on how people get their news. I get my news from swiping left on my Android phone and I just get the Google news that's kind of everybody gets. So like, really, that could have been a good place for it.
Yeah. I, maybe that is a, maybe it's more of a commentary.
on people's individual funnels or however you want to call like individual filters that we all
have into our news systems now where they're all biased. But I guess if you do, not that people
below a certain age, just tune into the news, but here since they launched, I mean, I live in
Florida, so they launched from Florida. So we got some local news coverage of it. So we do sometimes
watch the local news just to hear, you know, local events. But yeah, in general, national news,
it seemed a little undercovered in my opinion.
Yeah.
Yeah, totally.
I fully agree.
They're back successful, yeah.
So four astronauts from the United States and one from Canada,
so three from the United States and one from Canada,
went into orbit, a sort of half orbit around the moon.
They went, again, like sort of slingshotted around the backside.
So it took a bunch of pictures.
They didn't get super close.
They got close enough that it was very big in their field of view,
but they could kind of see the whole thing from side to side,
not super low.
And so by virtue of being so far away from the moon
when they went around the backside,
they actually had a record for farthest travel
for a human away from Earth.
Oh, wow.
And so they went further than any human has gone before
if we want to start trying to change the Star Trek.
That's cool.
You know, but to your point,
when the SpaceX people caught the rocket,
you know, when it landed in the chopsticks
or of you all call that thing.
Okay.
Yeah, that I saw it.
Like, that was all over my Google news.
So I do think that this wasn't able to get the kind of attention to deserve for whatever reason.
Agreed.
But a shout out.
Either way, you know, I think it's something exciting.
And, you know, I'm here for the, I, there's a lot of people, young people who are,
have their imagination captured by these sort of like big technological feats more so than
increasingly like, yeah, another smaller, you know, phone or, you know, a better chat
bot, right? I mean, people are going to start coming of age. You just grew up with that stuff.
And so I'm hopeful that some of the, at least even growing up from you know, going deep under
the ocean, going into outer space. There's always things to just feel a little special.
Yeah. Yeah, that makes sense. Cool. All right. My news story, first news story is the Gemma 4 release.
So Gemma is a family of vision language models or multimodal models that, um, that are totally,
totally open source, open weight, and are small enough that you can even run them on your phone.
You can definitely run them on your desktop or laptop.
And I actually did something kind of cool with Gemma 4 already.
I got it to, I fine-tuned a very small Gemma 4 model to try to correct grammar.
Okay.
And it actually works really, really well.
I tried this in the past with really small models,
and I never got good behavior.
So my vision was to build something into my phone
where no matter what app I'm on,
it would just maybe it would be a custom keyboard.
I don't know, but it would just correct my grammar.
So it would look at like the entire content of what I was saying,
and it would go through and fix grammatical things.
And the models were never very good,
and they were hard to fine tune,
et cetera, et cetera.
But Gemma 4 actually kind of crossed that barrier
where I ran through a fine tuning on my desktop.
Okay, real quick thing on fine tuning.
So, you know, fine tuning basically just means continuing training.
And so even though it's a small model,
it's small because of all these tricks,
but the tricks don't work at training time.
So think of it as like,
a, like you, you bake a cookie and then you put icing on it.
But if you put the icing first and then bake it,
or if you try to re-bake a cookie with icing on it,
you know, the icing gets all hard and it gets kind of, you know,
not very good.
So it's kind of two different processes.
And so what you need is to, it might be that to bake the cookie or in this case
to tune the model, you need like a pretty beefy setup.
And then you can sort of, you know, shrink it later.
So I got my desktop together.
I fine-tune this model.
Oh, yeah.
So there's a whole bunch of tricks now,
low-rank adaptation,
all these other tricks where you can fine-tune it
even if you don't have a really beefy desktop.
Mine's okay.
It's got 16 gigs of V-RAM,
which is pretty good,
but not enough to just do a pure fine-tuning.
Long story short,
the tooling is amazing.
It's come so far.
you don't have to be like a super read-in expert to do these things now.
I literally just handed it a text file that I got off the web, a CSV that had a bunch of grammatical
mistakes and then the corrected sentences.
And then it went off and it just, the validation loss went from like 50% to 80% or validation
accuracy.
So the fine-tuning definitely did something.
and it might actually work,
so I haven't finished it yet,
but hopefully I'll have something
that just corrects grammar for every app.
That would be pretty neat.
Oh, that's awesome.
I was going to ask how you got good training data
because I would give it poor training data
if I just gave it my own conversations.
Oh, yeah, I know.
It doesn't seem correct.
I just went online and said,
hey, I need a giant data set
full of grammar errors.
I found one.
It's called the, I think the C-200
C200M dataset.
But if you just go online, look up
a grammar error correction data set,
you'll find it. It's the first result. And it's got
way more grammar corrections than I'll ever use in my entire life.
It's like 200 million of them or something.
Oh, wow.
Yeah, I mean, I would love to try the fine-tuning.
Unfortunately, I've always, we've talked about this before.
I've always lagged in playing video games. I always put older video games.
So never had a good GPU. Now I kind of,
I don't want a good, better GPU than one I have for both playing video games and for playing with like model training.
And it's a extortion.
I like I could buy one.
But then I see articles saying, oh, this is a, you know, $500.
And I go on eBay and it's like $800.
I'm like, no, I'm not, I'm not doing that.
Like, I'm sorry.
It's like, it's like a principled thing.
Like, I just, I can't.
Yeah.
You know, if I had to start over, I would, uh, well, well, I use my GPU for game.
too. But if I didn't have the GPU, I would probably just use an EC2 instance for me.
Yeah, that's what I need to do. It's just one more step, one more, you know, unbounded.
Like it's rather than I spend this money and I do whatever I want, you know, it's like, okay, each
time I do this, if I make a mistake, is you pay a little more psychologically. But you're
absolutely right. I mean, way cheaper to do that stuff often with rented and probably better hardware
than you would have at home. So, yeah, that's something I definitely would love to get into.
And I think there's something to be said to the approach you're taking and like small models in general run a lot faster, run in more places.
So even if you have the ability to run a big model or subscription, having like a small model that you can use for specific tasks.
And we're going to talk about, you know, potentially in future more agentic sort of work, having things offloaded to scripts or to small models to do.
And just the speed, the tokens per second that you get is just so much faster.
I think you're going to see more of that blended stuff.
Yeah, totally agree.
Turning the corner to yet one,
the news is all over the place this time.
It's okay.
It's a small, you got to click the link in the show notes or just search it.
Hatris.
I hate this game so bad.
I guess that's why it's named Hatress.
First of all, I watch the,
I don't know if it's summoning salt.
Whoever does the, you know, Tetris,
like infinity scores and the broken life.
levels and speed runners.
Why I love these?
I watch them.
I'm like,
I'm going to play some Tetris.
I suck.
Like, oh, this is the worst.
It's so bad.
Like, I, you know, that I'm like, you know,
Tetris strategies.
Anyways, so I've been playing,
I think it's a Gameboy Advance homebrew game,
Appetrus on a little retrooid handheld that I have.
So if you're interested,
there's like sort of a modern interpretation.
So it has like some niceties.
And a lot of customization.
So I've been playing that a little bit, but still suck at it.
So a hatred, though, instead of, okay, people don't know what Tetris is.
There's all the different shaped pieces.
There's various ways of selecting them.
So sometimes it can be literally just whatever the next pieces is random.
Sometimes it's selected from what's called like a draw bag.
So they'll put all the pieces in a bag metaphorically and then draw one out at a time.
So you kind of, the longest you have to wait for a certain piece is bounded.
that's what more sort of modern interpretations too.
Hatris instead says,
I'm going to run a little like, you know,
programmed, heuristic, machine learning thing, whatever, the computer.
And it's going to pick the worst next piece.
So you start off getting one of the,
I don't know what they're called, like the Z pieces.
And you kind of just keep getting Z pieces,
which are kind of hard to put together.
Yeah.
But then unless you force it into an option
where there's two ways for you to score a line.
It's just going to give you whatever piece
won't arrange to fit in the notch that you have left.
So you have to like carefully plant
and to make it worse,
it lets you go as slow as you want.
Right?
So you feel like this has got to be easy.
But I literally, it's like getting one point
and then I'm like flipping the table, excited,
you know, the whole thing going off.
It's amazing.
But yeah, zero is where I was started.
And maybe I'm just bad.
So you can be like, oh man,
that Patrick guy, like, it clearly just sucks.
Like, I'm way good at this.
Go play it.
Maybe you'll score a lot of points, but it is enormously frustrating.
I can only play for like two or three times.
And then I'm like, not, I'm done.
I can't.
I can't.
I've got to try it.
Hatrous.
If you get more than zero, don't email me because I don't want to know.
Oh, man.
This is so good.
Yeah, I'll have to try a report back.
All right.
All right.
He's going to be doing it well.
I only go on a little monologue and give him time to play.
At the end of our episode, it's like, oh, I got 10.
No, dude, yeah, yeah, I'll hang up.
Oh, man.
All right, my second news is drip warts, school of drip.
So this is a, I don't have to click the YouTube link.
I already know what it is.
Do you really?
Yes.
Okay, so, so I play board games every Monday with a group of guys.
a very fun group.
And one of those showed me this.
This has got to be the most viral AI video that that's been created.
I mean, I don't think anything has even come close to this in terms of virality.
I disagree.
I think there's been viral videos aren't clearly AI that are AI.
This is the most overtly AI thing to go viral.
Like, it's clear that it's not bad AI, but it's clear that it's AI.
like it's, but it's clear that it's AI, like it's not real.
I think there are ones that have been purported to be real that were just viral.
And like, oh, okay, yeah, that's true.
Yeah.
I don't have evidence to be clear.
So, yeah, what I mean specifically is like, this is one where the whole point is it's
AI, no producer would ever make this.
And we all know that.
And so, and it's, it's just hilarious.
Amazing.
The premise is, you know, Harry Potter, but if it's,
kind of like gangster style,
but also like,
you know,
high fashion,
you know,
all mixed together.
Very,
very funny.
I just burst out laughing almost immediately.
This is great.
I mean,
I feel like there,
I'm actually surprised that Open AI killed SORA
because I do think that there is a play there.
You know,
there's,
you know,
really like opening it up and letting the whole world think about what are funny things
that we could mash together and sharing that, that seems like super powerful.
I mean, maybe that's what TikTok is going to become.
Is that.
Okay, but you, I don't know how much conspiracy theory, tinful how we want to go about why.
Yeah, go. Go all.
Okay.
No, no, I also thought the same thing, because Sora was like trying to do a thing.
like my kids kind of knew what it was,
which is always like a pretty big thing for
tech. Yeah. So we
don't have it in the news of the episode now
because I think it's still in flight, but
Anthropic has been teasing
this mythos, right? Their next
sort of LLM version
after Opus 4-6, I guess.
And it's supposedly
earth-shattering, which to be fair,
they all acclaimed to be before they come out.
I'm making no statement there.
But the rumor is that
it was sort of like 10x, more
training than the last one, but that there's some sort of, do you know like the thing that happens
of grocking? So when a machine learning model trains and it hits some sort of like asymptote, and it seems
like it's just sort of stabilized in performance. But actually under the hood, it's sort of like
self-organizing. And then it's able to like sort of reach a new level after it sort of gets through
this barrier of not actually getting better loss. Right. So your loss, your validations aren't
getting better, but the model is sort of actually organizing itself to the point where then it's
sort of like unlocks additional capacity and then trains further. That's Patrick's non-mathematical
Yeah, it's like defragmenting your hard drive kind of, and then you can it can run fast enough for you
to do something else. Right. And so supposedly the rumors go that Anthropics mythos took like,
whatever, 10 or 100 X what the last one did, which was already insanity, but that it was a
huge unlock, right? So there's all these scaling rules. And so it would break the sort of projected
sort of regressions fit to the sort of like how much training versus performance you get.
And supposedly, if that's true, right, then Open AI needed to free up the SORA services
so that they could use the additional hardware to get the training budget they needed to basically
not get leapfrog. Wow. Yeah, I mean, that would be remarkable. It's true.
I mean, to be fair, it's like a rumor about a rumor that is like, it could have just been it's not making money, but like it feels there had to have been some reason to sort of like deprecate it and not just wait for the new version to be better.
Yeah, I mean, I well, okay, I'll tell you my, I don't know if this is a conspiracy theory or just a theory.
My theory was that, you know, Open AI's brand recognition is suffering a lot, you know, like Sam Altman's.
name recognition is suffering a lot.
Did you see the Onion interview of Sam Altman?
No.
Oh my God, it's hilarious.
I mean, obviously completely fake.
You only see a transcript.
There's no video or anything, but very, very funny.
But, and I felt like SORA is kind of a big brand risk.
Like, even this Harry Potter thing, absolutely hilarious.
But I don't think the brands, you know, Valencia, I think, Louis Vuitton, are literally
I didn't buy name.
I don't think they're happy with that.
So I felt like they were trying to just limit their exposure.
But they had a big partnership with Disney, right?
Like a billion dollar like potential value.
Yeah, yeah.
And it's all just gone.
It's kind of wild.
I mean, yeah, I would chalk it up to the same thing, like a brand thing even for Disney.
But maybe it really is the compute.
We'll have to wait and see when this new mythos comes out.
what's going on there.
All right.
Talking about science fiction,
it's time for book of the show.
We're going to take turns of science.
I'm going to do a book of the show,
and Jason's going to do Tool the Show.
So my book of the show,
better late than never they say,
Project Hell Mary.
So if you've well-tined with Artemis,
which I think was coincidental.
Okay, real quick,
not to interject too much,
but I am just starting this book.
So you can't try not to spoil it.
Yeah, yeah.
No, I'll be good.
I'll be good.
I always undershoot, I think.
But it's also, there's movies, so there's trailers.
It's very hard to avoid.
That's true.
There's like levels of spoilers, but I'll be even better than I think the trailers.
I think trailers reveal too much.
But Project Halmerie is a book about, you know, science fiction, about a, you know,
person out trying to, you know, save a dying earth.
Right.
And there's lots that goes into that.
Um, it's a bit of, uh, if you've, you know, kind of gone through Andy Weir's other book,
oh, why is the name escaping me now? The Martian, right? Oh, the Martian, thank you. I was going to
say a different sci-fi Mars book. I was like, no, that's wrong. Um, thank you. The Martian.
It's a little bit the same, right? It's like sort of, uh, tech science grounded. Uh, he does a pretty
good job about that. But then also like this sort of hopeful, like, you know, you want to root for
the good guy, you know, and not have like sort of the same chaotic, you know,
black, mysterious, dark, like, you know, is this good or bad? It's just like you want to root for
someone. And so it's kind of in the same same themeing. And I will say, I was encouraged the book for a very
long time. Do you like the merchant? Man, I don't know. Like the book was good. It's a short read
compared to most of my recommendations, most of the books that I read. Very easy read. I went on a
vacation recently and actually like first two days of the vacation. I basically read the whole book.
Oh, wow. It included a very long flight, you know.
leaving the country going to a different continent.
So it was a very long flight.
So to be fair, okay, it was still a lot of reading.
Book was good.
Book was like, you know, pretty good for what it is.
Like, you know, not super deep, but, you know, well grounded.
I had some issues.
But, you know, just around the edges.
I'm not super critical of science stuff generally,
other than just being like, yeah, right?
That's, come on.
That's, that's, nobody knows such different disciplines to this level.
Oh, yeah.
That's just unrealistic.
But other than that, you know, plot.
I guess.
But I, you know, the movie, people love the movie, but I feel like I wasn't as keen on the
movie as I was on the book.
The book was definitely better than the movie, but the book was not as good as it was hyped
up to me.
But it was so short, it's got to be worth it for just like, sometimes you need the casual
stuff, right?
Like, you can love the deep, grindy, this is really making me think.
But then, you know, sometimes just taking that sort of like I say, I'm that way about
Marvel movies.
A lot of people rag on Marvel movies, superhero movies.
I want to go in and watch something
I don't have to think so hard about like
you know I don't want it to be you know
some deep brooding you know
at the end was he in the dream or was he awake or
you know okay I don't I don't whatever
I don't want to know like just tell me what happened
yeah I'm right there with you
I mean you know I go into a movie
I saw the Super Mario movie with the boys
I want to see this yeah yeah they they loved it
and you just you have to go in
with the right expectations.
You're not going in
to expect something
really high concept.
Yeah.
So Project Helmeyer is like
a near future,
very grounded,
not super,
you know,
far flung,
but definitely,
definitely worth the read.
It's pretty easy read,
you know,
on a scale of such books,
I guess,
and reasonably short.
So I've heard
the audio book is also
really good.
I normally listen to
audiobooks.
I actually read this one on my Kindle.
Oh.
I don't know if the
audiobook is as good
as it's correct up to be.
I'm digging the audio book is,
the audiobook. I mean, I've only just started, but the voice is fine. So yeah, I think it's a good voice.
All right. Shout out for the audio book. But yeah, most people probably know this now because the movie is out.
My daughter is reading the book after the movie and having a good time as well. So I don't think, you know, it's one of those.
They're just different. They're the same story. It's pretty true to each other, but they're still pretty different in the, you know, depth of content they have.
So you shared your Project Hail Mary literature with your daughter and I shared my drip-or
school of drip with my son.
Which one of us is the better father?
I showed my kids that video too.
Just to be clear.
Just to be clear.
There is some vulgarity.
So if you, like, just be mindful if you,
if that's something you're very concerned about.
There are a few minor profanities.
There's definitely, yeah, there's probably some F bombs and stuff like that.
To be fair, I only showed it to my 12-year-old.
So I don't think I would show it to a six-year-old.
Just watch it first.
That's what we're saying.
Yeah, totally.
Definitely. A viewer, what is it, discretion advised or something?
Cool. All right. Yeah. And if you do read the book and get it from the library instead of paying for expensive movie tickets, you could turn around and give that extra money to us by following us on Patreon. We do really appreciate all of our patrons. All the money just sits in an account where we use it to help out the show. We try and get more folks.
interested in the show, especially folks who are starting their career.
And so all of us collectively really appreciate your donations.
Okay, so tool of the show, Patrick's going to skip this time.
Okay, Patrick, this is going to blow, either going to blow your mind or you already know about it.
I wanted to play Final Fantasy 6.
That's the one with Edgar and Sabin and Tara.
Basically, it's actually Final Fantasy 3 in the U.S., but they call it six in Japan.
I know which one this is.
So it's with, you know, Tara has like got a connection with the Asper's.
And it turns out, I'm not going to spoil it.
But basically, love the game, love the story, haven't played it in probably 20 years.
So I thought, more than 20 years.
So I thought, I want to play it.
But like, I know I'm just going to beat it.
I mean, if I could beat it at 12, I'm sure I could beat it at 40, whatever.
So I was like, how can I play this game and get the story and experience, but like still be
challenged, right? Okay. Oh, man. So I found out that people make, you, I've always done, like,
ROM hacks for translation, so I can play, like, English versions of Japanese games. But there's a
whole ROM hacking community just for making games either harder or more interesting or both,
or what have you. And so I played, uh, ogre battle, it's not Ogre Battle hard type. Uh,
There's an ogre battle mod, which I can look up,
but basically made the game harder,
but also balanced all the units and everything.
There's separately an ogre battle hard type,
but that was just very frustrating.
Like at some point,
what I don't want to do is to play the same game,
but have to grind for like 100 hours.
You know, like that's not fun.
What I want is just like a harder experience,
but taking roughly the same amount of time,
you just have to be more strategic, right?
So I finished that Ogre Battle mod, and then I found this amazing Final Fantasy 6 mod.
It's called T-Edition, and it adds an insane amount of content.
For example, there's achievements, there's like all these extra side quests.
There's just a ton of content.
There's new bosses.
There's actually like a different mechanics.
So for example, you know, generally in Final Fantasy,
you almost never
you almost never like
cast certain spells
like you know they're kind of in the game
for continuity but like when do you really
ever like poison your enemy
it's pretty rare right
but but now it's like there's different
bosses that have certain weaknesses
that kind of encourages you to use the whole
gambit of spells
phenomenal
I mean I'm about halfway through it
I was worried that you know when you play
a ROM hack you know
There's always the risk that the ROM, the hack designer just isn't as thorough as the original game designer.
And like, you'll get maybe 40 hours into the game and now you're just stuck.
Like, you just can't make progress.
And so you wasted your time.
But this T edition is super popular.
It's got a thriving community.
It's been around forever.
Tons of people have beaten it.
And so you know that it's like you're going to get to the end.
but it's super, super fun.
I've been playing it on my phone,
and it's a blast.
I was awesome.
I have never played Final Fantasy.
I would always call it Final Fantasy 3, but six, I guess.
Wait, so you never played 3?
No, uh-uh.
Oh, man.
Seven.
I did 7, but never 3.
See, I actually stopped at 3 because I never had a PlayStation.
I missed out.
I actually,
I played seven a year ago,
but you know,
my childhood I missed out on seven.
Is it worth like,
I guess,
sometimes when you play the old games,
like if you don't have nostalgia,
they're really tough to play
for like quality of life reasons.
Just like,
lots of random,
stupid stuff you got to do
or just like,
you know,
they tried to make the game longer.
You know,
I don't know stuff like that.
Like,
is this game still playable?
Or do you got to have nostalgia for it?
Or is this going to be like,
no,
there's better games to play?
Um,
It's really hard to say.
Okay, I would say the story is phenomenal and probably worth playing, even if it's the first time.
The story is very, very good.
I feel like the pace is good.
I'm debating whether, if you go straight to T edition, that might be difficult because, you know, you are kind of expected to,
Yeah, I would play the regular edition.
Okay.
The T edition would really be for people who want to play it a second time.
But I would definitely go back and play.
If you're going to play the regular version,
there's some Android ports or iOS ports that are probably better
than playing it on emulator.
But I think it's a great game, very solid.
It shocked me the first time I played it,
where I thought I was at the end of the game,
and it turns out you're only at the halfway point.
Kind of like, if you ever played a link to the past,
Super Nintendo.
Yeah, remember when, like, you fight Gannon
and then, like, he throws you into the dark world
or whatever they call that, the other world?
Do we have to give spoiler alerts for, like, 30-year-old guests?
It's kind of like that where, although in that game, it was more obvious
because, like, clearly there aren't just three things to do in the whole world, right?
Here it's less obvious, right?
I literally thought, well, the game's over, and I was only halfway done.
Oh, nice.
Yeah, I'm ashamed to me.
I also have only ever made it, like, through the very beginning of Chrono Trigger.
Everyone's the most amazing RPG ever, but yeah, I'm bad with RPGs.
Yeah, Chrono Trigger was very good.
I think the, I could see people getting stuck in Chrono Trigger because there's some parts where you just have to kind of persevere.
to make the other parts
kind of worth it.
I would say Final Fantasy 3,
there's constantly something interesting going on.
All right, all right.
Well, I am trying to go back and play more of these.
So I'll add it to the list.
We'll see if I get to it.
You should totally play the ROM hack.
It's weird.
It says that it doesn't,
there's glitches on emulators,
but I think that was back in the past.
Just quick, quick history of emulators.
So, you know, the machine has its own
instruction set. And Patrick probably knows
this way better than I do, so I'm going to try my best.
You know, it has like, you know, move things over here or this other type of
instruction or whatever, like at really low level.
And your computer, you know, doesn't have all the same instructions, right?
Your phone probably has different instructions in your desktop.
And so, like, it has sort of translate the instructions from, you know, what it would give
a real Nintendo to what it has to give your computer.
And back in the day, like, they couldn't just translate.
translate it perfectly because it became really expensive.
Like there might be an instruction that was super fast on Nintendo,
but if you're really slow on your computer.
And so when they made these hacks,
I guess the hacks only worked on the Nintendo without glitches.
But now the emulators are like absolutely perfect
because the machines are just so fast.
So there were these warnings about,
oh, if you're playing on emulator, the sound won't work,
but everything worked perfectly.
I play almost all old games with the sound off
because I find the repetitive chip tune thing
only listenable and very small doses.
So I basically play with everything on mute.
Yeah, me too actually.
Even on my switch.
In my case it's because I'm like in the car
or I'm in the passenger speed or something where I don't.
Thank you for clarification.
Yeah.
I just don't want to bother people usually.
Yeah.
All right.
it is time for the agents to code to the rest of our podcast.
Yeah, I mean, why are we even here?
Like, can't I just press a button on generate script?
Oh, man.
I mean, what a crazy time we're living in.
So quick, quick history lesson.
You know, Intellicense has been around forever.
I don't know if folks remember the word intelligence,
but and I don't know how it worked.
I mean, I can take a guess.
I mean, I guess that, you know,
it ingests all of your code on your, on your local computer.
And then it, you know, builds up some kind of statistical models.
And then when you start typing, you know,
or maybe it doesn't even need statistics,
maybe it's totally deterministic,
but you start typing, you know,
my object dot, cac, and it just fills in, you know,
cache triangle or something like that because it knows that that function exists and there aren't
any other functions that start with CAC and so it'll just pop that up next to your cursor and you
could hit tab and it'll just auto complete that function name for you. So that's been around forever
and that's been great. I had like an Emacs extension at some point a long time ago that did this.
So even in the terminal you could do it and everything.
that's been around forever.
And then, you know, the large language models started to get really good.
And so, yeah, there was, they started doing auto-complete on steroids where you could use GitHub
copilot and you could start typing, you know, four.
And it would just auto-complete, you know, the entire four-loop and the contents of the
four-loop or something if it was pretty easy to infer from other parts of the
code based. So that was really cool. And then a cursor came out. And so cursor had this neat feature
where it would auto-complete at your starting at your cursor, but then it would jump around. So for
example, in most languages, you have some type of import. So in TypeScript is called import, and Python is
called import, T++, it's pound include, right?
And so let's say as part of the auto-complete,
it ended up needing to use a library that you weren't yet importing.
You could tab to complete that code,
and then it would give you a little notification-type thing
that there's more work to do somewhere else in the file.
So you could type tab again,
and it would jump to the top of the file
and show you what it wants to auto-complete over there,
and then you could type tab a third time,
and it would do the imports as well.
And so, you know, this kind of continued to the point where I think at some point
cursor had multi-file, like you could just kind of keep hitting tab,
and it would jump around your code base doing various things.
And so this is all great.
And so then, so people were, you know, using that.
I was using it.
It was very exciting.
And then Claude Code came around, and that was a huge game.
changer. So this is where, you know, instead of just extending, you know, an idea that you had
partially written, you could, you know, kind of ask it in English to do something with your code,
and it would go off and do it. And that is, was pretty wild. I remember when that first came out,
I post, actually, my most popular post on LinkedIn was basically talking about how awesome
Claude Cod Code is and a whole bunch of people
like ripping on me. This one guy I'll never forget it.
If you're out there listening to us, you're a jerk. Stop listening to us.
But this one guy was like, you should be better than this.
Like do better. You know, like Claude Cod Code and these things are,
he called it a stochastic parrot, which I found out later is like a common pejorative for LOMs.
he's like you should do better than this you have a programming podcast and
you're trusting these stochastic parrots and just insulting my character and everything
that guy was a jerk but but but uh but it was a very controversial post there's one of these
things that like i didn't do it on purpose but it got it got i think like one and a half million
views and there's definitely like pro proponents and antagonists um but i was right in hindsight
I told everyone early on that this quad code thing is amazing.
Actually, what I had said that I think was so provocative was I said,
I have not written a single line of machine code in my whole life.
I run 100% of my code through the compiler.
And so I'm already not writing pure code.
So if I start running 100% of my code through Claude and I just tell it in English what to do,
it's really not changing anything.
And that's kind of where I still stand.
I mean, I use it constantly.
Now, you know, even today, there are times I have to go in the code and we'll talk about all of that.
But in general, kind of where I'm coming from is I'm a big fan.
And I think it's pretty amazing, pretty amazing times we're living in.
So maybe to like, you know, expand a little on the switch from those early sort of like you kind of explain how cursor was.
in the early days to, you know, maybe what like ClaudeCode or Codex or the Gemini, you know,
solutions are today. I think we've also seen a lot of iterations like around the edges.
So things like what it, the MCPs, you know, around retrieval augmented graphs, rags around like,
you know, basically, in my opinion, these are kind of like around how to let the LLM,
which is just next token prediction, right, how to allow it to do tool calling, which is something
we talked about in the podcast a long, long, long time ago where we said, hey, these chat things
would be really cool if they could reach out and do web searches or connect to Wolfram Alpha or
okay, well, anyways, turns out people are already working on that. We just didn't know.
So everybody had the same good idea. Good. But then, you know, so basically how to interact with
tools. And there's been a lot of iterations through, you know, how that works to, you know,
I don't say it's settled today, but to where it is today. And then the other one is around
sort of managing the context window, right?
The, hey, how much of your code base, of your problem space, of your conversation,
how much of that can be held?
And for the context, I think it's been interesting that the models themselves have gotten
bigger.
But one of the things that isn't super obvious is that even if you will see something like
Gemini can handle a million tokens as context, that doesn't mean it's as efficient,
both in terms of like how fast it runs, but also in how the quality is,
a million tokens versus 10,000 tokens.
And so a lot of these models have degraded performance
as they reach up to those million tokens.
It is unintuitive because it's not like a hash map
or a list or an array that just grows.
And you get some weird cash effects,
but generally, you know, they're kind of well understood.
There's these unintuitive mechanisms.
So aggressively managing what's loaded into the context
and things like compaction is super important.
And I think it's been very interesting
to see, you know, something like taking your whole code base,
understanding how to use tools to search in your code base,
how to learn like which pieces to extract and load up and sort of assume.
And I will say it's still not perfect.
You still see sometimes I see it even using, you know,
a tool like cloud code where it'll, we talk about hallucination.
It'll hallucinate API calls.
But it's in a loop now.
That's the agentic part where it'll try to compile itself and realize like,
wait, why did I call, you know, features.
or, you know, object.
Dot foo.
Dot foo isn't doesn't exist.
It's dot bar.
And, you know, it'll figure it out.
But for whatever reason, like this, you know,
hallucinating of food, just assumed that,
you know, you have a list class,
therefore you have an insert.
And maybe you don't have an insert because of, you know,
some esoteric reason.
And so you still see around the edges,
but a lot of that is because it's trying to keep the context down in the first pass,
which causes zone problems.
If it loads up all your code base,
if it tries to understand all of this,
the speed can can really, you know, get bogged down.
Yeah.
I mean, the other thing that I think you touched on,
but just double click on it is the actual intelligence of the model goes down
as you put more information in it.
And so this creates a weird trap where someone will start with a completely blank slate
and say, build me a website for e-commerce.
And it can do that.
but a lot of that is based on its sort of innate knowledge from looking at many e-commerce
sites on GitHub and things like that.
And so it'll build something and you'll feel like it really understood the nature of what
you built, but a lot of it might be kind of copied, right?
And then as the context grows and you end up with more and more bespoke information about
your particular, you know, product that you're selling, then it needs to keep more and more
information in its, you know, short-term memory. And as it does that, the intelligence starts to
go down. And so you see this trap. And so it's just something to be aware of that when you're
working with bigger projects, shouldn't expect the same level of intelligence that you have,
you know, at a smaller scale.
But yeah, the loop thing and the tool calling super important.
Okay, so we'll explain a little bit how this works under the hood.
So in the beginning, it takes your question and the AI can do one of several things, right?
It can answer your question just by emitting some text, or it can call a tool.
and there's several tools that it can choose from.
And so this technology has been around
is older than Claude Code to Patrick's point.
So when it's done calling a tool,
the tool information is added to the context.
So just to back a bit,
so if you ask it, you know,
what's a distance from here to the moon?
Because I want to slingshot some astronauts, right?
So it'll start giving you that answer.
But as it's emitting those tokens, it's also using what it said to generate the next token.
So it's possible, I mean, it might be hard to make it do this, but it's possible for it to generate half of an answer and then to actually say, wait a minute, stop.
I'm heading in the wrong direction and it's a generally different answer.
Like mathematically, that's plausible.
It might be hard to craft a question that would cause that every time, but it can happen.
So similarly, when it calls a tool, the tool output, it's as if it said those tokens.
So in the sense of like it's now part of the context and it's used to generate the next thing that it says.
So the model can either answer you or it can call a tool.
When a tool comes back, that information is, let's say, part of the context along with your question,
then it can call another tool or it can answer you.
And whether, you know, however many tools you want to allow it to call
and all of that is up to the discretion of the developer, right?
But at some point it's done calling tools.
It gives you an answer and that's it.
To the idea with Agenic was,
what if a tool was itself like another question?
Like, what if we made this recursive?
And so there's two basic ways you can make this recursive.
One is where the tool call is actually another question that starts a session within a session.
Another way is to say, well, a tool call can actually give me back a list of more tools to call.
Both of those are implemented in Claude Code.
So you might say something like,
I want to remove all the lint errors in my code base
and it'll come back and say,
okay, I'll run a tool call and we'll run the Pyrite
and we'll look at the errors that are involved.
That might come back with, you know, 10,000 errors.
And the model will actually say,
okay, you have a ton of errors here.
You know, we're not going to just fix this in one diff.
A diff, by the way, or a patch is also another tool call,
but we're not going to fix 10,000 errors in one patch.
So I'm going to create a bunch of subtasks.
And those subtasks are like isolated questions that I'm going to ask myself.
So the model will ask itself, like in another process or another context,
you know, fix all the lint errors where the capitalization is wrong in the very
just focus on that.
So now the model is really like the initial model is now an orchestrator that's orchestrating
all these sub-questions.
And the thing Patrick was saying is, you know, the nice thing is if anything is wrong,
that's okay because the expectation is to be eventually correct.
So the model, you know, as long as it has a way to verify, it can come back and try to compile
to code and say, oh, I made a mistake. I'm going to call another tool call to fix it,
et cetera, et cetera. And this continues until there's some kind of stopping criteria,
which again is created by the developers. At that point, the model hands the reins back over to you.
There's a lot of, I think, maybe unintuitive to outsider like interplay, as you're describing,
between what you, when we talked about fine-tuning before, but targeting these coding benchmarks
and coding applications and code as like a way of doing sort of long-term planning,
which has been something difficult for LLMs to do.
But also the interplay between the harness of cloud code and like or, you know,
codex or any of them and the underlying model, right?
How do you tell it what tool calls to use?
You have like a tool call language or do you just let it use a command line?
Do you use MCPs, right?
Like internal for stuff.
how do you sort of like craft the system prompt?
How do you craft like the each turn, right?
Like how far to let the model think before the night?
Like there's this interplay between how the model was trained and how you prompt it,
guide it, harness it, skeletonize it structure and even ask separate questions about like,
hey, how do I, like, how would I plan to do this or how do I decompose this task?
So how do you decide, you know, whether to kind of have it do more of a one shot,
like here's what I'm trying to do, just do it.
and then saying, let me first ask for to decompose it into tasks.
Then I'm going to say for the first task, like output the task as JSON.
And then for each thing in the JSON array, prompt it again, right?
Like you get all this interplay between how you invoke the LLM and how the LLM works.
And then we, I don't think we've yet seen, to be honest, you know, we might talk about in a future episode,
but things like OpenClaar or like the various more like computer use things, which is the LLM is really starting to train
on this use case, right?
Sort of actually getting to where they themselves are, you know,
making sure that during training, they are understanding and working with this tooling
better.
And so for now, it's a lot of, I don't want to say hackery.
That's not the right word, but like a lot of sort of humans iterating or having
LLMs iterate the LLM harness for the LLM.
Oh my gosh, it's just LLM.
But, you know, I think there's a lot of nuance.
And you'll see people talk about how Codex works, how,
you know, cloud code works, how
the Gemini I think, but there's also
like open code, right, which is
a version of quad code, but
with open source and, you know, bring your
own back end as a more like supported
sort of methodology.
And I think all of them have
nuanced difference. In fact, just last week
Claude code had its source code leaked.
And people were kind of deep diving and seeing
oh, hey, there's like, you know,
monitoring of how upset the user
is. There's like all this stuff in it you kind of
wouldn't assume. So I don't think there's a
settled out harness approach yet.
And then the question is like, how much does the harness need to match the individual model
is, is unclear because every model is sort of slightly different too.
Yeah. Yeah, great points. Yeah, I think it's still pretty early days.
One sort of thing that really surprised me is how well it works on things that aren't
even coding related.
you know, there's recently Google released a set of skills.
So we didn't really talk about this,
but a skill is basically a set of tools with really detailed prompts behind each of the tools.
So, you know, there might be a skill where it's about reading and writing Google Docs.
And so you'll get a set of tools that let you sort of do the mechanics, you know,
add to a Google Doc, read a Google Doc, insert characters into a dock, etc.
But then you'll also get this really detailed markdown of what is the true sort of platonic nature.
What is the nature of a Google Doc?
Like what actually is it?
And yeah, and so you can you can plug that Google G Suite skill into, or I think they're calling a Google workspace skill.
into any of these coding agents
and then they have that power.
So I think that this is going to be
pretty disruptive to a lot of industries.
I'm thinking like finance,
medical, legal, et cetera,
in the same way that it's disruptive to coding.
It's a very generic system.
So the way that, you know,
ClaudeCode work that's pretty different
from any of its predecessors was that the tools that it had
were extremely generic.
You know, people were at the time building very specific tools.
Like, here's a tool to access, you know, my bespoke database.
But I'm not going to give you just open SQL access because you might just drop all the tables in my database.
So I'm going to give you like this tool adds a customer record and this tool deletes a customer record, like very, very specific tools.
And Cloud Code came around and said, well, we're going to give you a tool call.
bash, which can do anything. And we're going to give you a tool called patch that can just patch
any file anywhere and a read file tool and read a chunk of a file tool, etc. But just super, super
generic. And in the beginning, it's kind of scary, right? And so that's why by default,
most of the tools kind of ask for your permission if you're going to patch a file, etc.
But then over time, I feel people have just gotten more and more comfortable where I think most people probably run with like a workspace yolo kind of mode where, hey, as long as you're in my code base, you can read and write whatever you want.
And you don't have to prompt me all the time.
And so people are getting more comfortable with it.
I was using it recently to create notes where basically I wanted a midterm kind of help guide for my students.
And so I said, hey, you know, here's the textbook and here's the midterm, you know, come up with sort of a help guide.
And kind of what Patrick was saying in the beginning, the help guide was okay, but, you know, not great.
and I realized I was kind of asking it to do a lot.
You know, it's a big textbook.
It's a big midterm.
And I said, hey, break the textbook up in a chapters
and give me a study guide for each chapter.
And that kind of forced it to, you know, use many tasks
and each task having only to read a chapter, you know,
as we talked about, the model gets dumber if you give it more information.
So because each subtask only had to think about a chapter,
I actually got a much better summary
just from changing the prompt.
Yeah, I mean, I guess that's like a good transition
into our next thing, like a set of maybe like learnings
guidelines.
I do agree with like general use and the bright future.
But I mean, I guess like the first one
and we kind of talked about this with like hallucinating
with like needing to break down, you know,
sort of plans we're going to talk about somebody's a minute,
is at least if you're working in code,
but even if you're not, like even if you're working,
I've been playing around, maybe we'll talk about a future episode,
but like using a lot of markdown files and something like obsidian
to track stuff and to track thinking and doing basic file manipulation.
But use Git.
Even if you're not going to push it up to GitHub, it doesn't matter.
And I will say like a lot of these tools have skills needily to do like Gitwork trees,
which I will not lie.
I have never used a Gitwork tree in my life.
I don't know how it works.
I don't know how it works either.
Okay, good.
I'm not the only book, but like sophisticated ways of using Git,
they know how to do it.
You just have to tell them you want it done, right?
And then anytime it's sort of like working, check it in.
If you make a change, check it in it.
Like you can always go back and then you can always tell the alum, like,
if you want to try rolling forward, which we'll talk about in a future tip,
you can say things like, hey, like this last change didn't work, right?
And then it will be like, oh, okay, yeah, like let me revert it or let me, you know,
look at the diff, right?
And so it has access.
So Git is like something we talk about,
kind of, you know, basic, technical person,
software engineer skill,
but definitely here.
And again,
I use this for personal workflows
without any remote repository.
It's just a local thing.
It doesn't matter.
It's just to have like a history
that is like archived of my instructions of,
you know,
what I was doing, of my code.
Yeah, the agent is amazing at writing detailed
Git COVID messages too.
So you just tell,
But hey, do a Git commit here of all the untracked files and with a nice message.
And if you do that in your like session, it will often include notes from the session,
like things you were trying to do that wouldn't necessarily be inferrable from the code.
And then I guess that'll go to my next thing is before you start, you know, attempting.
I guess like the term is sort of broadly vibe coding where like to Jason's what you try not to actually write code.
although that term is, I think, a little derogatory, but like whatever.
And people, we can talk about the ethics or morality of it some other time, but, you know,
it just is here.
So, okay.
But I think there's like the similar term as one shotting, which is like, hey, make me a website and you're just like let it go.
You don't do any.
That may be a way.
I don't know that it works amazingly today.
But what I will say is all the tools I use have like a planning, you know, sort of flow.
And in the planning flow, the tool is not writing anything to disk.
It's not making any changes.
It is simply attempting to break down your request into what it's going to do to actually
go ahead and extract most of the needed context.
And it used to be that it was just telling it like think harder and like make a task list.
And then it would go to the task list.
More recently, I've seen the tools get better.
And basically what they'll do is, you know, hey, I want to add this feature.
I want to build this thing.
It will go like do the research,
put snippets of code.
And then you can review it of varying links
and spend time going two,
three, four times.
The better you get that,
the better like the output will be.
And you can ask questions like,
why did you do that?
If it's in a domain you don't know about
or like just force it to reconsider.
Like, is this the only option?
There's like lots of things you can do
and there's skills that will also help to.
Although sometimes those skills go out of date
as the harnesses roll forward in terms of like,
do you need to do those extra steps or not.
But then once you sort of get through the planning,
a lot of them will then clear the context and keep only the like,
I have minimized the description of the work to be done,
including function calls,
pertinent snippets and all of that.
And what you'll find is the execution runs much faster
and is less likely to run out of context
while doing the thing you described in your planning.
Yeah.
So we should talk about running out of context.
So, as we talked about earlier, your model has a context window, which means the model can only take in so many tokens at a time.
It's not a recurrent model, so it doesn't sort of accumulate anything.
And so every time it reads in a question or a tool call result, it reads in the entire context and then makes a decision.
And so because of that, the context has to have a limit.
When the model hits the limit or gets close to it,
it does what's called compacting, and it sucks.
And as of today, it hasn't really gotten much better.
So basically, the way compacting works is pretty simple.
The context is made up of a list of messages.
So a message might be something you asked it.
A message could be part of its response because now you have this multi question answer session, right?
A message could also be a result from a tool.
A message could be a request to call a tool.
These are all messages, this whole ledger of things that have happened, right?
And so in compaction, you know, it's going to preserve some of the things that are really important.
So when you told it something and it responded,
there's probably enough important information in it.
Those will get kept in their entirety.
But when a tool responds with like an entire PDF and it read the whole thing,
it's going to try to summarize that.
So it's going to summarize that whole PDF, you know,
down to maybe a paragraph or two.
And so compaction is all about sort of summarization.
and then continue to summarize until you have 90% of your context window open again.
The problem with the summarization is that details often really matter.
And so when you compact, often you're in this really weird state.
I've seen situations where I ask a question, it compacts,
and then it actually answers like the previous question I asked again.
Um, that's just probably maybe a bug.
I don't even know.
Um, but there's, there's weirdness.
So you, you really want to avoid compaction and you, task decomposition is the best way to do it.
Yes.
Uh, there's another, uh, a point or notes, but yes, you know, decomposition is a skill that I think
both like for the purposes of what we're talking about, like getting the mind of, but also
more generally, just like separation of concerns, like object oriented things like things,
like things, do not go out the window, but in a time when the LLM, I will say currently tends to
generate a lot of spaghetti code.
It's not pure spaghetti code, but it also doesn't spend cycles is the wrong way to reason about,
like it's not true, but whatever.
It doesn't spend tokens trying to think through like, you know, finding duplicate.
It just is happy to copy and paste stuff, right, and not think like, oh, I should move this out
to a common predecessor function or a pre-pro.
like you'll a lot of that stuff won't happen it will also happily do super inefficient things
and the more things for a unit i don't like it is trying to do the worse that becomes so really
letting it be step by step even more than maybe it's step by step is is very important yeah totally um
kind of related to this there's now the file is a little different it hasn't standardized yet but
For most of these, the file is called
Agents.MD, and
agents has to be all caps.
This is like a magic file.
So whenever you
start Claude Code
in a directory, it's going to
look for an agents.
dot MD file, and it's actually going to
recurse back through the file tree and
look for all of them and add them all up.
So here's where you can put
things like anytime you
change the code. You should, significant change of the code. You should run unit tests and here's
how you run them. Or don't use the system version of Python because we have a virtual environment,
use the virtual environment Python. These are all instructions that you put in your agent's
ID file and that way you don't have to say it every single time you have a question. So think of
whatever you put there as being sort of prepended to any question that you could ask. And that's
been extremely important.
You know, if you want to have a code that doesn't have all sorts of lint errors and just,
you know, really kind of cruddy design, then that's where you kind of enforce all of that.
Over time, I've been making this more and more complicated.
I have one now where, you know, if a file gets too big, it breaks it up and it's sort of
instructed to, you know, break the file up semantically, group the file, the code,
semantically.
Originally, I said,
hey, if a file gets too big, break it up.
And I was ending up with like,
cube underscore part 1.5,
cube underscore part 2.
Dot pie.
So then I had to go and add,
you know, no, okay,
when you break it up,
it has to be semantically.
It has to be meaning behind each of the file names.
So these are all the kind of things you'll do as you're iterating through.
But,
you know,
you can end up with something that granted is going to burn more tokens,
but it kind of keeps a healthy,
ecosystem so that if you do end up having to go back and look at that code, you could do it.
I had an issue recently where there was a config parameter.
I wanted to configure the size of the output video.
And so I said, hey, the size of the output video is too small, add a parameter, and by default, make it eight,
so that the video is eight times as large.
So it's like, sure, done.
And I do a run and the video is the same size.
So I said, hey, you know, the video is, you know, is only this many pixels by this many.
It should have been bigger.
And it was actually argue it to me.
It was like, well, you know, the domain might have gotten smaller.
And so I actually made the video bigger, but it was now smaller.
And so it canceled out.
I was like, well, wait a minute, wait a minute, run all the, you know, lint tests and everything.
and then come back.
And it's like, oh, yeah, I didn't pass this variable through and blah, blah, blah.
So, like, having good code hygiene is important even for the AI.
And so even if you're doing a side project, just keep around an agents.md file and dump it into every project you do so that your AI will have good code hygiene.
Otherwise, it will make mistakes just like a person.
I think maybe one day this will be different.
But just like your analogy earlier where we pass all our code through a compiler,
and maybe now we pass all of our instructions through, you know, an agent to code all of our stuff,
I do think some classic debugging stuff is kind of like just like those skills that people have learned.
And I don't know, it'll be tougher to learn them in the future, perhaps.
But I feel like are still like how you're saying, when you see a problem, it's like, well,
how do I get you to agree with me?
Why does that come?
That comes because you were a manager and you've had to argue with, you know, individual contributors under you.
before. Like, why don't you write a test, right? If you just don't go look for the bug,
there's going to be like, I didn't see one. Like, did you really try? Like, let's write a test.
Write a test. Change the, oh, oh, yeah, yeah, you're right. Okay. You know, like,
those skills are still, you know, understanding where the most likely source of problems are,
especially in your domain and in your experience, right? Like, I feel, at least for the time being,
are going to continue to be, you know, sort of very useful. And then also,
there's different ways of building a system,
but certain ones sort of stacked together nicely.
And so can we can like the LLM training ever really be equally good
across like all the different ways of developing,
right?
Like imagine one person wants to develop in an agile way and one person in a water flow
method and one per like,
is it really going to be equally good at all?
It feels unlikely.
And so depending on what,
but you may still want to stick in your lane or your company has policy or whatever.
right? Like, it's very interesting. If it's just like, your code sucks, I rewrote all of it.
Well, hang on. How am I going to like get buy off for that? Like, how am I going to get acceptance testing?
Right. You know, this is all very difficult, difficult questions. It does lead to your, you know, sort of debugging example.
One of the things that I hear a lot of back and forth, I try a little bit of both, but it's just something to be aware of.
on different levels, there's like rolling forward with the issue and then just restarting.
And that restart can be, like we talked about, reverting to an earlier Git commit.
It can be which some tools support better than others.
It can be clearing your session or closing it out and starting it again,
which erases your history, basically, unless you know, force resume it.
Because sometimes something in the context window is tripping it up or polluting it,
and it's just stuck in some weird loop.
And literally just exiting and starting again sounds stupid.
and asking the same question.
Now, we haven't talked about it.
It is still kind of slow.
Like, it sucks.
Like, it generally takes a long time,
even if you're asking a simple question,
which is not how it is with humans.
Like, if you were just having a debate with a human,
like, you know, you ask them something simple,
they're going to respond very quickly.
But you do need to be aware of the difference
between sort of adjusting your questions,
trying to fix bugs,
and just, like, nuking it and starting back
and saying, like, let's try this again,
but I'm going to state it a little.
different and being aware that even if you copy and pasted the same prompt, you're not actually
going to get the same answer because there is randomness in the LLM that's inserted on purpose through
things like temperature and other things. So I don't, I haven't tested it, but I would not be,
so I would assume that if you paste in the same question twice from the same base state, you aren't
guaranteed to get the same answer. Oh, totally. Okay. I'm like not crazy. I know that's true in the
chatbots, but I haven't tried it in cloud code or codex or anything.
Even if you set the temperature to zero and you fix the random seed and all of that,
you're still not going to get a deterministic answer because your question is batched
with other people's question. And the low-level Kuda operators will give slightly different
answers depending on what data is around you. And so, so yeah,
there's no way, zero way you can guarantee the turbanism.
Ah, Tay, I learned.
Yeah, unless you're running on your own hardware with a batch size of one in a very efficient way.
And you're waiting a day for every question.
Yeah, yeah, exactly.
Okay, so we're just want to wrap up on this.
You know, I get asked this a lot.
You know, is software engineering dead?
How is this going to work?
Well, I have a job after we graduate.
wait, you know, I have, I teach at UT now, and so tomorrow I have office hours, and I guarantee
you tomorrow someone is going to ask me this. I know this because they emailed me today saying
they're going to ask me tomorrow. Wait, that's not fair. But I get this every single week. And so,
you know, my, I guess I'll give my overall take, and I love to hear your take, Patrick.
it. I think that, you know, using a systematic way to solve problems, that is always going to be in demand, right? The demand for doing that has done nothing but go up. You know, I'd like to think that my great, great ancestors were also engineers, like building catapults and stuff. Maybe they weren't. Maybe they were tyrants. I don't know. Maybe they were bomb- Probably just died of dysentery, dude. But before they died of disintery, dude.
But before they died a dysentery, they were making some pretty kick-ass trebushes.
That's what I'd like to believe.
But we actually work more hours than they did, right?
Like, we work more hours than medieval peasants.
And so clearly the demand for, the demand for like systematic thinking and turning, you know, abstract problems into like really concrete problems, that's not going away.
what is going away
is like the very
wrote, you know, like I have this
project in Java and I need to port it to
C++ and that's going to take two years.
Like that's going away.
And so,
um,
and so, you know, maybe the supply of software engineering
jobs actually does,
does take a permanent hit.
Um,
but that was never really the point.
The point,
wasn't really to write software.
The point was to build something cool.
And if that, as long as we stayed true to that,
then there's always going to be a ton of opportunity.
And if, you know, we have thousands of years of history showing that,
you know, it's just going to get more busy for everybody.
So, yeah, I, my, we sort of talked about it.
I think like curiosity building.
I mean, I think there's are things you.
see throughout history like certain people
had and the way they applied it was maybe
different. The opportunities for applying it were maybe
different.
But I mean, I think
of all the things in the world that you can
build, having, maybe
call it style, like
having the persnicketyness
to keep through working
and forcing that it doesn't
work the first time or didn't match what you expected.
Like, the question
would be, I guess, we were talking about SORA
earlier. Someone was saying,
you know, oh, Sorrel will kill Hollywood, right?
Like, everybody will just watch their own,
own movie. I don't actually think that's
true. Right. For me, the
most likely outcome is
a new, maybe
order of magnitude, maybe two orders
of magnitude, literally a hundred more
creators and movies get made.
But there are still ones that just like,
like you were talking about drip boards. I wouldn't
have had that idea, but I enjoyed it.
Yeah. And so therefore, like, somebody
had taste, like someone to call it whatever
you want, I don't know. They had
ambition to like prompt it to do that thing that I never would have. And I think the same will be true
in software deciding what to build. Now it may not be that it may not be, I don't even say like fair
in the same way that you just go to college, like get a good job and you're guaranteed in the same way that like,
you know, you can hear the criticisms about, you know, 40 years ago. You did A, B, and C and you like,
you know, worked X years and then retired and it was a good like and that may not be true anymore. May or may not.
I'm not here to say.
I'm not trying to get on politics.
There's like, there's no guarantee,
like you're saying,
you know,
medieval peasants,
it's completely different than us today,
you know,
like things change.
And so my take on software engineering is
the skill set feels valuable,
understanding how systems work,
how they get built,
how to debug stuff,
right?
Like,
this feels useful,
but maybe there's less of a discrimination
between learning it as a mechanical engineer
an electrical engineer or a software engineer,
maybe a lot like the lines blur
because you can reach further outside your domain.
But maybe there's a big difference
between how those people interact with tools like Claude
less difference between them than working with an IDE,
but then maybe someone who's a journalism student
works with Cloud Code as well,
but in a fundamentally sort of different interaction model, right?
They're building something very different.
They're building smaller things.
They're building, you know,
and maybe they don't want to buy the thing you're building,
but maybe you're building for an end product.
I think there's probably going to be seismic shifts,
but it's very difficult to predict a specific shift.
Yeah.
That was a lot of words to say very little.
Boom.
Done.
Throw the gauntlet down.
Yeah.
No, I think that makes a lot of sense.
I mean, I think that, you know, for teachers,
for all these folks to use quad code for their specific task.
Oh, here's another thing.
you know, we seem to have forgotten the importance of data and particularly data around user experience.
You know, so, like, you hear people say, oh, I'll just make Salesforce myself for my company.
And I'll save my company 100K a year.
It's like, yeah, but Salesforce isn't just like a database and some front end code and back in code.
It's also, you know, like a decade plus of user research.
like, oh, you know, we, we allowed everyone to just add and delete people in Salesforce.
And then some rogue employee just got pissed and wiped the whole Salesforce database for one of our clients.
And we learned not to do that anymore.
We learned, oh, we actually need our back.
We need role-based access control.
So that Joe Shimo, who, you know, is like entry level, who's an intern for the summer, you know, when his internship is over,
that the last day he can't just go and download
all the personal information out of Salesforce, right?
So, like, that's a lesson they learned.
And there's definitely, like, you know,
they probably learned like 10 lessons a day for 10 years.
They learned like tons of tons of lessons.
And the AI is not going to have all those lessons
because, you know, it's not going to be in the GitHub comments.
You know, it's like Joe Shimo
to wipe the database for Exxon Mobile on his last intern day.
and that's why this code is here.
Like, that's not there.
So I don't think SaaS companies are dead.
I don't think it, I think, I know companies that are like,
we're going to cancel our Salesforce contract and in-house it.
I expect that to be a complete and utter disaster,
all sorts of weird security problems over the next 12 months.
And ultimately for SaaS to make a comeback.
And I'm not even in SaaS, but I have no horse in the race,
but that's where I see it going.
I also feel like it's a focus thing.
like do you sales force is expensive but I think people are maybe too worried about some of the
expenses is like if you can have focus and just outperform it right like is that really what
you want people spending time on like even if it's easy I don't know I mean I don't know how it
costs but it's like maybe it is for a certain size company or people like doing very low margin
work but growing your margins is probably like finding a way to be more effective in some
area rather than just like reduce costs.
There's like always that balance, that tension
in business. And you're
right. I feel like it's overblown that suddenly it'll
just make sense for everyone to do it.
Some people will. Like the total
addressable market will probably go down
when there's lots of companies who are just
paying too much per seat, right? Like it's
the total amount they spend,
they can employ a small group
of engineers to do this. But then in a bunch
it just, it takes away
focus. Right. Right. Yeah.
Because when you say re-implement
Salesforce, that's not good enough.
You know what I mean? Like, you can't just go to an engineer and say, do that.
Because Salesforce is enormous.
You probably don't need all of it.
And even if you did need all of it, how would you faithfully reproduce it?
So chances are they say redo Salesforce and you end up with some janky thing that can't handle
load and just doesn't behave the way you expect.
And guess what?
It has no documentation either.
Or if it does, it's AI generated.
And it hasn't been looked at.
that and reviewed by a person.
This is broken.
Talk to Claude.
Talk to,
talk to GPT.
There we go.
Open AI.
Ring,
ring, ring.
Sam Altman.
Can you fix my?
Oh, my gosh.
So, yeah,
I mean,
I'm not a betting person,
but I would probably
try to buy the dip
on SaaS.
I don't,
again,
I don't know where it is.
I wouldn't try to
try to convince
anyone that they could time
the market,
but it just feels like,
feels like SaaS is a little underrated at the moment.
All right.
So yeah, we will definitely cover OpenClawe.
Someone requested that.
There's been a lot of requests for things adjacent to this.
We wanted to really set the foundation,
talk about the history behind this,
and build the sort of first floor of this tower
that we're building on this topic.
So hope you all appreciated it.
And yeah, we'll catch you all later.
All right.
See you next time.
Music by Eric Barndollar.
Programming Throwdown is distributed under a Creative Commons Attribution Share-A-Like 2.0 license.
You're free to share, copy, distribute, transmit the work to remix and adapt the work,
but you must provide an attribution to Patrick and I and share alike and kind.
