The Data Stack Show - 229: The Future of AI: Superhuman Intelligence, Autonomous Coding, and the Path to AGI with Misha Laskin of ReflectionAI
Episode Date: February 19, 2025Highlights from this week’s conversation include:Misha's Background and Journey in AI (1:13)Childhood Interest in Physics (4:43)Future of AI and Human Interaction (7:09)AI's Transformative Nature (1...0:12)Superhuman Intelligence in AI (12:44)Clarifying AGI and Superhuman Intelligence (15:48)Understanding AGI (18:12)Counterintuitive Intelligence (22:06)Reflection's Mission (25:00)Focus on Autonomous Coding (29:18)Future of Automation (34:00)Geofencing in Coding (38:01)Challenges of Autonomous Coding (40:46)Evaluations in AI Projects (43:27)Example of Evaluation Metrics (46:52)Starting with AI Tools and Final Takeaways (50:35)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Hi, I'm Eric Dotz.
And I'm Jon Wessel.
Welcome to the Data Stack Show.
The Data Stack Show is a podcast where we talk about the technical, business, and human
challenges involved in data work.
Join our casual conversations with innovators and data professionals to learn about new
data technologies and how data teams are run at top companies. Welcome back to the Data Sack Show.
We are here today with Misha Laskin.
And Misha, I don't know if we could have had a guest
who is better suited to talk about AI
because you have this amazing vast Basu and your co-founder,
working in sort of the depths of AI,
doing research, building all sorts of fascinating things,
being part of the history's background of acquisition
by Google, and on the DeepMind side,
and some amazing stuff there.
So I am humbled to have you on the show.
Thank you so much for joining us.
Yeah, thanks a lot, Eric. It's great to be here.
Okay, give us just a brief background on yourself, like the quick overview. How did you get into AI?
And then, you know, what was your what was your high level journey?
So initially, I actually did not start in AI. I started in theoretical physics. I wanted to be a physicist since I was a kid.
And the reason was I just wanted to work on what I believe to be the most interesting and
impactful scientific problems out there. And the one miscalibration I think I made is that
when I was reading back on all these really exciting things that happened in physics,
they actually happened basically 100 years ago. And I sort of realized that I missed time. You know, you want to work on not just impactful
scientific problems, but the impactful scientific problems of your time. And that's how I made it
into AI. As I was working in physics, I saw the field of deep learning growing and all sorts of
interesting things being invented. I actually, what made me get into AI is seeing AlphaVote happen, which
was this system that was trained autonomously to beat kind of the
world champion at the game of Go.
And I decided I needed to get into AI then.
So after that, I ended up doing a postdoc in Berkeley and this lab
called Peter Beals Lab, which specializes in reinforcement
learning and other areas of deep learning as well. And then I joined DeepMind and worked there for
a couple of years where I've been my co-founder as we were working on Gemini and leading a lot of the
reinforcement learning efforts that were happening at Gemini at the time.
Yeah, so many topics we could dive into, Amisha. So I'm gonna have to take the data topic. So I'm at the time.
There are many things I'm really interested in, but something I'm really interested in is how do you set up evaluations on a data side that ensure that you can predict where
your AIs will be successful?
Because when you deploy AIs to a customer, you don't know exactly what the customer's
talents are.
And so you need to set up evals that allow you to kind of predict what's going to happen.
And I think that's a big part of what a data team does is setting up evaluations.
And it's maybe one of the last things that a lot of people think about and think about
AI because we're thinking about language models and reinforcement learning and so forth.
But actually the first thing that any team needs to get right in any AI project is setting
up clear evaluations
of matter. And so on the data side, that's something I'm really interested in. Awesome. All right,
well, let's dig in because we have a ton to cover. Yeah, let's do it. Misha, I obviously want to talk
about AI and we want to dig into reinforcement learning and talk about data for the entire show,
but I have to ask about your interest
in physics as a young child.
So you mentioned that you were interested in sort of working on some of the most important
scientific problems and you realized, okay, maybe some of those problems were actually
maybe 100 years old.
But what sparked that interest as a, you know, knowing you want to get into physics,
that obviously, you know, you ended up not being a professional physicist.
But what sparked that interest as a young age?
Do you have like a story that you could share around that?
Because, you know, knowing you want to be a physicist as a child is not the most common
thing.
Yeah, I considered cowboy first, but yeah.
Cowboy, fireman, physicist.
Well, what happened is that I'm not from the States originally.
I'm Russian, Israeli, and then moved to the States as a kid.
When I moved, I didn't really speak the language very well and didn't have a community here.
And so I ended up having a lot of time on my hands.
And my parents had a library, you know, a number of different kinds of books.
But one of the books that they brought with them were these lectures in physics by Feynman.
And this is kind of a legendary set of that I recommend anyone anyone should read them even if you're physics or not because it's kind of a
example of really clear and in that way cleared simple and very beautiful thinking and
I read those books and it was just so interesting that
The way in which Feynman described the physical world the way in which you could make really
the way in which Feynman described the physical world, the way in which you could make really counterintuitive predictions
about how the world works by just understanding how it works from,
you know, a set of very simple assumptions, very simple equations.
And it was the short answer is I had a lot of time on my hands
and got interested actually in a lot of things.
At the time, I got interested in literature as well.
I ended up double majoring in literature and physics,
but it was literature and physics that I got interested in literature as well. Ended up double majoring in literature and physics, but it was literature and physics that I got interested in at the time and then ended up going
kind of hard committing to physics. Wow, absolutely fascinating. Yeah, when you're chatting
before the show and you said, you know, I realized it was 100 years too late, I was like,
oh, theoretical physics, the answer to that problem is, you know, is traveling through time,
so you can get back to, you know, back to that era. Yeah. Well, it might to that problem is traveling through time, so you can get back to that
era.
Yeah.
Well, it might be that the problems also that we have in physics today are just so hard
that it's really hard to solve them.
I think that progress in this problem is definitely not being made nearly as quickly as it was
100 years ago, and there's so much to discover.
One of my folks with AI is that we develop
AIs that are smart enough as scientists that they help us answer some of these fundamental
questions that we have in physics, which I think like to me seemed like a complete sci-fi
thing even a few years ago, but now almost counter-intuitively it's, I think, everyone
really like theoretical math and theoretical physics is going to be one of the first use cases that we applied as kind of next generation of models that are coming up today. counter-intuitively, I think theoretically,
math and theoretical physics is going to be one of the first use cases that we applied as a generation of models that are coming up today.
Let's dig into that a little bit because one of the questions, just one of my burning questions to ask you was,
what do you envision the future with AI to be like?
What does that look like for you? envision the future with AI to be like?
types of things do you see in the future that make you excited and in the ways that humans will interact with AI or the way that it will shape the world that we live in?
Yeah, I think that, I mean, I'm very personally quite optimistic about AI.
Obviously, there are a lot of things that we need to be careful about, especially from
a safety perspective.
But there's one quote that I heard a friend say that really stuck with me, which was,
you know, artificial and general intelligence, AGI.
He said, you know, I think AGI will come and no one will care.
I hadn't heard that before.
Then I thought about it and I think that's what's going to happen.
But from the perspective of, right, we have computers today,
we have personal computers today, which is a massive leap from what people had decades ago,
or personal phones. And I would say, and we don't care, we just don't know our lives in any other
way. We don't know what life was like before computers or before personal phones, even though
the iPhone, I remember when it was like not having an iPhone, but from
a day-to-day perspective, I never even think about it anymore.
So I think what's going to happen is that, you know, all of the ways in which AI is going
to transform us are going to be similar in perception to the way technology has transformed
us already.
And so what I mean by that is that I
think that in AI, there are oftentimes
really polarizing, either hyper-optimistic,
we're going to be, it's going to be a completely transformed
world, which obviously it is, or doomsday scenarios,
things are really going to go down poorly.
And I think the reality is that it's
a remarkable piece of technology
that's probably more transformative than mobile phones or computers themselves. But the effect
on us as people is going to be that we just live our day-to-day lives and it doesn't,
it changed our day-to-day lives, but we won't even remember what Lides used to be like. So yeah, I think what's
going to happen, for example, from a work perspective, is that now we don't really take
notes, write a pencil and paper, we have much better storage systems on the computer for our
notes and things like this. And so we've accelerated the amount of work we can do just by having a
computer and knowledge work we can do.
And I think there's gonna be kind of a massive increase
in like productivity, especially in knowledge work to start
and in physical work as well.
But let's just think about knowledge work.
I think in the future, and this is kind of how I
at least think about AGI is that it's a system
that does the majority of knowledge work on a computer.
So what I think that means, it's
not that it's like a zero sum pie and that we go from today doing, let's say, almost 100% of knowledge
work on a computer to us going to 10%, AI going to 90%, and now we're doing 10x less work. I think
it's going to be that we kind of work the same amount that we did before, but we're getting 10x
more things done. And we don't even remember what it was like to get the amount of
things done that we do today.
I think that's what, that's what the world is going to look like.
That fits the historical curve too, right?
Like we don't even know what it's like to sit down and handwrite a memo and then
wait several days for it get delivered, right? Like compare that to email, for example. So it seems days for it to get delivered, right?
Like compare that to email, for example.
So it seems like it would fit that curve, right?
Like you get that drastically faster,
more leveraged life and it's just the life you live.
Yep.
Yeah, absolutely fascinating.
One follow-up question to that, Misha,
you talked about your co-founder developing
some pretty
amazing technology that could mimic what humans do, right? And actually you mentioned, you
know, seeing the world champion, world human champion at Go get beat by AI. And then I
believe your co-founder developed an autonomous system
that could play video games by looking at a screen,
which is pretty wild.
And one of the interesting things is that,
and maybe you can give us some insight into the research aspects of this, But one thing that's interesting about your perspective
on, you know, we sort of go about our day-to-day work
and we get 10x throughput.
One thing that's interesting about that is, you know, is that a replacement of some of the things that I'm doing as a human?
Is it augmentation? Is it... Can you speak to that a little bit? Because just replicating the keystrokes that I make in my computer isn't necessarily
the way to get 10x, right?
And I think we know that context is something
that the AI is amazing with, right?
It can take context and really do
some amazing things with it.
So can you speak to that a little bit
in terms of replicating humans, augmenting?
What does that actually look like?
Yeah, so the first thing that I'll say is that
the kind of algorithms developed leading up to this
wallet that we're in right now at the moment
and the things that you mentioned too,
that Janos, my co-planner worked on,
which we call DQ networks in the case of video games
and AlphaGo in the case of the Go example,
were actually superhuman. So they got to a human level, and then they exceeded it and became superhuman. So when you look at an AI system playing Atari,
it looks kind of alien because it's just so much better now than a human could be. And the same
thing was true for Go. And now what you said was right in that
the way these systems are trained,
especially like let's take AlphaGo as an example,
it had two phases.
The first phase was you train it to mimic human behavior.
So you have all these games, online games of Go,
like similar to how it just has online game servers.
There are a bunch of like online game servers for Go, and to how it just has online game servers. There are a bunch of online
game servers for Go. And they picked a bunch of those games and filtered them for the expert
amateur humans and taught a model to basically imitate expert amateur human behavior. And
what that ended up getting was just a model that was pretty proficient, but still just kind of a
human model. And then after that, they trained that model, that sort of human level model,
with reinforcement learning based on feedback of whether the model was winning the game or not.
And the thing with reinforcement learning is that you don't need demonstrations from people.
You just need a criteria for whether the thing that the model did was correct or not.
And as long as you have that,
which in the case of the game of Go
is did you win the game or not?
Sure.
You can basically push it almost, you know,
if you throw enough compute at it,
it will get to superhuman model, right?
It will just find strategies have never even thought of.
And that's kind of what ended up happening.
So there's a famous move called move 37
in the game of AlphaGo against LisaDoll,
the world champion in Go.
And move 37 was a move that looked really bad at first,
like analysts were looking at it were confused
and LisaDoll was confused.
Everyone was just really confused by it.
And then it turned out a few moves later
that it was actually a really creative play that
was just really hard for people to wrap their minds around.
And it turned out to be the right play in retrospect.
So we have that, that is all, what I'm trying to say is like, we have the blueprints for
how to build superhuman intelligence systems.
And so I think we are heading into an era of super intelligence. Now, it does
not necessarily mean super intelligence at everything, but we will have models that are
super intelligent at some things.
Well, I think it's a great time to talk about reflection. So tell us about reflection, because
that's a focus of what you're trying to do at reflection. So tell us about reflection and because that's a focus of what? You know what you're trying to do at reflection
So tell us about reflection what you're looking on before we jump into that on the just because I think I've seen a lot of this
Throwing around and like news articles and stuff
So you've got a GI right and you've got this superhuman and I think there's been some chat around that like oh like we're like moving
Past a GI to superhuman it'd be it'd be awesome
I think for the listeners to just take a minute and be like alright. What do we mean a GI? like, oh, we're moving past AGI to superhuman.
It'd be awesome, I think, for the listeners to just take a minute and be like,
all right, what do we mean AGI? Obviously that's general intelligence, superhuman.
And then just parse that out for them a little bit, because I think those words
already are just getting thrown around. What does it mean to go beyond human level proficiency and be superhuman? Yeah, right. Yeah.
Yeah. And I think, you know, if we put other words into the mix that may be good to kind of talk
about later is also the word agent, right? I think the word. Yeah, yeah, let's throw that into the
soup. Yeah, for sure. And then super. Yeah, yeah, exactly. Many things. So at least the way I think about it is first, I don't think about binary events,
like there's AGI and then there's super basically AGI. I think about it more as a continuous
spectrum and that's kind of how like in the game of Go, for example, there was no, it's
really hard to pinpoint a moment when it went from, you know, human level intelligence to
superhuman. Like the curve is actually smooth.
Like, so it's kind of a smooth continuum
and even subhuman intelligence,
like it's smooth from subhuman to human to superhuman.
So it's really around,
like if we have discovered methods that scale,
that the more kind of compute and data we throw at them,
the just predictably at them, they just
predictably, right, they scale on their intelligence, then those are kind of systems that we're talking
about. So to answer your question, to me, the distinction between sub-human intelligence,
human intelligence, and superintelligence is just where on the smooth curve of intelligence are you. Now, it's helpful to be, you know, yeah, it's helpful to define what some of these things are.
And different people have different definitions for AGI.
I think that there's a centralized, like the community has converged on what people agree it to be.
But we have a version that we're working with,
a working version that is kind of meaningful to us.
And that's kind of how we think about AGI, which is,
it's a functional definition.
It's just, we're thinking about digital AGI.
We think the same thing can be applied to physical AGI.
It's a system.
We don't know how, like, we don't know,
it can be a model, it can be a model,
it can be a model with, you know, with tools on a computer, but it's a system that does
the majority of knowledge work on a computer. And notice I'm not saying the majority of
knowledge work that people do today, because I think the majority of the knowledge work
that's done, you know, even a few years from now is going to look largely different. So
just at a given point in time, when you're, when you assess like the work that's done, you know, even a few years from now is going to look largely different. So just at a
given point in time, when you're, when you assess like the work that's being done on a computer,
that's struggling with economic value, is majority of that being done by humans or by computer,
basically the computers themselves. And, and to me, that's kind of what AGI is. So it's more a
functional definition. And what that means is that the only benchmark that matters
is whether AI is doing meaningful work
for you on a computer.
It doesn't matter what math benchmark it's solved.
It doesn't matter.
None of the academic benchmarks matter whatsoever.
All that matters is it doing meaningful work
for you on a computer or not.
And so what's an example of like products that I think,
you know, make meaningful impact along that kind of
benchmark?
Let's say GitHub Copilot.
GitHub Copilot, you can just track the amount of code that it writes versus the amount of
code that the person writes.
Now, of course, you also have to decouple the amount of time a software engineer thinks
about the design of the code and things like this.
But it's hard to argue that it's not doing work
on a computer.
Like it's definitely doing some work on the computer.
And so on the smooth spectrum from, you know,
sub human intelligence to human intelligence,
super intelligence, I think copilot is on that spectrum.
Right? It might not be general intelligence,
but it's on the way there.
So quick, quick followup.
And then I definitely want to dig in on reflections application of superhuman intelligence.
But something that's frustrated with me a little bit in how we talk about this is we've got this like AI curve that you just explained.
But then we treat the human intelligence as like a static factor, like some kind of standard to get to.
And like I would, I mean, the way I think about it is like that human intelligence has changed over time for sure and will continue to change. And I think there's an aspect of like whenever we talk about AGI, like when is AGI going to happen? It's like, well, I think the humans are going to get more intelligent too. And that like, you know, like even with a game of Go example, I would think it's very possible that like if somebody used,
you know, this model to essentially like learn new Go strategies and therefore like they're
better too. Now maybe, you know, maybe the AI is still better than them overall. So like
maybe just briefly like, I'd love your thoughts on that.
I think that's actually exactly what's happened that the Go community and the chess community,
they both, yeah, they both learn from
the AI systems now. So, right, what made Move37 special, people analyzed it and have incorporated
that into their gameplay. One of the things I'm really excited about is, you know, I just remember
what my life was like as a theoretical physicist, which is, I mean, it was very like theoretical
physicists, like, you know, write equations on a chalkboard
and, you know, derive things with pencil and paper. And you
basically sit in the room, think really hard, derive things, go
talk to collaborators and, you know, kind of try to sketch out
ideas on a chalkboard. And what I'm really excited about, you
know, especially AI that's super intelligent in some aspects of physics, that it's going to be this sort of patient and infinitely available thought partner for scientists to be able to do their best work.
So I think that kind of for a while, it's going to be the combination of, you know, scientists together with an AI system that works together to accomplish something because something that's kind of counterintuitive that we usually think about intelligence is this very
general thing because humans are generally intelligent and these AI systems are
generally intelligent and will continue to be as well but general in their case means
something different than in our case. That is to say, they can be intelligent across many things,
but there are some things where they're not gonna be
as intelligent that are counterintuitive to us
because you're like, wait, that's like so easy for us.
It's kind of like the, yeah, we have these systems
for playing like Go, but it's really hard to train robots
to like move a cup somewhere or something like this.
Right? Yeah, yeah, yeah.
Yeah, yeah. So yeah, that's how I kind of
see the interplay. I think that this universal generality as we see it as sort of maybe if possible,
but as somebody who needs a goal, these AI systems like end up spiking at many things that are
counterintuitive to us and they end up being, you know, pretty done with many things that are
kind of intuitive and we'll sort of co-evolve together with them. Yeah. Yeah I that's such a helpful perspective me Jenna. I want to return to the point that you made
around the definition of a GI or the working definition reflection around
You know AI doing the majority of knowledge work on a computer
But with the important distinction that you know, that's not just
a wholesale replacement, you know, so it's not like, you know, the human is not even
interacting with the computer.
It's that the knowledge work that a human does actually changes.
And I think that's a really helpful mindset to have in that when we talk about, you know,
the future of AI, we tend to think about how it impacts the world
as we experience it today,
when in fact it will be a completely different context.
There will be new types of work that don't exist today,
which is really interesting, so just appreciate that.
And there'll be things that it's bad at,
like there'll be lots of maybe more human cup movers or whatever the equivalent of that maybe in knowledge work that will be interesting.
Yeah, there was actually a scene from I think it was like Willy Wonko, Charlie and Chocolate
Factory.
Yeah.
And it's, I think it's that Tim Burton Johnny Depp one where they show like his father being on the conveyor belt line and like screwing on the caps to like a piece of toothpaste.
And then one day he gets replaced by a robot that does that.
When I was at Berkeley, I studied robotics and you know how to make robotics autonomous.
And then I thought about that and it was like, that's actually a really hard problem.
You know, like that requires dexterity that requires like, like it's all those things that, you know, in the movies we think like you can,
you can do that easily. That that was like one of those things that's counterintuitive.
It's really hard. Yeah, that's hilarious. Yeah. I mean, that was truly, truly fantasy,
you know, in the movie. Well, let's jump over to reflection. So
you described reflection.
I know you're still early on the product side of things, but what can you tell us about what you're working on and what you're building?
Definitely happy to share more.
The way we think about our company and the way we thought about it since we started it
is that we've been on the path as researchers of building AGI for the better part of the decade now, or that was
kind of our entrance.
Right?
Yannis, my co-founder, joined DeepMind in 2012 as one of the founding engineers when
it was just a crazy thing to even say that it just seemed like a complete sci-fi dream
that you want to work on AGI and in the scientific community, most people kind of even ostracized you if that's kind of what you want to do
because it was just such a crazy, almost unscientific thing to say.
It's just not serious.
And so he joined at that time.
And this is when these methods and reinforcement learning were developed that resulted in these
projects like D2 Networks and AlphaGo.
But ultimately, the reason he joined, the reason I joined AI as a researcher is this
belief that at first it was pretty vague.
What does it mean?
There's a belief that maybe we can build something like AGR within our lifetime, so might as
well try it and see the most exciting thing we can do.
But since then, I think it's gotten a bit more concrete. And now I think we're in a world where this definition of the
system that does majority of meaningful knowledge work on a computer is in the realm of possibilities.
Like it's not, it doesn't feel like sci-fi to me at all. It seems like something that we're just
inevitably headed towards. And so if that's a system you want to build, you then have to think
backwards towards what does that
mean from a product perspective, from a research perspective. And we basically started thinking
about, well, what does the world look like a few years from now? Once we start making it as a field,
a wedge into starting to do some meaningful knowledge or computer, where does that even
start? Where does that happen? And what does the world look like? And one useful place to think is that now that we have, you know, before language models,
we didn't even know what the form calculation would be, right? It was the fact that language
models work was pretty crazy. It surprised everyone. And it's still today. I just remember what
the world was like before then. And it's just kind of magic that it even worked. Like we,
this is one of those things, right?
It happened and we don't care.
Like language models are just magic.
Yeah.
I just want to stop and appreciate
that you have been researching AI for a decade.
And that's the way that you describe it
was that everyone was surprised at this.
Cause I was thinking, I wonder if Misha, you know,
sort of could see this acceleration happening, but it sounds like, you know, surprised at this?
which was like, yeah, it works at this and this, but will it really scale and do these things?
And different AI researchers at different points in time
in their careers got scaling-pilled
and realized that, wow, these things do scale.
Some people, some people happened earlier,
I think the opening I crew had happened earlier.
I was, I would say somewhere middle on that spectrum.
So, early enough where got to be part of the early team in Gemini and really on that spectrum. So, you know, early enough where got got to be,
you know, part of like the early team in Gemini and really build that out. But still, I feel
like I was it felt like I was a bit late to the game.
Fascinating. Okay, sorry to interrupt. Okay, so reflection, you are looking several years
ahead, imagining what it takes to, you know, for AI to do a
majority of knowledge work on a computer and you're working back. So where, where did you
like, where's the focus? Right? Like, you know, cause that's a pretty broad thing, right?
Like knowledge work on a computer is pretty broad.
Yeah. So I'll start with the punchline first and kind of explain why just to contextualize.
So we decided that the problem that needs to be solved is the problem of autonomous
coding.
So if you want to build for this future, or if you have systems that do majority of knowledge
work in a computer, you have to solve the autonomous coding problem.
It's kind of an inevitable problem that just must be solved.
The reason is the following. Language models, the way language models are most likely going to
interact with a lot of software and computer is going to be through code. We think about interactions
with computer through keyboard and mouse because the mouse was designed for this. By the way,
the mouse was invented what, like 60 years ago? like the angle bar kind of mother of all demos was in the 1960s.
So it's to, you know, it's actually like a pretty new thing.
And it was an affordance that unlocked our ability to interact with computers.
Now we have to think about for AI is like knowing that now the form factor that's really
working is language models.
What is like the most ergonomic way for them to do work on a computer?
And by and large, it turns out that they actually
understand code pretty well because there's a lot of code
on the internet.
And so the most natural way for a language model
to do work on a computer is basically through function calls,
API calls and programmatic languages.
And we're starting to see the software world kind of evolve
around that already.
Like Stripe a few months ago released an SDK that is built for a language model to basically
transact on Stripe reliably.
And we think that along the software, like Excel, for example, do we think that a language
model is going to drag, you know, an AI is going to drag a mouse around that people do
to click a table in Excel and manipulate data that way. Almost certainly not. It's going to probably do it
through, again, through function columns, right? We have SQL, we have querying languages. And
so we kind of need to think about how do we believe Sahu will get re-architected in a
way that is ergonomic to AI's systems. So that's how we're thinking about things. And if you think about it that way,
you just realize that, I mean, there's always
going to be a long list of things that there are
going to be code affordances for,
but a lot of the meaningful work will,
like a lot of those big pieces of software
that people use today and where you do most of your work
today, will have affordances through basically
programmatic affordances.
So if that's what you believe the world looks like, at least, a significant part of knowledge
work on a computer is done that way, then the bottleneck problem is, okay, assume it
has all of these programmatic affordances, how do you build the intelligence around it?
And so the intelligence around that is an autonomous coder.
It's something that kind
of, you know, it's not just generating code. It's also thinking. It's reasoning. I would
say, I think now I need to go, like, and open up this file and, you know, search for this
information and then, you know, maybe send an email to this person, right? Like, it needs
to be thinking and kind of reasoning. But then it was, it's acting on a computer to
code. So, we kind of thinking backwards and we thought about, okay, what is the category that today,
like basically today has like the affordances that we need to start and it's like very valuable
and something that we do all the time that we can have kind of high empathy for as product
builders.
And so it did not really just on time as coding for us,
both because we believe that that's sort of
the gateway problem to automation
a whole bunch of pieces of software, they're not coding,
but coding is also the problem setting where
that is right today for language models
because the ergonomics are already there.
Like because like it's kind of language laws are good at code because there's a lot of code on the internet. And so you don't need, the
ergonomics are there. They know how to, like, you can build tools to read files, to add current
terminal, to read documentation. And so it's just kind of a right category today that's truly
valuable that we understand very well,
code, based on your study of AI over the last decade,
that's an area that's right for this.
What are the other areas you think in the relative near term are right for automation as well?
There are a bunch of them. The way I would think about it, the way we think about it, and this is true both for
what we're doing and both for other companies that are working in automation with AI, building
autonomous agents, be it for coding or something else, is that what you're really saying, I
think that a good analogy here is this transportation going from cars as they are today to autonomous vehicles. I think it kind of analogy here is this sort of transportation, like going from cars as they are
today to autonomous vehicles. I think that kind of lands here as well. And the way to think about
is that chatbots like ChatGPT and Complexity and GitHub Copilot, these products that are much more
chat, like you ask them something, they give you something back. We think about them as like the
cruise control of vehicles,
of transportation vehicles, because they kind of work everywhere. They're not fully autonomous on
anything really yet, but they work everywhere. And so there are these like general purpose
tools that are kind of, you know, that are cruise controls, augment the human. Now, if you're trying
to build a fully autonomous experience, like, you know, this is what people refer to as vegans today,
a fully autonomous experience, like people refer to as vegans today. The same thinking,
it's much closer to how you would think about designing an autonomous vehicle.
Autonomous vehicles don't work everywhere from day one. They have a geofencing problem. And the kind of players that won are Waymo, I think is, I got on a Waymo when I was in San Francisco last, it was just this
magical experience. And they did a fantastic job by basically nailing San Francisco and
they geofence it. And you can't go on high roads, you can't do all these things that
you can do in normal car. But within the geofence area, it works so well, that it's just a transformative
magical experience. And I think that is how people should be thinking about autonomous agents.
So we shouldn't be actually promising, you know, we were promising like a fully
autonomous vehicle, like in the future.
So right here, promising a thing that automates a lot of stuff, like a computer
future, that's clear where things are going.
But today the important problem is geo-fencing.
And so what are examples of that?
I think customer support is an area
that has shown this kind of workflow work really well. How does the geofencing analogies
transfer there? It's that some tickets that your customers are asking about can be fully
resolved with. Maybe they have a simple question that's actually an FAQ or something like this.
And so you'll route that to an autonomous agent that will just solve that.
And the tickets that are more complex,
you'll send to a human.
Or if like the customer asks it to be escrow,
you'll send it to a human.
So there's a sort of a,
like I think that successful product form factors
in agency and autonomy,
have this sort of geo-fencing baked into them,
but they kind of take on the thing they can do well,
and then help the customer outsource the thing that they can't do well yet to like the normal, you know, state of
affairs.
So I'm curious your opinion on this.
I think there's an interesting like loop here where, yeah, like it makes total sense, like
interact with this AI thing and then like human, you know, human in the loop type thing.
But I think there's also this aspect of enough companies have to like generally be able to human in the loop type thing.
So say it was a solved problem, but essentially 5% of companies have the technology where this works well.
Humans are going to be like, I want to talk to a person.
They're just going to try to get past the AI agent as soon as possible.
So I'm curious about your thoughts with that because there's this what's possible problem
and there's like, will humans adopt it?
Will humans use it?
Because you guys must face that building a product. Yeah. So I think for us and for others like to complete the customer support thing, the
ideal experience is that the human doesn't even know.
Like it's just the customer came in, their problem got solved and they didn't care or
know what to do for them.
Yeah, give it a name, give it a face.
Right.
And that's the way we think about kind of autonomous coding.
So the kinds of things, you know, so when we think about geo-fencing, we think about,
we, you know, you want to go for tasks that are actually pretty straightforward for an
engineer to do because these models aren't, you know, like super-aggressive yet, but you
want these tasks to be things that are tedious and high volume and that engineers don't like
doing.
So there's so many examples of these things,
like code migrations.
There's so much part of like a migration
when you're moving like this version of Java to that one
that is kind of thankless work or, you know,
suppose you're relying, writing tests,
suppose you're relying on a bunch of third party APIs
or dependencies.
It got, you know, an API got updated.
It wasn't backwards compatible.
Your code fails.
Your engineer has to change what they were doing to go fix that.
And right, it's sort of, again, undifferentiated work that's, especially for companies that
have very sophisticated engineering teams and are doing a lot, they end up having this
sort of backlog of these kinds of tedious small tasks that actually not really like
well differentiating tasks for them as a business at all.
And so these are the kinds of tasks where a product like ours comes in. We, you know, when customers kind of talk to us, we,
like, we, they don't even think necessarily of like a co-pilot like product because they think
about if we can just automate these, you know, for them, some subset of them, right? Some subset of
these migration tasks or like third party API breaking or having some subset of her backlog, then it's something that engineers never even have to do.
And so whereas like the co-pilot helps them do the things that are on their plate faster. And
interesting from the developer's perspective, like so much as customer support use case,
it should be indistinguishable for the tasks where it works from like a competent engineer
sending them a pull request to review, right? Like a failure mode for a company that does
autonomous coding is that you took on more than you could chew and your agent is sending
bad pull requests and now the developers are wasting their time like you jump code.
Yeah, right, right.
So as long as from like, you know,
you have to be pretty strategic about the tasks
that you pick on and sort of not promise,
you know, set expectations correctly
and deliver an experience that is basically
indistinguishable from a competent engineer doing this.
Yeah, so that's really interesting.
So essentially, and I mean, this is an overused term,
but essentially like this could look like some kind of like self-healing component of an app. So essentially, and I mean this is an overused term,
but essentially this could look like some kind of self-healing component of an app.
So from an engineer's perspective, you could engineer this into the app
and it's able to autonomously take care of API updates
and maybe a couple other things.
That's really interesting.
One question I have is around what it takes to get to fully autonomous.
We use the example of tests or API integrations
or other things like that, is there,
really hard because they had to deal with all these edge cases. Even geofencing, I think, helped limit the scope of that, but it was still really difficult
to solve for all these edge cases.
Is it the same way when you think about an autonomous coding?
Is the last 10% really difficult to go from,
this is something where it is truly autonomous.
There's kind of a yes and no part to that.
The part where I think the analogy to autonomous vehicle breaks is that
an autonomous vehicle is truly autonomous and safety is so important that there's absolutely no way it can do anything wrong, right? But in this instance, right,
suppose that a coding agent
did most of what you asked it to do,
but didn't do, you know, miss some things.
Well, if it did stuff that was pretty reasonable, right,
then you just go into code review with it
and tell it, hey, you missed this,
just like you would with a developer.
So I think that the kind of failure tolerance is higher.
Like, you know, there's more tolerance in like digital
applications like this. Now, the thing is, what you want to avoid is, you know, a model that you
asked it to do something, it came back and just, it just wasted your time, basically, right? It's
like there's, it's kind of the amount of time that would save me to go like back and forth with this
thing, we just wasted time. So it's similar to how when you hire someone, if it's someone who, let's say, was just not
trained in software engineer, and it would take longer to upskill them and train them
to be a software engineer than just to do the task yourself. So I think that the actual
eval is like, is this net beneficial to you as a developer? Like are you spending less time on doing things you don't like to do with the system or not?
Rather than like meeting that level of perfection in the time you spend.
Makes total sense.
Okay, you mentioned evals early in the show, we were talking earlier, and how you said
that's one of the most important aspects
of this, especially as it relates to data.
So I think the last topic we should cover
is your question, John, which we made everyone wait
a really long time around data teams.
Some games have been changed over here.
Yeah, exactly, exactly.
So John, why don't you revisit your question?
Because I want to wrap up by talking
about the data aspect of this.
I mean, I could keep going asking a ton of questions
because it's so interesting, but.
Yeah, I think, you know, obviously a lot of our audience,
you know, works on data teams.
And I think I'm personally curious
and I bet a lot of the audience is curious about what,
what does it look like?
So say I'm a data team that works for reflection.
I'm on that data team and dealing with AI agents
and on a daily basis, like how is it similar
to what I might do at a B2B tech company or an industry?
And what are the main differences?
Something, as I mentioned earlier,
when you kind of first asked the question,
I think something that is possibly the most important thing
to any successful like AI project product or research
is getting your evaluations right.
So actually the most like successful AI projects,
they typically start with some phase where they spend,
they don't, they're not training any models
about doing anything like this.
They're just figuring out like,
how are we getting value and success?
And the reason, this is something that when we typically,
right, adopt, let's say,
when you see like all these coding products
and like AI products in the market,
there is sort of like shooting from the hip thing
where it's like, I put it through some workflow,
here you go customer, like does it have value or not?
Whereas the way like I've seen like successful products
like this built out, like how does, for example,
like when a company develops a language model,
a GBT model or a Gemini model or whatever,
how does it know that the thing it's running,
like the people, the users will like it, right?
You have to develop all these evaluations internally that are
really well correlated to what your customers actually care about. And so in the case of
chatbots, that evaluation is basically preferences. You have your data team, what does a data
team do for a normal language model chatbot-like product. They get a lot of data from
human ratings that is different prompts.
Then those raters basically say which ones,
which prompts they liked more over the others.
Typically, it means that the thing that gets
up-weighted is more helpful responses,
things that are formatted nicely,
things that are safe,
like they say they're not offensive.
And those, it's really important to set up those evals
that you're benchmarking internally
to actually correlate with what your customers
actually care about in your end product.
And I think that that's something that
it's kind of a new way of operating
because you're like, these systems aren't deterministic,
like, you know, like software as we know know it and so when you're shipping like something that is probabilistic
that is going to work in some cases not work in other cases you have to come in with some degree
of confidence like whether you know we're coming to a customer sometimes our use cases will not be
a good fit for us because we built evals and we were able to predict that actually for these use
cases like the models are not ready yet. Yeah Can you give us an example of just a really simple
eval, like what that would look like? Yeah. So for example, like for coding,
right? That's kind of what we're building these autonomous coding models and the eval,
what is the eval there? The eval there will be, from a customer perspective,
will they actually merge the code that are proposed
and how long of an interaction or back and forth
will it take them to merging, right?
So then the question is, well,
we want that experience to be delightful for customers.
We're not going to, we don't want to like set up
complex evals for every customer
because that's just gonna be a waste of their time. So it't want to like set up complex evals for every customer because that's just going to be a waste of their time.
So it's how do we set up internal evals that are kind of representative of what our customers
care about?
And so an example of this is, well, if we care about the merge rate, like the merge
rate of pull requests from our customers, then we should be tracking like the merge
rate on similar kinds of tasks to our customers.
So you know, some things that we, right?
So we have different task categories like migrations,
cybersecurity vulnerabilities,
these sort of third-party like API breakages.
And, you know, your data team,
what it does is that the eval side of it
is that it curates data sets that are representing that.
And then for every version of our model,
right, we run, we basically run a Tunze eval
and we have different eval for different use cases.
And we're seeing like where our models stack up,
you know, some of them they do better,
some they do worse,
but it allows us to come to customers.
And when we've identified a use case that is a good fit,
that's high confidence that it will be a delightful experience.
And I don't think most teams that build like products that may do not come from a research
background are as scientific about it because it's setting up the eval takes a really long
time.
And it's just kind of a pretty complex process, right?
Where are you going to like, where are you going to source like the coding raters who
are going to basically rate whether you'd merge these things or not?
How are you going to manage that team?
Where are you going to source the tasks from that are representative of what your customers care about?
These are the kinds of questions that the data team answers and more so beyond that,
it's how do we collect the data that we need to train models to be good at the things that
the customers care about?
At various aspects, how do we collect data for super-volume fine-tuning?
How do we collect data for reinforcement learning?
You need to be as nimble on data research as you are on basically software and model
research.
We think a lot about algorithms and model architectures and things like that. And the thing that maybe is equally important but less frequently talked about, like in
papers, is the data research that needs to go into like operational data research to
make sure that these systems are reliable, the things you carry out.
Right.
Love it.
That's so interesting and very true of people too.
Well, I was just going to say there's timeless, there's a lot of timeless wisdom in that approach as well. and very true of people too.
probably a lot of our listeners have tried can help me write code.
space, the best thing is to use coding products that are copilot or cursor, that are these five initial, as you're talking about, cruise control.
I think that that's how I actually started using both products.
A lot of members of our team use those products and they've been very, very informative. And as I said, in a sense, sort of complimentary.
I think that getting autonomy right and getting agency to work is a more complex and nuanced
problem.
And typically what we find when we talk to customers, by the time they're thinking about
autonomy and agency, they've already been using copilot for some time.
And they're pretty well educated on what kinds of problems they believe they have or can be automated. So if
it's someone coming from blank slate I would find you know take a like off the
shelf product like a copilot or a cursor and give that a shot and sort of start
just trying it out empirically and seeing like what sorts of values drive
into them. Love it. All right, well, Misha, best of luck
as you continue to dig into research and build products.
And when you're ready to come out of stealth mode,
of course, you know, tell John and I,
so we can, you know, so we can kick the tires,
but we'd love to have you back on the show
to talk about some product specifics in the future.
That sounds great.
Thanks, Eric. Thanks, John, for having me.
The Data Stack Show is brought to you by Rutter Stack, the warehouse native customer data
platform.
Rutter Stack is purpose-built to help data teams turn customer data into competitive
advantage.
Learn more at rudderstack.com.