The Changelog: Software Development, Open Source - The era of durable execution (Interview)
Episode Date: April 10, 2025Stephan Ewen, Founder and CEO of Restate.dev joins the show to talk about the coming era of resilient apps, the meaning of and what it takes to achieve idempotency, this world of stateful durable exec...ution functions, and when it makes sense to reach for this tech.
Transcript
Discussion (0)
Okay, friends, it's time for your favorite podcast. Welcome to the change where we feature
the hackers, the leaders and those who are building durable execution functions. Today,
Jared and I joined by Stefan Ewen, the founder and CEO of restate, talking about the coming era of resilient applications, the meaning of,
and what it takes to achieve item potency,
this world of stateful, durable execution functions,
and when it makes sense to reach for this tech.
A massive thank you to our friends
and our partners over at fly.io.
That is the home of changelaw.com. Learn more at fly.io. That is the home of changelaw.com. Learn more at fly.io. Okay, let's get resilient.
Well friends before the show, I'm here with good friend David Shue over at Retool.
Now David I've known about Retool for a very long time you've been working with us for
many many years and speaking of many many years Brex is one of your oldest customers.
You've been in business almost seven years I think they've been a customer of yours for
almost all those seven years to my knowledge but share the story what do you do for Brex
how does Brex leverage Retool and why have they stayed with you all these years?
So it's really interesting about Brex is that they are a extremely
operational, heavy company.
And so for them, the quality of the internal tools is so important
because you can imagine they have to deal with fraud.
They have to deal with underwriting.
They have to deal with so many problems.
Basically, they have a giant team internally, basically just using internal tools day in and day out.
And so they have a very high bar for internal tools.
And when they first started, we were in the same YC batch, actually.
We were both at Winter 17.
And they were, yeah, I think maybe customer number five or something like that for us.
I think DoorDash was a little bit before them, but they were pretty early.
And the problem they had was they had so many internal tools they needed to go and build, but not enough time
or engineers to go build all of them. And even if they did have
the time or engineers, they wanted their engineers focused
on building external physics software, because that is what
would drive the business forward. Brex mobile app, for
example, is awesome. The Brex website, for example, is
awesome. The Brex expense flow, all really, you know, really
great external external software.
So they wanted their engineers focused on that as opposed to building internal crud
UIs.
And so that's why they came to us.
And it was honestly a wonderful partnership.
It has been for seven, eight years now.
Today, I think Brex has probably around 1000 retool apps they use in production, I want
to say every week, which is awesome.
And their whole business effectively runs now on Retool.
And we are so, so privileged to be a part of their journey.
And to me, I think what's really cool about all this
is that we've managed to allow them to move so fast.
So whether it's launching new product lines,
whether it's responding to customers faster,
whatever it is, if they need an app for that,
they can get an app for it in a day,
which is a lot better than, you know, in six months or right here, for example,
having to schlep through spreadsheets, et cetera.
So I'm really, really proud of our partnership with Brex.
OK, Retail is the best way to build, maintain and deploy internal software,
seamlessly connected databases, build with elegant components
and customize with code, accelerate mundane tasks and free up time for the work that really matters for
you and your team. Learn more at retool.com. Start for free. Book a demo.
Again, retool.com. We We are joined today by Stefan Ewen from restate.dev.
Stefan, welcome to the ChangeWalk.
Hey, thanks for having me.
It's a pleasure.
Pleasure to have you as well.
Adam, how you doing, man?
So good, Jared, how about you?
I'm doing well.
Always excited at the beginning of a conversation Mm, so good, Jared, how about you? I'm doing well, always excited
at the beginning of a conversation
to dig into something new, something different,
and something called Restate.
This is supposed to be the simplest way
to build resilient applications.
This is a requested show, Stefan,
so we do take episode requests.
This listener would like to remain anonymous.
However, they say that Restate is a super exciting approach
to managing distributed systems.
And they say that we should get you on the show.
And so we just take orders around here
and our listeners often get what they want.
And so that's how we found you.
What's the listener request? Awesome so that's how we found you. Was a listener request.
Awesome, that's very cool to hear.
Is that open source communities at work and all that.
That's right.
So Restate, let's not get into Restate itself at first.
Let's talk about resilient apps first
because this is called Tagline.
The simplest way to build resilient applications.
Let's talk about that.
What is exactly a resilient application in your estimation?
Okay, yeah, so in the way we think of it
in the context of restate,
we're talking mostly about the back ends of application,
the sort of coordination and orchestration logic, a resilient application would be an
application that doesn't accidentally drop your order that
doesn't accidentally place it twice if you hit F5 at the wrong
point, when you're in the browser that doesn't, you know,
accidentally book your Uber for two people instead of for one
Book your Uber for two people instead of for one. That doesn't, you know, just disconnect you from your chat bot, lose the history, make
your start over, all these kind of things.
This is what we call, like what we're talking about when we mean result in apps in the context
of restate.
So basically apps that tolerate all sorts of hiccups, errors in the infrastructure, unavailable
endpoints, you know, net network failures, process failures,
temporary outages, but also like, you know, types of
programming glitches that cause like, requests to fall through
and having to be retried in order to, you know, be processed
reliably, but then not get duplicated,
but the system understanding how to either potently
treat them as a retry and not as a second request.
This is sort of the bigger picture of what we mean here
with resultant applications.
Can you take a moment to demystify that term
that you just said, idempotently?
So everyone's on the same page.
What does that mean, idempotently? Yeah, I think idempotently, so everyone's on the same page. What does that mean, idempotently?
Yeah, I think idempotency, you can think of it
as just understanding that a repeated request
is actually not a new request,
but it's the same request again.
You're just sending it again because you were
maybe disconnected from the original request
or you got an error back.
It's the thing that if you don't do it correctly,
that is actually accidentally placing an order twice
when it did try to place it once.
It's the thing that many applications don't get right
and that's why you still see too many websites saying,
don't hit F5 while this is showing.
We just don't want-
Do not reload while we finish this transaction.
Exactly, because it doesn't understand
that you submit the thing again,
and this is actually supposed to be the same thing.
It's just like another submission of the same request.
It has no way of identifying that,
so it might accidentally treat it as a second one
or it has just like a very rough way of doing this.
It's a surprisingly complicated problem
and there's like lots of applications
that don't get it correct
and a lot of weird ways applications work around this.
Like as a fun fact, I think my first,
the first bank where it was like a customer,
they did only allow you to wire to a certain recipient,
a certain amount once per day. If you're trying to wire
the same amount to the same person a second time for the day, they just wouldn't allow it because
they didn't know if that was like a retry from like your browser or if that was something stuck
in a queue at some point in time. That's just how they did duplicate it. And yeah, so generally
odd impotency is deduplication
in a meaningful way, understanding retries versus new requests.
Yeah, that's called overkill, I think, when they did that.
They're like, how can we just use a blunt tool
to solve this tiny little problem?
Let's just not let you do more than one per day.
Not ideal, surely, for lots of uses.
Okay, so same request twice, operates the first time,
won't operate the second time.
Generally speaking, how do you achieve this?
I mean, you could just limit to one request per day,
but if you're not gonna do that,
how do people usually implement
or ensure idempotency in their applications?
Yeah, so in a way, you basically have to find a way
to anchor the identity of requests all the way through.
There's different standards for doing that.
The HTTP standard has actually a new,
has defined a header where you can put in
all the potency key and the service when they support this
are supposed to understand if that key is set
and a previous request with the same key has come
and the same parameters that this is a duplicate request
for the same operation.
But then down the road, you basically just try
to anchor requests and the processing
of different steps in each other.
When you do message queues, you place correlation IDs
when you're working with databases,
maybe you try to address primary keys or transaction IDs
or leases or tokens.
There's tons and tons of tricks people do,
but it's ultimately still very hard problem to do
if you wanna do it end to end.
It's kind of a mindset.
You have to set out to do it, don't you?
Otherwise.
I think you have to design from the start for it.
Like if the weakest link in the chain that breaks,
it's one of these things, yes.
Or you can just say, do not reload your page
while you're hitting this API.
Exactly.
That's the simplest way to do it.
Just like make it the user's problem,
don't handle it in your infrastructure, yeah.
Also, as an engineer, when I've achieved idempotency,
I know it feels very good when you're like,
okay, I'm for sure not gonna do double execution
in this particular code path.
That feels good.
And as an end user, I'm also happy
when I know that I'm not gonna get charged twice,
for instance.
That's right. an end user, I'm also happy when I know that I'm not gonna get charged twice, for instance.
That's right. So you can actually see good APIs
design this in from the start.
I think the first API where I came across
that I thought that was really, really well done
was Stripe's payment API.
You can really see, I think that's also why
it got so insanely popular so quick.
The way they made that handling just seamless
for folks that embedded this into their code
to really understand how do I make sure
I accept this request once and deal with all these things.
That was really, that was stellar.
All right, so here's a harder question.
Where does the word item potency come from
and why do we use that to describe this thing?
It seems unnecessarily verbose and jargony.
Do you know?
Just spell it first, Jared.
I-D-E-M-P-O-T-E-N-C-Y would be the adjective.
Do you know?
If you don't know, you can just say I have no idea.
But if you know, it'd be awesome.
I don't really know, but my guess is it's a Latin word.
It comes from Latin.
It does sound like that.
It's not like, yeah.
It doesn't put that, I don't know.
It's all Greek to me.
Maybe we can get a real time look up for that
and follow up on it from some sort of LLM.
Just prompting Adam behind the scenes to prompt
his favorite LLM.
I'm on Wikipedia as we speak.
Okay, good. I was stalling for you.
You know, I don't really have the details here for you, Jared.
I'm sorry I can't LLM quickly enough for you,
but it says item potence is the property
of certain operations in mathematics.
All right, I went straight to the LLM
and I got the answer to my question.
So the term item potent comes from Latin roots.
So Stefan, excellent call there.
Item meaning the same and potent meaning having power
or being able to.
Put together, item potent roughly means
having the power to remain the same.
So there's the actual word.
And then yes, mathematics, blah, blah, blah.
I've stopped reading now,
so hopefully that wasn't a hallucination
and we can all move on.
Yeah, that sounds about right.
Like if it's an hallucination, it's a close one.
I've learned that.
Yeah, I D E M plus potence, same plus power.
There you go, it's the same power.
Very cool.
Well, thank you for scratching my itch,
my curious itch there.
Both of you and ChatGBT, I suppose, pitched in on that one.
What else?
So we're talking resiliency.
I'm curious, obviously, to have a resilient app
just as good, right?
Like, who wouldn't want these things?
And just taking one of those things,
like impotency and realizing it's hard to achieve
on your own throughout especially
more complicated applications.
Why did you all set out to solve this problem for folks
to build Restate and why did you feel like
I'm the guy for the job?
Yeah, that has a long answer and a short answer.
I can start with the short answer.
The short answer is, it's really the, I would say the state of the art to build backends
that are supposed to, backends that do any non-trivial state management and coordination
is completely unsustainable, I think, though, the way we're building this today.
Just to give you an example. So let's, let's, let's
actually start to accept it, stay with LLM, because we just
talked about it, right? So let's say you're building, you're
building a chatbot, you're submitting something like a
message there. This thing in the end has to, it has to reach the
LLM, but it has to look up the context in which that chat happened before it has to make the call has to reach the LLM, but it has to look up the context
in which that chat happened before.
It has to make the call, has to go back,
store the context.
You don't want it to just lose everything
if you lose your connection in the middle.
If you, let's go with the F5 thing again,
you don't want it to actually trigger the same request twice
or lose the entire session, make you start over.
So you're probably just putting this as an asynchronous request that runs in the
background that you're, you know, you're sending from your, from your chat session, from your
browser. But it's a separate, like asynchronous request that runs that talks to the LLM. You
want it to be actually retrying in case something fails or is overloaded and it's throttling your
back. And then you want to be able to reconnect to that task or request in case
something goes wrong or in your browser or you accidentally hit the back button or whatever.
Just implementing this is a surprisingly complicated thing where you start to stitch together
like probably a queue, a database, and a bunch of tasks to manage that. To give you another example,
just talked about Stripe, right? So let's say you're sending a request there for a payment and sometimes they tell you,
look, this is good or bad, like we accepted or didn't. Sometimes they tell you, I don't
really know, like off-road detector is still running or we have some weird thing in the
background that we're still asking and it hasn't told us. So I'm going to send you a
webhook in a moment to tell you whether this went through or not. And now you have like
a synchronous request there
and then somewhere else an asynchronous request coming on.
You just want to make those two reliably meet.
Even if this one fails, you want it to sort of like
recover somewhere, understand where to, you know,
reconnect with that web hook that you're awaiting.
And like this little piece, like it's really,
it's really just one case handling in the backend
where Stripe says, okay, I'm processing
instead of like yes or no.
There's actually many days of work to make that look reliable, to make that work reliable.
And it's like lots and lots and lots of things like this that just like get in the way with
so many moving pieces, so many APIs to talk to, so much work, so much more work than originally happening
asynchronously in separate requests than just in the synchronous user interaction.
Just gluing those all together has become such a complicated thing that we felt this
does need a better solution.
This is like the motivation more from the let's say from the use case side,
I can give you a motivation more from the, like why we actually
ended up doing this. I think this is a motivation that
probably lots of folks stumbled across to medley inside
legacy, this needs a better, a better solution. I think there's
like different, different projects approaching that
problem. Why are we approaching it the way we're doing this has to do with like where we come from, like
before we worked on on restate, we were building Apache Flink
is a different system is stream processing framework. Basically
events and analytics. So you know, you have these events
coming in, often through a message queue, and you want to,
you know, aggregate them, join them.
Just like a few examples where this is used as fraud detection banks, like some payment events
go in, you aggregate feature vectors, some that throw a fraud model. Or I think, you know,
things like the TikTok recommender look flink to use Flink to actually join information from users and interactions
together in real time and understand how to create, how to update the features that will
go into the recommendation model. I think companies like Uber use it for like determine
pricing and traffic models in ETA. So it's whenever you have events and you want to analyze
them in a way that you aggregate them into some sort of typically
statistical value or immaterialized view. This is what
we were building before. So it's an analytical framework. What
did actually happen then is at some point in time, we saw folks
were using that thing to solve distributed transactional processing, the types of things
that where you would say, hey, let's assume an order
processing service that you know, takes the event, get up,
check out that check out that order. And it has to do a bunch
of steps, let's say update the inventory trigger payment, call
the service to prepare logistics, maybe call another service to put this in the user's history,
maybe more steps and so on.
And we started to see folks using Fling for that
because it had this interesting property
that it had sort of this baked in way
of reliable communication and state management.
It was all built for analytical use cases, but they found this such an interesting property
that they started to apply this to the transactional use cases as well, like auto processing.
Just because they found that this is otherwise way too complicated to build and way too easy
to build it in such a way that it is brittle, not scalable, that it has corner cases where
it violates a lot of these properties that we just said. When this started happening
repeatedly we thought, okay, but apparently there isn't really a good tool out there yet.
Apparently this property of correct stateful coordination is something people really appreciate.
They feel like it makes their life easier to build these type of frameworks. And then we set out to, to build a solution for
that. And that became restate. It's in many ways actually from
the way it approaches things, from its architecture, it's
inspired by, by our work on Apache Flink. But it's almost,
it's almost a complete mirror image implementation of it. It
takes almost the opposite design choice in most aspects because
it's really optimized for load-data transactional processing rather than high throughput analytical
processing, which was linked. But what we retained from this idea is, yes, stateful
orchestration and event-driven foundation and so on.
This is something we should build
and we should be working on and that became Restate.
What's the timing all this?
When did Flink start?
How mature was it or is it?
And when did Restate be born out of the idea?
Like, give us a context in time.
Yeah, so this context is measured in decades, I think.
Okay, that's useful.
So yeah, I mean, Flink was officially founded in 2014,
but the work that became Flink is back to 2010
when I was still in university.
So it was like four years of academic work. And then
there's Flink started 2014. I worked on it until 2022. So eight years after it became
an open source project. And then I left Flink because I needed to work on something else,
like for a change. And then we started working on restet. Like end of 22, early 23. So Reset
is a bit over two years old now. Flink has had its 10th anniversary last year. Flink is super
mature. I think it's used by thousands and thousands of companies at absolutely insane scale.
at absolutely insane scale. The probably largest installation of link that I know must be Alibaba who runs tens of thousands of cores for like a single processing pipeline to like live
compute their e-commerce search and recommender continuously. Reset is not quite as old and quite as mature
as it is.
Yeah, two years.
It's been two years. But we've released our stable 1.0 last summer and we recently released
our first distributed replicated highly available architecture. And I have a bunch of folks
that actually productively use that.
So I would say we're on course to get there.
Nice.
Was leaving Flink difficult, sad, joyous?
Like what was that like when you left?
Cause that's a long time to work on one thing
and then to move on to that thing.
I mean, you become pretty attached, don't you?
Yeah, yeah, absolutely. So it is on to that thing. I mean, you become pretty attached, don't you? Yeah, yeah, absolutely.
So it is definitely a difficult thing.
And it was not just leaving Flynk.
So we created Flynk as an open source project,
but we also built a company around that.
And the company went through an acquisition,
but we actually still stayed there,
started building and growing the team.
So I was simultaneously leaving the open source
project and the company I built and everything. And absolutely, it's a difficult thing because
that becomes like your babies, actually two babies in a way, right? The project and the
company. But yes, I feel that at some point I felt like there's this, I don't know how
to say this in English, but I felt I was getting this like tunnel
dish on problems. I've been working on this for so long. And I've seen so many sort of
repeated things that like whenever I heard a problem, I was just like putting it in this
into this category that I knew and that like it started happening that I did that with
things and then later realized like, Oh, no, that wasn't actually the right thing for that
problem that might have been for the last nine,
but this one should actually have been different.
And so you get this kind of,
if you work too long on the same thing,
you're starting to not see the forest for all the trees.
And I felt I was reaching that point.
So it sounded like, yeah,
I should probably start doing something new.
It's kind of like familiar grooves in your brain, you know?
If you just go to the same thing
over and over again, it's like your brain gets new grooves
and then different problems fall into those familiar grooves
and they just slide in there.
Sometimes when they don't even fit
or they aren't a good fit.
I certainly understand that.
I think when you've focused on one thing
for a very long time,
it is hard to think outside that groove.
And it's interesting that you're building a similar system,
spiritually similar,
but I guess you could say radically different architecture.
Is that the way you described it?
Like it's inspired by Flink,
but it seems like it takes
the opposite approach of the world.
Yes, I think you could call it like that. It is under the hood, still an event driven system,
still as Flink is, but it just builds for completely different traders. So just to give
you an example, the core of Flink is the way it is this exactly one stateful processing.
So it basically has the data streams that it keeps moving operations that do
stuff, stateful stuff with events, count, joint and so on.
And then there's this asynchronous process running in the background that
takes these consistent snapshots.
So if something goes down, you can restore the state from the snapshot and sort of
like just start the flow from there and has this like kind of clever way to do this in a way
that maintains consistency across all the parallel machines and it is that like efficiently
incrementally frequently and so on.
But it's a very throughput optimized thing.
So it stays off the critical path if you wish like once in the background.
So yeah, it's really good for throughput, but it,
you know, it does this like persistence operation once every couple of seconds if, you know,
if you do it to be like very, very frequent, most people actually run it more in the order of
minutes, right? So when, when something goes down and that in a pipeline, you just like replay the
last minute of data, which typically doesn't quite take a minute because replays faster than the data rate at which the events are produced.
But still, it takes you back a certain amount of time, which is usually okay for analytics.
The worst thing that happens is this feature here in that vector that goes into that traffic
model or that recommendation like is maybe
a few seconds older than it would have otherwise been.
It's not such a big deal typically.
On the transactional processing side, imagine you have this multi-step process.
You want a really fast checkout process and you say, I want to just start the next step
after I know the payment has gone through.
Before that, I'm not updating inventory.
I'm not kicking off any of the
other processes, then you really need actually, like a persistent
step to be recorded ideally in milliseconds. Maybe it's not
that critical for the processing, but we're building
this for like even more low latency use case like payment
processing and settlement and so on. And there you really just
want to kick off the next step after you know, like the previous
step is persisted
like possibly in a multi-data center replication way.
And only then do I start the next thing.
So you have to really design this completely differently.
It's completely optimized for low latency,
transactional durability rather than analytical throughput.
So it's, yeah, it's a completely different design,
even though both ultimately are event-driven architectures.
Right.
But your atomic unit now is like, is compute, right?
Like it's logic.
And perhaps data that comes from that.
It's a transactional step.
And so you can't just skip a transactional step
because there's a workflow here
and certain things rely on other things.
And so the way that you think about durability
as opposed to analytical data is, like I said earlier,
radically different.
That makes sense to me.
Yeah, exactly.
The atomic step in RESTED is extremely fine grained, right?
Like we're really building this in such a
way that you should feel comfortable in a program that you do to use restate to persist
fine-grained steps, state updates. It uses internally actually this durable mechanism
for a leader election to understand that it can lock and fence off different retries. And this is yeah, this is such a such a fine grained
nature, what's really important that recording a durable step
has the lowest possible latency versus in Flink, the atomic step
is like a couple of million events being aggregated
together in some status you get it over 10 machines. So that's
one atomic step. It's completely different, yes.
Well friends, I'm here with a good old friend of mine, Terrence Lee, cloud native architect
at Heroku. So Terrence, the next gen of Heroku called
Fur is coming soon. What can you say about the next generation for Heroku?
Fur represents the next decade of Heroku. You know, Cedar lasted for 14 years and more.
Still going. And Heroku has this history of using trees to represent ushering in
new technology stacks and foundations for the platform.
And so like Cedar before, which we've had for over a decade,
we're thinking about fur in the same way.
So if you're familiar with fur trees at all, Douglas furs,
they're known for their stability and resilience.
And that's what you want for the foundation of a platform that you're going to trust your business on top of.
We've used stacks to kind of usher in this new technology.
And what that means for fur is we're replatforming on top of open standards. used stacks to kind of usher in this new technology. And what that means for FUR is we're re-platforming
on top of open standards.
A lot has changed over the last decade.
Things like container images and OCI and Kubernetes
and CloudNave, all these things have happened in this space.
And instead of being on a real island,
we're embracing those technologies and standards
that we help popularize
and pulling them into our technology stack.
And so that means you as a customer
don't have to kind of pick or choose.
So as an example, on Cedar today,
we produce a proprietary tarball called Slugs.
That's how you run your apps.
That's how we pack to them.
On Fur, we're just gonna use OCI images, right?
So that means that tools like Docker
are part of this ecosystem that you get to use.
So with our Cloud Native Build Packs,
you can build your app
locally with the tool called Pack and then run it inside Docker. And that's the same kind of
basic technology stack we're going to be running in Firm. So you can run them in your platform as
well. So we're providing this access to tools and things that developers are already using
and extensibility on the platform that you haven't had before. But this sounds like a lot of change,
right? And so what isn't changing? And what isn't changing is the Heroku you know and love.
That's about focusing on apps and on infrastructure
and focusing on developer productivity.
And so you're still gonna have that
get push Heroku main experience.
You're still gonna be able to connect your applications
and pipelines up to GitHub, have that Heroku flow.
We're still about abstracting out the infrastructure
from underneath you and allowing you as an app developer
to focus on developer productivity.
Well, the next generation of Heroku is coming soon.
I hope you're excited because I know a lot of us,
me included, have a massive love
and place in our heart for Heroku.
And this next generation of Heroku sounds very promising.
To learn more, go to heroku.com slash changelog podcast
and get excited about what's to come for Roku.
Once again, heroku.com slash change blog podcast.
This word durability is being used a lot.
Durable execution, durability.
What exactly is durability?
Doesn't fall down, doesn't break?
Always good?
Yeah, something like that.
I think durability is probably the same as persistence,
maybe with a bit of a stronger emphasis on
it really doesn't get lost after it happens.
So durability is the D in acid
when it comes to databases, right?
Databases say we're giving you atomicity, consistency,
isolation and durability, once you do an update, we're not
going to lose it, like no matter what crashes, like the database
has a mechanism to, to bring that change to the database
back. If I told you I've recorded that row, I've recorded
the change, it will be there no matter what. And, and the in
the context of restate, that doesn't mean,
for example, if you the core building block of restate is a stateful durable function,
you can think of it like that. And the stateful durable function, when you schedule an invocation
for that, or as you go through the code of that stateful durable function has like multiple steps,
that or as you go through the code of that state with durable function has like multiple steps, recording,
recording a step, whenever you go beyond beyond a step that you
asked to restate to treat as durable. You know that no matter
what happens, you will never re execute that step, you'll never
come up with a different value. Like if your machine goes down,
the reset server goes down, if you deploy it across availability zones, if your machine goes down, the reset server goes down,
if you deploy it across availability zones, the data center goes down, the network gets
partitioned, whatever, you'll never ever go back and re-execute that step. If it once
told you that it's done, that's the meaning of durability. Once it says it's there, it's always going to be there. And I think this
is in a way the, I'd say almost like one of the magic ingredients is the way Reset looks
at making distributed application development simple. I'd say there's two core pieces that you need to think about. One of
them is the durability. Make durability extremely fine-grained and extremely cheap. So because
if you can apply durability in fine-grained steps, you always have to worry about very
little after a failure. Let's say your durability is coarse-grained. Let's say the order workflow is
one durable step, right? And it crashes in the middle. It gets retried. It's up to you to figure
out, well, did I actually process the payment already or not? Maybe there is a way to just
assume, okay, I deported, I can send it again, or I might even not be able to ask the service,
did I do that or not? Did I actually decrement the available kind of product already or not? Maybe I have a
way to again make this durable or not. I don't know. These
things tend to be harder than one thinks because sometimes the
API gives you, you know, it might have given you an error
back the first time and you thought I didn't do this and
followed some control path flow. And then the next time you
actually get not an error, but the real result and then you
follow different paths. So people mess up this all the time. It's really hard to reconcile if you have these multiple steps
as a coarse atomic unit. What did I do? How did I do it the last time? How do I recover
from this? But if you have extremely fine-grained durability, if you're recording every individual
step as durable in the system, and when it comes back, it can tell you exactly like this
was the last step that you recorded, then you just have a very small amount of uncertainty. Okay, here's this one thing that I might have
tried already. I have to just worry about that bit instead of the whole history and
possible control flow and all the choices how I might have ended up here that I need
to reconstruct in order to proceed consistently from there. So just like very fine-grained
durability is extremely powerful and simply fine things. I'd say the second magic ingredient is then how do you anchor this in the
whole retrying and resolving potentially inconsistent situations with partitions, with timeouts,
with zombie processes and so on, so that there's always a very consistent view of what the last
durable step was. I think that's the second sort of ingredient of FreeState.
It's not just durability, it's actually durability and consensus.
And I'm giving you a very, very crystal clear view on where you left off, where you need
to continue from.
I think if you take those two things in conceptually, you've simplified the problem massively and
the rest is almost API sugar
that you built on top of that. That's the magic that happens in the restate runtime.
It's a very low latency, durable consensus lock that fuses queuing, state management,
locking, fencing,
creating futures, resolving futures, like all these kind of operations that tend to be part
of a distributed coordination process.
When you say the restate runtime,
what can you liken that to,
for those of us who don't know what a restate runtime is?
Is it like a Node.js thing?
Is it like a database? Is it like a Node.js thing? Is it like a database?
What is that?
Yes.
So using reset is a bit like, I would say, somewhere in between using a database and
using a message broker.
So you write your program pretty much as code, however, how you would write it before. But you're using the reset SDK.
Think of it a bit like your database driver
in order to act sort of,
wrap certain operations as, okay,
this operation here should be recorded as a durable step
or attach this state to the invocation transactionally
or create this future, complete this state to the invocation transactionally or, you know, like create
this future, complete this future and so on. So this is, do these operations on like through
the restate SDK.
Reset itself is, is then like, maybe message brokers the best comparison. It's on the level
of the message broker. So when you invoke your code, you're not calling it to that directly,
you're actually calling it to restate, which makes the invocation of your
function on behalf of you. The programming model that we try to provide is you're writing a service
that looks like an RPC service, like you're writing RPC handlers, and then reset almost looks like a
reverse proxy for you. So the other services, instead of calling the code directly, they call
it indirectly through restate, reset proxy in the call. And it puts itself in the middle
with its durable consensus log. And when it forwards the request to the service, it just
isn't forwarded naively as an HTTP request, but it actually uses a, like an invocation protocol,
it uses like HTTP2 or another type of streaming connection, holds onto that connection, allows
the service to sort of synchronize fine-grained steps. It will, when it forwards an invocation, for example, tell it
exactly what the supposed state of the world should be, as in, here's the steps I know you
should treat as completed, here's where you should continue. And it will then allow to use that connection, that sort of lifeline to let the application,
you know, create durable actions.
So the, yeah, it's like on the level of a broker or database, looks like a reverse proxy to the invoker,
looks like a, maybe almost like a database to the service that uses it.
And when would somebody reach for this? Now you said distributed systems,
but some people think every network attached system
is a distributed system.
So I mean, if I'm building a web application,
let's call it a monolith that answers HTTP requests
and has a database backend, a Ruby on Rails or a Phoenix
or insert your Django,
insert your backend framework here.
Are those folks pulling in Restate
and using it for certain aspects of their workflows
or is that not necessary for them
because they are kind of a monolith?
Like do I have to be building
a services-based architecture?
Like where does it fit in?
I would very much be with you on like
almost any system we build as a distributed system.
Yeah, completely.
So I think it becomes useful very, very quickly. Maybe one way to think about this is
your backend where you do, where you maintain state and update it and run operations and changes.
It usually has a database that has the sort of core business state, some operations go just like
purely straight to the database. That's all they do. That's fine.
But anytime, you have to do something that's not straight
against this like one like your core database, but something
that goes against like different API, something that runs in the
background. Yeah, something that is asynchronous work that goes beyond just touching the database, I think
you already are at the point where it's starting to become
useful. Then you know, if if the only thing you're doing is maybe
forwarding one call, yes, maybe maybe it's overkill. But the I
think the the usefulness starts much, much sooner than lots of folks realize.
I would say every time you think about pulling in a message queue, you should probably start
to think about pulling in something like restate because it gives you a way to do the things you're probably trying to do with a message queue,
but in a more high level, in a more well-defined concrete way. You're not treating events,
but you're dealing with stateful, durable invocations, stateful, durable functions all
of a sudden, which is very often what you really want. If you want, if you're putting something off the synchronous path with a queue, you very often want to say, okay, here's
something where I really care about that this happens. It shouldn't get lost, right? That's
why I'm putting it in a queue. And then you probably care about this thing happening once, having reliable retries, to quickly reach the state where
the processing of this operation is actually multiple steps. And then you're again in the
okay, how do I do reconciliation of multiple steps if it failed somewhere in the middle?
And I don't know what I already completed or not. So I would say the moment you start
to pull in a message queue, you probably should think about something like like we said, the point comes very
quickly. Yeah. But then it's, it's that's the that's the
simplest use case, I would say the most the most complicated
ones that we see people build with us right now is using this
to replace complex, a complex choreography of like multiple
Kafka topics and rabid mQQs and session servers
and workers and so on.
Or even like a distributed sort of payment ledger keeping system.
So it's really a very broad spectrum.
The I would say in a way you could think like all the type of work you do in the back end. That's not the central database that keeps your business state.
I think is ultimately ultimately where we sit comes in. Yeah, what about a scenario where.
It's publishing I'm thinking like tick tock or YouTube for example as a creator we will upload videos to YouTube, there's a process that happens,
there's a certain orchestration that happens,
it has to be compressed,
it has to go through certain filters,
maybe there's even a content filter
that has to go through a copyright filter.
Is that an example of where you would use
something like Restate where you want it to go,
you want the user to be able to upload properly
and your server capture the data
and all the good things, but you got to run it through a process of saying okay this is now content
that can be seen by what we call the world because it's been blessed by the copyright
filter etc etc. Is that a scenario where it makes sense?
Yeah absolutely. This is basically a workflow again if you think about it right? You're
uploading the video let's's say, maybe the
upload first puts it into some cloud storage.
But then as you said, you first pass it to the content filter, then you have maybe a
few steps that even run in parallel, like recoding it for different resolutions, optimizing
it to be served through the CDN and so on.
Then you're, I don't know, running it through a system that tries to figure out what's the
best sort of title frame to display
and like all these different steps that you do and they take potentially a long time. So it's a
long running process. There's a fair chance that the container goes down in the middle or wants to
be migrated and when it comes back up, you really want this to understand where did I leave off?
Like what are the processes I should reconnect to that are doing the encoding or the analysis?
Like this is exactly the orchestration of that process
is where we said would come in.
You wouldn't feed the video frames to the system.
That's like overkill.
You don't need to feed the video frames
to a transactional log.
Like it's just that you put them
in whatever cloud storage or so,
but the orchestration of the process,
of the workflow, of the pipeline that does that,
that's a very good to reset your skills actually, yes. That's a very good research case, actually, yes.
That's a great example, Adam,
because it definitely makes it easy to think through.
I guess as people who upload to YouTube,
we are intimately familiar with all the different steps.
That's why I enumerated very well.
And it's asynchronous,
because you can go about doing the other things
while it's working on the long-running tasks, for instance.
And somebody coded up some nice orchestration
behind that sucker to keep that thing running.
Yeah, and Google has the engineers
to code up reliable orchestration flows,
even in a way that they're nicely observable.
You can reconnect to them.
They're efficient, they know how to parallelize step
and synchronize steps and so on. It's a much harder thing to do for many companies who don't hire the same
type of engineers as Google does. And I think, I think for those reasons, it actually makes
these type of things much more achievable than if you try to embark on that, on that
journey without.
Even when you were sharing how it worked early on,
you were saying that it seemed,
at least from my perspective,
it seemed like it was user born every time.
Every new application, every new scenario, every new job,
every new what have you,
maybe even in your boring scenario
where you were to sort of focus for a bit there,
you keep recreating this durable invocation world over and over and over again.
And why not turn it into like you have done here with a server and a client and SDKs for
different languages and a flow that every developer can grab.
Is that kind of where what landed you to this point here was that frustration of the repetition
and repeating and rebuilding every single time you build an application?
I think that's a great way of putting it. Yes. I would say the number one alternative to
restay that people do or use is roll your own. Absolutely. And it's a very repetitive process.
And most of the time, I would say folks don't realize really all the edge cases
that existed what they do, they just maybe don't even solve them. So this is basically
half-baked, roll your own. And it's, yeah, every time again and again. And it's very
often very similar problems that you're solving.
Like let's say you're taking a message queue to say an action that I triggered should run
asynchronously and it should, you know, chat have reliability, read choice. And then I'm
pulling in another store like Redis or other key value store to record different steps.
Then I might be pulling in something like ZooKeep zookeeper etcd to place a lock on certain operations. So no, no, they don't happen concurrently, like, we shouldn't be updating certain, I don't know, should a retry shouldn't work on the same payment ID.
If the lock is still being held by the original processor. So and then you're trying to sort of
the original processes. So and then you're trying to sort of going back to add impotency, trying to make an update to another system and understand how do you actually anchor
the idea of that processing forward into the updated that other system like a new recreating
exactly recreating that type of pattern over and over and over again. And I think this
is where in a way workflow engines were originally born, if you wish, like enterprise workflow engines, which try to say, okay, let's try to define a flow where
we can have steps following a certain predefined control flow graph.
And we have the workflow engine giving you the guarantees that step B that follows step
A really only starts after step A is done and step A is transactually persisted before B starts and so on. They tend to be extremely heavyweight and flexible.
Yeah, it's just not not they break all the tools and everything when you want to interact with them.
And what what Reset does and the durable execution space, what they're trying to do in general is kind of bring this level of guarantees in a very, very lightweight way into like almost arbitrary programs because
it's just such a useful power to have to kind of define these durable steps, especially
if you don't have to like branch out in a different domain-specific language or graphical
way to define them. But if you could just write your regular code,
but have it treated, have it executed
with the same sort of guarantees
as if it was an enterprise workflow.
It seems like you might have a large education challenge
in front of you because there's so much thought
that has to go into this kind of architecture.
I think the fact that your largest competitor
is Roll Your Own means most people don't know.
Like, we kind of all discover this pattern slowly over time
inside of our own daily work,
and so I'm just curious how you think you can attack that,
or are there other people, like,
is there a common thread or movement
that you can attach to or create
in which people are like, yeah, here's this new style?
I thought of this because you mentioned workflow.
And I think that's sort of in the wheelhouse
or in the ballpark of what restate is, message queue.
I mean, is there like a simple idea or concept, pattern,
of which restate could be one,
or maybe restate is the brand?
But have you thought through this,
because you have a marketing problem here,
or a challenge, I should call it.
Yeah, so I think that is very true in many ways.
Okay, I think there's like lots of layers of answers to that.
I would say the simplest way you can actually explain it
reasonably simple if you just start from,
it's stateful durable functions,
which have guarantees that they execute,
they run to the end, they are able to record steps,
they're able to basically do these sort of asynchronous building blocks that you have in your usual programs, calling other functions,
creating promises, resolving them, updating state, making calls and so on, just in a fine
grained persistent way that knows how to recover. This is sort of the basic building block of
state for durable function. Now, the harder thing is actually in a way making people realize that
they should be using something like this rather than roll your
own. It's not uncommon that it's mostly on the on the junior
side of engineers that that you talk to them this like, I don't
get it. Like, I know how to write a retry loop like what,
what is this? And then, you know, this is it's a journey from
there. The interestingly, I think the most enthusiastic audience
is often the engineers that have been burned before,
that know, okay, I know how to build distributed systems,
but holy cow, I know how hard it is,
and like even though I'm really good at this,
I tend to overlook still two out of 10 corner cases
and I get paged Sunday night or so.
Those are the ones that really, that often go like,
yes, I know why I wanna use this
because I know how much time I would otherwise spend
on solving all these things if I have to do it myself.
So I guess that's right.
There's definitely an education challenge there.
And I would say a very sort of like in your face example
of that is if we look at the AI space
and agents right now, I think every AI company is like reinventing workflows in the context
of like agents and agentic workflows.
Like everybody's building and I would say slowly rediscovering like all those things
when there's been an entire sort of industry that has been working on this for like, wait,
I mean, we've been working on this for
two years. But if you ask IBM, they've been working on this
for probably 30 years or something like that. I mean, in
a very different way, right? But but still, and I feel like the
the AI companies are kind of like bit by bit rediscovering
this. And like, when you when you start talking to them, I
think some of them understand, okay, even where if you if you're building agents,
if you deploy them, how how they ultimately end up having to solve these problems again,
like imagine you have a chatbot that does your flight booking this like there's something
you have to do to make it not rebook your flight twice if they just crushes on the wrong
point. They're ultimately going to the same problems. They actually
have a perfect foundation to build on with these systems being built today. But yes,
I think they're not aware yet that this is something that they run eventually into. So
yeah, I think you can see this in many places that the industry is rediscovering work in
different sort of subfields
that other fields have been done just because
information flow isn't perfect.
Mine are off topic, Grant.
Why are all of the AI agent, like Hello World examples,
why are they all booking flights for us?
It's like, do you want some undeterministic,
half-baked language model,
booking your flight, that's like a very difficult thing
to roll back, you know?
Like, I just don't, that's like one of my last human
out of the loop, like AI agent moves.
Like, can we start with something a little bit less critical?
I don't know about you, Adam, but I get like serious
heart palpitations thinking that someone's gonna
book a flight for me.
And you don't get heart palpitations, Jared,
you're a pretty chill dude. I am, I'm pretty chill, but I just feel like, gosh, you know how hard it is to gonna book a flight for me. And you don't get hard palpitations, Jared. You're a pretty chill dude.
I am, I'm pretty chill, but I just feel like, gosh,
you know how hard it is to roll back a flight?
I mean, come on.
Oh yeah.
Well, I think it depends.
I mean, I don't mind.
I think it's the human dream
to have somebody or something take that kind of action.
Right?
That specific action.
Like, book me a flight.
Let's simplify it.
How about you just like, give me a, yeah, exactly.
Restaurant reservation, you know,
because worst case scenario, I ghost it and feel bad.
But if I don't show up for my flight,
I lose my 400 bucks or whatever, you know?
Yeah.
Maybe this is so much accumulated pain
from people waiting in the call centers for airlines
that, I know, all these companies see like,
oh, that's a perfect example for a chatbot.
People will wanna use it because they will not want
a single other minute to spend on the phone
with these call centers.
Yeah, perhaps. Well, friends, I am here with a new friend of mine, Scott Dietzen, CEO of Augment Code.
I'm excited about this.
Augment taps into your team's collective knowledge, your code base, your documentation, your dependencies.
It is the most context aware developer AI, so you won't just code faster, you also build smarter.
It's an ask me anything for your code. It's your deep thinking buddy. It's your stand flow antidote.
Okay, Scott. So for the foreseeable future, AI assisted is here to stay. It's just a matter
of getting the AI to be a better assistant. And in particular, I want help on the thinking part,
not necessarily the coding part.
Can you speak to the thinking problem
versus the coding problem
and the potential false dichotomy there?
A couple of different points to make.
AIs have gotten good at making incremental changes,
at least when they understand customer software.
So first and the biggest limitation
that these AIs have today, they really don't understand anything about your code base. If you
take GitHub Copilot for example, it's like a fresh college graduate, understands
some programming languages and algorithms, but doesn't understand what
you're trying to do. And as a result of that, something like two-thirds of the
community on average drops off of the product, especially the expert developers.
Augment is different.
We use retrieval augmented generation
to deeply mine the knowledge that's inherent
inside your code base.
So we are a co-pilot that is an expert
and they can help you navigate the code base,
help you find issues and fix them and resolve them over time
much more quickly than you can trying to tutor up a novice
on your software.
So you're often compared to GitHub Copilot.
I can imagine that you have a hot take.
What's your hot take on GitHub Copilot?
I think it was a great 1.0 product, and I think they've done a huge service in promoting
AI.
But I think the game has changed.
We have moved from AIs that are new college graduates
to in effect AIs that are now among the best developers
in your code base.
And that difference is a profound one
for software engineering in particular.
If you're writing a new application from scratch,
you want a webpage that'll play tic-tac-toe,
piece of cake to crank that out.
But if you're looking at a tens of millions of line code base,
like many of our customers, Lemonade is one of them.
I mean, 10 million line mono repo,
as they move engineers inside and around that code base
and hire new engineers,
just the workload on senior developers
to mentor people into areas of the code base
they're not familiar with is hugely painful.
An AI that knows the answer and is available seven by 24,
you don't have to interrupt anybody
and can help coach you through
whatever you're trying to work on
is hugely empowering to an engineer
working on unfamiliar code.
Very cool.
Well, friends, Augment Code is developer AI
that uses deep understanding of your large code base
and how you build software to deliver personalized
code suggestions and insights.
A good next step is to go to augmentcode.com.
That's A-U-G-M-E-N-T-C-O-D-E.com.
Request a free trial, contact sales, or if you're an open source project, Augment is
free to you to use.
Learn more at augmentcode.com.
That's A-U-G-M-E-N-T-C-O-D-E.com.
Augmentcode.com.
I'm gonna go out of the limb to bring us back
into somewhat left the center, but basically center.
Please do.
And I'm gonna say that this is the year, 2025 is the year
where durable execution of things is more important
than it ever has been.
Oh really?
It's always been important, but more and more people
are leveraging APIs, they're building out this agentic world
we keep hearing about.
Right.
And I think you keep having more and more people
program against brittle APIs, brittle latency of networks databases, etc
And you need that promise. I'm gonna say that this is the year
Where the marketing problem that you have that Jared alluded to is still there?
I'm sorry, but it's less and I'll tell you why it's less because
Render I just talked to on our go L, uh, CEO founder
of a render and this is on their radar.
So they're building an application for developers.
We did a whole show on this.
And during that conversation, he mentioned a brand, at least I think I did actually,
I mentioned a brand that sponsors us, not this show, but has been, and I think still
is a sponsor into Q2 and maybe Q3. And that brand is Temporal.
So I'm going to ask you to sort of help me understand the difference between Temporal,
Nats, Synadia, Restate, your open source flavors in your cloud, what Rendon may be doing for
application developers.
It seems like this durable execution retry model doesn't live in the language itself.
It's something you have to build every single time.
That sucks.
And it seems like more and more people are trying
to solve it.
So break down all those for me, temporal, NAT,
Synadia, yourself, what render's doing,
and anything else that may be doing, I mean, flink,
but you know, that's a different world.
Yeah.
There's another one called Resonate.
You know that one?
Stefan, do you know it?
Yeah, I know Resonate.
I know the guy behind it.
It's pretty new, but anyways, there is definitely,
like you said, there's other people
trying to solve this problem.
Yes, exactly.
I think this starts from the same observation,
like the state of how things are built if
you don't rely on one of those tools. It's almost unsustainable. It's hard to build.
It's hard to hand it over to another person. There's often so much implicit and brittle
assumptions in how this works. So folks have been trying to come up with solutions. From the ones you mentioned, Temporal is absolutely the closest, maybe
yeah, between Temporal and Resonate, I would say those are the closest to restate. So I
would actually focus on those. I would say NATS goes more in the direction of like flexible,
persistent messaging together with like some state management blended
in and so on.
But you can already see like folks are trying to just like figure out what are the different
aspects we need when building applications and sort of like make them type work together
with each other in a tighter way.
And if you wish, I think this is the, this is for me, restate is, there's a couple of
things that that make it unique, but I would say two things stand
out first, I'd say the model goes a bit further than than
every other system. So restate is, if you look at temporal,
temporal is workflows, that's really what they implement
workflows and activities. So it's this, it's like durable
steps. And then, you know and then with sleeps in there and signals and
so on. So the full-fledged workflow is actually very flexible if you're a power user and know
how to use that. Reset goes beyond that by saying we're not just looking at a workflow
at like one durable execution of multiple persistent steps. We're sort of generalizing
this almost what Temporal does for workflow, which we're sort of generalizing this almost
what Temporal does for workflow,
we're trying to do this for a distributed service
architecture consisting of like multiple stateful services
that interact with each other.
And that, you can see this from the fact that Reset has
like persistent messaging and RPC built in,
it has state built in that lives across
a single durable execution.
So again, in Prol terms, the workflow is done.
The workflow is done like, you know,
it's sort of a self-contained unit within the workflow
across the durable steps, it remembers context,
but once the workflow is done, it's done.
And then Reset is a stateful model where
you could almost think of the activities or like decoupled from the workflow is done is done. And then reset is a stateful model where you could almost think of the activities are like decoupled from the workflow. The activities can be stateful
services and entities that live for a very long time and then you have durable functions
that interact with them. It's a much more flexible and powerful model to build things
like distributed state machines. We have folks that actually start ditch certain elements
of databases to put their state in restate because that is transactually integrated then with the durable steps and out of the box consistent.
So let's say number one, so think of the temporal model that generalized into distributed services
to include long-lived states, include communication like between microservices. It makes for like
more powerful, more flexible box.
That's the one thing.
The second thing goes a bit back to what I said earlier.
When we started this project,
we set out with the following.
You can implement durable execution.
I think just, it's not terribly complicated
to implement a durable execution API on top of a database.
If you make it very simple, you know, have a step, write it to a database, you know, on replay, just like query the database, what are the steps that are already in there? It has a lot of holes,
but you know, like it gets you started. But then, okay, let's talk about the holes, right? Like,
all of a sudden, you have a problem with like long running processes that spend for a long time scaling this to zero.
You have a problem that you have to worry then in your library, you have to implement your own distributed locking and mutex
in case you have timeouts and zombie processes and so on.
And so when you try to make it a really good experience, you quickly come to the point of,
okay, we actually have to go a lot further than building a library on top of database. Then you start, you know, maybe we're building a big
orchestration server that still uses a database in the background. And then you really come to the
point of, if you want to make durable execution so lightweight that you can use it almost pervasively,
how low latency do you have to make these steps under load? What is the best you can actually do if you deploy this across multiple data centers,
if you deploy this across multiple regions?
Then you come to the point that a distributed database across multiple data centers and
regions, there's a lot of coordination back and forth because the database model, it needs
to guarantee integrity.
It does a lot of transaction transaction time, stamping back and forth
and round trips. On the other hand, if you built this on a log, like optimized transaction
log, you can get as good as make one flexible quorum right across your different data centers
and you have the step persisted and you can continue. So it's kind of going to the point where it's saying, if
we want to make this extremely fast, so low latency that you can actually start to use
it in places where you didn't think you could use durable execution before, because it becomes
so cheap, so low latency. How would you have to build a system to do that? And that's where
we went. You'd have to build it from first principle, starting with a low latency replicated log. On
top of that, build it like end to end event driven. So you don't
do like batch queries on a database, but you do the most
latency thing you can do. You do fine-grant messaging and event
pipelines. And you basically layer from there.
And then the other thing is like, okay, let's not just make it really low latency, but at
the same time, it has also to be an extremely lightweight thing.
Because somebody who, you know, we just said, what's the simplest use case?
Like when should you actually start looking at reset only when you have a distributed
ledger to build?
Or do you want to do this if the only thing you want to do is like put your asynchronous
emails sending in the background, but reliable. So the next thing is how do
we actually make this extremely lightweight? What's the most lightweight package we can
give that thing? And the most lightweight package is single binary, zero dependencies.
Just download that thing. It has its log built in its orchestration layer, its metadata consensus
module, everything in a single binary. Just download one command starts in a second and you're done. Like this, literally nothing else to do.
And then you can take this thing actually, and start scaling out just by adding more notes.
If you want to migrate it, let it take a snapshot to an object store, start deploying the data,
send us resume, go from there. So what's really the experience that durable execution needs,
really the experience that durable execution needs. If you want to be able to take it from the point that it's so lightweight, you almost want to embed it with almost any application
to this thing powers like distributed multi-regional payment processing. What's the architecture
need for that? So that's what we started building in Restate. So the second thing, that was
a very long way of saying the second thing is reset is really sort of a
durable execution stack built from first principles for low
latency serverless operations, high throughput and just like
really, really nice, nice operations from the small to the
large scale rather than saying, let's start with whatever
database we have, I think in temporal case, when I came out of Uber, they started with Cassandra and saying, let's start with whatever database we have, I think, in temporal case, when I came out
of Uber, they started with Cassandra and say, let's build
a server that sort of like sits on top of Cassandra and like
stores all all the state that it needs for coordination in there.
And then, you know, you have like different pieces that you
need to scale, you have a database that is actually a lot
more than you really need for durable execution. But on the
way also, sort of sacrifices
the potential for optimizing. Those are the differences I would say.
Gotcha. Built for speed basically, that's what you're saying.
Built to be lightweight.
Scale down.
Yeah, scale down, scale up. And yeah, lightweight, simple to operate. I usually don't like to do this.
I usually like to talk more about like
what makes Reset great than what makes other systems
not great.
Uh oh, Adam puts you on the spot.
Well, you know, I think it's important.
Well, if I'm gonna say, if I'm gonna go out on a limb
and say this is the year, then you have to follow me, okay?
Yeah.
You have to follow me and you have to answer my question
because I'm reducing your marketing churn for you.
Yeah.
Just by nature.
So I just say if you look at the way reset is built and it allows you to get started
and scale from there, if you say, okay, I care about self hosting this because what
I pipe through this is like critical data.
It's not something I trust with some managed
cloud. It really has to run in my account. So I think the experience you get out of Reset is
vastly different than what you get out that you get from many other systems. And that's because
it's just been this very thoughtfully crafted stack from the very beginning and not sort of
incrementally evolved from this database and
that server.
Right, if you're directly comparing to Temporal, which is an incumbent, which was spun off
from as you mentioned Uber and has different principles for which it was built on, you
went back to first principles and said, okay, if you want to get to the point where you
can put this in almost everywhere you want, you have to be low latency, You have to be fast. You have to see these first principles you built on have to be there.
Yeah. And you can't have the requirement to first install the distributed database before
you get started. What is the requirements? That's where I was going to go too. So it
seems like it's client, which is an SDK essentially inside your code base making
calls to a server. What is the architecture,
the infrastructure required?
So the reset server, which is where the low latency consensus log lives and the thing
that basically becomes the reverse proxy for your service. That thing has not really any requirement if you want to get
started. It's a self-contained binary. It embeds its own distributed log, ROC-CBS storage engine,
its own consensus engine. The only thing if you want to run it as a single node is you need to
give it a persistent disk. It's almost like, let's say if you want to run SQLite or Postgres, it's a little bit like,
let's go back to the good old days where you download one binary. I just like started and
it's actually running. It's actually good. Like there's nothing else you need to do.
But then at the same point, it's also able to go from that single process, to start with
the single binary to actually cluster up
and build a distributed cluster. And there's a very interesting architecture in there.
We built it basically for the cloud data age,
where you would say any system that you run at scale
should not really store its own data,
but it should just make use of object stores
for as much as it can, because
S3 and these systems, they're these bottomless, insanely durable and insanely cheap storage
systems. So make use of that as much as you can to put a large chunk of your data.
So that means while you may be working with the data on your individual nodes, you're not really required to safeguard it on the nodes
because you can recover it from S3 or an object store.
So what Reset then does is it actually implements its log
in such a way that it only uses this
for to really give you the very low latencies
for the durable steps.
And then in the background, it incrementally moves data to S3, which makes the individual
nodes fairly lightweight to operate.
So to go back to your question, what are really the requirements when you want to run it?
If you want to run it on a single node, none, or a persistent volume, if you actually want
to run it in production, if you want to run it in a distributed setup, give it an S3 bucket. Those are the requirements. If you want to
use it from your code, the requirement in your code is to use the SDK and to basically
create a reset entry point that reset can connect to and where it can use it, sort of
durable invocation protocol that understands how to decode that. This sort of entry point
is mimicking the popular frameworks like, you know, it's relatively close to Express.js
if you're talking the JavaScript world and the Java world, it looks more like Spring,
Boot, and so on. And then within the individual durable function service handler, you need
to use the restate context to say, okay, I want to run this step and record it as durable.
I want to create this as a durable promise for a persistent callback or so. But otherwise,
the structure of your code is very much the same as it used to be. So it's supposed to be as little invasive
or as little to get as little in the way
of how you used to do things as it can,
just sort of changing the paradigm as in,
because it has this foreign country ability
for these operations,
you can get rid of a
lot of this sort of unhappy path code.
There's still cases you need to treat, but mostly you still need to treat sort of like
persistent errors that come from the application and where you say like, okay, you're making
a call to an API where you're not authorized.
It's not really a way you can recover from this.
It's trying to do something you're not supposed to. And you know, handle this, but don't worry about handling process failures, network failures,
rate limits where that that bounce you back. Don't worry about many classes of race conditions
about you know, like the state being maintained in the database versus the logic that interacts
with it in a function that can, you know, where you don't know, really did this go through
or not.
Just if you put the state of the reset handler,
it's just gonna be consistent for you.
All of those things.
By keeping the structure of the code
like close to what you used to write.
So I made you talk about things you don't like to talk about
except for maybe the architecture.
That seems kind of fun to you.
What is it that you do like to talk about
when it comes to defining and describing restate and why developers should consider it? So now what I don do like to talk about when it comes to defining and describing restate
and why developers should consider it?
So now what I don't like to talk about is competitors in the sense of I don't want to
say, okay, I don't like this about their competitor.
I don't like that about another competitor because number one, I'm not an expert in those
systems.
I try to be honest.
I look at them to the extent I need to look at them, but
actually no deeper than I need to because I found this very liberating to not have my
like judgment sort of clouded or pre sort of pre biased by having looked at something
I feel if you look, for example, if we look in depth at how temporal would build their
API and so on, there's like a very good chance that like,
oh yeah, I get it.
This is why they did it and this makes sense and so on.
Then there's a good chance you'll probably do it
the same way just because you've sort of seen this example,
coded it, understood it, and you're preconditioned exactly.
If you don't do this.
Yeah.
It's like, sure, there's cat.
Is the cat alive or dead in the box?
We won't know until we look.
Maybe, yeah.
But if you don't do that,
It's both dead and alive.
You actually have a chance to do something, to come up with
your own creativity, possibly do something better, right?
So that's one of the reasons why I don't like to talk about them so much because I'm absolutely
not an expert.
I look as much as I need to but I don't usually don't try to go super deep into these systems.
And the second thing is that, I don't know, there's so much, I'd rather talk about good
things than bad things. Yeah. It's like more of a, it I'd rather talk about good things than bad things.
It's more fun to say nice things than bad things.
I understand your discomfort then and now.
Definitely it's always, it can be tumultuous talking about competitors and what they do
and what they don't do.
I think in the context and the reason why the question is pertinent is because whenever,
like to Jared's point, you have a marketing challenge ahead of you and I think in the context and the reason why the question is pertinent is because whenever like to Jared's point, you have a marketing challenge ahead of you.
And I think it's because the idea of durability and item potency is mostly well known, not
always easily implemented, and there's options out there.
And so when you sort of look at that challenge, you think, well, what could someone reach
for it?
When would they reach for it?
When does it make the most sense to reach for it?
And does it actually fit whenever they do try to implement it at scale, you know across different boundaries and whatnot
And so I think when you compare that and you look at like well Nats Nats is a whole different scenario
But they do similar things. It's kind of funny because when you mentioned Flink, you're like
Well, it does this in a different way and then you got to restate because of your experience there and whatnot
See anything with Nats is like Nats does a lot of the similarity things where your
brokering messages, there's a lot of retries, there's a lot of key value storing in there.
There's a lot of that same principles, but it's not about durability. It's not about retries.
And then you obviously have Temporal, you have Render who's trying to or going to do something
like that in the same platform, which I just had that conversation with
Anurag and then you obviously have restate and how you went back to first principles
versus being spun out of something or
what have you so I think from a
a guide standpoint, you're the most you're the you're the best suited guide in this conversation to explore those
Because jared and I can't do that for us
Yeah, absolutely. So if you want a quick summary, I'm very biased, but I think there's almost no reason to not
reach for restate.
I think it really is this solution from first principles with amazing developer experience with a very powerful abstraction
that allows you to build what you can build with workflows and signals, but also so much
more. And yeah, just the journey from the beginning downloading the binary, then migrating, scaling out is an absolutely,
it's a great experience.
And I mean, the project is newer than other projects,
so it will have a rough edge here or there,
but it's also moving very quick.
It's very good at reacting to community feedback fast.
So I think it's a good choice.
It hasn't made a lot of users happy so far.
Could we use maybe Jared an example from our own application
to consider how we would pick up restate?
And I know we publish episodes, right?
We publish episodes.
We often will have scenarios where the slug isn't right.
We've had different scenarios where we had to do things
in prod to fix something.
You know, it could be metadata
and you've got different checks before the publish process.
Is there a way knowing what you know now about restate,
how you would consider implementing something like that to safeguard publishing
episodes in a, in a durable way?
I've never really used one of these tools before,
so it's difficult for me to say that. Um,
I do know just at a technical level that I do not believe restate has an Elixir SDK, so we might be out in the yard.
Elixir one, that's a good ask.
Okay, maybe I can help you come up with an example here.
Let's say you're recording the episodes and then every time an episode is done,
let's say, you know, let's do an AI thing here or so. So you're
building your, your chat where you can chat with an episode,
like, okay, you know, show me, like, tell me when did they
talk about this? Or tell me what episodes talked about these
topics and so on. So what you're doing whenever, whenever an
episode is done, you're feeding it first through a model
that transcribes the audio, then you're chunking it up, feed it through embeddings models stored
maybe in a vector database, and then you have kind of a rack style way of, you know, when
a query comes, create the embedding, look up the similarity search in your vector database,
feed it to the model to get the
answer.
For something like this, let's say you started just building the flow in, say in a Node.js
application, like in a simpler way, just said, okay, here's the episode.
I have something like it gets uploaded.
Let's say you're uploading it to an S3 bucket and there's like an event whenever something,
you know, gets uploaded to this bucket,
you have an event that represents this,
and then it starts like a Node.js script
or something like this.
And this script is of the type that, you know,
if it fails, you know, somebody would have to restart it.
And now let's say you're trying to implement that
with Restate.
I would say approach it the following way.
Like the first thing is get a handle of Restate itself.
Like there's a cloud service that you can use on our site,
which is like a free tier.
Either go there or just use one of these ways
to run it yourself on like a,
single machine with an EBS volume.
Then you have the server there,
then put your Node.js script,
maybe you can actually put it on something like Lambda ECS,
just like use a serverless option to host this.
And then tell Reset, use the Reset SDK
to define the entry point until Reset,
okay, here's the service that you're now should
sort of durably manage. So Reset will then go there and discover
this and understand, okay, hey, there's this, you know, like what do we call it, like video
transcriber or video embedder service. And then Reset knows about this. And then you
would go to your Amazon console and say, okay, for this type of event,
I wanna create a webhook to restate
so that it makes an invocation to recent says,
okay, this thing has been updated.
You know, the kind of event
that would previously call directly your Node.js process
or script, you know, you actually make it an HTTP call
to restate and reset with that call your process.
You've already gained one thing right away.
You've now basically have a reliable queue in front of it.
Just that if you don't do anything special.
So when the web call comes, it's gonna be acknowledged back
and reset has this thinking of your process,
if it crashes, it will retry this.
It will actually give you a nice observability,
much more than you would get from your average message queue
about individual retries, configuration about timers and back
off and timelines and so on.
As the next step, you would actually then go into your script and say, okay, let's actually
identify the steps where if something fails in the step or after that step, I don't want
it to go back. Like let's say forking the process
that does the transcribing
or like calling the LLM to create the embeddings.
You introduce then the reset context that you get
by using the reset SDK and just say,
okay, let me wrap these API calls just with reset.run.
And that will capture the results of this
durable and basically turn,
you've now turned it basically into a workflow.
Let's say you want to do something like,
let's say you want to do something like
parallelize the different steps.
You know, maybe just typing this one by one
through this embeddings model is a little tricky thing.
So let's do, you want to fan out.
You would then, you could then go and say,
let me try and do the exact same thing I do
in a regular node process,
to just make a bunch of function calls,
record, like remember the promises,
sort of a way to promise that all for those in the end,
join the results, put those in the database.
You can do exactly that in your code.
Just again, anchor this in the reset context.
So you get like this durable parallelization, durable,
sort of like scatter, gather, and so on.
And so you would then incrementally sort of like rewrite
your code to say, okay, let's actually make the step durable.
Let's make that step durable, and that step durable.
Say as a next thing, maybe one of your folks wants to approve
it before it really goes
out.
You then possibly, let's do that in the simplest most possible way.
We create an awakeable or a durable promise in Reset and say, okay, somebody needs to
complete this actively.
Send an event, make an HTTP caller to complete this and say, okay, this is approved, go through
or no, this is not approved like a board.
You can then use, for example, you could put the result of transcription just in Reset.
Say somebody could look at it from the UI and then say, okay, yeah, I'm making an API
call here to approve this and continue.
And so you can then incrementally build your process,
rebuild your process into durable steps.
As the next thing you could then, for example,
take it and migrate it from a long running process
to a Lambda function.
Because one of the nice things you have
with durable execution is when it's waiting
for something else to happen,
it can actually just make this thing go away because with durable execution is when it's waiting for something else to happen, it can actually just make this thing go away because with durable execution, it knows how to recover it
to the back to the place where it was by replaying the history of durable steps.
So you could then say, if you're on vacation and you approve it a week later, you don't have
like some process running and waiting for it. It's just like, it's going to go away. And when
the approval finally comes, it's going to come back,
use the durable steps to replay back to the point and then to the remaining
steps. And so you typically folks would incrementally then rework their
non-durable services, first connect them to reset to basically get the
equivalent of a durable queue and then like incrementally rework it and say,
okay, I want to ask durable steps here, maybe parallelization, maybe a signal.
And I think that's typically how you'd approach it.
That makes a lot of sense.
I do see also you have some guides on the website
about how to implement certain things.
I'm curious about the observability bit.
Is that a part of your hosted offering?
Is that a part of the open source project?
How does the business end fit in
and is observability part of that open core sort of thing?
Yeah.
So at the moment,
at the moment, the, like what you get in the open source
is very broad.
You get in the open source compared to the hosted offering, pretty much everything except
the fact that you would self-host it and you know, like the whole day, whole authentication
and API tokens and so on that exists only in the managed offering.
But other than that, we've started with an open source first approach.
So the open source has pretty much
the full suite at the moment.
The observability, there's two things
about observability in Reset.
Like number one, it can actually give you
an amazing amount of observability itself
out of the box because it funnels all these durable steps
through its consensus logs.
It has all the information of what happened.
And it's not just function call,
but like to the granularity of here is a step that happened
or it's actually failed.
This is the last step that completed in this failure and since then I've retried so many
times and this is the last exception I've seen.
It has all that information available because it also is connected to the service and understands
okay what type of errors are happening, is this a retryable error or not?
And it gives you access to all that observability data in its own UI.
It's actually a fascinating way that this is implemented via control. I want like one or two technical details. Sure.
So there's this durable log that recalls all the actions.
Then everything is indexed into RocksDB instances to sort of retain it in a scalable way.
everything is indexed into RocksDB instances to retain it in a scalable way. We've built a SQL query engine using the data fusion project around this that allows you to basically do SQL queries
against all of that invocation and transaction journal state and so on. What the UI actually does
is basically issue SQL queries. It's almost like back to the good old days
when all your state was in a single Postgres database.
And if you wanted to find out why your application is stuck,
you just would open the SQL shell and start querying.
And we kind of lost that
because we went into like distributed microservices.
And if you want to find out what happened,
you now have to do a like murder mystery with 20 services.
Yeah, and you're bringing it back.
And we're kind of bringing it back,
like, yeah, SQL query for the win,
like for your distributed application state.
So this is one of the things,
you get an amazing amount of,
amazing amount of insights right out of
just the Reset journal.
Second thing is because like all the operations
go through there, Reset can also just
out of the box generate open telemetry traces
and spans for you.
So if you give it an O-Tel endpoint
that it should push those to,
it will just give you the traces right away
without you needing to configure anything.
You can then extend it and augment it with your own traces.
But yeah, so those are the two things,
the two things you can do.
And so the business end is basically cloud hosting
for Restate.
The business end is gonna be a lot more than that.
Okay.
This is what it's going to look like.
That's, I don't think we can go into this yet.
Interview us again in six months.
Interview us again in six months.
Um, it's not ready to be, uh, to be announced, but, um, what's available
right now is most of it, what we have built is also the open source.
So yes, on the business side,
we do currently only hosting.
For the next six months or so.
Maybe.
Fair enough, cool.
Well, I think it sounds like a really cool system.
I'm excited about this new world
of durable execution functions
and some way to slap a name on that
that brings all of the junior engineers to the yard,
along with us seasoned engineers
who have felt these pains for all these years.
You know, like the serverless folk did.
You know, they just said it's serverless,
and they're like, oh, okay, cool, serverless.
Maybe I should try it.
I feel like restate and friends need
some sort of a marketing term to just simplify
the overall concept of what you all are building.
But I do think it's very interesting tech
and very promising.
I do like the term resilient apps,
so I think maybe we need something with resiliency involved.
But that's all for me, Adam.
Any other questions from you before we let him go?
I was like in durable personally.
I don't know if you want to wordsmith that a little bit here,
but I like durable.
I think that seems to have-
So if you're interested,
we've actually gone through a few iterations.
We started with something just calling a durable
Async await because in many ways that's what is underneath the hook
It's like yes durable asynchronous operations like a function vocation as nice in one's operation made durable a step like that's sort of
Nice and connoisseur API call with the durable result
So and and they were like some sort of like expert programmers
that are like, oh, that's really cool, I get it.
Like, you know, it's like distributed durable event loops.
It's very cool.
But then 90% of the folks did not get that.
And we just went with durable execution.
And then it turns out that's a very,
that's a term that's maybe increasingly more recognized,
but it also undersells a little bit what we said there
because folks actually think like, oh yeah,
so it's just the same thing as Temporary,
but actually does a bit more.
So yeah, we're still on the wordsmithing side.
Like yes, distributed durability, resilient apps,
resilient distributed state management,
like there's so many things on the table.
At the moment, I used this earlier in the talk,
like stateful durable functions or something we've used.
I think this is maybe increasingly getting recognized
because of a lot of the efforts that let's say,
Cloudflare does with its work or like durable objects.
There's a construct and reset called a virtual object that a surprising
amount of similarities with durable objects, and we would
have called a durable object if that term hadn't been
trademarked by Cloudflare. And Azure, Azure durable functions
is probably even closer to reset than temporal. So if you I
think you can actually think like temporal,
Azure durable functions,
so reset is a bit more than that.
I think it combines a bit more of like orchestration
and stateful logic even in a more flexible way
than durable functions does.
Yeah.
But yeah, so stateful durable functions
is where we've currently landed at.
But look, it's a journey.
I think honestly, even temporal
hasn't figured that out after five years. I think it's still- Well, that's a journey. I think honestly, even Temporal hasn't figured that out
after five years.
I think it's still-
Well, that's why I said it's a challenge
that I think maybe restate shouldn't solve it,
but I feel like maybe everybody who's in this category,
like you need it, there's a missing category.
It's almost like a style of application or an architecture.
Where it's like, well, what architecture is this?
Well, it's model view controller.
Okay, it's MVC, I can build an MVC style app.
You know, where this is like,
I don't know what to call it.
I'm missing a word.
But it's almost like, you know, it's restate style
or something like, maybe you have to
term it after yourself if you want to really own the market.
That's durable function style or it's, yeah.
Durable function doesn't speak to me personally at all.
Sounds really boring, but that's just me.
And maybe it's working on other folks.
Adam likes durable.
Durable to me just sounds like cool.
It's not gonna break.
Stateful, durable functions, that's what you said.
Is that right?
That's what I said earlier, yeah.
Stateful, durable functions.
SDF. SDF style.
I'm gonna make an acronym or something like that.
I don't know, TDD, SDF.
Right. MVC, yes. Yeah, I'm gonna make an acronym, something like that. I don't know, TDD, SDF, MVCS.
Yeah, I mean, model view controller,
it doesn't have any sort of appeal to it either on its face.
So, I think that wasn't a bad example.
Anyways, we could continue to workshop it
till we're blue in the face.
But obviously you've been working on it longer than that.
Hey listen, this is the year.
This is the year of it.
I'm just saying. The year of the what?
The year of whatever this is.
So the durable function. Whatever this is, it's the year. This is the year of it. The year of the what? The year of whatever this is.
The durable function.
Whatever this is, it's the year.
Yeah, I feel there's something about durable in itself
that's not recognized by lots of folks.
I think you actually asked about that earlier as well.
Like what does durable really mean?
Like maybe a stronger emphasis on persistence or so.
like maybe a stronger emphasis on persistent or so.
I think there's something to be said about resilience. I think resilience is a much more attractive word,
generally speaking, and one that to me calls and says,
is your app resilient?
I'm like, ooh, I don't know if it is.
I want resilience.
Because durability can show somewhere,
whereas resilience is like, you know what?
No matter what happens, I'm going to succeed.
I'm going to try until I bounce back.
Yeah, and I think you're right in the sense of durability
is a means to an end.
It's very implementation detail, if you wish.
Reset achieves resilience by doing a lot of fine-grained
durable operations, which make it easy to bring
it back to a consistent state and that drives resilience.
There you go.
Now you're getting your ass looking down.
Love it.
Now we're deep.
Listen, hey, you know we have a fun place to hang.
It's called Zulip.
Oh, that's true.
Go to changelaw.com slash community.
Join us in there and then if you have some ideas about this name or this wordsmith, this with us, this world that, uh, Stefan is, is creating, then,
you know, pile on, share your thoughts, all the good stuff.
Well, what's left, anything left unsaid about this durable, resilient
world we're going to live in.
What's unsaid about the durable world that we live in?
What's unsaid about the durable world that we live in? I think it's inevitable to come.
The question is mostly in what shape is it coming.
I think it's been worked on actually from multiple dimensions.
There are folks like us that work on this from the reset side, like here's the lightweight
durable log that is easy to integrate with other functions.
I think that sort of the serverless folks
and the Wasm folks are working on that
from a different side, like saying,
okay, hey, let's compile everything to Wasm
and let the system sort of use the Wasm interpreter
to snapshot things.
Then there's, and I think there's folks that kind of use container
engines to implement this. So the thing what they all share is just understanding of completely
unsustainable to not have anything like this. It's harder. It's hard. It gets increasingly
more important, the more moving parts and the more asynchronous processes you have.
And I think if we all believe what the AI people tell us that like 80% of all this is anyway, it's going to be some
agentic stuff in two years. And I think you've just like created an even bigger problem and an
even bigger need for this type of systems. So I think this is coming in one in one shape or the
other. This is our sort of... It's inevitable.
Our bet of how to best achieve it.
And it's gonna be fun to see.
It's gonna be fun to see what happens.
There you go.
All right, Stefan, well, thank you so much
for sharing the journey, sharing the love,
sharing the things.
Appreciate you.
Thanks for having me.
Cheers.
Okay, very fun conversation today with Stefan.
Very big idea of resilient applications.
Love the idea of restate.
The idea of stateful, durable execution functions
is awesome, but it's just four words too long for me.
I do agree with Jared on the marketing challenge ahead, but resent applications. I'm
down with that. I think you are too. If you haven't yet check them out restate.dev. Okay.
Big thank you to our friends over at augment code, our friends over at retail, and of course,
our friends over at her Roku, her Roku.com. Actually, I think it's Heroku.com slash change all podcasts.
If you wanna use the URL they gave us, there you go.
But the next gen platform for is coming
and I heard it's awesome.
Okay, BMC thank you so much for those beats.
You are awesome and we'll see you soon. Thanks for watching!