Software Huddle - Durable Async/Await with Stephan Ewen of Restate
Episode Date: January 30, 2024Today's guest is a legend in the distributed systems community. Stephan Ewan was one of the creators of Apache Flink, a stream processing engine that took off with the rise of Apache Kafka. Stephan is... now working on core transactional problems by building a durable async/await system that integrates with any programming language. It's designed to help with a number of difficult problems in transactional processing, including idempotency, dual writes, distributed locks, and even simple retries and cancellation. In this episode, we get into the details of how Restate works and what it does. We cover core use cases and how people are solving these problems today. Then, we dive into the core of the Restate engine to learn why they're building on a log-based system. Finally, we cover lessons learned from Stephan's time with Flink and what's next for Restate.
Transcript
Discussion (0)
How big is the current team working on restate?
So we're still a fairly compact team
because we're iterating a lot on the foundation still,
which I think works great in a small team.
What is durable async await?
Where would I sort of use it in my application?
It's actually a very,
it's kind of very, very broad and general,
but powerful programming paradigm.
Like the idea of restate and
and the hand I'm durable a single weight is that you you add durability around
certain types of operations be that you know our PC calls between services
external API calls you know at timed waiting background background work and
so on by adding durability around these operations,
it actually makes it very easy to recover from failure.
So it kind of eliminates a lot of the issues you have to deal with in distributed systems.
Are people moving more towards, hey, I just want a managed service that does this for me?
Even if I have that safety of open source, but most of the time I want that managed service.
Have you seen a big change over the last, since I think 09 you're working on Flink or what does that sort of market
look like? Yeah, that's actually a good question. Hey folks, this is Alex. Today's guest is Stephen
Ewan, one of the founders of Restate. Restate is a durable async await framework type system,
right? So if you have some sort of workflow or process you need to do that needs to call a bunch
of services, maybe add retries and error handling and rollbacks and all that sort of
thing. Restate helps you do that in your programming language of choice. So TypeScript, Java, something
like that, rather than using something like step functions, you know, where you're sort of mixing
between your Lambda functions, but also like the step functions DSL. So I thought this was a really
great episode on sort of what problems it solves, how it works, some of the distributed system stuff behind Restate that was really interesting.
If you recognize Steven's name, he was one of the creators of Apache Flink, right?
So this stream processing framework that really took off with the rise of Apache Kafka.
And, you know, he worked on that 10 or 12 years, saw some really cool things, and now he's been working on Restate for the last two years.
So really cool distributed system stuff we talked about here as always if you have any comments
questions guests you want to see anything like that feel free to reach out to me or sean faulkner
with any of that uh and with that let's get to the show steven welcome to the show hey hi alex
yeah i'm excited to have you here because um're kind of a distributed systems, I don't know, legend or at least someone I know and look up to in the area, like based on previous work you've done with Flink and now have a new exciting project, Restate, that we're going to talk about a lot today.
For people that don't know, you can give a little background on you and what you're working on lately.
Yeah, of course.
Thanks for having me and thank you for the kind words um yeah so
um i think most of my my professional life i've been working um on on what became and i was a
apache flank and i started actually out as a database person originally like working on on
database like query execution query optimizers and so on. After I graduated, we took sort of a lot of the academic
work that I worked on, started an open source project,
stratosphere, became Apache Flink,
and yeah, and grew into what is now Apache Flink.
So with like after a few detours, like, you know,
that happens in like startup and open source land,
we sort of settled on stream processing, unified batch on stream processing.
And that's where we did that.
That's where I spent most of my professional life before 2021,
which is when I sort of phased out from the work on Apache Flink and stream processing.
And we slowly looked at the sort of like first experimental steps of what now is Restate, I think what
we're going to talk about today.
So previously working mostly on like event stream processing for analytics and Restate
interestingly is almost like the complementary part of Link in my mind.
It's like event streaming a little bit for transactions.
At least that's how it works underneath the hood.
It's not how I guess we think about it in the end,
but that's at least where it comes from and how it's built.
Yep, absolutely.
And I'm sure Flink was a really exciting time
where a lot of stuff was moving to streaming
with the rise of Kafka, and then how do we process those,
and especially in more complex ways,
and Flink did a lot with that.
I guess, how did you get interested in in
this space like more sort of durable async await or whatever you want to call that like how did
you get the idea for restate based on and was it based on your work with frank was it was it
something else how did that come up yeah i think it's kind of two things that that came together
i mean on the one hand a little bit just the frustration in general at the state of the art when it comes to just building reliable APIs, like reliable distributed applications.
But at the same time, also seeing we're not the only ones with that frustration.
We actually saw in the Flink community a bunch of folks that were somewhat like abusing Flink to build reliable event-driven applications for more like transactional style applications.
And that's really what Flink is built for.
You can kind of like make it do a bit of that work because it has a very like high consistency
model internally that you can try to sort of like then connect other systems and make
it kind of coordinate distributed transactions.
But it's not actually great.
It's a little bit like trying to abuse your OLAP database for transaction processing.
But it kind of showed us that apparently there's nothing really there.
So folks are turning to these kind of solutions that they're using systems not built for this, for these type of problems.
And when we said, OK, we have issues with these problems.
Other folks are abusing our analytical technology for transactional processing because they seem to have these problems.
Maybe there's something there and we should start looking at that.
Absolutely. So for those that aren't sort of aware of this space
and restate, what is DurableAsync away?
Where would I sort of use it in my application?
Right. So it's actually a very, it's kind of very, very broad and general
but powerful um powerful powerful
programming paradigm like the idea of restate and and the hand um durable icing the weight is
that you you add durability around certain types of operations be that you know rpc calls between
services external api calls um you know at timed waiting, background work and so on. By adding durability around these
operations, it actually makes it very easy to recover from failures. It kind of eliminates a
lot of the issues you have to deal with in distributed systems. Like how do you figure
out if, for example, you're updating different systems, you want to keep them in sync and some
stuff has happened, but the other not. And you have to kind of figure out what went through, what didn't.
How do you make sure you don't re-execute the previous parts, but you always make sure
and eventually also execute the parts that didn't initially go through and so on.
So you get a sort of consistent view in the end.
It helps with all sorts of those problems.
So anything where you sort of talk to different systems where you need to reliable background work or elaborate process to process communication.
It's kind of geared at solving those problems.
And the core idea is like very lightweight durability added sort of to the, if you come from a TypeScript world, the easiest way to think about this is like the promises.
A lot of this is written like in like async await style.
You make like a call to another system, you the response and so on and around this like async
await operations you add a lightweight form of durability that helps you recover awesome and
and you know what are people currently using to solve these these types of problems like is it is
it just sort of certain patterns in their application are there other tools that help
them with this or like what sort of tools are people putting together to solve this?
Yeah, I think there's different ways that folks approach this right now. I'd say a lot
is just, I'd say not solving that specific problem, but just trying to work around it
in a way, trying to design applications in a very specific way
so you can try and make certain pieces out in PODE
and then just retry forever until it goes through
and just hope everything else cancels each other out,
which sometimes works, sometimes doesn't.
Sometimes folks use workflow engines for this.
I think this kind of problem problem making sure stuff happens runs
to the end all steps get completed it's kind of a classical you know workflow engine topic actually
it's just that there's i think a very big space where you do want this kind of behavior
but it's just not it's not such a big and heavyweight sort of thing it's maybe just a
lightweight process that orchestrates between two other processes so that you don't actually want to
pull in a heavyweight workflow engine. And that's when it actually starts
and you have to kind of come up with all of your own
sort of transaction protocols and
sort of retries and deduplication
semantics and so on. So I would say it's kind of a mix
of both. Like if it's a sufficiently complex
process, maybe you pull in a workflow engine. If it's
kind of a lightweight process, you kind of build your own
protocol. Maybe there's something
in between where folks use like message queues,
SQS, Kinesis, Kafka Kafka and so on to sort of like schedule these tasks in a more durable way. But also that kind
of still involves a lot of like your own design of like requests and how to duplicate them, like how
to tie different IDs and tokens together and so on. So yeah, all in all, I would say it's a lot of like roll your own.
And this roll your own is, I think, tougher than it initially sounds like. I always am surprised
at the corner cases that come up that are not that easy to spot in the beginning. I think that's
why the roll your own ultimately is something that has many problems when folks do it.
And it's also something that increasingly more and more folks realize is actually very hard and try to avoid.
And that they try to look at different solutions.
So I think that's also part of why workflow engines recently have become more popular.
And you can see this, for example, in the rise of popularity of step functions and so on.
Yep, yep, absolutely.
So just to like put it in
um sort of example everyone can understand the example i think of i think it's in your docs as well is just like a checkout workflow on an e-commerce site right where someone hits
checkout and now you need to you know decrement the inventory maybe you need a process of payment
queue that up for an or like at least some sort of order record like you need to hit a bunch of
different systems or databases or things like that right some sort of order record. Like you need to hit a bunch of different systems
or databases or things like that, right,
to make this happen.
And like you're saying, you can just sort of like
hit them in order in your application
and like maybe wrap them in retries
and sort of hope it works.
And occasionally one falls through and, you know,
now you gotta like manually fix that.
Or like now I think in the world I see,
I see a lot of like step functions wrapping this around that,
which has its own sort of,
you know,
pros and cons around that sort of thing.
I think that'd be like the workflow solution that you're talking about.
You're talking about like the adding durability to this.
What,
like what does that specifically look like when you say,
you know,
restate sort of adds durability to this,
where does,
where does that durability comes in?
Especially like I see receipt integrates with Lambda, which I don't think of as having durability or anything like that at all.
What does that actually look like to add durability to this process?
The way it works in restate is you have basically two parts.
You have a small SDK and you have a server component. You can think of the server
component a bit like something in between a log database, a very lightweight workflow
engine. It's a stateful component that maintains a durable log internally. It also acts like
a proxy for a lot of things that happen in your application. For example, if you're, again, the simplest way of using restate
and you're writing sort of an RPC handler and you want to say,
okay, make that RPC handler durable for me,
you'd actually register that at restate and say, you know, restate,
you know, become the proxy for this one.
Then as you actually call this through restate being the proxy,
you have this connection between your RPC handler and the Restate server.
And it basically tracks many operations.
Like if you do an external API call and you wrap them sort of in a Restate operation,
or if you make a call to another system through the Restate SDK, it wraps those operations
in like an internal sort of like a durable journal entry that it commits.
But it also remembers sort of like how this connects to the to the application so it um it first of all records
internally that this operation has happened and then it also understands how to um how to complete
that operation uh when it's done and and send this back to the process so in case for example
something fails in between kicking off the rpc and receiving the response or maybe if in the case
of lambda you explicitly want to go away between sending the rpc and receiving the response, or maybe if in the case of Lambda, you explicitly want to go away between sending the RPC and getting the response because you don't want to wait,
it understands how to connect the response back, complete that promise, and then give it back to
a new invocation of the function and say, okay, now continue from here now that we've completed
that promise. So in the end, your program in a sort of familiar sequential sort of request
response RPC or make an API call await the response and reset
is kind of the proxy for these promises and sits kind of as a middleman in between
records the creation of them, the completion of them and facilitates
sort of the recovery in case something happens in between.
Gotcha. So all that traffic is going through this restate server component type thing, which
I want to talk about.
I want to talk about architecture for that and what that looks like.
But all that traffic is going there.
It's routing it to the other services you have, which could be Lambda functions.
It could be Kubernetes-based service or whatever it is, just some sort of endpoint that it
can then handle it.
And it's sort of tracking the status of those as it goes through there.
Yeah, that's right. It's not unlike saying you're invoking other functions through an event.
Like, you know, you can, for example, if you talk like Lambda to Lambda and so on, you can do that
by enqueuing an event in SQS or something like that. It's not unlike that. It's just that you're
not actually writing it like that. You're not writing like like okay um i want to enqueue an event i need to sort of understand um that i i make sure i enqueue it only once um i do enqueue
it in case something fails like i don't need to worry about the recovery and the sort of like
deduplication that that that kind of connects the recovery and that in queuing and all of it
together um because the because reset reset kind of sits a little different
between the components.
You don't actually explicitly like in queue events and so on.
It's really the SDK that kind of captures this,
creates and kind of logs it to the system in the background
and always connects it like with the current execution
and retry and so on to make sure things happen only once
and don't get duplicated.
So in the end, yeah, you can think of it, it's like an event-driven system. Things happen only once and don't get duplicated.
So in the end, yeah, you can think of it as like an event-driven system.
It's like an event bridge or SQS or something sitting between applications.
But you program it in a much more high-level way so that it in the end looks like you're really programming sequential RPC with durable promises.
Gotcha.
On that sort of, you mentioned EventBridge, does it have the sort of decoupled
producers and consumer type
concept that you would see in EventBridge where
a producer might just fire something into
EventBridge and then whoever
wants to be subscribed to that thing, does it have that sort of concept
or is it more sort of intentional
RPC or at least
message passing type stuff.
Yeah, it is
at least in the current version more this
intentional
RPC
though with a
few nice additions.
You can think of this as
request response, one way
to fire and forget
but at least with libel,
like accepting and delivery semantics, you can use this to also like schedule in events and
invocations a bit like you can do with EventBridge that you say, okay, hey, you know, this is,
maybe this is an invoicing process that happens only once per month. So I want to schedule this
for the next month. And then when this happens happens let me schedule it again for the next month or something like
this or just on a periodic timer or so so it's um it's it's not a classical pub sub and there's no
reason for us to not expose that at some point but at least at the moment it's not a it's not
a pub sub in the way of here's a bunch of producers and a bunch of consumers it's really
more a facilitator for for really direct direct direct RPC style communication at the moment.
Gotcha. Yeah. And it is interesting. I see a lot of people use
EventBridge for sort of that direct messaging, even though it's not
quite as much for that and they just don't know what other abstraction to use
for that. But it's interesting because it seems like restate can replace
a lot of other AWS primitives that I often use.
I thought of it initially as like, hey, this sort of is a step function replacement.
But it also sounds like SQS probably using less of that if I have something like this.
A Vembridge scheduler, like you're saying, if I need more dynamic scheduling, do this in 15 minutes or three days or at this particular time i can do that with
um with restate so like it replaced even like some of the dynamo db streams usage i use like where
often i have sort of like the dual write problem right where you want to like write a record to
the database and then enqueue a message or do something else and you can either do that in
your same process which is kind of finicky or you can write it to the database and then process it asynchronously
with Dynamo streams or whatever change data capture. But now restate can sort of help,
help handle that specific issue. Yeah. Yeah. I think, I think that's right. And the, I mean,
the way I would think about it is the following, right? Like so many of these primitive exist
because there are so many different sort of different
challenges you have in like in a distributed application.
And I mean, distributed application doesn't even mean like fancy distributed.
As soon as you have your own process and you're talking to one other process and maybe a database,
you have a distributed system, right?
And you have a lot of the problems already, right?
And there's so many classes of problems
you just mentioned dual right problem right then there's the um there's the the problem of you know
what like for um so the the item make making sure you have like item potency making sure you um
you even have like uh like um recovery there's the the issue of like building circuit breakers and so on.
I think so many systems exist.
And then there's the, of course, like sender and failure decoupling, right?
If you're doing an operation that takes longer on the receiver side,
you absolutely don't want this thing to be completely lost if the connections are a break, so the sender fails or the receiver goes away in between.
You really want to decouple all of those things. And because there between you really want them you probably want to
decouple all of those things and because there's so many individual things you actually need to
worry about i think there's so many products and so many solutions and one of the ideas of
reset was to like take a step back and look at this and just in a slightly more fundamental way
like what is really the thing that that that causes so many of these issues and then build
like a very sort of broad broadly usable set of parameters for that.
And I think because of that, it says, OK, if you use it,
you probably need a bit less of that and that and that
and that and that, because it helps actually
with a bunch of problems, not just with one problem.
MARK MANDELMANN, Yeah, absolutely.
OK, so I'm sure you've paid a hint to this space
better than I have, because I mostly know step functions.
And if I could know sort of step functions and if I could sort of
compare and contrast step functions with restate it would be
Restate gives you lets you sort of stay in in your you know
TypeScript land or draw the land or whatever your programming language of choice is more when you have these sort of distributed things going on
Rather than if you're in step functions, you might say hey,, there's this first step. It's going to do something and it's going to throw an error or throw some sort of result.
And you're sort of kicking between your programming language, maybe in a Lambda function or something
else. And then, and then back to like more like an infrastructure is code or, or some sort of DSL
for like processing through the States. So that would be like the difference I'd say between
restate and, and step functions. What about some other ones I hear about, like temporal came out of cadence.
Is temporal in the same category as restate?
Is it a different type of thing?
How does that compare or contrast?
Yeah, I think that's interesting.
It definitely is in a similar category.
And temporal is an durable execution framework.
I think that's how they,
at least used to call themselves also for a long time.
Reset definitely, durable execution is a core part of what it does.
I think Reset takes a bit of a different approach
because it doesn't look at,
okay, here's a durably executed workflow, and now you try to connect everything to that.
But it really tries to take more of a, you could say microservice architecture approach
or even a slightly simpler approach to that where we say it's not specifically about building
workflows.
It's about building durable event handlers building like durable event handlers um durable
rpc handlers um stateful rpc handlers like there's a component that actually lets you like
attach state directly to the handlers and like keep that sort of um transactionally consistent
with the like with the invocations and um i think that gives it a bunch of like different
characteristics than than temporal.
And while I think there's some overlap, I think Reset really is a bit more general and more
towards the general distributed applications rather than in the specific workflow category.
I mean, that being said, you just said that you compared mostly to step functions.
I think that's actually a great category because to my mind, step functions is also,
it's like workflows, but it's a little bit more. You don't think of step functions necessarily as
just like the long running sort of workflows for days with like human tasks in the middle.
You can also use that for like fairly like low latency and fast orchestration tasks and so on. And I think that's also
kind of a category that restates in more. It's not just heavyweight workflow, it's also
very lightweight, fast, distributed coordination orchestration.
So you might use a restate, do you call it a workflow?
What do you call a restate? If I have a handover that
uses that, what would you call that? It's actually interesting.
We're still debating what's the best way to call it at the moment. We just call it a durable handler.
Okay. So I might have a durable
handler that calls a bunch of systems, but that could still be in the synchronous path
of a request from a client or something like that, that goes through a bunch of different
things. I could still get a synchronous HTTP response back
from my client sometimes when using restate.
Yeah, that's right.
So the individual operations in restate,
we're paying a lot of attention to building it
in a way that they're quite fast,
like take at least very low latency.
So when you, for example, have different processes,
even if they both run on restate
and so you're kicking off one handler by an RPC call
from the outside, you're making this call,
another one, and so on,
you're waiting on a response to get back,
call another one,
like it's in the order of like milliseconds,
like every time you're hitting the different processes. It's actually built to be
like a very fast append to a log and that's it. That allows you to do a bunch of steps
in the synchronous path, which is really where we want to see this thing go because if you actually have this ability to add workflow-ish durability
to handlers with low latency,
you actually move a lot of things
that are sort of not possible to,
where it's not possible
to use workflow engines for them
because they tend to just take too long.
You actually make it possible to use that
and you all of a sudden have a big class of use cases of type of patterns where you can all of a sudden benefit from this
very, very nice tool that makes a lot of these problems just go away that you couldn't before
because it was latency prohibitive to use those type of tools. So maybe one interesting example is
it's the one that we also
that we're also using on our website
or in the blog post is this
this classical Step Functions demo
like the holiday reservation where
we have a process that
it takes care of like orchestrating everything
talking to like flight reservation, car booking, hotel reservation, and so on.
It also implements all the control for what if something goes wrong.
It implements the Sargas pattern, actually, in the end, and then undoes the previous steps
in case an error occurs.
You can actually run this in a synchronous response path.
Assuming at least every individual service is reasonably fast in responding, then the restate component adds very moderate latency overhead to this.
Yep, very cool.
Okay, so I want to know more about, I think you call it the restate runtime,
but basically the server component that's handling this flow between
serving as a proxy, handling the durable state, and things like that.
What does that component look like?
Right. So the server component, just to quickly recap, is the one that
kind of sits as an intermediary between all the different points. And it's the one that's
responsible for making sure that you can actually program sort of distributed, like distributed programs
that execute recover from failures and like an almost simple sequential RPC style.
So kind of pretending failures don't happen.
You can embed the control flow, the failure handling flow, just like that.
That's actually in a way the biggest difference also to step functionsctions, if you wish. On the Step Functions side, you basically
go and say every individual bit that you care about. Let's say between two parts where you care
about failure handling and say, okay, once this part is completed, I have to make sure it doesn't
get re-executed and I need to have the results of this durable committed before I'm starting the next
thing. Every time
you have this you kind of pull it apart into different
pieces and kind of orchestrate it
with something like a step function
flow across it. And in Reset you don't do
this. In Reset you basically just write the control
flow down and let like the durability of the individual
steps take care of it. And the runtime
is the piece that sort of sits as
a proxy or sort of sits in the background and sort of accepts the stream of messages that facilitate
the communication, that sort of persists the creation of the promises and so on.
What it looks like is when you use it yourself, it looks a just a single binary that you that you bring up
it's actually fairly fairly compact it's like started it has it has everything it needs in there
um a log like a state storage engine um sort of registry for uh for schema functions communication
and so on so it's a fairly conflicted compact bit um we it maybe feels a little anachronistic but
then at the same time it doesn't.
I think there's a bit of a swing back into components being built that way, like from
everything is just a set of custom resource definitions on Kubernetes.
We're going back to like, okay, things actually are sort of like standalone binaries that
do their thing without requiring a hundred other things.
So Reset is also that.
It's like, it's a simple thing.
You start it, it runs there, almost like a lightweight database. And then in your applications, you embed the simple SDK,
almost like the database client. You connect to it, and then it kind of acts as your proxy or
broker for the communication. That's kind of how it works. So in that sense, it's not actually wrong to think of it as it kind of takes the place
of maybe the SQS queue between your services, or it takes the place of the step function
coordinator.
But what it looks like internally is it has a very specific architecture.
It's a log-based system with sort of an engine that sort of asynchronously interprets commands, materializes indexes over the log,
and handles sort of the network communication,
the invocations, event dispatch, and all that.
I'm always trying to find simple ways to compare it,
but you can think of it a little bit as it's it's a log it's
a bit like a like a kafka kinesis log with sort of an event processing application built into it
that works on the events on behalf of your application and then basically only talks to
your application um like in a in a very sort of high level way as an okay i've i've done this for you
i've got a response here for you based on that and it kind of keeps sort of the ground truth
of everything that's happening in this like log and command processor and that really is kind of
how also the like the simplicity of the um the correctness model how it happens because
it kind of makes the application mostly just say what should happen
and let the command interpreter make it, or the central log make it happen for you, and
then listen to the response for that.
And that gives you this sort of nice correctness out-of-the-box experience.
Okay, sounds good.
So trying to understand it, and I definitely
don't know as much about distributed systems as you,
but you've got this server component
where it's sort of receiving these RPC calls,
or maybe even state setting calls, right?
Like if I'm a handler, I can sort of set some context
or just little bits of state for that particular function.
It'll handle that.
It's going to persist it to that durable log you have.
Also maybe some materialized data,
like in, I believe, like RocksDB,
just to make that a little more queryable and stuff
for resuming stuff.
And then also proxying that to two different other handlers
you have.
Am I understanding that correctly?
Yeah.
Yeah, that's right. Let's maybe
take
let's maybe work through this
as an example
and
really show what
happened in the runtime. Then we can also
explain why it is
built the way it is.
Let's maybe take this
holiday reservation,
holiday booking, so I guess example for a moment.
So if we're the general sort of like the main function
that has the workflow that talks to the different services.
So we've been sort of executed proxy through restate.
So restate hasn't sort of event in the log that says,
okay, I want to run that workflow.
That part's not too unlike, I think,
how most other workflow engines work.
Like they write it in the write ahead log.
So restate kind of dispatches that.
And then as you go through it, as for example,
the code says, hey, I need to now talk to the
flight reservation service first.
It basically streams an event in the background to
that server component. And the first thing that server component does is
it actually appends the event. And then it has sort of established the ground
truth. Okay, we've reached that part of the program. We've
sent this bit.
So it first of all is just like a very sort of cheap append to the
log but what happens at the same time is that there's like an um a process that that um so
indexes these events it establishes the relationship that this event that we just
depended is actually part of the it's kind of related to the original uh workflow execution
because it's an event that came came out of this and it sort of of related to the original workflow execution, because it's an event
that came out of this.
And it sort of attaches this to a recovery journal for that workflow function, for that
handler.
And anytime something goes wrong, it can actually use that to re-invoke it and say, okay, do your thing again.
But when the next time the promise gets created, it actually sees from the recovery log,
oh yeah, that promise actually is already there.
Might not be completed, but at least it's there.
So let me actually not send this again because I already have that.
At the same time, at some point, also the other service is going to respond with an event that completes it.
And then all of these are really
just like they're meant to be very fast appends so we actually get that durability um very fast
but then it's sort of again indexed into okay this is a completion for um for this specific original
rpc event so let me kind of attach it to there like augment the original sort of promise with
its completion result failure or success and so on and then um if it's still executing just you know forward it to that process otherwise keep it there
and the next time it executes it can be sort of resupplied as a completed promise to the um to
the executed um the executed function so that's that's basically what we i would say the most
those are the two three most important pieces we have in the system. It's like this log built for fast depend, fast durability of operations.
RocksDB is more like an index that sort of tracks the relationship between events.
And if you wish, like state that you're attached to a handler,
it's just like a specific sort of event that then gets sort of attached to the identity of that handler.
So the key RPCs are really specific forms of event that get attached to an execution
of a handler and so on. That's the second part. And the third part is the component that sort of
manages the communication with the services. Maybe a quick note on why did we build it that way.
So in theory, we could have also built it as in,
you know, we use Kafka as a log or Kinesis,
and then we store things in DynamoDB and so on.
There's a bit of a... I think maybe let's take a step back
and look what that means
from a distributed systems perspective, if you actually do it like that.
Let's just take Kafka and DynamoDB.
We actually pieced together those two systems.
What you ultimately now have is you have actually two logs involved.
You have Kafka as a log, and then DynamoDB, every database's core truth
is its own write-ahead log.
That's where it records things that by its definition have happened.
Now, if you say, okay, I'm writing an event to Kafka and to DynamoDB, you have a problem now.
You have actually two logs that need to keep in sync.
And you need to kind of implement, if you wish, at least a poor man's transaction coordinator that says,
okay, you know, maybe after I've been penanted to Kinesinesis if i crash i have to like look at this and see have
i moved it to the other system or not like what if i do find something in that other system and
not here like you're already at the i would say the the core of where most of the disability
systems complexity comes from like keeping different components in CIRC. The beauty, I think, of the way that we've built this with just one log and everything else is
sort of an asynchronous process that follows this log is exactly one ground truth. Like,
something is appended to the log or not. That determines by Reset's perspective,
did it happen or not? And everything else is a view that follows. And it's kind of very easy to
let that view follow consistently,
especially if it's the same component that manages it.
That actually gives us two very nice characteristics.
First of all, it's a fairly easy thing to make it correct.
Whenever you have different systems you need to keep in sync,
you're going to hit so many
corner cases over time where you need
to have more
elaborate protocol that deals with that.
We don't actually have that.
It's fairly simple to get it correct.
And one of the interesting outcomes of that was that
fairly early versions of that system,
when we really stress test
them with very heavy
um sort of like chaos marquis style tests like had complex logic that cares about consistency
and correctness and just like shoot things down introduce discretions and corruptions like
it's fairly hard to actually make a trip up for a very early for the early stage that it is it's
fairly resilient um and um the like that that's one of the like really nice outcomes the second one is
it's also it's also fairly easy to operate because it doesn't run the risk of like okay what if like
the log actually executes faster than like dynamo db can have can can take the inserts or so or
or like yeah what what if this one has a compaction and then the other one has one at a different time
and they cannot both stall each other out and so on.
Like it's fairly, because we care so much about the system being easy to operate, like this architecture lends itself to that very much.
It's an architecture that has comparatively few knobs and sort of things that can drift apart from each other.
That's what we liked about it a lot.
And that's why I decided to do it that way.
Okay, that was a very long speech.
No, that was great.
Just thinking as a user of a piece of open source infrastructure,
if I know I have to go set up not just this infrastructure
and start running this binary, but send in these credentials
to this DynamoDB table and this Kafka cluster and all that sort of stuff.
It's just like, okay, now I'm managing three things.
I don't really have that much control over how they interact.
It's just nicer to have that more in that single binary type situation.
Exactly.
Because maybe we haven't actually said this before.
The way we really want to get the system to use this is we're working on a
managed a managed service and of course we want to have a service that's easy to operate for ourselves
but we really also want to um we're publishing this as an as an open source system we want folks
to be able to to to run this if you want themselves in their at least in their own accounts and so on
if they need to with to in an easy way
in a way that doesn't require too complex operations
and maybe that's something we sort of
learned a little bit from our previous work
in the space fling and Kafka and all the other
systems that connect to that
it's very easy to get into trouble
even if you think a system is like super well behaved there's
just like every system itself makes sense but there's just different assumptions they make
every time they interact if just slightly different intervals in which they refresh
their credentials slightly different ways in which they like use their use their like um
when they have their burst and how they react to bursts and so on how they can stomach that that
it's very easy to get those combinations of...
If you combine many complex systems, you have something that is very easy to trip over,
even if each system itself is actually very well behaved.
And that's kind of what we wanted to get out of that situation.
That's why we're putting it into one.
Yeah.
Based on your experience with Flink, and especially with Flink for a long time,
and now talking to customers, I'm sure, with Restate, are you seeing a change in how people, like, are people moving more towards, hey, I just want a managed service that does this for me?
Even if I have that safety of open source, but most of the time I want that managed service.
Have you seen a big change over the last, you know, since I think 2009 you were working on Flink?
Or what does that sort of market look like?
Yeah, that's actually a good question.
So definitely the acceptance of managed services is way up there
and also the wanting to use a managed service.
Absolutely for Flink when we started was all-prem.
In the end, it was so many folks asking for managed services.
There's not such a thing
as the managed service.
There's so many
different paradigms.
There's like,
you can have a multi-tenant
managed service
that is shared by many users.
You can have a managed service
that has dedicated resources
and has an isolated process
and per tenant. you can have a managed
service that runs some part in maybe the vendor's account and then keeps a lot of maybe parts of the
control plane or some very low level storage primitives and then runs the the sort of sensitive
parts in the user's account to make sure like the the let's say at least the unencrypted data or like
the direct uh connection to to their application services and so on doesn't leave their account and so on.
Like there's this wide spectrum. So it's kind of, I'd say it depends on what kind of user you talk about.
The long tail is more on the like really fully managed set of things. But there's still a lot of big use cases
that need at least the dedicated clusters that
like these clusters to run in their own accounts and so on.
Building to make this easy and possible,
I don't think it hurts, to be honest.
MARK MANDELMANN, Yeah.
OK.
I want to backtrack to something you said a while ago,
because it was in my head.
So you were talking about the process of what's
happening on the server. Like, if I want to call some other service, said a while ago, because it was in my head. So you were talking about the process of what's happening on the server.
Like, if I want to call some other service,
it's going to write that log event.
And then when that service responds at some point,
that'll have a different event and mark that.
So does that mean if I'm using restate, both services
have to be using restate for that sort of thing?
Can I use restate around calling an external API that I don't have control over
or even calling out to Postgres or something like that that's not going to be fully integrated into that restate
ecosystem? Or is it more just internal RPC type stuff? That's the core of what you're
doing there. I think you can use it for all of those things. I would say
the simplest thing is, you can make it for all of those things. I would say the simplest thing is,
I mean, you can make it as simple as a one-step workflow, right?
Let's say you have something that comes,
maybe it's something you can schedule for later or so,
that's why you're partnering it through a system like Restate,
and then it does exactly one thing,
it wants to commit that,
it wants to acknowledge sort of that this happens, done.
So you can you can
use it for that like restate allows you to um basically wrap like any type of api call in
like in a side effect instead of a managed promise if you wish um and and then that's that and i i
think the a lot of the like simple examples kind of look like that you have one durable handler with a
bunch of calls that it makes you know to other services that that is it and that that's that's
totally totally fair um the and i think the interesting part is if you if you start going
beyond that if you say okay maybe this this one handler i i have is um is, there's a bunch of other sort of like services that it needs to interact with.
Let's say that you have that example of the checkout process, right?
It calls maybe an inventory service to decrement the counts.
It calls a payment service and so on.
And when you start saying, okay saying okay hey i'm not just using
restate for this for this one thing but i'm also using it to sort of like um i'm also using it as
part of um not just the checkout flow let's say also the inventory service and so on then you you
kind of get a few additional benefits because now you have sort of both both sides uh managed by restate and
that that means you you get um kind of get exactly one sort of communication between them or think
like you know retries out and potency all of that um just completely for free you can you can
actually pretend they they talk in a in a completely like reliable um exactly once manner
because yeah because Because both ends are
on a proxy through the same
system and it can work its internal magic
to make this nice.
You don't have to do this. It's completely
valuable for one service, but
when you put onto it
a few more services, you get
a bunch of additional benefits.
I think that's
where over time I think people
like to go.
Yeah. Tell me about that exactly once and how that works. I know there's some blog posts
on exactly once is impossible and things like that. What do you mean by exactly once and
how does Resate solve that particular issue?
Yeah. So I think that's good. Exactly once is a very controversial term.
I think exactly once in terms of,
let's say invoking a service exactly once
is not something you can realistically achieve
by laws of logic, math, and so on.
But that's also not what you really need.
You need the effects of the service
to happen just once, right?
That's kind of what I think
exactly once always has been about.
When you talk about in-stream processing
or like even in the context
of database transactions,
you never talk about
kicking something off only once.
You talk about it materializing
its effects once and only once
right and um it's often achieved like by a combination of sort of like retries and out
potency or um or like sometimes actually commit protocol on the bottom which at least has one
retried and out important step as well in it usually so so i think i think that's that's
really what it's about like If you have two services,
one is sort of sending its event. It's important to understand that the sending of an event from
one service to the other is anchored in the durable execution and the deduplication and so on.
So when this is sent, it actually goes as an event in the background to the restate server,
which appends it to the log. And then it has this single write that acts both as,
okay, we've created the promise, but also we have work to do here namely like send that
message to another service so we've um so we've kind of established an um a an endpoint that that
sends an event that duplicates on retries like if that handler fails it recovers and goes so there
and says like okay you know i have actually created that promise so creation and sending
happens in a deduplicated manner on the on the receiving side it's the same thing like you
receive it um you invoke the handler you might invoke it multiple times the failures happen in
between there but every time you kind of invoke it you you understand the effects that already
happened you kind of give the completed promises for these effects back and and eventually you you conclude it so
what you will not ever see is the event being sent out twice or resulting in sort of like two
individual invocations that don't share each other's partial progress i think that's what
it boils down to in the end and that's i think really all you care about that that in in the
end it it kind of means that like one or two invocations happen as part of like a retry
strategy it doesn't matter it matters whether you apply the effects doesn't change doesn't change
the result yeah multiple times yep yep very cool okay and then um so going back to this runtime
thing if i have this server um am i running multiple instances of it for replication and better durability and things like that?
What does that look like?
What's a common configuration?
Am I running three, five, seven of those?
That's a great question.
At the moment, you're running one because that's the version that is out there.
We're working on the distributed version where,
I mean, I think you would eventually run this
whatever your load sort of requires.
I would say starting with three,
one in each availability zone of a typical region.
So I think that would be like
what we would probably recommend in the future
if you want to run it yourself at the moment.
It's one process.
It basically persists.
It's logged durably by writing it to something like EBS.
So the simple thing you can use right now is we actually have a few tools around this like CDK constructs and so on that would basically deploy it exactly as like an either ec2 instance with an ebs volume or um just saw the release of like fargate abs support and so on like a fargate task
with an ebs volume and some those are like perfectly fine um fine deployments um it actually
doesn't really matter even if your task is um is dead long lived as long as it really only matters
if the log survives but like the log has to be durable.
That's why it has to be an EBS.
Like wherever you run the rest doesn't really matter.
And in the future, we want to take care
of that replication ourselves for a few reasons.
Like, I mean, cost effectiveness is one thing,
but the other thing is like cross AZ durability.
It's not something EBS gives you at the moment.
So, and S3 is too high latency to write still.
The low latency version is, again,
only single availability zone.
So it also doesn't help there.
So that's why I think we're working towards our own log
replication at the moment.
MARK MANDELMANN, Gotcha.
So in some future where you're doing replication,
does that mean you'd move away from EBS
and do more instance-based storage?
Or would you still be using EBS
and relying on some of those guarantees there?
I'm kind of curious on how those trade-offs manifest
for distributed systems builders like yourself.
Yeah, that's actually a very interesting question.
I think that...
I would say it depends just a little bit
on your risk appetite.
I think it's very related to me to the question whether you should F-Sync a log or not.
If you don't F-Sync your log, for example, you're basically relying on instance memory or so for the time being.
Let me put it like this.
What are the logs really for in a modern architecture at the at the end you you very quickly want to go to durability through something like s3 for the majority of
your data so you get latencies out of that in in the like tens hundreds of milliseconds like
tens hundreds of milliseconds if you go to the cross ac1 like the the 99th percentile here really
matters it's probably more like three-digit milliseconds
in many cases.
In a way, the most important thing you have
to bridge is the latency until that point has come
and maybe a bit more because you don't want to
depending on how much you write
you want to amortize costs into
bigger writes so you don't write too
small chunks. You kind of have to bridge just
this small bit until you've gotten
enough data and enough of a latency window that you can flush it out to S3.
This is what you have to bridge. This is what the replication is for.
So it's just
like replicating to instance storage or even to instance memory good enough to bridge that, right?
It kind of depends on what failure rates and what kind of
failure scenarios you assume.
Most of the time, you'll probably be right replicating to memory.
You'll probably be good replicating to disk.
If you actually want to replicate to something like EBS, even in the small window,
it kind of depends really what class of failures you're addressing. One, I've had this discussion a while ago with some Kafka people
that I asked them, like, would you F-Sync or not?
And what is the sort of the worst thing that you lose if you don't F-Sync?
And I said, the worst thing that we typically lose are the cases
where not F-Syncing usually matters.
So somebody accidentally deallocating the entire
deployment like you know you run it on kubernetes even like with multiple sort of like failure
domains and anti-affinity and so on and you know you don't ever you lose data because all of your
azs go down at the same time but you lose data if some administrator unfortunately deletes the
deployment because then everything goes away in the small window before you kind of can flush it out to the more stable storage and and that's why it might
for some people still matter i think to persist to something like eps still because then you still
get to defend against those scenarios if you're kind of happy with sort of the the math around
failure rates and you trust kind of your ops team to not take down all AZ deployments at the same time within the same second,
maybe you are OK with the cost benefits of just replicating
to memory.
I'm not sure if there's a one answer for all.
MARK MANDELMANN, Yeah, yeah, it's a tough one,
right on that edge there.
So is restate currently flushing stuff to s3
occasionally for for that more durable stuff and getting stuff off just regular disk um we're not doing that at the moment um but it's it's something we're we're working towards to um
which is incidentally also something why what for example we really like having having rocks to be
in there because if you if you actually there because if you actually have a replicated
log the sort of LSM tree design is actually a really interesting one for storage engines.
Because it has this like hierarchical immutable artifacts that are asynchronously merged it
lends itself really well for like storage on something like S3 is good for immutable artifacts or for something like that.
So it's another reason why we'd like to start that way.
**Matt Staufferer, Jr.:**
So you're flushing not just... You might flush the RocksDB segments and things like that to S3,
as well as the log segments. Is that what you're saying?
**Joseph Benz, Jr.:**
Look, we might. We're still in the process of developing this.
It could be very well that by the time we do this,
it's not actually RocksDB anymore,
but it's like a different storage tree and so on.
But conceptually thinking, yes, that's where we would move to.
It's been fun seeing all the S3 stuff that's happening with data systems.
I talked to the WarpStream people recently
and just what they're doing with S3
and just how it's sort of changing data infrastructure
in a lot of interesting ways, like you were talking about.
One thing I thought that was interesting, too, in the docs,
so you mentioned RocksDB is sort of your state storage,
but you can actually query it via Postgres, right?
You expose it basically via Postgres. What's going on there?
Yeah. So RockCB
basically is our state index, but not just state, right?
I'll actually just try to emphasize this.
RockCB stores sort of the relationship between all events and the end.
Here's an event that corresponds to this workflow execution. Here's an event corresponding to that
RPC invocation of that thing on restate. Here's an event corresponding to that committed side effect
or so to another system. All of this is events that get appended to the log, but then indexed to FnRocksDB.
And by making this queryable, you can find out a lot about what's going on in the application. Like what's currently running, who's kind of calling whom, who is waiting, who's sleeping,
who's waiting on some other function to kind of return, right?
Like that's the, I think the real power.
You can almost like query the distributed call stack,
at least from the services that are on restate.
Like if you have like a bunch of services
on reset calling each other,
you kind of can almost query
the distributed call stack between them.
And it's not that you're sort of maintaining
the synchronous here,
because it's just like an asynchronous index view,
the whole thing. So Restate internally has the query engine It's not that you're maintaining the synchronously because it's just like an asynchronous index view,
the whole thing. So Reset internally has the query engine based on Arrow Data Fusion.
But the endpoint, behind which we expose this, actually speaks to Postgres Wire Protocol, so you can use a Postgres client to query it. Actually, I haven't mentioned this before,
Data Fusion makes sense for us because the server component is completely implemented in Rust.
Actually, a big user of the Tokyo framework, which kind of fits very well with our philosophy,
we have this log, sort of single appender for the partition to the log. It kind of fits the, like the, the Tokyo program model pretty well.
And, Bing, Bing rust sort of system data fusion just looked like a sort of
promising rust by critic rust-based query execution engine that we connect
to Postgres by making Postgres.
Sorry, that we connect to ROXDB by making RocksDB sort of a virtual table that Data Fusion can
scan and push certain predicates down to.
Yeah, okay, that's really cool.
Do you anticipate that will be that Postgres compatible endpoint there?
Will that be something that is only used for occasional one-off hardcore debugging type
thing? Or will that be something where I might have a dashboard open
that's continuously refreshing and just showing the state of my services
and traffic and things like that going across?
How often do you see that used?
It's a very good question.
At the moment, the biggest user of this is actually our own tools.
The last version that you can actually find i
don't think it's actually officially published but you can actually find the release uh the
release artifacts already on on github and our website the open seven version introduces a very
very powerful cli that you can can use to exactly examine the sort of status of services um like
why is somebody, how long,
how long has it been waiting for?
Who is it waiting for?
And so on.
And that thing internally issues like a lot of queries
to that system, right?
That's the biggest user of that so far.
You can use it to expose something
about the state of your system in a dashboard.
If you don't do it in a crazy
aggressive way, like issue thousands of queries per second might actually work. It's not at the
moment optimized for production workload queries. It's really more for debugging, analysis,
introspection type workloads. But it's a very interesting
question. We've seen this being so valuable
that we might actually do more work on this in the future to
make it sustain bigger loads.
It's to be seen.
Yeah, even if you had a read replica one,
it's maybe just pulling down the S3 data.
Maybe it's a few seconds or however long behind,
but you can query it and not worry about messing with
the hot path of these FMs getting appended
and all that sort of stuff.
But it just gives you a look at your system and what's happening.
That'd be kind of interesting, just to sort of see where that goes.
Yeah, that'd be very interesting.
And again, like an LSM tree-based architecture
that you sort of like asynchronously write into,
it's actually great
because you can then create like different,
you can have a different consumer
pulling these artifacts
and like running a query engine over this and so on.
Lots of things are possible, yeah.
Yeah, I've seen other tools doing that,
like Rockset has, they call it it compute-compute separation, right?
Where they're just pulling those down into different compute instances so you don't have
one query here messing with one query over there and stuff like that.
Yeah, it makes total sense.
I mean, Rockset is the folks that build RocksDB, so a lot of these things that we're thinking
about, of course, they're also doing that.
It just kind of comes from using Rock Studio.
Yeah, cool.
Let's talk about where sort of Restate is,
both as a product and as a company and things like that.
So is it in a state where I can go out and sort of use it
and start running stuff through it?
Where are you at in the development cycle here?
Yeah, so there's different different pieces that are in different stages
like we have we have sdks out there in typescript javascript java kotlin um that that are are there
they're they're pretty much good to go um there's a there's the open source runtime that you can download.
That is, it's usable, but I would just add that it is still under like sort of rapid development.
And it doesn't mean it eats your data.
It's like I mentioned, we're actually pretty happy with the reliability so far,
because I think that's like one of the things we get from this architecture that we have.
That's sort of very simple, like single log paradigm.
But we'll absolutely go through a few sort of iterations of the storage formats and everything.
So if you're using it right now, like be aware, you might have to run a migration script at
some point in time when you upgrade.
That's definitely a caution I would throw in if you're happy with that um would be we would be happy for you
to check it out um we're we're also working on a cloud service which is in in in pretty early stages
um and we're um we're looking for folks that that would be interested like early design partners for that.
So open source, I would say, try it out as long as you're OK with running migration down
the road.
Managed service, talk to us.
MARK MANDELMANN, Yeah, absolutely.
OK, so you sort of stopped working on Flink in 2021.
Did you start on Restate right away? How long have you been working on
Restate?
It depends a little bit when you count. Some of the basic thoughts of it
we've been playing around for a while in the back of our heads.
End of 2021, I left Flink.
She took a longer parental leave in between and I went
really to work on Restate in the second half of
2022. We started by
prototyping a lot. I would say the actual Restate development
started kind of end of 2022, so a bit over a year
ago.
Okay. And how big is the, you said we started working,
like how big is the current team working on restate?
Yeah, so we're nine and a half people,
so nine people and somebody is 50%. We're, yeah, so we're still a fairly compact team
because we're iterating a lot on the foundation still,
which I think works great in a small team.
Yep, cool.
Are you based all over the world?
I know you're in Germany, I believe.
Are most people in Germany?
Or where's the team based?
I'm based in Germany from the team.
The team is based all the team is based sort of all around mostly
so European time zones or similar time zones at the moment.
So we have some folks in the UK, Italy, South Africa,
one person on the US East Coast.
So somewhat all over the place,
like we're working as a fully distributed company.
But for the early days like we're working as a fully distributed company but for the um like for the
for the early days that we're in we try to keep it in time zones that have enough overlaps so that
when when big discussions are necessary we can actually get um also in front of um like in front
of the screen for a while or somebody might actually fly in and on relatively short notice and we do a workshop and so on.
So that's what we try to optimize for.
Very cool.
Well, I just love learning about this.
I love what you did with Flink and I've followed you on some of that stuff for a while.
So it's cool to see this new project.
If people want to know more about you or about Restate, where should they go to find you?
Yeah, I mean, the Rest the reset website maybe is the,
is the best first place to start.
Uh,
restate.dev.
Um,
we,
we,
we do have a Twitter account also restate dev.
Um,
I think those were,
those are the places that,
that we usually,
that where we usually share our stuff.
Um,
where I guess we're,
we're not young enough to have an Instagram account or something like this.
Yeah, that's all right.
I don't either.
So yeah, my kids maybe someday.
So, but yeah, anyway, like, thanks for coming on the show.
I love this stuff.
I think it's super interesting.
And thanks for like walking through like what's going on in that runtime and some of the distributed
system stuff.
I think it's really cool.
So, you know, best of luck to you and the team going forward. Excited to see this space continue to develop.
Awesome. Yeah. Thanks so much for having me. It was a lot of fun. Thanks.
Cool. Thanks.