Software Huddle - Durable Async/Await with Stephan Ewen of Restate

Episode Date: January 30, 2024

Today's guest is a legend in the distributed systems community. Stephan Ewan was one of the creators of Apache Flink, a stream processing engine that took off with the rise of Apache Kafka. Stephan is... now working on core transactional problems by building a durable async/await system that integrates with any programming language. It's designed to help with a number of difficult problems in transactional processing, including idempotency, dual writes, distributed locks, and even simple retries and cancellation. In this episode, we get into the details of how Restate works and what it does. We cover core use cases and how people are solving these problems today. Then, we dive into the core of the Restate engine to learn why they're building on a log-based system. Finally, we cover lessons learned from Stephan's time with Flink and what's next for Restate.

Transcript
Discussion (0)
Starting point is 00:00:00 How big is the current team working on restate? So we're still a fairly compact team because we're iterating a lot on the foundation still, which I think works great in a small team. What is durable async await? Where would I sort of use it in my application? It's actually a very, it's kind of very, very broad and general,
Starting point is 00:00:20 but powerful programming paradigm. Like the idea of restate and and the hand I'm durable a single weight is that you you add durability around certain types of operations be that you know our PC calls between services external API calls you know at timed waiting background background work and so on by adding durability around these operations, it actually makes it very easy to recover from failure. So it kind of eliminates a lot of the issues you have to deal with in distributed systems.
Starting point is 00:00:53 Are people moving more towards, hey, I just want a managed service that does this for me? Even if I have that safety of open source, but most of the time I want that managed service. Have you seen a big change over the last, since I think 09 you're working on Flink or what does that sort of market look like? Yeah, that's actually a good question. Hey folks, this is Alex. Today's guest is Stephen Ewan, one of the founders of Restate. Restate is a durable async await framework type system, right? So if you have some sort of workflow or process you need to do that needs to call a bunch of services, maybe add retries and error handling and rollbacks and all that sort of thing. Restate helps you do that in your programming language of choice. So TypeScript, Java, something
Starting point is 00:01:33 like that, rather than using something like step functions, you know, where you're sort of mixing between your Lambda functions, but also like the step functions DSL. So I thought this was a really great episode on sort of what problems it solves, how it works, some of the distributed system stuff behind Restate that was really interesting. If you recognize Steven's name, he was one of the creators of Apache Flink, right? So this stream processing framework that really took off with the rise of Apache Kafka. And, you know, he worked on that 10 or 12 years, saw some really cool things, and now he's been working on Restate for the last two years. So really cool distributed system stuff we talked about here as always if you have any comments questions guests you want to see anything like that feel free to reach out to me or sean faulkner
Starting point is 00:02:14 with any of that uh and with that let's get to the show steven welcome to the show hey hi alex yeah i'm excited to have you here because um're kind of a distributed systems, I don't know, legend or at least someone I know and look up to in the area, like based on previous work you've done with Flink and now have a new exciting project, Restate, that we're going to talk about a lot today. For people that don't know, you can give a little background on you and what you're working on lately. Yeah, of course. Thanks for having me and thank you for the kind words um yeah so um i think most of my my professional life i've been working um on on what became and i was a apache flank and i started actually out as a database person originally like working on on database like query execution query optimizers and so on. After I graduated, we took sort of a lot of the academic
Starting point is 00:03:08 work that I worked on, started an open source project, stratosphere, became Apache Flink, and yeah, and grew into what is now Apache Flink. So with like after a few detours, like, you know, that happens in like startup and open source land, we sort of settled on stream processing, unified batch on stream processing. And that's where we did that. That's where I spent most of my professional life before 2021,
Starting point is 00:03:34 which is when I sort of phased out from the work on Apache Flink and stream processing. And we slowly looked at the sort of like first experimental steps of what now is Restate, I think what we're going to talk about today. So previously working mostly on like event stream processing for analytics and Restate interestingly is almost like the complementary part of Link in my mind. It's like event streaming a little bit for transactions. At least that's how it works underneath the hood. It's not how I guess we think about it in the end,
Starting point is 00:04:06 but that's at least where it comes from and how it's built. Yep, absolutely. And I'm sure Flink was a really exciting time where a lot of stuff was moving to streaming with the rise of Kafka, and then how do we process those, and especially in more complex ways, and Flink did a lot with that. I guess, how did you get interested in in
Starting point is 00:04:26 this space like more sort of durable async await or whatever you want to call that like how did you get the idea for restate based on and was it based on your work with frank was it was it something else how did that come up yeah i think it's kind of two things that that came together i mean on the one hand a little bit just the frustration in general at the state of the art when it comes to just building reliable APIs, like reliable distributed applications. But at the same time, also seeing we're not the only ones with that frustration. We actually saw in the Flink community a bunch of folks that were somewhat like abusing Flink to build reliable event-driven applications for more like transactional style applications. And that's really what Flink is built for. You can kind of like make it do a bit of that work because it has a very like high consistency
Starting point is 00:05:13 model internally that you can try to sort of like then connect other systems and make it kind of coordinate distributed transactions. But it's not actually great. It's a little bit like trying to abuse your OLAP database for transaction processing. But it kind of showed us that apparently there's nothing really there. So folks are turning to these kind of solutions that they're using systems not built for this, for these type of problems. And when we said, OK, we have issues with these problems. Other folks are abusing our analytical technology for transactional processing because they seem to have these problems.
Starting point is 00:05:43 Maybe there's something there and we should start looking at that. Absolutely. So for those that aren't sort of aware of this space and restate, what is DurableAsync away? Where would I sort of use it in my application? Right. So it's actually a very, it's kind of very, very broad and general but powerful um powerful powerful programming paradigm like the idea of restate and and the hand um durable icing the weight is that you you add durability around certain types of operations be that you know rpc calls between
Starting point is 00:06:17 services external api calls um you know at timed waiting, background work and so on. By adding durability around these operations, it actually makes it very easy to recover from failures. It kind of eliminates a lot of the issues you have to deal with in distributed systems. Like how do you figure out if, for example, you're updating different systems, you want to keep them in sync and some stuff has happened, but the other not. And you have to kind of figure out what went through, what didn't. How do you make sure you don't re-execute the previous parts, but you always make sure and eventually also execute the parts that didn't initially go through and so on. So you get a sort of consistent view in the end.
Starting point is 00:06:57 It helps with all sorts of those problems. So anything where you sort of talk to different systems where you need to reliable background work or elaborate process to process communication. It's kind of geared at solving those problems. And the core idea is like very lightweight durability added sort of to the, if you come from a TypeScript world, the easiest way to think about this is like the promises. A lot of this is written like in like async await style. You make like a call to another system, you the response and so on and around this like async await operations you add a lightweight form of durability that helps you recover awesome and and you know what are people currently using to solve these these types of problems like is it is
Starting point is 00:07:39 it just sort of certain patterns in their application are there other tools that help them with this or like what sort of tools are people putting together to solve this? Yeah, I think there's different ways that folks approach this right now. I'd say a lot is just, I'd say not solving that specific problem, but just trying to work around it in a way, trying to design applications in a very specific way so you can try and make certain pieces out in PODE and then just retry forever until it goes through and just hope everything else cancels each other out,
Starting point is 00:08:15 which sometimes works, sometimes doesn't. Sometimes folks use workflow engines for this. I think this kind of problem problem making sure stuff happens runs to the end all steps get completed it's kind of a classical you know workflow engine topic actually it's just that there's i think a very big space where you do want this kind of behavior but it's just not it's not such a big and heavyweight sort of thing it's maybe just a lightweight process that orchestrates between two other processes so that you don't actually want to pull in a heavyweight workflow engine. And that's when it actually starts
Starting point is 00:08:46 and you have to kind of come up with all of your own sort of transaction protocols and sort of retries and deduplication semantics and so on. So I would say it's kind of a mix of both. Like if it's a sufficiently complex process, maybe you pull in a workflow engine. If it's kind of a lightweight process, you kind of build your own protocol. Maybe there's something
Starting point is 00:09:02 in between where folks use like message queues, SQS, Kinesis, Kafka Kafka and so on to sort of like schedule these tasks in a more durable way. But also that kind of still involves a lot of like your own design of like requests and how to duplicate them, like how to tie different IDs and tokens together and so on. So yeah, all in all, I would say it's a lot of like roll your own. And this roll your own is, I think, tougher than it initially sounds like. I always am surprised at the corner cases that come up that are not that easy to spot in the beginning. I think that's why the roll your own ultimately is something that has many problems when folks do it. And it's also something that increasingly more and more folks realize is actually very hard and try to avoid.
Starting point is 00:09:51 And that they try to look at different solutions. So I think that's also part of why workflow engines recently have become more popular. And you can see this, for example, in the rise of popularity of step functions and so on. Yep, yep, absolutely. So just to like put it in um sort of example everyone can understand the example i think of i think it's in your docs as well is just like a checkout workflow on an e-commerce site right where someone hits checkout and now you need to you know decrement the inventory maybe you need a process of payment queue that up for an or like at least some sort of order record like you need to hit a bunch of
Starting point is 00:10:24 different systems or databases or things like that right some sort of order record. Like you need to hit a bunch of different systems or databases or things like that, right, to make this happen. And like you're saying, you can just sort of like hit them in order in your application and like maybe wrap them in retries and sort of hope it works. And occasionally one falls through and, you know,
Starting point is 00:10:38 now you gotta like manually fix that. Or like now I think in the world I see, I see a lot of like step functions wrapping this around that, which has its own sort of, you know, pros and cons around that sort of thing. I think that'd be like the workflow solution that you're talking about. You're talking about like the adding durability to this.
Starting point is 00:10:57 What, like what does that specifically look like when you say, you know, restate sort of adds durability to this, where does, where does that durability comes in? Especially like I see receipt integrates with Lambda, which I don't think of as having durability or anything like that at all. What does that actually look like to add durability to this process?
Starting point is 00:11:16 The way it works in restate is you have basically two parts. You have a small SDK and you have a server component. You can think of the server component a bit like something in between a log database, a very lightweight workflow engine. It's a stateful component that maintains a durable log internally. It also acts like a proxy for a lot of things that happen in your application. For example, if you're, again, the simplest way of using restate and you're writing sort of an RPC handler and you want to say, okay, make that RPC handler durable for me, you'd actually register that at restate and say, you know, restate,
Starting point is 00:11:54 you know, become the proxy for this one. Then as you actually call this through restate being the proxy, you have this connection between your RPC handler and the Restate server. And it basically tracks many operations. Like if you do an external API call and you wrap them sort of in a Restate operation, or if you make a call to another system through the Restate SDK, it wraps those operations in like an internal sort of like a durable journal entry that it commits. But it also remembers sort of like how this connects to the to the application so it um it first of all records
Starting point is 00:12:29 internally that this operation has happened and then it also understands how to um how to complete that operation uh when it's done and and send this back to the process so in case for example something fails in between kicking off the rpc and receiving the response or maybe if in the case of lambda you explicitly want to go away between sending the rpc and receiving the response, or maybe if in the case of Lambda, you explicitly want to go away between sending the RPC and getting the response because you don't want to wait, it understands how to connect the response back, complete that promise, and then give it back to a new invocation of the function and say, okay, now continue from here now that we've completed that promise. So in the end, your program in a sort of familiar sequential sort of request response RPC or make an API call await the response and reset
Starting point is 00:13:11 is kind of the proxy for these promises and sits kind of as a middleman in between records the creation of them, the completion of them and facilitates sort of the recovery in case something happens in between. Gotcha. So all that traffic is going through this restate server component type thing, which I want to talk about. I want to talk about architecture for that and what that looks like. But all that traffic is going there. It's routing it to the other services you have, which could be Lambda functions.
Starting point is 00:13:35 It could be Kubernetes-based service or whatever it is, just some sort of endpoint that it can then handle it. And it's sort of tracking the status of those as it goes through there. Yeah, that's right. It's not unlike saying you're invoking other functions through an event. Like, you know, you can, for example, if you talk like Lambda to Lambda and so on, you can do that by enqueuing an event in SQS or something like that. It's not unlike that. It's just that you're not actually writing it like that. You're not writing like like okay um i want to enqueue an event i need to sort of understand um that i i make sure i enqueue it only once um i do enqueue it in case something fails like i don't need to worry about the recovery and the sort of like
Starting point is 00:14:16 deduplication that that that kind of connects the recovery and that in queuing and all of it together um because the because reset reset kind of sits a little different between the components. You don't actually explicitly like in queue events and so on. It's really the SDK that kind of captures this, creates and kind of logs it to the system in the background and always connects it like with the current execution and retry and so on to make sure things happen only once
Starting point is 00:14:42 and don't get duplicated. So in the end, yeah, you can think of it, it's like an event-driven system. Things happen only once and don't get duplicated. So in the end, yeah, you can think of it as like an event-driven system. It's like an event bridge or SQS or something sitting between applications. But you program it in a much more high-level way so that it in the end looks like you're really programming sequential RPC with durable promises. Gotcha. On that sort of, you mentioned EventBridge, does it have the sort of decoupled producers and consumer type
Starting point is 00:15:10 concept that you would see in EventBridge where a producer might just fire something into EventBridge and then whoever wants to be subscribed to that thing, does it have that sort of concept or is it more sort of intentional RPC or at least message passing type stuff. Yeah, it is
Starting point is 00:15:28 at least in the current version more this intentional RPC though with a few nice additions. You can think of this as request response, one way to fire and forget
Starting point is 00:15:43 but at least with libel, like accepting and delivery semantics, you can use this to also like schedule in events and invocations a bit like you can do with EventBridge that you say, okay, hey, you know, this is, maybe this is an invoicing process that happens only once per month. So I want to schedule this for the next month. And then when this happens happens let me schedule it again for the next month or something like this or just on a periodic timer or so so it's um it's it's not a classical pub sub and there's no reason for us to not expose that at some point but at least at the moment it's not a it's not a pub sub in the way of here's a bunch of producers and a bunch of consumers it's really
Starting point is 00:16:21 more a facilitator for for really direct direct direct RPC style communication at the moment. Gotcha. Yeah. And it is interesting. I see a lot of people use EventBridge for sort of that direct messaging, even though it's not quite as much for that and they just don't know what other abstraction to use for that. But it's interesting because it seems like restate can replace a lot of other AWS primitives that I often use. I thought of it initially as like, hey, this sort of is a step function replacement. But it also sounds like SQS probably using less of that if I have something like this.
Starting point is 00:16:56 A Vembridge scheduler, like you're saying, if I need more dynamic scheduling, do this in 15 minutes or three days or at this particular time i can do that with um with restate so like it replaced even like some of the dynamo db streams usage i use like where often i have sort of like the dual write problem right where you want to like write a record to the database and then enqueue a message or do something else and you can either do that in your same process which is kind of finicky or you can write it to the database and then process it asynchronously with Dynamo streams or whatever change data capture. But now restate can sort of help, help handle that specific issue. Yeah. Yeah. I think, I think that's right. And the, I mean, the way I would think about it is the following, right? Like so many of these primitive exist
Starting point is 00:17:44 because there are so many different sort of different challenges you have in like in a distributed application. And I mean, distributed application doesn't even mean like fancy distributed. As soon as you have your own process and you're talking to one other process and maybe a database, you have a distributed system, right? And you have a lot of the problems already, right? And there's so many classes of problems you just mentioned dual right problem right then there's the um there's the the problem of you know
Starting point is 00:18:10 what like for um so the the item make making sure you have like item potency making sure you um you even have like uh like um recovery there's the the issue of like building circuit breakers and so on. I think so many systems exist. And then there's the, of course, like sender and failure decoupling, right? If you're doing an operation that takes longer on the receiver side, you absolutely don't want this thing to be completely lost if the connections are a break, so the sender fails or the receiver goes away in between. You really want to decouple all of those things. And because there between you really want them you probably want to decouple all of those things and because there's so many individual things you actually need to
Starting point is 00:18:48 worry about i think there's so many products and so many solutions and one of the ideas of reset was to like take a step back and look at this and just in a slightly more fundamental way like what is really the thing that that that causes so many of these issues and then build like a very sort of broad broadly usable set of parameters for that. And I think because of that, it says, OK, if you use it, you probably need a bit less of that and that and that and that and that, because it helps actually with a bunch of problems, not just with one problem.
Starting point is 00:19:17 MARK MANDELMANN, Yeah, absolutely. OK, so I'm sure you've paid a hint to this space better than I have, because I mostly know step functions. And if I could know sort of step functions and if I could sort of compare and contrast step functions with restate it would be Restate gives you lets you sort of stay in in your you know TypeScript land or draw the land or whatever your programming language of choice is more when you have these sort of distributed things going on Rather than if you're in step functions, you might say hey,, there's this first step. It's going to do something and it's going to throw an error or throw some sort of result.
Starting point is 00:19:48 And you're sort of kicking between your programming language, maybe in a Lambda function or something else. And then, and then back to like more like an infrastructure is code or, or some sort of DSL for like processing through the States. So that would be like the difference I'd say between restate and, and step functions. What about some other ones I hear about, like temporal came out of cadence. Is temporal in the same category as restate? Is it a different type of thing? How does that compare or contrast? Yeah, I think that's interesting.
Starting point is 00:20:19 It definitely is in a similar category. And temporal is an durable execution framework. I think that's how they, at least used to call themselves also for a long time. Reset definitely, durable execution is a core part of what it does. I think Reset takes a bit of a different approach because it doesn't look at, okay, here's a durably executed workflow, and now you try to connect everything to that.
Starting point is 00:20:47 But it really tries to take more of a, you could say microservice architecture approach or even a slightly simpler approach to that where we say it's not specifically about building workflows. It's about building durable event handlers building like durable event handlers um durable rpc handlers um stateful rpc handlers like there's a component that actually lets you like attach state directly to the handlers and like keep that sort of um transactionally consistent with the like with the invocations and um i think that gives it a bunch of like different characteristics than than temporal.
Starting point is 00:21:25 And while I think there's some overlap, I think Reset really is a bit more general and more towards the general distributed applications rather than in the specific workflow category. I mean, that being said, you just said that you compared mostly to step functions. I think that's actually a great category because to my mind, step functions is also, it's like workflows, but it's a little bit more. You don't think of step functions necessarily as just like the long running sort of workflows for days with like human tasks in the middle. You can also use that for like fairly like low latency and fast orchestration tasks and so on. And I think that's also kind of a category that restates in more. It's not just heavyweight workflow, it's also
Starting point is 00:22:11 very lightweight, fast, distributed coordination orchestration. So you might use a restate, do you call it a workflow? What do you call a restate? If I have a handover that uses that, what would you call that? It's actually interesting. We're still debating what's the best way to call it at the moment. We just call it a durable handler. Okay. So I might have a durable handler that calls a bunch of systems, but that could still be in the synchronous path of a request from a client or something like that, that goes through a bunch of different
Starting point is 00:22:43 things. I could still get a synchronous HTTP response back from my client sometimes when using restate. Yeah, that's right. So the individual operations in restate, we're paying a lot of attention to building it in a way that they're quite fast, like take at least very low latency. So when you, for example, have different processes,
Starting point is 00:23:09 even if they both run on restate and so you're kicking off one handler by an RPC call from the outside, you're making this call, another one, and so on, you're waiting on a response to get back, call another one, like it's in the order of like milliseconds, like every time you're hitting the different processes. It's actually built to be
Starting point is 00:23:26 like a very fast append to a log and that's it. That allows you to do a bunch of steps in the synchronous path, which is really where we want to see this thing go because if you actually have this ability to add workflow-ish durability to handlers with low latency, you actually move a lot of things that are sort of not possible to, where it's not possible to use workflow engines for them because they tend to just take too long.
Starting point is 00:24:02 You actually make it possible to use that and you all of a sudden have a big class of use cases of type of patterns where you can all of a sudden benefit from this very, very nice tool that makes a lot of these problems just go away that you couldn't before because it was latency prohibitive to use those type of tools. So maybe one interesting example is it's the one that we also that we're also using on our website or in the blog post is this this classical Step Functions demo
Starting point is 00:24:34 like the holiday reservation where we have a process that it takes care of like orchestrating everything talking to like flight reservation, car booking, hotel reservation, and so on. It also implements all the control for what if something goes wrong. It implements the Sargas pattern, actually, in the end, and then undoes the previous steps in case an error occurs. You can actually run this in a synchronous response path.
Starting point is 00:25:00 Assuming at least every individual service is reasonably fast in responding, then the restate component adds very moderate latency overhead to this. Yep, very cool. Okay, so I want to know more about, I think you call it the restate runtime, but basically the server component that's handling this flow between serving as a proxy, handling the durable state, and things like that. What does that component look like? Right. So the server component, just to quickly recap, is the one that kind of sits as an intermediary between all the different points. And it's the one that's
Starting point is 00:25:39 responsible for making sure that you can actually program sort of distributed, like distributed programs that execute recover from failures and like an almost simple sequential RPC style. So kind of pretending failures don't happen. You can embed the control flow, the failure handling flow, just like that. That's actually in a way the biggest difference also to step functionsctions, if you wish. On the Step Functions side, you basically go and say every individual bit that you care about. Let's say between two parts where you care about failure handling and say, okay, once this part is completed, I have to make sure it doesn't get re-executed and I need to have the results of this durable committed before I'm starting the next
Starting point is 00:26:26 thing. Every time you have this you kind of pull it apart into different pieces and kind of orchestrate it with something like a step function flow across it. And in Reset you don't do this. In Reset you basically just write the control flow down and let like the durability of the individual steps take care of it. And the runtime
Starting point is 00:26:42 is the piece that sort of sits as a proxy or sort of sits in the background and sort of accepts the stream of messages that facilitate the communication, that sort of persists the creation of the promises and so on. What it looks like is when you use it yourself, it looks a just a single binary that you that you bring up it's actually fairly fairly compact it's like started it has it has everything it needs in there um a log like a state storage engine um sort of registry for uh for schema functions communication and so on so it's a fairly conflicted compact bit um we it maybe feels a little anachronistic but then at the same time it doesn't.
Starting point is 00:27:25 I think there's a bit of a swing back into components being built that way, like from everything is just a set of custom resource definitions on Kubernetes. We're going back to like, okay, things actually are sort of like standalone binaries that do their thing without requiring a hundred other things. So Reset is also that. It's like, it's a simple thing. You start it, it runs there, almost like a lightweight database. And then in your applications, you embed the simple SDK, almost like the database client. You connect to it, and then it kind of acts as your proxy or
Starting point is 00:27:57 broker for the communication. That's kind of how it works. So in that sense, it's not actually wrong to think of it as it kind of takes the place of maybe the SQS queue between your services, or it takes the place of the step function coordinator. But what it looks like internally is it has a very specific architecture. It's a log-based system with sort of an engine that sort of asynchronously interprets commands, materializes indexes over the log, and handles sort of the network communication, the invocations, event dispatch, and all that. I'm always trying to find simple ways to compare it,
Starting point is 00:28:42 but you can think of it a little bit as it's it's a log it's a bit like a like a kafka kinesis log with sort of an event processing application built into it that works on the events on behalf of your application and then basically only talks to your application um like in a in a very sort of high level way as an okay i've i've done this for you i've got a response here for you based on that and it kind of keeps sort of the ground truth of everything that's happening in this like log and command processor and that really is kind of how also the like the simplicity of the um the correctness model how it happens because it kind of makes the application mostly just say what should happen
Starting point is 00:29:28 and let the command interpreter make it, or the central log make it happen for you, and then listen to the response for that. And that gives you this sort of nice correctness out-of-the-box experience. Okay, sounds good. So trying to understand it, and I definitely don't know as much about distributed systems as you, but you've got this server component where it's sort of receiving these RPC calls,
Starting point is 00:29:52 or maybe even state setting calls, right? Like if I'm a handler, I can sort of set some context or just little bits of state for that particular function. It'll handle that. It's going to persist it to that durable log you have. Also maybe some materialized data, like in, I believe, like RocksDB, just to make that a little more queryable and stuff
Starting point is 00:30:15 for resuming stuff. And then also proxying that to two different other handlers you have. Am I understanding that correctly? Yeah. Yeah, that's right. Let's maybe take let's maybe work through this
Starting point is 00:30:30 as an example and really show what happened in the runtime. Then we can also explain why it is built the way it is. Let's maybe take this holiday reservation,
Starting point is 00:30:46 holiday booking, so I guess example for a moment. So if we're the general sort of like the main function that has the workflow that talks to the different services. So we've been sort of executed proxy through restate. So restate hasn't sort of event in the log that says, okay, I want to run that workflow. That part's not too unlike, I think, how most other workflow engines work.
Starting point is 00:31:12 Like they write it in the write ahead log. So restate kind of dispatches that. And then as you go through it, as for example, the code says, hey, I need to now talk to the flight reservation service first. It basically streams an event in the background to that server component. And the first thing that server component does is it actually appends the event. And then it has sort of established the ground
Starting point is 00:31:37 truth. Okay, we've reached that part of the program. We've sent this bit. So it first of all is just like a very sort of cheap append to the log but what happens at the same time is that there's like an um a process that that um so indexes these events it establishes the relationship that this event that we just depended is actually part of the it's kind of related to the original uh workflow execution because it's an event that came came out of this and it sort of of related to the original workflow execution, because it's an event that came out of this.
Starting point is 00:32:06 And it sort of attaches this to a recovery journal for that workflow function, for that handler. And anytime something goes wrong, it can actually use that to re-invoke it and say, okay, do your thing again. But when the next time the promise gets created, it actually sees from the recovery log, oh yeah, that promise actually is already there. Might not be completed, but at least it's there. So let me actually not send this again because I already have that. At the same time, at some point, also the other service is going to respond with an event that completes it.
Starting point is 00:32:42 And then all of these are really just like they're meant to be very fast appends so we actually get that durability um very fast but then it's sort of again indexed into okay this is a completion for um for this specific original rpc event so let me kind of attach it to there like augment the original sort of promise with its completion result failure or success and so on and then um if it's still executing just you know forward it to that process otherwise keep it there and the next time it executes it can be sort of resupplied as a completed promise to the um to the executed um the executed function so that's that's basically what we i would say the most those are the two three most important pieces we have in the system. It's like this log built for fast depend, fast durability of operations.
Starting point is 00:33:28 RocksDB is more like an index that sort of tracks the relationship between events. And if you wish, like state that you're attached to a handler, it's just like a specific sort of event that then gets sort of attached to the identity of that handler. So the key RPCs are really specific forms of event that get attached to an execution of a handler and so on. That's the second part. And the third part is the component that sort of manages the communication with the services. Maybe a quick note on why did we build it that way. So in theory, we could have also built it as in, you know, we use Kafka as a log or Kinesis,
Starting point is 00:34:14 and then we store things in DynamoDB and so on. There's a bit of a... I think maybe let's take a step back and look what that means from a distributed systems perspective, if you actually do it like that. Let's just take Kafka and DynamoDB. We actually pieced together those two systems. What you ultimately now have is you have actually two logs involved. You have Kafka as a log, and then DynamoDB, every database's core truth
Starting point is 00:34:44 is its own write-ahead log. That's where it records things that by its definition have happened. Now, if you say, okay, I'm writing an event to Kafka and to DynamoDB, you have a problem now. You have actually two logs that need to keep in sync. And you need to kind of implement, if you wish, at least a poor man's transaction coordinator that says, okay, you know, maybe after I've been penanted to Kinesinesis if i crash i have to like look at this and see have i moved it to the other system or not like what if i do find something in that other system and not here like you're already at the i would say the the core of where most of the disability
Starting point is 00:35:20 systems complexity comes from like keeping different components in CIRC. The beauty, I think, of the way that we've built this with just one log and everything else is sort of an asynchronous process that follows this log is exactly one ground truth. Like, something is appended to the log or not. That determines by Reset's perspective, did it happen or not? And everything else is a view that follows. And it's kind of very easy to let that view follow consistently, especially if it's the same component that manages it. That actually gives us two very nice characteristics. First of all, it's a fairly easy thing to make it correct.
Starting point is 00:36:04 Whenever you have different systems you need to keep in sync, you're going to hit so many corner cases over time where you need to have more elaborate protocol that deals with that. We don't actually have that. It's fairly simple to get it correct. And one of the interesting outcomes of that was that
Starting point is 00:36:19 fairly early versions of that system, when we really stress test them with very heavy um sort of like chaos marquis style tests like had complex logic that cares about consistency and correctness and just like shoot things down introduce discretions and corruptions like it's fairly hard to actually make a trip up for a very early for the early stage that it is it's fairly resilient um and um the like that that's one of the like really nice outcomes the second one is it's also it's also fairly easy to operate because it doesn't run the risk of like okay what if like
Starting point is 00:36:55 the log actually executes faster than like dynamo db can have can can take the inserts or so or or like yeah what what if this one has a compaction and then the other one has one at a different time and they cannot both stall each other out and so on. Like it's fairly, because we care so much about the system being easy to operate, like this architecture lends itself to that very much. It's an architecture that has comparatively few knobs and sort of things that can drift apart from each other. That's what we liked about it a lot. And that's why I decided to do it that way. Okay, that was a very long speech.
Starting point is 00:37:29 No, that was great. Just thinking as a user of a piece of open source infrastructure, if I know I have to go set up not just this infrastructure and start running this binary, but send in these credentials to this DynamoDB table and this Kafka cluster and all that sort of stuff. It's just like, okay, now I'm managing three things. I don't really have that much control over how they interact. It's just nicer to have that more in that single binary type situation.
Starting point is 00:37:57 Exactly. Because maybe we haven't actually said this before. The way we really want to get the system to use this is we're working on a managed a managed service and of course we want to have a service that's easy to operate for ourselves but we really also want to um we're publishing this as an as an open source system we want folks to be able to to to run this if you want themselves in their at least in their own accounts and so on if they need to with to in an easy way in a way that doesn't require too complex operations
Starting point is 00:38:28 and maybe that's something we sort of learned a little bit from our previous work in the space fling and Kafka and all the other systems that connect to that it's very easy to get into trouble even if you think a system is like super well behaved there's just like every system itself makes sense but there's just different assumptions they make every time they interact if just slightly different intervals in which they refresh
Starting point is 00:38:53 their credentials slightly different ways in which they like use their use their like um when they have their burst and how they react to bursts and so on how they can stomach that that it's very easy to get those combinations of... If you combine many complex systems, you have something that is very easy to trip over, even if each system itself is actually very well behaved. And that's kind of what we wanted to get out of that situation. That's why we're putting it into one. Yeah.
Starting point is 00:39:20 Based on your experience with Flink, and especially with Flink for a long time, and now talking to customers, I'm sure, with Restate, are you seeing a change in how people, like, are people moving more towards, hey, I just want a managed service that does this for me? Even if I have that safety of open source, but most of the time I want that managed service. Have you seen a big change over the last, you know, since I think 2009 you were working on Flink? Or what does that sort of market look like? Yeah, that's actually a good question. So definitely the acceptance of managed services is way up there and also the wanting to use a managed service.
Starting point is 00:39:57 Absolutely for Flink when we started was all-prem. In the end, it was so many folks asking for managed services. There's not such a thing as the managed service. There's so many different paradigms. There's like, you can have a multi-tenant
Starting point is 00:40:15 managed service that is shared by many users. You can have a managed service that has dedicated resources and has an isolated process and per tenant. you can have a managed service that runs some part in maybe the vendor's account and then keeps a lot of maybe parts of the control plane or some very low level storage primitives and then runs the the sort of sensitive
Starting point is 00:40:38 parts in the user's account to make sure like the the let's say at least the unencrypted data or like the direct uh connection to to their application services and so on doesn't leave their account and so on. Like there's this wide spectrum. So it's kind of, I'd say it depends on what kind of user you talk about. The long tail is more on the like really fully managed set of things. But there's still a lot of big use cases that need at least the dedicated clusters that like these clusters to run in their own accounts and so on. Building to make this easy and possible, I don't think it hurts, to be honest.
Starting point is 00:41:18 MARK MANDELMANN, Yeah. OK. I want to backtrack to something you said a while ago, because it was in my head. So you were talking about the process of what's happening on the server. Like, if I want to call some other service, said a while ago, because it was in my head. So you were talking about the process of what's happening on the server. Like, if I want to call some other service, it's going to write that log event.
Starting point is 00:41:29 And then when that service responds at some point, that'll have a different event and mark that. So does that mean if I'm using restate, both services have to be using restate for that sort of thing? Can I use restate around calling an external API that I don't have control over or even calling out to Postgres or something like that that's not going to be fully integrated into that restate ecosystem? Or is it more just internal RPC type stuff? That's the core of what you're doing there. I think you can use it for all of those things. I would say
Starting point is 00:42:03 the simplest thing is, you can make it for all of those things. I would say the simplest thing is, I mean, you can make it as simple as a one-step workflow, right? Let's say you have something that comes, maybe it's something you can schedule for later or so, that's why you're partnering it through a system like Restate, and then it does exactly one thing, it wants to commit that, it wants to acknowledge sort of that this happens, done.
Starting point is 00:42:24 So you can you can use it for that like restate allows you to um basically wrap like any type of api call in like in a side effect instead of a managed promise if you wish um and and then that's that and i i think the a lot of the like simple examples kind of look like that you have one durable handler with a bunch of calls that it makes you know to other services that that is it and that that's that's totally totally fair um the and i think the interesting part is if you if you start going beyond that if you say okay maybe this this one handler i i have is um is, there's a bunch of other sort of like services that it needs to interact with. Let's say that you have that example of the checkout process, right?
Starting point is 00:43:12 It calls maybe an inventory service to decrement the counts. It calls a payment service and so on. And when you start saying, okay saying okay hey i'm not just using restate for this for this one thing but i'm also using it to sort of like um i'm also using it as part of um not just the checkout flow let's say also the inventory service and so on then you you kind of get a few additional benefits because now you have sort of both both sides uh managed by restate and that that means you you get um kind of get exactly one sort of communication between them or think like you know retries out and potency all of that um just completely for free you can you can
Starting point is 00:43:56 actually pretend they they talk in a in a completely like reliable um exactly once manner because yeah because Because both ends are on a proxy through the same system and it can work its internal magic to make this nice. You don't have to do this. It's completely valuable for one service, but when you put onto it
Starting point is 00:44:17 a few more services, you get a bunch of additional benefits. I think that's where over time I think people like to go. Yeah. Tell me about that exactly once and how that works. I know there's some blog posts on exactly once is impossible and things like that. What do you mean by exactly once and how does Resate solve that particular issue?
Starting point is 00:44:42 Yeah. So I think that's good. Exactly once is a very controversial term. I think exactly once in terms of, let's say invoking a service exactly once is not something you can realistically achieve by laws of logic, math, and so on. But that's also not what you really need. You need the effects of the service to happen just once, right?
Starting point is 00:45:07 That's kind of what I think exactly once always has been about. When you talk about in-stream processing or like even in the context of database transactions, you never talk about kicking something off only once. You talk about it materializing
Starting point is 00:45:24 its effects once and only once right and um it's often achieved like by a combination of sort of like retries and out potency or um or like sometimes actually commit protocol on the bottom which at least has one retried and out important step as well in it usually so so i think i think that's that's really what it's about like If you have two services, one is sort of sending its event. It's important to understand that the sending of an event from one service to the other is anchored in the durable execution and the deduplication and so on. So when this is sent, it actually goes as an event in the background to the restate server,
Starting point is 00:46:00 which appends it to the log. And then it has this single write that acts both as, okay, we've created the promise, but also we have work to do here namely like send that message to another service so we've um so we've kind of established an um a an endpoint that that sends an event that duplicates on retries like if that handler fails it recovers and goes so there and says like okay you know i have actually created that promise so creation and sending happens in a deduplicated manner on the on the receiving side it's the same thing like you receive it um you invoke the handler you might invoke it multiple times the failures happen in between there but every time you kind of invoke it you you understand the effects that already
Starting point is 00:46:38 happened you kind of give the completed promises for these effects back and and eventually you you conclude it so what you will not ever see is the event being sent out twice or resulting in sort of like two individual invocations that don't share each other's partial progress i think that's what it boils down to in the end and that's i think really all you care about that that in in the end it it kind of means that like one or two invocations happen as part of like a retry strategy it doesn't matter it matters whether you apply the effects doesn't change doesn't change the result yeah multiple times yep yep very cool okay and then um so going back to this runtime thing if i have this server um am i running multiple instances of it for replication and better durability and things like that?
Starting point is 00:47:27 What does that look like? What's a common configuration? Am I running three, five, seven of those? That's a great question. At the moment, you're running one because that's the version that is out there. We're working on the distributed version where, I mean, I think you would eventually run this whatever your load sort of requires.
Starting point is 00:47:51 I would say starting with three, one in each availability zone of a typical region. So I think that would be like what we would probably recommend in the future if you want to run it yourself at the moment. It's one process. It basically persists. It's logged durably by writing it to something like EBS.
Starting point is 00:48:20 So the simple thing you can use right now is we actually have a few tools around this like CDK constructs and so on that would basically deploy it exactly as like an either ec2 instance with an ebs volume or um just saw the release of like fargate abs support and so on like a fargate task with an ebs volume and some those are like perfectly fine um fine deployments um it actually doesn't really matter even if your task is um is dead long lived as long as it really only matters if the log survives but like the log has to be durable. That's why it has to be an EBS. Like wherever you run the rest doesn't really matter. And in the future, we want to take care of that replication ourselves for a few reasons.
Starting point is 00:48:52 Like, I mean, cost effectiveness is one thing, but the other thing is like cross AZ durability. It's not something EBS gives you at the moment. So, and S3 is too high latency to write still. The low latency version is, again, only single availability zone. So it also doesn't help there. So that's why I think we're working towards our own log
Starting point is 00:49:15 replication at the moment. MARK MANDELMANN, Gotcha. So in some future where you're doing replication, does that mean you'd move away from EBS and do more instance-based storage? Or would you still be using EBS and relying on some of those guarantees there? I'm kind of curious on how those trade-offs manifest
Starting point is 00:49:30 for distributed systems builders like yourself. Yeah, that's actually a very interesting question. I think that... I would say it depends just a little bit on your risk appetite. I think it's very related to me to the question whether you should F-Sync a log or not. If you don't F-Sync your log, for example, you're basically relying on instance memory or so for the time being. Let me put it like this.
Starting point is 00:50:00 What are the logs really for in a modern architecture at the at the end you you very quickly want to go to durability through something like s3 for the majority of your data so you get latencies out of that in in the like tens hundreds of milliseconds like tens hundreds of milliseconds if you go to the cross ac1 like the the 99th percentile here really matters it's probably more like three-digit milliseconds in many cases. In a way, the most important thing you have to bridge is the latency until that point has come and maybe a bit more because you don't want to
Starting point is 00:50:33 depending on how much you write you want to amortize costs into bigger writes so you don't write too small chunks. You kind of have to bridge just this small bit until you've gotten enough data and enough of a latency window that you can flush it out to S3. This is what you have to bridge. This is what the replication is for. So it's just
Starting point is 00:50:55 like replicating to instance storage or even to instance memory good enough to bridge that, right? It kind of depends on what failure rates and what kind of failure scenarios you assume. Most of the time, you'll probably be right replicating to memory. You'll probably be good replicating to disk. If you actually want to replicate to something like EBS, even in the small window, it kind of depends really what class of failures you're addressing. One, I've had this discussion a while ago with some Kafka people that I asked them, like, would you F-Sync or not?
Starting point is 00:51:33 And what is the sort of the worst thing that you lose if you don't F-Sync? And I said, the worst thing that we typically lose are the cases where not F-Syncing usually matters. So somebody accidentally deallocating the entire deployment like you know you run it on kubernetes even like with multiple sort of like failure domains and anti-affinity and so on and you know you don't ever you lose data because all of your azs go down at the same time but you lose data if some administrator unfortunately deletes the deployment because then everything goes away in the small window before you kind of can flush it out to the more stable storage and and that's why it might
Starting point is 00:52:08 for some people still matter i think to persist to something like eps still because then you still get to defend against those scenarios if you're kind of happy with sort of the the math around failure rates and you trust kind of your ops team to not take down all AZ deployments at the same time within the same second, maybe you are OK with the cost benefits of just replicating to memory. I'm not sure if there's a one answer for all. MARK MANDELMANN, Yeah, yeah, it's a tough one, right on that edge there.
Starting point is 00:52:41 So is restate currently flushing stuff to s3 occasionally for for that more durable stuff and getting stuff off just regular disk um we're not doing that at the moment um but it's it's something we're we're working towards to um which is incidentally also something why what for example we really like having having rocks to be in there because if you if you actually there because if you actually have a replicated log the sort of LSM tree design is actually a really interesting one for storage engines. Because it has this like hierarchical immutable artifacts that are asynchronously merged it lends itself really well for like storage on something like S3 is good for immutable artifacts or for something like that. So it's another reason why we'd like to start that way.
Starting point is 00:53:30 **Matt Staufferer, Jr.:** So you're flushing not just... You might flush the RocksDB segments and things like that to S3, as well as the log segments. Is that what you're saying? **Joseph Benz, Jr.:** Look, we might. We're still in the process of developing this. It could be very well that by the time we do this, it's not actually RocksDB anymore, but it's like a different storage tree and so on.
Starting point is 00:53:54 But conceptually thinking, yes, that's where we would move to. It's been fun seeing all the S3 stuff that's happening with data systems. I talked to the WarpStream people recently and just what they're doing with S3 and just how it's sort of changing data infrastructure in a lot of interesting ways, like you were talking about. One thing I thought that was interesting, too, in the docs, so you mentioned RocksDB is sort of your state storage,
Starting point is 00:54:21 but you can actually query it via Postgres, right? You expose it basically via Postgres. What's going on there? Yeah. So RockCB basically is our state index, but not just state, right? I'll actually just try to emphasize this. RockCB stores sort of the relationship between all events and the end. Here's an event that corresponds to this workflow execution. Here's an event corresponding to that RPC invocation of that thing on restate. Here's an event corresponding to that committed side effect
Starting point is 00:54:57 or so to another system. All of this is events that get appended to the log, but then indexed to FnRocksDB. And by making this queryable, you can find out a lot about what's going on in the application. Like what's currently running, who's kind of calling whom, who is waiting, who's sleeping, who's waiting on some other function to kind of return, right? Like that's the, I think the real power. You can almost like query the distributed call stack, at least from the services that are on restate. Like if you have like a bunch of services on reset calling each other,
Starting point is 00:55:35 you kind of can almost query the distributed call stack between them. And it's not that you're sort of maintaining the synchronous here, because it's just like an asynchronous index view, the whole thing. So Restate internally has the query engine It's not that you're maintaining the synchronously because it's just like an asynchronous index view, the whole thing. So Reset internally has the query engine based on Arrow Data Fusion. But the endpoint, behind which we expose this, actually speaks to Postgres Wire Protocol, so you can use a Postgres client to query it. Actually, I haven't mentioned this before,
Starting point is 00:56:06 Data Fusion makes sense for us because the server component is completely implemented in Rust. Actually, a big user of the Tokyo framework, which kind of fits very well with our philosophy, we have this log, sort of single appender for the partition to the log. It kind of fits the, like the, the Tokyo program model pretty well. And, Bing, Bing rust sort of system data fusion just looked like a sort of promising rust by critic rust-based query execution engine that we connect to Postgres by making Postgres. Sorry, that we connect to ROXDB by making RocksDB sort of a virtual table that Data Fusion can scan and push certain predicates down to.
Starting point is 00:56:51 Yeah, okay, that's really cool. Do you anticipate that will be that Postgres compatible endpoint there? Will that be something that is only used for occasional one-off hardcore debugging type thing? Or will that be something where I might have a dashboard open that's continuously refreshing and just showing the state of my services and traffic and things like that going across? How often do you see that used? It's a very good question.
Starting point is 00:57:18 At the moment, the biggest user of this is actually our own tools. The last version that you can actually find i don't think it's actually officially published but you can actually find the release uh the release artifacts already on on github and our website the open seven version introduces a very very powerful cli that you can can use to exactly examine the sort of status of services um like why is somebody, how long, how long has it been waiting for? Who is it waiting for?
Starting point is 00:57:48 And so on. And that thing internally issues like a lot of queries to that system, right? That's the biggest user of that so far. You can use it to expose something about the state of your system in a dashboard. If you don't do it in a crazy aggressive way, like issue thousands of queries per second might actually work. It's not at the
Starting point is 00:58:17 moment optimized for production workload queries. It's really more for debugging, analysis, introspection type workloads. But it's a very interesting question. We've seen this being so valuable that we might actually do more work on this in the future to make it sustain bigger loads. It's to be seen. Yeah, even if you had a read replica one, it's maybe just pulling down the S3 data.
Starting point is 00:58:48 Maybe it's a few seconds or however long behind, but you can query it and not worry about messing with the hot path of these FMs getting appended and all that sort of stuff. But it just gives you a look at your system and what's happening. That'd be kind of interesting, just to sort of see where that goes. Yeah, that'd be very interesting. And again, like an LSM tree-based architecture
Starting point is 00:59:09 that you sort of like asynchronously write into, it's actually great because you can then create like different, you can have a different consumer pulling these artifacts and like running a query engine over this and so on. Lots of things are possible, yeah. Yeah, I've seen other tools doing that,
Starting point is 00:59:24 like Rockset has, they call it it compute-compute separation, right? Where they're just pulling those down into different compute instances so you don't have one query here messing with one query over there and stuff like that. Yeah, it makes total sense. I mean, Rockset is the folks that build RocksDB, so a lot of these things that we're thinking about, of course, they're also doing that. It just kind of comes from using Rock Studio. Yeah, cool.
Starting point is 00:59:47 Let's talk about where sort of Restate is, both as a product and as a company and things like that. So is it in a state where I can go out and sort of use it and start running stuff through it? Where are you at in the development cycle here? Yeah, so there's different different pieces that are in different stages like we have we have sdks out there in typescript javascript java kotlin um that that are are there they're they're pretty much good to go um there's a there's the open source runtime that you can download.
Starting point is 01:00:26 That is, it's usable, but I would just add that it is still under like sort of rapid development. And it doesn't mean it eats your data. It's like I mentioned, we're actually pretty happy with the reliability so far, because I think that's like one of the things we get from this architecture that we have. That's sort of very simple, like single log paradigm. But we'll absolutely go through a few sort of iterations of the storage formats and everything. So if you're using it right now, like be aware, you might have to run a migration script at some point in time when you upgrade.
Starting point is 01:01:02 That's definitely a caution I would throw in if you're happy with that um would be we would be happy for you to check it out um we're we're also working on a cloud service which is in in in pretty early stages um and we're um we're looking for folks that that would be interested like early design partners for that. So open source, I would say, try it out as long as you're OK with running migration down the road. Managed service, talk to us. MARK MANDELMANN, Yeah, absolutely. OK, so you sort of stopped working on Flink in 2021.
Starting point is 01:01:43 Did you start on Restate right away? How long have you been working on Restate? It depends a little bit when you count. Some of the basic thoughts of it we've been playing around for a while in the back of our heads. End of 2021, I left Flink. She took a longer parental leave in between and I went really to work on Restate in the second half of 2022. We started by
Starting point is 01:02:16 prototyping a lot. I would say the actual Restate development started kind of end of 2022, so a bit over a year ago. Okay. And how big is the, you said we started working, like how big is the current team working on restate? Yeah, so we're nine and a half people, so nine people and somebody is 50%. We're, yeah, so we're still a fairly compact team because we're iterating a lot on the foundation still,
Starting point is 01:02:47 which I think works great in a small team. Yep, cool. Are you based all over the world? I know you're in Germany, I believe. Are most people in Germany? Or where's the team based? I'm based in Germany from the team. The team is based all the team is based sort of all around mostly
Starting point is 01:03:07 so European time zones or similar time zones at the moment. So we have some folks in the UK, Italy, South Africa, one person on the US East Coast. So somewhat all over the place, like we're working as a fully distributed company. But for the early days like we're working as a fully distributed company but for the um like for the for the early days that we're in we try to keep it in time zones that have enough overlaps so that when when big discussions are necessary we can actually get um also in front of um like in front
Starting point is 01:03:39 of the screen for a while or somebody might actually fly in and on relatively short notice and we do a workshop and so on. So that's what we try to optimize for. Very cool. Well, I just love learning about this. I love what you did with Flink and I've followed you on some of that stuff for a while. So it's cool to see this new project. If people want to know more about you or about Restate, where should they go to find you? Yeah, I mean, the Rest the reset website maybe is the,
Starting point is 01:04:06 is the best first place to start. Uh, restate.dev. Um, we, we, we do have a Twitter account also restate dev. Um,
Starting point is 01:04:14 I think those were, those are the places that, that we usually, that where we usually share our stuff. Um, where I guess we're, we're not young enough to have an Instagram account or something like this. Yeah, that's all right.
Starting point is 01:04:30 I don't either. So yeah, my kids maybe someday. So, but yeah, anyway, like, thanks for coming on the show. I love this stuff. I think it's super interesting. And thanks for like walking through like what's going on in that runtime and some of the distributed system stuff. I think it's really cool.
Starting point is 01:04:44 So, you know, best of luck to you and the team going forward. Excited to see this space continue to develop. Awesome. Yeah. Thanks so much for having me. It was a lot of fun. Thanks. Cool. Thanks.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.