Materialized View Podcast - SKDB and Reactive Databases With Julien Verlaguet

Starting point is 00:00:00 I recently talked with Julian Verlagoy. Julian is the founder of Skiplabs, a company building infrastructure for reactive applications. Before Skiplabs, Julian spent nine years at Facebook, where he was a tech lead for both Hack and Skiplang. In this interview, we discuss Skiplabs' foundational components, SKDB, SkipStorage, and Skiplang. SKDB is built as a reactive database that features performant materialized views, diffing between databases, and streaming via ephemeral tables. SkipStorage and Skiplang serve as building blocks for SKDB.

Starting point is 00:00:26 By the time you hear this, Skip Labs will have launched, so I invite you to check skiplabs.io and SKDB for more details. In the meantime, please enjoy the following conversation with Julian. Oh, and one final note. I apologize for the tin can sound from the microphone. Since this recording, I've purchased a better mic, and future recordings will have better audio. So last we spoke, I think we mostly talked about skip DB or SKDB. And I think in talking with you about that, a few things came out of it. One of which is, it sounds like you're working on something called skip FS or some file system

Starting point is 00:01:02 for that enables the DB as well. And then of course you have skip Lang, right. Which is the work that you did at Facebook. And so I was, I was thinking about this last night. I was like, there's almost a one-to-one correlation between those three projects and the things that I'm paying quite a bit of attention to now. So the sort of spaces, three of the spaces that I've kind of spent some time in recently are, you know, distribute durable execution frameworks. The second is like edge DBs, kind of the torso, lib SQL,

Starting point is 00:01:30 you know, LightFS, Lightstream, SQLite area. And then the third area is sort of conceptually this pattern we're seeing where people are moving serverless infrastructures, persistent storage, basically to S3 as the primary store. Like there's some version of this, which is, you know, tiered storage, but then there's a more extreme version that I'm starting to see more now where people are using S3 kind of as their primary store. So examples of this would include WarpStream, which is a Kafka pub subsystem, TurboPuffer,

Starting point is 00:01:57 which is a vector search system. And then like Neon is sort of the canonical example where they've got like this super, super duper. You mentioned them. I was looking at TurboPuffer. It's pretty new. It's pretty new. It's interesting though. They're trying to serve queries essentially straight from S3 and then they put a cache in front of it. And I think there's like variations on this theme. On the more OLTP style databases, it's definitely

Starting point is 00:02:23 they bolt on either some kind of transactional key value store or some consensus-based write-ahead log on top of S3 so that they can provide transactionality and low latency without having to worry about dealing with S3 directly. And then there's very sweet caches, write caches, write back caches and stuff. The reason I raised these three things is because when I look at the Skip stuff, it's like SkipDB very much could be to me like an edge database kind of thing. The SkipFS stuff to me could be very, you know, sort of analogous to the sort of serverless persistence thing. And then SkipLang could be, I think, a foundation for something that's more, you know, durable execution like.

Starting point is 00:03:00 And so it was just striking to me how the stuff you're working on kind of spans all three of these areas. So I've dumped out a bunch of stuff there, but maybe I think a good place to start, I think, is with the SkipDB stuff. And so it was just striking to me how the stuff you're working on kind of spans all three of these areas. So I've dumped out a bunch of stuff there, but maybe I think a good place to start, I think, is with the SkipDB stuff. I think the stuff that you're currently working on. Yeah, I'm working on SKDB right now. Yeah. So what can I say? So maybe I can give a little bit of context on what these three things are, and then we can talk about how they position against what's out there. So you've got Skiplang.

Starting point is 00:03:28 Skiplang, the version that's out there, so if you go on skiplang.com, you'll see that there's a website with a documentation about the language, et cetera, et cetera. But the problem is the open source version that is out there is pretty much unusable. That's not what we use at Skiplabs. And we plan to release a version of Skip that actually works for everyone relatively soon, but we didn't have time to do that. And the language has changed.

Starting point is 00:03:52 Well, the language has not changed all that much. What has changed a lot is the runtime because we wanted to support WebAssembly. And it turns out that writing a garbage collector to target WebAssembly and a runtime in general is difficult. It's a can of worms, to say the least. I had my nose in that stuff for a long time. But so what is Skip?

Starting point is 00:04:14 What Skip is, in a nutshell, is a programming language that gives you very strong immutability. And that's pretty much all it is. And the reason why you care about that stuff is when you're going to have durable execution or if you're going to do reactive programming like we do, you will have to store things in a cache, right? You will have to store objects, long-lived objects in this, right? And when you do that, you better have a guarantee that those objects are immutable

Starting point is 00:04:37 or have a tight control over the effects one way or another. Because if you start mutating objects in a cache, well, things are not going to go super well for you. So there's the obvious solution that consists in making a copy of the objects as they come out of the cache, and that you can do. But the problem with that is that

Starting point is 00:04:53 for reactive programming, it would be a showstopper. The cost of copying would be too high. So to give you a concrete example, the skip compiler is written in skip, and the skip, because we wanted the compiler to be reactive, incremental, that kind of makes sense, right? And the number of objects that are stored in cache, if instead of reading them from a cache, we were paying for a copy, it would make the whole system completely unusable. So what skip does

Starting point is 00:05:22 really is that it gives you the immutability guarantees that you need to put an object in a cache or take an object out of the cache and not worry about the fact that there's a mutable reference that still lives or anything, which is a notion of immutability that's very different from most programming languages out there. Most programming languages out there, in fact, all the mainstream, they don't talk about the value. They talk about what the function is going to do. So when I write const in C++, I'm saying me as a function, I don't intend to modify this thing, right? But you're not guaranteeing by any mean that, you know, somebody else could be holding a mutable reference to that pointer, right? That's none of the business of the function.

Starting point is 00:06:00 And if you want that function to be able to take that object and put it in a cache without a copy, that's what you need, right? You need that kind of guarantee. So that's what skip gives you. And then on top of that, we built, let me kill my discord. On top of that, we killed, we built a key value store. So I often refer to it as a file system. I guess it's the, you know, the plan nine that's, you know, blood that is still a little bit in me, the plan nine influence, you know, everything is a file and I tend to see things as a file system. But I noticed that whenever I use that term, it confuses me. I get it. I think I wrestle with the same thing.

Starting point is 00:06:36 You know, I mentioned I've been thinking about this sort of S3 and caching thing. And I see different implementations of it. And one of them is very much like a page cache. And in that world, you're thinking more akin to bytes and pages, and it looks a lot like a file system. But then for a lot of the infrastructure, what they end up building is some kind of key value store on top of it. And so it's like, what is the relationship between these two things? Should the byte level page cache thing exist underneath the key value store? But then when you look at the key value store implementations that you see, a lot of them aren't using page caches under this.

Starting point is 00:07:09 Like Neon does. Neon has page server, I believe. But a lot of them are more like write ahead log style thing. They're storing data on EBS. And then they're periodically flushing that stuff out S3. So I totally get what you're saying there. Just to be concrete, it sounds like the thing you have, though, is the API is like a key value kind of API. Yeah, it's more like a key value store. And it has all the thing you have, though, is the API is like a key value kind of API.

Starting point is 00:07:25 Yeah, it's more like a key value store. And it has all the stuff you would expect from a key value store. So it has a notion of a transaction as MVCC, all the kind of stuff that you would not expect from a file system. I guess it's modeled in my head. The only thing that it has is that it has a hierarchy of collections ordered in directories with slash, I guess. That's maybe the only thing that looks like a file system, but really everything else is closer to your key value store. Interesting. And, you know, sort of random question, but how much of this are you planning on open sourcing?

Starting point is 00:07:56 I know Skip is already open sourcing. We're going to open source all of it very, very soon. So Skip, the file system, SKDB, our entire stack. We want to be 100% open source shop and MIT also. So it's not going to be one of those, you know, dual lights and stuff. In fact, I mean, I wanted to open source for a long time, but, you know, we didn't want to get distracted. And so open sourcing can also mean, you know, there's a community once you open source and you want to take them seriously. And so when they find issues, you need the time. And so I wanted to have enough money and big enough of a team to make sure that we don't open source something and then we don't follow what the community wants.

Starting point is 00:08:40 So now I think we're at the point where we have what we need to support a community. So the reason I ask that is on the skip FS front, like literally just this week, I was writing, we need more transactional key value stores on top of S3. Cause like all these OLTP systems are building, uh, you know, they're building their, their systems on top of these things. And right now when I look around, there's like TIKV, which is one that I looked at and they just, uh, we're starting to talk about the S3, you know, work that they did. But there aren't a ton of others. Like I was looking at RocksCloud, I was like, is this thing going to work? And it's basically no, I don't think it does. I don't think it provides the semantics you would need. So this is really interesting. On the folder front

Starting point is 00:09:19 on SkipFS, that's interesting too. I need to think about that. What was the motivation for adding the folder hierarchy? It was just a way to... So the problem is when... So bear in mind that it's a key value store, but its primary purpose is not to be a key value store. Its primary purpose is to have reactive directories, right? Where you have some directories that are the result of a computation. That's what it does. So it's a key value store where some directories are computed. And so it really forms a graph of computation. And the motivation for the hierarchy is the amount of states that you have in a large system. Like if I take my compiler and I look at all the objects that are passed,

Starting point is 00:09:57 then type, then blah, blah, blah, and there are different passes for everything. It actually becomes difficult to make sense of things. And then when you have so many objects, you want to subdivide into categories. And it's just to help you think about your global state. When your global state gets big, that's what a hierarchy does for you. It helps you, you know, structure things. That's what it is. Gotcha. Okay. And then I think I have two questions. The first one is just a quick question. I think I sort of jumped to the conclusion that this was built on a blob store, but is it, is it handling its own persistence or is it using, you know, some other blobby thing or something? No, it's handled its own persistence. Gotcha. Okay. Okay. So that was all rewritten

Starting point is 00:10:33 from scratch. I think, look, we, the, the, the, I mean, most of the blobby things they optimize for write rates, you know, and for us, the write rate cannot be that high anyway, because when you write something, what we need once you've written something is go update a bunch of reactive directories, right? Yeah. There's no point in being super smart about, you know, how are we going to take these millions of writes when we will not be able to handle that kind of load anyway, because we'll be limited by how much the rest of the system can update. And so the constraints were very different from what you would typically find in, you know, a key value store that has been optimized for heavy writes, like for a logging system or something like that. Yeah. And so this leads me to my second question I was

Starting point is 00:11:20 going to ask, which I think is just, I think it'd be good to jump into this skip DB use case, because I think that's going to inform a lot of what you're saying in terms of, you know, write versus read and the materialized views and stuff. So can you, can you just run through skip DB? So, so SKDB is, uh, basically an SQL interpreter on top of an SQL engine on top of, uh, SK store. So SK store is this reactive key value store I just described. And in a nutshell, I mean, conceptually, it's really simple. Like you have a key value store. You're going to take some of those directories,

Starting point is 00:11:52 some of those tables, whatever you want to call them. And there will be a representation that is an SQL representation that can hold, you know, an SQL row, basically. And what you're going to do is your queries are going to become virtual directories. So we're virtual directory or reactive directories, whatever you want to call them, but where the directory is a result of a computation. So for example, I run select on a table. It's not, I'm not going to walk that table like, you know, I wouldn't take

Starting point is 00:12:20 typical database. What I'm going to do is I'm going to create a new folder, a new directory, and this new directory is going to be a view on the old one. And things get a little heavier once you have reads and writes, right? Like once you read and then you write into a new table. So for example, let's say I were to do updates of plus one on a table within a transaction, right? This is all in one transaction. What will end up happening

Starting point is 00:12:44 is you will have to make copies of the table. Well, that might be a bit complicated, but the idea is when you have a transaction, when you're doing multiple reads and writes, you might have to have logical copies of the table at different stages. So let me give you a concrete example, because this is probably way too abstract. Imagine my query says insert into table one. And what I'm inserting is a select. It's a select on itself. You do understand that if I do that, and my select is reactive, so it's something that depends on the table itself, I have a cycle. I have a problem where whenever there's an update, I'm updating myself, which is inserting into myself. And so I have a problem there. If you want to break that, what you'll have to do is to introduce basically logical copies of T1, where you're going to say, I have

Starting point is 00:13:35 my table one, I'm going to create a virtual view of table one, which is going to be my select, which is going to create a new version of table one. And so that's how things are modeled. And I don't know how I got involved into something so such a technical detail, but long story short, your queries in SKDB are in fact a chain of a graph of computation in this reactive file system. That's a key value store. Yeah. Yeah. It's interesting. I came across a similar kind of concept five or six years ago from a guy named Carl Steinbach at LinkedIn. And he was coming at it from the data warehousing perspective, but he was basically saying like, Hey, imagine we don't have a workflow orchestrators like Airflow and Dagster and Prefect and whatnot,

Starting point is 00:14:19 but instead everything is just views. So rather than have the workflow, it's views, querying stuff, and then you compose more views on top of that. And that is in effect a DAG, right? It's a hierarchy of views. Again, very different context on the sort of data warehousing hive side of things, but similar idea. So the thing that caught my eye with SKDB is that kind of paradigm could go a lot of different ways. I think SKDB, there's diffing involved. And so that part to me looked sort of EdgeDB-like and it starts to get into the CRDT realm and how you, you know, merge these databases that are, you know, perhaps spread around. I think the other one is sort of the materialized, you know, use case of streaming, stream processing and stream queries. And then the third one that occurred to me is, I think, probably flavored by

Starting point is 00:15:05 your history at Facebook, but it was like, okay, so this could be used, you know, purely as a persistence layer for somebody that's got a bunch of React components, and it's just on the front end, and there's some wasm in the browser that they're using to update a bunch of UI components when one thing changes, it, you know, cascades down. So it seems like there's a lot of different things you can do with it. Is there like, I guess, the first question would just be, what was the initial motivation for it? And I think the second is sort of what do you envision like the, the ideal, you know, first customers being? Yeah, so that is a great question. Thanks. Well, first, the idea of what how we got engaged into that was, we wanted to make the product more

Starting point is 00:15:42 approachable. So we are really interested in reactive systems in general. So we have a general programming language called Skip, and you can build very complex reactive systems, such as a compiler. I mean, this is pretty, you know, beefy stuff, right? That you can make completely incremental using our technology. The problem is you'd have to learn Skip.

Starting point is 00:16:01 And that is, you know, a barrier to entry that is pretty high. And so one of the things that we promised ourselves when we started Skip Labs was none of our product should be based with the assumption that people are going to learn skip. And so what SKDB really is, is an attempt to make SKStore more approachable. So now you can write SQL and you don't have to worry about skip. Then about the streaming and all the use cases, there are many use cases possible with skip. So the diffing thing that you've seen, a lot of it has to do with, you know, how you would deal with conflicts and how you would do merging, obviously. But we found, we tried to find a compromise, right?

Starting point is 00:16:41 Like there are two ways you can go about this. You can go full on, you know, last writer wins there are two ways you can go about this. You can go full on, you know, last writer wins, which is what a typical database would do. And then this is going to be efficient. But if, you know, somebody goes offline, then the, then, you know, merging things back will, will be difficult or impossible, or you can go full on, you know, keep the whole history like pouch DB style and these kinds of databases. And what's very nice about those kinds of databases is that you don't really need a main head, right? You can merge them in whatever order you want and they will all get eventually consistent. The downside with those kind of databases is that they force you to keep a lot of data around. You'll have to have all these

Starting point is 00:17:19 versions around. And so what we tried to do was to get a compromise between the two. And so what we do is we're going to do last writer wins on the server. But if there's a conflict, and only if there's a conflict, we will keep that data around. And then we'll let you choose how to resolve that conflict. So that's not as flexible as a database that really does proper versioning. But it goes a long way. And then if you don't want to deal with conflicts, we'll do a last writer wins for you. And we'll just do the, you know, what's a typical database would do. But now if you want more fine grained behavior,

Starting point is 00:17:55 and you want to write your own CRGT in the database, you can, right? We give you control and you can decide what to do. So that's for the diffing stuff. Then for the streaming stuff, so look, we could do streaming and it would work. Would it shine? I don't think so. I mean, I think it would work and I think it would be within a reasonable range in terms of performance of the state of the art. Like if I had to bet,

Starting point is 00:18:18 I would say we probably would be 2x slower, maybe 5x slower. That's what I was going to kind of poke at because earlier you were saying, you know, it's not optimized for heavy writes, right? It's not. It's not. And the reason is, our use case is, you have a notion of data, you have a notion of query,

Starting point is 00:18:36 most of that data doesn't change all that much, and you want those queries to be. So when something changes, you want the queries that are affected to respond relatively quickly. And so what I mean by relatively quickly is log n on the size of the table, not OFN. You don't want to have to scan anything when you do a write, right? And so that's going to be a use case where, let's say, a cache, for example. Like you have a lot of queries in your cache. And when something changed, you want to know the queries, you know, that are affected by this change and you want that to happen

Starting point is 00:19:09 relatively quickly. But if your use case is you have a couple of queries that are very well identified and you have a host of changes, like they're coming from a log or something like that, I don't think you can beat a streaming engine at this because what's going to happen is the streaming engine forms a sort of natural parallelism, right? Where the stuff comes inside the pipeline that becomes a node that can run on one thread and then that's processed in a pipeline of changes. And I don't think that you're going to do a much better job, you know, than a streaming engine. So if your use case is you have a couple of queries well identified and you have data changes very, very quickly, I think a streaming engine is going to be better. So yeah, that's not the use case we are after. But yeah, the typical use case we envision is you have an app, you want this app to become much more responsive because

Starting point is 00:20:03 you would like to have the data for a particular user locally. So what you're going to do is you're going to suck that data in your browser directly or in your service, in your Node service, in your Python service, or on your phone or whatever it is. And then you're going to run queries over that data as if the data was directly the data from the server. And the latency is going to be awesome because it's all going to be there. Consistency is going to be much easier to deal with because whenever you want to change something, you just update the local database and we will take care of propagating all of that. And it will all be live for free. Meaning if your database is touching a particular object and this object is used by another client on another browser, whenever you touch this object,

Starting point is 00:20:44 that client is going to see it live. So if you want to build something collaborative, something that feels live, it's going to be very easy. And the key to get this stuff right is privacy. And I think that's what's lacking today in today's system. If you want a system like that to work well, you need a real privacy layer. That's where you can really express with complex rules, who can see what, because if you don't have that, how many use cases do you have where you want your user to be able to see the data of all the other users? How often does this happen? So privacy is key to make that work. Yeah, that's, I think, a really interesting point. The space where I've

Starting point is 00:21:24 seen some work going there is specifically around the work that this database company, Nile, is doing, where they're coming at it more from a SaaS multi-tenant kind of thing, point of view, versus sort of an edge database point of view. And from the SaaS multi-tenant point of view, it's exactly what you said. You have a SaaS service that has a bunch of tenants. Each of those tenants doesn't need to see the other tenants' data. And so they've overlaid some semantics on top of the PostgreSQL SQL to basically set tenant IDs, and then it automatically hides what doesn't need to be seen.

Starting point is 00:21:55 On the edge side, I haven't seen a whole lot in that space. And so I think what you're saying really makes sense to me. I think, how are you guys thinking about that? Is that something that's just gated on the server side? And when you connect- I mean, I think to get it working properly on an edge, the reason why you haven't seen it is I don't think you can do a good job

Starting point is 00:22:16 without materialized views. And that's why you haven't seen that. Because what will happen is you have a user. And so what you will see in typical databases today is you have rules on what a user can do on a table. And sometimes those rules are actually pretty fine grain where you'll be able to say, let's say, can I read this? Can I modify this? Yes. If not, blah, blah, blah. Okay, great. But typically on an edge database, you will need the same version of the data with different views, depending on the context.

Starting point is 00:22:46 So let me give you a concrete example. Imagine you want to build a like button, right? So you have your user, that user is seeing objects, and you want that user to be able to like stuff, click on the stuff that this user likes. So you want the user to be able to like its own likes or unlike its own likes, right? But you don't want that user to be able to read all the other likes, at least not if they didn't choose their no, and certainly not modify other users' likes, right?

Starting point is 00:23:09 So you need some fine-grained access on who can do what. And so already expressing that with table permissions would be a little bit of a challenge. But let's say you manage, right? Let's say you do. Your problem is you want another view of this stuff which gives you let's say a life count or some other query right and you want to feed that back into a system with a different visibility right and today nobody does that right and so the only way you would have today to build something like that would be to either with a trigger where you know you

Starting point is 00:23:41 would maintain a count by hand and do all sorts of things. But then that's what I call a poor man's materialized view. That's what it is. You're maintaining a materialized view in your trigger. Or you would build a service. You would watch the changes on that particular table and aggregate some count and whatnot. And you run into all the classic problems, like concurrency, transaction did not go through, the service goes down. I mean, all that fun stuff, right? And so I think to have an edge database where you can really pack a lot of the action on the client side will require what I described, materialized view plus privacy. Yeah, interesting. Thinking back to the edge privacy stuff, I think where we're at right now,

Starting point is 00:24:27 sort of state of the art that I see is really just one SQLite DB per user. And then whatever data is in that SQLite DB for that user is what they get. It sounds like what you're talking about is maybe much more fine grained, right? Yeah, we want users to have a view of the database and then we ship them what they're supposed to see on that client.

Starting point is 00:24:48 And then they get to interact with that mini database that is an edge database, really, in their browser, as if it was their backend and not have to worry about the backend, what's going on in the backend. Yeah, and so I have to imagine the permission controls are extremely expressive and fine-grained because it's essentially SQL, right? Like you get to say whatever you want, as long as you can express it in SQL, that becomes the data that they're allowed to see versus more row and column level ACLs. I mean, there are row and columns level too.

Starting point is 00:25:19 So you can do that too. But the killer feature is you can use arbitrary SQL where you can say, I want the like count to be visible by the friends of this person who has become friends with that person less than two months ago. And you can do that. Well, that's something you would have struggled with in a typical system. Yeah. Well, it's funny that you raised that exact use case because the friends of friends query is sort of a classic database killer. And the fact that you can not only express that, but that you can have it materialize means that the read queries are going to be pretty reasonable for social networking style

Starting point is 00:25:54 queries. Friends of friends might get a little bit big though. It depends. I mean, you need to- On the network, yeah. ... make that usable in size. Because friends of friends of friends, there's a level of N where this is not going to work anymore. Unless if you're running a social network with 100 people and then that's fine.

Starting point is 00:26:10 But that's right. The challenges with the kind of approach that we are bringing to the table, I think they're twofold. One is expressing what you want in SQL and keep the size reasonable. But I think that that happens more or less naturally. And two, keep the petition that the user can see reasonable. So you don't want the user... So let's say you are in a setup where there are a lot of public documents. Well, you don't want to get all those public documents

Starting point is 00:26:40 if they're very large on your clients. So you'll have to come up with a petition that makes sense for a user. So ideally, all the data that this user can interact with can just fit on the client and you don't have to worry about it. And then that's great. But if you cannot afford that, then you need to come up with a reasonable way to petition the data. Yeah. Okay. That's what I was going to ask. So there is a model with SKDB where, you know, not all the data that user is interested in can fit on the client. And that SKDB can still handle that. And I'm assuming it just pushes the queries to the server.

Starting point is 00:27:11 No way does is it lets you establish a filter. So you can express a filter on the tables you want using an SQL filter, SQL expression, and then we will give you just that data. That's, that's the way it works. Gotcha. Okay. So you essentially are all the data, all the state does have to fit on the client side, but you have, you know, essentially all the SQL query, you know, expressiveness that you need in order to make that happen. Is that a fair statement? Yeah, it's a fair assessment, but I still think it's going to be a challenge. I think most of the time it's just going to work. I think most people all have to worry too much because look, how much data can a user really see? Like if you look at most apps today, I'm pretty sure this is going to be very reasonable and it's going to fit on one gig of RAM. But you're freezing a little bit here. There we go. You're back. You I'm pretty sure that for most use cases, you just give, you know, all the data that this user can see, and it will just work. But if you have to worry about that, then that's something that is a little bit ad hoc right now. We don't have a good solution other than you need to sit down and think hard about what you want the user to have on their device. Gotcha. Interesting. And then the classic

Starting point is 00:28:33 question I always ask every edge person that I talk to, like, what are the durability guarantees on this? If I write something to my client and I like begin and commit, is it really committed? Does it go to the server? Like what happens with durability? So your durability guarantees are when you write something locally, if the, between the moment where you wrote it locally and the moment where it was sent to the network, your browser was not killed. So we don't, right now, SKDB does not support persistence in the browser. We have a branch implementation of this, but we have not pushed it.

Starting point is 00:29:14 At least we're probably not gonna have it in our initial release for two reasons. And I know this is gonna be very unpopular because SQLite is doing a big push on that and getting a wasn't version to work. But I would say, number one, I'm not comfortable putting data in persistent storage in a browser if it's not encrypted. Because some browsers are actually shared among users.

Starting point is 00:29:35 You have computers in a library, for example, or in a public space. And people have a model of how the web works that does not match the persistent storage of the browser. So for example, if I go and I log in, and my app is using SKDB, and SKDB is storing stuff on the persistent storage, and then I log out, when I log out, I have a mental model that my data is gone. And so if you want a public computer that could be misleading. So I would be more comfortable. I'm not comfortable storing in persistent storage in a browser, something that hasn't been encrypted. So that's one. The second thing is that Safari makes life really hard with this kind of stuff. They break persistent storage. They wipe out data, you know, after one week without usage, they do a lot of things that's

Starting point is 00:30:26 like, if you put persistent storage out there, you're kind of telling your users if it's in the API that they should use it, right? But do I want to encourage people to use persistent storage if they envision some of their users to be in Safari? Probably not. I would tell them unless, you know, you have a good guarantee that your users are using Chrome, if you're going to rely on persistent storage, I think Safari... So that was, let's close the parent on the rant on persistent storage on browsers. But I think we're not ready to have to use persistent storage. So what does it mean when you write something in a browser? It means that if the process is killed between the moment where you've written and before the time it reached the network, well, your data is gone. So even if we came back with

Starting point is 00:31:10 a commit that said transaction successful, that data could be gone, right? So now let's imagine that this succeeded. Then once it's reached the server, then it will become durable on the server. And you will know of it if you ask for it. The client will know that something has been acknowledged in sync by the server. So these are the guarantees that we have. When you write locally at first, you don't have many guarantees until the server comes back and says, well, this stuff was actually written. Or there could be problems when you have been breaking privacy rules. But this should be very rare because we also check them client side. So the only case where you're going to break a privacy rule is if there was a race between

Starting point is 00:31:52 the moment where you've written something and somebody changed the privacy rule server side and that happened at the same time. It was a racy behavior. So that should be very rare. Gotcha. And I guess because the system's set up to be reactive, the end user, when they make a mutation, I'm assuming the UI component or whatever it is that they're waiting on, is that going to wait for the local? No, that's hooked up on the local stuff. You know, give an experience that's more live.

Starting point is 00:32:21 Yeah. Makes sense. Yeah. Low latency. Okay. awesome. Okay, so we touched on a lot here. Open source, we talked about skip lang, skip, well, I'm using the wrong terms, skip FS, which you called something else, skip store, I think. And then SKDB. I think by the time we ship this out, everything's going to be available

Starting point is 00:32:38 because I'm going to wait on you. But is there anything else you want to plug or touch on before we wind things down? Yeah, I mean, I want to talk a little bit about concurrency and how I think reactivity really changes the game on how we approach concurrency and especially systems that require high availability. I mean, one of the things that's so, and we talked about this last week, so it's going to be a repeat for you, but the typical way you're going to deal with concurrency normally in a database is you have locks at all kind of different levels, right?

Starting point is 00:33:09 You have a lock at the table level, at the role level. And what you're really trying to do is have locks that are as precise as possible, because the more precise you are and the less you're going to block other potential queries, right? But you have some queries that... And the fact that there's these locks, you could deadlock, so you need to be able to roll back. And so you need a journal and all that fun stuff, right? But some queries require, you know, a lot, need to touch on a lot of data, right? And then you don't really have a good solution. Either you block everybody else, or you roll back every time, you know, somebody has touched something that you were looking at,

Starting point is 00:33:45 and now you have a fairness issue, you could be starved for access to the resources, right? And so I think reactivity brings something pretty cool to the table, which is that what you can do when your transaction is reactive is to go build it on your own, on a thread that is not blocking anyone. And then when you come back and it's commit time and you know exactly what you want to do, what you can do because the transaction is reactive is only update what has changed between the moment you started

Starting point is 00:34:14 and the moment where you're trying to commit. And that in practice is really a game changer when you have a system that is a mix of fat, complex queries that can last for several minutes, but within, at the same time, rights without blocking the system. And I find it really interesting. So there's one caveat with this approach, which is that the data cannot leave the transaction until we've reached the end of the transaction. Meaning if I'm going to let you go off and build your transaction, right? And then when you come back, we're going to incrementally update it. I need to have a full view of what you

Starting point is 00:34:57 were really trying to do. Now, if I give data out of the transaction before I've reached the end, then I don't know if you have not executed some Python code that will, you know, make the rest of the transaction depend on what was read before. And so if you build it that way, because I don't have my hands on this Python code, my incremental update is going to be wrong. And so I cannot do it. And so the one caveat is your entire transaction has to basically be built at once, which is not as convenient as being able to do that as you go. But I think that's a pretty cool approach on how to deal with concurrency. breaks my brain is the kind of replay part where you come back after you've run through the initial pass of the query and you need to merge the stuff that's changed. Because I would think that the

Starting point is 00:35:52 ordering might matter on that. And maybe this is where I'm mistaken. But if you're merging changes in after the fact, is it possible that those changes could have affected things downstream of them that you've already computed? You know what I mean? Yeah, but those things I've already computed, they will come back at commit time and they will be incrementally updated, right? So let's do it together. Let's say, you know, I have inserts of select. So I have a select that selects, you know, some amount of rows in my thing, and I'm going to insert them in a table. So now comes commit time, and turns out that one of those rows in my select was deleted. Because the whole system is reactive, I'm able to say, delete this thing, and it will just propagate in log n and figure out which parts of the query needs to be updated.

Starting point is 00:36:38 I see. I end up, yeah. And so that's where the magic is. So what you do is you do that part under a lock, but you're holding the lock for a very, very brief amount of time. You're not blocking everybody else. I see.

Starting point is 00:36:51 So I understand now. It's as if you let people go off with their queries and their rights, and you're like, go nuts. When you come back, we'll reconcile. Yeah, and the merging happens at that point once they come back with anything that's changed. Interesting.

Starting point is 00:37:05 I'm gonna have to think about this some more. The skeptic in me is like, there must be some operations where it results in a full recompute of the query anyway. But I can't think of any. Everything seems incremental off the top of my head. You know, it's an insert, update, or delete, right? Yeah. Yeah.

Starting point is 00:37:21 Okay. So maybe something like a median, right? So would that change or, you know, I'm trying to think of these statistical computations that are difficult to compute. There will be couple of cases where you will have to re-comput everything, but then again, when that happens, you're not worse off than an existing database. Yeah. So it's a worst case is you're performing the same way as you would with an existing

Starting point is 00:37:43 DB. Best case is- Maybe 2x slower because you run it once and then you had to rerun it. So maybe 2x lower, but you're going to be in the same order of magnitude of a typical database. Gotcha. Very interesting. Interesting. Okay, great. Well, that's all I had.

Starting point is 00:38:00 Thanks for having me. It was a pleasure. Yeah, man. I'm excited.

Your Ad Here

Materialized View Podcast - SKDB and Reactive Databases With Julien Verlaguet

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.