Materialized View Podcast - SKDB and Reactive Databases With Julien Verlaguet
Episode Date: January 18, 2024I recently talked with Julien Verlaguet. Julien is the founder of SkipLabs, a company building infrastructure for reactive applications. Before SkipLabs, Julien spent 9 years at Facebook, where he was... a tech lead for Hack and Skiplang.In this interview, we discuss SkipLabs's foundational components: SKDB, SkipStorage, and Skiplang. SKDB is built as a reactive database that features performant materialized views, diffing between databases, and streaming via ephemeral tables. SkipStorage and Skiplang serve as building blocks for SKDB.By the time you hear this, SkipLabs will have launched, so I invite you to check skiplabs.io and skdb.io for more details. In the meantime, please enjoy the following conversation with Julien.Oh, and one final note: I apologize for the tin-can sound from my microphone. Since this recording, I've purchased a better mic, and future recordings will have better audio.Note: I am investor in WarpStream [$], which I mention in this podcast.You can support me by purchasing The Missing README: A Guide for the New Software Engineer for yourself or gifting it to new software engineers that you know.I occasionally invest in infrastructure startups. Companies that I’ve invested in are marked with a [$] in this newsletter. See my LinkedIn profile for a complete list. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit materializedview.io
Transcript
Discussion (0)
I recently talked with Julian Verlagoy.
Julian is the founder of Skiplabs, a company building infrastructure for reactive applications.
Before Skiplabs, Julian spent nine years at Facebook, where he was a tech lead for both
Hack and Skiplang.
In this interview, we discuss Skiplabs' foundational components, SKDB, SkipStorage, and Skiplang.
SKDB is built as a reactive database that features performant materialized views,
diffing between databases, and streaming via ephemeral tables.
SkipStorage and Skiplang serve as building blocks for SKDB.
By the time you hear this, Skip Labs will have launched, so I invite you to check
skiplabs.io and SKDB for more details.
In the meantime, please enjoy the following conversation with Julian.
Oh, and one final note.
I apologize for the tin can sound from the microphone.
Since this recording, I've purchased a better mic, and future recordings will have better audio. So last we spoke, I think we mostly talked about
skip DB or SKDB. And I think in talking with you about that, a few things came out of it. One of
which is, it sounds like you're working on something called skip FS or some file system
for that enables the DB as well. And then of course you have skip Lang, right.
Which is the work that you did at Facebook. And so I was,
I was thinking about this last night. I was like,
there's almost a one-to-one correlation between those three projects and the
things that I'm paying quite a bit of attention to now.
So the sort of spaces,
three of the spaces that I've kind of spent some time in recently are,
you know, distribute durable execution frameworks. The second is like edge DBs, kind of the torso, lib SQL,
you know, LightFS, Lightstream, SQLite area. And then the third area is sort of conceptually
this pattern we're seeing where people are moving serverless infrastructures,
persistent storage, basically to S3 as the primary store. Like there's some version of this,
which is, you know, tiered storage,
but then there's a more extreme version that I'm starting to see more now
where people are using S3 kind of as their primary store.
So examples of this would include WarpStream,
which is a Kafka pub subsystem, TurboPuffer,
which is a vector search system.
And then like Neon is sort of the canonical example where they've got like
this super, super duper.
You mentioned them. I was looking at TurboPuffer.
It's pretty new. It's pretty new. It's interesting though.
They're trying to serve queries essentially straight from S3
and then they put a cache in front of it. And I think there's like variations
on this theme. On the more OLTP style databases, it's definitely
they bolt on either some
kind of transactional key value store or some consensus-based write-ahead log on top of S3 so
that they can provide transactionality and low latency without having to worry about dealing
with S3 directly. And then there's very sweet caches, write caches, write back caches and stuff.
The reason I raised these three things is because when I look at the Skip stuff, it's like SkipDB
very much could be to me like an edge database kind of thing.
The SkipFS stuff to me could be very, you know, sort of analogous to the sort of serverless persistence thing.
And then SkipLang could be, I think, a foundation for something that's more, you know, durable execution like.
And so it was just striking to me how the stuff you're working on kind of spans all three of these areas.
So I've dumped out a bunch of stuff there, but maybe I think a good place to start, I think, is with the SkipDB stuff. And so it was just striking to me how the stuff you're working on kind of spans all three of these areas.
So I've dumped out a bunch of stuff there, but maybe I think a good place to start, I think, is with the SkipDB stuff. I think the stuff that you're currently working on.
Yeah, I'm working on SKDB right now.
Yeah.
So what can I say?
So maybe I can give a little bit of context on what these three things are, and then we can talk about how they position against what's out there.
So you've got Skiplang.
Skiplang, the version that's out there, so if you go on skiplang.com, you'll see that
there's a website with a documentation about the language, et cetera, et cetera.
But the problem is the open source version that is out there is pretty much unusable.
That's not what we use at Skiplabs.
And we plan to release a version of Skip
that actually works for everyone relatively soon,
but we didn't have time to do that.
And the language has changed.
Well, the language has not changed all that much.
What has changed a lot is the runtime
because we wanted to support WebAssembly.
And it turns out that writing a garbage collector
to target WebAssembly and a runtime in general is difficult.
It's a can of worms, to say the least.
I had my nose in that stuff for a long time.
But so what is Skip?
What Skip is, in a nutshell, is a programming language that gives you very strong immutability.
And that's pretty much all it is.
And the reason why you care about that stuff is when you're going to have durable execution
or if you're going to do reactive programming like we do,
you will have to store things in a cache, right?
You will have to store objects, long-lived objects in this, right?
And when you do that, you better have a guarantee
that those objects are immutable
or have a tight control over the effects one way or another.
Because if you start mutating objects in a cache,
well, things are not going to go super well for you.
So there's the obvious solution
that consists in making a copy of the objects
as they come out of the cache,
and that you can do.
But the problem with that is that
for reactive programming,
it would be a showstopper.
The cost of copying would be too high.
So to give you a concrete example,
the skip compiler is written in skip,
and the skip, because we wanted the compiler to be reactive, incremental, that kind of makes sense,
right? And the number of objects that are stored in cache, if instead of reading them from a cache,
we were paying for a copy, it would make the whole system completely unusable. So what skip does
really is that it gives you the immutability guarantees
that you need to put an object in a cache or take an object out of the cache and not worry about the
fact that there's a mutable reference that still lives or anything, which is a notion of
immutability that's very different from most programming languages out there. Most programming
languages out there, in fact, all the mainstream, they don't talk about the value. They talk about
what the function is going to do. So when I write const in C++, I'm saying me as a function, I don't intend to modify this
thing, right? But you're not guaranteeing by any mean that, you know, somebody else could be holding
a mutable reference to that pointer, right? That's none of the business of the function.
And if you want that function to be able to take that object and put it in a cache without a copy,
that's what you need, right? You need that kind of guarantee. So that's what skip gives you.
And then on top of that, we built, let me kill my discord. On top of that, we killed, we built a
key value store. So I often refer to it as a file system. I guess it's the, you know, the plan nine
that's, you know, blood that is still a little bit in me, the plan nine influence, you know, everything is a file and I tend to see things as a file system.
But I noticed that whenever I use that term, it confuses me.
I get it.
I think I wrestle with the same thing.
You know, I mentioned I've been thinking about this sort of S3 and caching thing.
And I see different implementations of it.
And one of them is very much like a page cache. And in that world, you're thinking more akin to bytes and pages,
and it looks a lot like a file system. But then for a lot of the infrastructure,
what they end up building is some kind of key value store on top of it. And so it's like,
what is the relationship between these two things? Should the byte level page cache thing exist
underneath the key value store? But then when you look at the key value store implementations that you see, a lot of them
aren't using page caches under this.
Like Neon does.
Neon has page server, I believe.
But a lot of them are more like write ahead log style thing.
They're storing data on EBS.
And then they're periodically flushing that stuff out S3.
So I totally get what you're saying there.
Just to be concrete, it sounds like the thing you have, though, is the API is like a key
value kind of API. Yeah, it's more like a key value store. And it has all the thing you have, though, is the API is like a key value kind of API.
Yeah, it's more like a key value store.
And it has all the stuff you would expect from a key value store.
So it has a notion of a transaction as MVCC, all the kind of stuff that you would not expect from a file system.
I guess it's modeled in my head.
The only thing that it has is that it has a hierarchy of collections ordered in directories with slash, I guess.
That's maybe the only thing that looks like a file system, but really everything else is closer to your key value store.
Interesting.
And, you know, sort of random question, but how much of this are you planning on open sourcing?
I know Skip is already open sourcing.
We're going to open source all of it very, very soon.
So Skip, the file system, SKDB, our entire stack. We want to be 100% open source shop and MIT also.
So it's not going to be one of those, you know, dual lights and stuff.
In fact, I mean, I wanted to open source for a long time, but, you know, we didn't want to get distracted.
And so open sourcing can also mean, you know, there's a community once you open source and you want to take them seriously.
And so when they find issues, you need the time.
And so I wanted to have enough money and big enough of a team to make sure that we don't open source something and then we don't follow what the community wants.
So now I think we're at the point where we have what we need to support a community.
So the reason I ask that is on the skip FS front, like literally just this week, I was writing, we need more transactional key value stores on top of S3. Cause like all
these OLTP systems are building, uh, you know, they're building their, their systems on top of
these things. And right now when I look around, there's like TIKV, which is one that I looked at
and they just, uh, we're starting to talk about the S3, you know,
work that they did. But there aren't a ton of others. Like I was looking at RocksCloud,
I was like, is this thing going to work? And it's basically no, I don't think it does. I don't
think it provides the semantics you would need. So this is really interesting. On the folder front
on SkipFS, that's interesting too. I need to think about that. What was the motivation for
adding the folder hierarchy? It was just a way to... So the problem is when... So bear in mind that it's
a key value store, but its primary purpose is not to be a key value store. Its primary purpose is
to have reactive directories, right? Where you have some directories that are the result of a
computation. That's what it does. So it's a key value store where some directories are computed.
And so it really forms
a graph of computation. And the motivation for the hierarchy is the amount of states that you
have in a large system. Like if I take my compiler and I look at all the objects that are passed,
then type, then blah, blah, blah, and there are different passes for everything.
It actually becomes difficult to make sense of things. And then when you have so many objects, you want to subdivide into categories. And it's just to
help you think about your global state. When your global state gets big, that's what a hierarchy
does for you. It helps you, you know, structure things. That's what it is.
Gotcha. Okay. And then I think I have two questions. The first one is just a quick
question. I think I sort of jumped to the conclusion that this was built on a blob store,
but is it, is it handling its own persistence or is it using, you know, some other blobby thing
or something? No, it's handled its own persistence. Gotcha. Okay. Okay. So that was all rewritten
from scratch. I think, look, we, the, the, the, I mean, most of the blobby things they optimize for
write rates, you know, and for us, the write rate cannot be that high anyway, because when you
write something, what we need once you've written something is go update a bunch of reactive
directories, right? Yeah. There's no point in being super smart about, you know, how are we
going to take these millions of writes when we will not be able to handle that kind of load anyway,
because we'll be limited by how much the rest of the system can update. And so the constraints were very different from what you would typically
find in, you know, a key value store that has been optimized for heavy writes, like for a
logging system or something like that. Yeah. And so this leads me to my second question I was
going to ask, which I think is just, I think it'd be good to jump into this skip DB use case,
because I think that's going to inform a lot of what you're saying in terms of, you know,
write versus read and the materialized views and stuff. So can you, can you just run through skip
DB? So, so SKDB is, uh, basically an SQL interpreter on top of an SQL engine on top of,
uh, SK store. So SK store is this reactive key value store I just described.
And in a nutshell, I mean, conceptually, it's really simple.
Like you have a key value store.
You're going to take some of those directories,
some of those tables, whatever you want to call them.
And there will be a representation that is an SQL representation
that can hold, you know, an SQL row, basically.
And what you're going to do is your queries
are going to become virtual
directories. So we're virtual directory or reactive directories, whatever you want to call
them, but where the directory is a result of a computation. So for example, I run select
on a table. It's not, I'm not going to walk that table like, you know, I wouldn't take
typical database. What I'm going to do is I'm going to create a new folder, a new directory,
and this new directory is going to be a view on the old one.
And things get a little heavier once you have reads and writes, right?
Like once you read and then you write into a new table.
So for example, let's say I were to do updates
of plus one on a table within a transaction, right?
This is all in one transaction.
What will end up happening
is you will have to make copies of the table. Well, that might be a bit complicated,
but the idea is when you have a transaction, when you're doing multiple reads and writes,
you might have to have logical copies of the table at different stages.
So let me give you a concrete example, because this is probably way too abstract. Imagine my query says insert into table one. And
what I'm inserting is a select. It's a select on itself. You do understand that if I do that,
and my select is reactive, so it's something that depends on the table itself, I have a cycle. I
have a problem where whenever there's an update, I'm updating myself, which is inserting into myself. And so I have a problem there. If you want to break that, what you'll
have to do is to introduce basically logical copies of T1, where you're going to say, I have
my table one, I'm going to create a virtual view of table one, which is going to be my select,
which is going to create a new version of table one. And so that's how things
are modeled. And I don't know how I got involved into something so such a technical detail, but
long story short, your queries in SKDB are in fact a chain of a graph of computation in this
reactive file system. That's a key value store. Yeah. Yeah. It's interesting. I came across a similar
kind of concept five or six years ago from a guy named Carl Steinbach at LinkedIn. And he was
coming at it from the data warehousing perspective, but he was basically saying like, Hey, imagine
we don't have a workflow orchestrators like Airflow and Dagster and Prefect and whatnot,
but instead everything is just views. So rather than have the workflow, it's views,
querying stuff, and then you compose more views on top of that. And that is in effect a DAG, right? It's a hierarchy
of views. Again, very different context on the sort of data warehousing hive side of things,
but similar idea. So the thing that caught my eye with SKDB is that kind of paradigm could go a lot
of different ways. I think SKDB, there's diffing involved. And so that part to me looked sort of EdgeDB-like and it starts to get into the CRDT realm and how you,
you know, merge these databases that are, you know, perhaps spread around. I think the other
one is sort of the materialized, you know, use case of streaming, stream processing and stream
queries. And then the third one that occurred to me is, I think, probably flavored by
your history at Facebook, but it was like, okay, so this could be used, you know, purely as a
persistence layer for somebody that's got a bunch of React components, and it's just on the front
end, and there's some wasm in the browser that they're using to update a bunch of UI components
when one thing changes, it, you know, cascades down. So it seems like there's a lot of different
things you can do with it. Is there like, I guess, the first question would just be,
what was the initial motivation for it? And I think the second is sort of what do you envision
like the, the ideal, you know, first customers being? Yeah, so that is a great question. Thanks.
Well, first, the idea of what how we got engaged into that was, we wanted to make the product more
approachable. So we are really interested in reactive systems in general.
So we have a general programming language called Skip,
and you can build very complex reactive systems,
such as a compiler.
I mean, this is pretty, you know, beefy stuff, right?
That you can make completely incremental
using our technology.
The problem is you'd have to learn Skip.
And that is, you know, a barrier to entry
that is pretty high.
And so one of the things that we promised ourselves when we started Skip Labs was none of our product should be based with the
assumption that people are going to learn skip. And so what SKDB really is, is an attempt to make
SKStore more approachable. So now you can write SQL and you don't have to worry about skip.
Then about the streaming and all the use cases, there are many use cases possible with skip.
So the diffing thing that you've seen, a lot of it has to do with, you know, how you would deal with conflicts and how you would do merging, obviously.
But we found, we tried to find a compromise, right?
Like there are two ways you can go about this.
You can go full on, you know, last writer wins there are two ways you can go about this. You can go full
on, you know, last writer wins, which is what a typical database would do. And then this is going
to be efficient. But if, you know, somebody goes offline, then the, then, you know, merging things
back will, will be difficult or impossible, or you can go full on, you know, keep the whole history
like pouch DB style and these kinds of databases. And what's very nice about those kinds of databases is that you don't really need a main head, right? You can merge them in
whatever order you want and they will all get eventually consistent. The downside with those
kind of databases is that they force you to keep a lot of data around. You'll have to have all these
versions around. And so what we tried to do was to get a compromise between the two. And so what
we do is we're going to do last writer wins on the server.
But if there's a conflict, and only if there's a conflict, we will keep that data around.
And then we'll let you choose how to resolve that conflict.
So that's not as flexible as a database that really does proper versioning.
But it goes a long way.
And then if you don't want to deal with conflicts, we'll do a last writer wins for you. And we'll just do
the, you know, what's a typical database would do. But now if you want more fine grained behavior,
and you want to write your own CRGT in the database, you can, right? We give you control
and you can decide what to do. So that's for the diffing stuff. Then for the streaming stuff, so look, we could do streaming and it would work.
Would it shine?
I don't think so.
I mean, I think it would work
and I think it would be within a reasonable range
in terms of performance of the state of the art.
Like if I had to bet,
I would say we probably would be 2x slower,
maybe 5x slower.
That's what I was going to kind of poke at
because earlier you were saying, you know,
it's not optimized for heavy writes, right?
It's not.
It's not.
And the reason is, our use case is, you have a notion of data, you have a notion of query,
most of that data doesn't change all that much, and you want those queries to be.
So when something changes, you want the queries that are affected to respond relatively quickly.
And so what I mean by relatively quickly is log n on the size of the table, not OFN.
You don't want to have to scan anything when you do a write, right?
And so that's going to be a use case where, let's say, a cache, for example.
Like you have a lot of queries in your cache.
And when something changed, you want
to know the queries, you know, that are affected by this change and you want that to happen
relatively quickly. But if your use case is you have a couple of queries that are very well
identified and you have a host of changes, like they're coming from a log or something like that,
I don't think you can beat a streaming engine at this because what's going to happen is the streaming engine forms a sort of natural parallelism, right?
Where the stuff comes inside the pipeline that becomes a node that can run on one thread and then that's processed in a pipeline of changes.
And I don't think that you're going to do a much better job, you know, than a streaming engine.
So if your use case is you have a couple of queries well identified and you have data changes very, very quickly, I think a streaming
engine is going to be better. So yeah, that's not the use case we are after. But yeah, the typical
use case we envision is you have an app, you want this app to become much more responsive because
you would like to have the data for a
particular user locally. So what you're going to do is you're going to suck that data in your
browser directly or in your service, in your Node service, in your Python service, or on your phone
or whatever it is. And then you're going to run queries over that data as if the data was directly
the data from the server. And the latency is going to be awesome because it's all going to be there.
Consistency is going to be much easier to deal with because whenever you want to change something, you just update the local database and we will take care of propagating all of that.
And it will all be live for free. Meaning if your database is touching a particular object
and this object is used by another client on another browser, whenever you touch this object,
that client is going to see
it live. So if you want to build something collaborative, something that feels live,
it's going to be very easy. And the key to get this stuff right is privacy. And I think that's
what's lacking today in today's system. If you want a system like that to work well, you need
a real privacy layer. That's where you can really express with complex rules,
who can see what, because if you don't have that, how many use cases do you have where you want your
user to be able to see the data of all the other users? How often does this happen? So privacy is
key to make that work. Yeah, that's, I think, a really interesting point. The space where I've
seen some work going there is specifically around the work that this database company, Nile, is doing,
where they're coming at it more from a SaaS multi-tenant kind of thing, point of view,
versus sort of an edge database point of view. And from the SaaS multi-tenant point of view,
it's exactly what you said. You have a SaaS service that has a bunch of tenants. Each of
those tenants doesn't need to see the other tenants' data.
And so they've overlaid some semantics on top of the PostgreSQL SQL
to basically set tenant IDs,
and then it automatically hides what doesn't need to be seen.
On the edge side, I haven't seen a whole lot in that space.
And so I think what you're saying really makes sense to me.
I think, how are you guys thinking about that?
Is that something that's just gated on the server side?
And when you connect-
I mean, I think to get it working properly on an edge,
the reason why you haven't seen it
is I don't think you can do a good job
without materialized views.
And that's why you haven't seen that.
Because what will happen is you have a user.
And so what you will see in typical databases today is you
have rules on what a user can do on a table. And sometimes those rules are actually pretty fine
grain where you'll be able to say, let's say, can I read this? Can I modify this? Yes. If not,
blah, blah, blah. Okay, great. But typically on an edge database, you will need the same version
of the data with different views, depending on the context.
So let me give you a concrete example.
Imagine you want to build a like button, right?
So you have your user, that user is seeing objects,
and you want that user to be able to like stuff,
click on the stuff that this user likes.
So you want the user to be able to like its own likes or unlike its own likes, right?
But you don't want that user to be able to read all the other likes,
at least not if they didn't choose their no, and certainly not modify other users' likes, right?
So you need some fine-grained access on who can do what.
And so already expressing that with table permissions would be a little bit of a challenge.
But let's say you manage, right?
Let's say you do.
Your problem is you want another view of this stuff which gives you let's
say a life count or some other query right and you want to feed that back into a system with
a different visibility right and today nobody does that right and so the only way you would
have today to build something like that would be to either with a trigger where you know you
would maintain a count by hand and do all sorts of things.
But then that's what I call a poor man's materialized view. That's what it is. You're maintaining a materialized view in your trigger. Or you would build a service. You would watch the
changes on that particular table and aggregate some count and whatnot. And you run into all the
classic problems, like concurrency, transaction did not go through,
the service goes down. I mean, all that fun stuff, right? And so I think to have an edge database
where you can really pack a lot of the action on the client side will require what I described,
materialized view plus privacy. Yeah, interesting. Thinking back to the edge privacy stuff,
I think where we're at right now,
sort of state of the art that I see
is really just one SQLite DB per user.
And then whatever data is in that SQLite DB
for that user is what they get.
It sounds like what you're talking about
is maybe much more fine grained, right?
Yeah, we want users to have a view of the database
and then we ship them what they're supposed to see on that client.
And then they get to interact with that mini database that is an edge database, really, in their browser, as if it was their backend and not have to worry about the backend, what's going on in the backend.
Yeah, and so I have to imagine the permission controls are extremely expressive and fine-grained
because it's essentially SQL, right?
Like you get to say whatever you want,
as long as you can express it in SQL,
that becomes the data that they're allowed to see
versus more row and column level ACLs.
I mean, there are row and columns level too.
So you can do that too.
But the killer feature is you can use arbitrary SQL where you can say,
I want the like count to be visible by the friends of this person who has become friends
with that person less than two months ago. And you can do that. Well, that's something you would
have struggled with in a typical system. Yeah. Well, it's funny that you raised that
exact use case because the friends of friends query is sort of a classic database killer.
And the fact that you can not only express that, but that you can have it materialize
means that the read queries are going to be pretty reasonable for social networking style
queries.
Friends of friends might get a little bit big though.
It depends.
I mean, you need to-
On the network, yeah.
... make that usable in size.
Because friends of friends of friends, there's a level of N where this is not going to work anymore.
Unless if you're running a social network with 100 people and then that's fine.
But that's right.
The challenges with the kind of approach that we are bringing to the table, I think they're twofold.
One is expressing what you want in SQL and keep the size reasonable.
But I think that that happens more or less naturally.
And two, keep the petition that the user can see reasonable.
So you don't want the user...
So let's say you are in a setup where there are a lot of public documents.
Well, you don't want to get all those public documents
if they're very large on your clients.
So you'll have to come up with a petition
that makes sense for a user. So ideally, all the data that this user can interact with can just
fit on the client and you don't have to worry about it. And then that's great. But if you cannot
afford that, then you need to come up with a reasonable way to petition the data.
Yeah. Okay. That's what I was going to ask. So there is a model with SKDB where,
you know, not all the data that user is interested in can fit on the client.
And that SKDB can still handle that. And I'm assuming it just pushes the queries to the server.
No way does is it lets you establish a filter. So you can express a filter on the tables you want
using an SQL filter, SQL expression, and then we will give you just that data. That's,
that's the way it works. Gotcha. Okay. So you essentially are all the data, all the state does have to fit on the client side,
but you have, you know, essentially all the SQL query, you know, expressiveness that you need in
order to make that happen. Is that a fair statement? Yeah, it's a fair assessment, but I still think
it's going to be a challenge. I think most of the time it's just going to work. I think most people
all have to worry too much because look, how much data can a user really see? Like if you look at most apps today, I'm pretty sure this is going to be very reasonable and it's going to fit on one gig of RAM. But you're freezing a little bit here.
There we go. You're back. You I'm pretty sure that for most use cases, you just give, you know, all the data that this user can see, and it will just work. But if you have to worry about that, then that's something that is a little bit ad hoc right now. We don't have a good solution other than you need to sit down and think hard about what you want the user to have on their device. Gotcha. Interesting. And then the classic
question I always ask every edge person that I talk to, like, what are the durability guarantees
on this? If I write something to my client and I like begin and commit, is it really committed?
Does it go to the server? Like what happens with durability?
So your durability guarantees are when you write something locally, if the, between the moment where you wrote it locally and the moment where it was sent to the network, your browser was not killed.
So we don't, right now,
SKDB does not support persistence in the browser.
We have a branch implementation of this,
but we have not pushed it.
At least we're probably not gonna have it
in our initial release for two reasons.
And I know this is gonna be very unpopular
because SQLite is doing a big push on that
and getting a wasn't version to work.
But I would say, number one, I'm not comfortable putting data in persistent storage in a browser
if it's not encrypted.
Because some browsers are actually shared among users.
You have computers in a library, for example, or in a public space.
And people have a model of how the web works that does not match the persistent
storage of the browser. So for example, if I go and I log in, and my app is using SKDB,
and SKDB is storing stuff on the persistent storage, and then I log out, when I log out,
I have a mental model that my data is gone. And so if you want a public computer that could be misleading. So I would be more
comfortable. I'm not comfortable storing in persistent storage in a browser, something
that hasn't been encrypted. So that's one. The second thing is that Safari makes life really
hard with this kind of stuff. They break persistent storage. They wipe out data, you know, after one week without usage, they do a lot of things that's
like, if you put persistent storage out there, you're kind of telling your users if it's in the
API that they should use it, right? But do I want to encourage people to use persistent storage if
they envision some of their users to be in Safari? Probably not. I would tell them unless, you know,
you have a good guarantee that
your users are using Chrome, if you're going to rely on persistent storage, I think Safari...
So that was, let's close the parent on the rant on persistent storage on browsers. But I think
we're not ready to have to use persistent storage. So what does it mean when you write something
in a browser? It means that if the process is killed between the moment where you've written and before the time it reached the network, well, your data is gone. So even if we came back with
a commit that said transaction successful, that data could be gone, right? So now let's imagine
that this succeeded. Then once it's reached the server, then it will become durable on the server. And you will know of it if you ask for it.
The client will know that something has been acknowledged in sync by the server.
So these are the guarantees that we have.
When you write locally at first, you don't have many guarantees until the server comes back and says, well, this stuff was actually written.
Or there could be problems when you have been breaking privacy rules.
But this should be very rare because we also check them client side.
So the only case where you're going to break a privacy rule is if there was a race between
the moment where you've written something and somebody changed the privacy rule server
side and that happened at the same time.
It was a racy behavior.
So that should be very rare.
Gotcha.
And I guess because the system's set up to be reactive, the end user, when they make a mutation, I'm assuming the UI component
or whatever it is that they're waiting on, is that going to wait for the local?
No, that's hooked up on the local stuff. You know, give an experience that's more live.
Yeah. Makes sense. Yeah. Low latency. Okay. awesome. Okay, so we touched on a lot here.
Open source, we talked about skip lang,
skip, well, I'm using the wrong terms,
skip FS, which you called something else,
skip store, I think.
And then SKDB.
I think by the time we ship this out,
everything's going to be available
because I'm going to wait on you.
But is there anything else you want to plug
or touch on before we wind things down?
Yeah, I mean, I want to talk a little bit about
concurrency and how I think reactivity really changes the game on how we approach concurrency
and especially systems that require high availability. I mean, one of the things that's
so, and we talked about this last week, so it's going to be a repeat for you, but
the typical way you're going to deal with concurrency normally in a database is you have locks at all kind of different levels, right?
You have a lock at the table level, at the role level. And what you're really trying to do is
have locks that are as precise as possible, because the more precise you are and the less
you're going to block other potential queries, right? But you have some queries that... And the
fact that there's
these locks, you could deadlock, so you need to be able to roll back. And so you need a journal
and all that fun stuff, right? But some queries require, you know, a lot, need to touch on a lot
of data, right? And then you don't really have a good solution. Either you block everybody else,
or you roll back every time, you know, somebody has touched something that you were looking at,
and now you have a fairness issue, you could be starved for access to the resources, right?
And so I think reactivity brings something pretty cool to the table, which is that
what you can do when your transaction is reactive is to go build it on your own,
on a thread that is not blocking anyone. And then when you come back and it's commit time
and you know exactly what you want to do,
what you can do because the transaction is reactive
is only update what has changed
between the moment you started
and the moment where you're trying to commit.
And that in practice is really a game changer
when you have a system that is a mix of fat, complex queries that can
last for several minutes, but within, at the same time, rights without blocking the system.
And I find it really interesting. So there's one caveat with this approach, which is that
the data cannot leave the transaction until we've reached the end of the
transaction. Meaning if I'm going to let you go off and build your transaction, right? And then
when you come back, we're going to incrementally update it. I need to have a full view of what you
were really trying to do. Now, if I give data out of the transaction before I've reached the end,
then I don't know if you have
not executed some Python code that will, you know, make the rest of the transaction depend on what
was read before. And so if you build it that way, because I don't have my hands on this Python code,
my incremental update is going to be wrong. And so I cannot do it. And so the one caveat is your
entire transaction has to basically be built at once, which is not as convenient as being able to do that as you go.
But I think that's a pretty cool approach on how to deal with concurrency. breaks my brain is the kind of replay part where you come back after you've run through the initial
pass of the query and you need to merge the stuff that's changed. Because I would think that the
ordering might matter on that. And maybe this is where I'm mistaken. But if you're merging
changes in after the fact, is it possible that those changes could have affected things downstream
of them that you've already computed? You know what I mean? Yeah, but those things I've already computed,
they will come back at commit time and they will be incrementally updated, right? So let's do it
together. Let's say, you know, I have inserts of select. So I have a select that selects, you know,
some amount of rows in my thing, and I'm going to insert them in a table. So now comes commit time, and turns out that one of those rows in my select was deleted.
Because the whole system is reactive, I'm able to say, delete this thing,
and it will just propagate in log n and figure out which parts of the query needs to be updated.
I see.
I end up, yeah.
And so that's where the magic is.
So what you do is you do that part under a lock,
but you're holding the lock
for a very, very brief amount of time.
You're not blocking everybody else.
I see.
So I understand now.
It's as if you let people go off
with their queries and their rights,
and you're like, go nuts.
When you come back, we'll reconcile.
Yeah, and the merging happens at that point
once they come back with anything that's changed.
Interesting.
I'm gonna have to think about this some more.
The skeptic in me is like, there must be some operations where it results in a full recompute
of the query anyway.
But I can't think of any.
Everything seems incremental off the top of my head.
You know, it's an insert, update, or delete, right?
Yeah.
Yeah.
Okay.
So maybe something like a median, right?
So would that change or, you know, I'm trying to think of these statistical computations
that are difficult to compute.
There will be couple of cases where you will have to re-comput everything, but then
again, when that happens, you're not worse off than an existing database.
Yeah.
So it's a worst case is you're performing the same way as you would with an existing
DB.
Best case is- Maybe 2x slower because you run it once and then you had to rerun it.
So maybe 2x lower, but you're going to be in the same order of magnitude of a typical database.
Gotcha.
Very interesting.
Interesting.
Okay, great.
Well, that's all I had.
Thanks for having me.
It was a pleasure.
Yeah, man.
I'm excited.