CoRecursive: Coding Stories - Tech Talk: Rethinking databases and Noria with Jon Gjengset

Episode Date: April 30, 2019

Can we make databases faster and remove the need for caching reads in an external cache? Can we make a distributed SQL based relational database that outperforms memcached? Jon Gjengset and the PDOS t...eam at MIT CSAIL have done just that with Noria. Today I talk to Jon about Noria, about building a database in rust and his efforts to teach people intermediate rust via live coding sessions. Jon was great to talk to. He really was able to explain to me how Noria is able to do what it does and where it is in terms of maturity. The key, besides Rust and evmaps, is that Noria uses materialized views to do query optimization ahead of time, on write. The devil is in the details though, of course. And the details, in this case, are turning declarative SQL into a dataflow program that handles cache updates on new writes. http://corecursive.com/030-rethinking-databases-with-jon-gjengset/ Show notes: Noria Project pdos group at MIT Noria Paper Noria Article Jon's Rust Streaming      

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, this is Adam Gordon-Bell. Join me as I learn about building software. This is Code Recursive. I think it has also made me a better programmer because in some sense, it's almost like pair programming, except with a hundred other people. And that is really fun, but you are right. It is also intimidating.
Starting point is 00:00:23 I think what's needed is to accept the fact that you're going to get stuck and you're going to make mistakes, and then realize that the people watching, what they learn the most from is watching how you deal with problems in your code. That was John Jetset. He is one of the leads behind an exciting new database. He and the other members of the MIT CSAIL team have built something pretty cool. Externally, it looks like a database with some extensions for materialized views, but inside they have rethought how a database works, leading to some pretty exciting performance numbers.
Starting point is 00:01:00 The whole thing is built on Rust, and we're going to talk a little bit about that towards the end of the podcast. If you enjoy the podcast, I would love it if you would spread the word to your colleagues or friends who you think might enjoy it. Also, we have a Slack channel you can join. Recently, there was some interesting discussions of the J programming language. User KR007 had some interesting thoughts about it. And we've also been talking a lot about UnisonWeb after interview number 27. So, John, thank you for coming on to the Code Recursive podcast. Well, thanks. I'm happy to be here. So I have actually a printout, because I'm a bit of a Luddite here,
Starting point is 00:01:46 of a paper that you contributed to or the lead on, I'm not really sure, about Noria. Could you tell me what Noria is? Sure. So Noria came about as we observed that databases are a little bit weird for a lot of the applications that people use them for. Usually web applications and the like are very read heavy and databases do a lot of work on read. Like they have to execute all these select statements and queries. But when the workload is primarily reads, you want the reads to be fast. And that's not what databases today are currently written for. And so people end up sticking a cache in front of the database. And we figured,
Starting point is 00:02:25 why not have that cache be maintained by the database instead, which knows about how your queries work? Interesting, interesting perspective. So in the paper, it says that it's a data flow engine, I believe. What is a data flow engine? Yeah, so we had to play a few tricks to get the database to keep your cache up to date, right? The idea is that whenever new writes come in, they cause the state that's been cached in the past to change in some way. So we wanted a system that can capture that, capture the effect that writes have on your cache. And Dataflow is a nice sort of conceptual model to use for this, where you think of data entering the system sort of at the top of a graph, if you will, sort of near the root. And then the changes that that
Starting point is 00:03:11 data cause as a result of whatever operators are in your query propagate down through the graph, all the way down to the caches that are maintained at the leaves. And so Dataflow lets us capture that notion of rights are propagated through a bunch of operators. And in the end, they have some effect on the cached state that you keep in your cache. So it's kind of like the old joke about naming and cache invalidation being the two hard problems in computer science. So what you've built is like an inference thing for finding out when you have to invalidate a piece of cache? In some sense, yeah. What the system does is you name your queries in advance. You give the database sort of, these are the queries that my application cares
Starting point is 00:03:57 about. And then the database uses those queries to figure out how the cache has to change in response to different types of writes. So it constructs essentially a data flow program that knows what to do with any single write and how that affects your cache, rather than you having to manually write up the code for that in your application. So it has a declarative model, sort of actually using SQL as the declarative model of this data flow. Yeah, that's entirely right. So the Noria takes normal SQL queries, sort of prepared statements, and turns them into these data flow programs. It basically, as you say, it infers how to compute changes to your cache using the information from your SQL queries. It basically understands how your
Starting point is 00:04:46 reads are related to your writes. So this isn't the type of new database that has its own language that I need to use instead of SQL? No, not at all. In fact, we very explicitly wanted this to feel very much like interacting with a regular database because one of the problems you have with all these research databases that exist today is they require that you change your application pretty substantially in order to make use of them. But people have lots of application code written already, and it would be really nice if they could take that existing code and just sort of remove all the cache maintenance stuff and just have the database do that for them. But apart from that, it should feel like interacting with a regular database.
Starting point is 00:05:27 You had this really neat example about, I guess, making a clone of like a Hacker News website. I wonder if you could describe how the database helps in that case. Sure. So we've taken this website called Lobsters, which is basically Hacker News, except that the source code is open sourced. And we also managed to get a bunch of analytics about the workload that that site sees. And this let us construct a benchmark that where we both have the correct access patterns, but also know the real schema and the real queries that that application issues, because we can look at the underlying source and actually run the real application. And this let us construct a benchmark that contains real queries issued at sort of the real rates that a normal application would see,
Starting point is 00:06:12 and then benchmark Noria using that workload, which is essentially a way to measure what a real application that used Noria instead of what it was previously using, what kind of speedup would that see if they were to use Noria instead? So if I understand the example, it's like you have like a table of news articles, people can submit to it, and you have a table of users and people vote on the stories. So where does the caching come in?
Starting point is 00:06:38 Yeah, that's right. So it's sort of similar to Reddit, Hacker News, all those sites. It's basically they use pretty much the schema you would have guessed that if you had to guess. And the caching comes in actually at a number of layers. So Noria caches not just the final results, but also intermediate values. And it does so sort of cleverly, like it caches things that are useful to cache. For example, it would cache the current vote count for every article. And then it would also cache the current front page
Starting point is 00:07:08 based on what those vote counts are. And then if a new vote comes in for an article, it will update the vote count for that article and also update the front page if necessary. So like if I just had my MySQL database, I would have these two tables. And then I guess I have to aggregate all the votes to count things. I'm just making sure I understand this. And then so the optimization I might want to do because that's kind of a join and then a sum and it probably ends up being a bottleneck is that I sort of just every once in a while, like get these totals and actually cache them and then actually show the first page off that is how I'm thinking how I would make it without your database, right? Yeah, that's exactly right. So that is in fact what lobsters does, except they take it slightly further and they keep a column in the stories
Starting point is 00:07:53 table that is the number of votes. And whenever someone votes for a story, they do an insert into the votes table. And then they also do an update to the stories table. And so that way, getting the front page is you select from stories, order by votes descending, and then limit. Except it turns out that the computation of how popular a story is, is not just the number of votes. It's also computed from the number of votes on comments on that post. There are penalties applies to certain tags that stories can have, then the number of comments is important. So the computation of the sort of hotness of a story
Starting point is 00:08:31 is actually fairly involved. And the way they capture this is they do a transaction on the database whenever anyone votes for or comments on a story, where they do sort of this full querying all the things that are relevant, computing the updated score, doing a transaction that updates the story's hotness, which turns out to make writes really slow. Anytime you vote, you have to do all this work. And that's fine because most people are reading. I assume that's the trade-off they're making. Exactly. So they have in some sense made the same observation that Noria makes, which is you should do this work on write rather than on read because the reads are more common. And so what's your solution to the problem?
Starting point is 00:09:11 So in Noria, your select would be exactly that thought you had initially of doing a join and a sum. That is the query you would write. There's no other columns you add. There's no, you don't think about caching at all. You just write the straightforward join and sum query. And behind the scenes, Noria will recognize that, oh, I should cache this sum, and I should also cache the front page view. And it will incrementally keep those up to date whenever a vote comes in or if a comment comes in. So you describe the full read you want to do,
Starting point is 00:09:41 including how to compute this hotness in your SQL queries. And then Noria will construct the appropriate data flow. And that will, whenever a write comes into any relevant table, it will figure out how that write, whether it's a new comment, a new vote, whatever, it will figure out how that affects the current score, update the score in place, and then update the front page if necessary. So all of that happens automatically internally in Noria. Do I have to tell it ahead of time what this front page query looks like? Yeah. So Noria does rely on knowing the queries in advance. However, you can always give it a new query or a new set of queries, and then it will
Starting point is 00:10:22 dynamically adapt the data flow to also take those queries into account. So it's not as though the query set has to be static. It's just that you do need to declare a query before you run it. The easiest way to think about this is probably something like prepared statements where you prepare a statement and then you execute it a bunch of times. It sounds a bit like a materialized view. Yes, in fact, it basically is a materialized view. Well, Noria actually does implement materialized views. That is a technically accurate statement.
Starting point is 00:10:54 Where it gets tricky is that materialized views have a bunch of restrictions as they exist in many systems today that Noria tries to address. The first of which is that materialized views, you want them to be incrementally maintained. You don't want it to be so that whenever any write comes in, you throw away the entire materialized views and compute it again, which is what you would get with something like create materialized view in Postgres or
Starting point is 00:11:20 something that's trigger-based. Yeah, you have to refresh it on some frequency, right? And it throws it all away. Exactly. And Noria does not do that. It does incremental view maintenance. Now, incremental view maintenance is also known in the database literature. Where Noria innovates here is that Noria also has the ability to make state partial. So what this means is, if you imagine a website such as Lobsters, you don't want to keep the vote counts for every story ever around all the time. That would be pretty inefficient. You want to sort of stories that are years old, you don't really need to remember anymore. And if someone asks for them, then you can compute it at that time, but you don't really need to
Starting point is 00:12:01 keep them all around all the time. And so what Noria lets you do is it only keeps in memory the things that have been asked for recently. And then it has the ability to evict from the materialized views. This is not something that any existing materialized view system that we know about can do. They are all full materialized views. They are never sort of only keeping some of the state or the state that the application cares about. How does it decide? So it sounds like an LRU cache mixed with a materialized view, I guess, but how does it decide what's in there to start with? Yeah, that's a good analogy. It pretty much tries to apply some kind of caching strategy.
Starting point is 00:12:42 Currently, it uses randomized eviction, which is sort of a halfway decent approximation of LRU. But over time, one of the things we want to look at is how to incorporate more advanced caching strategies that are not just randomized eviction. And in theory, there's nothing preventing us from doing pretty severe or pretty significant analysis of the access pattern of the application to figure out what things to throw away and what things to keep. One potential problem with materialized views that it sounds like maybe you found a solution for is they could just be really large. If I'm joining a number of tables and I have this fully expanded results, like it's just too big to store or it just becomes
Starting point is 00:13:19 expensive to read because it's so large that it's on disk or yeah exactly and this is one of the reasons why the ability to have partial materializations is so important because you just can't realistically store everything and especially when you have things like joins that might cause some kind of combinatorial explosion in the amount of state necessary and so this is why noria will only keep the state that you have asked for recently and any other state it will just get rid of. The way many existing systems try to deal with this is they provide windowed operators. So like window joins or windowed state, usually based on time. And while those are fine in some contexts, like for analytics, it's not great if you have a web application where you really want your reads to represent all of the data. Definitely. Are you familiar with the SQL server indexed views? Yeah. So SQL server indexed views is something we've looked at a little,
Starting point is 00:14:16 and they're a little interesting because they do try to provide something similar to what Norea does. They do have some limitations though. So first of all, they are full materializations. They are not partial. And second, from doing some performance analysis, they seem to be relatively slow and you have many writes. So if you have writes to different keys, then they still effectively end up, we're not sure whether it's locking or recomputing the entire materialized view, but it does not scale well in writes. It also seems as though it does not scale well in reads on those tables. And especially when they collide with writes, it's almost as though everything gets sequentialized in some way. The other restriction that materialized views or
Starting point is 00:15:01 indexed views in SQL Server have is that there are a number of restrictions on what you're allowed to do in a query that you create an index view over. They have this whole documentation webpage on if you want an indexed view, here are all the things your query has to do. And it includes pretty severe restrictions, like you cannot use other views inside of a view. Yeah. So like it's been a long time, but at some point years ago, I tried to use these index views as a solution for something. And I found that those restrictions you're describing, they made it not a feature that I could use. It's something like after looking at your work, it seems like just they're not able to reason beyond certain types of structures to know how to update this index view. That's exactly right. And we've tried to figure out why that is,
Starting point is 00:15:50 but unfortunately, the internal design of SQL Server isn't something we really have access to. We hypothesize that either they're doing essentially some kind of trigger magic behind the scenes, in which case they would effectively be recomputing the materialized view every time. Or maybe they're doing some like per key triggers to make it slightly more efficient. It could also be that they're using something the research literature calls delta queries. They're basically a way to refresh a view more efficiently. So it's a way to query for only the things that have changed since last time you queried. But in order to compute these Delta queries, you essentially have to compute derivatives over SQL queries, which is pretty hard. And you can only do that
Starting point is 00:16:38 analysis for certain queries. Whereas for us, we're not really doing that. We're not trying to figure out how to compute only the changes since last time we asked. Instead, we're doing just a regular forward computation for every write, which just turns out to be easier to construct. At least that's been our experience. Yeah, it sounds like the DeLorean or something, right? Like what you've built looks like a database on the outside, but in fact, it's something different on the inside. I don't know if that metaphor makes any sense. Yeah, I think it actually makes a lot of sense. And this is something we've struggled a little bit with when trying to present this work is people will either say, look, this is just a database when we say, no, really, it's not. But it looks like a database. In fact, Noria has this MySQL compatibility layer that means that
Starting point is 00:17:26 you can have an application that just uses normal MySQL client libraries, whatever off-the-shelf thing exists for your language, and you can just hook it up to Noria and it will work without you making any other modifications. So it really just looks like a database. But of course, as you observe, it isn't. And then we have the other side of the coin, which is people who say this is really just a data flow system, where sure, it really is a data flow system under the scenes. But in contrast to basically all existing data flow systems, it is specifically designed for the sort of caching database use case, where you care about state being partial, you care about not doing
Starting point is 00:18:06 windowing, you care about being able to shard efficiently, you care about supporting SQL. And in fact, you also care about having something that can change so you can add new queries without having to start the whole thing again, which is generally what you have to do with existing solutions. So what is an existing data flow solution, like an example? So there aren't any that are directly trying to solve the same thing Noria has. There are a number of systems that are similar, though. So Nyad is the first something that performs some computation.
Starting point is 00:18:51 And while it's generally geared towards things like graph computation, where you have a lot of cycles, it works for any kind of data flow program you want to make. But NIAID makes a number of assumptions about how you're going to use it. So for example, it assumes that the computation isn't going to change while you're running. It assumes you don't really have reads. There isn't a good way to expose sort of state that is cached, that someone external to the program can read. And also NIAID tries to provide very strong consistency guarantees. And that comes at a cost, especially when you look at more distributed settings. Whereas Noria says, we're going to provide the same consistency as a normal cache, which means that sometimes your
Starting point is 00:19:37 read will be a little stale, but we assume that that's okay for applications. Consistency is an interesting one. So does that mean is any read? Like if I'm not using your materialized view, if I just write to a table and read to it, can I read my writes like immediately or? So in Noria currently, there is no synchronization between reads and writes. So if you do a write and then you subsequently do a read, you may or may not observe
Starting point is 00:20:05 that read. What we guarantee though, is that every write will be applied exactly once. So you might read and then not see your result. And then you read and then you see your result. Your read will then not then at some later point to go away. That cannot happen. You will also never see it applied twice. That makes sense. Now, that said, we are looking at ways in which we can give things like read your own rights, so slightly stronger consistency guarantees. And we have a scheme that we don't have fully implemented yet, but that is sort of in the works that provides that kind of stronger consistency for users and queries that need that. So if I have my derived or my materialized view of these votes that we were talking about, what type of lag is there between writes and reads?
Starting point is 00:20:53 Writes to underlying votes, let's say, and this summation table? In general, it should be very fast. It should just be however long it takes to sort of write the write over the network and then updating an in-memory counter and then passing it along. And this is because Noria doesn't try to provide strong consistency guarantees. It also has to do very little internal coordination. happens in a very streaming fashion where there's a little bit of batching internally, but you should observe the read within sort of millisecond scale. Oh, wow. Of course, this depends a little bit on how large your graph is. So if you have a query that is very complicated, that has sort of lots of nested joins and aggregations,
Starting point is 00:21:42 then of course, it'll take longer for the write to propagate through those operators. And so you will get the read whenever all of the operators have finished computing. In some sense, you can compute the delay between you doing a write and that being represented in your reads by just adding together the computation cost of the operators that the write has to go through in order to reach the view you're reading from. So in fact, if you think about it, you might have two views that both will eventually represent your write. And it might be that the write is visible in one way sooner than it's visible in the other, because one was a much simpler view to compute. And can we have views that are derived
Starting point is 00:22:25 from other materialized views? Absolutely. Essentially, you just write create view SQL statements and Noria will create that view for you. And it will decide whether or not it thinks that view should be materialized, but that's not really a decision that the user has to think about.
Starting point is 00:22:42 They can just write their queries and Noria will choose whatever materializations are appropriate for that query such that the reads has to think about. They can just write their queries and Noria will choose whatever materializations are appropriate for that query such that the reads will be fast. I don't know much about database internals, but I'm imagining I write something and there is a materialized view and because of the join behavior,
Starting point is 00:22:58 that change needs to be in many rows of the materialized view. So is it possible for me to read it and see that change like only partway through? No, you will actually always see any given write either fully reflected or not at all reflected. The way this works is that when a write enters Snoria, so think of this as an insert into the base table, that is sort of logically one update. That update is then propagated down the operator data flow graph. So imagine it reaches a join.
Starting point is 00:23:31 That join is then going to do a lookup for the key in that update that you wrote on the other side of the join. That gives back a number of records that happen to match that join key. All of those records are going to be used to produce a single new update that has all of the sort of outbound changes. So if you add a vote for, let's see, what is a good example of this? Maybe if the username was on the final results and I changed my username, but I had multiple stories submitted. Sure. Okay. So let's take that example. So you have a stories table
Starting point is 00:24:06 that has a join with users on the author and you want to display the username and the final stories. In that case, what you would do is if you change your username, what will enter the base table is you will do an update on users, set name equals your new username,
Starting point is 00:24:24 where ID is, whatever your user ID is. That will enter the graph as a single update that contains two logical updates, removal of your old username and the assertion of your new one, which is basically a replace. Those will then flow to the join as a single update. That join is then going to do the lookup into the stories table based on the user ID. It will find any stories that you have written. So then it takes all the stories that it founds and performs essentially the join with the old record, so your old username and your new username.
Starting point is 00:25:03 And it emits a single update that has negative records or revocation records for all of the stories that you have written with your old username and positive or insertions for all the stories with your new username. Oh, it's like a diff. Yeah, exactly. It's exactly like a diff. And that diff is then propagated as a single update down to the cache at the bottom, which will then be applied to that cache in a single atomic operation. So it will remove all the old records and add all the new ones. And when you do a read, you will either read before that update is applied or after. So it sounds like I would need to not be able to read while that was going on, is my thought. Yeah, exactly. And in fact, we've put a lot of work
Starting point is 00:25:45 into trying to make the views that the user can read from in such a way that writes and reads can happen concurrently while we still preserve this guarantee of atomic application of updates. And the way we do that is using this neat data structure
Starting point is 00:26:00 that essentially keeps two maps and all the writes go to one and all the reads go to the other. And then the writer swaps keeps two maps and all the writes go to one and all the reads go to the other. And then the writer swaps the two maps whenever it's done applying an update. So the reader is always reading off sort of a snapshot of like a couple writes back. Is it conceptually?
Starting point is 00:26:17 Yeah, that's a good way to think about it. And in fact, you can give basically perfect consistency by saying that the writer is going to swap the maps after every update. So if you do that, then the reads will always see the latest state. Or of course, you can trade this off for performance by saying, we're going to swap only every five updates. And now you're amortizing the cost of these swaps. So there's some atomics involved there. And the readers will now see slightly more stale reads, but you have amortizing the cost of these swaps. So there's some atomics involved there. And the readers will now see slightly more stale reads, but you have amortized the cost of the swap.
Starting point is 00:26:51 So this is a map with a tunable consistency level, sort of. Exactly. What's it called? This data structure is called EVMap for eventually consistent map. And that is what we use internally in Aurea for the leaf views, precisely as you say, so that you can choose what consistency guarantees you want, and you can trade off performance for consistency. That's very cool. Do you expose it up at the database level that this is tunable or
Starting point is 00:27:16 not yet? Currently, we don't. So currently, we always refresh the map or swap the map after every update. So the reads get the strongest consistency we can give them. There's still sort of the propagation delay for writes. But beyond that, we try to expose the latest updates or the latest state at all time. But it is something we're looking into where you could imagine that the user can choose how fresh they want this view to be.
Starting point is 00:27:44 And then it should be pretty easy for us to incorporate that into Noria to then say, okay, we will only then swap this often. So is it like, it's because there could be like a queue of rights. So it's not immediately consistent because there could be this backlog of rights, even though you're syncing it after every right. Yeah, sort of. It's okay if I'm wrong. No, no, no. So you're on the right track. The way to think about it is you do a write, and then that write has to be processed, right? The write has to go through some kind of count or a join to produce the diffs we talked about earlier. And that takes a bit of time. When you
Starting point is 00:28:22 do a write to Norio, though, we reply to you saying the write has finished the moment we have stored it on disk and sent it into the data flow. And so there's this time period between when the write is accepted by Norio and when it's reflected in your reads. And so that is that propagation time. And of course, that means that it is eventually consistent
Starting point is 00:28:44 even if once the write reaches the that means that it is eventually consistent, even if once the right reaches the leaf view, it is immediately exposed. So what are we talking about in terms of performance numbers here as compared to MySQL on this example we were discussing? So it really depends on the workload. In the paper, we both measure a sort of simplified version of lobsters where all you have are stories and votes and nothing else. This is basically so that we can have a micro benchmark where we can test lots of different approaches and see how Noria compares to each of them. So here we compare against sort of materialized view in a commercial system, against MySQL, I guess, Memcached and a bunch of similar systems. And then we also compare the full lobsters example against just MySQL. And in the lobsters case, what we see is about a six to seven X speed up using Noria. And in the micro benchmark,
Starting point is 00:29:42 we see a speed up of, oh, I don't even know if I have the number in my head, but it's more than 10x compared to both the commercial system and MySQL. And in fact, because of this EV map, we end up outperforming Memcached for reads. Even though Memcached doesn't do any, like there, that's just key value, right? It doesn't execute any queries. It doesn't maintain the cache for you. It doesn't do any, like there, that's just key value, right? It doesn't execute any queries. It doesn't maintain the cache for you. It doesn't persist anything. Noria does all of those things and still outperforms Memcached.
Starting point is 00:30:13 Yeah, that doesn't sound right. So Memcached, isn't it just like some memory map to some value and you're reading and writing from it? How do you beat it? I don't know. So the way we beat it is actually purely because of EVMap. Memcached has a bunch of internal locking to manage their memory,
Starting point is 00:30:30 basically to manage writes and reads for the same key. And that ends up costing them whenever you have very high amounts of load. Whereas in EVMap, the reads never have to take a lock. The reads always just read. They do an atomic read, an atomic increment, and then they go ahead and do their read. They never have to take a lock. The reads always just read. They do an atomic read, an atomic increment, and then they go ahead and do their read. They never have to wait. Wow, that's impressive. Do you think, is this a new data structure, a known data structure? Will other databases be taking this idea up, you think?
Starting point is 00:30:58 It's a little hard to say. I haven't looked enough at the literature to say that this data structure is new, but it is an interesting trade-off because it is really targeted for this eventually consistent case. It also comes at a bit of a cost in terms of memory use because you now need to keep essentially two indices over your data rather than just one, right? You keep two maps. Now, you can deduplicate the actual values, but the keys you still have to duplicate. So you are duplicating the index. I see. Well, there's a couple of things. Like you're presenting this as an improvement to like having a MySQL database and then like some sort of cache. Another thing it sounds like maybe on a bigger scale, something like CQRS or
Starting point is 00:31:41 even something like people use Kafka in a certain way in which they, I've heard it called turning the database inside out where like all your rights go into some sort of queue and then each materialized view, in fact, is a service that receives those and then like actually writes out the database structure that is needed. Yeah, that is a very similar way to think about the problem that Noria takes. Specifically, so we think about it as turning the database upside down, where instead of doing the work on read, you do the work on writes. And then they think of some kind of queries instead of reads. So the queries are operations that are executed by some service. And that service might choose to process the writes a bunch before you get to do that read. And that is exactly the same model Noria follows. It's kind of crazy to think because you're doing it
Starting point is 00:32:43 just based on declarative statements, where with the Kafka system, you have to kind of crazy to think because you're doing it just based on declarative statements, where with the Kafka system, you have to kind of build a service and case streams and whatever. You guys are inferring that from a declarative statement. It's impressive to me. Oh, yeah. A lot of pain went into trying to get from the declarative queries we get from SQL to a Dataflow program that actually faithfully executes that SQL statement, or in fact, sequence of SQL statements. And it turns out that we're essentially doing the same thing a database query planner is doing, except that we are doing it for the long term.
Starting point is 00:33:22 So rather than doing it for a query when you're about to read, we do it across the set of all queries that we know about. So this is also a research problem that has been studied in the past called multi-query optimization. And we're doing the same thing. We're taking all of the queries that we know the application cares about, and we're trying to construct a single program that jointly satisfies and serves all those queries. You have more information as well, right? Because that's kind of the advantage, knowing these things ahead of time, you can do this.
Starting point is 00:33:54 Absolutely. And we take full advantage of that. This is how we are able to do things like infer what indices to add, because we know what the user is going to query by. We know what the free parameters of to query by. We know what the free parameters of the parameterized SQL query is. And so we can infer what things will be looked up. And when those things are looked up, what other things have to be looked up through
Starting point is 00:34:15 join keys and the like, we can use the queries to infer what kind of sharding we should do, what internal intermediate materializations we should add. And so we have a lot of information based on having the queries in advance. So you mentioned sharding. Does this scale beyond a single server? Yes. So Noria actually works in a distributed setting in the same way that data flow systems generally work across settings. So Noria works in a distributed setting, but it does so in a somewhat naive way currently. It basically follows the same strategy that some existing data flow systems do, where anytime there's an edge between two operators, implying that the output of one operator goes as input to the other operator, that edge doesn't
Starting point is 00:35:06 have to be on the same machine. That edge can be a TCP connection instead. And so you can sort of partition the operator graph any way you want. In addition, Noria has the ability to infer how to shard a given operator or sort of a clique of operators and run sort of shards of that clique of operators on different machines and then doing shuffles before and after that stage. If, for example, you want to aggregate by some other column that doesn't match the sharding key. The first part of that, does that mean like each materialized view would live on a single server? Is that the kind of unit of division? Yeah. So if you partition the graph, then you would end up with different materialized views
Starting point is 00:35:47 on different servers. If you, in addition, shard those operators, then you would end up with one shard per machine. That said, because Noria can choose for any given shard or any given clique of operators where to place them, it can also place them on different cores on the same machine. So this is how Noria gets multi-core scalability too, is that you can't execute a single operator
Starting point is 00:36:12 on multiple cores, but you can shard that operator and run it across multiple cores or multiple machines, whichever you prefer. Similarly, you can partition the graph and run the partitions in parallel, either on different machines or in different cores. Wow. So this thing does a lot. How big is, how much, how long did it take for this to be constructed? Let's start there. Yeah, it's definitely a bit of a beast at this point. We started this project in the end of 2015. And so it's been going for a while now. It's now up to, I think last time I counted, we were at around 70,000 lines of code, although that's including sort of documentation, testing, SQL parser, MySQL front end. So a bunch of other somewhat auxiliary things. But I think that the core of Noria is 40 to 50,000 lines of code. Yeah, so it's not a small research toy.
Starting point is 00:37:07 It's a real database. Would you encourage people to use it for, to actually use what you're saying as a drop-in replacement for some MySQL system? So I think those are two very different questions. For the former, I think I would say yes. It is a much more real system, I think, than many research systems tend to be in the sense that we strive really hard to be able to execute real application queries.
Starting point is 00:37:35 And we want the MySQL shim, for example, the thing that lets you use existing MySQL client libraries, means that in theory, at least, you can just run a normal application and it should just work. And that is something we sort of strive to provide. In terms of using this in practice, though, I think there are a number of things that you care about in production that we haven't cared about in the research. So examples of this are things like being able to choose what you use as your underlying persistence layer. So currently we use RocksDB on a single disk, but you could totally imagine that someone wants to integrate this with their existing MySQL database or run it on GFS or something like that. Similarly, we don't have very good support for some more esoteric SQL constructions.
Starting point is 00:38:28 Like if you have joins where the join condition is not a quality, or if you have things that require range indices we don't currently support, or even more esoteric things like the sound X SQL operator, which checks whether two strings sound the same when pronounced in English. We just haven't added support for that, but it's also not particularly interesting from a research perspective. So what programming language is Noria implemented in?
Starting point is 00:38:55 So before we go to that, I'd like to add one more limitation that I think is kind of important. Yeah. So there's one other thing that I think prevents you from using Norium production today, at least if you want to run it on multiple machines. And that is the fact that it currently doesn at the failed operators or below in the graph. And that includes any materialized state. You can always recompute that state from the base tables, and that would be equivalent to running the read queries again in a database setting,
Starting point is 00:39:37 but that might take a bunch of time. And similarly, if you have any other kind of failure, even if it's controlled, like you want to restart your server, all the materialized views start out empty. And so we are working on ways to add more dynamic fault tolerance to Noria, but that is not something we currently support. So how did you implement it? Noria is implemented in Rust, the new, well, I guess it's not new at this point, but Mozilla's sort of, quote unquote, systems programming language. And when we started working on Noria back in 2015, Rust was relatively new. This was not too long after the 1.0 release of Rust. And do you regret starting so early on Rust?
Starting point is 00:40:21 Actually, quite to the contrary. I am very happy that we chose Rust over pretty much any of the alternatives that were available to us. I think if we had started writing it in C++, I think we would not be able to have a code base of 70,000 lines of code today because we would be too bogged down in just getting stuff to work in the first place.
Starting point is 00:40:44 This is a highly concurrent system and writing those in C++ is pretty painful. We could have chosen to write it in Go, but with Go, you have some other problems like debugging concurrency is pretty hard, even though concurrency itself is easy. And we also want pretty low level control of memory, which you can sort of make, go let you do that. But Rust is better tailored for implementing your own data structures at a lower level. I'm familiar Rust has the borrow checker and the ownership model. Were those helpful in building this? Yeah, I think there's certainly been a sort of love-hate relationship with the compiler, which I think a lot of people develop with Rust when they first start out.
Starting point is 00:41:28 I think it takes a little while to get used to the borrow checker because it's a part of the programming process that people aren't used to dealing with, or they're used to dealing with it sort of after the fact, like they realized later on that they were trying to modify a value they weren't allowed to. They're not used to the compiler yelling at them when they try to do it in the first place. And so that definitely took some getting used to. But I think subsequently we've found it to really help us avoid a number of bugs that would have been very nasty to track down. And I had a previous interview with Jim Blandy, who wrote a great book about Rust. And like, I think one of the things he said was sort of, if you're using unsafe, which is the way to kind of turn off the world checker, like you're probably doing it wrong. So is it all unsafe everywhere? Or what's your strategy around when Rust gets in the way and when you can kind of ignore its concerns? So I think people overestimate how bad unsafe is. And I think
Starting point is 00:42:33 they also overestimate how often you need it. In Noria, we have very little unsafe code. I would have to go back and check, but from memory, it's on the order of tens of lines. What we do have is we depend on some libraries like EVMap, which do have some unsafe internally. But even EVMap, that only has, I think, seven unsafe lines of rust. So you only really need it when you're doing something where you know about an invariant that the compiler can't check, that is really what it comes down to. Like if you have pointer to some piece of memory and you know
Starting point is 00:43:12 right now because of some other bookkeeping you're doing that no one else is currently accessing that memory, the Rust compiler has no way of checking that it's safe for you to mutate through that pointer. So when you're using unsafe, you're basically telling the compiler, look, I have checked this manually. And it's pretty rare that you have to do that. I think in the beginning, when you don't really know why the borrow checker is yelling at you, it's tempting to use unsafe to sort of circumvent what it's trying to make you do. But usually that's not the way to go about it. Usually there's a safe way you can use instead.
Starting point is 00:43:51 But learning what those safe ways are might take a little while. So you're kind of like, don't use unsafe. But I am far enough along. I understand the rules. And in this particular narrow case, I can do it without exposing a memory safety issue. Sort of. I mean, I'm also very hesitant about using unsafe because I know that that is where you can shoot yourself in the foot. However, writing unsafe is basically like writing any C code, right? So it still is an overall win. The way I think about it is if I think I have to use unsafe, I first try to figure out a way in which I could do it slightly less efficiently or maybe not even less efficiently at all without using unsafe.
Starting point is 00:44:35 And only when I've convinced myself that I really need unsafe, then I sit down and go, okay, if this is in fact going to be unsafe, I'm going to have to convince myself that there really isn't, I'm really not doing anything unsafe here. And that usually leads to like a page full of comments that explains exactly why this one line of unsafe is okay. So you are part of a research group that's building distributed systems, if I understand correctly. In the past, there's been distributed systems built, like obviously C++, but some more recent ones like Cassandra is like, it's the JVM, like Zookeeper is like JVM based,
Starting point is 00:45:16 etcd is like Go, I think. So like, where does Rust fit in the future of building distributed systems, distributed databases? I actually think Rust fits really well into that ecosystem. And there are a couple of reasons why. The first one is that what a lot of these distributed systems end up with in terms of time spent during development is in trying to debug concurrency issues. And Rust just fixes that for you. While it is true that you can still have
Starting point is 00:45:46 bugs in Rust software, absolutely, and you can still have races if you use unsafe or you use some library incorrectly, they're just much rarer, at least empirically, I found when you use Rust. The second one is Rust has this nice intermediate where you get low level control over memory if you need it, but the language still feels pretty high level when you work with it. You get things like generics, you get sort of enums like algebraic data types, which are very nice, gives you options and results. Error management is quite nice. The asynchronous IO story for Rust, I think, is getting really good now. And so I think the language has a lot to offer in that space. In fact, to add to that, as you mentioned, I'm in the Parallel and Distributed Operating Systems
Starting point is 00:46:36 group at MIT. And slowly but surely, I'm sort of getting the other people on my floor to give Rust a shot. And it's now starting to take over in a number of the projects there. Not all, there's still like, if you're writing a kernel, you probably still to some extent want a lower level language, at least if you want to modify existing kernels. But Rust is now being used for things like writing low level networking stuff. It's being used for code generation. They're considering using it for porting a sort of RPC, a new research RPC system. So it really is starting to get some traction in that world. And it sounds like because of you, do you see yourself as an advocate for
Starting point is 00:47:18 Rust? Well, I suppose to some extent that's true. Although I think the way it came about for me was more that Rust is the first language in a long time where I find myself enjoying the language. Like I have found that I actively want to take old projects I've written and rewrite them in Rust, which is a sort of weird, weird desire to have. So I really feel as though Rust has a lot to offer. And that means that I end up talking about it, but it's not really to evangelize in any way. It's just my experience is that it works well for these problems. Yeah, I totally get that. And you do live streaming of yourself doing Rust coding? Yeah, that's right. So I started this last year and it came about because some of the Rust core team ran this, they ran the Rust survey that they do every year. And in it, a number of people said that what they really felt was lacking in the Rust ecosystem was
Starting point is 00:48:18 resources for intermediate level developers, people who either come, have a lot of experience for languages like Go or C or C++ and want to see what Rust can do for them or for programmers who had started out with Rust and now felt like they were ready to level up. And I figured that one of the best intermediate resources would be to see real software being built. And so I sort of sent out some feelers
Starting point is 00:48:41 on what would people like to see? And the overwhelming response I got was we want to see people program. Don't want to read blog posts. We want to actually watch people get stuck and people who know what they're doing so they can then get themselves unstuck. And so I started doing these live coding streams where I sort of just build real software in Rust and people watch and comment and sort of correct me all the times I'm wrong and then post them online afterwards.
Starting point is 00:49:09 It sounds like, it sounds a little intimidating to me to do live coding in front of an audience. How do you find it? Oh yeah, it was absolutely terrifying. No doubt about it. And I remember that very first stream, I had all sorts of worries. Like I would start the stream and no one would watch. I would start the stream, people would watch, and I would get stuck and not be able to get unstuck. Or I would just make like massive mistakes that people would comment out and sort of laugh at me. What I found though is actually quite the opposite. I started programming.
Starting point is 00:49:40 I tried to sort of solicit ideas on what people would like to see. I started programming all of them, and the people watching were enthusiastic, helpful. They found the material useful and interesting. And it was just a really sort of supportive experience. And I found that over time, I think it has also made me a better programmer because in some sense, it's almost like pair programming, except with a hundred other people. And that is really fun. But you are right, it is also intimidating. I think what's needed is to accept the fact that you're going to get stuck
Starting point is 00:50:12 and you're going to make mistakes. And then realize that the people watching, what they learn the most from is watching how you deal with problems in your code, right? How do you debug? How do you figure out what went wrong? How do you think about how to structure your program in the first place? That is what people learn from. And therefore it's good when you get stuck. Yeah. Do you feel like there's any sort of a performance art aspect to it that you're performing for a crowd or do you look at it a different way? That's a good question. I think I don't think of it so much as performance art as I think of it as it teaches me to articulate my thoughts, right? I think a live programming
Starting point is 00:50:55 stream is pretty much useless if the people watching don't hear what you're thinking, because that's what they're going to learn from. And so you need to just like start talking and keep talking and making your reasoning out loud, which feels somewhat unnatural to begin with. And so in that sense, you do have to get used to the performing in the sense of making your thoughts audible. But I don't think there's that much performance beyond that. In fact, if anything, it feels more like a conversation with the viewers, right? They are observing you programming and will say, why were you doing that? Or can you tell me more about why you made that particular design choice? Or in the process of explaining some design choice, I realized that it's not the right one. And I point out why, and then we collectively come up with a better design.
Starting point is 00:51:44 I was thinking of like, they have those contests where they like build a game in 48 hours. And a lot of people like live stream it. And I get the sense like some people, they've built a similar like game engine, like many times. And so you sit down, you can almost watch them like with popcorn, right? Like they know exactly what they're doing. And it's almost like a, as an outsider, it feels a little bit to me like, you know, this is a practice routine and they're like building it up, but I guess you're doing a more exploratory approach, right? Yeah. So I very intentionally didn't want it to be like that. So I try to choose problems that I have some past exposure to in the sense that I know roughly what the problem is like, but I don't do anything where
Starting point is 00:52:27 I've already written this code before. I also don't do any planning ahead of the stream. So I don't design the software, the APIs. I don't read the RFCs. I don't look at the APIs that we're going to interact with at all before I start the stream because I really want it to be a resource where people can watch and learn how someone who's experienced with the language and experience with software development in general, how they would approach solving this problem. And that includes the process of how do you get to the point where you're writing code. And so I think it's important to actually include that aspect and not just do sort of a show off, look at, I can write code that I know how to write because I've done it before. I don't think that's interesting or
Starting point is 00:53:10 useful to anyone. Yeah, no, that's true. You're brave, I think. It's certainly been scary and intimidating. There's no doubt about that. If people wanted to become live coders themselves, like where would they get started? So I think you need to first figure out why you're doing it. What is the goal? Is the goal for you to develop interesting stuff and you are just happy for people to watch? Do you want to teach something specifically, sort of do like a video tutorial or a class, or are you trying to just teach them programming in general? Because what your intended outcome is, is going to affect how you do the stream. It's going to change whether you do planning or not. So for example, there are a number of Rust even start. Some of them, they just read code and that can be useful too. And so I think having a good idea of what it is you're trying to do is important. And then I think the second element
Starting point is 00:54:16 of that is you want to make sure that the viewers or listeners have something to listen to, right? It's not going to be useful if they're just watching you type stuff out because they're not going to learn anything from it. It might be interesting, like you mentioned in sort of game jams or something like that, but it's not a learning experience unless you are also telling them why you're writing that particular code. Yeah, that must be a challenging part. It's true. One of the streams I did that was really helpful for me, and I think for other people too, in figuring out what the stream was even going to be, was I did an open source contribution stream where I had people on Twitter send me suggestions for open source Rust projects that were still relatively new, where we would just go to that GitHub page.
Starting point is 00:55:10 And I had only seen the readme before. And then we were going to make a contribution to that project. And that could be filing a pull request. It could be reading through the code and trying to understand how it works and file an issue. It could be taking an existing pull request and doing a review on it. Just something where we're learning some other code base. We're learning how to reason about existing code and we're learning how to contribute to the ecosystem. That's amazing. Yeah. I can see why that would be really interesting to watch. I'm thinking like, hey, I want to contribute to an open source Rust project and now I can watch John
Starting point is 00:55:44 and kind of see his approach. Yeah, and that's exactly how it worked. And I think another reason why that stream went so well was also because people felt as though they had a stake in it because they could suggest projects that we were going to do. And it also made them feel as though it was more real, I think, precisely because they knew that I hadn't seen these projects before. And even though I say that in every stream,
Starting point is 00:56:08 I think that adds to the genuine feeling of it. And we ended up doing, I think, three different projects in that one stream. So the streams end up pretty long. They're usually about five hours long. But over five hours, you can cover a lot of material. And I've had people ask me whether the streams should be shorter, whether I should do sort of shorter videos and then rather do more of them. But I think for programming, it's valuable to run longer sessions
Starting point is 00:56:37 because otherwise there's just so much context you have to keep track of in your head and having to put away the project and then pick it up a week later, I think is actually pretty hard and requires that you spend a bunch of time just reiterating what you did last time. And it's not how programming usually works. Yeah. So I think that we covered Noria in a lot of depth and kind of, I had you compare it to a number of different databases. And then we talked about Rust and live streaming. Is there anything else you think we should have covered? No, I think that's a pretty good summary
Starting point is 00:57:13 of the kind of things I've done recently. I think the one thing I would add is we really want to hear other use cases for Noria that people think would be interesting, whether that is queries that they think would that people think would be interesting, whether that is queries that they think would be hard or would be interesting, workload patterns that they care about, or even just from the business use case side of things, what are use cases that you have found to be hard to cache, hard to write queries for, how to write a Kafka pipeline for. I think it would be useful for the
Starting point is 00:57:46 research to learn more about what the sort of quote unquote real problems are. Yeah, that's great. So I will definitely put links in the show notes to all of these various projects. I mean, yeah, hopefully some people have feedback on this database, which seems quite amazing. So thanks for your time, John. This has been a lot of fun. Absolutely. It was a lot of fun and it's always interesting to talk about these things. So I'm happy to connect to anyone after the fact too, if they want to chat more. Great. That was the interview. I hope you enjoyed it as much as I did. If you have any feedback, let me know via email or Twitter, or just join the Slack channel. Thanks to Leif Batterman and Rodrigo Willrich and Tim Heaney, and probably many more people who recently said nice things
Starting point is 00:58:33 about the show. Thank you very much, guys. I appreciate that a lot. I will talk to you next time..

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.