Programming Throwdown - 164: Choosing a Database For Your Project With Kris Zyp

Starting point is 00:00:00 programming throwdown episode 164 choosing a database for your project with chris zeip take it away jason hey everybody um. So Patrick and I have been doing a bunch of solo episodes or duo episodes, non-interview episodes. It's been really fun. But every now and then, an interview comes across a plate that is a really spectacular opportunity for us to dive into something that's really important, especially for folks that are just getting started. You might be in college and high school. You might not have a lot of years of experience under your belt. And one of the things that I didn't know until much, much later is the power and the usefulness of databases. I thought databases

Starting point is 00:01:03 were for financial folks or for people who are really professional, people who wore shirts with buttons. I thought databases were for them. And so as a high school student, college student, I was doing a lot with data structures that really should have been in databases. So we're going to talk about how to choose a database for your project and what different databases have to offer and how they can make things a lot easier. And with me, I have the SVP of engineering at HarperDB, Chris Zipe. So Chris, thank you so much for coming on the show. Thank you. I am delighted to be here.

Starting point is 00:01:41 Cool. So before we get into the topic, why don't you tell us a little bit about yourself? So how did you get into computing? Did you go to college for it? How did you kind of learn that trade? And how did you end up following the path to HarperDB? Sure. Well, it started when I was about 10. I grew up in a great family of school teachers, and they were alwayshertz, 20 megabyte drive and had Turbo Pascal. So I got started with Turbo Pascal when I was 10 and just dove into it and loved it, loved what I could do. I've always kind of been a do-it-yourselfer. So when I saw there was this new game called Tetris, I was like, I can do that write that play it myself so that's what i did i wrote tetris and turbo pascal was probably a terrible clone but whatever i had fun doing it's amazing did you share it with anybody or or was it a solo yeah this was before the open source world i think so um yeah and i wrote like a a check uh recording program for a chiropractor friend of mine when I was, I think, 13 or 14.

Starting point is 00:03:08 So that was great. I had a lot of experiences as a kid growing up programming. And so I even going into college, I knew I wanted to be involved in computers. So I did computer engineering at Oregon State and then computer science at University of Utah. And did interesting work on simulations with electromagnetics in the body at University of Utah. Oh, wow. So that, did you have a background in like magnetism and electromagnetism or was it more like you were the sort of engineer and you found yourself with these scientists?

Starting point is 00:03:53 It was more I was the engineer and I was, you know, handed a C framework for writing these simulations. And, you know, it was a great opportunity to learn more about physics and medicine and medical research. And so I really enjoyed getting to be a part of that. Wow, super cool wow super cool that is amazing one of the things i've come a very similar background to you and and um you know because we were kind of like pre-internet uh you know we didn't really have an opportunity to share a lot of projects and uh that is something that you folks today, you know, really take advantage of just the amazing community out there that, you know, is there's there's a whole community of folks that love to see what, you know, different different hobby projects and other things you're building. Yeah, for sure. Yeah.

Starting point is 00:04:39 So, yeah, that's amazing. Yeah. And then from there, I worked with a friend doing some consulting work with a company called Documentum that did database software. And then from there, that's when I actually really got involved in more in open source software. And I started working at a company called SitePen. They were kind of the main company behind the Dojo Toolkit, If you remember back in the days of the Dojo Toolkit. You know, I remember the name, but I forgot what it is. What is the Dojo Toolkit?

Starting point is 00:05:11 It was kind of like around the same time as jQuery. It actually came out a little bit before jQuery, but it was kind of in the similar vein of like, you know, a client side web toolkit that was filling in for all the crazy discrepancies between different browsers at the time and trying to provide a you know client-sided library um so i was a core contributor with with dojo for a while um and really kind of doubled down into open source got involved with like CommonJS committee, went to W3C and WTC39 meeting. I was the author of the JSON schema, the first draft of the JSON schema specification. Wow. Okay, wait, let's rewind a little bit, right? So we just talked about how it's kind of hard to get into, at least back in our day, it's

Starting point is 00:06:02 hard to get into communities. I definitely didn't have a Turbo Pascal community in my small town or anything like that. And so how did you break into that? Right. So, you know, W3C, you're writing the JSON schema. Like, how did you build up kind of that network over time? I mean, I was, you know, just working on open source software and like, you know, throwing projects open source software and like you know throwing projects out there and that was at the time when that was starting to really take off and in some ways it was actually i think easier at that time like now there's so much saturation of projects out there but back good point i mean that was around 2010 where like if you created a new like

Starting point is 00:06:44 you know you spent a couple days creating a new web framework or something like people actually paid attention, because like, there wasn't much else out there. So I think it was actually relatively easy to kind of like break in and start connecting with people and, you know, start emailing different people and talking about different ideas. And so, yeah, I was at a time where it was easy to get involved with what common js was doing and different groups were very cool standardization yeah yeah right and so and so while you're doing open source were you working at at companies that were very open source friendly like what's the sort of how's that symbiotic relationship work yeah well again, I was a company called SitePen. And we were really focused on Dojo at the time. So I was doing a lot of work for the Dojo framework, I did, did a lot to write their event handling, we had like a store interface for interacting with different data stores. So that was kind of like a little bit of bridging into database from the client side, and finding like, what are good common uniform interfaces for interacting with different data storage mediums. So I did a lot of work with that at the time. And but that also afforded just a lot

Starting point is 00:07:56 of opportunities to be involved in the broader JavaScript ecosystem at the time. Well, that makes sense. Very cool. And so what... Oh, go ahead. Yeah, sorry. Another thing I contributed to at the time was we also did... I wrote the original proposal for the Promises Ace specification. So that was kind of like

Starting point is 00:08:17 we had kind of put together some of the original ideas of what should promises look like in JavaScript. And the Promises A proposal is kind of actually what the original ideas of what should promises look like in JavaScript. And the promises a proposal is kind of actually what promises basically are these days, where when you call in a weight on a value, if it has a then method, that's kind of what defines it as a promise. And so I'm certainly not claiming to be like the originator of promises in JavaScript, but I had an opportunity to be involved in a lot of like the original

Starting point is 00:08:48 discussions about how that should work in JavaScript. That is really cool. I remember, and I don't know if promises weren't around or I just didn't know about them, but writing JavaScript without using the sort of async await is really painful. It's's like it's every function and then and then oh but any one of them could crash and if they do you have to roll back the parts that you did but that means you have to keep track of it everywhere it just like got

Starting point is 00:09:16 totally out of control um and yeah learning about that was a lifesaver yeah any and before promises it was like everything was just callbacks, right? Like, right. No JS applications. And it was just like, callback, callback, callback, callback. And every step had to have an air handler. And yeah, it was fun. Yeah, that was wild. So, so from CodePen, you at some point, you ended up at Harper. Why don't you talk about that? Were you there at the beginning? Or what's the Harper story like? Yeah, yeah.

Starting point is 00:09:50 So a little bit of a transition between then. So from SitePen, I went to a company called Dr. Evidence. Kind of an opportunity to go back into medical research a little bit with programming. And we were doing a lot of work with analyzing clinical studies. And these were like super, super complex data structures that were super nested. And we were doing a lot with, they were in like, you know, six normal form in a SQL database, you know, highly normalized, really well structured. But to load every study was like it took like several minutes to do the join that was required to pull a single study from this database and do

Starting point is 00:10:29 analysis on it and we you know we were trying to like create a user interface to do these these clinical analysis on on the fly in a few seconds and so that was kind of really where i got involved more at the lower level of like, we needed to build like caching. And so we were starting to use like, looking at key value stores, and we were using like level, level DB. And then I started becoming more interested in using LMDB, partly because it had multiprocess support, and it looked like it had very good performance characteristics. So that kind of started, I became, I kind of started using the existing LMDB JavaScript package. I kind of actually started taking over that package

Starting point is 00:11:13 and maintaining it and continuing to make advancements on it. And that was mainly to facilitate like this very, very like low latency interactions that we needed to do where we could like constantly be fetching different parts of these studies, facilitate like this very, very like low latency interactions that we needed to do where we could like constantly be fetching different parts of these studies, do analysis on it, fetch different parts, do natural language processing retrievals, pull all this data together. But it really needed to be like these database interactions needed to happen in microseconds. And so we needed like this low level capabilities to interact with this data.

Starting point is 00:11:45 And so I really got more and more invested in this low level and optimizing these low level interactions between JavaScript and LMDB. And it actually became reasonably popular, partly kind of transitively, Parcel and Gatsby, Kibana, which is used by Elasticsearch. They started using this package. And so it actually has like over half a million downloads on NPM, which isn't like wildly popular or anything, but it's enough. It's a really really cool popular package um and then i'd also develop some serialization deserialization libraries for message pack it's used with it as well which ultimately meant like we could get like microsecond level uh access to data from a data storage engine which was really cool how did you take it from like you, you said you were saying multiple seconds to, you know, let's say microseconds or even milliseconds.

Starting point is 00:12:51 You're talking about orders and orders of magnitude. Is it really just kind of changing technology? Was their code just broken? Or how did you get such dramatic speed up? Yeah, yeah. No, this is a fascinating part of it. You know, I mean, first of all, like going from, there's a lot of layers involved with just like the normal SQL query, right? Like you have parsing involved, you have network connection involved, you have serialization, deserialization involved.

Starting point is 00:13:22 There's a huge amount of complexity. If you are just like trying to retrieve a record by primary key, at the actual storage engine level, those things are insanely fast, like just doing a B tree retrieval is an extremely fast operation, and there's a huge amount of overhead. And so if you're just dealing with like, I need to fetch this data really quickly, I need to fetch that data really quickly first of all doing things in process with an embedded data storage engine is radically faster because you don't have any network overhead you don't have any sir you don't have as much serialization deserialization cost so that was kind of the first step but then as i was getting more into like optimizing uh this javascript like there's also like really fascinating

Starting point is 00:14:05 just weird things that you run into like the simple process of like having a memory pointer and then like being able to access that data with a memory with a you know like a c pointer that like is like the bread and butter of c programmingProgramming. Turning that over into JavaScript and getting a buffer that points to that same reference is like an insanely expensive operation. That allocation is actually really, really expensive in a JavaScript engine. And so there was a massive performance gain that could be had just by realizing

Starting point is 00:14:39 that if we just use the same allocated buffer over and over and copy data into it as the mechanism for that interface between the C-level and the JavaScript level, at least for small records, you get like a 10 times performance. You go from like, you know, 40 microseconds down to four microseconds or two microseconds. And so it was the combination of that and then employing like more sophisticated

Starting point is 00:15:05 deserialization techniques. It ends up that there's techniques you can use to do like message pack deserialization that are actually faster than JSON parse. So like we can do retrievals from the database, that single fetch can actually be performed from the database, retrieved from the database, deserialized in message pack faster than adjacent force can call with pre built with a pre built string. So it's kind of a big pile of different optimizations that came together to really achieve like, you know, this several microsecond level access to data. Yeah, that's amazing. Yeah, that's so satisfying with something like that. Because, you know this several microsecond level access to data yeah that's amazing yeah that's so satisfying with something like that because you know it's just it's just like a you

Starting point is 00:15:51 know you're marching towards something as you kind of hit that asymptote um yeah it's it's yeah i i love that that feeling where things get just super optimized over time yeah Yeah. Yeah. And so from there, I guess back to the journey, you know, I guess, like, to me, this was kind of like, maybe the the open source dream outcome is that like, make a kind of make an open source project, like it's fun to make an open source project. Cool to see some people use it. It's fun to see it kind of get moderately popular. And then I basically applied to HarperDB because they had been using LMDB.js. We actually call it lime juice so that we don't have to say six syllables, but they had been using LMDB.js. Yeah. And so like basically like make an open source project. And I got hired at a company to work on this open source project and build a database, build database software on top of it.

Starting point is 00:16:51 So that's kind of how I ended up at HarperDB was through this through this work on the data store level. Wow, that's amazing. So was it so you you kind of applied to HarperDB kind of as an individual. It wasn't part, you weren't part of a company acquisition or anything like that. No, no, nope. And yeah, I had had a number of interactions with like Kyle at HarperDB. And so we knew each other. I mean, we'd had, you know, a number of interactions on GitHub issues and, you know, I'd solved

Starting point is 00:17:24 some problems for them. And so when I applied, they were like, oh, it's Chris. That's awesome. Come join us. That is so cool. I'm hearing more and more of these of these stories, very similar to yours. And it's extremely inspiring. The one I saw, two of them I saw recently, one, and I'm'm not gonna butcher the person's name but the person who created fast api and sql model uh i'm gonna butcher it but it's it's uh last name is tangelo i think but he um he actually got into some kind of like incubator with sequoia Capital, which is a VC, a venture capitalist fund. And basically they just said, look, you have amazing open source projects.

Starting point is 00:18:10 We're just going to pay you to keep working on them. And it's an amazing relationship. And then the person who came up with llama.cpp, which is a way to run these, you know, ChatGPTlike open source, large language models, a way to run them really fast on the CPU. That person, same thing, they started a GitHub project

Starting point is 00:18:34 and started posting about it on Reddit. The GitHub project got really popular. They've just been spending just an insane amount of time on it. They've brought in Lama 2 and all these things that have just come out in the past couple of weeks this person's like totally on top of it and same kind of thing like a group of venture capital has got together and i think the person's in serbia so it's not even there's not really even like a personal connection but they just got together and and just funded this person like this is amazing and we want to be a part of it and so um um you know

Starting point is 00:19:05 and your story as well i think if you have that pension to create things um you know put it on github uh how do you actually you know to turn this into a question how do you build some word of mouth like you create um lime juice right and and so you you it. It's a GitHub project. You push the first version. It has zero stars, zero followers. What do you do next? How do you actually build it up? Yeah, I feel like you're asking the wrong person. I mean, I talked about it on Twitter some. And with the Lime Juice project, it was actually kind of a kind of a started as node LMDB. So there was already some existing users using that I kind of took over maintenance of it, and then basically kind of forked it with like some of the newer ideas that I wanted to implement to make things

Starting point is 00:19:54 faster. So there was kind of some natural, natural growth there. But yeah, I mean, I've, I've tried to talk about it on Twitter. And, you know, I think that then, from there, But yeah, I mean, I've, I've, I've tried to talk about it on Twitter and, you know, I think that then from there, you know, once you actually get a little bit of a foothold, like you see some other projects using something. And so I think it's kind of just organically grown from there, but I'm the last person in the world to talk about how to be effective in marketing open source projects. Well, I mean, it goes to show how good the, you know, how healthy the system is, right? That, you know, you can focus on making good content and, uh, and, and through, through the power of, of all of the, the internet, you know, the collective consciousness of

Starting point is 00:20:39 humanity here, we can all start to find those amazing projects. Yeah, for sure. I agree. Yeah. I think it's similar with eternal terminal. I had a buddy call me a few days ago saying, oh, he has some eternal terminal issue at work. And they were asking who can, who knows anything about this? And he said, oh, I know the guy who wrote that. And so he called me and was asking me some questions around. It was pretty esoteric, you know, SSH type stuff. But same kind of thing.

Starting point is 00:21:11 No real promotion or anything. And, you know, I've created hundreds of projects. And that's the only one that's really taken off to that degree. And it's just, you know, you can't at least I can't really predict it. But when you do find something that that sort of strikes that chord, it's just, you know, you can't, at least I can't really predict it. But when you do find something that sort of strikes that chord, it's really satisfying. Yeah. Yeah, I agree. Yeah.

Starting point is 00:21:32 And there's actually been projects I've had where it's been frustrating that they aren't seen by more people. And then I've had projects that it's like, please stop using this. It's too many people are using it. Yeah. like, please stop using this. It's too many people are using it. I had developed one of the early JSON schema implementations and didn't do a good job of maintaining it, but it has a ton of NPM downloads that, I mean, it'll continue to exist and I'll keep it out there, but it's not something I continue to work on.

Starting point is 00:22:06 So, yeah, that's really difficult because you only have so many hours in a day. But but it is it is hard to see the issues pile up. I actually last week I went through and addressed like so many issues in Eternal Terminal, but they're just piling up way higher than I can really address. Maybe one day, I know so many folks try to talk about we can build a marketplace economy on top of GitHub. So many

Starting point is 00:22:36 companies have tried this. It almost is starting to become a tar pit idea. Have you heard of this term? A tar pit idea is something that's so appealing. It feels like a warm bath, but you get in and you're stuck and your company dies. Yeah. So a personal CRM tarpid idea, like Facebook for X, Uber for X, these are all tarpid ideas.

Starting point is 00:23:01 And I feel like monetizing GitHub is starting to become a tarpid ideas and i feel like you know yeah like monetizing github is starting to become a tarpid idea um but but i i do think that you know like eternal term is a great example i mean there's so many people using it that and your json schema is another even better example you know somebody should be able to make a modest living making that library better um and we really just don't have the marketplace for it but i think there's there's just there's just so many moving parts it's hard to really get that right yep you're absolutely right yeah and i agree it's this one of those things where i would love it if that could be reality i don't know how to make it reality yeah so many smart people have tried i'm a little afraid to to make it reality. Yeah. Yeah. So many smart people have tried.

Starting point is 00:23:45 I'm a little afraid to, to, uh, it's like saying Voldemort or something, right? Yeah. Um, well,

Starting point is 00:23:54 that's amazing. So, so you've been, how long have you been at Harper? Um, I've actually been there for just a little over a year, about a year and a half now. So.

Starting point is 00:24:03 Cool. Great. And is it, uh, we'll get more into the company, you know, after we talk about the main topic, but I'm just curious, is it a remote thing where are you all together or is it distributed? Yes, it is. I mean, there's a number of people, our headquarters are in Denver and there's a number of people that are out there.

Starting point is 00:24:19 I live in Salt Lake City. And so most of the engineers are working remotely um it's nice to actually be in the same same time zone i worked for many years i've actually this is i think 15th year that i've been working remotely so the previous companies were in oh wow yeah but so this has been just kind of a normal transition for me covid didn't affect work at all for me. Yeah, that's right. I just needed to work remotely. Yeah, very cool. Great. We'll definitely put a bookmark in that. I definitely want to talk more about Harper and that database. But we'll kind of step out here and talk about just choosing the right database.

Starting point is 00:25:00 And maybe before we even do that, we should talk a little bit about what is a database, you know, kind of in practical terms, like why would someone use a database versus using a B-tree library or some, you know, some JavaScript library for storing data? You know, when should people make that decision? Yeah, that's a good question. I mean, there actually are probably times people can use a B-tree library directly, but there certainly is a tremendous amount of functionality that is built on top of the host B-tree libraries that you typically use in a day to day work with databases. You know, databases handle the work of maintaining data in a structured format so that you actually instead of just having raw binary data, it's in the form of actual fields or properties or columns. search for records by different values and perform that efficiently, handles things like transactions,

Starting point is 00:26:11 ensuring that multiple things can be handled atomically with isolation, consistency, ensuring that it's stored on the disk drives in a durable, reliable way. Databases can get into the issues of management, observability, and then being able to provide higher level queries. You know, obviously, many of us use databases through SQL queries, which gives us a much easier way of thinking about querying data than having to think about interacting with individual indices and B-trees and how those are connected and related. So that's kind of broadly why we use databases is it gives us the ability to interact with

Starting point is 00:26:53 complex data using relatively simple mechanisms for querying and updating that data. I think it's interesting, too, like you mentioned, Chris, I I think maybe in some cases you don't need a database. I think we were having this a little bit debate maybe in the pre-show of what makes a database a database. I feel like it's expanded a lot. So, you know, something from like a key value store, you know, can still be a database. And then you were mentioning a lot of things, which I think hits upon things that folks miss, which is how many users are you talking about? Is there contention for data or not contention? So in other words, does your application running in multiple places need to make updates to the same data or not is a big one. And then for internal tooling,

Starting point is 00:27:41 it may be that each person is kind of by construction doing something slightly different. And so really, it's more of a caching transmission mechanism thing, in which case it's different. But then you mentioned schemas as well. I think that's one that we were referencing JSON earlier. But people maybe not with JSON schema, but with just JSON plain, we'll just insert a new field, right? And then stuff will break there's no like planned way of dealing with it and everyone says well you don't need that stuff i'll just figure it out and it's like well yeah you're right that is true but at what cost and indexes is another big one you mentioned that's fallen to the same bucket if you just put opaque binary data you know in blob somewhere somehow stored could you write something to like extract

Starting point is 00:28:26 and index the fields you want to dub that? Yes. And are you going to write a bunch of code that already is battle tested, robust and going to do a better job than you? Hey, I just going to reinvent a crappier version of like existing indexers. So there's not this like hard line, but I think early on, sitting down and really thinking about what you're optimizing for and targeting makes a big difference in in what you select and then also like you mentioned is it going to be sql interface or not and what is the implications of saying hey i'm just going to shove random json objects i'll keep beating on j i should do something else random you know jpeg pictures in these columns right like well wait a minute hang hang on. Like SQL is not going to buy you much

Starting point is 00:29:05 if all your data columns are JPEGs. I'm not, I mean, maybe it does. Maybe I'm not an expert there. It feels like it probably doesn't buy you as much, right? Could you do it? Sure, but it's not like a good choice. And so I think you end up with classically extremists on both ends, you know, no database

Starting point is 00:29:21 or everything in the database. And in reality, it's probably a little bit more fluid. Yeah, yeah, yeah, for sure. Yeah, you're right. Those are some great examples of where, you know, this is the reason why like Redis and Memcached and different things like that have really grown in popularity is because they are fulfilling a role of this, you know, high speed access to data that doesn't need the extra overhead of a full SQL engine. So that does illustrate some of the different needs of databases.

Starting point is 00:29:53 And one of the challenges is, I think maybe one of the primary drivers for what database you're going to use is, what is the hardest thing the database is going to do in terms of querying? And it's hard to figure that out ahead of time, right? Like, what is the most difficult query going to be? Is it just going to be these like, by key lookups?

Starting point is 00:30:12 Or is it going to be, you know, a three level join or something like that? So yeah, kind of thinking ahead about that. And then the other aspect is like, what are the data structures look like? Like, you know, traditional databases have had, you know, tables with a relatively flat structure of columns that each can have a field in it. And part of the driver for like NoSQL databases, document driven databases is the idea that, you know, when we are working with data in typical programming languages, like it can be very

Starting point is 00:30:45 convenient to think about data as nested structures right like i have an object and inside that object is an array and inside of that array is a set of objects and that's really convenient when i'm working in a programming language when i translate that to a relational model now i'm starting to get into junction tables and joins and things like that, that like, hey, I thought this was supposed to be pretty easy, or it felt really easy in my programming language. And now it's getting more complicated. So certainly data structures influence that as well as just how am I going to be accessing that data? Oh, that's interesting. I hadn't thought about this actually.

Starting point is 00:31:28 That explains the rise of the ORM as well, right? So the object relational model. The sort of middleware. So if you talk about like a Ruby on Rails person would just go, no, this is no problem. I got you. And they would sort of just attack it, right? By saying, hey, I'm just going to basically,

Starting point is 00:31:42 I don't call it what you want, middleware, ORM. I don't even know all the terms. But basically, how do I take a structured in-memory view and then sort of push it into the correct representation in a database and have that be, I'll call it a translator, back and forth between the two sides of the system, or even do joins or queries on the back end appropriately so that you're trying to get the best of both worlds by having a description in the middle. Right, right, right, so that you trying to get the best of both worlds by by having a description in the middle. Right, right, right, for sure. Yeah. Yeah. And one of the realizations people have is like, okay, if, if I have like an array of objects, again, inside of my object, and it only belongs inside of that object, the relational, the traditional relational model for that where you have the object and then maybe

Starting point is 00:32:26 the other table and it's joined and you may even have junction tables in between that's actually pretty complicated if all i want is this single you know thing this single document right um which could very well just be a single lookup in a B-tree. And so part of this is, how is that data structured in terms of ownership? Is this hierarchy completely contained within objects themselves? Or are these arrays like references to other objects that are then shared? And in that sense, then the relational model starts making more sense. You have these relationships between these objects and these other objects that are then shared. And that in that sense, then the relational model starts making more sense, you know, you have these relationships between these objects,

Starting point is 00:33:09 these other objects, if I can denormalize, if I can normalize them, sorry, if I can normalize them, you know, there's certainly benefits to normalization in terms of like one, you know, one, one source of truth as far as where a record goes, and then the joins start making more sense. But so there's a lot of questions just in terms of what do those data structures look like and how do those map to a database appropriately? Yeah, that makes sense. I think that you touched on something really important

Starting point is 00:33:36 where even without a database, you know, like, kind of circular dependencies and circular references become really difficult to manage. Like, you know, even imagine like an email app. So you have email folders. Imagine you're trying to write this without a database, you know, and then you have a bunch of email objects. And so the folder has a list of objects. Each of the objects needs to know what folder it's in. And so if you don't do this right, you end up with this kind of pointer nightmare where if you want to move an email

Starting point is 00:34:10 from one folder to the other, first you have to delete it from the folder list. Then you have to also tell it that it's now part of another folder. So you end up having to change like three places. And if your app crashes or if something happens, I have to roll that back and it just becomes really difficult and um you know i remember sql normalization was really popular

Starting point is 00:34:33 in the late 90s early 2000s where people said oh you just have to follow these rules and if you follow these rules then you will always have kind of a perfectly normalized world. And so we did follow these rules. And as you said, we ended up with so many different joins. It's like, oh, a person could have at most two phone numbers. But instead of having a phone number one and a phone number two column, which would be super easy, instead, now we're going to join to this table. As you said, junction table joins to another table of like you know user id phone numbers and so then you end up having to write this really complicated query to pull an entire entire object and so there is sort of a lot of like

Starting point is 00:35:17 deceivingly complicated uh design decisions you have to make there. Yeah, yeah, yeah. And you're absolutely right. You know, we, we kind of grew up with normalization and cod we trust. And when he taught his first normalization goes, but, you know, the last decade or two, really has been characterized more by like, trying to figure out where is the appropriate place to denormalize that data. And that doesn't necessarily is not necessarily mutually exclusive with normalization. You know, there's a lot of systems out there that do have like a source of truth normalization, but caching layers that do some of this denormalization where, you know, you have a derived version of that record where the phone numbers are in line, and you can very, very quickly and easily access

Starting point is 00:36:05 that. And so I think that a lot of the evolution of database has been learning to what are the appropriate ways to do this denormalization. It can go too far the other way too, right? Where you can have so much data that denormalized that it becomes inefficient to store this. And so you start looking at ways where maybe a simple key value store that just is doing this massive denormalization is a little too simplified. You want to do some denormalization. You want to have some relationships in that

Starting point is 00:36:36 that are kind of normalized to other parts. And so I think that that hybrid is really maybe the direction that we're starting to learn in terms of getting in between the two pendulum swings and having efficient data storage. Yeah, totally. you serialize in C++ or something, you want to change as your product changes. And change becomes really, really, really difficult. Changing, but keeping backwards compatibility, handling migrations. If you saved your data a year ago and you find oh i need some of those records i need to to retrieve something and so now i need to to uh you know mutate all of this year-old data

Starting point is 00:37:34 so that it can work with my modern software um these things are incredibly incredibly difficult to do yourself and so um the database that i'm most familiar with being a python guy is is um is a sql alchemy with alembic um and so what what that does is alembic is this tool where you you try your best not to change the database in the database um You try to use Alembic to say, create a row, create a column, or change this type to an int, or create a new table, create a junction table. And as long as you do everything in Alembic,

Starting point is 00:38:17 it's keeping track of all these changes. And then you now have this ledger. So if I have year-old data i know exactly what my database schema was like a year ago and i can tell alembic you know take this database and bring it up to modern standards and it will execute all of these steps and and and so under the hood is a is a ton of complexity there um i would say maybe just to tie it off like uh you know i think that databases will force you to be more disciplined um that that they'll they'll force you to do things that you can't if you're doing all sorts of pointer tricks and things like that um but from

Starting point is 00:39:01 that from that discipline you'll end up with a better product that you can rely on. Yeah, yeah. And I think where that maybe is most, or at least a good example of it is when you're dealing with transactions. Transactions are one of those things where you never feel like you need it right from the get go, right? Like, you're like, oh, I want to update this. And then I want to update that, right? Like, why should I have to think about transactions, but it's part of the reason we do that is because like, once you realize, well, what do we have to do if one of these is, is updated, and this other thing isn't updated. And you start dealing with tons of edge cases that are just incredibly difficult to like think through like these types

Starting point is 00:39:46 of in-between states and the race conditions that are involved um and so i think you're absolutely right when we are forced to deal with data through transactions even though like sometimes that's a little bit annoying uh to start with uh you start it's it deals with this whole class of just really really painful problems and makes them a lot more tackleable yeah totally um cool so so let's see you know people are super excited now they want to you know you make their their new game engine use a database, how should they go about picking a good database? I have a list of topics here, and we'll kind of walk through them. The first one I have on my list is speed and latency. So different databases kind of make different trade-offs there.

Starting point is 00:40:43 Why would everyone want a slower database? What are the things that those databases are doing with that time? And what's the reasons for that? Yeah, I don't think any of your listeners are out here looking for, what is the slowest database I can find? Maybe I can get advice on how to find that. So obviously that is a trade-off. There are reasons why people have ended up with slower databases.

Starting point is 00:41:09 And there's a lot of applications that simply cannot sacrifice when it comes to speed. You know, oftentimes when you're dealing with things that are directly driving user interfaces or even more so maybe part of gaming, like, you know, speed is like, you can't sacrifice. Whereas oftentimes, the things that will drive slower speeds is when you are dealing with something where there's higher levels of data consistency requirements. When you get into financial applications, like, there's pretty, pretty strict requirements about like, things not only being transactional, but like making sure that you are like fully coordinating any systems involved, that you have all the correct checks in place that you have the correct ability to roll back if anything doesn't look correctly. And that is a very different scenario

Starting point is 00:42:05 than, say, a database that's maintaining the positions of the players in a game, for example, or something like that, where the speed requires very, very low latency. Certainly, there are situations where things are slow just because it does involve complex queries. And oftentimes, that's well-rec well recognized by the people that are making the query is like hey I'm doing this thing that is doing searching for a

Starting point is 00:42:34 through a huge database for a very very complex set of different conditions and the obvious a lot of times there's a recognition on both sides that this is going to be difficult. It's going to take a while. So there's certainly those different aspects of it, I think. Yeah, that makes sense. Totally makes sense. Yeah, I mean, there's a saying kind of premature optimizations, root of evil kind of saying. I think it's Donald newt who said that but um but i think again if

Starting point is 00:43:05 you're if you're using a database um um you know not not some kind of homebrew thing but you're using a common database it will be relatively easy to migrate from one database to another and so you can always start with you know whatever is the most convenient. And if you find that all of a sudden some government agency wants to use my product and they are demanding that it's consistent, then you can switch to another database. Or if the latency is a real problem and you're willing to be eventually consistent, then you can go the other way as well. right yeah there definitely are opportunities for that and you know like anything in programming like you want to get it right the first time because there is work involved in switching but we do it all the time yep yep yeah totally okay, the next one I have is scalability.

Starting point is 00:44:09 One thing that comes to mind here is SQLite. I almost always start every project with SQLite. And maybe this is, again, because I'm a Python guy and I'm using SQL Alchemy, it's very simple to switch from SQLite to something else. And so I'll always start projects in SQLite, test out the project, test out the idea. Just for people who don't know, SQLite is a fully SQL database. You can write queries against it. You can just select statements, updates.

Starting point is 00:44:39 You can create tables. You can do all of that. But the database is literally just a file on your computer or maybe a folder full of files i don't remember but uh oh no it's literally just one file it's a dot sqlite file and so um now that file could be enormous right if you're putting a lot of data in it but but that it's it's really elegant in the sense that you don't have to worry about networking or any of that um the downside is only one process can write to the file at a time ever so so you're not going to build facebook on sqlite it's just not going to happen um and uh and so invariably you have to move to something

Starting point is 00:45:19 else but um um but having the the scalability you know the different the reason why there's so many different databases on that spectrum is because you do get um you know speed and latency and you get a really smooth developer experience um if you're willing to have those really constrained environments like running everything off of a file. And so, so SQLite is actually extraordinarily powerful, even if it's not very scalable. Yeah. And that actually is a great example of an embedded database. And like you're saying, yeah, there's huge, there's actually big performance benefits of being able to directly access that data in process, you eliminate a lot of extra hops. But yeah, generally as you're scaling,

Starting point is 00:46:09 you are wanting to achieve a state where, you can be running on multiple processes, multi threads, even multiple servers. And that's a big part of scalability is, what are the ways that we can vertically scale to make sure that we're leveraging, you know, highly multi-core machine of modern, modern servers? Are we going to be able to scale to larger and larger storage? And this is always kind of a classic issue with

Starting point is 00:46:40 databases is that if you are indexing data and you're just doing full table scans, it's always actually really, really fast. All queries are really, really fast on small tables. The real challenge with any database work is not how do I query the data, but how do I query the data in a way that's guaranteed to stay fast as the data gets bigger? That's always the challenge, I think, anyway, is, is making sure that I can do that. And it's that's can always be kind of deceptive, when you start building an application, because again, like everything is fast when you get going. But like, you're always, you always have to be thinking about, well, is this query going to be fast once the database is several gigabytes or several terabytes and is going to maintain that

Starting point is 00:47:25 speed so there's that aspect of it and yeah like you were saying um other scalability is horizontal scaling can we run this database even potentially across multiple machines what if we get too big for for one machine and then you start getting into issues of how do the databases cluster, replicate, or shard with each other. And so those definitely get into more complicated aspects of scaling a database. But those are all kinds of the different concerns related to it. Yeah, that makes sense. I was always kind of curious about this, and maybe you can help elucidate it for me. I mean, there's sort of single node databases

Starting point is 00:48:06 like SQLite, for example, Berkeley DB is another example. And then there's, you know, multi-node, which would be everything from like Postgres and MySQL to HBase to Dynamo to all of these other ones. And then it seemed like people were saying things like, MySQL doesn't scale as well.

Starting point is 00:48:31 I remember when NoSQL became a big thing, the thing that they were pushing was that it was just way more scalable, that you could scale something like Cassandra or HBase or one of these ones to an extraordinary degree that you couldn't scale Postgres to. But I never really understood why or if that was true or just marketing. So once you go multi-node, is there a spectrum there or are they all pretty much the same?

Starting point is 00:49:00 Yeah, I mean, I think there's definitely a spectrum there. And I think what you're hitting on is that a lot of the guarantees that you would typically get in a relational database are actually quite difficult to maintain in a distributed network. like just have a partial set of a table and do correct secondary indexing on it. Like the whole table has to be there to get a coherent secondary index. When you start dealing with things like foreign key constraints and cascading deletes, those are actually really, really difficult to maintain consistently across a distributed network. And so when you just take like the existing consistency guarantees of a traditional database, and then just try to scale that to a distributed network, it's fairly complicated. So you eventually end up with situations where you are trying to decide, OK, what are the guarantees that we really need. And one of the advantages that NoSQL databases had in terms of distribution was kind of starting without those constraints, kind of starting with this blank slate of like,

Starting point is 00:50:10 okay, we are going to think about what is the level of guarantees that we can provide, assuming that we are going to be in a distributed network and not providing any guarantees that we can't back up. And so it was kind of taking that different approach. And there's certainly ways that, you know, I mean, MySQL and a lot of these databases certainly have done a valiant job of trying to, you know, do better jobs of scaling. And sometimes, like, that can involve, like, things that are a little bit more complex, like sharding, like that involves a fair degree of like involvement in trying to

Starting point is 00:50:50 understand, well, how can this data be distributed? So there's, there's certainly approaches, but like carrying those, those guarantees of how acid expectations worked with a single node, and then trying to guarantee those same things

Starting point is 00:51:05 across the distributed network is a difficult difficult leap to make got it yeah that makes sense um yeah i think it's planet base i want to say one of them might be neon it's i think planet base plant base actually bans foreign keys. And so you have to do the cascading deletes and all of that yourself. But what they get from that is probably much better scalability. Yeah, I think I remember listening to your podcast on this. And I think when he said that, I was like, yes, that is the thing that you do not want to attempt to do across the distributed network. Yeah, just to dive in a little bit on that for the audience.

Starting point is 00:51:46 So imagine you have a user account, the user has phone numbers, they have credit cards, they have transactions, and then they say, I want you to delete my account, and I want it actually deleted, not like Google or Facebook deleted where they just keep your data forever, but actually deleted.

Starting point is 00:52:03 And so you have to do, you know, you delete that account and then you have to also delete all those other things that are derived from that account. And that's where the cascading metaphor comes from because it cascades into the credit card table and the phone number table and all of that. And so then, you know, to do that quickly, you need to somehow keep this, and I have no idea how this works. I mean, I'm very curious, but you somehow keep a, you know, like a dependency graph, really, of, you know, a person to all of their dependent data so that you have that ready on hand. And that sounds incredibly difficult to do across multiple machines. Yeah, and in particular, like foreign key constraints, cascading deletes have very significant locking requirements as far as like ensuring that, you know, the record that's referencing this still exists

Starting point is 00:52:56 while we're doing this delete and then nothing else has like come into existence that is also potentially using this. So it does, it simply requires a lot of like kind of global coordination to ensure that all the uh the requirements the constraints that foreign key or cascading locks provide are actually maintained across the network and so you know in the no sql world where you aren't necessarily guaranteeing these types of relational constraints, things get a lot simpler. And then you just simply deal with things, you know, potentially after the fact, where, you know, if there is this record that is referencing a record that no longer exists, well, we either remove that reference on the fly or tolerate that so there's a lot of things that

Starting point is 00:53:46 can be done after the fact rather than relying on trying to maintain uh this consistency in real time yeah that makes sense um yeah i mean i mean you touched on some of the extremely difficult edge cases i mean imagine you're deleting someone's account, but then, you know, maybe they like right after they put the command, they go to another tab and they say, I want to delete my credit card just to make sure it's really gone. And so now you get this, you might get it even in the wrong order where it requests to delete, delete a credit card while you're in the middle of trying to delete the credit card so you get double deletes or you get even worse as if someone maybe on their phone you know a family member has the same account and they're adding a credit card so you're trying

Starting point is 00:54:33 to wipe the account and a credit card gets added right in the middle and there's so many things that can happen and um and if you have these these foreign key constraints these cascading deletes like you're putting a very very hard hard guarantee and and so if you're not allowing yourself even for a moment to be inconsistent then the only way you can accomplish that is by hitting the pause button yeah yeah yeah yeah which is getting back to that speed thing is the thing that isn't like a logically a table, you know, like everything that like, wouldn't just look like an Excel spreadsheet. Is that like, what is kind of a good way of explaining SQL versus no SQL to folks out there? Yeah. I mean, I think that is a good starting point. Um, and it is kind of a complicated thing because there's there has been so much wrapped up into like the the notion of

Starting point is 00:55:45 sql traditional databases um you know i think that that has been kind of the primary like conceptual idea behind no sql is that it's a this idea of a document driven database where um yeah the document can be a data structure that has any structure that I want. And I can freely map that to the data structures in my application. And it may look more similar to it. And I don't have to have as much ORM magic that's like doing this translation. But certainly, like, it's also like comparing how things are querying, like NoSQL is obviously a comparison to SQL, which is a query language. And so oftentimes, NoSQL gets wrapped up with, SQL, which is a query language. And so oftentimes NoSQL gets wrapped up with, okay, we're going to have different querying mechanisms for accessing that data.

Starting point is 00:56:37 Maybe it's also wrapped up into the whole relational versus non-relational. And what does that even mean? You know, part of that part of what relational means, at least in the SQL world is like we talked about you do a query on that and there's a known foreign key, you actually have to tell the SQL engine every single time how those, those two tables are supposed to be joined. You have to say, right on this field to this other field. I can't just say, Hey, give me the data that's associated from table two with table one.

Starting point is 00:57:24 It's not part of SQL, right? You have to tell it every time what that relationship is, which is kind of a funny thing. That's such a good point. Such a good point. We've kind of like associated SQL with relational, even though SQL is actually the query language itself, isn't like terribly relational.

Starting point is 00:57:38 We do all that with ORMs, right? Like ORMs know these relationships. They're the ones that kind of put together these joins. So there's kind of like, just like historically, all of these things that have been associated with traditional databases. And so NoSQL was kind of this effort to like, rethink some of those things, rethink the relational aspect, rethink the querying aspect, rethink the structural aspect, how we store that data. And so it's kind of given us a, you know, a way to re approach that stuff, I think. But it does encompass a lot. And the

Starting point is 00:58:12 reality is, is that no SQL databases, like, you know, one of the things that I've learned is that you can say that it's not relational, but like, and there's a lot of relational data out there. And even if you aren't doing SQL, and even if you don't have foreign key constraints, I bet your data has some relational properties to it. Yep, yep, yeah, exactly. I mean, you almost always want to reduce on part of the data. So you say something like,

Starting point is 00:58:39 what is the average or the median number of phone numbers of all the users in my account? You know, is it zero? Is it one? I mean, it of all the users in my account. You know, is it zero? Is it one? I mean, it makes a big difference to my product. And so as soon as you want to start reducing on parts of these objects, then you find yourself like really wishing you had SQL again. Yeah, yeah, yeah. Yeah, once you start normalizing more, yeah,

Starting point is 00:58:59 it starts becoming more convenient. Yeah, that makes sense. Something you touched on that we should we should explain more detail our orms so yeah i talked about sql alchemy as an example i'm sure there's a ton of other ones but you know um you can write raw sql and you know you or you know really for any database you can write raw queries, and you will get back data. And you can definitely work that way. And there's times where you'll want to do that for certain queries.

Starting point is 00:59:35 There's advantages to that, just like there are advantages to writing some of your code in C, even if most of it is in Python. But by and large, you'll use an ORM for a lot of this work. And the way that works is you can actually have the ORM generate the database. I don't really advocate for that because I feel like you can't change languages then. You're kind of like stuck, right? If your Python ORM generates a database, then you switch to JavaScript and your JavaScript ORM also wants to generate the database. Now what do you do? So either you have to have some leader and everyone else follows or just use something else like Alembic, for example, but it could be anything to generate the database and then SQL alchemy and a lot of these ORMs, they can actually look at the database and, uh, you know, map it in real time to, you know, your, your data types. So just to give a

Starting point is 01:00:34 very simple example, um, you know, you might have a class called user. The user has an ID, a first name, a last name, a phone number. These are all just strings in your class. And with some annotations, you can now take that class and turn it into a sort of SQL alchemy, kind of a first class citizen. And so what SQL alchemy will do is look for a table called user. And then if you do something like uh you know give me the user class where the id is three sql alchemy will do sort of the magic to say okay

Starting point is 01:01:14 fetch this row from this table turn it into a python class and then give it to the give it to the developer and so it's it generates a lot of really nice uh features for you yeah yep that's exactly right yeah it's really fun i in the beginning i had so much trouble with orms um you know it's one of these things that's not very intuitive especially if you have native nested uh structures um you have to kind of pull those out. But I would encourage your listeners to take the time to learn something like that. Once I learned it, I was much, much more productive. Yeah. Yeah, yeah. Yeah. And maybe this is a segue into some of the challenges with ORM. Yeah. Yeah. ORMs are great. But one of the challenges that we often face with or Ms is that there's kind of like this, the classic select in plus one problem. And that problem is that oftentimes, you maybe are getting data, and then there's all this related data. And if you do a query, and then start accessing this data, maybe each time you access that data, it has to then do another query to your database, right? This is kind of a common problem is that

Starting point is 01:02:31 it's actually kind of, it can be challenging to get that initial data with the appropriate SQL query that's going to fetch all the data that you need for your future data when you're accessing the data from the properties, right? And that actually can be, it can be anywhere on the spectrum from like a pretty easy change to how you do the query to like, maybe it's just downright impossible to know ahead of time, based upon how you're going to process this data, what you're going to end up accessing. And this isn't necessarily like a problem, like ORM doesn't cause this problem. It's just kind of making it easier to access the data.

Starting point is 01:03:11 And you're still kind of forced to deal with these issues of like what is the appropriate way to query the data so that I'm reducing the amount of back and forth. Yeah, let me see if I... Oh, go ahead. No, go ahead. Oh, I was going to see if I oh go ahead no go ahead oh I was gonna see if I could if I could understand the problem because uh yeah I uh I just want to see if I wrap my head around this so the idea is you know let's say I just want to show someone's first name last name and their phone number but

Starting point is 01:03:39 but the user class has 30 fields in it if I use an an ORM, I'll get all 30 fields and 27 of those are wasted. Is that the problem? Well, that can be one of the problems. But the other problem is, let's say that you're getting this list of users and they each have a relationship with their employer record, right? And so you're doing a join on it. And there's different ways this can work. It can potentially pull in, do that join ahead of time and pull in all that data ahead of time. Or maybe it's not, you just have these IDs that reference the employer table. And then as you iterate through the users, oftentimes ORM will then reactively as you access that employer field, it will then say, oh, I haven't fetched that yet. I will go do a query to fetch that employer record. And so as I go through 30 user records,

Starting point is 01:04:31 depending upon the way that you initially fetch this data, every time you access that, you are then accessing this related, doing a separate fetch to access this related table. So that's kind of the classic select in plus one problem with ORMs.

Starting point is 01:04:45 Oh, now I totally get it. I totally, totally get it. Yeah, that is really painful, right? So if you're writing the SQL yourself, you would know, just join the user table to the employer table and fetch all of it at once. And you just want to query. Exactly. And it's hard for ORMs because you actually kind of have to look into the future a little bit, right? You have to know ahead of time, well, what data is going to be accessed from this, right? So it's challenging. Oh man, that is wild. Yeah, I mean, you know, for an ORM to do this explicitly, you would have to, in my user.get, you'd have to provide a list of all the derived classes that I would want and not want. Right, right, right. Exactly. Yeah. And then this is maybe kind of a segue into thinking about this problem from another approach. And that is kind of going back to the idea of embedded databases. Like, well, what if we made it so that this code that's iterating through these users

Starting point is 01:05:48 is actually close enough to the database that it can efficiently retrieve these employer records on the fly, right? Like part of the reason why the select in problem is so crippling is because we know that there's a lot of overhead to issuing each query but if the data if this code is executing close enough to the data well the internals of a sql engine is basically doing the same thing like it's i mean there's different approaches to joins but oftentimes when a join is executed it's going through a table getting a foreign key and doing a fetch from another table it's kind of as simple as that you know unless you're doing like hash joins or something like that. But oftentimes, it's relatively straightforward of like

Starting point is 01:06:29 just iteratively getting other records. So if that if if data can access that at a relatively similar speed, the way that you know, your internal engine is working, then you're kind of back into the realm of like, the code doesn't need to think ahead. Maybe it's not even again, maybe it's not even possible, maybe as you're iterating through the users, like, maybe there's actually like really complex logic that involves like the permissions of the user, what employer is related to another employer that dictates whether or not that employer record is actually retrieved or not retrieved. Those things may not even be expressible in SQL queries, right? And so this idea of getting code that's working close with the data kind of opens up new opportunities for doing these more complex

Starting point is 01:07:19 levels of data retrieval on the fly, and taking advantage advantage of this low latency access to data. Got it. Yeah, that makes sense. Yeah, that's a good transition to the last area here, which is the database environment. Just to give an example, I built a kind of like a clone of Google Photos just for my family. So I had a kind of like a clone of Google Photos just for my family. So I had a little Android app and I have a database.

Starting point is 01:07:50 I store all the photos on S3, which is this Amazon file system. My database kind of keeps track of the photos. But I ran into this issue where I had a, and I can't remember from using postgres or mysql but i had some sql database but then on my phone i basically needed the database but i can't run mysql on my phone so i ended up running uh this thing called android room which i think is built on sqlite um but that's an example where you know phone, it's just not practical in an Android app or an iPhone app to run MySQL database. And so your environment plays a huge, huge role on what set of databases you're going to be looking at. So if you're on the browser, for example,

Starting point is 01:08:45 or if you're running on the edge on an edge server, that's going to change the scope and the type of databases you're going to look at. Right. Right, right. For sure. Yeah. Yeah, and fundamentally, as you start

Starting point is 01:09:01 being more concerned about getting access to data quickly, fundamentally, this is a problem of getting data as close to the user as possible. And, you know, I mean, that kind of goes into the subject of like edge based databases where, you know, we're trying to keep data as close to the user as possible. You know, we kind of have a few fundamental constraints here. The speed of light kind of actually dictates like, there are fundamental limits to how quickly you can get data from a very far distance around the world to a user. And the other fundamental constraint is that we know this is one of the most important things to users, right? Like there's been study after study on user interaction where like low latency is absolutely key to a high quality user

Starting point is 01:09:47 experience um and so you know i think this is another fundamental uh direction of databases is recognizing that we we do need to get data close to users to if we're going to really try to achieve the optimal experience for users yeah that totally makes sense. I think even in this Android app, you know, it just it just was totally untenable to wait for a database lookup. Like, I just wanted to be able to scroll and see all the photos. And I mean, particularly for this app, because it's meant to look at photos that, you know know your family and your friends who have agreed to share with you have have created but also photos that you had on your own phone and so you kind of feel like why is this taking you know 800 milliseconds to pull up a photo that i took two seconds ago

Starting point is 01:10:38 right and so um yeah and so that uh um you know now also with other you'll see this a lot with even uh games where um there's a lot of transitions you have someone clicks i've been paying attention to a lot of the game design and game art recently that's just the latest kind of kick i'm on when someone clicks new game there's always kind of like a little fade out fade in and i thought about what would this game have been like if they didn't do that and the reality is you know it's hard to tell because they're hiding it with the fade but it probably was going to take let's say at least two three hundred milliseconds to create this game or to get from the new game, the splash screen to whatever's next. And if you don't have a transition,

Starting point is 01:11:32 people can see how long it took to click that button. And that is kind of jarring. I was thinking about when I play really kind of low-budget indie games, that is kind of this thing where you feel like a little bit of a stutter when you click new game and it kind of tells you that this is going to be like not a really professional experience you know yeah um so so it is amazingly like it's a subconscious thing it's one of these things you don't think about until you think about it but but it has an enormous enormous impact latency has an enormous impact on the user

Starting point is 01:12:06 experience and it's just a phenomenal degree to which it does it does yeah absolutely yeah i mean even if you like try to use your mouse on a 30 hertz screen it's like just give up yeah and we're talking about you know a, a few milliseconds here, right? Yeah, that's right. Yeah, it's totally wild. It's just something about that synergy of really real time. It is a totally different experience. And you can do things to hide it.

Starting point is 01:12:45 But, you know, when you're talking about databases, you could be potentially talking about multiple seconds and you really can't hide that. I mean, you have to get it faster than that. There's no other way. Yeah, absolutely. Yeah. Yeah, so I know for, you know, for Android, there's room for iOS. Actually, Patrick, do you know what iOS equivalent

Starting point is 01:13:02 of Android room is like for storing data on phones? No, not short of Googling it. Okay. Chat GPT open. All right. Yeah, ask Chat GPT. But there is something like that for iOS where it's basically a SQLite database, just like Android Room. But I'm sure it has the word framework in it.

Starting point is 01:13:27 It's like data framework or something. Everything is a framework. But there is something like that on iOS. And so if you're on those platforms, you're almost certainly going to be using one of those. Again, you could load LevelDB, like do some C++ interop type thing on android it's totally possible there's github projects for it but but you know if you're just starting out you know

Starting point is 01:13:52 use android room i mean it has the vast vast majority of the market share um uh but now you know we so for android and ios kind of a no-brainer. What about for the web? I mean, what are kind of things that people can do in the browser, things that people could do on the edge? What are sort of different options there? Well, in the browser, there's been a few different attempts over the years to provide native functionality. There's Web you know,

Starting point is 01:14:25 web SQL and then the index DB engine. Lately there's been, you know, the efforts to get a SQL light running in a web assembly, which is kind of interesting. Oh, cool. Yeah.

Starting point is 01:14:38 Yeah. So there's been some different, different things in the works for getting data to be, you know, like a database in the browser um you know for most large-scale applications though you typically are dealing more still with like a back-end database and so uh you know edge databases are kind of a big driver for that as far as like there's still a back-end that you're to, but it's as close as possible to the user.

Starting point is 01:15:07 Yeah, that makes sense. So, you know, describe for some folks, some folks might have not listened to, we had a whole episode on edge computing. If you haven't heard that one, go back. It's great. But, you know, if you haven't heard it yet, kind of give folks like a little intro to what is what is the edge when people say the edge and what is what is that environment? Sure. I mean, at a basic level, the edge is is about distributing your cloud computing around the world so that there is a server that is close to every person that is accessing your data, your application. That obviously has a huge spectrum of like, how close can you get these edge compute machines to your users? Certainly, if you have more money, you can have 200 server locations around the world,

Starting point is 01:16:00 you're going to be able to get closer than if you have, you know, four locations around the world. But at the fundamental level, you know, you're just simply trying to get your servers as close to your users as possible, and which again, is all about achieving lower latency. Got it. And so when you have a server on the edge, how is that different from, you know, renting an EC2 instance or something and installing Linux on it? What is that environment? Do you get just a whole machine where you can do anything you want or are there restrictions? I mean, there's a spectrum here, just like you'll have with cloud computing as far as whether you you know, whether you can afford like dedicated edge computing or whether, you know, you're utilizing shared, um, shared resources. Um, you know, we do a lot with,

Starting point is 01:16:52 with, um, Akamai and they have a lot of edge capabilities with like edge workers and things like that. Um, but yeah, again, there's a broad spectrum of like what you can what you can afford got it yeah i do know that uh i think amazon relatively recently announced like lambda edge where you can write lambda functions for the edge okay um but i think it's it's only uh it's only node or something or it's only it's some type of j. It's some type of JavaScript run. You couldn't run Python or something, not without converting it first. Right, right. Yeah.

Starting point is 01:17:33 Yeah. So how did that evolve? What's the connection between these edge nodes and JavaScript? And JavaScript. You know, I think the big driver is that JavaScript really has become probably the most advanced primary language for being able to sandbox in an effective way.

Starting point is 01:17:58 So, you know, being able to take code that a user has provided and execute that on a machine has always been like kind of a challenging task to deal with. Right. Like, is this code going to do something malicious or take too much resources? The thing is, JavaScript has been we have been using web browsers that run. I mean, I've got a dozen tabs open that are all from different sites. This is like the most well-tested, battle-tested system for taking user code and running it on a different machine in an untrusted model where different code can be malicious. It can be doing different things. And so JavaScript has really gone further than any other language in terms of this ability to host code and do so in a safe and secure way and ensure that there's correct limitations on resources. this both with like Lambda, Edge workers and occupied Ed workers, Cloudflare has like very similar capabilities where they're hosting things in JavaScript. And, you know, kind of getting back

Starting point is 01:19:14 to where I'm working with HarperDB, this is exactly the same model that we're using as well as JavaScript, hosting JavaScript as a mechanism for taking user code and being able to run that across the edge. And JavaScript just works really well because it is, again, so battle-tested for being able to distribute and quickly run in a secure way. Yeah, that makes sense. So let's spend a little bit of time talking about Harper. So where does HarperDB kind of fit here in terms of

Starting point is 01:19:45 we talked about just to recap latency consistency scalability um you know language support um relational versus non-relational what is harper db and how does it fit into the picture when should folks use it yeah um i mean it certainly has its roots in terms of storage is like a NoSQL database. It uses document storage mechanisms. Basically, we store the object structures, we actually store in message pack format, because that's a lot more efficient than JSON. But it also is has a lot of hybrid characteristics as well, a SQL query engine and secondary indexing, ACID compliance. And so a lot of those things that really make for robust application development exist built on top of that NoSQL engine capability. Probably one of the maybe distinctive aspects of HarperDB is the fact that it is designed to, again,

Starting point is 01:20:55 run JavaScript application code and do it basically in process with the database engine. And so to achieve that very, very low latency access between the JavaScript and the database engine. And so to achieve that very, very low latency access between the JavaScript and the database engine. And so when you have fundamentally a user, a client that's requesting data, that can go directly to an edge server. There can be application code that handles that.

Starting point is 01:21:21 It can do whatever appropriate queries into the database, fetch data as it needs to, and then respond to the user. And you've had exactly one network hop. And so that's kind of our fundamental goal is this notion that rather than, you know, maybe going around the world to an application server that then makes another hop to a database, then comes back, trying to achieve basically one hop access to data, even through the complexities of application logic, uh, and back to the database. Well, so if you're running on the edge, my guess is it's like, it's a full replication. So each, each node has a full copy of the database. And so so then how do you get around some of those challenges we talked about? Like if if you're ACID compliant and two folks in different parts of the world try to delete the same shopping cart at the node level. And then at the network level, it's eventually consistent. But that actually still means you get all the characteristics of atomic atomic commits, you get

Starting point is 01:22:32 the characteristics of durability, you get the characteristics of isolation, it just means that you basically we aren't employing locking. So I can't lock this this record across the entire database. I can atomically interact with it. But this isn't necessarily a great fit for a financial application where you need to do like a row level lock on a record on an account where, okay, I don't want anyone else changing this while I retrieve this money out of this one account and put it into this other account.

Starting point is 01:23:05 Right. But there's a lot of applications where this idea that, you know, you still have the basic concepts of atomic isolated durable commits. But those can be happening concurrently. We can replicate this data, resolve conflicts based on timestamps as that data comes together. And in doing so, achieve very low latency replication as well as low level uh low latency access to the data well that makes sense yeah um yeah i mean this is uh you know just tying a lot of things together here i remember when world of warcraft i don't know if this is still a still an issue but they had some issue where i guess uh you could be in one part of the world i'm totally gonna get this wrong because i don't know if this is still a still an issue but they had some issue where i guess uh

Starting point is 01:23:45 you could be in one part of the world i'm totally gonna get this wrong because i don't play world of warcraft but you could be in one part of the world and like pass something to somebody who was like right next to you but like in a different part of the world because of the chunking and it would it would duplicate it so it was like you could make a trade and then both you cancel at the same time or something. And just because they were different nodes and they were eventually consistent and their way of reconciling was to just let you both keep the weapon.

Starting point is 01:24:16 So yeah, Chase Bank is not going to let you go halfway across the world and double withdraw your money. I mean mean that would be nice it'd definitely pay for the plane ticket to singapore or what have you but they're not gonna let that let you get away with that um but for for most situations you know if your shopping cart has the item twice in it because two people in different parts of the globe added the item you know that's just you know that's a glitch that we're just going to have to sort out on the you know like like downstream right in exchange what you get is all of those

Starting point is 01:24:52 things that we talked about that are so important you know that that speed and that latency that definitely you know causes something in your brain to to uh to uh, to be, you know, really, really happy when you're, when you're on a product. Yes, exactly. Yep. We want people to be happy. Very cool. Um, yeah, I have a buddy who's a musician and he says, uh, you know, you, you don't want to play kind of crazy notes. Like you kind of want everybody kind of nodding their head and and you know feeling the rhythm and uh and then he'll save the crazy notes for when he's playing with other guitarists um same kind of thing here you know you want people to feel like they're in this really natural

Starting point is 01:25:35 environment and and latency is is has been proven over and over again to be super critical um for that let's talk about Harper, the company. So we mentioned that you're distributed. Roughly how many people and what's something kind of unique about Harper? It could be your mascot. It could be what you guys do for onsite. What's something that makes Harper stand out from a company perspective. Sure. Yeah. I think we have about 18 people right now and, you know, Harper is named after the CEO's dog. And so it's very much of a cat loving company. Yeah. I actually don't have a

Starting point is 01:26:18 dog myself. I have a cat. I've considered it a small miracle that they hired me despite the fact that i don't have a daughter but i think there's generally been like you know there's been stand-up meetings with chickens on the on the calls and in general it's a very pet friendly uh company so that might be a little bit of a distinctive so oh that is really cool i go to this place called Civil Goat Coffee. And for the longest time, there were goats right there. The goats would come up to you and nudge you and stuff like that. I think they finally got some kind of complaint or something, but they had to put the goats behind a fence. But I was a little bummed. You know, I thought the whole experience was just to watch my kids uh my my uh kids freak out when the goats got close to them that was part of the fun yeah that's awesome yeah yeah well that is really cool well you know you can always go from there to data dog you know it seems to be a recurring theme yeah for sure yeah yeah we've definitely done plenty of data dog yeah that's right very cool um, this is great.

Starting point is 01:27:25 Anything else that you wanted to kind of get out there? It could be, well, actually, you know, one thing is, you know, if someone's in high school or college, they might be really looking for something that's pretty low barrier of entry. They're not going to want to sign an RFP or anything like that. So for folks who are kind of really just getting started, does Harper have a product for them and how would they get started? Yeah, we have, you can go to studio.harperdb.io and you can sign up for a free instance of the database. And so that's one of the easiest ways to get started. You can also install it from npm so you can do an npm install harper db

Starting point is 01:28:07 and start with a local installation and so yeah those are some great ways to just spin up a harper db instance start creating some tables add some data you can import csv to have some sample data and then you could get started with writing some application code as well, and experience what it's like to have this like fast in process access to data. Very cool. And so just so I'm clear, like, it's meant, you know, it really excels at the edge, but you could run Harper just on your own computer, the server part of it as well. Is that correct? Yes, that is correct. Yep. Yep. And then, you know, in general, like, I think like you've experienced, that's usually like a great way to do development. You know, usually you want to have

Starting point is 01:28:54 a local instance if you're going to be doing any significant development so that things are fast and direct and you know exactly what's going on and you can look at things in your task manager and stuff like that yeah totally um really cool hey uh chris thank you so much for for coming on the show it's been awesome i really hope we've motivated folks out there to learn about using databases um you know if you have a database class at your university, be great to take it. I know there's a lot of competition. There's a lot of other really exciting classes you might want to take. So if you don't take the database class, definitely take some time to get familiar with databases and a way to store data, retrieve data pretty easily.

Starting point is 01:29:43 Because it's an incredibly important part of pretty much everything you're going to do in your professional life. And really, just thanks again for coming on the show and helping folks get started with that. Thank you so much for having me. This has been a lot of fun. I really appreciate it. All right, thanks.

Starting point is 01:29:59 And thanks to everybody out there. We've been going through a bunch of folks' requests for programming languages and topics um we have a um differential equations i think is the next show which will be pretty exciting that's a pretty heavy mathy topic we're going to talk about um we're talking about game engines we have a whole bunch of topics and we really couldn't do it without all of your inspiration, all of your ideas, your emails, and also without all of your practical, uh, support on Patreon. Um, that's really the way that we kind of, uh, keep the show going, get the word out

Starting point is 01:30:38 for everyone. And so we really thank everybody for your support on there and we will see you all next show. Thank you. Music by Eric Varndollar. Programming Throwdown is distributed under a Creative Commons Attribution Sharealike 2.0 license. You're free to share, copy, distribute, transmit the work, to remix, adapt the work,

Starting point is 01:31:10 but you must provide an attribution to Patrick and I and share alike in kind.

Programming Throwdown - 164: Choosing a Database For Your Project With Kris Zyp

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.