Programming Throwdown - 164: Choosing a Database For Your Project With Kris Zyp
Episode Date: September 11, 2023Things to consider when choosing a databaseSpeed & LatencyConsistency, ACID ComplianceScalabilityLanguage support & Developer ExperienceRelational vs. Non-relational (SQL vs. NoSQL)Da...ta typesSecurityDatabase environmentClient vs Server accessInfo on Kris & Harper:Website: harperdb.ioTwitter: @harperdbio, @kriszypGithub: @HarperDB, @kriszyp ★ Support this podcast on Patreon ★
Transcript
Discussion (0)
programming throwdown episode 164 choosing a database for your project with chris zeip
take it away jason hey everybody um. So Patrick and I have been doing
a bunch of solo episodes or duo episodes, non-interview episodes. It's been really fun.
But every now and then, an interview comes across a plate that is a really spectacular
opportunity for us to dive into something that's really important, especially
for folks that are just getting started. You might be in college and high school. You might
not have a lot of years of experience under your belt. And one of the things that I didn't know
until much, much later is the power and the usefulness of databases. I thought databases
were for financial folks or for
people who are really professional, people who wore shirts with buttons. I thought databases
were for them. And so as a high school student, college student, I was doing a lot with data
structures that really should have been in databases. So we're going to talk about how
to choose a database for your project and what
different databases have to offer and how they can make things a lot easier. And with me, I have the
SVP of engineering at HarperDB, Chris Zipe. So Chris, thank you so much for coming on the show.
Thank you. I am delighted to be here.
Cool. So before we get into the topic, why don't you tell us a little bit about yourself?
So how did you get into computing? Did you go to college for it? How did you kind of learn that
trade? And how did you end up following the path to HarperDB? Sure. Well, it started when I was about 10. I grew up in a great family of school teachers, and they were alwayshertz, 20 megabyte drive and had Turbo Pascal. So I got started with Turbo Pascal when I was 10 and just dove into it and
loved it, loved what I could do. I've always kind of been a do-it-yourselfer. So when I saw
there was this new game called Tetris, I was like, I can do that write that play it myself so that's what i did i wrote tetris
and turbo pascal was probably a terrible clone but whatever i had fun doing it's amazing did
you share it with anybody or or was it a solo yeah this was before the open source world i think so
um yeah and i wrote like a a check uh recording program for a chiropractor friend of mine when I was, I think, 13 or 14.
So that was great.
I had a lot of experiences as a kid growing up programming.
And so I even going into college, I knew I wanted to be involved in computers.
So I did computer engineering at Oregon State and then computer science at University of Utah.
And did interesting work on simulations with electromagnetics in the body at University of Utah.
Oh, wow.
So that, did you have a background in like magnetism and electromagnetism or was it more
like you were the sort of engineer and you found yourself with these scientists?
It was more I was the engineer and I was, you know, handed a C framework for writing these simulations.
And, you know, it was a great opportunity to learn more about physics and medicine and medical research. And so I really enjoyed getting to be a part of that.
Wow, super cool wow super cool that is
amazing one of the things i've come a very similar background to you and and um you know because we
were kind of like pre-internet uh you know we didn't really have an opportunity to share a lot
of projects and uh that is something that you folks today, you know, really take advantage of just the amazing community out there that, you know, is there's there's a whole community of folks that love to see what, you know, different different hobby projects and other things you're building.
Yeah, for sure.
Yeah.
So, yeah, that's amazing.
Yeah.
And then from there, I worked with a friend doing some consulting work with a company called Documentum that did database software.
And then from there, that's when I actually really got involved in more in open source software.
And I started working at a company called SitePen.
They were kind of the main company behind the Dojo Toolkit, If you remember back in the days of the Dojo Toolkit.
You know, I remember the name, but I forgot what it is.
What is the Dojo Toolkit?
It was kind of like around the same time as jQuery.
It actually came out a little bit before jQuery, but it was kind of in the similar vein of like, you know, a client side web toolkit that was filling in for all the crazy discrepancies between different browsers at the
time and trying to provide a you know client-sided library um so i was a core contributor with with
dojo for a while um and really kind of doubled down into open source got involved with like CommonJS committee, went to W3C and WTC39 meeting.
I was the author of the JSON schema, the first draft of the JSON schema specification.
Wow.
Okay, wait, let's rewind a little bit, right?
So we just talked about how it's kind of hard to get into, at least back in our day, it's
hard to get into communities.
I definitely didn't have a Turbo Pascal community in my small town or anything like that. And so how did you
break into that? Right. So, you know, W3C, you're writing the JSON schema. Like, how did you build
up kind of that network over time? I mean, I was, you know, just working on open source software
and like, you know, throwing projects open source software and like you know throwing
projects out there and that was at the time when that was starting to really take off and in some
ways it was actually i think easier at that time like now there's so much saturation of projects
out there but back good point i mean that was around 2010 where like if you created a new like
you know you spent a couple days creating a new web framework or something like people actually paid attention, because like, there wasn't much else out there. So I think it was actually relatively easy to kind of like break in and start connecting with people and, you know, start emailing different people and talking about different ideas. And so, yeah, I was at a time where it was easy to get involved with what common js was doing
and different groups were very cool standardization yeah yeah right and so and so while you're doing
open source were you working at at companies that were very open source friendly like what's the
sort of how's that symbiotic relationship work yeah well again, I was a company called SitePen. And we were really focused
on Dojo at the time. So I was doing a lot of work for the Dojo framework, I did, did a lot to write
their event handling, we had like a store interface for interacting with different data stores. So
that was kind of like a little bit of bridging into database from the client side, and finding like, what are good common uniform interfaces for interacting with different data
storage mediums. So I did a lot of work with that at the time. And but that also afforded just a lot
of opportunities to be involved in the broader JavaScript ecosystem at the time.
Well, that makes sense. Very cool.
And so what... Oh, go ahead.
Yeah, sorry. Another thing I contributed
to at the time was we also did...
I wrote the original proposal
for the Promises Ace
specification. So that was kind of like
we had kind of put together some of
the original ideas of what
should promises look like in JavaScript.
And the Promises A proposal is kind of actually what the original ideas of what should promises look like in JavaScript.
And the promises a proposal is kind of actually what promises basically are these days, where when you call in a weight on a value, if it has a then method, that's kind of what defines
it as a promise.
And so I'm certainly not claiming to be like the originator of promises in JavaScript,
but I had an opportunity to be involved in a lot of like the original
discussions about how that should work in JavaScript.
That is really cool.
I remember,
and I don't know if promises weren't around or I just didn't know about them,
but writing JavaScript without using the sort of async await is really
painful. It's's like it's every
function and then and then oh but any one of them could crash and if they do you have to roll back
the parts that you did but that means you have to keep track of it everywhere it just like got
totally out of control um and yeah learning about that was a lifesaver yeah any and before promises
it was like everything was just callbacks, right?
Like, right. No JS applications. And it was just like, callback, callback, callback, callback. And
every step had to have an air handler. And yeah, it was fun. Yeah, that was wild. So,
so from CodePen, you at some point, you ended up at Harper. Why don't you talk about that?
Were you there at the beginning?
Or what's the Harper story like?
Yeah, yeah.
So a little bit of a transition between then.
So from SitePen, I went to a company called Dr. Evidence.
Kind of an opportunity to go back into medical research a little bit with programming.
And we were doing a lot of work with analyzing clinical studies. And these were like
super, super complex data structures that were super nested. And we were doing a lot with,
they were in like, you know, six normal form in a SQL database, you know, highly normalized,
really well structured. But to load every study was like it took like several minutes
to do the join that was required to pull a single study from this database and do
analysis on it and we you know we were trying to like create a user interface to do these these
clinical analysis on on the fly in a few seconds and so that was kind of really where i got involved
more at the lower level of like, we needed to build like
caching. And so we were starting to use like, looking at key value stores, and we were using
like level, level DB. And then I started becoming more interested in using LMDB, partly because it
had multiprocess support, and it looked like it had very good performance characteristics. So that kind of started, I became, I kind of started using
the existing LMDB JavaScript package.
I kind of actually started taking over that package
and maintaining it and continuing to make advancements on it.
And that was mainly to facilitate like this very, very like
low latency interactions that we needed to do where we could
like constantly be fetching different parts of these studies, facilitate like this very, very like low latency interactions that we needed to do where we could
like constantly be fetching different parts of these studies, do analysis on it, fetch different
parts, do natural language processing retrievals, pull all this data together. But it really needed
to be like these database interactions needed to happen in microseconds. And so we needed like this
low level capabilities to interact with this data.
And so I really got more and more invested in this low level and optimizing these low level interactions between JavaScript and LMDB. And it actually became reasonably popular, partly kind of transitively, Parcel and Gatsby, Kibana, which is used by Elasticsearch.
They started using this package.
And so it actually has like over half a million downloads on NPM, which isn't like wildly popular or anything, but it's enough.
It's a really really cool popular package um and then i'd also develop some serialization deserialization
libraries for message pack it's used with it as well which ultimately meant like we could get like
microsecond level uh access to data from a data storage engine which was really cool
how did you take it from like you, you said you were saying multiple seconds
to, you know, let's say microseconds or even milliseconds.
You're talking about orders and orders of magnitude.
Is it really just kind of changing technology?
Was their code just broken?
Or how did you get such dramatic speed up?
Yeah, yeah.
No, this is a fascinating part of it.
You know, I mean, first of all, like going from, there's a lot of layers involved with just like the normal SQL query, right?
Like you have parsing involved, you have network connection involved, you have serialization, deserialization involved.
There's a huge amount of complexity. If you are just like trying to retrieve a record by primary key, at the actual storage engine level,
those things are insanely fast, like just doing a B tree retrieval is an extremely fast operation,
and there's a huge amount of overhead. And so if you're just dealing with like, I need to fetch
this data really quickly, I need to fetch that data really quickly first of all doing things
in process with an embedded data storage engine is radically faster because you don't have any
network overhead you don't have any sir you don't have as much serialization deserialization cost
so that was kind of the first step but then as i was getting more into like optimizing uh this
javascript like there's also like really fascinating
just weird things that you run into like the simple process of like having a memory pointer
and then like being able to access that data with a memory with a you know like a c pointer that like
is like the bread and butter of c programmingProgramming. Turning that over into JavaScript and getting a buffer that points to that same reference
is like an insanely expensive operation.
That allocation is actually really, really expensive
in a JavaScript engine.
And so there was a massive performance gain
that could be had just by realizing
that if we just use the same allocated buffer over and over
and copy data into it as the mechanism for that interface
between the C-level and the JavaScript level,
at least for small records, you get like a 10 times performance.
You go from like, you know, 40 microseconds
down to four microseconds or two microseconds.
And so it was the combination of that
and then employing like more sophisticated
deserialization techniques. It ends up that there's techniques you can use to do like
message pack deserialization that are actually faster than JSON parse. So like we can do
retrievals from the database, that single fetch can actually be performed from the database, retrieved from the
database, deserialized in message pack faster than adjacent force can call with pre built with a
pre built string. So it's kind of a big pile of different optimizations that came together
to really achieve like, you know, this several microsecond level access to data.
Yeah, that's amazing. Yeah, that's so satisfying with something like that. Because, you know this several microsecond level access to data yeah that's amazing yeah
that's so satisfying with something like that because you know it's just it's just like a you
know you're marching towards something as you kind of hit that asymptote um yeah it's it's yeah i i
love that that feeling where things get just super optimized over time yeah Yeah. Yeah. And so from there, I guess back to the journey, you know,
I guess, like, to me, this was kind of like, maybe the the open source dream outcome is that like,
make a kind of make an open source project, like it's fun to make an open source project.
Cool to see some people use it. It's fun to see it kind of get moderately popular. And then
I basically applied to HarperDB because they had been using LMDB.js. We actually call it
lime juice so that we don't have to say six syllables, but they had been using LMDB.js.
Yeah. And so like basically like make an open source project. And I got hired at a company to work on this open source project and build a database, build database software on top of it.
So that's kind of how I ended up at HarperDB was through this through this work on the data store level.
Wow, that's amazing.
So was it so you you kind of applied to HarperDB kind of as an individual.
It wasn't part, you weren't part of a company acquisition or anything like that.
No, no, nope.
And yeah, I had had a number of interactions with like Kyle at HarperDB.
And so we knew each other.
I mean, we'd had, you know, a number of interactions on GitHub issues and, you know, I'd solved
some problems for them. And so when I applied,
they were like, oh, it's Chris. That's awesome. Come join us. That is so cool. I'm hearing more
and more of these of these stories, very similar to yours. And it's extremely inspiring. The one I
saw, two of them I saw recently, one, and I'm'm not gonna butcher the person's name but the person who
created fast api and sql model uh i'm gonna butcher it but it's it's uh last name is tangelo i think
but he um he actually got into some kind of like incubator with sequoia Capital, which is a VC, a venture capitalist fund.
And basically they just said,
look, you have amazing open source projects.
We're just going to pay you to keep working on them.
And it's an amazing relationship.
And then the person who came up with llama.cpp,
which is a way to run these, you know,
ChatGPTlike open source,
large language models, a way to run them really fast
on the CPU.
That person, same thing, they started a GitHub project
and started posting about it on Reddit.
The GitHub project got really popular.
They've just been spending just an insane amount of time on it.
They've brought in Lama 2 and all these things
that have just come out in the past couple of weeks this person's like totally on top
of it and same kind of thing like a group of venture capital has got together and i think
the person's in serbia so it's not even there's not really even like a personal connection but
they just got together and and just funded this person like this is amazing and we want to be a part of it and so um um you know
and your story as well i think if you have that pension to create things um you know put it on
github uh how do you actually you know to turn this into a question how do you build some word
of mouth like you create um lime juice right and and so you you it. It's a GitHub project. You push the first version. It
has zero stars, zero followers. What do you do next? How do you actually build it up?
Yeah, I feel like you're asking the wrong person. I mean, I talked about it on Twitter some. And
with the Lime Juice project, it was actually kind of a kind of a started as node LMDB. So there was
already some existing users using that I kind of took over maintenance of it, and then basically
kind of forked it with like some of the newer ideas that I wanted to implement to make things
faster. So there was kind of some natural, natural growth there. But yeah, I mean, I've, I've tried
to talk about it on Twitter. And, you know, I think that then, from there, But yeah, I mean, I've, I've, I've tried to talk about it on Twitter and,
you know, I think that then from there, you know, once you actually get a little bit of a foothold,
like you see some other projects using something. And so I think it's kind of just organically
grown from there, but I'm the last person in the world to talk about how to be effective in
marketing open source projects. Well, I mean, it goes to show how good the, you know, how healthy the system is, right?
That, you know, you can focus on making good content and, uh, and, and through, through
the power of, of all of the, the internet, you know, the collective consciousness of
humanity here, we can all start to find those amazing projects.
Yeah, for sure.
I agree. Yeah. I think it's
similar with eternal terminal. I had a buddy call me a few days ago saying, oh, he has some eternal
terminal issue at work. And they were asking who can, who knows anything about this? And he said,
oh, I know the guy who wrote that. And so he called me and was asking me some questions around.
It was pretty esoteric, you know, SSH type stuff.
But same kind of thing.
No real promotion or anything.
And, you know, I've created hundreds of projects.
And that's the only one that's really taken off to that degree.
And it's just, you know, you can't at least I can't really predict it.
But when you do find something that that sort of strikes that chord, it's just, you know, you can't, at least I can't really predict it. But when you do find something that sort of strikes that chord, it's really satisfying.
Yeah.
Yeah, I agree.
Yeah.
And there's actually been projects I've had where it's been frustrating that they aren't seen by more people.
And then I've had projects that it's like, please stop using this.
It's too many people are using it.
Yeah. like, please stop using this. It's too many people are using it.
I had developed one of the early JSON schema implementations and didn't do a good job of maintaining it,
but it has a ton of NPM downloads that, I mean,
it'll continue to exist and I'll keep it out there,
but it's not something I continue to work on.
So, yeah, that's really difficult because you only have so many hours in a day.
But but it is it is hard to see the issues pile up.
I actually last week I went through and addressed like so many issues in Eternal Terminal, but they're just piling up way higher than I can really
address.
Maybe one day, I know so many folks
try to talk about
we can build a marketplace economy
on top of GitHub. So many
companies have tried this. It almost is starting
to become a tar pit idea. Have you heard of
this term?
A tar pit idea is
something that's so appealing.
It feels like a warm bath, but you get in and you're stuck and your company dies.
Yeah.
So a personal CRM tarpid idea, like Facebook for X, Uber for X, these are all tarpid ideas.
And I feel like monetizing GitHub is starting to become a tarpid ideas and i feel like you know yeah like monetizing github is starting to become a tarpid
idea um but but i i do think that you know like eternal term is a great example i mean there's
so many people using it that and your json schema is another even better example you know somebody
should be able to make a modest living making that library better um and we really just don't have the
marketplace for it but i think there's there's just there's just so many moving parts it's hard
to really get that right yep you're absolutely right yeah and i agree it's this one of those
things where i would love it if that could be reality i don't know how to make it reality
yeah so many smart people have tried i'm a little afraid to to make it reality. Yeah. Yeah. So many smart people have tried.
I'm a little afraid to,
to,
uh,
it's like saying Voldemort or something,
right?
Yeah.
Um,
well,
that's amazing.
So,
so you've been,
how long have you been at Harper?
Um,
I've actually been there for just a little over a year,
about a year and a half now.
So.
Cool.
Great.
And is it,
uh,
we'll get more into
the company, you know, after we talk about the main topic, but I'm just curious, is it a remote
thing where are you all together or is it distributed? Yes, it is. I mean, there's a
number of people, our headquarters are in Denver and there's a number of people that are out there.
I live in Salt Lake City. And so most of the engineers are working remotely um it's nice to actually be in the same
same time zone i worked for many years i've actually this is i think 15th year that i've
been working remotely so the previous companies were in oh wow yeah but so this has been just
kind of a normal transition for me covid didn't affect work at all for me. Yeah, that's right. I just needed to work remotely.
Yeah, very cool.
Great.
We'll definitely put a bookmark in that.
I definitely want to talk more about Harper and that database. But we'll kind of step out here and talk about just choosing the right database.
And maybe before we even do that, we should talk a little bit about what is a database, you know, kind of in practical terms, like why would someone use a database versus using a B-tree library or some, you know, some JavaScript library for storing data?
You know, when should people make that decision?
Yeah, that's a good question. I mean, there actually are probably times people can use a
B-tree library directly, but there certainly is a tremendous
amount of functionality that is built on top of the host B-tree
libraries that you typically use in a day to day work with
databases. You know, databases handle the work of maintaining data in a structured format so that you actually instead of just having raw binary data, it's in the form of actual fields or properties or columns. search for records by different values and perform that efficiently,
handles things like transactions,
ensuring that multiple things can be handled atomically with isolation,
consistency,
ensuring that it's stored on the disk drives in a durable,
reliable way.
Databases can get into the issues of management, observability, and then being able to provide higher level queries.
You know, obviously, many of us use databases through SQL queries, which gives us a much easier way of thinking about querying data than having to think about interacting with individual indices and B-trees and how those
are connected and related.
So that's kind of broadly why we use databases is it gives us the ability to interact with
complex data using relatively simple mechanisms for querying and updating that data.
I think it's interesting, too, like you mentioned, Chris, I I think maybe in some cases you don't need a database.
I think we were having this a little bit debate maybe in the pre-show of what makes a database a database.
I feel like it's expanded a lot.
So, you know, something from like a key value store, you know, can still be a database.
And then you were mentioning a lot of things, which I think hits upon things that folks miss, which is how many users are you talking about? Is there contention for data
or not contention? So in other words, does your application running in multiple places
need to make updates to the same data or not is a big one. And then for internal tooling,
it may be that each person is kind of by construction doing something slightly different. And so really, it's more of a caching transmission mechanism thing,
in which case it's different. But then you mentioned schemas as well. I think that's one
that we were referencing JSON earlier. But people maybe not with JSON schema, but with just JSON
plain, we'll just insert a new field, right? And then stuff will break there's no like planned way of
dealing with it and everyone says well you don't need that stuff i'll just figure it out and it's
like well yeah you're right that is true but at what cost and indexes is another big one you
mentioned that's fallen to the same bucket if you just put opaque binary data you know in blob
somewhere somehow stored could you write something to like extract
and index the fields you want to dub that? Yes. And are you going to write a bunch of code that
already is battle tested, robust and going to do a better job than you? Hey, I just going to
reinvent a crappier version of like existing indexers. So there's not this like hard line,
but I think early on, sitting down and really thinking about what you're optimizing for
and targeting makes a big difference in in what you select and then also like you mentioned is
it going to be sql interface or not and what is the implications of saying hey i'm just going to
shove random json objects i'll keep beating on j i should do something else random you know jpeg
pictures in these columns right like well wait a minute hang hang on. Like SQL is not going to buy you much
if all your data columns are JPEGs.
I'm not, I mean, maybe it does.
Maybe I'm not an expert there.
It feels like it probably doesn't buy you as much, right?
Could you do it?
Sure, but it's not like a good choice.
And so I think you end up with classically extremists
on both ends, you know, no database
or everything in the database.
And in reality, it's probably a little bit more fluid. Yeah, yeah, yeah, for sure. Yeah, you're right. Those are some great examples
of where, you know, this is the reason why like Redis and Memcached and different things like
that have really grown in popularity is because they are fulfilling a role of this, you know,
high speed access to data that doesn't need the extra overhead
of a full SQL engine.
So that does illustrate
some of the different needs of databases.
And one of the challenges is,
I think maybe one of the primary drivers
for what database you're going to use is,
what is the hardest thing the database is going to do
in terms of querying?
And it's hard to figure that out ahead of time, right?
Like, what is the most difficult query going to be?
Is it just going to be these like, by key lookups?
Or is it going to be, you know, a three level join or something like that?
So yeah, kind of thinking ahead about that.
And then the other aspect is like, what are the data structures look like?
Like, you know, traditional databases have had, you know, tables with a relatively flat
structure of columns that each can have a field in it.
And part of the driver for like NoSQL databases, document driven databases is the idea that,
you know, when we are working with data in typical programming languages, like it can
be very
convenient to think about data as nested structures right like i have an object and inside that object
is an array and inside of that array is a set of objects and that's really convenient when i'm
working in a programming language when i translate that to a relational model now i'm starting to get
into junction tables and joins and things like that,
that like, hey, I thought this was supposed to be pretty easy, or it felt really easy in my
programming language. And now it's getting more complicated. So certainly data structures
influence that as well as just how am I going to be accessing that data?
Oh, that's interesting. I hadn't thought about this actually.
That explains the rise of the ORM as well, right?
So the object relational model.
The sort of middleware.
So if you talk about like a Ruby on Rails person would just go,
no, this is no problem.
I got you.
And they would sort of just attack it, right?
By saying, hey, I'm just going to basically,
I don't call it what you want, middleware, ORM. I don't even know all the terms.
But basically,
how do I take a structured in-memory view and then sort of push it into the correct representation in a database and have that be, I'll call it a translator, back and forth between the two sides
of the system, or even do joins or queries on the back end appropriately so that you're trying to
get the best of both worlds by having a description in the middle. Right, right, right, so that you trying to get the best of both worlds by by having a
description in the middle. Right, right, right, for sure. Yeah. Yeah. And one of the realizations
people have is like, okay, if, if I have like an array of objects, again, inside of my object,
and it only belongs inside of that object, the relational, the traditional relational model for that where you have the object and then maybe
the other table and it's joined and you may even have junction tables in between that's actually
pretty complicated if all i want is this single you know thing this single document right um which
could very well just be a single lookup in a B-tree.
And so part of this is,
how is that data structured in terms of ownership? Is this hierarchy completely contained within objects themselves?
Or are these arrays like references to other objects that are then shared?
And in that sense, then the relational model starts making more sense. You have these relationships between these objects and these other objects that are then shared. And that in that sense, then the relational model
starts making more sense, you know, you have these relationships between these objects,
these other objects, if I can denormalize, if I can normalize them, sorry, if I can normalize them,
you know, there's certainly benefits to normalization in terms of like one,
you know, one, one source of truth as far as where a record goes, and then the joins start
making more sense. But so there's a lot of questions just in terms of
what do those data structures look like
and how do those map to a database appropriately?
Yeah, that makes sense.
I think that you touched on something really important
where even without a database,
you know, like, kind of circular dependencies
and circular references become really difficult
to manage. Like, you know, even imagine like an email app. So you have email folders.
Imagine you're trying to write this without a database, you know, and then you have a bunch
of email objects. And so the folder has a list of objects. Each of the objects needs to know
what folder it's in. And so if you don't do this right, you end up with this kind of pointer nightmare
where if you want to move an email
from one folder to the other,
first you have to delete it from the folder list.
Then you have to also tell it
that it's now part of another folder.
So you end up having to change like three places.
And if your app crashes or if something happens,
I have to roll that back and it just
becomes really difficult and um you know i remember sql normalization was really popular
in the late 90s early 2000s where people said oh you just have to follow these rules and if you
follow these rules then you will always have kind of a perfectly normalized world. And so we did follow
these rules. And as you said, we ended up with so many different joins. It's like, oh, a person
could have at most two phone numbers. But instead of having a phone number one and a phone number
two column, which would be super easy, instead, now we're going to join to this table. As you
said, junction table joins to another table
of like you know user id phone numbers and so then you end up having to write this really
complicated query to pull an entire entire object and so there is sort of a lot of like
deceivingly complicated uh design decisions you have to make there. Yeah, yeah, yeah. And you're absolutely right. You know,
we, we kind of grew up with normalization and cod we trust. And when he taught his first
normalization goes, but, you know, the last decade or two, really has been characterized
more by like, trying to figure out where is the appropriate place to denormalize that data.
And that doesn't necessarily is not necessarily mutually exclusive with normalization. You know, there's a lot of systems out there that do have like a
source of truth normalization, but caching layers that do some of this denormalization where,
you know, you have a derived version of that record where the phone numbers are in line,
and you can very, very quickly and easily access
that. And so I think that a lot of the evolution of database has been learning to what are the
appropriate ways to do this denormalization. It can go too far the other way too, right? Where
you can have so much data that denormalized that it becomes inefficient to store this.
And so you start looking at ways where maybe a simple key value store
that just is doing this massive denormalization
is a little too simplified.
You want to do some denormalization.
You want to have some relationships in that
that are kind of normalized to other parts.
And so I think that that hybrid is really
maybe the direction that we're starting to learn in terms of getting in between the two pendulum swings and having efficient data storage.
Yeah, totally. you serialize in C++ or something, you want to change as your product changes.
And change becomes really, really, really difficult.
Changing, but keeping backwards compatibility, handling migrations.
If you saved your data a year ago and you find oh i need some of those records i
need to to retrieve something and so now i need to to uh you know mutate all of this year-old data
so that it can work with my modern software um these things are incredibly incredibly difficult
to do yourself and so um the database that i'm most familiar with being a python guy
is is um is a sql alchemy with alembic um and so what what that does is alembic is this tool where
you you try your best not to change the database in the database um You try to use Alembic to say,
create a row, create a column,
or change this type to an int,
or create a new table, create a junction table.
And as long as you do everything in Alembic,
it's keeping track of all these changes.
And then you now have this ledger.
So if I have year-old data i know exactly what my
database schema was like a year ago and i can tell alembic you know take this database and bring it
up to modern standards and it will execute all of these steps and and and so under the hood is a is
a ton of complexity there um i would say maybe just to tie it off like uh you know i think that
databases will force you to be more disciplined um that that they'll they'll force you to do
things that you can't if you're doing all sorts of pointer tricks and things like that um but from
that from that discipline you'll end up with a better product
that you can rely on. Yeah, yeah. And I think where that maybe is most, or at least a good
example of it is when you're dealing with transactions. Transactions are one of those
things where you never feel like you need it right from the get go, right? Like, you're like,
oh, I want to update this. And then I want to update that, right? Like, why should I have to think about transactions, but it's part of the
reason we do that is because like, once you realize, well, what do we have to do if one of
these is, is updated, and this other thing isn't updated. And you start dealing with tons of edge
cases that are just incredibly difficult to like think through like these types
of in-between states and the race conditions that are involved um and so i think you're absolutely
right when we are forced to deal with data through transactions even though like sometimes that's a
little bit annoying uh to start with uh you start it's it deals with this whole class of just really really painful problems
and makes them a lot more tackleable yeah totally um cool so so let's see you know people are super
excited now they want to you know you make their their new game engine use a database, how should they go about picking a good database?
I have a list of topics here, and we'll kind of walk through them.
The first one I have on my list is speed and latency.
So different databases kind of make different trade-offs there.
Why would everyone want a slower database?
What are the things that those databases are doing with that time?
And what's the reasons for that?
Yeah, I don't think any of your listeners are out here looking for,
what is the slowest database I can find?
Maybe I can get advice on how to find that.
So obviously that is a trade-off.
There are reasons why people have ended up with slower databases.
And there's a lot of applications that simply cannot sacrifice when it comes to speed.
You know, oftentimes when you're dealing with things that are directly driving user interfaces or even more so maybe part of gaming, like, you know, speed is
like, you can't sacrifice. Whereas oftentimes, the things that will drive slower speeds is when you
are dealing with something where there's higher levels of data consistency requirements. When you
get into financial applications, like, there's pretty, pretty strict requirements about like,
things not only being transactional, but like making sure that you are like fully coordinating
any systems involved, that you have all the correct checks in place that you have the correct
ability to roll back if anything doesn't look correctly. And that is a very different scenario
than, say, a database that's maintaining
the positions of the players in a game, for example,
or something like that,
where the speed requires very, very low latency.
Certainly, there are situations where things are slow
just because it does involve complex queries.
And oftentimes, that's well-rec well recognized by the people that are making
the query is like hey I'm doing this thing that is doing searching for a
through a huge database for a very very complex set of different conditions and
the obvious a lot of times there's a recognition on both sides that this is going to be difficult.
It's going to take a while.
So there's certainly those different aspects of it, I think.
Yeah, that makes sense.
Totally makes sense.
Yeah, I mean, there's a saying kind of premature optimizations, root of evil kind of saying.
I think it's Donald newt who said that but um but i think again if
you're if you're using a database um um you know not not some kind of homebrew thing but you're
using a common database it will be relatively easy to migrate from one database to another
and so you can always start with you know whatever is the most convenient. And if you find that all of a sudden some government agency wants to use my product
and they are demanding that it's consistent, then you can switch to another database.
Or if the latency is a real problem and you're willing to be eventually consistent,
then you can go the other way as well. right yeah there definitely are opportunities for that and you
know like anything in programming like you want to get it right the first time because there is
work involved in switching but we do it all the time yep yep yeah totally okay, the next one I have is scalability.
One thing that comes to mind here is SQLite.
I almost always start every project with SQLite. And maybe this is, again, because I'm a Python guy and I'm using SQL Alchemy, it's very simple to switch from SQLite to something else.
And so I'll always start projects in SQLite,
test out the project, test out the idea.
Just for people who don't know,
SQLite is a fully SQL database.
You can write queries against it.
You can just select statements, updates.
You can create tables.
You can do all of that.
But the database is literally just a file on your computer or maybe
a folder full of files i don't remember but uh oh no it's literally just one file it's a dot sqlite
file and so um now that file could be enormous right if you're putting a lot of data in it but
but that it's it's really elegant in the sense that you don't have to worry about networking or any of that um the downside
is only one process can write to the file at a time ever so so you're not going to build facebook
on sqlite it's just not going to happen um and uh and so invariably you have to move to something
else but um um but having the the scalability you know the different the reason why there's so many
different databases on that spectrum is because you do get um you know speed and latency and you
get a really smooth developer experience um if you're willing to have those really constrained
environments like running everything off of a file. And so, so SQLite is
actually extraordinarily powerful, even if it's not very scalable. Yeah. And that actually is a
great example of an embedded database. And like you're saying, yeah, there's huge, there's actually
big performance benefits of being able to directly access that data in process, you eliminate a lot of extra hops.
But yeah, generally as you're scaling,
you are wanting to achieve a state where,
you can be running on multiple processes,
multi threads, even multiple servers.
And that's a big part of scalability is,
what are the ways that we can vertically scale
to make sure that we're
leveraging, you know, highly multi-core machine of modern, modern servers? Are we going to be
able to scale to larger and larger storage? And this is always kind of a classic issue with
databases is that if you are indexing data and you're just doing full table scans,
it's always actually really, really fast. All queries are really, really fast on small tables.
The real challenge with any database work is not how do I query the data, but how do I query the
data in a way that's guaranteed to stay fast as the data gets bigger? That's always the challenge, I think, anyway, is, is making sure
that I can do that. And it's that's can always be kind of deceptive, when you start building an
application, because again, like everything is fast when you get going. But like, you're always,
you always have to be thinking about, well, is this query going to be fast once the database is
several gigabytes or several terabytes and is going to maintain that
speed so there's that aspect of it and yeah like you were saying um other scalability is horizontal
scaling can we run this database even potentially across multiple machines what if we get too big
for for one machine and then you start getting into issues of how do the databases cluster, replicate, or shard with each other.
And so those definitely get into more complicated aspects of scaling a database.
But those are all kinds of the different concerns related to it.
Yeah, that makes sense.
I was always kind of curious about this, and maybe you can help elucidate it for me.
I mean, there's sort of single node databases
like SQLite, for example,
Berkeley DB is another example.
And then there's, you know, multi-node,
which would be everything from like Postgres
and MySQL to HBase to Dynamo
to all of these other ones.
And then it seemed like people were saying things like,
MySQL doesn't scale as well.
I remember when NoSQL became a big thing,
the thing that they were pushing was that it was just way more scalable,
that you could scale something like Cassandra or HBase
or one of these ones to an extraordinary degree
that you couldn't scale Postgres to.
But I never really understood why or if that was true or just marketing.
So once you go multi-node, is there a spectrum there
or are they all pretty much the same?
Yeah, I mean, I think there's definitely a spectrum there.
And I think what you're hitting on is that a lot of the guarantees that you would typically get in a relational database are actually quite difficult to maintain in a distributed network. like just have a partial set of a table and do correct secondary indexing on it. Like the whole
table has to be there to get a coherent secondary index. When you start dealing with things like
foreign key constraints and cascading deletes, those are actually really, really difficult
to maintain consistently across a distributed network. And so when you just take like the
existing consistency guarantees of a traditional database, and then just try to scale that to a distributed network, it's fairly complicated.
So you eventually end up with situations where you are trying to decide, OK, what are the guarantees that we really need. And one of the advantages that NoSQL databases had in terms of distribution
was kind of starting without those constraints, kind of starting with this blank slate of like,
okay, we are going to think about what is the level of guarantees that we can provide,
assuming that we are going to be in a distributed network and not providing any guarantees that we
can't back up.
And so it was kind of taking that different approach.
And there's certainly ways that, you know, I mean, MySQL and a lot of these databases
certainly have done a valiant job of trying to, you know, do better jobs of scaling.
And sometimes, like, that can involve, like, things that are a little bit more complex, like sharding,
like that involves a fair degree of like involvement in trying to
understand,
well,
how can this data be distributed?
So there's,
there's certainly approaches,
but like carrying those,
those guarantees of how acid expectations worked with a single node,
and then trying to guarantee those same things
across the distributed network is a difficult difficult leap to make got it yeah that makes
sense um yeah i think it's planet base i want to say one of them might be neon it's i think
planet base plant base actually bans foreign keys.
And so you have to do the cascading deletes and all of that yourself.
But what they get from that is probably much better scalability.
Yeah, I think I remember listening to your podcast on this.
And I think when he said that, I was like, yes, that is the thing that you do not want to attempt to do across the distributed network.
Yeah, just to dive in a little bit on that for the audience.
So imagine you have a user account,
the user has phone numbers, they have credit cards,
they have transactions, and then they say,
I want you to delete my account,
and I want it actually deleted,
not like Google or Facebook deleted
where they just keep your data forever,
but actually deleted.
And so you have to do, you know, you delete that account
and then you have to also delete all those other things that are derived from that account. And
that's where the cascading metaphor comes from because it cascades into the credit card table
and the phone number table and all of that. And so then, you know, to do that quickly,
you need to somehow keep this, and I have no idea how this works. I mean, I'm very curious, but you somehow keep a, you know, like a dependency graph, really, of, you know, a person to all of their dependent data so that you have that ready on hand. And that sounds incredibly difficult to do across multiple machines. Yeah, and in particular, like foreign key constraints,
cascading deletes have very significant locking requirements
as far as like ensuring that, you know,
the record that's referencing this still exists
while we're doing this delete
and then nothing else has like come into existence
that is also potentially using this.
So it does, it simply requires a lot of like
kind of global coordination to ensure that all the uh the requirements the constraints that
foreign key or cascading locks provide are actually maintained across the network and so
you know in the no sql world where you aren't necessarily guaranteeing these types of relational constraints, things get a lot simpler.
And then you just simply deal with things, you know, potentially after the fact, where, you know, if there is this record that is referencing a record that no longer exists, well, we either remove that reference on the fly or tolerate that so there's a lot of things that
can be done after the fact rather than relying on trying to maintain uh this consistency in real time
yeah that makes sense um yeah i mean i mean you touched on some of the extremely difficult edge
cases i mean imagine you're deleting someone's account, but then,
you know, maybe they like right after they put the command, they go to another tab and they say,
I want to delete my credit card just to make sure it's really gone. And so now you get this,
you might get it even in the wrong order where it requests to delete, delete a credit card while
you're in the middle of trying to delete the credit card so you get double deletes or you get even worse as if someone maybe on their phone you
know a family member has the same account and they're adding a credit card so you're trying
to wipe the account and a credit card gets added right in the middle and there's so many things
that can happen and um and if you have these these foreign key constraints these cascading deletes
like you're putting a very very hard hard guarantee and and so if you're not allowing
yourself even for a moment to be inconsistent then the only way you can accomplish that is by
hitting the pause button yeah yeah yeah yeah which is getting back to that speed thing is the thing that isn't like a logically a table, you know,
like everything that like, wouldn't just look like an Excel spreadsheet. Is that like, what is
kind of a good way of explaining SQL versus no SQL to folks out there? Yeah. I mean, I think that is
a good starting point. Um, and it is kind of a complicated thing because there's there has been so much wrapped up into like the the notion of
sql traditional databases um you know i think that that has been kind of the primary like conceptual
idea behind no sql is that it's a this idea of a document driven database where um yeah the document
can be a data structure that has any structure that I want. And I can freely
map that to the data structures in my application. And it may look more similar to it. And I don't
have to have as much ORM magic that's like doing this translation. But certainly, like, it's also
like comparing how things are querying, like NoSQL is obviously a comparison to SQL, which is a query
language. And so oftentimes, NoSQL gets wrapped up with, SQL, which is a query language.
And so oftentimes NoSQL gets wrapped up with, okay, we're going to have different querying mechanisms for accessing that data.
Maybe it's also wrapped up into the whole relational versus non-relational.
And what does that even mean? You know, part of that part of what relational means, at least in the SQL world is like we talked about you do a query on that and there's a known foreign key,
you actually have to tell the SQL engine every single time how those, those two tables are supposed to be joined.
You have to say,
right on this field to this other field.
I can't just say,
Hey,
give me the data that's associated from table two with table one.
It's not part of SQL, right?
You have to tell it every time what that relationship is,
which is kind of a funny thing.
That's such a good point.
Such a good point.
We've kind of like associated SQL with relational,
even though SQL is actually the query language itself,
isn't like terribly relational.
We do all that with ORMs, right?
Like ORMs know these relationships.
They're the ones that kind of put together these
joins. So there's kind of like, just like historically, all of these things that have
been associated with traditional databases. And so NoSQL was kind of this effort to like,
rethink some of those things, rethink the relational aspect, rethink the querying aspect,
rethink the structural aspect, how we store that data. And so it's kind of given
us a, you know, a way to re approach that stuff, I think. But it does encompass a lot. And the
reality is, is that no SQL databases, like, you know, one of the things that I've learned is that
you can say that it's not relational, but like, and there's a lot of relational data out there.
And even if you aren't doing SQL,
and even if you don't have foreign key constraints,
I bet your data has some relational properties to it.
Yep, yep, yeah, exactly.
I mean, you almost always want to reduce on part of the data.
So you say something like,
what is the average or the median number of phone numbers
of all the users in my account?
You know, is it zero?
Is it one? I mean, it of all the users in my account. You know, is it zero? Is it one?
I mean, it makes a big difference to my product.
And so as soon as you want to start reducing on parts of these objects, then you find yourself like really wishing you had SQL again.
Yeah, yeah, yeah.
Yeah, once you start normalizing more, yeah,
it starts becoming more convenient.
Yeah, that makes sense.
Something you touched on that we should
we should explain more detail our orms so yeah i talked about sql alchemy as an example i'm sure
there's a ton of other ones but you know um you can write raw sql and you know you or you know
really for any database you can write raw queries, and you will get back data.
And you can definitely work that way.
And there's times where you'll want to do that for certain queries.
There's advantages to that, just like there are advantages to writing some of your code in C, even if most of it is in Python.
But by and large, you'll use an ORM for a lot of this work. And the way that works is you can actually have the ORM generate the database.
I don't really advocate for that because I feel like you can't change languages then.
You're kind of like stuck, right?
If your Python ORM generates a database, then you switch to JavaScript and your JavaScript ORM also wants to generate the database.
Now what do you do?
So either you have to have some leader and everyone else follows or just use something else like Alembic, for example, but it could be anything to generate the database and then SQL alchemy and a lot of these ORMs, they can actually look at the database
and, uh, you know, map it in real time to, you know, your, your data types. So just to give a
very simple example, um, you know, you might have a class called user. The user has an ID, a first
name, a last name, a phone number. These are all just strings in your class.
And with some annotations, you can now take that class
and turn it into a sort of SQL alchemy,
kind of a first class citizen.
And so what SQL alchemy will do is look for a table called user.
And then if you do something like uh you know
give me the user class where the id is three sql alchemy will do sort of the magic to say okay
fetch this row from this table turn it into a python class and then give it to the give it to
the developer and so it's it generates a lot of really nice uh features for
you yeah yep that's exactly right yeah it's really fun i in the beginning i had so much trouble
with orms um you know it's one of these things that's not very intuitive especially if you have
native nested uh structures um you have to kind of pull those out. But I would encourage your listeners to take
the time to learn something like that. Once I learned it, I was much, much more productive.
Yeah. Yeah, yeah. Yeah. And maybe this is a segue into some of the challenges with ORM.
Yeah. Yeah. ORMs are great. But one of the challenges that we often face with or Ms is that there's kind of like this, the classic select in plus one problem. And that problem is that oftentimes, you maybe are getting data, and then there's all this related data. And if you do a query, and then start accessing this data, maybe each time you access that data, it has to then do another query to your database, right? This is kind of a common problem is that
it's actually kind of, it can be challenging to get that initial data with the appropriate SQL
query that's going to fetch all the data that you need for your future data when you're accessing
the data from the properties, right? And that actually can
be, it can be anywhere on the spectrum from like a pretty easy change to how you do the query to
like, maybe it's just downright impossible to know ahead of time, based upon how you're going to
process this data, what you're going to end up accessing. And this isn't necessarily like a
problem, like ORM doesn't cause this problem.
It's just kind of making it easier to access the data.
And you're still kind of forced to deal with these issues
of like what is the appropriate way to query the data
so that I'm reducing the amount of back and forth.
Yeah, let me see if I...
Oh, go ahead.
No, go ahead. Oh, I was going to see if I oh go ahead no go ahead oh I was gonna see if I could if I could understand
the problem because uh yeah I uh I just want to see if I wrap my head around this so the idea is
you know let's say I just want to show someone's first name last name and their phone number but
but the user class has 30 fields in it if I use an an ORM, I'll get all 30 fields and 27 of those are
wasted. Is that the problem? Well, that can be one of the problems. But the other problem is,
let's say that you're getting this list of users and they each have a relationship with their
employer record, right? And so you're doing a join on it. And there's different ways this can work.
It can potentially pull in, do that join ahead of time and pull in all that data ahead of time. Or maybe it's not, you just have these IDs that reference the employer table. And then as you iterate through the users, oftentimes ORM will then reactively as you access that employer field, it will then say, oh, I haven't fetched that yet.
I will go do a query
to fetch that employer record.
And so as I go through 30 user records,
depending upon the way
that you initially fetch this data,
every time you access that,
you are then accessing this related,
doing a separate fetch
to access this related table.
So that's kind of the classic
select in plus one problem with ORMs.
Oh, now I totally get it. I totally, totally get it. Yeah, that is really painful, right? So if
you're writing the SQL yourself, you would know, just join the user table to the employer table
and fetch all of it at once. And you just want to query. Exactly. And it's hard for ORMs because
you actually kind of have to look into the future a little bit, right? You have to know ahead of time, well, what data is going to be accessed from this, right? So it's challenging.
Oh man, that is wild. Yeah, I mean, you know, for an ORM to do this explicitly, you would have to, in my user.get, you'd have to provide a list of all the derived classes that I would want and not want.
Right, right, right. Exactly. Yeah.
And then this is maybe kind of a segue into thinking about this problem from another approach.
And that is kind of going back to the idea of embedded databases. Like, well, what if we made it so that this code that's iterating through these users
is actually close enough to the database that it can efficiently retrieve these employer
records on the fly, right?
Like part of the reason why the select in problem is so crippling is because we know
that there's a lot of overhead to issuing each query but if the data if this code is
executing close enough to the data well the internals of a sql engine is basically doing
the same thing like it's i mean there's different approaches to joins but oftentimes when a join is
executed it's going through a table getting a foreign key and doing a fetch from another table
it's kind of as simple as that you know unless you're doing like hash joins or something like that. But oftentimes, it's relatively straightforward of like
just iteratively getting other records. So if that if if data can access that at a relatively
similar speed, the way that you know, your internal engine is working, then you're kind of back into
the realm of like, the code doesn't
need to think ahead. Maybe it's not even again, maybe it's not even possible, maybe as you're
iterating through the users, like, maybe there's actually like really complex logic that involves
like the permissions of the user, what employer is related to another employer that dictates
whether or not that employer record is actually retrieved or not retrieved.
Those things may not even be expressible in SQL queries, right? And so this idea of getting code that's working close with the data kind of opens up new opportunities for doing these more complex
levels of data retrieval on the fly, and taking advantage advantage of this low latency access to data.
Got it. Yeah, that makes sense.
Yeah, that's a good transition to the last area here,
which is the database environment.
Just to give an example,
I built a kind of like a clone of Google Photos
just for my family. So I had a kind of like a clone of Google Photos just for my family.
So I had a little Android app and I have a database.
I store all the photos on S3, which is this Amazon file system.
My database kind of keeps track of the photos.
But I ran into this issue where I had a, and I can't remember from using postgres or mysql but i had some
sql database but then on my phone i basically needed the database but i can't run mysql on
my phone so i ended up running uh this thing called android room which i think is built on
sqlite um but that's an example where you know phone, it's just not practical in an Android app or an
iPhone app to run MySQL database. And so your environment plays a huge, huge role on what
set of databases you're going to be looking at. So if you're on the browser, for example,
or if you're running on the edge on an edge server,
that's going to change the scope and the type of databases
you're going to look at.
Right.
Right, right.
For sure.
Yeah.
Yeah, and fundamentally, as you start
being more concerned about getting access to data quickly, fundamentally,
this is a problem of getting data as close to the user as possible. And, you know, I mean,
that kind of goes into the subject of like edge based databases where, you know, we're trying to
keep data as close to the user as possible. You know, we kind of have a few fundamental constraints
here. The speed of light kind of actually dictates like,
there are fundamental limits to how quickly you can get data from a very far distance around the
world to a user. And the other fundamental constraint is that we know this is one of the
most important things to users, right? Like there's been study after study on user interaction where like low latency is absolutely key to a high quality user
experience um and so you know i think this is another fundamental uh direction of databases
is recognizing that we we do need to get data close to users to if we're going to really try
to achieve the optimal experience for users yeah that totally makes sense. I think even in this Android
app, you know, it just it just was totally untenable to wait for a database lookup. Like,
I just wanted to be able to scroll and see all the photos. And I mean, particularly for this app,
because it's meant to look at photos that, you know know your family and your friends who have agreed to share
with you have have created but also photos that you had on your own phone and so you kind of feel
like why is this taking you know 800 milliseconds to pull up a photo that i took two seconds ago
right and so um yeah and so that uh um you know now also with other you'll see this a lot with even uh
games where um there's a lot of transitions you have someone clicks i've been paying attention to
a lot of the game design and game art recently that's just the latest kind of kick i'm on
when someone clicks new game there's always kind of like a little fade out fade in and i thought about what would this
game have been like if they didn't do that and the reality is you know it's hard to tell because
they're hiding it with the fade but it probably was going to take let's say at least two three
hundred milliseconds to create this game or to get from the new game, the splash screen to whatever's next.
And if you don't have a transition,
people can see how long it took to click that button.
And that is kind of jarring. I was thinking about when I play really kind of low-budget indie games,
that is kind of this thing where you feel like a little bit of a stutter
when you click new game
and it kind of tells you that this is going to be like not a really professional experience you know
yeah um so so it is amazingly like it's a subconscious thing it's one of these things
you don't think about until you think about it but but it has an enormous enormous impact latency
has an enormous impact on the user
experience and it's just a phenomenal degree to which it does it does yeah absolutely yeah i mean
even if you like try to use your mouse on a 30 hertz screen it's like just give up yeah
and we're talking about you know a, a few milliseconds here, right?
Yeah, that's right.
Yeah, it's totally wild.
It's just something about that synergy of really real time.
It is a totally different experience.
And you can do things to hide it.
But, you know, when you're talking about databases, you could be potentially talking about multiple seconds and you really can't hide that.
I mean, you have to get it faster than that.
There's no other way.
Yeah, absolutely.
Yeah.
Yeah, so I know for, you know, for Android,
there's room for iOS.
Actually, Patrick, do you know what iOS equivalent
of Android room is like for storing data on phones?
No, not short of Googling it.
Okay.
Chat GPT open.
All right.
Yeah, ask Chat GPT.
But there is something like that for iOS where it's basically a SQLite database, just like Android Room.
But I'm sure it has the word framework in it.
It's like data framework or something.
Everything is a framework.
But there is something like that on iOS.
And so if you're on those platforms,
you're almost certainly going to be using one of those.
Again, you could load LevelDB,
like do some C++ interop type thing on android it's totally
possible there's github projects for it but but you know if you're just starting out you know
use android room i mean it has the vast vast majority of the market share um uh but now you
know we so for android and ios kind of a no-brainer. What about for the web? I mean, what are kind of things that people can do in the browser,
things that people could do on the edge?
What are sort of different options there?
Well, in the browser,
there's been a few different attempts over the years
to provide native functionality.
There's Web you know,
web SQL and then the index DB engine.
Lately there's been,
you know,
the efforts to get a SQL light running in a web assembly,
which is kind of interesting.
Oh,
cool.
Yeah.
Yeah.
So there's been some different,
different things in the works for getting data to be,
you know,
like a database in the browser
um you know for most large-scale applications though you typically are dealing more still with
like a back-end database and so uh you know edge databases are kind of a big driver for that as far
as like there's still a back-end that you're to, but it's as close as possible to the user.
Yeah, that makes sense.
So, you know, describe for some folks, some folks might have not listened to, we had a whole episode on edge computing.
If you haven't heard that one, go back.
It's great.
But, you know, if you haven't heard it yet, kind of give folks like a little intro to what is what is the edge when people say the edge and what is what is that environment?
Sure. I mean, at a basic level, the edge is is about distributing your cloud computing around the world so that there is a server that is close to every person that is accessing your data, your application.
That obviously has a huge spectrum of like, how close can you get these edge compute machines
to your users? Certainly, if you have more money, you can have 200 server locations around the world,
you're going to be able to get closer than if you have, you know, four locations around the world. But at the fundamental level, you know, you're just simply trying to get your servers as
close to your users as possible, and which again, is all about achieving lower latency.
Got it. And so when you have a server on the edge, how is that different from, you know,
renting an EC2 instance or something and installing Linux on it?
What is that environment?
Do you get just a whole machine where you can do anything you want or are there restrictions?
I mean, there's a spectrum here, just like you'll have with cloud computing as far as whether you you know, whether you can afford like dedicated edge computing or whether, you
know, you're utilizing shared, um, shared resources. Um, you know, we do a lot with,
with, um, Akamai and they have a lot of edge capabilities with like edge workers and things
like that. Um, but yeah, again, there's a broad spectrum of like what you can what you can afford
got it yeah i do know that uh i think amazon relatively recently announced like lambda edge
where you can write lambda functions for the edge okay um but i think it's it's only uh it's only
node or something or it's only it's some type of j. It's some type of JavaScript run.
You couldn't run Python or something, not without converting it first.
Right, right.
Yeah.
Yeah.
So how did that evolve?
What's the connection between these edge nodes and JavaScript?
And JavaScript.
You know, I think the big driver is that JavaScript
really has become probably the most advanced
primary language for being able to sandbox
in an effective way.
So, you know, being able to take code
that a user has provided and execute that on a machine has always been like kind of a challenging task to deal with.
Right. Like, is this code going to do something malicious or take too much resources?
The thing is, JavaScript has been we have been using web browsers that run.
I mean, I've got a dozen tabs open that are all from different sites.
This is like the most well-tested, battle-tested system for taking user code and running it on a different machine in an untrusted model where different code can be malicious.
It can be doing different things. And so JavaScript has really gone further than any other language in terms of this ability to host code and do so in a safe and secure way and ensure that there's correct limitations on resources. this both with like Lambda, Edge workers and occupied Ed workers, Cloudflare has like very
similar capabilities where they're hosting things in JavaScript. And, you know, kind of getting back
to where I'm working with HarperDB, this is exactly the same model that we're using as well as
JavaScript, hosting JavaScript as a mechanism for taking user code
and being able to run that across the edge.
And JavaScript just works really well because it is, again,
so battle-tested for being able to distribute and quickly run in a secure way.
Yeah, that makes sense.
So let's spend a little bit of time talking about Harper.
So where does HarperDB kind of fit here in terms of
we talked about just to recap latency consistency scalability um you know language support um
relational versus non-relational what is harper db and how does it fit into the picture when should
folks use it yeah um i mean it certainly has its roots in terms of storage is like a NoSQL database.
It uses document storage mechanisms. Basically, we store the object structures, we actually store
in message pack format, because that's a lot more efficient than JSON. But it also is has a lot of hybrid characteristics as well, a SQL query engine and secondary indexing, ACID compliance.
And so a lot of those things that really make for robust application development exist built on top of that NoSQL engine capability.
Probably one of the maybe distinctive aspects of HarperDB
is the fact that it is designed to, again,
run JavaScript application code
and do it basically in process with the database engine.
And so to achieve that very, very low latency access between the JavaScript and the database engine. And so to achieve that very, very low latency access
between the JavaScript and the database engine.
And so when you have fundamentally a user,
a client that's requesting data,
that can go directly to an edge server.
There can be application code that handles that.
It can do whatever appropriate queries into the database,
fetch data as it needs to, and then respond to the user. And you've had exactly one network hop.
And so that's kind of our fundamental goal is this notion that rather than, you know, maybe
going around the world to an application server that then makes another hop to a database,
then comes back, trying to achieve basically one hop access to data, even through the complexities of application logic, uh, and
back to the database. Well, so if you're running on the edge, my guess is it's like, it's a full
replication. So each, each node has a full copy of the database. And so so then how do you get around some of those challenges we talked about? Like if if you're ACID compliant and two folks in different parts of the world try to delete the same shopping cart at the node level. And then at the network level, it's eventually consistent. But that actually still means you
get all the characteristics of atomic atomic commits, you get
the characteristics of durability, you get the
characteristics of isolation, it just means that you basically we
aren't employing locking. So I can't lock this this record
across the entire database.
I can atomically interact with it. But this isn't necessarily a great fit for a financial
application where you need to do like a row level lock on a record on an account where,
okay, I don't want anyone else changing this while I retrieve this money out of this one account and
put it into this other account.
Right. But there's a lot of applications where this idea that, you know, you still have the basic concepts of atomic isolated durable commits.
But those can be happening concurrently.
We can replicate this data, resolve conflicts based on timestamps as that data comes together.
And in doing so, achieve very low latency replication
as well as low level uh low latency access to the data well that makes sense yeah um
yeah i mean this is uh you know just tying a lot of things together here i remember when
world of warcraft i don't know if this is still a still an issue but they had some issue where
i guess uh you could be in one part of the world i'm totally gonna get this wrong because i don't know if this is still a still an issue but they had some issue where i guess uh
you could be in one part of the world i'm totally gonna get this wrong because i don't play world
of warcraft but you could be in one part of the world and like pass something to somebody who was
like right next to you but like in a different part of the world because of the chunking and it
would it would duplicate it so it was like you could make a trade and then both you cancel at the same time or something.
And just because they were different nodes
and they were eventually consistent
and their way of reconciling
was to just let you both keep the weapon.
So yeah, Chase Bank is not going to let you
go halfway across the world and double withdraw your money.
I mean mean that would
be nice it'd definitely pay for the plane ticket to singapore or what have you but they're not
gonna let that let you get away with that um but for for most situations you know if your shopping
cart has the item twice in it because two people in different parts of the globe added the item
you know that's just you know that's a glitch that we're just going to have to
sort out on the you know like like downstream right in exchange what you get is all of those
things that we talked about that are so important you know that that speed and that latency that
definitely you know causes something in your brain to to uh to uh, to be, you know, really, really happy when you're, when you're on a product.
Yes, exactly. Yep. We want people to be happy.
Very cool. Um, yeah, I have a buddy who's a musician and he says, uh, you know, you,
you don't want to play kind of crazy notes. Like you kind of want everybody
kind of nodding their head and and you know feeling the
rhythm and uh and then he'll save the crazy notes for when he's playing with other guitarists um
same kind of thing here you know you want people to feel like they're in this really natural
environment and and latency is is has been proven over and over again to be super critical um for
that let's talk about Harper, the company.
So we mentioned that you're distributed.
Roughly how many people and what's something kind of unique about Harper?
It could be your mascot.
It could be what you guys do for onsite.
What's something that makes Harper stand out from a company perspective.
Sure. Yeah. I think we have about 18 people right now and, you know, Harper is named after the CEO's dog. And so it's very much of a cat loving company. Yeah. I actually don't have a
dog myself. I have a cat. I've considered it a small miracle that they hired me despite the fact that i don't have a
daughter but i think there's generally been like you know there's been stand-up meetings with
chickens on the on the calls and in general it's a very pet friendly uh company so that might be a
little bit of a distinctive so oh that is really cool i go to this place called Civil Goat Coffee. And for the longest time, there were goats right there. The goats would come up to you and nudge you and stuff like that. I think they finally got some kind of complaint or something, but they had to put the goats behind a fence. But I was a little bummed. You know, I thought the whole experience was just to watch my kids uh my my uh kids freak out when
the goats got close to them that was part of the fun yeah that's awesome yeah yeah well that is
really cool well you know you can always go from there to data dog you know it seems to be a
recurring theme yeah for sure yeah yeah we've definitely done plenty of data dog yeah that's
right very cool um, this is great.
Anything else that you wanted to kind of get out there?
It could be, well, actually, you know, one thing is, you know, if someone's in high school
or college, they might be really looking for something that's pretty low barrier of entry.
They're not going to want to sign an RFP or anything like that.
So for folks who are kind of really just getting started,
does Harper have a product for them and how would they get started?
Yeah, we have, you can go to studio.harperdb.io and you can sign up for a free instance of the
database. And so that's one of the easiest ways to get started. You can also install it from npm so you can do an npm install harper db
and start with a local installation and so yeah those are some great ways to just spin up a harper
db instance start creating some tables add some data you can import csv to have some sample data
and then you could get started with writing some application code as well,
and experience what it's like to have this like fast in process access to data.
Very cool. And so just so I'm clear, like, it's meant, you know, it really excels at the edge,
but you could run Harper just on your own computer, the server part of it as well. Is that correct?
Yes, that is correct. Yep. Yep. And then, you know, in general, like, I think like you've
experienced, that's usually like a great way to do development. You know, usually you want to have
a local instance if you're going to be doing any significant development so that things are fast
and direct and you know exactly what's going on and you can look at things in your task
manager and stuff like that yeah totally um really cool hey uh chris thank you so much for for coming
on the show it's been awesome i really hope we've motivated folks out there to learn about using
databases um you know if you have a database class at your university, be great to take it.
I know there's a lot of competition.
There's a lot of other really exciting classes you might want to take.
So if you don't take the database class, definitely take some time to get familiar with databases and a way to store data, retrieve data pretty easily.
Because it's an incredibly important part of pretty much everything you're going to do
in your professional life.
And really, just thanks again for coming on the show
and helping folks get started with that.
Thank you so much for having me.
This has been a lot of fun.
I really appreciate it.
All right, thanks.
And thanks to everybody out there.
We've been going through a bunch of folks' requests
for programming languages
and topics um we have a um differential equations i think is the next show which will be pretty
exciting that's a pretty heavy mathy topic we're going to talk about um we're talking about game
engines we have a whole bunch of topics and we really couldn't do it without all of your inspiration,
all of your ideas, your emails, and also without all of your practical, uh, support on Patreon.
Um, that's really the way that we kind of, uh, keep the show going, get the word out
for everyone.
And so we really thank everybody for your support on there and we will see you all next
show.
Thank you.
Music by Eric Varndollar.
Programming Throwdown is distributed under a Creative Commons Attribution
Sharealike 2.0 license.
You're free to share, copy, distribute, transmit the work, to remix, adapt the work,
but you must provide an attribution to Patrick and I and share alike in kind.