Programming Throwdown - 166: Speedy Database Queries with Lukas Fittl

Starting point is 00:00:00 programming throwdown episode 166 speedy database queries with lucas fiddle take it away patrick everyone welcome to another episode. Pretty excited about this one. Well, I think I say that every time. It's true every time. Try to bring you guys good content. Today, talking a little bit about some part of databases we haven't talked about. We've had a fair amount of people on to talk about various aspects of database and learned a lot. Today, Lucas is here and he's going to help us understand some a little bit lower level stuff, some optimizations and queries.

Starting point is 00:00:50 We had a bunch of good thoughts, even in the pre-recording. I took notes. I don't know how many of we're going to get to, but I'm excited to have Lucas here. Lucas is the founder at PG Analyze. Welcome to the show, Lucas. Thank you. Thanks for having me. All right. So we normally start off by talking a little bit about how people got into tech so the question we normally tee up at the beginning is what was your first sort of like computer programming experience like do you have like a formative

Starting point is 00:01:14 moment where you're like oh yeah that's that's it that's magical yeah good question so how did i get into tech was probably i think it was my dad's laptop which was probably running you know windows 95 or whatever was before that at that point so i'm in my 30s now for context and i remember what fascinated me back then there was this game where you had these apes throwing bananas at each other and you could program it because it was just written in i think a version of visual basic or something or like basic i think just at the time and so i think how i got into programming you could argue is going in and modifying that game so you know the ape would throw the banana a bit stronger essentially um and so that kind of stuff was like fascinating for me early on and how i think i got actually into serious programming was

Starting point is 00:02:00 probably um you know with game programming um from a hobby perspective. So I wanted to build my own game engine. I wanted to build my own games. Let's say the most I got to was there was this competition over the weekend where people were building games with Pygame. And so I ended up building a really simple game where you were rolling a ball for a tube and going around obstacles.

Starting point is 00:02:20 And that game actually worked compared to all the other ones that I coded up the engine for but didn't create any assets for. but that's really how i got started to you know kind of programming as something that i enjoyed at the time um and then more you know later how i actually got into the tech industry was i um actually like left school i don't wouldn't recommend it but i left school um at the age of of 16 and essentially just started working. And my first job was a hosting company, essentially, you know,

Starting point is 00:02:48 putting servers in racks, like back when you still had physical servers and also writing code to support, you know, somebody provisioning in the backend for that hosting company. And so really that's, you know, where I would say I would got my first professional experience actually, you know, working with customers, working with, you know, applications, working with customers working with you know applications working with databases um and that's you know a long time since um since i was 16 um and then i pretty you know soon actually got into um the more entrepreneurial side of this right so um back in if i remember correctly in 2007 um we started a company together where i was essentially

Starting point is 00:03:22 the you know one of the co-founders and technical CTO, I think at some point as well, I forget. But essentially, creating a blogging network. So something like Tumblr or Twitter or X, if you call it these days. But much earlier, essentially. And so the backing database for that was Postgres. And so my interest in Postgres really started back then in that startup where we essentially had that blogging site that got a lot of visitors and had a lot of traffic coming into that site. But ultimately, it was always database query that, you know, fetched your results if the burden cached. And so that's really where my interest in Postgres and the database world came in was, you know, from an application perspective as an application engineer, encountering frustrating slow experiences.

Starting point is 00:04:04 Wow, that was awesome that was that was that was a whirlwind tour i think that you know mentioning these various sort of like stops along the the road and sort of even i don't want to say meager that there's an underbinding like oh working at a database hosting company and putting racks on i feel like everyone or even myself i look around and see everyone where we are at today and you know know, you know, you're at a company, you know, Jason, I read tech companies or whatever, like, these kinds of things, people look at them and say, Oh, hey, like, wow, it must have been this, like, you know, meteoric rise. And I guess I do know some people who this kind of is, but then others, it's just like this story of just like, no, it's like really basic job after basic job,

Starting point is 00:04:41 and just like continuing to just sort of roll it forward. So I guess like that's an encouragement to me to hear it, but also to other folks out there listening like, oh, man, sometimes it takes that first job. It's like all about the next step or whatever. Maybe some people will sort of like innately have the everything set up for them to just sort of like go directly into a quote unquote dream job. But some people it just really takes finding it finding your way growing, not not to say that maybe you actually loved all those jobs uh but sometimes you look back on them and like wow i can't believe i did that at least for me and it was funny you know like so i worked at microsoft a couple years back and my first big company job at microsoft i was like wow you know i never thought it would get here because i i remember thinking i could never get a job at

Starting point is 00:05:22 apple or google right like i was not you know, the Ivy League graduate who could, you know, actually have that background that they look for in those interviews. But there's always a backdoor of sorts into those companies, right? But it's just not for the front door, right? Like if your resume looks like you dropped out of school at 16, like they're not just going to be like a rubber stamp, right? But if you have a way of showing your skills and showing your work, right, in some other way, then there's always a way in the system essentially and showing your work right in some other way then there's

Starting point is 00:05:45 always a way in the system essentially to you know find that dream job of yours if that's what you're going for i have opinions in big companies by the way but if that's what you want there is a way to do it yeah i feel like there's i have opinions about ivy league schools about big companies but maybe maybe you have to have to save those up for for another time or yeah one thing bonus episode or something one thing i remember when I got my first tech job, it was much less than half of what a starting engineer makes at one of these big tech companies or like fang companies, let's say. But I remember this is a kind of a really weird thing.

Starting point is 00:06:21 But I remember I had to do something like go get a pen from from the cabinet because i didn't have a pen at my desk and i walked back and i realized oh i made like 10 cents in the walk or it was like three cents or something like that but that was such a big deal it's like wow you know they paid me a quarter to you know go and get this pen and this notebook and that was just blew my freaking mind that like, oh, the salary that you get paid as part of everything. And yeah, it's true. It really is kind of stepping stones.

Starting point is 00:06:54 I feel like now, and Patrick maybe can attest to this, in management, in leadership, there's just always kind of a crisis. And I feel like you just always kind of a crisis and and i feel like uh you know you have to kind of build the battle scars to walk into the office every day where there's some kind of crisis and say all right you know let's let's let's let's get this figured out and let's move on to the next one i feel like that would have just completely destroyed me as a new as a new hire yeah and it's fascinating

Starting point is 00:07:25 i think you know to me it is you know that's why i now run the small company right and so to me it is this notion of there's nobody else like everything stops with me ultimately as a ceo right so if if things don't work yes of course i can tell people to improve but ultimately you know i need to be the one making sure that it happens and that's a really different mindset than being a cog in a wheel so to say which sometimes can happen in some jobs right where you're just like doing what you're told but you're like your buck stops really early right like you're just shipping the code and not even shipping it you're just writing the code you're committing it and you're done right versus in startup you actually have to care about the outcome of the code

Starting point is 00:08:00 yeah that's an important distinction and lesson and i'll also say one of the interesting things and you're kind of mentioning it that it never stops happening is people believe this like you said you're just there's gonna be an issue and you're just gonna tell someone to take care of it and it's like i mean maybe maybe at some level or if you've grown a really good team you can kind of claim through your hard work that you've gotten to that point but i was trying to explain to some people the other day it's i you know i have a team of people work with me but there's you can't tell them what to do you have to get them to buy in convince them like explain teach them like to make those decisions jointly so any expertise i have is only as good as like how much i can imbue

Starting point is 00:08:37 into others the same feelings and same understandings and context and you know having to trust each other to make good decisions but it's it's this weird thing where everyone believes or even you know i was explaining to my kids and have this thing like oh you know you're you've been there a long time like you just tell everyone what to do it's like yeah no that doesn't that doesn't work like it's all about trying to like explain what you would like explain a plan like trying to convince others to buy in and so people believe it's this like bullion flag that just gets flipped like oh you're senior or you're a manager or you're whatever and that means anyone without that flag you get to tell what to do and it's like no no it does not work like that all right so i i guess like trying to get us on to our our story arc today uh you mentioned something in in in your sort of intro

Starting point is 00:09:21 there that you started uh you got introduced to Postgres and you were sort of talking about, and I think this is a, you know, interesting segue is a lot of people get very focused on what tool to do a job, right? Not what like kind of tool, but the specific name brand on the thing. And you see it in other hobbies. It's not like it's unique to programmers,

Starting point is 00:09:40 but you know, I don't know. I'm going to pick something I don't know anything about. Oh, you know, running. I want to go running. What exact shoe do I need? Is it going i need everybody nike you know is it going to be rebugs like they get very hyper focused and then you hear people just say no just go run and like no no you don't understand i gotta know which shoe uh and so i think sometimes people can get caught up with uh you know choosing a specific database you know uh brand or a specific program or which one they're going to run and i

Starting point is 00:10:07 think we're going to talk about a bunch of stuff today that's going to cut across hopefully a lot of different things in this sort of generic the underworkings the underpinnings and often i'll say in some ways are more similar than they are different and being aware of the difference is important but being aware of the similarities more important maybe same important equally important well we'll figure it out. But you had mentioned something in the intro about Postgres. I thought it was really, really interesting. And we'll drop it here just because I want to make sure we hit it and you kind of mentioned it,

Starting point is 00:10:34 which is Postgres, unlike a lot of other databases, is open source. And so being open source, the community vibe is a little bit different. Can you maybe speak a little like your introduction to Postgres? And I think even now you're still working with postgres so like obviously it's sort of stuck with you and why does it continue to be something that uh that you work with for sure yeah and i think it's interesting you mentioned open source as the qualifier i i actually you know these days open source is a term has gotten a bit muddled because a lot of databases are open source or open core in some way or form um i think what really differentiates Postgres to me is that it's not run by a single entity.

Starting point is 00:11:10 It's a community project, right? And so if you think of the Linux kernel, for example, it's a very similar community project. I mean, there's actually the one big difference is in Linux kernel, there's Linux Torvalds at the head of everything versus Postgres doesn't have that person necessarily. But what's interesting to me with Postgres in particular is that it's 27 years old now. So, you know, it's almost as old as I am. Not all the way, but almost there.

Starting point is 00:11:31 And I'm sure some older than some of you listeners here. And I think what's interesting is that it has survived the test of time, right? Like it wasn't the fad that gone away. It actually, you know, survived the like changes that happened over the years, cloud databases, right? Like if people deploy a database in the cloud today a lot of times it's postgres probably more than half the databases relational databases deployed are going

Starting point is 00:11:52 to be postgres today um in the cloud you know caveat comment but um but i think the like where it really comes back to right is that it's a community project. Everybody can contribute to it. It's in many ways had all the downsides of community projects. So sometimes you might have to look for tools beyond the core product. So one of the things that we did with my company is provide a monitoring and optimization tool for Postgres called PG Analyze. And the idea there is essentially to add the know, add the parts to Postgres that, you know, are not there yet. And similarly, you know, what I think has stood out of Postgres over the test of time also is the fact that you can extend Postgres. So we don't do much of that personally. But what many people have done, you know, there's companies like TimescaleDB or

Starting point is 00:12:38 Citus Data that I used to work at personally, that extend Postgres by creating an extension for Postgres. And that allows you to ultimately build a very different database into the core engine. And so I think coming back to which database should you choose? I think I'm cool with any choice. So I will never fault anyone for choosing, you know, my SQL or MongoDB. It's all cool. The reason I personally choose Postgres as my default is because I think it has the extensibility that covers a lot of different use cases. And so you, you kind of have that riddle room, like if you're suddenly looking to store columnar data, there is a way to do that in Postgres. If you're looking to do time-series data, there's a way to do that in Postgres. And if you want to work with embeddings in your AI

Starting point is 00:13:17 system, there's a way to do it in Postgres with pgVector. And so those flexibilities are very convenient, right? Because oftentimes, if you choose a specialized database too early then you will just hit a you know brick wall and then you can't do anything yeah i think this is something i've heard echoed a lot and um i'll guess you kind of said stood the test of time i guess like the other word there is like you know fad so there's like fad databases i don't know people who like you said choose let's say a specialized database um and i think and maybe we can talk about it briefly but like sort of document store databases or just key value stores with no relational part no real query engine leave people on a lurch when they realize they need to do a query right like oh hey i have this key value

Starting point is 00:13:59 store it's it's super cool it's super fast like i feel really good about myself and then you know someone comes in and we're going to talk about this a little later, hopefully, but like someone comes in and says, Oh, I want some aggregation. I want whatever. And you're like, uh, uh, okay, there's no sequel. I got to write this myself. Or you find yourself like writing a sequel engine over your thing. Like, you know, and oftentimes at some, some form of premature optimization, I guess, like people are trying to like narrow in for some future concern of what happens if i have a billion visitors to my website and you don't have one it's like okay first get you know a handful make some good decisions or whatever but yeah you kind of mentioned that like early specialization or going to a niche database and and interesting that you say sort of postgres

Starting point is 00:14:39 has i guess i knew that but i didn't know is like, I hear about extensions for all sorts of things for Postgres, and it's really a way to customize it later, potentially for lower cost than sort of completely ripping out and switching to a new database that's sort of like layering on an extension. Do I sort of have that model kind of right? Yeah, I think so.

Starting point is 00:14:58 I mean, you could, I think one good example is Postgres, you know, has a lot of data types. So Postgres can be typed, right? So it's not just a, you know, throw in text and it's always text you can actually say hey this is a url and then there's some validation because urls have certain structure and so one of the most basic extensibilities of postgres is that you can do your own custom data types so if you know a way that you want your data to be validated and you want your data to be stored but you want the

Starting point is 00:15:22 input to be this particular text format you can write an extension that does that for you in postgres and so it's that right it's the fact that you don't have to contribute a change to the core database to add a new data type you can actually just do create extension my data type and that's all you need to do right like ultimately you write a c extension you if you do that kind of stuff and lowlevel parts. There's also a way to do this in Rust these days. So if you want to have more type safety, there's a good way to write complex extensions in Rust. And then sometimes extensions can be as simple as just a SQL script, right? Sometimes it's just literally, I want to have this function in all my projects. So I can just say, create extension, my function.

Starting point is 00:16:00 And then that's just a way to package it, essentially. Cool. Maybe let's dive real deep here for a second and and then uh we'll see how this goes this is if it goes bad it'll be on me not not on lucas but we're going to try to dive deep into database okay so i i guess like not every database but a lot of databases you'll sort of hear the the sort of storage and i guess even you know maybe this goes back to a lot like spinning metal hard drives maybe people don't even know what that is anymore um but you know sort of people talk about uh you know using using a b tree and we don't have to go into like that's always really hard like describe to me what a b tree is no no let's not do that

Starting point is 00:16:37 please uh but like you know have this b tree which is a generalization of a binary tree so it's this order tree structure that sort of lives and we're going to insert insert data into um and so the idea here being we want to store something and then later we're going to you know kind of query for it so um i guess like when you start at this level this feels like not that hard right like oh i'm just going to insert something into a hash map and then i'm just going to you know query it back out um and this is really you know you know no big deal and so i guess sort of like the first thing is like making sure it goes on disk so like it's not in memory but it actually needs to be on disk that way if

Starting point is 00:17:13 your you know service dies or whatever you know it's preserved and you can kind of kind of load it back up um but but leaving that as sort of the base and maybe i set you up poorly but like you know sort of the base there like what are what are some of the like mechanics that a database offers we're going to try to get to ultimately talking about like why queries get slow um but i start putting putting data in this you know tree and the tree is you know ordered in in some manner and i want to write you know queries against it sort of what what kind of is the growth that leads me from there to like all of a sudden, something was fast, and now it's not fast?

Starting point is 00:17:48 Yeah, I think this is a great topic. We'll have to go into a lot of details there. All right, yeah, let's go. So I think, let's forget about data structures for a moment. So I think, let's forget about trees or hash maps. I don't think, from a fundamental perspective, they don't necessarily matter, right? We can definitely talk about how B2 works,

Starting point is 00:18:03 but I don't think that's actually the most important thing i think the most important thing is to understand that ultimately what most databases that are not in memory databases do for you is that they store data in a file on disk right an actual file is on your desktop um in a sense um and make access to that portions of that file very efficient right because the like usually what that means is in a relational database you have a table in postgres for example a table gets represented as a file and then if you have indexes each index also gets its own file but you don't always want to look at the whole file right what you want to do is you want to look at a portion of the file just enough to answer the question you're asking, right? Which might be, give me this particular person's birth date

Starting point is 00:18:45 by email address or something, right? Or like give me the top 10 users of my website or something. And so really what the database does for you is figure out the best way to work with those files, both from a reading and a writing perspective so that you don't have to, right? Because the alternative here is that you and your own code and your own software do the same thing right you do file open and then file read

Starting point is 00:19:08 and file write but ultimately what you would usually most likely what i would do is i would just literally read the whole file and then put it in the hash map in python or ruby script right and then just look at it and it's just not workable if you're talking gigabytes or terabytes of data like it's just too slow and so really the the big job of database is to make that interaction way more effective way more efficient and then also lets you write queries in a way where you don't have to write the code to do all these lookups right but you can actually express your intent using sql and then database figures out how to locate the data you're looking for okay that was much better than i did all right so so what i hear here is like uh that's all right i give it to you no no no uh the separation of

Starting point is 00:19:50 concerns that's how i would sort of term it which is like you know you're writing some application that like you said sort of wants quick access to a wide variety of data past the point where you know you're going to necessarily want to just have it all in memory and all these other benefits and so the database job is to i think you said really well like make the smallest portion of data available or need the smallest portion of data available in order to answer your questions and then provide sort of an api in this case sql a query so that you can tell the database what it is that you're looking for. So and then its job is to as efficiently as possible, give you back your answer. That's right. And I think from performance

Starting point is 00:20:30 perspective, the one thing that I'll mention that's really important to understand is that oftentimes you will, the database will look at a lot more data than what isn't the result that gives back to you, right? So the oftentimes performance issues are like, you might just be looking for a single row, like in this case of the birthdate and the email address, right? But the database actually has to look at the whole user's file, aka table in this case, right? And actually, in the worst case, read the whole file, right? Until it finds the matching row.

Starting point is 00:20:57 And so really... Oh, go ahead. I was just gonna say, so what would cause that? So if I have a table, I have a file, what would be the difference between able to efficiently go to the specific row versus have to look at the whole thing? So I think ultimately what it,

Starting point is 00:21:13 I mean, so the simplest thing you could do, so maybe, again, important to remember is that usually in systems like Postgres, and I think this applies to at least most relational systems like MySQL and such, is that the index and the main table are separate right so the main table is usually so i'll talk a little bit in detail just because it's helpful to think about this visually in a sense right so a the main table in postgres

Starting point is 00:21:35 is usually separated in what's called pages and they're eight kilobyte pages and so each eight kilobyte page has you know a number attached it, like just an integer number that counts up. And then in each page, you can have one or more rows, post calls some tuples often for various reasons. May not have to get into that. But the point is, these rows that you're looking for, they're in these pages. Now, if you knew which page you're looking for,

Starting point is 00:21:59 you could very much like a phone book, right? You could just look in the phone book on page 20 and you knew that there's your row somewhere on that page now the challenge is that oftentimes you don't know that right you don't know on which page the data is you're looking for and so short of you know just reading through the whole like each individual page right until you get to the matching row really what you want is an index that sometimes can also answer the thing directly but usually in many cases what the index does is just point you back at the right page right so the index just says hey i know you're

Starting point is 00:22:29 looking for this user with this email address um that's in page like 67 and you know you just then read that one page of eight kilobyte right it's an eight kilobyte read essentially um and then you know that in that eight kilobyte read somewhere is going to be that row that you're looking for and that's really where indexes come in is just reducing that lookup that you have to do now the important thing to remember of course the index is again a file and so we don't just have to do io for the main page the main table the page on the main table but we also have to do io for the index itself and that's where it gets, I think, conceptually more challenging. Even I have challenges these days when you ask me how much overhead is it going to be

Starting point is 00:23:10 if you read something from an index. It's really hard to estimate that because in a B-tree, you have to walk the tree. So you ultimately start at the root page, figure out, okay, do I go left, do I go right? Do I go left, do I go right? At some point, you end up on that thing that's essentially matching a query or multiple like index entries that are matching your query um and

Starting point is 00:23:28 then these index entries will have a pointer back to one or more pages and so it's hard to do a conceptual model for that but i think that's roughly how i would try to describe it so so right after you said it's rough i'm going to attempt it because i want to i want to make sure i kind of understand this so the index is so the data that you want to feed into the query, so one or more entries, like let's say email address. So it's like a mapping of email address to page number for the rest of the data associated with that row.

Starting point is 00:23:56 But it could be, I don't know what the right term is, compound. It could be composite. It could be more than one thing. And so the idea is the index file gets opened, and then searched for finding the sort of entries in the key there. And then it gets the value of the page. And then it goes to the actual table file, opens that and then efficiently goes to that page sort of by number lookup because because you have it and then is when people talk about building a database table and saying this thing is the key

Starting point is 00:24:29 so that rows are have some unique identifier or whatever is that is that inherently different or is it still just an index file but like happens to have a shorthand name because it's so common yeah it depends a bit on the database so in postgres that i'm most familiar with it it's mainly a convention in a sense that so if you have a primary key, right, let's say we have a users table, the users table has three columns, an ID and email and a birthdate. Usually the ID would be the primary key, you could also make the email a primary key. And really, one of the distinctions with primary keys is that they're usually, well, they always have to be unique. So email might not be the best choice if you want to support, you know, the same email you use twice. So let's suppose for a moment that ID column is the primary key. Now, in this scenario, we're describing where we're looking for the email, the ID column actually does not get involved at all, unless we're looking for the ID, right? But if you're just looking for the birthday value, and we're just doing select birthday from users where email equals something, then we never need to look at the id now in postgres specifically primary keys do matter most mostly when you're joining tables

Starting point is 00:25:30 and you're like trying to say this is you know kind of like i'm talking about this one record and there's only ever going to be one record because i already kind of am grouping in a certain way and so it's it's mostly relevant if you're joining things and you're like using the id column in a join. Then it matters a lot from a career performance optimization perspective. But otherwise, it doesn't make a big difference if you're looking for an ID value versus an email value. They're essentially both indexes in this case, right? So primary key is an index.

Starting point is 00:25:57 Okay. And so then reversing that, I think you said birthdate. So if you're using birthdate and you didn't know in advance that you were going to be searching on birthdate, then this is where the database has no option. It doesn't have an efficient way to find birthdate. And so this is where it needs to end up going through Table or does it do some magic there as well?

Starting point is 00:26:18 No, I think ultimately, if you just said select, let's say select email from users where birthdate equals something, but you don't have an index on birthdate, then then there would be no way for database to do it efficiently beyond just like looking through like the table until it might find matches and if unless you're doing a limit one in your query it may have to look at the whole table right if you're doing limit one then it gets lucky and the first page has that row right then you know with 50 average or whatever yeah um and i mean this is also again where it matters like how queries are written matters in a huge way how the database can optimize things for you right so if you're very specific you're like i want this order and i want

Starting point is 00:26:57 you know the top five then that's going to be more expensive then i don't care about the order and i just want one um because that's going to be faster okay yeah so so maybe actually that's a decent segue so all right so we have these sets of tables there the database is managing them we have some indexes over the and then now we're sort of talking you were kind of moving to the next level sort of like writing writing queries and then as you said there's um i feel like this is happens in programming as well uh i at least i made this a lot and I've seen other people do it. When you first start writing a program, you're worried you didn't write your conditional correct.

Starting point is 00:27:31 So you write your conditional multiple times or other variations of it. You don't trust like, oh, if I is less than 50 and then a couple of lines later, you'll say like, as if I does not equal 50. And it's like, well well it can't equal 50 because you said it can't be less or maybe i'm messing up anyways you sort of like repeat conditionals that that don't need to be um or you put a lot of parentheses in your math operations and so you know i come from c++ so this is like a big deal because everyone's always performance nerdy but like you know oh i have a constant and a you know another constant but i put my

Starting point is 00:28:04 parentheses in such a way that i'm telling the compiler i don't want to let you you know oh i have a constant and a you know another constant but i put my parentheses in such a way that i'm telling the compiler i don't want to let you you know pre-multiply these constants together at compile time uh because i'm over prescriptive about what i'm doing um so when we're writing queries you're sort of saying if you're over prescriptive in some way like you're very very very stringent then you're sort of limiting the hand of the the database and sort of how it can can do things can maybe like unpack that a little like what what exactly are the kinds of things it does or doesn't care about and what would you sort of be looking for as like code smells i guess and the you know you're making it slower than it needs to be right and i think i would say well i mean in general you need to do

Starting point is 00:28:41 what the intent is you're trying to implement right so sometimes you may just have to do an order with three columns and a filter clause with five different like aware clause of five different things you're looking for right so i think it it depends a lot on what you're trying to do i think you actually mentioned an interesting case of compilers right so i think that situation where you're like optimizing for certain things because the compiler you know does things a certain way and so you know you have to write them with all these parentheses um unfortunately at least today you also have to do a little bit of that with databases. So in Postgres specifically, I wouldn't even say here's one right way or wrong way to do it. I would rather say learn how to learn about the database. So how do you figure out what the

Starting point is 00:29:18 database is doing and how do you understand the different choices available in terms of how the database can find your data ultimately because generally I would say yes it's better to not over specify like the where clauses but there are exceptions there are actually cases where I've you know I've seen real life situations where I'm joining two tables and adding a where clause actually improved performance drastically because it allowed a different kind of join to happen versus if there was no where clause it had to join like read one table first and then do a nested loop over the other one and so it really does depend a bit um and what was most important in that situation for me was able

Starting point is 00:29:53 to in postgres there is a command explain and if you put explain in front of a query what the database will give you is essentially the query plan that it well if you explain analyze it actually executes and says here is the plan i use and here is you know how long each part took if you just do explain it just gives you the plan that it is most likely going to be used if you run it again um and so what that really tells you is what the database is doing and so one of the most important feedback loops right like when programming might be a repl right like you're typing something in an interactive shell and gives you back a result and so one of the most important feedback loops in database world is i would say that i write a

Starting point is 00:30:28 query i do and explain i see the query plan i think it's a bad query plan i change my query right um maybe add an index so that database you know i try different indexes i see what sticks essentially right um but that interaction you can't really get around like that that just has to happen right even if you have happen right even if you have best practices even if you you know don't over specify your conditions sometimes you just have to like get in and do that feedback loop and do that iteration so when you run explain on a query and you get back the the query plan i'm sure i've done this before i know i've done it before but it's not something i do routinely so it's not top of mind to me. But is so like, let's say Postgres and I send the query plan to the, I guess there's some

Starting point is 00:31:09 sort of execution engine that is going to run, I think. And you were saying, and you sort of mentioned something interesting, which is what keyed me off, which is like, that it thinks it's going to run. So is it true that like, if, if we had two databases that were the same schemas, but yours was an order of magnitude more data than mine, and I send my query to mine, you send it to yours, would we get back the same explain? That is like, is it invariant? And it's like just a schema

Starting point is 00:31:36 and the execution engine, same version? Or is it somehow like understanding the size of your data? It's monitored queries in the past and is somehow like trying to do this like at what level is it sort of building that plan yeah and i think to a big extent this again does depend on the system right so i'll speak to postgres but definitely this does vary between database systems um so generally speaking it's definitely deterministic or invariant how we're going to call it right but essentially it's definitely deterministic, or invariant, however you want to call it. But essentially, the planner in Postgres, with the exception of essentially one feature I can think of right now,

Starting point is 00:32:10 is if you give it the exact same files on disk, and you run it on a different server, it's going to give you the same plan. Now, the problem is that usually you're not really aware of all those details. I would say the most important things... Let me actually take a step back just for people to follow along more easily, right? So again, like think through you're sending this

Starting point is 00:32:29 query to a database. So what does the database actually do? Maybe just to like give you a visual because I find that helpful. So like query comes in, right? So the first thing the database does is actually parses the query. So the engine, in this case, Postgres, turns that query into a parse tree. And that parse tree then essentially gets, you know, analyzed and says, hey, you're looking for this particular table and such. Now, the part we're talking about is after that initial parsing, right after the database essentially knows, you know, this is what you're looking for these tables, you're querying, then it actually has to figure out, you know, how do I get that data. And so that's the component that is trying to, you know, come up with a career plan is called usually the planner or in Postgres,

Starting point is 00:33:04 sometimes the optimizer. And that's really the part where it looks at that you know parse tree that comes in like the query that comes in in combination with a couple of other things i'll talk about in a second and then says here is the plan and then what happens is that in postgres called the executor goes and executes that plan right and then ultimately the executor is what sends you back your query result. Now, if we're doing explain, what we're doing is we're just looking at the output of that planner component without doing the execution, unless you're doing the explain on lies I mentioned earlier. Now, looking into what does this planner actually do, right?

Starting point is 00:33:37 Like, how does it come up with its result? And in most systems today, the way this works is it essentially does a cost-based estimation. So it tries to say, if I were to go ahead and execute this, or if my friend, the executor, would go ahead and execute, how expensive would it be, right? So is it more expensive to use this index or that index? Is it more expensive to, you know, join this table first with this other table? So it's trying to essentially go through all these different variations of how to get the data. And the way it does that is it attaches a cost to it like an actual cost value which is a floating point value and it says you know it's a tree that it builds ultimately so it says you know join this first and then do that and so like ultimately it comes up with a cost at the end right so the top there always sits a cost on the plan and that cost is the lowest cost right like that's kind of what

Starting point is 00:34:24 the point there's there's again caveats there but like generally it's speaking cost is the lowest cost right like that's kind of what i mean there's there's again caveats there but like generally speaking it's the lowest cost that it tries to find and then that's the best plan from database perspective now in terms of it being reproducible between systems it what's important for us to look for is what is the data it actually looks for when it makes these cost estimations right um and so we talked about an example earlier where you're doing that select star from table limit one and you don't care about the order, that's actually going to be cheaper in terms of cost, right? Because the database can estimate, well, most likely, you know, I'll not have to look at the whole table because I'll just find trenching row at some point. And so it will

Starting point is 00:34:59 actually discount the cost for that, right? It will actually reduce the cost because of that. And the way it can do that is because it has statistics about the data in the database right it wouldn't want to read the whole data but it has statistics about how frequent are certain values right and so if it knows actually when you're looking for a birth date that everybody has like maybe everybody inputs like 2000 the january 1st 2000 for some reason right Because they're just like being funny. And let's say you're looking for that. The database will know that that's the most common value in that table. And so we'll actually be able to do different optimizations

Starting point is 00:35:33 because of that and give you a different plan. And so when you compare between databases, what matters is that those statistics that database scatters are actually the same because if they're different, then it's going to give you a different plan. One last note on that, just to make sure i included that is the other thing that's really important is the actual physical size of the table and of the indexes so a issue that a lot of people run into is they're trying to compare their development environment

Starting point is 00:35:59 with a production environment and statistics are a problem yes but the other issue is that the planner will know how big your table is and if your table is 100 gigabytes it's not going to of a production environment and statistics are a problem yes but the other issue is that the planner will know how big your table is and if your table is 100 gigabytes it's not going to do sequential scan but if the table is just one megabyte it is going to suggest a sequential scan and so sometimes i see people asking why is my database not using my index and the reason is well you're running this locally and the database thinks your table is tiny so it's just gonna not use an index because it doesn't need one oh wow that's actually like several steps more complicated than i was thinking it would it would be so we mentioned a bunch of really interesting stuff there so understanding the

Starting point is 00:36:36 data sizing i can kind of see that like it's sort of a function of like you know how big various caches are and speeds out to the hard drive and the amount of RAM you're using and this kind of thing. The statistics one is somewhat intriguing to me, which is understanding what are useful statistics to measure. And then presumably, whenever data is modified or inserted into the table, rather than re-scanning and computing the statistics, presumably it's incrementally updating them as it goes. But then using that, that actually, I don't know, that's not obvious to me that like,

Starting point is 00:37:11 you know, like you said, birthdays, right? We know, for instance, there can only be, you know, in a month and day, only 366, I guess, birthdays that there can be. And so chances of collision are really high. The birthday paradox or whatever that's called the birthday birthday i don't know what it's called anyways uh so so i mean not that it's getting that clever but as you mentioned like some things uh you know have like skew really left like all really small values and very few really big numbers and you there are various techniques for kind of understanding that and so that's actually really interesting are there hints

Starting point is 00:37:42 you provide uh you know at the table to say, hey, as I'm building this, I kind of know in advance I'm going to be interested in these kinds of things or these are queries I'm going to want to run later or it's just not kind of worth it? Yeah, and I would say it again depends very much on your database system, right? Like what it provides.

Starting point is 00:37:58 So the way this works in Postgres is these statistics are collected by a process called analyze. There's also an automatic process called auto-analyze. And so auto-analyze will just run on the periodic schedule depending on the amount of change in a table, essentially. So if you do a lot of updates, a lot of inserts, at some point, that auto-analyze

Starting point is 00:38:16 counter essentially has gone to the point where it kicks off an analyze. But what it does then, it actually samples your table. So it looks for, I think, by default, 30,000 records. Don't quote my discipline. I look it up in Postgres documentation. But the point is it looks at a certain amount of default part of the table.

Starting point is 00:38:33 And then it actually, by default, takes just 100 records from that. And so there is a way to customize a couple of things here in Postgres specifically. For example, you can tell it actually look at more data. So don't just look at those 100 ultimately that it saves, but look at the thousand, for example. So oftentimes people say, you know, store more statistics so we know more in more detail, you know, how often, like let's say birthdays, right?

Starting point is 00:38:56 If we want to cover every birthday and we just do day and month, then, you know, we could just say, let's raise that statistics target to 400. Keep it easier. Don't know how to do 366, just do 400. And then Postgres will actually remember exactly how oftentimes each of those values showed up in its sample. The other thing, and I'll just drop this here in case you ever get to this point, but just to note that the database also offers this, is Postgres has a way to do what's called extended statistics. And that's essentially where it collects even more statistics

Starting point is 00:39:29 that are slightly more expensive to calculate, which is why they don't always do it. But then you can have things like, is this column dependent on this other column, right? Like, for example, do people with the name Lucas always have birthdays in August? I don't know. My birthday's not in August.

Starting point is 00:39:42 But the point is, if there are correlations between those values then often sometimes that matters a lot for complex career plans and so there is a way to instruct the database to measure that information specifically so that then you get better career plans uh yeah i can imagine probably similar to compiler there's like a whole like in-depth rabbit trail of complexity. So I guess like slightly shifting off of that. So the other thing I guess when you were explaining, you know, trying to build these costs is the one thing as someone who kind of doesn't know, but always kind of boggles my mind

Starting point is 00:40:15 is handling transactions. So if you're doing some read, modify, you know, right, you know, kind of cycle, or you're doing something transactionally into the database, I imagine like how the plan executes, is it just trying to minimize the sort of costo? Or is it thinking a bit about like the probability that like, a right comes in and interrupts the query and has to go again? Or how is that sort of like component? Maybe maybe I'm off, it's fine if this is like not relevant. But I guess like, that's always one of those things

Starting point is 00:40:44 that like all of this has to be done. Yes yes but it all has to be done in the phase of like data that could be changing out from under you and it has to be done consistently right and i think this is where you'll once again you'll hear hear me say this a couple of times but um you'll once again see that you know postgres is this is one way our data is to do differently um so postgres has you has a system called MVCC, Multi-Version Concurrency Control. And I mean, it's more of a general term, right? Outdated pistols have two,

Starting point is 00:41:13 but Postgres has a particular way of implementing MVCC. And Postgres has received criticism for that too. Like people sometimes criticize that for performance reasons and such. But one of the most important things that the MVCC implementation in Postgres does, it solves exactly the scenario you're describing, right? Where I'm doing a select, but whilst

Starting point is 00:41:30 my select is running, because maybe I'm looking at a big portion of the table, right? So it takes some time, then a write comes in. And so there's two choices here. Either we block that, right? So I finish my select and the write is blocked, or they are able to run concurrently. Now, if you implement this naively, if you built your own database,

Starting point is 00:41:45 you should probably just block the writes, right? Like keep it simple. But in a concurrent like real world system, that doesn't work, right? Because you just have too much stuff going on at the same time. And so that's really where what Postgres does at the most fundamental level

Starting point is 00:41:57 is if you think back to those pages, right? With those rows. So each of those rows, I mentioned earlier that they're actually called a tuple in Postgres. Now that distinction matters because it's technically a row version not a row because there can be rows that are physically there in that page that you're not actually seeing because they're for example an insert that came in after you started your select right and so postgres has a way to essentially say well this is this is something from the future right so you can see

Starting point is 00:42:23 these like even as you're doing your select the postgres execut, this is something from the future, right? So you can see these, like, even as you're doing your select, the Postgres executor might encounter things from the future that it's just going to ignore because they're from the future, right? And it's really fascinating because, like, you have to think about this in performance optimization too because you can also encounter things that are in the past and not relevant to you anymore, but they're still physically there, right? Because, again, it's a physical structure.

Starting point is 00:42:44 And so, like, it really, like like the part where it sucks for you ultimately is that there are situations where you can have data that you need to look at but because of these visibility rules it's actually no longer there right like you're not supposed to see it it's just you have to physically walk over it because it's in that file still okay so it's trying to put put these things in a place where you'll see them but then like you said there's some i guess i always think like some counter some timestamp it is a counter like yeah your query is like at some counter version and it knows like anything less than this or greater than this like it needs the highest value up to but not greater than your your ticker value i guess uh and so it's scanning and

Starting point is 00:43:25 it's seeing stuff that's crazy i don't know this starts to get really and this is why i guess people always sort of have this caution where as you said if you're building your own it's better just to block people have this they slowly end up building their own database and then not thinking about these things right i always just have a file now i need to have it updated now i need to and then you slowly backing your way into uh some of this complexity so the query planner has this explain function and it's giving you these costs when it goes to the execution it has to kind of respect all these all these rules um so as someone writing the query plan this is i we most been talking about it sounds like reads like joins and order buys

Starting point is 00:44:00 are these important for inserts as well or or composite statements like how is the sort of like breakdown of like when is it like yeah yeah it's it's probably not something worth worrying about yeah i would say it depends yeah i mean definitely so anything that's more complex is going to be more interesting to think about right so if you're joining tables even if it's just two tables oftentimes that can already be an important question which is what you're joining first like are you first looking at table a or table b right um or can you look at them both at the same time um it gets much more relevant of course if you're joining five tables together or you're looking only for a small part of this table then you're using that

Starting point is 00:44:36 to find something else in another table um if you're just inserting data um i don't think so postgres technically still makes a prayer plan for that because what you can do in postgres is you can do insert um into the table and then select and then you actually run a query to insert the data right so you are essentially running a full query just to get the data you're inserting um but if you're just passing data to the server there's not no magic to that right it's literally just writing it out there's different ways to do it like there's copy in postgres which is way faster than insert blah. But it's generally not something you need to worry about from a planning perspective. And the same applies to, you know, like, let's say you're changing a schema or something like those are called utility statements in Postgres,

Starting point is 00:45:15 they're totally separate from query plans. So really, it matters most of your reading data. And then what is it so so when you're you know writing your your queries and they're complex and you're you know looking at your query plans like on the balance i guess there's this uh decision between like you said maybe you can't optimize your query it has to be a certain way or there's just like yes it's frustrating but it's no good and then you can kind of sort of look to i guess indexes we were talking about that but also I guess to uh schema definitions like what is this sort of troubleshooting debugging go from like okay I have this query it's not doing what I want you know we talked about query plan which is sort of like I guess how to mutate the query in

Starting point is 00:45:56 some form to make it better so you've exhausted that what is this sort of like trip down this uh you know optimization look like yes usually what i would do in practice i'll just walk you through my personal debugging um but like approach here is um so imagine we have the query right and we so my situation is i'm the application engineer we're going to be back an engineer who got handed you know kind of a sql query that's bad um from the data team or something and they're like this is so slow you're like making the whole database slow fix it right and so i'm like okay what do i do um and so the first thing i would do again is go look at a query plan to understand you know what is you know what's actually happening right because i might up until now have just thought about this function call in my like application

Starting point is 00:46:37 which uses an orm so i don't even see the sequel right so like first step is actually looking at the sequel the next step is looking at the query plan. And then really what oftentimes helps in Postgres specifically, and the same applies to other systems, is looking at the physical IO that's being done. So don't just look at the plan, but look at how much data does it actually have to fetch. And so the way you do this in Postgres is you pass... So first of all, you do explain analyze,

Starting point is 00:47:00 which actually executes the query. So it doesn't just say here is the plan, but it also says when it executes the plan, what part takes how long, right? And this helps you then say, well, you know, in my really complex query plan where I'm doing joins and insert, like, sorry, index selection and stuff and whatnot,

Starting point is 00:47:17 this index scan is slow, right? Like this particular index scan, we looked at the whole index and that was really slow. And the reason you know this is by doing explain analyze versus explain. Now explain analyze has an option called buffers and if you pass buffers um it's kind of this terminology thing which is confusing so a buffer in postgres is also a page um postgres uses those interchangeably so when we say buffer is really what we mean is those eight kilobyte pages that i talked about earlier right so these portions of

Starting point is 00:47:42 the file on this and so if you pass that buffers option it will show you how much of the file ultimately it had to read and so it will tell you i had to read 100 buffers and then you have to do the mental math and say 100 times eight kilobytes is you know that's this much in actual bytes um and so that way you know like which part of the plan not only how slow or or fast it was, but also, you know, then, you know, this actually took this much IO, essentially. What I would do next is kind of what you were alluding to earlier, right? Which is try to think about, do I want to, like, is this an indexing problem? Or is this more of a data modeling type problem, right? Or a query structure problem, right?

Starting point is 00:48:23 So I think they're all different directions, right. I can change the way my query is written, I can change my table definition, or I could add indexes. Oftentimes, changing the query, like how you write it, or adding or sometimes removing or changing an index, those are the easier choices, because they're really fast to do, like indexes or cache structures of sorts, right? So you can just create new ones and the Postgres planner will choose them if they're better most of the time. And so most of the time, I would say, probably start with understanding

Starting point is 00:48:54 what the query is trying to do. If there's a simple change you can make in the query, try that first because it's going to be fastest, right? Then next, look at indexes. And then really only if that doesn't work, then it helps to look at the actual data model i'll give you one example of where the data model makes a big difference is again think back to this physicality of looking at those portions of the file on disk if your rows like in your table if

Starting point is 00:49:18 each row is very wide right so you have a lot of columns in them maybe a lot of text columns that are like have a lot of text in them then they will take up a lot of space in each page and so what happens is that you know like let's say you have this eight kilobyte page um and you have you know each row taking one kilobyte so at most you'll have probably seven um rows maybe eight rows to hang on math works out per page now imagine instead you had instead of that one megabytes sized row, you had 100 kilobytes size row, suddenly you can fit 10 times more into each page, right? And so the thing that matters there is, if your data, if you have a really wide table, that can sometimes be a big problem. And so it does sometimes make sense to essentially think about making tables small from

Starting point is 00:50:02 a physical perspective, in terms of each row being small, like having less columns or having less text columns in particular, because then you're optimizing for that. Yeah, so that's, I mean, I guess that's like a pretty universal, it's like a data locality thing, which is like, oh, hey, I want to,

Starting point is 00:50:18 if I have a JPEG attached to the row, it might be better to like insert it with an ID and then put the ID in the row. So that way, like, as you're mentioning mentioning you don't have to physically read in pages to get through the data to get through the jpeg maybe handle jpeg separately but you know you don't have to like physically read in multiple pages trying to get to the next page of actual data that you're interested in you're trying to make it really compact so that the number of rows through the reader go as fast as possible like the highest throughput and

Starting point is 00:50:46 you do that by moving data that's unlikely like you're not querying for text inside a jpeg you know it's stored it sort of stores externally or differently so that you know your your queries run fast that's right and i'll add one quick thing just so that people don't get confused if at some point you do dive into this detail um so one important thing to know in Postgres in particular is that if your data gets beyond a certain size, there's actually a separate storage for it. So Postgres, what you just described, Postgres has a way of doing this, right?

Starting point is 00:51:14 So if you, for some reason, do decide to store a JPEG, which you shouldn't do, don't store images in your database. But if you do do that for some reason, and they are like multiple megabytes in size, Postgres will actually store it in what's called Toast. Not the, you know know thing you eat but uh the the extended was it the oversized attribute storage technique i think um okay but the point is it's a thing and so if you have really large values they're actually less of a problem the issue is more if you have these medium-sized values right so these things that are kind of they're not large enough to be stored separately

Starting point is 00:51:43 but they're large enough to mess up your page structure that's the issue okay so like it's basically you confuse postgres so he doesn't know is this something i'm actually going to need to access or it's just a blob and so there's like some ambiguous overlap where yeah like it doesn't know it doesn't know where to put them okay that makes that makes sense uh and then good database recommendation don't put your jpegs in the database but i i know people are still going to do it so uh definitely happens um okay so so it's it's has the indexing you know like you said lastly about the the schemas and i think this is like one of those things too that they get fairly debated and approached but i'm a big fan of like thinking about your data models and like in general in general, even for code, I think it's under thought about. And so people don't really sit

Starting point is 00:52:29 down and think like, how does my data actually need to look what goes in it, people just sort of like start writing code and shoving stuff into classes, structures, tables, like it's sort of all manifesting the same, the same sort of problem and sitting down, but you get into this discussion where i guess you hear people talk about normalizing or denormalizing your data like you know you're like it's a star pattern and data warehousing versus anything i mean feel free to just be like no i don't want to talk about that but like any any commentary or thoughts on like is it better to put as much stuff locally into the row put it into separate and do joins like how what is your philosophy sort of like i mean the one thing i'll mention so denormalizing and normalizing is important right so the i mean the

Starting point is 00:53:09 most let's put this in layman's terms it's like do i have one copy of my data or multiple copies of my data right so sometimes it makes sense to essentially write the exact same value to multiple tables because then you normalize it and the benefit of that can be that you don't have to look at the other table to get the data right um but maybe even like so this is one case and it's very situational but i would say don't denormalize unnecessarily right like in the simplest case like start out by just storing your data once like it's going to be smaller it's going to be easier and then if you encounter issues where that's not feasible then think about um denormalizing um the other thing i'll mention because i think there's oftentimes we talked about this very beginning about documents, like stores versus relational databases. And one of

Starting point is 00:53:50 the things that Postgres actually has this data type is called JSONB. And JSONB is essentially a binary variant of JSON, which is slightly more optimized for indexing. And so the big benefit of JSONB is that instead of you having to specify each of your columns that you're going to need, you actually just have this, you know, JSON column, or JSONB column rather. And this JSONB column is just storing a JSON document, right? It's just storing key values, and it can be nested and all that stuff. And so what you can do that way is you can just throw your data into Postgres and then query it without having to have that rigid structure. There are limitations, and it's not going to be always, it's definitely going to be slower in some ways than just having separate columns.

Starting point is 00:54:30 But if you're unsure about the structure of your data, right, if you don't know yet, maybe there's metadata attached to your objects, then using something unstructured like a JSON P field is the way to go and is what I would do oftentimes. So I guess it's just don't optimize too early right like rather like keep things normalized do a jsonp column where you need them to have that unstructured information and then if you need to then optimize the structure later on this is this has been like super useful i've uh you know know what databases are and in fact i've tried to encourage people we probably should be using more of them but i've generally avoided them in my life so i've never had to run down this but i always just sort of it's one of those things that kind of like uh it it's a little

Starting point is 00:55:14 confusing it's pretty in depth and i feel and i don't and i don't know um and we're going to kind of try to transition this here but uh for me i feel like it almost became these techniques you're talking about are critical and useful but it once became like oh when i was coming out of school there was like a database admin was like a big thing and then people did not want to be database admins i don't know like the current zeitgeist around like database admins but a lot of people just got this like you know i do not want to do you know looking at queries and i do not want to do optimization like that is a dead-end job you do not want it you're going to be like you know stuck racking servers or whatever the the like mindset was i don't know where it

Starting point is 00:55:55 came from but i feel like there's like lingering after effects through even to today where people have this hesitancy to like engage with you know databases yet we see like all the major tech companies very reliant on quite traditional uses of databases you know they they gussy them up is that i don't know anyways make them fancy by like you know sharding them and distributed and all of this and we've had great you know podcast interviews with folks folks working on that but i feel still like at the engineer level to engineer level but a lot of people who just would rather avoid it for i don't say like stigma reasons but uh it's been great to hear your explanation of this and like it's really not that different than compiling and debugging

Starting point is 00:56:34 stuff you would do or at least i do you know day-to-day just like in sort of c++ code and making code run faster and don't use more data than you need to it's all the same concepts right and i would say you know administration is uncool but you don't use more data than you need to it's all the same concepts right and i would say you know administration is uncool but you don't have to call administration right like it's the same if system administrators like people didn't want to be called system administrators anymore maybe other new people coming into industry didn't want to be called administrators and so now we have devops engineers right like it's it's the same thing with databases now people are called data platform engineers instead of like dbas like i think you

Starting point is 00:57:06 you pointed out one thing that you know if i think about database performance work right the part that i really enjoy is i can make a difference pretty easily often right like it's oftentimes like you can get drastic performance improvements like something takes multiple seconds and like people have a slow experience to milliseconds and you know it's super fast and that is really rewarding right like as as somebody working on that as somebody as an engineer working on you know it's super fast and that is really rewarding right like as as somebody working on that as somebody as an engineer working on it um it's it's really rewarding to kind of get that kind of uh performance benefit the performance improvement um by doing boring administration in a sense i'm with you i mean running the like debug tool timing code and like seeing literally like

Starting point is 00:57:42 7 000 x speed ups right right? You know, like you said, something, somebody has some program, it takes minutes to run, and now it runs in like, a few milliseconds. This is something I find enormously rewarding. But it's this grunge work of going in and putting the logging in like paying attention, but maybe you and I are broken the same way. I find that enormously rewarding, because it's just like, haha, you did this. And here's my it's like code golf, I guess, or whatever, like, you know, here's my speed, it's so much better than your speed. So I was going to transition it now. So so we opened up with that, that your founder at this company, PG analyze, I guess, tipped off by you've already said the word analyzer and Postgres

Starting point is 00:58:20 shortening to PG there, I think we might might be teed off to what is a little, but maybe help explain like what your company does and sort of like a little bit about it and just sort of like tell folks what you're up to. Sure. And I'll try to put this in, you know, from the perspective of why I as an engineer care about that and why I started a company. So let me give you just a little bit of background

Starting point is 00:58:42 on how the company came to be. So ultimately I started as a site project, which is close to 11 years ago now. So it's been a while. But I've only been, you know, kind of full time in the last couple of years. And so we now have a small team, you know, essentially supporting it. And ultimately, what I set out to do with PG Analyst at the beginning was giving me better introspection into the database, right? So back then, again, all focused on Postgres, right? Pointed out PG, short for Postgres.

Starting point is 00:59:09 And so the reason that I started that as a project back then was that I tried to say, you know, what does the database think is going on, right? Like if I look at the database and I see like CPU utilization or IO utilization, it doesn't really tell me much. And so the very first thing that we did back then was just query performance metrics, essentially. And so it was just saying, you know, here is this query that was running and this query took the most time and there's various ways in postgres to get that data and to kind of say you know this is

Starting point is 00:59:33 a query that's essentially making the system busy and so that's how we kind of set out to you know kind of just have a way to say here is what's most slow on the database um now over the years um and especially more recently, what we've turned this into is not just, you know, kind of that monitoring and observability side, but really also giving recommendations.

Starting point is 00:59:51 And so we already touched upon indexes earlier, but one of the things that I'm, I think, reasonably proud of is our implementation of how we make index recommendations. Now, it's actually not as good as I'd like it to be, nowhere near it,

Starting point is 01:00:03 but it is, you know, I think a system that is a really good starting point that says, here is my query workload. Here are the suggested indexes, essentially. Here is what's missing. You're querying for these things. You're doing these where clauses and these join clauses. We think this index would be helpful to make your queries faster.

Starting point is 01:00:22 Go try it out. Human in the loop type system, right? Actually try it out and see if it makes a difference. Well, that's awesome. So you guys, so how, I mean, so you're an extension, you're sort of like,

Starting point is 01:00:35 how does that, how does it like integrate in? I have a database I'm running, like your service comes alongside and sort of monitors what mine is doing. And then I'm able to sort of get these suggestions and try them out. So we'd love to be an extension, but we're not.

Starting point is 01:00:47 So the problem with extensions, this is kind of a technicality, but the sad part is extensions require you to usually have access to a database server. And what we find in practice is that most people today don't run their own database service. They use managed services like Amazon RDS or Google Cloud SQL or Azure Database.

Starting point is 01:01:04 And so the issue is that you can't really write the custom extension and run it, have your customers expect to run it, right? I mean, you can yourself can certainly run it, but we intentionally did not require any custom extension. What we do is we have people install an agent and that agent essentially sits next to the database, right? Like it's in a container or in a virtual machine. And essentially it runs SQL queries itself

Starting point is 01:01:26 to get data from database about like statistics that are happening. And then the other thing it's doing, it's looking at the error logs. So we do a lot of log parsing because that's where we get additional information, like which query plan happened at which time, like there's ways to tell Postgres

Starting point is 01:01:39 to log query plans. And so we essentially pick up those query plans to then kind of put them in a more easily accessible UI. Oh, wow, that's awesome. That's interesting. Yeah, I guess I didn't really think about that, but you're right.

Starting point is 01:01:51 Like most people are probably not like, you know, installing locally a Postgres database and running it and they're running it somewhere in the cloud. So like adapting to that and still being able to make it work, I guess like it's even less obvious to me that it would work.

Starting point is 01:02:05 But yeah, the fact that you guys, that's actually really cool. And so this started as like sort of a side project is clearly something you're passionate about. I mean, like we've sort of touched on these subjects sort of all together and building a tool that sort of like helps you do the things that you would do naturally,

Starting point is 01:02:20 but like faster. And, you know, for people who maybe don't have that background or expertise and really helping people who, you know for people who maybe don't have that background or expertise and and really helping people who you know run into slow queries and be able to sort of like help break them open and and sort of look at them have you guys felt like that uh people is it much is it people who are like pretty experienced and saying oh these are actually just like incredibly time saving and i know i know that i can tell what it's doing is smart and good. And

Starting point is 01:02:45 I like it. And I want to use it to save time. Or is it more people are like, yeah, I have no clue. Like, I'm just going to, you know, click yes, whatever it says, and I'm just going to trust you to be in charge of everything. Yeah, I would definitely say it's both, you know, it really depends. So you know, we, the good thing is, we have enough customers that I definitely don't know all their names. And you know, I don't know all the use cases either. But I think one pattern that I definitely see, and this is, you know, for the folks listening, if you're currently in college, for example, you may, this may be like, at least conceptually a long way from where you're at. But what's out there often in industry these days is that you have usually what's called data platform team. And so there's often in big

Starting point is 01:03:25 organizations there's a central team that operates data stores right in databases and so one of the things that we found is that those you know centralized teams that manage the databases they ultimately work with application engineers right because application engineers write the code they make all these feature changes and so the challenge that they have is that they are usually just a small team like you might have 100 application engineers, 200 application engineers, but there's just like five data platform engineers that are like wrangling all the databases. And you know, if any of the databases is down or slow, everything is on fire. But there's still just this tiny team. And so really, where we come in, and where we found, you know, making the biggest impact,

Starting point is 01:04:01 so to say, is just enabling those teams to then ultimately hand off more things to the application teams, right? So to give better tooling to the application team so that the application team can say, hey, maybe this is an obvious issue that PG Analyze can help me identify. So I don't need to spend as much time with that really overwhelmed data platform team.

Starting point is 01:04:22 And so it just becomes that way of kind of collaborating more effectively that's awesome and then how like i guess like a bit a bit on a an adjacent topic but a bit off topic um sort of taking something that i think you had been doing for a lot of years you know had kind of been like you said a side project turning it into a company how's that experience been like any sort of like like introspective tips or like suggestions like people out there this is like for anyone who has is sort of new you will have these thoughts you will sit someday in a big company if you ever go there and you will think why am i here i could do this on my own i could be making more doing this on my own if you're not at

Starting point is 01:04:58 a big company you're going to say why am i doing this i could just be at a big company you know so these thoughts are at least for everyone I've ever spoken to, these are always sort of at war in our heads. But from someone who's sort of, I think you said you had been at Microsoft for a while. Now you're sort of doing this on your own. You've kind of played both sides of it, I guess. Any sort of like thoughts or tips from the trenches?

Starting point is 01:05:19 Yeah, I think, you know, there's the saying, if you like drinking coffee at a coffee shop, don't create your own coffee shop, right? So you like drinking coffee at a coffee shop don't create your own coffee shop right so like don't start your coffee shop because you would wish there was a coffee shop around the corner because running a coffee shop is not the same as drinking coffee at a coffee shop so this is great i'm paraphrasing but you know what i mean right like there's like running a business is definitely not the exact same thing as you imagining using that business as an end user of imagining it exists right so i think there is something to be said about do you actually want

Starting point is 01:05:49 to run a business because it it's great don't get me wrong i really enjoy working on my own terms and you know having a team of folks working you know with me and kind of like that's all great but it is not necessarily the same as saying i'll just go and create a project for fun right so i think there's an important distinction there. Now, I will say that it's definitely possible, but you have to have a lot of patience. So I'm very much, I mean, we bootstrapped the company, we have no outside funding. And because we bootstrapped it,

Starting point is 01:06:15 that's why it was a side project for a long time is because revenue just wasn't there in the beginning. It just took a long time to even be able to pay one person's salary. And if you're willing to do that, I personally believe that most people listening to this will be able to create their own business if they really go for it, right?

Starting point is 01:06:31 But the challenge is that you have to have that longevity. You have to have the motivation. And so I think what I would do, if I would do it again, is focus on something that you, A, enjoy working on. B, can use yourself, right? So ideally, you're your own customer in some way or form because then it's just much more motivating um and you know c you actually enjoy building the business side of it not just you know like you actually want

Starting point is 01:06:54 to run the coffee shop yeah i guess like for for folks out of the industry i mean i guess that like eating your own dog food was like very weird term to me the first time I sort of like heard it. Everyone sort of takes it for granted now, but like I was like, what? It is a weird term. I was like, this is, this is like interior and people even say, oh, dog food programs. And like, there's all these, and there's even other variants I won't go into for different big companies, but this is like a very common term, which is, you know, what Lucas is referring to here is like, if you can build something and be your own customer for that thing, and like,

Starting point is 01:07:27 force yourself to use it, like, that just makes the iteration cycle better. Like you're teaching yourself something, you're staying engaged with it. I think we talked about at the beginning, you guys started playing video games, I feel like video games are very obvious example, because you're like, well, of course, I would play this video game. And you I don't know anyone who would sit down and say, I'm going to build a video game i don't want to play like everyone sits down and says i'm going to build a video game that i want to play and so they're inherently saying i want to build something that i know whether or not they enjoy playing it by the end that's a different different question um but maybe this loses some of its obviousness as people

Starting point is 01:08:02 sit down and they try to say i just want to build a business. And then this thing you mentioned, I think was great to this, like, for folks who don't know this bootstrap versus like taking external investment is a is a very tough decision. And I haven't been to myself, so I won't fess any knowledge or wisdom there. But it's great to hear someone like, you know, share some examples. I think you hear a lot of pros and cons of both sides, but clearly a really big passion project for you. You can kind of like hear it come across and that, you know, even in the beginning, I was sure, you know, Lucas making sure people want to know that, you know, he's technical, but I mean, I think that's pretty clear from, from the explanations, but you know, I think like it's exciting and, you know, it's been great talking to you. And

Starting point is 01:08:42 I learned a lot today. I mean, there's a huge gap in my knowledge from what actually happens. I know at a really low level, I know what a B2 is. At a really high level, I know what a Postgres server is, but kind of unboxing some of the middle parts and some of these optimizations

Starting point is 01:08:58 and what happens to you, as well as a ton of tips and tricks along the way for sort of heuristics about when to... This has been great, Lucas. Is there anything else you wanted to like t up uh we'll have the link to pg analyze just pganalyze.com so it's pretty easy to find anything else you want to like send people to or you know you know recommend them to look at for sure and i'll actually add one more thing that we talked a little bit about earlier but i just make sure to mention it for this audience in particular is if you're interested in databases,

Starting point is 01:09:27 well, there's two things I'll point out. So if you're interested in database in general, CMU actually has a course on databases where they publish all the lectures online. It's really good on databases more broadly. So if you want to look for just CMU and database things, and Andy Pavlo is the one who runs that, that's, I would say, you know,

Starting point is 01:09:44 one of the best just online materials to learn about databases more broadly. If you're interested in Postgres in general, one thing I would recommend is Postgres like talks about everything in the open. So Postgres has mailing lists. It's old school in that sense. So you can actually follow along,

Starting point is 01:10:00 which is really cool. You can actually follow along Postgres development. It's really technical, right? But if you ever are really interested in this like at the lowest level you can just subscribe to the mailing list and you can just see what people are discussing and how new features get contributed so that can be really fascinating if you're into that side of the house and if you you know want to contribute to postgres there's also each year google summer of code where postgres participates as you is you can actually kind of have an official project and a mentor

Starting point is 01:10:26 that help you kind of contribute to Postgres. Now on PG Analyze, I do a weekly video series called Five Minutes of Postgres. So if you're interested in Postgres, each week on the PG Analyze YouTube channel, there's just a five-minute video where I talk about what I found interesting that week. So sometimes that's, you know, usually it's other people's blog posts that I use as a starting point, but I'll talk about things like the slow query optimization that we talked about earlier today,

Starting point is 01:10:49 new features and new Postgres releases. So just a way to stay on top of Postgres. And then pgannalize, pgannalize.com. We're also on Twitter as pgannalize. Awesome. And also kudos for like the remembering multiple punch list things that to go through and get back onto the topic.

Starting point is 01:11:06 That was very smooth. Very impressive, Lucas. Well, I've enjoyed having you on the show. I hope folks find this as enjoyable as I did as educational. And databases are a very important topic for our industry and lots of different avenues and directions. And so I've learned a lot today and it's been great to have you on the show. So thank you for coming. Yeah. Thank you so much for hosting.

Starting point is 01:11:30 All right. And we'll see everyone next time. See everyone later. Music by Eric Barndollar. music by eric barn dollar programming throwdown is distributed under a creative commons attribution share alike 2.0 license you're free to share copy distribute transmit the work to remix adapt the work but you must provide provide attribution to Patrick and I and share alike in kind.

Your Ad Here

Programming Throwdown - 166: Speedy Database Queries with Lukas Fittl

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.