Programming Throwdown - 166: Speedy Database Queries with Lukas Fittl
Episode Date: October 16, 2023- pganalyze: https://pganalyze.com/- Weekly series "5mins of Postgres": https://www.youtube.com/channel/UCDV_1Dz2Ixgl1nT_3DUZVFw- How Postgres chooses which index to use: https://pganalyze.co...m/blog/how-postgres-chooses-index- CMU databases courses: https://db.cs.cmu.edu/courses/- Postgres community: https://www.postgresql.org/community/As well as social links:- Mastodon: https://hachyderm.io/@lukas- Twitter/X: @pganalyze, @LukasFittl- GitHub: @pganalyze, @lfittl- LinkedIn: https://www.linkedin.com/in/lfittl/ , https://www.linkedin.com/company/pganalyze/ ★ Support this podcast on Patreon ★
Transcript
Discussion (0)
programming throwdown episode 166 speedy database queries with lucas fiddle take it away patrick
everyone welcome to another episode.
Pretty excited about this one. Well, I think I say that every time.
It's true every time. Try to bring you guys good content.
Today, talking a little bit about some part of databases we haven't talked about.
We've had a fair amount of people on to talk about various aspects of database and learned a lot.
Today, Lucas is here and he's going to help us understand some a little bit lower level stuff,
some optimizations and queries.
We had a bunch of good thoughts, even in the pre-recording.
I took notes. I don't know how many of we're going to get to,
but I'm excited to have Lucas here.
Lucas is the founder at PG Analyze.
Welcome to the show, Lucas.
Thank you. Thanks for having me.
All right. So we normally start off by talking a little bit about how people got into tech so the question we normally tee up at the beginning is what was
your first sort of like computer programming experience like do you have like a formative
moment where you're like oh yeah that's that's it that's magical yeah good question so how did
i get into tech was probably i think it was my dad's laptop which was probably running
you know windows 95 or whatever was before that at that point so i'm in my 30s now for context
and i remember what fascinated me back then there was this game where you had these apes
throwing bananas at each other and you could program it because it was just written in i
think a version of visual basic or something or like basic i think just at the time and so i think how i got into programming you could argue is going in and modifying that game
so you know the ape would throw the banana a bit stronger essentially um and so that kind of stuff
was like fascinating for me early on and how i think i got actually into serious programming was
probably um you know with game programming um from a hobby perspective. So I wanted to build my own game engine.
I wanted to build my own games.
Let's say the most I got to was
there was this competition over the weekend
where people were building games with Pygame.
And so I ended up building a really simple game
where you were rolling a ball for a tube
and going around obstacles.
And that game actually worked
compared to all the other ones
that I coded up the engine for
but didn't create any assets for. but that's really how i got started to you know kind of
programming as something that i enjoyed at the time um and then more you know later how i actually
got into the tech industry was i um actually like left school i don't wouldn't recommend it but i
left school um at the age of of 16 and essentially just started working.
And my first job was a hosting company, essentially, you know,
putting servers in racks, like back when you still had physical servers
and also writing code to support, you know,
somebody provisioning in the backend for that hosting company.
And so really that's, you know, where I would say I would got
my first professional experience actually, you know,
working with customers, working with, you know, applications,
working with customers working with you know applications working with databases um and that's you know a long time since um since i was 16 um and then i pretty you know soon actually got into um the more entrepreneurial side of this right so
um back in if i remember correctly in 2007 um we started a company together where i was essentially
the you know one of the co-founders and technical CTO, I think at some point as well, I forget.
But essentially, creating a blogging network.
So something like Tumblr or Twitter or X, if you call it these days.
But much earlier, essentially.
And so the backing database for that was Postgres.
And so my interest in Postgres really started back then in that startup where we essentially had that blogging site that got a lot of visitors and had a lot of traffic coming into that site.
But ultimately, it was always database query that, you know, fetched your results if the burden cached.
And so that's really where my interest in Postgres and the database world came in was, you know, from an application perspective as an application engineer, encountering frustrating slow experiences.
Wow, that was awesome that was that was that was
a whirlwind tour i think that you know mentioning these various sort of like stops along the the
road and sort of even i don't want to say meager that there's an underbinding like oh working at a
database hosting company and putting racks on i feel like everyone or even myself i look around
and see everyone where we are at today and you know know, you know, you're at a company, you know, Jason, I read tech companies or whatever, like,
these kinds of things, people look at them and say, Oh, hey, like, wow, it must have been this,
like, you know, meteoric rise. And I guess I do know some people who this kind of is, but then
others, it's just like this story of just like, no, it's like really basic job after basic job,
and just like continuing to just sort of roll it forward. So I guess like that's an encouragement to me to hear it, but also to other folks out there
listening like, oh, man, sometimes it takes that first job. It's like all about the next step or
whatever. Maybe some people will sort of like innately have the everything set up for them to
just sort of like go directly into a quote unquote dream job. But some people it just really takes
finding it finding your way growing, not not to say that maybe you actually loved all those jobs uh but sometimes you look back on them and
like wow i can't believe i did that at least for me and it was funny you know like so i worked at
microsoft a couple years back and my first big company job at microsoft i was like wow you know
i never thought it would get here because i i remember thinking i could never get a job at
apple or google right like i was not you know, the Ivy League graduate who could, you know, actually have that background
that they look for in those interviews.
But there's always a backdoor of sorts into those companies, right?
But it's just not for the front door, right?
Like if your resume looks like you dropped out of school at 16, like they're not just
going to be like a rubber stamp, right?
But if you have a way of showing your skills and showing your work, right, in some other
way, then there's always a way in the system essentially and showing your work right in some other way then there's
always a way in the system essentially to you know find that dream job of yours if that's what
you're going for i have opinions in big companies by the way but if that's what you want there is a
way to do it yeah i feel like there's i have opinions about ivy league schools about big
companies but maybe maybe you have to have to save those up for for another time or yeah one
thing bonus episode or something one thing i remember when I got my first tech job, it was much less than half of what a
starting engineer makes at one of these big tech companies or like fang companies, let's
say.
But I remember this is a kind of a really weird thing.
But I remember I had to do something like go get a pen from from the
cabinet because i didn't have a pen at my desk and i walked back and i realized oh i made like
10 cents in the walk or it was like three cents or something like that but that was such a big
deal it's like wow you know they paid me a quarter to you know go and get this pen and this notebook
and that was just blew my freaking mind that like,
oh, the salary that you get paid as part of everything.
And yeah, it's true.
It really is kind of stepping stones.
I feel like now, and Patrick maybe can attest to this,
in management, in leadership,
there's just always kind of a crisis.
And I feel like you just always kind of a crisis and and i feel like uh you know
you have to kind of build the battle scars to walk into the office every day where there's some kind
of crisis and say all right you know let's let's let's let's get this figured out and let's move
on to the next one i feel like that would have just completely destroyed me as a new as a new
hire yeah and it's fascinating
i think you know to me it is you know that's why i now run the small company right and so
to me it is this notion of there's nobody else like everything stops with me ultimately as a ceo
right so if if things don't work yes of course i can tell people to improve but ultimately
you know i need to be the one making sure that it happens and that's a really different mindset than
being a cog in a wheel so to say which sometimes can happen in some jobs right where you're just
like doing what you're told but you're like your buck stops really early right like you're just
shipping the code and not even shipping it you're just writing the code you're committing it and
you're done right versus in startup you actually have to care about the outcome of the code
yeah that's an important distinction and lesson and i'll also say one of the interesting things
and you're kind of mentioning it that it never stops happening is people believe this like you
said you're just there's gonna be an issue and you're just gonna tell someone to take care of it
and it's like i mean maybe maybe at some level or if you've grown a really good team you can kind
of claim through your hard work that you've gotten to that point but i was trying to explain to some
people the other day it's i you know i have a team of people work with me but there's you can't tell
them what to do you have to get them to buy in convince them like explain teach them like to
make those decisions jointly so any expertise i have is only as good as like how much i can imbue
into others the same feelings and same understandings and context and you know having to
trust each other to make good decisions but it's it's this weird thing where everyone believes or even you know i was explaining to my kids and
have this thing like oh you know you're you've been there a long time like you just tell everyone
what to do it's like yeah no that doesn't that doesn't work like it's all about trying to like
explain what you would like explain a plan like trying to convince others to buy in and so people
believe it's this like bullion flag that just gets flipped like oh you're senior or you're a manager or you're whatever and that means anyone without that flag you get to
tell what to do and it's like no no it does not work like that all right so i i guess like trying
to get us on to our our story arc today uh you mentioned something in in in your sort of intro
there that you started uh you got introduced to Postgres and you were sort of talking about,
and I think this is a, you know,
interesting segue is a lot of people get very focused
on what tool to do a job, right?
Not what like kind of tool,
but the specific name brand on the thing.
And you see it in other hobbies.
It's not like it's unique to programmers,
but you know, I don't know.
I'm going to pick something I don't know anything about.
Oh, you know, running.
I want to go running.
What exact shoe do I need? Is it going i need everybody nike you know is it going to
be rebugs like they get very hyper focused and then you hear people just say no just go run and
like no no you don't understand i gotta know which shoe uh and so i think sometimes people can get
caught up with uh you know choosing a specific database you know uh brand or a specific program or which one they're going to run and i
think we're going to talk about a bunch of stuff today that's going to cut across hopefully a lot
of different things in this sort of generic the underworkings the underpinnings and often i'll say
in some ways are more similar than they are different and being aware of the difference
is important but being aware of the similarities more important maybe same important equally
important well we'll figure it out.
But you had mentioned something in the intro about Postgres.
I thought it was really, really interesting.
And we'll drop it here just because I want to make sure we hit it and you kind of mentioned it,
which is Postgres, unlike a lot of other databases, is open source.
And so being open source, the community vibe is a little bit different.
Can you maybe speak a little like your introduction to Postgres?
And I think even now you're still working with postgres so like obviously it's sort of stuck
with you and why does it continue to be something that uh that you work with for sure yeah and i
think it's interesting you mentioned open source as the qualifier i i actually you know these days
open source is a term has gotten a bit muddled because a lot of databases are open source or
open core in some way or form um i think what really differentiates Postgres to me is that it's not run by a single entity.
It's a community project, right?
And so if you think of the Linux kernel, for example, it's a very similar community project.
I mean, there's actually the one big difference is in Linux kernel,
there's Linux Torvalds at the head of everything versus Postgres doesn't have that person necessarily.
But what's interesting to me with Postgres in particular is that
it's 27 years old now.
So, you know, it's almost as old as I am.
Not all the way, but almost there.
And I'm sure some older than some of you listeners here.
And I think
what's interesting is that it has survived
the test of time, right? Like it wasn't
the fad that gone away. It actually,
you know, survived the like changes
that happened over the years, cloud databases, right? Like if people deploy a database in the cloud today a lot of
times it's postgres probably more than half the databases relational databases deployed are going
to be postgres today um in the cloud you know caveat comment but um but i think the like where
it really comes back to right is that it's a community project. Everybody can contribute to it.
It's in many ways had all the downsides of community projects.
So sometimes you might have to look for tools beyond the core product.
So one of the things that we did with my company is provide a monitoring and optimization tool for Postgres called PG Analyze.
And the idea there is essentially to add the know, add the parts to Postgres that, you know, are not there yet. And similarly, you know, what I think has stood out of Postgres
over the test of time also is the fact that you can extend Postgres. So we don't do much of that
personally. But what many people have done, you know, there's companies like TimescaleDB or
Citus Data that I used to work at personally, that extend Postgres by creating an extension
for Postgres. And that allows you to ultimately build a very different database into the core engine. And so I think
coming back to which database should you choose? I think I'm cool with any choice. So I will never
fault anyone for choosing, you know, my SQL or MongoDB. It's all cool. The reason I personally
choose Postgres as my default is because I think it has the extensibility that covers a lot of
different use cases. And so you, you kind of have that riddle room, like if you're suddenly looking to store
columnar data, there is a way to do that in Postgres. If you're looking to do time-series
data, there's a way to do that in Postgres. And if you want to work with embeddings in your AI
system, there's a way to do it in Postgres with pgVector. And so those flexibilities are very
convenient, right? Because oftentimes, if you choose a specialized database too early then you will just hit a you know brick wall and then you
can't do anything yeah i think this is something i've heard echoed a lot and um i'll guess you
kind of said stood the test of time i guess like the other word there is like you know fad so
there's like fad databases i don't know people who like you said choose let's say a specialized
database um and i think and maybe we can talk about it briefly but like sort of document store
databases or just key value stores with no relational part no real query engine leave
people on a lurch when they realize they need to do a query right like oh hey i have this key value
store it's it's super cool it's super fast like i feel really good about myself and then you know
someone comes in and we're going to talk about this a little later, hopefully, but like someone
comes in and says, Oh, I want some aggregation. I want whatever. And you're like, uh, uh, okay,
there's no sequel. I got to write this myself. Or you find yourself like writing a sequel engine
over your thing. Like, you know, and oftentimes at some, some form of premature optimization,
I guess, like people are trying to like narrow in for some future concern of what happens if i have a billion visitors to my website and you don't have one it's like okay first get
you know a handful make some good decisions or whatever but yeah you kind of mentioned that like
early specialization or going to a niche database and and interesting that you say sort of postgres
has i guess i knew that but i didn't know is like, I hear about extensions for all sorts of things for Postgres,
and it's really a way to customize it later,
potentially for lower cost
than sort of completely ripping out
and switching to a new database
that's sort of like layering on an extension.
Do I sort of have that model kind of right?
Yeah, I think so.
I mean, you could,
I think one good example is Postgres,
you know, has a lot of data types.
So Postgres can be typed, right?
So it's not just a, you know, throw in text and it's always text you can actually say hey this
is a url and then there's some validation because urls have certain structure and so one of the
most basic extensibilities of postgres is that you can do your own custom data types so if you know
a way that you want your data to be validated and you want your data to be stored but you want the
input to be this particular text format you can write an extension that does that for you in postgres and so it's that right it's the fact
that you don't have to contribute a change to the core database to add a new data type you can
actually just do create extension my data type and that's all you need to do right like ultimately
you write a c extension you if you do that kind of stuff and lowlevel parts. There's also a way to do this in Rust these days.
So if you want to have more type safety, there's a good way to write complex extensions in Rust.
And then sometimes extensions can be as simple as just a SQL script, right?
Sometimes it's just literally, I want to have this function in all my projects.
So I can just say, create extension, my function.
And then that's just a way to package it, essentially.
Cool. Maybe let's dive real
deep here for a second and and then uh we'll see how this goes this is if it goes bad it'll be on
me not not on lucas but we're going to try to dive deep into database okay so i i guess like not
every database but a lot of databases you'll sort of hear the the sort of storage and i guess even
you know maybe this goes back to a lot like spinning metal hard drives maybe people don't even know what that is anymore
um but you know sort of people talk about uh you know using using a b tree and we don't have to go
into like that's always really hard like describe to me what a b tree is no no let's not do that
please uh but like you know have this b tree which is a generalization of a binary tree so it's this
order tree structure that sort of lives and we're
going to insert insert data into um and so the idea here being we want to store something and
then later we're going to you know kind of query for it so um i guess like when you start at this
level this feels like not that hard right like oh i'm just going to insert something into a hash map
and then i'm just going to you know query it back out um and this
is really you know you know no big deal and so i guess sort of like the first thing is like making
sure it goes on disk so like it's not in memory but it actually needs to be on disk that way if
your you know service dies or whatever you know it's preserved and you can kind of kind of load
it back up um but but leaving that as sort of the base and maybe i set you up poorly but like you
know sort of the base there like what are what are some of the like mechanics that a database offers we're going to try to get to
ultimately talking about like why queries get slow um but i start putting putting data in this
you know tree and the tree is you know ordered in in some manner and i want to write you know
queries against it sort of what what kind of is the growth that leads me from there to
like all of a sudden, something was fast,
and now it's not fast?
Yeah, I think this is a great topic.
We'll have to go into a lot of details there.
All right, yeah, let's go.
So I think, let's forget about data structures for a moment.
So I think, let's forget about trees or hash maps.
I don't think, from a fundamental perspective,
they don't necessarily matter, right?
We can definitely talk about how B2 works,
but I don't think that's actually the most important thing i think the most
important thing is to understand that ultimately what most databases that are not in memory
databases do for you is that they store data in a file on disk right an actual file is on your
desktop um in a sense um and make access to that portions of that file very efficient right because the like usually what that
means is in a relational database you have a table in postgres for example a table gets represented
as a file and then if you have indexes each index also gets its own file but you don't always want
to look at the whole file right what you want to do is you want to look at a portion of the file
just enough to answer the question you're asking, right? Which might be, give me this particular person's birth date
by email address or something, right?
Or like give me the top 10 users of my website or something.
And so really what the database does for you
is figure out the best way to work with those files,
both from a reading and a writing perspective
so that you don't have to, right?
Because the alternative here is that you and your own code
and your own software do the same thing right you do file open and then file read
and file write but ultimately what you would usually most likely what i would do is i would
just literally read the whole file and then put it in the hash map in python or ruby script right
and then just look at it and it's just not workable if you're talking gigabytes or terabytes
of data like it's just too slow and so really the the big job of database is to make that interaction way more effective way more
efficient and then also lets you write queries in a way where you don't have to write the code to
do all these lookups right but you can actually express your intent using sql and then database
figures out how to locate the data you're looking for okay that was much better than i did all right so so
what i hear here is like uh that's all right i give it to you no no no uh the separation of
concerns that's how i would sort of term it which is like you know you're writing some application
that like you said sort of wants quick access to a wide variety of data past the point where you
know you're going to necessarily want to just have it all in memory and all these other benefits
and so the database job is to i think you said really well like make
the smallest portion of data available or need the smallest portion of data available in order
to answer your questions and then provide sort of an api in this case sql a query so that you can
tell the database what it is that you're looking for. So and then its job is to
as efficiently as possible, give you back your answer. That's right. And I think from performance
perspective, the one thing that I'll mention that's really important to understand is that
oftentimes you will, the database will look at a lot more data than what isn't the result that
gives back to you, right? So the oftentimes performance issues are like, you might just
be looking for a single row, like in this case of the birthdate and the email address, right?
But the database actually has to look at the whole user's file, aka table in this case,
right?
And actually, in the worst case, read the whole file, right?
Until it finds the matching row.
And so really...
Oh, go ahead.
I was just gonna say, so what would cause that?
So if I have a table, I have a file,
what would be the difference between
able to efficiently go to the specific row
versus have to look at the whole thing?
So I think ultimately what it,
I mean, so the simplest thing you could do,
so maybe, again, important to remember
is that usually in systems like Postgres,
and I think this applies to at least most relational systems
like MySQL and such,
is that the index and the main
table are separate right so the main table is usually so i'll talk a little bit in detail just
because it's helpful to think about this visually in a sense right so a the main table in postgres
is usually separated in what's called pages and they're eight kilobyte pages and so each eight
kilobyte page has you know a number attached it, like just an integer number that counts up.
And then in each page, you can have one or more rows,
post calls some tuples often for various reasons.
May not have to get into that.
But the point is, these rows that you're looking for,
they're in these pages.
Now, if you knew which page you're looking for,
you could very much like a phone book, right?
You could just look in the phone book on page 20
and you knew that there's your row somewhere on that page now the challenge is that oftentimes you don't know that
right you don't know on which page the data is you're looking for and so short of you know just
reading through the whole like each individual page right until you get to the matching row
really what you want is an index that sometimes can also answer the thing directly but usually
in many cases what the index
does is just point you back at the right page right so the index just says hey i know you're
looking for this user with this email address um that's in page like 67 and you know you just then
read that one page of eight kilobyte right it's an eight kilobyte read essentially um and then
you know that in that eight kilobyte read somewhere is going to be that row that you're looking for and that's really where indexes come in is just reducing that lookup that you have to
do now the important thing to remember of course the index is again a file and so we don't just
have to do io for the main page the main table the page on the main table but we also have to do io
for the index itself and that's where it gets, I think, conceptually more challenging.
Even I have challenges these days when you ask me
how much overhead is it going to be
if you read something from an index.
It's really hard to estimate that because in a B-tree,
you have to walk the tree.
So you ultimately start at the root page,
figure out, okay, do I go left, do I go right?
Do I go left, do I go right?
At some point, you end up on that thing
that's essentially matching a query or multiple like index entries that are matching your query um and
then these index entries will have a pointer back to one or more pages and so it's hard to do a
conceptual model for that but i think that's roughly how i would try to describe it so so
right after you said it's rough i'm going to attempt it because i want to i want to make sure
i kind of understand this so the index is so the data that you want to feed into the query,
so one or more entries,
like let's say email address.
So it's like a mapping of email address to page number
for the rest of the data associated with that row.
But it could be,
I don't know what the right term is, compound.
It could be composite.
It could be more than one thing.
And so the idea is the index file gets opened,
and then searched for finding the sort of entries in the key there. And then it gets the value of
the page. And then it goes to the actual table file, opens that and then efficiently goes to
that page sort of by number lookup because because you have it and then is when people talk about building a database table and saying this thing is the key
so that rows are have some unique identifier or whatever is that is that inherently different or
is it still just an index file but like happens to have a shorthand name because it's so common
yeah it depends a bit on the database so in postgres that i'm most familiar with it it's
mainly a convention in a sense that so if you have a primary key, right, let's say we have a users table, the users table has three columns, an ID and email and a birthdate. Usually the ID would be the primary key, you could also make the email a primary key. And really, one of the distinctions with primary keys is that they're usually, well, they always have to be unique. So email might not be the best choice if you want to support, you know, the same email you use twice. So let's suppose for a moment that ID column is
the primary key. Now, in this scenario, we're describing where we're looking for the email,
the ID column actually does not get involved at all, unless we're looking for the ID, right? But
if you're just looking for the birthday value, and we're just doing select birthday from users
where email equals something, then we never need to look at the id now in postgres specifically primary keys do matter most mostly when you're joining tables
and you're like trying to say this is you know kind of like i'm talking about this one record
and there's only ever going to be one record because i already kind of am grouping in a certain
way and so it's it's mostly relevant if you're joining things and you're like using the id column
in a join.
Then it matters a lot from a career performance optimization perspective.
But otherwise, it doesn't make a big difference if you're looking for an ID value versus an email value.
They're essentially both indexes in this case, right?
So primary key is an index.
Okay.
And so then reversing that, I think you said birthdate.
So if you're using birthdate and you didn't know in advance
that you were going to be searching on birthdate,
then this is where the database has no option.
It doesn't have an efficient way to find birthdate.
And so this is where it needs to end up going through Table
or does it do some magic there as well?
No, I think ultimately, if you just said select,
let's say select email from users where birthdate equals something,
but you don't have an index on birthdate, then then there would be no way for database to do it efficiently
beyond just like looking through like the table until it might find matches and if unless you're
doing a limit one in your query it may have to look at the whole table right if you're doing
limit one then it gets lucky and the first page has that row right then you know with 50 average or whatever yeah um and i mean this is also again
where it matters like how queries are written matters in a huge way how the database can
optimize things for you right so if you're very specific you're like i want this order and i want
you know the top five then that's going to be more expensive then i don't care about the order and i
just want one um because that's going to be faster okay yeah so so maybe actually that's a decent segue so all right so
we have these sets of tables there the database is managing them we have some indexes over the
and then now we're sort of talking you were kind of moving to the next level sort of like writing
writing queries and then as you said there's um i feel like this is happens in programming as well
uh i at least i made this a lot and I've seen other people do it.
When you first start writing a program, you're worried you didn't write your conditional
correct.
So you write your conditional multiple times or other variations of it.
You don't trust like, oh, if I is less than 50 and then a couple of lines later, you'll
say like, as if I does not equal 50.
And it's like, well well it can't equal 50 because
you said it can't be less or maybe i'm messing up anyways you sort of like repeat conditionals that
that don't need to be um or you put a lot of parentheses in your math operations and so
you know i come from c++ so this is like a big deal because everyone's always performance
nerdy but like you know oh i have a constant and a you know another constant but i put my
parentheses in such a way that i'm telling the compiler i don't want to let you you know oh i have a constant and a you know another constant but i put my parentheses in such a way that i'm telling the compiler i don't want to let you you know pre-multiply these
constants together at compile time uh because i'm over prescriptive about what i'm doing um so when
we're writing queries you're sort of saying if you're over prescriptive in some way like you're
very very very stringent then you're sort of limiting the hand of the the database and sort
of how it can can do things can maybe like unpack
that a little like what what exactly are the kinds of things it does or doesn't care about
and what would you sort of be looking for as like code smells i guess and the you know you're making
it slower than it needs to be right and i think i would say well i mean in general you need to do
what the intent is you're trying to implement right so sometimes you may just have to do an order with three columns and a filter clause with five different like aware
clause of five different things you're looking for right so i think it it depends a lot on what
you're trying to do i think you actually mentioned an interesting case of compilers right so i think
that situation where you're like optimizing for certain things because the compiler you know does
things a certain way and so you know you have to write them with all these parentheses um
unfortunately at least today you also have to do a little bit of that with
databases. So in Postgres specifically, I wouldn't even say here's one right way or wrong way to do
it. I would rather say learn how to learn about the database. So how do you figure out what the
database is doing and how do you understand the different choices available in terms of how the
database can find your data
ultimately because generally I would say yes it's better to not over specify like the where clauses
but there are exceptions there are actually cases where I've you know I've seen real life situations
where I'm joining two tables and adding a where clause actually improved performance drastically
because it allowed a different kind of join to happen versus if there was no where clause it had
to join like read one table first and then do a nested loop over the other one
and so it really does depend a bit um and what was most important in that situation for me was able
to in postgres there is a command explain and if you put explain in front of a query what the
database will give you is essentially the query plan that it well if you explain analyze it
actually executes and says here is the plan i
use and here is you know how long each part took if you just do explain it just gives you the plan
that it is most likely going to be used if you run it again um and so what that really tells you is
what the database is doing and so one of the most important feedback loops right like when programming
might be a repl right like you're typing something in an interactive shell and gives you back a
result and so one of the most important feedback loops in database world is i would say that i write a
query i do and explain i see the query plan i think it's a bad query plan i change my query right um
maybe add an index so that database you know i try different indexes i see what sticks essentially
right um but that interaction you can't really get around like that that just has to happen right
even if you have happen right even if
you have best practices even if you you know don't over specify your conditions sometimes you just
have to like get in and do that feedback loop and do that iteration so when you run explain on a
query and you get back the the query plan i'm sure i've done this before i know i've done it before
but it's not something i do routinely so it's not top of mind to me. But is so like, let's say Postgres and I send the query plan to the, I guess there's some
sort of execution engine that is going to run, I think.
And you were saying, and you sort of mentioned something interesting, which is what keyed
me off, which is like, that it thinks it's going to run.
So is it true that like, if, if we had two databases that were the same schemas,
but yours was an order of magnitude more data than mine,
and I send my query to mine, you send it to yours,
would we get back the same explain?
That is like, is it invariant? And it's like just a schema
and the execution engine, same version?
Or is it somehow like understanding the size of your data?
It's monitored queries in the past
and is somehow like trying to do this like at what level is it sort of building that plan
yeah and i think to a big extent this again does depend on the system right so i'll speak to
postgres but definitely this does vary between database systems um so generally speaking it's
definitely deterministic or invariant how we're going to call it right but essentially it's definitely deterministic, or invariant, however you want to call it. But essentially, the planner in Postgres,
with the exception of essentially one feature I can think of right now,
is if you give it the exact same files on disk,
and you run it on a different server,
it's going to give you the same plan.
Now, the problem is that usually you're not really aware
of all those details.
I would say the most important things...
Let me actually take a step back just for
people to follow along more easily, right? So again, like think through you're sending this
query to a database. So what does the database actually do? Maybe just to like give you a visual
because I find that helpful. So like query comes in, right? So the first thing the database does
is actually parses the query. So the engine, in this case, Postgres, turns that query into a
parse tree. And that parse tree then essentially gets, you know, analyzed and says, hey, you're looking for this particular table and such.
Now, the part we're talking about is after that initial parsing, right after the database
essentially knows, you know, this is what you're looking for these tables, you're querying,
then it actually has to figure out, you know, how do I get that data. And so that's the component
that is trying to, you know, come up with a career plan is called usually the planner or in Postgres,
sometimes the optimizer. And that's really the part where it looks at that you know parse
tree that comes in like the query that comes in in combination with a couple of other things i'll
talk about in a second and then says here is the plan and then what happens is that in postgres
called the executor goes and executes that plan right and then ultimately the executor is what
sends you back your query result.
Now, if we're doing explain, what we're doing is we're just looking at the output of that planner component without doing the execution, unless you're doing the explain on lies I
mentioned earlier.
Now, looking into what does this planner actually do, right?
Like, how does it come up with its result?
And in most systems today, the way this works is it essentially does a cost-based estimation.
So it tries to say, if I were to go ahead and execute this, or if my friend, the executor,
would go ahead and execute, how expensive would it be, right? So is it more expensive to use this index or that index? Is it more expensive to, you know, join this table first with this other table?
So it's trying to essentially go through all these different variations of how to get the data.
And the way it does that is it attaches a cost to it like an actual cost value which is a floating point value and it says you know it's a tree that it builds ultimately so it says you know join this
first and then do that and so like ultimately it comes up with a cost at the end right so the top
there always sits a cost on the plan and that cost is the lowest cost right like that's kind of what
the point there's there's again caveats there but like generally it's speaking cost is the lowest cost right like that's kind of what i mean there's there's
again caveats there but like generally speaking it's the lowest cost that it tries to find
and then that's the best plan from database perspective now in terms of it being reproducible
between systems it what's important for us to look for is what is the data it actually looks
for when it makes these cost estimations right um and so we talked about an example earlier where you're doing that select
star from table limit one and you don't care about the order, that's actually going to be cheaper in
terms of cost, right? Because the database can estimate, well, most likely, you know, I'll not
have to look at the whole table because I'll just find trenching row at some point. And so it will
actually discount the cost for that, right? It will actually reduce the cost because of that.
And the way it can do that is because it has statistics about the data in the database right it wouldn't want to read the
whole data but it has statistics about how frequent are certain values right and so if it knows
actually when you're looking for a birth date that everybody has like maybe everybody inputs like
2000 the january 1st 2000 for some reason right Because they're just like being funny. And let's say you're looking for that.
The database will know that that's the most common value
in that table.
And so we'll actually be able to do different optimizations
because of that and give you a different plan.
And so when you compare between databases,
what matters is that those statistics
that database scatters are actually the same
because if they're different,
then it's going to give you a different plan.
One last note on that, just to make sure i included that is the other thing that's really important is the actual physical size of the table and of the indexes so
a issue that a lot of people run into is they're trying to compare their development environment
with a production environment and statistics are a problem yes but the other issue is that
the planner will know how big your table is and if your table is 100 gigabytes it's not going to of a production environment and statistics are a problem yes but the other issue is that the
planner will know how big your table is and if your table is 100 gigabytes it's not going to
do sequential scan but if the table is just one megabyte it is going to suggest a sequential scan
and so sometimes i see people asking why is my database not using my index and the reason is
well you're running this locally and the database thinks your table is tiny so it's just gonna
not use an index because it doesn't need one oh wow that's actually like several steps more complicated than i was thinking it would it
would be so we mentioned a bunch of really interesting stuff there so understanding the
data sizing i can kind of see that like it's sort of a function of like you know how big various
caches are and speeds out to the hard drive and the amount of RAM you're using and this kind of thing.
The statistics one is somewhat intriguing to me,
which is understanding what are useful statistics to measure.
And then presumably, whenever data is modified or inserted into the table,
rather than re-scanning and computing the statistics,
presumably it's incrementally updating them as it goes.
But then using that, that actually, I don't know, that's not obvious to me that like,
you know, like you said, birthdays, right?
We know, for instance, there can only be, you know, in a month and day, only 366, I
guess, birthdays that there can be.
And so chances of collision are really high.
The birthday paradox or whatever that's called the birthday birthday i don't know what it's called anyways uh so so i mean not that it's
getting that clever but as you mentioned like some things uh you know have like skew really
left like all really small values and very few really big numbers and you there are various
techniques for kind of understanding that and so that's actually really interesting are there hints
you provide uh you know at the table to say,
hey, as I'm building this, I kind of know in advance
I'm going to be interested in these kinds of things
or these are queries I'm going to want to run later
or it's just not kind of worth it?
Yeah, and I would say it again depends very much
on your database system, right?
Like what it provides.
So the way this works in Postgres is
these statistics are collected by a process called analyze.
There's also an automatic process called auto-analyze.
And so auto-analyze will just run
on the periodic schedule depending on
the amount of change in a table, essentially.
So if you do a lot of updates, a lot of inserts,
at some point, that auto-analyze
counter essentially has gone to the point
where it kicks off an analyze.
But what it does then, it actually samples
your table. So it looks for, I think,
by default, 30,000 records.
Don't quote my discipline.
I look it up in Postgres documentation.
But the point is it looks at a certain amount of default part of the table.
And then it actually, by default, takes just 100 records from that.
And so there is a way to customize a couple of things here in Postgres specifically.
For example, you can tell it actually look at more data.
So don't just look at those 100 ultimately that it saves,
but look at the thousand, for example.
So oftentimes people say, you know, store more statistics
so we know more in more detail, you know, how often,
like let's say birthdays, right?
If we want to cover every birthday and we just do day and month,
then, you know, we could just say,
let's raise that statistics target to 400. Keep it easier.
Don't know how to do 366, just do 400. And then Postgres will actually remember exactly how
oftentimes each of those values showed up in its sample. The other thing, and I'll just drop this
here in case you ever get to this point, but just to note that the database also offers this,
is Postgres has a way to do what's called extended statistics.
And that's essentially where it collects even more statistics
that are slightly more expensive to calculate,
which is why they don't always do it.
But then you can have things like,
is this column dependent on this other column, right?
Like, for example, do people with the name Lucas
always have birthdays in August?
I don't know.
My birthday's not in August.
But the point is, if there are correlations between those values then often sometimes that matters a lot for complex career
plans and so there is a way to instruct the database to measure that information specifically
so that then you get better career plans uh yeah i can imagine probably similar to compiler
there's like a whole like in-depth rabbit trail of complexity. So I guess like slightly shifting off of that.
So the other thing I guess when you were explaining,
you know, trying to build these costs
is the one thing as someone who kind of doesn't know,
but always kind of boggles my mind
is handling transactions.
So if you're doing some read, modify,
you know, right, you know, kind of cycle,
or you're doing something transactionally
into the database,
I imagine like how the plan executes, is it just trying to minimize the sort of costo? Or is it thinking a bit about like the probability that like, a right comes in and
interrupts the query and has to go again? Or how is that sort of like component? Maybe maybe I'm
off, it's fine if this is like not relevant. But I guess like, that's always one of those things
that like all of this has to be done. Yes yes but it all has to be done in the phase
of like data that could be changing out from under you and it has to be done consistently
right and i think this is where you'll once again you'll hear hear me say this a couple of times
but um you'll once again see that you know postgres is this is one way our data is to do
differently um so postgres has you has a system called MVCC,
Multi-Version Concurrency Control.
And I mean, it's more of a general term, right?
Outdated pistols have two,
but Postgres has a particular way of implementing MVCC.
And Postgres has received criticism for that too.
Like people sometimes criticize that
for performance reasons and such.
But one of the most important things
that the MVCC implementation in Postgres does, it solves
exactly the scenario you're describing, right?
Where I'm doing a select, but whilst
my select is running, because maybe I'm looking at a big portion
of the table, right? So it takes some time,
then a write comes in. And so there's
two choices here. Either we block that, right?
So I finish my select and the write is blocked,
or they are able to run concurrently.
Now, if you implement this naively,
if you built your own database,
you should probably just block the writes, right?
Like keep it simple.
But in a concurrent like real world system,
that doesn't work, right?
Because you just have too much stuff
going on at the same time.
And so that's really where what Postgres does
at the most fundamental level
is if you think back to those pages, right?
With those rows.
So each of those rows, I mentioned earlier
that they're actually called a tuple in Postgres.
Now that distinction matters because it's technically a row version not a row because
there can be rows that are physically there in that page that you're not actually seeing because
they're for example an insert that came in after you started your select right and so postgres has
a way to essentially say well this is this is something from the future right so you can see
these like even as you're doing your select the postgres execut, this is something from the future, right? So you can see these, like, even as you're doing your select,
the Postgres executor might encounter things from the future
that it's just going to ignore because they're from the future, right?
And it's really fascinating because, like,
you have to think about this in performance optimization too
because you can also encounter things that are in the past
and not relevant to you anymore, but they're still physically there, right?
Because, again, it's a physical structure.
And so, like, it really, like like the part where it sucks for you ultimately is that there are
situations where you can have data that you need to look at but because of these visibility rules
it's actually no longer there right like you're not supposed to see it it's just you have to
physically walk over it because it's in that file still okay so it's trying to put put these things in a place where
you'll see them but then like you said there's some i guess i always think like some counter
some timestamp it is a counter like yeah your query is like at some counter version and it
knows like anything less than this or greater than this like it needs the highest value up to
but not greater than your your ticker value i guess uh and so it's scanning and
it's seeing stuff that's crazy i don't know this starts to get really and this is why i guess people
always sort of have this caution where as you said if you're building your own it's better just to
block people have this they slowly end up building their own database and then not thinking about
these things right i always just have a file now i need to have it updated now i need to and then
you slowly backing your way into uh some of this complexity so the
query planner has this explain function and it's giving you these costs when it goes to the
execution it has to kind of respect all these all these rules um so as someone writing the query
plan this is i we most been talking about it sounds like reads like joins and order buys
are these important for inserts as well or or composite statements like how is the
sort of like breakdown of like when is it like yeah yeah it's it's probably not something worth
worrying about yeah i would say it depends yeah i mean definitely so anything that's more complex
is going to be more interesting to think about right so if you're joining tables even if it's
just two tables oftentimes that can already be an important question which is what you're joining
first like are you first looking at table a or table b right um or can you
look at them both at the same time um it gets much more relevant of course if you're joining
five tables together or you're looking only for a small part of this table then you're using that
to find something else in another table um if you're just inserting data um i don't think so
postgres technically still makes a prayer plan for that because what you can do in postgres is you can do insert um into the table and then select and then you actually
run a query to insert the data right so you are essentially running a full query just to get the
data you're inserting um but if you're just passing data to the server there's not no magic to that
right it's literally just writing it out there's different ways to do it like there's copy in
postgres which is way faster than insert blah. But it's generally not something you need
to worry about from a planning perspective. And the same applies to, you know, like, let's say
you're changing a schema or something like those are called utility statements in Postgres,
they're totally separate from query plans. So really, it matters most of your reading data.
And then what is it so so when you're you know writing your your queries and they're complex
and you're you know looking at your query plans like on the balance i guess there's this uh
decision between like you said maybe you can't optimize your query it has to be a certain way
or there's just like yes it's frustrating but it's no good and then you can kind of sort of
look to i guess indexes we were talking about that but also I guess to uh schema definitions like what
is this sort of troubleshooting debugging go from like okay I have this query it's not doing what I
want you know we talked about query plan which is sort of like I guess how to mutate the query in
some form to make it better so you've exhausted that what is this sort of like trip down this uh
you know optimization look like yes usually what i would do in practice i'll just
walk you through my personal debugging um but like approach here is um so imagine we have the
query right and we so my situation is i'm the application engineer we're going to be back an
engineer who got handed you know kind of a sql query that's bad um from the data team or something
and they're like this is so slow you're like making the whole database slow fix it right and so i'm like okay what do i do um and so the first thing i would do again is
go look at a query plan to understand you know what is you know what's actually happening right
because i might up until now have just thought about this function call in my like application
which uses an orm so i don't even see the sequel right so like first step is actually looking at
the sequel the next step is looking at the query plan. And then really what oftentimes helps in Postgres specifically,
and the same applies to other systems,
is looking at the physical IO that's being done.
So don't just look at the plan,
but look at how much data does it actually have to fetch.
And so the way you do this in Postgres is you pass...
So first of all, you do explain analyze,
which actually executes the query.
So it doesn't just say here is the plan,
but it also says when it executes the plan,
what part takes how long, right?
And this helps you then say,
well, you know, in my really complex query plan
where I'm doing joins and insert,
like, sorry, index selection and stuff and whatnot,
this index scan is slow, right?
Like this particular index scan,
we looked at the whole index and that was really slow.
And the reason you know this
is by doing explain analyze versus explain. Now explain analyze has an option called buffers and
if you pass buffers um it's kind of this terminology thing which is confusing so a buffer in postgres
is also a page um postgres uses those interchangeably so when we say buffer is really
what we mean is those eight kilobyte pages that i talked about earlier right so these portions of
the file on this and so if you pass that buffers option it will show you how much of the file ultimately it had to read and so
it will tell you i had to read 100 buffers and then you have to do the mental math and say 100
times eight kilobytes is you know that's this much in actual bytes um and so that way you know
like which part of the plan not only how slow or or fast it was, but also, you know, then, you know, this actually took this much IO, essentially.
What I would do next is kind of what you were alluding to earlier, right?
Which is try to think about, do I want to, like, is this an indexing problem?
Or is this more of a data modeling type problem, right?
Or a query structure problem, right?
So I think they're all different directions, right. I can change the way my query is written, I can change my table definition,
or I could add indexes. Oftentimes, changing the query, like how you write it, or adding or
sometimes removing or changing an index, those are the easier choices, because they're really
fast to do, like indexes or cache structures of sorts, right? So you can just create new ones
and the Postgres planner will choose them
if they're better most of the time.
And so most of the time, I would say,
probably start with understanding
what the query is trying to do.
If there's a simple change you can make in the query,
try that first because it's going to be fastest, right?
Then next, look at indexes.
And then really only if that doesn't work,
then it helps to look at the actual data model
i'll give you one example of where the data model makes a big difference is again think back to this
physicality of looking at those portions of the file on disk if your rows like in your table if
each row is very wide right so you have a lot of columns in them maybe a lot of text columns
that are like have a lot of text in them then they will take up a lot of space in each page and so what happens is that you know
like let's say you have this eight kilobyte page um and you have you know each row taking one
kilobyte so at most you'll have probably seven um rows maybe eight rows to hang on math works out
per page now imagine instead you had instead of that one megabytes sized row, you had 100
kilobytes size row, suddenly you can fit 10 times more into each page, right? And so the thing that
matters there is, if your data, if you have a really wide table, that can sometimes be a big
problem. And so it does sometimes make sense to essentially think about making tables small from
a physical perspective, in terms of each row being small,
like having less columns
or having less text columns in particular,
because then you're optimizing for that.
Yeah, so that's, I mean,
I guess that's like a pretty universal,
it's like a data locality thing,
which is like, oh, hey, I want to,
if I have a JPEG attached to the row,
it might be better to like insert it with an ID
and then put the ID in the row.
So that way, like, as you're mentioning mentioning you don't have to physically read in pages to get through the data to get
through the jpeg maybe handle jpeg separately but you know you don't have to like physically read
in multiple pages trying to get to the next page of actual data that you're interested in you're
trying to make it really compact so that the number of rows through the reader go as fast as
possible like the highest throughput and
you do that by moving data that's unlikely like you're not querying for text inside a jpeg you
know it's stored it sort of stores externally or differently so that you know your your queries run
fast that's right and i'll add one quick thing just so that people don't get confused if at some
point you do dive into this detail um so one important thing to know in Postgres in particular
is that if your data gets beyond a certain size,
there's actually a separate storage for it.
So Postgres, what you just described,
Postgres has a way of doing this, right?
So if you, for some reason, do decide to store a JPEG,
which you shouldn't do, don't store images in your database.
But if you do do that for some reason,
and they are like multiple megabytes in size,
Postgres will actually store it in what's called Toast.
Not the, you know know thing you eat but uh the the extended was it the oversized attribute storage technique i think um okay but the point is it's a thing and so if you have really
large values they're actually less of a problem the issue is more if you have these medium-sized
values right so these things that are kind of they're not large enough to be stored separately
but they're large enough to mess up your page structure that's the issue okay so like it's basically you confuse
postgres so he doesn't know is this something i'm actually going to need to access or it's just a
blob and so there's like some ambiguous overlap where yeah like it doesn't know it doesn't know
where to put them okay that makes that makes sense uh and then good database recommendation
don't put your jpegs in the database but i i know people are still going to do it so uh definitely happens um okay so so
it's it's has the indexing you know like you said lastly about the the schemas and i think this is
like one of those things too that they get fairly debated and approached but i'm a big fan of like
thinking about your data models and like in general in general, even for code, I think it's under thought about. And so people don't really sit
down and think like, how does my data actually need to look what goes in it, people just sort
of like start writing code and shoving stuff into classes, structures, tables, like it's sort of all
manifesting the same, the same sort of problem and sitting down, but you get into this discussion where i guess you hear people talk about normalizing or denormalizing your data like
you know you're like it's a star pattern and data warehousing versus anything i mean feel free to
just be like no i don't want to talk about that but like any any commentary or thoughts on like
is it better to put as much stuff locally into the row put it into separate and do joins like
how what is your philosophy sort of like i mean
the one thing i'll mention so denormalizing and normalizing is important right so the i mean the
most let's put this in layman's terms it's like do i have one copy of my data or multiple copies
of my data right so sometimes it makes sense to essentially write the exact same value to multiple
tables because then you normalize it and the benefit of that can be that you don't have to
look at the other table to get the data right um but maybe even like so this is one case and it's very situational but i would say
don't denormalize unnecessarily right like in the simplest case like start out by just storing your
data once like it's going to be smaller it's going to be easier and then if you encounter issues where
that's not feasible then think about um denormalizing um the other thing i'll mention
because i think there's oftentimes we talked about this very beginning about documents, like stores versus relational databases. And one of
the things that Postgres actually has this data type is called JSONB. And JSONB is essentially a
binary variant of JSON, which is slightly more optimized for indexing. And so the big benefit
of JSONB is that instead of you having to specify each of your columns that you're going to need, you actually just have this, you know, JSON column, or JSONB column rather.
And this JSONB column is just storing a JSON document, right?
It's just storing key values, and it can be nested and all that stuff.
And so what you can do that way is you can just throw your data into Postgres and then query it without having to have that rigid structure.
There are limitations, and it's not going to be always, it's definitely going to be slower
in some ways than just having separate columns.
But if you're unsure about the structure of your data, right, if you don't know yet, maybe
there's metadata attached to your objects, then using something unstructured like a JSON
P field is the way to go and is what I would do oftentimes.
So I guess it's just don't optimize too early right like rather like keep things normalized do a jsonp column where you need them to have that unstructured
information and then if you need to then optimize the structure later on
this is this has been like super useful i've uh you know know what databases are and in fact i've
tried to encourage people we probably should be using more of them but i've generally avoided them in my life so i've never had to run down
this but i always just sort of it's one of those things that kind of like uh it it's a little
confusing it's pretty in depth and i feel and i don't and i don't know um and we're going to kind
of try to transition this here but uh for me i feel like it almost became these techniques you're talking
about are critical and useful but it once became like oh when i was coming out of school there was
like a database admin was like a big thing and then people did not want to be database admins
i don't know like the current zeitgeist around like database admins but a lot of people just
got this like you know i do not want to do you know looking at queries and i
do not want to do optimization like that is a dead-end job you do not want it you're going to
be like you know stuck racking servers or whatever the the like mindset was i don't know where it
came from but i feel like there's like lingering after effects through even to today where people
have this hesitancy to like engage with you know databases yet we see like all the
major tech companies very reliant on quite traditional uses of databases you know they
they gussy them up is that i don't know anyways make them fancy by like you know sharding them
and distributed and all of this and we've had great you know podcast interviews with folks
folks working on that but i feel still like at the engineer level to engineer level but a lot
of people who just would rather avoid it for i don't say like stigma reasons but uh it's been great to
hear your explanation of this and like it's really not that different than compiling and debugging
stuff you would do or at least i do you know day-to-day just like in sort of c++ code and
making code run faster and don't use more data than you need to it's all the same concepts right
and i would say you know administration is uncool but you don't use more data than you need to it's all the same concepts right and i
would say you know administration is uncool but you don't have to call administration right like
it's the same if system administrators like people didn't want to be called system administrators
anymore maybe other new people coming into industry didn't want to be called administrators
and so now we have devops engineers right like it's it's the same thing with databases now people
are called data platform engineers instead of like dbas like i think you
you pointed out one thing that you know if i think about database performance work right the part that
i really enjoy is i can make a difference pretty easily often right like it's oftentimes like you
can get drastic performance improvements like something takes multiple seconds and like people
have a slow experience to milliseconds and you know it's super fast and that is really rewarding
right like as as somebody working on that as somebody as an engineer working on you know it's super fast and that is really rewarding right like as as somebody
working on that as somebody as an engineer working on it um it's it's really rewarding to kind of get
that kind of uh performance benefit the performance improvement um by doing boring administration in
a sense i'm with you i mean running the like debug tool timing code and like seeing literally like
7 000 x speed ups right right? You know, like you
said, something, somebody has some program, it takes minutes to run, and now it runs in like,
a few milliseconds. This is something I find enormously rewarding. But it's this grunge work
of going in and putting the logging in like paying attention, but maybe you and I are broken the same
way. I find that enormously rewarding, because it's just like, haha, you did this. And here's my
it's like code golf, I guess, or whatever, like, you know, here's my speed, it's so much better than your speed.
So I was going to transition it now. So so we opened up with that, that your founder at this
company, PG analyze, I guess, tipped off by you've already said the word analyzer and Postgres
shortening to PG there, I think we might might be teed off to what is a little,
but maybe help explain like what your company does and sort of like a little bit about it
and just sort of like tell folks what you're up to.
Sure.
And I'll try to put this in, you know,
from the perspective of why I as an engineer
care about that and why I started a company.
So let me give you just a little bit of background
on how the company came to be.
So ultimately I started as a site project, which is close to 11 years ago now.
So it's been a while.
But I've only been, you know, kind of full time in the last couple of years.
And so we now have a small team, you know, essentially supporting it.
And ultimately, what I set out to do with PG Analyst at the beginning was giving me better introspection into the database, right?
So back then, again, all focused on Postgres, right?
Pointed out PG, short for Postgres.
And so the reason that I started that as a project back then
was that I tried to say, you know,
what does the database think is going on, right?
Like if I look at the database and I see like CPU utilization
or IO utilization, it doesn't really tell me much.
And so the very first thing that we did back then
was just query performance metrics, essentially.
And so it was just saying, you know, here is this query that was running and this query took the most time and there's various ways in postgres to get that data and to kind of say you know this is
a query that's essentially making the system busy and so that's how we kind of set out to you know
kind of just have a way to say here is what's most slow on the database um now over the years um and
especially more recently,
what we've turned this into
is not just, you know,
kind of that monitoring
and observability side,
but really also giving recommendations.
And so we already touched upon indexes earlier,
but one of the things that I'm,
I think, reasonably proud of
is our implementation of
how we make index recommendations.
Now, it's actually not as good
as I'd like it to be,
nowhere near it,
but it is, you know,
I think a system that is a really good starting point
that says, here is my query workload.
Here are the suggested indexes, essentially.
Here is what's missing.
You're querying for these things.
You're doing these where clauses and these join clauses.
We think this index would be helpful to make your queries faster.
Go try it out.
Human in the loop type system, right?
Actually try it out and see if it makes a difference.
Well, that's awesome.
So you guys,
so how, I mean,
so you're an extension,
you're sort of like,
how does that,
how does it like integrate in?
I have a database I'm running,
like your service comes alongside
and sort of monitors what mine is doing.
And then I'm able to sort of get these suggestions
and try them out.
So we'd love to be an extension, but we're not.
So the problem with extensions,
this is kind of a technicality,
but the sad part is extensions require you
to usually have access to a database server.
And what we find in practice is that most people today
don't run their own database service.
They use managed services like Amazon RDS
or Google Cloud SQL or Azure Database.
And so the issue is that you can't really write the custom extension and run it,
have your customers expect to run it, right?
I mean, you can yourself can certainly run it,
but we intentionally did not require any custom extension.
What we do is we have people install an agent
and that agent essentially sits next to the database, right?
Like it's in a container or in a virtual machine.
And essentially it runs SQL queries itself
to get data from database
about like statistics that are happening.
And then the other thing it's doing,
it's looking at the error logs.
So we do a lot of log parsing
because that's where we get additional information,
like which query plan happened at which time,
like there's ways to tell Postgres
to log query plans.
And so we essentially pick up those query plans
to then kind of put them
in a more easily accessible UI.
Oh, wow, that's awesome.
That's interesting.
Yeah, I guess I didn't really think about that,
but you're right.
Like most people are probably not like,
you know, installing locally a Postgres database
and running it and they're running it
somewhere in the cloud.
So like adapting to that
and still being able to make it work,
I guess like it's even less obvious to me
that it would work.
But yeah, the fact that you guys,
that's actually really cool.
And so this started as like sort of a side project
is clearly something you're passionate about.
I mean, like we've sort of touched on these subjects
sort of all together
and building a tool that sort of like
helps you do the things that you would do naturally,
but like faster.
And, you know, for people who maybe
don't have that background or expertise
and really helping people who, you know for people who maybe don't have that background or expertise and and
really helping people who you know run into slow queries and be able to sort of like help break
them open and and sort of look at them have you guys felt like that uh people is it much is it
people who are like pretty experienced and saying oh these are actually just like incredibly time
saving and i know i know that i can tell what it's doing is smart and good. And
I like it. And I want to use it to save time. Or is it more people are like, yeah, I have no clue.
Like, I'm just going to, you know, click yes, whatever it says, and I'm just going to trust you
to be in charge of everything. Yeah, I would definitely say it's both, you know, it really
depends. So you know, we, the good thing is, we have enough customers that I definitely don't
know all their names. And you know, I don't know all the use cases either.
But I think one pattern that I definitely see, and this is, you know, for the folks listening, if you're currently in college, for example, you may, this may be like, at least conceptually a long way from where you're at.
But what's out there often in industry these days is that you have usually what's called data platform team.
And so there's often in big
organizations there's a central team that operates data stores right in databases and so one of the
things that we found is that those you know centralized teams that manage the databases
they ultimately work with application engineers right because application engineers write the code
they make all these feature changes and so the challenge that they have is that they are usually
just a small team like you might have 100 application engineers, 200 application engineers, but there's just like
five data platform engineers that are like wrangling all the databases. And you know,
if any of the databases is down or slow, everything is on fire. But there's still just this tiny team.
And so really, where we come in, and where we found, you know, making the biggest impact,
so to say, is just enabling those teams to then ultimately hand off more things
to the application teams, right?
So to give better tooling to the application team
so that the application team can say,
hey, maybe this is an obvious issue
that PG Analyze can help me identify.
So I don't need to spend as much time
with that really overwhelmed data platform team.
And so it just becomes that way
of kind of collaborating more
effectively that's awesome and then how like i guess like a bit a bit on a an adjacent topic
but a bit off topic um sort of taking something that i think you had been doing for a lot of
years you know had kind of been like you said a side project turning it into a company how's
that experience been like any sort of like like introspective tips or like suggestions like people out there this is like for anyone who has is sort of new you will have
these thoughts you will sit someday in a big company if you ever go there and you will think
why am i here i could do this on my own i could be making more doing this on my own if you're not at
a big company you're going to say why am i doing this i could just be at a big company you know so
these thoughts are at least for everyone I've ever spoken to,
these are always sort of at war in our heads.
But from someone who's sort of,
I think you said you had been at Microsoft for a while.
Now you're sort of doing this on your own.
You've kind of played both sides of it, I guess.
Any sort of like thoughts or tips from the trenches?
Yeah, I think, you know, there's the saying,
if you like drinking coffee at a coffee shop,
don't create your own coffee shop, right? So you like drinking coffee at a coffee shop don't create
your own coffee shop right so like don't start your coffee shop because you would wish there
was a coffee shop around the corner because running a coffee shop is not the same as drinking
coffee at a coffee shop so this is great i'm paraphrasing but you know what i mean right like
there's like running a business is definitely not the exact same thing as you imagining using that
business as an end user of imagining it exists right so i think there is something to be said about do you actually want
to run a business because it it's great don't get me wrong i really enjoy working on my own terms
and you know having a team of folks working you know with me and kind of like that's all great
but it is not necessarily the same as saying i'll just go and create a project for fun right
so i think there's an important distinction there. Now, I will say that it's definitely possible,
but you have to have a lot of patience.
So I'm very much, I mean, we bootstrapped the company,
we have no outside funding.
And because we bootstrapped it,
that's why it was a side project for a long time
is because revenue just wasn't there in the beginning.
It just took a long time
to even be able to pay one person's salary.
And if you're willing to do that,
I personally believe that most people listening to this
will be able to create their own business
if they really go for it, right?
But the challenge is that you have to have that longevity.
You have to have the motivation.
And so I think what I would do, if I would do it again,
is focus on something that you, A, enjoy working on.
B, can use yourself, right?
So ideally, you're your
own customer in some way or form because then it's just much more motivating um and you know
c you actually enjoy building the business side of it not just you know like you actually want
to run the coffee shop yeah i guess like for for folks out of the industry i mean i guess that like
eating your own dog food was like very weird term to me the first time I sort of like heard it.
Everyone sort of takes it for granted now, but like I was like, what?
It is a weird term.
I was like, this is, this is like interior and people even say, oh, dog food programs.
And like, there's all these, and there's even other variants I won't go into for different
big companies, but this is like a very common term, which is, you know, what Lucas is referring
to here is like, if you can build something and be your own customer for that thing, and like,
force yourself to use it, like, that just makes the iteration cycle better. Like you're teaching
yourself something, you're staying engaged with it. I think we talked about at the beginning,
you guys started playing video games, I feel like video games are very obvious example,
because you're like, well, of course, I would play this video game. And you I don't know anyone who
would sit down and say, I'm going to build a video game i don't want to play like everyone
sits down and says i'm going to build a video game that i want to play and so they're inherently
saying i want to build something that i know whether or not they enjoy playing it by the
end that's a different different question um but maybe this loses some of its obviousness as people
sit down and they try to say i just want to build a business. And then this thing you mentioned, I think was great to this,
like, for folks who don't know this bootstrap versus like taking external investment is a
is a very tough decision. And I haven't been to myself, so I won't fess any knowledge or wisdom
there. But it's great to hear someone like, you know, share some examples. I think you hear a lot
of pros and cons of both sides, but clearly a really big passion project for you. You can kind of like hear it come across and that,
you know, even in the beginning, I was sure, you know, Lucas making sure people want to know that,
you know, he's technical, but I mean, I think that's pretty clear from, from the explanations,
but you know, I think like it's exciting and, you know, it's been great talking to you. And
I learned a lot today. I mean, there's a huge gap in my knowledge
from what actually happens.
I know at a really low level,
I know what a B2 is.
At a really high level,
I know what a Postgres server is,
but kind of unboxing some of the middle parts
and some of these optimizations
and what happens to you,
as well as a ton of tips and tricks along the way
for sort of heuristics about when to...
This has been great, Lucas. Is there anything else you wanted to like t up uh we'll
have the link to pg analyze just pganalyze.com so it's pretty easy to find anything else you want to
like send people to or you know you know recommend them to look at for sure and i'll actually add one
more thing that we talked a little bit about earlier but i just make sure to mention it
for this audience in particular is if you're interested in databases,
well, there's two things I'll point out.
So if you're interested in database in general,
CMU actually has a course on databases
where they publish all the lectures online.
It's really good on databases more broadly.
So if you want to look for just CMU and database things,
and Andy Pavlo is the one who runs that,
that's, I would say, you know,
one of the best just online materials
to learn about databases more broadly.
If you're interested in Postgres in general,
one thing I would recommend is Postgres
like talks about everything in the open.
So Postgres has mailing lists.
It's old school in that sense.
So you can actually follow along,
which is really cool.
You can actually follow along Postgres development.
It's really technical, right?
But if you ever are really interested in this like at the lowest level
you can just subscribe to the mailing list and you can just see what people are discussing and
how new features get contributed so that can be really fascinating if you're into that side of
the house and if you you know want to contribute to postgres there's also each year google summer
of code where postgres participates as you is you can actually kind of have an official project and a mentor
that help you kind of contribute to Postgres.
Now on PG Analyze, I do a weekly video series called Five Minutes of Postgres.
So if you're interested in Postgres, each week on the PG Analyze YouTube channel,
there's just a five-minute video where I talk about what I found interesting that week.
So sometimes that's, you know, usually it's other people's blog posts that I use as a starting point,
but I'll talk about things like
the slow query optimization
that we talked about earlier today,
new features and new Postgres releases.
So just a way to stay on top of Postgres.
And then pgannalize, pgannalize.com.
We're also on Twitter as pgannalize.
Awesome.
And also kudos for like the remembering
multiple punch list things that to go through
and get back onto the topic.
That was very smooth. Very impressive, Lucas.
Well, I've enjoyed having you on the show.
I hope folks find this as enjoyable as I did as educational.
And databases are a very important topic for our industry and lots of different avenues and directions.
And so I've learned a lot today and it's been great to have you on the show.
So thank you for coming.
Yeah.
Thank you so much for hosting.
All right.
And we'll see everyone next time.
See everyone later.
Music by Eric Barndollar. music by eric barn dollar programming throwdown is distributed under a creative commons attribution share alike 2.0 license you're free to share copy distribute transmit the work to remix adapt the
work but you must provide provide attribution to Patrick and
I and share alike in kind.