Postgres FM - Gadget's use of Postgres
Episode Date: September 19, 2025Nik and Michael are joined by Harry Brundage from Gadget to talk about their recent zero-downtime major version upgrade, how they use Postgres more generally, their dream database, and some c...hallenges of providing Postgres as an abstracted service at scale. Here are some links to things they mentioned:Harry Brundage https://postgres.fm/people/harry-brundageGadget https://gadget.devZero downtime Postgres upgrades using logical replication (blog post) https://gadget.dev/blog/zero-downtime-postgres-upgrades-using-logical-replicationHOT updates https://www.postgresql.org/docs/current/storage-hot.htmlPgDog https://pgdog.devMultigres https://multigres.comNeki https://www.neki.devRunning 10 Million PostgreSQL Indexes In Production (Heap blog post) https://www.heap.io/blog/running-10-million-postgresql-indexes-in-productionpgwatch2 (Postgres.ai Edition) https://gitlab.com/postgres-ai/pgwatch2Advanced query insights on AlloyDB https://cloud.google.com/alloydb/docs/advanced-query-insights-overviewOrioleDB https://www.orioledb.comNot discussed but relevant: Gadget have also now published a blog post about their sharding! https://gadget.dev/blog/sharding-our-core-postgres-database-without-any-downtime~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is produced by:Michael Christofides, founder of pgMustardNikolay Samokhvalov, founder of Postgres.aiWith credit to:Jessie Draws for the elephant artwork
Transcript
Discussion (0)
Hello, and welcome to Postgres FM.
We can share about all things Postgres cool.
My name is Michael. I'm founder of PG-Mustard, and I'm joined, as usual, by Nick, founder of Postgres.
Hey, Nick.
Hi, Michael.
And today we have a special guest.
We have Harry Brundage from Gadget, co-founder and CTO there.
Hey, Harry, thanks for joining us.
Thanks so much for having me on.
I'm happy to be here.
It is our pleasure.
So you've been doing some cool things with Postgres for a while.
Recently blogged about a no downtime, a zero downtime upgrade.
but there's a ton of other things we wanted to talk with you about as well.
Where would you like to start?
Yeah, I'd love to tell you about this zero-down time upgrade.
Just because, I don't know, it's like a point of pride of mine, I guess.
It's always so frustrating to me when I get an email from one of my providers.
It's like, hey, Sunday at 2 p.m., we're going to be down for 30 minutes.
Like, hope that's okay.
You know, so, yeah.
Gadgets and infrastructure provider, like, are the developers who use us,
use us for mission-critical apps, so we just try to take it more seriously than that and
like not, you know, disappear. So yeah, we did a zero-dentime upgrade with the help of
Postgres.com.A.I. And Nick, thank you so much for your help. Nick. What a treat.
Thank you for trusting us because, you know, like we did it with other companies. And always, like,
in the beginning, I like, I might hear pushback because understanding how much work it's needed
there, you know, like because it's far from being well automated. It should be.
postgres and it's improving but yeah not not quite yet yeah yeah so you you saw you did this
journey and thank you for for not stopping you know absolutely absolutely yeah so it was a lot of work
it was yeah yeah i think too it's like to me just kind of feels like something that should be
built into the managed providers like i'm kind of surprised that so few of them have it i think
The only one I could find that has, like, true zero-down time is planet scale.
But for Postgres?
The GCP options.
I don't know about the new Postgres one.
I guess I know they have it for MySQL, but not for Postgres yet, eh?
For MySQL, yeah, I think so.
I don't know details.
But for Postgres, I don't see yet anywhere of this thing.
But again, like, what the heck?
Like, as an industry, shouldn't this have been a thing we kind of, like,
solved 20 years ago and never looked back, you know?
I don't understand.
Yeah. And there's no AI needed there. It's just pure automation. That's it.
Yeah. Terrifying button to press.
I think...
Yeah, I think what happens is that it should be automated more in PostGus itself.
And most managed platforms like RDS and others, they just sit on the fence.
Well, I know they work and do it, but they don't want to automate things outside PostGos too much.
For example, Analyze, right?
Analyze is not automated. You always forget about...
forget about this and they they don't automate it and they just put it in
documentation on shoulders of users now it's time to run analyze and I I started
like asking around like maybe a couple of years ago why why why one year ago and
then okay post goes 18 should should be out next week right finally statistics will
be dump restore of statistics will be fully automated so they will just exclude
this from their how to manual right right so this is an example things should be
inside postgres and but I think still they should not make much better indeed
not waiting for postgis itself yeah anyway I'm with you here 100% I'm glad what
what I was gonna say is my I've always attributed it or the absence of something
like is a true zero downtime major version upgrade
to the fact that it's like more than just Postgres that you need to pull it off, right?
Like you need a connection pooler or a proxy or something in front of it sitting outside the system
and you need a bunch of tight coordination between that proxy and the rest of the system to kind of orchestrate it all.
And so like I know at least on our previous Postgres provider, like they ran the PG pooler on the same VM as Postgres itself.
And so like they're kind of hammed.
by the architecture where they can't have this piece that lives outside that can talk to two at once
whereas we self-hosted our bouncers for exactly this reason and can kind of do it all and orchestrate it all
and it's like sucks to self-host it but gets you that architectural flexibility you know uh we needed
that yeah and pg bouncer supports post-gium which helps you achieve absolutely absolutely
we're maybe nuts too and we run our pgbouncers in uh kates and like do the whole
you know rolling restart and whatever and it's turned out to be I think an okay choice like I think a lot of
people maybe shy away from that because it's sort of super stateful super critical maybe this is a pet not a
cattle but we did in case just because that's where the rest of our compute runs and it's like
easy to sort of bin pack and share it all it actually really saved our butts recently because
we're paying a lot for intrazonal network transit to our databases specifically like
like once from our application pod to the bouncer and then again from the bouncer to
the database and because the bouncer was in kates we could turn on the handy dandy like
prefer zone local network routing configurations and kind of have the application pod go to a
zone local bouncer and then to the database and i think were we outside of kates that would
have been a lot more annoying to kind of like teach all the routing to be able to do the right thing
Nick, you look scared.
You look skeptical.
No, no, I'm just thinking, this is also a job for database platform, I think, to take care of such things.
So, yeah, and I know like RDS has RDS proxy, but it doesn't have a, it has cool features, PG Bouncer doesn't.
But it also likes cool features Pige Bouncer has.
So it's a zoo and, yeah, and I think POS Resume is a wonderful feature.
Yeah, absolutely.
I'm going to emphasize it.
I was going to ask about your PG-Bounser setup, actually, because your, you're both,
post on the upgrade is really cool and goes into good detail on the actual upgrade
process but you didn't talk much about your PG bouncer setup and like how like how many
you're running and like do all of your application stuff go through that or does any of it
skip it going to the data like is there is there anything interesting about that that you
wanted to share with people good question i have two interesting tidbits the first is i think our
so our pg bouncer deployment's a bit weird and then we have we have we have
nine instances.
Interesting.
Yeah.
And we take our kind of total connection count where we're targeting 400 kind of server-side
connections and split that up over nine.
And the reason for that is that same Kate's network-aware routing thing, where the kind of
up until very recent versions of Kubernetes, you needed three endpoints per availability
zone for the network-aware routing to kick in.
So we had to use a minimum of nine.
And initially, we were really skeptical about that because it seems kind of bad, right?
Instead of having like, you say, one or two buckets of a lot of connections, we were kind of
brokering them fairly.
You have many small buckets and you could get kind of weird inefficiencies.
But it didn't come to pass where that bit us.
It seemed actually okay in the end.
And what I like about it is we have a lot of CPU headroom, right?
Like PG Bouncer, I think, is still largely single-threaded.
I don't know the innermost details of it.
But we're using the fancy new prepared statement stuff within it.
And so I'm kind of happy that I can give it nine cores instead of two or four or something like that and kind of be like, you know, take as much as you need.
Yeah, nice.
Always good to have headroom.
Cool.
Anything else about the upgrade that, or like, which bits were most scary when Nick came to you with the proposed plan?
You said the pushing the button.
Is it kind of because it's single direction?
Did you have a rollback plan?
Good question.
We had a rollback plan, but it was written rather than tested.
Because we sort of like, so the way that we pulled it out,
or the way that we kind of built the whole thing is we built a temporal workflow.
So we're like huge fans of temporal, very easy to kind of script something like this.
And the reason why we did this is because it lets us get all the nice kind of retries
and item potency stuff for free, where we get to write in a somewhat failure, oblivious way.
and we wrote our workflow to have like a dry run capability where we would do all but the very last step so we would go do everything where we like prepared the new host kind of replicated the data made sure we were caught up did the pg bouncer pause and then we just wouldn't kind of reload the config or change over and we just do a resume and kind of everything would keep on going sort of as it was and so the terrifying part is we just never actually done the middle thing and we weren't really sure like here's
how we can test this other than like standing up a whole parallel set of infrastructure
which we're like I'm not even sure that will prove this one will work right like yeah
just in my admittedly still limited lifespan like the things that go wrong are always the things
you didn't foresee right so it's like we dry run did and I went million times found all the things
we could foresee and then it was like the spooky you know yeah so it's really like it's
impossible to test switchover in production but I wanted to emphasize it's possible to
test everything before that and it's great like because this is beauty of this
procedure you like create the whole cluster which is running already a new
version you like see it's working you can touch it and it's great like and so on
but yeah I agree the scariest part is switch over how to test it properly of
course it was tested in non-production right but it's different
yeah here we just need to jump at some point exactly it was the best
of jump where it was like a big meeting that we were all like scared of we all like sat down
Nick was on the phone and it was over in like seven minutes where it was like oh well
guess it worked like all the numbers look good like yeah that's mid onwards you know so
I feel fortunate to work with a team where you know that was how it went as opposed to like
us screaming around like change their heads and I wanted to thank you like to thank you for
this like emphasizing that you use this temporal framework because we like post-GICI will
always big fans of Go and we like developed using Go a lot of stuff but recently our
team we have a few Python guys right now and I just envision that probably will learn
from you in this experience as well and do some stuff with temporal as well so
thank you for emphasizing yeah we're like we're heavy heavy users of it we use it
for these kinds of like systems administration tasks or like behind-the-scenes
maintenance workflows but we use it for a lot of other stuff too like gadgets
is mostly used today by people building Shopify apps.
And when you build Shopify apps, you have to receive a metric shit ton of webhooks.
And so the gadget has like a background kind of queuing system that knows how to take a webhook
and process it later or process it with retries or whatever.
So each one of those is a temporal workflow for us.
And so we're doing, I don't know, something like 3K, 4K, temporal state transitions per second
also backed by Postgres in this like big ass temperate.
portal cluster and that's something I never ever want to have to try to build myself
like it's so complicated and nasty and you know fun for some you mentioned in the
post also that sharding was on your to-do list yes the you mentioned though it was
you're using a different database for that now like allo db specifically I think for the
new setup but you said it was just for the user data what's the separation there with
like Shopify, is it the data of the Shopify apps, for example?
Yeah, so Gadget is like a general purpose app development platform.
So if you want to make a to-do list tracker or a chatGBT clone or a Shopify app,
you can sign up for Gadget and you get kind of like a full stack dev environment.
So you get a place to run kind of backend code.
It's all JavaScript TypeScript.
You get a database or a sort of virtual slice of a database to store backhand.
end data, you get a front end and react and whatever, and you're off to the races.
You write back in code, you write front end code, you define your data models, you transact,
so and so forth.
And then the Shopify part comes in where we kind of have a managed API connection to Shopify,
where we'll do the Oath and the web hooks and sync the historical data, et cetera, et cetera.
So like many infrastructure providers, we have a control plane and a data plane.
The control plane stores like our list of users, our kind of who has what,
access to what app, so on and so forth.
And then the data plane is the actual app's data.
Like this app has a to-do model that stores, you know, this many to-does.
This app has the Shopify product model and, you know, stores this many Shopify products.
So when we started, we just stuck those in the same Postgres.
And then that Postgres got too big.
And so this major version upgrade kind of precipitated splitting the control plane and the data plane,
which I think is like a pretty obvious.
best practice when you say too big by which metric like what's what was the where was the
botanet coming from good question so we had about eight terabytes of data in the one instance
or sorry our disc was eight terabytes big and we're at six or something like that six point five
and we're like okay this is scary like we got to figure out the next thing and we were on pg13
which was end of life long ago um and so we needed to to
do both do the upgrade and we just wanted kind of a better roadmap for how we were going to deal with growth.
Because Gadget has lots of these e-commerce customers, BFCM is always a quote, fun, unquote, time for us.
And like we'll see probably this year like a 4X traffic increase and like a 4X database QPS increase.
And so that means like we kind of need to be ready for a 4xing of database CPU.
And our provider at the time, it didn't sell a big enough one.
So we're kind of like, we have to do the horizontal scaling sort of journey.
So what we did is we sharded in order to move the bulk of that six terabytes out
into sort of newer instances that had much more room to grow,
leaving only the sort of reasonably sized control plane left over to do this major version upgrade.
And it simplified a few of the steps of this zero down time major version upgrade
because we didn't have to replay six terabytes worth of stuff on the new instance.
Yeah, yeah. Makes sense. But why not stick with the existing provider for those shards?
Two reasons. Or three reasons. The first is we're Google shop and we get good pricing from Google.
The second is Google claims up to a 4x TPS increase with AlloDB.
Like, I like Google and I don't want to poo them. I'm not sure I've ever seen it actually hit 4X.
But we did see like at least a 1.5x kind of like TPS increase without really much config change or tuning or whatever.
I can't even say if it's actually the same CPUs like under the hood.
You know what I mean?
I don't know what the initial provider was using.
But the instance was faster for the same price.
So that was nice.
And then the other big thing with Alloy is you don't pay introsonal network transit fees.
It's not like a normal VM.
I present that without comment to the competitive forces that may or may not be listening.
But yeah, it's free networking to the alloy instance, whereas, like, if you're, you know, Avin or Crunchy or whatever,
and you run instances, like, your customers have to pay from two or the three zones just to talk to it,
so that sucks.
And then the second thing is you don't pay for replica storage.
It's a split, like, storage compute sort of setup where the disk is,
this sort of auto-scale, elastic, fancy log storage service thing, and then you pay for your
nodes that kind of read and write from that shared tier, but you don't pay N times, you pay it once.
So it's like 40-ish cents a gigabyte, which is like more expensive than, like, say, a big fat EBS
volume, but not that much more. And so if you're running like a lot of replicas, it ends up being
cheaper storage-wise, and you don't have like all the nasty I.O. spikes from backups or any of that
stuff. Cool. And unlike Aurora, I don't see the, like the hidden cost with Aurora, usually
are your costs. Yeah, no I.O. costs. Yeah, that's interesting. So our database bill came out
to be about the same as a pretty classic like GCP VM with, you know, they're not EBS and GCP,
but like fat kind of SSD volumes backing it. And we save money on.
the networking. We saved money on not having to pay for storage again for the replicas,
and we deleted our, like, I.O. Spike for the backups problem.
That's interesting comparison. So we even don't talk about features like this
Column store versus Roastore. We don't, even without that, just direct comparison. Interesting.
Yeah. We intend to make use of that column store. Like, I'm skeptical. I used to work on
sort of like the big data warehousing team at Shopify when I work there. And, like, I just don't,
don't really believe that you can force Postgres to do the kind of like super wide
throughput stuff you need for crazy scale analytics.
But you can definitely get pretty far with it.
You know what I mean?
So my guess is like some of our operational reporting use cases where you're traversing
half a million, a million rows to answer a query on a read replica with this column store is
going to be fine.
But in the like billion, 10 billion scale, I think purpose built analytics databases still
should exist.
We've had a few of those conversations recently.
I'm sure, I'm sure.
I had one more boring question on the major version upgrade
before we move on to more interesting things.
It looks like you were on Crunchy Data before
for the control plane stuff
and are going to be again for the upgrade.
Yep.
But you only upgraded, I think, from 13,
which, by the way, you were a bit harsh on yourself.
I don't think it's quite yet out of support.
I think it's like a few months.
Oh, really? Oh, good for me.
A couple of months' time, that'll be out of support.
But you only move to 15.
And I think Crunchy have always been very good at supporting the latest versions.
So probably 17 is available.
So why not, like, why 15, not 16 or 17?
Yes, Crunchy has been great, that kind of thing.
I think the main reason was, shoot, you know what?
I don't even remember.
Maybe Nick knows.
I also don't remember.
I remember that Crunchy had different end-of-life dates somehow.
Crunchy Bridge had different dates.
Yeah, they deprecate support, or they stop sort of long-term support before the full-on end of the kind of core support.
But they, in fairness, and they gave us extreme advanced warning, and like, we knew it was coming for a long time, and they supported it, you know, for four years or whatever, which is, I think, you know, pretty good.
I don't remember.
I think it might have been a desire to keep the versions homogenous across all our instances.
Sure.
And for really silly reasons, the alloy shards that we stood up were stuck on 15 and couldn't be newer.
Yes, that would explain it.
So alloy, I think because it's a fork and it's probably quite hard to maintain fork,
I don't think they're as good at keeping up with the major versions.
So there's a good chance you wouldn't even have the option of going to 16 or 17 on there.
I don't know for sure.
That's correct. That's correct.
At the time, 16 was GA and 17 wasn't even out yet.
Now, 17 is out, but it's still in preview.
Like, it's not, you can't, shouldn't run production workloads on Alloy 17.
But the weirdest thing was we wanted to go to 16, but they just don't have PG squeeze support on their 16 version.
They have it on 15 and they have it on 17 and not on 16 and couldn't tell you why, but we squeeze a lot, so we really cared about that.
That makes so much sense.
In my head, the only reason I could think of is because now you're so good at major version upgrade you wanted to give yourself more.
more to do in the future let's let's rinse and repeat yeah yeah nice all right so you're
running Postgres for a lot of different workloads are there any challenges or fun
things you wanted to share on that front I think maybe a good one just for the audience and like
if you're listening out there and you think I'm full of it like please send me an email I'd love to
know what you you think but we did the classic Postgres as a queue thing
early in our life.
And I think that that, well, it's on my mind a lot.
So a lot of people who have like background queuing problems use Redis.
When we were at Shopify or when I was at Shopify,
I used this library called Rescue.
There's another really popular one called Sidekick.
And Node, it's Bill MQ.
There's a litany of them, right?
And they all seem to use Redis.
We decided that for our user who tends to be like a front-end dev
or sort of someone who,
isn't deeply familiar with, like, data consistency or data durability concerns that we didn't really want to be like,
hey, we'll happily process your background jobs, but we'll drop 0.5% of them and not be able to tell you which ones we dropped
and not be able to tell you when it happened. You know, so we, for our background Q product, we said it's going to be durable.
Not only is it going to be durable, it's going to be transactional where it's like when you kind of do some business logic and you NQ a job within a transaction,
if the transaction commits, then the job will get in queued.
And if the transaction doesn't commit, the job won't be in queued.
And that's like the transaction boundary.
So we were like, this is nice.
This is us kind of trying to prioritize our customers' sanity.
And I think it's the kind of thing that's like annoying because not that many of them are going to notice or care.
It's just going to be no data quality issues for this like business logic.
But that locked us into needing Postgres to be.
on the hot path for those NQs, right? We have to participate in that business logic transaction
to kind of like save the thing. We knew we didn't want to build a giant queue of our own,
so we knew we were going to get it into temporal and let temporal take over, but we sort of needed
like a Postgres outbox, if that makes sense, where it's like you write to the outbox,
and then you're slowly forklifting out of the outbox into the actual system eventually,
and then temporal itself is durable, and so, yeah. So that's the architecture we went with.
But as we sort of built it out more and more, we learned about more limitations of temporal.
Like it came out of Uber from this system that kind of predates it called cadence.
And my guess is they learned all this stuff the hard way and built in a bunch of limits that make it, you know, easier for the operators to guarantee certain sort of performance characteristics.
But really annoying for people like me.
So one of those limits is when you start a job, there's a maximum payload size that's pretty sort of.
small. I forget what it is exactly. I think it's like 50 megs or five megs or something like
that. And then as that job executes, it can only generate X amount of state transitions
that are called, but, you know, events in its processing where it can only have a certain number
of retries, basically, and you need to sort of re-architect your temporal workflows if you want
to overcome that. And again, in our case, our users are kind of like, hey, Harry, I have this
400 megabyte video file that I'd like to process in a background action. Can you?
you just make it work, please? Can you not make me, like, re-architect my entire thing to toss it
into cloud storage, whatever? It's ephemeral data. Like, I don't need it in the cloud. I just want
to push it to this thing and work it within the background. So we found that the temporal
limitations were annoying enough that we actually needed to stage a bunch of each job's data
outside of temporal. And so we ended up kind of having this nasty hybrid where that Postgres
Outbox, I mentioned, sort of became like a store.
for a bunch of extra stuff, and then the temporal workflow as it went, it would be, you know,
transacting with its state storage and doing its state transition things, but then also updating
what is now no longer the outbox to be our sort of source of truth, it powered our UI,
it had all these extra params, yada, yada, yada. So we weren't even really using Postgres as a
queue. Like we weren't having the classic like skip locked problems and whatever, but we're just
having like this sort of like request scale data transmit like one web hook would be one right
to this outbox to NQ one right to start processing one right to finish processing so just like
massive massive tupil turnover and we also needed to build a UI on this stuff so we only could
get a few of them to be like hot updates like many of them were not and so we were just like
destroying this poor Postgres instance with what I would call like less important
data than the actual data plane data, but still important enough that we felt that
merited being on a disk. So all that to say, we ditched Postgres for this and we
switched to big table, which is like Google fancy super distributed key value store, and
it friggin sucks. It really sucks and the reason is no transactions, right?
It's like a way worse model for me as a programmer and a systems designer and a way
more scalable, fundamentally, you know, supersized system under the hood. But boy, oh boy, do I miss
the creature comforts, we'll call them, of secondary indexes and transactions and like a sane
query language, let alone, you know, the gazillion and one features that a, you know, full
relational system has over a key value store like that. So we use Postgres, I think, by default
for all our workloads. And then when we find something that we're like, okay, I think
there's maybe a better fit here.
We'll try to switch it out for the next best thing.
But it's like, it's really hard to beat.
And I know this is maybe unsurprising on a Postgres podcast,
but it's like, I just wish I didn't have to do that.
I could have stayed with my creature comforts and got my scale, you know?
Yeah, so I'm curious what was the number one reason.
Is it like bloat, maybe, or?
Yeah, it was like bloat, double turnover and just like the fact that we were like always vacuuming.
Yeah, yeah, I can imagine.
Another big motivation for switching to something akin to Bigtable is like compute elasticity,
where we can kind of like Bigtable auto scales by default.
You can't even turn it off, I don't think, where they just add more nodes as demand kind of grows.
And for us where we're kind of, like the background subsystems are the thing that absorb load, right?
When you have a flash sale on your Shopify shop and you need the NQ a bunch of stuff.
Like, it's quite, we need this thing to be able to handle unpredictable surges.
And it was a bummer.
At the time, we didn't have any way to do a zero-down time, like, non-disruptive post-gressory size.
Now we do, but even then, I don't really want to be doing it automatically, like in response to load.
I kind of, I kind of want something like Big Table that's like five seconds later, you can get your capacity expansion.
Yeah.
Yeah.
Well, I can see it easily.
Yeah.
I think, I think, like, my ideal would be a million tiny Postgreses and have some fancy system for, like, disaggregating a logical Postgres from a physical VM.
Or if I have, like, let's say, I shard my workload into 16,000 Postgresses, but they only run on four VMs, and then as I need more compute, they kind of split out, get bigger and whatever.
but that's like a twinkle in old man hairy eyes.
Old man Harry's eyes.
There is work in this direction by multiple teams right now, so we have good...
Go on? Go on?
Yeah, there is Pidgee Dogg, there is Maltigres and Plainscale also.
They're all working in these direction.
And Maltigres specifically, it's one of the things Sugu mentioned as well.
From his experience at YouTube, he loved having lots of smaller databases,
and he much preferred that in terms of reliability and localized outages even.
So there were just so many benefits.
Highly automated operations also in roadmap like upgrades as well.
And rebalancing.
Yeah.
Yeah.
I forgive me for another brief aside, but are you familiar with Azure Cosmos DB?
Like the kind of main one, not the Citus one, the main one?
I thought Cosmos was Citus.
I get confused by their names, though.
So I think there is Azure Cosmos DB for Postgres, which is Citus.
But I think there's, and I'm not deeply familiar, so I may be totally wrong here.
But I think there is a different original Cosmos DB that's kind of the Azure answer to Dynamo and Big Table.
Like they're kind of like super scale, sort of less sophisticated query patterns, but guaranteed scale for whatever.
the way that it works or the way that it scales is you specify like a partition key with every piece of data and you're only allowed partition local transactions so it's it's not like a spanner or a cockroach where the whole key space can be transacted against together it's just like a million tiny partitions and then every partition could be a max of 50 gigs and to me that's like so desirable because it means like one little part you get like node local performance but the
same horizontal scaling properties of the kind of super scalers but without necessarily
two-phase commit or paxos or whatever and so i'm like kind of hoping someone manages to do that i think
it's maybe what's it called necky like the v test for postgres did i get that yeah that's uh it's any
k i have no idea how to pronounce it yeah same i have no idea but i'm hoping that's what it is which is
like tiny little boundaries that force me to design my data you know
way that makes the transactions efficient where you don't need a bunch of computers to agree
on a transaction outcome and instead you can just funnel them to the right place unfortunately they
all not all but many of them came to me and asked have you used to pc i said no i always avoided it but
so they need it unfortunately of course like some workload will avoid it but some workload will
still need to PC. This is the sad truth. Totally, totally. I think like the, like I, so I
used to work at Shopify circa, I don't know, 2012, 2013. We were running like a MySQL fleet at
the time and we were, we felt constrained by the biggest database servers we could buy. Like we,
we were sharding, but it was still like the biggest customers could push the one
the instances, I don't know what I was, probably 64 threads, like to the max. I'm not really
feeling that way anymore. Like, I feel like there's few workloads that don't have a natural
partitioning where you need more than one server's worth of compute just to satisfy one partition's
worth of work. I mean, maybe you talked about this assignment last time, but it's like, you can
always split it up. And when you split it up, like, you don't need, you need an aggregate many
computers but one transaction boundary doesn't need you know more than one and and so I'm
just and and like the hardware has just gotten so so much better in that 10 years that I like
I'm not sure I could grow my company fast enough to need that right it's like maybe if we had like
a graph problem maybe if we were Facebook or whatever but as it stands right now that's true
and on GCP if we don't look at the MD we look at only zero and scalable fifth generation
the best available c4 nodes i think right you can get 192 course or so but if you go to a
wus you will have even more like almost i think 800 cores also the same type of processor and this is
already huge so you can scale one node it's just bonkers like that's ridiculous yeah it's a lot
of doge coins you know sorry so that was a long a long-winded rant but what more can i tell you
about that temporal sort of postgres workload set up thing before we start
you mentioned one of the most interesting things you're working on is that you're serving other people's workloads and that you don't control with design like schemers and yeah design decisions that you don't control and how do you stay sane while while doing that and also how what are the interesting parts of that great question you hit me with the the question that just invalidates my entire prior rant which is like yeah you can partition your workload if you think hard about the schema design
but yeah so okay let me set the stage a little um so gadgets like reason for being like the reason
why we're working hard on this thing is because we kind of believe software development is still hard
unnecessarily that like when you go to spin up an app on the weekend or your next big business or
just like solve a small problem in your life with software you spend a lot of time on what we call
like baloney problems of provisioning infrastructure setting up connection poolers like worrying about
the auto vacuum schedule, so on and so forth.
We just think that's all a distraction from actually building something useful.
It matters.
Like you can't ignore it, but you also ideally don't want to have to deal with it.
So similar to the people who came before us, like the firebases of the world,
we kind of want to delete a bunch of those problems from our customers' lives,
take them on ourselves and allow them to focus on whatever unique, interesting custom problem they want.
like recently it might even be allow cloud code to focus on whatever unique interesting problem they want right but like by kind of deleting infrastructure footguns and systems integration we just let people kind of succeed faster and then i guess i should say bearing in mind the audience listening here it's not for everybody right like if you have the perfect knowledge of the postgres schema you want gadget's probably going to be less efficient than that right because we've had lots of layers of abstraction we have our own operations and
infrastructure team kind of managing this stuff. So I'm not trying to say, like, no one needs
to care about these problems. They're just trying to say that a lot of people, I hope, don't have to
care about these problems so that me and my team, you know, pull our hair out, stressing about them.
All this means is there's this tension between asking the developer to be more specific
about their data access patterns and adding friction for them to, like on the path to success,
right? That like every time you say, do you want four, eight, or 32 bytes?
for this number you're you're posing a what I would call computer baloney question rather than like it's
just a number like you know screw off I don't know I don't know how big it's ever going to be just you
figure it out you know so in gadget we have like a number field type instead of an int and a float
in a decimal we've taken this to a pretty intense extreme where when you do data modeling
in gadget you specify tables we call them models because they include behavior as well but you specify
tables, you specify fields, you can specify relationships among them, but you don't run any SQL
to realize those schemas. And then you get a kind of fully auto-generated GraphQL API for
interacting with those tables without doing anything else. You just, you specify your schema,
you get the API. We generate a nice client for that API. We generate nice docs for that
API. But the idea is, like you say, to do, title, string, completed Boolean.
and now you immediately can write some React code for the form,
or you can ask the AI to write some React code for a form to, like, transact against that stuff.
All this to say, we're responsible for the SQL that, or the schema of those tables under the,
like the real tables under the hood, as well as the SQL that that API generates and executes.
So this is a win because we can avoid things like the N-plus-1 problem.
Like, we know how to do that good.
We can implement things like zero-down time cutovers or kind of maintenance modes, where we know how it's like, this app is in the process of being moved to a different shard.
Like, hold on a second, we'll finish that and then serve your request, you know.
It's powerful that gadget sort of owns the whole stack.
Oh, we can implement other stuff like cloud storage where you can store data blobs in the right way instead of within Postgres, you know.
But it means that all we know about a given application is like what we've collected from the device.
developer, which is, yeah, that list of tables, the list of fields on the tables and the
relationships among them, not stuff like, ah, this is a shard key, or this is how I intend to
transact, or if I'm going to make an HTTP call in the middle of my transaction, like two days
for now when I changed my code. So we picked this bed and we made it like on purpose. And it's
the idea is we want building high quality software that scales and like meets the demands
at Black Friday, Cyber Monday, to be more accessible than to not be locked behind, you know,
whatever, between the three of us, 60 years of Postgres experience or something like that, right?
You should be able to build and get an app going way faster without that.
So the gist of it is we need some way to structure our data under the hood within our Postgresis
such that people can't, like, take down the platform.
They can't use more than their fair share.
We just need to, we need to be safe by default where we're committed.
to serving this API, and regardless of what strange queries you throw at, strange transaction
patterns, poorly written code, either out of ignorance or malice or whatever, just needs to still
work, same as like a Firebase or, yeah, those other companies I mentioned before.
And the way that we do that is, like, kind of two main ways.
We don't give everyone their own instance and say, it's your job, like, you manage the performance.
I think that's what, say, like a neon or a super base does.
like, which is still helpful, but it's different, right?
They're sort of saying, like, you get this much resources,
and within that, it's up to you to optimize.
We don't say that.
We charge that kind of by operation, sort of more, like, usage-based,
like, say, an upstash Redis or something like that.
And then we consider ourselves gadget on the hook
for the performance of those operations.
And that means that, again, like, we have to find some way
to, like, not process queries that could take us down.
So we do two things.
We refuse to execute a query unless there's an index on the predicate.
kits. And that's just a safety like must, basically. We can't run on index queries because we can't
know that they'll process fast enough to be safe to run at scale. And then we rate limit. And that
allows us to say, like, you know, no one can have more than their fair share. But the crazy thing is
almost no one hits their rate limits. We consider that like a regrettable private platform
protection mechanism as opposed to like a pricing mechanism or like an incentive and the reason
that no one hits their rate limits is because if you just force every query to be indexed like you get
pretty far there's there's the planner is so good and like the bitmap index combining stuff
gets you so far that like no almost nobody or anytime we ever have like a database CPU spike
it's usually because something's gone wrong with our creation of the indexes
like gadgets automations their temporal workflow surprise like have failed to set the index up properly
or we like let the code that makes the API calls ship before the index was actually in place or something
like that you know but it's like postgres just works if if you do this so it's it's really naive but it
works are you okay to share how many indexes you've already created yeah we have i think probably
over two million indexes in our in our system right now actually sorry i take that back at the time of the
We did, but we've reduced that drastically since then, but for kind of unrelated reasons.
But, yeah, basically, if a user wants to be able to, in their API, sort or filter on a column,
we just say there has to be an index.
And so we have, like, I don't know, most of the columns indexed.
And, like, oh, God, I can feel the hate mail now, talking to a giant legion of database engineers
who are all screaming in their cars right now.
What?
you index the recall no no the modern hardware that respected nick just mentioned means that like
the cost of this while substantial is not actually that crazy like it's higher no doubt but we're not
we're not in the days where you need to be so incredibly stingy about every index you make especially
when the cardinality of the tables are like you know a million two million records as opposed to
like 10 gazillion right when our first 10 gazillion customer comes along we're going to work with them
closely to be like you don't need that index like let's turn that off your stuff will go
faster but the the vast majority of people the overhead is it adds up but it's not so meaningful
that our business is busted you know yeah so i wanted to say that this is a pet versus cattle
philosophy if we like think about indexes as a pet we need to get rid of unused indexes
take care of blood and but at your scale before the schema became sharded millions of indexes
you think about this as cattle as well just inside one postgous instance right before it became
sharded yes this is interesting approach and i agree with since we have small tables we know there is
right amplification all indexes if it's not hot update it needs to update all those indexes some
of them are not used but since approach by design it's cattle okay it's not that bad yes this is
interesting case. I must say I never saw it before. It's interesting. I've heard of one other case
which was heap analytics blogged about, something about, you know the one, great. I sure do. So that
was the only other time I heard about this. Was it also where they discussed that create indexes
without work concurrently, right? Yeah, because you can't do more than one at the same time. Yeah,
to parallelized work.
Yeah.
What's neat or strange or weird about gadget is we kind of participate in the whole
SLDC, like the whole life cycle of like authoring a feature, debugging it,
deciding it's good, merging it and deploying to production.
Gadget has a development environment as well as the production environment.
And so we kind of get to witness the moment that you want to go live.
And we just do a normal style, call it Rails or whatever,
migration where on deploy we like create the indexes concurrently behind the scenes then we ship the
code that makes use of them you know and it's like we can do that because we're not sort of exposing
the raw sequel and asking people to manage their own schema so on so forth but um what do i want to
share about this that i think is the most thing is like giant catalogs where you have you know
we had i think at peak in one instance like 250 000 tables 1.4 million indexes
giant catalogs like suck
but were more manageable than I was guessing
like the I was assuming there was going to be some like
0N squared system table inside Postgres
or like some doinky little thing that just breaks
when things are that big but no like
we that instance operated well
the thing that did break was a bunch of the
analytics tools out there
like like PG stat statements
PG stat statements is okay
it was big it takes a lot of data but it
wasn't like an ungodly amount of performance overhead but like pg analyze and not many of the
kind of collectors they're like let's get the size of every table every minute you know and it's like
that's that's a hard no our monitoring also didn't work properly because of catalogs yeah but you
guys you guys only had to make a couple tweaks i think we're talking about pg watch too like the
postgres dot a i remix and uh it works well for us it's the kind of fastest one i haven't tried
PG mustard but I must I owe not a monitoring tool minimum yeah all right it's all right so
that was surprising but the other actually this is just a fun fact so we we did our sharding
we moved this gazillion number of tables over to alloy we turned on their analytics thing
it's called advanced query insights and like it's not bad like it's a nice UI for kind of PG stat
statements souped up a bit they store all the data for it in a private
database in your instance so like you have an almost super user but not a full super user
and there's this one database you can't touch that has it in it and currently this is my fault
I like set it all up and I was like oh this is probably going to be not that expensive
currently our like alloyDB insights database is almost 1.5x the size of the actual database
and I think it's because the query diversity is so high and all the like data that they're
keeping about each one adds up
so much that, yeah, we're paying more than double in storage just for the privilege of
this thing.
And half the like queries timeout.
But again, I'm going to blame me for that, like not Alley for that.
We just need to turn that off.
I'm curious if the main, like, if we have just one AlloDB database and we store metrics
inside the same thing, if it's down, how we will troubleshoot?
I think it is only like query insights, like the equivalent of PG stat statements.
For troubleshooting?
I guess you're right. I don't know.
Isn't the same true for Postgres, like with PG stat statements?
Well, yeah.
That's why we need external monitoring tool which will pull data and store it independently for troubleshooting.
I thought about this, like self-observed or externally observed system.
I think we actually need both.
will need both but if it's just self-observed might be problem yeah it goes
full circle to that same thing we were talking about with the infrastructure
provider earlier which is like they don't have anywhere else to put it unless
they're going to pay for it right so by putting it in my instance I pay for it
which I am paying for dearly and I'm now going to turn off you know but I agree
you do get the like system metrics that they store in their you know fancy
time series database but yeah yeah that's I didn't know about this interesting
I think, if I may, the most interesting thing we've done to manage this ungodly number of indexes and tables is classic computer science, a hybrid strategy.
What we noticed is, like, many, many, many gadget apps are like tutorials.
It's like someone wants to give, like kick the tires or they're building a to-do app and they only ever add three to-dos.
It's kind of excessive to create them one to-do table, one user table, indexes,
on many of those columns to store zero to five rows.
So we use a thing we call it a storage strategy pattern
where different gadget models are backed by different data.
We still store them in the same Postgres,
and that's so that we can still transact against them together.
Like we support, you know, real actual Postgres transactions,
but we use doinkey little JSONB tables
when the cardinality is small.
So we, like, have one
called shared JSON table
which has like an ID,
a model identifier, and then just
field values, and we don't index that at all.
No gin, no nothing.
We just kind of do a table,
you could call it a table scan, even though it's not.
It's like a range scan.
And then we just bound the cardinality of that.
And then when we witness the,
I think it's like 22,500th record,
we swap out kind of a
real table that has the real indexes,
whatever. So that cut our table count and our index count down by a lot, a lot, a lot.
Again, that swap is zero downtime, and it has to be. Like, that's, again, our commitment to our
customers. And that was actually quite difficult in order for, like, someone's doing a bulk load of
data or whatever. We witnessed that 2500 record. We have to kind of like pause traffic, do the
switcheroo, resume traffic without, like, there'll be a latency spike, but not an error spike.
And I think that's kind of, it's gadget as a bundle, right, where we have this back-end runtime, this front-end runtime, this database, like all these things that work nicely together.
And we see that as validation that the bundle's showing through, or it's like, we can only do that because we have kind of control over the app tier, control over the load balancers to be able to be like, ah, hold on a second.
We've got to do some maintenance on your app real quick.
We won't drop any requests.
We'll do a little, you know, switcheroo without anybody knowing.
And you're off to the races with the sort of full performance after that.
that's a really neat trick I like that how but how you must have lem provisions in
that code that doesn't allow queries without indexes to like account for tiny tape like
yeah oh my goodness this sounds painful we've we've been working on it for like four
years so yeah yes but I think we're in a good position like I'm excited for BFCM
this year which for for operations staff is I think a rare thing to hear like I think
Yeah.
I'm excited for the challenge, you know.
I saw the color drain from Nikola's face.
Yeah.
Yeah.
Do you think there might be some point that true postgres will be fully replaced in your company by something?
Or there will always be a place for true postgust, not allow it to be postgous-like.
Great question.
I think, yes.
I do think it will be fully replaced.
And I think that's, it's only because we haven't seen any alloy downsides.
Like, while it's a fork, it's still 99% true Postgres.
So we have, like, PGVector support, for example, and PG-sque support.
You know, it's not, like, a wire-compatible thing.
It's, like, mostly actually Postgres with the, like, wall and the block storage kind of ripped out.
It also has lightweight locks and so on, all those problems.
Yes.
Yeah, blows.
Yes, true.
I think, like, the thing, like, I think we're going to stick on alloy, but, like, not
because it's the very best, but because now we have familiarity, now we know we can maintain
it, now we can scale to Valhalla, so we've sort of unlocked something, and we've got
other stuff to do.
I think the thing that would make me revisit that is Oriole DB, or whatever the, whatever
InnoDB for Postgres is.
Like, when that comes around and I don't have to vacuum anymore, that seems like a pretty
epic win that i'd be interested in realizing makes sense yeah great answer great thank you so much
enjoy it learned i really appreciate getting to nerd out these are these are the things that
keep me up at night of like 1.4 million indexes what do i do you know it's nice to let it out a little bit
cool thank you it's our pleasure thanks so much