Postgres FM - Gadget's use of Postgres

Starting point is 00:00:00 Hello, and welcome to Postgres FM. We can share about all things Postgres cool. My name is Michael. I'm founder of PG-Mustard, and I'm joined, as usual, by Nick, founder of Postgres. Hey, Nick. Hi, Michael. And today we have a special guest. We have Harry Brundage from Gadget, co-founder and CTO there. Hey, Harry, thanks for joining us.

Starting point is 00:00:18 Thanks so much for having me on. I'm happy to be here. It is our pleasure. So you've been doing some cool things with Postgres for a while. Recently blogged about a no downtime, a zero downtime upgrade. but there's a ton of other things we wanted to talk with you about as well. Where would you like to start? Yeah, I'd love to tell you about this zero-down time upgrade.

Starting point is 00:00:39 Just because, I don't know, it's like a point of pride of mine, I guess. It's always so frustrating to me when I get an email from one of my providers. It's like, hey, Sunday at 2 p.m., we're going to be down for 30 minutes. Like, hope that's okay. You know, so, yeah. Gadgets and infrastructure provider, like, are the developers who use us, use us for mission-critical apps, so we just try to take it more seriously than that and like not, you know, disappear. So yeah, we did a zero-dentime upgrade with the help of

Starting point is 00:01:09 Postgres.com.A.I. And Nick, thank you so much for your help. Nick. What a treat. Thank you for trusting us because, you know, like we did it with other companies. And always, like, in the beginning, I like, I might hear pushback because understanding how much work it's needed there, you know, like because it's far from being well automated. It should be. postgres and it's improving but yeah not not quite yet yeah yeah so you you saw you did this journey and thank you for for not stopping you know absolutely absolutely yeah so it was a lot of work it was yeah yeah i think too it's like to me just kind of feels like something that should be built into the managed providers like i'm kind of surprised that so few of them have it i think

Starting point is 00:01:56 The only one I could find that has, like, true zero-down time is planet scale. But for Postgres? The GCP options. I don't know about the new Postgres one. I guess I know they have it for MySQL, but not for Postgres yet, eh? For MySQL, yeah, I think so. I don't know details. But for Postgres, I don't see yet anywhere of this thing.

Starting point is 00:02:16 But again, like, what the heck? Like, as an industry, shouldn't this have been a thing we kind of, like, solved 20 years ago and never looked back, you know? I don't understand. Yeah. And there's no AI needed there. It's just pure automation. That's it. Yeah. Terrifying button to press. I think... Yeah, I think what happens is that it should be automated more in PostGus itself.

Starting point is 00:02:40 And most managed platforms like RDS and others, they just sit on the fence. Well, I know they work and do it, but they don't want to automate things outside PostGos too much. For example, Analyze, right? Analyze is not automated. You always forget about... forget about this and they they don't automate it and they just put it in documentation on shoulders of users now it's time to run analyze and I I started like asking around like maybe a couple of years ago why why why one year ago and then okay post goes 18 should should be out next week right finally statistics will

Starting point is 00:03:18 be dump restore of statistics will be fully automated so they will just exclude this from their how to manual right right so this is an example things should be inside postgres and but I think still they should not make much better indeed not waiting for postgis itself yeah anyway I'm with you here 100% I'm glad what what I was gonna say is my I've always attributed it or the absence of something like is a true zero downtime major version upgrade to the fact that it's like more than just Postgres that you need to pull it off, right? Like you need a connection pooler or a proxy or something in front of it sitting outside the system

Starting point is 00:04:05 and you need a bunch of tight coordination between that proxy and the rest of the system to kind of orchestrate it all. And so like I know at least on our previous Postgres provider, like they ran the PG pooler on the same VM as Postgres itself. And so like they're kind of hammed. by the architecture where they can't have this piece that lives outside that can talk to two at once whereas we self-hosted our bouncers for exactly this reason and can kind of do it all and orchestrate it all and it's like sucks to self-host it but gets you that architectural flexibility you know uh we needed that yeah and pg bouncer supports post-gium which helps you achieve absolutely absolutely we're maybe nuts too and we run our pgbouncers in uh kates and like do the whole

Starting point is 00:04:53 you know rolling restart and whatever and it's turned out to be I think an okay choice like I think a lot of people maybe shy away from that because it's sort of super stateful super critical maybe this is a pet not a cattle but we did in case just because that's where the rest of our compute runs and it's like easy to sort of bin pack and share it all it actually really saved our butts recently because we're paying a lot for intrazonal network transit to our databases specifically like like once from our application pod to the bouncer and then again from the bouncer to the database and because the bouncer was in kates we could turn on the handy dandy like prefer zone local network routing configurations and kind of have the application pod go to a

Starting point is 00:05:41 zone local bouncer and then to the database and i think were we outside of kates that would have been a lot more annoying to kind of like teach all the routing to be able to do the right thing Nick, you look scared. You look skeptical. No, no, I'm just thinking, this is also a job for database platform, I think, to take care of such things. So, yeah, and I know like RDS has RDS proxy, but it doesn't have a, it has cool features, PG Bouncer doesn't. But it also likes cool features Pige Bouncer has. So it's a zoo and, yeah, and I think POS Resume is a wonderful feature.

Starting point is 00:06:18 Yeah, absolutely. I'm going to emphasize it. I was going to ask about your PG-Bounser setup, actually, because your, you're both, post on the upgrade is really cool and goes into good detail on the actual upgrade process but you didn't talk much about your PG bouncer setup and like how like how many you're running and like do all of your application stuff go through that or does any of it skip it going to the data like is there is there anything interesting about that that you wanted to share with people good question i have two interesting tidbits the first is i think our

Starting point is 00:06:50 so our pg bouncer deployment's a bit weird and then we have we have we have nine instances. Interesting. Yeah. And we take our kind of total connection count where we're targeting 400 kind of server-side connections and split that up over nine. And the reason for that is that same Kate's network-aware routing thing, where the kind of up until very recent versions of Kubernetes, you needed three endpoints per availability

Starting point is 00:07:18 zone for the network-aware routing to kick in. So we had to use a minimum of nine. And initially, we were really skeptical about that because it seems kind of bad, right? Instead of having like, you say, one or two buckets of a lot of connections, we were kind of brokering them fairly. You have many small buckets and you could get kind of weird inefficiencies. But it didn't come to pass where that bit us. It seemed actually okay in the end.

Starting point is 00:07:41 And what I like about it is we have a lot of CPU headroom, right? Like PG Bouncer, I think, is still largely single-threaded. I don't know the innermost details of it. But we're using the fancy new prepared statement stuff within it. And so I'm kind of happy that I can give it nine cores instead of two or four or something like that and kind of be like, you know, take as much as you need. Yeah, nice. Always good to have headroom. Cool.

Starting point is 00:08:10 Anything else about the upgrade that, or like, which bits were most scary when Nick came to you with the proposed plan? You said the pushing the button. Is it kind of because it's single direction? Did you have a rollback plan? Good question. We had a rollback plan, but it was written rather than tested. Because we sort of like, so the way that we pulled it out, or the way that we kind of built the whole thing is we built a temporal workflow.

Starting point is 00:08:36 So we're like huge fans of temporal, very easy to kind of script something like this. And the reason why we did this is because it lets us get all the nice kind of retries and item potency stuff for free, where we get to write in a somewhat failure, oblivious way. and we wrote our workflow to have like a dry run capability where we would do all but the very last step so we would go do everything where we like prepared the new host kind of replicated the data made sure we were caught up did the pg bouncer pause and then we just wouldn't kind of reload the config or change over and we just do a resume and kind of everything would keep on going sort of as it was and so the terrifying part is we just never actually done the middle thing and we weren't really sure like here's how we can test this other than like standing up a whole parallel set of infrastructure which we're like I'm not even sure that will prove this one will work right like yeah just in my admittedly still limited lifespan like the things that go wrong are always the things you didn't foresee right so it's like we dry run did and I went million times found all the things

Starting point is 00:09:40 we could foresee and then it was like the spooky you know yeah so it's really like it's impossible to test switchover in production but I wanted to emphasize it's possible to test everything before that and it's great like because this is beauty of this procedure you like create the whole cluster which is running already a new version you like see it's working you can touch it and it's great like and so on but yeah I agree the scariest part is switch over how to test it properly of course it was tested in non-production right but it's different yeah here we just need to jump at some point exactly it was the best

Starting point is 00:10:20 of jump where it was like a big meeting that we were all like scared of we all like sat down Nick was on the phone and it was over in like seven minutes where it was like oh well guess it worked like all the numbers look good like yeah that's mid onwards you know so I feel fortunate to work with a team where you know that was how it went as opposed to like us screaming around like change their heads and I wanted to thank you like to thank you for this like emphasizing that you use this temporal framework because we like post-GICI will always big fans of Go and we like developed using Go a lot of stuff but recently our team we have a few Python guys right now and I just envision that probably will learn

Starting point is 00:11:01 from you in this experience as well and do some stuff with temporal as well so thank you for emphasizing yeah we're like we're heavy heavy users of it we use it for these kinds of like systems administration tasks or like behind-the-scenes maintenance workflows but we use it for a lot of other stuff too like gadgets is mostly used today by people building Shopify apps. And when you build Shopify apps, you have to receive a metric shit ton of webhooks. And so the gadget has like a background kind of queuing system that knows how to take a webhook and process it later or process it with retries or whatever.

Starting point is 00:11:38 So each one of those is a temporal workflow for us. And so we're doing, I don't know, something like 3K, 4K, temporal state transitions per second also backed by Postgres in this like big ass temperate. portal cluster and that's something I never ever want to have to try to build myself like it's so complicated and nasty and you know fun for some you mentioned in the post also that sharding was on your to-do list yes the you mentioned though it was you're using a different database for that now like allo db specifically I think for the new setup but you said it was just for the user data what's the separation there with

Starting point is 00:12:19 like Shopify, is it the data of the Shopify apps, for example? Yeah, so Gadget is like a general purpose app development platform. So if you want to make a to-do list tracker or a chatGBT clone or a Shopify app, you can sign up for Gadget and you get kind of like a full stack dev environment. So you get a place to run kind of backend code. It's all JavaScript TypeScript. You get a database or a sort of virtual slice of a database to store backhand. end data, you get a front end and react and whatever, and you're off to the races.

Starting point is 00:12:52 You write back in code, you write front end code, you define your data models, you transact, so and so forth. And then the Shopify part comes in where we kind of have a managed API connection to Shopify, where we'll do the Oath and the web hooks and sync the historical data, et cetera, et cetera. So like many infrastructure providers, we have a control plane and a data plane. The control plane stores like our list of users, our kind of who has what, access to what app, so on and so forth. And then the data plane is the actual app's data.

Starting point is 00:13:23 Like this app has a to-do model that stores, you know, this many to-does. This app has the Shopify product model and, you know, stores this many Shopify products. So when we started, we just stuck those in the same Postgres. And then that Postgres got too big. And so this major version upgrade kind of precipitated splitting the control plane and the data plane, which I think is like a pretty obvious. best practice when you say too big by which metric like what's what was the where was the botanet coming from good question so we had about eight terabytes of data in the one instance

Starting point is 00:14:02 or sorry our disc was eight terabytes big and we're at six or something like that six point five and we're like okay this is scary like we got to figure out the next thing and we were on pg13 which was end of life long ago um and so we needed to to do both do the upgrade and we just wanted kind of a better roadmap for how we were going to deal with growth. Because Gadget has lots of these e-commerce customers, BFCM is always a quote, fun, unquote, time for us. And like we'll see probably this year like a 4X traffic increase and like a 4X database QPS increase. And so that means like we kind of need to be ready for a 4xing of database CPU. And our provider at the time, it didn't sell a big enough one.

Starting point is 00:14:52 So we're kind of like, we have to do the horizontal scaling sort of journey. So what we did is we sharded in order to move the bulk of that six terabytes out into sort of newer instances that had much more room to grow, leaving only the sort of reasonably sized control plane left over to do this major version upgrade. And it simplified a few of the steps of this zero down time major version upgrade because we didn't have to replay six terabytes worth of stuff on the new instance. Yeah, yeah. Makes sense. But why not stick with the existing provider for those shards? Two reasons. Or three reasons. The first is we're Google shop and we get good pricing from Google.

Starting point is 00:15:32 The second is Google claims up to a 4x TPS increase with AlloDB. Like, I like Google and I don't want to poo them. I'm not sure I've ever seen it actually hit 4X. But we did see like at least a 1.5x kind of like TPS increase without really much config change or tuning or whatever. I can't even say if it's actually the same CPUs like under the hood. You know what I mean? I don't know what the initial provider was using. But the instance was faster for the same price. So that was nice.

Starting point is 00:16:06 And then the other big thing with Alloy is you don't pay introsonal network transit fees. It's not like a normal VM. I present that without comment to the competitive forces that may or may not be listening. But yeah, it's free networking to the alloy instance, whereas, like, if you're, you know, Avin or Crunchy or whatever, and you run instances, like, your customers have to pay from two or the three zones just to talk to it, so that sucks. And then the second thing is you don't pay for replica storage. It's a split, like, storage compute sort of setup where the disk is,

Starting point is 00:16:46 this sort of auto-scale, elastic, fancy log storage service thing, and then you pay for your nodes that kind of read and write from that shared tier, but you don't pay N times, you pay it once. So it's like 40-ish cents a gigabyte, which is like more expensive than, like, say, a big fat EBS volume, but not that much more. And so if you're running like a lot of replicas, it ends up being cheaper storage-wise, and you don't have like all the nasty I.O. spikes from backups or any of that stuff. Cool. And unlike Aurora, I don't see the, like the hidden cost with Aurora, usually are your costs. Yeah, no I.O. costs. Yeah, that's interesting. So our database bill came out to be about the same as a pretty classic like GCP VM with, you know, they're not EBS and GCP,

Starting point is 00:17:38 but like fat kind of SSD volumes backing it. And we save money on. the networking. We saved money on not having to pay for storage again for the replicas, and we deleted our, like, I.O. Spike for the backups problem. That's interesting comparison. So we even don't talk about features like this Column store versus Roastore. We don't, even without that, just direct comparison. Interesting. Yeah. We intend to make use of that column store. Like, I'm skeptical. I used to work on sort of like the big data warehousing team at Shopify when I work there. And, like, I just don't, don't really believe that you can force Postgres to do the kind of like super wide

Starting point is 00:18:21 throughput stuff you need for crazy scale analytics. But you can definitely get pretty far with it. You know what I mean? So my guess is like some of our operational reporting use cases where you're traversing half a million, a million rows to answer a query on a read replica with this column store is going to be fine. But in the like billion, 10 billion scale, I think purpose built analytics databases still should exist.

Starting point is 00:18:47 We've had a few of those conversations recently. I'm sure, I'm sure. I had one more boring question on the major version upgrade before we move on to more interesting things. It looks like you were on Crunchy Data before for the control plane stuff and are going to be again for the upgrade. Yep.

Starting point is 00:19:03 But you only upgraded, I think, from 13, which, by the way, you were a bit harsh on yourself. I don't think it's quite yet out of support. I think it's like a few months. Oh, really? Oh, good for me. A couple of months' time, that'll be out of support. But you only move to 15. And I think Crunchy have always been very good at supporting the latest versions.

Starting point is 00:19:20 So probably 17 is available. So why not, like, why 15, not 16 or 17? Yes, Crunchy has been great, that kind of thing. I think the main reason was, shoot, you know what? I don't even remember. Maybe Nick knows. I also don't remember. I remember that Crunchy had different end-of-life dates somehow.

Starting point is 00:19:42 Crunchy Bridge had different dates. Yeah, they deprecate support, or they stop sort of long-term support before the full-on end of the kind of core support. But they, in fairness, and they gave us extreme advanced warning, and like, we knew it was coming for a long time, and they supported it, you know, for four years or whatever, which is, I think, you know, pretty good. I don't remember. I think it might have been a desire to keep the versions homogenous across all our instances. Sure. And for really silly reasons, the alloy shards that we stood up were stuck on 15 and couldn't be newer. Yes, that would explain it.

Starting point is 00:20:22 So alloy, I think because it's a fork and it's probably quite hard to maintain fork, I don't think they're as good at keeping up with the major versions. So there's a good chance you wouldn't even have the option of going to 16 or 17 on there. I don't know for sure. That's correct. That's correct. At the time, 16 was GA and 17 wasn't even out yet. Now, 17 is out, but it's still in preview. Like, it's not, you can't, shouldn't run production workloads on Alloy 17.

Starting point is 00:20:47 But the weirdest thing was we wanted to go to 16, but they just don't have PG squeeze support on their 16 version. They have it on 15 and they have it on 17 and not on 16 and couldn't tell you why, but we squeeze a lot, so we really cared about that. That makes so much sense. In my head, the only reason I could think of is because now you're so good at major version upgrade you wanted to give yourself more. more to do in the future let's let's rinse and repeat yeah yeah nice all right so you're running Postgres for a lot of different workloads are there any challenges or fun things you wanted to share on that front I think maybe a good one just for the audience and like if you're listening out there and you think I'm full of it like please send me an email I'd love to

Starting point is 00:21:33 know what you you think but we did the classic Postgres as a queue thing early in our life. And I think that that, well, it's on my mind a lot. So a lot of people who have like background queuing problems use Redis. When we were at Shopify or when I was at Shopify, I used this library called Rescue. There's another really popular one called Sidekick. And Node, it's Bill MQ.

Starting point is 00:21:59 There's a litany of them, right? And they all seem to use Redis. We decided that for our user who tends to be like a front-end dev or sort of someone who, isn't deeply familiar with, like, data consistency or data durability concerns that we didn't really want to be like, hey, we'll happily process your background jobs, but we'll drop 0.5% of them and not be able to tell you which ones we dropped and not be able to tell you when it happened. You know, so we, for our background Q product, we said it's going to be durable. Not only is it going to be durable, it's going to be transactional where it's like when you kind of do some business logic and you NQ a job within a transaction,

Starting point is 00:22:38 if the transaction commits, then the job will get in queued. And if the transaction doesn't commit, the job won't be in queued. And that's like the transaction boundary. So we were like, this is nice. This is us kind of trying to prioritize our customers' sanity. And I think it's the kind of thing that's like annoying because not that many of them are going to notice or care. It's just going to be no data quality issues for this like business logic. But that locked us into needing Postgres to be.

Starting point is 00:23:08 on the hot path for those NQs, right? We have to participate in that business logic transaction to kind of like save the thing. We knew we didn't want to build a giant queue of our own, so we knew we were going to get it into temporal and let temporal take over, but we sort of needed like a Postgres outbox, if that makes sense, where it's like you write to the outbox, and then you're slowly forklifting out of the outbox into the actual system eventually, and then temporal itself is durable, and so, yeah. So that's the architecture we went with. But as we sort of built it out more and more, we learned about more limitations of temporal. Like it came out of Uber from this system that kind of predates it called cadence.

Starting point is 00:23:50 And my guess is they learned all this stuff the hard way and built in a bunch of limits that make it, you know, easier for the operators to guarantee certain sort of performance characteristics. But really annoying for people like me. So one of those limits is when you start a job, there's a maximum payload size that's pretty sort of. small. I forget what it is exactly. I think it's like 50 megs or five megs or something like that. And then as that job executes, it can only generate X amount of state transitions that are called, but, you know, events in its processing where it can only have a certain number of retries, basically, and you need to sort of re-architect your temporal workflows if you want to overcome that. And again, in our case, our users are kind of like, hey, Harry, I have this

Starting point is 00:24:32 400 megabyte video file that I'd like to process in a background action. Can you? you just make it work, please? Can you not make me, like, re-architect my entire thing to toss it into cloud storage, whatever? It's ephemeral data. Like, I don't need it in the cloud. I just want to push it to this thing and work it within the background. So we found that the temporal limitations were annoying enough that we actually needed to stage a bunch of each job's data outside of temporal. And so we ended up kind of having this nasty hybrid where that Postgres Outbox, I mentioned, sort of became like a store. for a bunch of extra stuff, and then the temporal workflow as it went, it would be, you know,

Starting point is 00:25:14 transacting with its state storage and doing its state transition things, but then also updating what is now no longer the outbox to be our sort of source of truth, it powered our UI, it had all these extra params, yada, yada, yada. So we weren't even really using Postgres as a queue. Like we weren't having the classic like skip locked problems and whatever, but we're just having like this sort of like request scale data transmit like one web hook would be one right to this outbox to NQ one right to start processing one right to finish processing so just like massive massive tupil turnover and we also needed to build a UI on this stuff so we only could get a few of them to be like hot updates like many of them were not and so we were just like

Starting point is 00:25:58 destroying this poor Postgres instance with what I would call like less important data than the actual data plane data, but still important enough that we felt that merited being on a disk. So all that to say, we ditched Postgres for this and we switched to big table, which is like Google fancy super distributed key value store, and it friggin sucks. It really sucks and the reason is no transactions, right? It's like a way worse model for me as a programmer and a systems designer and a way more scalable, fundamentally, you know, supersized system under the hood. But boy, oh boy, do I miss the creature comforts, we'll call them, of secondary indexes and transactions and like a sane

Starting point is 00:26:47 query language, let alone, you know, the gazillion and one features that a, you know, full relational system has over a key value store like that. So we use Postgres, I think, by default for all our workloads. And then when we find something that we're like, okay, I think there's maybe a better fit here. We'll try to switch it out for the next best thing. But it's like, it's really hard to beat. And I know this is maybe unsurprising on a Postgres podcast, but it's like, I just wish I didn't have to do that.

Starting point is 00:27:13 I could have stayed with my creature comforts and got my scale, you know? Yeah, so I'm curious what was the number one reason. Is it like bloat, maybe, or? Yeah, it was like bloat, double turnover and just like the fact that we were like always vacuuming. Yeah, yeah, I can imagine. Another big motivation for switching to something akin to Bigtable is like compute elasticity, where we can kind of like Bigtable auto scales by default. You can't even turn it off, I don't think, where they just add more nodes as demand kind of grows.

Starting point is 00:27:49 And for us where we're kind of, like the background subsystems are the thing that absorb load, right? When you have a flash sale on your Shopify shop and you need the NQ a bunch of stuff. Like, it's quite, we need this thing to be able to handle unpredictable surges. And it was a bummer. At the time, we didn't have any way to do a zero-down time, like, non-disruptive post-gressory size. Now we do, but even then, I don't really want to be doing it automatically, like in response to load. I kind of, I kind of want something like Big Table that's like five seconds later, you can get your capacity expansion. Yeah.

Starting point is 00:28:23 Yeah. Well, I can see it easily. Yeah. I think, I think, like, my ideal would be a million tiny Postgreses and have some fancy system for, like, disaggregating a logical Postgres from a physical VM. Or if I have, like, let's say, I shard my workload into 16,000 Postgresses, but they only run on four VMs, and then as I need more compute, they kind of split out, get bigger and whatever. but that's like a twinkle in old man hairy eyes. Old man Harry's eyes. There is work in this direction by multiple teams right now, so we have good...

Starting point is 00:29:04 Go on? Go on? Yeah, there is Pidgee Dogg, there is Maltigres and Plainscale also. They're all working in these direction. And Maltigres specifically, it's one of the things Sugu mentioned as well. From his experience at YouTube, he loved having lots of smaller databases, and he much preferred that in terms of reliability and localized outages even. So there were just so many benefits. Highly automated operations also in roadmap like upgrades as well.

Starting point is 00:29:38 And rebalancing. Yeah. Yeah. I forgive me for another brief aside, but are you familiar with Azure Cosmos DB? Like the kind of main one, not the Citus one, the main one? I thought Cosmos was Citus. I get confused by their names, though. So I think there is Azure Cosmos DB for Postgres, which is Citus.

Starting point is 00:30:04 But I think there's, and I'm not deeply familiar, so I may be totally wrong here. But I think there is a different original Cosmos DB that's kind of the Azure answer to Dynamo and Big Table. Like they're kind of like super scale, sort of less sophisticated query patterns, but guaranteed scale for whatever. the way that it works or the way that it scales is you specify like a partition key with every piece of data and you're only allowed partition local transactions so it's it's not like a spanner or a cockroach where the whole key space can be transacted against together it's just like a million tiny partitions and then every partition could be a max of 50 gigs and to me that's like so desirable because it means like one little part you get like node local performance but the same horizontal scaling properties of the kind of super scalers but without necessarily two-phase commit or paxos or whatever and so i'm like kind of hoping someone manages to do that i think it's maybe what's it called necky like the v test for postgres did i get that yeah that's uh it's any k i have no idea how to pronounce it yeah same i have no idea but i'm hoping that's what it is which is

Starting point is 00:31:17 like tiny little boundaries that force me to design my data you know way that makes the transactions efficient where you don't need a bunch of computers to agree on a transaction outcome and instead you can just funnel them to the right place unfortunately they all not all but many of them came to me and asked have you used to pc i said no i always avoided it but so they need it unfortunately of course like some workload will avoid it but some workload will still need to PC. This is the sad truth. Totally, totally. I think like the, like I, so I used to work at Shopify circa, I don't know, 2012, 2013. We were running like a MySQL fleet at the time and we were, we felt constrained by the biggest database servers we could buy. Like we,

Starting point is 00:32:11 we were sharding, but it was still like the biggest customers could push the one the instances, I don't know what I was, probably 64 threads, like to the max. I'm not really feeling that way anymore. Like, I feel like there's few workloads that don't have a natural partitioning where you need more than one server's worth of compute just to satisfy one partition's worth of work. I mean, maybe you talked about this assignment last time, but it's like, you can always split it up. And when you split it up, like, you don't need, you need an aggregate many computers but one transaction boundary doesn't need you know more than one and and so I'm just and and like the hardware has just gotten so so much better in that 10 years that I like

Starting point is 00:32:56 I'm not sure I could grow my company fast enough to need that right it's like maybe if we had like a graph problem maybe if we were Facebook or whatever but as it stands right now that's true and on GCP if we don't look at the MD we look at only zero and scalable fifth generation the best available c4 nodes i think right you can get 192 course or so but if you go to a wus you will have even more like almost i think 800 cores also the same type of processor and this is already huge so you can scale one node it's just bonkers like that's ridiculous yeah it's a lot of doge coins you know sorry so that was a long a long-winded rant but what more can i tell you about that temporal sort of postgres workload set up thing before we start

Starting point is 00:33:43 you mentioned one of the most interesting things you're working on is that you're serving other people's workloads and that you don't control with design like schemers and yeah design decisions that you don't control and how do you stay sane while while doing that and also how what are the interesting parts of that great question you hit me with the the question that just invalidates my entire prior rant which is like yeah you can partition your workload if you think hard about the schema design but yeah so okay let me set the stage a little um so gadgets like reason for being like the reason why we're working hard on this thing is because we kind of believe software development is still hard unnecessarily that like when you go to spin up an app on the weekend or your next big business or just like solve a small problem in your life with software you spend a lot of time on what we call like baloney problems of provisioning infrastructure setting up connection poolers like worrying about the auto vacuum schedule, so on and so forth. We just think that's all a distraction from actually building something useful.

Starting point is 00:34:50 It matters. Like you can't ignore it, but you also ideally don't want to have to deal with it. So similar to the people who came before us, like the firebases of the world, we kind of want to delete a bunch of those problems from our customers' lives, take them on ourselves and allow them to focus on whatever unique, interesting custom problem they want. like recently it might even be allow cloud code to focus on whatever unique interesting problem they want right but like by kind of deleting infrastructure footguns and systems integration we just let people kind of succeed faster and then i guess i should say bearing in mind the audience listening here it's not for everybody right like if you have the perfect knowledge of the postgres schema you want gadget's probably going to be less efficient than that right because we've had lots of layers of abstraction we have our own operations and infrastructure team kind of managing this stuff. So I'm not trying to say, like, no one needs to care about these problems. They're just trying to say that a lot of people, I hope, don't have to

Starting point is 00:35:48 care about these problems so that me and my team, you know, pull our hair out, stressing about them. All this means is there's this tension between asking the developer to be more specific about their data access patterns and adding friction for them to, like on the path to success, right? That like every time you say, do you want four, eight, or 32 bytes? for this number you're you're posing a what I would call computer baloney question rather than like it's just a number like you know screw off I don't know I don't know how big it's ever going to be just you figure it out you know so in gadget we have like a number field type instead of an int and a float in a decimal we've taken this to a pretty intense extreme where when you do data modeling

Starting point is 00:36:34 in gadget you specify tables we call them models because they include behavior as well but you specify tables, you specify fields, you can specify relationships among them, but you don't run any SQL to realize those schemas. And then you get a kind of fully auto-generated GraphQL API for interacting with those tables without doing anything else. You just, you specify your schema, you get the API. We generate a nice client for that API. We generate nice docs for that API. But the idea is, like you say, to do, title, string, completed Boolean. and now you immediately can write some React code for the form, or you can ask the AI to write some React code for a form to, like, transact against that stuff.

Starting point is 00:37:19 All this to say, we're responsible for the SQL that, or the schema of those tables under the, like the real tables under the hood, as well as the SQL that that API generates and executes. So this is a win because we can avoid things like the N-plus-1 problem. Like, we know how to do that good. We can implement things like zero-down time cutovers or kind of maintenance modes, where we know how it's like, this app is in the process of being moved to a different shard. Like, hold on a second, we'll finish that and then serve your request, you know. It's powerful that gadget sort of owns the whole stack. Oh, we can implement other stuff like cloud storage where you can store data blobs in the right way instead of within Postgres, you know.

Starting point is 00:38:04 But it means that all we know about a given application is like what we've collected from the device. developer, which is, yeah, that list of tables, the list of fields on the tables and the relationships among them, not stuff like, ah, this is a shard key, or this is how I intend to transact, or if I'm going to make an HTTP call in the middle of my transaction, like two days for now when I changed my code. So we picked this bed and we made it like on purpose. And it's the idea is we want building high quality software that scales and like meets the demands at Black Friday, Cyber Monday, to be more accessible than to not be locked behind, you know, whatever, between the three of us, 60 years of Postgres experience or something like that, right?

Starting point is 00:38:48 You should be able to build and get an app going way faster without that. So the gist of it is we need some way to structure our data under the hood within our Postgresis such that people can't, like, take down the platform. They can't use more than their fair share. We just need to, we need to be safe by default where we're committed. to serving this API, and regardless of what strange queries you throw at, strange transaction patterns, poorly written code, either out of ignorance or malice or whatever, just needs to still work, same as like a Firebase or, yeah, those other companies I mentioned before.

Starting point is 00:39:22 And the way that we do that is, like, kind of two main ways. We don't give everyone their own instance and say, it's your job, like, you manage the performance. I think that's what, say, like a neon or a super base does. like, which is still helpful, but it's different, right? They're sort of saying, like, you get this much resources, and within that, it's up to you to optimize. We don't say that. We charge that kind of by operation, sort of more, like, usage-based,

Starting point is 00:39:47 like, say, an upstash Redis or something like that. And then we consider ourselves gadget on the hook for the performance of those operations. And that means that, again, like, we have to find some way to, like, not process queries that could take us down. So we do two things. We refuse to execute a query unless there's an index on the predicate. kits. And that's just a safety like must, basically. We can't run on index queries because we can't

Starting point is 00:40:16 know that they'll process fast enough to be safe to run at scale. And then we rate limit. And that allows us to say, like, you know, no one can have more than their fair share. But the crazy thing is almost no one hits their rate limits. We consider that like a regrettable private platform protection mechanism as opposed to like a pricing mechanism or like an incentive and the reason that no one hits their rate limits is because if you just force every query to be indexed like you get pretty far there's there's the planner is so good and like the bitmap index combining stuff gets you so far that like no almost nobody or anytime we ever have like a database CPU spike it's usually because something's gone wrong with our creation of the indexes

Starting point is 00:41:04 like gadgets automations their temporal workflow surprise like have failed to set the index up properly or we like let the code that makes the API calls ship before the index was actually in place or something like that you know but it's like postgres just works if if you do this so it's it's really naive but it works are you okay to share how many indexes you've already created yeah we have i think probably over two million indexes in our in our system right now actually sorry i take that back at the time of the We did, but we've reduced that drastically since then, but for kind of unrelated reasons. But, yeah, basically, if a user wants to be able to, in their API, sort or filter on a column, we just say there has to be an index.

Starting point is 00:41:48 And so we have, like, I don't know, most of the columns indexed. And, like, oh, God, I can feel the hate mail now, talking to a giant legion of database engineers who are all screaming in their cars right now. What? you index the recall no no the modern hardware that respected nick just mentioned means that like the cost of this while substantial is not actually that crazy like it's higher no doubt but we're not we're not in the days where you need to be so incredibly stingy about every index you make especially when the cardinality of the tables are like you know a million two million records as opposed to

Starting point is 00:42:26 like 10 gazillion right when our first 10 gazillion customer comes along we're going to work with them closely to be like you don't need that index like let's turn that off your stuff will go faster but the the vast majority of people the overhead is it adds up but it's not so meaningful that our business is busted you know yeah so i wanted to say that this is a pet versus cattle philosophy if we like think about indexes as a pet we need to get rid of unused indexes take care of blood and but at your scale before the schema became sharded millions of indexes you think about this as cattle as well just inside one postgous instance right before it became sharded yes this is interesting approach and i agree with since we have small tables we know there is

Starting point is 00:43:13 right amplification all indexes if it's not hot update it needs to update all those indexes some of them are not used but since approach by design it's cattle okay it's not that bad yes this is interesting case. I must say I never saw it before. It's interesting. I've heard of one other case which was heap analytics blogged about, something about, you know the one, great. I sure do. So that was the only other time I heard about this. Was it also where they discussed that create indexes without work concurrently, right? Yeah, because you can't do more than one at the same time. Yeah, to parallelized work. Yeah.

Starting point is 00:43:56 What's neat or strange or weird about gadget is we kind of participate in the whole SLDC, like the whole life cycle of like authoring a feature, debugging it, deciding it's good, merging it and deploying to production. Gadget has a development environment as well as the production environment. And so we kind of get to witness the moment that you want to go live. And we just do a normal style, call it Rails or whatever, migration where on deploy we like create the indexes concurrently behind the scenes then we ship the code that makes use of them you know and it's like we can do that because we're not sort of exposing

Starting point is 00:44:33 the raw sequel and asking people to manage their own schema so on so forth but um what do i want to share about this that i think is the most thing is like giant catalogs where you have you know we had i think at peak in one instance like 250 000 tables 1.4 million indexes giant catalogs like suck but were more manageable than I was guessing like the I was assuming there was going to be some like 0N squared system table inside Postgres or like some doinky little thing that just breaks

Starting point is 00:45:05 when things are that big but no like we that instance operated well the thing that did break was a bunch of the analytics tools out there like like PG stat statements PG stat statements is okay it was big it takes a lot of data but it wasn't like an ungodly amount of performance overhead but like pg analyze and not many of the

Starting point is 00:45:27 kind of collectors they're like let's get the size of every table every minute you know and it's like that's that's a hard no our monitoring also didn't work properly because of catalogs yeah but you guys you guys only had to make a couple tweaks i think we're talking about pg watch too like the postgres dot a i remix and uh it works well for us it's the kind of fastest one i haven't tried PG mustard but I must I owe not a monitoring tool minimum yeah all right it's all right so that was surprising but the other actually this is just a fun fact so we we did our sharding we moved this gazillion number of tables over to alloy we turned on their analytics thing it's called advanced query insights and like it's not bad like it's a nice UI for kind of PG stat

Starting point is 00:46:14 statements souped up a bit they store all the data for it in a private database in your instance so like you have an almost super user but not a full super user and there's this one database you can't touch that has it in it and currently this is my fault I like set it all up and I was like oh this is probably going to be not that expensive currently our like alloyDB insights database is almost 1.5x the size of the actual database and I think it's because the query diversity is so high and all the like data that they're keeping about each one adds up so much that, yeah, we're paying more than double in storage just for the privilege of

Starting point is 00:46:56 this thing. And half the like queries timeout. But again, I'm going to blame me for that, like not Alley for that. We just need to turn that off. I'm curious if the main, like, if we have just one AlloDB database and we store metrics inside the same thing, if it's down, how we will troubleshoot? I think it is only like query insights, like the equivalent of PG stat statements. For troubleshooting?

Starting point is 00:47:26 I guess you're right. I don't know. Isn't the same true for Postgres, like with PG stat statements? Well, yeah. That's why we need external monitoring tool which will pull data and store it independently for troubleshooting. I thought about this, like self-observed or externally observed system. I think we actually need both. will need both but if it's just self-observed might be problem yeah it goes full circle to that same thing we were talking about with the infrastructure

Starting point is 00:47:54 provider earlier which is like they don't have anywhere else to put it unless they're going to pay for it right so by putting it in my instance I pay for it which I am paying for dearly and I'm now going to turn off you know but I agree you do get the like system metrics that they store in their you know fancy time series database but yeah yeah that's I didn't know about this interesting I think, if I may, the most interesting thing we've done to manage this ungodly number of indexes and tables is classic computer science, a hybrid strategy. What we noticed is, like, many, many, many gadget apps are like tutorials. It's like someone wants to give, like kick the tires or they're building a to-do app and they only ever add three to-dos.

Starting point is 00:48:40 It's kind of excessive to create them one to-do table, one user table, indexes, on many of those columns to store zero to five rows. So we use a thing we call it a storage strategy pattern where different gadget models are backed by different data. We still store them in the same Postgres, and that's so that we can still transact against them together. Like we support, you know, real actual Postgres transactions, but we use doinkey little JSONB tables

Starting point is 00:49:12 when the cardinality is small. So we, like, have one called shared JSON table which has like an ID, a model identifier, and then just field values, and we don't index that at all. No gin, no nothing. We just kind of do a table,

Starting point is 00:49:27 you could call it a table scan, even though it's not. It's like a range scan. And then we just bound the cardinality of that. And then when we witness the, I think it's like 22,500th record, we swap out kind of a real table that has the real indexes, whatever. So that cut our table count and our index count down by a lot, a lot, a lot.

Starting point is 00:49:49 Again, that swap is zero downtime, and it has to be. Like, that's, again, our commitment to our customers. And that was actually quite difficult in order for, like, someone's doing a bulk load of data or whatever. We witnessed that 2500 record. We have to kind of like pause traffic, do the switcheroo, resume traffic without, like, there'll be a latency spike, but not an error spike. And I think that's kind of, it's gadget as a bundle, right, where we have this back-end runtime, this front-end runtime, this database, like all these things that work nicely together. And we see that as validation that the bundle's showing through, or it's like, we can only do that because we have kind of control over the app tier, control over the load balancers to be able to be like, ah, hold on a second. We've got to do some maintenance on your app real quick. We won't drop any requests.

Starting point is 00:50:33 We'll do a little, you know, switcheroo without anybody knowing. And you're off to the races with the sort of full performance after that. that's a really neat trick I like that how but how you must have lem provisions in that code that doesn't allow queries without indexes to like account for tiny tape like yeah oh my goodness this sounds painful we've we've been working on it for like four years so yeah yes but I think we're in a good position like I'm excited for BFCM this year which for for operations staff is I think a rare thing to hear like I think Yeah.

Starting point is 00:51:10 I'm excited for the challenge, you know. I saw the color drain from Nikola's face. Yeah. Yeah. Do you think there might be some point that true postgres will be fully replaced in your company by something? Or there will always be a place for true postgust, not allow it to be postgous-like. Great question. I think, yes.

Starting point is 00:51:34 I do think it will be fully replaced. And I think that's, it's only because we haven't seen any alloy downsides. Like, while it's a fork, it's still 99% true Postgres. So we have, like, PGVector support, for example, and PG-sque support. You know, it's not, like, a wire-compatible thing. It's, like, mostly actually Postgres with the, like, wall and the block storage kind of ripped out. It also has lightweight locks and so on, all those problems. Yes.

Starting point is 00:52:06 Yeah, blows. Yes, true. I think, like, the thing, like, I think we're going to stick on alloy, but, like, not because it's the very best, but because now we have familiarity, now we know we can maintain it, now we can scale to Valhalla, so we've sort of unlocked something, and we've got other stuff to do. I think the thing that would make me revisit that is Oriole DB, or whatever the, whatever InnoDB for Postgres is.

Starting point is 00:52:29 Like, when that comes around and I don't have to vacuum anymore, that seems like a pretty epic win that i'd be interested in realizing makes sense yeah great answer great thank you so much enjoy it learned i really appreciate getting to nerd out these are these are the things that keep me up at night of like 1.4 million indexes what do i do you know it's nice to let it out a little bit cool thank you it's our pleasure thanks so much

Postgres FM - Gadget's use of Postgres

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.