Postgres FM - Managed service support

Episode Date: April 25, 2025

Nikolay and Michael discuss managed service support — some tips on how to handle cases that aren't going well, tips for requesting features, whether to factor in support when choosing servi...ce provider, and whether to use one at all. Here are some links to things they mentioned:YugabyteDB’s new upgrade framework https://www.yugabyte.com/blog/postgresql-upgrade-frameworkEpisode on Blue-green deployments https://postgres.fm/episodes/blue-green-deploymentspg_createsubscriber https://www.postgresql.org/docs/current/app-pgcreatesubscriber.html~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is produced by:Michael Christofides, founder of pgMustardNikolay Samokhvalov, founder of Postgres.aiWith credit to:Jessie Draws for the elephant artwork

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, hello, this is Postgresfm. I'm Nikolai, Postgres AI as usual. My co-host is Michael, Peter Mustard. Hi Michael, how was your week? Hi Nikolai, I'm good thank you. How was yours? Perfect, very active and a lot of stuff is happening. So we needed to miss last week because I had even more stuff last week, but I'm happy if we continue, right? We don't stop. Oh, yeah. Yeah.
Starting point is 00:00:28 I remember in the beginning, I was always against skipping any week because for me, it would be a sign that we probably stop, which I don't want. So yeah, right now I'm already, we already proved that, like, during a couple of years we... A couple of years almost... How many years? Nearly three, maybe. Almost three, yeah. Wow. It's like this July it will be three years.
Starting point is 00:00:56 And I already proved to myself, we proved to ourselves that if we skip one or two weeks it's not game over. Yeah, this is me as the European convincing you it's okay to have a week off every now and again. Yeah, exactly. Okay, if we stop, that's it. I don't want that. Yeah, good. And today this was my choice and the topic is like it's not less technical but although we will talk about technical stuff as well. And topic is how managed Postgres services, how they help us or don't help us customers. I mean I'm in different situation probably but of course sometimes I'm just a customer
Starting point is 00:01:42 or I'm on customer side. And there's a problem when the fact that we cannot have access to the cluster and we have some issue, there's a whole big class of problems how to deal with it. And maybe we should create some best practices how to deal with support engineers from RDS, Cloud SQL, I don't know, and all others. Let me start from this. I learned an important lesson, I think, in 2015-16 when I first tried RDS. I liked it a lot because of the ability to experiment a lot. Before cloud, it was really difficult to experiment because for experiments you need machines
Starting point is 00:02:37 of the same size usually for full-fledged experiments for a very limited amount of time. 15 years ago or so we were buying servers and putting them to data centers and experiments were super limited. Cloud brought us this capability. Great. And with RDS I quickly learned how cool it is to just create a clone, check everything, how it works, throw it out and then rinse and repeat many times. And then when you deploy, you already studied all the behavior. And I remember I was creating a clone, but then it was so slow. RDS clone.
Starting point is 00:03:21 I think it was 2016, maybe 2015. Why is it slow? The cluster is like maybe 100 gigabytes. Today it's a tiny cluster. Not tiny, small. But back in those days it was quite already a big one. And I restore and somehow it takes forever to run some select. And experienced AWS users know very well this phenomenon. It's called lazy load, because the data is still on S3, and you have ABS volume which only pretends to have data, but data is still there, lazy loading in the background. And I reached support because we had good support.
Starting point is 00:04:08 And engineer said, oh, let's diagnose, it's some kind of issue. So it was hard to understand what's happening and so on. And I spent maybe an hour or so with that engineer, support engineer, who was not really helpful, right? And then someone, I don't know, like maybe my experience of managing people by that time, I was already, I had already three companies created in the past, so I learned something about psychology and so on. What I did, I just closed the ticket and opened another one.
Starting point is 00:04:48 Although usually any support would hate it. Don't duplicate, right? But this helped me solve the problem in a few minutes because another engineer told me, oh, that's just lazy load. And I Googled it. I quickly educated myself. okay, what to do? Oh, just select star from your table to warm it up. Okay. And since then I have a rule and I share it with my customers all the time. If you are on managed Postgres service and
Starting point is 00:05:19 you need to deal with support sometimes, it's like roulette, right? It's 50-50. Can be helpful, can be not. If it's not helpful, don't spend more than 10 minutes and just close the ticket, say thank you, and open another one, because if it's a big company who has big support, probably you will find another engineer who will be more helpful.
Starting point is 00:05:43 Actually, I use this rule in other areas of my life as well, for example, talking to some support people like bank, credit cards, debit cards, anything. It's not helpful, okay, thank you, and you can just call again and another person will probably help you much faster. What do you think about this problem? Yeah, I think you must have different banking services to ask us, because if we need to call the bank, you're guaranteed to be waiting 20 minutes on hold. Oh, yes, it's terrible. It can be ours. I think we'll have a day when someone will
Starting point is 00:06:17 create a AI assistant for serving on the human side, not on company side. Oh, interesting. Yeah. Yeah. So they should wait on that line and ask me to join only if everything is ready and did like small details already negotiated. Some approvals needed and that's it. Yeah. Sure. So maybe one day we will have such systems. Yeah. I think at big companies that makes a lot of sense and at smaller ones much less so I think there are some smaller managed services out there. But yeah, maybe this problem happens less. I was gonna ask
Starting point is 00:06:55 Because sometimes they have the ability to escalate right? Do you have any tips? So let's say you've got a support engineer that wasn't able to work out the issue So let's say you've got a support engineer that wasn't able to work out the issue. Do you have any tips for getting them to escalate the problem to a second tier or do you, do you always go to a, like, let's open another ticket and hope that they. Yeah, that's a great, great question. And I think we, I don't know about RDS by the way, but what I see in many cases, there is no such leather built yet. So in case of big corporations, banks and so on, there is such option.
Starting point is 00:07:31 You can ask to senior manager, blah, blah, blah. Especially if you go offline, it's definitely an option always. So please let me speak to another person. You escalate and so on. But what I observe and recently what happened, we had a client who experienced some weird incidents. Those incidents require you to have low-level access which you don't have on RDS. You need to see where Postgres spends time, for example, on RDS. You need to see where Postgres spends time, for example, like using Perf, for example, or something. But you cannot connect. It's all in their hands. And you need to also grant them approval to allow them to connect to your box and so on. So a lot of bureaucracy
Starting point is 00:08:18 here. And I told them, like, you need to escalate. And of course, like it's normal, but I don't see this option working. If like, if you say escalate, it looks like they don't understand how, like what's happening here, right? Really? Well, you can try, like you can try and have some problem and some difficult problem, bring some difficult problem and try to escalate. Will it work? Is there any official option? Because if it's not official and it works sometimes, it's OK. Again, it's like gambling, like I said.
Starting point is 00:08:53 It's similar to closing and reopening the issue and hoping next engineer will be more helpful. Escalation is also not guaranteed. It's like in many cases, it's good, right? Because there, probably, they will try to solve. Well, I also have several, actually I have several recent cases, very interesting had issues, a bunch of them, like 10 issues, various, of various nature, like different kinds. One issue eventually was identified with mutual effort as don't run backup push or how you
Starting point is 00:09:40 call it on the primary. If a system is loaded, do it on replicas. We talk about it from time to time, when we touch backups, right? And this was an issue on that platform. But what I observed is, like, trying to work with engineers, support engineers, and also ultimate escalation if you go to CTO or CEO level and say, oh, look, like, you know, CTOs are talking, right? And this is ultimate escalation. And it's also not helpful sometimes, right?
Starting point is 00:10:12 In that case, there was some chunk of disappointment, what I observed. This was feedback I heard. So escalation is interesting, but my point is we probably need to learn about escalation later and practices from other businesses obviously. And I still think it's not fair that customer pays bigger price and doesn't have control. Yeah, sure. Well, actually on this on this topic I was going to ask do you think this is less of a an issue as for the there are managed service providers that give more access like we took we talked we had an old episode on super user for example and it's come up
Starting point is 00:11:01 a few times. Yeah, obviously that's not, like you're talking about running Perth for example, but I'm guessing a whole category of issues just don't exist if you've got super user access. So is it less of an issue on those? I will tell you funny story. It was with Crunchy Bridge. I respect Crunchy Bridge for two reasons. Already for two. It was one, now for two.
Starting point is 00:11:24 One is super user. I don't know any other managed service yet which provides you super user. It's amazing. You can shoot off your feet very quickly if you want. It's freedom, right? And another thing is that they provide access to physical backups, which is also nice. This is true freedom and honoring the ownership of database and so on. Because without it, maybe you own your data, but not database.
Starting point is 00:11:59 You can dump, but you cannot access PGA data, physical data, nothing. And also, you own your data conditionally, because if bugs happen, you even cannot dump. And this sucks completely. And I'm talking about everyone except Crunchy Bridge, all managed services, they all steal ownership from you. That sucks. So the final thing is at least one other but like I think they're quite smart. I think maybe Tembo give super use access. I have maybe maybe maybe. Yeah. Apologies if I missed something. I work with a lot of customers and expanding my vision all the time, but of course it's
Starting point is 00:12:46 not 100% coverage. Definitely not. Definitely the big ones don't. Right, exactly. And they say this is for your own good, but it's not. So let me talk a little bit about Crunchy Bridges. It was super funny. We needed to help our customer and reboot a
Starting point is 00:13:06 standby node. And turned out Crunchy Bridge doesn't support rebooting, restarting Postgres on standby nodes. They support it on primary or whole cluster, but not specific standby node. It was very weird. I think it's because they just didn't do it somehow. It should be done. It should be provided. But we could not afford restarting whole cluster when you just want a replica. And then I said, OK, we have superuser. Yeah, what we can do?
Starting point is 00:13:40 Copy from program, right? So you crashed the server? Not crashed, why crash? PgCity, we already start, like, it's all good. Just, hyphen, am, fast, all good, all good. Yeah, there are some nuances there. But on that, let's go back to the topic briefly, because it's relevant.
Starting point is 00:14:01 Let me finish. Copy from program doesn't work on replicas because it's a. Let me finish. Copy from program doesn't work on replicas because it's a writing operation. So you had to contact support, right? That's why I was going with this. Well support says this feature is not working. I mean, it's not... But they could do it for you, you know. No, no, no, I need it as a part of automation we were building. It was part of bigger picture and we needed this ability. we were building. It was part of bigger picture and we needed this ability. So what we ended up doing is copy to program writing to a local. And this worked on replica, but we were blind
Starting point is 00:14:33 a little bit. But then I talked to the developers and realized we had an easier path in our hands. It's PL Python U. Anyway, if you have super user, you can hack yourself a little bit. It's your own right. If you broke something, don't do it. Yeah, it's a really good point. So that was kind of my questions. If you've got more access, I presume there are fewer issues that you need support for. But that does raise a good question,
Starting point is 00:15:06 because there's kind of three times you need to contact support, right? We've got an issue right now, maybe urgent, maybe not. I've got a question, how does something work? And then the third category is feature requests. Like, I'd like to be able to do this, which we can't currently do. Exactly.
Starting point is 00:15:25 My experience of feature requests or like looking at different forums of different managed service providers of where they ask people to go to request and vote on features, it looks a little hit and miss. How like what's your, do you have any advice in terms of how to do that? We have two paths here. Advice to whom? To users or to platform users? To users. I'm thinking for people listening, mostly users. Well, it's a bad state right now. Again, I think managed services should stop hiding access. they build everything on on top of open source and they charge for operations and for like support good good good but hiding access to purely open source pieces it's like it sounds bullshit to
Starting point is 00:16:16 me a complete bullshit I'm actually it makes me angry even you know like so amazing like yesterday I saw an article from YugoByte. YugoByte suddenly, I feel it, like Tempo actually released DBAI going outside of their platform. And YugoByte did a similar thing. They went outside of their database product and platform and they started offering a tool for zero downtime upgrades, compatible with Postgres or running on many managed service providers, like RDS, Cloud SQL, SuperBase, Crunchy Bridge, and so on.
Starting point is 00:16:54 And that's great. That's great. They did it wrong a little bit because they called things like blue-green deployments, while it's not... They did similar mistake as RDS did. We discussed it, right? They, this- Yeah, but I saw your tweet about this and I'm gonna defend them because I don't think it's their fault.
Starting point is 00:17:11 I think the problem is RDS broke people's understanding. Wait a little bit. Yeah. I'm going there, exactly. I'm going exactly there. So blue-green deployments, according to Martin Fowler, 15 years ago, he published an article
Starting point is 00:17:26 They by nature must be symmetric We didn't episode remember. Yes, exactly criticizing RDS implementation and POSGIS definitely supports that we implemented this like Some customers use it. That's great And what my point is like probably you go buy it hit the same limitations we hit. On RDS you cannot change things, like it's not available. And since you don't have low-level access, you cannot change many of things. And this limits you so drastically.
Starting point is 00:17:58 And it feels like some weird Pender lock-in. If you want RDS, okay good I understand, but you cannot engineer like the best approach for upgrades and you need to wait how many years like okay blue-green deployments say at least I see better path for blue-green deployments and it's my database and I cannot do it and I need to go out of RDS. At the same time, if they provided access, more access, opening gates for additional pieces of changes, it would be possible to engineer blue-green deployments for me or for third party. Like, okay, you go buy
Starting point is 00:18:41 this third party, they want to offer or sell some product or tool compatible with RDS, but since they don't have access to recovery target LSN and so on, they are very limited, right? But it might be exactly for that reason. If we're talking about the reason for needing it, one of the reasons is migrating off, migrating out, then you can see the incentives to not... Yes, render, log in. This is what I...
Starting point is 00:19:13 And for upgrades, things are becoming much better in Postgres 17. Blue-green deployments, it's kind of not only for upgrades. If we eliminate upgrade idea, we can implement blue-green deployments on any platform right now. Because you can skip many LSNs in the slot and just... how is it called? Not promote, because promote is different. I forgot, like shift position of logical slot and synchronize it's the same position with position we need and then from there we can already perform this dance with blue-green deployments. It's doable, but if you want upgrades, okay, we need to wait until 17 because there is low risk of corruption. You mean 18? 17. 17 has PG Create Subscriber CLI tool.
Starting point is 00:20:07 And it also officially supports major upgrades on replicas, logical replicas. So yeah, these two powerful things give us great path to upgrading really huge clusters using zero downtime approach. Well, near zero downtime unless you have PgBouncer. If you have PgBouncer, you have PostGIS-U, then it's purely zero downtime. Anyway, my point is since they perform this vendor lock-in,
Starting point is 00:20:37 they hesitate opening gates. Customers cannot diagnose incidents, and they also cannot build tools. Third parties like YugoBite or for example, or for example Postgres, we probably would also build some tools compatible with many other platforms. Not other, we don't have platform, right? We help customers regardless of location of their Postgres database. So if it's RDS, okay, cloud SQL, okay. But building tools for them, it's very
Starting point is 00:21:08 limited right now because we don't have access to many things and we don't have super user and so on. So yeah, that's bad. That's bad. But back to support, if like my main advice is just gambling advice, just gamble guys. Well, I have some like, I think a lot of people have very high trust when they when they request features like very, or very, very high belief that people will understand why they're asking for it. And I don't I think a lot of people don't include context when they're asking for they don't include why they want the feature or what it's preventing them or what it might cause them to do if they can't get it or what their alternatives are going to be.
Starting point is 00:21:50 So I think sometimes when you make products, people just ask for features and you have to ask them, why do you want this? Like, what are you trying to do? Because without that context, it's really hard to know which of your potential solutions could be worth it or if it's worth doing at all. But most vendors I've seen just don't ask that question. People ask for a feature or a new extension to be supported or something. Even if that extension has multiple use cases, there's no question back as to why they want
Starting point is 00:22:20 that feature. Value, right? Goals. Yeah. Well, exactly. And sometimes five people could want the same feature, but it's so... Value, right? Goals. Yeah, I think, well exactly and sometimes five people could want the same feature but it's all for different reasons and that's like really interesting. Which shows bigger value if there are many different reasons. Yeah or maybe an issue like maybe it's actually less of a good idea because
Starting point is 00:22:40 they're actually gonna want different things from it like it's gonna be harder to implement it well unless it's an extension and you get them all for like straight away. But I think in terms of customers asking for things, I've not seen this work from managed service providers specifically, but for products in general, I think it is helpful to give the context as to why you're asking for something. The only other thing I had to add from my side was to why you're asking for something. The only other thing I had to add from my side was if and when you are considering migrating to a managed service provider.
Starting point is 00:23:09 So either at the beginning or when you've got a project up and running, I see quite a few people on Reddit and places at the moment looking at moving self-hosted things to managed service providers, you know, as they're gaining a little bit of traction. And I've seen at least one case go badly wrong when the person didn't contact support at the beginning of the process, you know, they tried to
Starting point is 00:23:32 do everything self-service and actually it would have been helpful for them to contact support earlier. I think there's two good reasons for that. One is to make sure the migration goes smoothly, but the second is test the support out how, like. How does it work for you? Is it responsive? What kind of answers do you get? Is it helpful? That kind of thing. Yeah, we need to write some automation to periodically test all the supports using LLM. I'm joking, of course. But I know it's your database, even if you have a like,
Starting point is 00:24:07 consider I'm a cattle, like microservices, it's not a pet, it's cattle. But it's still like you, being maybe DBA, DBR, it doesn't matter, backend engineer, you are very interested to take proper care of database and so on. And support, your database is one of many. And they also have their own KPIs. Your question closed, okay, goodbye.
Starting point is 00:24:37 And also like, okay, do this and so on. And since we don't have accesses and so on, I just feel the big need. Like, this is a big imbalance. If you ask something about support, I saw many helpful support attempts, very helpful, very careful, but it's it's rare right and Post-gaz experts also rare Not many right yeah, and and this like closing ability to third party for example if somebody is involving us We immediately say okay this you need to put pressure on their support. We cannot help say, okay, this you need to put pressure on their support, we cannot help. Okay, so what do you mean by putting pressure? Do you mean like following up regularly?
Starting point is 00:25:29 What do you mean by putting pressure on? Reopening, escalating and so on, like explaining why, for example, like big company can have various support engineers. And for example, if there is a hanging query, and it's RDS, it's a recent real story, and they suddenly say, okay, we solved, the query is not hanging. And I wonder, how come? It was hanging because it cannot intercept signal, blah, blah, blah, it was hanging many hours. How did you manage it? Support said, RDS support.
Starting point is 00:26:09 Did restart happen? Yes it did. And in logs we see the signs of killing minus nine. So this is what support engineer did. This support engineer should be fired in RDS team. This is my opinion. But I'm just saying it's hard to build super strong support team and it will be always lacking. And it would be great if company
Starting point is 00:26:32 would allow third party people help. If you check other aspects of our life, for example, if you have a car or you have recently I replaced tankless heater in my house. If you go to vendor, sometimes vendor doesn't exist. For example, my solar is very old. Anyway, you can variety of service people who can help and do maintenance. If company, even RDS is limiting maintenance aspects only to their own staff, it always will be very limited because Postgres expertise is limited on the market. They should find a way to open gates.
Starting point is 00:27:13 This is my, it's already a message to platform builders. What? Well, I mean, I understand where you're coming from as a... I'm coming too. It's not from, it's to, it's future. I mean, I understand where you're coming from, that they can't hire all of them and actually there's benefit in terms of other people being able to provide support. But if Postgres Expertise is so limited, where is everyone else going to get their support
Starting point is 00:27:39 from? Like it's not... It's open market and competition. Yeah, exactly. So you're saying there is plenty of Postgres expertise. Well, the company should only benefit if they open the gates and allow other people, help other people, whilst they are still on the same platform. Because otherwise concern and level of disappointment about support can raise
Starting point is 00:28:06 until the point they go off, which is actually probably not a bad idea. And I think I also believe that slowly our segment of market will start to realize that there should be like this self-managed, there's managed, but probably there is something in between. And I know there's some work I cannot share is happening. So something in between where And I know there's some work I cannot share is happening. So something in between where you truly own, but still have benefits of managed services. This should happen.
Starting point is 00:28:31 And I think multiple companies are going in this direction. Or, and I'm seeing this more from kind of smaller companies, quite established in terms of like the database and team, but not brand new startups necessarily, moving to services where factoring in support as one of the main things they're looking for in a service provider. I think in the past, like people would factor in that people look at a lot of things, right? Price, ease of use, region. Yeah, they look for a bunch of features, but don't always
Starting point is 00:29:10 factor in support as one of those key factors. And I think I like to see when people do factor that in and take it seriously. So that's the alternative, right? It's pick your managed service provider partly based on how good they support. I'm talking about absolutely new approach when a service is not self-managed, not managed, but it's very, very well automated and you can hire if you're not satisfied with some company who helps you maintain it, you can switch the provider of this maintenance work, right? This should be like co-managed.
Starting point is 00:29:44 Yeah, co-managed. Yeah, co-managed. Yes, exactly. It's great because market is growing and competition is growing and we see, like I just provided a few examples about several managed services, we see bad examples all the time and the problem is systematic. It's not just like some company is bad and others are good or vice versa. It's systematic problem rooted in the decision to close the gates and not allowing others to look inside. I also think providing good support is expensive. Deep post-course expertise is expensive. I'm a bit surprised by your experience with escalation. Most companies I see do have escalation paths,
Starting point is 00:30:28 but I don't deal with, like Postgres managed service providers support that often. So I'm surprised to hear they don't have good escalation paths. But yeah, if that's the case, I feel like there must be opportunity for people. And I know some do really I have a question about this
Starting point is 00:30:49 About like if you you're also running something on cloud on GCP, right? Yeah cloud Do you have kubernetes? Yeah, you use it. Okay. So can you so you use GKE, right? Yeah Google cloud engine or cloud kubernetesKE, right? Google Cloud Engine or Kubernetes Engine, right? So if you go to Compute Engine, they call it Compute Engine, right? Where you can see VMs. Do you see VMs where this Kubernetes cluster is running? I guess yes.
Starting point is 00:31:23 See the pods and yeah. I guess yes. You can see the pods and the... Yeah, I see that. No, not the pods. I'm VMs. Can you use SSH to those VMs? Oh, I have a show. Yeah. So, Google provides Kubernetes Engine, automation, everything, and you still have SSH access.
Starting point is 00:31:42 Yeah. Why cannot be done the same thing for managed Postgres? Okay, yeah. Good question. If you have SSH access, well, you can break things. Well, okay, I know. I know. If I open my car, I can break things there as well. So this is interesting, right? So I know companies who provide services to tune, maintain Kubernetes clusters. And this is a perfect example, because for them, there is great automation from Google. Everything is automated.
Starting point is 00:32:20 But if customers have specific needs, and Google cannot meet those needs because they have limited hands, number of hands still, right, and attention and so on. Company can hire another company who are experts in this particular topic, they can go and they have everything and they have SSH access to this fully automated thing. Interesting, right? Yeah. Well, any last advice for people like actual users? Well, yeah, I know I'm biased towards platform builders because I'm upset and angry and I hope I explain origins of my anger. But yeah, put as much pressure to support as possible politely, but very firmly and explaining. I think it's possible to...
Starting point is 00:33:12 You had a great point that reasons and final goals need to be explained, right? And also risks like what will happen if we don't achieve this. Sometimes up to okay, we consider switching to different approach, provider or something. Yeah, I think people should be more detailed and putting more pressure to support, to squeeze details from them. I'm very interested because many managed Postgres users come to us more and more recently and they ask for help. And if support is working, is doing their job great, it helps us as well because it's
Starting point is 00:33:54 like beneficial for all because we help to level up health of Postgres clusters, get rid of bloat, add some automation, tuning and so on. But if support does poor job, well, customer starts looking at different direction where to migrate, right? And yeah, so my advice to users, pressure, details and so on, to support. Is there anything to be gained in the cases where they give exceptional support? You know, the time you mentioned rare cases where the support is very good. Is there anything that we can do in those cases to like, not just say thank you, but say this was really good or feedback that this is...
Starting point is 00:34:34 Oh, yes. What I liked a lot is when support engineers formatted the responses very well. And I knew it's not LLMM actually but maybe partially but it was a human behind that for sure because I saw it like I well actually who knows yeah and in this case I would say I would say thank you for well formatted well explained well structured response and and so definitely so you try to find good things and mitigate my anger calm me down thank you so much thank you for it well it's good it's interesting and my friend is from interesting thank you yeah let's technical discussion today but I hope it provokes some thoughts of it we
Starting point is 00:35:21 I think I think changes are inevitable I'm very curious in which direction the whole market will go eventually. Let's see. Me too. Good. Well, have a good week and catch you next time. Thank you. See you. Bye.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.