Postgres FM - Managed service support
Episode Date: April 25, 2025Nikolay and Michael discuss managed service support — some tips on how to handle cases that aren't going well, tips for requesting features, whether to factor in support when choosing servi...ce provider, and whether to use one at all. Here are some links to things they mentioned:YugabyteDB’s new upgrade framework https://www.yugabyte.com/blog/postgresql-upgrade-frameworkEpisode on Blue-green deployments https://postgres.fm/episodes/blue-green-deploymentspg_createsubscriber https://www.postgresql.org/docs/current/app-pgcreatesubscriber.html~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is produced by:Michael Christofides, founder of pgMustardNikolay Samokhvalov, founder of Postgres.aiWith credit to:Jessie Draws for the elephant artwork
Transcript
Discussion (0)
Hello, hello, this is Postgresfm. I'm Nikolai, Postgres AI as usual. My co-host is Michael,
Peter Mustard. Hi Michael, how was your week?
Hi Nikolai, I'm good thank you. How was yours?
Perfect, very active and a lot of stuff is happening. So we needed to miss last week
because I had even more stuff last week, but I'm happy if we continue, right?
We don't stop.
Oh, yeah.
Yeah.
I remember in the beginning, I was always against skipping any week because for me,
it would be a sign that we probably stop, which I don't want.
So yeah, right now I'm already, we already proved that, like, during a couple of years we...
A couple of years almost...
How many years?
Nearly three, maybe.
Almost three, yeah. Wow.
It's like this July it will be three years.
And I already proved to myself, we proved to ourselves that
if we skip one or two weeks it's not game over.
Yeah, this is me as the European convincing you it's okay to have a week off every now and again.
Yeah, exactly. Okay, if we stop, that's it. I don't want that.
Yeah, good. And today this was my choice and the topic is like it's not less technical but although we will talk about
technical stuff as well.
And topic is how managed Postgres services, how they help us or don't help us customers.
I mean I'm in different situation probably but of course sometimes I'm just a customer
or I'm on customer side. And there's
a problem when the fact that we cannot have access to the cluster and we have some issue,
there's a whole big class of problems how to deal with it. And maybe we should create some best practices how to deal with support engineers
from RDS, Cloud SQL, I don't know, and all others.
Let me start from this.
I learned an important lesson, I think, in 2015-16 when I first tried RDS.
I liked it a lot because of the ability to experiment a lot.
Before cloud, it was really difficult to experiment because for experiments you need machines
of the same size usually for full-fledged experiments for a very limited amount of time.
15 years ago or so we were buying servers and putting them to data centers and
experiments were super limited. Cloud brought us this capability. Great. And with RDS I quickly learned how
cool it is to just create a clone, check everything, how it works, throw it out and then rinse and
repeat many times.
And then when you deploy, you already studied all the behavior.
And I remember I was creating a clone, but then it was so slow.
RDS clone.
I think it was 2016, maybe 2015. Why is it slow? The cluster is like maybe 100 gigabytes.
Today it's a tiny cluster. Not tiny, small. But back in those days it was quite already a big one.
And I restore and somehow it takes forever to run some select.
And experienced AWS users know very well this phenomenon.
It's called lazy load, because the data is still on S3,
and you have ABS volume which only pretends to have data,
but data is still there, lazy loading in the background.
And I reached support because we had good support.
And engineer said, oh, let's diagnose, it's some kind of issue.
So it was hard to understand what's happening and so on.
And I spent maybe an hour or so with that engineer, support engineer, who was not really
helpful, right?
And then someone, I don't know, like maybe my experience of managing people by that time,
I was already, I had already three companies created in the past, so I learned something
about psychology and so on.
What I did, I just closed the ticket and opened another one.
Although usually any support would hate it.
Don't duplicate, right?
But this helped me solve the problem in a few minutes because another engineer told
me, oh, that's just lazy load.
And I Googled it.
I quickly educated myself. okay, what to do? Oh, just select
star from your table to warm it up. Okay. And since then I have a rule and I
share it with my customers all the time. If you are on managed Postgres service and
you need to deal with support sometimes, it's like roulette, right?
It's 50-50.
Can be helpful, can be not.
If it's not helpful, don't spend more than 10 minutes
and just close the ticket, say thank you,
and open another one, because if it's a big company
who has big support, probably you will find
another engineer who will be more helpful.
Actually, I use this rule in other areas of my life as well, for example, talking to some
support people like bank, credit cards, debit cards, anything.
It's not helpful, okay, thank you, and you can just call again and another person will
probably help you much faster.
What do you think about this problem?
Yeah, I think you must have different banking services to ask us, because if we need to
call the bank, you're guaranteed to be waiting 20 minutes on hold.
Oh, yes, it's terrible. It can be ours. I think we'll have a day when someone will
create a AI assistant for serving on the human side, not on company side.
Oh, interesting. Yeah.
Yeah. So they should wait on that line and ask me to join only if everything is ready and did like small details already negotiated.
Some approvals needed and that's it. Yeah.
Sure.
So maybe one day we will have such systems.
Yeah. I think at big companies that makes a lot of sense and at smaller ones much less so
I think there are some smaller managed services out there. But yeah, maybe this problem happens less. I was gonna ask
Because sometimes they have the ability to escalate right? Do you have any tips?
So let's say you've got a support engineer that wasn't able to work out the issue
So let's say you've got a support engineer that wasn't able to work out the issue.
Do you have any tips for getting them to escalate the problem to a second tier or do you, do you always go to a, like, let's open another ticket and hope that they.
Yeah, that's a great, great question.
And I think we, I don't know about RDS by the way, but what I see in many cases,
there is no such leather built yet.
So in case of big corporations, banks and so on, there is such option.
You can ask to senior manager, blah, blah, blah.
Especially if you go offline, it's definitely an option always.
So please let me speak to another person.
You escalate and so on. But what I observe and recently what happened,
we had a client who experienced some weird incidents. Those incidents require you to have
low-level access which you don't have on RDS. You need to see where Postgres spends time, for example,
on RDS. You need to see where Postgres spends time, for example, like using Perf, for example, or something. But you cannot connect. It's all in their hands. And you need to also grant
them approval to allow them to connect to your box and so on. So a lot of bureaucracy
here. And I told them, like, you need to escalate. And of course, like it's normal, but I don't see this option working. If like,
if you say escalate, it looks like they don't understand how, like what's happening here,
right? Really? Well, you can try, like you can try and have some problem and some difficult
problem, bring some difficult problem and try to escalate. Will it work?
Is there any official option?
Because if it's not official and it works sometimes,
it's OK.
Again, it's like gambling, like I said.
It's similar to closing and reopening the issue
and hoping next engineer will be more helpful.
Escalation is also not guaranteed.
It's like in many cases, it's good, right?
Because there, probably, they will try to solve.
Well, I also have several, actually I have several recent cases, very interesting had issues, a bunch of them, like 10 issues,
various, of various nature, like different kinds.
One issue eventually was identified with mutual effort as don't run backup push or how you
call it on the primary.
If a system is loaded, do it on replicas.
We talk about
it from time to time, when we touch backups, right? And this was an issue on that platform.
But what I observed is, like, trying to work with engineers, support engineers, and also
ultimate escalation if you go to CTO or CEO level and say, oh, look, like, you know, CTOs are talking, right?
And this is ultimate escalation.
And it's also not helpful sometimes, right?
In that case, there was some chunk of disappointment, what I observed.
This was feedback I heard.
So escalation is interesting, but my point is we probably need to learn about escalation
later and practices from other businesses obviously.
And I still think it's not fair that customer pays bigger price and doesn't have control.
Yeah, sure. Well, actually on this on this topic I was going to ask do you think this
is less of a an issue as for the there are managed service providers that give more access
like we took we talked we had an old episode on super user for example and it's come up
a few times. Yeah, obviously that's not, like you're talking about running Perth for example, but I'm guessing
a whole category of issues just don't exist if you've got super user access.
So is it less of an issue on those?
I will tell you funny story.
It was with Crunchy Bridge.
I respect Crunchy Bridge for two reasons.
Already for two.
It was one, now for two.
One is super user.
I don't know any other managed service yet which provides you super user.
It's amazing.
You can shoot off your feet very quickly if you want.
It's freedom, right?
And another thing is that they provide access to physical backups, which is also nice.
This is true freedom and honoring the ownership of database and so on.
Because without it, maybe you own your data, but not database.
You can dump, but you cannot access PGA data, physical data, nothing. And also, you own your data conditionally,
because if bugs happen, you even cannot dump.
And this sucks completely.
And I'm talking about everyone except Crunchy Bridge,
all managed services, they all steal ownership from you. That sucks.
So the final thing is at least one other but like I think they're quite smart. I think maybe
Tembo give super use access. I have maybe maybe maybe. Yeah. Apologies if I missed something.
I work with a lot of customers and expanding my vision all the time, but of course it's
not 100% coverage.
Definitely not.
Definitely the big ones don't.
Right, exactly.
And they say this is for your own good, but it's not.
So let me talk a little bit about Crunchy Bridges.
It was super funny.
We needed to help our customer and reboot a
standby node. And turned out Crunchy Bridge doesn't support rebooting, restarting Postgres
on standby nodes. They support it on primary or whole cluster, but not specific standby node.
It was very weird. I think it's because they just didn't do it somehow.
It should be done.
It should be provided.
But we could not afford restarting whole cluster when you just want a replica.
And then I said, OK, we have superuser.
Yeah, what we can do?
Copy from program, right?
So you crashed the server?
Not crashed, why crash?
PgCity, we already start, like, it's all good.
Just, hyphen, am, fast, all good, all good.
Yeah, there are some nuances there.
But on that, let's go back to the topic briefly,
because it's relevant.
Let me finish.
Copy from program doesn't work on replicas
because it's a. Let me finish. Copy from program doesn't work on replicas because it's a writing operation.
So you had to contact support, right? That's why I was going with this.
Well support says this feature is not working. I mean, it's not... But they could do it for you, you know.
No, no, no, I need it as a part of automation we were building.
It was part of bigger picture and we needed this ability.
we were building. It was part of bigger picture and we needed this ability. So what we ended up doing is copy to program writing to a local. And this worked on replica, but we were blind
a little bit. But then I talked to the developers and realized we had an easier path in our
hands. It's PL Python U. Anyway, if you have super user, you can hack yourself a little bit.
It's your own right.
If you broke something, don't do it.
Yeah, it's a really good point.
So that was kind of my questions.
If you've got more access, I presume there are fewer issues that you need support for.
But that does raise a good question,
because there's kind of three times
you need to contact support, right?
We've got an issue right now, maybe urgent, maybe not.
I've got a question, how does something work?
And then the third category is feature requests.
Like, I'd like to be able to do this,
which we can't currently do.
Exactly.
My experience of feature requests or like looking at different forums of different managed
service providers of where they ask people to go to request and vote on features, it
looks a little hit and miss.
How like what's your, do you have any advice in terms of how to do that?
We have two paths here. Advice to whom? To users or to platform users?
To users. I'm thinking for people listening, mostly users. Well, it's a bad state right now.
Again, I think managed services should stop hiding access. they build everything on on top of open source and they charge for operations and for like support good good good but
hiding access to purely open source pieces it's like it sounds bullshit to
me a complete bullshit I'm actually it makes me angry even you know like so
amazing like yesterday I saw an article from
YugoByte. YugoByte suddenly, I feel it, like Tempo actually released DBAI
going outside of their platform. And YugoByte did a similar thing. They
went outside of their database product and platform and
they started offering a tool for zero downtime upgrades,
compatible with Postgres or running on many managed service providers,
like RDS, Cloud SQL, SuperBase, Crunchy Bridge, and so on.
And that's great. That's great.
They did it wrong a little bit because they called things like blue-green deployments,
while it's not... They did similar mistake as RDS did.
We discussed it, right?
They, this-
Yeah, but I saw your tweet about this
and I'm gonna defend them
because I don't think it's their fault.
I think the problem is RDS broke people's understanding.
Wait a little bit.
Yeah.
I'm going there, exactly.
I'm going exactly there.
So blue-green deployments,
according to Martin Fowler, 15 years ago,
he published an article
They by nature must be symmetric
We didn't episode remember. Yes, exactly criticizing RDS implementation and POSGIS definitely supports that we implemented this like
Some customers use it. That's great
And what my point is like probably you go buy it hit the same
limitations we hit.
On RDS you cannot change things, like it's not available.
And since you don't have low-level access, you cannot change many of things.
And this limits you so drastically.
And it feels like some weird Pender lock-in.
If you want RDS, okay good I understand, but you
cannot engineer like the best approach for upgrades and you need to wait how
many years like okay blue-green deployments say at least I see better
path for blue-green deployments and it's my database and I cannot do it and I
need to go out of RDS. At the same time,
if they provided access, more access, opening gates for additional pieces of changes, it would
be possible to engineer blue-green deployments for me or for third party. Like, okay, you go buy
this third party, they want to offer or sell
some product or tool compatible with RDS, but since they don't have access to recovery
target LSN and so on, they are very limited, right?
But it might be exactly for that reason.
If we're talking about the reason for needing it, one of the reasons is migrating off, migrating
out, then you can see the incentives to not...
Yes, render, log in.
This is what I...
And for upgrades, things are becoming much better in Postgres 17.
Blue-green deployments, it's kind of not only for upgrades.
If we eliminate upgrade idea, we can implement blue-green deployments on any platform right now.
Because you can skip many LSNs in the slot and just... how is it called?
Not promote, because promote is different. I forgot, like shift position of logical slot and synchronize it's the same position with
position we need and then from there we can already perform this dance with blue-green deployments.
It's doable, but if you want upgrades, okay, we need to wait until 17 because there is low
risk of corruption. You mean 18? 17. 17 has PG Create Subscriber CLI tool.
And it also officially supports major upgrades on replicas, logical replicas.
So yeah, these two powerful things give us great path to upgrading really huge clusters
using zero downtime approach.
Well, near zero downtime unless you have PgBouncer.
If you have PgBouncer, you have PostGIS-U,
then it's purely zero downtime.
Anyway, my point is since they
perform this vendor lock-in,
they hesitate opening gates.
Customers cannot diagnose incidents,
and they also cannot build tools.
Third parties like YugoBite or for example, or for example Postgres, we probably would also build some
tools compatible with many other platforms.
Not other, we don't have platform, right?
We help customers regardless of location of their Postgres database.
So if it's RDS, okay, cloud SQL, okay. But building tools for them, it's very
limited right now because we don't have access to many things and we don't have super user and so
on. So yeah, that's bad. That's bad. But back to support, if like my main advice is just gambling
advice, just gamble guys. Well, I have some like, I think a lot of people have very high trust when they when they request
features like very, or very, very high belief that people will understand why they're asking
for it.
And I don't I think a lot of people don't include context when they're asking for they
don't include why they want the feature or what it's preventing them or what it might cause them to do if they can't get it or what their alternatives
are going to be.
So I think sometimes when you make products, people just ask for features and you have
to ask them, why do you want this?
Like, what are you trying to do?
Because without that context, it's really hard to know which of your potential solutions
could be worth it or if it's worth doing at all.
But most vendors I've seen just don't ask that question.
People ask for a feature or a new extension to be supported or something.
Even if that extension has multiple use cases, there's no question back as to why they want
that feature.
Value, right?
Goals.
Yeah.
Well, exactly. And sometimes five people could want the same feature, but it's so... Value, right? Goals. Yeah, I think, well exactly and
sometimes five people could want the same feature but it's all for different
reasons and that's like really interesting. Which shows bigger value if there are many different reasons.
Yeah or maybe an issue like maybe it's actually less of a good idea because
they're actually gonna want different things from it like it's gonna be harder
to implement it well unless it's an extension and you get them all for like straight away. But
I think in terms of customers asking for things, I've not seen this work from managed service
providers specifically, but for products in general, I think it is helpful to give the
context as to why you're asking for something. The only other thing I had to add from my side was
to why you're asking for something. The only other thing I had to add from my side was
if and when you are considering migrating
to a managed service provider.
So either at the beginning
or when you've got a project up and running,
I see quite a few people on Reddit and places at the moment
looking at moving self-hosted things
to managed service providers, you know,
as they're gaining a little bit of traction.
And I've seen at least one case go badly wrong when the person
didn't contact support at the beginning of the process, you know, they tried to
do everything self-service and actually it would have been helpful for them to
contact support earlier.
I think there's two good reasons for that.
One is to make sure the migration goes smoothly, but the second is test the
support out how, like. How does it work
for you? Is it responsive? What kind of answers do you get? Is it helpful? That kind of thing.
Yeah, we need to write some automation to periodically test all the supports using LLM.
I'm joking, of course. But I know it's your database, even if you have a like,
consider I'm a cattle, like microservices,
it's not a pet, it's cattle.
But it's still like you, being maybe DBA, DBR,
it doesn't matter, backend engineer,
you are very interested to take proper care of database and so on.
And support, your database is one of many.
And they also have their own KPIs.
Your question closed, okay, goodbye.
And also like, okay, do this and so on. And since we don't have accesses and so on, I just feel the big need.
Like, this is a big imbalance. If you ask something about support, I saw many
helpful support attempts, very helpful, very careful, but it's it's rare right and
Post-gaz experts also rare
Not many right yeah, and and this like closing ability to third party for example if somebody is involving us
We immediately say okay this you need to put pressure on their support. We cannot help
say, okay, this you need to put pressure on their support, we cannot help. Okay, so what do you mean by putting pressure?
Do you mean like following up regularly?
What do you mean by putting pressure on?
Reopening, escalating and so on, like explaining why, for example, like big company can have
various support engineers.
And for example, if there is a hanging query, and it's RDS, it's a recent real story,
and they suddenly say, okay, we solved, the query is not hanging. And I wonder, how come?
It was hanging because it cannot intercept signal, blah, blah, blah, it was hanging many
hours. How did you manage it?
Support said, RDS support.
Did restart happen?
Yes it did.
And in logs we see the signs of killing minus nine.
So this is what support engineer did.
This support engineer should be fired in RDS team.
This is my opinion.
But I'm just saying it's hard to build super
strong support team and it will be always lacking. And it would be great if company
would allow third party people help. If you check other aspects of our life, for example,
if you have a car or you have recently I replaced tankless heater in my house.
If you go to vendor, sometimes vendor doesn't exist.
For example, my solar is very old.
Anyway, you can variety of service people who can help and do maintenance.
If company, even RDS is limiting maintenance aspects only to their own staff, it always will be very
limited because Postgres expertise is limited on the market.
They should find a way to open gates.
This is my, it's already a message to platform builders.
What?
Well, I mean, I understand where you're coming from as a...
I'm coming too.
It's not from, it's to, it's future.
I mean, I understand where you're coming from, that they can't hire all of them and actually
there's benefit in terms of other people being able to provide support.
But if Postgres Expertise is so limited, where is everyone else going to get their support
from?
Like it's not...
It's open market and competition.
Yeah, exactly.
So you're saying there is plenty of Postgres expertise.
Well, the company should only benefit if they open the gates and allow other people, help other people,
whilst they are still on the same platform.
Because otherwise concern and level of disappointment about support can raise
until the point they go off, which is actually probably not a bad idea. And I
think I also believe that slowly our segment of market will start to realize
that there should be like this self-managed, there's managed, but probably
there is something in between. And I know there's some work I cannot share is
happening. So something in between where And I know there's some work I cannot share is happening.
So something in between where you truly own,
but still have benefits of managed services.
This should happen.
And I think multiple companies are going in this direction.
Or, and I'm seeing this more from kind of smaller companies,
quite established in terms of like the database and team,
but not brand new
startups necessarily, moving to services where factoring in support as one of the main things
they're looking for in a service provider.
I think in the past, like people would factor in that people look at a lot of things, right?
Price, ease of use, region. Yeah, they look for a bunch of features, but don't always
factor in support as one of those key factors. And I think I like to see when people do factor
that in and take it seriously. So that's the alternative, right? It's pick your managed
service provider partly based on how good they support.
I'm talking about absolutely new approach when a service is not self-managed, not managed,
but it's very, very well automated and you can hire if you're not satisfied with some
company who helps you maintain it, you can switch the provider of this maintenance work,
right?
This should be like co-managed.
Yeah, co-managed.
Yeah, co-managed. Yes, exactly. It's great because market is growing and
competition is growing and we see, like I just provided a few examples about
several managed services, we see bad examples all the time and the
problem is systematic. It's not just like some company is bad and others are good or vice
versa. It's systematic problem rooted in the decision to close the gates and not allowing others
to look inside. I also think providing good support is expensive. Deep post-course expertise
is expensive. I'm a bit surprised by your experience with escalation. Most companies I see do have escalation paths,
but I don't deal with,
like Postgres managed service providers support that often.
So I'm surprised to hear
they don't have good escalation paths.
But yeah, if that's the case,
I feel like there must be opportunity for people.
And I know some do really
I have a question about this
About like if you you're also running something on cloud on GCP, right? Yeah cloud
Do you have kubernetes?
Yeah, you use it. Okay. So can you so you use GKE, right?
Yeah Google cloud engine or cloud kubernetesKE, right? Google Cloud Engine or Kubernetes Engine, right?
So if you go to Compute Engine, they call it Compute Engine, right?
Where you can see VMs.
Do you see VMs where this Kubernetes cluster is running?
I guess yes.
See the pods and yeah. I guess yes. You can see the pods and the...
Yeah, I see that.
No, not the pods.
I'm VMs.
Can you use SSH to those VMs?
Oh, I have a show.
Yeah.
So, Google provides Kubernetes Engine, automation, everything, and you still have SSH access.
Yeah.
Why cannot be done the same thing for managed Postgres? Okay, yeah. Good question.
If you have SSH access, well, you can break things. Well, okay, I know. I know. If I open my car,
I can break things there as well. So this is interesting, right? So I know companies who provide services to tune,
maintain Kubernetes clusters.
And this is a perfect example, because for them, there is
great automation from Google.
Everything is automated.
But if customers have specific needs, and Google cannot meet those needs because they have
limited hands, number of hands still, right, and attention and so on.
Company can hire another company who are experts in this particular topic, they can go and
they have everything and they have SSH access to this fully automated thing.
Interesting, right? Yeah. Well, any last advice for people like actual users?
Well, yeah, I know I'm biased towards platform builders because I'm upset and angry and I
hope I explain origins of my anger. But yeah, put as much pressure to support as possible politely, but very firmly and explaining.
I think it's possible to...
You had a great point that reasons and final goals need to be explained, right?
And also risks like what will happen if we don't achieve this.
Sometimes up to okay, we consider switching
to different approach, provider or something. Yeah, I think people should be more detailed
and putting more pressure to support, to squeeze details from them.
I'm very interested because many managed Postgres users come to us more and more recently
and they ask for help.
And if support is working, is doing their job great, it helps us as well because it's
like beneficial for all because we help to level up health of Postgres clusters, get
rid of bloat, add some automation, tuning and so on.
But if support does poor job, well, customer starts looking at different direction where
to migrate, right?
And yeah, so my advice to users, pressure, details and so on, to support.
Is there anything to be gained in the cases where they give exceptional support?
You know, the time you mentioned rare cases where the support is very good. Is there anything that we can do in those
cases to like, not just say thank you, but say this was really good or feedback that this is...
Oh, yes. What I liked a lot is when support engineers formatted the responses very well.
And I knew it's not LLMM actually but maybe partially but it was a human
behind that for sure because I saw it like I well actually who knows yeah and
in this case I would say I would say thank you for well formatted well
explained well structured response and and so definitely so you try to find good things and mitigate my anger calm me
down thank you so much thank you for it
well it's good it's interesting and my friend is from interesting thank you yeah
let's technical discussion today but I hope it provokes some thoughts of it we
I think I think changes are inevitable I'm very curious in which direction the whole market
will go eventually. Let's see.
Me too.
Good.
Well, have a good week and catch you next time.
Thank you. See you. Bye.