Postgres FM - Get or Create
Episode Date: August 23, 2024Michael and Nikolay are joined by Haki Benita, a technical lead and database enthusiast who writes an excellent blog and gives popular talks and training sessions too, to discuss the surprisi...ngly complex topic of trying to implement “get or create” in PostgreSQL — handling issues around idempotency, concurrency, and bloat. Here are some links to things they mentioned:Haki Benita https://hakibenita.com How to Get or Create in PostgreSQL (blog post by Haki) https://hakibenita.com/postgresql-get-or-create "Find-or-insert" using a single query (how-to guide by Nikolay) https://gitlab.com/postgres-ai/postgresql-consulting/postgres-howtos/-/blob/main/0036_find-or-insert_using_a_single_query.md?ref_type=heads Is SELECT or INSERT in a function prone to race conditions? (Answer by Erwin Brandstetter) https://stackoverflow.com/questions/15939902/is-select-or-insert-in-a-function-prone-to-race-conditions/15950324#15950324 get_or_create() in Django https://docs.djangoproject.com/en/5.1/ref/models/querysets/#get-or-create Subtransactions Considered Harmful (blog post by Nikolay) https://postgres.ai/blog/20210831-postgresql-subtransactions-considered-harmful MERGE (Postgres documentation) https://www.postgresql.org/docs/current/sql-merge.htmlHidden dangers of duplicate key violations in PostgreSQL and how to avoid them (blog post by Divya Sharma and Shawn McCoy from the RDS team) https://aws.amazon.com/blogs/database/hidden-dangers-of-duplicate-key-violations-in-postgresql-and-how-to-avoid-them/ One, Two, Skip a Few... (blog post by Pete Hamilton from Incident) https://incident.io/blog/one-two-skip-a-few ~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is produced by:Michael Christofides, founder of pgMustardNikolay Samokhvalov, founder of Postgres.aiWith special thanks to:Jessie Draws for the elephant artwork
Transcript
Discussion (0)
Hello and welcome to PostgreSQL FM, a weekly show about all things PostgreSQL.
I am Michael, founder of PgMustard.
This is my co-host Nikolay, founder of PostgreSQL.
Hey Nikolay.
Hi Michael.
How are you?
Good.
How are you?
Very good.
Well, yeah, me too.
Thank you for asking.
Thank you for asking.
You like how are you doing?
I'm here too.
If you just want to keep chat among yourselves, I'll just wait here.
Luke, we have a guest today.
Yeah.
Hi.
Yeah.
We are delighted to have a guest today.
And it is the one and only Haki Benita, technical lead and database enthusiast, who also writes an excellent blog over at hakibanita.com that we've mentioned many times on the podcast before.
He also gives popular talks and training sessions.
Thank you so much for joining us.
Thank you for having me.
Big fan, very excited.
The blog is great.
If someone didn't read it yet, it's a must read for, I think.
And long reads, right?
It's long reads, but not very frequent.
I guess it's not possible to make them frequent, right?
Yeah, it's a lot of work.
Funny story is we agreed that our bot will have your blog post in our knowledge base.
Yeah, that's right.
And thank you for that.
And a few days ago, I told the team,
this part of our knowledge base was not updated since January, since alpha release.
And now we're approaching like beta release.
And they said, no problem, we'll update and pay attention to this block, I said.
And then they came and said, we have updated, only one article added since January.
Okay, okay.
But it makes sense because long reads require a lot of time.
I know it very well.
Yeah, it takes a very long time to produce these types of articles. Yeah, I know. I've been
writing for a very, very long time and I've been very surprised about this article. I'll be honest
with you, after you write for some time, you start to develop a sense of which ones are going to do
well and be very accepted and which ones are going to do well and be very accepted,
and which ones are going to be just one for the archives.
So when I wrote stuff like comparing pandas to SQL,
I knew that it's going to be a hit because people like pandas.
And if you say pandas is slow, then people are going to hate you.
And when I wrote this, me and Micah wrote something about
hash indexes a few years back. And I think it's a top result for Postgres hash index on just about
every search engine. So that one, you know, you release that one, you know that it's going to do
well. But then I released this one. I said, man, it's so long, such a niche problem. No chance
anybody's actually going to read all of that.
And I was very surprised by the amount of responses
that I received to this article.
And I think the most surprising comment
that I've received on this article
is that a lot of Postgres people didn't know about Merge.
Now, I came from Oracle, so I knew all about merge. And actually when I came to Postgres, I wrote a lot of ETL processes and I really wanted merge and I didn't have merge. So I had to learn about insert on conflict. So a lot of people told me this is the first time I heard about merge. Such a cool command. Because it's new. It's still new. It was added to 15, right?
15. So it's super new.
It also to be honest, insert on conflict is like what you
want 99% of the cases. And while merge is nice, I think that on
conflict is you seem pissed. pissed. Why did I say?
Well, it was you describing your article, why it's bad, right?
Let's talk about why it's bad.
These gaps and sequences like bloat created.
Yeah.
This is not what people want, right?
I know.
Let's step back.
Like our, Michael likes us to pronounce the topic, right?
Topic is...
I have to tell you something.
I have to tell something.
It's funny, okay?
I have to tell something.
So when we talked, I talked with Michael about this, doing this show, and I asked him, okay, let's do it.
Can you give me like a rough outline of how it's going to go?
And he told me, I can tell you how it's going to go. But to be
honest, if Nikolai is going to be on the show, then I can
promise anything.
Nikolay Spilkaevich I can't promise anything.
Well, I see you are the same type of person, actually.
You
I don't know, you delivered like on the first five minutes.
Don't follow the agenda, right? So you said that you
were surprised
this topic will be popular,
but how come it's super popular?
It's like in computer science,
particularly in databases,
find or create,
this is like a super common pattern, right?
And the fact that in Postgres,
these solutions have so many pros and cons,
it's super interesting
because everyone needs it, actually.
Yeah.
You build something,
you want this, right? Yeah. I'm going to tell you a secret. After I published this article,
I actually discovered that there are still situations where you get very unpredictable
results. And I've done some experimenting in the past two weeks based on comments that I received.
And I haven't cracked it yet.
Yeah.
With the lock, we're complicated.
With what?
And I'll even tell you another secret.
There's a very, in my opinion, an unexpected difference between the way merge and insert on conflict behave under different circumstances.
But, you know, we promised Michael that we described the topic
before we actually dive into the details.
Right.
Let's name the topic.
Yeah.
Well, I was interested in your thoughts on this, actually,
because I feel like you deliberately called your blog post get or create,
whereas from the database side of it i've always thought of it
as like insert or select and i think nicolai's called it that but in the past in like a how-to
guide and things so we'll link up the blog post and we'll link up nicolai's how to and there's
like a few other things i think one of the things you're referencing about how it's become even more
complicated in the last few weeks you link me to a great answer by Erwin Brandstetter on Stack
Overflow that discusses this as well.
But it's a surprise.
It sounds really simple, right?
Select or insert, get or create.
It seems such a simple piece of functionality and in small, low traffic systems.
It can be like, you can, like any newbie developer could come up with a solution to
this problem it's just gets complicated when you have to factor in concurrency and mvcc
implementations and things so yeah i i loved your post i found it really interesting how complicated
this could get but yeah maybe we should start simple and it'd be great to hear a little bit
about it you both said it's quite a common thing that you come, like you have to implement it quite often.
I haven't had to that often.
Like I've got a couple of occasions, but not, maybe not as often as you do.
So I'd be interested to hear like where you're coming across this.
Okay.
So my use case was very similar to what I actually implemented in the article because I had this organization table and we had members.
And we wanted to have users be able to tag members
with their own tags.
And we wanted to encourage reuse of tags.
Otherwise, they don't make much sense.
So we've set up this very simple HTML form.
Now, HTML forms are not SPAs.
They're not very complicated.
You can't send JSON.
You just send a list of names,
like from a radio selection
or like an autocomplete, whatever.
So at the end of the day,
the server receives a list of strings
and he wants to set these tags on the member.
Now, if the tag exists,
you want to reuse the tag.
And if the tag does not exist, you want to create a new tag and get the ID.
Because the next step would be get all the IDs and put them in the table associating tags with members, right?
So that's the first part where you have a list of tag names and you want to get or create tags, this is where I came to this idea.
Now, the thing that surprised me is that, now, this is not a very high traffic part of the
application. I could have done the brute force approach, would have been just fine. But, you
know, I wrote some unit testing and one of the tests was, let's see what happens when you just have an existing tag.
You want to make sure that it's reused.
So I used insert on conflict do nothing with returning star.
So I expected that if I insert, for example, two tags and one of them already exists, I expected to get two
tags in return.
But in fact, what I got was just one tag.
So this was very strange to me.
So at this point, I started investigating and starting to explore why this is happening.
And in fact, the first thing that I thought about was, let's do a meaningless update.
Like instead of doing on conflict do nothing, I did on conflict set ID equals exclude ID.
Like, let's fool the database into thinking that this tag was modified.
So I get that in return.
But it really bugged me because it's a very ugly solution. Why
would I want to update something for no reason just to get it back? So this is where all the
different scenarios started to uncover. And as I tested farther and farther, I came to the conclusion
like, hell man, why is this so complicated? I mean, this is what database and applications do.
This should be so simple.
Why is this so complicated?
And then I started digging.
And one of the places that I eventually arrived was Django.
Django is a web framework and there's an implementation of get or create.
And what Django is doing,
they try to fetch the record.
If they find it, they return it.
If they don't find it,
they try to insert, right?
But then, and that's like the funny part,
the non-intuitive part,
then they actually handle an integrity,
unique constraint violation, and then they actually handle an integrity, unique constraint violation,
and then they try to get it again.
This is the select, insert, select.
But then it gets even more complicated because if you have a unique constraint violation
inside a transaction, it abhors the transaction, right?
You need to wrap the second insert in another transaction.
Or use sub transactions. I think Django by default uses sub transactions, which is a very bad idea, actually, but we will talk about it later.
The reason that he uses sub transactions, and I know you're a very big fan of sub transactions, you use them all the time, you encourage everybody to use sub transactions. You wrote a very long blog post about why they are absolutely great, and you should use them all the time. You encourage everybody to use up transactions. You wrote a very long blog post about why they are absolutely great and you should use
them all the time.
But the reason that you do that in Django, the reason that they're doing that is because
if you happen to call this function inside a transaction of your own, and if you reach
a unique constraint violation, it gets your transaction aborted.
So that's a problem. So the only way that Django can make sure that they don't get your transaction
aborted is to execute this part in a sub-transaction of its own. Now, there's also another very
interesting thing that happens here. And this is also something that I mentioned in the article.
Python encouraged asking for forgiveness.
So in Python, the item says that you should try something and then handle all the possible
exceptions.
So trying to insert a record and then handle the unique constraint violation is actually
consistent with how you're expected to do things in Python.
But it kind of conflicts with the way Postgres handles the situation because in Postgres,
when you have an exception inside a transaction, it aborts the transaction, which is not a
very good thing.
So the approaches between how you would do things in Postgres and how you would do things in Python kind of clashed in this case.
I thought it was very interesting.
So, you know, I explored different approaches, like what would be the equivalent of asking for forgiveness in Postgres?
What would be the equivalent of looking before you leave, check all the preconditions in advance?
So, yeah, it turned out to be
way more complicated than I anticipated. Yeah, that's interesting. And you explored
several properties, right? Bloat, concurrency, constraint, and idempotency, right?
Idempotency, yeah.
So, I'm very curious. I
saw some parts of benchmarks, but you decided not to include
the performance to the resulting table. Yeah, when you compared
methods, why so? Performance also interesting, right?
Alex Ferrari- Performance is interesting. And I saw I saw
some articles that do this from a performance point of view. In my situation, performance was not a concern.
I was more focused on getting the functional attributes that I wanted,
like idempotency, for example, was a lot more important to me than make it fast.
Because at the end of the day, you don't update lots of tags.
You probably set no more than five. So performance was not a concern. I did want to
focus on the functional attributes. Also, we've got
pretty long. So at some point, you need to let something go,
you know?
Sanyam Bhutaniyaraman Right. Yeah. Let's maybe talk about
your favorite method right now. And for future for future, I
guess it's merge, right? Becausegres 17 it will have returning clause actually i didn't know and when i read about
it in your blog post i immediately thought oh this is a candidate to be one of my favorite
additions to postgres 17 i somehow overlooked it so it's great before Before, like, Merge was added to Postgres 15.
It was being developed more than 10 years, I think.
It was a huge story.
But it lacks returning in 15 and 16.
Now, in 17,
to be released soon,
it's receiving returning.
And it looks like
all your checkboxes are green, right?
I mean...
Yeah. It's the favorite the favorite. You would choose this if you
already were on Postgres 17.
But before 17, what's your favorite method?
Well, insert on conflict, do nothing with a union is currently
my way to go. And I expected to remain
my go-to approach,
mostly because of habit,
and also because I experimented with the new merge
under different circumstances in Postgres 17.
And it turns out that merge can actually fail
with unique constraint violation.
So I found it unexpected.
I don't know because I don't know much about how these two are different
in the underlying implementation, but I'm guessing that they are not implemented
the same way. And we talk about it commuted level.
Because I think it also matters.
So after I published this article, a reader reached out to me and he said,
I really liked your post and everything is very comprehensive, blah, blah, blah.
But there is one race condition that can still cause inconsistent results.
So basically, if you have two sessions inserting the exact same tag at the exact same time,
then you can end up with a situation where using insert on conflict, you would not get
item potency, meaning the first statement would return the tag, but the second would return nothing.
And the reason for that is when you start the first transaction,
you insert the tag A. So tag
A does not exist. And insert
the tag, right? And then
you get that in return. but then you don't commit.
Now another session starts, the transaction begins, and now you do the exact same thing.
You do insert tag A on conflict, do nothing, returning star, and then you select from the
table and select from the returning. What's happening now is interesting
because right now the row is locked so the second session hangs. So now you commit the first session
so the second session at this point is going to return nothing and the reason is, insert on conflict, lock, it encountered a unique constraint violation, so the row was not inserted.
But then when you try to select from the table, it found nothing because the snapshot is older than the other transaction.
So this is a scenario where you get,
you don't get item potency.
You expect to get the same output in both cases,
but you don't.
And the more interesting thing is if you do the exact same exercise with merge,
you get different results.
What do you get?
Can you guess?
I found it to be surprising.
If you do the
exact same experiment with merge, you're going to get unique
constraint violation. Even though you can say in the merge
clause, you can say, on conflict, do nothing, when match
do nothing, you are still going to get unique constraint
violation. So this raises a question of whether merge is really or I would
I wouldn't want to say something like is it safe in concurrent highly concurrent deloads but
it doesn't do what it promised. Better understand how merge handles these types of situations you know.
Yeah it shouldn't produce unique constraint violation in read committed, but it does.
Yeah, okay.
But it should be easy to check based on what you say, like if you just do pgbench multiple
sessions in parallel doing the same thing, right?
It's very easily reproducible.
You just need two terminal windows and you can immediately replicate it.
The thing I found to be, I would say,
disturbing, yeah, and I'm using air quotes for those of you just listening, is that when you do
merge and you set on match do nothing, you don't expect to get unique constraint violation.
Like if I wrote this statement in my code, I would not handle any constraint violation. Okay.
Because if I'm inserting into a row and I know that it can raise, I would handle the
situation and do whatever I need to do.
But I would not expect to get integrity error when I explicitly handle the case where there
is a match.
So I found this to be surprising.
So to answer your previous question,
what would I use now?
And that's a tough choice between you don't get the expected result
to you get an exception.
You know, when I'm thinking about right now,
I think that it kind of makes sense
to get the exception, right?
Yeah, well, I guess we're moving
from the end of article
backwards, right?
So I'm not sure the problem you described,
is it discussed somewhere?
Do you know, like, this behavior,
was it discussed in mailing lists or somewhere, no?
I don't know.
I'm guessing that it might have, but I haven't seen any.
And also, as you said, it's kind of new in Postgres,
so it's possible that some of the people that worked on it
didn't have enough time to actually document it.
I wouldn't say document, but maybe write these types of articles on it,
analyzing how it behaves under different circumstances.
Yeah, I'm curious.
I think this should be documented.
Maybe it's actually documented.
We should check.
But it's super interesting.
I never used merchant production, never, never ever.
But, yeah, so,
well, honestly,
most production systems are still
behind a lot
in terms of major version.
But moving backwards,
insert on conflict
do nothing or do update. This is
interesting because, so, we
said our goal is
find or create or
get or create.
But when you say
let's also update, it's like
third layer of logic.
Find or create
but if you found, also
update. It's more complicated.
I like it, but it's already a different
task.
When I wrote lots of ETL, I don't know, 15 years ago in Oracle, I used merge all the time. It was like a third hand. Every time I needed to do anything, I would use merge because
when you do ETL processes, you're basically taking data from one place and you want to sync it to
some kind of target. So you use merge all the time.
In Postgres, I kind of got used to not using it.
And also, some may also claim that the whole ETL,
the way that we do ETL now is kind of different
than the way we used to do ETL 15 years ago.
A lot of it is the same, but some of it is still kind of different.
Yeah, I think it's interesting.
And hopefully when more people adopt merge, I know that it was
a very desirable feature, right?
A lot of people wanted that.
Right.
People were very excited when it came out.
It was all in the, what I anticipate, what I expect most N version,
you know, the merge command.
So hopefully we're going to see some people doing interesting stuff with merge.
Yeah, and I was surprised to see, so there is difference between insert and conflict do nothing
and do update in the table you provided in the end of the article. And update there is a like red cross in the column of bloat meaning that it
has bloat issues obviously update okay but also you have a green checkbox for insert on config
do nothing the thing is that when we created our bot we also use insert on config do nothing
for knowledge base for example if your article gets inserted,
it searches if this URL is known already, right?
And if it's known already, do nothing.
We decided not to apply update yet.
So I was surprised to see you concluded
that there is no bloat issues
because we had a lot of activities, a lot of data to insert
and it suffers from bloat.
We have gaps.
So if you have
collision,
right, so you inserted something,
you try to insert and do nothing,
it actually generates that
topple, I think. No?
No, I think that, first of all...
Gaps for sure. I just checked.
Yeah, gaps for sure.
Yeah, gaps for sure.
And, okay, I know what you're talking about.
You're talking about the issue of bloat, right?
Bloat, yeah. So this is also a very interesting topic,
which I briefly mentioned,
but I think that this is also something
that some people commented about,
that they weren't aware of the
fact that it causes bloat. So apparently in Postgres, when you have unique constraint violation,
it can cause bloat. Because the way it works is that Postgres tries to, it basically inserts
the row into the table, and then it checks that there are no unique constraint violations.
And if there is a unique constraint violation, this row is marked as dead, which causes bloat. So this is what's happening
if you rely, heavily rely on catching unique constraint violation. However, however, however,
and this is also in the original RDS article, which is how I found out about it. It was also
very unexpected that unique constraint violation would cause bloat.
But according to the article and according to the benchmarks that I've done,
when you use insert on conflict, it uses a different mechanism
that can check in advance for collisions, which prevents bloat.
So I'm pretty confident that insert on conflict do nothing cause bloat,
which is a big plus for do nothing. And also, you know, we talked before about the difference
between how you do things in Python, like asking for forgiveness, versus look, look before you
leap. So in Python, the pattern that is encouraged, basically try to
insert something and then handle the exception, can end up causing lots of bloat, which is a problem.
So if you have a process with lots of collision where you actually rely on unique constraint
violation, this is something that you need to think about. Right. So yeah, I double checked.
You're right. No bloat. But sequence gaps are happening.
Yeah, sequence gaps.
You know, there was a period where I like to ask in job interviews, how would you implement gapless sequences?
Have you ever talked about this?
Evil question.
Evil question.
Yeah, it has many depths to go into, right?
Deeper, deeper, deeper, yeah.
There's a company that a few of my friends
and former colleagues work at called Incident.io.
They do incident management within things like Slack and Teams.
And they had this issue where they blogged about it,
but they wanted gapless sequences for their incident numbers they didn't the customers
got confused by the idea of
the incident number jumping by like
five when they'd only had you know
they've had one incident then a second incident and a third incident
suddenly they're on incident six what happened
so there's a really nice blog
post there
explaining how they were
yeah yeah but I'm curious
why are you bothered by sequence I'm curious why like why
are you
bothered by
sequence gaps
Nikolai
like why
is that an
issue
well it's
not a big
issue
it's just
it might be
an issue
for some
cases but
actually I
don't care
it gave
me false
impression
that there
is bloat
that's why
but now I
double checked
I kept
absolutely right
no bloat
and for
so in certain conflict do nothing probably is the winner in Postgres 16 or older.
I've got a question.
I've got a question here because you mentioned the kind of trade-off between the issues in the highly concurrent cases
or like the potential issue of insert on conflict, do nothing returning null.
If, if you happen to have this, uh, case with the two concurrent sessions in
setting the exact same tag at the same time, wouldn't insert on conflict,
do update avoid that issue at the cost of some bloke?
Well, that's a different issue.
The issue I describe is that when you have two transactions trying to insert a new tag
at the same time, the same new tag at the same time, then the second on conflict would
not do anything and it would also not return anything.
This is the unexpected part because one of my requirements is that the
operation would be idempotent.
So if I give it tag A, I expect to get an id return.
So in this case, there is a scenario where I provide it with a list of N tags
and I get less than N tags in return.
Henry Suryawirawan, But if they both updated, imagine if both concurrent sessions were doing insert and conflict update, they both get to the update stage.
You get two updates, but you still get back all the tags that you wanted to insert.
First of all, in my scenario, I don't update.
This is a very, a lot of people got this very confused.
This is why I added a comment and some people made like a very, let lot of people got this very confused. This is why I added a comment.
And some people made like a very, let's call them funny comments about it.
Okay.
But there's a difference between upcert where you want to update or insert,
which is also a very, very popular pattern.
In this case, by the way, it's simpler, because if you actually update the
row, then it's going to get returned by returning star returning. Yeah, so that's like the easy
case. In my case, I don't want to touch the row, but I still want to get it back. This is why get
or create is like more difficult variation variation of absurd if you wait.
Okay.
So, uh, there, there's a tricky part.
So apparently there's no good solution, by the way, the block, the post by
Irwin on stock overflow lands on a brute force solution where you basically loop
until you get what you expect.
Uh, I don't know
if I would have if I maybe this is the way to go.
Alex Ferrari- loop where in different language or in
PLPGSQL?
Sanyam Bhutani- He ended up writing a function where you
essentially loop until you stop getting unique constraint
violations. Yeah.
Alex Ferrari- But you need the sub transactions. So
this is no go for me. Right? This doesn't scale. I mean, this
Nikolai, you're a man with a mission.
Well, it's no joke. I will have clients literally last week, I
spent like half one hour and people went and and switched off
it in Django. Because they because it hurts. It hurts constantly
people. Like, sub-trans SLRU
wait event. It happens
all the time. So, the fact that Django
by default uses them and people
don't realize and then come to us
with sub-trans SLRU, it's good for my
income, right? But
it's bad in general,
right? By the way,
I wanted to highlight one thing.
If you do insert on conflict do nothing,
returning star, it won't return anything
if row already exists.
I just wanted to highlight that you have a trick.
You use CTE, right?
And then select for such case, right?
Yeah.
Additionally, with union, right?
Yeah, that's right.
Union or what?
Yeah, this is the
only way that you can actually communicate between parts of a query. Otherwise, they all see the same
snapshot and you don't get this is a trick also not straightforward at all. Yeah, like some people
can move away from insert on conflict do nothing just because of that. Yeah. Right. Yeah. One of many cool tricks in that book.
I went through this process where I try to figure out, wait, I just inserted
this row, why when I select outside the CTE, I don't give it to me.
Yeah.
Yeah.
I understand there's some cost to upset, but given the complexity we're running
into here, is it, would you both see it as acceptable to pay that, that extra cost, the kind of heavier?
I know you, I know you mentioned the kind of double hit of the table being annoying, but like those updates, even though they'd be setting the tag name to the exact same tag name, potentially over and over and over again.
I just see the, the guarantees that provides is so attractive personally.
I'll be honest with you.
I understand what you're saying and you're probably correct.
This is like the practical approach, but I would not do it.
It just bugs me.
I would not do it.
I would not do a meaningless update just to say, I would just do two queries.
I would insert and then I would select separately.
It's like insert, rollback, insert, rollback. You end up having a lot of bloat depending on the concrete situation.
So, sub-transactions and bloat, huge limitations of performance here, right? I know. I think this is a case where you kind of
understand the limitation
and restrictions of the database
and you kind of end up
solving it at the application level.
Now,
you mentioned at the beginning, I know that
I, well, get
or create is
useful, but
I haven't had a chance to implement it as much.
So, yeah, for the rare occasion when I actually need to do get or create,
I would just do it at the application level.
Because as it turns out, it's very complicated at the database level.
Right. In application, it becomes simpler.
You probably don't deal with,
probably don't deal
with sub-transactions
and bloat,
but latency,
like,
route trip times
between attempts
to do one step,
another step,
it increases chances
of failures,
right?
I mean,
collisions and so on.
Yeah,
but you know,
if my main business
is to get or create,
I would come up
with a very innovative solution, put a lot of engineering work into it.
But if it's just updating tags for members in an organization, I would go.
Yeah, but I understand what you're saying.
I spent a lot of time doing unnecessary optimizations on, you know, weird places in the code just for fun.
I do it all the time.
Yeah, like 90% of my blog posts is inspired by these strange endeavors
where I try to optimize things.
So yeah, this is interesting.
And I think that this very simple problem
surfaced a lot of issues.
I also learned a lot from writing this.
It got me interested in speculative insertion, the mechanism used by insert on conflict.
It, I think, brightened my understanding of how the interaction between different queries and common table expressions work within the same query. of things that I'm now aware of them. So at least I'm better equipped to debug issues I might have, and don't even know about.
Yeah.
Yeah.
I really liked it as a journey.
I, I would encourage people to read this, even if they don't have this problem, just
as a, almost as just to watch somebody go through this problem.
Uh, and you know, I feel like you've included things as well.
Like there's one section that I took a wrong turn as well.
Like that's so helpful to share those things with people
because otherwise you can read some of these blog posts
and they just sound like the person knows,
like just got it right straight away.
And it's quite frustrating as somebody that quite often goes down long term
for me reading those.
So I appreciated that. I just, I wish you had also some warning about sub transactions anyway
you know what let me just change the banner at the top of the website instead of the about
page i would just place like this huge warning they don't have sub-transactions. Let me put it right here. If you see PLPG-SQL with begin,
exception, when, or something.
So anyway, nested, begin, and blocks.
This is sub-transaction.
Yeah.
But in this case, it's warranted
because otherwise it cancels any calling transaction.
So that would be the responsible thing to do in this case.
But you know what? Maybe
it needs a comment saying this
is a good sub-transaction.
It's not a good sub-transaction because
if you have high... This is
the least worst type of sub-transaction.
I cannot agree with you
here because you talk about
you explore
collisions and concurrency issues.
Yeah.
It means you think that there will be a lot of sessions issuing this query.
It means eventually the project will bump into this wall.
Yeah.
Right?
Good sub-transactions is, for example, DDL, which happens not often in one session.
This is good sub-transaction. And even there it can become bad. But here I cannot agree because we should consider multiple sessions fighting and this is
where things can be wrong. What you're saying reinforces what I'm saying that in this
situation, because it's so complicated in the database. I would just elevate that to the
application level and try to find a solution there to avoid all the complexity. But it was a nice
thought exercise. Just to try a regular transaction, maybe, right? It should not be
such complex, right? I know. I was surprised that it was so complicated. Why is it so? What would you
both like? Is it that you would like merge to work as expected? What is the nice solution on
the database side for this? Well, the dream syntax would be for select do nothing returning star to
just return everything. That would be like the dream syntax syntax i think part of the problem is that
returning star is quite like not super popular thing which was added later to insert and update
and delete it was not originally there so it's not supernatural thing right also once you've
defined that behavior it's so difficult to like you can't change the behavior for existing applications. No, you can't.
And I can even give you an example because you usually use insert on conflict with, you
know, there is no merge command in Postgres.
So if you want to sync data, you use insert on conflict and then when matched do nothing.
Okay.
And more often what you want to do is you want to
avoid the unnecessary update so you do when matched do update where something actually changed
okay and then some rows ends up not being affected so at this point you do returning star and then
usually what i like to do in my ETL processes is I count the
rows so I can log how many rows were affected. Right? So now the question is, if I expect to get
the number of affected rows, we go for a circle here. I know. So if I'm just expecting the affected rows, then if I'm going to get rows
that were not affected, also very strange, right?
Right.
Yeah.
So what's the conclusion?
Come on, guys.
I mean, what's the conclusion?
Just do it in the application?
Don't go there.
The main conclusion is don't go there.
Don't think about it.
Yeah. If error happens,
it only
confirms that application developers
should implement proper
retry logic.
If transaction fails,
code should be ready to retry it.
That's it. Universal
solution to everything.
We should rename this to
just do it in the application FM.
Well.
It's complicated FM.
It's complicated.
No, it is complicated.
Jokes aside, I don't know why it's so complicated.
But it is.
Turns out concurrency is hard.
Yeah.
You know, I remember one of the things
that I remember is that when I read this RDS article
about unique constraint violation causing bloat, I was shocked.
I was shocked because coming from Python, I'm like encouraged to trigger exceptions.
This is how I do things.
I fail and then I adjust.
So that was very strange for me. I have this
entire lecture about how to handle concurrency
with a short URL system and everything. There's a problem of how
you generate short random IDs. How do you do it?
You need to, if you know a short URL system, you need to generate keys.
Very small keys.
So the space is limited.
So you want to generate random unique keys.
So how do you do it?
So you end up in a solution where you try to insert and then you loop until you don't get unique constraint violations anymore.
Right? constrained violations anymore. Right. And now all I'm thinking is I may have caused lots of
bloat without intention.
Maybe you have a better solution.
By the way, it's very similar to the question about
the gapless sequences.
Yeah.
I get that bloat is bad, right?
Like I understand, I see lots of slow queries that have been
of course like caused by bloat.
But once you're aware of it, it's like something that can be managed.
And you've written a great blog post about how to manage it hacky as well that I'll include.
But it feels to me like it might be one of those pains worth paying sometimes when you're aware.
Like every single Postgres table is going to have some bloat in it in a natural like working life.
You're not going to get it down to zero even when you do some of these maintenance tasks so we're always managing some amount it's just like what's a
what's a healthy amount and can can we keep it within those boundaries it's going to be more
painful with certain workflows like maybe once we're getting near the end of that like once we
get past maybe halfway full of that space we're starting like an average of two tries per unique URL might start to get
annoying. So maybe that's the point where you start
expanding the space by an extra digit
or something. I imagine
once you're aware of the problems, you can design a
solution that isn't that bad and
then manage the bloat. Well, it all boils
down to how many collisions do you actually expect
to get? Yeah. I think
the fact that
rollback
inserts
cause
bloat
makes me
think that
only undo
people can
save us
someday.
We had a
quote last
week.
The new
term undo
people because
indeed if you
think about it.
We should
credit Melanie.
So this
different approach would place new rows
with the same position in the page, right?
And if it committed, it's there.
If not, like...
I mean, the different approach would not cause bloat
if inserts is rolled back, right?
So this is just the design
of Postgres MVCC, we should
blame. Right?
Yeah.
It's very unexpected, for sure.
Yeah. Well, at least we can run
very long queries without worrying
about undo segments
running out.
I've been there.
And it's great to have a former
occupant understanding the
downsides of under too, right?
Haki, any
last things that we should have mentioned
that we didn't? No, I think that
we zigzagged across this article
very, very nicely.
Nikolai definitely delivered on his promise.
Yeah, well,
are you planning part two with benchmarks?
I don't know.
I still don't care much about performance in this case,
but I think that I might add an update
on some of the findings about how merge and insert
can end up eventually causing unique constraint violation
and non-idempotent results.
I hope that you implemented Upsert
so you can identify where blog post is updated
so your AI can answer this correctly.
Oh, this is a good goal, actually.
We don't have that.
Yeah. Yeah.
Brilliant.
Misconceptions about blog posts.
They never update.
Challenge accepted.
Challenge accepted.
Yeah, I just solved your get or create problem.
You can just update all the time.
It's not that easy because we have many preprocessing stuff
for really long articles as yours.
Because, you know know the LLM
usually has some limitation. OpenAI
has like 8k
tokens as input.
I'm sure your article exceeds it
a lot.
I'll try to write shorter ones.
Yeah.
Probably you should have two versions
for people who
have issues with long reads.
Yeah.
I do actually.
You like blog posts in parts?
I like short versions till there, but with some details.
I'm a big fan of long blog posts, Hakan.
I know there's a lot of others out there too.
So thank you for continuing to write for us.
Yeah.
Yeah.
It's something that I'm going to cite for sure.
Yeah.
Links will be used to this article.
Yeah.
Thank you for this.
I need an idea for the next one.
So if you have one, ping me.
I have a working title in my mind.
It's called How to Take a Dump Like a Pro.
I couldn't resist the title.
That's a huge topic, actually.
Yeah, I've actually had some experience with it lately
and I think that it's interesting
to see how you can optimize PG dump
with the different compression algorithms,
how they affect the CPU.
And when dumping in parallel is actually not useful at all.
On partition tables?
No, if you have one big table.
Right. It's possible, but you need to engineer yourself with
snapshots and repeatable retransactions.
So that's common. And we use ranges of IDs.
This is what PRDB, for example,
does for logical replication initialization.
So it's possible, for sure.
Yeah, but I do have another consideration
that you didn't think about.
Because one of my restrictions
was that I wanted to stream the output directly to S3 and not
to the local file system. Imagine that you run this in a Lambda. In this case, using parallel
is problematic. But using a single file, you can just stream it directly and skip the file system. And also if you want to pass the dump through some
kind of encryption, then also doing things in parallel makes it a bit difficult. So yeah,
different restrictions, different solution.
When you do this, like you find some like lacking feature, do you have sometimes idea
to implement
something?
Implement? To become a hacker.
To become a hacker.
What, like contribute to
Postgres?
Because if you, well, dump lacks
a lot of stuff all the time.
But it's super complicated to add some things
there, but definitely it lacks a lot
of stuff.
I'm curious if you had such feeling
like idea like, oh, I'm going to try to implement this.
No, I I looked at the Postgres code many, many times. I think that the documentation in the source itself is absolutely
excellent. I remember when we worked on this hash indexes
article, we looked at the actual source file. There's lots of very, very useful information.
By the way, this information was a lot more useful than anything I can find online,
including the documentation itself. So it was absolutely excellent. And I think that the code
is very well written. I'll be honest with you.
I don't think I'm capable enough to contribute to Postgres itself.
But, you know, I do from time to time think about my contribution, my type of contribution
to the Postgres community.
And I know that the most straightforward way to contribute to Postgres is to contribute
code. But I feel like I contribute in my way by advocating
for Postgres and educating people how to use Postgres correctly. Because I believe that if
people would have good systems, they'll be happy with Postgres. They'll continue using it. They'll
recommend it to their friends. And this is how
you build a community. So I think that I'm more capable in telling stories, educate people about
Postgres, working on ETL processes and applications than I do working on, you know, the tooling,
the internal stuff. There are people much smarter than me that can do this stuff. I'll write about them.
Well, yeah, Postgres documentation lacks how-to recipes,
and this is what you do.
You describe some recipes and pros and cons.
This is an excellent thing.
Yeah, war stories.
You absolutely do all those things, and we really appreciate it, Haki.
Thank you so much for coming on as well.
It's been a pleasure having you here.
Okay, thank you, Nikolai, Michael.
It was a pleasure.
This is the first time for me in a podcast.
Michael Weiszner But you did great.
Haki Shabnamurthy Thank you so much.
Bye.
Nikolai Weiszner Thank you.
Haki Shabnamurthy Take care.
Haki Shabnamurthy Thanks.