Postgres FM - Four million TPS
Episode Date: July 5, 2024Nikolay talks Michael through a recent experiment to find the current maximum transactions per second single-node Postgres can achieve — why he was looking into it, what bottlenecks occurre...d along the way, and ideas for follow up experiments. Here are some links to things they mentioned:How many TPS can we get from a single Postgres node? (Article by Nikolay) https://www.linkedin.com/pulse/how-many-tps-can-we-get-from-single-postgres-node-nikolay-samokhvalov-yu0rcChat history with Postgres AI bot https://postgres.ai/chats/01905a83-4573-7dca-b47a-bb60ce30fe6cOur episode on the overhead of pg_stat_statements and pg_stat_kcache https://postgres.fm/episodes/overhead-of-pg_stat_statements-and-pg_stat_kcachePostgreSQL 17 beta 2 is out https://www.postgresql.org/about/news/postgresql-17-beta-2-released-2885/ PostgreSQL and MySQL: Millions of Queries per Second (about the work by Sveta and Alexander) https://www.percona.com/blog/millions-queries-per-second-postgresql-and-mysql-peaceful-battle-at-modern-demanding-workloadspostgresql_cluster https://github.com/vitabaks/postgresql_clusterTrack on CPU events for pg_wait_sampling https://github.com/postgrespro/pg_wait_sampling/pull/74The year of the Lock Manager’s Revenge (post by Jeremy Schneider) https://ardentperf.com/2024/03/03/postgres-indexes-partitioning-and-lwlocklockmanager-scalability Pluggable cumulative statistics (Postgres hackers thread started by Michael Paquier) https://www.postgresql.org/message-id/flat/Zmqm9j5EO0I4W8dx%40paquier.xyz ~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is produced by:Michael Christofides, founder of pgMustardNikolay Samokhvalov, founder of Postgres.aiWith special thanks to:Jessie Draws for the elephant artworkÂ
Transcript
Discussion (0)
Hello and welcome to PostgresFM, a weekly show about all things PostgresQL.
I am Michael, founder of PGMustard, and this is my co-host Nikolai, founder of PostgresAI.
I know, Goliath, what are we talking about today?
Hi, Michael. Let's talk about a lot of TPS we can get from a single Postgres node.
I recently achieved something. I'm curious, is it some interesting achievement or like, no?
I'm sure it's possible to get more on bigger machines
because my machines yeah big enough right it's like just 360 vcpus and just two terabytes of
ram or something maybe one at 1.5 terabytes of ram yeah fourth generation amd epic it's
there is already five generation AMD EPYC,
and there are Intel's also,
Xeon scalable, bigger instances and so on.
It's possible.
But I've almost achieved
4 million transactions per second,
full-fledged transactions,
not just like what other NoSQL databases have.
It's transactions.
Read-only, we should admit, read-only,
and meaningful transactions.
I mean, it was a table with, I think, 10 million rows.
It's a tiny table for this machine.
Everything is cached, of course.
And it was a primary key lookup and PgBench.
Regular PgBench accounts table, you know it, right?
So, yeah.
Is it single read per transaction, single query per transaction?
Is that right?
That's a good question, actually.
Let's double check.
I will double check it right now.
But I think, yes, it's just select from PgBench, random number, random ID.
And that's it.
It should be.
I was just hoping you'd get to multiply your 4 million by 2 or something.
Get queries per second.
Let me tell you what you don't know.
Yesterday I was driving to LA and I used this.
I'm thinking I should check my AI and get even more TPS.
And thankfully I had AI with autopilot in the car, right?
So I could...
Maybe it's still not a good idea to use phone,
but it's just a quick question, right?
So I asked my bot to repeat this TPS.
Oh, I forgot to mention,
all these experiments are done with our Postgres AI bot,
which simplifies experimentation
a lot. So,
if you need to provision a huge machine,
run some PgBench during a couple of hours
according to some script, you just
ask for it, then you think, oh, what
if we change something else?
And it's just a few words.
You just ask, and then
it reports back in a couple of hours.
So I asked, can you please check not regular select only for PgBench,
which is very standard Postgres for benchmarks, quite simple but widely used.
But rather, let's just use select semicolon.
Not meaningful, right? Not dealing with data at all, but still it semicolon. Not meaningful, right?
Not dealing with data at all, but still it needs to be parsed, right?
And so on.
I had it on my list of questions for you to ask what would happen if we got,
I was going to suggest select one, which I guess is the beginner's version.
I can tell you, we exceeded 6 million.
6 million, wow, nice.
But I don't consider it fair because it's just heartbeat.
It depends what you're measuring, right?
If you're trying to measure what's the most we can achieve,
then sure, it's meaningful.
But if you're actually trying to do something meaningful,
read-only is probably not realistic.
There's so many things you could say.
I like the idea of what's the maximum we could get
if we don't worry about constraints,
about how realistic it is.
Right.
So, yeah, in this case,
we only checked the parser and general transaction,
read-only transaction, not full-fledged transaction,
all these pieces.
But we didn't check how Postgres works with,
like, for example, buffer pool with reading data and so on. Because I know there was some
contention in the past when even for read-only queries, which was improved in, I think, in 9.6
on the road to achieve 1 million TPS. We will probably discuss this slightly later. But in general, what's interesting, I needed to explain.
I hate select one statements.
You know why, right?
No.
Because one is not needed.
Fair enough.
You can say select in Postgres.
It will return you results that consisting of zero rows and zero columns.
Can you just run semicolon on its own or something?
Yes, you can.
I think we also discussed it.
Must have done.
Yeah, it has issues, negative side effects,
because, for example, you won't see it in logs
if somehow it's slow.
It's like empty query.
You won't see it in PGC statements, for example,
unlike regular SELLECT semicolon.
And I failed to convince bot to run semicolon.
Somehow it didn't work.
And I also had difficulties convincing bot to select.
It started telling me, I think it was Gemini.
It was the specific aggressive manner of Gemini for sure.
So it started telling me, you know, like it's incorrect.
It's invalid SQL statement. So let's like it offered me let's put one at least they're like, or where's the table
name? Like, okay, let's I said no tables, no tables, just like, idle queries, like no
real action. And then said, like, let's put at least one because otherwise it's
incorrect statement i i of course like are you sure you can check so we had some battle
but in general yeah it worked and we achieved six million tps which was good
but semicolon i didn't work didn't work for our. I think it's possible. I don't expect a huge win compared to Select.
Maybe it will be Parse.
I don't know.
You could run the experiment manually, right?
Could you just submit the JSON yourself?
It's better to convince the board.
You prefer to fight the LLM, fair enough.
A little bit, yeah, because every time we do it,
I find opportunity to improve, of course.
Fair.
But back to meaningful results, which is interesting.
We almost achieved 4 million, but originally it was just 1 million.
And when I saw just 1 million on modern Postgres,
we checked 16 and 17 beta 1.
By the way, beta 2 is already out.
I think we should update and switch to it in our tests.
So I expected much more because I remember the post from Alexander Korotkov from 2016
when I remember he worked with Percona people, Sveta Smirnova particularly.
They had had friendly battle
let's see
what's better, Postgres or
MySQL in terms of how many
TPS we can get from a single node.
And I think
both Postgres and MySQL learned
something from it and
Alexander created a few patches
because some
issues were found, some contention was found.
And it was improved and in Postgres 9.5 it was not so good and 9.6 it was improved just based
on this work. And he achieved 1 million TPS. MySQL also achieved 1 million TPS, good. But I was
surprised like so many years like 8 years right
we have so many Postgres versions
basically 8 Postgres versions already
since then and maybe 7
and much better hardware
not much but
significantly better hardware
so I expected to get more
from this
well we looked into the machines didn't we
and they even back then they had managed to get more from this. Well, we looked into the machines, didn't we? And even back then,
they had managed to get their hands on a 144 vCPU machine.
And I think you mentioned yours was a bit more than double that, was it?
360.
And I think they had the Intels, all much older Intels,
and I had AMD, which is very easily available on GCP, which I use because
I have credits there. Right. So yeah, before we started this call, I, I, I cannot miss this.
You showed me what I double yes has super impressive. How many, almost six, six, almost 800 more 896
vCPU
and it's Intel scalable
fourth generation
yeah it'll cost you
but also 32 terabytes of RAM
maximum
224
x large
and they don't have spots, of course.
This is something new.
Yeah, and it's fourth generation, little scalable.
What I found interesting, like during last year and this year,
various benchmarks, not this like toy benchmark,
it's a side project for me,
but working with customers, various cases.
What I found, what my impression right now from Intel and AMD,
recent versions of them available in cloud, is that AMD, for example, on GCP, you can get more
vCPUs and reach more TPS on very simple workloads. And it will be cheaper in general in terms of how many, like if you divide TPS
per dollar, find TPS per dollar,
it will be cheaper probably
and you can achieve
higher throughput
for simple workloads. But when it
comes to complex workloads
and various problems like
lightweight lock manager
contention we discussed a few times.
And in general complex workloads and behavioral photo vacuum and so on lightweight lock manager contention we discussed a few times.
And in general, complex workloads and behavioral photo vacuum and so on,
AMD looks less preferable
and Intel's are winning.
So that's why I'm additionally impressed
that AWS offers Intel scalable Xeons
for generation, almost 900 vCPUs.
It's insane for me.
32 terabytes.
Wow.
And you think I can exceed 4 million TPS, right?
It's a joke, I guess.
It'd be interesting to hear more about what you think the current bottlenecks are,
like where you think but I guess my stupid brain
is thinking
this is probably
CPU constrained at this point
throwing more at it
makes sense that it could
throw double at it, maybe you can
get close to doubling it
well, we collect
a lot of information, we a weight event analysis yeah we have
flame graphs collected automatically for each step and still like still needs to be analyzed
deeper but what i see right now the bottlenecks are like mostly communication let's actually
maybe let's take a step or a couple of steps back and discuss what issues I had originally.
Why only one million?
Can we take a step back further than that?
Why were you doing this?
Yeah, it's a good question.
I was asking myself, why do we do this?
Well, for fun, first of all, actually.
I was very curious.
I remember that
work. I think
for Alexander
and Sveta, it was also for fun
eight years ago.
But it led to some interesting findings
and eventually to optimizations.
So this
fun was converted
to something useful. And we started doing
this for fun because we have like big machines, we have very good automation, we have newer
Postgres. I was just curious how easy it is for us to achieve 1 million and maybe go further.
And first thing we saw, since we have a complex system, a lot of things and observability tools. First thing we saw is
that one of extensions is a huge bottleneck. It was pgstart kcache.
And it even didn't let us achieve 1 million TPS, quickly showing huge observer effect,
meaning that just measuring... pgRK cache extends PGSR statements
to see metrics related
to physical metrics
like CPU, real CPU,
user time,
system time, then
real disk I.O., not
just talking to the page cache.
And context
switches
and very good detailed physical query analysis, right?
And we like to have it,
but if you have it until the latest versions,
it quickly became huge overhead
when you have a lot of queries per second for a specific query.
I remember we discussed this back in,
I just looked up the episode back in February
about the overhead.
Right, exactly.
Yeah, I'll link that episode up as well.
Right, and yeah, just to recap,
we found that it shows up in weight event analysis
using PGA Web Sampling, great.
And we just quickly excluded it
and also reported to the maintainers
and they fixed it in four hours.
It was amazing.
And we tested it with Bot, confirmed that now it's improved.
So this is how we achieved 1 million.
But why only 1 million?
We have newer Postgres and better hardware, right?
Did I answer the question why, by the way?
You said for fun.
For fun, right? right yeah this is the
key actually here i i enjoy it you know but let's just run pg bench like that way or this way let's
collect more data points let's visualize it let's repeat this but with adjusted version let's compare
various versions for example right so it sounds like for fun but also like you've got credits
and also dogfooding right like trying yeah yeah trying it out on here's an idea like yeah yeah
we try to collect successful experiments and we do it on smaller machines as well but on bigger
machines we collect just to understand like how system is working. Of course, we're moving towards more useful benchmarks,
for example, when it will be already some specific database
and specific kind of workload,
and maybe not only PgBench and so on.
And all the tooling will be in place,
and automation is super high,
so you can just talk to chatbot
and get results visualized and collected with all details
and you you don't miss any point so we collect almost 80 artifacts for each data point it's
like huge it reflects many years of experience i would say not only mine for example you can say
okay let's compare versions 12 to 17 beta 1. This is what I did as well.
And I confirmed that for regular PgBench workload, or for select-only PgBench workload, no difference.
These versions are stable.
No regression, but also no wins.
12 through 17.
Yeah.
But for select-only, this is exactly like workload which allowed me to achieve almost 4 million TPS. For this, 12 is losing significantly.
Wait, wait, wait. You lost me. So for read only...
No, no, for regular.
Oh, for read-write.
For read-write.
You know, regular transactions, PagerBench has, it's like, I don't know,
like a couple of updates, insert, delete, a couple of selects,
and they are packaged as a single transaction,
and then you just say, okay, let's just check it.
And the way we check it, it's called like actually not load testing,
but stress testing, specialized version of load testing,
and we check in the edge.
We say, okay, with one client, how many?
50 clients, how many?
100 clients, how many?
And then you have a graph dependency of TPS,
how TPS depends on the number of clients.
Yeah.
And it's interesting that we cannot say zero
and due to scale,
we need to jump with big steps like 50.
So I say one, 50, 100, then dot, dot, dot, 500.
And the bot understands that it needs to go with 50 increments.
But first, dot is shifted to one, and it's okay.
So it's interesting
it's like it adjusts very quickly so yeah for fun and this and uh back to versions comparison
so we found that for select only we don't see difference but for regular workload and we don't
get million transactions there we have only like 30 000 maybe 40 000
something like quite lower because it's already writing transactions and so on we saw that 12 is
losing apparently and 16 17 are winning slightly compared to like average for these old versions
but yeah i think we will start continue exploring and maybe test some specific workloads,
not just standard ones.
And just to check, this is like completely untuned,
like just default config.
Okay, it's tuned.
Good to check.
Yeah, we use PostgreSQL clusters,
Ansible playbooks,
maintained by Vitaly, who works with me,
and Vitaly Kukharik and
it's a great project if you don't like
Kubernetes for example
and actually containers
you just need bare Postgres
installed on Debian
Ubuntu or something and you want
automation for it
this is a good project
it goes with everything
like everything most popular stuff
like Patroni
BGP Crest
or RollG
what else
like a lot of
a lot of simple things
actually
even with
timescale DB
it's packaged
so you can choose
to use it
and quickly have it
on your
like cloud
or
so
so it's
for self-managed
Postgres
it's a good option
yeah
it takes time to install of course because it will be running like apt install like cloud or self. So for self-managed Postgres, it's a good option.
It takes time to install, of course,
because it will be running like apt-install,
apt-get-install, right?
So it's coming with some tuning.
It's similar to what cloud providers do,
like 25% for shared buffers.
It also adjusts operational system parameters,
like kernel settings.
I guess the dataset is so small that 25% is still so easily fits within.
Yeah, yeah, of course.
And maybe we can tune additionally,
but I didn't see the point yet.
This default tuning this project provides
was also like...
I know it quite well
because we adjusted in the past together
a few times. So I know how it works. So I relied on it for now, but maybe we will
additionally adjust some things, but queries are super simple, right? Data is cached.
So I understand. I'm just trying to think like the major things like all in all cached i guess basic like
right head log stuff when you've got the read write stuff going on just sim yeah it doesn't
matter because we don't write so in one of them in the 30 000 i reckon the 30 000 is low
and you could get much higher there.
Well, yeah.
If we talk about writing transactions, of course,
we should think about checkpoints and so on.
And it comes with some adjustments based on the size of machine already.
So I just relied on this default behavior.
We can look further, and good thing,
we consider settings as one of artifacts we collect.
So for history, we record everything,
all custom settings and some system info as well.
So we can revisit it later and think how to improve and next step. But bottlenecks are not there yet.
And I was wondering,
last question I had on the top line number,
I get keeping PG weight sampling for being able to introspect,
for being able to try and work out where is the bottleneck now.
It's great for that.
But doesn't it also add a small amount of overhead?
I know it's only sampling, so it's probably minimal.
But were you tempted to try without it as well and just see what tickets you got?
I think we did, and I think it's very minimal. It's try without it as well and just see what tickets you got? Yeah, I think we did.
And I think it's very minimal.
It's worth revisiting as well.
But for now, our impression is PG1 sampling among...
That's why, by the way,
we encountered the problems with PG.k cache
because we used basically the package
we use for various kinds of experiments
in the case of self-managed approach, right?
Had it more automatically
on by default. Yeah, we already
had these observability tools there
and this is how we found the PG-SAT-K
cache problem. As for
PG-Wide sampling among all three
extensions, PG-SAT-K cache,
PG-SAT statements, and PG-Wide sampling,
by the way, both
PG-Wide sampling and PG-SAT-K cache depend on PG-SAT statements PG-Word sampling. By the way, both PG-Word sampling and PG-stat K-cache
depend on PG-stat statements
because they basically extend it for query analysis.
They all provide query analysis.
Regular PG-stat statements provides query analysis
like regular database metrics like calls, timing,
IEO metrics at upper level, buffer pool metrics, right?
KCache goes down to physical level, CPU and disk.
And PGWet sampling for each query, it also provides like profile for in terms of weight
events, which is super useful.
It's like different angles of analysis, right? So if you can have them all three, it's super useful. It's like different angles of analysis, right?
So if you can have them all three, it's super good.
Also worth mentioning, it's off topic, but interesting.
Until very recently,
PGW sampling didn't register events where wait event is null,
which most wait event analysis we know,
such as RDS Performance Insights, they picture it as a green area called CPU.
And by the way, Alexander Korotkov, who was originally the author of PGA weight sampling,
confirmed my idea that it's not fair to name it CPU.
It should be like CPU or some unknown wait event
which is not yet implemented
because not everything is covered
by a wait event analysis in the code base, right?
And until recently, the PGW sampling had a huge problem.
Huge problem.
It didn't register nulls,
like naming them somehow,
at least like CPU or something, at all.
But guys from CyberTech implemented it
a couple of weeks ago.
It's huge news. It's good.
Thank you, guys, if you listen to us.
I'm happy that it's
implemented. And I know GCP
Cloud SQL also has
PGWet sampling.
And I like the idea that unlike
in Performance Insights in RDS,
we have interface to this data.
I mean, good interface with SQL, right?
We can just query this data right from our database
and build our own observability extensions or automation.
For experiments, it's super important because I want to take snapshot
and it's easier for me just to take snapshot
using SQL.
Let's collect profile
global and for per query.
And we dump it
to CSV files, basically.
That's it for long-term
storage.
So we quickly found KCache,
solved it, and we stuck
at 1 million TPS,
which made me sad because it just repeated the experiment
Alexander had eight years ago.
So what's next?
Do you know what was next?
I've cheated.
I've read your post already.
You know.
Yeah.
And I think the last episode as well gave it away in February
with huge set statements.
Is this a set statement?
Huge bottleneck for such cases.
I consider this edge case because normally we don't have
super high-frequent query, super fast, super high-frequent,
and a lot of CPUs.
But theoretically, it can happen.
For example, you have Postgres and a few replicas,
and you have some huge table, maybe not so huge, single index, and you need just to find records.
Basically, almost like a key value approach. Maybe partitioned. And then queries are super fast, a lot below one millisecond.
So in this case, if you have a query which after you remove parameters is the same,
you have single normalized query, very high frequency.
Yeah, there are probably only two or three normalized queries in that PGBench.
Well, by the way, I double-checked for select only just a single query.
It's one. Okay, wow.
It has also set this macro backslash set to get random,
but I think it's client-side, so it's not going to Postgres.
It's only for PGgbench itself to generate random
number and i don't think it's a bottleneck i hope no oh by the way our pgbench clients were sitting
on the same host as posgus which is interesting right because they consume cpu as well and
unfortunately right now we don't collect host stats and so on. We don't know how much of CPU was written. But usually, it's noticeable.
It's not 50% usually when you do experiments like that,
co-hosting clients and server.
But I think it's noticeable, I would say maybe 10%, 20%.
So if you offload them somehow, but if you offload them,
you bring a network.
Yeah.
And we will
get back to that point. So
single machine, clients
on the same machine, eating some CPU.
And if you
have this pattern, just a single query,
normalized query, PGC statements becomes
huge bottleneck.
Just removing PGC statements, we jumped
from 1 million to 2.5 million
TPS. Yeah.
Yeah. And then, what's next? moving PGCR statements, we jumped from 1 million to 2.5 million TPS. Yeah, wow.
Yeah. And then what's next? We
reached 3 million.
How? Let's recall.
Then we reached 3 million just
because I forgot originally
I should use it. I remember Alexander
used it in 2016.
Prepare statements, right?
I forgot to use it and
how did I understand that I
forgot? Looking at flame graphs. Our bot
collects flame graphs.
And I just
was inspecting
flame graphs and I just saw planning time.
And I'm thinking, oh, we spent a lot
on planning time.
Let's just get rid of it.
So the right way is just to say dash capital M hyphen,
capital M prepared for PidginBench.
And it's a perfect workload for it because it's simple,
same query over and over again, no variance.
Perfect.
Right.
So also interesting that when you, interesting, again, step back.
When you see that starting from 150, 100, 150, 200,
you go to 500 clients, then you have 360 cores.
And when you see it goes up to 100 or 150,
and then it's plateau.
It also gives you a strong feeling it's not normal, right?
Because it means other Postgres doesn't scale in terms of number of parallel clients, right,
to number of cores.
Are you expected to scale to number of cores?
Of course, we have clients running on the same host, so maybe the situation will start earlier,
but definitely not at 100 or 150
if you have 360 cores.
So this is how I saw this is not right.
And it happened with pgstart kcache
and then pgstart statements.
Situation started earlier.
So yeah, some bottleneck, obviously.
But now the curve looks already much better.
It's already like we reach maximum point
already close to number of cores.
300 or 360.
Maybe 400, actually, slightly bigger.
So then it doesn't make sense to increase,
but we just check it, trying to find plateau.
So prepared statements gave us more than additional half a million of TPS.
Oh, I expected more, actually.
Only because, you know, when you're looking at super fast primary key lookups,
quite often planning time is more than execution time
in a simple explain-analyze.
Interesting, yeah.
Yeah, so it's just interesting to me that it wasn't like double.
But yeah, half a million more is nothing to look down our noses at.
Yeah, well, interesting.
Maybe I should think about it and explore additionally.
But this is how it is.
Did you do force generic plan as well?
No, not yet.
Mind you, that would only that would only like be five
yeah no that probably won't help here's the thing is it's my my to-do maybe for this week or next
when i have time i don't have time this is again this is side project just for fun and understand
general behavior of postgres what i remember about this uh generic plan forced is that Jeremy Schneider posted
a very good overview
of problems
others had
with lock manager recently,
right? And I remember
some
benchmarks, some specific case
someone showed
on Twitter. If you just
tell pgbench, if you adjust
user, for example, alter user
and set force plan
cache mode, this force generic
plan, it leads to
huge degradation in terms of TPS.
So I need to check
it. I need to check it.
And lightweight
log manager pops up
in top of wait events.
So I need to double check it because I don't understand the nature of it.
It's super interesting.
I realized soon after saying it that the reason I thought it would help
is not going to help here because you've got the same query over and over,
not lots of little different queries.
But that should be cached for sooner, right?
Yeah, but after five executions which
is probably yeah it's like less than half a millisecond total there should be no difference
in my opinion but no degradation why degradation is also true yeah yeah fair fair something
interesting to check or maybe i just misunderstood this is my to do to to clarify so i guess uh it's
like it's interesting like exploring Postgres behavior on the edge.
So let's just recap.
Like, pgSat kcache, we removed it,
also fixed, but did we
get it back? We should get it back, actually,
because now it's much better. They just removed
some lock, and it's
good again. Then we found
pgSat statements. Oh, so
we achieved 1 million. pgSat statements, we
removed pgSat statements. This is one million. We removed pg-star statements.
This is the key.
By the way, when I say
we keep only pg-weight sampling,
but it also depends on pg-star
statements, I also feel
some inconsistency
in my...
I've looked it up. When you said it, I got confused.
It doesn't depend on it. It's just more
stuff is possible.
It's optional, right?
Yeah.
It can run alone, as we do right now in these experiments.
But if pgSatK cache is present, we can join data using query ID, right?
Exactly.
This is a good correction because I said the wrong thing.
But pgSatK cache does depend on pgSatK cache.
You cannot install it without it.
This is for sure.
Okay.
And these guys are not available in cloud environments usually.
PgWord Sampling is in Cloud SQL.
I mean, not in cloud, in managed Postgres.
PgWord Sampling is available on Cloud SQL, but that's it.
But maybe after this podcast, people will consider adding it.
Especially with PgBot sampling
which is really great.
Especially now with this fix
from CyberTech.
So we removed
PgStat statements, reached
2.5 million.
We added prepared statements,
exceeded 3 million.
And then final step how I almost approached 4 million.
It's like 3.75 million TPS.
How?
Like last optimization.
It's actually not super fair optimization.
My last optimization was let's just switch from TCP connection
to Unix domain circuit connection,
which is possible in limited
cases because it works only when you're on the same node. You should exclude network completely.
And of course, this is obvious. TCP, IP, TCP, TCP, IP connections are heavier, definitely,
like more overhead than just when processes talk through Unix domain circuit.
It's much more lightweight.
And this allowed me to jump from 3-something to 3.75 million CPS,
which is good.
And also I found that 17 works worse on Unix domain circuit than 16.
Interesting.
Yeah, this is also in my to-do.
Why?
For TCP connections,
no difference.
On this edge case,
like select only.
But for Unix domain
circuit, I see
degradation and it's
worth exploring why.
So maybe I found
some degradation for
17.
I also need to
double check.
Quite a big
difference.
Yeah.
Oh, no, no, no.
So sorry.
7 to 8% as I, on higher number of clients.
Yeah, sure.
Sure, sure, sure.
I've made the mistake of looking at the 17 beta chart
and looking at TCP versus Unix, not 17 versus 16.
By the way, when you have a long chart with both
and different series of experiments,
then you want to cross-compare something, it's super easy.
Yeah, just visualize this and that on the same graph.
That's it.
Why not?
Much easier than getting it to give you the correct JSON I saw from the chat transcript.
Well, yeah, it depends and gemini
is better in json than gpt well we have a lot of fun stories so like let's not go there but
it happens yeah so so yeah almost four four million tps for quite meaningful workloads i
think it's worth checking bigger tables for example to see like how it
depends for example we can just take like 300 clients and then check uh draw different picture
it's my it's in my mind like for example let's like very small table bigger bigger bigger and
huge right and how it will degrade for example right when you say bigger table do you mean like
100 million rows instead of 10 million rows yeah 200 million half a billion rows billion rows partition not partition like it's many it's
it's like a huge maze you can go in many in many directions and turn and then have like that and
maybe a return and inspect another like it's it's it's. The cool thing is that I do it, like, anywhere.
Like, I don't know, like, I'm eating breakfast
and checking what's up with our experiments.
Let's do something else.
It's just my current hobby, you know, to explore.
Yeah, I think we should have more tools like Sysbench, for example,
to have different kinds of workloads to be brought up to this.
So, yeah, that's it.
I'm very curious for select-only PGA Bench,
anyone had more?
But now, like I was thinking,
oh, maybe it's the biggest number.
It's definitely biggest I ever saw,
but I didn't see everything.
Maybe someone has bigger numbers
or saw bigger numbers for the Pidgey bench.
If they have, I couldn't find it.
I did do some looking before the episode just to try and see if anyone pushed it further.
If it's been published, I've struggled to find it.
But it's definitely possible on these huge machines on AWS, right?
Because you have more vCPUs
and as we see Postgres scales quite well
in terms of number of clients
and number of CPUs.
That's the direction.
If you're talking about that maze,
the way I would be tempted to go with this
is with the currently available
like cloud provided VMs,
what's the most TPS I can get?
I don't care about cheating.
I don't care if it's only select
semicolon.
I just would love to know, what is that
number? I'm guessing it's above
10 million, but is it 12? Is it
15? How many could we...
That would be super interesting.
Actually, it's another good
point. By the way, I forgot to mention
that the problem with PGSF
statements identified. Again, if you have a select-only key-value approach and you need to squeeze more, PGSAT statements can drop performance three times here, I think, by Michael Pacquare.
Sorry, I pronounce always wrong.
How to optimize, what to do with PGS statements in general.
And he reacted on Twitter on my benchmark.
So definitely there is opportunity here to improve PGS statements.
One of the idea was start sampling at some point.
Sampling is the obvious one.
Also, like,
deal with high
contention case
differently, and somehow then, like,
update less.
There are ideas.
I would hate for the
takeaway here to be that there's a
real overhead here of PgStat statements
mostly because
this is such the this is so much the extreme version right this is the same query being what
i i put it everywhere disclaimer this is edge case on production you probably don't see yeah
yeah with your workload you won't see with your workload probably uh it's like a few percent only
overhead but in some cases when your workload
is just one or a few normalized queries and frequency is super high you have a lot of vcpus
and they start like competing this is high contention situation in this case just check it and
in terms of weights weight event analysis pg start statements didn't show up in weight event analysis until
I think Postgres 16 because basically it was not covered by weight events, I think.
Interesting.
So maybe I remember incorrectly, but definitely it was not covered until some version and then
it started to be covered. So you can see it if you have a very fresh
version of Postgres, you can see it in, for example, RDS performance insights.
Right.
In this case, it's worth just removing it.
But I think some guys have such kind of it's not for everyone.
Quite rare.
Let's say it's quite rare, but it's very It's quite rare. But it's valid key value, right?
Just selects.
You need to serve a lot of, like, you need to respond very quickly using just B3, index search, just single row, super fast query, and then you have a lot, right?
So just removing PGSA statements, you can get more.
So this leads to question how to optimize PGSA statements.
There is interesting discussion in hackers.
Let's just advertise it a little bit for those who are interested.
Yeah, for sure.
And another point I wanted to mention,
you started this discussion, various clouds.
Our bot knows already about price, about cost.
So we could potentially check different machines on GCP and bring also AWS and start discovering.
I want 1 million TPS, just select where it's cheaper, which machine.
ARM, Intel, AMD, let's compare different clouds, different machines, where.
It's interesting, right?
I mean, performance cost.
I think so.
I liked your transactions per second per dollar.
TPSPD, I reckon you should go with.
Starting from very simple, trivial workload like these selects,
and then maybe to extend it closer to what you have,
and then you can decide which machine is better for you.
This is something I've actually been struggling with.
I think it could be really useful.
How do you let someone know that their workload could
be done on the next instance size down? Maybe through a bit of optimization, but if people
have a rough idea of their workload, maybe they know it's all OLTP and they know roughly the
average transactions per second, what size instance should they roughly be on? Just as a rule of thumb.
This is called capacity planning. I know, but also I think so many people are so over-provisioned because they're so
under-tuned.
Many people are under-provisioned.
They don't realize.
I have cases when people already hit the sailing for a specific kind of platform,
for example. For example, Graviton
on AWS or Nardius for example
and you have already the biggest size
maybe it's time to
I don't know like
to scale somehow or to take
different kinds of machines because you see
how big intels you have
but it's not available on Nardius of course yet
right? So
I see cases both under-provisioned,
and maybe you're right, it's more common,
and over-provisioned more common,
when you have CPU 5% maximum, right?
And you pay for everything else, like, why?
But I also see cases when it's already hitting ceilings
and it's time to do something about it, right?
Of course.
Maybe I see them because people go to us for consulting and help, right?
So they don't go for, well, sometimes we have this,
like let's optimize our spendings, right?
But it's like we're mostly useful when you have problems with performance
and how to grow further yeah well i see the
opposite i think not because i not because they come to me but or they come to us it's more
that i know friends like founder friends and things quite often the cheapest solution in the
early days is double the instant size you know if you're you're going from $30 to $60 to $120,
there's not many engineering hours
that can go into fixing that.
You might as well just upgrade
and then worry about it later.
For example, you became a big fan of Graviton ARM, right?
And you decide, oh, it's great.
And it's like checked, performance costs, great. But maybe it's great and it's like checked performance costs great but maybe
it's not like it was good year ago but maybe not now and things change very quickly and different
nodes uh instances and types of instances are added in cloud very quickly so what in ideal
world we should have some rule how to find the best optimal choice for instance, family, and size.
And also, we should have great migration automation to switch zero downtime, right?
That's the other thing.
I think people are scared.
People are scared of going back down.
Test and then switch.
Everything is possible, but requires some effort, unfortunately.
So, good.
4 million TPS, almost.
Almost, yeah.
Or 6 million.
I'm going with 6 million TPS.
And maybe someone already has this machine
with almost 900 vCPUs
and they can show us 10 million or something.
I'd expect more.
Maybe. Yeah, 10 million TPS. I'd expect more. Maybe.
10 million TPS next milestone
for Postgres on single node.
Great. Yeah. By the way,
final note, I tried to find information
about MongoDB. I asked guys
like, I found only
some slightly outdated benchmarks, like
three years old or so. And
I didn't see anything above
like from single node. And I didn't see anything above, like from single node.
I even didn't care about
if they have transaction mode enabled or not, right?
Because they can cheat here, right?
But I didn't see for million.
I saw only 2 million, 2.5 million or something TPS.
That's it.
So I think Postgres is very competitive.
Good to know.
Yeah. Okay. Good.
Nice one.
Cheers, Nikolai. Catch you next week.