Postgres FM - TOAST
Episode Date: March 3, 2023Nikolay and Michael discuss TOAST (The Oversized-Attribute Storage Technique) — what it is, how it works, and some general things to be aware of. Here are links to a few things we mentione...d: TOAST docsTOAST wikiHussein Nasser on rows per page (Twitter)Toasting in action (dbi services blog)Interview with Peter Zaitsev (Postgres TV)Building columnar compression in a row-oriented database (Timescale blog post)The Surprising Impact of Medium-Size Texts on PostgreSQL Performance (blog post by Haki Benita)PostgreSQL at Scale: Saving Space Basically for Free (blog post by Braintree on column Tetris)postgres_dba alignment padding query ------------------------What did you like or not like? What should we discuss next time? Let us know by tweeting us on @samokhvalov / @michristofides / @PostgresFM, or by commenting on our Google doc.If you would like to share this episode, here's a good link (and thank you!)Postgres FM is brought to you by:Nikolay Samokhvalov, founder of Postgres.aiMichael Christofides, founder of pgMustardWith special thanks to:Jessie Draws for the amazing artworkÂ
Transcript
Discussion (0)
Hello, and welcome to PostgresFM, a weekly show about all things PostgresQL.
I'm Michael, founder of PgMustard, and this is my co-host Nikolai, founder of PostgresAI.
Hey Nikolai, what are we talking about today?
Your choice, and it's toast, yet another boring topic.
I think it's the tastiest breakfast snack, so I thought you might actually want to talk about it being, you know, food-related.
Well, okay, I agree that this is something
that everyone should learn, at least
in some detail. Not maybe
deep, but, I mean, all developers should
understand how Postgres stores
data and why limits are
so good. I mean, we can store
up to one gigabyte of data
in a single column, in a single row,
in a single column, I mean, a single cell, if you
speak a simple language.
And it's great. So you can sort really large values usually.
Yeah, this is one of my favorite types of Postgres feature in that it's on by default.
It does something that most people never have to worry about.
Transparent, the correct word.
Exactly. Transparent and on by default. it it helps in so many ways that you
might never thought of thought about or you might only hit a limit and i yeah i love this kind of
thing because you can go through life never knowing about it and it's still helping you out
limited toast toast to everyone almost unlimited yeah so should we start with what it what we're
even talking about i guess if anybody new to, we're talking about how Postgres,
well, based on a couple of Postgres limitations,
specifically, you can't store a row across multiple pages
and the default page size being 8 kilobytes and...
Kibibytes.
Kibibytes, sure, yeah.
And there's one other thing that is specifically limited,
I think, a two kilobyte limit
so that it can store at least four rows or tuples per page.
If we only had that,
we'd only be able to store small and medium values in Postgres, right?
But we want to be able to potentially store
sometimes larger or relatively large pieces of data from time to time.
Yeah, let's warm up with a small topic.
It's not about Toast.cl, but it's a good inter-topic.
Recently, a very popular YouTuber and blogger, Hussein Nasser,
sorry, I always pronounce other names wrong,
but they pay me for the same because my last name is very difficult to pronounce.
So it was a great couple of posts on Twitter, LinkedIn.
How many rows can we ensure?
How many tuples, actually?
Let's speak correct language here because we are on Postgres.fm podcast, right?
Not rows, but tuples, or tuples as you prefer. How many of them can fit
into one single page, which is, as you said, by default, eight kibibytes? And it's a great
question to understand the internal layout. There is a page header, and it's some static
amount of bytes, number of bytes is 24 bytes in each page is allocated for
page header.
Then from both directions, we fill page with some data.
From the beginning, we fill it with pointers to tuples, pointers to entries, right?
And each one is only four bytes.
So each tuple in page, the pointer takes four bytes and they start from the beginning.
But real data comes from the end of the page, right?
And of course, the size there depends on the data itself.
But each tuple there always has a header as well, tuple header.
And it's 23 bytes, but due to alignment padding, it's padded to 24 bytes.
So alignment padding is for performance reasons. Data should be fetched in like so-called words
of eight bytes. That's why 24, not 23. And it means that we can fit a few hundred of tuples,
right? But interesting point that you can create a table without any columns at all, right?
That's interesting.
So you can create table, parenthesis, that's it.
Create table T1, for example, parenthesis, that's it, semicolon.
Postgres allows it.
Of course, it's not like standard thing, but Postgres allows it.
Why not?
We will always have internal, like, virtual
system columns, ctid, xmin,
xmax, as we discussed a few
times. So you can select
ctid, and it will provide
page number and offset there.
Offset, this is exactly this
pointer. And so we can
start inserting, insert into
t1, select. Again,
select in Postgres,
select clause allows
empty list of columns as well.
It's some kind of trick.
And this is how you can reach
a little bit more tuples
inserted to page,
291 tuples.
And the overhead is huge,
actually, in this case.
But I would like to say
it's not 100%
because number of columns,
the number of rows or tuples we have,
it's also some knowledge, right?
But almost 100% wasted for overhead
for various headers and pointers.
Yeah, so we're looking, I guess,
that's the theoretical maximum number of tuples.
Practical as well.
I did it, but nobody wants them to...
But it's not useful, right?
In some testing, maybe.
In some testing, sometimes it's useful.
But it does mean that in practice,
we'll only ever have a maximum of a few hundred rows per page,
even if you're only storing very, very small amounts of data.
But I guess we're talking about the opposite in this case.
We're talking about the other end of the limit.
Yes, this is extreme.
I just wanted to give a wider understanding of what we talk about.
This is an extreme situation when we have no data at all in the table, and we can have
291 tuples.
Then we start having some column,
for example.
An interesting thing, it doesn't matter.
Like, smallest column probably will be one byte.
If you take, for example, boolean, boolean is one byte.
Already a lot of waste, right? Because it's not bit, it's byte.
But this is how the world we live in.
Like, we need this sort the world we live in.
This is all of overhead and relational approach.
Of course, if you need to store some bits,
probably you need to create some additional storage or pack them to bytes and store somehow differently.
But take, for example, one byte data type.
Again, it will be padded till eight bytes
and you will be able to store only 226 rows because
of this overhead immediately. The same for regular integer, which is 4 byte. It will be padded with
four zeros in each value. Or integer 8 or bigint data type. They all will take the same storage and you will be able to fit only 226
tuples into a single
page. Quite straightforward.
And
if it's timestamp of
timestamp with timezone, which
both take 16 bytes,
okay, you will have
less, you will be able to fit
fewer tuples into a page
because they might be
just one row, but many
versions of it, right? So now
let's go to Toast
now, finally, right? We
add more and more, or for
example, we decide to have a
column from Varlina family
of data types. Varlina
means variable length.
For example, text, Varchar, JSON, XML. Freelina means variable length. For example, text,
varchar,
JSON,
XML.
Previous episode.
Yeah.
Or numeric,
for example,
as well.
Because numeric
is exact number
and it can take
a lot of data.
So it's also Varlina.
Interesting.
But yeah,
typically this would be
a text field
or a JSON field.
That's pretty common,
isn't it?
Right. right.
And so they all behave similarly,
and they all involve this mechanism, TOST,
which works implicitly, transparently.
We usually don't think about it
until we deal with some performance issues
and we need to understand what's happening under the hood.
So what is TOST?
Yeah, let's discuss.
I was really interested in where the name came
from and I looked into the mailing
lists and I think the original
reason for calling it
Toast, so it's an acronym
but the original reason was
it's about slicing. So
if we go
back to the page size,
and let's say we're just trying to insert an 80 kilobyte blob of text.
Now, to fit that, we can't fit that in an 8 kilobyte.
I probably picked a bad format there.
Let's say a megabyte of text into an 8 kilobyte page.
Naturally, we're going to struggle,
and we've got the limitation that we can't store it over multiple pages.
So this is a strategy for slicing that data up
into multiple tuples in a separate table
and storing pointers to each one
so that we can reconstruct it.
So Postgres can reconstruct it later when we request it.
It also, by the way, includes compression. will i'm sure we'll come to that but by default we'll first try and compress it and if not we'll store it off page and yeah so
the name comes initially from slicing and i think there's some funny like people kind of working
backwards into acronyms which i think is how a lot of things end up getting named. But I think the generally accepted one is the oversized attribute storage technique.
And it had a code name of Toast or the best thing since sliced bread, which I found funny.
And I'll include the mailing list thread there for anybody interested in that.
Yeah.
Imagine Toast with gin, how they work together.
Yeah. Imagine Toast with Jim, how they work together. Yeah.
So about technical aspects
of it, roughly, very
roughly, when our
total length of all
columns involved, one tuple
value approaches
roughly 2,000 bytes,
this
mechanism starts working.
There are many details inside it.
Podcasts are probably not the best place to discuss everything
because there are many if, else, and so on details how it works.
But roughly, Postgres tries to fit at least four values in one page.
That's why, actually, I wanted to bring this topic of the maximum number.
But let's have at least four values in
one page. It means we need to shrink value to chunks roughly 2000 bytes. And some data will
be in the main page, but other chunks will go to a separate table, which automatically created
called Toast table. For each regular table,
we can have up to only one Toast table, right? But it can have many, many chunks for each value.
And this is interesting. What should we discuss about it? Let's talk.
So there's a couple of things. One, kind of like the performance implications of that,
both good and bad. Also, I guess, when do we need to worry about,
we've got certain values that can be, certain things that can be changed.
Should we ever change those?
So there's default compression, for example,
there's default whether this is even on or off.
There's a few things to be wary of, like the usual kind of select star,
or even whether we should store, you know, we can use Postgres'
version of this, or we could avoid storing this data in the database at all. That's a common
technique. We could store it ourselves deliberately in a separate table to avoid some of those issues.
Like, I guess we could discuss some of those pros and cons. Which direction would you like to go?
Yeah, well, first of all, as we already discussed
on previous episode about JSON,
nothing can beat narrow cases
when we have really small number of columns.
All columns are small.
Even if Varlina data types involved text,
the actual values are small.
Everything fits roughly.
Total values of tuple is less than 2,000 bytes.
By the way, the easiest way to check the size,
it's really interesting.
You can select T1 from T1,
taking row record.
You can convert it to text.
And you can get length of it.
But also there are special functions,
pg, column
size, I don't remember by name on top
of my head, but it's possible
also to check it
more wisely.
But quite often I use this trick
to converting whole record to text
and just checking the length of it.
So if everything
below 2000, nothing
can beat this, because because toast is not involved so
there is no overhead because of course if you need to read whole tuple you need to join with
like postgres needs to join with toast table and multiple records get and extract multiple
records from toast table and then collect everything. And it takes time, of course,
and there is overhead here.
Without Toast, it's much faster, obviously.
That's why we discussed in JSON episode,
if JSON values are below 2,000,
performance is much better.
Once we approach roughly 2,000,
we have some degradation
because Toast starts to be involved.
But interesting thing that it's quite, there are many things we've done to make Toast quite efficient in terms of performance.
For example, we know MVCC Postgres model, when you change, when you update some row, a new tuple is created. And the old tuple is stored also
until Autowacom deletes it,
seeing that it's not needed anymore,
not all transactions can look at it.
So for Toast, it's slightly different.
So in Toast,
there is special versioning applied, and it's quite reduced.
So only inserts and deletes, and updates trigger inserts of Toast values
only if Toast value actually changed.
If you change some different column, the same Toast records,
Toast is regular table, actually, right?
From first glance.
But update happens in the main page.
It can be hot update actually inside this same page,
very efficient, not involving index,
not causing index amplification problem.
But Toast, we just reuse the same records
for new actual tuple in main page right in main table and this
is quite interesting and this adds efficiency as well right yeah very much so and i read another
cool thing which is that the the large values in the from the toast table are only pulled out if
they're selected at all and they're only pulled out at the time the result set is sent to the client.
Now that's super interesting for things like in memory processes.
So if we're, like for example, if we're sorting rows and we're carrying like large rows around,
that's going to take up a bit more space in memory.
And if we only have to pull out some of that larger information at the end,
chances are more of that's going to be done quicker
and with less footprint.
So it's super interesting
some of the benefits this comes with,
but some of the other things we mentioned don't.
So yeah.
This means putting all columns into select clouds,
which sometimes ORM can do.
It can cause issue with performance.
So you can be much more efficient if you select only those columns you really need at this moment,
right? Yeah. So I think it's slightly different. Yeah. So this is like a select,
this is a typical select star from problem, isn't it? Now that has other issues like if what what happens if you add a column select star is um kind of can change on you but yeah right you have everything for anybody who's
listening on audio khali has got a smirk on his face and he's definitely joking right so so what
about compression there is automatic compression applied, which is lightweight, not super high ratio, producing not super high ratio, but doesn't consume too many CPU cycles. that he tried to test out Toast. And it looks like the first example they picked actually didn't trigger a Toast table
as they were expecting,
or didn't trigger any chunks to be inserted
into a Toast table
because the value they were inserting,
whilst it would be two kilobytes of text,
it compressed really well
because they were just doing a repeat character,
for example.
So you're saying it was compressed
before going to Toast tables
and it could fit main table.
Exactly.
In fact, that's a better way of describing it.
At first, if the whole tuple, as you mentioned, looks like it's going to exceed two kilobytes,
the first thing that Postgres tries to do is compress as many of these toastable values as it can to store them on the
page still and only if it can't will it then move them off the page into a into the toast table
so that's a really important thing that's on by default that can be like every combination of
this is possible but yeah by default it will try and compress. And it will, by default, use, and this is something I didn't realize until a year or two ago,
it will use a Postgres, a pglz compression
algorithm. And since Postgres 14, you
can use lz4, is it?
You can control. I don't remember, but there are options in new Postgres
for options to control. But I would like to emphasize that you also can control and tell Postgres not to try to
compress.
Because if you, for example, have some value which you know already compressed, it would
be not good in terms of efficiency to try to compress it once again.
So you can tell Postgres to have something like storage external.
In this case, it would go to Tofst immediately without compression attempts at all.
Yeah, exactly.
Really good point.
You do pan overhead.
Like I think, I think one of the benefits of LZ4, I think it's, um, well with
compression, there's normally like a trade off, right?
It's how well it compresses like versus speed of compression versus speed of read. And the benchmarks I've seen, LZ4 does come out on top,
but I guess it will depend on the exact scenario.
But as you say, if it's already compressed,
it doesn't matter what algorithm you're going to use,
there's going to be overhead for no real benefit.
Right. I'm not a big compression expert,
but of course there are many recent advances in this area, not only Pied Piper and so on. And it was like, for example, Broadly, we use Broadly quite often for wall G compression of walls when we do backups. But I'd like to say it's good that we have more controls. It's very good.
But recently we had a brief, small, like, world battle with CTO of Victoria Metrics.
And we discussed number of rows versus terabytes of data.
For example, one terabyte for me is always something like one billion of rows, roughly.
And for them, it's roughly one billion of rows is 1 gigabyte of data.
So it's like three orders of magnitude difference.
Why?
Because they talk about time series often.
For example, Victoria Metrics, right?
ClickHouse, ColumnStore.
And they live in a very different world.
Unfortunately, with Postgres, we still don't have quite good battle-proven column store.
There are attempts.
There is ongoing work.
Several teams working on it, we know.
But when we talk about compression,
it really shines when we have column store.
Yeah, because the data is more similar, right?
Right, right, right.
Because when you try to compress values in raw,
not involving the context of other values in the same column,
you're in a worse position.
So the compression topic,
actually Peter Zaitsev's interview happened this week on Monday
on Postgres TV, a very good chat it was.
This is his idea that you
should consider compression
and expect good results only if
you combine this topic with
storage engine topic.
And for column storage,
it's much better. Or like Timescale,
they are very smart. They still
work in the row store, but
they apply compression
looking at column. It's an interesting
article we already discussed it a few times. So worth referencing once more time. But again,
in some cases, some compression in place, it's good. And it's good to have more controls
in Postgres 15. But I hope in future...
14.
14, sorry. I hope in future Postgres versions
we will have in core
product we will have column store
storage engine
providing the same level of compression
like maybe
not 1000x but 100x
and in this case
we will also say okay
1 billion rows it's not 1 terabyte anymore.
It's like 10 gigabytes, maybe.
Should we talk about some of the limits of Toast?
Like what's the maximum size?
1 gigabyte for one value?
I don't remember.
The limits are quite good, but sometimes we hit them.
And for me, the biggest limitation is that Toast is not applied.
Like for indexes, it's not applied that toast is not applied. For indexes,
it's not applied. There is compression there.
I don't remember details, but I remember you cannot
index a really huge value
of 100 megabytes. You cannot index
it. You need to first maybe hash
it and index only
some expression. So this
is one of the biggest limitations
because in toasting, it's applied only to hit, right? This the biggest limitations because in toasting it's applied
only to heat, right? This full-fledged toasting. So this is one limitation. One gigabyte for values,
another limitation. What else? I've read a couple more in the docs, but I don't think
you're likely to hit them. So for example, you can't have more than 4 billion toasted values in a single table,
which would be a lot given they all have to be over a certain size.
You know, we're talking about terabytes of data.
Well, I can easily imagine that we can hit it.
But before that, we probably will face other performance issues.
So everyone with a table more than 100 gigabytes should already use partitioning.
This is another benefit of partitioning.
And then one more kind of slightly hypothetical one.
Please don't do this, but you could, I think in theory,
you could only have about 450 wide columns in a table.
How many?
450 columns that are tasteable or something.
It's a lot. It's a lot.
It's a lot, but still I already
saw cases when it was achieved.
So yeah, pretty big numbers,
but things to be aware of if you're
working in an extreme case for some reason.
Right, right.
And if you care
about storage
efficiency, my advice
is to learn about column Tetris in Postgres.
This is quite a common term already.
And usually there are two approaches.
Either start from big and go to small, or vice versa.
I prefer starting from small columns first.
For example, all 1-byte, 2-byte columns go first.
Then we have 4-byte columns, 8-byte columns, 16. And then all Varlin columns go first, then we have 4 byte columns, 8 byte columns, 16, and then all
Varlina columns go last.
In this case, you probably will have fewer zeros due to alignment padding and spend fewer
bytes, maybe sometimes gigabytes, actually, or hundreds of gigabytes if you have big data
volumes for storage. And it's not only for storage.
It's also for the amount of RAM,
both for shared buffers and page cache,
and also for backups and so on and so on.
Sometimes the benefits of applying
better column Tetris tactics are good.
And Toast, yeah,
like Varlin and Toast should go last.
And in this case, it's good. If you mix them like Varlin or Toast should go last, and in this case
it's good. If you mix them,
this is the worst case. Irregular integer,
some text. Regular integer,
some text. In this case,
you will have a lot of zeros due to alignment
padding. I read a good
post by Braintree, who
I think built a
they built something to help
kind of, it's almost like repacking a table.
For example, you can sort out a column ordering
that makes sense initially,
but then what happens if you add a couple of columns?
Does the order still make sense?
For large enough tables,
they found it could be worth changing it.
So that's a really good...
I also build it.
I have a small report.
Actually, it's not small.
It's recursive CTE with recursive analyzing this for any table and proposing better order.
It's in my Postgres DBA, quite old tool, like ad hoc tool to analyze some Postgres cluster aspects.
Nice.
Yeah.
And some people use it, I think.
There's one more blog post I wanted to mention before,
kind of on this topic, but kind of not,
is by Haki Benito. I've mentioned it
in a couple of episodes, but
this is almost a side effect of the fact
we have Toast, is he noticed
that there was some kind of medium
sized text. So he defined small text
as being like integers, timestamps,
small numbers of bytes per
column, and then large text being kind of like the blobs we're talking about here,
you know, whole HTML pages or larger things that, you know,
are automatically going to be quite big and therefore off the page.
He mentioned there was, he calls it, a surprising impact of medium-sized text.
So kind of still text fields or, you know, it might be a comment on a blog
or something could get quite long, could get into the multiple kilobytes even, or just about still fits on the page.
So it's below that threshold we were're getting towards two kilobytes each suddenly if you're
doing reads of multiple pages if you're doing kind of scans you could be having to read quite
a lot of data for the amount you're returning if you're not returning those columns if you're not
returning for example the comments you don't need them for the query you're doing you're still
having to go through quite a lot of pages. So efficiency wise, those sometimes you might want to deliberately store yourself in a different table, so that you can fit much more data on the pages that you are scanning more frequent.
Yeah, it's really interesting.
Sanyam Bhutaniyaraman Yeah, sometimes one one one to one relationship, like might seem strange, but it makes sense. So one table is for very compact storage
and another for bigger values.
Yeah, exactly.
But in this case, you always should remember
that there are no indexes on top of two tables.
So you lose some index-only scan, for example,
or index scan opportunities.
Right.
Yeah.
Was there anything else you wanted to cover on this one?
No, no.
That's it, I think, right?
I think so. Thank you everyone for joining us. Thank you, Nikolai. See you next week.
Thank you. Bye-bye.