Postgres FM - TOAST

Episode Date: March 3, 2023

Nikolay and Michael discuss TOAST (The Oversized-Attribute Storage Technique) — what it is, how it works, and some general things to be aware of. Here are links to a few things we mentione...d: TOAST docsTOAST wikiHussein Nasser on rows per page (Twitter)Toasting in action (dbi services blog)Interview with Peter Zaitsev (Postgres TV)Building columnar compression in a row-oriented database (Timescale blog post)The Surprising Impact of Medium-Size Texts on PostgreSQL Performance (blog post by Haki Benita)PostgreSQL at Scale: Saving Space Basically for Free (blog post by Braintree on column Tetris)postgres_dba alignment padding query ------------------------What did you like or not like? What should we discuss next time? Let us know by tweeting us on @samokhvalov / @michristofides / @PostgresFM, or by commenting on our Google doc.If you would like to share this episode, here's a good link (and thank you!)Postgres FM is brought to you by:Nikolay Samokhvalov, founder of Postgres.aiMichael Christofides, founder of pgMustardWith special thanks to:Jessie Draws for the amazing artwork 

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, and welcome to PostgresFM, a weekly show about all things PostgresQL. I'm Michael, founder of PgMustard, and this is my co-host Nikolai, founder of PostgresAI. Hey Nikolai, what are we talking about today? Your choice, and it's toast, yet another boring topic. I think it's the tastiest breakfast snack, so I thought you might actually want to talk about it being, you know, food-related. Well, okay, I agree that this is something that everyone should learn, at least in some detail. Not maybe
Starting point is 00:00:29 deep, but, I mean, all developers should understand how Postgres stores data and why limits are so good. I mean, we can store up to one gigabyte of data in a single column, in a single row, in a single column, I mean, a single cell, if you speak a simple language.
Starting point is 00:00:45 And it's great. So you can sort really large values usually. Yeah, this is one of my favorite types of Postgres feature in that it's on by default. It does something that most people never have to worry about. Transparent, the correct word. Exactly. Transparent and on by default. it it helps in so many ways that you might never thought of thought about or you might only hit a limit and i yeah i love this kind of thing because you can go through life never knowing about it and it's still helping you out limited toast toast to everyone almost unlimited yeah so should we start with what it what we're
Starting point is 00:01:22 even talking about i guess if anybody new to, we're talking about how Postgres, well, based on a couple of Postgres limitations, specifically, you can't store a row across multiple pages and the default page size being 8 kilobytes and... Kibibytes. Kibibytes, sure, yeah. And there's one other thing that is specifically limited, I think, a two kilobyte limit
Starting point is 00:01:48 so that it can store at least four rows or tuples per page. If we only had that, we'd only be able to store small and medium values in Postgres, right? But we want to be able to potentially store sometimes larger or relatively large pieces of data from time to time. Yeah, let's warm up with a small topic. It's not about Toast.cl, but it's a good inter-topic. Recently, a very popular YouTuber and blogger, Hussein Nasser,
Starting point is 00:02:20 sorry, I always pronounce other names wrong, but they pay me for the same because my last name is very difficult to pronounce. So it was a great couple of posts on Twitter, LinkedIn. How many rows can we ensure? How many tuples, actually? Let's speak correct language here because we are on Postgres.fm podcast, right? Not rows, but tuples, or tuples as you prefer. How many of them can fit into one single page, which is, as you said, by default, eight kibibytes? And it's a great
Starting point is 00:02:54 question to understand the internal layout. There is a page header, and it's some static amount of bytes, number of bytes is 24 bytes in each page is allocated for page header. Then from both directions, we fill page with some data. From the beginning, we fill it with pointers to tuples, pointers to entries, right? And each one is only four bytes. So each tuple in page, the pointer takes four bytes and they start from the beginning. But real data comes from the end of the page, right?
Starting point is 00:03:30 And of course, the size there depends on the data itself. But each tuple there always has a header as well, tuple header. And it's 23 bytes, but due to alignment padding, it's padded to 24 bytes. So alignment padding is for performance reasons. Data should be fetched in like so-called words of eight bytes. That's why 24, not 23. And it means that we can fit a few hundred of tuples, right? But interesting point that you can create a table without any columns at all, right? That's interesting. So you can create table, parenthesis, that's it.
Starting point is 00:04:12 Create table T1, for example, parenthesis, that's it, semicolon. Postgres allows it. Of course, it's not like standard thing, but Postgres allows it. Why not? We will always have internal, like, virtual system columns, ctid, xmin, xmax, as we discussed a few times. So you can select
Starting point is 00:04:32 ctid, and it will provide page number and offset there. Offset, this is exactly this pointer. And so we can start inserting, insert into t1, select. Again, select in Postgres, select clause allows
Starting point is 00:04:46 empty list of columns as well. It's some kind of trick. And this is how you can reach a little bit more tuples inserted to page, 291 tuples. And the overhead is huge, actually, in this case.
Starting point is 00:05:01 But I would like to say it's not 100% because number of columns, the number of rows or tuples we have, it's also some knowledge, right? But almost 100% wasted for overhead for various headers and pointers. Yeah, so we're looking, I guess,
Starting point is 00:05:19 that's the theoretical maximum number of tuples. Practical as well. I did it, but nobody wants them to... But it's not useful, right? In some testing, maybe. In some testing, sometimes it's useful. But it does mean that in practice, we'll only ever have a maximum of a few hundred rows per page,
Starting point is 00:05:42 even if you're only storing very, very small amounts of data. But I guess we're talking about the opposite in this case. We're talking about the other end of the limit. Yes, this is extreme. I just wanted to give a wider understanding of what we talk about. This is an extreme situation when we have no data at all in the table, and we can have 291 tuples. Then we start having some column,
Starting point is 00:06:08 for example. An interesting thing, it doesn't matter. Like, smallest column probably will be one byte. If you take, for example, boolean, boolean is one byte. Already a lot of waste, right? Because it's not bit, it's byte. But this is how the world we live in. Like, we need this sort the world we live in. This is all of overhead and relational approach.
Starting point is 00:06:30 Of course, if you need to store some bits, probably you need to create some additional storage or pack them to bytes and store somehow differently. But take, for example, one byte data type. Again, it will be padded till eight bytes and you will be able to store only 226 rows because of this overhead immediately. The same for regular integer, which is 4 byte. It will be padded with four zeros in each value. Or integer 8 or bigint data type. They all will take the same storage and you will be able to fit only 226 tuples into a single
Starting point is 00:07:08 page. Quite straightforward. And if it's timestamp of timestamp with timezone, which both take 16 bytes, okay, you will have less, you will be able to fit fewer tuples into a page
Starting point is 00:07:24 because they might be just one row, but many versions of it, right? So now let's go to Toast now, finally, right? We add more and more, or for example, we decide to have a column from Varlina family
Starting point is 00:07:39 of data types. Varlina means variable length. For example, text, Varchar, JSON, XML. Freelina means variable length. For example, text, varchar, JSON, XML. Previous episode. Yeah.
Starting point is 00:07:49 Or numeric, for example, as well. Because numeric is exact number and it can take a lot of data. So it's also Varlina.
Starting point is 00:07:58 Interesting. But yeah, typically this would be a text field or a JSON field. That's pretty common, isn't it? Right. right.
Starting point is 00:08:06 And so they all behave similarly, and they all involve this mechanism, TOST, which works implicitly, transparently. We usually don't think about it until we deal with some performance issues and we need to understand what's happening under the hood. So what is TOST? Yeah, let's discuss.
Starting point is 00:08:28 I was really interested in where the name came from and I looked into the mailing lists and I think the original reason for calling it Toast, so it's an acronym but the original reason was it's about slicing. So if we go
Starting point is 00:08:44 back to the page size, and let's say we're just trying to insert an 80 kilobyte blob of text. Now, to fit that, we can't fit that in an 8 kilobyte. I probably picked a bad format there. Let's say a megabyte of text into an 8 kilobyte page. Naturally, we're going to struggle, and we've got the limitation that we can't store it over multiple pages. So this is a strategy for slicing that data up
Starting point is 00:09:12 into multiple tuples in a separate table and storing pointers to each one so that we can reconstruct it. So Postgres can reconstruct it later when we request it. It also, by the way, includes compression. will i'm sure we'll come to that but by default we'll first try and compress it and if not we'll store it off page and yeah so the name comes initially from slicing and i think there's some funny like people kind of working backwards into acronyms which i think is how a lot of things end up getting named. But I think the generally accepted one is the oversized attribute storage technique. And it had a code name of Toast or the best thing since sliced bread, which I found funny.
Starting point is 00:09:56 And I'll include the mailing list thread there for anybody interested in that. Yeah. Imagine Toast with gin, how they work together. Yeah. Imagine Toast with Jim, how they work together. Yeah. So about technical aspects of it, roughly, very roughly, when our total length of all
Starting point is 00:10:13 columns involved, one tuple value approaches roughly 2,000 bytes, this mechanism starts working. There are many details inside it. Podcasts are probably not the best place to discuss everything because there are many if, else, and so on details how it works.
Starting point is 00:10:34 But roughly, Postgres tries to fit at least four values in one page. That's why, actually, I wanted to bring this topic of the maximum number. But let's have at least four values in one page. It means we need to shrink value to chunks roughly 2000 bytes. And some data will be in the main page, but other chunks will go to a separate table, which automatically created called Toast table. For each regular table, we can have up to only one Toast table, right? But it can have many, many chunks for each value. And this is interesting. What should we discuss about it? Let's talk.
Starting point is 00:11:17 So there's a couple of things. One, kind of like the performance implications of that, both good and bad. Also, I guess, when do we need to worry about, we've got certain values that can be, certain things that can be changed. Should we ever change those? So there's default compression, for example, there's default whether this is even on or off. There's a few things to be wary of, like the usual kind of select star, or even whether we should store, you know, we can use Postgres'
Starting point is 00:11:45 version of this, or we could avoid storing this data in the database at all. That's a common technique. We could store it ourselves deliberately in a separate table to avoid some of those issues. Like, I guess we could discuss some of those pros and cons. Which direction would you like to go? Yeah, well, first of all, as we already discussed on previous episode about JSON, nothing can beat narrow cases when we have really small number of columns. All columns are small.
Starting point is 00:12:15 Even if Varlina data types involved text, the actual values are small. Everything fits roughly. Total values of tuple is less than 2,000 bytes. By the way, the easiest way to check the size, it's really interesting. You can select T1 from T1, taking row record.
Starting point is 00:12:37 You can convert it to text. And you can get length of it. But also there are special functions, pg, column size, I don't remember by name on top of my head, but it's possible also to check it more wisely.
Starting point is 00:12:54 But quite often I use this trick to converting whole record to text and just checking the length of it. So if everything below 2000, nothing can beat this, because because toast is not involved so there is no overhead because of course if you need to read whole tuple you need to join with like postgres needs to join with toast table and multiple records get and extract multiple
Starting point is 00:13:21 records from toast table and then collect everything. And it takes time, of course, and there is overhead here. Without Toast, it's much faster, obviously. That's why we discussed in JSON episode, if JSON values are below 2,000, performance is much better. Once we approach roughly 2,000, we have some degradation
Starting point is 00:13:43 because Toast starts to be involved. But interesting thing that it's quite, there are many things we've done to make Toast quite efficient in terms of performance. For example, we know MVCC Postgres model, when you change, when you update some row, a new tuple is created. And the old tuple is stored also until Autowacom deletes it, seeing that it's not needed anymore, not all transactions can look at it. So for Toast, it's slightly different. So in Toast,
Starting point is 00:14:21 there is special versioning applied, and it's quite reduced. So only inserts and deletes, and updates trigger inserts of Toast values only if Toast value actually changed. If you change some different column, the same Toast records, Toast is regular table, actually, right? From first glance. But update happens in the main page. It can be hot update actually inside this same page,
Starting point is 00:14:53 very efficient, not involving index, not causing index amplification problem. But Toast, we just reuse the same records for new actual tuple in main page right in main table and this is quite interesting and this adds efficiency as well right yeah very much so and i read another cool thing which is that the the large values in the from the toast table are only pulled out if they're selected at all and they're only pulled out at the time the result set is sent to the client. Now that's super interesting for things like in memory processes.
Starting point is 00:15:32 So if we're, like for example, if we're sorting rows and we're carrying like large rows around, that's going to take up a bit more space in memory. And if we only have to pull out some of that larger information at the end, chances are more of that's going to be done quicker and with less footprint. So it's super interesting some of the benefits this comes with, but some of the other things we mentioned don't.
Starting point is 00:15:56 So yeah. This means putting all columns into select clouds, which sometimes ORM can do. It can cause issue with performance. So you can be much more efficient if you select only those columns you really need at this moment, right? Yeah. So I think it's slightly different. Yeah. So this is like a select, this is a typical select star from problem, isn't it? Now that has other issues like if what what happens if you add a column select star is um kind of can change on you but yeah right you have everything for anybody who's listening on audio khali has got a smirk on his face and he's definitely joking right so so what
Starting point is 00:16:43 about compression there is automatic compression applied, which is lightweight, not super high ratio, producing not super high ratio, but doesn't consume too many CPU cycles. that he tried to test out Toast. And it looks like the first example they picked actually didn't trigger a Toast table as they were expecting, or didn't trigger any chunks to be inserted into a Toast table because the value they were inserting, whilst it would be two kilobytes of text, it compressed really well because they were just doing a repeat character,
Starting point is 00:17:18 for example. So you're saying it was compressed before going to Toast tables and it could fit main table. Exactly. In fact, that's a better way of describing it. At first, if the whole tuple, as you mentioned, looks like it's going to exceed two kilobytes, the first thing that Postgres tries to do is compress as many of these toastable values as it can to store them on the
Starting point is 00:17:47 page still and only if it can't will it then move them off the page into a into the toast table so that's a really important thing that's on by default that can be like every combination of this is possible but yeah by default it will try and compress. And it will, by default, use, and this is something I didn't realize until a year or two ago, it will use a Postgres, a pglz compression algorithm. And since Postgres 14, you can use lz4, is it? You can control. I don't remember, but there are options in new Postgres for options to control. But I would like to emphasize that you also can control and tell Postgres not to try to
Starting point is 00:18:29 compress. Because if you, for example, have some value which you know already compressed, it would be not good in terms of efficiency to try to compress it once again. So you can tell Postgres to have something like storage external. In this case, it would go to Tofst immediately without compression attempts at all. Yeah, exactly. Really good point. You do pan overhead.
Starting point is 00:18:52 Like I think, I think one of the benefits of LZ4, I think it's, um, well with compression, there's normally like a trade off, right? It's how well it compresses like versus speed of compression versus speed of read. And the benchmarks I've seen, LZ4 does come out on top, but I guess it will depend on the exact scenario. But as you say, if it's already compressed, it doesn't matter what algorithm you're going to use, there's going to be overhead for no real benefit. Right. I'm not a big compression expert,
Starting point is 00:19:22 but of course there are many recent advances in this area, not only Pied Piper and so on. And it was like, for example, Broadly, we use Broadly quite often for wall G compression of walls when we do backups. But I'd like to say it's good that we have more controls. It's very good. But recently we had a brief, small, like, world battle with CTO of Victoria Metrics. And we discussed number of rows versus terabytes of data. For example, one terabyte for me is always something like one billion of rows, roughly. And for them, it's roughly one billion of rows is 1 gigabyte of data. So it's like three orders of magnitude difference. Why? Because they talk about time series often.
Starting point is 00:20:16 For example, Victoria Metrics, right? ClickHouse, ColumnStore. And they live in a very different world. Unfortunately, with Postgres, we still don't have quite good battle-proven column store. There are attempts. There is ongoing work. Several teams working on it, we know. But when we talk about compression,
Starting point is 00:20:34 it really shines when we have column store. Yeah, because the data is more similar, right? Right, right, right. Because when you try to compress values in raw, not involving the context of other values in the same column, you're in a worse position. So the compression topic, actually Peter Zaitsev's interview happened this week on Monday
Starting point is 00:21:00 on Postgres TV, a very good chat it was. This is his idea that you should consider compression and expect good results only if you combine this topic with storage engine topic. And for column storage, it's much better. Or like Timescale,
Starting point is 00:21:17 they are very smart. They still work in the row store, but they apply compression looking at column. It's an interesting article we already discussed it a few times. So worth referencing once more time. But again, in some cases, some compression in place, it's good. And it's good to have more controls in Postgres 15. But I hope in future... 14.
Starting point is 00:21:44 14, sorry. I hope in future Postgres versions we will have in core product we will have column store storage engine providing the same level of compression like maybe not 1000x but 100x and in this case
Starting point is 00:22:00 we will also say okay 1 billion rows it's not 1 terabyte anymore. It's like 10 gigabytes, maybe. Should we talk about some of the limits of Toast? Like what's the maximum size? 1 gigabyte for one value? I don't remember. The limits are quite good, but sometimes we hit them.
Starting point is 00:22:21 And for me, the biggest limitation is that Toast is not applied. Like for indexes, it's not applied that toast is not applied. For indexes, it's not applied. There is compression there. I don't remember details, but I remember you cannot index a really huge value of 100 megabytes. You cannot index it. You need to first maybe hash it and index only
Starting point is 00:22:40 some expression. So this is one of the biggest limitations because in toasting, it's applied only to hit, right? This the biggest limitations because in toasting it's applied only to heat, right? This full-fledged toasting. So this is one limitation. One gigabyte for values, another limitation. What else? I've read a couple more in the docs, but I don't think you're likely to hit them. So for example, you can't have more than 4 billion toasted values in a single table, which would be a lot given they all have to be over a certain size. You know, we're talking about terabytes of data.
Starting point is 00:23:13 Well, I can easily imagine that we can hit it. But before that, we probably will face other performance issues. So everyone with a table more than 100 gigabytes should already use partitioning. This is another benefit of partitioning. And then one more kind of slightly hypothetical one. Please don't do this, but you could, I think in theory, you could only have about 450 wide columns in a table. How many?
Starting point is 00:23:40 450 columns that are tasteable or something. It's a lot. It's a lot. It's a lot, but still I already saw cases when it was achieved. So yeah, pretty big numbers, but things to be aware of if you're working in an extreme case for some reason. Right, right.
Starting point is 00:23:57 And if you care about storage efficiency, my advice is to learn about column Tetris in Postgres. This is quite a common term already. And usually there are two approaches. Either start from big and go to small, or vice versa. I prefer starting from small columns first.
Starting point is 00:24:17 For example, all 1-byte, 2-byte columns go first. Then we have 4-byte columns, 8-byte columns, 16. And then all Varlin columns go first, then we have 4 byte columns, 8 byte columns, 16, and then all Varlina columns go last. In this case, you probably will have fewer zeros due to alignment padding and spend fewer bytes, maybe sometimes gigabytes, actually, or hundreds of gigabytes if you have big data volumes for storage. And it's not only for storage. It's also for the amount of RAM, both for shared buffers and page cache,
Starting point is 00:24:52 and also for backups and so on and so on. Sometimes the benefits of applying better column Tetris tactics are good. And Toast, yeah, like Varlin and Toast should go last. And in this case, it's good. If you mix them like Varlin or Toast should go last, and in this case it's good. If you mix them, this is the worst case. Irregular integer,
Starting point is 00:25:10 some text. Regular integer, some text. In this case, you will have a lot of zeros due to alignment padding. I read a good post by Braintree, who I think built a they built something to help kind of, it's almost like repacking a table.
Starting point is 00:25:27 For example, you can sort out a column ordering that makes sense initially, but then what happens if you add a couple of columns? Does the order still make sense? For large enough tables, they found it could be worth changing it. So that's a really good... I also build it.
Starting point is 00:25:40 I have a small report. Actually, it's not small. It's recursive CTE with recursive analyzing this for any table and proposing better order. It's in my Postgres DBA, quite old tool, like ad hoc tool to analyze some Postgres cluster aspects. Nice. Yeah. And some people use it, I think. There's one more blog post I wanted to mention before,
Starting point is 00:26:05 kind of on this topic, but kind of not, is by Haki Benito. I've mentioned it in a couple of episodes, but this is almost a side effect of the fact we have Toast, is he noticed that there was some kind of medium sized text. So he defined small text as being like integers, timestamps,
Starting point is 00:26:22 small numbers of bytes per column, and then large text being kind of like the blobs we're talking about here, you know, whole HTML pages or larger things that, you know, are automatically going to be quite big and therefore off the page. He mentioned there was, he calls it, a surprising impact of medium-sized text. So kind of still text fields or, you know, it might be a comment on a blog or something could get quite long, could get into the multiple kilobytes even, or just about still fits on the page. So it's below that threshold we were're getting towards two kilobytes each suddenly if you're
Starting point is 00:27:05 doing reads of multiple pages if you're doing kind of scans you could be having to read quite a lot of data for the amount you're returning if you're not returning those columns if you're not returning for example the comments you don't need them for the query you're doing you're still having to go through quite a lot of pages. So efficiency wise, those sometimes you might want to deliberately store yourself in a different table, so that you can fit much more data on the pages that you are scanning more frequent. Yeah, it's really interesting. Sanyam Bhutaniyaraman Yeah, sometimes one one one to one relationship, like might seem strange, but it makes sense. So one table is for very compact storage and another for bigger values. Yeah, exactly.
Starting point is 00:27:49 But in this case, you always should remember that there are no indexes on top of two tables. So you lose some index-only scan, for example, or index scan opportunities. Right. Yeah. Was there anything else you wanted to cover on this one? No, no.
Starting point is 00:28:04 That's it, I think, right? I think so. Thank you everyone for joining us. Thank you, Nikolai. See you next week. Thank you. Bye-bye.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.