Postgres FM - JSON

Episode Date: February 24, 2023

Nikolay and Michael discuss JSON — our options for storing it in Postgres, whether or when we should, as well as a brief intro and some tips for JSON functions available. Here are links to... a few things we mentioned: hstoreXML typeXML functionsJSON typesJSON functionsJSONB indexingNULLS episodeWhy Postgres is popular episodePostgreSQL 12 release notesWhat’s New in SQL:2016 (blog post by Markus Winand)SQL/JSON is postponed (blog post by depesz) JSON[b] Roadmap (talk by Oleg Bartunov)Slides, with benchmarksRUM access methodJSON in PostgreSQL: how to use it right (blog post by Laurenz Albe from Cybertec)pg_jsonschemaTOAST_TUPLE_TARGET and TOAST_TUPLE_THRESHOLD The Surprising Impact of Medium-Size Texts on PostgreSQL Performance (blog post by Haki Benita) Aggregate functionsHow to store and index json data (blog post by ScaleGrid)When to avoid JSONB (blog post by Heap)FerretDB------------------------What did you like or not like? What should we discuss next time? Let us know by tweeting us on @samokhvalov / @michristofides / @PostgresFM, or by commenting on our Google doc.If you would like to share this episode, here's a good link (and thank you!)Postgres FM is brought to you by:Nikolay Samokhvalov, founder of Postgres.aiMichael Christofides, founder of pgMustardWith special thanks to:Jessie Draws for the amazing artwork 

Transcript
Discussion (0)
Starting point is 00:00:00 Hello and welcome to PostgresFM, a weekly show about all things PostgresQL. I'm Michael, founder of PGMustard, and this is my co-host Nikolai, founder of Postgres.ai. Hey Nikolai, what are we talking about today? Hi Michael, Jason B. Oh, we're going straight for B. Yeah. B is better. Yeah, I found that funny. Is that semi-official or is that just a joke?
Starting point is 00:00:20 Well, I remember when it was presented, it was discussed, and binary, better, nobody knows. Maybe B, like number two, A and B. But no, I've never saw that discussed. This explanation I never heard, so I doubt. Yeah. So, I mean, specifically, I guess we're talking about storing JSON in Postgres, and the options for doing so, and how you can then use it. We've had a couple of requests for similar things in this area. So thank you to the people that have requested those. Where did you want to start?
Starting point is 00:00:53 Well, I would start with discussion of first normal form and atomic values versus non-atomic values and do consider arrays, for example, as atomic. And we definitely should touch history and HStore. Well, arrays are supported for quite long in Postgres, both numeric and other types of arrays, like text arrays and so on, and multidimensional arrays, actually. Then HStore, I think it appeared first in 2004,
Starting point is 00:01:21 very long ago, almost 20 years ago. And HStore is like a key value, and it doesn't support nesting, but it's still very already flexible. So Arrays and HStore, it already breaks, for some understanding, it already breaks first normal form because these values are not atomic. But actually, we can argue that, for example, timestamp is not atomic as well. So let's leave this discussion for theory part and focus on practice part, right? For practice definitely I would say these values are very convenient and they are not as evil as nulls. They can save your time
Starting point is 00:02:07 not bringing too much issues in terms of unexpected behavior. And so there are some cases when you have some unexpected behavior. And the most interesting thing, I remember like 15 years ago, I had a talk in Maryland. It was my first visit to the US.
Starting point is 00:02:23 And the talk was about how Postgres is good for modern web applications. At the time, web 2.0 was a thing. So my talk was called like Postgres is good for web 2.0. I did some benchmarks for EAV versus HStore or Erase. Doesn't matter actually in this case. JSON didn't exist at that time yet. And actually, I was also working on XML. It's yet another data type which you can say is breaking first normal form, but also provides some flexibility in some cases.
Starting point is 00:02:58 I think currently it's not the most popular format, but still. There is some development ongoing still. And the thing is that eav is very flexible if it's entity attribute value when you have three tables with foreign keys between them optional foreign keys foreign keys is not the biggest problem there the biggest problem there is that you need joy to join extensively so some entity some attribute and an additional like connection table between them which contains values stores value so basically it's like your document if you had originally xml or json document you shrink to three tables it's flexible, supports a lot of possible documents, but to construct document back, you need to join all of them, right? And one thing we have, we have storage
Starting point is 00:03:53 overhead issue. Usually a lot of more bytes need to be wasted to store compared to, for example, text representation or JSON representation. And second thing, you need to join everything together all the time. And as I, like my very simple explanation to new Postgres users, you cannot have index on two tables. This means you usually, for best performance, you usually want to have everything in one table joints like joints are good. Postgres can be very performant in joints, but not always. Nothing can beat index only scan, for example, right?
Starting point is 00:04:36 Yeah. And when it's just to clarify, when you're saying you can't have an index on two tables, we mean if you want a multi column index, you can't pick two columns that are from separate tables. two tables we mean you if you want a multi-column index you can't pick two columns that are from separate tables right so eav approach is like i would say radical normalization approach you normalize everything you have ids everywhere and then you need to index and then you need to rely on join algorithms we have three right And compared to denormalized state, when you probably repeat some keys and so on, you have worse performance usually.
Starting point is 00:05:13 So both storage overhead and worse performance to extract whole document. And it usually means you need to avoid EAV. In most cases, yeah. I think it's worth even saying developer experience is worse, right? If even to get the simplest data out of, if you're having to do multiple joins,
Starting point is 00:05:32 it's just more complex as well. But yeah, I agree that those two are the far worse. Well, developers specifically, I've observed several teams who loved the flexibility. It's like with Star Schema for analytical databases they they love flexibility they can quite quickly develop abstraction in their language of choice and
Starting point is 00:05:53 be very flexible very good but then at some point when we have billion of rows we already can like feel trouble with this approach so So my talk 15 years ago, I was trying to prove that HStore or XML or arrays is better. So you need to go dynamilized way and rely on flexible indexing capabilities Postgres provides. For example,
Starting point is 00:06:20 GIN is a very flexible way to index complex data types. And then JSON appeared, I think, in 2014, 2013. I've written it down, yeah. 9.2 is 2012, but that was JSON without B, the less good version. Syntax Sugar JSON, let's call it so. Yeah, well, actually, if we stick into history, should we give a quick rundown of what that included?
Starting point is 00:06:47 So we could store JSON. It did some validation, but it was just a blob. It was like a string, basically. Text, yes. But why was it so? It was a success even before JSONB because by that time already, JSON became a standard de facto
Starting point is 00:07:06 for web application, mobile application. Most pieces of systems started to communicate using JSON. And it was obvious that it's great to support it in database. And also it gave a birth to this movement, let's eliminate middleware and allow maybe some very thin middleware
Starting point is 00:07:28 like postgres we had episode about it and so so front-end can communicate with data stores postgres data database through some very thin middleware which doesn't have a lot of logic and a lot of logic goes for example react app or it goes to client code or mobile app a lot of logic goes, for example, React app, it goes to client code or mobile app. A lot of logic is there. And the data logic goes to database closer. And it also gave another wave of interest to database side programming using PLPGSQL or PLPython and so on. So this is interesting. Indeed, it started roughly 10 years ago. Yeah. Wow. That's
Starting point is 00:08:08 longer than I would have guessed, actually. Yeah. And then a few years later, JSONB was implemented and it has indexing support. So it's binary format which stores keys in ordered form internally and it supports flexible indexing capabilities. Well, for JSON, you also can index using B3 and path. So you can use B3 indexing for some path expressions.
Starting point is 00:08:37 So you can create an index, an expressional index, but the expression should return scalar values, right? It cannot return whole value. Or you need to convert it to text, but it will be not very good. Basically, this index can be meaningful only if your query has the same expression as well, right? And sometimes it's good because this bit.js is very efficient and you can use it in some cases. But for JSONB, you can use gene index with some modifiers like JSONB pattern ops.
Starting point is 00:09:13 And I don't remember, but there is a way to... Path ops, maybe? Path ops, yes, yes. So you can say, I don't need to remember anything. I just need keys and values. And in this case, only limited expressions will be supported. But index size will be smaller. So you have a choice.
Starting point is 00:09:32 You can choose how to index it. And GIN is great in many cases. Not often. There are also issues with GIN sometimes, especially with this fast update option. But it's another story. So JSONB probably. what do you think? How much of popularity to Postgres can be associated with appearance of JSON or JSONB?
Starting point is 00:09:54 Yeah, I think I mentioned in our, I think we did an episode roughly on this topic of, you know, why is Postgres so popular? And I actually assign quite a lot of value to this, but not necessarily for fully technical reasons I think there are really good technical reasons for it which we'll go into soon but I think for marketing reasons I think MongoDB was doing really well around this time not so much for technical reasons but for marketing reasons and this as I think we're
Starting point is 00:10:21 going to discuss this didn't put an end to that debate, but it meant people didn't have to make a decision either or they could store some of the key value data in Postgres and get nearly as good performance or better in some cases. So I personally think it was really important for the trend we're seeing now of people seeing relational databases as cool again or the de facto again. I think we were at risk of losing that or losing traction there and people going more towards databases like Mongo. But yeah, I don't know. I think it was very important, but I'm not sure. Right. Likewise. I think it's important and it helped to join two worlds, right? So relational and non-relational world. And now everyone who wants to stay in JSON world can just have a table with like surrogate key, big integer, and JSON, like data, right? No, don't give people bad ideas.
Starting point is 00:11:28 But, of course, I don't encourage. Usually I say let's understand which columns we can say. We know these columns, and we know data types and all constraints, and we should extract them and store separately and have JSON only for some flexible, not yet understood parts of our data. But worth mentioning that some people, for example, Alek Bartonov mentions that usually in his slides on conferences, he shows some graphs showing that when JSONB appeared, popularity of Postgres started to increase. But if you look closer, actually, popularity started
Starting point is 00:12:11 to increase slightly earlier. It was the same year, but JSON was not yet released. It was released in the end of year. But popularity started to go up earlier this that year it was 2013 maybe 12 13 i don't remember and as we discussed in the past we suspect it was related to rds probably support of postgres in rds back to json i think before we move on from the comparison of json and json b i have seen some very smart people making a case for that. There are some use cases for JSON over JSON B. There's a few things like you mentioned ordering. If you really want to store what you are sent, for example, for some reason, JSON, it preserves the order. It preserves white space.
Starting point is 00:12:59 It even preserves duplicate keys. And there's one more advantage I saw somebody share which is on just purely on ingest rate it can yeah it is faster yeah so if for any reason those those are important to you it can be useful but it comes at a big cost in terms of if you need to read the data so i think for most people they want to access this data in some way in the future. You're probably going to want JSONB. Yeah, I think your default choice should be JSONB. But in some cases, indeed, when we need faster insert and we don't care about better validation in terms of duplicate keys and so on.
Starting point is 00:13:45 Also, regular JSON takes roughly two more space, right? No, no, no. Less space, right? Yes, yes. Unless you can't work space. Yes, yes. It takes less space than JSONB. So if default choice is JSONB, but if you want to optimize on insert speed and on storage space,
Starting point is 00:14:01 probably older JSON data type is better. Basically, there are some differences. If these kinds of things are important to you, look into it, test it. But that's why if you read up on these topics, often people are just defaulting to JSONB and not really even explaining why.
Starting point is 00:14:19 Well, right. Because, for example, the speed of accessing particular keys will be slower, especially if you have a lot of keys nested structure and so on. In JSONB, it's easy because it's already parsed. But for JSON, Postgres will be parsing on the fly, trying to extract your part of your value. So it depends on workload, right? But I would say go with JSONb and only in some narrow cases
Starting point is 00:14:48 consider json json regular json cool continuing with history should we talk a little bit about the json path stuff added in it was it 20 it's only a few years ago 2019 i think version 12 right not everything was added. First of all, there is a whole part in SQL standard right now, similar to SQL XML. Now there is SQL JSON. And the JSON path, it's a lot of stuff appeared there. Well, before that, I would say Postgres already had a lot of manipulation capabilities, like a lot.
Starting point is 00:15:31 And it's hard to remember all of them. For example, you can switch from relational data back and forth, aggregation, accessing particular parts, also changing particular paths. But it was done in native Postgres way because no standard existed. I think Postgres pioneered a lot of things in this area before even commercial databases, right? And when standard appeared, a lot of efforts were done to develop support of standard. But unfortunately, not everything was committed to recent Postgres versions because it's a huge work and it needs more time for testing and so
Starting point is 00:16:08 on. So only parts of that went to Postgres 15. Yeah, fingers crossed for future versions. I know a lot of people are excited about that. And on the functions front, I think you're right. I think my, I don't even try and remember them. don't use it that much but my general rule of thumb is if you want to manipulate your data in some way and transform it in some way there's probably a function for that the docs are great i'll share a link to them in the show notes but there's a serious list so if you think something might be possible check because it's probably a function that's already there right i i worked with with JSON a lot in Postgres context and not only, and I always have reference Postgres documentation open
Starting point is 00:16:51 because it has a lot of stuff to cover. Also, speaking of SQL JSON, of course, we need to recommend Oleg Bartonov and others, Alexander Karatkov, Nikita Glukhov work. They had a lot of presentations, a lot of slides. I think we can link them. And there is a lot of work there done over the last few years. Yeah, a lot in terms of performance and indexing, especially, I believe.
Starting point is 00:17:16 Well, not only indexing. I mean, support of SQL JSON standard, even if it's not finished. Yes, it started several years ago. It's like a lot of work go check out but indexing yes the same people also involved into gin support for json b indeed rum or vodka i forgot what it's called no no rum vodka it's different it's extension of gin uh yeah so the problem with gin usually is that you have gin, okay, you indexed your document, for example, JSON, doesn't matter, maybe something else. Gin supports various things,
Starting point is 00:17:54 for example, arrays as well, right? Or full text search, so also gin. So we indexed some document and we have a huge table with, for example, ID created at something else and this data value indexed by Gene. And then we want to have a search covered by Gene, supported by Gene. But we want to also to have some pagination in order by. Order by, we usually want to have not... Like, traditional approach with full-text search... Full-text search, why I mention full-text search? Because GIN originally was created for it, for full-text search.
Starting point is 00:18:33 Only then it was propagated to complex data types, this, like, RD3, Russian doll tree for arrays, and so on. It's historical knowledge, right? So, usually usually we ordered by some relevance metric. For example, this document is most relevant for this search and goes to the first place and so on. And this data is like an index. But in social media, use cases, we usually order by time. Right? We want latest awesome,
Starting point is 00:19:04 fresh data go go first, all data goes last. And pagination. How to do that? We need to order by either ID or create it at timestamp. And this data is not present in GEN. And this is a huge problem
Starting point is 00:19:19 because Postgres now needs to choose what to use, either B3 for that historical ordering, right, or GIN for full-text search or JSON search, anything. 3-gram search, it can be anything. And how to combine it? The idea, let's combine it, is RAM index. So integer or timestamp value goes to GIN,
Starting point is 00:19:44 it extends it, and have the ram index but ram index in my tests honestly i didn't do it in the past two or three years so i'm there might be improvements but in my tests in the past i had so big index size and performance issue so i decided not to go with it in production every time I tried it. Maybe today it's better, I don't know. If our listeners have fresh data about RAM behavior, please comment. Let us know. It's interesting. Yeah, that'd be cool.
Starting point is 00:20:19 Should we move back to, I feel like we haven't really even discussed when when it isn't isn't a good idea there's some really good articles about this my personal favorite is by team at cybertech i think lawrence goes into great detail on a lot of actually focuses mostly on mistakes he sees people doing so as we said very popular people like the idea of the, but he listed a bunch of things that they often do that he wouldn't advise. And I think they're very sensible suggestions. Like for example, the first thing you mentioned, which is ID and then JSON, and that was it. Instead of pulling out some of the columns that you might. Right. Classic example here, if it's some account data, for example, which has email and so on, I
Starting point is 00:21:05 would definitely put it as normal columns for better control, to have all constraints in place and to have better data quality, to ensure that everything is filled, not null, uniqueness and so on. It can be achieved in JSON because we can have constraints, like check constraints using some expressions over JSON, but it would be much harder to maintain. Yeah, I think that might be an understatement. But yeah, I would not like to maintain a system that was doing that. But yeah, so constraints are a really good point uh good like foreign keys are an example uniqueness oh yeah so there's so many things that would be that you're kind of losing out on nice
Starting point is 00:21:52 support for you know what i i remember from my xml practice uh 15 years ago when i worked on it for postgres uh of course there was a ways several ways to to ensure some semi-structured schema for XML values. And for JSON, we don't have it. Of course, we can use some check constraints, as I mentioned. But there is an interesting project. I only know about it. I haven't looked at it. From Supabase, it's called pgJSON schema.
Starting point is 00:22:20 So there is something to have in place some schema for JSON and ensure that our values are matched against the rules defined in this schema. So something interesting also. But for account data, for data like billing data, I would never put the most important fields to JSON. I would keep them in relational part of tables. Yeah, I've got a few more kind of reasons here as to why. Like, I think it's worth saying that, for example,
Starting point is 00:22:54 if we want to update some of the data in there, it's hugely more expensive for some technical reasons, like especially if it's been toasted. Exactly. The toasting, it's not only about updating, but for updating it's a big question, but also for reading as well. All benchmarks which deal with,
Starting point is 00:23:13 which check JSON values of different size, we can observe that when we achieve toast point, two kilobytes per, or it can be controlled, right? right in this case degradation happens immediately and then toasting adds huge overhead for both reading and modifications but there is some work i don't think it was finished and already committed i think no there is some work
Starting point is 00:23:41 to improve the update speed touching only those chunks of toast, which really need to be touched. So this would be interesting improvement, but I think it's not yet finished. Yeah, that'd be cool. But I'm not sure how that would work, but that'd be very cool. A couple of other kind of more basic points,
Starting point is 00:23:59 I think are quite important are, well, actually Lawrence mentions joining. If you want to join on one of the columns, for example, that is in the JSON, that's obviously going to be a challenge. But I think the more important part for me coming from the performance side is there's no statistics on the data within JSON. Yeah, that's why joining can be not working well because of lack of statistics for internal data, right? Yeah, so it's really difficult for the planner to make good decisions. I suspect we can collect it, no? Using create statistic and expression, no?
Starting point is 00:24:32 You can do that? Why not? That would be cool. But it's not automatic, right? You'd have to do that on a manual basis. Right, but why not? It's just some fresh question came to my mind. I actually never needed it
Starting point is 00:24:49 because I was very careful usually with JSON B values, keeping only some really flexible data there. And also worth mentioning, you might need to work with JSON even not storing any of JSON, right? You just provide it as output for example in case if you use oh yes you do you do it all the time implicitly you might do it explicitly if you have some queries so relational data is stored but json b or json as output
Starting point is 00:25:20 oh like casting to it or like or using, I guess, creating or parsing. I never thought of it. Yeah, or parsing. You take it as input, and then you parse and store it as a relation. Actually, there's some direction of research called updatable views, but that will XML views over relational data type, the relational data, I guess it could be done for json as well i'm not sure it's very practical interesting practically interesting but yeah so you can have like layers so relational data stored and json as communication language yeah true and i think what you said just then is that is where a lot of people land with this is
Starting point is 00:26:06 you're very sensible and stored a lot of it in regular columns but then the kind of the more you said the flexible part the other way i've heard that described is fields where the vast majority of your rows would be null like if it's i think a common example given is like an e-commerce store where you might want to store some data for some types of products that most products don't have that data for so you want it structured but it's mostly for most of your objects it's going to be null so you could store maybe five attributes for each item but they could be different for different items so that i think that'd be a really good use case. Good. And speaking of nulls, you know how I love nulls.
Starting point is 00:26:49 Of course, it's a very important concept in all relational databases. But if you use JSON B set to modify a part of your JSON value, you should be careful with nulls because if you use relational null, SQL null, not relational, SQL null, in this case, it will nullify whole value. You need to use JSON b null. So it's like string null converted to JSON b. In this case, you'll be fine and it will nullify only a part of your JSON value which you target at. So it's a lot of small things to learn uh when you work with it but i think modern development is impossible without json right and any other pieces of advice you have no you
Starting point is 00:27:34 mentioned that the toast value that toast value threshold starts toting things over threshold yes it can be configurable that's if anybody wants to look that up that's the toast tuple target and there's a really good blog post by Haki Benita where I learned about that, about storing medium-sized texts. But by default, if we have JSON values much
Starting point is 00:27:55 less in size than 2 kilobytes, we should be fine. Well, much less, yes. If they're tiny, great. If they're really big is there's there can be performance issues but it's that he actually makes a really good point that it's there's really weird performance penalties for medium size so things that are just slightly under that two kilobyte limit so it's a really good blog post quite long but i highly recommend reading it because it's
Starting point is 00:28:22 quite well especially i recommend experiment-oriented approach. Take your schema, take your workload. If you don't have workload yet, try to invent it, trying to predict the future. It's sometimes possible. And just benchmark all aspects and pay attention to toasting and so on. Why not? Absolutely. I'm constantly learning. I don't know all the details. This reference of all functions related to JSON and JSONB data types. An interesting thing that I use usually quite a lot, I use row to JSON and JSON AGG for aggregation. And the interesting small advice is that you can order by inside aggregation so you can have group by for example
Starting point is 00:29:07 and have json agg and there inside parentheses you write order by and it works for example that's so cool i didn't know that it's independently working independently from main order by clause in your select statement so it's interesting yeah well yeah well it's not sorry it's not it's about arrays then you can order by then and for arrays it's not fully related but then you can also have indexing for example i want only first five members of this array ah a very important thing new people sometimes make mistake here. If you use arrow operator, regular arrow operator for JSON and JSONB values, you will be getting JSON or JSONB values as a result.
Starting point is 00:29:56 But sometimes you want text and if you, you will see double quotes if you later convert it to text, these double quotes will be preserved there. And it's not good. In this case, one of the things I wish I knew earlier, right? Double arrow operator will directly convert it to text, eliminating double quotes. It's very handy. Yeah, that's a great tip.
Starting point is 00:30:18 Some awesome little tips at the end as well. Bonus. We can write an article now, right? Jason and JasonB tips. Yeah, that'd be great. Actually, there was one last blog post that I wanted to... So there's a couple of good ones, one by ScaleGrid
Starting point is 00:30:34 that I'll share and one by the Heap team that I'll share as well, that were good and a bit further reading. Yeah, and we should definitely share benchmark results from Oleg Bartonov's talks. The last was maybe a year ago or something. There is a comparison with MongoDB there, also interesting. And speaking of MongoDB, there is a new project called FerretDB, right?
Starting point is 00:30:55 Which attempts to build MongoDB-like database system on top of Postgres, which is interesting. Is it compatible with Mongo? I don't know details. I only know that one of the founders is Peter Zaitsev from Percona. So a lot of experience should be brought there. And I think it's an interesting project. I checked it yesterday.
Starting point is 00:31:17 It looks very live. So a lot of development happening. So worth keeping an eye on it for those who have mongodb preference in terms of work with json documents and so on i'm curious about checking benchmarks for this new system it's like open source mongodb because you know mongodb is not really open source anymore right they push everyone to use atlas so this project has good chances to convert some users to
Starting point is 00:31:48 Postgres, basically, right? So it's a good thing. Yep. Wonderful. Well, thank you very much, Nikolai. Thank you, everybody, for joining us and see you next week. Thanks, Michael. Bye-bye.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.