Postgres FM - JSON
Episode Date: February 24, 2023Nikolay and Michael discuss JSON — our options for storing it in Postgres, whether or when we should, as well as a brief intro and some tips for JSON functions available. Here are links to... a few things we mentioned: hstoreXML typeXML functionsJSON typesJSON functionsJSONB indexingNULLS episodeWhy Postgres is popular episodePostgreSQL 12 release notesWhat’s New in SQL:2016 (blog post by Markus Winand)SQL/JSON is postponed (blog post by depesz) JSON[b] Roadmap (talk by Oleg Bartunov)Slides, with benchmarksRUM access methodJSON in PostgreSQL: how to use it right (blog post by Laurenz Albe from Cybertec)pg_jsonschemaTOAST_TUPLE_TARGET and TOAST_TUPLE_THRESHOLD The Surprising Impact of Medium-Size Texts on PostgreSQL Performance (blog post by Haki Benita) Aggregate functionsHow to store and index json data (blog post by ScaleGrid)When to avoid JSONB (blog post by Heap)FerretDB------------------------What did you like or not like? What should we discuss next time? Let us know by tweeting us on @samokhvalov / @michristofides / @PostgresFM, or by commenting on our Google doc.If you would like to share this episode, here's a good link (and thank you!)Postgres FM is brought to you by:Nikolay Samokhvalov, founder of Postgres.aiMichael Christofides, founder of pgMustardWith special thanks to:Jessie Draws for the amazing artworkÂ
Transcript
Discussion (0)
Hello and welcome to PostgresFM, a weekly show about all things PostgresQL.
I'm Michael, founder of PGMustard, and this is my co-host Nikolai, founder of Postgres.ai.
Hey Nikolai, what are we talking about today?
Hi Michael, Jason B.
Oh, we're going straight for B.
Yeah.
B is better.
Yeah, I found that funny. Is that semi-official or is that just a joke?
Well, I remember when it was presented, it was discussed, and binary, better, nobody knows.
Maybe B, like number two, A and B. But no, I've never saw that discussed.
This explanation I never heard, so I doubt.
Yeah. So, I mean, specifically, I guess we're talking about storing JSON in Postgres,
and the options for doing so, and how you can then use it.
We've had a couple of requests for similar things in this area.
So thank you to the people that have requested those.
Where did you want to start?
Well, I would start with discussion of first normal form and atomic values versus non-atomic
values and do consider arrays, for example, as atomic.
And we definitely should touch history and HStore.
Well, arrays are supported for quite long in Postgres,
both numeric and other types of arrays,
like text arrays and so on,
and multidimensional arrays, actually.
Then HStore, I think it appeared first in 2004,
very long ago, almost 20 years ago.
And HStore is like a key value, and it doesn't support nesting,
but it's still very already flexible.
So Arrays and HStore, it already breaks, for some understanding,
it already breaks first normal form because these values are not atomic.
But actually, we can argue that, for example, timestamp is not atomic as well.
So let's leave this discussion for theory part and focus on practice part, right? For
practice definitely I would say these values are very convenient and they are not as evil as nulls. They can save your time
not bringing too much issues
in terms of unexpected behavior.
And so there are some cases
when you have some unexpected behavior.
And the most interesting thing,
I remember like 15 years ago,
I had a talk in Maryland.
It was my first visit to the US.
And the talk was about
how Postgres is
good for modern web applications. At the time, web 2.0 was a thing. So my talk was called like
Postgres is good for web 2.0. I did some benchmarks for EAV versus HStore or Erase.
Doesn't matter actually in this case. JSON didn't exist at that time yet.
And actually, I was also working on XML.
It's yet another data type which you can say is breaking first normal form,
but also provides some flexibility in some cases.
I think currently it's not the most popular format, but still.
There is some development ongoing still.
And the thing is that eav is very flexible if it's entity attribute value when you have three tables with foreign
keys between them optional foreign keys foreign keys is not the biggest problem there the biggest
problem there is that you need joy to join extensively so some entity some attribute and an additional like connection table between them
which contains values stores value so basically it's like your document if you had originally
xml or json document you shrink to three tables it's flexible, supports a lot of possible documents, but to construct
document back, you need to join all of them, right? And one thing we have, we have storage
overhead issue. Usually a lot of more bytes need to be wasted to store compared to, for example,
text representation or JSON representation.
And second thing, you need to join everything together all the time.
And as I, like my very simple explanation to new Postgres users, you cannot have index
on two tables.
This means you usually, for best performance, you usually want to have everything in one
table joints like joints are good. Postgres can be very performant in joints, but not
always. Nothing can beat index only scan, for example, right?
Yeah. And when it's just to clarify, when you're saying you can't have an index on two
tables, we mean if you want a multi column index, you can't pick two columns that are
from separate tables. two tables we mean you if you want a multi-column index you can't pick two columns that are from
separate tables right so eav approach is like i would say radical normalization approach
you normalize everything you have ids everywhere and then you need to index and then you need to
rely on join algorithms we have three right And compared to denormalized state,
when you probably repeat some keys and so on,
you have worse performance usually.
So both storage overhead and worse performance
to extract whole document.
And it usually means you need to avoid EAV.
In most cases, yeah.
I think it's worth even saying
developer experience is worse, right?
If even to get the simplest data out of,
if you're having to do multiple joins,
it's just more complex as well.
But yeah, I agree that those two are the far worse.
Well, developers specifically,
I've observed several teams
who loved the flexibility.
It's like with Star Schema
for analytical databases they they
love flexibility they can quite quickly develop abstraction in their language of choice and
be very flexible very good but then at some point when we have billion of rows we already can like
feel trouble with this approach so So my talk 15 years ago,
I was trying to prove that
HStore or XML or arrays is better.
So you need to go dynamilized way
and rely on flexible indexing capabilities
Postgres provides.
For example,
GIN is a very flexible way
to index complex data types.
And then JSON appeared, I think, in 2014, 2013.
I've written it down, yeah.
9.2 is 2012, but that was JSON without B, the less good version.
Syntax Sugar JSON, let's call it so.
Yeah, well, actually, if we stick into history,
should we give a quick rundown of what that included?
So we could store JSON.
It did some validation, but it was just a blob.
It was like a string, basically.
Text, yes.
But why was it so?
It was a success even before JSONB
because by that time already,
JSON became a standard de facto
for web application, mobile application.
Most pieces of systems
started to communicate using JSON.
And it was obvious that it's great
to support it in database.
And also it gave a birth to this movement,
let's eliminate middleware
and allow maybe some very thin middleware
like postgres we had episode about it and so so front-end can communicate with data stores
postgres data database through some very thin middleware which doesn't have a lot of logic
and a lot of logic goes for example react app or it goes to client code or mobile app a lot of logic goes, for example, React app, it goes to client code or mobile app.
A lot of logic is there.
And the data logic goes to database closer.
And it also gave another wave of interest to database side programming using PLPGSQL or PLPython and so on.
So this is interesting.
Indeed, it started roughly 10 years ago. Yeah. Wow. That's
longer than I would have guessed, actually. Yeah. And then a few
years later, JSONB was implemented and
it has indexing support. So it's binary format
which stores keys in ordered form internally
and it supports flexible indexing
capabilities.
Well, for JSON, you also can index using B3 and path.
So you can use B3 indexing for some path expressions.
So you can create an index, an expressional index, but the expression should return scalar
values, right?
It cannot return whole value.
Or you need to convert it to text, but it will be not very good.
Basically, this index can be meaningful only if your query has the same expression as well, right?
And sometimes it's good because this bit.js is very efficient
and you can use it in some cases.
But for JSONB, you can use gene index with some modifiers like JSONB pattern ops.
And I don't remember, but there is a way to...
Path ops, maybe?
Path ops, yes, yes.
So you can say, I don't need to remember anything.
I just need keys and values.
And in this case, only limited expressions will be supported.
But index size will be smaller.
So you have a choice.
You can choose how to index it.
And GIN is great in many cases.
Not often.
There are also issues with GIN sometimes, especially with this fast update option.
But it's another story.
So JSONB probably. what do you think?
How much of popularity to Postgres
can be associated with appearance of JSON or JSONB?
Yeah, I think I mentioned in our,
I think we did an episode roughly on this topic of,
you know, why is Postgres so popular?
And I actually assign quite a lot of value to this,
but not necessarily for fully
technical reasons I think there are really good technical reasons for it which we'll go into
soon but I think for marketing reasons I think MongoDB was doing really well around this time
not so much for technical reasons but for marketing reasons and this as I think we're
going to discuss this didn't put an end to that debate, but it meant people didn't have to make a decision either or they could store some of the key value data in Postgres and get nearly as good performance or better in some cases.
So I personally think it was really important for the trend we're seeing now of people seeing relational databases
as cool again or the de facto again. I think we were at risk of losing that or losing traction
there and people going more towards databases like Mongo. But yeah, I don't know. I think it
was very important, but I'm not sure. Right. Likewise. I think it's important and it helped to join two worlds, right?
So relational and non-relational world.
And now everyone who wants to stay in JSON world can just have a table with like surrogate key, big integer, and JSON, like data, right?
No, don't give people bad ideas.
But, of course, I don't encourage.
Usually I say let's understand which columns we can say.
We know these columns, and we know data types and all constraints,
and we should extract them and store separately and have JSON only for some
flexible, not yet understood parts of our data.
But worth mentioning that some people, for example, Alek Bartonov mentions that usually
in his slides on conferences, he shows some graphs showing that when JSONB appeared,
popularity of Postgres started to increase. But if you look closer, actually, popularity started
to increase slightly earlier. It was the same year, but JSON was not yet released. It was
released in the end of year. But popularity started to go up earlier this that year it was 2013 maybe 12 13 i don't
remember and as we discussed in the past we suspect it was related to rds probably support
of postgres in rds back to json i think before we move on from the comparison of json and json b
i have seen some very smart people making a case for that. There are some use cases for JSON over JSON B.
There's a few things like you mentioned ordering.
If you really want to store what you are sent, for example, for some reason, JSON, it preserves the order.
It preserves white space.
It even preserves duplicate keys.
And there's one more advantage I saw somebody share which is on just
purely on ingest rate it can yeah it is faster yeah so if for any reason those those are important
to you it can be useful but it comes at a big cost in terms of if you need to read the data
so i think for most people they want to access this data in some way in the future.
You're probably going to want JSONB.
Yeah, I think your default choice should be JSONB.
But in some cases, indeed, when we need faster insert and we don't care about better validation in terms of duplicate keys and so on.
Also, regular JSON takes roughly two more space,
right? No, no, no.
Less space, right? Yes, yes.
Unless you can't work space.
Yes, yes. It takes less space than JSONB. So if
default choice is JSONB, but if
you want to optimize on insert
speed and on storage space,
probably older
JSON data type is better.
Basically, there are some differences.
If these kinds of things are important to you, look into
it, test it.
But that's why if you read up on these topics, often people
are just defaulting to JSONB and not really even explaining
why.
Well, right.
Because, for example, the speed of accessing particular
keys will be slower,
especially if you have a lot of keys nested structure and so on.
In JSONB, it's easy because it's already parsed.
But for JSON, Postgres will be parsing on the fly, trying to extract your part of your value.
So it depends on workload, right?
But I would say go with JSONb and only in some narrow cases
consider json json regular json cool continuing with history should we talk a little bit about
the json path stuff added in it was it 20 it's only a few years ago 2019 i think version 12
right not everything was added.
First of all, there is a whole part in SQL standard right now, similar to SQL XML.
Now there is SQL JSON.
And the JSON path, it's a lot of stuff appeared there.
Well, before that, I would say Postgres already had a lot of
manipulation capabilities, like a lot.
And it's hard to remember all of them. For example, you can switch from relational data back and forth, aggregation, accessing particular parts, also changing particular paths. But it was
done in native Postgres way because no standard existed. I think Postgres pioneered a lot of things in this area
before even commercial databases, right?
And when standard appeared,
a lot of efforts were done to develop support of standard.
But unfortunately, not everything was committed
to recent Postgres versions because it's a huge work
and it needs more time for testing and so
on. So only parts of that went to Postgres 15. Yeah, fingers crossed for future versions. I know
a lot of people are excited about that. And on the functions front, I think you're right. I think
my, I don't even try and remember them. don't use it that much but my general rule of
thumb is if you want to manipulate your data in some way and transform it in some way there's
probably a function for that the docs are great i'll share a link to them in the show notes but
there's a serious list so if you think something might be possible check because it's probably a
function that's already there right i i worked with with JSON a lot in Postgres context and not only,
and I always have reference Postgres documentation open
because it has a lot of stuff to cover.
Also, speaking of SQL JSON, of course,
we need to recommend Oleg Bartonov and others,
Alexander Karatkov, Nikita Glukhov work.
They had a lot of presentations, a lot of slides.
I think we can link them.
And there is a lot of work there done over the last few years.
Yeah, a lot in terms of performance and indexing, especially, I believe.
Well, not only indexing.
I mean, support of SQL JSON standard, even if it's not finished.
Yes, it started several years ago.
It's like a lot of work
go check out but indexing yes the same people also involved into gin support for json b
indeed rum or vodka i forgot what it's called no no rum vodka it's different it's extension of gin
uh yeah so the problem with gin usually is that you have gin, okay, you indexed your document,
for example, JSON, doesn't matter, maybe something else. Gin supports various things,
for example, arrays as well, right? Or full text search, so also gin. So we indexed some document
and we have a huge table with, for example, ID created at something else and this data value indexed by Gene.
And then we want to have a search covered by Gene, supported by Gene.
But we want to also to have some pagination in order by.
Order by, we usually want to have not... Like, traditional approach with full-text search...
Full-text search, why I mention full-text search?
Because GIN originally was created for it,
for full-text search.
Only then it was propagated to complex data types,
this, like, RD3, Russian doll tree for arrays, and so on.
It's historical knowledge, right?
So, usually usually we ordered by
some relevance metric. For example, this document is most
relevant for this search and goes to the first place and so
on. And this data is like an index. But in social media, use
cases, we usually order by time. Right? We want latest awesome,
fresh data go go first,
all data goes last.
And pagination.
How to do that?
We need to order by either ID
or create it at timestamp.
And this data is not present in GEN.
And this is a huge problem
because Postgres now needs to choose
what to use,
either B3 for that historical ordering, right,
or GIN for full-text search or JSON search, anything.
3-gram search, it can be anything.
And how to combine it?
The idea, let's combine it, is RAM index.
So integer or timestamp value goes to GIN,
it extends it, and have the ram index but ram
index in my tests honestly i didn't do it in the past two or three years so i'm there might be
improvements but in my tests in the past i had so big index size and performance issue so i decided
not to go with it in production every time I tried it.
Maybe today it's better, I don't know.
If our listeners have fresh data about RAM behavior, please comment.
Let us know. It's interesting.
Yeah, that'd be cool.
Should we move back to, I feel like we haven't really even discussed when when it isn't isn't a good idea
there's some really good articles about this my personal favorite is by team at cybertech i think
lawrence goes into great detail on a lot of actually focuses mostly on mistakes he sees people
doing so as we said very popular people like the idea of the, but he listed a bunch of things that they often do that he
wouldn't advise. And I think they're very sensible suggestions. Like for example, the first thing you
mentioned, which is ID and then JSON, and that was it. Instead of pulling out some of the columns
that you might. Right. Classic example here, if it's some account data, for example, which has
email and so on, I
would definitely put it as normal columns for better control, to have all constraints
in place and to have better data quality, to ensure that everything is filled, not null,
uniqueness and so on.
It can be achieved in JSON because we can have constraints, like check constraints
using some expressions over JSON, but it would be much harder to maintain.
Yeah, I think that might be an understatement. But yeah, I would not like to maintain a system
that was doing that. But yeah, so constraints are a really good point uh good like foreign keys are an example
uniqueness oh yeah so there's so many things that would be that you're kind of losing out on nice
support for you know what i i remember from my xml practice uh 15 years ago when i worked on it
for postgres uh of course there was a ways several ways to to ensure some semi-structured schema for XML values.
And for JSON, we don't have it.
Of course, we can use some check constraints, as I mentioned.
But there is an interesting project.
I only know about it.
I haven't looked at it.
From Supabase, it's called pgJSON schema.
So there is something to have in place some schema for JSON and ensure that
our values are matched against the rules defined in this schema.
So something interesting also.
But for account data, for data like billing data, I would never put the most important
fields to JSON.
I would keep them in relational part of tables.
Yeah, I've got a few more kind of reasons here as to why.
Like, I think it's worth saying that, for example,
if we want to update some of the data in there,
it's hugely more expensive for some technical reasons,
like especially if it's been toasted.
Exactly. The toasting,
it's not only about updating, but
for updating it's a big question,
but also for reading as well.
All benchmarks which deal with,
which check JSON values
of different size, we can observe
that when we achieve
toast point, two kilobytes
per, or it can
be controlled, right? right in this case degradation happens
immediately and then toasting adds huge overhead for both reading and modifications but there is
some work i don't think it was finished and already committed i think no there is some work
to improve the update speed touching only those chunks of toast,
which really need to be touched.
So this would be interesting improvement,
but I think it's not yet finished.
Yeah, that'd be cool.
But I'm not sure how that would work,
but that'd be very cool.
A couple of other kind of more basic points,
I think are quite important are,
well, actually Lawrence mentions joining.
If you want to join on one of the columns, for example, that is in the JSON, that's obviously going to be a challenge. But I think the more
important part for me coming from the performance side is there's no statistics on the data within
JSON. Yeah, that's why joining can be not working well because of lack of statistics for internal
data, right? Yeah, so it's really difficult for the planner to make good decisions.
I suspect we can collect it, no?
Using create statistic and expression, no?
You can do that?
Why not?
That would be cool.
But it's not automatic, right?
You'd have to do that on a manual basis.
Right, but why not?
It's just some fresh question came to my mind.
I actually never needed it
because I was very careful usually with JSON B values,
keeping only some really flexible data there.
And also worth mentioning,
you might need to work with JSON
even not storing any of JSON, right?
You just provide it as output
for example in case if you use oh yes you do you do it all the time implicitly you might do it
explicitly if you have some queries so relational data is stored but json b or json as output
oh like casting to it or like or using, I guess, creating or parsing.
I never thought of it.
Yeah, or parsing. You take it as input, and then you parse and store it as a relation.
Actually, there's some direction of research called updatable views, but that will XML
views over relational data type, the relational data, I guess it could be done for json as well i'm not sure it's very
practical interesting practically interesting but yeah so you can have like layers so
relational data stored and json as communication language yeah true and i think what you said
just then is that is where a lot of people land with this is
you're very sensible and stored a lot of it in regular columns but then the kind of the more
you said the flexible part the other way i've heard that described is fields where the vast
majority of your rows would be null like if it's i think a common example given is like an e-commerce store where
you might want to store some data for some types of products that most products don't have that
data for so you want it structured but it's mostly for most of your objects it's going to be null so
you could store maybe five attributes for each item but they could be different for different
items so that i think that'd be a really good use case.
Good. And speaking of nulls, you know how I love nulls.
Of course, it's a very important concept in all relational databases.
But if you use JSON B set to modify a part of your JSON value,
you should be careful with nulls because if you use relational null,
SQL null, not relational, SQL null, in this case, it
will nullify whole value. You need to use JSON b null. So it's like string null converted
to JSON b. In this case, you'll be fine and it will nullify only a part of your JSON value
which you target at. So it's a lot of small things to learn uh when you work with it but i think modern
development is impossible without json right and any other pieces of advice you have no you
mentioned that the toast value that toast value threshold starts toting things over threshold yes
it can be configurable that's if anybody wants to look that up that's the toast tuple target
and there's a really
good blog post by Haki Benita
where I learned about that, about storing
medium-sized texts.
But by default, if we
have JSON values much
less in size than
2 kilobytes, we should be fine.
Well, much
less, yes. If they're tiny, great.
If they're really big is there's there can be
performance issues but it's that he actually makes a really good point that it's there's really weird
performance penalties for medium size so things that are just slightly under that two kilobyte
limit so it's a really good blog post quite long but i highly recommend reading it because it's
quite well especially i recommend experiment-oriented approach. Take your schema, take your workload.
If you don't have workload yet, try to invent it, trying to predict the future. It's sometimes
possible. And just benchmark all aspects and pay attention to toasting and so on. Why not?
Absolutely.
I'm constantly learning. I don't know all the details. This reference of all functions related to JSON and JSONB data types.
An interesting thing that I use usually quite a lot, I use row to JSON and JSON AGG for
aggregation.
And the interesting small advice is that you can order by inside aggregation so you can have group by for example
and have json agg and there inside parentheses you write order by and it works for example that's so
cool i didn't know that it's independently working independently from main order by clause in your
select statement so it's interesting yeah well yeah well it's not
sorry it's not it's about arrays then you can order by then and for arrays it's not fully related
but then you can also have indexing for example i want only first five members of this array
ah a very important thing new people sometimes make mistake here.
If you use arrow operator, regular arrow operator for JSON and JSONB values, you will be getting
JSON or JSONB values as a result.
But sometimes you want text and if you, you will see double quotes if you later convert
it to text, these double quotes will be preserved there.
And it's not good.
In this case, one of the things I wish I knew earlier, right?
Double arrow operator will directly convert it to text,
eliminating double quotes.
It's very handy.
Yeah, that's a great tip.
Some awesome little tips at the end as well.
Bonus.
We can write an article now, right?
Jason and JasonB tips.
Yeah, that'd
be great. Actually, there was one last
blog post that I wanted to... So there's
a couple of good ones, one by ScaleGrid
that I'll share and one by the Heap team
that I'll share as well, that were good
and a bit further reading. Yeah, and
we should definitely share benchmark
results from Oleg Bartonov's talks.
The last was maybe a year ago or something.
There is a comparison with MongoDB there, also interesting.
And speaking of MongoDB, there is a new project called FerretDB, right?
Which attempts to build MongoDB-like database system on top of Postgres, which is interesting.
Is it compatible with Mongo?
I don't know details.
I only know that one of the founders
is Peter Zaitsev from Percona.
So a lot of experience should be brought there.
And I think it's an interesting project.
I checked it yesterday.
It looks very live.
So a lot of development happening.
So worth keeping an eye on it
for those who have mongodb preference in terms of work with
json documents and so on i'm curious about checking benchmarks for this new system it's like open
source mongodb because you know mongodb is not really open source anymore right they push everyone
to use atlas so this project has good chances
to convert some users to
Postgres, basically, right? So it's a good
thing. Yep.
Wonderful. Well, thank you very much,
Nikolai. Thank you, everybody, for joining us
and see you next week. Thanks, Michael.
Bye-bye.