Postgres FM - NULLs: the good, the bad, the ugly, and the unknown
Episode Date: August 5, 2022Here are links to a few things we mentioned: Three-valued_logic (Wikipedia)Postgres 15 improves UNIQUE and NULL (blog post by Ryan Lambert) Practical SQL for Data Analysis — Interpolation... (blog post by Haki Benita) What is the deal with NULLs? (blog post by Jeff Davis) NULL in SQL: Indicating the Absence of Data (Markus Winand on Modern SQL) The Art of PostgreSQL (book by Dimitri Fontaine) ------------------------What did you like or not like? What should we discuss next time? Let us know by tweeting us on @samokhvalov and @michristofidesIf you would like to share this episode, here's a good link (and thank you!)Postgres FM is brought to you by:Nikolay Samokhvalov, founder of Postgres.aiMichael Christofides, founder of pgMustardWith special thanks to:Jessie Draws for the amazing artworkÂ
Transcript
Discussion (0)
Hello and welcome to PostgresFM, a weekly show about all things PostgresQL.
I am Michael, I'm founder of PG Mustard, and this is my co-host Nikolai, founder of Postgres AI.
Hey Nikolai, what are we talking about today?
Hi Michael, so it's time to talk about various mistakes and I've noticed that many people,
from my conference experience and so on, I noticed that many people love to hear about
other people's mistakes. I don't know why they love it so much,
but I've noticed it's very popular when you talk about,
we made this mistake, it was so painful,
and people enjoy listening or watching about this.
Maybe they want to learn something, I don't know.
Yeah, I love it.
And there are two main reasons.
One is the hope that by
hearing about their mistake, I hopefully can avoid making the exact same one, avoid having to make
that myself. It's extra painful. And then the second reason I love it is it just makes you feel
better when you make mistakes, knowing that these amazing other people make mistakes all the time,
too. So I think that for me, there's that part of it as well right and and it's already fifth episode of our podcast and i've noticed we have
quite good numbers so far and i wanted to thank everyone who's sharing and liking and so on
please continue it's it helps we we see this and we also see comments in twitter feedback it's so
great to see yeah and thank you to everyone who suggested ways we can make it better as well.
Really appreciate it.
Keep topic suggestions coming too.
We'll try to get to quite a few of them over time.
Oh, and I also want to say a big hello to all those who are currently running or riding bicycle and so on.
It's good.
I also need to do it.
Okay, so mistakes.
If we talk about mistakes related to SQL and specifically Postgres,
I think the biggest, the most painful, at least in my career,
it's always related to nulls.
Nulls, this is the thing I hate most in SQL, in theory,
in SQL model, if you think about SQL model, in standard, in various
database systems and specifically in Postgres. So let's talk about Nulls today.
Awesome. And one of the reasons they're so painful is because the mistakes don't come
back right away, right? You don't necessarily hit an error message or you don't, you know, there's, yeah, that feels like
they kind of, they go invisible for a while. Silent.
So that, yeah, that's a better word for it. Yeah.
Yeah. Hidden, silent, and so on. This is like, if, of course, if you see the error, you can,
you can react, but if you don't see error, and now it's very, very tricky. Let's start with maybe some news about Postgres 15.
There will be new capability that will allow people to say that unique constraints should not distinguish nulls.
So there is only one in universe for some unique constraint.
By default, all nulls are distinguishable.
Like they are all different.
So we cannot say this null is the same as that null because null
means unknown in SQL logic.
SQL logic is based on three valued logic.
It's it's worth understanding what three valued logic is.
And again, if you don't know, please check out Wikipedia or other basic articles.
It's very fundamental knowledge everyone should possess.
So if you compare
to nulls, the result is always unknown, which actually translates to null if we like we
don't have unknown values, we have only null values. So like, now, compared to now, kind
of also now, however, strictly speaking, it's unknown. And that's why if you define a unique constraint,
you can insert as many nulls in this column as you want. This is, by the way, a big difference
between unique constraint and primary keys, because primary keys completely forbid nulls
in columns that participate in primary key. But unique constraint allow nulls.
So if it's one column unique constraint,
you can insert as many values of null as you want,
as many rows having null in this column as you want.
And this, in some cases, is not convenient to some developers. So that's why in Postgres 15, a long-awaited feature appeared.
So default behavior is the same.
You can insert as many nulls as you want.
But with new option, I don't
remember this option, but this option
allows you to say
that there can be only one null
here.
And this is like
kind of, I like that this
is explicit option, not a
setting you change.
Of course, it would be a nightmare and it would break all logic that standard dictates and so on.
It's good that it's explicit and only for those who need it will have it.
But it's dangerous, I guess. Yeah, I saw it.
Sorry, I saw a good blog post on this by Ryan Lambert maybe a month or so ago.
I'll share it in the show notes.
I thought it was super interesting,
but I was also struggling to come up with a use case for it.
I didn't really understand why you would want to be able
to determine that there was exactly,
or sorry, at most one null value in a column.
Did you ever come across it?
Well, yeah.
If you, I remember somewhere in some articles probably or some discussions,
we compared like when you start understanding three-valued logic,
you're kind of infected.
You always think about, okay, we have now unknowns.
If you are more database guy, you start teaching your backend and frontend developers,
colleagues, especially frontend colleagues who write a lot of JavaScript code that you know
unknown is unknown. You cannot return, like think it's about it as a zero or about it as an empty string.
So, and at some point, you probably want to create some table and use null as, like, unknown.
And you can imagine, for example, we have, for example, true, false, and unknown, and we want to restrict only three options.
It can be, for example, two-column constraint.
For each person, we can say, for example, it can be a poll results.
So each person can say, I agree, I disagree, or I don't vote in this question,
for example, right. And each person can say only once. So we
can have personal ID and result, right. And unknown result, we
can use nulls to record that the vote is like an unknown. But
it's actually it's breaking logic because in this case the third option
i i by purpose i don't want vote is is a kind of known result so strictly speaking i wouldn't do
this i would use for example minus one or something some people may choose nulls and
then they want to restrict and say only one result for each person is possible in this case you want unique constraint
and in this case you would want to have only one null possible for one person something like this
is what popped up in to my head initially like i can imagine this can be useful but i would avoid
it that makes sense i actually used to work at a company that did survey software and we did have
more than one negative value because
we had well for example you might have seen the question and skipped it versus you might never
have been presented the question and those are different um so yeah very interesting various
flavors of various meanings i i understand some some people need it and Postgres. This is probably not the biggest problem with us, actually.
Let's talk about other issues.
For example, I had a bad situation roughly 10 years ago.
It was my third one already, social media.
And I was completely sure that it's good and market is ready and we should launch with great success. And I remember we launched slowly in spring first,
and everything was ready to grow.
And we used a lot of mechanics, all social.
You might hate it or not, but when you check,
initiating some invitations without lying, of course,
we know the case LinkedIn had and so on, but it's a different topic. So everything was ready and I saw audience and likes it. But at
some point during summer, we experienced very low. I didn't understand why, because I see like,
if we talk about, if you talk to specific people, they are happy. They consider project very interesting.
And analytics shows us they use system to connect to friends and so on.
But somehow, registration don't grow as I expected, as I predicted.
And my experience was already quite good because it was already a third project.
I like, I understand how people work and so on,
but somehow it didn't grow.
And I remember in early September,
I went digging what's happening.
I went to code and I saw some select,
which I don't remember it was comparison
or some arithmetic operation,
but it was not null safe.
So the result was null,
but developer obviously didn't expect, result was unknown, implicitly converted to null, but developer didn't expect null to be present
there. And I just put coalesce there and we started to grow like a rocket during the next
several months. So by the end of year, it was huge success. So I just put one word
call us there.
Wow. So what was this doing? What part of it? Like, was it?
Was it about suggesting people add others? Or like, how did
that limit growth so much?
If you're interested, it's off topic for different
interests that usually all social networks do like this,
they offer you who which friends already use the service.
And when you do it, they suggest you to connect.
So back to Coraless.
My understanding of that is it returns the first non-null value that it sees.
So you can, for example, put Coraless and then the result you're expecting.
Zero, for example.
And then zero, yeah.
If you, for example, want to use arithmetic to make them null,
say if you need to use coalesce, use zero when there is no value.
Otherwise, whole result will become null
if at least one null is present among operands, right?
And this is exactly what happened.
Because of that, the thing didn't work at all.
And once they fixed it, it was like, it was so good.
I was happy.
I built the loss, and it was estimated it was like $30,000
during several months.
Loss of missing one word.
And that's a startup, right?
That could be bigger, a larger company.
Imagine it.
Of course, yeah.
So it's significant.
And of course, silent, as we were talking about before.
It doesn't fail spectacularly.
Yeah, it's just working, but not working well.
We had some registrations, but not enough.
You know what I thought about it?
It's not a this person mistake.
I would do the same easily.
And even if you have 20 years experience writing SQL as I do,
I completely 100% understand.
I realize it that I still have high chances to write some not null safe code.
This is the thing that you need always to keep in mind.
And it's so hard to just forget about it
and once again have not null safe operations.
And it's a dangerous part of SQL.
So I guess this takes us back to why they're your least favorite part.
Right.
So I like the flexibility because true or false,
it's not enough.
You need something unknown and it was added by purpose, but maybe in future we will
have some better tools that will, maybe it's already exists.
I don't know if our listeners can suggest something.
Maybe there are tools that can put some warnings and say, you know, this column might have
nulls and this operation is not null safe, probably you
will have unexpected results. It would be good to have such something like in CI CD checked all the
time we write code. But if you use ORM, which dynamically constructs SQL, it's probably not.
Well, it's an interesting topic, how to improve experience here. I'm 100% sure we're not in the perfect position in terms of nulls in relation to databases.
Yeah, that makes complete sense.
And tooling sounds interesting.
I wonder if it could be like some even static analysis on, well, I guess you might need to know that a column does contain nulls, but knowing that it could is enough, right?
So that's, yeah, it feels doable.
I'm not aware of anything myself.
So is the message is basically,
remember that something could be null
and handle it appropriately.
But try to remember is the best.
And maybe useful for code reviews as well.
I think sometimes in teams that I see,
there's sometimes more people
that are comfortable reviewing back-end code or
application code than there are reviewing database code and sometimes
as you know there's an expert on the team or somebody who's better at this
than most people perhaps sometimes their code isn't getting reviewed or you know
just remembering to review sequel as well well if you if you are reviewing uh of course this is one should be you should
have your in in your bag of reviewers tool set let's check if all operations are now safe
but i i think it's not right we should have the path that leads us to better results by
definitely not the perfect solution but for now i'm looking forward to hearing if there are good
tools for this but if there aren't and if neither of us are aware of them for now i guess that's what people
need to be doing yeah the developer has to think about it and also the reviewer right and we also
discussed only arithmetic operations in comparison but there are many many many more other places
where nos can can shoot off your legs or arms, I don't know.
For example, aggregates.
Some aggregates ignore NALs, some aggregates don't ignore NALs.
Some aggregates do it depending on the situation.
For example, if you do sum, again, like summarizing all values in a column,
NALs will be ignored, of course, right? Although
sum is based on plus operator, right? What would the difference be if they weren't ignored?
It would have to return null, I guess. Well, if you think about sum as first value plus second
value plus blah, blah, blah, if null is present, the result should be null as well. Okay, yeah, I see what you mean. But sum actually does col s, comma, zero, right?
So it ignores nulls.
So it's like null safe, right?
One people aren't probably aware of is count.
Count is interesting.
Count for column will also ignore null.
So it won't count rows where you have null for this column.
But if you say count star, all rows will be counted.
And if you say count table name from table name, it's possible.
Because you can count the whole row, considering the whole row, all its columns as one single value.
It's Postgres magic.
It's like a record type.
Again, nulls will be counted what about this is probably ridiculous
and definitely uh not a real case but if every column in the table was null so imagine a single
row with every column null yeah single yeah would that be counted by count tables okay wow yes yes
why no because i also discussed now some time ago, like two years ago, it was a YouTube live
stream.
And I said, you know, of course, if we do this, what you described, it won't be counted.
I do it and I see it's counted.
That's so funny.
Like it's so many unexpected parts when behavior is not as, like we could have some certifications
like null expert in Postgres.
And also
like array
agg, aggregation.
What do you think? Will nulls
be counted or
used? It'd be a good
quiz, wouldn't it? If you have a
chain of questions
specifically moving
to very bad places.
So it would be tricky to answer everything properly.
Yeah.
What grade would you give people at the end?
Would it be A, B, C, D, E, or no, maybe?
Or a no.
So what do you think about array aggregation, array agg?
Aggregate function.
I'm going to guess.
I'm going to play this game.
And I am going to say that it does not ignore nulls.
Yes, it will take them and use as members of array.
Exactly.
So you see, it's a very complex thing.
All of it is just one word.
That was genuinely a 50-50 guess, so I would definitely not have...
I didn't know that.
Also, disclaimer, I might be wrong.
What percentage sure are you?
As I've said, no matter how many years of experience you have,
you cannot be 100% right all the time with nulls.
It's impossible.
But the crucial part here is not being unaware of which one it is.
It's remembering that it might not be null safe in the first place.
So there might be unexpected behaviors when you're working with data that can include nulls.
So it's the same takeaway, isn't it? Remember that when
you're working with data that contains null values, or I'm not even sure if that's the right
way of saying it, that contains nulls, then remember to test, check, put some test data in,
add some realistic test data, and then check that your code's working as expected.
But the key part is remembering. Yeah, realistic testing. Sure. But I'm quite skeptical. I think
people really understand all the dangers related to now's only
when they experience it in production. So like learning
curve should bring you some real examples to hit you so badly
that you understand Oh, next time I should be very careful.
Yeah. Also, only recently I learned, for example, there is JSONB set function.
And it can be used to set some part of JSONB value to some different value.
So not be in need to override everything.
So you can say, along this path, I need to replace something.
And if you put, if you want, JSON also has nulls.
If you want to put null there and say JSON B provide some path array of keys,
and then say null, you will have whole value set to null unexpectedly again.
So you need to use single quote null converted to JSON B instead of regular SQL null.
So it's like everywhere danger.
When you say converted to, do you mean like the cast operator or what do you mean?
Yes, cast it to JSONB.
Null string converted to JSONB.
So in this case, it will be as expected.
You say you want to change the part of the whole value, not the whole.
So if I'm a team lead and I've got a few junior developers on my team
and I'd like them to learn a few things around this,
are there any good starting places?
I know you said they probably need to learn through experience,
but if I want to try and avoid that, what can I do?
Oh, it's a good question and probably a hard one.
I will, of course, do some training maybe
related to nulls. Of course, making sure everyone understands what three-valued logic is and so on,
some theoretical basics. And then some examples, I don't know, like it depends on the tools
for education you have. Of course, if you have some playground where some database exists and you can do some exercises,
it would be helpful to write some code and see how dangerous null can be.
But I don't have an excellent, simple answer here.
Again, as I've said, I'm very skeptical and think only when it hits you very badly,
breaking some functionality, only then you probably will learn that it's really dangerous.
Yeah.
I think I saw a really good article by Haki Benita that included an issue they came across that I will share in the show notes.
And I think you shared one from, is it Jeff Davis?
Jeff Davis, very old article and excellent collection of very unexpected examples, much more than we discussed here.
So yeah, let's also attach the link in show notes.
For sure.
I know it's difficult to ask people
to go away and read stuff,
but it feels like learning
that there even are these problems
gives you a chance of knowing.
I think probably sometimes
people don't even realize
this could be a problem.
I definitely only learned about Coral S quite late.
I knew quite a lot about SQL or SQL before learning that coalesce.
Anti-join example as well.
If you use a not in and then expect something, then have nulls inside in, you can also have
unexpected.
It's quite popular.
I think all articles we mentioned, they should discuss this example.
Anti-join also have some problems with nulls.
Same with no exists, I'm guessing.
Ah, here's the trick.
Like I always forget about it.
Actually, I, as I remember, not in is dangerous, but not exist is not.
Something like this.
But every time I deal with it, I'm refreshing memory and even the
experimenting or checking some blog posts or articles.
Unfortunately, again, it's a bag of knowledge you should carry all the time.
And it's easy to stop thinking about it if you don't use it every day.
So always test kids.
Yes.
So testing is perfect.
So if you think about something, just test it.
Yeah, actually one other. So in terms of learning resources, just test it. Yeah, actually, one other.
So in terms of learning resources,
there are a couple of others that I thought were great.
Probably always a good port of call is,
have you come across Marcus Winnand's Modern SQL?
Sure, long ago.
I think that's pretty great.
I'd be surprised if they don't have something around this.
They have it.
Yeah, awesome.
And the book Art of postgresql i think that's quite good in terms of like a mixture of beginner friendly and quite advanced topics should be somewhere behind the purple one
right um awesome so i think those those stand out to me as particularly good places to learn. Also, anything else on NOLS you wanted to talk about?
Well, last thing I would like to say, to add, you say like tested, and I say tested.
But I think many times when people start, it's hard for them to test properly, because
to create proper tests, you need to understand the depth of the issue and i wish some people like i so many times people
go to some chats or some places where they can get public help and ask questions and got reaction
like oh why you just don't test it i wish some people like it can be considered as negative
actually reaction but i agree that i don't know why the new people why they
don't ask can you help me to create a proper test so why I would I would see
myself yeah so please help me to test it all all aspects of this problem this
would be a great approach to ask questions when you start with some
topic I completely agree I think we could definitely be a great approach to ask questions when you start with some topic i completely agree i
think we could definitely be a little bit friendlier in some communities in the postgres
world to newcomers but equally i do also understand that we do get people that clearly haven't even
used google or even looked in the docs at the most appropriate place so i think it goes both ways, but for sure, teaching people how to share an online snippet or how to set a little test example in one of the fiddle tools or something like that, just to demonstrate their problem a little bit clearer, goes a long way.
And if you're asking for help, it definitely helps to show that you've tried something yourself first.
But I think we can all improve there for sure.
Okay, good. Nice one. i hope it was helpful to someone i hope so too quite basic material but as i've said even
you have 20 years of experience it still it still hurts hopefully this was a good public service
announcement at least and i think some of this is quite technical and quite difficult to do without
examples so the articles we share in the show notes might be extra helpful for this one. So I'll make sure to make those nice and clear.
Oh, one more thing I wanted to mention, I forgot. Since Postgres 11, we are in a better place.
Why? Because before Postgres 11, imagine if you have a billion row table and there are many nulls so so and you add some column which can be true false
and you have you think okay i want it to be strict i want to have not now there
so so only true or and or only false two options exclusive luck right but you know
yes updating billion rows it's it's nightmare i don't want to go with it. So I will consider nulls as false.
And this is what we always did because practically it was a big headache for us to update billion rows.
So we started to treat nulls as false.
And because practically it makes sense.
You avoid big problems.
But since Postgres 11, we can say default false.
That's it.
And it will be fast since Postgres 11, we can say default false. That's it. And it will be fast since Postgres 11.
Even you can say default false, not null.
What does that mean?
Yes, it's possible.
Oh, one minute.
So I think I understand.
Yeah.
So you can add a null.
You can just get result you want.
Only two options are possible.
All existing rows have false.
And null is prohibited in this column.
It's perfect.
By the way, the same trick allows you to redefine primary keys in SpotGrid 11,
but it's a different topic.
Let's discuss it another day, probably.
Awesome.
Well, thanks again, everybody, for listening.
Keep the feedback coming, and I hope you have a good week.
Yeah, don't forget to share us as much as possible.
We like it a lot.
Thank you so much.
Cheers.
Bye now.
See you.