Postgres FM - NULLs: the good, the bad, the ugly, and the unknown

Episode Date: August 5, 2022

Here are links to a few things we mentioned: Three-valued_logic (Wikipedia)Postgres 15 improves UNIQUE and NULL (blog post by Ryan Lambert) Practical SQL for Data Analysis — Interpolation... (blog post by Haki Benita) What is the deal with NULLs? (blog post by Jeff Davis) NULL in SQL: Indicating the Absence of Data (Markus Winand on Modern SQL) The Art of PostgreSQL (book by Dimitri Fontaine) ------------------------What did you like or not like? What should we discuss next time? Let us know by tweeting us on @samokhvalov and @michristofidesIf you would like to share this episode, here's a good link (and thank you!)Postgres FM is brought to you by:Nikolay Samokhvalov, founder of Postgres.aiMichael Christofides, founder of pgMustardWith special thanks to:Jessie Draws for the amazing artwork 

Transcript
Discussion (0)
Starting point is 00:00:00 Hello and welcome to PostgresFM, a weekly show about all things PostgresQL. I am Michael, I'm founder of PG Mustard, and this is my co-host Nikolai, founder of Postgres AI. Hey Nikolai, what are we talking about today? Hi Michael, so it's time to talk about various mistakes and I've noticed that many people, from my conference experience and so on, I noticed that many people love to hear about other people's mistakes. I don't know why they love it so much, but I've noticed it's very popular when you talk about, we made this mistake, it was so painful,
Starting point is 00:00:33 and people enjoy listening or watching about this. Maybe they want to learn something, I don't know. Yeah, I love it. And there are two main reasons. One is the hope that by hearing about their mistake, I hopefully can avoid making the exact same one, avoid having to make that myself. It's extra painful. And then the second reason I love it is it just makes you feel better when you make mistakes, knowing that these amazing other people make mistakes all the time,
Starting point is 00:01:01 too. So I think that for me, there's that part of it as well right and and it's already fifth episode of our podcast and i've noticed we have quite good numbers so far and i wanted to thank everyone who's sharing and liking and so on please continue it's it helps we we see this and we also see comments in twitter feedback it's so great to see yeah and thank you to everyone who suggested ways we can make it better as well. Really appreciate it. Keep topic suggestions coming too. We'll try to get to quite a few of them over time. Oh, and I also want to say a big hello to all those who are currently running or riding bicycle and so on.
Starting point is 00:01:40 It's good. I also need to do it. Okay, so mistakes. If we talk about mistakes related to SQL and specifically Postgres, I think the biggest, the most painful, at least in my career, it's always related to nulls. Nulls, this is the thing I hate most in SQL, in theory, in SQL model, if you think about SQL model, in standard, in various
Starting point is 00:02:08 database systems and specifically in Postgres. So let's talk about Nulls today. Awesome. And one of the reasons they're so painful is because the mistakes don't come back right away, right? You don't necessarily hit an error message or you don't, you know, there's, yeah, that feels like they kind of, they go invisible for a while. Silent. So that, yeah, that's a better word for it. Yeah. Yeah. Hidden, silent, and so on. This is like, if, of course, if you see the error, you can, you can react, but if you don't see error, and now it's very, very tricky. Let's start with maybe some news about Postgres 15. There will be new capability that will allow people to say that unique constraints should not distinguish nulls.
Starting point is 00:02:54 So there is only one in universe for some unique constraint. By default, all nulls are distinguishable. Like they are all different. So we cannot say this null is the same as that null because null means unknown in SQL logic. SQL logic is based on three valued logic. It's it's worth understanding what three valued logic is. And again, if you don't know, please check out Wikipedia or other basic articles.
Starting point is 00:03:20 It's very fundamental knowledge everyone should possess. So if you compare to nulls, the result is always unknown, which actually translates to null if we like we don't have unknown values, we have only null values. So like, now, compared to now, kind of also now, however, strictly speaking, it's unknown. And that's why if you define a unique constraint, you can insert as many nulls in this column as you want. This is, by the way, a big difference between unique constraint and primary keys, because primary keys completely forbid nulls in columns that participate in primary key. But unique constraint allow nulls.
Starting point is 00:04:07 So if it's one column unique constraint, you can insert as many values of null as you want, as many rows having null in this column as you want. And this, in some cases, is not convenient to some developers. So that's why in Postgres 15, a long-awaited feature appeared. So default behavior is the same. You can insert as many nulls as you want. But with new option, I don't remember this option, but this option
Starting point is 00:04:32 allows you to say that there can be only one null here. And this is like kind of, I like that this is explicit option, not a setting you change. Of course, it would be a nightmare and it would break all logic that standard dictates and so on.
Starting point is 00:04:53 It's good that it's explicit and only for those who need it will have it. But it's dangerous, I guess. Yeah, I saw it. Sorry, I saw a good blog post on this by Ryan Lambert maybe a month or so ago. I'll share it in the show notes. I thought it was super interesting, but I was also struggling to come up with a use case for it. I didn't really understand why you would want to be able to determine that there was exactly,
Starting point is 00:05:24 or sorry, at most one null value in a column. Did you ever come across it? Well, yeah. If you, I remember somewhere in some articles probably or some discussions, we compared like when you start understanding three-valued logic, you're kind of infected. You always think about, okay, we have now unknowns. If you are more database guy, you start teaching your backend and frontend developers,
Starting point is 00:05:53 colleagues, especially frontend colleagues who write a lot of JavaScript code that you know unknown is unknown. You cannot return, like think it's about it as a zero or about it as an empty string. So, and at some point, you probably want to create some table and use null as, like, unknown. And you can imagine, for example, we have, for example, true, false, and unknown, and we want to restrict only three options. It can be, for example, two-column constraint. For each person, we can say, for example, it can be a poll results. So each person can say, I agree, I disagree, or I don't vote in this question, for example, right. And each person can say only once. So we
Starting point is 00:06:50 can have personal ID and result, right. And unknown result, we can use nulls to record that the vote is like an unknown. But it's actually it's breaking logic because in this case the third option i i by purpose i don't want vote is is a kind of known result so strictly speaking i wouldn't do this i would use for example minus one or something some people may choose nulls and then they want to restrict and say only one result for each person is possible in this case you want unique constraint and in this case you would want to have only one null possible for one person something like this is what popped up in to my head initially like i can imagine this can be useful but i would avoid
Starting point is 00:07:37 it that makes sense i actually used to work at a company that did survey software and we did have more than one negative value because we had well for example you might have seen the question and skipped it versus you might never have been presented the question and those are different um so yeah very interesting various flavors of various meanings i i understand some some people need it and Postgres. This is probably not the biggest problem with us, actually. Let's talk about other issues. For example, I had a bad situation roughly 10 years ago. It was my third one already, social media.
Starting point is 00:08:16 And I was completely sure that it's good and market is ready and we should launch with great success. And I remember we launched slowly in spring first, and everything was ready to grow. And we used a lot of mechanics, all social. You might hate it or not, but when you check, initiating some invitations without lying, of course, we know the case LinkedIn had and so on, but it's a different topic. So everything was ready and I saw audience and likes it. But at some point during summer, we experienced very low. I didn't understand why, because I see like, if we talk about, if you talk to specific people, they are happy. They consider project very interesting.
Starting point is 00:09:06 And analytics shows us they use system to connect to friends and so on. But somehow, registration don't grow as I expected, as I predicted. And my experience was already quite good because it was already a third project. I like, I understand how people work and so on, but somehow it didn't grow. And I remember in early September, I went digging what's happening. I went to code and I saw some select,
Starting point is 00:09:36 which I don't remember it was comparison or some arithmetic operation, but it was not null safe. So the result was null, but developer obviously didn't expect, result was unknown, implicitly converted to null, but developer didn't expect null to be present there. And I just put coalesce there and we started to grow like a rocket during the next several months. So by the end of year, it was huge success. So I just put one word call us there.
Starting point is 00:10:08 Wow. So what was this doing? What part of it? Like, was it? Was it about suggesting people add others? Or like, how did that limit growth so much? If you're interested, it's off topic for different interests that usually all social networks do like this, they offer you who which friends already use the service. And when you do it, they suggest you to connect. So back to Coraless.
Starting point is 00:10:33 My understanding of that is it returns the first non-null value that it sees. So you can, for example, put Coraless and then the result you're expecting. Zero, for example. And then zero, yeah. If you, for example, want to use arithmetic to make them null, say if you need to use coalesce, use zero when there is no value. Otherwise, whole result will become null if at least one null is present among operands, right?
Starting point is 00:11:03 And this is exactly what happened. Because of that, the thing didn't work at all. And once they fixed it, it was like, it was so good. I was happy. I built the loss, and it was estimated it was like $30,000 during several months. Loss of missing one word. And that's a startup, right?
Starting point is 00:11:27 That could be bigger, a larger company. Imagine it. Of course, yeah. So it's significant. And of course, silent, as we were talking about before. It doesn't fail spectacularly. Yeah, it's just working, but not working well. We had some registrations, but not enough.
Starting point is 00:11:43 You know what I thought about it? It's not a this person mistake. I would do the same easily. And even if you have 20 years experience writing SQL as I do, I completely 100% understand. I realize it that I still have high chances to write some not null safe code. This is the thing that you need always to keep in mind. And it's so hard to just forget about it
Starting point is 00:12:08 and once again have not null safe operations. And it's a dangerous part of SQL. So I guess this takes us back to why they're your least favorite part. Right. So I like the flexibility because true or false, it's not enough. You need something unknown and it was added by purpose, but maybe in future we will have some better tools that will, maybe it's already exists.
Starting point is 00:12:33 I don't know if our listeners can suggest something. Maybe there are tools that can put some warnings and say, you know, this column might have nulls and this operation is not null safe, probably you will have unexpected results. It would be good to have such something like in CI CD checked all the time we write code. But if you use ORM, which dynamically constructs SQL, it's probably not. Well, it's an interesting topic, how to improve experience here. I'm 100% sure we're not in the perfect position in terms of nulls in relation to databases. Yeah, that makes complete sense. And tooling sounds interesting.
Starting point is 00:13:12 I wonder if it could be like some even static analysis on, well, I guess you might need to know that a column does contain nulls, but knowing that it could is enough, right? So that's, yeah, it feels doable. I'm not aware of anything myself. So is the message is basically, remember that something could be null and handle it appropriately. But try to remember is the best. And maybe useful for code reviews as well.
Starting point is 00:13:36 I think sometimes in teams that I see, there's sometimes more people that are comfortable reviewing back-end code or application code than there are reviewing database code and sometimes as you know there's an expert on the team or somebody who's better at this than most people perhaps sometimes their code isn't getting reviewed or you know just remembering to review sequel as well well if you if you are reviewing uh of course this is one should be you should have your in in your bag of reviewers tool set let's check if all operations are now safe
Starting point is 00:14:14 but i i think it's not right we should have the path that leads us to better results by definitely not the perfect solution but for now i'm looking forward to hearing if there are good tools for this but if there aren't and if neither of us are aware of them for now i guess that's what people need to be doing yeah the developer has to think about it and also the reviewer right and we also discussed only arithmetic operations in comparison but there are many many many more other places where nos can can shoot off your legs or arms, I don't know. For example, aggregates. Some aggregates ignore NALs, some aggregates don't ignore NALs.
Starting point is 00:14:53 Some aggregates do it depending on the situation. For example, if you do sum, again, like summarizing all values in a column, NALs will be ignored, of course, right? Although sum is based on plus operator, right? What would the difference be if they weren't ignored? It would have to return null, I guess. Well, if you think about sum as first value plus second value plus blah, blah, blah, if null is present, the result should be null as well. Okay, yeah, I see what you mean. But sum actually does col s, comma, zero, right? So it ignores nulls. So it's like null safe, right?
Starting point is 00:15:32 One people aren't probably aware of is count. Count is interesting. Count for column will also ignore null. So it won't count rows where you have null for this column. But if you say count star, all rows will be counted. And if you say count table name from table name, it's possible. Because you can count the whole row, considering the whole row, all its columns as one single value. It's Postgres magic.
Starting point is 00:16:00 It's like a record type. Again, nulls will be counted what about this is probably ridiculous and definitely uh not a real case but if every column in the table was null so imagine a single row with every column null yeah single yeah would that be counted by count tables okay wow yes yes why no because i also discussed now some time ago, like two years ago, it was a YouTube live stream. And I said, you know, of course, if we do this, what you described, it won't be counted. I do it and I see it's counted.
Starting point is 00:16:35 That's so funny. Like it's so many unexpected parts when behavior is not as, like we could have some certifications like null expert in Postgres. And also like array agg, aggregation. What do you think? Will nulls be counted or
Starting point is 00:16:55 used? It'd be a good quiz, wouldn't it? If you have a chain of questions specifically moving to very bad places. So it would be tricky to answer everything properly. Yeah. What grade would you give people at the end?
Starting point is 00:17:11 Would it be A, B, C, D, E, or no, maybe? Or a no. So what do you think about array aggregation, array agg? Aggregate function. I'm going to guess. I'm going to play this game. And I am going to say that it does not ignore nulls. Yes, it will take them and use as members of array.
Starting point is 00:17:38 Exactly. So you see, it's a very complex thing. All of it is just one word. That was genuinely a 50-50 guess, so I would definitely not have... I didn't know that. Also, disclaimer, I might be wrong. What percentage sure are you? As I've said, no matter how many years of experience you have,
Starting point is 00:18:03 you cannot be 100% right all the time with nulls. It's impossible. But the crucial part here is not being unaware of which one it is. It's remembering that it might not be null safe in the first place. So there might be unexpected behaviors when you're working with data that can include nulls. So it's the same takeaway, isn't it? Remember that when you're working with data that contains null values, or I'm not even sure if that's the right way of saying it, that contains nulls, then remember to test, check, put some test data in,
Starting point is 00:18:36 add some realistic test data, and then check that your code's working as expected. But the key part is remembering. Yeah, realistic testing. Sure. But I'm quite skeptical. I think people really understand all the dangers related to now's only when they experience it in production. So like learning curve should bring you some real examples to hit you so badly that you understand Oh, next time I should be very careful. Yeah. Also, only recently I learned, for example, there is JSONB set function. And it can be used to set some part of JSONB value to some different value.
Starting point is 00:19:14 So not be in need to override everything. So you can say, along this path, I need to replace something. And if you put, if you want, JSON also has nulls. If you want to put null there and say JSON B provide some path array of keys, and then say null, you will have whole value set to null unexpectedly again. So you need to use single quote null converted to JSON B instead of regular SQL null. So it's like everywhere danger. When you say converted to, do you mean like the cast operator or what do you mean?
Starting point is 00:19:53 Yes, cast it to JSONB. Null string converted to JSONB. So in this case, it will be as expected. You say you want to change the part of the whole value, not the whole. So if I'm a team lead and I've got a few junior developers on my team and I'd like them to learn a few things around this, are there any good starting places? I know you said they probably need to learn through experience,
Starting point is 00:20:17 but if I want to try and avoid that, what can I do? Oh, it's a good question and probably a hard one. I will, of course, do some training maybe related to nulls. Of course, making sure everyone understands what three-valued logic is and so on, some theoretical basics. And then some examples, I don't know, like it depends on the tools for education you have. Of course, if you have some playground where some database exists and you can do some exercises, it would be helpful to write some code and see how dangerous null can be. But I don't have an excellent, simple answer here.
Starting point is 00:20:55 Again, as I've said, I'm very skeptical and think only when it hits you very badly, breaking some functionality, only then you probably will learn that it's really dangerous. Yeah. I think I saw a really good article by Haki Benita that included an issue they came across that I will share in the show notes. And I think you shared one from, is it Jeff Davis? Jeff Davis, very old article and excellent collection of very unexpected examples, much more than we discussed here. So yeah, let's also attach the link in show notes. For sure.
Starting point is 00:21:28 I know it's difficult to ask people to go away and read stuff, but it feels like learning that there even are these problems gives you a chance of knowing. I think probably sometimes people don't even realize this could be a problem.
Starting point is 00:21:39 I definitely only learned about Coral S quite late. I knew quite a lot about SQL or SQL before learning that coalesce. Anti-join example as well. If you use a not in and then expect something, then have nulls inside in, you can also have unexpected. It's quite popular. I think all articles we mentioned, they should discuss this example. Anti-join also have some problems with nulls.
Starting point is 00:22:06 Same with no exists, I'm guessing. Ah, here's the trick. Like I always forget about it. Actually, I, as I remember, not in is dangerous, but not exist is not. Something like this. But every time I deal with it, I'm refreshing memory and even the experimenting or checking some blog posts or articles. Unfortunately, again, it's a bag of knowledge you should carry all the time.
Starting point is 00:22:30 And it's easy to stop thinking about it if you don't use it every day. So always test kids. Yes. So testing is perfect. So if you think about something, just test it. Yeah, actually one other. So in terms of learning resources, just test it. Yeah, actually, one other. So in terms of learning resources, there are a couple of others that I thought were great.
Starting point is 00:22:50 Probably always a good port of call is, have you come across Marcus Winnand's Modern SQL? Sure, long ago. I think that's pretty great. I'd be surprised if they don't have something around this. They have it. Yeah, awesome. And the book Art of postgresql i think that's quite good in terms of like a mixture of beginner friendly and quite advanced topics should be somewhere behind the purple one
Starting point is 00:23:16 right um awesome so i think those those stand out to me as particularly good places to learn. Also, anything else on NOLS you wanted to talk about? Well, last thing I would like to say, to add, you say like tested, and I say tested. But I think many times when people start, it's hard for them to test properly, because to create proper tests, you need to understand the depth of the issue and i wish some people like i so many times people go to some chats or some places where they can get public help and ask questions and got reaction like oh why you just don't test it i wish some people like it can be considered as negative actually reaction but i agree that i don't know why the new people why they don't ask can you help me to create a proper test so why I would I would see
Starting point is 00:24:12 myself yeah so please help me to test it all all aspects of this problem this would be a great approach to ask questions when you start with some topic I completely agree I think we could definitely be a great approach to ask questions when you start with some topic i completely agree i think we could definitely be a little bit friendlier in some communities in the postgres world to newcomers but equally i do also understand that we do get people that clearly haven't even used google or even looked in the docs at the most appropriate place so i think it goes both ways, but for sure, teaching people how to share an online snippet or how to set a little test example in one of the fiddle tools or something like that, just to demonstrate their problem a little bit clearer, goes a long way. And if you're asking for help, it definitely helps to show that you've tried something yourself first. But I think we can all improve there for sure.
Starting point is 00:25:06 Okay, good. Nice one. i hope it was helpful to someone i hope so too quite basic material but as i've said even you have 20 years of experience it still it still hurts hopefully this was a good public service announcement at least and i think some of this is quite technical and quite difficult to do without examples so the articles we share in the show notes might be extra helpful for this one. So I'll make sure to make those nice and clear. Oh, one more thing I wanted to mention, I forgot. Since Postgres 11, we are in a better place. Why? Because before Postgres 11, imagine if you have a billion row table and there are many nulls so so and you add some column which can be true false and you have you think okay i want it to be strict i want to have not now there so so only true or and or only false two options exclusive luck right but you know
Starting point is 00:25:59 yes updating billion rows it's it's nightmare i don't want to go with it. So I will consider nulls as false. And this is what we always did because practically it was a big headache for us to update billion rows. So we started to treat nulls as false. And because practically it makes sense. You avoid big problems. But since Postgres 11, we can say default false. That's it. And it will be fast since Postgres 11, we can say default false. That's it. And it will be fast since Postgres 11.
Starting point is 00:26:26 Even you can say default false, not null. What does that mean? Yes, it's possible. Oh, one minute. So I think I understand. Yeah. So you can add a null. You can just get result you want.
Starting point is 00:26:37 Only two options are possible. All existing rows have false. And null is prohibited in this column. It's perfect. By the way, the same trick allows you to redefine primary keys in SpotGrid 11, but it's a different topic. Let's discuss it another day, probably. Awesome.
Starting point is 00:26:54 Well, thanks again, everybody, for listening. Keep the feedback coming, and I hope you have a good week. Yeah, don't forget to share us as much as possible. We like it a lot. Thank you so much. Cheers. Bye now. See you.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.