Postgres FM - RegreSQL

Starting point is 00:00:00 Hello and welcome to Postgres FM, a week to share about all things Postgres Q. I am Michael, founder of PG-Must and I'm joined as usual by Nick, founder of Postgres AI. Hey, Nick. Hi, Michael. And we also have Radimarek with us, who is a consultant and Postgres enthusiast, whose boring SQL site we have mentioned many times on the show. Welcome, Reddy. Hello, guys.

Starting point is 00:00:20 Thank you for having me. Thank you for coming. Yeah, absolutely. So we wanted to talk about a newest project you have kicked off called Regress about regression testing SQL queries or SQL queries. Very cool. It got quite a lot of uptick on Hacker News, a lively discussion on there. Congrats.

Starting point is 00:00:41 But also, I'd be interested in discussing with you about how you got into the topic, like regression testing in general. Where would you like to start? Okay. So that's the topic to start with. So effectively, it wasn't my project or it didn't start it as my project. It actually has started as a complementary project to a book called The Art of Postgres, which I still believe is one of the best things you can do to learn how to work with the postgres. Dmitri, right?

Starting point is 00:01:11 Yes, Dmitri. And I'm trying to get in touch with him forever, like last four years, but I'm not able. So maybe if he hears this podcast that I will have a chance to talk to him. So I did start with the project because I was kind of a bit worried about how people, deal with SQL queries in their code because usually it's either generated by ORM and you don't see it or it was a string hard-coded in my case go files so you can see it in typescript files and all the testing around this one originally kind of bothered me because you have a infrastructure which was difficult to start set up and everything and I said there must be a simple way something that that actually running a simple query doesn't take minutes or seconds to do all the infrastructure

Starting point is 00:02:04 tier up, tiered down. And I found Regress SQL that is doing this job. And I tried to fix the project and fast forward, I think four years because it's actually quite a long time ago. I found another use cases because I'm working on some educational products and I needed to test whatever users are hitting the metrics that I'm interested. And then I kind of figure out, yes, this is actually where it can come in because it was comparing whatever the output, the correctness of the query. And that was only step away because then I realized, okay, every single incident is based on increased time on SQL queries.

Starting point is 00:02:47 And that was just a, you know, small step to go from something that is predictable because time is unpredictable. I got into something predictable and this is where I started. So I start committing, I start adjusting it. And right now I'm actually working on a version 2.0, which is a big update in terms of functionality and in terms of developer experience. But how do you approach the testing? So I understand the tool.

Starting point is 00:03:16 It's for regression testing. It's super important these days, especially because there are no more excuses not to have 100% coverage of tests and proper coverage, not just like good looking coverage, but and performance testing in CI and then before you, before your AI assistant commits and pushes code, that should happen before probably. How do we approach the fact that we need a lot of data to have proper plans? Okay, so this is effectively never ending to pick how even developers can get to something which is representative of their database. I actually call this one.

Starting point is 00:03:55 I think that happened somewhere on Hacker News with somebody from Superbase. We came up to a topic which was called It Works on My Database. That's the same issue. But to answer your question is, as I said, initially I've naively thought I can do it time, but time is all over the place and that wasn't working. So the only way how to approach this one is buffers and cost. Let's pose here because I think it's super important. I think industry doesn't understand it yet.

Starting point is 00:04:27 We chatted about buffers with Michael two years. Finally, Postgix 18 has it by default in explaining lies. This is great, like absolutely great. So industry doesn't realize how important it is to shift focus from timing to buffer still, right? Why timing is not enough? Why is it not so good when you have controlled performance? What's your point of view on this? Okay, so timing.

Starting point is 00:04:52 Timing is a perfect metric because this is what you see. If you come in the middle of incident, timing is actually the easiest to spot metric. Queries are timing out. Hopefully all of us have, you know, statement timeouts, transaction timeouts, everything in place. And humans need good timing. This is our final goal, right? Exactly. But the problem is, A, if you want something fast,

Starting point is 00:05:17 explain alone doesn't give you timing because that's just what the planner thinks and explain analyze can't give you something that is predictable so if you want to do a regression testing you can't rely if you are running on I don't know

Starting point is 00:05:33 underutilize AMD machine or arm machine or exactly or you know everybody these days we have Apple Silicon Max and they are super fast their IO is fast So timing is generally, that's the...

Starting point is 00:05:50 It doesn't work in CI. It doesn't, yes. Well, it doesn't even work in production because you will get a big spread of queries. You still have outliers. One customer can hit 10,000 rows. One customer hits five rows. How do you... Once locking, wait on locking acquisition as well.

Starting point is 00:06:08 Yeah, yeah. That's great. Yeah. I think it's even worse in CI. because flaky tests are a nightmare. So even if, for example, we accepted double the time, if we said, like, fail this test if it's double the time, and that worked 99% of runs,

Starting point is 00:06:30 when it fails, but for a bad reason, when it fails, it's picked the same plan, no indexing changes, nothing has actually changed, but there was just something else happening on the machine at that moment, and it took, it kind of blitzed past its P99. We got that one time. it was slow and the test fails that's now a flaky test that's a test

Starting point is 00:06:49 that sometimes fails when it shouldn't like false points and those are always the nightmare tests. Before we proceed to the solution we all agree buffers right. Still like let's pause maybe I have a crazy idea I'm thinking several years about this and maybe it's time to

Starting point is 00:07:05 implement it probably it's a better topic for our hiking sessions with Kirk and Andre we have usually on Wednesday's life but like your opinion I'm very interested in hearing your opinion. What if we still keep analyzing timing, but imagine, you know, weight events, right? Which usually I used an aggregated form, like how many active sessions, aggregate and so on.

Starting point is 00:07:28 What if that analysis would be brought to explain and analyze to understand how this timing was spent exactly? Was it heavy weight lock acquisition waiting or IAO was too slow this time? in this case wouldn't it help to extract the like to structure the whole latency to like to split it to pieces and in the CI context extract pieces which are like unpredictable okay this is this doesn't matter okay waiting time it doesn't matter waiting on lock acquisition but here we see a lot of actual CPU non-waiting which usually is now right but there is trick there but anyway wouldn't it help to keep in the like still keep timing as main metric for regression testing in CI for performance I believe in theory that would work but for me the complexity behind this is already implemented on the cost and buffer level so I'm not sure if the benefit would actually be having better metrics because what I found out is and we can get to the discussion is you know buffers are fixed if you have a predictable data set that you start with

Starting point is 00:08:44 and you have a ideal state scenario, I think that's the important ideal state because production is never in ideal state. You will have a blow, you will have a... So, you know, my regression testing is never going to be much a... Exactly, everything is changing. So it depends. But if you want to do a regression testing,

Starting point is 00:09:02 you just need a matrix one or two of them, which are predictable. Buffers are always predictable. Yeah, I have. I'm just so, so interesting because you came and talked what I was like shouting, shouting on every event, basically. Buffers, I even have T-Short with buff, explain the last buffers where buffers is bold. By the way, it can depend a little bit, unfortunately.

Starting point is 00:09:26 It depends what you mean by buffers, but like if you include planning buffers, which are sometimes reported, like there are states that, like, for example, whether, if, for example, auto vacuum just ran, you might end up with a different, if analyzers just run, you might end up with a different number of planning buffers than if it hasn't just run or if it hasn't run recently. So there are like slight variations. So yeah, it might not be perfect. But cost is an interesting one for sure.

Starting point is 00:09:53 So, okay, what I'm going to do, I will postpone my comments about why not only buffers and maybe let's return to timing and we will discuss maybe later. But let's agree with you. Buffers is the thing that doesn't depend on the state so much. so we just check hits and reads of shared buffers, also temporary, like everything. It's IO. Database is all about IO, right?

Starting point is 00:10:18 But still the question remains, how, like, your approach, like, you need a lot of data to reproduce the executor behavior, so you will see the proper number of buffers, right? You cannot do it with database sizes smaller. Yes, that's the part. It works on my database because developers, they will start with hundreds. Cardinality of data is going to be very small. And I haven't seen a single company that wouldn't have a problem with how to get a kind of representative set of data to developers machines.

Starting point is 00:10:52 Well, I know there are attempts, but like if you take an average software team, they usually have a huge issue on this one. So my issue, and that's actually part reaction to Michael's comment, is I do always have to start with a clean slate. So for me, this is not a production, so I have to count that way that you run and statistics are up to date. The original solution, which I describe fixtures, they are the obvious solutions. So, you know, I will always have to improve the ways how to improve the fixtures. But the way, how to do it, and this is actually my roadmap towards version 2.0, is to do a snapshot building. you probably seen well you actually seen because you have discussed it

Starting point is 00:11:37 that instant database clones that article wasn't a coincidence it was actually for me need how to get a fast start state for the regression testing and I needed something that works this is why I had to dig and then I actually

Starting point is 00:11:53 same as you guys I was surprised it's already there you know because it wasn't something I was aware actually exists in version 18 but suddenly everything was done So it's not for production use, which you mention correctly because there are still issues. But if you fix a easy-to-get starting point, if you imagine you have a predictable build, which actually is coming, it's now already committed, but it's coming for version 2.0 as part of the developer experience.

Starting point is 00:12:25 And you have a fresh statistics. You can run the query. and if your configuration, Postgres configuration is actually very interesting part because you can change the cost, how it's calculated, but you will always get the same buffers. Exactly. If you talk about hits plus three, it's probably the distribution will be different. Also, hits might be hitting the same buffer multiple times. So there are nuances there. But the total, it's like a variant. It's always the same.

Starting point is 00:12:55 So, and unique, you said it. We, all of us came from a spindle disk. So you remember times where you tried to hit the control C on the query. And, you know, it was just flashing wall files on the disk. And we can say that I-O is cheap these days, but it's not. It's still the only metric that you have to do. And I-O, whatever there's a contention, will affect time. So for me, suddenly buffers is number one.

Starting point is 00:13:24 So there is nothing more predictable. Cost is very close. And I actually did a, I haven't done like a scientific research on the topic. But if you take the one version of Postgres and you run through different architectures and same settings, the variance is surprisingly low. There is specific set of settings because you can have much less memory. Shared buffers can be very small. You just need to set proper effective cache size, proper all the planner settings. and workMam. We have an article about it. And then I agree with you, it will be the same planar behavior

Starting point is 00:13:59 for the same version. That's it. It's great. That's magic. You can. Exactly. You can have Raspberry API and repeat behavior of huge like 800 core machine. It's insane. Exactly. This is what I did. I actually took my old Raspberry Pi. I took Hedzner-Armar machine. I took a GCP machine. And the difference was something around 2%. So right now I have a threshold of 10%. by default, but I'm looking forward to actually make it configurable. That's great. And let's people decide. And then the answer to all your initial questions is, I don't have to care about timing.

Starting point is 00:14:37 Because in production, if you would have 1,000 transactions per second, I will get saturated. And that will naturally raise the timing. So if you care about buffers, you don't have to care about timing. because timing is just, you know, extrapolation of problems that happened before actually the current. Exactly. Some people say, like, let's do proper full-fledged benchmarking and just to study behavior of planning for a few queries.

Starting point is 00:15:05 It's a wrong idea because you have so many moving parts around. Like, you just narrow down to single plan, single session, right, and focus on what matters I. But here, like, my, I had before I had that question. So you mentioned create a database in. PostGos 18, you mean strategy, right? When you can copy and write clone and when you have create database with clones template one database, but not in regular form, but basically very fast because of copy and write. Does it mean you use ZTF or something, B3FS or what?

Starting point is 00:15:39 Well, it depends what you want. You know, I'm not forcing people to use anything. This is, I'm just offering an option if you want files because, you know, another part of frustration was that everything in your integration test that touches the database is slow yeah and the way how people normally use it it's slow but you know running most of the queries it's like literally milliseconds yeah yeah and we have neon branching we have other copy and right branching so that's great and and here if you compare like do you know about db lab or no yes i know yeah so i'm comparing the various approaches and uh and this is this approach what like when we have multiple logical databases in physics It's great because this is exactly what we try to achieve with the DBLAB, but not many people realize what you just said.

Starting point is 00:16:31 I hope with this maybe episode, more people will realize. And the thing is that if you allocate this machine, you can have huge database. This wonderful PostGus 18 feature, if you use ZTFS, we don't care about time anymore, right? Almost, right? So we focus on buffers. It doesn't fast, Postgres does know which file system we have, right? And then you can achieve constant price for many, many tests in parallel. Unlike Neon or Aurora thing cloning, you need to pay.

Starting point is 00:17:05 If you run 10 tests in parallel, for example, you can have 10 ideas from LLM. You need to pay 10 times more for compute. Yes, storage software, but compute is not. Here, create database, 10 databases in parallel. It's great. The only thing compared to DB lab, what is not possible, you cannot upgrade, you cannot have major upgrade inside a clone and have independent versions in parallel, you know. But maybe some more differences, but in general it's great that it comes to post this

Starting point is 00:17:34 naturally. This feature, it's a great feature. I hope it will get traction in the AI environments. I think it's actually great to have this natively. I'm not saying it's a perfect. It's just an option. You know, nobody is forced into a third-party solutions. everybody will pick whatever they want.

Starting point is 00:17:51 And I'm not against Neon or anything. I think you evaluate your requirements and you say what's the best approach you want to take. Well, with Neon, it's also the same. But you can just pay for compute, but also you can study, Pananaabriha, put it into CI. And the thing is that you need to focus on buffers

Starting point is 00:18:09 because a cold startup will be different, right? But we don't care anymore if you focus on buffers. That's magic, right? That's great. And cost, also cost. because this is what says how Planner thinks. The only problem is there are some queries which are out of this methodology. For example, when people who deal with RLS, RLS, they know this problem, right?

Starting point is 00:18:33 If you have a query with counting million rows and there is a hidden filter involving current setting, it's going to be executed for each row. And it's not buffers. It's pure CPU time. but these are exclusions right we have only limited number of them right exactly I would say before

Starting point is 00:18:55 hitting that problem yeah there is 99% of problems it can solve it much easier I agree it's not one and still you can troubleshoot it on this machine because getting those million rows is much faster

Starting point is 00:19:06 if you have this copy on the right and your methodology I mean what you said it doesn't mean you should stop thinking about timing right you still can't think about timing right you can it will hit you that's what's going to hit you in production at the end of the day because this is what the user perceives when they render the page waiting for the output or something but it's not predictable yeah we just we gave this tool db lab to many hundreds of engineers

Starting point is 00:19:35 and noticed i think thousands already and noticed that some people expect that it's full reproduction of production behavior they say why my index is being created three times longer than in production well because quite different environment right so there is some adjustment needed in a mindset to test like this right

Starting point is 00:19:58 well okay enough about thing zoning and branching let's think about tooling yeah I had a question going back to cost tolerance you mentioned allowing for 10% higher costs and that confused me slightly because I was thinking on CI

Starting point is 00:20:15 we'd always be running on the same date, effectively a clone of the same database, static state you mentioned. Why would the cost, why would we need to allow for 10% higher costs potentially? That was initially my naive protection, effectively. So before I did the benchmarking, I said, you know, 10%, literally I just,

Starting point is 00:20:37 there was no measuring. I just said 10% sounds reasonable. When I did measuring, I found, as we discussed, you know, there is slight. variance, but it's not actually, I think, on average, was below 2%. I would like to see details here, because I think there should be zero difference. Same. Let's talk.

Starting point is 00:20:57 I'm going to follow up all of you on this, because there are some cases which I don't know. I'm super interested to understand. And yeah. I'll just, you know, just to finish a topic, because, you know, I did extensive testing in terms of establishing it. I haven't dig into a reason because, you know, some of our machine, some of the, were very, you know, rosberry pies. So that wasn't comparable.

Starting point is 00:21:19 And that was the difference came from. Some queries were zero. But Planner doesn't know about hardware. It doesn't know anything about hardware. We actually had this discussion on the Prague's post-grass meetup two months ago, that it would be actually very interesting to go into a differences between arm and Intel processors. What actually is, if there is some default somewhere, I have for example seen that I

Starting point is 00:21:45 IOS start up time, again, it doesn't involve buffers, but the I.O. Startup times on the cold I are much slower on arm in general. I believe that there might be something else. Different code paths. Exactly. Right. This I can imagine. Yeah. But it's not about hardware, just code paths. The architecture might matter. Yeah, that's interesting. Yeah. What's also interesting here, I think we touched an interesting topic. Could zoom in there. It's because it's super interesting. When I say 10%, I think it's because in CI you need to define some thresholds to define when it's failed, when it's past, right? And when you work manually, you have some like thresholds in your mind you don't realize. But then you need to code it basically.

Starting point is 00:22:28 And this is a challenge, right? Exactly. This is where it started from. And this is actually where I hit a wall with the original release of the tool. Because people didn't understand, you know, you said it. People didn't understand buffers. why they matter so much. People didn't understand why something is changing.

Starting point is 00:22:47 And I had to start thinking again. And this is, you know, I think if you would go to original blog post, now there is a notice that I had to take the fixtures out of the test because people were kind of linking it together and they were getting different data sets and they didn't use it correctly. My fault was that I didn't specify properly, that that was the issue with the blog post.

Starting point is 00:23:08 But this is why the version 2.0 is actually heading in the direction because we have touched on the topic. You can have multiple snapshots and those snapshots will give you a different experience. But the same baseline on a given snapshot should be stable or more or less stable over time. And then it depends because your business will change, business requirements will change. Then for example, yesterday I committed a change, which is a selective update. So you can literally link a change in baseline to a commit. So you can say and blame what actually was that kind of historical trend change on the query.

Starting point is 00:23:48 So this kind of adds another dimension because suddenly you don't have a good blame only who broke something. You actually can trace it back to how did it affect the performance. And this was actually on one of the post you had on your blog about this optimization from seven milliseconds to something. You hit the index scan. And that was actually a big inspiration because I said, you know, this is actually very important because the reason that Clary does seven milliseconds, that doesn't matter.

Starting point is 00:24:20 But if you save so many buffers, you should be able to try some kind of phrase it back even on the positive outcome. So at one point, there can be a layer which will take all those optimizations. And this is where we will get to very interesting topic, and that's the topic and that's AI. Right.

Starting point is 00:24:37 Experiments, right? experiment. Well, I wouldn't call it experiments. I call it set reality because I believe most people. Code is written by AI only right now, right? Yes. Yeah, yeah. And how do we control it, right? So as I said, we need not only 100% coverage page, but it should be real tests should be inside, right? Including performance. I agree. Yeah. So, so yeah, we had a discussion with Michael in London during a VG day. If you have a context which AI don't have at one point, you can then see RedisSQL actually in a very different light

Starting point is 00:25:20 because it knows the query, it knows the baseline, it has a snapshot and it has a context, which is the schema. Exactly. And statistics. Exactly. You have statistics. So you have suddenly much more data, which you need. And I had to, you. know, all of us, all of us have to think about how AI is going to change our work. I'm thinking every day since 2017.

Starting point is 00:25:45 Exactly. I have a weekly session which I do and, you know, I'm trying to digest the news and do this one. But there are things that just confirmed that this is the right direction because the joke is 84% of developers use some sort of assisted coding or the wipe code or whatever. we all do it. But the same 84% of people, they don't write SQL queries anymore. Yes. If they were bad in writing SQL queries, you know how a length SQL queries look without the context. I actually found it, you know, this is the deformation you have when you look into how planar works.

Starting point is 00:26:26 I found it very funny because it doesn't differentiate whatever you're running Postgres 12, 14, 18. it will mix all that advice, it kind of reads the blog post, which is just use something wrong. And people will just go with it. And if you put this one in context of a 10 times more curriculous, who's going to review it? And now you see where all this is coming through, because if you don't have time to review it, if you don't have the human aspect of reviewing it, and if you don't have the DBA knowledge, I think the lost DBA knowledge is actually kind of, key here, you need something that will govern that A, your tests are fine. You can write unit

Starting point is 00:27:11 test, integration test on a couple rows, but how do you do it if you have to return 5,000 rows? This is very difficult. And this is what Regress SQL actually was doing ever since start. You can take a 5,000 rows and compare them. And it can suggest you, you know, order is wrong, or maybe, you know, there is a wrong field and it will notice it. So actually, that comes somewhere between unit test and integration test, it can measure or control that your data and your contract is still valid. And this is where I believe the direction of the tool is actually having a big future. That's absolutely great.

Starting point is 00:27:48 I just like everything you say makes total sense, absolute sense. Yeah. That's great that we like think in the same direction. That's what I can say. Well, thank you. but I would say, you know, this is, if you have experience dealing with production system, this is what you will effectively arrive to. Yeah.

Starting point is 00:28:08 Because you have to generalize, you have to find underlying issues. We need guardrails and protection, automated protection at a larger scale. We cannot say, oh, let's not, let's prepare to run those pipelines, right, because they're expensive and long. We need making them cheap and fast radically, right, so we can run like 50 of them. right and check all the all the things and throw away all hallucinated stuff right this is this is the way yeah this is a way and then comes to in many details including for example there are some hard problems which come from production side not from development side for example plant flips but you know like tintin tin and plan's different so but it's a different problem if we split these problems anyway i agree the most

Starting point is 00:28:54 danger to performance right now is the yi but it's also great but Like, I mean, it's wild west right now, right? So we need some practices and tools to be established. But just to kind of put it in contrast, I actually believe most issues are actually very simple. Because most incidents, they start, you know, I'm not sure if I'm the only one, but most incidents you see are literally simple, missing index.

Starting point is 00:29:24 Or somebody who writes index without understanding how it should be rewritten. And, you know, they, obviously verify it locally. It works fine because they have a small database. But they don't notice that the query exploded 100 times in terms of buffers retrieved and that will have an impact. Well, 100 times it's easy that you can survive it. What important point here also I noticed if you say thousand buffers, people don't understand you. If you say 8 mibytes, they start understanding how much it is, right? So you just multiply by block size and this is magic.

Starting point is 00:30:01 So simple magic, you know. It is. And you know, this is, I don't want to sound as a promotion. I'm building like, it's called labs. So, you know, we both have labs of some sort. For experimentation, it's right, yeah. Because for me, and you mentioned it, Nick before, developers won the same environment as production.

Starting point is 00:30:26 And they don't understand. why it's different. And how many of us? I think three of us, we are lucky because we had access to a really busy production environments. And we know how it looks like. But if you take an average developer,

Starting point is 00:30:42 during their career, they will never actually touch an environment that has more than 10 TPS and maybe a gigabyte of data. They don't simply see it. This is why, you know, if you trace the boring SQL blog, everything comes down to this is fine on your database but it will eat your life on production

Starting point is 00:31:04 because it won't work yes I compare it to bicycle if so it's hard to be DBA because you need to learn but where to learn if you don't have access to terabyte scale and heavy workloads if the only like practice you have is production it's basically learning bicycle how to to ride bicycle on highway. You know, it's like some people survived and they are called DBA now, right? Others, they say database is slow, it's dangerous, don't touch it. If you have an environment where you can safely make mistakes, I saw magic when people have access to terabyte scale clones and they start doing really crazy stuff and realize

Starting point is 00:31:49 it's wrong, but nobody affected. And this is how learning is actually works, you know, it works. So, yeah, you're basically 100% resonating in my mind, what you do. And then take whatever you do, because, you know, I'm actually following your post on LinkedIn about, you know, self-driving Postgres. Now you take, you know, people want to write SQL queries. They will not, let's forget, that is a regression testing framework. They will not have this experience and they will never hit issue that Postgres would be badly. tuned.

Starting point is 00:32:28 That suddenly increases the gap. Sure, there will be AI which will help, but you still need a manual, a human operator at one point, or somebody who will be able to diagnose weight when wrong. Yeah, because real testing shifted to production. That's why. And we just need to shift it left again.

Starting point is 00:32:47 Yeah. Exactly. So I think it's all about clarity and giving people, sure, I think we will never have 10 times. with more DBAs. We will have probably times less DBAs. But we need to make sure how the new people will actually learn those skills and how they will learn them predictably without that fear, without that, you know, being scared and who knows what will happen in five to 10 years.

Starting point is 00:33:14 Yeah, I agree. I agree. And also it's very resonates to what happened to code in containers. Containers is the way to have reproducible tests. It's standard, right? We just need. the same for data and copy on the right is the key and all but also methodologies should be adjusted buffers so everything is so and i agree dbAs is like cid means it's in the past people need to learn postguess who build stuff and know how to operate bigger databases coming from various angles i have front-end engineers who already start understanding things or super base super base is great example he attracted a lot of front-end engineers to to learn SQL and so on, right? That's great, I think.

Starting point is 00:34:00 And that's the boring part. You know, I have to make the joke because, you know, for me, literally the direction where I'm going is those boring things are with us for 40 years. They will be with us for quite some time. And they just work. Those skills are reusable. No matter if you've wrapped them in LLM or not, somebody needs to understand them. Yeah. Yeah. So can you explain what your tool does? So assuming we understood this that we need to focus on buffers during optimization and establish baselines focusing on buffers and cost and we managed to achieve the same data using this feature in PostGGG18. What does your tool do and who should use it?

Starting point is 00:34:44 Okay, so the main goal of the tool is to find all your SQL queries and help you in some way to write a plan, because you know, queries need parameters. It needs to have a way. So you need to write a fixtures or provide your dump database, anonymized database. That's your choice. I'm not actually forcing you in anything. You have to get it to predictable state. That's your snapshot.

Starting point is 00:35:12 Then you run it on your queries. And based on your plan, each query can have one to n plans. That means what's the variation? So you can have LLM to generate you plans, you know, using random values. you can have a fuzzy generated plans, and that means every single plan will execute that query once. What the tool does, it captures the expected, the original tool did it also.

Starting point is 00:35:37 It captures the data. So what you have is right now it's JSON with all the data that it produced. This is your baseline, and that's the contract for correctness of the queries. Then you have the functionality for baseline. So right now, the version 1.0 does explain only, but Vagin 2.0 will have a full explaineralyze because that's what I need for buffers.

Starting point is 00:35:59 And it records that data. Yeah, what about other settings, modern settings like memory and serialize? That is interesting because again, this is the part of the 2.0 roadmap because snapshot can't be consistent. You have to have a data. You have to have a now even configuration. So I'm not saving all the configuration. all the metadata about where it's running and how it's running.

Starting point is 00:36:27 But I'm actually saving this as part of the snapshot. So if something changes, like if you change a configuration, what's the cost of sequential scan, it will warn you. You change something and I will probably give you wrong results. So you still have to interpret the data. At one point, you arrive to a scenario where you have a good commit and you can distribute the snapshots for testing. So ideally, you can have a couple versions of the snapshots.

Starting point is 00:36:57 You can have a small snapshot, which will have 100 megabytes, 200 megabytes. You can have a mid-size, 5-gibytes snapshot, which you can use. And then you can have a large snapshot, 50 gigabytes or something. And then every time you run test, you will get your report what actually has changed. And that change is based on the cost and buffers. Yeah, that's great. And you capture also version, I think. Right? And settings and some like statistics and structure of table, which indexes we have and so on, right?

Starting point is 00:37:30 We basically in the same, like Matt at some point, it resonates a lot, but you come from development more. I come from production more. And in production, what also matters often, including buffers, for example, if you think about index only scans, hip fetches matters a lot, right? And this is the state of production, which can be bad or good. And capturing from production, some statistics might make sense as well. Do you see somehow it could play or not? For example, when last time this table was vacuumed? How much dead apples it has or how much blood accumulated, something like these things,

Starting point is 00:38:11 or it's outside of your scope? I spend so much time thinking about this topic. And if you would ask L-O-N, that would say, yes, that's your natural direction. but I actually proactively said no to this, because this is a different domain. You know, you have to guarantee what's broken on the development side. You have to make sure that contract, which is also a performance of buffers,

Starting point is 00:38:37 is same or better, and you track it, but you can't replace this in production. Because in production, you can have a latency, you can have a replica which run in different availability zone. You can have a network issues. You can have, you know, so many issues that can happen in production. And there are tools that I effectively, I can't compete with. And I think that wouldn't be my goal.

Starting point is 00:39:01 Because my angle is education, effectively, for me, it stops with CI. Yeah, it again resonates 100%. Because what you do, you say, let's split the system into pieces and study each piece separately. And when we understand each piece, I, like, behavior fully, then we can go and increase complexity. And this, that's why you shouldn't use PG-Benz to study plans, right? Because PG-Benz is able to test the whole system. And also that's why you shouldn't bring too much from production.

Starting point is 00:39:37 Study your system first how it works, right? And then let's go to production, understanding our system, and knowing what it's capable of already, then we bring complexity. This is like the old, like one of the oldest, scientific methods, René Descartes, right? Like, let's split two pieces, study each piece, and then composition, to have whole understanding. Without this split, it's super hard, and you always have too many moving pieces, right?

Starting point is 00:40:08 So I like this, but still I'm from production, so I care about the state of production, you know, it's different. Yeah. Cool. I think we're talking about different things, though, right? I really like that this is about regression testing. This is about not breaking things like ourselves through development. This is not breaking correctness, whether that's a human changing a query or an LLM chatbot changing a query.

Starting point is 00:40:32 So correctness is like primary importance, that makes sense. And ideally not affecting performance, but we can't measure, we can't measure production performance in test, in development. So what we can do is get a proxy for it. I don't think we can anticipate every single potential performance issue in development. that could hit production. We produced really difficult problems. Yeah, with point in time recovery, sometimes you can reproduce really hard problems at macro level.

Starting point is 00:41:03 Reproduction is different from anticipation, though. Anticipating, so to get it into your regression test, you would have to anticipate every possible issue. And I think personally I really like the direction of, let's do one thing at a time, let's tackle the common things. I'm not arguing with the whole concept. I'm just saying that also is possible. It's just much, much heavier, much more expensive, time consuming.

Starting point is 00:41:27 In development, we need to move faster to rate a lot and check dozens of ideas at the same time, like in parallel. This is what we need, right? And this is perfect approach. The last thing I wanted to say is I personally think that this hasn't been a topic that's been particularly of interest to people for many years. I've expected it to be like I've thought it should be. I've worked at a company previously that made tools in this area that weren't very successful.

Starting point is 00:41:54 And I was actually pleasantly surprised by how much positivity and interest there was on the Hacker News discussion and in your revamp of this tool. So for me, it actually feels like people are ready for this topic for some reason. I don't know what changed or what. What is writing there? Who knows? Do you think that's what it is, Radim, from your perspective? I think this goes to, there's not one answer. First is going back one topic.

Starting point is 00:42:27 How many percent of incidents you deal are really production-based? Like, no matter if it's human or LLM, you know, I'll be just guessing, 90% of all incidents are change management that didn't grow well. and something happened. So this is why the regression testing, actually, when you might not measure, but if you take a snapshot, something which is now working, you should actually do something that is measurable.

Starting point is 00:42:58 You measure if you're improving on performance. So it's, you know, sure, contract is important, but actually if you start with a point zero, you can technically say we, on average, go up and down in terms of performance. This is very easy. to measure. And you know this one, for example, you know how many times you double guess when you upgrade from Postgres 17 to 18, what's the impact? You can suddenly put a number. And if you

Starting point is 00:43:26 somehow put all your queries and do that number, you can put that number into a bucket. And the second one is no matter what you do, I was asking myself, you know, why nobody thought about this one before. I actually don't understand why nobody did this because it's predictable. The metrics were there, but nobody did it. And from my experience and the frustration I had was all the incidents were very easily avoidable because the metrics were not in the 10% margin. They were in 1,000%. Every single incident started with 1,000% difference. I have also a important question about this tooling. So how do you capture queries to test?

Starting point is 00:44:16 Because for example, if we have ORM, it produces various queries, we can see PGSA statements, normalized queries in production. But for each query, we might have multiple plans depending on parameters. So which how do find those queries? I have my own methodology. It's actually described in some articles we could collaborate maybe a little bit about, about on it, but what's your approach? This is the difficult part.

Starting point is 00:44:42 I'm not going to lie. I come from Go background, and if you take ORMs, I'm not that popular in Go, I would say. Framework like SQLX and things like this one day encourage you to write SQL by hand, or SQL queries by hand. And this is actually very started. So effectively my first commits were about supporting multiple queries in one file. So I have a different base. There are even a framework like SQLC. generates code based on the queries or vice versa.

Starting point is 00:45:16 But that would keep the target market very slow. So the obviously next step, and that's the support of the different frameworks. So I think in my article I did a SQL alchemy. That was one that actually allows you to save the queries. And this, if you remember, we talked about the plans. And plans are that variation of arguments. Right. So you see you have an artificially generated snapshot.

Starting point is 00:45:42 which is fixed, and then you need to generate the plans. And plans, variation, if you write one plan, your query testing is not going to be best one. So you need to generate. And this is actually where LLM can help you. Because suddenly you have a tool which can create the variation, create even hallucinated data. And you will be testing queries with a,

Starting point is 00:46:03 I would actually be quite happy to have hallucinated parameters because it will show you whatever something is changing or not. It's a proactive approach, actually, Because my approach is to take everything from production because we deal with really large systems and we need to understand when we, for example, perform major upgrade there are no plan flips or plan flips are positive.

Starting point is 00:46:25 But what you say, like let's use LLM to generate parameters which probably will happen next month, right? Like some edge cases we haven't met before. I like this a lot as well. Yeah. Yeah. So you have to use what you are afraid of. So if we are afraid of randomized queries, then why not to use it?

Starting point is 00:46:43 And effectively said, yeah, I need a variation of thousand plans. Yes. Again, like the problem is like, again, I have a slightly different opinion here because I'm coming from production experience. I would say we can generate like thousands and thousands of various cases, maybe a million queries to test, but it will increase longevity and the price we pay for CI and testing. right so we need probably to prioritize some queries how to prioritize we need to study production no exactly okay that you you need to you need to understand your business requirements you need to understand

Starting point is 00:47:19 your own players you need to understand your smallest customers you need to understand your biggest customers you need to understand the patterns which are happening so that's also interesting comment because i i was thinking about like this tooling oh by the way parameters right now is just an idea it's it's not implemented yet a collection of no that's that's already the queries are already parameterized. I mean, yeah, I understand pictures, but if I have already a large 10 terabyte database with a complex workload and I want to start using your tool, will it help me to create a set of queries, example queries, which will be tested?

Starting point is 00:47:56 Queries we have to extract from somewhere. So no, I would say answer is no, because right now I assume you have a queries or way how to capture the queries either by having them in the files or by being able to have a CI task that will generate those SQL tasks, people files from your ORM. Yeah, I would like to have additional call with you to discuss because I have approach maybe we could connect the dots here and see if it can land because this methodology probably works with your tool naturally, you know, how to inject this approach to existing big project. We deal mostly with projects already,

Starting point is 00:48:37 starting hitting walls. They come to us for help. So I understand completely how to start from scratch. But when you have something already, there it can be difficult. Another thing I wanted to point out, what I hear, you say smaller, medium size, bigger. Sometimes we have only one database and that's it, some systems like SAS, which is installed only once. But sometimes we have, we develop software which has many installations, small, big, ones. So it's interesting. I think the approach when you define some reference databases, this works better when we have multiple installations of various sizes. Is it what you think? Yes. As I said, you know, there's no universal answer, how to generate. And many people

Starting point is 00:49:24 actually wrote me about features being completely wrong. But what I'm trying to get past is I don't care about the features. I care about the way how to generate a snapshot. And if you have the same cardinality and same data, you can then measure correctness. So obviously, the investment in a tool like AgresSQL is actually, it's a big one. You have to start using it and you will get dividends of months old. You might get some dividends as you go,

Starting point is 00:49:54 but actually the true value will only go. So the developer experience and then whole lifecycle, that's maybe a version of 3.0 because there needs to be much more support. So the tool itself is not important as much. It's the whole experience in the process because methodology, exactly. So what I'm actually interesting, like let me share, like to be honest with you. And then what I have in my mind listening to you. For example, GitLab, they use our approach.

Starting point is 00:50:21 Many engineers, that's great. And we have like case study, publish. All the thing is great. But listening to you, I realize, actually, we always cared about the biggest GitLab.com installation, which is huge, multiple database, huge. And we focused on solving that problem. We solved it, like, solved, like, not fully, but solved, like, how to test queries there. It can be proved always. But now I'm thinking,

Starting point is 00:50:44 actually, GitLab has hundreds of thousands of other installations. And they have reference. And I think, well, we actually should care about that as well, right? So it's interesting, because, like, we could generate different sizes of database and covered by testing. them as well. Yeah. So it's interesting because I was always thinking about the biggest one, you know, because usually problems are usually there. Right. Exactly. You hear about the biggest ones. This is where you know the production breaks, the incident starts, the query, which was fine. It sounds similar, but it's a very different problem. But you need to find some sort of boundary. Sometimes you can't solve all the issues and maybe you need to find a different solution.

Starting point is 00:51:30 Right, right, right. Yeah, so it's interesting. Just that of interest. What would you expect to break on a smaller instance? Is it that it's just like a different shape? For example, on GitLab.com's main instance, maybe there's tons of one type of object. And in a different customer's installation, there's a different distribution and therefore plans are different. Or is it like...

Starting point is 00:51:53 I can tell you. So what we think about, like we have various kinds of testing, like red data testing. It's like ultimate final testing with basically production clones. which only few people can have access. But also you cannot download this database due to size and also restrictions, PII and so on. But what if you need to care about multiple installations

Starting point is 00:52:16 in this case, the direction DBLAP is moving in is having full-fledged database branch and Git-like, but also cloning between multiple setups and we already run on MacBook. So I'm thinking, if you, you care about smaller databases as well, developers could basically download, references, database and iterate much faster because I hate how CI is organized right now. If AI is hallucinating, hallucinating, in production, if I hit it, it's absolutely nightmare.

Starting point is 00:52:48 But hitting even in GitLab CI or GitHub actions, hitting problems there, it's already too late for me. I want to iterate much faster. I want all the tests to be run before push, Git push, right? In this case, if we have a DB lab of smaller size, you download it and don't hunt only for the biggest case, medium size, small size. I have like a terabyte or so here. I can have good example database on my laptop and iterate much faster.

Starting point is 00:53:17 And before push, I already have much better code, you know. Yeah. So catch more issues earlier and only catch the... Shift left, what you can shift. Yeah, shift left. Yeah, that makes sense. Okay. So it's saying let's test earlier.

Starting point is 00:53:30 Ideally, developers should test. Usually they don't have time, but EI, like, I have no excuses to EI. I should test everything. I pay for it, right? Tokens, not time, right? So, yeah. And that's it. Like, I think it's perfect time to think about all levels,

Starting point is 00:53:50 and if you can shift left to your laptop, to your work machine, you should do it. Right. So future DBELA versions will be able to clone data between, to pass snapshots between, to develop. You can have it on your machine and just extract and then do something, extract back and share with your colleagues, some tests and so on. Yeah, anyway, I'm still in agenda all the time because it's actually my topic. I'm so glad you came actually. There's actually so many topics and I think we can go for another hour because, you know,

Starting point is 00:54:22 it's like the support, I think last night I pushed was the same one. You take the snapshot. You run it against migration. either a SQL file or command so you can actually trigger a third-party system and then test the same just to test the migration how it looks like. And you can start there's so many angles on this topic alone.

Starting point is 00:54:42 Oh yes. You know, I don't say I have answers for everything, but I just feel like I hate a topic. Let's have part two about testing of schema changes. Let's have part two just about it because it's a big topic. This is where, as you said, change management, this is where Benjure

Starting point is 00:54:59 where danger comes from usually in incidents right so let's just discuss it separately because i think it's worth discussing. Radeem is there anything on the regression testing side that you wish we'd asked about or that he wanted to mention before we wrap up? I think we have covered most of the things. Where to start from people? Where to start from people you know first steps follow first step follow the project i think that's the first good step i will definitely

Starting point is 00:55:29 make it public and do some, you know, sessions around the version 2.0 because this is where the core developer experience will come. And I will actually need feedback on that. So I'm actually open to any feedback. You can email me at any point or open a GitHub issue. I would be actually grateful to see that. But I would say pick your thank queries. Don't start with anything. Find a way to pick tank queries, which you believe are maybe not small. I would say, say representative of something that's breaking, try with them. Try to create a plan, try to try the tool, how it works, what might be the flow, and try to live with it.

Starting point is 00:56:11 And then have a rule that every single handwritten or all-unwritten query needs to be covered. If you start with that, you will start seeing value in a couple weeks, couple of months, because you will have that checkmark, everything is fine. and then in future, hopefully all the features and the developer flow will support that you will get much more values and you will actually get your time back in terms of safer production. Confidence level builds up. Yeah, I mean a small set of queries, but it's a good idea to understand which queries matter most based on some production beginner or something. And you have a website boringSQL.com, so this is where you have blog and so on. And the project you're talking about is regret squel, right?

Starting point is 00:56:57 Exactly. Yeah, great, great. Cool. Right now, it's not featured. I'm preparing a section. You have a huge roadmap, I guess. You can't even imagine. And this is actually where I loved it.

Starting point is 00:57:09 I can. But I believe you can. Yeah, that's great. That's great. And it's good. And it's in Go. So that's great. Yeah.

Starting point is 00:57:17 Good, good, I think, finishing words, right, unless Michael has something. No, I'll link to everything in the, in the show notes if anybody needs anything that we've mentioned and yeah thanks so much for joining us redeem it's been great yeah thank you it was very interesting and i love talking because you know it's a it's an important topic thank you for coming okay thank you guys thanks both

Postgres FM - RegreSQL

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.