Postgres FM - PostGIS

Starting point is 00:00:00 Hello and welcome to Postgres firm, a weekly show about all things PostgreSQL. I am Michael, founder of PG-Mustit, and I'm joined as usual by Nick, founder of Postgres AI. Hey, Nick. Hi, Michael. And we have not one but two heavy hitters from the Post-GIS world. Both also recognize Post-Core contributors. First, we have Regina Obay, president of the Paragon Corporation, a consulting firm, member of the Post-GIS core development team and steering committee,

Starting point is 00:00:27 and co-author of a whole host of book. on GIS and SQL, including the book Post-GIS in action. Welcome, Regina. Thanks. Thanks for having me. It's our pleasure. And also we have Paul Ramsey, who is staff engineer at Snowflake via the Crunch Data Acquisition, the co-founder of Post-GIS, member of the core development team, and the chair of its steering committee. Pleasure to have you too, Paul. Hi, Michael and Nick. Nice to be here. It's wonderful. Right. So I was hoping, actually, that maybe one of you could give us a kind of brief history of the project, and maybe the other one could give us an update on where it's at today and what it looks like.

Starting point is 00:01:06 Well, I'll take the history side of this, because, yeah, because PostGES comes out of my history as a geospatial consultant, and it's a long time ago now. Post-Jist first released, 0.1 was at the end of May in 2001, so we're creeping up on 25 years of Post-Jus. Pretty crazy. Post-Jist was born because, out of a consulting company here in Victoria, British Columbia, and it was born because we ran out

Starting point is 00:01:33 of work. The fiscal year for the government, which is our main source of money, ends on the March 31st, and the first release of post just as two months after that. We had been doing an analytical work for the government using standard GIS tools and a database to store interim results, and the database we were using was Postgres. And having had that experience of using a database and using SQL, it's becoming familiar with this sort of powerful tool to slice through big chunks of data in different ways, and also being spatial people, looked at this database tool and said, there's got to be a way. Like, there's got to be a way to bring together the spatial questions we're asking

Starting point is 00:02:11 and the non-spatial questions were asking into one place so we can ask, like, more complicated questions and have it all fulfilled inside the database. And there was, at that point, a relatively fresh international standard, the simple features for SQL standard from the Open Geospatial Consortium, there was one extant proprietary product that did that. Oracle 8i at that point had been released, and it had a pretty 80-20 complete implementation of Spatial SQL. And we did both.

Starting point is 00:02:43 We both looked at Oracle Spatial and thought, can we use this? And at the time, if you wanted to use it, you had to have Oracle Enterprise. It really, the implication was about a quarter million dollar. licensing bill to really use it in anger. And that obviously it scared us off as a small consulting company. And also successfully scared off anyone in the government that we like, we may all get some of the government who wants to do this stuff and we'll piggy pack on their licensing and learn

Starting point is 00:03:08 it and do it implementations for them and we'll be like spatial database experts. Now it's too expensive for anyone in the government. But we had this specification and we had this experience with Postgres and Postgres itself was built for type extension. So we also had a really like clear on-ramp to add a new type, to bind it to access methods, to add functions that worked on this type in a way that really didn't exist in what was at the time the sort of premier open-source database MySQL. It took quite a few other, quite a few years beyond 2001 before MySQL added spatial. And when they did, we went and looked at their implementation, looked at the patch. And it was big. It was a big patch that crossed the whole code base. It really

Starting point is 00:03:51 took someone who understood MySQL soup to nuts to do it. They had a different. do the index implementation like everything. They had to do a hell of a lot of work to do it. Whereas we got this nice, tight little code base sitting as an extension completely divorced from the core code. The core code provided an access method pre-built. We were good to go. So it was doable to have a working extension that we could demonstrate to ourselves and to customers within two months and then release. So yeah, we were clever people, but Postgres is a hell of a clever product. And that Postgres just wouldn't exist if Postgres hadn't been so clever to start with. And it was gist, right, at that time?

Starting point is 00:04:27 At that point it was just, that still is just actually the access method we used for spatial indexing. I'm smiling the whole time you were talking because it connects 20 years ago, 2005, 2005, 2006, 2007. I remember, first of all, this is when I left Oracle 8i.

Starting point is 00:04:43 Exactly, like last time I touched Oracle was exactly 8i. And then when I was building my second startup, social network, we put a lot of people and objects like restaurants and so on map, 2007, it was too early. And I remember I hired two guys, Alec Bartanov and Teodor Signa. Oh yeah, yeah, yeah.

Starting point is 00:05:05 To help us. Oh, yeah, we remember them. It was so bad because, like, I was, like, I was amazing, like, it was amazing to have gist and reading all the articles and basics and like seven functions, abstraction, all great. But when we went to practical questions and we had 10 million people, to be placed on the map and a lot of objects. And then you try to solve simple problem. Show me nearest 10 objects on the map

Starting point is 00:05:35 which have good reviews. And it's terrible because there was not support for Order Buy and K&N. I didn't know K&N. And since we moved so fast, I eventually decided to pivot my startup away from maps and so on. But Oleg and Theodore,

Starting point is 00:05:52 a couple of years later brought K&N to Gist. It was 9.1. And then Alexander Kortkoff also Nikita Glouche helped to brought the same to SPGIST. And this connects us a lot. But I just, I showed them this problem and that's it for me. It was great. Yeah, back in the early days, starting off incubating an open source project. Within a couple years, we had quite a few users and some institutional users.

Starting point is 00:06:20 and at the time, the just infrastructure was, did not include recovery. It wasn't a completely acid implementation. And at that point, I had like a stable of people who had businesses depending on post-just, and I put out a call for development money and got scraped together from all the different companies who were using or dependent on post-just about $10,000 or $15,000. and then shipped that money to Oleg and Teodor, who did the work necessary to make just recoverable. And it was a weird thing to sell to the community?

Starting point is 00:07:01 It's, we're going to get this work done. You're using it now as a result of the release cycle of Postgres, and time takes to do the work and all that stuff. You'll be able to use it and show it to your clients in 12 to 18 months. But you're going to give me the money right up front. So it was an interesting introduction, both to the Postgres development cycle. and also to like trying to drum up crowd sourced funds for open source development.

Starting point is 00:07:26 And it really taught me that crowdsourcing funds is really hard. Because people are willing to pay to get the bug, which is bothering them right now, fixed right away. Yeah. Much more difficult to get them to pay for something, which will maybe make them a little bit happier in the far future. It's a harder sale for sure. Especially for startups. They move too fast to understand that the thing will come. only in a couple of years.

Starting point is 00:07:50 Yeah, they might not be around that long. Yeah, yeah. But I'm glad Kenan was brought into post goods and post-JS as well. Yeah, great. Is that a good excuse for us to get Regina to give us a little bit of like a catching up from, it's been 25 years, right? You've done a lot of work since those early days. What's changed?

Starting point is 00:08:12 What does it look like now and how is that different to the first versions? Mistakes. I mean, we definitely have a bigger audience, and we now have people seem to be running things in the cloud more than they're running things on premises, or maybe not. I think it's, yeah, slowly getting to the point where most people we see are running posts just in Google Cloud or Amazon Cloud or Microsoft Azure. I think that's the biggest thing that's changed since we started. I was thinking as well, it looked like it started as a single extension, and it's now kind of a collection, it's like a family of extensions that serve quite a few more use cases. Oh, that's true.

Starting point is 00:09:03 Yeah. So when we started off, we just had the vector support, and then we added raster support. And originally we thought, oh, we'll just throw raster in the post-stress. extension and then it got very fat and people started complaining why do I have to carry these thousands of functions that I don't even use raster so then we broke it out into two extensions postures and post just raster and we also have post just topology and then post just spun off other extensions so you have the h3 extension which is now it's we brought it into our organization, but it's still a separate extension. It's not even part of the post-just extension,

Starting point is 00:09:48 because we forked it from the original owner into our system. So that's very recent. And then there's another extension called MobilityDB, which is a separate extension, separate group of people, but they build on top of post-just. And they deal with things like, you know, vehicle movement, the speed at which things move and all that, and flying birds, trains. I think those are the key ones. Yeah. And then there's also PG routing, which I'm a member of too, which deals with network routing, like driving, biking, walking, getting directions on that. So yes, it's spun up a lot of other extensions. Yeah. So that's the technical side. Deployment is a big deal for sure. When we started, it was, you want to use Post-Dist? Here's the source code. I'm tarred in the Contrib Directory of an existing Postgres

Starting point is 00:10:38 installation. And then you build and then you install. And now you have Post-JIS. And around the time, Regina started to join us. She helped with Windows builds. So Windows people at least could have a pre-build binary. And I still do Windows builds. At some point, PG-DG included us in their builds. And that was a huge change. So now it meant that you didn't have to build it yourself. You could just find the right packages and install those. That opened up a whole pile of new users who previously wouldn't have tried it. But yeah, as Regina said, the real tipping point lately has been being part of the default installs on all the clouds.

Starting point is 00:11:09 So there's no longer a software install step from the point of view of the end user. It's just type create extension post-just. And that ubiquity really goes beyond the ubiquity for users. I feel like we've achieved a certain level of ubiquity in the industry. Like, first of all, like just geospatial as database functionality is cable stakes now. It's pretty rare to find a SQL implemented database that doesn't have geospatial. Those who don't are all tacking it on at a great rate. Ours is still the most popular, I think.

Starting point is 00:11:42 I think it's surprising that Postgres, even though it started out as a meager database, PostGIS itself is probably more popular than Oracle Spatial and SQL Server Spatial. I don't even hear other people doing serious work on those. Yeah, and that's probably not fair because I bet you there's a whole bunch of institutional users of those databases who, they never leave the ranch. So that's all I understand is Oracle Spatial or SQL Service Spatial because that's just the way their institutions are shaped. But there's no doubt that in the wider world, PostGIS is the thing.

Starting point is 00:12:15 The thing that made me really understand that PostGIS was the Linguafranco was seeing the Google BigQuery announcement blog post. where they said one of the features of Google BigQuery was that it was Post-Jus compatible. It's like, oh, okay. So that's nice to see. That's them saying you're the standard, right? That's pretty cool.

Starting point is 00:12:39 Yeah, and it's a standard that I could have said we support the simple features for SQL standard, but they decided that Post-D's compatible was more understandable to the larger marketplace. It's like, okay, so we have the mind-share. Yeah, that's great, actually. My daughter, she is a fourth-year bachelor degree student at UCSD. And they have projects related to wildfires in California and GIS as well.

Starting point is 00:13:01 But I was super annoyed to hear it's not post-JIS, but it's actually Google DeQuery, so that's good to hear. It's at least compatible, great. Oh, reusable. Reusible knowledge. Cool. What's the latest things you can share? What's the latest development? This is one thing.

Starting point is 00:13:19 I'm also curious about. I know, for example, post-JIS, PostGIS works quite well with timescale DB. It should work with PGVector. I never tried. I'm curious. And with sharding systems as well. There are like super extensions. In the case of postgis, it's not super extension.

Starting point is 00:13:35 It's super set of extensions. Like not super set. A collection of extensions. But sometimes it's hard to make it like working together these big building blocks. What's there? Especially PGVector because it's also important building block these days, right? What's the recent development? I don't know if there's any recent.

Starting point is 00:13:55 development. I think they just work together anyway just because of the way Postgres is set up, right? You can have three or four different extensions in the same query, and it's just because they standardize on all datasets get the same feature set,

Starting point is 00:14:11 get the same core feature set. And so you don't have that annoying problem that MySQL has with their different storage engines and their different query syntax for things. So things just work seamlessly well. So I don't think we did anything special to try to make ourselves work with PG vector.

Starting point is 00:14:31 Yeah, PG vector is more similar to us than, say, one, like you said, the super extensions, like time scale or, or Citis, like, is sort of overarching big swings of the architecture of the system. PG vector, like, they provide a type, they provide access methods, they provide some functions to make sense to the type, like, they're very post-gest-esque in that respect. But from the point of view, like, what they're for, they're kind of orthogonal. I've only seen one example recently of Postchus and PGVector being used together. And they're not being used together because they're getting any great leverage out of PostGSys. They're getting leverage out of the fact that PostGS itself has a huge amount of third-party interconnectivity. So you use something like really high-dimensional search and PG-Vector to get a bunch of results. And if those results happen to include spatial information, then you throw that spatial information on a map because everything connects to Post-GIS, all the mapping software connects to Post-Gis.

Starting point is 00:15:22 And that's how it work. We really saw a cool example of this in our last Post-Jist Day. We have a Post-Jist day mid-November every year, an online conference. And the show Burke from GeoBase showed us geospatial AI using PG-Vector as a search engine, but then gating the results back through Post-Gist to its mapping tools. So it could show us the results. A neat example of what the geospatial are calling. People are calling foundational models. So not large language models, but a completely different look at how to use the edge. kind of technology for doing geospatial work. And if we talk back to timescale DB, I saw great examples of post-Gios and timescaleDB working together because if you collect a lot of geographic signals from drivers or from anything like on the map, you want to partition always, right? Because it's a lot of data, right? One way or another partitioning should work.

Starting point is 00:16:16 Because it's also time serious, not only geographical data, right? So this is natural pair of extensions working to. Okay, what about lakes and since like your company got acquired last year? Yeah, yeah. Yeah. What are recent development there? And this ties actually back to your other question, which is at the core is an interoperability question.

Starting point is 00:16:38 And the nature of the type model in Postgres is such that when you add a new type using the Postgres type model, a lot of the time things just work because the overarching machinery of a big, a big piece of machine to say CITUS doesn't think necessarily about particular type. It thinks just about type mappings, and the type mappings are entered in when you add the new type to the PG and PG type table, and your PG attributes just refer back to type tables.

Starting point is 00:17:07 Everything like is runtime configured, and that's the underlying expectation of these extensions. And so stuff just works. It's only when you have to push out to some external format that has its own understanding of the world, that you have to start thinking about these mappings again. The PG Lake, which was open-sourced a few months ago, is a set of tooling to allow you to push data from Postgres database out to Parquet or iceberg data on object stores. I had a question around that.

Starting point is 00:17:37 Only in the searching for this, I came across Geo Parquet for the first time. I didn't know that was a thing. Yeah. There's a long history of the geospatial community taking extant open-source standards, which they said. which are like good and clearly going somewhere and saying. GeoJson is another one. Exactly. Back when Jason was turning into the lingua franco wire transfer protocol,

Starting point is 00:18:00 the geospatial community said, but we need a way to encode geometry too, so they invented one and eventually got it ISO, not ISO, IETF standardized. Same thing with Parquet. So you have Parquet, which in its initial versionings understands, like the basic types of relational databases,

Starting point is 00:18:16 and doesn't understand anything about geo. So like people, geo people want to push data through this format out to their own clients and say, we've got this byte type and we've got this metadata slot. So we'll slot some information about columns that are holding geometry and we'll stick the raw geometry in there into the byte type. And that will be just enough affordance for geo-aware tools just to look in the metadata and say, oh, wait, there's geo here. And we can pull it out. Fortunately that for the parquet and iceberg formats has been a very short interregnum.

Starting point is 00:18:50 Sometimes these things last for like years. Five years and we're still playing with these hacks. I think Geo Parquet and Joe Iceberg lasted for about a year and a half before the mainline projects came along and said, you know what we need? We need a geometry type and a geography type. We should just formalize that. So yeah, so that was formalized, I think. Nine to 12 months ago, both Parquet and Iceberg got around to saying, yeah, we have an

Starting point is 00:19:09 actual type. And I meant that the metadata could, like, live in proper slots. So where you'd expect it'd be, which in turn made engine developers. to be able to do things like much faster scans and pruning of these files. So it's been good all around, and you can forget about GeoParkey and forget about Geo Iceberg, because there's a pieces of history now. The Geo stuff is now actually embedded in the standard proper. That's great news.

Starting point is 00:19:35 Serve their purpose. Yes. So, yeah, so PG Lake is a way of pushing data out of Postgres into the public internet in places where these RK and iceberg engines can scan them. And the promise of that is that if you pump your data out in hopefully an ordering, which is good for the bulk queries you expect to apply to the data, you can then hit those object store data lakes, if you will, God help us, with an engine which understands those formats and can slam through them in massive parallel.

Starting point is 00:20:10 And, you know, one of those engines is duct D.B, a single node engine, which is very good at that. One of those engines is Snowflake, a multi-node, highly parallel engine, which is also really good at that. Another one is Databricks. All the same gig is like this separation of compute and storage. The compute comes to the object store, rips this file off. It is ridiculously inefficient. If you're a Postgres developer, you think this is really inefficient. You should have an index.

Starting point is 00:20:35 You can find exactly the thing you're looking for, and these compute engines are, no, I will look at every retricate record. I will lightly prune the file as I zip through it, but they're willing to scan a terabyte of data together. to get a few hundred thousand rows of results, which is crazy, unless you happen to have compute and storage and cloud infrastructure to afford to do that, which in these days we do. It's one of the problems of being old. Your understanding of what is efficient and correct is receding in the distance. But I still have some doubts and questions, how stable performance for such queries is, like, it's maybe not fully resolved problem.

Starting point is 00:21:10 I agree. I find it really weird that you have to reform or resort your data. to match your use case. And it's nice. It's great when things work out great. It's like you're sloughing your data off the transactional store, off to the object store. Usually it's like time series order, right? Time of most recent change. And it so happens that your queries are either grouping on time or filtering on time. And then the underlying format metadata allows the engines to really throw away 90% of the data before. That's great. I'm used to the power of SQL and a real LTP engine where you can address any combination of tables and joins, and you're not

Starting point is 00:21:49 going to be hit too hard with efficiency as long as your indexes are in place. And it sounds reasonable for Colm Store and for purely analytical queries, but when you, for example, back to this K&N, K-N nearest neighbors, problem, we just need like 10 rows, but we have millions, right? Or billions. Yeah, how does that work? Does it even work? Can you even do a K&N in those systems?

Starting point is 00:22:14 you end up making some kind of compromise with reality. If you, if you sort your data like on a key, like a Hilbert code or something that has like spatial collocation to it, then you can do that kind of scan a little better because you know where to hop in the Hilbert code. Doc for its internal implementation, I think it's just sucked it up and put a lightweight index next to the table. so they're not actually doing the full column of store scan for those particular kinds of queries. Spatial is, as database people who have put a geospatial type into a standard relational database, we like to say spatial isn't special because you just type SQL and you get your answer back. But at an implementation level, there's no getting around it.

Starting point is 00:23:00 Spatial is special in a whole bunch of ways because just being higher dimensional, it's the same thing people run into the PG vector. Once you get beyond that one dimension of sort order, a lot of your preconceptions about the most efficient way to deal with the data fall away and you have to rethink problems. One difference seems to be people aren't as forgiving of accuracy issues when it comes to spatial data. You know, you don't have approximate nearest neighbor. That is 100% true. Yeah, things on maps, people are very unforgiving with maps. It's like one mistake. You're wrong, whereas you roll up a BI-I's dashboard and give some people a pie charts, whatever. It's actually one of my great

Starting point is 00:23:40 shaking fist at cloud, things about the big column stores, is that people insist on summing up every record of 5 billion records and make a pie chart. You could have sampled and got the exact same pie chart. I feel like we've dived into the most complex type of performance issues, but I think a lot of people, like, I come across some performance issues when it comes to post-JAS work, a few customers that have the odd specific case. And I think even it's like the basics that they have questions around,

Starting point is 00:24:08 like, geometry versus geography. or GIST versus SPGIST and any advice either of you have for people that are new to this stuff on the basics of on the simpler queries how should they be thinking about these things? The absolute basic is how R3 is implemented

Starting point is 00:24:24 and what R3 is compared to B3. This is like the gem of Bosgrass actually, right? So maybe let's talk a little bit about it's like some basics in the middle of I think that's most normal users. the discussion.

Starting point is 00:24:41 They just want to say that I can write this and I get this. And they don't care about the underlying implementation. You think so? There are different users. I don't know. Maybe. But, okay, let's mention that Postbus has this gist, generalized search tree.

Starting point is 00:24:58 And R3 is implemented using it, not separately. Correct. And, yeah, there is a great work by Hellerstein from Berkeley and from the late 90s. And it's like basically generalization of B3 idea to multidimensional space, two-dimensional, three-dimensional, N-dimensional. Well, it's defined more or less by the API, right? Not so much by the space. And whatever space you're in, as long as you can meet the conditions of the API. Seven functions, right?

Starting point is 00:25:28 Yeah, given a key, can you say whether things are consistent or not with that key? That's one piece of the API. Another is, given a collection of keys, can you split that collection? and that depending on your dimensional space, those splits will be calculated differently, but it's just like, give it a set. Can you do a split for me? Yes, okay, can you do key consistency

Starting point is 00:25:49 and try to think what the other API pieces? I feel like there's a third. But regardless, you can, with the rules of the Just API, you can implement an archery. Like the kinds of questions that an archery answers, the kinds of guarantees an archery gives are ones that fit inside the generalized API of GIST. So it means that you don't,

Starting point is 00:26:09 need a separate access method for archery. They don't actually need a separate access method for B-tree either. In fact, there's a like B-tree just implementations in Contrib which you can look at. Because it's more generic, there's less opportunities for performance optimizations, but the beauty of being able to for almost any type, because the just conditions are really, really loose and generic. Almost any type, you can get an access method without having to think about all the problems of pages and consistency and so on. That's one of the things that has made, say, a tool like PGVector, a complex thing to approach and that someone only someone with like serious computer science chops would want to

Starting point is 00:26:46 attack is that they had to write their own access method right down to disk. They did not get to just layer their access method on top of an existing implementation like just. They had to write it all the way down to all the way down to the disk and page level, which is a lot harder to do if you're going to retain guarantees of consistency. All right. So you mentioned, I guess you mentioned B3GIS that's called, or GIST B3, I always consider. Yeah, B2Gist.

Starting point is 00:27:10 Yeah, and it lost comparison to native implementation of B2GEL. While R3 didn't lose, like, it won. Yeah, yeah, yeah. And I'm not sure if that's just, probably that's more a matter of people not wanting to spend the time to work on the archery implementation. Like it was considered, it was considered not a good implementation at the time that we started. It still existed when we started with post-GIS, but it was more tightly bound to the spatial types, the native spatial types in Postgres. And so we didn't use it.

Starting point is 00:27:42 No one else used it. Eventually it was ripped out. It had, I think, all the way up to the time that got taken out, a object size limitation. So it couldn't handle objects larger than a page size. No, no, yeah. That was like I think the death now, really, because practical work with spatial data

Starting point is 00:28:03 inevitably involves dealing with objects that exceed the page size. So what I wanted to just like some counter argument, why it matters to understand things slightly more than just let's use it. And that's it. Because for example, in this particular case, B3 is like less than, greater than equals. R3 is like, is contained, contains, right?

Starting point is 00:28:24 Overlapse. You can, if you use this B3Gist extension, you can combine and have multi-column index of very different data types, right? Yeah, and I use that a lot too. Yeah, and this is a great performance trick. And if you know why it's useful, it can save you because it's just a single index scan, that's it. So you use it a lot.

Starting point is 00:28:47 Well, if you have things where there's a lot of attribute, things that you have to filter against, but you also need to do a spatial at the same time, especially with the way the sensitivity that Postgres has, it sometimes picks the wrong plan, whereas when you have it together, it can't screw up. So there are some cases where I've had where if I had just a gist index and a bee tree separate, the planner would do something really stupid and try to use the spatial and not use the B tree at all.

Starting point is 00:29:23 And so the performance, you'd go from minutes to milliseconds in those cases. Yeah. Yeah. So I had some experience not once, yeah. same and I also thought like Postgres would be able to do a bitmap scan I thought it would be able to use the GIST index and the B tree separately

Starting point is 00:29:43 I know it wouldn't be as efficient still but it seemed to be refusing to do so for me so I also wonder if there's something missing on the Postgres side to consider them for bitmap scans as what multiple indexes with an and condition I thought it would consider it but I really struggled to get it to I also can complain you take PGVector and you cannot combine it with just,

Starting point is 00:30:05 just basically. Yeah, that's super hard. You give them just in the use case for PGVector, like the find me the nearest thing is the core use case. And that is like, really does not fit nicely into the executor. One case works really fast, another doesn't. Yeah. What about for beginners?

Starting point is 00:30:22 Regina, you've written some books on this. How do you describe performance stuff for beginners, like in terms of practical tips, or what do you tend to see mistakes people making? I think the classic one is people try to do distance by using ST distance. So they say SD distance this less than 50 instead of using STD within. So SD distance can't use an index. So their queries might take minutes.

Starting point is 00:30:52 And then when they switch it to SDD within it finishes in milliseconds, that kind of thing. because it would have to scan all the records if you do SD distance. That's a classical mistake that people make. Curious, does I make this mistake or no? And what's annoying is like in SQL Server, it's different. You can actually use SD distance, and it uses an index. So what are they doing in SQL Server? They must be rewriting the query.

Starting point is 00:31:21 Yeah, they must be rewriting the query. So I was surprised that, oh, yeah, you can't do distance less than 50, and it will do the right thing. It'll do it pretty much what our STD within does. That's interesting. That, yeah, well, query rewriting is kind of deep. But the fact that they do that is interesting

Starting point is 00:31:39 just in and of itself. It speaks to them, well, either like presupposing a problem their users are going to have or something their users really did have constantly, so it's worth it for them to go to the extra work of doing a query rewrite for that case. Yeah, I noticed that Postgres does do

Starting point is 00:31:57 query rewrites too, right? It does. There's an affordance there. We could get muck with the plan if we wanted to. Yeah, if you do in, they change it to any, right? Yeah, yeah. So in theory, we could do or rewrite somewhere. It's just where would that go? Yeah, I mean, that's like adding a certain amount of brittleness to the system where you don't

Starting point is 00:32:15 know that the users actually need it. Anyways, your example presupposes that they have a spatial index in the first place, which is what I would have raised. It's like the number one, the number one new user mistake. Oh, that's true. Yeah, that's the other thing. It's quite an easy mistake to make because by and large, yeah, by and large people don't think about index as indexes as having a type. Like if they've come from the SQL server world or whatever, like the idea that they need this extra keyword, they're using just keyword to get a spatial index.

Starting point is 00:32:46 It's not going to be super obvious if they're just kind of through some other database. And if you type, create index blah on geometry column, it'll do something. It won't error out because we've had to do lightweight B-tree bindings. in order to get aspects of sortability from our type. So it's not like it'll just say, oh, I don't have a B-Fee binding. It'll go and build a really terrible index. That's no good for you and quietly return.

Starting point is 00:33:09 So I think that's one of the biggest gotchas. It's just like having a proper spatial index. Yes, knowing that you need an index and knowing that you need it to be a GIST index. Yeah. As far as between GIST and SPGIS, I'd still tend to go toward GIST because I know we put in more effort in GIST.

Starting point is 00:33:27 than we did an SPGIST. Same. I think our GIST support is more robust than SPGIST. Yeah. I have tried different benchmarks and come up with different answers from different data,

Starting point is 00:33:38 different queries. I've not been able to fully characterize the kinds of places where SPJest outperforms. I don't know, Nick, you seem like really excited about SBGist or like there are places you've found

Starting point is 00:33:49 where like SBGIST is the clear winner? No, somehow I just recently don't see this kind of workflows at all, like maybe a couple of years, so I don't have fresh data at all. I don't know why. Obviously, like, it's just matter of luck.

Starting point is 00:34:04 I came across one recently where they were using a GIST index and still struggling with performance, and I suggested trying SPJS because I thought it might make sense and it didn't help. So it wasn't much worse. It was like very similar.

Starting point is 00:34:18 But the one thing that they changed that did help a lot was they were using geography before, and it was only to calculate quite small distances, like within 25 minutes, meters type thing. And switching to geometry made a big difference. Yeah, that's not surprised by. Yeah, you two are not surprised. You two are not

Starting point is 00:34:36 surprised at all. Because geography is a more complex. It's more, it's three-dimensional, whereas the geometry is two-dimensional, really. I have a slide in my sort of standard post-just talk where I say, you know, geometry or geography, what's that I do? And the thing that I use to demonstrate why geometry is going to be better for performance purposes is just the distance calculation. Everyone has learned Pythagoras, so I know what the distance calculations looks like there, right? Two squares and a square root. Any distance calculation in geography space is going to have to do a bunch of Haverson calculations, which is the distance on a sphere. And Haverson has five transcendentals in it plus the square root. So it's just way more, like orders a magnitude,

Starting point is 00:35:16 more computationally intensive. So if you're doing a bunch of those calculations, you're just going to pay a big price for sure. The earth is flat. The earth is full. flat, I knew it. Ideal, in an ideal world, or there's a flat, yes. So that means you need to commit to a planar understanding of your work area if you can. So if you are doing an app which works in London and say, you know, there's a planar projection which is good for London. If you're doing something in Idaho, there's a planar projection which is good for Idaho.

Starting point is 00:35:45 You're doing something which is like the continental U.S. There is a planar projection which is good for most people's purposes in the continental U.S. It's very rare that there's something which requires latitude, longitude. You only get the breakdowns when you're starting to work with truly global data, places where you go over the poles, places where you have data in the Northwest territories and also in South America, where the scale distortions of something like Mercator cannot be ignored. That's when you say, okay, fine, I'll use geography. And then you'll get some nice advantages out of geography. If you have truly global data, the fact that the shortest path might go over the North Pole, that's a useful thing to have. Or go across the dateline.

Starting point is 00:36:23 That's a useful thing for the system to just transform. transparently handle, which it does for both the calculations of things like distance and area and for the performance-based stuff like indexing. It does not care. It goes over the polls. It goes around the dateline. It doesn't notice because the model just takes away at those singularity points when we're working on a sphere. Nice.

Starting point is 00:36:40 I think the reason I was surprised was more because I'm so used to queries being I-O-bound. And it's quite a fresh to get one that's actually CP, you know, actually the CPU, I can do matter here. That is one of the places where spatial is special. And people don't notice it. And it's a weird one. Like we, for trying to integrate with a Postgres planner, we've been trying to help the planner and get smarter by costing our functions appropriately.

Starting point is 00:37:09 But the planner only looks at function costs in filters and joints. And a lot of the times people are doing the spatial calculations in the, in the, in the, let's have to use the select line, right? The return values. So if you have a complex function in your return set, Postgres is not going to add that in and say, oh, I should do a different thing or that there's a lot of cost here. So that's an issue. And then actually, it's a real problem on I.O. as well. I actually did a blog post for Snowflake about two weeks ago that went public talking about that with respect to Jason B.

Starting point is 00:37:41 But I learned about it in the geography, geometry world, which is, you know, you get a big object in Canada. right. The Canada has like 10,000 vertices. Like it's well above the maximum object size that you'll find in a page. So it's going to be toasted. And that means any retrieval of that object is going to be a two-stepper where it has to go and find the toast pointer and then go to the toast tables and gather all the parts and put them all together in order and then decompress all that like just to get it out, just in the I. And that's like a combination of the IO, pull that across, but also the computational overhead of just reassembling. the thing. It's about 10 times, 10x. I found in joins. You can get things to go 10x faster when I don't have that toast overhead. If I pre-cut everything into the shapes, which are smaller than the object to the max. Yeah, speaking of toast, there's also the one, the, is it one gigabyte? Is that the limit on toast? People have already started complaining about that, about the size. It's for Warlino types. Warlino types, so it's one gigabyte, I think. Yeah, yeah.

Starting point is 00:38:46 But it's better not to go beyond 300 megabytes, I think. There are benchmarks showing that for Jason, it's terrible. Oh, I can't imagine. Yeah. There is a drop in performance after 300 megabytes or so. That's a lot of JSON, though. Yeah, I think I forget what the reason was. It had more to do with outputting stuff.

Starting point is 00:39:09 I also forgot. Like if you're trying to output a big MVP file. The whole world in one thing? file, the level zero tile. Does it compress well? I was just thinking, Jason generally, there's a lot of repetition,

Starting point is 00:39:23 a lot of, like a 300 megabytes post-compression JSON file would be like, pretty massive. This is one of the places where there is, there are potential implementation pans, which would have given us

Starting point is 00:39:35 a much better I.O. profile, but at the cost of something which the user is considered pretty inviolate, and that's perfect fidelity of input and output. So users are feeding the system with geometries which are described in terms of coordinates, which are built on double precision. And if you want to return what they got, you have to store the double precision. And double precision does not compress so great.

Starting point is 00:40:03 So in answer your compression question, nah, not so great. If you just take this smallest swing at saying, this stuff is quantized or like this has an effective minimum precision, like your road segment is not more precise than a centimeter. And if I take advantage of that fact, if I know that there is a quantization available, I can squash stuff down into a delta encoding, which is incredibly cheap for spatial data because spatial data tends to be highly autocorrelated in space. So you can store the differences between each coordinate with a very small number, which

Starting point is 00:40:38 is great. It's a really good compression and it's a very fast decompression. but you have to give up precision to take advantage of it. Or if it's a lot of moving objects on the map but this should be colon compression like time scale DB does so we just change a little bit

Starting point is 00:40:57 position. We cannot jump to different continent immediately, right? Yeah, but there will have to be precision associated with that. Like they will be losing a little bit of, a little bit of precision when they compress. Like you can't just slam floats or slam doubles together cleanly all the time. and get a nice clean delta encoding.

Starting point is 00:41:15 Sometimes you can, but not always. Similar with timestamps, actually. If you want to give up milliseconds, you can have fewer bytes, definitely. Speaking of numbers, I'm curious, what are the biggest post-GIS setups you saw, both like single node or maybe sharded systems, like with Saitos or something.

Starting point is 00:41:39 Regina, you're the implementer. What have you seen? I have no idea. I don't work with that big of data sets. Yeah. So Postgres is weird, or PostGist is weird, a weird member of the Postgres community because we have, I think, almost certainly the largest number of deployed clusters. Like, the number of Postgres users is huge.

Starting point is 00:42:03 I remember Stephen Frost came to his first phosphory and his jaw sort of hit the floor because he was used to going to a Postgres conference where there'd be like 200. attendees, and he came to one of the free and open source for geospatial conferences, and there are 800 attendees. And every one of them he talked to said, yeah, yeah, we use PostGIS. It was like he'd never seen so many

Starting point is 00:42:24 distinct Postgres users in one place. So that's one difference. The population of Postgres users is huge relative to the population of Postgres users, but they all have very small deployments. So, yeah, if you're a county, you've got several dozen tables, none of which has more

Starting point is 00:42:40 than 100,000 records from the point of view of scale, it barely registers. But there are a few big ones. It's just that you don't necessarily know they exist because no one talks about them. Like, pretty sure Apple Maps production system is all on post-just, Postgres. That's only because someone whispered to me in a dark alley, like, that no one's ever made it public. I know that here, which is a really large geospatial data aggregator, if you're going to get raw geospatial data, and want to have a map, which, like, competes with the,

Starting point is 00:43:12 what Apple or Google provides, you probably go to here and license their data. I know that their production pipeline and a lot of their APIs are built on PostGess. So that's a really big user, global data, high uptime requirements. And then one repeats over and over again, so there's a lot of these installations

Starting point is 00:43:29 and they're big and their high performance is people who are doing OSM map builds because the open, and open, it means open street map. Open street map is a world. It's a map of the world based on a very strict topological model, edge node collections, which means that anything which covers an area has to be built from those edge node relationships, which implies loading the whole edge node system and then

Starting point is 00:43:52 building up the polygons from the parts and then testing the polygons against each other. And so people who end up using PostGist as the intermediate stage between I've got a lot of raw open stream map data and I have a rendered tile or a vector tile. There's a big PostGist, Postgres process in the middle of that where it's shuffling through all the the data and building up the higher level structures just don't exist in the raw OSN dump. And there's got to be hundreds and hundreds of those. But they all look about the same size. There's a size of OSN.

Starting point is 00:44:21 There's more than hundreds. Okay. Thousands of that. Yeah. I also saw in your blog post, Redfin, Stet Farm, like this setup should have a lot of data, right? They do. And real, like, real production loads. They're actually, they're asking real questions of the spatial database and getting

Starting point is 00:44:40 spatial answers back. And also state level, government level, like in France, Norway. Yeah, that was our biggest installation for a long time. It was the biggest one I could, like, quote, chapter in Brousson. It was the, and they came on very early 2005. One of our first big institutional users was IGN, the French national mapping agency. And they ran Postgres and PostGIS through a competitive process with DB2 spatial and Oracle spatial at that time, 2005, and determined that Postgres PostGIS was.

Starting point is 00:45:12 in that words, just as good as DV2 spatial or or Oracle spatial, and most importantly, like, way cheaper. So, and way easier to deploy. They didn't have to go through like a full commercial requisition. They could just say, okay, we're going to use it. And they did. So IGN has been on Postgres post just for 20 years. And that initial database was a 150 million object database.

Starting point is 00:45:33 At a time, very substantial. Nowadays, it's rounding error. But at the time, it was one of the biggest databases. And you need to be really brave those days because of, lack of replication and proper backup. Yeah. Today is much, much better. Much better.

Starting point is 00:45:48 They wanted to have, one of the requirements, was multi-site replication because they wanted to have, I think it was slowly to start with, yeah. Yeah. Good.

Starting point is 00:45:59 Okay. I'm glad now it's already solved relatively. We have problems with large setups, but it's already a different story. Great. Is there anything we haven't talked about that either if you wanted to

Starting point is 00:46:11 make sure we mentioned? I guess I should mention conferences. So there's going to be a PG-Boston Day conference June 9th. And then for the conference that Paul was mentioning, the free and open-source geospatial, that's going to be in Japan this year. And let's see. That's Hiroshima, August 30th to September 5th.

Starting point is 00:46:35 Yeah. Oh, and then there's also Phosphory GNA. So that's coming up. But I don't think we've said, we haven't said this the day yet, but it's going to be probably in November, October. I really appreciate that you write on post-GIS Day the day after GIS Day. And I've actually watched a few of them. You run them all online, which is great. And I, yes, I've seen quite a few.

Starting point is 00:46:58 They tend to at least watch the talks on performance, my special interest, and anything by Brian Timony as well. I wanted to give him a shout out. I've enjoyed your talk. Yeah, post-GIS day. if you just type it in, you don't get all the video archives. So if you're jonesing for post-JIS content, but that's the place to get it for sure. Yeah, it's awesome.

Starting point is 00:47:20 I highly recommend. Any last things for me, Paul? We haven't talked about AI, which feels like a huge mess. Yeah. And I don't know what your experience with this is. His next company has AI and its name. Oh, yeah.

Starting point is 00:47:33 So he's really into it. That's good. And I wanted to point out like a couple things that I've seen like in the AI space. One is in the geospatial AI space, there's this whole other field called foundation models, which is basically like RAG retrieval log meta generation, like RAG for geospatial, where they build up a characterization of an area or the whole globe, where every pixel in that characterization has a multidimensional,

Starting point is 00:47:58 a very high dimensionality vector associated with it. So really is a rag process. But people have been using PGVector as like the storage engine for their foundation models and to do like quick gatherings of nearest neighbor. And when you do nearest neighbor with a foundation model, what you get is like land, which is similar to this other land. So you can draw a circle around, say,

Starting point is 00:48:19 a bunch of solar panels and say, show me the other solar panels. And it will just go off and find every solar panel in the world for you. Very cool stuff. And there's a cool talk on it in the 2025 Post-Jist Day. Really worth looking at. The other thing that came up in that post-dust day around AI was the idea of natural language

Starting point is 00:48:37 to SQL and swing a cat, throw a stone, hit someone who's doing natural language to SQL out there. But a really cool observation from Redd and Ashworth from Bunting Labs during his talk that a lot of the code models, like how you get a coding model is first train a generic model and then do reinforcement learning against it. And the way you get a code model, which can speak sequels, you do a lot of reinforcement learning against standard SQL patterns. And his observations that the standard SQL patterns tend to be like Postgres patterns. So you tend to get really good Postgres SQL and maybe not so good Oracle SQL. And the same thing holds true with the spatial side, because they're using

Starting point is 00:49:15 Postgres, because PostGIS is the lingua franca for geospatial SQL. You tend to get good spatial SQL generated by these models. And the real limiting step tends to be providing the correct amount of like semantic metadata around your data model so that the LLM can makes sense of the natural language question you're making and spit out the right sequel, but you're already ahead of the game if you're using Postgres and PostGIS is your engine, because these models will tend to do a sequel that matches them. It's writing in Python is easier with, like, using LLMs because they know much better than new languages.

Starting point is 00:49:55 Yeah, give me Pearl 6. Yeah, and that's why Postgres SQL is very natural, because so many, so much material, it was trained on it, and Post-JAS the same thing. is a great point. So it should help to grow popularity even more because a lot of applications are created by AI these days. But I'm curious, if you deal with a lot of data, you need to verify. If you don't have a way to verify all the ideas AI is generating, quality will be not good. So this is one thing we always do is how to give AI tools to verify its own ideas, very fast without human before

Starting point is 00:50:35 human is involved, it's already much better. So I'm curious what you think about this. How to write proper queries and don't use this distance function, you mentioned. The good thing is, as Paul said, it does write the right distance function.

Starting point is 00:50:50 It does do STD within if you give it a text and give me something within 50 miles. It writes the right query. I have tested that. Good, but it still would be better if it would have some tooling to test and see actual results, not just guess.

Starting point is 00:51:07 Yeah, and we've started to see that discipline arrive in the Post-Jus codebase. We have our first agents file. Oh, yeah, we have that skill.mptial, sorry, yeah. So, yeah, we have our first skills file, which does touch on that kind of stuff you're talking about, Nick, like these are best practices for generating spatial SQL. Yeah, in my opinion, copy on riot and database branching should grow, because we need to be able to test on large data sets for very low money and very fast.

Starting point is 00:51:39 It can be in S3, like some company implemented, neon or time scale they implemented it. It can be with ZTFS or anything, but it should be fast and cheap. So you give millions of points on the map until just check, explain, buffer, analyze buffers. Or, Nick, did you see

Starting point is 00:51:57 friend of the show, Radim, posted a new blog post recently about just, import the stats so you don't even need to load all the data. Yeah, I saw it. Just test planning or behavior without data, actually, just with stats. I think it's a valid approach, but at the same time, I think it's limited because you don't see actual execution, right?

Starting point is 00:52:17 Of course, but you can at least check index use and things, which is quite cool. Yeah, definitely. It's a super lightweight approach, I like that. And I think more and more should be developed because of AI, otherwise, so many mistakes. But if you have this testing pipeline, you're even not involved. AI has some ideas how to improve query and tooling to verify. This is the way. And put it to test to CI and so on.

Starting point is 00:52:42 And it's great. Yeah. So that's why we created branching many years ago with DIPLA. We were sitting and waiting for these very days when it's needed so much. Yeah, so great. Well, it was a pleasure to meet you both. Thank you so much for coming.

Starting point is 00:53:00 I thank you for joining us. Thank you for coming. It was a nice talk. Thanks, Nick. Thanks, Michael. Yeah. Have a great week. Bye-bye.

Postgres FM - PostGIS

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.