CppCast - GDAL and PDAL

Starting point is 00:00:00 Episode 337 of CppCast with guest Howard Butler, recorded February 15th, 2022. This episode of CppCast is sponsored by JetBrains. JetBrains has a range of C++ IDEs to help you avoid the typical pitfalls and headaches that are often associated with coding in C++. Exclusively for CppCast, JetBrains is offering a 25% discount for purchasing or renewing a yearly individual license on the C++ tool of your choice, CLion, ReSharper C++, or AppCode. Use the coupon code CHEPBRAINS for CppCast during checkout at www.chepbrains.com. In this episode, we talked about Swift and C++ Interop. And we talked to Howard Butler from Pobu. Howard talks to us about the geospatial libraries GDAL and PDAL. Welcome to Episode 337 of CppCast,

Starting point is 00:01:30 the first podcast for C++ developers by C++ developers. I'm your host, Rob Irving, joined by my co-host, Jason Turner. Jason, how are you doing today? I'm all right, Rob. How are you doing? I'm doing all right. My son turned 13 today, so I am feeling a little old from that, but I'm doing okay.

Starting point is 00:01:48 That means we've been doing this podcast for more than half of your son's life, it seems. Yeah. Wow. Does that make you feel older again? Yeah, it does. How about you? Got anything going on? Well, I just had that moment myself when I released episode 311 of C++ Weekly, which, as probably a lot of our listeners know, I have not missed a single week. So knowing that I've been doing it for 311 weeks, at first I wanted to feel a sense of pride, but then I just felt old instead.

Starting point is 00:02:23 All right. Well, at the top of every episode, I'd like to read a piece of feedback. We got this message on Discord from Eli or Ellie saying this is not a very important piece of feedback, but I like it a little better when you let the theme song outro play for longer before starting

Starting point is 00:02:39 the podcast audio proper. I think it might be because the first minute and change of every podcast used to be exactly the same and it became sort of comforting for me to follow along with that. So yeah, we did start using editor recently and I guess they just did a little bit of a different style with editing the intro music with the intro of the show. But I will pass on that feedback to the editor because there could be other listeners also, you know, frazzled by change. So did we actually talk about why you're using editor now on the podcast? I'm not sure if we had on the podcast, but you know, we have had a Patreon for a while for the show and we put out

Starting point is 00:03:17 a goal when we first opened up the Patreon that if we ever get to a certain level of support, that we would start using an editor as opposed to me editing the show myself. And we did hit that goal earlier this year. And it's starting to work out? Yeah. Oh, no. I think the editor has been doing a really great job. All right. We'd love to hear your thoughts about the show. You can always reach out to us on Facebook, Twitter, or email us at feedback at cppcast.com. And don't forget to leave us review on iTunes or subscribe on YouTube.

Starting point is 00:03:46 Joining us today is Howard Butler. Howard is the founder and president of Hoboo Incorporated, an open-source software consultancy located in Iowa City that focuses on point cloud data management solutions. He's an active participant in the ASPRS LAS Committee, a Project Steering Committee member of both the PROJ and GDAL open source software projects, a contributing author to the GeoJSON specification, and a past member of the OSGeo Board of Directors.

Starting point is 00:04:12 With his firm, Howard leads the development of the PDAL and Entwine open source point cloud processing and organization software libraries. Howard, welcome to the show. Thanks for the introduction. Nice to, excited for the opportunity to talk about geospatial stuff and C++ with you guys. How did you decide to create a company founded on open source stuff? Because that sounds risky to a lot of people, I think. There wasn't a decision so much as I was doing consulting and my wife and I moved towns. We moved from Ames to Iowa City, and part of the job search, I wasn't doing so well

Starting point is 00:04:47 finding kind of what I wanted to do for a job, and I'd been very active in open-source geospatial software. And so I thought, well, maybe I can just try to make a business out of doing this where people need features added to software or support and capability or help navigating the community. I started the consultancy doing that. It was kind of like a moonlighting sort of thing. And eventually we started developing software in the point cloud and LIDAR domain, which

Starting point is 00:05:16 at the time was a very small niche. It's grown quite a bit now. And so people would find us to add capability to the software, and they were building systems with our tools. And so the business just kind of grew from there. How long ago was that? So Hobu actually started as a consultancy maybe 20 years ago, but we started doing very heavily focused on open source and LIDAR stuff in 2008 and 2009. 2008 and 2009. Okay, cool. How big is the company now?

Starting point is 00:05:46 We're five employees and have one employee in Peterborough, New Hampshire, and then the rest of us here in Iowa City. Yep, always looking to find good engineers like everyone else, but... So you are hiring right now? Of course. Everyone's unicorn hunting.

Starting point is 00:06:04 It does seem like it's surging on impossible to find new programmers right now because everyone has a job that they want or there's not enough people or something. Yeah, I also think, you know, we're in Iowa, which is kind of a challenging environment too, just because there isn't as big of a community. And so, you know, the buy or build solution is mostly build, right? You want to hire young engineers and mentor them forward either in your development practices or teach them the culture that you're trying to get them to participate in. And then they grow up and leave, of course. But that's one way to move forward with trying to get the talent that you need.

Starting point is 00:06:39 All right. All right, Howard, we got a couple news articles to discuss. Feel free to comment on any of these and we'll start talking more about these geospatial libraries that you're involved with. Okay. Okay. So this first one is Swift and C++ interoperability workgroup announcement. I had not heard of this, but we had a listener reach out to us over Twitter, Jason. Actually, I think listener and former guest telling us that we should talk about this one. And yeah, it seems really interesting. The Swift development is going for

Starting point is 00:07:12 bidirectional interoperability with C++? Yeah, I mean, examples here are things like standard string and standard vector without having to go through a memory copy C binding layer, which is generally what has to happen with other languages like Java, Python, whatever. Are there other languages with this type of interop? As far as I can recall, only D, from what Alex Andreescu told us when he was on.

Starting point is 00:07:40 Because he was talking about basically when D and C++ interop works by relying on knowing what the calling conventions are for the C++ compiler that you happen to be trying to interop with. And so that's really getting into the guts of things there. It's the only other language I know of that does that. Seems really cool. And I don't think we've ever talked about Swift in any detail on the show, but maybe this is a good opportunity to try to get someone from the Swift community

Starting point is 00:08:10 who's involved in this C++ interop because it seems really cool. You have any thoughts on this one, Howard? You do anything with other languages that maybe interop with C++? Certainly, you know, the libraries we develop have interop with things like Python, done some Java. The deep bindings approach of doing that is always pretty painful for everyone. Somewhere across that boundary, somebody gets to be responsible for who owns what when. And that's always the problem. So things that make that better would always be exciting.

Starting point is 00:08:44 I don't know that there's never really been a silver bullet for it, though. Every kind of generation has had its own sort of approach to doing that kind of interop, and it always ends up being a compromise in places that you don't want. And then it kind of gets redone the next generation in a little bit different way with the slightly different compromises and and you're kind of back in the same spot it was all these kinds of issues is why i created chai script in the first place because i wanted a scripting language that was designed for interop with c++ and just for the record i don't maintain it anymore so i'm not trying to convince anyone to use it it cost me far more money to write and maintain it than I ever made from it. All right. Next thing we have is another post on the old new thing, Raymond Chen's blog. And this one is a mystery of the crash that seems to be on a stood move operation. Could you go over this one for us, Jason? I love this because even though I had already read the article once, when I came back to it to review before the show i'm like forgot where the crash was but what it comes down to is that the programmer is using an object and moving that object in the same statement and that was unspecified behavior before C++17 and the way they're doing it,

Starting point is 00:10:10 because there's no guarantee as to what order these things are going to occur in. In C++17, there is the guarantee that function itself will be resolved first, and then its function arguments will be resolved. But there's still no guarantee as to what order the function arguments will be instantiated in C++17. So this code be okay in C++17? Yes, that code would be okay in C++17. Yeah. But I mean, not that it's good code, I would say just for the record, because it still feels a little sketchy. Like you have to know exactly what the rules are saying right here to have any real clue what it's doing.

Starting point is 00:10:48 That's how I feel about it. All right. And then the last thing we have is another post on the Visual C++ blog. And this is how we use C++ 20 to eliminate an entire class of runtime bugs. This is a good article. Basically, you know, it goes down to formats, compile time, checking support, being able to just get rid of tons of possible bugs with their format strings in the Visual C++ compiler. Yeah, but how this is actually implemented, I think our readers should just go

Starting point is 00:11:20 and read this and let it soak in for how it works. Yeah, we'll definitely put the link in the show notes for our listeners to go read the article. All right, so Howard, let's start off by talking about what is the GDAL library. I don't think we've really ever talked about any geospatial libraries before on the show, so I think it should be a new topic for a lot of our listeners. Sure. So GDAL or Goodle or GDAL, you might hear it pronounced lots of different ways in many different languages, is the Geodata Abstraction Library or Geospatial Data Abstraction Library. It was started in about 1998 by a guy named Frank Warmerdam. Frank had developed a software package or the engine of a software package called PCI at his former employer. And it allowed people to do raster data. So raster, just array data, if you use the term raster vector in geospatial, they have kind of geospatial

Starting point is 00:12:21 meanings. But PCI allowed a user to process and manipulate raster data in kind of an engine. And so it allowed people to construct workflows and they could manipulate data, change formats. So Frank had developed PCI and it was proprietary. His employment situation changed and he decided, well, I'll make a PCI version 2.0, except it'll be open source. And of course, the 2.0 version of everything you ever write is always better than the 1.0 version. And so he started kind of developing this. And his business model at the time, I met him a number of years later. He said, well, I wanted a business model, which he likened it to being the garbage man.

Starting point is 00:13:02 And what he meant by that is there's a set of tasking that is convenient to outsource or contract out, and format translation is one of those. Nobody likes to write formats translators, be responsible for them, have to worry about all of the impedance mismatch that goes on with taking data in one system and moving it to another. And so he saw this as kind of like a business opportunity in that he could make a living developing these translators for people and then add processing capability to it. And so his business model was actively doing that until about, I want to say 2010 or so, a Google recruiter email finally landed for him. And so he started

Starting point is 00:13:43 working for Google for a little while. And then now he works at Planet Labs, which he was part of the team that helped process the data. Planet Labs acquires satellite imagery over, the goal is everywhere on the earth every day, right? And so big geospatial data processing backends and engines. And so GDAL, Google, sits underneath all of this. And so Frank developed it, and it kind of took on a life of its own as other open source and commercial software products started to pick it up and use it as that kind of underlying engine, you know, conveniently for format translation. But then as more capability they needed to add to their software, GDAL became more capable as well. It's primarily a library, primarily a command line tool.

Starting point is 00:14:31 How do I actually work? What do I feed it and what do I get back out from an API programming perspective? So yes to all of those answers. So it's a command line utility. It's actually a set of utilities that start with the name GDAL as the prefix and then a bunch of different names that kind of do different tasks. It is a C API that sits on top of a C++ API that provides kind of an ABI stable layer for people to write software against.

Starting point is 00:14:59 Although the C++ API hasn't really changed that much over the years either. And then it's a C++ library that people can program directly with. And then there are kind of two legs to the library. There's the raster, which would be array and imagery and image processing, and then the vector side, which is called OGR, which is points, lines, and polygons. And those APIs were kind of merged together, at least at a really high level,

Starting point is 00:15:30 maybe about four or five years ago. But most programmers kind of will specialize depending on what major data types they're working with. And then there's two virtual layers. There's one virtual layer, which is kind of a virtual, it's called VFS, or virtual file system layer that abstracts away data access, network file access, and is something that a software engineer can write to. And then there's kind of a directed programming language called VRT, which used to stand for something, but it's essentially kind of an XML mini-language for orchestrating processing workflows to do various things to data so that you don't have to realize that data by a full IO step or a full process. You can simply just describe it as a

Starting point is 00:16:10 process and then execute it in one shot. And you mentioned there the raster and vector data models and just make sure our listeners understand what we're talking about. So raster, you said imagery, and we're talking like that satellite imagery that you were mentioning before. Yeah, so PNGs, JPEGs, TIFFs, you know, of course, those would be most common and commonly seen. But, you know, there's nearly 200 raster format drivers in GDAL. So every little weird software developed its own little format somewhere. And now it's like, you know, from 1970s NASA data that you're reading, or it's, you know, it was a small commercial product that existed for a little while, and it had a slight and little unique format. And so, again, the idea of the library is to provide support and capability for people to have abstract access to that content and be able to take that and do other things with it. Now that

Starting point is 00:17:05 task is messy, right? Like the content, the pixels themselves might mean different things in different formats. The metadata or information that describes what is those pixels might be different across all of those formats. In some cases, the library tries to handle some of that, but it can't in all. And so that's often left to the user to manage that as they might go from one format to the other. That said, there's probably three or four very common geospatial formats that you'll see most data in. So one is TIFF or something called a GeoTIFF, which is just a regular old Adobe TIFF file with some geospatial metadata. You'll also see the JPEG family of raster data.

Starting point is 00:17:51 So JPEG, JPEG 2000, and various flavors. And then there's the HDF family. So HDF and NetCDF, which are open source format containers that allow people to construct binary representations of data. And you'll see that very commonly used in meteorology and space applications. So you'll find HDF and NASA and stuff like that quite commonly. And then you said over 200 raster formats. Go ahead and describe what you've got going on the vector side too then, I guess. So on the vector side, there's not 200 of them. There's certainly more than 30. The commonality there is points, lines, and polygons, and then attribute data, which would

Starting point is 00:18:32 mean that those columns of data or the table of data that might correspond to every one of those features or geometries in that data source. And that world has its own kind of common format base as well, or commonly used formats. So you'll hear the phrase shapefile, which isn't actually a file, it's a group of files that came out from a company called ESRI or Esri in the 90s. GeoJSON is a specification that's out there that's frequently used for data interchange in web contexts. Not particularly efficient, but super convenient. I was wondering why it wasn't GeoXML or GeoYAML when I saw it in your bio here. I'm not really being serious.

Starting point is 00:19:13 There actually is a GeoXML, but it's not called GeoXML, it's called GML. And when XML was super in vogue in the late 90s and early 2000s, there was a lot of modeling in the geospatial realm to do that around XML and through an organization called Open Geospatial Consortium. So there are very detailed thousand-page specifications about doing GML. Very, very complex. If you've ever had to implement GML, I'm sorry for you. In the end, most people just want the point line and polygon or what you'll hear the phrase simple features, right?

Starting point is 00:19:50 They don't want a complex ontology of geospatial modeling. Like, you know, very few applications need all of that. And if your application needs that kind of thing, then, you know, you're going to have to support it with a lot more engineering than hoping to just off the shelf a standard. On the GDAL website, there's a list of software using GDAL. And I'm guessing a lot of our listeners have at least used an application that is using GDAL, like Google Earth. I'm sure a lot of people have used, but it seems like there's a whole lot of others too. Yeah, and then a lot of the back-end server-side geospatial computing too, most of the mapping companies use it in some form, at the very least for data translation, if not some of the algorithm and processing.

Starting point is 00:20:36 There's people putting it on phones. There's a browser-based implementation that people have kind of been working on to pull some of the processing algorithms up to browser land. So it's kind of going all over the place, just like computing's kind of going all over the place. How do the vector and raster worlds meet, or do they? I guess my supposition is they meet at point clouds, which is what we do. Point clouds is simply just a bunch of points, but they're modeled as discrete locations in space instead of an array of pixel values or an array of points like a raster data might be.

Starting point is 00:21:12 Point clouds have properties of both. You want to manage them in raster ways as much as possible, meaning you want to address them conveniently. You want to compress them as efficiently as possible. You want to transmit them and model them simply. But you also want all of that fidelity of that point cloud corresponding to things like where it's located. Maybe you have some other information on per points. Like say you had a laser scanner and it was measuring a returned intensity or other information that you might attach to like a LIDAR point or something like that. And so in some ways, point clouds are kind of in between, at least in the

Starting point is 00:21:51 geospatial realm. Another area where all of these kind of meet together, and when I talk about PDAL or POODLE and what it brings unique to the kind of the point cloud space is the geospatial part. And so the geospatial part and so the geospatial part here for raster vector and point cloud is coordinate systems the modeling of that coordinate system and the where part of it so it's not simply just a coordinate that describes some point in some space it's a coordinate that describes a point in world space or other world space if you're doing maybe nasa stuff or something like that. But being able to model all of that coordinate system information, be able to transform between those,

Starting point is 00:22:30 that's where all the kind of the stickier parts of geospatial tend to be just because it's necessarily complex sometimes. And also because most people don't want to have to think about it and want to be able to outsource that to software libraries that can do that for them. So GDAL depends on a library called Proj. And so Proj is an open source library that originated in 1983 from a guy named Gerald Evenden at the USGS. And so there's a very famous book out there, the Robinson book from USGS that described a whole bunch of coordinate system transformations and coordinate systems. And so Jerry implemented a number of these in the personal and mini computer era. And so this library kind of survived as a C library over the years as he was developing it.

Starting point is 00:23:20 He retired in, I want to say, 98 or 99, around the time when Frank Wormerdam was doing GDAL. And so Frank needed a coordinate system support library for GDAL. And so he used Proj as the basis of that. So Proj mostly concerned itself with just the mathematics of geospatial projection, and it didn't concern itself with geodetics, which is, you know, modeling the earth. And so recently we had an effort to update the Proj library to support a geodetic transformation engine. So applying the math and the modeling of, you know, geoids and datums and things like that, so that people can do very precise engineering activities with this geospatial library instead of making a Web Mercator map for the web or something simple like that.

Starting point is 00:24:11 You can do that stuff with it too, but being able to have the capability to do this complex geodetic modeling. I wonder if it's worthwhile just for a moment to make sure our listeners know what we're talking about with projections and you just said web mercator so we're talking about like different ways of projecting earth coordinates to a map yep so you know a while ago there was the orange peel story where you know you try to take the orange peel and lay it flat out and draw a map with it and if you take a geography 101 class you might get to do that as. It's a lossy operation no matter what you're trying to do. If you log into Google Maps, you know, maybe five or ten years ago, the projection or the coordinate system of the description of that map is something called Web Mercator. It was kind of a special thing that Google had made as they were developing Google Maps. And it's kind of a shortcut over something that's called Mercator, which of course is a very well-known coordinate system.

Starting point is 00:25:06 As a data modeler, you have to make choices about your coordinate system depending on what you're using. So you have the ability to preserve area, direction, or angle, right? And so you can't do all three of those at once. And so depending on what you want to do with your data or how you want to model it, you'll have to make a coordinate system choice. And there's lots of them to choose from. And you'll see them used in various contexts. There's definitely some very common ones. So your GPS on your phone, for example, is going to be just in geographic coordinates or plate carry, which is, you know, latitude, longitude, or longitude, latitude, depending on who you want to argue with. I hear, you know, raster, point clouds, geodesic. You mentioned LIDAR earlier.

Starting point is 00:25:50 I start thinking, you did mention LIDAR earlier, right? Yeah. Okay. I start thinking like, you know, people that are doing like archaeological surveys and photogrammetry, reconstruction of scenes and that kind of thing. Does that have an overlap or have I gone too far? Definitely an overlap. So the group of archaeologists who are flying LIDAR in Central America to be able to look for lost cities and stuff like that. And LIDAR doesn't penetrate the foliage. It's just you're shooting so many lasers

Starting point is 00:26:23 at such a high frequency that if there is a ray that can cast through the foliage. It's just you're shooting so many lasers at such a high frequency that if there is a ray that can cast through the foliage to hit the ground, you try to do that with and then you spend a bunch of computing to process it back out, right? And the photogrammetry-based point clouds is where you're taking lots and lots of photographs at lots of different angles and doing various stereo photography processing techniques to extract out a 3d scene or a 3d point cloud from that data and then build a 3d model from that so sometimes you'll have lasers actively scanned lasers and lidar to do that but you don't need that to do it so if you're just out flying your drone or whatever you can put that through various open

Starting point is 00:27:01 source software to be able to construct an ortho photo or like a photogrammetically correct raster image. So there's a software project out there called Open Drone Map that you can use to do that. But you also extract out this 3D scene point cloud, right? And other associated things. And so, again, it's all kind of the same sort of story of lots of data, compute out a geospatial product, and then take that geospatial data product and go do whatever you need with it. Because I'm just trying to wrap my mind around where the edges of this tools are. And the original goal, you said, the person who originally created the stuff was to be the garbage man that just did the translation that no one else wanted to do. But it sounds like you do a little bit more than that now, because you can also do processing and

Starting point is 00:27:49 alignment of points, perhaps. Is that? Yeah, over the years, GDAL grew the ability to do warping, which is, you know, taking that raster, taking that array of data and stretching it or rubber sheeting it over a bunch of points and doing that various ways depending on what the needs are. And so that's a task that you need to be able to do to, say, take these raster data and transform them in different coordinate systems. So warping is a common one. Compression, you know, there's large flavor or diversity of flavors of compression. And depending on your data type, this may or may not have lots of meaning for you.

Starting point is 00:28:24 One thing that the geospatial data almost universally has in common is it's large. And so maybe it's not so dense in content, but what I always say is it's very fluffy. There's just lots of bytes. And so sometimes it compresses well, and other times it doesn't. But a compression is a huge part of any story with geospatial data. I like the idea of fluffy data. I've never heard that before. I mean, yes, it's big data.

Starting point is 00:28:54 Maybe it was one of the original big datas. But, you know, the meaning of that phrase is kind of smashed out a little bit. And, you know, it's a little bit like cotton candy. Wonder if the discussion for just a moment to bring you a word from our sponsor. CLion is a smart cross-platform IDE for C and C++ by JetBrains. It understands all the tricky parts of modern C++ and integrates with essential tools from the C++ ecosystem, like CMake, Clang tools, unit testing frameworks, sanitizers, profilers, Doxygen, and many others. CLion runs its code analysis to detect unused and unreachable code,

Starting point is 00:29:29 dangling pointers, missing typecasts, no matching function overloads, and many other issues. They're detected instantly as you type and can be fixed with a touch of a button while the IDE correctly handles the changes throughout the project. No matter what you're involved in, embedded development, CUDA, or Qt, you'll find specialized support for it. You can run debug your apps locally, remotely, or on a microcontroller,

Starting point is 00:29:49 as well as benefit from the collaborative development service. Download the trial version and learn more at jb.gg slash cppcast dash clion. Use the coupon code jetbrains for cppcast during checkout for a 25% discount off the price of a yearly individual license. So you were talking about Google or GDAL. And so is Poodle part of GDAL or is it a separate library built on top of it? The latter. So we started developing some point cloud software, like I said, in about 2008 or 2009. And this point cloud software was using GDAL to provide some geospatial capability like coordinate system transformation.

Starting point is 00:30:32 We were doing some geometry filtering and stuff like that. And then we started developing a library called LibLAS, which was the first open source library that was available. And it was clear that LibLAS was focused on this LAS format, which would be kind of like the TIFF of geospatial LIDAR, for example. I think of TIFF as the over-engineered image file format. And I'm wondering, when you said it's like the TIFF of geospatial, is that because it's very configurable and has lots of options,

Starting point is 00:31:03 or just because it's ubiquitous? Mostly on the ubiquity thing, right? So TIFF is kind of like a primary sort of geospatial raster container. I agree with you on people write really horrible TIFF story. But just in terms of ubiquity and frequency or commonality in the industry, for geospatial point clouds, LAS is kind of that format. But there are other formats. And so it was clear we needed some kind of abstraction API that would provide a similar sort of capability or space as GDAL. We wanted the ability to have an abstract data API so that we could provide access to the content without necessarily imposing all of the complexity of the organization on the user, right? So in exchange for that,

Starting point is 00:31:45 you might be a little bit slower or you have impedance mismatch, but the ability to just get those points out and go do something with them in your application, if that's really valuable, we needed to be able to do that. And so we developed that primarily for doing point cloud data warehousing, right? So there you know, there's lots of companies and governments out there flying LIDAR over, you know, places like Central America, but they're doing it all over the United States or all over Europe. And they're using this information as kind of geospatial base data for, you know, things like flood mapping, 3D modeling, photogrammetry. And so there's kind of a bucket where all that stuff gets dumped and our software kind of sits for one of those buckets.

Starting point is 00:32:25 Our software sits there and does all the processing for it. So is there like a knob that you can turn and say, I want slower processing, but like super precision conversions or, you know, I want faster processing, but, you know, quicker results so I can just visualize the data real quick. So for raster data, that knob is to kind of overview or mip map the data into a pyramid. And, you know, that's your kind of common processing technique for imagery. For point cloud data, that sort of mip map is going to be some sort of indexing structure. So, you know, commonly you'll see an octree used as that filtering technique where, you know, I want to select the data but to a specified resolution i don't need you know 250 points per square meter or whatever that original

Starting point is 00:33:10 data capture was that allows me to model you know the surface so intensely i don't need that to maybe ask and answer a really quick question and so you're seeing data organization that's doing those kinds of processing techniques to allow people to touch that data at rest without having to process the entire thing at once that octree is what karmak famously used in quake to be able to do software 3d rendering is that right yep yep so although it was groundbreaking when he did it right but, but that capability, those are the hammers you have in the box to be able to go do stuff. Sounds like there's just an awful lot going on here

Starting point is 00:33:52 in your world. Yeah. It's been a world that was kind of the domain of what I would call governments and government data capture. LiDAR cost $200,000 for a laser scanner that you could mount on an airplane maybe 15 years ago. It was just a really large capital investment to capture and do stuff with this kind of data a while ago. And you have a LiDAR on your iPhone 12 Pro now. You can walk around and capture point cloud data.

Starting point is 00:34:25 I did not realize that. Yeah, there's a laser scanner on your iPhone and people are building applications to do things like augmented reality where they want to make digital representation of their scene or their environment and put that into their goggles or put that into photographs, 3D photographs or whatever. And so the data type and the data to support that is the same stuff, right? It's geospatial, it's raster, it's point clouds, it meshes about the space. And so these processing techniques, even though they're 20 and 30 years old, are still applicable today.

Starting point is 00:34:56 I'm reminded, I think tangentially, although maybe it has more of a direct relationship than I'm aware of, OpenStreetMaps, which I saw a recent article that's discussing how all of the world's companies, trillion-dollar companies and world's governments all rely on OpenStreetMap for their GIS kinds of things right now. For the base, yep, absolutely. Yep. And my opinion, the ones that are a little bit more progressive are the government entities, at least that are a little more progressive, are embracing that as the place to have lots of impact with their data that they would produce. Whereas previously, a government would spend a lot of money constructing that data. So in the United States, we have, you know, Census Bureau does that, right? So the Census Bureau needs to know where everybody is so that they can find them all to do the census. And so part of that tasking was to build the maps of all of the streets. Then companies would take that and go do stuff with it. Some of these organizations, though, they're embracing OpenStreetMap and pushing that data out to that map or out to the

Starting point is 00:36:00 common database early in the process instead of trying to do it later on so that it has more impact, has more utility, and it's a lot more up-to-date, right? Because it's just something that's necessarily always changing. Yeah. 15 years ago, I went on vacation to Ireland, and at the time, all of the travel info said, you can't rely on GPS, you can't rely on a map that you buy in the U.S., you have to buy a physical map in Ireland when you get there to know what the roads are doing. I'm hoping now that that's changed, as you said, with people getting more pushing data

Starting point is 00:36:32 to open street maps and embracing that anyhow. Some countries where a lot, you know, there's variation in the policies too. So the United States, for example, has an open data policy where the data it captures

Starting point is 00:36:44 tends to be open and available and companies can use that to build products and make things more valuable if they want and take that and close it up and enhance it. EU, it wasn't quite like that. Certainly 15 years ago, I think things have gotten better. Everybody's geospatial view of the world kind of starts with their phone nowadays. And so that might be Apple maps or google maps or something like that but then you know they're starting to take that into other spaces like you know i know snapchat has a bunch of geospatial stuff that they're doing i don't know if tiktok's doing any geospatial stuff but everybody's the spaces are kind of converging where this like digital space and physical space stuff kind of starting to merge together a little bit. And the part that takes that, the reality, the physical part,

Starting point is 00:37:27 and puts it into digital is challenging data processing, challenging data modeling, and software. So I wanted to kind of ask a little bit about the C++ code in these libraries. I'm guessing the answer is going to be very different, whether we're talking about GDAL or PDAL, but are we using modern C++ at all in these libraries? It sounds like GDAL is a very old library and PDAL is maybe relatively new. Yeah, GDAL predates good STL. Okay. We actually had a policy in GDAL where we wouldn't allow STL into the library. Like GDAL had its own

Starting point is 00:38:03 kind of STL thing going on. And there's still quite a bit of that in the code base, actually. But hold on, I'm gonna make a fork and start a pull request now. I think you would describe the design of GDAL as like, you know, that original sort of vision of C++ of C with classes.

Starting point is 00:38:21 And not much more than that, you know, a very simple kind of core API, top-level API, and then the concept that GDAL calls drivers, which are just essentially implementations of a particular data model applied to the overall GDAL model. What that's allowed is for people to contribute to the library and only have to focus on the part they care about which is modeling their one specific little format or

Starting point is 00:38:49 their one specific use without having to worry about you know i think it's almost a million lines of code now the big complexity of this library they don't have to worry about that if they want to just focus on their one little task you know gd, I guess what I would call it kind of fluffy or verbose in terms of style. It's not particularly compact. A lot of the reasoning for that is in the late 90s, early 2000s, compiler variability was a lot bigger. GDAL expects to run on, it was running on everything from AIX to Windows 32 to OS X, first flavors of OS X. So it was running all over the place. And so it had to assume the responsibility for all that compiler variability.

Starting point is 00:39:34 Things have improved a lot in the interim. So when we started Poodle in 2010 or so, Clang and GCC were implementing modern C++ at that point. C++ 11 was just kind of coming onto the horizon. And so we were able to say, hey, we're going to cut our baseline here at C++ 11. Eventually, our first implementations were based on Boost. But once C++ 11 compilers were prevalent enough that you could depend upon them for the platforms you cared about. We set that as our floor. And so that gives you things like shared pointers and memory management and stuff like that. Recently for Poodle, though, the next upcoming release, we're setting that floor at C++17.

Starting point is 00:40:16 The thing that pushed us up there is standard file system and getting that universal support for that across everything. I think the deployment targets for where our software goes tends to lag just because of the open source community, the packagers or the vendors like Ubuntu or Debian or Red Hat that will pick up these libraries are quite conservative. And so the compilers they have available to them by default are conservative, their settings tend to be quite conservative because their customer base and their user base wants things to move slowly so that they don't break. And so for us, that has to be kind of a

Starting point is 00:40:55 target of, you know, we're not going to turn C++20 on, for example, even though as a developer, we would love to, you know, both of those libraries kind of track that migration or evolution. What kind of performance considerations does this system have? Are people relying on it being super performant? In some cases, maybe. I mean, there's a penalty for having an abstract API, right? You have this penalty of transferring between your data models, potentially. And if you write purpose-built software for a purpose-built format, you're going to beat it. And so the idea

Starting point is 00:41:32 with these libraries is, if that's not a consideration for you, and you as a consumer of the content just wants access to the content without regard to those high performance considerations, and you can get away with that, then libraries like these are quite valuable. Because as a developer, you don't have to worry about all the intricacies of how data was laid out. And then they roll a new version of that format out, and now you have more complexity to deal with. In exchange for that, the bargain there is that as a consumer of the content, you don't have to worry about all those intricacies. But if you need performance and you need to manage your system very precisely so that you know what's going on all the time, you're just going to do much better developing and controlling that yourself. And then you might

Starting point is 00:42:21 provide a driver, say for GDAL or HOODL to allow people to access your thing. And so we see lots of companies will do that. They might even be closed source, so they'll have an SDK for their format, but they'll provide a GDAL driver for it so that people have access to it. There's lots of benefits to that. One is GDAL as kind of a distribution platform for capability. GDAL is available on Ubuntu and Red Hat and all over the place. And so people will install it and use that as kind of a base for stuff that they're building. So having support for your thing there has some value to it.

Starting point is 00:42:56 And also that allows you to manage the life of the features of that thing without regard to this larger thing. And so you can kind of control how those move together. A couple of follow-up questions I've been thinking. You have 200, 300-ish, whatever, different formats that you support between the raster and the vector, right? Are all of those parsers then written internally, or do you ever rely on like libpng, libjpeg, you know, those kinds of things as well?

Starting point is 00:43:31 Frequently, we rely upon libpng or libjpeg or libtif or libwhatever or proprietary SDK for weird one-off format that, you know, three people use but is really important to some group. Like, you'll see see all combinations of that. So GDAL and Poodle both allow a user to build a proprietary dynamically loaded plugin that'll get woke up at runtime that they can adapt the API to their driver. And so that's been an important part of the model of both of these libraries in that we recognize and want people doing proprietary software to adapt to the system in a way that is opt-in when they want. You know, they don't have to make some bargain to be able to participate. And, you know, we've seen companies

Starting point is 00:44:15 that will do that. And then eventually, like, it's more convenient to just have an open SDK. And then eventually, it's just more convenient to have it in to GDollar or Poodle. And so we've seen some of that go on as well. But every company doing various things can make their own business choices or technical choices to do that. And they have the flexibility within the system to manage that. So how much of these different drivers with supporting up to 200 or 300 formats, how much of it is in the GDAL code base? And how much is like third-party drivers that you could optionally put in if you have it available? So if you go to the website and kind of list the raster drivers, you know, I think there's like 180 or 200 or something,

Starting point is 00:44:55 I don't know what the list is right now. The majority of those are kind of adaptations on the same thing, which is a bunch of binary bytes laid down on disk with some metadata. And so they might have different organization, they might have slightly different meaning, but that's the gist of what raster data is, right? It's just a bunch of bytes on disk and something that describes how they're organized. Those format parsers tend to be pretty simple. They don't have to be super complex.

Starting point is 00:45:20 They don't have lots of options or features with them. But then there's others that are. You'll see things like JPEG 2000, which is a really complex specification for compressing raster data. And there's a library called OpenJPEG that's available to manage that. And GDAL will just delegate and use that as the library. So it doesn't do the decompression. It doesn't do the geospatial part. It just adapts that model to GDAL's model.

Starting point is 00:45:45 And so some of the drivers are that. You'll see proprietary drivers as well. And what most vendors will do is they'll include that proprietary driver or parser or implementation of what's using their SDK in the GDAL project itself. And then they'll have the user, whoever it is, essentially compile and link that when they're building their own GDAL project itself. And then they'll have the user, whoever it is, essentially compile and link that when they're building their own GDAL to make it available to their customers. And so that's another model that's available. And then finally, the last model is we just provide a runtime DLL that's linked against GDAL and you can load it up at runtime to get access to our format along with either our SDK or whatnot. And you'll see that commonly done as well,

Starting point is 00:46:25 mostly in commercial software deployments where they're deploying GDAL as part of their commercial software system. We've talked a lot on this show about fuzz testing, and I'm finding myself thinking, okay, you've got 200 whatever file formats, however many of them you offload out. But it's pretty well accepted that fuzz testing, anything that accepts user input, is good practice. What does your test scenario look like for Ulf? So GDAL was, let's see, Evan's not here, but Evan Rualt is the current maintainer and kind of the major domo of GDAL at this point.

Starting point is 00:47:01 And so he was invited by Google to include GDAL into OSS Fuzz very early on. And I think GDAL was one of the projects with the most findings, one for some time window. It was also one with some of the more complex findings. Not every format driver is Fuzz tested equally because of this linking and the proprietary things and all this sort of stuff. But the test as currently run is fuzz tested with commits to the code base. And so we get findings not as frequently as we were. OSS fuzz is a continuous fuzzing harness, right? Like it's just always running. Yes. They've found issues in lib tiff. They find lib issues. I don't think they found any issues in lib jpeg, but they'll find little issues, right? And sometimes that originates from GDAL and its usage or just the SDK usage.

Starting point is 00:47:51 The test suite for GDAL is very large, but it's testing of capability more so than full engineering tests of every possible path through the code base. It doesn't test that. The C++ testing in GDAL is actually driven by Python because it's very data sensitive, right? You have all these different data types and configurations. Most of the testing is around that. I wouldn't recommend putting GDAL at the end of your web address and saying, pass me in data. Even though it is fuzz tested, you might want to jail stuff up a little bit.

Starting point is 00:48:25 That's awesome, though. That's really cool. I don't know what answer I expected, but I didn't expect that. Yes, of course, we're continuously fuzzing as much of it as we reasonably can. Underlying support libraries for GDAL, like Proj, are also fuzz tested through OSS Fuzz, Proj, and I think Geass, which would be the geometry algebra engine that you'll also see used. And so the groups in these projects were early participants in OSS fuzz, mostly because Google was using them and also wanted to, you know, essentially not have to upstream all the bugs they were finding, right? So like that feedback loop was a lot tightened up quite a bit.

Starting point is 00:48:58 All right. Well, Howard, it was great having you on the show today. Thank you so much for telling us about these libraries. Anything you want to plug before we let you go or anything like that? A couple of things I want to plug. One is something called cloud optimized point cloud and analysis ready data. So in geospatial realm, because the data are so large and they're sitting at rest somewhere, having the data laid down on disk in a way that you can access them without having to process all of them like we were talking about earlier is an important thing and so you're starting to see some efforts to do that and so one of our efforts on the point cloud space is cloud optimized point cloud that's at copc.io and the other is i would never claim to be much of a c++ software engineer so thank you very much for the

Starting point is 00:49:41 opportunity to be on your podcast i mean for us c++ is kind of a means. At the time, doing high-performance geospatial software processing where you want to move lots of bytes around, compress lots of bytes, transform lots of bytes, that meant C or C++. And maybe it still does today, but that's how these libraries came into being. And it's been interesting watching C++ kind of continue to evolve over the years to be something certainly quite a bit different and more like other programming languages than it certainly was 20 years ago.

Starting point is 00:50:16 Well, I certainly hope C and C++ still do mean high performance and everything like you were just saying. I think it does. Thanks again, Howard. Thanks for coming on. Thank you very much. Thanks so much for listening in as we chat about

Starting point is 00:50:27 C++. We'd love to hear what you think of the podcast. Please let us know if we're discussing the stuff you're interested in, or if you have a suggestion for a topic, we'd love to hear about that too. You can email all your thoughts to feedback at cppcast.com. We'd also appreciate if you can like CppCast on Facebook

Starting point is 00:50:43 and follow CppCast on Twitter. You can also follow me at Rob W. Irving and Jason at Left2Kiss on Twitter. We'd also like to thank all our patrons who help support the show through Patreon. If you'd like to support us on Patreon, you can do so at patreon.com slash cppcast. And of course, you can find all that info and the show notes on the podcast website at cppcast.com. Theme music for this episode is provided by podcastthemes.com.

CppCast - GDAL and PDAL

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.