CppCast - GDAL and PDAL
Episode Date: February 18, 2022Howard Butler joins Rob and Jason. They first talk about an announcement from Swift on a C++ interoperability workgroup. Then they talk to Howard Butler about the C++ Geospatial libraries GDAL and PDA...L, and his involvement with geospatial development. News Swift and C++ interoperability workgroup announcement The mystery of the crash that seems to be on a std::move operation How we used C++20 to eliminate an entire class of runtime bugs Links hobu GDAL GDAL on GitHub PDAL PDAL on GitHub Cloud Optimized Point Cloud Sponsors Use code JetBrainsForCppCast during checkout at JetBrains.com for a 25% discount
Transcript
Discussion (0)
Episode 337 of CppCast with guest Howard Butler, recorded February 15th, 2022.
This episode of CppCast is sponsored by JetBrains. JetBrains has a range of C++ IDEs to help you
avoid the typical pitfalls and headaches that are often associated with coding in C++.
Exclusively for CppCast, JetBrains is offering a 25% discount for purchasing or renewing a yearly
individual license on the C++ tool of your choice, CLion, ReSharper C++, or AppCode.
Use the coupon code CHEPBRAINS for CppCast during checkout at www.chepbrains.com. In this episode, we talked about Swift and C++ Interop.
And we talked to Howard Butler from Pobu.
Howard talks to us about the geospatial libraries GDAL and PDAL. Welcome to Episode 337 of CppCast,
the first podcast for C++ developers by C++ developers.
I'm your host, Rob Irving,
joined by my co-host, Jason Turner.
Jason, how are you doing today?
I'm all right, Rob. How are you doing?
I'm doing all right.
My son turned 13 today,
so I am feeling a little old from that, but I'm doing okay.
That means we've been doing this podcast for more than half of your son's life, it seems.
Yeah. Wow.
Does that make you feel older again?
Yeah, it does. How about you? Got anything going on?
Well, I just had that moment myself when I released episode 311 of C++ Weekly,
which, as probably a lot of our listeners know, I have not missed a single week.
So knowing that I've been doing it for 311 weeks,
at first I wanted to feel a sense of pride, but then I just felt old instead.
All right. Well, at the top
of every episode, I'd like to read a piece of
feedback. We got this message
on Discord from
Eli or Ellie saying
this is not a very important piece of feedback, but
I like it a little better when you let the theme
song outro play for longer before starting
the podcast audio proper.
I think it might be because the first minute and change
of every podcast used
to be exactly the same and it became sort of comforting for me to follow along with that.
So yeah, we did start using editor recently and I guess they just did a little bit of a different
style with editing the intro music with the intro of the show. But I will pass on that feedback to
the editor because there could be other listeners also, you know, frazzled by change. So did we actually talk about why you're using editor now on the podcast? I'm not sure if
we had on the podcast, but you know, we have had a Patreon for a while for the show and we put out
a goal when we first opened up the Patreon that if we ever get to a certain level of support,
that we would start using an editor as opposed to me
editing the show myself. And we did hit that goal earlier this year.
And it's starting to work out?
Yeah. Oh, no. I think the editor has been doing a really great job.
All right. We'd love to hear your thoughts about the show. You can always reach out to us
on Facebook, Twitter, or email us at feedback at cppcast.com. And don't forget to leave us
review on iTunes or subscribe on YouTube.
Joining us today is Howard Butler.
Howard is the founder and president of Hoboo Incorporated,
an open-source software consultancy located in Iowa City
that focuses on point cloud data management solutions.
He's an active participant in the ASPRS LAS Committee,
a Project Steering Committee member of both the PROJ
and GDAL open source software projects,
a contributing author to the GeoJSON specification, and a past member of the OSGeo Board of Directors.
With his firm, Howard leads the development of the PDAL and Entwine open source point cloud processing and organization software libraries.
Howard, welcome to the show.
Thanks for the introduction.
Nice to, excited for the opportunity to talk about geospatial stuff and C++ with you guys.
How did you decide to create a company founded on open source stuff?
Because that sounds risky to a lot of people, I think.
There wasn't a decision so much as I was doing consulting and my wife and I moved towns.
We moved from Ames to Iowa City, and part of the job search, I wasn't doing so well
finding kind of what I wanted to do for a job,
and I'd been very active in open-source geospatial software.
And so I thought, well, maybe I can just try to make a business
out of doing this where people need features added to software
or support and capability or help navigating the community.
I started the consultancy doing that.
It was kind of like a moonlighting sort of thing.
And eventually we started developing software in the point cloud and LIDAR domain, which
at the time was a very small niche.
It's grown quite a bit now.
And so people would find us to add capability to the software, and they were building systems with our tools.
And so the business just kind of grew from there.
How long ago was that?
So Hobu actually started as a consultancy maybe 20 years ago, but we started doing very heavily focused on open source and LIDAR stuff in 2008 and 2009.
2008 and 2009. Okay, cool.
How big is the company now?
We're five employees and have one employee
in Peterborough, New Hampshire,
and then the rest of us here in Iowa City.
Yep, always looking to find good engineers
like everyone else, but...
So you are hiring right now?
Of course.
Everyone's unicorn hunting.
It does seem like it's surging on impossible to find
new programmers right now because everyone has a job that they want or there's not enough people
or something. Yeah, I also think, you know, we're in Iowa, which is kind of a challenging environment
too, just because there isn't as big of a community. And so, you know, the buy or build
solution is mostly build, right? You want to hire young engineers and mentor them forward either in your development practices
or teach them the culture that you're trying to get them to participate in.
And then they grow up and leave, of course.
But that's one way to move forward with trying to get the talent that you need.
All right.
All right, Howard, we got a couple news articles to discuss.
Feel free to comment on any of these and we'll start talking more about these geospatial libraries that you're involved with. Okay.
Okay. So this first one is Swift and C++ interoperability workgroup announcement.
I had not heard of this, but we had a listener reach out to us over Twitter, Jason. Actually,
I think listener and former guest telling us that we should talk about this one.
And yeah, it seems really interesting.
The Swift development is going for
bidirectional interoperability with C++?
Yeah, I mean, examples here are
things like standard string and standard vector
without having to go through a memory copy
C binding layer,
which is generally what has to happen with other languages like Java, Python, whatever.
Are there other languages with this type of interop?
As far as I can recall, only D, from what Alex Andreescu told us when he was on.
Because he was talking about basically when D and C++ interop works
by relying on knowing what the calling conventions are
for the C++ compiler that you happen to be trying to interop with.
And so that's really getting into the guts of things there.
It's the only other language I know of that does that.
Seems really cool.
And I don't think we've ever talked about Swift in any detail on the show,
but maybe this is a good opportunity to try to get someone from the Swift community
who's involved in this C++ interop because it seems really cool.
You have any thoughts on this one, Howard?
You do anything with other languages that maybe interop with C++?
Certainly, you know, the libraries we develop have interop with things like Python, done some Java.
The deep bindings approach of doing that is always pretty painful for everyone.
Somewhere across that boundary, somebody gets to be responsible for who owns what when.
And that's always the problem.
So things that make that better would always be exciting.
I don't know that there's never really been a silver bullet for it, though. Every kind of generation has had its own sort of approach to doing that kind of interop, and it always ends up being a compromise in places that you don't want. And then it kind of gets redone the next generation in a little bit different way with the slightly different compromises and and you're kind of back in the same spot it was all these kinds of issues is why i created chai script in the first place because i wanted a scripting language that was designed for interop with c++ and just for the record i don't maintain it anymore so i'm not trying to convince anyone to use it it cost me far more money to write and maintain it than I ever made from it.
All right. Next thing we have is another post on the old new thing, Raymond Chen's blog.
And this one is a mystery of the crash that seems to be on a stood move operation.
Could you go over this one for us, Jason? I love this because even though I had already
read the article once, when I came back to it to review before the show i'm like forgot where the crash was
but what it comes down to is that the programmer is using an object and moving that object in the
same statement and that was unspecified behavior before C++17
and the way they're doing it,
because there's no guarantee as to what order these things are going to occur in.
In C++17, there is the guarantee that function itself will be resolved first,
and then its function arguments will be resolved.
But there's still no guarantee as to what order the function arguments will be instantiated in C++17. So this code be okay in
C++17? Yes, that code would be okay in C++17. Yeah. But I mean, not that it's good code, I would
say just for the record, because it still feels a little sketchy. Like you have to know exactly what
the rules are saying right here
to have any real clue what it's doing.
That's how I feel about it.
All right.
And then the last thing we have is another post on the Visual C++ blog.
And this is how we use C++ 20 to eliminate an entire class of runtime bugs.
This is a good article.
Basically, you know, it goes down to formats, compile time, checking
support, being able to just get rid of tons of possible bugs with their format strings in the
Visual C++ compiler. Yeah, but how this is actually implemented, I think our readers should just go
and read this and let it soak in for how it works. Yeah, we'll definitely put the link in the show notes for our listeners to go read the article.
All right, so Howard, let's start off by talking about what is the GDAL library.
I don't think we've really ever talked about any geospatial libraries before on the show,
so I think it should be a new topic for a lot of our listeners.
Sure. So GDAL or Goodle or GDAL, you might hear it pronounced lots of different ways in many different languages, is the Geodata Abstraction Library or Geospatial Data Abstraction Library.
It was started in about 1998 by a guy named Frank Warmerdam. Frank had developed a software package or the engine of a
software package called PCI at his former employer. And it allowed people to do raster data. So raster,
just array data, if you use the term raster vector in geospatial, they have kind of geospatial
meanings. But PCI allowed a user to process and manipulate raster data in kind of an engine.
And so it allowed people to construct workflows and they could manipulate data, change formats.
So Frank had developed PCI and it was proprietary.
His employment situation changed and he decided, well, I'll make a PCI version 2.0, except it'll be open source.
And of course, the 2.0 version of everything you ever write is always better than the 1.0 version.
And so he started kind of developing this.
And his business model at the time, I met him a number of years later.
He said, well, I wanted a business model, which he likened it to being the garbage man.
And what he meant by that is there's a set of tasking that is convenient to outsource or contract out,
and format translation is one of those.
Nobody likes to write formats translators, be responsible for them,
have to worry about all of the impedance mismatch that goes on with taking data in one system and moving it to another.
And so he saw this as kind of like a business
opportunity in that he could make a living developing these translators for people and
then add processing capability to it. And so his business model was actively doing that until about,
I want to say 2010 or so, a Google recruiter email finally landed for him. And so he started
working for Google for a little while.
And then now he works at Planet Labs, which he was part of the team that helped process the data. Planet Labs acquires satellite imagery over, the goal is everywhere on the earth every day, right?
And so big geospatial data processing backends and engines. And so GDAL, Google, sits underneath all of this.
And so Frank developed it, and it kind of took on a life of its own as other open source
and commercial software products started to pick it up and use it as that kind of underlying
engine, you know, conveniently for format translation.
But then as more capability they needed to add to their software, GDAL became more capable as well.
It's primarily a library, primarily a command line tool.
How do I actually work?
What do I feed it and what do I get back out from an API programming perspective?
So yes to all of those answers.
So it's a command line utility.
It's actually a set of utilities that start with the name GDAL as the prefix and then
a bunch of different names that kind of do different tasks.
It is a C API that sits on top of a C++ API that provides kind of an ABI stable layer
for people to write software against.
Although the C++ API hasn't really changed that much over the years either.
And then it's a C++ library that people can program directly with.
And then there are kind of two legs to the library.
There's the raster, which would be array and imagery and image processing,
and then the vector side, which is called OGR,
which is points, lines, and polygons.
And those APIs were kind of merged together,
at least at a really high level,
maybe about four or five years ago. But most programmers kind of will specialize depending on what major data types they're working with. And then there's two virtual layers. There's one
virtual layer, which is kind of a virtual, it's called VFS, or virtual file system layer that
abstracts away data access, network file access, and is something that a software engineer can write to.
And then there's kind of a directed programming language called VRT,
which used to stand for something,
but it's essentially kind of an XML mini-language
for orchestrating processing workflows to do various things to data
so that you don't have to realize that data by a full IO step or a full process. You can simply just describe it as a
process and then execute it in one shot. And you mentioned there the raster and vector data models
and just make sure our listeners understand what we're talking about. So raster, you said imagery,
and we're talking like that satellite imagery that you were mentioning before. Yeah, so PNGs, JPEGs, TIFFs, you know, of course, those would be most common and commonly
seen. But, you know, there's nearly 200 raster format drivers in GDAL. So every little weird
software developed its own little format somewhere. And now it's like, you know, from 1970s NASA data that you're reading, or it's, you know, it was a small commercial product that
existed for a little while, and it had a slight and little unique format. And so, again, the idea
of the library is to provide support and capability for people to have abstract access to that content
and be able to take that and do other things with it. Now that
task is messy, right? Like the content, the pixels themselves might mean different things in different
formats. The metadata or information that describes what is those pixels might be different across all
of those formats. In some cases, the library tries to handle some of that, but it can't in all. And so
that's often left to the user
to manage that as they might go from one format to the other. That said, there's probably three or
four very common geospatial formats that you'll see most data in. So one is TIFF or something
called a GeoTIFF, which is just a regular old Adobe TIFF file with some geospatial metadata.
You'll also see the JPEG family of raster data.
So JPEG, JPEG 2000, and various flavors.
And then there's the HDF family. So HDF and NetCDF, which are open source format containers
that allow people to construct binary representations of data.
And you'll see that very commonly used
in meteorology and space applications. So you'll find HDF and NASA and stuff like that quite
commonly. And then you said over 200 raster formats. Go ahead and describe what you've got
going on the vector side too then, I guess. So on the vector side, there's not 200 of them. There's certainly more than 30.
The commonality there is points, lines, and polygons, and then attribute data, which would
mean that those columns of data or the table of data that might correspond to every one of those
features or geometries in that data source. And that world has its own kind of common format base
as well, or commonly used
formats. So you'll hear the phrase shapefile, which isn't actually a file, it's a group of files
that came out from a company called ESRI or Esri in the 90s. GeoJSON is a specification that's out
there that's frequently used for data interchange in web contexts. Not particularly efficient,
but super convenient. I was wondering
why it wasn't GeoXML or GeoYAML when I saw it in your bio here. I'm not really being serious.
There actually is a GeoXML, but it's not called GeoXML, it's called GML. And when XML was super
in vogue in the late 90s and early 2000s, there was a lot of modeling in the geospatial realm to do that around XML
and through an organization called Open Geospatial Consortium.
So there are very detailed thousand-page specifications about doing GML.
Very, very complex.
If you've ever had to implement GML, I'm sorry for you.
In the end, most people
just want the point line and polygon or what you'll hear the phrase simple features, right?
They don't want a complex ontology of geospatial modeling. Like, you know, very few applications
need all of that. And if your application needs that kind of thing, then, you know, you're going
to have to support it with a lot more engineering than hoping to just off the shelf a standard. On the GDAL website, there's a
list of software using GDAL. And I'm guessing a lot of our listeners have at least used an
application that is using GDAL, like Google Earth. I'm sure a lot of people have used,
but it seems like there's a whole lot of others too. Yeah, and then a lot of the back-end server-side geospatial computing too,
most of the mapping companies use it in some form, at the very least for data translation,
if not some of the algorithm and processing.
There's people putting it on phones.
There's a browser-based implementation that people have kind of been working on
to pull some of the processing algorithms up to browser land.
So it's kind of going all over the place, just like computing's kind of going all over the place.
How do the vector and raster worlds meet, or do they?
I guess my supposition is they meet at point clouds, which is what we do.
Point clouds is simply just a bunch of points, but they're modeled
as discrete locations in space instead of an array of pixel values or an array of points like a raster data might be.
Point clouds have properties of both.
You want to manage them in raster ways as much as possible, meaning you want to address them conveniently.
You want to compress them as efficiently as possible.
You want to transmit them and model them simply.
But you also want all of that fidelity of that point cloud corresponding to things like where it's located.
Maybe you have some other information on per points.
Like say you had a laser scanner and it was measuring a returned intensity or other information that you might attach to like a LIDAR point or
something like that. And so in some ways, point clouds are kind of in between, at least in the
geospatial realm. Another area where all of these kind of meet together, and when I talk about
PDAL or POODLE and what it brings unique to the kind of the point cloud space is the geospatial
part. And so the geospatial part and so the geospatial part
here for raster vector and point cloud is coordinate systems the modeling of that coordinate
system and the where part of it so it's not simply just a coordinate that describes some point in
some space it's a coordinate that describes a point in world space or other world space if
you're doing maybe nasa stuff or something like that. But
being able to model all of that coordinate system information, be able to transform between those,
that's where all the kind of the stickier parts of geospatial tend to be just because it's
necessarily complex sometimes. And also because most people don't want to have to think about it
and want to be able to outsource that to software libraries that can do that for them.
So GDAL depends on a library called Proj.
And so Proj is an open source library that originated in 1983 from a guy named Gerald Evenden at the USGS.
And so there's a very famous book out there, the Robinson book from USGS that described a whole bunch of coordinate system transformations and coordinate systems.
And so Jerry implemented a number of these in the personal and mini computer era.
And so this library kind of survived as a C library over the years as he was developing it.
He retired in, I want to say, 98 or 99, around the time when Frank Wormerdam was doing GDAL.
And so Frank needed a coordinate system support library for GDAL. And so he used Proj as the
basis of that. So Proj mostly concerned itself with just the mathematics of geospatial projection,
and it didn't concern itself with geodetics, which is, you know, modeling the earth.
And so recently we had an effort to update the Proj library to support a geodetic transformation
engine. So applying the math and the modeling of, you know, geoids and datums and things like that,
so that people can do very precise engineering activities with this geospatial library
instead of making a Web Mercator map for the web or something simple like that.
You can do that stuff with it too, but being able to have the capability to do this complex geodetic modeling.
I wonder if it's worthwhile just for a moment to make sure our listeners know what we're talking about with projections and you just said web mercator so we're talking about like different ways of projecting
earth coordinates to a map yep so you know a while ago there was the orange peel story where
you know you try to take the orange peel and lay it flat out and draw a map with it and if you take
a geography 101 class you might get to do that as. It's a lossy operation no matter what you're trying to do. If you log into Google Maps, you know, maybe five or ten years ago,
the projection or the coordinate system of the description of that map is something called
Web Mercator. It was kind of a special thing that Google had made as they were developing Google
Maps. And it's kind of a shortcut over something that's called Mercator, which of course is a very well-known coordinate system.
As a data modeler, you have to make choices about your coordinate system depending on what you're using.
So you have the ability to preserve area, direction, or angle, right?
And so you can't do all three of those at once.
And so depending on what you want to do with your data or how you want to model it, you'll have to make a coordinate system choice.
And there's lots of them to choose from. And you'll see them used in various contexts.
There's definitely some very common ones. So your GPS on your phone, for example, is going to be just in geographic coordinates or plate carry, which is, you know, latitude, longitude,
or longitude, latitude, depending on who you want to argue with. I hear, you know, raster, point clouds, geodesic.
You mentioned LIDAR earlier.
I start thinking, you did mention LIDAR earlier, right?
Yeah.
Okay.
I start thinking like, you know, people that are doing like archaeological surveys
and photogrammetry, reconstruction of scenes and that kind of thing.
Does that have an overlap or have I gone too far? Definitely an overlap. So the group of
archaeologists who are flying LIDAR in Central America to be able to look for lost cities and
stuff like that. And LIDAR doesn't penetrate the foliage. It's just you're shooting so many lasers
at such a high frequency that if there is a ray that can cast through the foliage. It's just you're shooting so many lasers at such a high frequency that if
there is a ray that can cast through the foliage to hit the ground, you try to do that with and
then you spend a bunch of computing to process it back out, right? And the photogrammetry-based
point clouds is where you're taking lots and lots of photographs at lots of different angles and
doing various stereo photography processing techniques to extract out
a 3d scene or a 3d point cloud from that data and then build a 3d model from that so sometimes
you'll have lasers actively scanned lasers and lidar to do that but you don't need that to do
it so if you're just out flying your drone or whatever you can put that through various open
source software to be able to construct an ortho photo or like a
photogrammetically correct raster image.
So there's a software project out there called Open Drone Map that you can use to do that.
But you also extract out this 3D scene point cloud, right?
And other associated things.
And so, again, it's all kind of the same sort of story of lots of data, compute out a geospatial product, and then take that geospatial data product and go do whatever you need with it.
Because I'm just trying to wrap my mind around where the edges of this tools are.
And the original goal, you said, the person who originally created the stuff was to be the garbage man that just did the translation that no one else wanted to do. But it sounds like you do a little bit more than that now, because you can also do processing and
alignment of points, perhaps. Is that? Yeah, over the years, GDAL grew the ability to do
warping, which is, you know, taking that raster, taking that array of data and stretching it or
rubber sheeting it over a bunch of points and doing that various ways depending on what the needs are.
And so that's a task that you need to be able to do to, say, take these raster data and
transform them in different coordinate systems.
So warping is a common one.
Compression, you know, there's large flavor or diversity of flavors of compression.
And depending on your data type, this may or may not have lots of meaning for you.
One thing that the geospatial data almost universally has in common is it's large.
And so maybe it's not so dense in content, but what I always say is it's very fluffy.
There's just lots of bytes.
And so sometimes it compresses well, and other times it doesn't.
But a compression is a huge part of any story with geospatial data.
I like the idea of fluffy data.
I've never heard that before.
I mean, yes, it's big data.
Maybe it was one of the original big datas.
But, you know, the meaning of that phrase is kind of smashed out a little bit.
And, you know, it's a little bit like cotton candy.
Wonder if the discussion for just a moment to bring you a word from our sponsor.
CLion is a smart cross-platform IDE for C and C++ by JetBrains.
It understands all the tricky parts of modern C++ and integrates with essential tools from the C++ ecosystem,
like CMake, Clang tools, unit testing frameworks, sanitizers, profilers, Doxygen, and many others.
CLion runs its code analysis to detect unused and unreachable code,
dangling pointers, missing typecasts,
no matching function overloads, and many other issues.
They're detected instantly as you type
and can be fixed with a touch of a button
while the IDE correctly handles the changes throughout the project.
No matter what you're involved in,
embedded development, CUDA, or Qt,
you'll find specialized support for it. You can run debug your apps locally, remotely, or on a microcontroller,
as well as benefit from the collaborative development service. Download the trial
version and learn more at jb.gg slash cppcast dash clion. Use the coupon code jetbrains for
cppcast during checkout for a 25% discount off the price of a yearly individual license.
So you were talking about Google or GDAL.
And so is Poodle part of GDAL or is it a separate library built on top of it?
The latter.
So we started developing some point cloud software, like I said, in about 2008 or 2009. And this point cloud software was using GDAL to provide
some geospatial capability like coordinate system transformation.
We were doing some geometry filtering and stuff like that.
And then we started developing a library called LibLAS, which was the
first open source library that was available. And it was clear that
LibLAS was focused on this LAS format,
which would be kind of like the TIFF of geospatial LIDAR, for example.
I think of TIFF as the over-engineered image file format.
And I'm wondering, when you said it's like the TIFF of geospatial,
is that because it's very configurable and has lots of options,
or just because it's ubiquitous?
Mostly on the ubiquity thing, right?
So TIFF is kind of like a primary sort of geospatial raster container.
I agree with you on people write really horrible TIFF story.
But just in terms of ubiquity and frequency or commonality in the industry,
for geospatial point clouds, LAS is kind of that format.
But there are other formats.
And so it was clear we needed some kind of abstraction API that would provide a similar sort of capability or space as GDAL. We wanted the ability to have an abstract data API so that we could provide access to the content without necessarily imposing all of the complexity of the organization on the user, right? So in exchange for that,
you might be a little bit slower or you have impedance mismatch, but the ability to
just get those points out and go do something with them in your application, if that's really
valuable, we needed to be able to do that. And so we developed that primarily for doing
point cloud data warehousing, right? So there you know, there's lots of companies and governments out
there flying LIDAR over, you know, places like Central America, but they're doing it all over
the United States or all over Europe. And they're using this information as kind of geospatial base
data for, you know, things like flood mapping, 3D modeling, photogrammetry. And so there's kind of a
bucket where all that stuff gets dumped and our software kind of sits for one of those buckets.
Our software sits there and does all the processing for it.
So is there like a knob that you can turn and say, I want slower processing, but like super precision conversions or, you know, I want faster processing, but, you know, quicker results so I can just visualize the data real quick.
So for raster data, that knob is to kind of overview or mip map the
data into a pyramid. And, you know, that's your kind of common processing technique for imagery.
For point cloud data, that sort of mip map is going to be some sort of indexing structure. So,
you know, commonly you'll see an octree used as that filtering technique where,
you know, I want to select the data but to a specified
resolution i don't need you know 250 points per square meter or whatever that original
data capture was that allows me to model you know the surface so intensely i don't need that to
maybe ask and answer a really quick question and so you're seeing data organization that's doing
those kinds of processing techniques to allow people to touch
that data at rest without having to process the entire thing at once that octree is what karmak
famously used in quake to be able to do software 3d rendering is that right yep yep so although
it was groundbreaking when he did it right but, but that capability, those are the hammers you have in the box
to be able to go do stuff.
Sounds like there's just an awful lot going on here
in your world.
Yeah.
It's been a world that was kind of the domain
of what I would call governments and government data capture.
LiDAR cost $200,000 for a laser scanner that you could mount on an airplane maybe 15 years ago.
It was just a really large capital investment to capture and do stuff with this kind of data a while ago.
And you have a LiDAR on your iPhone 12 Pro now.
You can walk around and capture point cloud data.
I did not realize that.
Yeah, there's a laser scanner on your iPhone and people are building applications to do things like
augmented reality where they want to make digital representation of their scene or their environment
and put that into their goggles or put that into photographs, 3D photographs or whatever.
And so the data type and the data to support that is the same stuff, right?
It's geospatial, it's raster, it's point clouds, it meshes about the space.
And so these processing techniques, even though they're 20 and 30 years old, are still applicable
today.
I'm reminded, I think tangentially, although maybe it has more of a direct relationship
than I'm aware of, OpenStreetMaps, which I saw a recent article that's discussing how all of the world's companies, trillion-dollar companies and world's governments all rely on OpenStreetMap for their GIS kinds of things right now.
For the base, yep, absolutely. Yep. And my opinion, the ones that are a little bit more progressive are the government entities, at least that are a little more progressive, are embracing that as the place to have lots of impact with their data that they would produce.
Whereas previously, a government would spend a lot of money constructing that data.
So in the United States, we have, you know, Census Bureau does that, right?
So the Census Bureau needs to know where everybody is so that they can find them all to do the census. And so part of that tasking was to build the maps of all
of the streets. Then companies would take that and go do stuff with it. Some of these organizations,
though, they're embracing OpenStreetMap and pushing that data out to that map or out to the
common database early in the process instead of trying to do it later on so that
it has more impact, has more utility, and it's a lot more up-to-date, right? Because it's just
something that's necessarily always changing. Yeah. 15 years ago, I went on vacation to Ireland,
and at the time, all of the travel info said, you can't rely on GPS, you can't rely on a map
that you buy in the U.S., you have to buy a physical map in Ireland when you get there to know what the roads are doing.
I'm hoping now that that's changed,
as you said,
with people getting more pushing data
to open street maps
and embracing that anyhow.
Some countries where a lot,
you know, there's variation
in the policies too.
So the United States, for example,
has an open data policy
where the data it captures
tends to be open and
available and companies can use that to build products and make things more valuable if they
want and take that and close it up and enhance it. EU, it wasn't quite like that. Certainly 15 years
ago, I think things have gotten better. Everybody's geospatial view of the world kind of starts with
their phone nowadays. And so that might be Apple maps or google maps or something like that but then you know they're starting to take that into other spaces like you know i know
snapchat has a bunch of geospatial stuff that they're doing i don't know if tiktok's doing any
geospatial stuff but everybody's the spaces are kind of converging where this like digital
space and physical space stuff kind of starting to merge together a little bit. And the part that takes that, the reality, the physical part,
and puts it into digital is challenging data processing,
challenging data modeling, and software.
So I wanted to kind of ask a little bit about the C++ code in these libraries.
I'm guessing the answer is going to be very different,
whether we're talking about GDAL or PDAL,
but are we using modern C++ at all in these libraries? It sounds like GDAL is a very
old library and PDAL is maybe relatively new. Yeah, GDAL predates good STL. Okay.
We actually had a policy in GDAL where we wouldn't allow STL into the library. Like GDAL had its own
kind of STL thing going on.
And there's still quite a bit of that
in the code base, actually.
But hold on, I'm gonna make a fork
and start a pull request now.
I think you would describe the design of GDAL
as like, you know, that original sort of vision
of C++ of C with classes.
And not much more than that,
you know, a very simple kind of core API, top-level API,
and then the concept that GDAL calls drivers,
which are just essentially implementations of a particular data model
applied to the overall GDAL model.
What that's allowed is for people to contribute to the library
and only have to
focus on the part they care about which is modeling their one specific little format or
their one specific use without having to worry about you know i think it's almost a million
lines of code now the big complexity of this library they don't have to worry about that if
they want to just focus on their one little task you know gd, I guess what I would call it kind of fluffy or verbose in terms of
style. It's not particularly compact. A lot of the reasoning for that is in the late 90s, early 2000s,
compiler variability was a lot bigger. GDAL expects to run on, it was running on everything
from AIX to Windows 32 to OS X, first flavors of OS X.
So it was running all over the place.
And so it had to assume the responsibility for all that compiler variability.
Things have improved a lot in the interim.
So when we started Poodle in 2010 or so, Clang and GCC were implementing modern C++ at that point.
C++ 11 was just kind of coming onto the horizon.
And so we were able to say, hey, we're going to cut our baseline here at C++ 11.
Eventually, our first implementations were based on Boost.
But once C++ 11 compilers were prevalent enough that you could depend upon them for the platforms you cared about. We set that as our floor.
And so that gives you things like shared pointers and memory management and stuff like that.
Recently for Poodle, though, the next upcoming release, we're setting that floor at C++17.
The thing that pushed us up there is standard file system and getting that universal support
for that across everything. I think the deployment targets for where our software goes
tends to lag just because of the open source community,
the packagers or the vendors like Ubuntu or Debian or Red Hat
that will pick up these libraries are quite conservative.
And so the compilers they have available to them by default
are conservative, their settings tend to be quite conservative because their customer base and their user base
wants things to move slowly so that they don't break. And so for us, that has to be kind of a
target of, you know, we're not going to turn C++20 on, for example, even though as a developer,
we would love to, you know, both of those libraries kind of track that migration or evolution.
What kind of performance considerations does this system have?
Are people relying on it being super performant?
In some cases, maybe.
I mean, there's a penalty for having an abstract API, right?
You have this penalty of transferring between your data models, potentially. And if you
write purpose-built software for a purpose-built format, you're going to beat it. And so the idea
with these libraries is, if that's not a consideration for you, and you as a consumer
of the content just wants access to the content without regard to those high performance considerations,
and you can get away with that, then libraries like these are quite valuable.
Because as a developer, you don't have to worry about all the intricacies of how data was laid out.
And then they roll a new version of that format out, and now you have more complexity to deal with. In exchange for that, the bargain there is that as a consumer
of the content, you don't have to worry about all those intricacies. But if you need performance
and you need to manage your system very precisely so that you know what's going on all the time,
you're just going to do much better developing and controlling that yourself. And then you might
provide a driver, say for GDAL or HOODL to allow people to access your thing. And so we see lots of companies will do that. They might even be
closed source, so they'll have an SDK for their format, but they'll provide a GDAL driver for it
so that people have access to it. There's lots of benefits to that. One is GDAL as kind of a
distribution platform for capability. GDAL is available on Ubuntu and Red Hat
and all over the place.
And so people will install it and use that
as kind of a base for stuff that they're building.
So having support for your thing there has some value to it.
And also that allows you to manage the life
of the features of that thing
without regard to this larger thing.
And so you can kind of control how those move together.
A couple of follow-up questions I've been thinking.
You have 200, 300-ish, whatever, different formats that you support
between the raster and the vector, right?
Are all of those parsers then written internally, or do you ever rely on like libpng, libjpeg, you know, those kinds of things as well?
Frequently, we rely upon libpng or libjpeg or libtif or libwhatever or proprietary SDK for weird one-off format that, you know, three people use but is really important to some group.
Like, you'll see see all combinations of that.
So GDAL and Poodle both allow a user to build a proprietary dynamically loaded plugin
that'll get woke up at runtime that they can adapt the API to their driver.
And so that's been an important part of the model of both of these libraries
in that we recognize and want people doing
proprietary software to adapt to the system in a way that is opt-in when they want. You know,
they don't have to make some bargain to be able to participate. And, you know, we've seen companies
that will do that. And then eventually, like, it's more convenient to just have an open SDK.
And then eventually, it's just more convenient to have it in to GDollar or Poodle. And so we've
seen some of that go on as well.
But every company doing various things can make their own business choices or technical choices to do that.
And they have the flexibility within the system to manage that.
So how much of these different drivers with supporting up to 200 or 300 formats, how much of it is in the GDAL code base?
And how much is like third-party drivers that you could optionally put in if you have it available? So if you go to the
website and kind of list the raster drivers, you know, I think there's like 180 or 200 or something,
I don't know what the list is right now. The majority of those are kind of adaptations on
the same thing, which is a bunch of binary bytes laid down on disk with some metadata.
And so they might have different organization,
they might have slightly different meaning,
but that's the gist of what raster data is, right? It's just a bunch of bytes on disk
and something that describes how they're organized.
Those format parsers tend to be pretty simple.
They don't have to be super complex.
They don't have lots of options or features with them.
But then there's others that are.
You'll see things like JPEG 2000, which is a really complex specification for compressing raster data.
And there's a library called OpenJPEG that's available to manage that.
And GDAL will just delegate and use that as the library.
So it doesn't do the decompression.
It doesn't do the geospatial part.
It just adapts that model to GDAL's model.
And so some of the drivers are that. You'll see proprietary drivers as well. And what most vendors
will do is they'll include that proprietary driver or parser or implementation of what's
using their SDK in the GDAL project itself. And then they'll have the user, whoever it is,
essentially compile and link that when they're building their own GDAL project itself. And then they'll have the user, whoever it is, essentially compile and link
that when they're building their own GDAL to make it available to their customers. And so that's
another model that's available. And then finally, the last model is we just provide a runtime DLL
that's linked against GDAL and you can load it up at runtime to get access to our format along with
either our SDK or whatnot. And you'll see that commonly done as well,
mostly in commercial software deployments where they're deploying GDAL as part of their
commercial software system. We've talked a lot on this show about fuzz testing, and I'm finding
myself thinking, okay, you've got 200 whatever file formats, however many of them you offload
out. But it's pretty well accepted that fuzz testing,
anything that accepts user input, is good practice.
What does your test scenario look like for Ulf?
So GDAL was, let's see, Evan's not here,
but Evan Rualt is the current maintainer and kind of the major domo of GDAL at this point.
And so he was invited by Google to include GDAL into OSS Fuzz very early on.
And I think GDAL was one of the projects with the most findings, one for some time window.
It was also one with some of the more complex findings. Not every format driver is Fuzz tested
equally because of this linking and the proprietary things and all this sort of stuff.
But the test as currently run is fuzz tested with commits to the code base. And so we get findings not as frequently as we were. OSS fuzz is a continuous fuzzing harness, right? Like it's just
always running. Yes. They've found issues in lib tiff. They find lib issues. I don't think they
found any issues in lib jpeg, but they'll find little issues, right?
And sometimes that originates from GDAL and its usage or just the SDK usage.
The test suite for GDAL is very large, but it's testing of capability more so than full
engineering tests of every possible path through the code base.
It doesn't test that.
The C++ testing in GDAL is actually
driven by Python because it's very data sensitive, right? You have all these different data types and
configurations. Most of the testing is around that. I wouldn't recommend putting GDAL at the
end of your web address and saying, pass me in data. Even though it is fuzz tested, you might
want to jail stuff up a little bit.
That's awesome, though. That's really cool. I don't know what answer I expected,
but I didn't expect that. Yes, of course, we're continuously fuzzing as much of it as we
reasonably can. Underlying support libraries for GDAL, like Proj, are also fuzz tested through OSS
Fuzz, Proj, and I think Geass, which would be the geometry algebra engine that you'll also see used.
And so the groups in these projects were early participants in OSS fuzz,
mostly because Google was using them and also wanted to, you know,
essentially not have to upstream all the bugs they were finding, right?
So like that feedback loop was a lot tightened up quite a bit.
All right. Well, Howard, it was great having you on the show today.
Thank you so much for telling us about these libraries.
Anything you want to plug before we let you go or anything like that?
A couple of things I want to plug.
One is something called cloud optimized point cloud and analysis ready data.
So in geospatial realm, because the data are so large and they're sitting at rest somewhere, having the data laid down on disk in a way that you can access them without having to process all of them like we were talking about earlier is an important thing and so you're starting to see some efforts to do that and so one
of our efforts on the point cloud space is cloud optimized point cloud that's at copc.io and the
other is i would never claim to be much of a c++ software engineer so thank you very much for the
opportunity to be on your podcast i mean for us c++ is kind of a means. At the time, doing high-performance geospatial
software processing where you want to move lots of bytes around, compress lots of bytes,
transform lots of bytes, that meant C or C++. And maybe it still does today,
but that's how these libraries came into being. And it's been interesting watching C++
kind of continue to evolve over the years
to be something certainly quite a bit different
and more like other programming languages
than it certainly was 20 years ago.
Well, I certainly hope C and C++
still do mean high performance and everything
like you were just saying.
I think it does.
Thanks again, Howard.
Thanks for coming on.
Thank you very much.
Thanks so much for listening in as we chat about
C++. We'd love to hear what you think
of the podcast. Please let us know if we're
discussing the stuff you're interested in,
or if you have a suggestion for a topic, we'd love
to hear about that too. You can email
all your thoughts to feedback at cppcast.com.
We'd also appreciate
if you can like CppCast on Facebook
and follow CppCast on
Twitter. You can also follow me at Rob W. Irving and Jason at Left2Kiss on Twitter. We'd also like
to thank all our patrons who help support the show through Patreon. If you'd like to support
us on Patreon, you can do so at patreon.com slash cppcast. And of course, you can find all that info
and the show notes on the podcast website at cppcast.com.
Theme music for this episode is provided by podcastthemes.com.