CppCast - Modules and build systems

Starting point is 00:00:00 Episode 362 of CppCast with guest Daniel Ruoso, recorded 7th of June 2023. This episode is sponsored by JetBrains, smart AD posts and a new C++ user interface library. Then we are joined by Daniel Ruroso. Daniel talks to us about his work on C++ modules. Welcome to episode 362 of CppCast, the first podcast for C++ developers by C++ developers. I'm your host, Timo Dummler, joined by my co-host, Phil Nash. Phil, how are you doing today? I'm all right, Timo. How are you? I'm not too bad. I should say, if my audio sounds a little bit different or less good than usual, it's because I'm traveling.

Starting point is 00:01:25 I don't have my usual mic set up with me. I'm currently in Thessaloniki in Greece, visiting a friend. And next week, I'm traveling from here to Varna in Bulgaria, which is where you're going to have the committee meeting. It's actually not very far from here at all. And then, yeah, head back home to Finland after that. How are you, Phil?

Starting point is 00:01:40 Are you also still traveling? Or how's it going? Well, after my trip to Norway a couple of weeks ago, which concluded my Scandinavian tour, I'm actually done with traveling. Released a couple of months now, although I have a holiday here in August. But apart from that, and C++ on C in a couple of weeks, which is only an hour or so drive, but I will be staying away from home. Apart from that, I'm going to be looking forward to reminding myself what home looks like.

Starting point is 00:02:03 All right, so at the top of every episode, I'd like to read a piece of feedback. This time we have a tweet by Rene Ferdinand Herrera-Morrell, who commented on our Conan 2.0 episode, that was, I believe, two episodes ago, with Luis Caro Campos. And Rene said, good interview, love the ecosystem IS mentions, smiley face. So Rene, I'm very happy you liked the episode. And actually, I'm sure that the ecosystem IS will be mentioned again today because we are continuing our mini-series on C++ tooling,

Starting point is 00:02:32 which we started in that Conan episode. Before we get to that, I just want to briefly mention also that apparently there's been a bit of confusion with the last CPPcast episode, the one with Anthony Peacock, where apparently some people initially only saw a six-minute long audio file instead of the full episode. So Phil, can you explain what was going

Starting point is 00:02:50 on there? Angle brackets, sigh, yes. For all the issues, we're all resolved. This is a new one, actually, because I was traveling and we had recorded the week before, but I uploaded the episode from the airport in Oslo and I thought it would have gone through fine. But apparently there was, I think, a processing glitch on the host. So initially it only went out as a six minute episode. So of course, when I landed, I corrected it, uploaded the full episode, thought it was all fine. Then I started getting reports that, yeah, there's only a six minute episode. And it's another one of these things where every podcast host or podcast

Starting point is 00:03:29 redistributor, they have their own way of caching things and clearing things. Some cases you just have to re-download. In other cases, I had to re-upload or give it a new GUID. But eventually Spotify was the holdout, where the only way i could get spotify to not serve up a cached version that was only six minutes long was to change the file name that it was coming from so i did that and then it was seen to be fine except that for me and i don't know if anybody else has seen this i've not heard anybody else report it i'm not getting past episodes showing up in my podcast player and every time i remove them just gives me another batch so i don't know if anybody else has seen that.

Starting point is 00:04:05 Hopefully it's just me, because I can't see why it would be related. But if you have seen that, do let me know. I'm not sure there's much we can do about it at this point, but I will investigate further. Other than that, hopefully we're back on track. Well, thanks, Phil. And we'd like to hear your thoughts about the show.

Starting point is 00:04:22 You can always reach out to us on Twitter or Mastodon or email us at feedback at cppcast.com. Joining us today is Daniel Ruozo. Daniel has been working for over 20 years, working in and around build systems and package management. He introduced a package management system at Bloomberg that is used today for over 10,000 C++ projects. In the last five years, Daniel focused on static analysis, automated refactoring, and

Starting point is 00:04:47 building consensus on engineering practice. In the last two years, he collaborated with the ISO C++ Tooling Study Group to help figure out C++ modules. Daniel, welcome to the show. Thank you for having me. It's a pleasure to be here for the second time. Many of the things on your bio we're actually going to talk about a bit more when we get into the interview but the one thing that i picked up on that stood out to me because i work at sonar do static analysis tools that mention the static analysis got my attention

Starting point is 00:05:13 so is that something that you you are or have been doing in-house at bloomberg or is it just using existing tools uh both uh so we have a team that is called the STAR team, Static Analysis and Automated Refactoring. It's a nice acronym. And the role that we have is to introduce static analysis, both for off-the-shelf tools, and we use Clang Tidy and related tools. We also use CodeQL, which is the GitHub Advanced Security, and Fortify, and a bunch of other tools. We happen to also have some Sonar, but we're not actually driving the Sonar Analyzer a lot.

Starting point is 00:05:56 It's actually something that we have to investigate how much overlap there is with the other tools we do. But we also have a very dedicated effort to building custom tools for specific purposes. So we have, for instance, a feature enablement library. And so we have a custom tool that will go in, like retire the code that's behind the switch

Starting point is 00:06:24 after the switch has been fully rolled out and do that automatically for the users when the switch is set. And also a bunch of other refactorings that we do, mostly using Clang tools. Yeah, a lot of interest in those custom rules and custom refactorings at the moment, I think. We probably need better tooling, too, to support that, but Clang tooling is pretty good.

Starting point is 00:06:49 Yeah, that was actually the subject of a talk I gave back in 2019 in C++ Now. So if anyone is interested, there is more details. The talk is about the Clang MetaTool library that we developed at Bloomberg. It's open source and github.com slash Bloomberg slash Clang MetaTool library that we developed at Bloomberg. It's open source and github.com slash Bloomberg slash Clang MetaTool. That essentially makes it very easy to build a small standalone tool. And the workflow is usually you're going to build that small tool,

Starting point is 00:07:16 run it across your code base, be done with it, and throw it away. Nice. We'll put a link to that in the show notes. Thanks. All right, Daniel. We'll get more into your work in just a few minutes, but we have a couple of news articles to talk about. So feel free to comment on any of those, okay? So the first one, I already kind of mentioned it briefly, next week is the committee meeting in Varna, Bulgaria. So I'm very much looking forward to that.

Starting point is 00:07:37 It's going to be our first post-C++23 meeting where we can start voting new things into C++26. So that's going to be very exciting. And yeah, I expect there's going to be lots of trip reports afterwards so you can see what happened there if you haven't had the opportunity to be there yourself. Yeah, and I think Vana was also the one that was cancelled during the pandemic. The first one to be cancelled, I seem to remember. Yes. Well, there were quite a few that were cancelled during the pandemic, but that was, I think,

Starting point is 00:08:05 the first one that would have been summer 2020. Yes, and then it has been postponed multiple times, and I'm very, very happy that you finally get to go there. I actually have been to Varna myself a few years ago on a completely unrelated vacation trip. So it's a really nice place. I'm very much looking forward to being back there. I'm jealous.

Starting point is 00:08:28 I'll see you there. See you there, yeah. Then we have a couple of blog posts that caught my attention this week. First one was by Michael Christophic, and it's actually part of his series, Modern C++ in Depth. And the blog post is called,

Starting point is 00:08:42 Is StringView Worth It? And I found that really interesting because it was kind of from the perspective of a large financial data company that started using C++ before the string was actually available in the standard library. So they developed their own string class,

Starting point is 00:08:57 like I'm sure thousands of companies out there. I've actually worked at one of those myself, a music tech company that had their own std string like way before std string was a thing and obviously with different semantics. So that's quite a common scenario, I think, especially for bigger companies that have been around for longer.

Starting point is 00:09:14 And so the article actually says, well, StringView can be helpful as a kind of lingua franca if you want to move away from char pointers as the lingua franca. You have this kind of situation. And then it gives a really nice overview of when and how you actually should use the string view

Starting point is 00:09:29 and when you really shouldn't. And I thought that was a really nice overview. It really discusses the typical pitfalls you get with the string view. It has reference semantics, so every reference it can dangle. But actually, it's even worse than normal references. But if you have're playing const ref,

Starting point is 00:09:46 and you assign like a temporary object to it, then that's going to extend the lifetime of the temporary object. But std string view doesn't actually do that. So it's even easier to end up with kind of dangling string views. And yeah, I thought that this blog post was a really nice overview. Because now that we had the string view since c++ 17 so for six years kind of kind of nice to uh look back and see okay what are the use cases what have you learned from from using it in the field yeah i actually uh i had my own string class post c++ 98 in fact uh even post c++ 11 because there were some semantics that we particularly wanted that std string didn't give us.

Starting point is 00:10:28 And we had our own string view, where we called it string ref. And the nice thing is that we could actually make them work well together. So our string class was reference counted. And the string view could also have like a weak reference in. So you could take a strong string from it at the end, and it will pick up the reference count again. So there are still some things, still some reasons that you might want to have your own string class you should think very carefully before doing so but uh but it may still apply i think the

Starting point is 00:10:56 interesting thing this this blog post made me think about is just how deep into the C++ ecosystem the problem of vocabulary is, where you end up with this niche places, each one with its own set of words to talk about things, and how important it is that we start converging on those basic vocabulary words that we use. Yeah. Right. And so the second blog post that I want to briefly mention is by Raymond Chen.

Starting point is 00:11:28 It's called How to Check if a Pointer is in a Range of Memory. And that was from the Microsoft Dev Blog. And I thought that was a really fascinating blog post. It basically asks the question, given a range described by a pointer and a size, can you check if some other pointer actually lies within that range? And you might be tempted to say, well, okay, we'll just check the

Starting point is 00:11:51 integer value of the pointer and see if it's numerically between pointer and pointer plus size. But that's actually wrong because conceptually in the C++ abstract machine, pointers are not just integers, right? They have an address, but they also have a provenance.

Starting point is 00:12:10 And so that kind of makes this approach not work at all. And you have to think about this quite differently. And while on modern platforms, typically you have a flat memory model, but in the end, it's again, just an address, at least at runtime. But Raymond actually gives an example of an older architecture,

Starting point is 00:12:28 like the 802286 processor, where that is actually not the case. And pointers, indeed, are not just addresses, but they have, like, different parts that mean different things, where, kind of, you need this concept in a C++ abstract machine of provenance to actually work with that kind of thing.

Starting point is 00:12:43 So I thought that was really cool and fascinating. Yeah, it was a real trip down memory lane for me. I did start out working with near and far pointers on the 286 back in the day. I'd sort of forgotten what it was like to work with that, but yeah. The idea that a near pointer is sort of within a single 16-bit integer jump from where you are now you could actually address that more quickly than something that was further away where it needed like the two segments so yeah these things still exist i think it's actually in in the very near future we're starting to go to a place where we're going to have more exact architectures that we interface with as like gpu

Starting point is 00:13:25 code gets uh more common and other specialized chips if you think about how the risk 5 architecture is being designed where there's all these extensions so i think we're we're actually going to be on a point where these kinds of subtleties in the abstract machine are going to become more relevant. The one thing that the post made me think about was a joke that we have internally with some core workers that we really should build a troll OS where everything that's in the standard is implemented to the latter

Starting point is 00:14:00 and then everything else is just randomized. That's a great idea. I actually thought about that too at some point like you can have um you know 11 bit bytes and all kinds of weird stuff and like the size of int can be equal to size of char and all kinds of evil things that could be fun i'm curious how many how many library uh test suites would fall apart if you do that. Let's do it. Right. And so one more thing that I want to mention is I saw this initially on Reddit.

Starting point is 00:14:33 There was a post about a new C++ user interface library called Nui. And so there was a discussion on Reddit, but there's also GitHub, and there's also a dedicated website. So that's a new GUI library that is permissively licensed. It has a boost license, and it lets you write a UI in C++. But it's not like Qt or anything like that,

Starting point is 00:14:53 because it then turns that UI into a web view. So it's a bit more like Electron or something like that, but you actually write the UI in C++. So my understanding is that the C++ you write for UI gets compiled to WebAssembly and then rendered by WebView in a finished app. It does use modern C++. It has

Starting point is 00:15:11 quite a few interesting features. The author says that it still needs some polish and features, but it's already fully usable and documented. And that currently only Linux and Windows are supported, but the author is hoping for a contribution from Mac in the future. So I thought that was kind of an interesting approach.

Starting point is 00:15:32 I'm curious what you think about that. I think interesting is doing a lot of work there. Now, I will confess I haven't actually read the article, so maybe some of this is explained there, but I wonder what the use case for this would be because obviously there are plenty of webview based frameworks that are more javascript based or typescript based and it seems like a better fit for that sort of thing and that's of course you're trying to bundle in a bit a big c++ library that's the only reason i could think of it but

Starting point is 00:15:59 i just think there are probably still better ways to interoperate. But maybe there's a good use case for it. It also struck me as amusing that one reason that you would want to do this more web-based is for maximum cross-platform capabilities, but it's not available in all platforms yet. I was actually wondering

Starting point is 00:16:20 the same thing, like what's the use case for this? Because from what I've seen, I would approach it the other way around, that you often wondering the same thing, like what's the use case for this? Because, you know, from what I've seen, kind of, I would approach it the other way around, that you often don't want to write the GUI in C++, right?

Starting point is 00:16:31 Because it's not necessarily a language that lends itself to that. Like I've used like Qt and Juice and like other frameworks, and it's typically quite painful to write a GUI in C++ compared to some like declarative approach or, you know, JavaScript

Starting point is 00:16:44 or something like that. Particularly because as integration with WebAsm becomes a thing, if you have a specific part of your UI that has a heavy computation step, then you can still write that heavy computation step in C++ and call that from the JavaScript UI code. Yeah. But what I found interesting is that in the Reddit discussion, there were lots of very positive comments like,

Starting point is 00:17:12 oh, this is amazing, this is great. So apparently, you know, there seems to be a use case for this kind of stuff. Oh, technically, I'm sure it is amazing. It does sound like a very interesting problem to be solved. I'm just not sure. I will claim defensive ignorance on writing GUIs. GUIs have been at least 20 years since I did one. Not my area at all.

Starting point is 00:17:41 Yeah, and it seems like if you are writing a GUI in C++, then you kind of want it to be a native GUI rather than a work view, right? But again, maybe there is a use case here that I'm missing. All right, so that kind of concludes the news articles, but I do actually have two things on my own behalf that I quickly want to mention. First one is, if you remember,

Starting point is 00:18:03 I said I'm going to organize a meetup in Helsinki and actually it's going really well. So we're going to have our first meetup in two weeks on Tuesday, the 20th of June. We're going to have our first ever C++ Helsinki meetup. So if you are in Finland, please come. We will have, so it's going to be the week after Vana. So we're going to have a talk from

Starting point is 00:18:25 yari ronkainen about forming habits and teaching c++ but we're also going to have a report about what happened in varna by lauri wasama mark gillard and myself and yeah it's our first ever meetup so i'm very curious about how it's going to go hopefully not the last so this is the this is the second time that you've done your first ever meetup in Finland, isn't it? Right. So the other first ever meetup was basically just a very informal thing. It's just a bunch of people met in a bar who were on the Discord. It was never announced officially on the internet anywhere. It was just whoever was on the Discord channel, we just got together for a beer, basically.

Starting point is 00:19:01 But that was kind of the unofficial first meetup. But this is going to be the official first meetup, but this is going to be the official first meetup. We're going to have talks and everything. And we expect quite a few people to come. So we're going to be excited. Well, good luck with that. Thank you. Thank you. And the other thing is that my undefined behavior survey for CPP on C is also still running. So please, if you haven't done so yet, please participate under timur.audio slash survey. And if it hasn't terminated at this point, it may well be undefined behavior, as we said before. So with that, let's transition to our main topic for the day, which is modules

Starting point is 00:19:35 and C++. And our guest, Daniel Rousseau. Hello again. Hello. So you mentioned in your bio that you work on build systems. So how did you get to work on build systems? This is something that I know some people I've met try to avoid working in, although it's a fascinating kind of area. So I'm curious how you got into that particular slice of the C++ universe.

Starting point is 00:20:02 So I think the way that the industry worked when I started, which was like 98, 99, we still have very few division of labor classes. Like we were all playing like all the roles all the time. And I ended up being the person that started putting together the package management. At the time, it was not mostly C++. It was mostly a Perl shop,

Starting point is 00:20:36 but it still has to build packages and create deployable artifacts and create like a CI, CD framework of some sort. And I ended up being the person that solved that problem. And that's kind of the decision that starts pushing you in a direction in your career, regardless of how much you want it or not. It's not that I didn't want, but it's kind of like a self-guiding process where the more you

Starting point is 00:21:06 start looking at that problem the more the problem becomes more complex and the more you have to spend on it um then after this initial contact uh context i i built a little automation at the time we were using cvs and we were using deb. And so I had a little automation that would watch the CVS repo and whatever you changed the package, it would like rebuild the Debian package and ship to the repo that would then get installed in the production machines. And that actually was how I ended up becoming a Debian developer myself around 2004.

Starting point is 00:21:42 And that, that like just pushed me all the way through. I got involved into trying to bootstrap new architectures. I was at some point trying to build a Debian based on UC LibC to run in really small devices. But that didn't go anywhere. But that was a huge introduction to the whole world of toolchains. And then I was just lost to the world. And that was my life.

Starting point is 00:22:13 So you work at Bloomberg, but you're also on the C++ Standards Committee. And if you draw a Venn diagram of well-known C++ people at Bloomberg and people on the C++ Standards Committee. There's quite a big overlap. But what is it you do both at Bloomberg and on the committee? And is there a relation between the two? There is. So I joined Bloomberg in February 2011, so 12 years ago at this point. And one of the very few, well, not the very first, but one of the very few well not the very first but one of the first

Starting point is 00:22:47 projects i worked with was actually again introducing automation to getting package management and builds and uh ironically or not ironically but funnily enough also based on the debian packaging system so i was like back to 1999, building the same thing at Bloomberg, but now with sandboxing and a bunch of other things and supporting C++. And it was only around 2019 that I started to be more interested in the C++ standard committee work

Starting point is 00:23:25 because I was honestly scared of the direction of modules because they seemed somewhat incompatible with the way that we built code. So in the early experiences, a lot of the discussions that happened in the context of defining how modules are going to work were heavily influenced by organizations with what I call heavily regulated build systems or monorepos. different than the requirements of organizations like Bloomberg, where we have an open-ended package-based build system where you don't see the source of the other package. You just see files on disk that were produced by the build process. And that's the only interface you have. And the different projects could be using entirely different build systems. And in our case, sometimes they do. It's often that they do. So it was clear to me that we had a huge gap to get to a point where we could have like a sconce project and a CMake project in a C++ ecosystem with modules and everything worked. And that's what pushed me into the work in the C++ standard committee.

Starting point is 00:24:51 And later what pushed Bloomberg to start a funding kit where to work on the implementation of modules to drive a vision where our universe at Bloomberg actually could work with modules. In a way, our biggest fear at the time is that we would have a big fork of the C++ ecosystem where you would have some parts of the ecosystem that could use modules

Starting point is 00:25:16 and some parts of the ecosystem that couldn't. And now either you're in these companies that have these huge monorepos and module work for them, or you're not. And then there's a bunch of open source libraries that you can't use. So that was our biggest fear. Right. So let's talk about modules. We can get into the specific problems maybe in a second. But let me kind of zoom out and ask a very kind of general question

Starting point is 00:25:45 that I think quite a lot of people are asking themselves these days. So we have modules and standards in C++ 20, so it's been three years now. They're still not widely used in the C++ community, right? If you go and get, I don't know, whatever library, it might even be using very modern C++, but it's going to be a header, it's going to be a header and source or something.

Starting point is 00:26:07 It's not going to be a module, right? And so you don't really see modules really being widely used in the C++ community. And I'm just curious why that is, like what's the actual challenge there? And did we as a committee get them wrong? Like, what do you think is the main issue there? Why do we not

Starting point is 00:26:27 see them in the wild? So I want to focus on named modules first because named modules are something a lot easier to talk about and we can get into importable headers or header units later. Can you maybe quickly

Starting point is 00:26:43 for our listeners, can you say what named modules are? Yeah, so named modules are what you see, what you would expect to say import STD. So you say STD without angle brackets, without quotes. There is this new name space of like module lookup. So this is an entirely new avenue of module lookup. So this is an entirely new avenue of how things are named in C++

Starting point is 00:27:09 because C++ doesn't have enough types of namespaces. And the idea is, and the main distinction between a named module and a header unit is, the import statement cannot affect the state of the preprocessor. But all the entities that were exported from that module are now reachable to the translation unit that did the import.

Starting point is 00:27:38 What does that mean in practice? Right. So if I have a file that's like module blah and then export this export that and then another file i say import blah then that's what you're talking about that's what name modules correct yeah yeah and and so what the reason why they're called named modules is because the compiler has this new namespace which is like the module name lookup. And then you can tell the compiler, for module blah, you can take the pre-built module interface from this location, right?

Starting point is 00:28:13 And the distinction from that to importable headers or header units is primarily that, one, you don't have this new namespace. When you say import angle bracket IO stream, you're essentially doing the same... You're supposedly doing the same lookup as you would do with the include statement.

Starting point is 00:28:41 But we don't really have a concept of identity of headers, right? Is that also why you can't standardize pragma once, even though everybody's using it? Yeah, that is exactly the reason. Once what? We barely acknowledge in the standard that the things are in files. How can we even define that they're the same file? What is a file?

Starting point is 00:29:06 And so that's the first challenge. So if I have to give the compiler the pre-built module interface, or the built module interface, we use the acronym BMI all the time, saying this header unit is in this location, what is it to tell that this is going to be coming from the same source that is what the compiler

Starting point is 00:29:34 would have seen if it was doing an include? There's nothing to say this. And for the user, I think this is going to be very challenging. And the second part is the compiler is allowed to of code is understood by the compiler can be drastically different if you allow the compiler to do the source inclusion or if now you're saying, take this BMI instead.

Starting point is 00:30:16 So that's the main difference. And I think we kind of stepped ahead a bit on the discussion, but that's the main thing that's the biggest challenge for header units. And there's a lot of additional problems in the tooling space that come because of it. Right. So let's rewind maybe and start with the name modules. So you say that those are actually a lot more straightforward, but those are also not really

Starting point is 00:30:47 widely adopted yet. The main thing is there is a gigantic shift that happens when it comes to the adoption of C++ modules in the tooling ecosystem. We like to use this term embarrassingly parallel when we talk about C++ build systems, because up to modules, the order at which you translated individual units was irrelevant. There were no dependencies across translation units. There were dependencies from translation units to source files that were included.

Starting point is 00:31:36 And the traditional way that this is implemented is you have the first translation generate a dependency file that gets read by the build system when it's available, and that tells the build system when an incremental build of that translation unit is necessary. So this is how C and C++ build systems have worked forever now when we get to modules we have a significant change in which not only the order of translation units is now relevant, but we have a new type of relationship between translation units. So before we had only one kind of translation unit relationship, which was linkage. So we only had ABI relationship between different translation units. Either you built your objects coherently

Starting point is 00:32:42 and everything should work, or you built your objects incoherently and you're going to have an ODR violation, which hopefully fails at link time, but most likely will just sag fault in production. With built module interfaces, you have this new import relationship where in order to translate this particular unit, you need to have translated the module being imported beforehand.

Starting point is 00:33:10 So now you need the build system to topologically sort all the translation units in order to build them in the right order because otherwise it just can't build. It would just fail saying, hey, I can't find this module. I can't compile this translation unit. Yeah, but isn't this like the whole point of modules? In pre-modules, you can do hash include vector

Starting point is 00:33:38 in your 2,000 different source files, and then you're going to be parsing and compiling header vector 2,000 times, right? And then you throw it all out again when you link. And so you just don't want to spend all of those resources on recompiling the same stuff over and over again. You want to compile it once and then reuse it. So this relationship, that's kind of the whole point, right? That's what you want.

Starting point is 00:33:59 Yeah, and conceptually, it's all great. It's just for this to work, there is a lot of steps in converging how build systems work. For instance, how do we know the order in which the translations have to happen? Which means that now we need a dependency scanning step before the build even starts, because the build needs to know what modules are where and which modules depend on what modules.

Starting point is 00:34:34 And so there has been a long process of figuring out. So now we have this extra dependency scanning. What is the output of this dependency scanning? Great. So now we have a common output format for the dependency scanning. Now let's figure out how there is. It was a paper that Kitware wrote back in 2020, I think, that essentially describes the output format for the dependency scanning that will say,

Starting point is 00:35:07 this source file provides this named module and requires this other named modules. Oh, is it this JSON format that I think CMake now supports? I think the three major compilers are supposed to support it now, I guess. GCC is almost there, I think. There's a patch that Kitware is pushing upstream on this. But the reality is, it's only last year that we got all the compilers, or at least MSVC and Clang, and a patched GCC to produce this format.

Starting point is 00:35:48 And this doesn't even work with header units on GCC. Because, and we'll go back to header units. I'm very aggravated about header units. So let's say unnamed modules for now. So now that we converged on the dependency scanning step, and then we had a second problem, which is, right, so CMake generates a build system in Ninja, but Ninja didn't support dynamically adding dependency nodes between translation units,

Starting point is 00:36:27 between build nodes, right? You couldn't add dependency edges between nodes dynamically as part of the build. And it turns out that Kitware had a fork of ninjas since a long time ago because that was required for Fortune modules because C++ modules is heavily inspired by Fortune modules. And it was only when it became clear that this was a requirement for C++

Starting point is 00:36:54 that Ninja Upstream finally accepted the patch. And so CMA could generate a Ninja that worked upstream to support building modules in the right order. And so it has just been this very long process of getting the tools in place. And even if we're confident that with the work that Bloomberg is helping fund with Kitware, that by the end of this year, if you have a CMake project and you want to use modules internally to that project, it's going to be a viable solution.

Starting point is 00:37:35 Or if you have a multi-project integrated CMake build where everything is in the same CMake project, kind of like submodule style, that this is likely going to work. Or even with export files, as long as you're careful about making the flags close enough. Because here comes the other challenge of modules, which is the implementation of the built module interface and how that gets imported actually had its roots in precompiled headers,

Starting point is 00:38:09 which means that this new relationship of the importing and translation unit is actually driven from the performance of the import statement rather than the interoperability of the tooling. What does that mean?

Starting point is 00:38:32 What I mean is, while different objects produced by different compilers, as long as they use the same standard library in ABI-compatible ways, can be linked together. If you have a BMI that was produced by Clang and you want to import it from GCC, it's just not going to work at all. It's even worse.

Starting point is 00:39:00 If you have a BMI produced by Clang 14 and you want to import in a translation unit that you're using clang-15, that's not going to work. Because the import process is actually doing heavily optimized memmap, memcopy kind of things to make the import really, really fast. But it gets even worse because specific flags, even if you're using exactly the same compiler, will change the way that the abstract syntax tree is constructed,

Starting point is 00:39:33 which means that that BMI is not going to be useful. So let's say you have a library that was built with standard C++20 and then you have your translation unit building with C++ standard 23, the BMI is not going to be usable. And so that's the second challenge that we have been working through.

Starting point is 00:39:53 And we kind of have a consensus of the idea that the compiler needs to advertise kind of like an opaque hash token describing what is the compatibility of this BMI. And then you can ask the compiler doing the import what is the compatibility of its BMIs. And essentially, in a way, you kind of have to build in the most pessimistic way possible

Starting point is 00:40:24 as if every translation unit doing an import needs its own build of every module that it needs transitively, and then just hope that the deduplication across those translation units is actually going to get you a better performance. Because if we don't do that, what ends up happening is your build system starts working and then suddenly you just get a compilation failure saying

Starting point is 00:40:51 you chose the wrong flags, that's too bad. Sounds fun. But again, we're almost there. Does all of this imply that we actually got it wrong when we standardized modules in the first place? I don't think we got it wrong when we standardized modules in the first place? That was one of Tim's original questions.

Starting point is 00:41:07 I don't think we got it wrong. I think the main thing is a lot of the challenges would have been significantly easier if we had done packaging before modules. Because a lot of the things that we spent a year talking about in SG15 was how do I ship a pre-built library with modules? How does that look like? What are the files on this? How can CMake import a library that has modules that were built in SCONs? And we have been slowly working through it, and we have a general mental model of how that should work.

Starting point is 00:41:57 And now it's a simple matter of programming to get CMake to actually implement all of that. But I don't think it was... From a language perspective, I think modules are fine. Named modules are fine. I think it's just there was a general underestimating

Starting point is 00:42:18 of the impact that it would have in the tooling space. And for that reason, the effort required to make it work is significantly higher than most people expect. GCC has had the module support since GCC 10, I think. But the tooling to make it usable is just

Starting point is 00:42:45 not there yet. Well, I know you want to get on to talk about importable headers and header units. But before we do that, it's a good time to take a little break. And while we're talking about the sorry state of C++, it's a great time to

Starting point is 00:43:01 talk about Sonar, the home of CleanCode, a sponsor for this episode. So Sonar Lint is a free plugin for your IDE. It helps to find and fix bugs and security issues from the moment you start writing code. You can also add Sonar Cube or Sonar Cloud to extend your CI-CD pipeline and enable your whole team to deliver CleanCode consistently

Starting point is 00:43:21 and efficiently on every check-in or pull request. Sonar Cloud is completely free for open-source projects and integrates with all of the cloud DevOps platforms. All right, Daniel, so let's talk about importable headers. So you recently wrote a paper, P2898, saying that importable headers are not universally implementable. And you also had a talk at C++ Now a few weeks ago, which was a great talk, by the way,

Starting point is 00:43:47 about the challenges of implementing header units. First of all, apologies if that's a stupid question, but are importable headers and header units the same thing? It's not a stupid question at all. Naming things is really hard. But it's essentially talking about similar things just on different levels of abstraction. When the standard talk about importable headers,

Starting point is 00:44:13 it's talking about the header in the abstract from the perspective of how the compiler should think about the semantics of the importation process. And when we talk about header units, we're talking about how the build system needs to think about it in terms of the fact that there is now one more node in the build graph to produce that header as a translation unit, and that's what we call a header unit.

Starting point is 00:44:42 So it's a very similar thing just talking about them in different layers of abstraction because the header unit is the translation of an importable header. All right. And so when you say they're not universally implementable, what does that mean and how bad is it and can you fix it? So let me start by talking about what universal means, right? So, and this goes back to the conversation about how the conversation about modules in the beginning was very focused on specific environments that, again, what I call the highly regulated environments. And the main difference is that those environments tend to have

Starting point is 00:45:27 very elaborate and powerful build systems where adding new nodes to the build graph during the build itself is a normal thing to do. But when we consider the C++ ecosystem as a whole, then we have to consider that POSIX make is a part of the C++ ecosystem, right? And while it is possible to get named modules to work in POSIX make,

Starting point is 00:45:58 and it's definitely not going to be the case that you're going to be manually editing those make files, like that part is gone. But from the perspective of the tooling ecosystem, it's still possible to write for CMake, for instance, to generate POSIX make files that will be able to build a system using named modules. And the reason for that is that you have the dependency scanning process happens before everything else. And that's a requirement in all cases.

Starting point is 00:46:38 But as I was talking before, in the case of named modules, the import statement is not allowed to affect the state of the preprocessor at time of the import, which means that you can look at a translation unit in isolation, and you're going to find all the edges that this translation unit has in terms of dependencies. And it doesn't matter that they're dangling at the time of the dependency scan, because you can do the dependency scan in an embarrassingly parallel step, and then you collate all that data into a coherent build graph, and it works. With importable headers, on the other other hand the import statement or the transparent rewriting of a pound include

Starting point is 00:47:30 into an import by the compiler which is allowed by the standard is allowed to influence the state of the preprocessor what does that mean? It means that if we are doing a header import and we are saying that the translation of the header unit happens independently from this translation unit, it means that I can't just pretend that the source inclusion is equivalent to the import. Because if I have something in my preprocessor state that would result in that header being interpreted in a different way than if it was being imported to standalone, I will end up with an incoherent build. And the end result of that thought process is that the list of header units become a dependency off the dependency scanning process itself. So we need to know all the list of all importable headers before we read any file. But it gets worse than that.

Starting point is 00:48:46 Because as we translate the header unit independently, that header unit needs its own compiler flags. And the compiler flags of the header unit are not necessarily the same as the compiler flag of the translation unit doing the import. In fact, this is very much a desired outcome. Like one of the things that we want is to be able to isolate the preprocessor flags such that

Starting point is 00:49:12 everyone sees Iostream the same way. On the other hand, that means that I need to know what are the compiler flags for Iostream before I do the dependency scanning. And that the dependency scanning now needs to emulate what the import process will do. What does that emulation look like is the pre-processor stops at the point of the import, starts a new pre-processor context informed by the compile command of the header unit

Starting point is 00:49:48 that's going to be translated later, processes that header unit, gets the final state of the preprocessor at that point, and merges it back into the original preprocessor state. Right? So that's what's necessary

Starting point is 00:50:03 to correctly do the dependency scanning with header units. The end result of that is that the list of all header units and all the arguments through all the header units is now a dependency of the dependency scanning process itself. Consequently, any change to any one of those things, to the list of header units, either adding a new header unit or removing a header unit, or the changes to the arguments of how those header units are translated, effectively invalidates the entire build.

Starting point is 00:50:44 Because with POSIX make, like if a target gets invalidated, it just, it's over, the build plan goes all the way through. So a workaround for that is that in the case of Ninja, for instance, you can use the restat option on that target and say, okay, if the dependency scanning ran and the output of

Starting point is 00:51:08 the dependency scanning is the same just don't like essentially write through an intermediary file and then have the final file depend on the intermediate file with the restat option but only copy it over if the contents are different. And then Ninja can stop the invalidation after that. Or in the case of SCONs, where you have checksum-based invalidations. So if the dependency scanning produces the same checksum, again, the invalidation stops.

Starting point is 00:51:43 But in the case of POSIX make, that's not how it works. The moment that you invalidate a rule, all downstream rules are automatically invalidated. So what does that mean? It can mean two things. It can mean, one, that module units, should not be used because it's unusable in environments where C++ is used today.

Starting point is 00:52:12 Or it means that we're declaring, like Michael Scott style, that POSIX make is no longer a valid part of the C++ ecosystem. And it may be that we get to that point, but my main concern has been that we have been driving this conversation in a very implicit way, as saying like, oh, the standard requires this, therefore the standard is right.

Starting point is 00:52:46 Therefore, it doesn't matter what the cost is. And the thing that I want us to be explicit about is if we're saying that we want to commit to what's in the standard, then we're explicitly saying, fine, like we're explicitly choosing to say POSIX make is no longer a valid C++ build system driver. I personally think it's weird for us to make that choice, but I'm not going to hold everyone back if that's where the consensus is going.

Starting point is 00:53:18 This definitely brings to mind that quote, and I used it in a talk just recently. No plan survives first contact with the enemy. I think that's definitely what we're seeing with modules. It's been years specifying it. And then when we actually try to use it, then we hit all these rafts of unexpected consequences, I think, that we're still working through. So I want to thank you for your part

Starting point is 00:53:41 in trying to make sense of all this and even do something about it. Thank you for recognizing the pain. Yeah. Well, we definitely need to get somewhere with it. I actually want to switch gears a little bit because we don't have a lot of time left and there's another bit that we want to get into. So just taking a step back and talking about SG15,

Starting point is 00:54:06 the tooling study group, because you're quite a central member of that group. Can you talk about the group and what they do? So the main thing, it's a bit of a weird thing, and it has been weird for a while, but I think we're now finding a different way of framing it. Because I remember in cpp con 2021 there was a panel with the like members like chairs of the standard committee and and there were quite a few questions i i asked a similar question of if WG21 doesn't think that ecosystem is something that they should be interested in, not as individuals, but for WG21 to be interested in the ecosystem as a whole and not just the semantics of the language itself. And at the time, there was a surprising yet very clarifying answer that there was a sentiment that this is out of scope for WG21,

Starting point is 00:55:12 that WG21 was meant to work on the language itself and that the ecosystem was not part of the scope. Since then, there have been a number of conversations with various people and in kona last year uh october i think uh kona a substantive shift on how the WG21 chairs think about this problem. And there is a consensus building that driving the ecosystem as a whole, not just the semantics of the language itself, is and should be part of the WJ21 scope. And it was one of the most expressive, positive votes in the room in Kona, where there was this realization that, yes, we need to work to consolidate how the ecosystem is driving,

Starting point is 00:56:23 because the amount of divergency we have and most of the time unnecessary divergence is hurting the ecosystem a lot. So what is the scope of this ecosystem IS? Is, for example, the JSON file that we mentioned earlier that describes the dependencies of the modules, is that something that would be standardized in there? And what else could be part of this new standard? So the basic framework where this was presented is a framework of interoperability. So we're not

Starting point is 00:56:58 trying to specify what is the standard build system for C++? What is the standard package management? Right, that's not the goal. We are profoundly aware that this would not only fail catastrophically to actually be standardized, it would actually be profoundly damaging to the ecosystem to commit to a single build system. But what is important is that for us to be able to say hey i have a cmake project you have a sconce project maybe we should be able to share code right like i should be able to ship you a library and you should be

Starting point is 00:57:40 able to import my library into your build system. That's not... I think we have a bit of a Stockholm Syndrome thing with the C++ ecosystem, where we just accept that this is how things are. But if you explain this to anyone that hasn't been in this ecosystem for long, they will look at us like we're crazy. And it's like, how is this still a problem in your ecosystem?

Starting point is 00:58:10 Every other ecosystem has dissolved. Yes, I've experienced this where people that were relatively new to C++, they were like, okay, so I need this library. How do I use it? Which command do I need to type to download and install this library and link against it and everything. And then you have to go and say, well, that's not really a thing. You have to actually get the source, and then it depends on the build system,

Starting point is 00:58:35 how you compile it, and then you get into this mess. And as you say, everybody looks at this and says, this is crazy. How can you get any work done if this is how your ecosystem works? If you come from Rust or Python or anything like that. But the reality is that a lot of the C++ shops, they have solved this problem internally. Like Bloomberg, we have a package management ecosystem with more than 10,000 C++ projects.

Starting point is 00:59:06 And if you're at Bloomberg as a C++ developer writing a library, you know exactly what you have to do. And if you're consuming someone else's library, you know exactly what you have to do. The same thing is true for Google. Google, they have their Blaze build system. And if you want to consume a library, you know exactly what you're going to do. And if you want to consume a library, you would know exactly what you're going to do.

Starting point is 00:59:26 And if you want to ship a library, you also know exactly what you're going to do. And it's also not entirely broken in some ecosystems like GNU Linux distributions. If you commit to a specific Linux distribution, you actually have a fairly reasonable way to specify things. Like, you know how you depend on the library, you know how you ship a library. And if you commit to that ecosystem, it kind of works.

Starting point is 00:59:54 And we now also have things like VC package. And as we discussed a few episodes ago, we have Conan. Yeah, and those are starting to create specialized ecosystems where people have been able to reduce this kind of divergency. But we are now at this, again, I talked before about bifurcating the ecosystem, and we're kind of in this point where, well, I'm in Conan

Starting point is 01:00:25 but this library I need doesn't have a Conan file so maybe I can't use that library or I need to become a Conan expert to be able to figure out how to package a third-party library that I

Starting point is 01:00:42 barely know how it builds. So it seems like that's actually solved the problem which bumps it one one level up right kind of right it just makes it more affordable to the people where it was completely unaffordable before without actually solving the problem so what we're doing is essentially finding a bunch of local maxima and some local maxima have a lot of investment like bloomberg has like a lot of people working in the packaging system on the build orchestration on making sure everything's coherent making sure all of that and then you have like small organizations that can't

Starting point is 01:01:19 afford that and so they will find whatever is the local maxima that works for them. But I think the thing that we're hoping for is that we can break through that local maxima via interoperability. Let's imagine a world where a C++ project describes in an automatable way the steps for its build, which dependencies it has. How do you look up dependencies? So we don't need everyone to converge on the same package manager. We don't need everyone to converge on the same build system. We just need to find which languages we're missing to allow this interoperability to happen today. And so that's what you're going to put into this or what you're aiming to put into this ecosystem international standard that's being developed? That is the goal. Yeah. It's just build interoperability like one of the first things that is being discussed and uh to be fair like i've i'm still focusing mostly on modules i'm not like really

Starting point is 01:02:33 investing a lot of effort in in that right now but one of the first things that is being discussed is an introspection mechanism where you can ask where you like your tool chain or your tooling environment, what are the capabilities of your tooling environment in regards to the specified interoperability languages? Because then CMake can go and say like, oh, this package manager supports this format. I can go and ask for what libraries are installed in the system and what modules come with those libraries in an interoperable way.

Starting point is 01:03:12 So we might finally get something richer than the compilation database. Yeah, the compilation, it's a very good example. It's something that has been profoundly useful. But everyone that seriously interacts with it knows just how painful it is to try and extract semantics out of the compile commands. Like in the case of Bloomberg, we have legacy environments, and we run clang tooling on the legacy environment, so we have a bunch

Starting point is 01:03:43 of code that just, oh, I recognize the semantics of this particular compiler. Let me rewrite this compilation command into a clang equivalent of whatever this compiler was doing. So we need to raise the semantics from dash capital Y into more structured data. What do we mean for the compiler to be doing? Well, we are running long again. We didn't really get to ask you any more personal questions,

Starting point is 01:04:15 so just very, very quickly, if you could say one other thing that we haven't talked about so far in the world of C++ that you find interesting or exciting, what would it be? Oh, I don't know. I've been so sucked into the tooling world that I don't even know. Well, to be fair, you're actually working

Starting point is 01:04:35 on some pretty interesting, exciting stuff there. So we'll give you a pass on that one. But anything else that you do want to tell us, though? Anything you want to let our listeners know before we wrap up? Just if you're a person that is tooling-inclined,

Starting point is 01:04:54 we do have a lot of work to get through the tooling ecosystem international specification. So this is all work. It will take effort. And we need people to actually come in and chip in on figuring stuff out.

Starting point is 01:05:15 There's a lot of stuff to be figured out. And if people don't come in to join the effort, it's just going to take longer. And hopefully we're going to be motivated five years from now. So if you're tooling inclined, please pretty much get involved in the committee work to help us get there. Well, there's a call to action. Thanks for that.

Starting point is 01:05:36 And thank you so much for being a guest on the show today and telling us all about the current state of modules and build systems and international ecosystem standard. Thank you for having me. It's a pleasure. So how can people reach you if they want to talk to you about this stuff? So my email is druoso at bloomberg.net or my personal email daniel at ruoso.com.

Starting point is 01:06:03 I am on Twitter, but I don't really use twitter i'm there just because twitter was annoying me enough with not having an account that i eventually created one uh so it's daniel ruoso there if you want to dm me but yeah uh and i also try to follow the sg15 mailing list so if you want to discuss something tooling-based, maybe just go straight there. All right. Well, thank you so much, Daniel, for joining us today and for this fascinating discussion. I found that very, very informative. Thank you for having me. Thanks so much for listening in as we chat about C++. We'd love to hear what you think of the podcast. Please let us know if we're discussing the stuff you're interested in, or if you have

Starting point is 01:06:44 a suggestion for a guest or topic, we'd love to hear about that too. You can email all your thoughts to feedback at cppcast.com. We'd also appreciate it if you can follow CppCast on Twitter or Mastodon. You can also follow me and Phil individually on Twitter or Mastodon. All those links, as well as the show notes, can be found on the podcast website at cppcast.com. The theme music for this episode was provided by podcastthemes.com.

CppCast - Modules and build systems

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.