CppCast - Automatic Static Analysis

Starting point is 00:00:00 Episode 368 of CppCast with guest Abbas Sabra, recorded 23rd of August 2023. This episode is sponsored by Sona, the home of clean code. In this episode, we talk about the latest releases from the Boost and FMT libraries. And about the downsides of C++ core routines. Then we are joined by Abbas Sabra. Abbas talks to us about static code analysis and a new approach to automatically analyzing C++ projects. Welcome to episode 368 of CppCast, the first podcast for C++ developers by C++ developers. I'm your host, Timo Dummler, joined by my co-host, Phil Nash. Phil, how are you doing today? I'm all right, Timo. How are you doing?

Starting point is 00:01:10 I'm great. Yeah, I'm very happy you're back from vacation. I, as you know, had a few episodes with some guest co-hosts, but it's great to have you back here. How was your vacation, by the way? It was a great vacation, actually. But as is often the case i now need a break to recover from it we did a lot of driving it was a road trip i mean from the uk where i live down to italy we actually went through switzerland in both directions and on the way there we actually went right past sonar's head offices in geneva so appropriate to this episode and we somehow managed to get incredibly lucky with the timing because for a three-week holiday that we actually planned last year we managed to land exactly between two record-breaking heat waves in southern Europe so for us it was just a manageable

Starting point is 00:01:54 30 degrees most days which was uh you know still a bit uncomfortable when walking around Rome for example but uh we definitely had a lot of gelato managed to survive that but you know if we've been a week either way we would have been into the 40s and that would have um not been as comfortable for for a holiday i think oh wow no it hasn't been 30 degrees uh here in finland at any point so uh you managed to avoid that heat wave uh quite a bit further north so kind of as the rest of europe was sizzling it was a very comfortable you know 20 to 23 degrees here most of the time yeah i've been in the uk as well um i did need to give a full report on my vacation because um as we've heard before i have a

Starting point is 00:02:38 colleague at sonar who who says that he only ever hears about these things from cpp cast well this time he's going to get to hear it earlier than usual because he is here with us today. So I can exclusively reveal that my mysterious colleague is our guest today, Abbas Sabra. So that one's for you, Abbas. Thank you. Thank you for having me. Excited to be here.

Starting point is 00:02:59 Finally, I don't have to re-watch the next episode of CppCast because I know about your vacation now. Yeah. All right. So at the top of every episode, I'd like to read a piece of feedback. And this week, the feedback tab has been turned back on. Thank you very much for everybody who gave us feedback. I've selected this one from GrainedRanger7305 on Reddit

Starting point is 00:03:24 about the last episode with jason turner and mark gillard and grainedranger says i enjoyed the show very much jason is always a great guest it's sad to hear that the reflection support is not being worked on at the moment i had a project where i had to work around the language because i wanted to iterate over all members of a certain type. Thank you for your work. P.S. About the episode before, the voice of Matt Gottbold is kind of soothing. He should record audiobooks. Well, Matt, if you're listening to this,

Starting point is 00:03:57 yeah, maybe that's something you should consider. And thank you, Graint Ranger, for your feedback. Yeah, I think Matt needs some sort of side hustle. Yeah, I would totally listen to an audiobook read by Matt Gottwald. I think that would work really well. There's also some other feedback from another listener of the last two episodes, and that was me. I was playing the listener role this time, and I had that experience where I was listening to, particularly the episode, two episodes ago, where I was shouting at the podcast because

Starting point is 00:04:24 I had all the answers, and wasn't there to to say them so i'm not going to give all of my um all of the points that i would have made during the episode but there's a couple of things i did want to um to address first of all particularly there was a whole section on adding slides to a podcast as the the chapter art so most podcast clients can show images at certain points that you can specify in the podcast. And that was discussed in the episode. As it happens, I've done that before. No diagnostic required.

Starting point is 00:04:53 The podcast I did with Anastasia Kazakova, we actually did that every episode. So we prepared slides in advance, almost like a conference talk. We talked around the slides and I put those in as a chapter art. As you guessed, that was a lot of extra work you're spot on there and of course the other problem is not everyone can can see that so you know you can't come to rely on it but it can be handy

Starting point is 00:05:16 sometimes we also did it a couple of times in cpp chat there was one time we had sean baxter on and uh he just like suddenly started doing a screen share and was like walking through some code and we'll say we can't show this to people on the audio podcast and i said well i'll add it as chapter up and uh yeah we did that and as a one-off that was fine but yeah it wouldn't be sustainable in the long term but yeah maybe we'll do it from time to time we also talked about mold the the linker we did actually attempt to reach the author a while back to come on the show because i think that'd make a great episode particularly a series about tooling not actually yet been able to to reach them so if the

Starting point is 00:05:57 author is listening would like to come on do please reach out to us we'd love to have you on the show and my final piece of feedback was when you talked about the article from Jonathan Mueller suggesting that we use lambdas instead of functions. Perhaps a little bit tongue-in-cheek. As it happens, I did a lightning talk on almost exactly the same thing a few years ago. All the defaults are backwards. I'll put a link in the show notes. I had a slightly different focus.

Starting point is 00:06:24 I was suggesting that rather than just using lambdas as they are, we actually changed the language feature so you can have named lambdas at sort of global and class scope, which would have all of the same benefits that Jonathan talked about, but be a more official part of the language. So kind of functions 2.0, basically. Yeah, yeah, exactly that.

Starting point is 00:06:43 Yeah, it gives us a chance to reset on a few other things, which I talked about in that talk. So do go and watch that. But I thought it was like an interesting overlap between that and what Jonathan was saying. But that's where I'm going to stop for now. We need to get on with the show. All right.

Starting point is 00:06:58 So we'd like to hear your thoughts about the show. And you can always reach out to us on X, formerly known as Twitter, Mastodon, LinkedIn, or email us at feedback at cppcast.com. Joining us today is Abbas Sabra. Abbas is a principal engineer at Sonar, where he has discovered the ideal platform to pursue his passion for C++ development, development processes, and tooling. His career began in the financial industry, where he identified inefficiencies within the C++ tooling ecosystem that led to significant debugging and time losses. He firmly believes that static analyzers can significantly improve the productivity of C++

Starting point is 00:07:36 developers. Fueled by a keen interest in compilers, static analysis techniques, and language design, he's continually driven to innovate and push the boundaries in his field. Abbas, welcome to the show. Thanks for having me. So you mentioned tooling inefficiencies in finance there. My own experience working in the finance industry is that every organization and often every team in an organization has developed their own ad hoc set of tooling practices or even their own tools, including static analysis tools. Did you find the same thing? Isn't that the case for C++ in general? Especially so in finance, I found.

Starting point is 00:08:12 Yeah, so yeah, we had special tooling. We had a special build system. So we had a special language also. That is, yeah, I will not go into that. All right. So Abbas, we will get go into that. All right. So, Abbas, we will get more into your work in just a few minutes, but we have a couple of news items to talk about before we get to that.

Starting point is 00:08:35 So feel free to comment on any of these, okay? Okay. So the first news item for today is that Boost has released version 1.83, and it has a lot of new stuff, much more than what i could talk about in the show but just to name a few things there's a new library it's called boost compact it's a repository of c++ 11 implementations of standard components added in later c++ standards from peter demoff and contributors and apart from that new library there's also loads of updates to existing libraries so just a few highlights boost any now has a boost unique any which is an alternative to boost any or actually just did any that does not require copy or move

Starting point is 00:09:15 construction from the held type there's loads of new additions to boost file system boost iterator now has an is iterator type trait that allows to test whether the type qualifies as an iterator type. That, I think, sounds like something that's very useful in kind of generic programming. Boost Math now works with C++23's new float

Starting point is 00:09:37 types, like float16t, float32t, etc. There's a major update to boost MySQL with loads of new stuff. And there's a major update to boost mysql with loads of new stuff and there's a major update to boost unordered which is something that i think we've discussed a few episodes ago they added a boost concurrent flat map which is a fast thread safe hash map based on open addressing so it's just a few things there's a lot more in there but i think that's that's kind of a really really interesting uh major update to boost yeah it does sound like a pretty major update.

Starting point is 00:10:06 And the thing that jumped out at me was the Boost compat library, because it's sort of almost the inverse of our previous mention of Boost updates, where they removed, I think it was compatibility with pre-C++11. Yeah. So removed a load of backwards compatibility, and now we've added some back in a different way. Right, but I think compatibility with C++ 98 and 0.3 is gone for good. Yeah, yeah.

Starting point is 00:10:32 Good riddance. Right. So there was another major library update this time around, FMT, Viktor Zverevich's format library, which is now on version 10.1. And version 10.1 gives us lots of new interesting updates compared to version 10. One thing that I found particularly interesting is that there is an optimized format string compilation now, resulting in up to 40% speedup in compiled format 2 and up to 4 times speedup in compiled format 2 and up to four times speed up

Starting point is 00:11:05 in compiled format 2n, at least on a kind of concatenation benchmark that they published, compared to FMT 10.0. So I thought that was quite impressive. They also added formaters for proxy references to elements of things like vector ball and bit set of n and things like that.

Starting point is 00:11:26 And there's many, many other fixes and improvements here as well. So another pretty significant update to a very popular library. Yeah, and std vector ball just keeps giving, doesn't it? Well, it's not going anywhere. We'll have to support it until the heat death

Starting point is 00:11:42 of the universe, I guess. Or maybe nobody proposed it. Maybe somebody should just propose to get rid of it. I don't know. Anyway, there's one more news item that I would like to discuss this time around, which is a blog post called The Downsides of C++ Core Routines. And that blog post caught my attention. It's by James Mitchell, aka Reductor,

Starting point is 00:12:02 which is his name on Twitter and GitHub and a few other places which I don't think I've encountered that name before but I kind of thought that was a really interesting blog post so James is works at Letchhammer Games which is the studio that makes Call of Duty so it's interesting to see gaming people write a blog post like this I think that's great and James shows one of the reasons why I thought his blog post like this i think that's great and james uh shows one of the reasons why i thought his blog post was really interesting is that before he goes into like what he actually wants to talk about is he actually shows what the compiler actually transforms your core routine to like if you write a core routine if you write a function that says like co return or co yield

Starting point is 00:12:39 somewhere you know this is what you actually get and i think that's kind of really really good for actually understanding what's going on with coreines and i saw a blog post before that did that it was lewis baker's uh blog post understanding the compiler transform but lewis was uh like kind of very very uh complete uh but also because obviously lewis is one of the people who actually work on uh coroutines but it's quite complex and I wasn't really able to understand everything in that blog post without kind of extensive knowledge of like all the ins and outs

Starting point is 00:13:10 of like the guts of coroutines. So James's blog post is actually not quite as kind of accurate and precise and complete, but I think a lot way easier to understand. So I think that is kind of really cool that you can actually see, now here's the

Starting point is 00:13:25 coroutine and here's what your compiler actually does with it and so once he has that he then talks about various downsides of stackless coroutines and the way they're implemented in c++ he calls out a bunch of security issues that don't happen in the same way with plain functions like object lifetime iterator pointer invalidation etc Also, he talks about how you get either a dynamic memory allocation with the coroutine, because the coroutine frame in general will be allocated on the heap.

Starting point is 00:13:53 Or in those cases where the compiler can optimize out that allocation, you get lots of stack bloat. And the other thing that he calls out, which is something that gaming people actually often complain about, and I think it's quite a valid concern, is that because of the way like corgis work under the hood, there's a lot of kind of injected internal function calls.

Starting point is 00:14:14 And that leads to very slow debug builds, right? So if you're compiling for release, a lot of that stuff gets optimized out. But if you're compiling for debug, you actually get those function calls under the hood that you didn't write. And that slows down your debug build massively when you're like testing it or whatever. So I thought that was really interesting. That's something that I haven't really thought about, but it's, you know,

Starting point is 00:14:35 legit kind of downside of coroutines. And I wonder if, if we had designed them differently, maybe as stack for coroutines, we could have avoided some of that stuff. And he actually talks about that as well towards the end of the blog post yeah i thought it was a great post just to be clear the the pseudo code that he shows up front it's not actually what the compiler generates it's

Starting point is 00:14:54 it's close to it's enough to give you an idea of the sort of thing that the compiler might generate which of course is going to be implementation dependent which is why he can get away with being not quite as accurate as Lewis Baker's version. I did something similar in my own co-routines talks, you know, how you might implement co-routines without the co-routines feature. So it's quite a useful education technique. I also want to encourage readers

Starting point is 00:15:19 not to get put off by the pessimistic title, at least, of the blog post. It doesn't mean that coroutines are not worth using or they've got lots of problems, just that there are things you have to be aware of. And I thought it was a little bit ironic that one of the problems is that it simplifies code too much that you might miss problems. When most people's experience of using coroutines is that they're too complex. But most of the complexity is in the infrastructure around the coroutines. And when you're writing the coroutine code itself

Starting point is 00:15:48 with the coretines and the corweights, that part's actually much simpler than it would have been otherwise. That's the whole point. Sort of pushes that complexity into the infrastructure. Which means, yes, you can be fooled into thinking, well, this is just like normal synchronous code, when actually there are things you have to be aware of.

Starting point is 00:16:07 And one of my favorite things that James does bring out is the fact that you can actually get dangling references going into functions now, as well as coming out of them. And that's another thing that I bring out in my own coroutines talks. And I will always point out that, again, because because of our guest today it's a great time to talk about this sonar lint actually catches that for you so it will tell you when you've got a dangling reference going into a co-routine if you pass something in by by reference so um great post i recommend you you read it if you're interested in co-routines

Starting point is 00:16:39 all right so over the last few months uh we had kind of a little bit of a mini series about tooling, which was kind of interleaved with episodes about other stuff. So we had an episode about Conan, which is a package manager. We had an episode about build systems in the context of modules. And more recently, we had an episode about CLI, which is an IDE. And another very important type of tooling is static analysis tools. So we thought we should do an episode about that. And as it so happens, Phil actually works at a company that provides tools for static analysis. And so, Phil, you've invited one of your colleagues along, isn't that right? It is, yes. So that's why we have Abbas here today. Not just so that he knows

Starting point is 00:17:21 about my vacations. All right. So welcome again, Abbas. Pleasure to have you here today on our episode. Thank you. Happy to be here. So you're in static analysis now. You mentioned in your bio that you did other stuff before, and we heard about that a little bit, like you used to be in finance. But what was your journey that got you into static analysis? How did you get into that field? Yeah, it's an interesting journey that I usually like to talk about. So I started in the finance industry.

Starting point is 00:17:53 And as many developers, we used to debug 80% of our times. And one day, I had a ticket related to... I used to work on an interest trade derivative functionality with million line of code, and we had a ticket where the end calculation was wrong. So I started debugging. It took me two days, almost two days, to figure out what's happening.

Starting point is 00:18:19 And it turned out after two days of debugging that we had an expression with a side effect like a plus plus iterator and someone decided to add a decal type around it and as you know in C++ when you do something like that the expression inside of the decal type is not evaluated. So the side effects suddenly disappear, which led to miscalculation in the finance model that we were working on. The interesting part is that it took me two days to discover it. Then after discovering such a bug,

Starting point is 00:18:57 I wrote a quick Regex script that can detect similar issue. And it took my machine five seconds to go over all the code base and find detect similar issue. And it took my machine five seconds to go over all the code base and find similar bug. And we ended up having two bugs like that in our code base. And hopefully I saved another colleague two days of debugging. But that's where my passion came from. There is many problems in the development world where a machine does it better and we

Starting point is 00:19:26 should work as much as possible to keep this problem for a machine rather than for a human. So how long did it take you to write that script? It was a simple regex, so it took less than an hour to find that bug. So less than an hour to save a couple of days of work. Yeah, and once you do that and you realize the power of something as simple as that, you start to think about patterns that can happen in C++ that lead to such bugs. And I can tell you that there are many. Oh, yeah.

Starting point is 00:19:55 But your bio also mentioned, and we talked about it a little bit, about the challenges of C++ tooling, which I'm sure most of us are familiar with in some sense. But what was your experience then that contributed to this journey? of C++ tooling, which I'm sure most of us are familiar with in some sense. But what was your experience then that contributed to this journey? Yes. So C++ tooling are in general challenging, but if you think about it and you compare it to other languages, every tool created in the IDE, for example, if we focus on IDE tooling, is either created by someone as a side project or by a big company. And if you look at every successful C++ tool,

Starting point is 00:20:28 there is a parser behind it. And then you have an entry barrier that if you want to write a good C++ tool, you need to know how to parse C++, which is, I think everybody knows that it's a very complex problem.

Starting point is 00:20:40 And the main challenge is that IDE are interactive. And if you know something about C++ parsing, that it's slow. So you start your project to start tooling like I did, and you become, you directly have a bottleneck on the parser because your tool will be as fast as Clang if you are using Clang. And if it takes, I don't know, 10 seconds to compile your file, your tool, best case scenario, will be 10 seconds to get back to the user. So there's basically a high entry barrier for tooling.

Starting point is 00:21:16 And if you look at the successful tools, for example, I dig deeper recently into Clang format, one of the tools that is getting standardized in our industry. What they did is they implemented their own parsers that focus only on formatting. And I don't think any tool is... You cannot reasonably expect every tool to write their own parser to optimize their use case.

Starting point is 00:21:44 So we actually had, two episodes ago, we actually talked to Dmitry Kozhevnikov about C-Line, the C-Line IDE, and he was talking about the fact that they also have a parser because certain things like ClangD doesn't have the right trade-offs, so they have their own

Starting point is 00:21:59 parser in the IDE as well. Yes, so even when you write your own parser, like in CLion, which is a great tool, if you use other JetBrains IDE, like IntelliJ for Java, and you move to CLion, you feel that it is heavier,

Starting point is 00:22:16 which is not the fault of JetBrains. It's basically the current state of C++. Right. So, I mean, in order to parse C++ code, I mean, it's not just hard per se, right? Because you have all of these ambiguities you have to resolve, you have to kind of backtrack.

Starting point is 00:22:31 But also you might have to like execute arbitrary code like constexpr functions to figure out. Like, for example, you have like an identifier you need to figure out if something's a value or a type and that is like a dependent name so you might have to like execute random code somewhere else in the

Starting point is 00:22:50 context of function to figure that out so you actually have to have a full-blown interpreter of c++ for context for stuff if you want to actually pause it fully right so yeah the usual two bottleneck are context by resolution and-based include because they are costly to resolve. And hopefully now with C++20 model, once every compiler implements it, we might get faster analysis, faster tooling for the new modern code bases. Yeah, modules will fix everything. So do you then also have your own parser at Sonar, your own C++ parser? So at Sonar, we have a fork of LLVM where we have a lot of patches to the parser. Then once we can do something that is upstreamable, we upstream as much as we can to contribute back to LLVM, but we have multiple hacks to optimize

Starting point is 00:23:47 how fast is the parser in the IDE. One of the common ones is to, for example, to compute a preamble where you precompile. While you're working on the IDE, you usually modify the code multiple times, and you can always precompile all the set of headers at the beginning of the file, so you don't have to compile them again before the analysis starts, and this way you save time

Starting point is 00:24:11 in parsing. Right, so before we continue talking about how this stuff works under the hood, maybe you could tell us a little bit more about static analysis in general. Like, what is it? How can I use it? What is it good for? Sure. So static analysis is basically the art of knowing information about the code without executing it, which is the opposite of dynamic analysis. If you think about dynamic analysis, it's mostly Valgrind, Address Sanitizer. I'm just mentioning some names that are familiar for our listeners. And static analysis is, the requirement is that you should not execute the code while doing it.

Starting point is 00:24:50 And once you learn about static analysis, you discover that it's used in many places, like compiler do static analysis to optimize your code and get rid of that code. We order instruction to make it more, to make your binary more optimized. So they analyze their code, they learn information about it, then they apply it for code optimization. Another use case of static analysis is tooling to detect bugs and basically undefined behavior in C++. So there are some toolings that try to detect null pointer dereference,

Starting point is 00:25:29 buffer overflow, and all the common undefined behavior. And here we reach the usual discussion about static analysis not being able to get everything right. Then there is other application like guidelines, like CPP core guidelines or MISRA guidelines, where the goal of static analysis is not to detect issues in your code, rather to detect patterns that might lead to issues. And here, static analysis is usually very successful at doing that.

Starting point is 00:26:03 Other use cases, like there is tools to build a dependency graph of your code base. That's some sort of static analysis. But the most interesting use case for static analysis for me is usually, in my opinion, is education. Let's say you want to learn about C++ algorithm. The usual thing to do is either to go to CPP reference or go to a conference or read the C++ standard.

Starting point is 00:26:31 But you can use actually static analysis to, let's say, find in your code base every row loop and tell you that, hey, you can replace this row loop by std any of or by std rotate. And you can map every std any of or by std rotate and you can map every std algorithm to a certain pattern and try to educate the new joiner of C++

Starting point is 00:26:53 about a different part of the STL algorithm using static analysis. So if done right, static analysis can be also a very good education tool. One of my favorite examples of that is using the heterogeneous comparison operators, you know, where the second argument to a container is the less operator by default, which compares the same type. Whereas like if you have a stood string and you want to compare it

Starting point is 00:27:22 against a char star, that means that that char star has to be converted to a std string every time. So at some point, I think it was C++14, maybe C11, we introduced the less sort of empty angle brackets operator, which has a templated comparison. So that will use std string's own comparison operator to compare against char star. So you don't do that conversion every time. So we will tell you if you haven't used that that there's an opportunity that you can you can speed up your your comparisons except there are some cases where it becomes a pessimization

Starting point is 00:27:53 and it will detect that as well so you can select progressively learn more and more as you as you go so you think oh i know this now and then you you learn a refinement on it as well so yeah completely agree so so static analysis can also help you with performance? Yes, performance is also a place where static analysis can contribute. Like if you have a pushback and you can detect the size of an object statically, because it compiles, you can automatically detect that in place back. In this case, case is better. Right. I do have a question.

Starting point is 00:28:29 So, I mean, I've used static analysis before. For example, CLion also has a built-in static analysis tool, right? And what I found often is that for cases where the tool can reason about what's happening at compile time, you know, it is pointing out to me hey there is a bug for example let's say a null pointer dereference like if i outright you know dereference a null pointer like obviously the tool is going to tell me hey you know you're doing something here which which might be ub but if i'm doing something where

Starting point is 00:29:01 it's like a runtime property whether it's bug, like the pointer comes from somewhere else, the static analysis tool can't reason about whether or not it's an alt pointer in this case. Or, for example, it depends on a branch and then the condition of that branch is like a runtime value. Then typically, from what I've seen, the static analysis tool just doesn't tell you anything at that point. And then you only find the bug when you run a runtime analysis tool such as address sanitizer or UB sanitizer or something like that. Is that kind of how they all work? Or is there a mode where you can say, you know, point out to me like potential problems, you know, like, hey, this might be null pointer to reference here, you know, you might want to double check that or you might want to insert a null pointer check. I don't know what happens at runtime,

Starting point is 00:29:47 but there's a potential problem. Is that something that these tools can do? Because I'm not seeing that by default, typically. Or maybe I'm using the wrong tools. I don't know. I don't think you are. So static analysis has a different technique to do it. And what you are talking about

Starting point is 00:30:04 is basically detecting a null pointer to reference in the wild without any constraint on the code. And usually this is detected by a technique called symbolic execution, where you try to execute the code without executing it by simulating it in the static analyzer. And this kind of technique usually are limited because they reach the path explosion problem where you kind of reach the theoretical limit of what you can do in a reasonable time. Oh, the path explosion problem. Interesting. Yeah. Cool name, I like that. There's our episode title. So if you want to simulate a code and you have a loop inside of it, for example, and you don't know how many times this loop is going to run,

Starting point is 00:30:47 so you have to execute the loop multiple times until you finish this loop. So the static analyzer, at the end, if it wants to detect all these kinds of issues, it will be as slow as the runtime and not as good also. So these kinds of techniques, I believe they are useful, but they cannot be the only things that you use, because if you have tunnel pointer dereference in your code base, it's probably going to detect two or three of them, which I believe is already good. I think you mentioned earlier that a better approach is to detect patterns that themselves are not wrong, but can lead to these sort of issues.

Starting point is 00:31:27 So if we can avoid a situation where a naked nodal reference can happen in the first place without some other checking around it, then that's a better pattern. So I think that often can avoid us going down that path in the first place. Yes. A common example is that you cannot detect all the buffer overflow things, and it's the most common CVE in the database. And if you have a static analysis checks that tell you that you should use GSL span, for example, everywhere, this is kind of a check that can be implemented by a static analyzer and can get all the checks. And I'm not arguing that everybody should use that, but many people should be aware of it and consider it as a possibility.

Starting point is 00:32:12 So you already mentioned CVEs. This is something we talked about quite a few times on the show already, this whole safety discussion where some people are saying, you know, you can actually find a safe subset of C++ that can be statically proven to be safe. And other people are saying, no,, you can actually find a safe subset of C++ that can be statically proven to be safe. And other people are saying, no, that's not possible. Or that subset would be too limited to be useful. Like, I'm actually curious, like, is it possible to use static analysis to, you know, build

Starting point is 00:32:41 like some kind of subset of C++ where you can reason about that being safe and not having unafraid behavior? Or do you think that's kind of not a viable way forward? In the theoretical sense, it is possible because common tools that actually does that is done by Microsoft, trying to remember its name. It's called Daphne, I think. So they have a tool where you basically write the precondition of each function, the postcondition of each function,

Starting point is 00:33:09 and the static analysis can build mathematical verifications that your code is supposed to do what it's supposed to do. So you can reach a subset of a language where you can verify statically that it is correct and doing what it's supposed to do. Now, if you want to reduce this problem of C++, yes, you can do that, but the question is not if you can do that. The question is at what cost.

Starting point is 00:33:34 You can definitely tell people that you should use bound checking everywhere and verify that they do it statically. You can tell them, you can create a function, for example, that checks for null pointer before every dereference and statically detects that only this function is used. But I'm not sure this is the language that everybody wants to write.

Starting point is 00:33:55 That's interesting because you're kind of contradicting, I think, what Bjarne said a few episodes ago where we discussed this. So it's actually interesting to hear, you know, the perspective of somebody who actually works on these tools. And it seems like that perspective is quite different from, for example, what we discussed with Bjarne, who comes at it from a

Starting point is 00:34:11 language design theoretical point of view. Before we dig into that, because that was really going to be my next question anyway, I wanted to dig into that a bit more. Let's take the opportunity to have our sponsor break, which is also appropriate because we are sponsored again by Sona, the home of CleanCode. So SonaLint is a

Starting point is 00:34:31 free plugin for your IDE to help you find and fix bugs and security issues from the moment you start writing code. We've been learning all about that. But you can also add SonaCube or SonaCloud, which we may hear about later, to your cicd pipeline and enable your whole team to deliver clean code consistently and efficiently on every check-in or pull request sonar cloud is completely free for open source projects and integrates with all the main cloud devops platforms so as tim has said we had bianon recently talk about safety and security and he said that static analysis should play a central role in that. But he suggested that existing tools don't really address that yet.

Starting point is 00:35:11 And to be fair, he did actually say at that point, correct me if I'm wrong. And one of my regrets is I didn't actually get back to, well, correct him. And I know you've got views on this. We've already started discussing them, Abbas. But yeah, what's your take then on this? How much do we really cover with static analysis?

Starting point is 00:35:32 What else can we do? Yeah, it's a two-part question from each of you. I think I can start with one from Timur. I think when Bjarne talked about this, we are not talking about 100% proving that this program is correct without any CVE. At the end, we are engineers. So if we get to 99%, that's already good.

Starting point is 00:35:57 So for me, the problem with all of this is that we are relying on CVEs and these CVEs are not well organized because they don't separate between C and C++ and they don't separate between old C++ and modern C++. So when we talk about memory safety, we don't actually know if code bases that are using smart pointer always have these CVEs. We don't have this information in our database. And to start thinking about a solution for this

Starting point is 00:36:28 before getting the data might lead us to a different path than the one that is needed. And I know that, for example, smart pointer can be misused and you can detect some part of the misuse statically. But are actually people that are using smart pointer misusing it? I don't know. Because also other languages that are compared with C++, they can be misused.

Starting point is 00:36:53 It's just that it is harder. So yeah, this point of do we want to get to 100% or is it enough to get to 95% is always interesting. And I'm always on the 95% is good enough because we are engineers. Then the second question was from Phil about the current state of tooling. I think, for example, we only advocate for using CPP core guidelines, which our tools support, and I can also talk about other tools like Lanktidy support part of it.

Starting point is 00:37:27 Microsoft Static Analyzer support most of it. So there are two links that can enforce you to use the C++ core guidelines. But the question is, do actually people use them? And if you look at survey from JetBrains, not many people use static analysis. 30% don't at all. And some of them use only the one in their IDEs. So we have a problem of awareness of the value of static analysis. We also have problem of processes and, let's say, a push towards static analysis. For example, we support different coding standards, like CPP Core Guideline and MISRA.

Starting point is 00:38:06 If we follow the requests that we get as a company, most of the requests are coming to MISRA. And I personally believe that CPP Core Guideline is a better guideline for writing code for most people. But yet, many people ask for MISRA. Why is it so? Because there is enforcement on some companies to use MISRA. So people are getting aware of MISRA and trying to apply it in their code base. While there is not enough awareness about CPP core guideline and its value. That's my opinion. Right.

Starting point is 00:38:43 So we talked about static analysis in general but then there's also a bunch of actual products that you know that you do static analysis and you briefly mentioned that you know not as many people use them as as they probably should but also there are quite a few of them around right so i already mentioned that t-line has its own static analysis built in then we have like pvs studio Studio has a static analysis tool. We had them as sponsors last couple of episodes. There's Perforce Clockwork, there's Coverity, there's a bunch of others. So it seems like there's quite a few tools out there.

Starting point is 00:39:16 And you actually have three tools at Sonar, don't you? So you have, as we heard in the sponsor read, there's Sonar Cube, there's Sonar Lint, there's Sonar Cloud. So what do all of these products do? When do you use which and how do you integrate them into your workflow if you want to try and start using static analysis? Just to be clear, I'm more of an advocate

Starting point is 00:39:36 on static analysis. I use ClankTidy even though I use Sonar. I use other static analysis tools. So for me, my opinion is that if combining them is better, if you can find more issues, it's okay. You don't have to use one tool. And C++ is a big language.

Starting point is 00:39:54 So in our backlog, we have 600 rules already, 600 checks. In our backlog, there is thousands of them. So we are not going to run off checks. So there is room for all competitors. So yes, we have three products. All of them have the same target, which is reaching a state of clean code in your code base. So all of them try to push towards the same goal. Clean code is, there's multiple definitions of clean code, but it's switching the point where your code base is an asset for you rather than an adapt.

Starting point is 00:40:32 So you can think about maintainability, you can think about security, you can think about performance, you can think about safety, all the non-good aspects of good code. So all of these tools are trying to get you there. SonarLint is the first line, which is a new IDE. It's usually a personal choice when someone installs an extension in their IDE to comply with CPP core guidelines, for example, and you don't have to invest a lot.

Starting point is 00:41:00 While SonarCube and SonarCloud are the same product, one is on the cloud, hence the name, and the other one is on-premise, which is SonarQube. But they both have the same goal, which is scaling up to a project level or organizational level quality. When I say organizational level, let's say your organization wants to comply with CPP core guidelines. It's not enough that everyone of half of your team applied in the IDE. You need some sort of a process, of a guideline of what are the criteria that you need to comply with to merge to your main branch. And all of that government is usually done by SonarQube and SonarCloud. And there's processes to do that because you cannot just say, hey, I'm going to refactor my entire code to comply with CPP core guideline. Because if you have a million lines of code, you cannot just go and remove every new and put in place of it smart pointers.

Starting point is 00:41:58 So SonarQube and SonarCloud are meant to put processes to reach this goal incrementally. For example, if you open a pull request, you have some checks that only say that you have to comply with CPP core guidelines on your new code rather than the old code. And once you change your new code incrementally, at some point you will reach a code base that is compliant to CPP core guidelines. I think the statistics we usually use are that after about five years, you will have touched half of your code base that is compliant to CPP code guideline. I think the statistics we usually use are that after about five years, you would have touched half of your code base. So just by only changing new code, you would have cleaned half the code base within five years.

Starting point is 00:42:35 And some statistics even faster than that. So we talked earlier about the challenges of C++ tooling, of which there are many. But are the Sonar tools then, particularly SonarCube and SonarCloud that you mentioned there, are they just as easy to pick up or are there particular challenges to using those new tools? So SonarLint is free in your IDE.

Starting point is 00:43:00 You don't have to configure anything. It works with CLion, VS Code, and Visual Studio. But SonarQube and SonarCloud, you have to set up the analysis in your pipeline on your CI. And hence,

Starting point is 00:43:17 currently we worked on a new feature called automatic analysis, which makes this process of analyzing your code on Sonar Cloud much easier. So I do have another question. What if I'm working on an open source library and it's like an entirely not-for-profit kind of thing,

Starting point is 00:43:36 which is out there on GitHub, and I have my CI set up with GitHub Actions or something like that? Is there any non-commercial license of these tools that I can use to have static analysis on CI rather than just in my IDE? Yes, so for Sonar Cloud, you have two options. The previous way was there was a GitHub Action

Starting point is 00:43:56 for analysis on Sonar Cloud, and it's free for open-source projects, so you can just integrate it in your pipeline and you will get our static analyzer. Or you can use a new way of analyzing C++ code on SonarCloud, which is called automatic analysis, where you just click one button and it's going to be analyzed on SonarSource infrastructure

Starting point is 00:44:22 so you don't need a CI. And it's going to be configured automatically. So we are going to resolve your dependencies and your preprocessor and every single part of the C++ ecosystem automatically. Wait, so how do you do that? So I have been working for the last year on this feature. It takes time.

Starting point is 00:44:42 So the motivation is that not many people use static analysis. And previously to set up a GitHub action to do that, you need to make sure that your build works. You need to make sure that the dependency exists

Starting point is 00:44:56 on this action. You need to know how to invoke, how to build your code, because if you build your code with different preprocessor, it means different things. So there's many things that the static analyzer needs to know about your project to be able to analyze it. And now after this new feature, we don't need that anymore.

Starting point is 00:45:16 And what we did, we did three main things. First, for dependencies, we built our own dependency manager that basically instead of you telling us which dependency you are using, we scan your code, look at your include, try to match them with open source dependency, and we automatically check them out for you. That's the first part. This works for most dependencies, but doesn't work for private dependencies. And we are going to go over that later on the second step is resolving pre-processor and the good thing about resolving pre-processor is that you only need to find a valid combination of pre-processors that that lead to your code to compile so what we basically do is we build an equation from all the preprocessor

Starting point is 00:46:07 directive, and we use a tool called SMT Solver to find the best combination of all the preprocessors that lead to a code that actually compiles. And SMT Solver is known as a static analyzer world. It's basically a tool that you give an equation and it do some mathematical model to resolve this equation. And the equation here is, what is the combination of all the preprocessors that lead to a compile-level C++ and lead to the maximum number of token?

Starting point is 00:46:38 Because if you have a preprocessor that remove your entire code base, we don't want to enable it. We want to actually analyze your code. And by using this technology, we get a solution for the preprocessor. We can understand your code.

Starting point is 00:46:54 And the third thing is we go back to Parcel. We had to be able to modify the Parcel to understand invalid code because if you have an internal dependency and you have a call to this internal dependency, Clang will fail because we cannot resolve what this dependency is. But on our side, we worked on the parser to be able to say, keep this aside.

Starting point is 00:47:17 We don't understand it right now. And make the analyzer behave as if this function, try to guess what it is and behave as if this function exists. So there is three elements. There is preprocessor, there is dependency manager, and there is parser. And combining them together, we were able to reach,

Starting point is 00:47:38 we analyzed 200 open source projects. We were able to reach with this method something like 95% accuracy compared to manual analysis where it's done by the user itself. This is a lot of work that is done behind the scenes to reach a state where the user only has to click one button to get their C++ code analyzed. It only works on GitHub. For open source project, it's free.

Starting point is 00:48:10 For private source project, it is paid. So that's super impressive. So basically, you have your own package manager, so to say, and your own built system-ish thing that just magically configures itself by just looking at what's there in the repository. And it sounds like magic to me. It sounds also like this kind of system would be very useful

Starting point is 00:48:33 in many contexts, even beyond just doing static analysis, like this kind of magic auto-configuring thing that figures out how to compile your code or how to reason about your code. That sounds really, really cool and useful. Yes, so it works in this tooling area where you need to know mostly how your code compiles. If you need 100%, like you need to generate binary of your code,

Starting point is 00:48:57 yes, this is not reliable because you might do small mistakes that change the meaning of your code. But in things like tooling or refactoring and static analysis, this seems to be a good approach. And I think magic is the right word there when you try this for the first time, especially if you've tried setting up something like SonarQ manually or even SonarCloud. It is just magic.

Starting point is 00:49:19 You just literally click that button and it just works. And 95% accuracy is pretty good when it comes to applying static analysis. It's good enough that most people would never need to manually configure their projects anymore. Well, now I know what I'm going to be playing with this weekend. All right. So we talked a lot about static analysis and tooling,

Starting point is 00:49:42 but is there anything else going on in the world of C++ right now that you find particularly interesting or exciting? Let's see what's happening in the C++ world these days. So there is this movement of new languages that you went over. I think I'm excited about it in the sense of hopefully it moves C++ language forward. Let's see what comes out of it. The new hot topic in the last year in C++ is safety. So I hope that the language improves due to this push.

Starting point is 00:50:16 But yeah, that's all. So can I ask you a question just because it's kind of a controversial question that I've asked a lot of people. And I'm just curious about your opinion as somebody who works on static analysis and obviously worries about safety and security. Do you think that C++ is doomed if you can't make it a kind of guaranteed memory safe language the way that Rust is? No, I don't think it's doomed because there is many use cases of C++ where it's not the main thing to do. Maybe you will lose some

Starting point is 00:50:45 industry will move away from C++ in the places where you need 100% safety. But in the industry of gaming, for example, or in the industry of music, I'm not sure all of this discussion is

Starting point is 00:51:01 going to impact the users of C++. And there's many people that are using COBOL these days, so C++ is not going away. Right. Well, thank you very much. That was a very good insight. I like your answer. So anything else you want to tell us

Starting point is 00:51:21 before we let you go, Habas? What do I want to tell you? Maybe how people can reach you if they want to get in touch. So I'm usually on SonarSource community, community.sonarsource.com. I'm interested in C++. If you have any idea of how to check things statically, please post it there, and I will be one of the people that are going to reply to you. I have a LinkedIn profile.

Starting point is 00:51:48 I have a Twitter slash X profile that I don't use that much, but feel free to reach out to discuss C++ and static analysis. All right. We're going to put links to all of these things in the show notes. Perfect. By the way, that community discourse forum, I can say from seeing on the way, that community discourse forum, I can say from seeing it on the inside, that's almost a direct line to the developer team. So if you've got any questions for any of us, that's where to post it.

Starting point is 00:52:14 All right. So it looks like we're at the end of this episode, but thank you again, Abbas, for being our guest today and for this fascinating discussion. That was a pleasure to have you here. Thank you for having me. And I'll see you next week. All right. See you next time.

Starting point is 00:52:30 Thanks so much for listening in as we chat about C++. We'd love to hear what you think of the podcast. Please let us know if we're discussing the stuff you're interested in. Or if you have a suggestion for a guest or topic, we'd love to hear about that too. You can email all your thoughts to feedback at cppcast.com. We'd also appreciate it if you can follow CppCast on Twitter or Mastodon. You can also follow me and Phil individually. All those links, as well as the show notes, can be found on the podcast website at cppcast.com. The theme music for this episode was provided by podcastthemes.com.

Your Ad Here

CppCast - Automatic Static Analysis

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.