CppCast - Automatic Static Analysis
Episode Date: September 1, 2023Abbas Sabra joins Phil and Timur. Abbas talks to us about static analysis, the challenges - and benefits - of analysing C++ code, and a new feature from Sonar that can scan public repos with zero conf...ig. News Boost 1.83.0 released fmt 10.1 released The downsides of C++ Coroutines Links "All the defaults are backwards" - Phil's Lightning Talk "No, C++ static analysis does not have to be painful" - Sonar blog video showing Sonar's Automatic Analysis in action Sonar Community Discourse forums
Transcript
Discussion (0)
Episode 368 of CppCast with guest Abbas Sabra, recorded 23rd of August 2023.
This episode is sponsored by Sona, the home of clean code. In this episode, we talk about the latest releases from the Boost and FMT libraries.
And about the downsides of C++ core routines.
Then we are joined by Abbas Sabra.
Abbas talks to us about static code analysis and a new approach to automatically analyzing C++ projects.
Welcome to episode 368 of CppCast, the first podcast for C++ developers by C++ developers.
I'm your host, Timo Dummler, joined by my co-host,
Phil Nash. Phil, how are you doing today? I'm all right, Timo. How are you doing?
I'm great. Yeah, I'm very happy you're back from vacation. I, as you know, had a few episodes with
some guest co-hosts, but it's great to have you back here. How was your vacation, by the way?
It was a great vacation, actually. But as is often the case i now need a
break to recover from it we did a lot of driving it was a road trip i mean from the uk where i
live down to italy we actually went through switzerland in both directions and on the way
there we actually went right past sonar's head offices in geneva so appropriate to this episode
and we somehow managed to get incredibly lucky with the timing because for a three-week holiday that we actually planned last year we managed to land
exactly between two record-breaking heat waves in southern Europe so for us it was just a manageable
30 degrees most days which was uh you know still a bit uncomfortable when walking around Rome for
example but uh we definitely had a lot of gelato managed to survive that but you know if
we've been a week either way we would have been into the 40s and that would have um not been as
comfortable for for a holiday i think oh wow no it hasn't been 30 degrees uh here in finland at any
point so uh you managed to avoid that heat wave uh quite a bit further north so kind of as the
rest of europe was sizzling it was a very
comfortable you know 20 to 23 degrees here most of the time yeah i've been in the uk as well um i
did need to give a full report on my vacation because um as we've heard before i have a
colleague at sonar who who says that he only ever hears about these things from cpp cast
well this time he's going to get to hear it earlier than usual
because he is here with us today.
So I can exclusively reveal that my mysterious colleague
is our guest today, Abbas Sabra.
So that one's for you, Abbas.
Thank you. Thank you for having me.
Excited to be here.
Finally, I don't have to re-watch the next episode of CppCast
because I know about your vacation now.
Yeah.
All right.
So at the top of every episode, I'd like to read a piece of feedback.
And this week, the feedback tab has been turned back on.
Thank you very much for everybody who gave us feedback.
I've selected this one from GrainedRanger7305 on Reddit
about the last episode with jason turner and mark
gillard and grainedranger says i enjoyed the show very much jason is always a great guest it's sad
to hear that the reflection support is not being worked on at the moment i had a project where i
had to work around the language because i wanted to iterate over all members of a certain type. Thank you for your work.
P.S. About the episode before,
the voice of Matt Gottbold is kind of soothing.
He should record audiobooks.
Well, Matt, if you're listening to this,
yeah, maybe that's something you should consider.
And thank you, Graint Ranger, for your feedback.
Yeah, I think Matt needs some sort of side hustle.
Yeah, I would totally listen to an audiobook read by Matt Gottwald.
I think that would work really well.
There's also some other feedback from another listener of the last two episodes, and that was me.
I was playing the listener role this time, and I had that experience where I was listening
to, particularly the episode, two episodes ago, where I was shouting at the podcast because
I had all the answers, and wasn't there to to say them so i'm not going to give all of my um all of the
points that i would have made during the episode but there's a couple of things i did want to um
to address first of all particularly there was a whole section on adding slides to a podcast
as the the chapter art so most podcast clients can show images at certain points
that you can specify in the podcast.
And that was discussed in the episode.
As it happens, I've done that before.
No diagnostic required.
The podcast I did with Anastasia Kazakova,
we actually did that every episode.
So we prepared slides in advance,
almost like a conference talk.
We talked around the slides
and I put those in as a chapter art.
As you guessed, that was a lot of extra work you're spot on there and of course the other problem is
not everyone can can see that so you know you can't come to rely on it but it can be handy
sometimes we also did it a couple of times in cpp chat there was one time we had sean baxter on
and uh he just like suddenly started doing a
screen share and was like walking through some code and we'll say we can't show this to people
on the audio podcast and i said well i'll add it as chapter up and uh yeah we did that and as a
one-off that was fine but yeah it wouldn't be sustainable in the long term but yeah maybe we'll
do it from time to time we also talked about mold the the linker we did actually
attempt to reach the author a while back to come on the show because i think that'd make a great
episode particularly a series about tooling not actually yet been able to to reach them so if the
author is listening would like to come on do please reach out to us we'd love to have you
on the show and my final piece of feedback was when you talked about the article from Jonathan Mueller
suggesting that we use lambdas instead of functions.
Perhaps a little bit tongue-in-cheek.
As it happens, I did a lightning talk on almost exactly the same thing a few years ago.
All the defaults are backwards.
I'll put a link in the show notes.
I had a slightly different focus.
I was suggesting that rather than just using lambdas as they are,
we actually changed the language feature
so you can have named lambdas at sort of global and class scope,
which would have all of the same benefits
that Jonathan talked about,
but be a more official part of the language.
So kind of functions 2.0, basically.
Yeah, yeah, exactly that.
Yeah, it gives us a chance to reset on a few other things,
which I talked about in that talk.
So do go and watch that.
But I thought it was like an interesting overlap
between that and what Jonathan was saying.
But that's where I'm going to stop for now.
We need to get on with the show.
All right.
So we'd like to hear your thoughts about the show.
And you can always reach out to us on X,
formerly known as Twitter,
Mastodon, LinkedIn, or email us at feedback at cppcast.com. Joining us today is Abbas Sabra.
Abbas is a principal engineer at Sonar, where he has discovered the ideal platform to pursue his passion for C++ development, development processes, and tooling. His career began in the financial
industry, where he identified inefficiencies
within the C++ tooling ecosystem that led to significant debugging and time losses.
He firmly believes that static analyzers can significantly improve the productivity of C++
developers. Fueled by a keen interest in compilers, static analysis techniques,
and language design, he's continually driven to innovate and push the boundaries in his field. Abbas, welcome to the show. Thanks for having me. So you mentioned tooling inefficiencies
in finance there. My own experience working in the finance industry is that every organization
and often every team in an organization has developed their own ad hoc set of tooling
practices or even their own tools, including static analysis tools.
Did you find the same thing?
Isn't that the case for C++ in general?
Especially so in finance, I found.
Yeah, so yeah, we had special tooling.
We had a special build system.
So we had a special language also.
That is, yeah, I will not go into that.
All right. So Abbas, we will get go into that.
All right.
So, Abbas, we will get more into your work in just a few minutes,
but we have a couple of news items to talk about before we get to that.
So feel free to comment on any of these, okay?
Okay.
So the first news item for today is that Boost has released version 1.83,
and it has a lot of new stuff, much more than what i could talk about in the show but just to name a few things there's a new library it's called boost compact it's a
repository of c++ 11 implementations of standard components added in later c++ standards from
peter demoff and contributors and apart from that new library there's also loads of
updates to existing libraries so just a few highlights boost any now has a boost unique any
which is an alternative to boost any or actually just did any that does not require copy or move
construction from the held type there's loads of new additions to boost file system boost iterator
now has an is iterator type trait that allows to test
whether the type qualifies as an iterator type.
That, I think, sounds like something
that's very useful in kind of generic
programming. Boost Math
now works with C++23's
new float
types, like float16t,
float32t, etc.
There's a major update to boost
MySQL with loads of new stuff. And there's a major update to boost mysql with loads of new stuff and there's a major update
to boost unordered which is something that i think we've discussed a few episodes ago
they added a boost concurrent flat map which is a fast thread safe hash map based on open addressing
so it's just a few things there's a lot more in there but i think that's that's kind of a really
really interesting uh major update to boost yeah it does sound like a pretty major update.
And the thing that jumped out at me was the Boost compat library,
because it's sort of almost the inverse of our previous mention of Boost updates,
where they removed, I think it was compatibility with pre-C++11.
Yeah.
So removed a load of backwards compatibility,
and now we've added some back in a different way.
Right, but I think compatibility with C++ 98 and 0.3 is gone for good.
Yeah, yeah.
Good riddance.
Right.
So there was another major library update this time around,
FMT, Viktor Zverevich's format library,
which is now on version 10.1.
And version 10.1 gives us lots of new interesting updates compared to version 10.
One thing that I found particularly interesting is that there is an optimized format string compilation now,
resulting in up to 40% speedup in compiled format 2 and up to 4 times speedup in compiled format 2 and up to four times speed up
in compiled format 2n,
at least on a kind of concatenation benchmark
that they published,
compared to FMT 10.0.
So I thought that was quite impressive.
They also added formaters for proxy references
to elements of things like vector ball
and bit set of n and things like that.
And there's many, many other fixes
and improvements here as well. So another
pretty significant update to
a very popular library.
Yeah, and std vector ball
just keeps giving, doesn't it?
Well, it's not going anywhere.
We'll have to support it until the heat death
of the universe, I guess. Or maybe
nobody proposed it.
Maybe somebody should just propose to get rid of it.
I don't know.
Anyway, there's one more news item that I would like to discuss this time around,
which is a blog post called The Downsides of C++ Core Routines.
And that blog post caught my attention.
It's by James Mitchell, aka Reductor,
which is his name on Twitter and GitHub and a few other places
which I don't think I've encountered that name before but I kind of thought that was a really
interesting blog post so James is works at Letchhammer Games which is the studio that
makes Call of Duty so it's interesting to see gaming people write a blog post like this I think
that's great and James shows one of the reasons why I thought his blog post like this i think that's great and james uh shows one of the reasons
why i thought his blog post was really interesting is that before he goes into like what he actually
wants to talk about is he actually shows what the compiler actually transforms your core routine to
like if you write a core routine if you write a function that says like co return or co yield
somewhere you know this is what you actually get and i think that's kind of really really good for
actually understanding what's going on with coreines and i saw a blog post before that
did that it was lewis baker's uh blog post understanding the compiler transform but lewis
was uh like kind of very very uh complete uh but also because obviously lewis is one of the people
who actually work on uh coroutines but it's quite complex and I wasn't really able to understand
everything in that blog post
without kind of extensive knowledge
of like all the ins and outs
of like the guts of coroutines.
So James's blog post is actually
not quite as kind of accurate
and precise and complete,
but I think a lot way easier to understand.
So I think that is kind of really cool
that you can actually see,
now here's the
coroutine and here's what your compiler actually does with it and so once he has that he then talks
about various downsides of stackless coroutines and the way they're implemented in c++
he calls out a bunch of security issues that don't happen in the same way with plain functions like
object lifetime iterator pointer invalidation etc Also, he talks about how you get
either a dynamic memory allocation
with the coroutine,
because the coroutine frame in general
will be allocated on the heap.
Or in those cases where the compiler
can optimize out that allocation,
you get lots of stack bloat.
And the other thing that he calls out,
which is something that gaming people
actually often complain about,
and I think it's quite a valid concern, is that because of the way like corgis work under the hood,
there's a lot of kind of injected internal function calls.
And that leads to very slow debug builds, right?
So if you're compiling for release, a lot of that stuff gets optimized out.
But if you're compiling for debug, you actually get those function calls under the hood that you didn't write.
And that slows down your debug build massively
when you're like testing it or whatever.
So I thought that was really interesting.
That's something that I haven't really thought about,
but it's, you know,
legit kind of downside of coroutines.
And I wonder if,
if we had designed them differently,
maybe as stack for coroutines,
we could have avoided some of that stuff.
And he actually talks about that
as well towards the end of the blog post yeah i thought it was a great post just to be clear the
the pseudo code that he shows up front it's not actually what the compiler generates it's
it's close to it's enough to give you an idea of the sort of thing that the compiler
might generate which of course is going to be implementation dependent which is why he can
get away with being not quite as accurate as Lewis Baker's version.
I did something similar in my own co-routines talks,
you know, how you might implement co-routines
without the co-routines feature.
So it's quite a useful education technique.
I also want to encourage readers
not to get put off by the pessimistic title,
at least, of the blog post. It doesn't mean that coroutines
are not worth using or they've got lots of problems, just that there are things you have
to be aware of. And I thought it was a little bit ironic that one of the problems is that
it simplifies code too much that you might miss problems. When most people's experience of using
coroutines is that they're too complex. But most of the complexity is in the infrastructure
around the coroutines.
And when you're writing the coroutine code itself
with the coretines and the corweights,
that part's actually much simpler
than it would have been otherwise.
That's the whole point.
Sort of pushes that complexity into the infrastructure.
Which means, yes, you can be fooled into thinking,
well, this is just like normal synchronous code,
when actually there are things you have to be aware of.
And one of my favorite things that James does bring out
is the fact that you can actually get dangling references
going into functions now, as well as coming out of them.
And that's another thing that I bring out in my own coroutines talks.
And I will always point out that, again, because because of our guest today it's a great time
to talk about this sonar lint actually catches that for you so it will tell you when you've got
a dangling reference going into a co-routine if you pass something in by by reference so um
great post i recommend you you read it if you're interested in co-routines
all right so over the last few months uh we had kind of a little bit of a mini series about tooling, which was kind of interleaved with episodes about other stuff.
So we had an episode about Conan, which is a package manager.
We had an episode about build systems in the context of modules.
And more recently, we had an episode about CLI, which is an IDE.
And another very important type of tooling is static analysis tools. So we
thought we should do an episode about that. And as it so happens, Phil actually works at a company
that provides tools for static analysis. And so, Phil, you've invited one of your colleagues along,
isn't that right? It is, yes. So that's why we have Abbas here today. Not just so that he knows
about my vacations. All right. So welcome again, Abbas.
Pleasure to have you here today on our episode. Thank you. Happy to be here.
So you're in static analysis now. You mentioned in your bio that you did other stuff before,
and we heard about that a little bit, like you used to be in finance. But what was your journey
that got you into static analysis?
How did you get into that field?
Yeah, it's an interesting journey that I usually like to talk about.
So I started in the finance industry.
And as many developers, we used to debug 80% of our times.
And one day, I had a ticket related to...
I used to work on an interest trade derivative functionality
with million line of code,
and we had a ticket where the end calculation was wrong.
So I started debugging.
It took me two days, almost two days,
to figure out what's happening.
And it turned out after two days of debugging
that we had an expression with a side effect like a plus plus
iterator and someone decided to add a decal type around it and as you know in C++ when you do
something like that the expression inside of the decal type is not evaluated. So the side effects suddenly disappear,
which led to miscalculation in the finance model that we were working on.
The interesting part is that it took me two days
to discover it.
Then after discovering such a bug,
I wrote a quick Regex script that can detect similar issue.
And it took my machine five seconds to go over all the code base and find detect similar issue. And it took my machine five seconds
to go over all the code base and find similar bug.
And we ended up having two bugs like that in our code base.
And hopefully I saved another colleague two days of debugging.
But that's where my passion came from.
There is many problems in the development world
where a machine does it better and we
should work as much as possible to keep this problem for a machine rather than for a human.
So how long did it take you to write that script?
It was a simple regex, so it took less than an hour to find that bug.
So less than an hour to save a couple of days of work.
Yeah, and once you do that and you realize the power of something as simple as that,
you start to think about patterns that can happen in C++ that lead to such bugs.
And I can tell you that there are many.
Oh, yeah.
But your bio also mentioned, and we talked about it a little bit,
about the challenges of C++ tooling, which I'm sure most of us are familiar with in some sense.
But what was your experience then that contributed to this journey? of C++ tooling, which I'm sure most of us are familiar with in some sense.
But what was your experience then that contributed to this journey?
Yes.
So C++ tooling are in general challenging, but if you think about it and you compare it to other languages, every tool created in the IDE, for example, if we focus on IDE
tooling, is either created by someone as a side project or by a big company.
And if you look at every successful C++ tool,
there is a parser behind it.
And then you have an entry barrier
that if you want to write
a good C++ tool,
you need to know
how to parse C++,
which is, I think everybody knows
that it's a very complex problem.
And the main challenge
is that IDE are interactive.
And if you know something about C++ parsing, that it's slow. So you start your project to start tooling like I did, and you become,
you directly have a bottleneck on the parser because your tool will be as fast as Clang if
you are using Clang. And if it takes, I don't know, 10 seconds to compile your file,
your tool, best case scenario,
will be 10 seconds to get back to the user.
So there's basically a high entry barrier for tooling.
And if you look at the successful tools, for example,
I dig deeper recently into Clang format,
one of the tools that is getting standardized in our industry.
What they did is they implemented their own parsers
that focus only on formatting.
And I don't think any tool is...
You cannot reasonably expect every tool
to write their own parser to optimize their use case.
So we actually had, two episodes
ago, we actually talked to Dmitry
Kozhevnikov about C-Line, the C-Line IDE,
and he was talking about
the fact that they also have a parser
because certain things like
ClangD doesn't have the right
trade-offs, so they have their own
parser in the IDE as well.
Yes, so
even when you write your own parser,
like in CLion, which is a great tool,
if you use other JetBrains IDE,
like IntelliJ for Java,
and you move to CLion,
you feel that it is heavier,
which is not the fault of JetBrains.
It's basically the current state of C++.
Right.
So, I mean, in order to parse C++ code,
I mean, it's not just hard per se, right?
Because you have all of these ambiguities
you have to resolve,
you have to kind of backtrack.
But also you might have to like
execute arbitrary code
like constexpr functions to figure out.
Like, for example,
you have like an identifier
you need to figure out
if something's a value or a type
and that is like a dependent name so you might have to like execute random code somewhere else in the
context of function to figure that out so you actually have to have a full-blown interpreter
of c++ for context for stuff if you want to actually pause it fully right so yeah the usual
two bottleneck are context by resolution and-based include because they are costly to resolve.
And hopefully now with C++20 model, once every compiler implements it, we might get faster analysis, faster tooling for the new modern code bases.
Yeah, modules will fix everything.
So do you then also have your own parser at Sonar, your own C++ parser?
So at Sonar, we have a fork of LLVM where we have a lot of patches to the parser.
Then once we can do something that is upstreamable, we upstream as much as we can to contribute back to LLVM, but we have multiple hacks to optimize
how fast is the parser in the IDE.
One of the common ones is to, for example,
to compute a preamble where you precompile.
While you're working on the IDE,
you usually modify the code multiple times,
and you can always precompile all the set of headers
at the beginning of the file, so you don't have to compile them again
before the analysis starts, and this way you save time
in parsing. Right, so before we continue talking about
how this stuff works under the hood, maybe you could tell us a little bit more about static analysis
in general. Like, what is it? How can I use it? What is it good for?
Sure.
So static analysis is basically the art of knowing information about the code without
executing it, which is the opposite of dynamic analysis.
If you think about dynamic analysis, it's mostly Valgrind, Address Sanitizer.
I'm just mentioning some names that are familiar for our listeners. And static analysis is, the requirement is that you should not execute the code while doing it.
And once you learn about static analysis, you discover that it's used in many places,
like compiler do static analysis to optimize your code and get rid of that code.
We order instruction to make it more, to make your binary more optimized.
So they analyze their code, they learn information about it,
then they apply it for code optimization.
Another use case of static analysis is tooling to detect bugs
and basically undefined behavior in C++.
So there are some toolings that try to detect null pointer dereference,
buffer overflow, and all the common undefined behavior.
And here we reach the usual discussion about static analysis
not being able to get everything right.
Then there is other application like guidelines,
like CPP core guidelines or MISRA guidelines,
where the goal of static analysis is not to detect issues in your code,
rather to detect patterns that might lead to issues.
And here, static analysis is usually very successful at doing that.
Other use cases, like there is tools
to build a dependency graph of your code base.
That's some sort of static analysis.
But the most interesting use case for static analysis for me
is usually, in my opinion, is education.
Let's say you want to learn about C++ algorithm.
The usual thing to do is either to go to CPP reference
or go to a conference or read the C++ standard.
But you can use actually static analysis to, let's say,
find in your code base every row loop and tell you that,
hey, you can replace this row loop by std any of or by std rotate.
And you can map every std any of or by std rotate and you can map every
std algorithm to a certain pattern
and try to educate
the
new joiner of C++
about a different part of
the STL algorithm using static
analysis. So
if done right, static analysis can be also
a very good education tool.
One of my favorite examples of that is using the heterogeneous comparison operators,
you know, where the second argument to a container is the less operator by default,
which compares the same type. Whereas like if you have a stood string and you want to compare it
against a char star, that means that that char star has to be converted to a std string every time.
So at some point, I think it was C++14, maybe C11,
we introduced the less sort of empty angle brackets operator,
which has a templated comparison.
So that will use std string's own comparison operator to compare against char star.
So you don't do that conversion every time.
So we will tell you if you haven't used that that there's an opportunity that you can
you can speed up your your comparisons except there are some cases where it becomes a pessimization
and it will detect that as well so you can select progressively learn more and more as you as you go
so you think oh i know this now and then you you learn a refinement on it as well
so yeah completely agree so so static analysis can also help you with performance?
Yes, performance is also a place where static analysis can contribute.
Like if you have a pushback and you can detect the size of an object statically,
because it compiles, you can automatically detect that in place back.
In this case, case is better.
Right. I do have a question.
So, I mean, I've used static analysis before.
For example, CLion also has a built-in static analysis tool, right?
And what I found often is that for cases where the tool can reason
about what's happening at compile time,
you know, it is pointing
out to me hey there is a bug for example let's say a null pointer dereference like if i outright
you know dereference a null pointer like obviously the tool is going to tell me hey
you know you're doing something here which which might be ub but if i'm doing something where
it's like a runtime property whether it's bug, like the pointer comes from somewhere else, the static analysis tool can't reason about whether or not it's an alt pointer in this case.
Or, for example, it depends on a branch and then the condition of that branch is like a runtime value.
Then typically, from what I've seen, the static analysis tool just doesn't tell you anything at that point. And then you only find
the bug when you run a runtime analysis tool such as address sanitizer or UB sanitizer or something
like that. Is that kind of how they all work? Or is there a mode where you can say, you know,
point out to me like potential problems, you know, like, hey, this might be null pointer to reference
here, you know, you might want to double check that or you might want to insert a null pointer check.
I don't know what happens at runtime,
but there's a potential problem.
Is that something that these tools can do?
Because I'm not seeing that by default, typically.
Or maybe I'm using the wrong tools.
I don't know.
I don't think you are.
So static analysis has a different technique to do it.
And what you are talking about
is basically detecting a null pointer to reference in the wild without any constraint on the code.
And usually this is detected by a technique called symbolic execution, where you try to execute the code without executing it by simulating it in the static analyzer.
And this kind of technique usually are limited because they reach the path explosion problem where you kind of reach the theoretical limit of what you can do in a reasonable time.
Oh, the path explosion problem. Interesting.
Yeah.
Cool name, I like that.
There's our episode title.
So if you want to simulate a code and you have a loop inside of it, for example, and you don't know how many times this loop is going to run,
so you have to execute the loop multiple times until you finish this loop.
So the static analyzer, at the end, if it wants to detect all these kinds of issues,
it will be as slow as the runtime and not as good also.
So these kinds of techniques, I believe they are useful, but they cannot be the only
things that you use, because if you have tunnel pointer dereference in your code base, it's
probably going to detect two or three of them, which I believe is already good.
I think you mentioned earlier that a better approach is to detect patterns that themselves
are not wrong, but can lead to these sort of issues.
So if we can avoid a situation where a naked nodal reference can happen in the first place without some other checking around it, then that's a better pattern.
So I think that often can avoid us going down that path in the first place.
Yes. A common example is that you cannot detect all the buffer overflow things,
and it's the most common CVE in the database. And if you have a static analysis checks that
tell you that you should use GSL span, for example, everywhere, this is kind of a check
that can be implemented by a static analyzer and can get all the checks.
And I'm not arguing that everybody should use that, but many people should be aware
of it and consider it as a possibility.
So you already mentioned CVEs.
This is something we talked about quite a few times on the show already, this whole
safety discussion where some people are saying, you know, you can actually find a safe subset
of C++ that can be statically proven to be safe. And other people are saying, no,, you can actually find a safe subset of C++ that can be statically
proven to be safe.
And other people are saying, no, that's not possible.
Or that subset would be too limited to be useful.
Like, I'm actually curious, like, is it possible to use static analysis to, you know, build
like some kind of subset of C++ where you can reason about that being safe and not having unafraid behavior?
Or do you think that's kind of not a viable way forward?
In the theoretical sense, it is possible
because common tools that actually does that
is done by Microsoft, trying to remember its name.
It's called Daphne, I think.
So they have a tool where you basically write the precondition of each function,
the postcondition of each function,
and the static analysis can build mathematical verifications
that your code is supposed to do what it's supposed to do.
So you can reach a subset of a language where you can verify statically
that it is correct and doing what it's supposed to do.
Now, if you want to reduce this problem of C++,
yes, you can do that,
but the question is not if you can do that.
The question is at what cost.
You can definitely tell people
that you should use bound checking everywhere
and verify that they do it statically.
You can tell them, you can create a function, for example, that checks for null pointer
before every dereference and statically detects
that only this function is used.
But I'm not sure this is the language
that everybody wants to write.
That's interesting because you're kind of contradicting,
I think, what Bjarne said a few episodes ago
where we discussed this.
So it's actually interesting to hear, you know,
the perspective of somebody who actually works on these tools.
And it seems like that perspective is quite
different from, for example, what we discussed with
Bjarne, who comes at it from a
language design theoretical point of view.
Before we
dig into that, because that was really going to be my
next question anyway, I wanted to
dig into that a bit more. Let's take the
opportunity to have our sponsor break,
which is also
appropriate because we are sponsored again by Sona, the home of CleanCode. So SonaLint is a
free plugin for your IDE to help you find and fix bugs and security issues from the moment you start
writing code. We've been learning all about that. But you can also add SonaCube or SonaCloud,
which we may hear about later, to your cicd pipeline and enable your whole
team to deliver clean code consistently and efficiently on every check-in or pull request
sonar cloud is completely free for open source projects and integrates with all the main cloud
devops platforms so as tim has said we had bianon recently talk about safety and security and he
said that static analysis should play a central role in that.
But he suggested that existing tools don't really address that yet.
And to be fair, he did actually say at that point,
correct me if I'm wrong.
And one of my regrets is I didn't actually get back to,
well, correct him.
And I know you've got views on this.
We've already started discussing them, Abbas.
But yeah, what's your take then on this?
How much do we really cover with static analysis?
What else can we do?
Yeah, it's a two-part question from each of you.
I think I can start with one from Timur.
I think when Bjarne talked about this,
we are not talking about 100% proving
that this program is correct without any CVE.
At the end, we are engineers.
So if we get to 99%, that's already good.
So for me, the problem with all of this
is that we are relying on CVEs
and these CVEs are not well organized
because they don't separate
between C and C++ and they don't separate between old C++ and modern C++.
So when we talk about memory safety, we don't actually know if code bases that are using
smart pointer always have these CVEs.
We don't have this information in our database. And to start thinking about a solution for this
before getting the data might lead us to a different path
than the one that is needed.
And I know that, for example, smart pointer can be misused
and you can detect some part of the misuse statically.
But are actually people that are using smart pointer misusing it?
I don't know.
Because also other languages that are compared with C++,
they can be misused.
It's just that it is harder.
So yeah, this point of do we want to get to 100%
or is it enough to get to 95% is always interesting.
And I'm always on the 95% is good enough because
we are engineers.
Then the second question was from Phil about the current state of tooling.
I think, for example, we only advocate for using CPP core guidelines, which our tools
support, and I can also talk about other tools like Lanktidy support part of it.
Microsoft Static Analyzer support most of it.
So there are two links that can enforce you to use the C++ core guidelines.
But the question is, do actually people use them?
And if you look at survey from JetBrains, not many people use static analysis.
30% don't at all. And some of them use only the one in their IDEs.
So we have a problem of awareness of the value of static analysis.
We also have problem of processes and, let's say, a push towards static analysis.
For example, we support different coding standards, like CPP Core Guideline and MISRA.
If we follow the requests that we get as a company, most of the requests are coming to MISRA.
And I personally believe that CPP Core Guideline is a better guideline for writing code for most people.
But yet, many people ask for MISRA.
Why is it so? Because there is enforcement on some companies to use MISRA.
So people are getting aware of MISRA and trying to apply it in their code base.
While there is not enough awareness about CPP core guideline and its value.
That's my opinion.
Right.
So we talked about static analysis in general but then there's also
a bunch of actual products that you know that you do static analysis and you briefly mentioned that
you know not as many people use them as as they probably should but also there are quite a few
of them around right so i already mentioned that t-line has its own static analysis built in
then we have like pvs studio Studio has a static analysis tool.
We had them as sponsors last couple of episodes.
There's Perforce Clockwork, there's Coverity, there's a bunch of others.
So it seems like there's quite a few tools out there.
And you actually have three tools at Sonar, don't you?
So you have, as we heard in the sponsor read, there's Sonar Cube, there's Sonar Lint, there's
Sonar Cloud. So what
do all of these products do? When do you use
which and how do you integrate them into your
workflow if you want to try and start
using static analysis?
Just to be clear, I'm more of an advocate
on static analysis.
I use ClankTidy even though
I use Sonar. I use other static
analysis tools. So for me,
my opinion is that if combining them is better,
if you can find more issues, it's okay.
You don't have to use one tool.
And C++ is a big language.
So in our backlog, we have 600 rules already, 600 checks.
In our backlog, there is thousands of them.
So we are not going to run off checks. So there is room for all competitors. So yes, we have three products. All of them have
the same target, which is reaching a state of clean code in your code base. So all of them
try to push towards the same goal. Clean code is, there's multiple definitions of clean code,
but it's switching the point
where your code base is an asset for you
rather than an adapt.
So you can think about maintainability,
you can think about security,
you can think about performance,
you can think about safety,
all the non-good aspects of good code.
So all of these tools are trying to get you there. SonarLint is the
first line, which is a new IDE. It's usually a personal choice when someone installs an extension
in their IDE to comply with CPP core guidelines, for example, and you don't have to invest a lot.
While SonarCube and SonarCloud are the same product, one is on the cloud, hence the name, and the other one is on-premise, which is SonarQube.
But they both have the same goal, which is scaling up to a project level or organizational level quality.
When I say organizational level, let's say your organization wants to comply with CPP core guidelines.
It's not enough that everyone of
half of your team applied in the IDE. You need some sort of a process, of a guideline of
what are the criteria that you need to comply with to merge to your main branch.
And all of that government is usually done by SonarQube and SonarCloud. And there's processes to do that because you cannot just say, hey, I'm going to refactor my entire code to comply with CPP core guideline.
Because if you have a million lines of code, you cannot just go and remove every new and put in place of it smart pointers.
So SonarQube and SonarCloud are meant to put processes to reach this goal incrementally.
For example, if you open a pull request, you have some checks that only say that you have to comply with CPP core guidelines on your new code rather than the old code.
And once you change your new code incrementally, at some point you will reach a code base that
is compliant to CPP core guidelines.
I think the statistics we usually use are that after about five years, you will have touched half of your code base that is compliant to CPP code guideline. I think the statistics we usually use are that after about five years,
you would have touched half of your code base.
So just by only changing new code,
you would have cleaned half the code base within five years.
And some statistics even faster than that.
So we talked earlier about the challenges of C++ tooling,
of which there are many.
But are the Sonar tools then,
particularly SonarCube and SonarCloud that you mentioned there,
are they just as easy to pick up
or are there particular challenges to using those new tools?
So SonarLint is free in your IDE.
You don't have to configure anything.
It works with CLion, VS Code, and Visual Studio.
But SonarQube
and SonarCloud, you have
to set up
the analysis in your pipeline
on your CI.
And hence,
currently we worked on a new feature
called automatic analysis,
which makes this
process of analyzing your code
on Sonar Cloud much easier.
So I do have another question.
What if I'm working on an open source library
and it's like an entirely not-for-profit kind of thing,
which is out there on GitHub,
and I have my CI set up with GitHub Actions
or something like that?
Is there any non-commercial license of these tools
that I can use to have static analysis on CI
rather than just in my IDE?
Yes, so for Sonar Cloud, you have two options.
The previous way was there was a GitHub Action
for analysis on Sonar Cloud,
and it's free for open-source projects,
so you can just integrate it in your pipeline
and you will get our static analyzer.
Or you can use a new way of analyzing C++ code
on SonarCloud, which is called automatic analysis,
where you just click one button
and it's going to be analyzed on SonarSource infrastructure
so you don't need a CI.
And it's going to be configured automatically.
So we are going to resolve your dependencies
and your preprocessor and every single part
of the C++ ecosystem automatically.
Wait, so how do you do that?
So I have been working for the last year on this feature.
It takes time.
So the motivation is that
not many people use static analysis.
And previously to set up
a GitHub action to do that,
you need to make sure
that your build works.
You need to make sure
that the dependency exists
on this action.
You need to know how to invoke,
how to build your code,
because if you build your code
with different preprocessor,
it means
different things. So there's many things that the static analyzer needs to know about your
project to be able to analyze it. And now after this new feature, we don't need that anymore.
And what we did, we did three main things. First, for dependencies, we built our own dependency
manager that basically instead of you telling us which dependency you are using, we scan your code, look at your include, try to match them with open source dependency, and we automatically check them out for you.
That's the first part.
This works for most dependencies, but doesn't work for private dependencies.
And we are going to go over that
later on the second step is resolving pre-processor and the good thing about resolving
pre-processor is that you only need to find a valid combination of pre-processors that
that lead to your code to compile so what we basically do is we build an equation from all the preprocessor
directive, and we use a tool called SMT Solver to find the best combination of all the preprocessors
that lead to a code that actually compiles. And SMT Solver is known as a static analyzer
world. It's basically a tool that you give an equation
and it do some mathematical model to resolve this equation.
And the equation here is,
what is the combination of all the preprocessors
that lead to a compile-level C++
and lead to the maximum number of token?
Because if you have a preprocessor
that remove your entire code base,
we don't want to enable it.
We want to actually analyze your code.
And by using this technology,
we get
a solution for the preprocessor.
We can understand your code.
And the third thing is
we go back to Parcel.
We had to be able to modify the
Parcel to understand
invalid code because
if you have an internal dependency and you have a call to this internal dependency,
Clang will fail because we cannot resolve what this dependency is.
But on our side, we worked on the parser to be able to say, keep this aside.
We don't understand it right now.
And make the analyzer behave as if this function, try to guess what it is and behave as if this function exists.
So there is three elements.
There is preprocessor,
there is dependency manager,
and there is parser.
And combining them together,
we were able to reach,
we analyzed 200 open source projects.
We were able to reach with this method something like 95% accuracy compared
to manual analysis where it's done by the user itself.
This is a lot of work that is done behind the scenes
to reach a state where the user only has to click one button
to get their C++ code analyzed.
It only works on GitHub.
For open source project, it's free.
For private source project, it is paid.
So that's super impressive.
So basically, you have your own package manager, so to say,
and your own built system-ish thing
that just magically configures itself
by just looking at what's there in the repository.
And it sounds like magic to me.
It sounds also like this kind of system would be very useful
in many contexts, even beyond just doing static analysis,
like this kind of magic auto-configuring thing
that figures out how to compile your code
or how to reason about your code.
That sounds really, really cool and useful.
Yes, so it works in this tooling area
where you need to know mostly how your code compiles.
If you need 100%, like you need to generate binary of your code,
yes, this is not reliable because you might do small mistakes
that change the meaning of your code.
But in things like tooling or refactoring and static analysis,
this seems to be a good approach.
And I think magic is the right word there when you try this for the first time,
especially if you've tried setting up something like SonarQ manually
or even SonarCloud.
It is just magic.
You just literally click that button and it just works.
And 95% accuracy is pretty good
when it comes to applying static analysis.
It's good enough that most people would never need
to manually configure their projects anymore.
Well, now I know what I'm going to be playing with this weekend.
All right.
So we talked a lot about static analysis and tooling,
but is there anything else going on in the world of C++ right now
that you find particularly interesting or exciting?
Let's see what's happening in the C++ world these days.
So there is this movement of new languages that you went over.
I think I'm excited about it in the sense of hopefully it moves C++ language forward.
Let's see what comes out of it.
The new hot topic in the last year in C++ is safety.
So I hope that the language improves due to this push.
But yeah, that's all.
So can I ask you a question just because it's kind of a controversial question
that I've asked a lot of people.
And I'm just curious about your opinion
as somebody who works on static analysis and obviously worries about safety and security.
Do you think that C++ is doomed if you can't make it a kind of guaranteed memory safe language the
way that Rust is? No, I don't think it's doomed because there is many use cases of C++ where
it's not the main thing to do. Maybe you will lose some
industry
will move away from C++ in the
places where you need 100% safety.
But in
the industry of gaming, for example,
or in the industry of
music, I'm not sure
all of this discussion is
going to impact
the users of C++.
And there's many people that are using COBOL these days,
so C++ is not going away.
Right. Well, thank you very much.
That was a very good insight.
I like your answer.
So anything else you want to tell us
before we let you go, Habas?
What do I want to tell you?
Maybe how people can reach you if they want to get in touch.
So I'm usually on SonarSource community, community.sonarsource.com.
I'm interested in C++.
If you have any idea of how to check things statically, please post it there,
and I will be one of the people that are going to reply to you.
I have a LinkedIn profile.
I have a Twitter slash X profile that I don't use that much,
but feel free to reach out to discuss C++ and static analysis.
All right.
We're going to put links to all of these things in the show notes.
Perfect.
By the way, that community discourse forum,
I can say from seeing on the way, that community discourse forum, I can say from seeing it on the inside, that's almost a direct line to the developer team.
So if you've got any questions for any of us, that's where to post it.
All right.
So it looks like we're at the end of this episode, but thank you again, Abbas, for being
our guest today and for this fascinating discussion.
That was a pleasure to have you here.
Thank you for having me.
And I'll see you next week.
All right.
See you next time.
Thanks so much for listening in as we chat about C++.
We'd love to hear what you think of the podcast.
Please let us know if we're discussing the stuff you're interested in.
Or if you have a suggestion for a guest or topic, we'd love to hear about that too.
You can email all your thoughts to feedback at cppcast.com. We'd also appreciate it if you can follow CppCast on Twitter or Mastodon.
You can also follow me and Phil individually. All those links, as well as the show notes,
can be found on the podcast website at cppcast.com.
The theme music for this episode was provided by podcastthemes.com.