CppCast - Parsing and Analysing C++
Episode Date: October 4, 2024Yuri Minaev joins Timur and Phil. Yuri talks to us about static analysis and how PVS Studio helps. Then we chat about his work on a custom C++ parser, and what challenges he's encountered. New...s CppCon 2024 keynotes on YouTube (via CppCon site): Herb Sutter - "Peering Forward: C++'s Next Decade" Khalil Estell - "C++ Exceptions for Smaller Firmware" Amanda Rousseau - "Embracing an Adversarial Mindset for C++ Security" David Gross - "Ultrafast Trading Systems in C++" Daveed Vandevoorde - "Gazing Beyond Reflection for C++26" Coros - task-based parallelism library built on C++20 Coroutines "The case of the crash when destructing a std::map" - Raymond Chen ACCU 2025 Call for Speakers and (super) Early Bird Tickets Links C++ Under the Sea PVS-Studio (download) PVS-Studio Blog Yuri's Webinar: Parsing C++
Transcript
Discussion (0)
Episode 391 of CppCast with guest Yuri Minayev, recorded 1st of October 2024.
This episode is sponsored by the PVS Studio team.
PVS Studio is a static analyzer created to detect errors and potential vulnerabilities
in C, C++, CppCon 2024, a new coroutine library, and debugging stdmap.
Then we are joined by Yuri Minayev. Yuri talks to us about static the first podcast for C++ developers by C++ developers.
I'm your host, Timo Dummler, joined by my co-host, Phil Nash.
Phil, how are you doing today?
I'm all right, Timo. How are you doing?
I'm good. I kind of recovered from CppCon.
I've been back for a couple of weeks now um one and a half weeks to be precise um and
there's the new thing that comes up which is uh in about 10 days we have the deadline for the uh
pre-vrotzwa mailing for the committee which is the mailing for which you have to submit all the
papers you want to be discussed at the next committee meeting so there is a flurry of papers
that need to be written and reviewed and commented on and uh busy days because we have quite a few features like uh contracts and
reflection and pattern matching and others that are kind of trying to trying to make it in to
css 26 before the the deadline so the feature freeze deadline is is kind of early next year so
uh it's getting tight so so there's a lot of uh it's very busy so so that's kind of what i'm doing right now
when i'm not recording cppcast so what about you phil what are you busy with right now well yeah
i think i've just about recovered from cppcon as well but um i'm off to c++ under the sea
next week oh is that the one in Amsterdam, the new conference?
Yeah, that is so exciting.
Unfortunately, I can't make it this year, but let me know how it goes.
Yeah, I'm really excited to see how that's going to play out.
As far as I know, you can still buy tickets for it.
Not for my workshop, though, on coroutines, because I believe that sold out.
So it gives you an idea of the level of interest in this new conference.
So if you can get there, then you probably should.
So I'm busy preparing for that workshop at the moment because that's a brand new one.
And then I should be able to maybe take a day off.
That sounds like a good idea.
All right.
So at the top of every episode, we'd like to read a piece of feedback. This week, we have an email from George. George wrote, I was listening to the latest episode. He refersed the idea i began saying to myself no no
no i then proceeded to say hey siri remind me to write to cpp cast that the committee episode is a
bad idea so checking off this to-do item having an episode on how the committee works is a bad idea
i was glad when i heard that both of you recognized that the majority of listeners would probably not
be interested in this.
Yeah, apologies to anyone whose Siri just went off listening to that.
Mine did. We cut that out.
So what do you think, Phil? Should we then just abandon this idea?
No, I don't think we should abandon it.
But as we said on the show, we just need to be careful how often we get into that sort of thing and how deep we go.
I think some surface level discussion every now and then is fine.
Like you were just talking about your preparation for the upcoming meeting in Roslov.
So I think that's fine.
But yeah, probably having multiple episodes on how deep the rabbit hole goes might be a little bit too much.
Yeah.
So, I i mean you also
have the local uh super sauce meetup here in helsinki that i'm running and we sometimes have
these talks that uh talk about like future stuff and committee stuff and we actually had a couple
talks like that just this week and the impression i get there is that most people who come to meetups
and i suspect also most people who listen to our podcast are more interested in things that they
can actually use in their code today so i think we should really kind of try and focus on
that maybe a little bit more. I think this is kind of really what's most relevant for people
listening to this. Absolutely. But if people do want to find out more, where can they go?
I'm glad you asked. So if you are interested in how the committee process works, we do have extensive documentation about that on isocpp.org slash std.
So you can read all about the process there.
Right.
Right.
So we'd like to hear your thoughts about the show.
You can always reach out to us on xmaster on LinkedIn
or email us at feedback at cppcast.com.
Joining us today is Yuri Minayev.
Yuri works on PVS Studio as the C++ Static Analyzer Architect.
His primary responsibility is to keep low-level stuff in order and add new features to the
core module.
A bit of a surgeon operating on the open heart of a C++ parser, a bit of an orthopedist inventing
prosthetics for the legacy code,
father of three cats, and a couple of pet projects.
Juri, welcome to the show.
Hi, Phil.
Hi, Timur.
Glad to be here.
Hi, Juri.
It's an interesting range of responsibilities that you have there, but interesting that
you do talk about your pet projects alongside your cats.
So you treat your code as pets rather than cattle?
Well, my pet projects are kind of my, well, I treat them like my cats, more like.
Sadly, I don't have much time for them, but that's okay.
I'm not on a deadline, you know.
I had, well, you know, at some point I decided, hey, I want to create a game engine.
Sounds great, right?
And I tried and I failed.
Yeah, because just if you can imagine how many things, how much stuff you need to put in there.
So at some point I decided, hey, I'm working on that sort of
compiler-y thing, you know, like static analyzer.
Why don't I write my own programming language?
Of course.
And yeah, currently I have project which it's like a functional
python i don't know programming language it started as a calculator i just wanted to you know
create a command line calculator and it kind of went out of out of control yeah yeah that's the
thing you start off writing a calculator and it soon adds up.
Yeah.
So at some point you add variables,
then you want to add functions,
then you do conditionals
and it goes downhill.
Yeah.
As long as you don't add side effects.
All right.
So Yuri, we'll get more into your work
in just a few minutes but
first we have a couple of news articles to talk about so feel free to comment on any of these okay
okay so the first one um i have here is about cpp con which phil and i went to just two weeks ago
um and they have published their keynotes or at least four out of the five keynotes that happened there.
One by Herb Sutter, one by Khalil Estelle, David Gross, and David van der Voorde.
Also, this year's committee panel discussion, which happened there.
I haven't seen the fifth keynote yet, the one by Armando Rousseau about safety and security,
but I'm sure it's going to pop up soon.
They're all not really public in the sense that they're unlisted. So you're not going to find them on Google or on YouTube, but they are online.
If you have the link, you can access them.
And conveniently, the links have been posted on Reddit.
So you can find them there.
And I guess you can also include them in the show notes, Phil.
Is that right?
Absolutely.
And then you can watch them there for free.
And they're all pretty amazing talks.
So I certainly learned a lot from them. So I'm sure it's going to be uh interesting to some of you as well but just a bit of inside
information about why those links are those videos are unlisted at the moment this is something that's
common across most of the conferences now that most of the viewings for youtube videos come in
sort of organically for searches so everything is sort of optimized to cater to the YouTube algorithm for that.
And so having it on like a staggered release schedule through the year really
helps with that, with those numbers.
But for those of us that are sort of closer to these things,
you just want to get those videos as soon as possible.
So for those smaller numbers, it's okay just to give out the unlisted links.
So that's what we're all doing now.
We're putting them up unlisted.
We can share them with the speakers
and with the community.
That's not really going to change
YouTube statistics.
So we get a bit of the best of both worlds.
Yeah, I'm sure you're in the thick of it
with all the conference stuff,
aren't you, Phil?
So you know everything about that.
Well, not everything,
but more than I should.
All right. So talking a little bit more about code that you can use today um there is a new library for task
based parallelism which is called chorus it's available on github it's built on top of c++
20 coroutine so it's not the coro library that's a different library it's called chorus with an s
so it's built on top of c++ 20 coroutines it's uh it's a pretty straightforward interface it's a different library. It's called Coros with an S. So it's built on top of C++20 coroutines.
It's a pretty straightforward interface.
It's a header-only library, which is, of course, great
if you want to integrate it.
It's optimized to performance.
I guess you kind of have to do that if you do C++.
That's kind of the point, isn't it?
It also uses std expected for error handling.
This is pretty cool.
This is a new thing in C++23,
so I guess you do need C++23 to compile it.
But it's a pretty
cool new little standard uh class to to do error handling which i talked about at some point
they use expected doesn't mean that they don't throw exceptions left and right i would assume
that they don't but i haven't looked into the code they don't throw exceptions but if they catch an exception in the unhandled exception function yeah so they just drop it to the expected
yeah so more more interestingly um the library supports monadic operations so you can chain
tasks with things like and then and then you give it like another one and stuff like that
and yeah it's all built on coroutines. There's, of course, lots of questions, like how does it compare to other libraries
that do similar things,
like Taskflow and LibFork?
There's a bunch of them.
How does it compare to the whole sender-receiver thing,
of which there is also a reference library out there,
I think by NVIDIA, I think,
has that now, like LibExecution or something.
Right.
As well as LibUnifx.
Yeah.
LibUnifx from Meta is the other one as well.
Yeah.
So there's a bunch of libraries.
I'm not sure what makes this one different or better
or why you should use this one,
but it's a new project, very modern C++,
which I guess makes interesting
uses of coroutine so i thought it was newsworthy to mention uh yeah you know it's uh i think it's
also for people who developed it it's a hacker value or whatever it's called like when you do
things for the sake of having things done interesting things, and it's like an achievement.
Yeah, I can do that.
Maybe.
Maybe there's really a need for another library.
Oh, I think we need lots of these, and we'll see how they shake out,
and then we'll have previous experience that we can standardize because we do need some task-based libraries in the standard
that are lower level than std execution stuff because i
think that's a that's a much bigger framework for this sort of stuff whereas this one seems to be
much lower level more at the like std generator level so it seems to work well alongside that
just adds a few of the common features and a bit of um like there's a thread pool you can have
different coroutines going off in different threads and then join them at the end with a very simple co-await syntax.
And you can even do recursive calls,
so you can have a task within a task,
and the outer one will await the inner one and that sort of thing.
Stuff that's actually just generally very useful,
but not too opinionated.
So I think there's definitely a need for this sort of library.
Yeah, and that's the stuff that's very tricky to get right if you want to do it yourself you had a few talks
about that and uh yeah it's good to have just kind of libraries that do that for you
i tried to understand that and i was terrified by how much stuff you need to write by hand just to make it work without anything, just with the standard library.
Yeah, I think that's why my workshop next week sold out.
Yeah, I wish I could attend your workshop.
I actually spent quite a lot of time trying to understand coroutines recently because we are trying to make contracts like pre and post conditions work with coroutines and um for that i had to actually understand how
coroutines work or at least to an extent sufficient to understand what was going on
there and that was a deep dive that um i don't know i that's probably a bit too much information
about what's going on there information is the starting point so so yeah i think i think i would
really enjoy attending your workshop, Phil,
to kind of make sense of it all.
But unfortunately, I can't be there.
Maybe there's going to be another opportunity.
Oh, yeah.
So I have one more news item, which is, again, about kind of a very practical thing.
And it is another great blog post by the never tiring Raymond Chen,
who is writing multiple amazing
blog posts every week.
I don't know how he does that.
This one stood out even from his usual blog posts to me.
It's called The Case of the Crash When Destructing a StdMap.
And it's all about debugging.
So a customer was getting a crash in the destructor of StdMap in the Microsoft STL and just send
them the
crash log and said hey my code is crashing it looks like your stl is broken can you fix that
and the whole blog post is basically about raymond trying to figure out what's actually going on
like figuring out what the bug is basically just from the information that the the customer sent
them and it's just an amazing read it's like um he starts by reading the source code of std map
like in the stl
implementation that they have trying to figure out like what's going on like okay what are the
pointers and no's there like what refers to what okay great and he looks at the disassembly
he maps that back to the actual code then he figures out where like the the corruption the
memory corruption happens that triggers the actual crash and he guesses okay it
must be like probably this kind of error code and then he goes on and on and on and on and like
oh okay this this this is not happening here okay let's dive deeper oh okay it must be that
and then at the end of that he finds the bug which is uh completely unrelated to sitmap
at all and it's an asynchronous IO operation
that was not even happening in the client code.
It was happening in some other library that they were using
where there was a bug that there was an asynchronous IO operation
that was started and then kind of just, you know,
shot into space and never joined or anything.
And then it's like, okay, we have waited for a second now
for this operation to complete.
Okay, whatever, we're just going to abandon it.
And then you kind of just go on
and like do something else in that memory.
And then much later that IO operation comes back
and says, oh, I'm done now.
And kind of writing its result in like that memory,
which just happens to override the root node
of that std map somewhere
in a completely entirely different place.
I liked the analogy he used there.
Like, imagine you hire a demolition company to knock out your house, but they can't just go now.
So they say, we'll come later.
So you just, you're tired of waiting for them and you sell your house and eventually they come
and knock that house down although somebody else is living there now
it's kind of sold a million years ago and and here oh my surprise right no that's great i mean
this whole article is just you know know, Jedi master level debugging.
Like you can just learn so much just from reading that.
Like how he thinks, how he approaches the problem.
It's just masterclass.
I really enjoyed that.
So the thing that struck me was
how he actually has time to do all this debugging
when he's knocking out so many articles every week.
Maybe he's delegating articles.
I mean, I'm sure he can debug.
Maybe, maybe we'll have to try and debug that.
Okay, so Timo said that was the last news item,
but I sneaked one more in.
So I'll take this one
because we actually launched the ACCU 2025 call for speakers
this week, breakfast yesterday.
So it's breaking news news so if you want to
speak at next year's accu which will be running from the 1st to the 4th of april tuesday to friday
then that is open now and we've also opened early bird tickets in fact we're calling them sort of um
it's not explicitly called super early bird but it's like effectively
super early bird because there's like an additional discount you can get and the discount
is going to decrease over time during the early bird period so the sooner you book cheaper the
tickets so pause the podcast now go to the website get your tickets come back and then finish the
show all right yeah it's an amazing conference i have been there quite a few times now um i hope uh i can be back next year i'm pretty sure i can
make it uh do you want to remind people where it is yeah that's in bristol in the uk so it's the
venue it's been that in for getting over 10 years now so if you've been there before you should be
familiar and it's uh although it's sort of in the west of england it's very very
accessible from from london especially heathrow uh just a straight line train hour or so hour and a
half maybe from heathrow so yeah come to accu 2025 and that is the last news item so we can now get
to uh to yuri who you were on um cpb cars back in 2020 back when rob and jason
were running the show of course so what have you been up to in the four years now
a lot of things actually well let's see i'm i developed a new type system for PVS Studio.
We are currently working on a parser.
I have three cats now.
In 2020, I had one.
I finished Dark Souls like 20 more times.
Lots of stuff.
So definitely busy then.
Yeah, it's been productive,
especially the cats and dark souls parts of it.
I'm sure they take a lot of time.
Cats, yeah, they do.
Yeah.
But static analysis then.
PVS Studio is a static analyzer for C++
and a number of other languages, as we said.
Even since we were last on, we've had a few shows about static analysis.
But maybe you can just remind us what static analysis is and why you would want to use it.
The way I usually talk about that is, well, imagine a compiler, right? What it does, it understands your code, and then it generates code for your computer.
Static analyzers, they do the same thing, but they don't generate code.
They analyze your code and tell you where you made mistakes, basically. Well, you need it really just to make sure your code,
because before it goes to production,
that it is as bug-free as you can do it.
Because, well, usually we even have a diagram
like cost of an error, of fixing an error.
If you catch it early, it's cheap.
If you catch it when, I don't know, your software is uploaded to a spacecraft or something, that's really expensive.
So, yeah, you might want to use static analysis just because it will catch for you things that people usually overlook.
Because people are not good at noticing small things.
Like you forgot accidentally to check, I don't know, a point or something.
It's easy to forget.
And it's difficult to see, especially if you're reviewing a lot of code by yourself.
So that thing is just a little helper
that can help you find obvious mistakes.
And sometimes not so obvious too.
But obvious mistakes are the main focus
because, like I said,
people are not good at noticing small patterns and
stuff like that yeah i think those obvious mistakes when you discover them they can always
be well they can be a bit embarrassing like how did i let that slip in so they're the ones you
really want to get at least the non-obvious ones you got sort of some respect like nobody could afford that well uh the nasty thing about those small things small mistakes is that
sometimes you won't just notice them sometimes you crash uh and sometimes it's ub oh yeah and
you just don't know what happens yeah we've had a lot of um esic bugs in our own code, which I probably shouldn't say, but
who cares?
When something really weird happens, like a bug happens only when you consume over a
gig of memory or something.
And many of them were UB.
Yeah, so I've seen quite a few conference talks
by people like you and your colleagues,
usually about weird bugs and how they happen
and it's usually quite interesting material.
But so the description mentions
that it's a static analyzer for C,
C++, C Sharp and Java. So that's's a static analysis for C, C++, C Sharp, and Java.
So that's quite a few different languages.
Can I ask what language the actual analysis tool is written in?
Well, C and C++ ones are in C++.
C Sharp one is in C Sharp.
And Java is, surprise, in Java.
Oh, so they're separate products, separate code bases.
Three of them.
Interesting.
Right now, some people here
are dreaming about making
a unified engine or something
which can handle them all.
But I don't know.
Because I don't want
to write a C Sharp
parser or something like that,
because C-sharp and Java, they use imports instead of headers.
And in C++, it's really convenient.
You have a CPP file, which includes everything it needs.
And you can analyze the entire preprocessed output.
And in C-sharp and Java, you have to build that model.
You have to load those inputs and stuff like that.
That's why for C Sharp and Java, they use Roslyn.
And for Java, Spoon.
They use it for their front end.
Because writing that from scratch, I think it's a little bit more
difficult yeah yeah it always seems a bit almost wasteful to have completely separate implementations
of these parsers and analyzers in different languages but every attempt i've seen to try
to unify them and i've worked at a few places now that have come down this path it's never quite
worked out right so so how different are these i assume that you work on the c++ one since yeah
we're here yeah i wonder how different like the problems are because in c++ obviously we have
undefined behavior we have all of these things like you know invalid pointer dereference and
whatever out of range that I guess you can detect.
Whereas in C Sharp and Java,
I guess those things are still bugs,
but they're not undefined behavior.
So you get an exception.
Is that like a completely different game in C++?
Or is it relatively similar?
What kind of problems do you find?
I'm not exactly sure how like the other parts work
internally because i'm not um i'm not going there a lot uh but i guess they have also no references
and stuff like that well yeah you will crash uh it's not ub but still you have to catch these errors and fix them.
And in Java, I remember they did some checks when new versions of Java were released.
They did checks for if you have some code which becomes obsolete in new versions,
they pointed that out in the Java analyzer to the user.
So you can use an older code base, run the analysis, and you will see parts you need to rewrite for the new version of the language.
And that was a request from a customer, I believe.
Do you distinguish between what we might call actual bugs things that are actually
wrong and will lead to problematic behavior versus what we might call code smells or you know
violations of best practices the where the code may itself be correct but it's going to lead to problems well we usually have uh like diagnostic levels so it will if it's sure
that okay that's a bug that shouldn't be there it will complain with a high priority and if it's
like we can make we can let it slide kind of it's not a bug but don't do that it will complain on the lower level of
priority so the alarm level is different but like in in actual code when checks occur
they work the same usually for well well, there's no clear distinction.
Okay, that's a bug and that's a code smell.
For code smells and minor stuff, we usually check the context,
some surrounding conditions or something.
We can, by using circumstances around the place where this happens,
the error is posted.
We can guess, okay, probably this was supposed to be there.
He knew what he was doing.
Or probably that was a copy-paste and something like that.
Right.
So they're context-aware.
Yeah.
And that presumably implies um symbolic execution yeah
we have data flow and symbolic execution right yeah yeah which um in general is like solving
the hoarding problem but in limited context you can you can actually get quite far with it
so i i'm actually i actually have a question about this i'm because
i'm trying to wrap my head around around the stuff uh kind of uh right now so there is program
execution right which is you just run the code and you know for example if you call a function
you know what the inputs are right because you literally call the function with that input
and that can happen during runtime or during compile time,
which is what we call constant evaluation,
like constexpr and stuff.
But symbolic evaluation is different, right?
Instead of just calling a function with an input,
you're kind of calling it with every possible input
or something like this.
Is this like a vaguely right?
Or what is symbolic evaluation exactly? We go through the function's code inside.
And based on how different entities inside that function are used,
we can infer some facts about them.
Like, for example, let's say if you have a function
and it accepts a pointer,
and the first thing you do is if not PTR return, right?
So that PTR gets a label never null pointer and stuff like that.
For pointers, it's just really simple.
But for example, if you have if a minus B equals zero, you enter the if body.
Inside that if body, we know that A equals B because A minus B equals zero, if it makes sense.
So yeah, you just gather different facts.
It's similar to what an optimizer does on an analyzing path.
So like LLVM, for example, we don't have IR, but LLVM uses IR.
What it does, it first goes through instructions and collects data,
some information about them, and then it uses that information to optimize.
It can infer that some variables are,
even if you don't label them const in your code explicitly,
it can infer that they never change.
It can throw them out of your function.
It can inline functions.
If it infers that a function always returns the same value,
it can just replace the entire function call with it.
So that kind of thing.
It's like an optimizer,
but the real optimizer is more sophisticated
than symbolic execution.
Does it work across different like across like different functions
or even different station units for example if i have one function that returns a pointer but then
something in that function like like makes you be able to reason that that pointer will never be
null and then you pass that into another function uh can you then do you then know that you know
that function uh do you like look at every function. Do you then know that function?
Do you look at every function in isolation?
Or can you look at the actual flow and reduce that that function call must be correct?
Because the other function that you get that pointer from never returns a not pointer or something.
Well, yeah, it can do that.
We annotate functions when we analyze them.
We analyze one function, we
annotate it like we can say
that it never returns a zero
or a null point or something.
There are also manual annotations like
we know that malloc allocates memory
and free returns memory. How do
then we know that? Because we manually
annotated
malloc and free.
But automatically we also create such annotations. and we know that because we manually annotated malloc and free, right?
But automatically, we also create such annotations while analyzing.
And when you call a function,
we can know something about its return value, for example.
It's kind of limited, so it's not as good as, I don't know, LLVM's optimizer, but it's able to understand things like that.
And for across modules, for that, we did a special kind of quasi-linker.
So to do intermodular analysis, we first go and collect information about modules and store it in a binary format.
And then we call analysis and use that collected information
to understand what functions from other CPPs do.
So I got one last question about this, if I may.
So let's say I have a code base
and I want to like throw PVS Studio at it
and see if it finds any bugs.
Like, how do I do that?
Like, do I need to have a CMake file
or a Visual Studio solution
or like something else that tells the tool,
like these are the source files that belong to the project
and this is how they are related
and this is how I compile them
or like a compilation database
or like how do you figure out
what the actual code will be that you're going to look at?
Let's see.
So you can do a Visual Studio solution
because there's an extension for Visual Studio.
You can do integration in VS Code, KeyUtopreator, something else.
I don't remember.
It supports CMake. You can give it a special file, compile comments, JSON.
Some tools like CMake can generate it.
It's a special JSON file which contains all the compilation commands
on your project.
You can also, under Linux, you can use strace to trace compilation commands
and then a special tool that we created
will parse the output,
and it will collect all the flags
and paths to source files and stuff like that
and run analysis on them.
Under Windows, you can use Compiler Monitor,
which is also a tool from us, which basically you start it in the
monitoring mode and you build your project and it will catch compiler calls.
And then it will start analysis on CPP files.
So there are different ways to do that so in some cases you do actually
interpret the the project model well not exactly project model because usually
well in visual studio for example ms build give us it gives us all the information we need. In CMake, CMake does that.
Basically, we just integrate and listen.
Yeah, so we don't have like...
You know what's funny?
I was about a year ago,
maybe I had that crazy idea
that we need to do a PVS tool chain.
You have a CMake, let's say you have a CMake,
and you say build it, but instead of your compiler,
use PVS, and then we'll have compiler and the linker
and stuff like that, and we will just work as a tool chain.
But it's still like a dream.
I think it would be interesting.
Yeah, that's the interesting thing about
build tools and toolchains is
nobody really wants to have to deal with them,
but they'll want to do it exactly their way.
Bit of a contradiction.
So is it easy enough to integrate
with your existing tools
or do you have to do a lot of configuration?
Usually people just set up PVS Studio in their CI,
and it just runs after builds.
Right. Okay. That sounds reasonable.
And you mentioned LLVM.
We have Clang tooling, and part of that is Clang Tidy,
which is already a static analyzer.
So what does it do differently?
What does PVS Studio do differently to, say, Clang Tidy
that we already have very often built into our IDs?
If I had a dollar for every time I'm asked this question, right?
Tidy is a little linter.
As far as I know, it doesn't have much of semantic
information so if you if you want to compare pv studio with something i guess uh clang static
analyzer would be a better right comparison but as for tidy well's good, but it doesn't dive deep enough, I think.
And client-state analyzer, I'm not even sure in which state it is now.
I think it's not big enough yet.
I mean, it doesn't catch as many errors if if i know correctly because i haven't looked at
it uh for a long time right yeah i mean it's it's definitely still evolving and i know some issues
that i saw with it even like a year or so ago have been fixed so it's getting better but um
you're right that there's a there's a limit to how far it goes. Tidy has a really great advantage, I think,
because you can write your own rules there.
You can write those AST matchers,
and they will work like custom checks.
Right.
And it can be surprisingly easy to get up and running with that
if you've got a specific custom need, it's definitely something to look at.
Well, usually what people hear say from people who understand marketing and sales better than me,
because I know nothing about it.
We have user support and we have integration everywhere.
So it's kind of like a big advantage over free tools
especially which like you take it and you do it and you're on your own in many cases yeah yeah
and one thing i like to say is um you know you don't have to just use one static analysis tool
you can use a whole range of them they often often complement each other. If you don't sync
in their output,
if you're not afraid to
sync in their output
because you will be
bombarded with a lot of things,
especially if you run for the first time
and you don't ignore
stuff which is irrelevant,
you don't disable checks
which you don't
really want. Oh, it's like
was it tidy?
It's like tidy was complaining
oh, you're passing a parameter
here and you don't change
it inside a function. Make
it const reference and
when you do it
make it a const reference it starts
complaining, hey, what are you doing, dude?
The object is really small.
Just pass it by copy.
Right.
Yeah, there can be a lot of false positives like that.
You need to tune.
Well, I mean, what I mean by that
is not that tidy is bad or anything.
It's just you have to set up
what you want to see in the output, really.
Yeah, yeah.
Yeah, exactly.
Not everything is an obvious bug.
Okay.
So, Yuri, we're going to continue the conversation with you in a minute.
But before we do that, we have a few words from our sponsors this week.
PVS Studio is a team that develops its own tool,
a static code analyzer for C, C++, C Sharp, and Java code.
Large legacy projects require the full focus of the development team
and newer projects require experienced
support. With a static analysis
tool at your disposal, your team will write
even safer, cleaner, and more secure code.
The static analyzer has more than
1,000 diagnostic rules and a special mode
for mass suppression of warnings in legacy code.
It can find dead code,
typos, potential vulnerabilities, and
much more. If you haven't tried PVS Studio yet, now is a great opportunity to do it.
Podcast listeners can get a one-month trial of the analyzer with the CPP Cast promo code.
You can find the link to download the tool in the description as well.
I guess description means you're going to put it in the show notes, right, Phil?
Yes, yeah, that'll be there.
Go and have a look.
So yeah, a bit of a coincidence
that we have pbs studios a sponsor for this episode not entirely a coincidence we sort of
tried to coordinate those things so that that'll work out quite nicely but um yeah great opportunity
to to go and try it out yourself so having talked about pbs studio um just want to change gears a
little bit because you did mention working on a C++ parser in your bio.
And that interested me because I assumed that you would do something like
clang tooling and build on top of that.
But are you actually doing the parsing yourself?
Yeah.
We are doing our own thing.
Clang would be great, but not good enough. Let's say, I don't mean again that
Clang is bad, but how we typically work, we support different compilers from different platforms.
I mean, code written by them. And what we do, we get a CPP file as input, and we ask the compiler to preprocess it and give us the preprocessed version of the file.
Right.
Without macros, includes, and stuff like that.
And then we parse that.
And the problem is that Clang doesn't support everything. There are compilers, especially in the embedded world,
which have non-standard types, non-standard intrinsics,
and stuff like that, and we have to understand that.
But Clang doesn't support all of them.
Clang would be great if we just did Linux and Windows,
MSVC, GCC kind of stuff.
Well, actually, embedded compilers are usually GCC-based,
so we could cover a lot of them, but not all.
I remember supporting some 8-bit compiler
which had types like int24, int40,
and it had a bit modifier,
so you could manipulate bits directly somewhere in memory.
And its pointers were a flexible size.
They could be from 4 to 12 bytes long, I think, addresses.
It's been a long time ago,
so I don't remember the
details, but the idea is
kind of like that.
And one more thing,
which I hope dies soon, is
we support Microsoft
C++ CLI,
which is
C++.NET.
Don't support it too well,
but we can at least parse it.
And Clang, I think if we gave Clang
something like that to parse,
it would tell us to go and do something right yeah
no that that makes sense but then building your own c++ parser that that sounds like a
a pretty big project so well how's it been worth it first of all what have the challenges been
oh challenges uh in c, challenges start at grammar.
Right.
You know?
I'll often end there as well.
If I may propose a feature for some upcoming standards, can we please have a keyword for functions to define a function declare?
You mean like FN or or yeah yeah something like that
i have a proposal on the back burner for exactly that if you knew how easier it would make
the job of a compiler to know that uh because well we do recursive descent in the parser. So you take the grammar and you go left, right, from the top level grammatic rule and down, down, down, down, down to the bottom.
You reach something like a literal there or an identifier or something and you go up and you parse things. The problem is, by the way, recursive descent,
because that's the only algorithm you can actually write,
you can actually code,
because others require some generated tables,
state machines, and stuff like that.
And recursive descent, it has some limitations
it imposes on the grammar.
For example, you can't have two rules in your grammar which start with the same thing.
That's why I want a keyword for function, because both function and variable, they start the same.
You have type specifier, you have a declarator.
And the difference is after the declarator,
when you either have an initializer or parameters.
And the other thing is left recursion.
In your grammar, if you have left recursion,
like a rule is defined using itself.
Usually it's for binary expressions, like, you know,
binary expression is binary expression
operator and an
expression with a
higher priority.
So
the grammar of C++
is not well suited for recursive descent.
But everyone does
recursive descent. Clang does recursive
descent. GCC does recursive
descent. We do recursive descent. Iang does recursive descent. GCC does recursive descent. We do recursive descent.
I think Microsoft does something else.
Maybe. I'm not sure about Microsoft.
I know they did a token stream-based approach, but I think that was in contrast to
AST-based, which they now do more of as well. I forget whether they do it entirely now or
they still do the thing where they backtrack when they the original approach doesn't work anymore they have to
backtrack and then try an ast based but i'm not sure if that's the difference between
so so i know a little bit about this i have never written my own parser from scratch but i have
rewritten large parts of an existing parser when i was at JetBrains all the way back in 2018
I rewrote
like large parts
of their C++ parser at the time
which was written in Java
that was also fun but that's a different story
but I remember
that I like this kind of
realization that I had
because when you for example want to
parse a declaration or anything
anything interesting you have to like there's there's all of these ambiguities where you have
to like try and pass it as a type or try and pass it as a variable or try and pass it as a function
call or whatever and then if it doesn't work you're gonna have to rewind and start again yeah
but what but a lot of the time like what you're actually parsing depends on whether some identifier x is itself type or a variable, right?
But in order to find that out, you might have to basically execute arbitrary constexpr functions anywhere and instantiate you have to interpret half your program sometimes in order to make
sense if there's a lot of constexpr stuff going on in order just to continue parsing and and yeah
i think that's the moment when i realized how messed up c++ actually is yeah what's what's
your take on this i agree that it's really messed up uh You know, the worst part of C++ is C, really.
Because we have to carry that legacy.
All those declaration syntaxes, which don't really make sense, they're inconvenient.
Oh, yeah.
I did rewrite that specific part from scratch.
I remember I found out just
how many places you can surround something with parens and it's still part of this but
and wait until you get to function pointers which get references to arrays and something like that
like a pointer to a function which they could it takes a pointer yeah yeah you
kind of have to pause it inside out and it's kind of gets very weird yeah yeah i've been there and
done that i'm never going to do that again and you and you have to backtrack a lot really yes
and another funny part is templates but templates like like, you know, template declarations, they're easy because you have keywords, you have template, type name, stuff like that.
So, yeah, that's actually context-free grammar, I believe, if you just stick to templates.
And when you get into instantiating them, it becomes real fun.
Especially if you have non-type parameters there.
That's where you have to do
a lot of those evaluations,
which you probably want to avoid
when you're parsing
because evaluating something
during parsing,
it's kind of out of place there. But you have
to.
Because the whole meaning can change.
You have to calculate expressions
and stuff like that.
And you're lucky if
you don't get
some call to some constexpr
function which does recursion.
God forbid.
I'm not sure. By the way, I just thought about it. Can you do recursion, God forbid. I'm not sure, by the way,
just thought about it.
Can you do recursion in constexpr context?
Sure, why not?
You can, right.
I was doubting for some reason.
Good idea to terminate it, though.
Yeah.
By the way, about, I don't know,
about constexpr,
I always found it funny that
the standard says
ub is forbidden in
compile time.
Wow, to a point. And the same
standard says that
can a ub,
well, it doesn't say it explicitly,
but it doesn't list
every possible ub, so
how can you as a compiler know that you have a UB?
There is some UB that I would contact for now.
There's some UB allowed?
Yeah.
I can't remember the details off the top of my head.
I think there was a lightning talk at the CppCon that went into some of it.
So I'll put a link in the show notes if that's available.
Well, I got something even more scary for you now that we get reflection on css 26 we're going to have
stateful uh uh compile time evaluation uh that's what i actually always wanted but as a C++ R server writer.
But I guess we'll deal with it as it goes.
But reflection will be interesting to deal with
because you have to really generate code based on patterns,
if I get it right.
Yeah, and the thing about reflection,
the way it's heading to C++ 26 right now
is that it's not really just reflection
because you can also splice back
any of the entities that you're dealing with.
So you can take a type, reflect on it,
you get a metatype,
you can add more members to it or whatever,
and then you can splice it back into an actual type
and then declare a member of that,
declare a variable of that type or whatever in your actual code, right?
I wonder how code completion tools will work with that.
If I just take a class, add like five methods, and then I declare a variable.
Well, I guess they can get information from what you're writing in reflection for that.
You mean how code completion would work
on a variable of this spliced back generated type?
I guess the tool will have to actually execute
the reflection code at compile time,
just like it has to do today with like constexpr stuff.
Ultimately, it's another form of constexpr evaluation, right?
It's just like that.
It also like declares,
it can now declare entities and do things like,
like just normal constexpr can't do.
I used to work with F sharp quite a lot.
And one of the things that really impressed me in a shop is type providers
and type providers are a way of generating code,
not even that compile time,
but that design time,
you might say you can actually have a type provider that will open a socket
and load stuff over the internet and then generate code based on that.
And you can do code completion on it.
It just blew my mind.
Hopefully we won't have that, but it does show what's possible.
But I think we're going a little bit down a rabbit hole here.
We should probably come back up to the surface.
We started off talking about this in the context of static analysis.
So in your deep dive into the murky world of C++ parsing,
has any of that actually influenced what sort of things
you actually might want the analyzer to look for?
You know, examples of code that you think,
oh, we should never allow people to write that.
Well, not really because parsing, well, static analysis
deals with some specific cases,
and parsing is for the...
You want to do it in the most general case possible, if you can.
Oh, yeah.
But I certainly want to ban all intrinsics now of compilers
because I've seen
a lot of things which don't
even fit grammar.
Like in
MSVC or CLAN
I think it was CLAN
it has in type traits
it has the is same
trait which compares to types
and under the hood
it uses
an intrinsic
called
is same
and it looks
like a function
but it
accepts
two type
names
as its
arguments
that's fun
for a
parser
isn't it
yeah
that was
great
because that would look like a function type declaration to a parser or something like that, right?
Yeah, but we have a list of them.
We just decided, okay, so they have names which start with two underscores.
And since the standard says if you use such names so expect anything so we decided okay we just can
do it we just write names of those intrinsics down and create custom rules for them so if we
encounter the name we will apply the rule which parse is this intrinsic. So
we are nearing
our time limit for this episode.
So I'd love to
talk a lot more to
you about static analysis and
parsing and the deep,
deep depths of
C++. But I think we have to slowly wrap
up. Before we do that, do you want to
quickly talk about the webinar that you're going to be doing next week?
Yeah, I'm going to be doing the one about parsing next week.
I will be talking about all those problems with declarations, with backtracking and how it leads to exponential algorithms.
And you can get stuck.
And I will touch on grammars a bit and how parsing works in general.
Actually, we are planning to do a series of those webinars. So one is next week, one in November.
It's going to be about semantics i think and one in december
is going to be like kind of a continuation of semantics right so yeah i i guess the link is
in the notes right you probably will put that in the notes. Is only the first webinar available as a link so far?
Yes, because it's the only one scheduled, actually.
For others, we don't have a schedule yet, at least as said.
But I know that the second one will be in November,
somewhere first week probably of the month.
And the third one, again, first week of December or something
along those lines at the beginning of the month.
Okay, so if you listen to this in the future,
then you'll just have to search for that one,
but we'll put the link to the first one in the show notes.
I guess we'll have recording, So if someone doesn't catch it,
you can always watch recording later.
Great.
Okay.
Well, then we just have one final question for you.
It's our usual wrap-up question.
What else in the world of C++
do you find interesting or exciting?
Or in the case of static analysis,
maybe dangerous or challenging?
You know, the more I write in C++, the more I become,
the more I understand why functional languages exist.
Which is funny.
Very interesting take.
Yeah.
So, I don't know.
I guess C++ likes to borrow good ideas
from those languages,
like the ranges library.
I guess I would
want to see more stuff like that.
And maybe here
Sutter finishes his
CPP2 and we will
be using a different syntax
at some point.
Oh, yes, and please ban C arrays and C costs.
They are evil.
Right.
If only we had tools that could analyze your code statically
and spot these things.
If only.
Well, thank you, Yuri, for coming on this episode,
being our guest, talking about static analysis and parsing C++
and all of the, well, the depths of the rabbit hole that can take you down.
So, yeah, thank you for your time, and we look forward to your webinar.
Anything else you want to let people know,
like where people can find you or follow you?
Well, not really.
You know, the thing about me i'm
not really a social guy right so i i guess maybe i'll visit some conferences
at some time okay then all the more reason for people to catch you while they can in your webinar. So, yeah, thank you.
Right. Thanks very much, Yuri.
And thank you for letting me in.
It's been a pleasure.
Thanks so much for listening in as we chat about C++.
We'd love to hear what you think of the podcast.
Please let us know if we're discussing the stuff you're interested in,
or if you have a suggestion for a guest or topic,
we'd love to hear about that too. You can email all your thoughts to feedback at cppcast.com. We'd also
appreciate it if you can follow CppCast on Twitter or Mastodon. You can also follow me and Phil
individually on Twitter or Mastodon. All those links, as well as the show notes, can be found
on the podcast website at cppcast.com. The theme music for this episode was provided by podcastthemes.com.