CppCast - Parsing and Analysing C++

Starting point is 00:00:00 Episode 391 of CppCast with guest Yuri Minayev, recorded 1st of October 2024. This episode is sponsored by the PVS Studio team. PVS Studio is a static analyzer created to detect errors and potential vulnerabilities in C, C++, CppCon 2024, a new coroutine library, and debugging stdmap. Then we are joined by Yuri Minayev. Yuri talks to us about static the first podcast for C++ developers by C++ developers. I'm your host, Timo Dummler, joined by my co-host, Phil Nash. Phil, how are you doing today? I'm all right, Timo. How are you doing?

Starting point is 00:01:17 I'm good. I kind of recovered from CppCon. I've been back for a couple of weeks now um one and a half weeks to be precise um and there's the new thing that comes up which is uh in about 10 days we have the deadline for the uh pre-vrotzwa mailing for the committee which is the mailing for which you have to submit all the papers you want to be discussed at the next committee meeting so there is a flurry of papers that need to be written and reviewed and commented on and uh busy days because we have quite a few features like uh contracts and reflection and pattern matching and others that are kind of trying to trying to make it in to css 26 before the the deadline so the feature freeze deadline is is kind of early next year so

Starting point is 00:02:03 uh it's getting tight so so there's a lot of uh it's very busy so so that's kind of what i'm doing right now when i'm not recording cppcast so what about you phil what are you busy with right now well yeah i think i've just about recovered from cppcon as well but um i'm off to c++ under the sea next week oh is that the one in Amsterdam, the new conference? Yeah, that is so exciting. Unfortunately, I can't make it this year, but let me know how it goes. Yeah, I'm really excited to see how that's going to play out. As far as I know, you can still buy tickets for it.

Starting point is 00:02:37 Not for my workshop, though, on coroutines, because I believe that sold out. So it gives you an idea of the level of interest in this new conference. So if you can get there, then you probably should. So I'm busy preparing for that workshop at the moment because that's a brand new one. And then I should be able to maybe take a day off. That sounds like a good idea. All right. So at the top of every episode, we'd like to read a piece of feedback. This week, we have an email from George. George wrote, I was listening to the latest episode. He refersed the idea i began saying to myself no no

Starting point is 00:03:26 no i then proceeded to say hey siri remind me to write to cpp cast that the committee episode is a bad idea so checking off this to-do item having an episode on how the committee works is a bad idea i was glad when i heard that both of you recognized that the majority of listeners would probably not be interested in this. Yeah, apologies to anyone whose Siri just went off listening to that. Mine did. We cut that out. So what do you think, Phil? Should we then just abandon this idea? No, I don't think we should abandon it.

Starting point is 00:04:03 But as we said on the show, we just need to be careful how often we get into that sort of thing and how deep we go. I think some surface level discussion every now and then is fine. Like you were just talking about your preparation for the upcoming meeting in Roslov. So I think that's fine. But yeah, probably having multiple episodes on how deep the rabbit hole goes might be a little bit too much. Yeah. So, I i mean you also have the local uh super sauce meetup here in helsinki that i'm running and we sometimes have

Starting point is 00:04:29 these talks that uh talk about like future stuff and committee stuff and we actually had a couple talks like that just this week and the impression i get there is that most people who come to meetups and i suspect also most people who listen to our podcast are more interested in things that they can actually use in their code today so i think we should really kind of try and focus on that maybe a little bit more. I think this is kind of really what's most relevant for people listening to this. Absolutely. But if people do want to find out more, where can they go? I'm glad you asked. So if you are interested in how the committee process works, we do have extensive documentation about that on isocpp.org slash std. So you can read all about the process there.

Starting point is 00:05:15 Right. Right. So we'd like to hear your thoughts about the show. You can always reach out to us on xmaster on LinkedIn or email us at feedback at cppcast.com. Joining us today is Yuri Minayev. Yuri works on PVS Studio as the C++ Static Analyzer Architect. His primary responsibility is to keep low-level stuff in order and add new features to the

Starting point is 00:05:36 core module. A bit of a surgeon operating on the open heart of a C++ parser, a bit of an orthopedist inventing prosthetics for the legacy code, father of three cats, and a couple of pet projects. Juri, welcome to the show. Hi, Phil. Hi, Timur. Glad to be here.

Starting point is 00:05:53 Hi, Juri. It's an interesting range of responsibilities that you have there, but interesting that you do talk about your pet projects alongside your cats. So you treat your code as pets rather than cattle? Well, my pet projects are kind of my, well, I treat them like my cats, more like. Sadly, I don't have much time for them, but that's okay. I'm not on a deadline, you know. I had, well, you know, at some point I decided, hey, I want to create a game engine.

Starting point is 00:06:34 Sounds great, right? And I tried and I failed. Yeah, because just if you can imagine how many things, how much stuff you need to put in there. So at some point I decided, hey, I'm working on that sort of compiler-y thing, you know, like static analyzer. Why don't I write my own programming language? Of course. And yeah, currently I have project which it's like a functional

Starting point is 00:07:08 python i don't know programming language it started as a calculator i just wanted to you know create a command line calculator and it kind of went out of out of control yeah yeah that's the thing you start off writing a calculator and it soon adds up. Yeah. So at some point you add variables, then you want to add functions, then you do conditionals and it goes downhill.

Starting point is 00:07:37 Yeah. As long as you don't add side effects. All right. So Yuri, we'll get more into your work in just a few minutes but first we have a couple of news articles to talk about so feel free to comment on any of these okay okay so the first one um i have here is about cpp con which phil and i went to just two weeks ago um and they have published their keynotes or at least four out of the five keynotes that happened there.

Starting point is 00:08:09 One by Herb Sutter, one by Khalil Estelle, David Gross, and David van der Voorde. Also, this year's committee panel discussion, which happened there. I haven't seen the fifth keynote yet, the one by Armando Rousseau about safety and security, but I'm sure it's going to pop up soon. They're all not really public in the sense that they're unlisted. So you're not going to find them on Google or on YouTube, but they are online. If you have the link, you can access them. And conveniently, the links have been posted on Reddit. So you can find them there.

Starting point is 00:08:35 And I guess you can also include them in the show notes, Phil. Is that right? Absolutely. And then you can watch them there for free. And they're all pretty amazing talks. So I certainly learned a lot from them. So I'm sure it's going to be uh interesting to some of you as well but just a bit of inside information about why those links are those videos are unlisted at the moment this is something that's common across most of the conferences now that most of the viewings for youtube videos come in

Starting point is 00:09:02 sort of organically for searches so everything is sort of optimized to cater to the YouTube algorithm for that. And so having it on like a staggered release schedule through the year really helps with that, with those numbers. But for those of us that are sort of closer to these things, you just want to get those videos as soon as possible. So for those smaller numbers, it's okay just to give out the unlisted links. So that's what we're all doing now. We're putting them up unlisted.

Starting point is 00:09:28 We can share them with the speakers and with the community. That's not really going to change YouTube statistics. So we get a bit of the best of both worlds. Yeah, I'm sure you're in the thick of it with all the conference stuff, aren't you, Phil?

Starting point is 00:09:40 So you know everything about that. Well, not everything, but more than I should. All right. So talking a little bit more about code that you can use today um there is a new library for task based parallelism which is called chorus it's available on github it's built on top of c++ 20 coroutine so it's not the coro library that's a different library it's called chorus with an s so it's built on top of c++ 20 coroutines it's uh it's a pretty straightforward interface it's a different library. It's called Coros with an S. So it's built on top of C++20 coroutines. It's a pretty straightforward interface.

Starting point is 00:10:07 It's a header-only library, which is, of course, great if you want to integrate it. It's optimized to performance. I guess you kind of have to do that if you do C++. That's kind of the point, isn't it? It also uses std expected for error handling. This is pretty cool. This is a new thing in C++23,

Starting point is 00:10:22 so I guess you do need C++23 to compile it. But it's a pretty cool new little standard uh class to to do error handling which i talked about at some point they use expected doesn't mean that they don't throw exceptions left and right i would assume that they don't but i haven't looked into the code they don't throw exceptions but if they catch an exception in the unhandled exception function yeah so they just drop it to the expected yeah so more more interestingly um the library supports monadic operations so you can chain tasks with things like and then and then you give it like another one and stuff like that and yeah it's all built on coroutines. There's, of course, lots of questions, like how does it compare to other libraries

Starting point is 00:11:06 that do similar things, like Taskflow and LibFork? There's a bunch of them. How does it compare to the whole sender-receiver thing, of which there is also a reference library out there, I think by NVIDIA, I think, has that now, like LibExecution or something. Right.

Starting point is 00:11:27 As well as LibUnifx. Yeah. LibUnifx from Meta is the other one as well. Yeah. So there's a bunch of libraries. I'm not sure what makes this one different or better or why you should use this one, but it's a new project, very modern C++,

Starting point is 00:11:43 which I guess makes interesting uses of coroutine so i thought it was newsworthy to mention uh yeah you know it's uh i think it's also for people who developed it it's a hacker value or whatever it's called like when you do things for the sake of having things done interesting things, and it's like an achievement. Yeah, I can do that. Maybe. Maybe there's really a need for another library. Oh, I think we need lots of these, and we'll see how they shake out,

Starting point is 00:12:18 and then we'll have previous experience that we can standardize because we do need some task-based libraries in the standard that are lower level than std execution stuff because i think that's a that's a much bigger framework for this sort of stuff whereas this one seems to be much lower level more at the like std generator level so it seems to work well alongside that just adds a few of the common features and a bit of um like there's a thread pool you can have different coroutines going off in different threads and then join them at the end with a very simple co-await syntax. And you can even do recursive calls, so you can have a task within a task,

Starting point is 00:12:52 and the outer one will await the inner one and that sort of thing. Stuff that's actually just generally very useful, but not too opinionated. So I think there's definitely a need for this sort of library. Yeah, and that's the stuff that's very tricky to get right if you want to do it yourself you had a few talks about that and uh yeah it's good to have just kind of libraries that do that for you i tried to understand that and i was terrified by how much stuff you need to write by hand just to make it work without anything, just with the standard library. Yeah, I think that's why my workshop next week sold out.

Starting point is 00:13:32 Yeah, I wish I could attend your workshop. I actually spent quite a lot of time trying to understand coroutines recently because we are trying to make contracts like pre and post conditions work with coroutines and um for that i had to actually understand how coroutines work or at least to an extent sufficient to understand what was going on there and that was a deep dive that um i don't know i that's probably a bit too much information about what's going on there information is the starting point so so yeah i think i think i would really enjoy attending your workshop, Phil, to kind of make sense of it all. But unfortunately, I can't be there.

Starting point is 00:14:09 Maybe there's going to be another opportunity. Oh, yeah. So I have one more news item, which is, again, about kind of a very practical thing. And it is another great blog post by the never tiring Raymond Chen, who is writing multiple amazing blog posts every week. I don't know how he does that. This one stood out even from his usual blog posts to me.

Starting point is 00:14:32 It's called The Case of the Crash When Destructing a StdMap. And it's all about debugging. So a customer was getting a crash in the destructor of StdMap in the Microsoft STL and just send them the crash log and said hey my code is crashing it looks like your stl is broken can you fix that and the whole blog post is basically about raymond trying to figure out what's actually going on like figuring out what the bug is basically just from the information that the the customer sent them and it's just an amazing read it's like um he starts by reading the source code of std map

Starting point is 00:15:04 like in the stl implementation that they have trying to figure out like what's going on like okay what are the pointers and no's there like what refers to what okay great and he looks at the disassembly he maps that back to the actual code then he figures out where like the the corruption the memory corruption happens that triggers the actual crash and he guesses okay it must be like probably this kind of error code and then he goes on and on and on and on and like oh okay this this this is not happening here okay let's dive deeper oh okay it must be that and then at the end of that he finds the bug which is uh completely unrelated to sitmap

Starting point is 00:15:41 at all and it's an asynchronous IO operation that was not even happening in the client code. It was happening in some other library that they were using where there was a bug that there was an asynchronous IO operation that was started and then kind of just, you know, shot into space and never joined or anything. And then it's like, okay, we have waited for a second now for this operation to complete.

Starting point is 00:16:07 Okay, whatever, we're just going to abandon it. And then you kind of just go on and like do something else in that memory. And then much later that IO operation comes back and says, oh, I'm done now. And kind of writing its result in like that memory, which just happens to override the root node of that std map somewhere

Starting point is 00:16:23 in a completely entirely different place. I liked the analogy he used there. Like, imagine you hire a demolition company to knock out your house, but they can't just go now. So they say, we'll come later. So you just, you're tired of waiting for them and you sell your house and eventually they come and knock that house down although somebody else is living there now it's kind of sold a million years ago and and here oh my surprise right no that's great i mean this whole article is just you know know, Jedi master level debugging.

Starting point is 00:17:05 Like you can just learn so much just from reading that. Like how he thinks, how he approaches the problem. It's just masterclass. I really enjoyed that. So the thing that struck me was how he actually has time to do all this debugging when he's knocking out so many articles every week. Maybe he's delegating articles.

Starting point is 00:17:23 I mean, I'm sure he can debug. Maybe, maybe we'll have to try and debug that. Okay, so Timo said that was the last news item, but I sneaked one more in. So I'll take this one because we actually launched the ACCU 2025 call for speakers this week, breakfast yesterday. So it's breaking news news so if you want to

Starting point is 00:17:47 speak at next year's accu which will be running from the 1st to the 4th of april tuesday to friday then that is open now and we've also opened early bird tickets in fact we're calling them sort of um it's not explicitly called super early bird but it's like effectively super early bird because there's like an additional discount you can get and the discount is going to decrease over time during the early bird period so the sooner you book cheaper the tickets so pause the podcast now go to the website get your tickets come back and then finish the show all right yeah it's an amazing conference i have been there quite a few times now um i hope uh i can be back next year i'm pretty sure i can make it uh do you want to remind people where it is yeah that's in bristol in the uk so it's the

Starting point is 00:18:35 venue it's been that in for getting over 10 years now so if you've been there before you should be familiar and it's uh although it's sort of in the west of england it's very very accessible from from london especially heathrow uh just a straight line train hour or so hour and a half maybe from heathrow so yeah come to accu 2025 and that is the last news item so we can now get to uh to yuri who you were on um cpb cars back in 2020 back when rob and jason were running the show of course so what have you been up to in the four years now a lot of things actually well let's see i'm i developed a new type system for PVS Studio. We are currently working on a parser.

Starting point is 00:19:29 I have three cats now. In 2020, I had one. I finished Dark Souls like 20 more times. Lots of stuff. So definitely busy then. Yeah, it's been productive, especially the cats and dark souls parts of it. I'm sure they take a lot of time.

Starting point is 00:19:53 Cats, yeah, they do. Yeah. But static analysis then. PVS Studio is a static analyzer for C++ and a number of other languages, as we said. Even since we were last on, we've had a few shows about static analysis. But maybe you can just remind us what static analysis is and why you would want to use it. The way I usually talk about that is, well, imagine a compiler, right? What it does, it understands your code, and then it generates code for your computer.

Starting point is 00:20:31 Static analyzers, they do the same thing, but they don't generate code. They analyze your code and tell you where you made mistakes, basically. Well, you need it really just to make sure your code, because before it goes to production, that it is as bug-free as you can do it. Because, well, usually we even have a diagram like cost of an error, of fixing an error. If you catch it early, it's cheap. If you catch it when, I don't know, your software is uploaded to a spacecraft or something, that's really expensive.

Starting point is 00:21:20 So, yeah, you might want to use static analysis just because it will catch for you things that people usually overlook. Because people are not good at noticing small things. Like you forgot accidentally to check, I don't know, a point or something. It's easy to forget. And it's difficult to see, especially if you're reviewing a lot of code by yourself. So that thing is just a little helper that can help you find obvious mistakes. And sometimes not so obvious too.

Starting point is 00:21:57 But obvious mistakes are the main focus because, like I said, people are not good at noticing small patterns and stuff like that yeah i think those obvious mistakes when you discover them they can always be well they can be a bit embarrassing like how did i let that slip in so they're the ones you really want to get at least the non-obvious ones you got sort of some respect like nobody could afford that well uh the nasty thing about those small things small mistakes is that sometimes you won't just notice them sometimes you crash uh and sometimes it's ub oh yeah and you just don't know what happens yeah we've had a lot of um esic bugs in our own code, which I probably shouldn't say, but

Starting point is 00:22:48 who cares? When something really weird happens, like a bug happens only when you consume over a gig of memory or something. And many of them were UB. Yeah, so I've seen quite a few conference talks by people like you and your colleagues, usually about weird bugs and how they happen and it's usually quite interesting material.

Starting point is 00:23:18 But so the description mentions that it's a static analyzer for C, C++, C Sharp and Java. So that's's a static analysis for C, C++, C Sharp, and Java. So that's quite a few different languages. Can I ask what language the actual analysis tool is written in? Well, C and C++ ones are in C++. C Sharp one is in C Sharp. And Java is, surprise, in Java.

Starting point is 00:23:43 Oh, so they're separate products, separate code bases. Three of them. Interesting. Right now, some people here are dreaming about making a unified engine or something which can handle them all. But I don't know.

Starting point is 00:24:00 Because I don't want to write a C Sharp parser or something like that, because C-sharp and Java, they use imports instead of headers. And in C++, it's really convenient. You have a CPP file, which includes everything it needs. And you can analyze the entire preprocessed output. And in C-sharp and Java, you have to build that model.

Starting point is 00:24:27 You have to load those inputs and stuff like that. That's why for C Sharp and Java, they use Roslyn. And for Java, Spoon. They use it for their front end. Because writing that from scratch, I think it's a little bit more difficult yeah yeah it always seems a bit almost wasteful to have completely separate implementations of these parsers and analyzers in different languages but every attempt i've seen to try to unify them and i've worked at a few places now that have come down this path it's never quite

Starting point is 00:25:05 worked out right so so how different are these i assume that you work on the c++ one since yeah we're here yeah i wonder how different like the problems are because in c++ obviously we have undefined behavior we have all of these things like you know invalid pointer dereference and whatever out of range that I guess you can detect. Whereas in C Sharp and Java, I guess those things are still bugs, but they're not undefined behavior. So you get an exception.

Starting point is 00:25:33 Is that like a completely different game in C++? Or is it relatively similar? What kind of problems do you find? I'm not exactly sure how like the other parts work internally because i'm not um i'm not going there a lot uh but i guess they have also no references and stuff like that well yeah you will crash uh it's not ub but still you have to catch these errors and fix them. And in Java, I remember they did some checks when new versions of Java were released. They did checks for if you have some code which becomes obsolete in new versions,

Starting point is 00:26:26 they pointed that out in the Java analyzer to the user. So you can use an older code base, run the analysis, and you will see parts you need to rewrite for the new version of the language. And that was a request from a customer, I believe. Do you distinguish between what we might call actual bugs things that are actually wrong and will lead to problematic behavior versus what we might call code smells or you know violations of best practices the where the code may itself be correct but it's going to lead to problems well we usually have uh like diagnostic levels so it will if it's sure that okay that's a bug that shouldn't be there it will complain with a high priority and if it's like we can make we can let it slide kind of it's not a bug but don't do that it will complain on the lower level of

Starting point is 00:27:27 priority so the alarm level is different but like in in actual code when checks occur they work the same usually for well well, there's no clear distinction. Okay, that's a bug and that's a code smell. For code smells and minor stuff, we usually check the context, some surrounding conditions or something. We can, by using circumstances around the place where this happens, the error is posted. We can guess, okay, probably this was supposed to be there.

Starting point is 00:28:13 He knew what he was doing. Or probably that was a copy-paste and something like that. Right. So they're context-aware. Yeah. And that presumably implies um symbolic execution yeah we have data flow and symbolic execution right yeah yeah which um in general is like solving the hoarding problem but in limited context you can you can actually get quite far with it

Starting point is 00:28:41 so i i'm actually i actually have a question about this i'm because i'm trying to wrap my head around around the stuff uh kind of uh right now so there is program execution right which is you just run the code and you know for example if you call a function you know what the inputs are right because you literally call the function with that input and that can happen during runtime or during compile time, which is what we call constant evaluation, like constexpr and stuff. But symbolic evaluation is different, right?

Starting point is 00:29:12 Instead of just calling a function with an input, you're kind of calling it with every possible input or something like this. Is this like a vaguely right? Or what is symbolic evaluation exactly? We go through the function's code inside. And based on how different entities inside that function are used, we can infer some facts about them. Like, for example, let's say if you have a function

Starting point is 00:29:42 and it accepts a pointer, and the first thing you do is if not PTR return, right? So that PTR gets a label never null pointer and stuff like that. For pointers, it's just really simple. But for example, if you have if a minus B equals zero, you enter the if body. Inside that if body, we know that A equals B because A minus B equals zero, if it makes sense. So yeah, you just gather different facts. It's similar to what an optimizer does on an analyzing path.

Starting point is 00:30:29 So like LLVM, for example, we don't have IR, but LLVM uses IR. What it does, it first goes through instructions and collects data, some information about them, and then it uses that information to optimize. It can infer that some variables are, even if you don't label them const in your code explicitly, it can infer that they never change. It can throw them out of your function. It can inline functions.

Starting point is 00:31:03 If it infers that a function always returns the same value, it can just replace the entire function call with it. So that kind of thing. It's like an optimizer, but the real optimizer is more sophisticated than symbolic execution. Does it work across different like across like different functions or even different station units for example if i have one function that returns a pointer but then

Starting point is 00:31:32 something in that function like like makes you be able to reason that that pointer will never be null and then you pass that into another function uh can you then do you then know that you know that function uh do you like look at every function. Do you then know that function? Do you look at every function in isolation? Or can you look at the actual flow and reduce that that function call must be correct? Because the other function that you get that pointer from never returns a not pointer or something. Well, yeah, it can do that. We annotate functions when we analyze them.

Starting point is 00:32:05 We analyze one function, we annotate it like we can say that it never returns a zero or a null point or something. There are also manual annotations like we know that malloc allocates memory and free returns memory. How do then we know that? Because we manually

Starting point is 00:32:22 annotated malloc and free. But automatically we also create such annotations. and we know that because we manually annotated malloc and free, right? But automatically, we also create such annotations while analyzing. And when you call a function, we can know something about its return value, for example. It's kind of limited, so it's not as good as, I don't know, LLVM's optimizer, but it's able to understand things like that. And for across modules, for that, we did a special kind of quasi-linker.

Starting point is 00:33:09 So to do intermodular analysis, we first go and collect information about modules and store it in a binary format. And then we call analysis and use that collected information to understand what functions from other CPPs do. So I got one last question about this, if I may. So let's say I have a code base and I want to like throw PVS Studio at it and see if it finds any bugs. Like, how do I do that?

Starting point is 00:33:34 Like, do I need to have a CMake file or a Visual Studio solution or like something else that tells the tool, like these are the source files that belong to the project and this is how they are related and this is how I compile them or like a compilation database or like how do you figure out

Starting point is 00:33:54 what the actual code will be that you're going to look at? Let's see. So you can do a Visual Studio solution because there's an extension for Visual Studio. You can do integration in VS Code, KeyUtopreator, something else. I don't remember. It supports CMake. You can give it a special file, compile comments, JSON. Some tools like CMake can generate it.

Starting point is 00:34:26 It's a special JSON file which contains all the compilation commands on your project. You can also, under Linux, you can use strace to trace compilation commands and then a special tool that we created will parse the output, and it will collect all the flags and paths to source files and stuff like that and run analysis on them.

Starting point is 00:34:57 Under Windows, you can use Compiler Monitor, which is also a tool from us, which basically you start it in the monitoring mode and you build your project and it will catch compiler calls. And then it will start analysis on CPP files. So there are different ways to do that so in some cases you do actually interpret the the project model well not exactly project model because usually well in visual studio for example ms build give us it gives us all the information we need. In CMake, CMake does that. Basically, we just integrate and listen.

Starting point is 00:35:50 Yeah, so we don't have like... You know what's funny? I was about a year ago, maybe I had that crazy idea that we need to do a PVS tool chain. You have a CMake, let's say you have a CMake, and you say build it, but instead of your compiler, use PVS, and then we'll have compiler and the linker

Starting point is 00:36:16 and stuff like that, and we will just work as a tool chain. But it's still like a dream. I think it would be interesting. Yeah, that's the interesting thing about build tools and toolchains is nobody really wants to have to deal with them, but they'll want to do it exactly their way. Bit of a contradiction.

Starting point is 00:36:38 So is it easy enough to integrate with your existing tools or do you have to do a lot of configuration? Usually people just set up PVS Studio in their CI, and it just runs after builds. Right. Okay. That sounds reasonable. And you mentioned LLVM. We have Clang tooling, and part of that is Clang Tidy,

Starting point is 00:37:05 which is already a static analyzer. So what does it do differently? What does PVS Studio do differently to, say, Clang Tidy that we already have very often built into our IDs? If I had a dollar for every time I'm asked this question, right? Tidy is a little linter. As far as I know, it doesn't have much of semantic information so if you if you want to compare pv studio with something i guess uh clang static

Starting point is 00:37:36 analyzer would be a better right comparison but as for tidy well's good, but it doesn't dive deep enough, I think. And client-state analyzer, I'm not even sure in which state it is now. I think it's not big enough yet. I mean, it doesn't catch as many errors if if i know correctly because i haven't looked at it uh for a long time right yeah i mean it's it's definitely still evolving and i know some issues that i saw with it even like a year or so ago have been fixed so it's getting better but um you're right that there's a there's a limit to how far it goes. Tidy has a really great advantage, I think, because you can write your own rules there.

Starting point is 00:38:30 You can write those AST matchers, and they will work like custom checks. Right. And it can be surprisingly easy to get up and running with that if you've got a specific custom need, it's definitely something to look at. Well, usually what people hear say from people who understand marketing and sales better than me, because I know nothing about it. We have user support and we have integration everywhere.

Starting point is 00:39:03 So it's kind of like a big advantage over free tools especially which like you take it and you do it and you're on your own in many cases yeah yeah and one thing i like to say is um you know you don't have to just use one static analysis tool you can use a whole range of them they often often complement each other. If you don't sync in their output, if you're not afraid to sync in their output because you will be

Starting point is 00:39:33 bombarded with a lot of things, especially if you run for the first time and you don't ignore stuff which is irrelevant, you don't disable checks which you don't really want. Oh, it's like was it tidy?

Starting point is 00:39:49 It's like tidy was complaining oh, you're passing a parameter here and you don't change it inside a function. Make it const reference and when you do it make it a const reference it starts complaining, hey, what are you doing, dude?

Starting point is 00:40:05 The object is really small. Just pass it by copy. Right. Yeah, there can be a lot of false positives like that. You need to tune. Well, I mean, what I mean by that is not that tidy is bad or anything. It's just you have to set up

Starting point is 00:40:21 what you want to see in the output, really. Yeah, yeah. Yeah, exactly. Not everything is an obvious bug. Okay. So, Yuri, we're going to continue the conversation with you in a minute. But before we do that, we have a few words from our sponsors this week. PVS Studio is a team that develops its own tool,

Starting point is 00:40:38 a static code analyzer for C, C++, C Sharp, and Java code. Large legacy projects require the full focus of the development team and newer projects require experienced support. With a static analysis tool at your disposal, your team will write even safer, cleaner, and more secure code. The static analyzer has more than 1,000 diagnostic rules and a special mode

Starting point is 00:40:58 for mass suppression of warnings in legacy code. It can find dead code, typos, potential vulnerabilities, and much more. If you haven't tried PVS Studio yet, now is a great opportunity to do it. Podcast listeners can get a one-month trial of the analyzer with the CPP Cast promo code. You can find the link to download the tool in the description as well. I guess description means you're going to put it in the show notes, right, Phil? Yes, yeah, that'll be there.

Starting point is 00:41:23 Go and have a look. So yeah, a bit of a coincidence that we have pbs studios a sponsor for this episode not entirely a coincidence we sort of tried to coordinate those things so that that'll work out quite nicely but um yeah great opportunity to to go and try it out yourself so having talked about pbs studio um just want to change gears a little bit because you did mention working on a C++ parser in your bio. And that interested me because I assumed that you would do something like clang tooling and build on top of that.

Starting point is 00:41:55 But are you actually doing the parsing yourself? Yeah. We are doing our own thing. Clang would be great, but not good enough. Let's say, I don't mean again that Clang is bad, but how we typically work, we support different compilers from different platforms. I mean, code written by them. And what we do, we get a CPP file as input, and we ask the compiler to preprocess it and give us the preprocessed version of the file. Right. Without macros, includes, and stuff like that.

Starting point is 00:42:34 And then we parse that. And the problem is that Clang doesn't support everything. There are compilers, especially in the embedded world, which have non-standard types, non-standard intrinsics, and stuff like that, and we have to understand that. But Clang doesn't support all of them. Clang would be great if we just did Linux and Windows, MSVC, GCC kind of stuff. Well, actually, embedded compilers are usually GCC-based,

Starting point is 00:43:11 so we could cover a lot of them, but not all. I remember supporting some 8-bit compiler which had types like int24, int40, and it had a bit modifier, so you could manipulate bits directly somewhere in memory. And its pointers were a flexible size. They could be from 4 to 12 bytes long, I think, addresses. It's been a long time ago,

Starting point is 00:43:48 so I don't remember the details, but the idea is kind of like that. And one more thing, which I hope dies soon, is we support Microsoft C++ CLI, which is

Starting point is 00:44:02 C++.NET. Don't support it too well, but we can at least parse it. And Clang, I think if we gave Clang something like that to parse, it would tell us to go and do something right yeah no that that makes sense but then building your own c++ parser that that sounds like a a pretty big project so well how's it been worth it first of all what have the challenges been

Starting point is 00:44:40 oh challenges uh in c, challenges start at grammar. Right. You know? I'll often end there as well. If I may propose a feature for some upcoming standards, can we please have a keyword for functions to define a function declare? You mean like FN or or yeah yeah something like that i have a proposal on the back burner for exactly that if you knew how easier it would make the job of a compiler to know that uh because well we do recursive descent in the parser. So you take the grammar and you go left, right, from the top level grammatic rule and down, down, down, down, down to the bottom.

Starting point is 00:45:35 You reach something like a literal there or an identifier or something and you go up and you parse things. The problem is, by the way, recursive descent, because that's the only algorithm you can actually write, you can actually code, because others require some generated tables, state machines, and stuff like that. And recursive descent, it has some limitations it imposes on the grammar. For example, you can't have two rules in your grammar which start with the same thing.

Starting point is 00:46:12 That's why I want a keyword for function, because both function and variable, they start the same. You have type specifier, you have a declarator. And the difference is after the declarator, when you either have an initializer or parameters. And the other thing is left recursion. In your grammar, if you have left recursion, like a rule is defined using itself. Usually it's for binary expressions, like, you know,

Starting point is 00:46:43 binary expression is binary expression operator and an expression with a higher priority. So the grammar of C++ is not well suited for recursive descent. But everyone does

Starting point is 00:47:00 recursive descent. Clang does recursive descent. GCC does recursive descent. We do recursive descent. Iang does recursive descent. GCC does recursive descent. We do recursive descent. I think Microsoft does something else. Maybe. I'm not sure about Microsoft. I know they did a token stream-based approach, but I think that was in contrast to AST-based, which they now do more of as well. I forget whether they do it entirely now or they still do the thing where they backtrack when they the original approach doesn't work anymore they have to

Starting point is 00:47:29 backtrack and then try an ast based but i'm not sure if that's the difference between so so i know a little bit about this i have never written my own parser from scratch but i have rewritten large parts of an existing parser when i was at JetBrains all the way back in 2018 I rewrote like large parts of their C++ parser at the time which was written in Java that was also fun but that's a different story

Starting point is 00:47:56 but I remember that I like this kind of realization that I had because when you for example want to parse a declaration or anything anything interesting you have to like there's there's all of these ambiguities where you have to like try and pass it as a type or try and pass it as a variable or try and pass it as a function call or whatever and then if it doesn't work you're gonna have to rewind and start again yeah

Starting point is 00:48:19 but what but a lot of the time like what you're actually parsing depends on whether some identifier x is itself type or a variable, right? But in order to find that out, you might have to basically execute arbitrary constexpr functions anywhere and instantiate you have to interpret half your program sometimes in order to make sense if there's a lot of constexpr stuff going on in order just to continue parsing and and yeah i think that's the moment when i realized how messed up c++ actually is yeah what's what's your take on this i agree that it's really messed up uh You know, the worst part of C++ is C, really. Because we have to carry that legacy. All those declaration syntaxes, which don't really make sense, they're inconvenient. Oh, yeah.

Starting point is 00:49:22 I did rewrite that specific part from scratch. I remember I found out just how many places you can surround something with parens and it's still part of this but and wait until you get to function pointers which get references to arrays and something like that like a pointer to a function which they could it takes a pointer yeah yeah you kind of have to pause it inside out and it's kind of gets very weird yeah yeah i've been there and done that i'm never going to do that again and you and you have to backtrack a lot really yes and another funny part is templates but templates like like, you know, template declarations, they're easy because you have keywords, you have template, type name, stuff like that.

Starting point is 00:50:12 So, yeah, that's actually context-free grammar, I believe, if you just stick to templates. And when you get into instantiating them, it becomes real fun. Especially if you have non-type parameters there. That's where you have to do a lot of those evaluations, which you probably want to avoid when you're parsing because evaluating something

Starting point is 00:50:41 during parsing, it's kind of out of place there. But you have to. Because the whole meaning can change. You have to calculate expressions and stuff like that. And you're lucky if you don't get

Starting point is 00:50:57 some call to some constexpr function which does recursion. God forbid. I'm not sure. By the way, I just thought about it. Can you do recursion, God forbid. I'm not sure, by the way, just thought about it. Can you do recursion in constexpr context? Sure, why not? You can, right.

Starting point is 00:51:13 I was doubting for some reason. Good idea to terminate it, though. Yeah. By the way, about, I don't know, about constexpr, I always found it funny that the standard says ub is forbidden in

Starting point is 00:51:29 compile time. Wow, to a point. And the same standard says that can a ub, well, it doesn't say it explicitly, but it doesn't list every possible ub, so how can you as a compiler know that you have a UB?

Starting point is 00:51:48 There is some UB that I would contact for now. There's some UB allowed? Yeah. I can't remember the details off the top of my head. I think there was a lightning talk at the CppCon that went into some of it. So I'll put a link in the show notes if that's available. Well, I got something even more scary for you now that we get reflection on css 26 we're going to have stateful uh uh compile time evaluation uh that's what i actually always wanted but as a C++ R server writer.

Starting point is 00:52:26 But I guess we'll deal with it as it goes. But reflection will be interesting to deal with because you have to really generate code based on patterns, if I get it right. Yeah, and the thing about reflection, the way it's heading to C++ 26 right now is that it's not really just reflection because you can also splice back

Starting point is 00:52:51 any of the entities that you're dealing with. So you can take a type, reflect on it, you get a metatype, you can add more members to it or whatever, and then you can splice it back into an actual type and then declare a member of that, declare a variable of that type or whatever in your actual code, right? I wonder how code completion tools will work with that.

Starting point is 00:53:22 If I just take a class, add like five methods, and then I declare a variable. Well, I guess they can get information from what you're writing in reflection for that. You mean how code completion would work on a variable of this spliced back generated type? I guess the tool will have to actually execute the reflection code at compile time, just like it has to do today with like constexpr stuff. Ultimately, it's another form of constexpr evaluation, right?

Starting point is 00:53:51 It's just like that. It also like declares, it can now declare entities and do things like, like just normal constexpr can't do. I used to work with F sharp quite a lot. And one of the things that really impressed me in a shop is type providers and type providers are a way of generating code, not even that compile time,

Starting point is 00:54:11 but that design time, you might say you can actually have a type provider that will open a socket and load stuff over the internet and then generate code based on that. And you can do code completion on it. It just blew my mind. Hopefully we won't have that, but it does show what's possible. But I think we're going a little bit down a rabbit hole here. We should probably come back up to the surface.

Starting point is 00:54:37 We started off talking about this in the context of static analysis. So in your deep dive into the murky world of C++ parsing, has any of that actually influenced what sort of things you actually might want the analyzer to look for? You know, examples of code that you think, oh, we should never allow people to write that. Well, not really because parsing, well, static analysis deals with some specific cases,

Starting point is 00:55:06 and parsing is for the... You want to do it in the most general case possible, if you can. Oh, yeah. But I certainly want to ban all intrinsics now of compilers because I've seen a lot of things which don't even fit grammar. Like in

Starting point is 00:55:32 MSVC or CLAN I think it was CLAN it has in type traits it has the is same trait which compares to types and under the hood it uses an intrinsic

Starting point is 00:55:47 called is same and it looks like a function but it accepts two type names

Starting point is 00:55:54 as its arguments that's fun for a parser isn't it yeah that was

Starting point is 00:56:01 great because that would look like a function type declaration to a parser or something like that, right? Yeah, but we have a list of them. We just decided, okay, so they have names which start with two underscores. And since the standard says if you use such names so expect anything so we decided okay we just can do it we just write names of those intrinsics down and create custom rules for them so if we encounter the name we will apply the rule which parse is this intrinsic. So we are nearing

Starting point is 00:56:47 our time limit for this episode. So I'd love to talk a lot more to you about static analysis and parsing and the deep, deep depths of C++. But I think we have to slowly wrap up. Before we do that, do you want to

Starting point is 00:57:03 quickly talk about the webinar that you're going to be doing next week? Yeah, I'm going to be doing the one about parsing next week. I will be talking about all those problems with declarations, with backtracking and how it leads to exponential algorithms. And you can get stuck. And I will touch on grammars a bit and how parsing works in general. Actually, we are planning to do a series of those webinars. So one is next week, one in November. It's going to be about semantics i think and one in december is going to be like kind of a continuation of semantics right so yeah i i guess the link is

Starting point is 00:57:57 in the notes right you probably will put that in the notes. Is only the first webinar available as a link so far? Yes, because it's the only one scheduled, actually. For others, we don't have a schedule yet, at least as said. But I know that the second one will be in November, somewhere first week probably of the month. And the third one, again, first week of December or something along those lines at the beginning of the month. Okay, so if you listen to this in the future,

Starting point is 00:58:35 then you'll just have to search for that one, but we'll put the link to the first one in the show notes. I guess we'll have recording, So if someone doesn't catch it, you can always watch recording later. Great. Okay. Well, then we just have one final question for you. It's our usual wrap-up question.

Starting point is 00:58:55 What else in the world of C++ do you find interesting or exciting? Or in the case of static analysis, maybe dangerous or challenging? You know, the more I write in C++, the more I become, the more I understand why functional languages exist. Which is funny. Very interesting take.

Starting point is 00:59:17 Yeah. So, I don't know. I guess C++ likes to borrow good ideas from those languages, like the ranges library. I guess I would want to see more stuff like that. And maybe here

Starting point is 00:59:36 Sutter finishes his CPP2 and we will be using a different syntax at some point. Oh, yes, and please ban C arrays and C costs. They are evil. Right. If only we had tools that could analyze your code statically

Starting point is 00:59:55 and spot these things. If only. Well, thank you, Yuri, for coming on this episode, being our guest, talking about static analysis and parsing C++ and all of the, well, the depths of the rabbit hole that can take you down. So, yeah, thank you for your time, and we look forward to your webinar. Anything else you want to let people know, like where people can find you or follow you?

Starting point is 01:00:23 Well, not really. You know, the thing about me i'm not really a social guy right so i i guess maybe i'll visit some conferences at some time okay then all the more reason for people to catch you while they can in your webinar. So, yeah, thank you. Right. Thanks very much, Yuri. And thank you for letting me in. It's been a pleasure. Thanks so much for listening in as we chat about C++.

Starting point is 01:00:55 We'd love to hear what you think of the podcast. Please let us know if we're discussing the stuff you're interested in, or if you have a suggestion for a guest or topic, we'd love to hear about that too. You can email all your thoughts to feedback at cppcast.com. We'd also appreciate it if you can follow CppCast on Twitter or Mastodon. You can also follow me and Phil individually on Twitter or Mastodon. All those links, as well as the show notes, can be found on the podcast website at cppcast.com. The theme music for this episode was provided by podcastthemes.com.

CppCast - Parsing and Analysing C++

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.