CppCast - BrontoSource and Swiss Tables

Episode Date: July 3, 2025

Matt Kulukundis joins Timur and Phil. Matt talks to us about BrontoSource, his start-up focused on refactoring, updating or migrating large codebases, as well as his work on Swiss Tables. News... Herb Sutter's WG21, Bulgaria, trip report End of active development on jemalloc "Amortized O(1) complexity" - Andreas Weiss' lightning talk Reddit discussion of filter view issue Links Acronyms on cppreference/com Arthur O'Dwyer's acronym glossary Matt’s Swiss Tables talk at CppCon Example of BrontoSource integration in Compiler Explorer

Transcript
Discussion (0)
Starting point is 00:00:00 Episode 401 of Cppcast, recorded 27th of June 2025. In this episode, we talk about the latest news regarding C++26, the retirement of Gmalloc, and the algorithmic complexity of FilterView. Then we are joined by Matt Pouloukandis. Matt talks to us about BrontoSource, an AI-powered tool to modernize large C++ code bases. C++ code bases.
Starting point is 00:00:54 Welcome to episode 401 of CppCaster, first podcast for C++ developers by C++ developers. I'm your host, Timo Dummler, joined by my co-host Phil Nash. Phil, how are you doing today? I'm good, Timo. A little bit tired, but how are you? Yeah, I'm also quite tired. I'm good, Timo. A little bit tired, but how are you? Yeah, I'm also quite tired. For context, I just came out of a very intense one week long CUSS committee meeting. We're going to talk about that later. Going straight into a conference, which was very good, but also very intense, where I was also keynoting, then had a bunch of meetings afterwards, then got onto a plane back to Finland, arrived very late last night, and then had something like five hours of sleep and just got up.
Starting point is 00:01:31 So I think I'm okay. I think, Phil, you had even less sleep than I did. Yeah, I did. And I am still at that conference that you mentioned that I happen to be running as well. So it's about 20 past five in the morning for me here, which is not quite as bad as we had it for for MetcodPolt in the last episode, but I'm feeling it a bit. So hopefully we'll make it through this episode and that's all I can say. Okay, so at the top of every episode we'd like to read a piece of feedback. This time you received an email from Raga Vendra. I have been following CWVcast for quite a while now. The show is going amazing. Would love to know how to navigate the jargon in the C++ space.
Starting point is 00:02:15 Better understand the concepts being talked about. Cheers. Hmm. Interesting. So are we too jargony? I mean, I think if you think about like CRTP, CTAD, there is a lot of jargon in C++. Yeah. Right. So I remember somebody somewhere had a list of all of those things. I don't know, it wasn't
Starting point is 00:02:37 somebody's blog. There's a list of all the acronyms, at least. Maybe that would help. Yeah. If you're ever in a live context, I actively recommend people, if you're ever in a live context, just ask. If, if you're stuck on something, odds are someone else in the room is stuck on it too, and is also not asking. Okay. So I just looked this up. There is a long list of acronyms on cppreference.com. Actually, that's the first thing that pops up if you Google C++
Starting point is 00:03:08 acronyms. And then there is an even longer list with very detailed explanations on Arthur Dwyer's blog. And that blog post is called a C++ acronym glossary. And that's the one that I remembered somebody did this. So it was Arthur. Yeah. So that is a very detailed explanation of every single one. So that's just the acronyms. I don't know. Maybe there's other jargon that we're using, which is not acronyms, but also confusing.
Starting point is 00:03:31 So, um, yeah, apologies for that. And thank you so much for the feedback. I think the worst ones are where we reuse a word that has a general meaning and then we have a very specific meaning and you have to know which context you're in to know which one we're talking about. Yeah. To grasp the concept, if you will. There you go. Like, can you think of something? As Matt just said, just the word concept. Very easy to be talking about the general concepts or... Oh, just the word concept. Yeah, absolutely. Yeah. Yeah, it's funny. When I talk about C++ stuff,
Starting point is 00:04:03 I catch myself wanting to say concept if I don't mean concept, the feature. And then I say something like idea or notion instead. But it's just extra seconds to think about and come up with a different word. Just adds an extra constraint. Well, thank you so much for the feedback, Raghavendra. We'd like to hear your thoughts about the show. If you're listening to this, you can always email us at feedback at cppcast.com. So joining us today is Matt Koulokandis. Matt is the CEO and co-founder of BrontoSource, a startup that builds tools to modernize legacy code bases at scale with a focus on a C and C++ space. Prior to that, Matt spent 11 years at Google, where he led the software ecosystem organization
Starting point is 00:04:46 as a principal engineer. During that time, he designed language and library features for migration, as well as directly planning and executing multiple migrations across Google's entire code base. Russ's SturtCollections hash map and Go's map are based directly on his Swiss table work. When he isn't trying to figure out
Starting point is 00:05:04 how to rewrite all of the world's code, he scuba dives every chance he gets. Matt, welcome to the show. Hey, it's great to be here. I have a long time listener, first time caller, if you will. So, yeah, it's great that you join us. Actually, last time we had, no, the episode before last time, we had Kristen on the show and she made the connection. She said, Matt would be a great guest. And I quickly Googled you and I thought, yes,
Starting point is 00:05:31 Matt would be a great guest. And then Phil agreed. And so here we are. So thank you again for joining us. My pleasure. Kristin is an amazing networker and also an incredible engineer. I worked with her for a number of years at Google. and also an incredible engineer. I worked with her for a number of years at Google. You talk about C++, Rust, Go, but scuba diving as well. So I think we're missing deep sea from the list. Yes.
Starting point is 00:05:55 Most of my scuba diving is relatively shallow. I haven't advanced up in water, but generally speaking, I only go around sort of 15 to 22 meters depth. Right. I think most of my C++ is quite shallow as well. Generally speaking, I only go around sort of 15 to 22 meters depth. Right. Yeah. I think most of my C++ is quite shallow as well. Okay. Um, so we'll get, uh, back into talking about Matt's work in just a few minutes.
Starting point is 00:06:15 So Matt, hang in there. Um, because we first have a couple of news articles to talk about, but feel free to comment on any of these as well. Um, so we got three news items for today. The first one is the really big news. So as I said earlier, we had a committee meeting, which ended a week ago as we record this. And we finished a C++ 26 committee draft,
Starting point is 00:06:38 which means that C++ 26 is now complete-ish, except that now there's two more meetings where we can address kind of wording bugs, essentially, or other kinds of bugs that people report. There's a process for this called the ballot. But we are not any more procedurally going to do any more design changes. So this is now just about two more meetings to have like fit to iron out
Starting point is 00:07:07 the last little wrinkles. So that means C626 is design complete. And the big news there is that we managed to put reflection in at the last minute, there was a bit of a scramble to get the wedding done. But it did happen. We voted reflection in a week ago. So, um, so your statistics will have reflection. So that's pretty, pretty big news. Yeah.
Starting point is 00:07:31 Huge congratulations to everyone who worked on that. There were a bunch of people who put in a lot of hours there. I'm sure. And I'm sure we'll be hearing about it in all of our final questions. Yes. So, so there is already one trip report that I saw online from Herb. There might be more coming out between now and when this episode is released, but we're going to put definitely Herb's one in show notes.
Starting point is 00:07:55 We are not going to talk about what else happened at Sofia too much now, because we're actually planning to do a special episode about that next time. We have a very exciting guest planned for this, so stay tuned for that. Yeah, maybe one more thing on that, which is that this is an awkward time during the standardization process where for about two or three meetings we'll be saying, and now it really is complete, because there's just different stages of completeness that we go through. So it's always a little bit awkward until it's finally out there. In fact, we only just got C++23 actually out at the end of last year.
Starting point is 00:08:30 Yeah. So it's going to be two more milestones. One is going to be two meetings from now, which is now, I think, officially announced to be in March in London, right next year. Phil, you might know something about that. Yes, I know a little bit about that. In fact, much more than I would like, because I'm actually hosting it for JGX, the company that I run, is hosting it next year. Okay.
Starting point is 00:08:57 So that is going to be the meeting when we officially, when C++36 is going to be officially done, done. So that's when this ballot where we iron out any issues that people report is also done. And we're going to have a finished document. And then there's going to be another year plus until ISO actually approves it and it becomes a new standard. So there's still a little bit of time to go, but we already now know what the feature set is going to be. So it is very, I think, in a very exciting stage of that process right now. So we have one more news item, which
Starting point is 00:09:36 is a little bit of a sad one. So the active development on the gmalloc memory allocator has come to an end. There is a blog post by the main author and maintainer explaining like why and like describing the whole history of the project. And yeah, I think that's kind of interesting, both in terms of like, you know, what it is and how it happened,
Starting point is 00:10:08 how it was developed, how it was used, and also that now they reached a stage where they're like, oh yeah, we can't really keep going with this anymore. So I believe TCMalloc is like the Google allocator and gmalloc is kind of the meta one, right? Yeah. I actually think phase three meta of this blog post is really interesting because it talks very heavily about sort of the politics of large organizations and how that influences these projects like this. And sort of when Facebook shifted to meta, the way that they rewarded individual contributors ended up being the death knell for J.E. Malik just took time to play out. And so if you're a high placed engineer, like really think about second order effects to how you incentivize the people
Starting point is 00:10:56 under you. Yeah, that's interesting, because it's a very popular project, right? I think it's used in a bunch of places. It's kind of very efficient, very fast. There's trade offs between project. I think it's used in a bunch of places. It's very efficient, very fast. There's trade-offs between JMalloc and TCMalloc, so you might choose one or the other. But I think they're equally like marvels of SuperStock engineering. So yes, it's very interesting and slightly sad
Starting point is 00:11:19 to see that this is not going to be continuing as an actively maintained project. Thanks for the memories. And I guess we're going to talk about allocators a little bit more with Matt further down the line. But before we get there, I have one more news item, which is a lightning talk by Andreas Weiss from last year's CPPCON, which has been released on YouTube.
Starting point is 00:11:39 And it's got a lot of views. And it's a really, really interesting one. So it's just five minutes, like all the lightning talks. But it's a really, really interesting one. So it's just five minutes, like all the lightning talks, but it's quite fascinating, quite densely packed with information. There's a long Reddit discussion about it as well. So it starts off by Andreas Weiss explaining what amortized complexity actually means. So we got like big O like OO1 or N or log N and sometimes the standard or some other description of an algorithm size, this is amortized N complexity,
Starting point is 00:12:08 for example, right? Or amortized O1 complexity. So amortized constant complexity. And so he managed to figure out a way to really, really nicely explain what that actually means without too much math in a very intuitive way. So I thought that was great. And then you went on to show that for a C++ range where the standard says that begin is amortized 01, that is actually unimplementable. So that's actually not true for filter views specifically,
Starting point is 00:12:37 which is a very popular view in the ranges library. Because what it actually does and how it actually works and how you would implement it, even the most efficient case just kind of isn't compatible with any reasonable definition of amortized complexity. So, and the kind of consequence of that is that it turns out we can't actually trust the C++ standard when it talks about algorithmic complexity guarantees. So I think that's something that we should keep in mind. Yeah, it's interesting. Hashtables often follow this iteration over hashtables is often of capacity instead of of size and depending on how you implement it. Right. But that it's actually amortized, right? So I think the standard specifies that the hash table iteration should be O size. I should double check that. Let's see.
Starting point is 00:13:33 It's actually next on an iterator in a hash table is isomorphic to a complexity of iteration. So when you're talking about something like st ordered, stood unordered map operator bracket, it says average case constant worst case linear. So not so it's going to be iterators plus plus on stood unordered stood unordered map iterator. All right, so just one like find the next item. Yeah, that can be of capacity, depending on implementation.
Starting point is 00:14:06 All right. So I'd have to dig a little bit deeper to find that. So let's do that offline. But yeah, you're probably right. That is probably exactly what it says. And you've done a lot of work on HashMaps, so I think you know what you're talking about. You're going to talk about that in a minute as well. But before we get there, actually first, Matt, again,
Starting point is 00:14:30 thanks for joining us at this late hour for you. It's kind of a bit of a weird one. It's 7am for me, which is just a lot. So okay, it's 5am for Phil, which is pretty rough. And then it's like past midnight for you now, right? So thank you very much for getting up, not getting up, but like staying up late, just for us. Really, really appreciate that. So what we want to talk to you about first is the main thing that you're working on right now. So that's called Bronto Source. Do you want to talk to us about what it is and what is it for and why it's awesome? Absolutely. So Brought to Source is a startup that I co-founded with my co-founder Andy, building out tools for large-scale code migrations and code updating with a real focus on C++. And this comes directly out of the work that we did at Google for a number of years. Actually, Kristin also did a lot of this work. And it's how do you build out a toolkit
Starting point is 00:15:34 of sort of easy things to allow you to make sweeping changes to sort of over millions of lines of C++ with a very high correctness, right? You want it to be so correct that it can commit code, basically compiles, passes tests, submit, the end. And that's a very high bar of correctness that you have to aim for. And so the architecture of a system like that is fun and interesting. And so the catch here is, though, that, or not the catch, but the special
Starting point is 00:16:08 sauce here is that that's actually done with the help of AI. Is that right? So how does that work? What does that mean? It's actually a bit of a mixture. So we use a lot of the sort of traditional static analysis techniques. Like we do build on top of playing and we have the AST and we manipulate it. But whenever you're doing an analysis on code,
Starting point is 00:16:26 you always run like face first into the halting problem. You're like, okay, can this pointer be null? And like, without fail, your answer is gonna be like, yes, no, and the vast majority of the time will be like, I cannot prove that this is or is not null, right? That's what the halting problem gives you for all static analysis is the most common answer is, I don't know, I can't prove it in either direction. And so what you can do with AI though, is you can get it to like extract a little bit more
Starting point is 00:16:55 information. You can read the comments, you can look at variable names and function names. So you can augment the traditional static analyses with additional information gleaned from the text. Then we go back to traditional static code generation techniques that are correct by construction. That is so cool. So that means that the AI is going to give you just a little bit more data kind of that you use as a heuristic basically? Yeah, very much so.
Starting point is 00:17:23 kind of that you use as a heuristic, basically? Yeah, very much so. One of the early products, one of the other products we're sort of looking at doing is automatic conversion of C and C++ code to Rust. And so if we're talking strictly about C code for a second, in C you've got a struct and then you have a function whose first argument is a pointer to the struct. When you convert that to Rust, obviously you can convert it to a function that takes a pointer or a reference to a struct. But the idiomatic conversion would be to put it in an input block so that it's a method. And you can actually use AI a little bit to ask it, like, hey, which of these is more idiomatic? Right. But, okay, there's so many questions there. First of all, how would you even convert CoC or SOS to Rust, given that Rust
Starting point is 00:18:14 doesn't have pointers and references and you express everything with other things and you don't have, you know, random access and into an array and you don't have, uh, just, just, you can't just pass references around. You just have to rewrite everything that would run into the borrower checker and fail. Is that something that you do or do you just do a translation into unsafe rust basically? So we try to go to enigmatic rust. There is always a fallback of unsafe rust. But right, it's not like, yeah, okay, so rust doesn't call them references, it calls them, you know, borrows or things like it. But in fact, most C++ code actually will, if sort of translated to idiomatic rust graphs will pass the borrow checker. Like aliasing for const references is surprisingly, it's uncommon. You know, it's not non-existent, but it's uncommon.
Starting point is 00:19:12 You usually can make a more idiomatic conversion than the pure semantically correct, like we're going to throw unsafe on everything and do everything that way. Huh. That is interesting on many levels, because one thing that people, including myself, correct, like we're going to throw unsafe on everything and do everything that way. That is interesting on many levels because one thing that people, including myself, kept saying when we were talking about things like safety and getting rid of UB and C++ is that if you were to introduce something like the borrower checker into C++, you would have to rewrite all of your code because none of it would work. And you're saying, well, actually, there's a lot of code that technically has references and pointers, but, you know, it doesn't actually have like multiple, you know, mutable references to
Starting point is 00:19:55 the same thing. And you can reason about it not having that locally. So you can actually make a transformation to save Rust or something like it. This is very new information for me. This kind of changes the picture actually for me. So that's really interesting. Yeah. And it's important to realize that you're always playing this numbers game of like, how much of the code can you lift to save idiomatic Rust? And how much do you have to fall back to the unsafe versions of it
Starting point is 00:20:22 when you do these conversions? All right. So when people say that, oh, we don't really want to do C++ anymore. fall back to the unsafe versions of it when you do these conversions. All right. So when people say that, oh, we don't really want to do C++ anymore, it's all unsafe, you want to import everything to Rust, then you have a tool with which they can do that at scale. That's the hope. Right now that tool, like our C++ refactoring tool is available. You can find it on our website and look at the docs.
Starting point is 00:20:46 I can't wait for Andy's C Plus Now talk about it to come out. I'm very excited for that. But the C2Rust converter is much more alpha-level software, at least at this point in time, June 2025. All right. So that's actually a good point. So what stage is the C++ refactoring tool at? Is it better? Is it actually a product that you can now buy and use? What stage is that at? It is a product. You could reach out to us and buy it and use it. It's available on Godbolt to play around with. Oh, that is so cool. Yeah. If you go to our docs, all of the examples in the docs and with a like, hey, see it in action on compiler explorer. And so you can use it on Godbolt, play around with it and get a sense for how well it works.
Starting point is 00:21:32 I will own very openly. It's early days. There are bugs. There are, you know, features yet to be implemented, but we're working at it and making progress. That is very cool. I did look at your webpage, but I missed that there were Godbolt links there. So big thanks to you for letting me know that we will put a link to the website in the show notes so other people can check it out too.
Starting point is 00:21:55 That is very cool. So one thing that I found when I was doing somewhat similar stuff at JetBrains where obviously the C-Line IDE and all of their other IDEs for other languages have lots of refactoring tools, which is really cool. Like for example, for me, that was the killer feature of C-Line that I can just go on an identifier, click whatever hotkey that is and say rename, and it's not going to just rename it in a file, it's actually going to rename it properly. And we'll know that if the same identifier appears in a different scope, it's probably a different actually going to rename it properly. And we'll know that it's the same identifier appears in a different scope.
Starting point is 00:22:26 It's probably a different variable and not rename that one. And so it would actually do that kind of very intelligently. So that for me was a killer feature. But I think whenever I tried to use it at like the code bases that were like larger than let's say a million files, a million lines or something, it would get slow. And I think nowadays there are improving it, but I think the difference, your product is that you're explicitly targeting massive code base, right? So what's the difference and challenge there? And can you talk a little bit about that? Absolutely. So the way we think about it is,
Starting point is 00:23:02 for the in IDE kind of experience that you're talking about, you want the entire change to be done and you want to submit sort of a whole logical change that contains everything. And for a very large code base, you really can't. You know, relatively few code bases can tolerate submitting 10,000 or 100,000 files in a single change. It just gets rough. CI systems break down. You have too many races with developers. And so the idea that we have is you submit into your code base a declaration of intent. You sort of say, hey, here's a pattern for old code and here's the pattern for new code that I want it to look like. And then the system
Starting point is 00:23:42 goes in the background, making those changes, breaking them up incrementally, submitting them bit by bit. And this is, there are a lot of things like Open Rewrite or TreeSitter that give you the ability to do stuff like this in different languages. TreeSitter kind of claims to support C++, but doesn't really, the C++ grammar is not great for that sort of thing. And you can build custom tools in Clang doing it, but you really have to deeply understand the internals, the Clang AST to build those tools. And so what we have allows you to write actual C++ snippets that are, here's my code before and here's my code after. And then we do the sort of translation of that C++ snippet to the right thing to search
Starting point is 00:24:31 the AST and find the right nodes and then make the transformation. So we sort of think about it as a declarative intent where you just submit these declarative intents and the system will start to move your code base in the background for you. And so how do you express that intent? Is that in English? Or is there some kind of declarative special language in which you do that? It's actually C++ code where you say like, okay, I have a struct, it inherits a struct or class, it inherits from just the tag that says, I'm a rewrite. And then you have a function that's annotated before and a function that's annotated after. And whatever it looks for patterns in your
Starting point is 00:25:10 code base that are like the body of the before, and it changes it to patterns in your code base to be the like the after. And this is on Godbolt. You can play with it. It's pretty fun. So yeah, no, I encountered this quite a few times, I think, where there were refactorings that were just not trivial. Like, I don't know, some vendor, you know, updated their audio API or whatever other API and said, you know, this function call that you've been using for the last 10 years is no longer safe.
Starting point is 00:25:41 So now we have a different one, but it takes like this other parameter where you now have to specify, you know, do you want to do like, I don't know, the safe mode or the unsafe mode or something. And so then you have to just do a little bit more work. Typically, like the code bases where I was working on, like they weren't that big. So you could get away with doing it manually. weren't that big, so you could get away with doing it manually. Yeah, absolutely. And if you can get away with doing these things manually, your life is so much easier, you should like, there's a question of scale where you're like, what's your code base is, you know, million, 10 million lines of code. The manual kind of breaks down.
Starting point is 00:26:22 So that's kind of what you're targeting, right? Code bases that are sufficiently large like this. Yeah. That's really interesting. And so that goes back to stuff you did at Google. So can you tell us a little bit more about what you were doing there and how that played into kind of where you are now and how that transition process happened that you decided,
Starting point is 00:26:42 oh, I'm going to now do my own startup and do this differently or something. I'm just curious, how did that happen? So I was at Google for 11 years. And of that, I think nine of those years, I was on C++ core libraries, which was the team that owns the sort of libraries like app sale and the internal libraries to Google. And we would do these migrations, but every time we did them, it was a very bespoke tool. So we would say like, okay, we want to build a new error handling library for use everywhere across Google.
Starting point is 00:27:14 And so we're gonna look at the existing set of error handling libraries and build tools to migrate everyone onto this sort of central standard one. And we would do each migration like this would be a bespoke singular tool. And after like eight years of this, we sort of started to see the pattern of like, actually if we could lower the cost of this
Starting point is 00:27:37 and have something where you don't have to learn clients internals, because teams would come to us all the time and they would say, hey, I see the changes you guys are making and I really want to do something like that on my code base, on my like corner of Google's code base. Can I do that? We would say like, oh yeah, you can build your own tool over here using Clang AST matchers and all this. And they would say, thanks. And we would never hear from them again. Because like, right. And eventually
Starting point is 00:28:10 we were like, maybe if we gave them something that really lowered the bar here, that made it so that they didn't have to learn all of Klang's internals in order to do refactors. And so we started down this path. And about a year ago, right, I founded Browse Source in September, and sort of the year before that, I was working with figuring out how AI would fit into our larger code migration strategy across Google. And honestly, I think most people in the AI space are missing it. There's a lot of focus on how do I put AI in the IDE and how do I have it be like autocomplete on steroids?
Starting point is 00:28:51 And no one is looking at how do I actually get it to have correctness at 99.9% correct. And so I had this insight of like, oh, what if I use it to read the parts that my static analysis can't do, like comments and variable names, but still use the traditional techniques that give you that correctness bar? And I got really excited about this idea and decided that I wanted to bring it outside of Google because I don't know, I like rewriting code. I like seeing things change. So where does the name come in? Because Brontosaurus sounds like Brontosaurus. Presumably that is the intention. But is that because you're taking old dinosaur code and
Starting point is 00:29:36 updating it? Or is it because you're dealing with huge dinosaur sized code bases? Or is it the combination of the two? It's the combination of the two and the fact that the domain name was available. It is when we were when we were founding the company, we had a bunch of time trying to figure out what to name it. And like the first name I came up with was Verdigris. And Andy was like, I hate that. I don't know how to spell it. I don't know what it means. Like this is a terrible day. Um, so we took a step back and we're like, what are the rules that we want for a good name? What should a name for a company be?
Starting point is 00:30:11 We said the dot com and dot dev have to both be available. That way, if you get the wrong URL, it will go to the right place. If you hear it pronounced, you should be able to guess how it is spelled. And if, uh, it should be three syllables or fewer, right? Cause we want it to be kind of short. And then we found out very quickly that dotcom and dotdev have to be available are deeply restricting set of things. And our fourth sort of soft rule was that we wanted it to be clever and a little playful because I find it's really important to have a sense of play and whimsy. Right.
Starting point is 00:30:53 So I can't help but notice that there is a tendency for dev tools like this to have like cute colorful animals as their kind of mascot. So, you know, PBS studio has a bright blue unicorn and you have a bright green brontosaurus, which also both are not actually extant animals, which is also fun. Yeah. So a good friend of ours named Dan Zaloggi is a professional Dan Zaloggi is a professional graphic artist and we hired him to design our logo. That is Charlotte Bronto, if you're curious. Also, it's not AI generated, the Brontosaurus. It is not AI that I generated. Dan Zaloggi did great work, sort of working with us to figure out what color palette, sort of how the vibe was. Cause we told them like we wanted something sort of friendly like the go gopher. Right.
Starting point is 00:31:49 That approachable kind of a mascot. I think you nailed it. Um, Dan is amazing. Yeah. I'll put a plug out there. Like if you're working in the video game space and you want a graphic designer, reach out for him. Also his, his professional is like, he's best known for a series called creepy working in the video game space and you want a graphic designer, reach out for him. Also,
Starting point is 00:32:12 he's best known for a series called Creepy Pokemon, which is hilarious. Okay. All right. So yeah, this is very exciting. I really hope you're going to succeed with this startup. You seem to have some quite unique and amazing tech there, which is going to be hopefully very valuable for anybody with a big code base that needs to do stuff with it, basically. So I think that's a pretty big market, no? Yeah, I think so. I'm hopeful. If you are anyone interested in trying it out, playing, like feel free to go to our website, play with it on Godbolt or just shoot me an email, matt at bronzesource.dev.
Starting point is 00:32:47 I think, I think I want to drill down into one last thing. Like how would I use this in practice? Is it a plugin for my IDE? Is it like a cloud-based thing? Does it hook into the kind of CI? Like what level does it attach to? Like how do I actually interact with it? So it would hook into your CI, right?
Starting point is 00:33:04 GitHub actions sort of into your CI, because that's how it will decide when it reads from your code base, it starts to make the changes, it wants to test them, right? It needs to verify the correctness of the changes by running them through your CI and then sending them. And based on configuration,
Starting point is 00:33:22 it would be like you could send it to say like, okay, I want to send it to specific reviewers or I want it to just submit on green, things like that. It could also, it can, the plan is, and now we're firmly in the world of like theory and where we're going, not what we have, is to run it as part of like a code review time as well. So you can give it rules for how to like look at code at code review time. So it can suggest edits in that context. So would that be basically competing with something like, you know, other AI powered tools that are out there?
Starting point is 00:33:59 It would compete with all the sort of automated code review and linters and things like that. You could think of it a bit like Clippy, like Rust has Clippy. There are client ID checks. So it does compete with that. I used to work at Sonar, as many people know, static analysis tools, but also one of the main products will do a sort of an automated code review and quality gate on CI. And I haven't been there for a while now, but I understand that they are
Starting point is 00:34:29 introducing AI into some of that process as well. And I'm wondering if we're getting to a world where we'll have AI agents modifying the code, submitting a PR and then a different AI agent reviewing it, and then deciding whether to allow it to merge into the codebase or not. So in some senses it depends on your definitions of AI. Like at Google we already had it to the point where systems would generate large volumes of code and secondary systems would review them and if the secondary system was okay with it we would submit it directly and like sometimes those secondary systems were just like a pile of regexes that was doing a sanity check on the tool.
Starting point is 00:35:12 And so you can already be in that space. I think the real question, organizations need to take a very SRE mindset of this, of if you have automated systems operating at scales like this, and there is an issue, something slips by, you actually have to do the postmortem and understand what could we have done that would have stopped this. And like the answer shouldn't be like, oh, just don't run systems like that. No, the answer needs to be how do you ensure correctness at the levels you need? Right? Much the same way, if I, as a developer, write a
Starting point is 00:35:47 book that takes down prod, we should say, not like Matt, you're a bad developer, but how did our systems fail to allow prod to come down from a simple thought? Yeah, that's a good way of thinking about it, actually. All right, Matt, so that sounds really cool. It sounds like you have some really interesting technology going on there. And I think there's a wider question of like, how will AI fit into the whole kind of ecosystem in the long term kind of discussion or question, I think that's going to be that remains to be seen, it's going to play out in some way or another, like, everywhere I look these days, I mean, I'm not working on AI myself directly, but, you know, I go to conferences, I go to committee meetings, I talk to people, and I get this vibe that we are still just in the like, in the baby steps stage, like people are just kind of scratching the surface and trying to discover how to use these tools effectively, or, like how to use it beyond like the really obvious things that are out there, right? Yeah, I think we're also going to go into a very interesting space in the world of like Hiram's law with AI, where people are going to build a system and they're going to deploy
Starting point is 00:36:57 it and it's going to be working. And then Anthropic is going to upgrade from like Cloud Sonnet 3.7 to 3.8. And that deployed system is suddenly going to break. And then you're going to say like, no, no, leave the old one up. They'll say like, okay, you're paying us enough money that we'll leave 3.7 up for you for a while. And then they'll say like, actually, it's been three years, you really need to upgrade to 3.8.
Starting point is 00:37:23 And there's going to be these tension as a lot of the players in the AI space relearn all of the difficult lessons of SREs and versioning APIs over time. Right. So I'm going to now very deliberately change topic here, because I noticed that there's a trend where when we get into the kind of depth of any discussion on the show lately, it always degenerates into a discussion about either safety or AI pretty much every time. One of those two, which is kind of fun. But there's just also other stuff to talk about.
Starting point is 00:37:55 So I just wanna use the remaining time to cover other things, but it's obviously a very fascinating discussion. So thanks again for talking about BrontoSource, very exciting. Apart from that, and apart from your work at Google on refactoring stuff, you also did a lot of very impactful work on hash tables, right? So you, I think, came up with a particular type of hash table, which is now the
Starting point is 00:38:19 standard in two other programming languages. Is that right? Can you talk about that a little bit? Yeah, I want to actually, like, I did not come up with. I was part of a group of people. So Alka Sevlo-Humanos did most of the implementation. Sanjay and Jeff Dean had a couple of really key insights. What I did was a lot of the polyteching. I made, I collected and gathered data, and I convinced all of the stakeholders across Google that we should actually move all of our hash tables to this. And then I did the work actually moving the hash tables. But I don't want to take credit for what is really a brilliant algorithm that it was a mixture of all this and Jeff Dean and Sanjay's work. Okay.
Starting point is 00:39:05 So not only is Rusts and Go's default hash map based on that, but also internally at Google, this is the hash map they use. Yeah. It's open source. It's Abseil's flat hash map. Oh, okay. I actually know about that one. So what's so cool about this algorithm that's compared to like stuff that
Starting point is 00:39:23 people were using before then. So right, the standard kind of hash table that you would write in college or whatever has, you know, you have a number of buckets, you compute your hash, you do modulus by your size, then you have like a linked list or a vector of things that fit in that hash bucket. Or you can do a probing hash table where you sort of advance them one at a time when you hit collisions. And that works, but each individual probe sort of reflects a single element. And the way that Swiss tables work is that they have a metadata array at the front that contains seven bits of hash code. And so the hash code is split into what's called H1 and H2. H1 is what you
Starting point is 00:40:06 take the modulus of. And you say like, okay, what position am I in? And then H2 is just those seven bits and we have them packed into, there's an eighth bit that is a presence bit. And so you have a set of 16 one byte objects that represent very compactly 16 different entries in the hash table with a lot of hash code bits. And then you use SSE instructions to compare all 16 entries simultaneously. And so the analogy is in a traditional hash table, you're probing elements one at a time to figure out which one it is. And in this, you're probing elements one at a time to figure out which one it is. And in this you're probing them 16 at a time. Well, that's obviously going to be more performant. So that is really cool.
Starting point is 00:40:53 Yeah, it was, it's a lot of fun. It's really impressive. And the other thing about it is because you're probing them 16 at a time, when you do erases, you can often not leave tombstones behind. The set of times you need to put in tombstones is smaller. And so you can actually get that win as well. Right. And that's called a Swiss table.
Starting point is 00:41:14 Yeah. That was the internal code name because Alkes is in Zurich and... So it's not something clever about holes in cheese or something like that. Nope. It's a common guess, but it's actually just named for Zurich and like Swiss efficiency has a good spot in the American zeitgeist. Oh, I'm really curious actually now about this because I mean, I'm not originally from Germany, right?
Starting point is 00:41:39 I'm originally from Russia, but I grew up in Germany, so I have this German accent. So I get this a lot sometimes when I'm in countries like America, where people say, Oh, you're you're German, Germans are so efficient. And so now you're saying, so I'm curious now, who is like, what's the stereotype? Like, what nation is considered like, the kind of the most efficient one is the Swiss, the Germans or somebody else? Like, so the Swiss have the like association with clockmakers in the US. Yes, yes, yes, of course. Where Germans have an association with rule followers in the US. Okay, that's
Starting point is 00:42:14 interesting. That's interesting. Huh, okay. Well, thanks. Sorry, that wasn't a side, but I just I just wanted to know that's that's that's interesting. Okay, so you not only did a lot of work in, obviously, refactoring and large-scale codebase management and hash tables, but you also did a lot of work with concurrency as well. And I remember actually the first time I met you, that was CVPCon, I believe 2021. It was this weird one when the lockdowns, the COVID lockdowns had just ended. And we had the very first in-person one after that, which was really, really small. And it was impossible to travel there from Europe because there was this weird rule where if you had been physically in Europe,
Starting point is 00:42:59 for the last two weeks, you couldn't even travel to the US. So I had to jump through quite a few hoops to even get there. I think I was pretty much the only European there. And there were just like 200 other people, which was like very weird addition of CPP con. But I remember at that additional CPP con, you gave a talk, which I obviously went to, about building a lock fee, multi producer, multi consumer for TCMalloc, which is the other big, famous, efficient allocator that's out there, which is the Google one. That was a really, really cool talk because what I remember from it is that you said, okay, we have TCMalloc,
Starting point is 00:43:38 which is this really, really complex massively parallel allocator with many different stages of caching and all of that stuff. And in the middle, there's this one mutex, which is probably the most highly contented mutex in all of Google. Let's get rid of that and replace that by a log V data structure, which is ambitious, to say the least. I don't quite remember how it then played out. I kind of remember the premise, but then I think it kind of lost me halfway through because a jet lag and things like that. Can you talk a little bit about that work? Because that sounds really, really interesting and also very impactful actually.
Starting point is 00:44:17 Yeah, it's a ton of fun. I'm in the weird set of people that like, like concurrency. And my favorite memory ordering is relaxed. Oh, mine too, mine too. I think you should do sequential consistency. No, never. If you can't name your acquires and releases, you shouldn't be in the world of AtomX, just like using mutex. That's so funny.
Starting point is 00:44:42 I just submitted a talk to the audio developer conference in November in the UK. So if that gets accepted, it will be there about exactly this. So I agree with everything you just said. And I have a talk scheduled hopefully for later this year about this topic. So it resonates with me very strongly. Yeah, I'm actually one of, I'm very proud to have over my course of working on things at Google have discovered two bugs in T-SAN. Wow, okay. So I wanted to replace this particular data structure in the guts of TCMALC. And one of the things, right, for most data structures, they're sort of commonly known best in class things, right?
Starting point is 00:45:26 Like, cool, use a vector, use a B tree. These are like commonly known multi-producer, multi-consumer queues are kind of an open problem in that there's no single best one. You should always have one that is tuned to the specifics of your system. And I saw that the component I was replacing was actually a stack in TCMALC, but it didn't need to be a stack. It actually just needed to be a thing that you could put things into and take things out of,
Starting point is 00:45:58 and it didn't care about ordering. And so I thought, okay, I'll do this multi-producer, multi-consumer queue, because I know how to implement one, and it's based on what's called the disruptor pattern from, LMAX was a hedge fund, I think, out of London in the like 2000s, 27 or so, they had this disruptor pattern that is a way to implement a multi-producer, multi-consumer queue. And so I was like, oh, cool. I'll build it up.
Starting point is 00:46:27 I'll base it on that. And like the talk goes through doing it and then debugging it and rolling it out and testing it and debugging it more. As it turns out, by the way, if you're ever writing a concurrent data structure and you think like, oh, I have some bug and I'm tracing down all my concurrency things, pause for a second, just put a giant mutex on it, put the mutex everywhere and see if your bug persists. Because a lot of the time your bug is actually just in your indexes and your
Starting point is 00:46:59 bookkeeping and not in your concurrency. And so like that's a very easy, like first step in debugging any of these. Also have fuzz tests. Fuzz tests are great for multi threaded things. And thread fuzzing specifically. Yeah. Thread fuzzing specifically, right? You want to have a test that brings up N different threads that just kind of
Starting point is 00:47:24 pound on it and you don't even need to assert exact results, right? You want to have a test that brings up n different threads that just kind of pound on it. You don't even need to assert exact results, right? Because if you run it in TSAN, thread sanitizer from LLVM, right? If you run it in TSAN, it will tell you when you got your inner leaves wrong. So it's great. And so then as I went through and I did all these benchmarks and everything else, at the very end of the day, the punchline was after doing all of this work, as near as I could tell, it was a very mild performance regression, but it was so hard to get statistical significance. It was functionally equivalent performance. Interesting. Yeah.
Starting point is 00:48:01 But I actually, on the path to doing it, I refactored and cleaned up a bunch of the code, and I added unit tests to a bunch of things in TCMALC. And so the code base actually got better, even though the final thing that I did those refactors for didn't land. Okay, so we talked a lot about things from the world of C++. Is there anything else that we haven't talked about in the world of C++. Is there anything else that we haven't talked about in the world of C++ that you find particularly interesting or exciting that maybe we'll be hearing from you more about in the future?
Starting point is 00:48:33 Yeah, so I was at C++ now recently and David Sankal gave a talk on some Rust binding stuff. And I think actually there's gonna be a lot of interesting things coming out actually as a side effect of reflection around finding C++ other languages that I'm actually really excited about. I'm also mildly terrified of reflection from a Hyrum's law sense. For those who don't know, Hyrum's law is this idea that any change whatsoever to your source
Starting point is 00:49:04 code can break some user. And now with reflection, you change the order of private variables. I mean, that could already change layout, which could break users, right? Change the name of private variable could change reflection in some way and break users. And so the tyranny of Hiram's law is going to increase with reflection in a way I'm kind of curious about. So for a moment there, I thought you were going to say something other than reflection, but you still managed to sink it in anyway.
Starting point is 00:49:36 Yep. You know, I thought about trying to not say reflection, but it's really cool. It is really cool. Okay, then I think we will start to wrap up there. And I just want to take a moment to point out Tim has been doing most of the talking for this episode, because I've been having a lot of latency issues here, which hopefully you won't hear too much of because of the editing. But issues here, which hopefully you won't hear too much of because of the editing. But, but that's why I haven't said so much. So as we do reach the final stretch, is there anything else you want to tell us
Starting point is 00:50:12 or where people go to find out more about BruntaSource or any of the other things that we've been talking about Matt? Yeah. Like, first of all, thank you so much for having me. This has been a fun conversation and thank you, Phil, for fighting through the connection issues and finding a time that worked for all of us. Being spread across the world makes scheduling interesting. And that, that I think is the big points. Oh, you can find us on brontosaurus.dev. If we manage to achieve our naming goals based on the name Brontosaurus, you should be able to guess how it's spelled. But we will put that in the show
Starting point is 00:50:53 notes as well anyway. I figured. All right, so that wraps up our episode 401 with Matt Kulukandis about Brontosaurus. Thank you again, Matt, for coming on the show and we will see you all again here in two weeks. Bye. Bye. Thanks so much for listening in as we chat about C++. We'd love to hear what you think of the podcast. Please let us know if we're discussing the stuff that you're interested in or if you have a suggestion for a topic we'd love to hear about that too.
Starting point is 00:51:20 You can email all your thoughts to feedback at cppcast.com. We'd also appreciate it if you can follow cppcast at cppcast on X or at mastadon at cppcast.com on mastadon and leave us a review on iTunes. You can find all of that info and the show notes on the podcast website at cppcast.com. The theme music for this episode was provided by podcastthemes.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.