CppCast - Rcpp

Starting point is 00:00:00 Episode 346 of CppCast with guest Dirk Elbattel, recorded April 5th, 2022. This episode of CppCast, we're thanking all of our patrons on Patreon. Thank you so much for your ongoing support of the show. And if you'd like to support the show too, please check us out at patreon.com slash cppcast. In this episode, we talk about a C++ web framework. And we talk to Dirk Elbertel. Dirk talks to us about the R programming language and the R-CPP Interop Library. Welcome to episode 346 of CppCast, the first podcast for C++ developers by C++ developers. I'm your host, Rob Irving, joined by my co-host, Jason Turner. Jason, how are you doing today?

Starting point is 00:01:19 I'm all right, Rob. How are you doing? I'm doing okay. I know you're still right in the middle of the C++ game jam. How's that going? I am in the middle of the game jam right now. And it seems that, you know, I'm expecting at least like two or three submissions or something like that. I honestly have no real idea how many people are involved. I know how many people have joined my Discord and are having a conversation. I think one of the more interesting parts that's come out of this is there's been a couple of people who have mostly like,

Starting point is 00:01:50 well, they're like, well, I don't know if I'll have enough time to actually participate in the game jam itself. But they have been hanging out on the Discord, helping other people get set up with the example project and make sure that they can get moving when they need to and everything. Yeah, that's been cool. I did look at him on Twitter the other day and clicked on the hashtag and saw a couple people posting screenshots and little gifs.

Starting point is 00:02:13 Oh, that's cool. I must have missed a couple of those. Yeah, I saw that there are a lot of bouncing circles around in the console window, that sort of thing. Yeah, one of the ones that I did retweet that was the circles bouncing in the window is actually from Arthur, I forget in the past. I know we've talked about FTX UI and a bunch of really cool console-based tools. We should probably really get them on at some point to talk about these things. Yeah, definitely. Well, at the top of our episode,

Starting point is 00:02:55 I'd like to read a piece of feedback. We got a lot of tweets about last week's episode or I guess two weeks ago when we talked with Logan about Julia. This tweet was saying, thanks for hosting an episode about Julia. when we talked with Logan about Julia, this tweet was saying, thanks for hosting an episode about Julia. There was a point raised about Julia in real-time applications, which is worthy of further discussion

Starting point is 00:03:12 if Julia is to be promoted at solving the two-language problem, especially in robotics and embedded applications. So yeah, it was definitely great having Logan on the show, and it seems like a lot of people really enjoyed that episode, but it seems like there's a desire for us to be able to go a little bit deeper into Julia. Maybe we should get a compiler engineer who works on Julia on sometime. Similar comments from a friend who's a listener of the show that said, yeah, if we could get someone, you know, back end or something like that from Julia to really be able to dig into these details, that could be really

Starting point is 00:03:43 cool. Well, we'd love to hear your thoughts about the show. You can always reach out to us on Facebook, Twitter, or email us at feedback at cppcast.com. And don't forget to leave us a review on iTunes or subscribe on YouTube. Joining us today is Dirk Edelbottel. Dirk is a senior data scientist, engineer, and quant with extensive experience in research, development, and trading. He works at TileDB. Dirk is also an adjunct clinical professor in statistics at the University of Illinois, Urbana-Champaign, where he created and teach STAT447 data science programming methods. Dirk has been contributing to open source, mostly Debian and R, since the 1990s. Besides looking after numerous Debian packages, he develops and or maintains a number of projects, many around R, often involving C++ and making use of R CPP.

Starting point is 00:04:28 Dirk also is a co-creator of the Rocket project, bringing Docker to R. Dirk, welcome to the show. What do you do at TileDB? You say quant and trading, but TileDB does not sound like the typical trading company name to me. That's right. That's right. So I've been working since the 90s and I worked for 20 years in financial services writ large, trading investment firms from analytical roles and pivoted over in tool chain building and integrating things as one does as a quant, as we were called then before data science. But I left finance two and a quarter years ago, and now I'm in a startup building,

Starting point is 00:05:06 being part of a team that builds a new universal data engine. That's TileDB. It's an MIT-licensed, really clever C++ backend. So I can send you pointers. Maybe that's good for a follow-up. Instead of a format specification, we really just have an API, and then it binds to just about everything, because just about everything can use FFIs and C interfaces.

Starting point is 00:05:26 We're using a very clever thing that I finally learned the term about, the hourglass pattern. Have you seen that? That's a good CppCon 2014. Because we're using C++, but the bindings are all C, so we can have actually pre-built libraries to build against. again. And I basically look after the R integration because R is one of the data languages which you would want to connect to a universal backend. Well, I mean, unfortunately, we didn't bring you on to talk about TileDB today, but maybe we'll talk about it more through the course of the interview. But is it like a traditional relational database, SQL-backed, that kind of thing? Or what are you talking about?

Starting point is 00:06:07 No, it's storage. It's files. And it just abstracts that away. So it's native to the three main cloud engines and is new enough so that cloud was built in rather than after the fact. If you wish a more modern version of HDF5, if you've come across that with cloud support in there. And the basic gist is that multidimensional arrays are actually fairly universal and can be used for just about anything. So we have very big uptake in geospatial,

Starting point is 00:06:31 point cloud data, geoinformatics, standard data frames, just things that can be indexed by one, two, or many dimension dense or sparse. And we have a unified approach for those whether they're dense or sparse. So it's pretty cool. Sounds directly related to the GDAL episode we did, yeah. That's right. And we work with Howard. I saw that he was on just a couple of weeks ago. That's right.

Starting point is 00:06:54 So we have direct PDAL and GDAL support and all those things and work with Howard quite a bit. It all comes together. That's right. Alright. Well, Dirk, we got a couple news articles to discuss. Feel free to comment on any of these, then we'll start talking more about R and RCPP. All right. So this first one, we have a Reddit post and a GitHub link. And this is a year and a half ago, I picked

Starting point is 00:07:20 up an abandoned C++ web framework. And today, it was released as version 1.0. And this is Crow, a C++ web application framework. I'm trying to remember if we talked about it like a long time ago, because it did get abandoned at some point, but maybe it's possible. I don't know. But you know, there's lots of comments in here, you know, thanking these people for working the project and bringing it back to life. And it sounds like it's, you know, a pretty useful C++ web framework. Yeah, I'm so not familiar with these things. But I did dig into the like, how to write your first application. And I guess it's so you could set up a REST server of some sort, basically in C++ and like 10 lines of code. That's how I'm looking at this.

Starting point is 00:08:06 Pretty awesome. It reminds me a little bit of a toolkit called WT. Have you seen that? It's not that well known. It's a bit like QT, and they play on that with the name, but it's really made for the web. And we'll get to RCPP in a minute, but let's sort of do a little bit of a cottage industry

Starting point is 00:08:22 of various things that one can do with it. And one use that I actually had very early was taking R and embedding it into C++ applications. And then you sometimes have an embedded web server in there. So long story short, I also have a really small C++ class framework called R-Insight that allows you to stick R, the statistical language that us VPs, into a C++ app, and then led to various front ends where you could present that, that you have sort of analytical capabilities behind. So I first did something with Qt,

Starting point is 00:08:53 and I've forgotten where the pointer came from. Maybe I just saw it somewhere on the web, something called Wt, and I just put up sort of standard applications. It's still not quite clear whether you want quick web apps really written in C++ because there's a little bit more

Starting point is 00:09:07 tooling and getting going. But, you know, if you want it that way, they're there. And I played a little bit with it, but it's a bit out there, I guess. Same with that framework that you showed. It's intriguing, but maybe not the most mainstream of things

Starting point is 00:09:21 for web applications. I feel like my first thought when I saw this was, okay, I'm going to connect my C++ application directly to the internet. How well has their URL parser been fuzzed? That's what I want to know straight away. I'm assuming that the authors of the project have taken care of that kind of thing. I feel like hopefully fuzz testing and whatever to make sure there's an obvious security flaws and the thing that you connect directly to the internet is becoming more standard practice to these days. Okay, next article we have is a blog

Starting point is 00:09:54 post. This is on kdab.com. C++23 will be really awesome. And this is an April Fool's post. It took me a second, Rob. Yeah, it's one of those ones where you kind of wish that this was a real feature coming to us in C++23. Just to give a brief overview, it's about a fictional keyword being added, the really keyword, which maybe makes C++ behave a little bit more like how we sometimes might wish it would.

Starting point is 00:10:26 Brought a smile to my face. I do get fooled just like other people every now and then, but he had only talked about two paragraphs and when they repeated the keyword for the next sort of little use, it's really well done. I mean, including sort of the bits of asm in there and what have you, but yeah, really?

Starting point is 00:10:42 It needs to be partnered with like just to make everything more verbose that we can like put like more stuff in really yeah there's a comment here similar to that indeed like really and indeed and a couple of other honestly oh you missed the honestly proposal that was like i really honestly want to allow this conversion. Well, we have explicit, so yeah. Right. Okay. And then the last thing we have is another blog post, and this one is called C isn't a programming language

Starting point is 00:11:16 anymore. And this makes reference to another blog post we talked about recently from Shahid Manid about C-ABI. This person is apparently working with Shah Manid about C-ABI. This person is apparently working with Shahid on the C-ABI problem, but this post is basically how they're coming at it from

Starting point is 00:11:31 a very different perspective, how they don't really care so much about C itself, but because C has become the lingua franca of programming, it needs to be improved, although this person seems to be thinking it's not likely to get improved. Yes, it's lingua franca. programming, it needs to be improved. Although this person seems to be thinking it's not likely to get improved. Yes, it's lingua franca. It can't be improved. It's

Starting point is 00:11:49 Latin, seems to be effectively their argument here. But this falls, I think, comes right back to Dirk, your comment earlier about the hourglass pattern, which if I understood right, is basically you're using C to funnel your C++ to whatever, pick other language, right? Right. The talk's actually really, because we do that at TaltiB, and I wonder how it works. Because on Linux, we have this problem that you otherwise need to rebuild, but therefore the integration, we can, in CI, build the static libraries because they're C libraries. Because C is portable, whereas the C++ signatures change so much and you can't. And it just puts a little bit of extra effort initially on things

Starting point is 00:12:28 because you have a core that's all C++, you have all the abstractions that you want, but then everything that you expose, you expose at plain C functions, and you wrap those, again, with a header-only layer that makes it really cheap to use and deploy it. And that way the client sees the C++ API that's high level, but that C++ API internally just pivots to C and then the linkage is at the C level and all of a sudden it's beautiful

Starting point is 00:12:56 because you can actually pre-build and ship libraries. A bit of what Python does with wheels and other things and that's otherwise a problem that we don't really have licked or deployment on Linux. On one hand, it sounds like a lot of work. On the other hand, it sounds like it really makes you think about what your API needs to be

Starting point is 00:13:15 so that you really get the succinct, you know, this represents the problem we're trying to solve kind of API. Yeah. One of the most interesting things I thought I read in this post was talking about how no one wants to write their own way of parsing C header files and everything. Everyone just winds up embedding or relying on Clang or some other compiler.

Starting point is 00:13:39 Well, I don't know. I found the post very ranty. I just glanced at it. It's a bit ranty, yes. I think it's just off base because it is 2022 and we're still using C for all these things because it stood the test of time. And he's ranting a little that you don't quite know

Starting point is 00:13:58 what the size of a variable is and all the rest of it, but the spec left that out. And for that, we have header files and defines, and we switch. This is not a dynamic interpreted language, so we're not reading the parser and doing something directly. It will have an if-def on this architecture and that architecture. I'm so old that I actually remember when we moved from 16-bit to 32-bit. And now we're mostly in a world of luxury because everything's 60-bit and it's rich enough and enough resolution and all the rest of it. And you don't really have these portability issues anymore.

Starting point is 00:14:28 But those used to be there and C, for all its warts, is so powerful and general enough that it accommodated all those changes. And that's why it hasn't gone away. And that article won't change that. Right. All right. Well, Dirk, we already made a few references to RCPP, which is going to be the main thing we talk about today.

Starting point is 00:14:48 But before we get into that, I was wondering if you could maybe just give us a brief overview on R itself. We've obviously mentioned the language plenty of times on the show, but we've never dedicated an episode to it. So could you tell us a little bit about R? Absolutely.

Starting point is 00:15:01 FAQ calls it a language and an environment, and it's meant and designed to work interactively with data, but being very extensible. It actually goes back, like many of the things we just mentioned and care about, C, C++, Unix, to Bell Labs. At Bell Labs, they also had, other than all the computer science research, a very powerful statistics research department, and they invented themselves a system called the interface in the 70s that became the system,

Starting point is 00:15:29 which got short-changed to S. But then the phone company was a regulated monopoly, which was not allowed to license IT products. That's why Bell Labs never made money really off Unix or the language and all that stuff. It was the same with the S language. So that got licensed eventually, and there was a commercial implementation called S+. And then in sort of a very ironic twist of fate, two lecturers in Auckland in New Zealand were frustrated with the vendor with the license

Starting point is 00:15:59 because they at the time didn't have a binary for the max in the teaching lab. So they kind of said, shucks this, and just built around. So R was started in around 92. And by 95, it was on 94, 95, it was sort of a new project. And as we were all getting going with Linux at the time, and, you know, CompOS Linux announced news groups, I first read about R then, kept an eye on it, it was really small, but grew. And I started playing with it in the 90s and then switched in the 2000s. So R is a reimplementation of S.

Starting point is 00:16:32 And S was once described, and S and R were once described by someone equivalently, that the best thing about them is that it's a language by statisticians for statistics. And the worst thing about it is that it's a language by statisticians for statistics. And the worst thing about it is that it's a language by statisticians for statistics. But, you know, I like to follow up on that with the gorgeous quote by Bjarne, which just basically flips that there's two types of programming languages, the ones that everybody complains about, and the ones that nobody uses. So it's an evolved language that has been in use all this time, and is quite widely used for things that have to do with data at large. So, you know, around statistics departments, but also other data centric things. And yeah, because of this design from the 70s that fit really well into this mold of workstations

Starting point is 00:17:19 from the 80s, when I was growing up, that was the ideal, real scientists, workstations, big monitors, and then this model that you have a. Real scientists, workstations, big monitors, and then this model that you have a powerful central processing unit and enough RAM, and your data sits in there. So that's sort of the first sort of approach that was initially written. You work best with the data that you can hold in memory. And then, of course, data sets always grow. So other approaches get developed to sweep over data or distribute it on different machines. But by and large, it's sort of meant to sweep over data or distribute it over different machines.

Starting point is 00:17:46 But by and large, it's sort of meant to work with data, explore data, transform data, and be extensible around data. So the main creator of the language is a fellow named John Chambers. And he had always put up two dictums, basically, that everything in SNR is an object. So everything that you represent is sort of an object, a number or the result of a fit or a plot or what have you. And everything that does something is a function call. So a function call would work around objects. And then the next sort of close thing to that was extensibility.

Starting point is 00:18:18 So there were some reasons that made it that we got really lucky when we started probing with this RCPP thing. And I got particularly lucky with really excellent contributions by collaborators that pushed it much further than I had planned at the times, a few times. And yeah, it became what it is. And it's one of the different extension systems to SNR.

Starting point is 00:18:36 It isn't even the first one, but I got more serious with SNR. S and then R, we were sort of coming out of the 90s and we were beginning of this new century and Java was still more active and C++ was a bit stagnant. C++ 11 hadn't happened. The standards committee hasn't gotten back together.

Starting point is 00:18:53 So everybody thought Java would be the next best thing. So at the time, most advanced extension mechanism for R was actually a system called R-Java by one of the R core team contributors. I looked at that and kind of think, oh, you know, he has 60, 80 packages using it. And that was a big deal for us

Starting point is 00:19:09 when we reached that number and we went a little bit further by now. So R-Java is still at about 120 and we have about 20 times that now. So it just turned out that we picked something where the shoe really fit. There's some things internally about how R goes about things. It's a child of these designs

Starting point is 00:19:28 before OO, really. These sketches in the 70s and some core development in the 80s and 90s before C++. So one key aspect is that everything that's held inside of R is always an as-expression pointer object, which is an old-school C union type. If you've seen those, which is where you basically have multiple aspects in one by just having a controlling type in there. Oh, today you're an int, or you're an enclosure, or you're a function, or you're a double, and sort of things like that. And that makes the interface really simple because everything goes in and out of SEXPs. And so I was first interested in extending our two C++ libraries. That's something I played

Starting point is 00:20:08 with in other systems too, to other libraries, often database backends. And I wanted to do that for a financial library called Quantlib that also had started in 2001 about, was really far out there and forward looking with extensive use of Boost and at the time really modern C++ idioms and still did really well. And I got some help about how to connect that to R and just marshal some simple values back and forth in a very pedestrian way because as an MVP, you just have to start somewhere and have a first feasibility study. Then I had that and I worked for simple things.

Starting point is 00:20:43 And out of the blue someone came and realized that well you should use c++ and a little bit of templating and then we can at least translate car vectors int vectors double vectors automatically and pass them back and forth and that at the time was called rcpp template and then rcpp and back and forth and that collaborator eventually got disenchanted with a couple of things including open source politics was all very volatile and he sort of he left after a year or two and then it was just there and i was still looking after it and then i needed something like exactly that for problem at work related to the embedding and working then i played a little bit more with that and that was sort of 2005 six seven i think and I think. And around that time, I was also 8, 9.

Starting point is 00:21:27 I was interested in connecting it to Google's protocol buffers library, multilingual code generators. So I reached out to a fellow who had met at one of the conferences, a French guy who I knew was very fluent with Java and had become a collaborator on this Java extension. I told him, look what I'm doing here. He said, oh, this is really neat. So you're using C++.

Starting point is 00:21:48 How does one go about C++? Literally, he sort of asked me. I sent him an email. I really like these Formias books and a couple of other things. And he basically inhaled those and became one of the most amazing C++ programmers I've worked with and wrote the bunch of the ASVB design as we have it now. A lot of that really is homa and he did really good work there and we're still taking advantage of that and profiting from it

Starting point is 00:22:12 today. And yeah, that's just basically it. With C++ we can do some tricks around passing values back and forth and that makes the integration much more direct and seamless is the term we often use. If you've seen things like, it's sometimes called inline in a couple of languages. I think I first saw it in Perl. I can't remember whether Python had it, but that's basically the idea that you're sitting in one of those high-level script languages,

Starting point is 00:22:36 and then you want to extend it with some compiled code, and you more or less pass it down as a string vector, as a long character variable, and then sometimes it just picks it up, maybe prefixes it with a header, invokes a compiler, links it, and it so happens that R actually has really good abstracted tooling across the operating system where you just say, R command compile, R command change the shared library,

Starting point is 00:22:59 R command load, and then so we can invoke those behind the scenes, and yeah, that made the extensibility much more straightforward and powerful. And that, that appeals to a lot of people too. That kind of sounds like how you might do inline assembly in C++ if you were so inclined. Like just open an asm block, put some assembly in there. Yes, but on steroids. For many data types, you know, I can marshal R function or I can call from C++ code and back and forth. It's pretty good.

Starting point is 00:23:27 There's still sort of some issues about what gets mutated. It doesn't get mutated. We have some constraints in there because everything's really underneath the C-level pointer type. So you have some difficulties really protecting things,

Starting point is 00:23:38 but by and large, it's pretty good. So do I understand correctly, you're saying when you're doing this function call handling back and forth, that everything in R is the same type, basically, but you said it's a union. So it could be a double, it could be an int, it could be a function pointer, something like that. In all seriousness, it sounds kind of like using JSON for marshalling data between applications, because you're like,

Starting point is 00:24:03 well, I got this data structure. Everything is either a double, an int, an array, or an object. And now I just have to unpack these things on my side in some meaningful way. Does that sound fair? Not a bad analogy, but JSON, of course, has its cardinal sin of being untyped. Right.

Starting point is 00:24:19 And here, at least, we are typed. We just have this union type in the interface. But down below, it's then something else. That creates some friction for us sometimes with dispatching. So, for example, we can't do this trick that C++ does, where you have a function C, then you have a function foo with five different signatures because it's five different types that it calls.

Starting point is 00:24:39 At this level, we're calling foo with a single argument that is an S-expression pointer, and you then inside have to look at the payload and switch to whether you want to do this or that. So it's sometimes these little things. But other than that, it's yes. The serialization example isn't perfect, but you can think of it a little bit that way.

Starting point is 00:24:58 But we do get types and other things. So in R, can you define your own types, your own structs that contain whatever in them? You can, but then you have to provide the mappers because you can have something as rich as you want, like these protocol buffer objects. But then you have to define some interface that gets it into R and then again unpacks it.

Starting point is 00:25:21 So if that makes sense. So if the protocol buffer, say, is a list that contains a vector and a matrix and three strings as attributes or whatever, so then you bring them to R as a matrix because it knows what that is, two strings and whatever else, or just a vector, and then you can sort of just unpack them

Starting point is 00:25:39 into the base types, how structs basically are compositional of these core things. And the nice thing is the core things we can do. But it's extensible because you can add the particular converters for the classes and types that you're interested in. That's reasonably advanced use, but the doors are open and people have done that. Is there anything like a preprocessor available or whatever,

Starting point is 00:25:58 if you wanted to say, please automatically write this interface layer for me? No, we don't have a parser that is a code generator for those steps. But you could do that conceivably. In previous life, I spent a lot of time with Swig. So when we talk about cross-language and dropperability, the way Swig works often comes to my mind. It's a really great example because the Quantlet project that I bound things to by hand

Starting point is 00:26:25 to make this happen, and out of which, out of these interactions, RCPG really grew, always had interfaces to SWIG, but there were some difficulties. I mean, SWIG is really promising and powerful

Starting point is 00:26:38 and can try many languages, but it's complicated. I mean, at the time, for example, it didn't do shared pointers in C++. So yeah, there are issues. At one point, and someone else worked on that, at one point we had a completely working I interface with Swig,

Starting point is 00:26:56 but then it became one really huge compilation unit. It was Unreal, lots of resources. But yeah, it's an interesting problem and a very interesting project that we were all very much in love with. But then over time, it became apparent that it's really, really hard to do it in full generality for all client languages, and it didn't really carry the day all that much.

Starting point is 00:27:21 I mean, it's still used, I guess, for some projects, but the scope is a bit more reduced. It seems like its effort to be generic and work with so many languages means that for any given language, there's a more efficient, better way to do it. Right. Yeah. I use that as a mental model too, exactly. That's often true. So you've mentioned code generation a couple of times, and that was one of the things that caught my attention when reading about RCPP with attributes being used to declare a C++ function as callable from R. Can you tell us a little bit more about how the code generation works? Hey, everyone. Quick interruption to let you know that CppCast is on Patreon.

Starting point is 00:27:57 We want to thank all of our patrons for their ongoing support of the show. And thanks to our patrons, we recently started using a professional editor for CppCast episodes. If you'd like to support the show too, please check us out at patreon.com slash CppCast. Absolutely. So I had mentioned that the first pass that we used was a thing called inline. And someone had, I think, already written that was already on the package repository, which for us was called CRAN, in a wordplay on CPAN or CTAN or whatever. And we'd adapted inline to work with RCPP. And that was already pretty good, but not quite good enough.

Starting point is 00:28:37 And then someone who became a friend and collaborator was just quietly working on a generalization. And that's JJ Allaire, a co-founder of RStudio. And he did that. And he basically contributed the attributes extension to RCP and with that then became a team member. That was mostly written a decade ago, I want to say, just about. So attributes already existed for C++ as a notion, even though the compilers didn't have that. So we do something that looks like it, but we hide it behind a slash slash. We hide it in a comment. But basically the attribute,

Starting point is 00:29:10 the most simple one is that one line above your function signature, you say slash slash and then in two square brackets, delineate that for for regexpassing, you say rcpp colon colon export. And that basically says that this function that we're seeing here, we want to call from R. And then all the beauty that makes this happen, by and large, is some 3,000 lines or something like that on the order in source file attribute cpp, where JJ just goes over the business of doing everything that needs to be done. And that's sort of passing the signature. So we're doing that old school pre-clang VM by just reading it out and then basically generating the little wrappers. And basically the attributes package

Starting point is 00:29:53 gives us three functions that you can call of sort of increasing order of importance. The simplest one is evalcpp, and you just have a string, and I often show that with two plus two. Then attributes picks up the string 2 plus 2 and actually writes the minimal shim around it so that it becomes a callable function that has this expression in there

Starting point is 00:30:10 and returns the value and knows enough about what the value is or maybe takes advantage of the fact that S-E-X-Bs are general anyway. But it just goes off as long as the compiler takes in linking whatever and then the result comes back and you get the 4. And that's a great litmus test to actually check that your compiler is set up correctly and what have you, because Windows users install R, but they don't have a compiler there natively.

Starting point is 00:30:33 R is very tightly integrated with just one build of GCC that they call Rtools, so you have to have that. And if you don't have it, you give an error. And if you have it right, it comes back. On the Mac, where things always change with Xcode, you may have the right or wrong thing. You know, same thing. But this is a great litmus test.

Starting point is 00:30:49 So that's just Evolve. And obviously, you can't do very rich things with it. I have some simple demos in there that you can look at. You know, standard limits, templated double, and it comes back and you get the 1.7 E308 or whatever it is. I mean, you know, it's a representation of a double max in C++. But the next one up is CPP function. And it took this idea that we had from inline one further

Starting point is 00:31:09 because it's just one function call, essentially one argument, one long string. And that's basically a return value, a function name, a set of arguments, and then curlies, and then the string closes. And the attributes code by JJ goes in, picks out the name. So it creates us an R callable function of the name of the C++ function,

Starting point is 00:31:28 looks at the arguments, picks up the ones that are properly mappable. You know, you can't put a random boost math type in there, which we don't have a translator, then it would fail. But for the standard things, including vectors, both R's internal ones, that we have in RCPpp and C++ ones, so standard vector double, say, and provides you that function, and then it's callable. And if you invoke it with verbose equal true, you see what it does, because for every function that the user supplies, there's sort of one shimmer that calls it, and then one R function

Starting point is 00:32:00 that goes into the other one. So there's a little bit of extra code being generated and just does that, but it's transparent. You can see what it does. And then it's just there and bam, you have your function. You've just extended your system interactively. And our eyes glazed over because John Chambers, the guy sort of at the core of S and then R, taught me around that time,

Starting point is 00:32:21 showing sort of sketches from these meetings in the 70s and a bit of the genesis of R, that this direct extensibility is really what they had in mind 40 years earlier. And this was the first sort of real approach that accomplished that as it was scoped, because the Java and other languages extensions that exist weren't quite as powerful. That one's pretty neat. And then the next one is source CPP, which just strips an entire source file, not to dissimilar to had I once seen things like, I don't know, C++ extensions for MATLAB or other things.

Starting point is 00:32:52 MATLAB is actually for commercial product. It's quite disappointing in there, I always found, because you can only, in each file, define one function. I found that very laborious. I played with it 20, 25 years ago and thought that was, it would be great. I mean, this is, 25 years ago and thought that was, it would be great. I mean, this is what professionals use and it wasn't so great. But with source CPP, you can have one or 10 or 100, if you insist, functions in a file.

Starting point is 00:33:13 It reads it in and then also does things. And the attributes that are in there are sort of mostly that export thing where you can turn on and off. I want this function callable or not callable, as well as other plugins that we added over the years. Remember, this started sort of 10 or so years ago. So we had plugins in there to say now turn on C++11 or 14 or 17. Basically, the attributes code just reads that plug login and then throws out the appropriate CPP, the C flags, the exec flags argument, sets the environment variable so that make and the compiler pick it up and carry on with that. And similarly, a couple of other things, including OpenMP and sort of other natural extensions as well as package.

Starting point is 00:33:54 So I work a lot with the C++ library for sort of linear algebra called Amadeo. So there's then a plugin that just says, OK, now just work with ACPP Amadeo because that's the header-only library. And that makes things really easy because you just have to look at headers. So we've done that for a couple of other things. But as well, some numeric libraries, there's an old scientific library called GNU-GSL, also going back to the 90s,

Starting point is 00:34:17 C interface, that one you have to link with. And yeah, then we have a plugin for rcpp-GSL as well and some converters for GSL type vectors and matrices into R and out of R. And that's what Attributes does. Basically, it's one really impressive tour de force by a very good engineer who kind of thought, I think you guys could use this.

Starting point is 00:34:38 And he just showed me that in Chicago around a conference that we organized. And yeah, it's just, wow, this will make a difference. And that definitely helped popularize RCPP quite a bit because it made the usability just that much easier. And then sort of the next level up is what you really want to do is create little packages because the R ecosystem is really powerful because there's a curated set of packages, currently 19,000.

Starting point is 00:35:04 And then you can build packages also with RCPP that connect to something. So, for example, typically the package is talking to MySQL or PostgreSQL or whatever, and just would rely on RCPP or alike. It's been quite helpful that way and quite widely deployed, which is a treat to see. I had started really just looking after other people's code as an open source maintainer for Debian and always looked sort of with a bit of imposter syndrome at actual open source project authors. And then it was sort of really timid steps of building small ones.

Starting point is 00:35:39 And then it turns out that, yeah, that I just got lucky and the time was right. And we built some things to help people and they're widely used, which was really cool. I couldn't have aimed for that, but it's just, yeah, it worked out really well. It's good. R is an interpreted language, right? We would call it an interpreted language. Correct. Because of this richness that you want to work interactively with data, it allows you

Starting point is 00:35:58 to do really crazy things for computing on the language. So it can self-modify and do other things that make some things more expensive than others. Function calls tend to be expensive. Internally, it uses a boatload of C and Fortran for normal operations. So if you have a vector from, well, there's several things to that now, but even in the old model, if you have a vector from a 1 to 100, it would just send that vector down to compile C code that loops over it and sums the values up. Something else may happen with these sequences because now we have expressions that have it, but that's an implementation detail. But yeah, a lot of compiled code behind that just used to be straight up C and Fortran only. The core team does not use

Starting point is 00:36:37 C++ for all itself. But yeah, that opened the door because the extensibility was already pretty close to it anyhow. So when you do this embedded C++ stuff and it has to call out the compiler, does it then cache that in some way so the next time that you go to execute it, it doesn't have to call the compiler and link and do whatever again? Most excellent question. So that, for example, is why you would want to create it as a package because that you can more easily persist to this and then just say, load me the package in the next session.

Starting point is 00:37:06 Because otherwise, these things that you do ad hoc, they'll disappear. There's a technical reason behind it and another team that needed it for heavy duty Bayesian computations, the STEM team. And I was supposed to meet them in New York and then had flight and weather delays and that never happened. They actually managed to do that because basically what happens is you call the compiler, you have a little bit of on Linux, say, but the other operators are the same because R made a very early call internally to write on DL tool. So the lib tool and basically, you know, dynamic libraries being callable. That's how these

Starting point is 00:37:40 little function blocks come in. We can just always call something in and then call it. But those things would just, in your current session, sit at a random memory location, and they're a little tricky to persist. It's not impossible or implausible to persist. So someone eventually worked out how to do that, but I was always a bit, you know, played simple, safe, just write yourself a package,

Starting point is 00:38:04 because a package is a unit made for either compiled code, or packages are without compiled code, but about a fifth or a quarter of packages on-prem use compiled code. And that's really the clear way to do that. Otherwise you get hit in another way. So, you know, when our problems got bigger and we got from computers with one CPU to maybe still one CPU, but multiple cores, and you want to unwind loop in parallel, process parallel, sort of forking. But your main session has this compiled function, and the other ones that fork off will not necessarily get that either.

Starting point is 00:38:31 So for these things, you're also better off persisting it into a package and then have the child processes basically invoke the package clearly to have it callable. But that's inside baseball technical details about how we fought sort of parallel computing back in our day. It may to be extensible. And yeah, that's what we help provide. And you mentioned like GSL. It sounds like what you're saying at the moment, that the main use case is when you want some very high performance or well-known library that you want to call, you say, I know that R is not the right language for speed sake, I'm going to call into C++ or C. Great example. Yeah. So I still give workshops every now and

Starting point is 00:39:11 then. And the last one that I just gave a couple of days ago, I had the slides on that again, sort of the prototypical example of where we look glorious is something where I actually sat here late one evening looking at Stack Overflow, which already exists maybe 12, 13 years ago. And a kid was writing himself a Fibonacci function. Yet again which already exists maybe 12, 13 years ago, and a kid was writing himself a Fibonacci function. Yet again, Fibonacci, very well understood, but super, super costly. And because R does these things where it can compute on the language,

Starting point is 00:39:35 it has to keep track of so and so many things with a function called Stack. So there's a bit of overhead there. So functions aren't its strongest point. Recursive functions compound that problem. So he was just sitting there, wrote a Fibonacci and then said Fibonacci of 30 or something or 35 and his R interpreter went off for half an hour. And when you do that with R-CPP, because the actual Fibonacci expression is just three lines. It's a great demo in these talks

Starting point is 00:40:03 with CPP function because you can just have that embedded in a string, maybe with a line break in there or even one long strike. And the speed differences vary a little between Linux and the Mac, but on Linux, for a moderate size, if you have 20 or 25, I get 500 times faster code. But that's a completely silly corner case because we know that R ties its hands behind the back for recursion. A thing that's more common and where we had a lot of early adopters

Starting point is 00:40:30 is just simulating in loops, Monte Carlo Markov chain things, very popular statistical technique with more powerful computers, but people still have a lot of problems, wait for too long. So then they put these loops in and then you often get 60, 70, 80. That's sort of kind of nice. And the counter example may be, I'm not sure if you've ever seen that, but there's a really clever little trick when you can play in simulation and random draws is how to compute pi by simulation. Have you seen that? Or you'd like draw a circle and measurement kind of thing? Exactly. It's super intuitive because you just think about the unit circle, take one quadrant of the unit circle, and you basically just draw an X and a Y coordinate and then do, you know, X squared and Y squared. You're the inside or outside the unit circle. And that's a quarter of a unit circle in a unit quadrant.

Starting point is 00:41:19 That's the surface that you're after. So you just basically count in and out over so-and-so many and multiply by four. And you can do that in R code because you can have this expression of getting 10,000 random number values in a vector of 10,000 easily, square sum and square rooted in one expression too, and then just count how many of those are larger than threshold value. And for that particular problem, going from R to C++ only gives you about a threefold speedup. So not every problem translates in equal measure, because if the code inherently was already heavily vectorized or making use of internal operations that are actually C-heavy, then the gain will not be as big. Then you're just basically saving time because the interpreter updates the value, assigns something, sort of the little things, and you're replacing all of that with compiled code.

Starting point is 00:42:09 So you're still saving, but not the same amount. So that I find is a really good mental model that there may be code where you're not gaining that much, or you may be gaining 50, 60, 80 times for loops, our bread and butter problems, or in corner cases, you may get hundreds fold, but that's corner cases. That's not typical. And by the same token, you know, people have used it, of course, to go to anything and everything, image libraries, databases, computational things.

Starting point is 00:42:36 Yeah, it's in a lot of places because there's a lot of great libraries out there because the FFI, C and C++ provide a really rich set of things to work with. And then people just have an urge to do that. And they can provide the glue relatively cheap with RCPP. You use GCC, MinGW, I'm assuming, on Windows and GCC on Linux. But you said, I might be projecting my own troubles with macOS here onto what you described, but it sounds like sometimes it works on Mac, sometimes it doesn't, depending on whether or not Xcode has been updated recently.

Starting point is 00:43:13 Yeah, the Mac can be challenging, but it has been solved. The R core team provides a set of tools, so you can download basically versions of GCC, G4, G++ that work with it. But people don't usually start there. They start with brew, and then you need to coordinate it. But then on the Mac, you also need Xcode. I never had a Mac. I left that piece of trouble out of my life.

Starting point is 00:43:35 So we get a lot of questions on that. I mean, at the GitHub repo, it's like overflow, because it tends to change with the releases, but it's very, very widely used, and it's all solvable. It reminds me of my youth. I mean, when you just had to go to some random websites, download certain things,

Starting point is 00:43:49 glue them together. I find I'm spoiled on a moderately competent Linux distribution. I mean, you don't have any of those issues. DNF the tool in or install the tool in and it just works. And on these Unix, the operating systems,

Starting point is 00:44:01 any compiler really works. So it can be Clang, can be GCC, G++. We had users to using the Intel compilers back in the day when that still had a bit more mark and mindshare. Because again, the real key is the thing that R wants is a C interface. It's this call that returns an S expression pointer

Starting point is 00:44:19 and takes zero, one, or several S expression pointers. And that's the C level interface. And you just have to set yourself up such that ultimately you can call a C function. And people have used the same thing from any and all interesting languages. There's R bindings to Rust. There's R bindings to V8.

Starting point is 00:44:37 There's, of course, several packages to Julia. And ultimately, they all meet in the middle of that C interface because they can. It's just we've been handed sort of a secret weapon by template metaprogramming because we can have these converters happen automatically at the interface point. And that's really cool. I'm just imagining that the R Rust interface is just called RR. No, that one's taken. RR isn't that the...

Starting point is 00:45:02 The rewindable debugger. Oh, right, right, right, right. So I think the Rust project is called RustR or something. They're one or two, and they're in cargo this and cargo that. Can I pass an R function into C++ with these interfaces? I don't remember if you covered that. You can, because everything is an S expression pointer. Everything's an object.

Starting point is 00:45:24 And so in the C++ code, what it then realizes is just, if you covered that. You can, because everything is an S-expression pointer, everything's an object. And so in the C++ code, what it then realizes is just, oh, this is an R function that can be called, and at that point you can provide it a payload. But what it does behind the scenes, because RCPB really is an extension to the R system, you're sitting in the R system, so this piece of code never exists outside of an R interpreter somewhere. So it just then basically calls up, has the workload evaluated,

Starting point is 00:45:52 and brings the result back to you, which is super powerful for some things. For example, some really early cases were numerical optimizers or things where you needed initialization or things like that, sort of things that you maybe just call once and then you call something else a thousand times

Starting point is 00:46:06 and that one thing may be a bit more expensive, it doesn't really matter, but you can then get the seeding, setting, parameterization, whatever just done in the R call and then you go off. It provides a nice bit of flexibility. I think we still do that even in SAP P for one or two things. We just call back into R because it was sort of simpler or certain functions weren't available

Starting point is 00:46:27 on all operating systems back in the day. Yeah. Something related around the string pass time. It was a streepy time, I think. I have to check again. But yeah, you can. Okay. It sounds like RCPP has been around for a while.

Starting point is 00:46:39 Have you kept up to date with, you know, the C++ standard as it's, you know, brought in lots of new features? Like, can you use lambdas with our CPP, for example? Yes. So, great question. Because we were as excited as everybody else has been over the last decade about everything that came to us

Starting point is 00:46:57 and the toys that we've been given. The active constraint often was which compiler was provided by Windows. Because, again, on the Mac, you sort of get them from several places, can work them, and then it's whatever that version is. On Windows, it was very boiled down. And for a really long time, we had G++ for nine, and that was it. And that, for example, provided a fair amount of C++,

Starting point is 00:47:21 including TR1 for C++11 and other things, but not all. At one point, I was working with a time zone and time passing library from Google. And I think I was using IO streams to print something. And that particular on Windows, I think the G++ implementation of the C++ standard library was missing one function. And then a friend helped me port that over from Clang so that we had it on our operating system. That tends to go in leaps and bounds. So that use of G++ 4.9 got replaced after a long wait about two years ago. And on Windows, G++ 10 is now used. And that opens the door up to quite something. So R itself, being written in C, doesn't mind and just looks at, basically the position is

Starting point is 00:48:09 whatever the user's compiler supports can be used. And that's affected to where it comes in. So if these days, since R tends to not release annually in April, so we'll have a new one coming up in three weeks, but I think it was two years ago that the default, if you don't specify anything otherwise, you get C++ 11. And the release last year moved it to 14. But it's this thing that you are as constrained as the equipment that your user uses.

Starting point is 00:48:38 And having JJ around and his RStudio experience has been really good. So we kept this very conservative for RCPPE. I was just reminding a team member about that the other day because he thought that we sort of passed that watershed and are now at C++ 11. No, no, no, no, no. We still have C++ 98 code in there. And with that, a bit of code bloat that we have defined

Starting point is 00:48:58 will either do it this way or that way. I mean, the much hated var arcs and lots of other stuff. So it's good in the sense that if your use case has a new enough compiler, then you can use whatever the compiler supports. There are definitely C++17 using packages on CRAN, including the one from my day job. And that's no issue either

Starting point is 00:49:19 because the compilers do that. That's been fantastic. There was one use case, and I think that only had to do with some interop that we do with some other parts. So I knew that I was protected by if-defs for Linux. I got to use the file system standard and the regular expressions in C++, and that was just one of the most joyful days ever. I mean, that was like working in a scripting language. This is just amazing. It's just not as commonplace yet as we really wanted,, that was like working in a scripting language. This is just amazing. It's just not as

Starting point is 00:49:45 commonplace yet as we really wanted, but that was great. So yeah, it's there. We're moving quite a bit towards that. You have to be mindful of your users who may still be on a, you know, CentOS 3 system from 1974. So I exaggerate, but we still see 7 every now and then. It's just bad. And I feel bad for people there, too, because RCP really is glue code, and we are then the error message. And then someone goes off and wants to install a really large bioinformatics pipeline, and it's sort of a complex composed of 20 packages directly with recursive dependencies and maybe 200 of them, and then somewhere in there,

Starting point is 00:50:23 and one of them may be doing something relatively modern and then it just all ends in tears because they what do centers call that dev tool set or whatever. I mean there's always an out that you get at least to G++ 8 or something like that but sometimes they're stuck with really old ones and then it's bad. But yeah I really look forward

Starting point is 00:50:40 to talking to you guys again in 10 years when I'll show you what we do with the really keyword in C++ 23 behind the scenes in RCPP. It's good. It's a nice option to have, but we're trying to be reasonable and careful and not impose too much pain on users. If we've got a listener who says, I've never played with R before, but I'm really curious to try your bindings. RCPP.

Starting point is 00:51:08 Is there a get-starting-quick guide somewhere, something that would take them from not having R installed to this Fibonacci example? Yes. We have it somewhat easy for RCPP, the package and project, because we don't really have to force you to bring the tooling in. So you have to start somewhere and install R. So you just go to, you know, rproject.org and download it

Starting point is 00:51:27 for your particular operating system. Where if it's Windows, you have to make, and you want to do RCP, not just R, then you also have to do the extra step of doing a compiler. But even that is helped and automated. These days, ipso facto, maybe 90% of users drive R from RStudio,

Starting point is 00:51:46 which is a freestanding application, desktop app, that looks cross-platform, behaves the same on all operating systems. And that one, for example, on Windows, when it sees you call out to RCPP, checks the path and informs you that, oh, you actually don't have R tools. Do you want me to get it for you? And does these things.

Starting point is 00:52:04 So there's a little bit of user bells and whistles. So you first have to get R, and then you have to learn how to get a package, which in RStudio is mostly just a button click, and then you're there. And documentation is sort of taken reasonably seriously by the R community at large. So we have 10 PDF vignettes in the package,

Starting point is 00:52:23 including an older document for intros and a newer document for intros, as well as others with more technical detail. It gets head-spinning and hair-splitting relatively quickly, but there's one vignette to the internal attributes that we discussed earlier and others. And the getting-starting examples are there. And we have one other good website, too. Yet another brilliant idea by JJ. He once grabbed the rcpp.org domain and handed it to me. And then we created a website, which is basically, by now it's sort of, I think, 110

Starting point is 00:52:51 self-contained short little stories, posts in Markdown, uses Jekyll behind the scenes. And then it has sort of the, here's the problem that we're solving here. And in some cases, it's just how to create a particular data structure or how to do another thing or how to particular interface but they go from relatively simple to obscure at the very tail and they're also a good place to to start awesome that's gallery.rcpp.org and it's hosted at github and everywhere so with a bit of google one can get places and we have by now i think 2800 questions on stack overflow so that's a pretty good corpus too. And their search engine

Starting point is 00:53:26 actually isn't all that bad either. So, you know, things come up, sort of, you know, how do I do JSON with R, sort of something.

Starting point is 00:53:34 And then it would, you know, point you to the couple of packages that do those things. Well, it's like RStudio is free and open source as well.

Starting point is 00:53:41 Right. Okay. Well, Dirk, thank you so much for coming on the show and telling us all about R and RCPT we'll put you know links on the show notes and yeah thanks for coming on thank you so much for having me that was awesome I hope your listeners will enjoy it yes thank you very much thanks so much for listening in as we chat about C++ we'd love

Starting point is 00:54:00 to hear what you think of the podcast please let us know if we're discussing the stuff you're interested in or if you have a suggestion for a topic we'd love to hear what you think of the podcast. Please let us know if we're discussing the stuff you're interested in, or if you have a suggestion for a topic, we'd love to hear about that too. You can email all your thoughts to feedback at cppcast.com. We'd also appreciate if you can like CppCast on Facebook and follow CppCast on Twitter. You can also follow me at Rob W. Irving and Jason at Lefticus on Twitter. We'd also like to thank all our patrons who help support the show through Patreon. If you'd like to support all our patrons who help support the show through Patreon. If you'd like to support us on Patreon, you can do so at patreon.com slash cppcast.

Starting point is 00:54:30 And of course, you can find all that info and the show notes on the podcast website at cppcast.com. Theme music for this episode was provided by podcastthemes.com.

CppCast - Rcpp

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.