CppCast - Rcpp
Episode Date: April 21, 2022Dirk Eddelbuettel joins Rob and Jason. They first talk about an updated C++ web framework, and whether C should be considered a programming language or a protocol. Then they talk to Dirk about the R p...rogramming language, and RCPP the R/C++ interop library. News Crow v1.0 released C++23 will be really awesome C isn't A Programming Language Anymore Links Rcpp Rcpp: Seamless R and C++ Integration Patreon CppCast Patreon
Transcript
Discussion (0)
Episode 346 of CppCast with guest Dirk Elbattel, recorded April 5th, 2022.
This episode of CppCast, we're thanking all of our patrons on Patreon.
Thank you so much for your ongoing support of the show.
And if you'd like to support the show too, please check us out at patreon.com slash cppcast. In this episode, we talk about a C++ web framework.
And we talk to Dirk Elbertel.
Dirk talks to us about the R programming language and the R-CPP Interop Library. Welcome to episode 346 of CppCast, the first podcast for C++ developers by C++ developers.
I'm your host, Rob Irving, joined by my co-host, Jason Turner.
Jason, how are you doing today?
I'm all right, Rob. How are you doing?
I'm doing okay. I know you're still right in the middle of the C++ game jam.
How's that going?
I am in the middle of the game jam right now.
And it seems that, you know, I'm expecting at least like two or three submissions or something like that.
I honestly have no real idea how many people are involved.
I know how many people have joined my Discord and are having a conversation. I think one of the more interesting parts that's come out of this
is there's been a couple of people who have mostly like,
well, they're like, well, I don't know if I'll have enough time
to actually participate in the game jam itself.
But they have been hanging out on the Discord,
helping other people get set up with the example project
and make sure that they can get moving when they need to and everything.
Yeah, that's been cool.
I did look at him on Twitter the other day and clicked on the hashtag and saw a couple people
posting screenshots and little gifs.
Oh, that's cool. I must have missed a couple of those.
Yeah, I saw that there are a lot of bouncing circles around in the console window, that sort of thing.
Yeah, one of the ones that I did retweet that was the circles bouncing in the window is actually from Arthur, I forget in the past. I know we've talked about FTX UI
and a bunch of really cool console-based tools.
We should probably really get them on at some point
to talk about these things.
Yeah, definitely.
Well, at the top of our episode,
I'd like to read a piece of feedback.
We got a lot of tweets about last week's episode
or I guess two weeks ago
when we talked with Logan about Julia.
This tweet was saying, thanks for hosting an episode about Julia. when we talked with Logan about Julia, this tweet was saying,
thanks for hosting an episode about Julia.
There was a point raised about Julia in real-time applications,
which is worthy of further discussion
if Julia is to be promoted at solving the two-language problem,
especially in robotics and embedded applications.
So yeah, it was definitely great having Logan on the show,
and it seems like a lot of people really enjoyed that episode,
but it seems like there's a desire for us to be able to go a little bit deeper into Julia. Maybe
we should get a compiler engineer who works on Julia on sometime. Similar comments from a friend
who's a listener of the show that said, yeah, if we could get someone, you know, back end or
something like that from Julia to really be able to dig into these details, that could be really
cool. Well, we'd love to hear your
thoughts about the show. You can always reach out to us on Facebook, Twitter, or email us at
feedback at cppcast.com. And don't forget to leave us a review on iTunes or subscribe on YouTube.
Joining us today is Dirk Edelbottel. Dirk is a senior data scientist, engineer, and quant with
extensive experience in research, development, and trading. He works at TileDB. Dirk is also an adjunct clinical professor in statistics at the
University of Illinois, Urbana-Champaign, where he created and teach STAT447 data science programming
methods. Dirk has been contributing to open source, mostly Debian and R, since the 1990s.
Besides looking after numerous Debian packages, he develops and or maintains a number of projects, many around R, often involving C++ and making use of R CPP.
Dirk also is a co-creator of the Rocket project, bringing Docker to R.
Dirk, welcome to the show.
What do you do at TileDB? You say quant and trading, but TileDB does not sound like the typical trading company name to me.
That's right. That's right.
So I've been working since the 90s and I worked for 20 years in financial services writ large,
trading investment firms from analytical roles and pivoted over in tool chain building and
integrating things as one does as a quant, as we were called then before data science.
But I left finance two and a quarter years ago, and now I'm in a startup building,
being part of a team that builds a new universal data engine.
That's TileDB.
It's an MIT-licensed, really clever C++ backend.
So I can send you pointers.
Maybe that's good for a follow-up.
Instead of a format specification, we really just have an API,
and then it binds to just about everything,
because just about everything can use FFIs and C interfaces.
We're using a very clever thing that I finally learned the term about, the hourglass pattern.
Have you seen that?
That's a good CppCon 2014.
Because we're using C++, but the bindings are all C, so we can have actually pre-built libraries to build against. again. And I basically look after the R integration because R is one of the data
languages which you would want to connect to a universal backend.
Well, I mean, unfortunately, we didn't bring you on to talk about TileDB today, but maybe
we'll talk about it more through the course of the interview. But is it like a
traditional relational database, SQL-backed, that kind of thing? Or what are you talking about?
No, it's storage. It's files.
And it just abstracts that away.
So it's native to the three main cloud engines and is new enough so that cloud was built in rather than after the fact.
If you wish a more modern version of HDF5,
if you've come across that with cloud support in there.
And the basic gist is that multidimensional arrays
are actually fairly universal and can be used for just about anything.
So we have very big uptake in geospatial,
point cloud data, geoinformatics, standard data frames,
just things that can be indexed by one, two, or many dimension dense or sparse.
And we have a unified approach for those whether they're dense
or sparse. So it's pretty cool.
Sounds directly related to
the GDAL episode we did, yeah.
That's right. And we work with Howard. I saw
that he was on just a couple of weeks ago. That's right.
So we have direct PDAL and GDAL
support and all those things and
work with Howard quite a bit. It all comes
together. That's right.
Alright.
Well, Dirk, we got a couple news articles to discuss. Feel
free to comment on any of these, then we'll start talking more about R and RCPP. All right. So this
first one, we have a Reddit post and a GitHub link. And this is a year and a half ago, I picked
up an abandoned C++ web framework. And today, it was released as version 1.0. And this is Crow,
a C++ web application framework. I'm trying to remember if we talked about it like a long time
ago, because it did get abandoned at some point, but maybe it's possible. I don't know. But you
know, there's lots of comments in here, you know, thanking these people for working the project and
bringing it back to life. And it sounds like it's, you know,
a pretty useful C++ web framework. Yeah, I'm so not familiar with these things. But I did dig into
the like, how to write your first application. And I guess it's so you could set up a REST server
of some sort, basically in C++ and like 10 lines of code. That's how I'm looking at this.
Pretty awesome.
It reminds me a little bit of a toolkit called WT.
Have you seen that?
It's not that well known.
It's a bit like QT, and they play on that with the name,
but it's really made for the web.
And we'll get to RCPP in a minute,
but let's sort of do a little bit of a cottage industry
of various things that one can do with it.
And one use that I actually had very early was taking R and embedding it into C++ applications.
And then you sometimes have an embedded web server in there. So long story short, I also have a
really small C++ class framework called R-Insight that allows you to stick R,
the statistical language that us VPs, into a C++ app, and then led to various front ends
where you could present that,
that you have sort of analytical capabilities behind.
So I first did something with Qt,
and I've forgotten where the pointer came from.
Maybe I just saw it somewhere on the web,
something called Wt, and I just put up
sort of standard applications.
It's still not quite clear whether you want
quick web apps
really written in C++
because there's a little bit more
tooling and getting going.
But, you know, if you want it that way,
they're there.
And I played a little bit with it,
but it's a bit out there, I guess.
Same with that framework that you showed.
It's intriguing,
but maybe not the most mainstream of things
for web applications.
I feel like my first thought when I saw this was, okay, I'm going to connect my C++ application
directly to the internet.
How well has their URL parser been fuzzed?
That's what I want to know straight away.
I'm assuming that the authors of the project have taken care of that kind of thing.
I feel like hopefully fuzz testing and whatever to make sure there's an obvious security flaws and the thing that you connect directly to the
internet is becoming more standard practice to these days. Okay, next article we have is a blog
post. This is on kdab.com. C++23 will be really awesome. And this is an April Fool's post.
It took me a second, Rob. Yeah, it's one of those ones where you kind of wish
that this was a real feature coming to us in C++23.
Just to give a brief overview,
it's about a fictional keyword being added,
the really keyword,
which maybe makes C++ behave a little bit more
like how we sometimes might wish it would.
Brought a smile to my face.
I do get fooled just like other people every now and then,
but he had only talked about two paragraphs
and when they repeated the keyword
for the next sort of little use,
it's really well done.
I mean, including sort of the bits of asm in there
and what have you, but yeah, really?
It needs to be partnered with like
just to make everything more verbose that we can like put like more stuff in really yeah there's a comment here similar to
that indeed like really and indeed and a couple of other honestly oh you missed the honestly
proposal that was like i really honestly want to allow this conversion. Well, we have explicit, so yeah. Right.
Okay.
And then the last thing we have is another
blog post, and this one is
called C isn't a programming language
anymore. And this
makes reference to another blog
post we talked about recently from
Shahid Manid about
C-ABI. This person is apparently working with Shah Manid about C-ABI.
This person is apparently working with Shahid
on the C-ABI problem, but
this post is basically how they're coming at it from
a very different perspective, how
they don't really care so much about
C itself, but
because C has become
the lingua franca of programming,
it needs to be improved, although this
person seems to be thinking it's not likely to get improved. Yes, it's lingua franca. programming, it needs to be improved. Although this person seems to be
thinking it's not likely to get improved. Yes, it's lingua franca. It can't be improved. It's
Latin, seems to be effectively their argument here. But this falls, I think, comes right back
to Dirk, your comment earlier about the hourglass pattern, which if I understood right, is basically
you're using C to funnel your C++ to whatever, pick other language, right?
Right.
The talk's actually really, because we do that at TaltiB, and I wonder how it works.
Because on Linux, we have this problem that you otherwise need to rebuild, but therefore the integration, we can, in CI, build the static libraries because they're C libraries.
Because C is portable, whereas the C++ signatures change so much and you can't.
And it just puts a little bit of extra effort initially on things
because you have a core that's all C++, you have all the abstractions that you want,
but then everything that you expose, you expose at plain C functions,
and you wrap those, again, with a header-only layer
that makes it really cheap to use and deploy it.
And that way the client sees the C++ API that's high level,
but that C++ API internally just pivots to C
and then the linkage is at the C level
and all of a sudden it's beautiful
because you can actually pre-build and ship libraries.
A bit of what Python does with wheels and other things
and that's otherwise a problem
that we don't really have licked
or deployment on Linux.
On one hand, it sounds like a lot of work.
On the other hand, it sounds like it really makes you think about
what your API needs to be
so that you really get the succinct,
you know, this represents the problem we're trying to solve kind of API.
Yeah.
One of the most interesting things I thought I read in this post
was talking about how no one wants to write their own way
of parsing C header files and everything.
Everyone just winds up embedding or relying on Clang
or some other compiler.
Well, I don't know.
I found the post very ranty.
I just glanced at it.
It's a bit ranty, yes.
I think it's just off base because it is 2022
and we're still using C for all these things
because it stood the test of time.
And he's ranting a little that you don't quite know
what the size of a variable is and all the rest of it,
but the spec left that out.
And for that, we have header files and defines,
and we switch. This is not a dynamic interpreted language, so we're not reading the parser and
doing something directly. It will have an if-def on this architecture and that architecture.
I'm so old that I actually remember when we moved from 16-bit to 32-bit. And now we're mostly in a
world of luxury because everything's 60-bit and it's rich enough and enough resolution and all the rest of it.
And you don't really have these portability issues anymore.
But those used to be there and C, for all its warts,
is so powerful and general enough that it accommodated all those changes.
And that's why it hasn't gone away.
And that article won't change that.
Right. All right.
Well, Dirk, we already made a few references to RCPP,
which is going to be the main thing
we talk about today.
But before we get into that,
I was wondering if you could maybe
just give us a brief overview on R itself.
We've obviously mentioned the language
plenty of times on the show,
but we've never dedicated an episode to it.
So could you tell us a little bit about R?
Absolutely.
FAQ calls it a language and an environment,
and it's meant and designed to work interactively with data,
but being very extensible.
It actually goes back, like many of the things we just mentioned and care about,
C, C++, Unix, to Bell Labs.
At Bell Labs, they also had, other than all the computer science research,
a very powerful statistics research department,
and they invented themselves a system called the interface in the 70s that became the system,
which got short-changed to S. But then the phone company was a regulated monopoly, which was not
allowed to license IT products. That's why Bell Labs never made money really off Unix or the
language and all that stuff. It was the same with the S language.
So that got licensed eventually,
and there was a commercial implementation called S+.
And then in sort of a very ironic twist of fate,
two lecturers in Auckland in New Zealand
were frustrated with the vendor with the license
because they at the time didn't have a binary
for the max in the teaching lab.
So they kind of said, shucks this, and just built around. So R was started in around 92. And by 95, it was on 94,
95, it was sort of a new project. And as we were all getting going with Linux at the time, and,
you know, CompOS Linux announced news groups, I first read about R then, kept an eye on it,
it was really small, but grew.
And I started playing with it in the 90s and then switched in the 2000s.
So R is a reimplementation of S.
And S was once described, and S and R were once described by someone equivalently,
that the best thing about them is that it's a language by statisticians for statistics.
And the worst thing about it is that it's a language by statisticians for statistics. And the worst thing about it is that it's a language by statisticians for statistics. But, you know, I like to follow up on that with the gorgeous quote by Bjarne,
which just basically flips that there's two types of programming languages,
the ones that everybody complains about, and the ones that nobody uses. So it's an evolved language
that has been in use all this time, and is quite widely used for things that have to do
with data at large. So, you know, around statistics departments, but also other data centric things.
And yeah, because of this design from the 70s that fit really well into this mold of workstations
from the 80s, when I was growing up, that was the ideal, real scientists, workstations, big monitors,
and then this model that you have a. Real scientists, workstations, big monitors,
and then this model that you have a powerful central processing unit and enough RAM, and your data sits in there.
So that's sort of the first sort of approach that was initially written.
You work best with the data that you can hold in memory.
And then, of course, data sets always grow.
So other approaches get developed to sweep over data
or distribute it on different machines. But by and large, it's sort of meant to sweep over data or distribute it over different machines.
But by and large, it's sort of meant to work with data, explore data, transform data, and be extensible around data.
So the main creator of the language is a fellow named John Chambers.
And he had always put up two dictums, basically, that everything in SNR is an object.
So everything that you represent is sort of an object,
a number or the result of a fit or a plot or what have you.
And everything that does something is a function call.
So a function call would work around objects.
And then the next sort of close thing to that was extensibility.
So there were some reasons that made it that we got really lucky
when we started probing with this RCPP thing.
And I got particularly lucky
with really excellent contributions by collaborators
that pushed it much further
than I had planned at the times, a few times.
And yeah, it became what it is.
And it's one of the different extension systems to SNR.
It isn't even the first one,
but I got more serious with SNR.
S and then R, we were sort of coming out of the 90s
and we were beginning of this new century
and Java was still more active
and C++ was a bit stagnant.
C++ 11 hadn't happened.
The standards committee hasn't gotten back together.
So everybody thought Java would be the next best thing.
So at the time,
most advanced extension mechanism for R
was actually a system called R-Java
by one of the R core team contributors.
I looked at that and kind of think,
oh, you know, he has 60, 80 packages using it.
And that was a big deal for us
when we reached that number
and we went a little bit further by now.
So R-Java is still at about 120
and we have about 20 times that now.
So it just turned out that we picked something
where the shoe really fit.
There's some things internally
about how R goes about things. It's a child of these designs
before OO, really. These sketches in the 70s and some core development in the 80s and 90s before
C++. So one key aspect is that everything that's held inside of R is always an as-expression
pointer object, which is an old-school C union type. If you've seen those, which is where you basically have multiple aspects in one
by just having a controlling type in there. Oh, today you're an int, or
you're an enclosure, or you're a function, or you're a double, and sort of things like that.
And that makes the interface really simple because everything goes in and out of
SEXPs. And so I was first interested
in extending our two C++ libraries. That's something I played
with in other systems too, to other libraries, often database backends. And I wanted to do that
for a financial library called Quantlib that also had started in 2001 about, was really far
out there and forward looking with extensive use of Boost and at the time really
modern C++ idioms and still did really well.
And I got some help about how to connect that to R and just marshal some simple values back
and forth in a very pedestrian way because as an MVP, you just have to start somewhere
and have a first feasibility study.
Then I had that and I worked for simple things.
And out of the blue someone came
and realized that well you should use c++ and a little bit of templating and then we can at least
translate car vectors int vectors double vectors automatically and pass them back and forth and
that at the time was called rcpp template and then rcpp and back and forth and that collaborator
eventually got disenchanted with a couple of things including open source politics was all very volatile and he sort of he left after a year or two and then
it was just there and i was still looking after it and then i needed something like exactly that
for problem at work related to the embedding and working then i played a little bit more with that
and that was sort of 2005 six seven i think and I think. And around that time, I was also 8, 9.
I was interested in connecting it to Google's protocol buffers library,
multilingual code generators.
So I reached out to a fellow who had met at one of the conferences,
a French guy who I knew was very fluent with Java
and had become a collaborator on this Java extension.
I told him, look what I'm doing here.
He said, oh, this is really neat.
So you're using C++.
How does one go about C++?
Literally, he sort of asked me.
I sent him an email.
I really like these Formias books and a couple of other things.
And he basically inhaled those
and became one of the most amazing C++ programmers I've worked with
and wrote the bunch of the ASVB design as we have it now. A lot of that really is
homa and he did really good work there and we're still taking advantage of that and profiting from it
today. And yeah, that's just basically it. With C++ we can do some tricks around passing values
back and forth and that makes the integration much more direct and seamless is the term we
often use. If you've seen things like,
it's sometimes called inline in a couple of languages.
I think I first saw it in Perl.
I can't remember whether Python had it,
but that's basically the idea that you're sitting
in one of those high-level script languages,
and then you want to extend it with some compiled code,
and you more or less pass it down as a string vector,
as a long character variable,
and then sometimes it just picks it up,
maybe prefixes it with a header, invokes a compiler, links it,
and it so happens that R actually has really good abstracted tooling
across the operating system where you just say,
R command compile, R command change the shared library,
R command load, and then so we can invoke those behind the scenes,
and yeah, that made the extensibility
much more straightforward and powerful. And that, that appeals to a lot of people too.
That kind of sounds like how you might do inline assembly in C++ if you were so inclined.
Like just open an asm block, put some assembly in there.
Yes, but on steroids. For many data types, you know, I can marshal R function or I can call from C++ code
and back and forth.
It's pretty good.
There's still sort of some issues
about what gets mutated.
It doesn't get mutated.
We have some constraints in there
because everything's really
underneath the C-level pointer type.
So you have some difficulties
really protecting things,
but by and large, it's pretty good.
So do I understand correctly,
you're saying when you're doing
this function call handling
back and forth,
that everything in R is the same type, basically, but you said it's a union. So it could be a double,
it could be an int, it could be a function pointer, something like that. In all seriousness,
it sounds kind of like using JSON for marshalling data between applications, because you're like,
well, I got this data structure.
Everything is either a double, an int, an array, or an object.
And now I just have to unpack these things on my side
in some meaningful way.
Does that sound fair?
Not a bad analogy, but JSON, of course,
has its cardinal sin of being untyped.
Right.
And here, at least, we are typed.
We just have this union type in the interface.
But down below, it's then something else.
That creates some friction for us sometimes with dispatching.
So, for example, we can't do this trick that C++ does,
where you have a function C,
then you have a function foo with five different signatures
because it's five different types that it calls.
At this level, we're calling foo with a single argument
that is an S-expression pointer,
and you then inside have to look at the payload and switch
to whether you want to do this or that.
So it's sometimes these little things.
But other than that, it's yes.
The serialization example isn't perfect,
but you can think of it a little bit that way.
But we do get types and other things.
So in R, can you define your own types, your own structs
that contain whatever in them?
You can, but then you have to provide the mappers
because you can have something as rich as you want,
like these protocol buffer objects.
But then you have to define some interface
that gets it into R and then again unpacks it.
So if that makes sense.
So if the protocol buffer, say, is a list that contains a vector and a matrix
and three strings as attributes or whatever,
so then you bring them to R as a matrix
because it knows what that is,
two strings and whatever else,
or just a vector,
and then you can sort of just unpack them
into the base types,
how structs basically are compositional
of these core things.
And the nice thing is the core things we can do.
But it's extensible because you can add the particular converters
for the classes and types that you're interested in.
That's reasonably advanced use, but the doors are open and people have done that.
Is there anything like a preprocessor available or whatever,
if you wanted to say, please automatically write this interface layer for me?
No, we don't have a parser that is a code generator for those steps.
But you could do that conceivably.
In previous life, I spent a lot of time with Swig.
So when we talk about cross-language and dropperability,
the way Swig works often comes to my mind.
It's a really great example because the Quantlet project
that I bound things to by hand
to make this happen,
and out of which,
out of these interactions,
RCPG really grew,
always had interfaces to SWIG,
but there were some difficulties.
I mean, SWIG is really promising
and powerful
and can try many languages,
but it's complicated.
I mean, at the time, for example,
it didn't do shared pointers in C++.
So yeah, there are issues.
At one point, and someone else worked on that,
at one point we had a completely working
I interface with Swig,
but then it became one really huge compilation unit.
It was Unreal, lots of resources.
But yeah, it's an interesting problem
and a very interesting project
that we were all very much in love with.
But then over time, it became apparent that it's really, really hard
to do it in full generality for all client languages,
and it didn't really carry the day all that much.
I mean, it's still used, I guess, for some projects,
but the scope is a bit more reduced. It seems like its effort to be generic and work with so many languages
means that for any given language, there's a more efficient, better way to do it.
Right. Yeah. I use that as a mental model too, exactly. That's often true.
So you've mentioned code generation a couple of times, and that was one of the things that
caught my attention when reading about RCPP with attributes being used to declare a C++ function as callable from R.
Can you tell us a little bit more about how the code generation works?
Hey, everyone. Quick interruption to let you know that CppCast is on Patreon.
We want to thank all of our patrons for their ongoing support of the show.
And thanks to our patrons, we recently started using a professional editor for CppCast episodes. If you'd like to support the show too, please check us out at patreon.com
slash CppCast. Absolutely. So I had mentioned that the first pass that we used was a thing
called inline. And someone had, I think, already written that was already on the package repository, which for us was called CRAN,
in a wordplay on CPAN or CTAN or whatever.
And we'd adapted inline to work with RCPP.
And that was already pretty good,
but not quite good enough.
And then someone who became a friend and collaborator
was just quietly working on a generalization.
And that's JJ Allaire, a co-founder of RStudio.
And he did that. And he basically contributed the attributes extension to RCP and with that
then became a team member. That was mostly written a decade ago, I want to say, just about.
So attributes already existed for C++ as a notion, even though the compilers didn't have that. So we
do something that looks
like it, but we hide it behind a slash slash. We hide it in a comment. But basically the attribute,
the most simple one is that one line above your function signature, you say slash slash and then
in two square brackets, delineate that for for regexpassing, you say rcpp colon colon export.
And that basically says that this function that we're seeing here, we want to call from R.
And then all the beauty that makes this happen, by and large, is some 3,000 lines or something like that on the order in source file attribute cpp, where JJ just goes over the business of doing everything that needs to be done.
And that's sort of passing the signature. So we're doing that old school pre-clang VM
by just reading it out
and then basically generating the little wrappers.
And basically the attributes package
gives us three functions that you can call
of sort of increasing order of importance.
The simplest one is evalcpp,
and you just have a string,
and I often show that with two plus two.
Then attributes picks up the string 2 plus 2
and actually writes the minimal shim around it
so that it becomes a callable function that has this expression in there
and returns the value and knows enough about what the value is
or maybe takes advantage of the fact that S-E-X-Bs are general anyway.
But it just goes off as long as the compiler takes in linking whatever
and then the result comes back and you get the 4.
And that's a great litmus test to actually check
that your compiler is set up correctly and what have you,
because Windows users install R,
but they don't have a compiler there natively.
R is very tightly integrated with just one build of GCC
that they call Rtools, so you have to have that.
And if you don't have it, you give an error.
And if you have it right, it comes back.
On the Mac, where things always change with Xcode,
you may have the right or wrong thing.
You know, same thing.
But this is a great litmus test.
So that's just Evolve.
And obviously, you can't do very rich things with it.
I have some simple demos in there that you can look at.
You know, standard limits, templated double,
and it comes back and you get the 1.7 E308 or whatever it is.
I mean, you know, it's a representation of a double max in C++.
But the next one up is CPP function.
And it took this idea that we had from inline one further
because it's just one function call,
essentially one argument, one long string.
And that's basically a return value, a function name,
a set of arguments, and then curlies,
and then the string closes.
And the attributes code by JJ goes in,
picks out the name.
So it creates us an R callable function of the name of the C++ function,
looks at the arguments, picks up the ones that are properly mappable.
You know, you can't put a random boost math type in there,
which we don't have a translator, then it would fail.
But for the standard things, including vectors, both R's internal ones,
that we have in RCPpp and C++ ones, so
standard vector double, say, and provides you that function, and then it's callable.
And if you invoke it with verbose equal true, you see what it does, because for every function
that the user supplies, there's sort of one shimmer that calls it, and then one R function
that goes into the other one.
So there's a little bit of extra code being generated and just does that, but it's transparent.
You can see what it does.
And then it's just there and bam, you have your function.
You've just extended your system interactively.
And our eyes glazed over because John Chambers,
the guy sort of at the core of S and then R,
taught me around that time,
showing sort of sketches from these meetings
in the 70s and a bit of the genesis of R,
that this direct extensibility is really what they had in mind 40 years earlier. And this
was the first sort of real approach that accomplished that as it was scoped, because
the Java and other languages extensions that exist weren't quite as powerful. That one's
pretty neat. And then the next one is source CPP, which just strips an entire source file,
not to dissimilar to had I once seen things like,
I don't know, C++ extensions for MATLAB or other things.
MATLAB is actually for commercial product.
It's quite disappointing in there, I always found,
because you can only, in each file, define one function.
I found that very laborious.
I played with it 20, 25 years ago and thought that was,
it would be great. I mean, this is, 25 years ago and thought that was, it would be great.
I mean, this is what professionals use and it wasn't so great.
But with source CPP, you can have one or 10 or 100, if you insist, functions in a file.
It reads it in and then also does things.
And the attributes that are in there are sort of mostly that export thing where you can turn on and off.
I want this function callable or not callable, as well as other plugins that we added over the years.
Remember, this started sort of 10 or so years ago.
So we had plugins in there to say now turn on C++11 or 14 or 17.
Basically, the attributes code just reads that plug login and then throws out the appropriate CPP, the C flags, the exec flags argument,
sets the environment variable so that make and the compiler pick it up and carry on with that.
And similarly, a couple of other things, including OpenMP and sort of other natural extensions as well as package.
So I work a lot with the C++ library for sort of linear algebra called Amadeo.
So there's then a plugin that just says, OK, now just work with ACPP Amadeo because that's the header-only library.
And that makes things really easy
because you just have to look at headers.
So we've done that for a couple of other things.
But as well, some numeric libraries,
there's an old scientific library called GNU-GSL,
also going back to the 90s,
C interface, that one you have to link with.
And yeah, then we have a plugin for rcpp-GSL as well
and some converters for GSL type vectors and matrices
into R and out of R.
And that's what Attributes does.
Basically, it's one really impressive tour de force
by a very good engineer who kind of thought,
I think you guys could use this.
And he just showed me that in Chicago
around a conference that we organized.
And yeah, it's just, wow, this will make a difference.
And that definitely helped popularize RCPP quite a bit
because it made the usability just that much easier.
And then sort of the next level up is what you really want to do
is create little packages because the R ecosystem is really powerful
because there's a curated set of packages, currently 19,000.
And then you can build packages also with RCPP that connect to something.
So, for example, typically the package is talking to MySQL
or PostgreSQL or whatever, and just would rely on RCPP or alike.
It's been quite helpful that way and quite widely deployed,
which is a treat to see.
I had started really just looking after other people's code as an open source maintainer for Debian
and always looked sort of with a bit of imposter syndrome at actual open source project authors.
And then it was sort of really timid steps of building small ones.
And then it turns out that, yeah, that I just got lucky and the time was right.
And we built some things to help people and they're widely used, which was really cool.
I couldn't have aimed for that, but it's just, yeah, it worked out really well.
It's good.
R is an interpreted language, right?
We would call it an interpreted language.
Correct.
Because of this richness that you want to work interactively with data, it allows you
to do really crazy things for computing on the language.
So it can self-modify and do other things that make
some things more expensive than others. Function calls tend to be expensive. Internally, it uses
a boatload of C and Fortran for normal operations. So if you have a vector from, well, there's
several things to that now, but even in the old model, if you have a vector from a 1 to 100,
it would just send that vector down to compile C code that loops over it and sums the values up. Something else may happen with these sequences because now we have
expressions that have it, but that's an implementation detail. But yeah, a lot of
compiled code behind that just used to be straight up C and Fortran only. The core team does not use
C++ for all itself. But yeah, that opened the door because the extensibility was already
pretty close to it anyhow. So when you do this embedded C++ stuff and it has to call out the compiler,
does it then cache that in some way so the next time that you go to execute it,
it doesn't have to call the compiler and link and do whatever again?
Most excellent question.
So that, for example, is why you would want to create it as a package
because that you can more easily persist to this
and then just say, load me the package in the next session.
Because otherwise, these things that you do ad hoc, they'll disappear.
There's a technical reason behind it and another team that needed it for heavy duty
Bayesian computations, the STEM team.
And I was supposed to meet them in New York and then had flight and weather delays and
that never happened.
They actually managed to do that because basically what happens is you call the compiler, you have a little bit of on Linux,
say, but the other operators are the same because R made a very early call internally to write on
DL tool. So the lib tool and basically, you know, dynamic libraries being callable. That's how these
little function blocks come in. We can just always call something in and then call it.
But those things would just, in your current session,
sit at a random memory location,
and they're a little tricky to persist.
It's not impossible or implausible to persist.
So someone eventually worked out how to do that,
but I was always a bit, you know, played simple, safe,
just write yourself a package,
because a package is a unit made for either compiled code,
or packages are without compiled code, but about a fifth or a quarter of packages on-prem use compiled code.
And that's really the clear way to do that. Otherwise you get hit in another way. So,
you know, when our problems got bigger and we got from computers with one CPU to maybe still one CPU,
but multiple cores, and you want to unwind loop in parallel, process parallel, sort of
forking.
But your main session has this compiled function, and the other ones that fork off will not
necessarily get that either.
So for these things, you're also better off persisting it into a package and then have
the child processes basically invoke the package clearly to have it callable.
But that's inside baseball technical details about how we fought sort of parallel computing
back in our day. It may to be extensible. And yeah, that's what we help provide.
And you mentioned like GSL. It sounds like what you're saying at the moment,
that the main use case is when you want some very high performance or well-known library that you
want to call, you say, I know that R is not the right language for speed sake,
I'm going to call into C++ or C. Great example. Yeah. So I still give workshops every now and
then. And the last one that I just gave a couple of days ago, I had the slides on that again,
sort of the prototypical example of where we look glorious is something where I actually sat here
late one evening looking at Stack Overflow, which already exists maybe 12, 13 years ago.
And a kid was writing himself a Fibonacci function. Yet again which already exists maybe 12, 13 years ago,
and a kid was writing himself a Fibonacci function.
Yet again, Fibonacci, very well understood,
but super, super costly.
And because R does these things where it can compute on the language,
it has to keep track of so and so many things
with a function called Stack.
So there's a bit of overhead there.
So functions aren't its strongest point.
Recursive functions compound
that problem. So he was just sitting there, wrote a Fibonacci and then said Fibonacci of 30 or
something or 35 and his R interpreter went off for half an hour. And when you do that with R-CPP,
because the actual Fibonacci expression is just three lines. It's a great demo in these talks
with CPP function because you can just have that embedded in a string,
maybe with a line break in there or even one long strike.
And the speed differences vary a little between Linux and the Mac,
but on Linux, for a moderate size,
if you have 20 or 25, I get 500 times faster code.
But that's a completely silly corner case
because we know that R ties its hands
behind the back for recursion. A thing that's more common and where we had a lot of early adopters
is just simulating in loops, Monte Carlo Markov chain things, very popular statistical technique
with more powerful computers, but people still have a lot of problems, wait for too long. So
then they put these loops in and then you often get 60, 70, 80. That's sort of kind of nice. And the counter example may be, I'm not sure if you've ever seen that, but there's a
really clever little trick when you can play in simulation and random draws is how to compute pi
by simulation. Have you seen that? Or you'd like draw a circle and measurement kind of thing?
Exactly. It's super intuitive because you just think about the unit circle,
take one quadrant of the unit circle, and you basically just draw an X and a Y coordinate and then do, you know, X squared and Y squared.
You're the inside or outside the unit circle. And that's a quarter of a unit circle in a unit quadrant.
That's the surface that you're after. So you just basically count in and out over so-and-so many and multiply by four.
And you can do that in R code because you can have this expression of getting 10,000 random number values in a vector of 10,000 easily,
square sum and square rooted in one expression too, and then just count how many of those are larger than threshold value.
And for that particular problem, going from R to C++ only gives you about a threefold speedup.
So not every problem translates in equal measure, because if the code inherently was already heavily vectorized or making use of internal operations that are actually C-heavy, then the gain will not be as big.
Then you're just basically saving time because the interpreter updates the value,
assigns something, sort of the little things,
and you're replacing all of that with compiled code.
So you're still saving, but not the same amount.
So that I find is a really good mental model
that there may be code where you're not gaining that much,
or you may be gaining 50, 60, 80 times for loops,
our bread and butter problems,
or in corner cases, you may get hundreds fold,
but that's corner cases. That's not typical. And by the same token, you know, people have used it,
of course, to go to anything and everything, image libraries, databases, computational things.
Yeah, it's in a lot of places because there's a lot of great libraries out there because
the FFI, C and C++ provide a really rich set of things to work with.
And then people just have an urge to do that.
And they can provide the glue relatively cheap with RCPP.
You use GCC, MinGW, I'm assuming, on Windows and GCC on Linux.
But you said, I might be projecting my own troubles with macOS here onto what you described,
but it sounds like sometimes it works on Mac, sometimes it doesn't,
depending on whether or not Xcode has been updated recently.
Yeah, the Mac can be challenging, but it has been solved.
The R core team provides a set of tools,
so you can download basically versions of GCC, G4, G++ that work with it.
But people don't usually start there.
They start with brew, and then you need to coordinate it.
But then on the Mac, you also need Xcode.
I never had a Mac.
I left that piece of trouble out of my life.
So we get a lot of questions on that.
I mean, at the GitHub repo, it's like overflow,
because it tends to change with the releases,
but it's very, very widely used, and it's all solvable.
It reminds me of my youth.
I mean, when you just had to go
to some random websites,
download certain things,
glue them together.
I find I'm spoiled
on a moderately competent Linux distribution.
I mean, you don't have any of those issues.
DNF the tool in or install the tool in
and it just works.
And on these Unix,
the operating systems,
any compiler really works.
So it can be Clang,
can be GCC, G++.
We had users to using the Intel compilers back in the day
when that still had a bit more mark and mindshare.
Because again, the real key is the thing that R wants
is a C interface.
It's this call that returns an S expression pointer
and takes zero, one, or several S expression pointers.
And that's the C level interface.
And you just have to set yourself up
such that ultimately you can call a C function.
And people have used the same thing
from any and all interesting languages.
There's R bindings to Rust.
There's R bindings to V8.
There's, of course, several packages to Julia.
And ultimately, they all meet in the middle
of that C interface because they can.
It's just we've been handed sort of a secret weapon by template metaprogramming
because we can have these converters happen automatically at the interface point.
And that's really cool.
I'm just imagining that the R Rust interface is just called RR.
No, that one's taken. RR isn't that the...
The rewindable debugger.
Oh, right, right, right, right.
So I think the Rust project is called RustR or something.
They're one or two, and they're in cargo this and cargo that.
Can I pass an R function into C++ with these interfaces?
I don't remember if you covered that.
You can, because everything is an S expression pointer.
Everything's an object.
And so in the C++ code, what it then realizes is just, if you covered that. You can, because everything is an S-expression pointer, everything's an object.
And so in the C++ code, what it then realizes is just, oh, this is an R function that can be
called, and at that point you can provide it a payload. But what it does behind the scenes,
because RCPB really is an extension to the R system, you're sitting in the R system,
so this piece of code never exists
outside of an R interpreter somewhere.
So it just then basically calls up,
has the workload evaluated,
and brings the result back to you,
which is super powerful for some things.
For example, some really early cases
were numerical optimizers
or things where you needed initialization
or things like that,
sort of things that you maybe just call once
and then you call something else a thousand times
and that one thing may be a bit more expensive,
it doesn't really matter,
but you can then get the seeding, setting, parameterization,
whatever just done in the R call and then you go off.
It provides a nice bit of flexibility.
I think we still do that even in SAP P for one or two things.
We just call back into R because it was sort of simpler
or certain functions weren't available
on all operating systems back in the day.
Yeah.
Something related around the string pass time.
It was a streepy time, I think.
I have to check again.
But yeah, you can.
Okay.
It sounds like RCPP has been around for a while.
Have you kept up to date with, you know,
the C++ standard as it's, you know,
brought in lots of new features?
Like, can you use lambdas with our CPP, for example?
Yes. So, great question.
Because we were as excited as everybody else
has been over the last decade
about everything that came to us
and the toys that we've been given.
The active constraint often was
which compiler was provided by Windows.
Because, again, on the Mac, you sort of get them from several places,
can work them, and then it's whatever that version is.
On Windows, it was very boiled down.
And for a really long time, we had G++ for nine, and that was it.
And that, for example, provided a fair amount of C++,
including TR1 for C++11 and other things, but not all. At one point,
I was working with a time zone and time passing library from Google. And I think I was using
IO streams to print something. And that particular on Windows, I think the G++ implementation of the
C++ standard library was missing one function. And then a friend helped me port that over from Clang so that we had it on our operating system. That tends to go in leaps
and bounds. So that use of G++ 4.9 got replaced after a long wait about two years ago. And on
Windows, G++ 10 is now used. And that opens the door up to quite something. So R itself, being written in C,
doesn't mind and just looks at,
basically the position is
whatever the user's compiler supports can be used.
And that's affected to where it comes in.
So if these days,
since R tends to not release annually in April,
so we'll have a new one coming up in three weeks,
but I think it was two years ago that the default, if you don't specify anything otherwise, you get C++ 11.
And the release last year moved it to 14.
But it's this thing that you are as constrained as the equipment that your user uses.
And having JJ around and his RStudio experience has been really good.
So we kept this very conservative for RCPPE.
I was just reminding a team member about that the other day
because he thought that we sort of passed that watershed
and are now at C++ 11.
No, no, no, no, no.
We still have C++ 98 code in there.
And with that, a bit of code bloat that we have defined
will either do it this way or that way.
I mean, the much hated var arcs and lots of other stuff.
So it's good in the sense that
if your use case has a new enough compiler,
then you can use whatever the compiler supports.
There are definitely C++17 using packages on CRAN,
including the one from my day job.
And that's no issue either
because the compilers do that.
That's been fantastic.
There was one use case,
and I think that only had to do with
some interop that we do with some other parts. So I knew that I was protected by if-defs for Linux.
I got to use the file system standard and the regular expressions in C++, and that was just
one of the most joyful days ever. I mean, that was like working in a scripting language. This is just
amazing. It's just not as commonplace yet as we really wanted,, that was like working in a scripting language. This is just amazing. It's just not as
commonplace yet as we really wanted, but that was great. So yeah, it's there. We're moving quite a
bit towards that. You have to be mindful of your users who may still be on a, you know, CentOS 3
system from 1974. So I exaggerate, but we still see 7 every now and then. It's just bad.
And I feel bad for people there, too, because RCP really is glue code,
and we are then the error message.
And then someone goes off and wants to install a really large bioinformatics pipeline,
and it's sort of a complex composed of 20 packages directly with recursive dependencies and maybe 200 of them,
and then somewhere in there,
and one of them may be doing something relatively modern
and then it just all ends
in tears because they
what do centers call that dev tool set
or whatever. I mean there's always an out that you get at least
to G++ 8 or something like that but
sometimes they're stuck with really old ones and then
it's bad. But yeah I really look forward
to talking to you guys again in 10 years
when I'll show you what we do with the really keyword
in C++ 23 behind the scenes in RCPP.
It's good. It's a nice option to have, but we're trying to be
reasonable and careful and not impose too much pain on
users. If we've got a listener who says, I've never played with R
before, but I'm really curious to try your bindings.
RCPP.
Is there a get-starting-quick guide somewhere,
something that would take them from not having R installed to this Fibonacci example?
Yes.
We have it somewhat easy for RCPP, the package and project,
because we don't really have to force you to bring the tooling in.
So you have to start somewhere and install R.
So you just go to, you know,
rproject.org and download it
for your particular operating system.
Where if it's Windows, you have to make,
and you want to do RCP, not just R,
then you also have to do the extra step
of doing a compiler.
But even that is helped and automated.
These days, ipso facto,
maybe 90% of users drive R from RStudio,
which is a freestanding application, desktop app,
that looks cross-platform, behaves the same on all operating systems.
And that one, for example, on Windows,
when it sees you call out to RCPP,
checks the path and informs you that,
oh, you actually don't have R tools.
Do you want me to get it for you?
And does these things.
So there's a little bit of user bells and whistles.
So you first have to get R,
and then you have to learn how to get a package,
which in RStudio is mostly just a button click,
and then you're there.
And documentation is sort of taken reasonably seriously
by the R community at large.
So we have 10 PDF vignettes in the package,
including an older document for intros
and a newer document for intros, as well as others with more technical detail.
It gets head-spinning and hair-splitting relatively quickly, but there's one vignette to the internal attributes that we discussed earlier and others.
And the getting-starting examples are there.
And we have one other good website, too.
Yet another brilliant idea by JJ.
He once grabbed the rcpp.org domain and handed it to me.
And then we created a website, which is basically, by now it's sort of, I think, 110
self-contained short little stories, posts in Markdown, uses Jekyll behind the scenes. And
then it has sort of the, here's the problem that we're solving here. And in some cases, it's just
how to create a particular data structure or how to do another thing or how
to particular interface but they go from relatively simple to obscure at the very tail and they're
also a good place to to start awesome that's gallery.rcpp.org and it's hosted at github and
everywhere so with a bit of google one can get places and we have by now i think 2800
questions on stack overflow so that's a pretty good corpus too.
And their search engine
actually isn't all that bad
either.
So, you know,
things come up,
sort of, you know,
how do I do JSON
with R,
sort of something.
And then it would,
you know,
point you to the couple
of packages
that do those things.
Well, it's like RStudio
is free and open source
as well.
Right.
Okay.
Well, Dirk,
thank you so much
for coming on the show and telling us all
about R and RCPT we'll put you know links on the show notes and yeah thanks for coming on
thank you so much for having me that was awesome I hope your listeners will enjoy it
yes thank you very much thanks so much for listening in as we chat about C++ we'd love
to hear what you think of the podcast please let us know if we're discussing the stuff you're
interested in or if you have a suggestion for a topic we'd love to hear what you think of the podcast. Please let us know if we're discussing the stuff you're interested in, or if you have a suggestion for a topic, we'd love to hear about that too.
You can email all your thoughts to feedback at cppcast.com. We'd also appreciate if you can
like CppCast on Facebook and follow CppCast on Twitter. You can also follow me at Rob W. Irving
and Jason at Lefticus on Twitter. We'd also like to thank all our patrons who help support the
show through Patreon. If you'd like to support all our patrons who help support the show through Patreon.
If you'd like to support us on Patreon,
you can do so at patreon.com slash cppcast.
And of course, you can find all that info
and the show notes on the podcast website
at cppcast.com.
Theme music for this episode
was provided by podcastthemes.com.