CppCast - High Performance Computing
Episode Date: November 12, 2015Rob and Jason are joined by Dmitri Nesteruk to talk about High Performance Computing and some of the new features coming to CLion and ReSharper for C++. Dmitri Nesteruk is a developer, speaker..., podcaster and a technical evangelist at JetBrains. His interests lie in software development and integration practices in the areas of computation, quantitative finance and algorithmic trading. His technological interests include C#, F# and C++ programming as well high-performance computing using technologies such as CUDA. He has been a C# MVP since 2009. News Visual Studio 2015 Update 1 RC Available Reverse Iteration with Range-Based for Loops Interactively create clang-format configurations Dmitri Nesteruk @dnesteruk Dmitri Nesteruk's Pluralsight courses Links Webinar Recording: A Tour of Modern C++ What's New in CLion 1.2 What's New in ReSharper++ High Performance Computing in C++
Transcript
Discussion (0)
This episode of CppCast is sponsored by JetBrains, maker of excellent C++ developer tools including
CLion, ReSharper for C++, and AppCode.
Start your free evaluation today at jetbrains.com slash cppcast dash cpp.
Episode 34 of CppCast with guest Dimitri Nustruk, recorded November 2nd, 2015.
In this episode, we discuss upcoming changes in Visual C++ 2015 Update 1.
Then we'll interview Dimitri Nustra from JetBrains.
Dimitri will talk to us about high-performance computing and some of the new features coming to C-Lineast, the only podcast for C++ developers by C++ developers.
I'm your host, Rob Irving, joined by my co-host, Jason Turner.
Jason, how are you doing today?
All right, Rob, how about you?
Doing pretty good. This little kind of time change we're doing with the recording this week is a
little odd. Just so people know, we're recording this episode about two weeks before it's going to
air. So some of the articles might seem a bit older. That's why. So at the top of every episode,
I'd like to read a piece of feedback. This one, we got a lot of great feedback from Andre,
Alex Andreescu's episode on D.
And this one I picked out was a Reddit comment
where someone listened to the episode
and was saying how they talked with Andre during a break at CPPCon,
and he was incredibly humble, down to earth.
And it was after his talk, and he said he can now get out of character. And he was the only guy in the whole conference who actually had an interest in what he was incredibly humble down to earth. And it was after his talk and he said he can now get out of character.
And he was the only guy in the whole conference
who actually had an interest in what he was working on.
And Andre goes and actually replies to him and says,
so my trick to pretend to listen did work.
So it's great to see Andre's sense of humor.
It was really great having him on the show.
Right, Jason?
Yeah, so it was a good talk.
Yeah.
So we'd love to hear your thoughts about the show as well.
You can always email us at feedback at cppcast.com,
follow us on Twitter at twitter.com slash cppcast,
and like us on Facebook at facebook.com slash cppcast.
And you can always review us on iTunes as well.
Joining us today is Dimitri Nustruk.
Dimitri is a developer, speaker, podcaster,
and a technical evangelist at JetBrains.
His interests lie in software development and integration practices in the area of computation,
quantitative finance, and algorithmic trading.
His technological interests include C-sharp, F-sharp, and C++ programming, as well as high
performance computing using technologies such as CUDA.
He has been a C-sharp MVP since 2009.
Dimitri, welcome to the show.
Thank you. Glad to be here.
So you're also a podcaster.
Well, yeah, I do a podcast, although my podcast is in Russian.
Oh, okay. So is it a C-Sharp or C++ focused podcast?
Well, it originally started, it's actually called Solo on dot net so it originally started as a podcast i was doing with
other people uh related to the user group that i was taking care of back in st petersburg and it
was yeah it was dot net related but from then on it kind of branched out into all sorts of
directions because obviously my personal interests have diverged somewhat and i just decided i also
recently the last i think two or three episodes were
actually both uh podcast as well as uh video recordings as well because i got a new camera
i thought i'd check out you know 4k recording and all the rest of it so it's now dual mode if you
will wow you know i'm would be a little embarrassing to find out if there had been a russian language
c++ podcast this whole time,
or we've been calling ourselves the only C++ podcast.
Well, yeah, a lot of it is C++.
Okay.
Okay, maybe we can still say we're still the only completely focused on C++,
although I guess that's not true if we've had episodes on D and Rust.
Oh, well.
Yeah.
So I want to go over a couple news items. This first one is Visual Studio 2015 Update 1 Release Candidate is available.
And again, we are going through a bit of a time change.
So maybe by the time this episode airs, the Update 1 release will actually be out.
Hopefully.
Well, there's an interesting note from Eric on that where he says,
well, I can't give a specific date, but the conference is coming up.
Connect 2015 is coming up on November 18th. So maybe there'll be some interesting news then.
Yeah. You know, thinking back to last year, I think they had a similar event. I actually got
to attend that one and they released a visual studio. I guess it may have been still in beta
at that point, but they made a bunch of releases on that day.
So hopefully update one will come out
during that event.
You know what's funny? I remember
releases of at least
2013, maybe even
2012, where Microsoft explicitly
promised that the updates to the C++
compiler would be out of band.
They basically said that from now on, we're going to
do everything kind of asynchronously
whenever it happens.
And it still hasn't happened.
So maybe we're finally seeing this sort of thing.
Maybe, but I mean, we are expecting the update once.
I think everything's coming with that,
from what I can tell.
So some of the things coming with this update, though,
one of the big ones that we've talked about already
was the C++ build tools, where you can have a separate package of just the C++ compiler without any of the Visual Studio IDE.
I know a lot of people are looking forward to that.
And it looks like they're also making improvements with memory diagnostics, compiler improvements, improvements with the cross-platform development tools.
Jason, was there anything else you wanted to call out in this?
Well, just specifically constexpr and Sphene support
are things that keep coming up over and over again.
And if you've been following any of the tweets from STL,
it looks like they're very, very close to having good support for those things now.
Yeah, it says partial expression Sphene support.
So I'm not sure what that means exactly
for it to be partially done.
I don't know.
Sorry, wasn't it the case at CppCon
that they said they would have support for modules as well?
Yes, I think actually they said it currently in...
Oh, shoot.
I saw something about it.
They explicitly said update one will have like a beta preview thing
with modules yeah so they say in the comments they say the documentate or the excuse me the
support for modules is still going to be unofficial and you're going to have to use whatever out of
band unofficial documentation blog postings and whatever to get the information you need to use
it but it will be in update one i believe that's that's good i i personally i kind of get the feeling that that's what
many people are waiting for in terms of the build process i mean certainly having a separate
kind of visual studio free stack is great for continuous integration but it still doesn't
cover the problem of why is my program so slow to compile but i guess even even the introduction modules i
imagine would still leave the problem of well you have your stl and your boost and they're not you
know they're not going to jump to modules overnight it's going to be like a process right jason
speaking about modules really need to get gabby on to uh to talk about modules more depth i think
that'd be a great episode yes we, we should try to do that.
Okay, so the next article is
reverse iteration with range-based for loops.
Jason, do you want to go over this one?
Yeah, it's just a quick comment from someone on Reddit
who's like, hey, how come we can't use
the range-based for syntax in C++11
to do a reverse iteration?
And I think it's an interesting point
that as useful as range-based for loops are,
they do have a lot of limitations.
But the basic answer is,
well, we pretty much have to wait
until Eric Niebler's ranges proposal gets fully accepted.
But someday it'll be here.
Yeah, I think an interesting question is
what would happen in such an arrangement
if you've got the yield keyword in?
Because yield kind of goes forward in time.
You get your next value and your next value and the next value.
So starting from the end, the paradigm no longer applies.
Because we're kind of encouraged in a way to use the for keyword together with yield.
But it doesn't make sense if you introduce some sort of for r keyword, which does it backwards,
because there is no final element, effectively.
So that's an interesting problem.
Yeah.
I've not yet watched any of the videos
on the continuations in yield work yet.
Well, I'm coming from the C-sharp world
where yield is kind of a standard thing.
And you have this idea of infinite collections
or generators which will effectively yield the values infinitely
however many times you ask.
Right.
So in the context of this, reverse iteration kind of doesn't make much sense.
Right.
That's an interesting point.
Okay, so this last article is really a project which is pretty interesting.
You can now interactively create
Clang format configurations. And basically they have this interactive web page where they have
a little code sample on the right. And you can switch between the default styles that Clang
format offers LLVM, Google, Chromium, Mozilla, and WebKit. And then you can dive down into all the different Clang format options
and see it live in this little interactive code editor
on the right.
So if you're deciding what Clang format you want to use,
it's really the best way I've seen
to test different options.
Yeah, I feel like if you have any interest at all
in libclang and the stuff that you can do with it,
you should just go and play with the formatter for a little while.
Because it's just kind of fun to play with, too.
Yeah.
Is there anything you want to add with this, Dimitri?
Oh, I don't know.
I mean, we're kind of the competitor in this space in the sense that all of our tools provide a certain formatting support.
An interesting story, actually.
I know you already had Anastasia here on the show a couple of episodes ago, but the formatting options in C-Line, as opposed to ReSharper C++, were actually
inferred using genetic algos, which I think is pretty impressive, because we hardly see any sort
of AI-type tech being used for development specifically but the case there was that you essentially took a
bunch of code which already conformed to a particular coding style and you ran a genetic
algorithm shifting all the options that you have in the IDE until your gene sequence if you will
match the formatting that was originally in the code so that's an interesting point I think we
also have to do this in vshharper at some point as well,
because I think it kind of nails perfectly the set of settings
that you need for a particular coding style.
Okay, do you mind if we dig into that for just a second?
Does that work with anyone's...
I mean, starting from my own code base, can it learn what my style is?
No, this was essentially something
that we did. We have this hackathon
thing once a year where everybody at the company
just takes the
weekend to do their own crazy thing.
And so this is essentially somebody's
hackathon project that was applied specifically
to samples taken from
the particular settings. I suppose
we can start talking about including it
in a kind of inferential
way in the sense that you feed it your own code and say, can you please infer all the stuff from
here. But from what I remember, this is not exactly a quick operation in the sense that we,
I think when the presentation was happening live in front of an audience, we actually had to
just look at the final results, because it wasn't quick, you basically have,
I don't know, you might have a couple of hundred settings
that are constantly being readjusted and re-evaluated.
And because all of CLion's infrastructure has to be effectively erased
to build the parse tree and everything, it's not a cheap operation to do so.
I don't know if we can actually make it palatable to the end user
in the sense of not annoying them and having them to
wait for like, I don't know, 10-15 minutes while the code is being analyzed. And if I had to do
that exactly once on my project, I think I would probably put up with it personally. Although my
style is not so far out there that I can't generally adjust it with whatever settings are available.
Well, yeah, we can sort of discuss it and make a feature
request. That would certainly
get somebody's attention. But from what I understand,
we generally try not to skew
the user experience
in terms of long waits, because even
if you do explain clearly that there is
a long wait coming
at the end of this, people are still going to be
annoyed about it.
That's pretty neat, though.
Yeah, I think so.
So, we'll dive
a little bit deeper into
what you've been working on with JetBrains. It's actually been
29 episodes since we had your
colleague Anastasia on.
What has the JetBrains team
been busy with since then?
Well, we've been doing pretty much the same
stuff. In fact, today, we've had
the world release.
We released just about everything, every product line that we do in a single day, which is kind of a bit crazy.
But we've still been working on CLion, obviously, improving the cross-platform story and improving the different features there. In addition, we have been working on
ReSharper C++, which is something that I guess didn't get covered so much in your previous
podcast. That's essentially the ReSharper. We have a product, let's start from the beginning. We have
a product called ReSharper, which is over 10 years old, and it supports primarily.NET development,
though in recent times, it's branched into all sorts of things. It's done
web languages like HTML, CSS, JavaScript. It's also done all sorts of specific formats, like,
for example, in the latest version, we support things like the Google Protocol Buffers format
or the JSX format from Facebook. So sort of format-specific kind of tech. And in recent years, we've added support for C++ there as well.
So the story right now is that
if you are into cross-platform development
or if you want an IDE,
which is a standalone IDE,
then C-Line is the way to go.
However, if you're still in the Visual Studio mindset,
you work with a Microsoft compiler,
then ReSharper C++.
So essentially, we've made a kind
of umbrella uh product if you will called resharper ultimate and that's something that includes both
everything from resharper supporting dot net and whatnot as well as the support for c++ and also
our tools for things like memory profiling and performance profiling and that sort of thing
and just to add to this a little bit you know know, working a bit in the C-Sharp community
myself, ReSharper has been like, you know, a must have for Visual Studio developers for
a long time.
So do you think that's going to become the same case with ReSharper C++ for Visual Studio
developers?
I think if you're on Visual Studio, then, well, I'm certainly hoping that it does.
And I think that as we, I mean, it's a fairly
young product, but as we provide more and more value, I think people are going to sort of see
that it's really, you know, it adds so much in terms of, you know, making your life easier, then
it's silly not to use it. Suddenly, that's what we've been experiencing with ReShop. And I think
the challenges are, in most cases, quite similar.
Although I have to admit that, of course, supporting C++ is a bit harder
because you effectively have to implement your own preprocessor
and you effectively have to, I mean, to analyze a particular translation unit correctly,
you have to build it yourself.
So even if you have a tiny little hello world,
which includes some boost header somewhere,
you may end up with 300 megabytes worth of text that ReSharper would have to process.
So the story is a lot harder than it is with.NET.
I'm hoping it will improve with modules and whatnot.
That's why I'm sort of enthusiastic about them.
But at the moment, it's quite a challenge. So one thing worth mentioning is with Visual Studio 2015, Microsoft actually delivered some of their own built-in refactoring support, which I think has been a big request for years, and they finally delivered on it.
What does ReSharper bring to enhance that?
Well, if you look at the feature set just like for like, then we generally provide so much, then I could take up a podcast or two just talking about
the features. And let's, I mean, even if you regress just to the features, which have,
which have their identical kind of counterparts in Visual Studio, then what we pride ourselves on
is the correctness in terms of how they work. I mean, doing an ordinary kind of basic refactoring,
something that everybody can more or less manage, but of basic refactoring is something that everybody can more
or less manage. But doing a refactoring where your variable is kind of dragged through a lambda or
used in some bizarre setting, or it's actually part of a macro where, you know, if you do the
refactoring, you actually break what this macro is doing. These sorts of things are the things that
we provide diagnostic for. And quite often in
ReSharper, what you're going to see is you're going to see a window pop up and that window will say,
oh, by the way, we found a couple of conflicts where this is simply not going to work because
of the way you're using it. And people, I think, are sometimes not ready for that because they
assume that if you're doing a rename, a rename will work consistently across the board, whereas this only works if you're just using plain C++ without macros, without any kind of template
magic and whatever. So I would say only like-for-like features, we pride ourselves on the
thoroughness and the correctness. And of course, we put a lot of things on top of that, a lot of
analyses, a lot of interesting kind of, you know, notions to help
people to make their lives better. So, I mean, you just consider a simple printf statement where you
make a mistake in the format specifier. That's not something that Visual Studio will necessarily
pick up, but this is something that we can do. And of course, the fact that we implemented our own preprocessor implies that one of the
things we can do is we can expand the actual preprocessing macros to like a depth of one,
for example, or an infinite depth.
So what this means is, let's say you're writing Google tests or something, which is just,
you know, macros basically being used to actually manufacture these massive tests.
If you want to take a peek at what's going on with ReSharper, you can just navigate on
top of your test, press Alt-Enter and expand the entire macro.
So you get to see the final code.
It's, you know, readable.
Finally, you see what's wrong and then you can undo and go back.
And that way you can sort of get a feel for what's actually going on in macro, what's
going wrong there. Because otherwise, diagnosticity is effectively lost. You get an
error at compile time, and you're unable to determine what actually happened.
Well, there's been some times in the past that it's using boost preprocessor library before
variadic templates were available. And I could see being able to expand those preprocessor macros would be huge.
Does it work like that? Well, I mean, like, it doesn't really matter how complex the macros are.
It doesn't because unlike certain other products, I won't say which we actually took this whole
thing seriously, because one approach to macros will be to simply ignore them. And that's what
some people do. They just look at the C++ code,
and if they see a macro, they assume it to be redundant tokens that we have nothing to do with.
Whereas with ReSharper, we first of all, internally expand all the macros, and then we perform the
analysis on top of what's expanded, which is interesting. So if you use the macro incorrectly,
you might actually get a highlighting on top of the macro, which will tell you an error which relates to what would be expanded. So it gives you an insight into the
final kind of output. So you don't have to wait until compilation time to find out something's
wrong. ReSharp will just tell you right here. I'm curious, like how much time did you guys
spend developing the ability to be able to handle this much C++ parsing before you're
able to actually release it to the public?
It's actually difficult to say because I think Anastasia already went with the history.
Essentially, originally, we were going through the phase where we were adding C++ support
to Objective-C programs because they can use a kind of C++, a really somewhat bizarre version of
C++ with some of the things removed, like destructors, for example.
Very strange.
But we knew we had to do it at least to some extent.
And then from then on, we kind of branched out into these two products.
So it took several years and it took quite a few people to get where we are.
And we're still improving.
It's still an ongoing process. I would say that the ReSharper C++ team, if you include the testers,
it's about 10 people, maybe less than 10. And yeah, it did take a couple of years. But,
you know, you want to, when you go to market, you want to avoid, you want, well, you don't want any
false positives. And certainly one of the kind of challenges
is that we still have false positives.
I recently put out an article on why that is.
Essentially, the Microsoft compiler
does some things incorrectly with respect to C++.
And as a result, we have to take sides.
Do we side with the C++ compiler
or do we side with correctness?
Because we can imagine situations where people would work in Visual Studio
and then try to cross-compile on Linux and whatnot.
In this case, it's a problem.
And that's why if you go into some of the usages of Boost,
like using AccumulatorSet, for example,
then yeah, we're still going to have a few nitpicks here and there.
And these, unfortunately, I mean, in my post,
I just explained this is Microsoft's problem.
And we've been in communication with Microsoft about it.
We told them, you know, and I think you're going to,
you will see there was Herb Sutter's comment
in that original blog post saying that they are working
on fixing those issues so that hopefully some years down the line we will have a kind of perfect consistency.
But we did get the product to a state where the vast majority of things are parsed and resolved correctly.
And that includes handling all those crazy cases.
I'm sure you've seen the C++ Watt talk from the lightning talks at CppCon.
We do handle just about everything, including the really bizarre.
Cool.
So besides refactoring, it looks like ReSharper plus what's going to also do some static analysis
in the IDE.
Is that correct?
Yes.
Yes, indeed.
Well, essentially, as soon as you open up a solution, ReSharper starts continuously analyzing
everything that you edit.
And it kind of does this in real time effectively.
So as you type your code, if you make a mistake or if you write something which isn't a mistake
but can be improved, then ReSharper is going to underline it with a wavy underline.
And you will have a pop-up where you can, for example, fix an error or you can somehow
improve the code.
And there is also a marker bar on the right-hand side of the editor
that shows you throughout the whole file where particular issues are,
from subtle hints to actual warnings and errors.
Cool. Okay.
So changing gears a little bit,
I see that you recently finished up a Pluralsight course
on high-performance computing in C++.
I was first just wondering what your background was with high-performance computing.
Well, a couple years ago, I got into quant finance.
I'm kind of self-taught, but quant finance is all about doing math and doing all sorts
of simulations and mathematical modeling.
And the thing about this is it's one of those areas where, you know, if you're doing
any kind of random simulations, the more computing power you have, the better, essentially. So I had
to get into computation just by virtue of wanting my stuff to run faster than it runs in MATLAB. So
QuantFinance is actually very much a C++ oriented kind of business. It always was and I imagine it will be for a very
long time. So it's the language that they teach if you go and you actually do like a master's in
financial engineering. So in this setup, after I found a very large machine cluster where I could
actually do my research, I realized that I didn't have the necessary skills to actually, you know, leverage all this power. And so I went, I kind of systematically picked out the topics that
I wanted to master in order to be able to basically leverage all this wealth of computing
power. So that's, the course was born from that. Okay, could you give us an overview on some of the,
you know, instructions you go over in the course? It looks like the first chapter you have to, I mean, you have your algorithm, and you want to give your algorithm more entities to compute on. And by entity, we can mean different things. So
at the simplest level, at the most basic level, we have this idea of single instruction, multiple
data. So essentially, in addition to the ordinary registers, you have very large registers on the
CPU, and you can stick several values into them, like instead of sticking one value, you can stick four values in a register, and then you can stick another four values in
another register. And when you perform the add operation, it's not going to be an ordinary sort
of x86 add, it might be an add PS. So it would add these four values and these four values and
give you kind of four additions at the same time. So you can see the performance improvement in this regard.
And this has been going on for a very long time.
I think when I was a child and processors were like 166 megahertz, not gigahertz.
I think we started out back then with technologies like MMX.
They weren't primarily targeting multimedia back then.
But right now, you know, it's open to anybody,
including, you know, codecs
and certainly scientific computing to leverage this.
Unfortunately, this is something
that you can only really do in C++.
It's not available in.NET just yet.
And I don't think it's available on the JVM just yet.
There are efforts in both cases
to bring it to those platforms,
but, you know, it might take a few more years
to get the JIT working
because in the managed world,
people just assume that your JIT compiler
will do everything for you.
And unfortunately, it doesn't.
It really, really doesn't,
even in the simplest cases.
And nobody is really complaining.
I mean, if you go out on the market
and you actually buy yourself a math library,
let's say you buy yourself a popular.NET math library, that's just a wrapper around a C++ library.
It's a C++ library with all the optimizations, and then you get a wrapper on top of it for.NET or Java or whatever.
So, SIMD is the first step.
And certainly, the compiler tries to help you in this regard.
So unlike the managed compilers,
the C++ compilers like the Intel compiler,
they're actually very smart.
So they try to, you know,
they have vectorization built in.
So if they see something that's obviously vectorizable, they would go and rewrite it
in terms of these large registers.
And also you can give the compiler hints
in terms of pragmas and whatnot
to make it even more efficient.
So I'm sorry, what compilers did you say actually support that?
Well, I think pretty much every popular C++ compiler
supports vectorization to some extent.
Some do it better than others.
I was recently surprised that GCC apparently supports
the vectorization on Intel
CPUs, which are not even out yet. So it would actually use instructions that no CPU can use,
which is fantastic in terms of, I guess, some sort of future-proofing. But the problem with
SIM you have to realize is it makes your code not really portable because it's not the sort of underlying x86.
It's something that's been evolving in time.
So if I have my SIM code using AVX4
and I run it on an older machine,
well, the program will just go boom.
It will say, I'm sorry,
but these instructions are not supported.
That's it.
So it's not, in terms of portability, it's not great.
But if you have your machine cluster
and you know exactly what versions of what sort of CPUs you're using,
you can build for specifically this instruction set,
and it will be just fine.
So you cover in your course how to help best write your C++ code
so that the compiler can leverage SIM to that kind of thing.
Well, actually, to start with, I show inline assembly.
I know it's a cardinal sin to do this kind of stuff,
but I start with inline assembly and then move on to mnemonics,
which are little wrappers around assembly language
for doing those little operations like initialize
or large register with four values.
There would be a tiny little kind of C-like function
which would do that for you.
And then, of course, yes, the next stage is really just getting the compiler
and giving the compiler hints on how to do it properly.
Okay.
So then where does this next chapter come in on open multiprocessing?
All right.
So you're leveraging your instructions to the best of your ability,
and the next kind of entity, the next level of scale is multi-core.
And of course, we kind of rely on this idea
that if you spin up multiple threads,
then the operating system will actually
put them on different cores
and it will all run concurrently
and therefore improve your program.
And here, there are actually two approaches.
There is an imperative approach,
which is kind of straightforward.
You make your threads
and you sort of make your own thread pool, maybe even. You use an imperative approach, which is kind of straightforward. You make your threads and you sort of, you make your own thread pool, maybe even you use some library functions, but
there is also a declarative approach. So I thought that since somebody else on Pluralsight is doing
a course on Intel threading building blocks, which is the imperative library for multi-threading,
I decided to go with the declarative route. And the idea of declarative was once again,
that instead of doing the parallelization manually, you actually give the compiler certain
instructions on how to do that. So once again, you do it with pragmas. These are very smart
programs, you can stick a lot of stuff into them. But they're pragmas, which essentially say, oh,
by the way, here is a loop. And I know that you the compiler can parallelize this loop. And here
are some hints as to which variables you can capture and where, which ones are private or public and whatnot.
And the end result is that OpenMP, which is a compiler plus library solution. So the compiler
has to support these pragmas. But in addition, you have some library function calls for actually
deploying the stuff using the thread pool and whatever.
So this compiler plus library solution is what actually turns your code parallel.
And the interesting thing about modern optimizing compilers is that even if you're not using
OpenMP and you turn on the parallelized flag in your compiler, what will happen is your
compiler will still use OpenMP behind the scenes.
You can look at the disassembly and see that. So in both cases, and OpenMP is a very old technology. It's a very
mature technology. It looks weird because, I mean, people look at all these pragmas and they think,
haven't we gotten away from it already? But in actual fact, it just works and people are
sufficiently happy with it. So you have a choice in a way. You can go with the imperative route
and just, you know just write your functors and
write a parallel for, which takes a lambda plus an iteration variable, and that will work just fine,
or you can use it this way. And OpenMP is less intrusive because you can take existing code and
you don't have to rewrite it to any great degree. You just put the pragmas. If the compiler doesn't
know those pragmas, if it doesn't know OpenMP, you just get serial execution. So that's it.
So do most compilers support the OpenMP pragmas?
Yeah, they totally support it.
OpenMP is now at version 4.
There is a lot of stuff you can do in there, not just parallelizing loops.
For example, you can just slice up your code into different tasks, and those would run in parallel.
And this is, once again, a very unobtrusive operation.
So you have a chunk of code that you've written already. And you say,
oh, by the way, these things are independent. So why don't I slice it into three little tasks,
and then they can complete at the same time, and I'll just wait for them. Very convenient.
Yeah, I don't know if I should admit this on the air or not. But I'm pretty sure the only time I've
actually seen OpenMP in use was in Fortran code. So I might have to take another look at that
for the C++ support.
Yeah, C++ and Fortran are kind of,
they go kind of hand in hand for scientific computing.
Although I would say that Fortran is mainly a remnant
which is used by astronomers, surprisingly enough.
So if you're into astronomy or I guess cosmology,
then you might see some Fortran.
For the most part, I think people use C++ daily.
Right.
Yeah, I'll definitely have to look into that a bit myself.
The next article or the next chapter you have
is on message passing interface.
How does that factor in?
All right, so that's the highest level essentially.
So you've got your machine cluster
of however many thousand cores you have
and you want to run your calculations, your Monte Carlo simulations, whatever on them.
It turns out that the API, the most common API that people use for actually spreading out the data and then sort of collecting the end results is called MPI.
That's short for message passing interface. It's been around for years, and it doesn't look like anybody is replacing it anytime soon, because it works pretty well. Although personally, the variety of MPI that
I use is called Boost MPI. That's essentially a wrapper around the core MPI. So when you build
Boost by default, it doesn't get built because it's kind of, how should I put it? It's library
specific. There are different implementations of MPI. for example, the Microsoft MPI, the MPI Chameleon, there is the Intel MPI. So I build it
for my variety of MPI and I get these nice boost wrappers. So it becomes, because originally it's
kind of like C style functions, it looks pretty ugly and it has pointers all over the place.
And then you plug in boost and suddenly you get things like Boost serialization.
So you just sent an object across the wire as a single parameter.
It just works.
I mean, it's beautiful, and the friction is minimal.
So I love Boost MPI.
It's very useful.
I wanted to interrupt this discussion for just a moment to bring you a word from our
sponsor, JetBrains.
ReSharper C++ makes Visual Studio a much better IDE for
C++ developers. It provides on-the-fly code analysis, quick fixes, powerful search and
navigation, smart code completion, automated refactorings, a wide variety of code generation
options, and a host of other features to help increase your everyday productivity.
Code refactoring for C++ helps change your code safely, while context actions
let you switch between alternative syntax constructs and serve as shortcuts to code
generation actions. With ReSharper C++, you can instantly jump at any file, type, or type member
in Solution. You can search for usages of any code and get a clear view of all found usages with grouping and preview options. Visit jb.gg slash cppcast dash rcpp
to learn more and download your free 30-day evaluation. And use the following coupon code
to get a 20% discount for the ReSharper C++ personal license. CppCast, JetBrains CPP tools.
And then the last chapter you have there is C++ Accelerated Massive Parallelism.
Now, is this where you're using the GPU to parallelize?
Yes, indeed.
And first of all, I have to mention that in addition to this,
I also have another course on CUDA up on Pluralsight.
And the GPU story is actually a very interesting story.
So maybe I can elaborate a bit more, especially for those of your listeners who are not so aware of the situation.
So originally, years ago, we had a situation where the graphics output went to the monitor right from the motherboard.
And nobody found that in any way surprising.
So at some point, people realized that, you know, there is two ways this can go.
We can improve the motherboard or we can start selling a separate piece of hardware,
which would go into the motherboard on some sort of bus and then take care of the rendering.
And that's the rise of the GPU.
And it's an interesting, it's a unique story in a way, because if you think about, for example, sound cards,
then not many of us have discrete sound cards these days.
Most people just use what's on the motherboard.
But for GPU, what happened essentially is the GPU industry got a lot of money flowing in all the time
because people would buy more modern games and they would require better and better hardware.
I think it's slowed down a little bit now, but it used to be the case of kind of an arms race.
So as a consequence,
a lot of money got funneled into development of GPUs. And at some point, we got this idea of
programmable shaders. Now, a shader is just a piece of microcode that you can, it's kind of like
hardware programming in a way. You program your GPU to, for example, take a single pixel and color it in a specific way to perform a particular effect, kind of in something like C-like or assembly language
like fashion. So that came in and it was very popular because, I mean, the GPU is a massively
parallel kind of construct. It runs lots of threads at the same time. So it's specifically
designed for processing like large arrays of pixels or large arrays of vertices. So people could write these little microcodes to actually
like do all sorts of wonderful fantastic effects. And they still do, by the way. However, at some
point, the sort of scientific community and the computation community realized that this
capability can be tricked. You can actually take your, like, let's say you want to multiply two matrices.
You turn those matrices into textures.
You feed those textures to the GPU.
You perform the multiplication using microcode,
and then you get the data back from the resulting texture.
And people realized that this was so fast that it was worth actually doing.
And back then, the technology wasn't really that advanced.
It didn't support, you know support double precision variables, for example.
It was really kind of coming from the ground up in this regard.
But what happened, and this was great, is that manufacturers themselves realized that this was worth it.
This was worth doing.
So this explains why now you can buy GPUs like the Tesla GPU from NVIDIA, for example, which doesn't even have video output.
I mean, it's a graphics card that doesn't output any video because it's sold specifically for
computation. So we're in a way lucky that this kind of evolutionary approach led us to a situation
where for massively parallel tasks, for data parallel tasks, like when you have, let's say,
you have a huge array of financial data, and you want
to perform the same operation on the same kind of groups within that data, for example, the GPU is
a fantastic tool, and it saves a lot of time. So on the market, we at the moment, we have two players,
we have AMD, formerly ATI, they bought ATI, and we have NVIDIA. NVIDIA is the one that's
commercially successful, and it provides fantastic tools, and it
introduced this idea of CUDA,
which is, well, a slight
small extension to the C language,
which support the programming model
for their specific graphics
cards. However, the problem is, if you
use CUDA, it doesn't support ATI
or AMD, as it
is now. It doesn't support those graphics cards.
And let me just remind you that when the Bitcoin,
when Bitcoins were mined on GPUs,
they were mined on ATI GPUs and not on NVIDIA GPUs
because one particular instruction there
was twice as fast as NVIDIA.
And that's kind of settled the field back then.
Of course, right now we're past that.
You need ASICs to mine GPUs, mine Bitcoin rather.
But the end result is that NVIDIA's tool set is fantastic.
I did a course on it.
However, it only targets one side of the equation.
If you want to use AMD GPUs, then, well, you would have to use OpenCL.
And OpenCL is great.
OpenCL is an open standard, which aims not to support just GPUs, but also things like
the Intel Xeon Phi, and even Altera is now
experimenting with OpenCL for FPGA development. So it's a great technology. However, if you're
targeting just GPUs and nothing else, it gets really verbose. You have to write a lot of code
for the simplest of things because it tries to be so general. So what Microsoft did is they said, hey, how about we try to get our compiler to
produce uniform code or to produce code which will actually, you know, you write it once and
it works both on NVIDIA and on the AMD device as well. And maybe at some point it will also work on,
you know, other classes of devices. And they did exactly that. So they came out with C++ AMP,
which is essentially a technology.
It's just two extensions to the C++ compiler that Microsoft does, which enable them to figure out
that a certain chunk of code is intended for the GPU. So essentially, what does it mean that the
chunk of code, the chunk of C++ code is intended for the GPU? This means that you cannot just call
any arbitrary function. You cannot pass a string into it. You can only do what's possible on the GPU because GPU is not an x86 device. So you cannot just
run arbitrary code on it. And essentially, they compile things in a kind of uniform fashion.
And it's great. It's a great approach. And it would be even more great if the device manufacturers
kind of jumped in and took a more active participation.
So Microsoft provides what they call a reference implementation.
So they kind of, they show you how to do it
and what you have to support.
Unfortunately, I haven't seen any other compiler support C++ AMP,
but what Microsoft is doing is already pretty interesting.
And this part of the HPC course,
I'm actually showing how you can get started with it
and get some work done, basically.
So do you have any hope for that, that other compilers might support it,
or there might be other initiatives that are as flexible as the Microsoft solution?
I don't know. I think we're still in the limbo in terms of some sort of convergence language or
convergence technology, because we have to mention other classes of devices. We have to mention the
Intel Xeon 5, which is essentially your 60-core
coprocessor solution that you can also stick into the PCI slot.
And it's something, I mean, we see some innovation from Intel.
Again, Altera, which I guess is now also Intel since they bought them, Altera is experimenting
with using OpenCL for the FPGA side of things.
So I think in the long run, we might see a convergent technology.
It's certainly still C or C++-like language.
So essentially, at some point, it might be folded into C++ proper
and just become part of a compiler.
I don't know how exactly that will happen,
because obviously, you know, in the modern world,
whenever Microsoft comes up with a standard,
people are a bit apprehensive.
They think that, you know know you've built this because i mean the c++ amp technology is currently reliant
upon directx which is i mean great for windows not so great if you want to have widespread
availability because directx is windows only so these are you know some of the hurdles that have
to be overcome somehow i don't think anybody has the sort of silver bullet
or the magic solution that will work on every class of device,
but at least we're seeing new things.
And so I try to basically cover two out of three of these technologies on Pluralsight.
Okay.
So you came at all this with an interest in quantitative finance.
If you're like an application or library developer
and you're interested in paralyzing your code,
where do you think would be a good place to start?
Well, it depends on whether you have written code already
or whether you're writing something from scratch
because I think you can sort of, if you're starting from the ground up,
you have to ask yourself whether you're writing the library
just kind of for your internal use or for external use.
Because if you're like the authors of Eigen, for example, which is a very popular matrix manipulation library, then your best goal, and that's what the Eigen people are doing, your best goal is to basically target every kind of CPU support,
every SIMD level that there is.
And it's a huge task, effectively, to make a library
which performs its best under all CPUs.
But unfortunately, you would have to have this kind of stratified testing.
This problem, by the way, it's not just a CPU problem.
The same happens on GPUs because GPUs get more and more modern.
And so if you want to leverage all the best stuff,
then it becomes a case of testing on all sorts of different devices.
And that's why quite often you see programs like, let's say,
MATLAB or Mathematica, they only support a cross-section of GPUs.
They're honest.
They're saying we're only supporting the latest
because it is extremely difficult to write code which is portable across you know a whole history of gpus across time right and what
if you're working with an existing code base and we're trying to increase performance through
parallelization well um guided auto parallelism is an interesting idea and that's something that
intel is doing so essentially what they do is they kind of run your code, and they give you hints on possible locations where you might actually be able to parallelize
things. Because, I mean, certainly, if you're doing, let's say, some data parallel stuff,
and you have four loops all over the place, that's easy to detect. Unfortunately, in the real world,
if you're working with, let's say, a tree-based structure, for example, it becomes a lot less
obvious. And in this case, you would have to use your brain much more to figure out where the parallelism lies.
But once you figure that out, then you have plenty of choices for how you want to actually parallelize things.
And certainly we're finally having thread support as part of C++ proper.
But in addition, there are so many libraries out there.
I would say that the pair of libraries, the Intel Threading Building Blocks and the Microsoft Power of Patterns library, their interface is almost identical. The difference is that the Intel's realization, it gives you a bit more in terms of data structures and whatnot. very basic level, you can try OpenMP, just marking up code and seeing how that goes. At the more
advanced level, where, for example, you're using, let's say, a vector, you might have to turn that
into some sort of concurrent, you know, hash map or something in order to get the parallelism,
to get the correctness effectively. Because remember, STL constructs are not thread safe
by default. So this is where, you know, if you start writing to a structure from multiple threads,
you might get all sorts of problems. So that requires analysis that requires
time, unfortunately. So if you were to throw all these solutions as much as you could at your
problems, SIMD and C++ AMP and whatever, what kind of like crazy performance improvements can
have you seen? A hundredfold is not untold of essentially the the issue here is this some uh some algorithms
are what we call embarrassingly parallel in the sense that you if you're not paralyzing them you
should be embarrassed of yourself because it's really so obvious like like for example a typical
matrix multiplication is something that you can uh speed up quite a bit however it's not really
so straightforward for example there is example, there is still,
I would say, a fairly significant difference between single precision and double precision.
And that's something that you would particularly feel on the GPU. So once again, you might jump
from a CPU implementation to a GPU and not see such a massive increase because double precision
on the GPU, depending on the class of the device, can still be not as great.
But in terms of... It's very difficult to put a number.
I would say that if you are just parallelizing multi-core and your algorithm is embarrassingly
parallel, you should see a linear increase or more or less.
However, if you get some sort of entanglement or at some point you have a barrier or you're waiting on something or I don't know
then that effectively reduces
that part to a
single core so it's difficult to
kind of put a number there
but at least you know on the CPU you can expect
a linear increase more or less
if you're a
fairly straightforward data parallel
solution. On the GPU
things are well you still get fairly significant increases,
but it's very difficult to predict what they are
because, remember, the performance of parallelization
is dependent upon a rather large set of parameters
related to the way that your invocation of the GPU's actual infrastructure is.
Because when you fire up a GPU kernel,
you give it some parameters for how you want to partition your data.
And interestingly enough, this partitioning has all sorts of weird effects.
In fact, NVIDIA, they actually ship you an Excel spreadsheet
where you can put in the size of these partitions,
the size of your data,
and they give you a graph where you can put the sort of optimum these partitions, the size of your data, and they give you a
graph where you can put the sort of optimum point where you get maximum performance.
And it's not a straightforward graph.
It's not like a line where if you split it in this way, you get a linear increase.
It's a very jagged graph, which I guess unless you're an engineer, a GPU engineer, you're
going to have a really hard time figuring out what the dependencies are.
So GPU, I would say, it's fairly hard to realize again, for if we're talking about things like the Xeon Phi, you know, just coprocessors where you've got these 60 cores, and they're
slow cores, they're like Pentium 4 class cores, but there are 60 of them, and they're sitting
there just doing nothing kind of like a computer within your computer, then once again, because this device is effectively x86, despite the fact that you
have to recompile, this device should also give you a straightforward increase because, I mean,
it supports things like OpenMP and pthreads and whatever you throw at it. It even supports MPI,
though, unfortunately, I have to admit, I didn't get it to work in a kind of uniform MPI setting.
The problem with this is that this device, even though it's a PCI card, it runs its own brand of Linux on one of the cores.
It's effectively a computer that just feeds off your PCI line.
So you can use it as a separate machine, which doesn't have much to do with your main machine.
You can just peek at the data from time to time with SSH or something. Or you can do the offload mode where you have your
program running on your ordinary machine. But some of your loops, for example, have these little
pragmas. So we have pragmas again. And these pragmas say that, if possible, can you please
offload this computation to the Xeon Phi. So that's an interesting model. But once again,
it's very difficult to say what the performance increase will be, because I mean, it's a bit of
a black box, you have to use the Intel compiler, and they do some magic behind the scenes. But
certainly, it's of benefit in terms of computation. And then again, keep in mind that you can plug in
more than one device into your machine, you can in both a g you can plug in two gpus
or two of these enfis after two devices it gets a bit non-linear in terms of performance especially
if you're saturating pci bandwidth and it doesn't go so well but two devices seems to be uh the norm
around here and it seems to be uh it seems to be adding the performance benefit and of course if
you are kind of uh you know if you have tons of money to burn,
and certainly financial institutions do have tons of money to burn,
then you can go the route of FPGAs and custom chips and all the rest of it,
where you can get insane performance increases
simply by the virtue that instead of leveraging the instruction set of a processor,
an FPGA is essentially you designing your own processor,
you doing parallelization intrinsically. So it's not parallelization where, you know, you have a
multi-core device and you're spreading tasks or spreading kind of threads across or your operating
system doing that. It's actually a piece of silicon effectively, which is inherently parallel.
You feed it a data and four things happen physically at the same time. It's not just,
you know, an emulation. So in this case,
it's a different ballgame. But unfortunately, development for these classes of devices is
extremely expensive. The tools are fairly immature, I have to say. And it's something that is available
for certain classes of tasks. And certainly, the financial industry is using it for feed handlers
and for essentially once something comes over the wire from the stock exchange for example to process this data faster than your competitors instead
of it going through an ordinary you know cpu and ram it goes through this specialized piece of
hardware so uh that that provides you you know it might be like a microsecond level advantage but
it's worth it if there is big money on the line. So people do that, but
the development costs are huge, unfortunately.
That's amazing.
Yeah. So is there anything else
you wanted to go over before we let you go, Dimitri?
Gosh, I don't know.
I wanted to talk more about
the C++ stuff
because we just had a release and
I thought there would be questions like, what have
you done that's new since the last version?
I really haven't prepared the answer for that.
You see, it's actually funny, because I mean,
I could go off listing the features here.
But in actual fact, like number one feature,
we now open Unreal Engine in under 40 seconds.
That's, I mean, a lot of people.
We took a lot of flack for taking a huge amount.
I saw some of those comments, yeah.
Yeah, you have to be honest, though.
It is a massive solution.
And like I said, because we have to do it correctly,
we have to get each of the files and process it
and make sure that we're treating everything fairly.
But now we're actually doing it in reasonable time,
and hopefully this would sort of tone down
the rhetoric
on the internet regarding
this particular thing.
So for comparison,
how long did it used to take, if it takes
40 seconds now?
Hard to say, but
the first run could have taken maybe
half an hour, something like that.
Oh, okay, that's quite the improvement.
It was an unreasonable amount of time now it's kind of tolerable and of course you only pay this price on the first uh on first startup now so whenever
you open the solution again it's gonna be fairly quick because we kind of we we cash everything
but apart from that we actually uh did other things. We are now correctly supporting the C language.
And I know people are kind of like,
oh God, not C, you're taking us even further back in time.
But honestly, it's something that we felt we had to do and do it well.
And in addition, as always, there are analysis features and cogeneration features.
Actually, cogeneration is the thing that quite often gets ignored
in the discussions of tools and whatnot.
I'm a huge fan, actually, of code generation.
I think that if I want a constructor which initializes my fields,
I should get it in a fraction of a second,
not, you know, writing it with all the correct types being passed.
Because, well, it's the whole point of IDEs,
the whole point of you paying
your CPU time and
your RAM time is that you get
it back in terms of productivity.
So CodeGen is kind of
a personal favorite
of mine, I would say.
It also solves that reverse iteration problem
because why
have compiler
support for it when you can just generate a chunk of code, which, yes, it does go backwards through an iteration variable of some kind.
But, I mean, so what?
I mean, you just have, instead of having a built be kind of familiar with this whole generated code thing because macros, because like when you use something like Google Test, you're essentially manufacturing these huge tons of C++ anyway.
You're just not seeing it.
So you feel kind of insulated and you think, you know, all this cozy Google Test, nothing magical is happening.
Whereas behind the scenes, you have like classes full of functions in them and you're totally...
Ignorance is bliss
in this particular regard.
I wouldn't say overall, but in this regard it's
good to just get the final
result and start using it.
I'm looking at the CLion
1.2 blog post and it does mention
Google Test support. Is there anything else you wanted to mention there?
Google Test, yeah.
Google Test actually appeared in VShopper c++ before it appeared in c line and obviously the
implementations are somewhat different because the uis are different uh visual studio and resharper
c++ or they have their own user story in terms of unit testing because from the early days of
resharper we were providing our own test runner. And well,
I would say it's a much nicer test runner and it can do a lot of things. For example, in the
latest version, we provide continuous testing. And I mean, not many people out there provide
continuous testing. I'll just explain for the listeners that the whole business of continuous
testing is that you rerun tests whenever either somebody saves a file or
somebody builds a project. But the key thing about continuous testing is you only rerun those tests
which are actually affected by the code that you changed. And this is a rocket science problem.
This is an extremely difficult problem. It's not something that you can just, you know, do within a month or two.
So it took us, I would say, maybe two years to get this.
And we're talking about.NET now.
I don't know what the situation is with C++ because it's even harder to get it in this regard.
But it's a rocket science problem that I think we cracked to some degree.
And even though it looks very simple, it's a fairly simple interface,
and we actually like to keep it simple and straightforward.
What's happening behind the scenes
is that we're essentially leveraging code coverage analysis,
but instead of just telling you what tests affected
what parts of your code,
what we're doing is we're looking at how your changes,
your recent changes when you press Ctrl-S to save a file,
how those changes propagated across all the
other files, how those files affected
the test that you wrote, and then
we're not rerunning your whole test suite.
And test suites, some people have huge
test suites. We are rerunning only
the parts which have actually been affected.
So that's one of the
really cool ReSharper Ultimate
10 features.
Where can people find you online, Dimitri, if they want to find more of your info? So that's one of the really cool ReSharper Ultimate 10 features. Okay.
Well, where can people find you online, Dimitri, if they want to find more of your info?
All right.
So I have a Twitter.
It's dnesterook.
I also, well, I have a couple of courses on Pluralsight,
like you mentioned,
so you can just Google Pluralsight and Nestorook,
and you'll find me.
In addition, JetBrains' youtube channel has lots of my videos both obviously product overview
videos as well as uh webinars because sometimes we do webinars where we just talk about you know
whatever the hell we like basically so i think uh some of my recent webinars were things on uh
what talks on things like generative art, for example. So, you know, all sorts of experiments with C++ and the way that you can fill it.
Once again, the generative art webinar, by the way, is interesting because it actually used all those HPC practices that were mentioned here on the podcast today.
And also, it's actually part of the demo on the plural side course which i thought was really
neat because you know generative art you you show people actual pictures of stuff and then you say
by the way co-generate our generation of those pictures is really slow so let's look at 10 ways
of optimizing that and mentioning webinars you also did one on like kind of modern c++ in general
right yes indeed and and this was uh one of the more popular webinars although there are some
things which uh didn't fit into the one hour time slot but yeah it was uh very popular although i
i get the feeling that i would have to be redoing this this one regularly because obviously as new
stuff comes out you have to kind of update your samples and update the code and whatever
so yeah and and it was actually a lot of... I got a lot of unpleasant surprises
during the preparation of this webinar
because it turned out that the compiler I was using,
which I use the Intel compiler,
which, well, it's a commercial compiler.
You have to pay money for it.
You expect top-notch performance.
What you don't expect is to be lagging quite a bit
behind other compilers in terms of standards.
And unfortunately, it does lag unfortunately the intel compiler is kind of it's uh it's not for everyone
in the sense that one it doesn't support uh c++ to the same language level it's not as fast in
terms of getting you uh you know automatic uh return type deduction for example but on the
other hand it also has like really insane errors where sometimes you write a piece of valid C++ code
and instead of, you know, you don't even get decent output,
you just get something like error four.
And you have to take that error four with the example,
send it to Intel, and they tell you,
okay, this will be fixed in the next update.
So it's actually quite, I mean, I wouldn't really recommend it to people
unless they're doing scientific computing
because in the scientific computing domain, Intel, they have their own MPI implementation. In addition,
they ship lots of algorithms as libraries. And those algorithms, it's interesting,
they're not just optimized for multicore, because I mean, you'd expect that from Intel to optimize
for their own CPUs. But in addition, some of the algorithms actually leverage MPI, meaning that if
you want to do like a fast
Fourier transform, and you want to do it quickly, then they would use their own MPI infrastructure.
So if you've got like 100 machines prepped for this sort of thing, then you don't really have
to write your own MPI code, you just do an invocation. And that invocation gets spread
across the entire network, which is fantastic. I think that's the whole point
of having algorithms
that behave in this kind of transparent manner.
Of course, if I tried to do it by hand,
it would take a really long time
to implement all of this.
Right.
Okay.
Well, thank you so much for your time, Dimitri.
All right.
Thanks a lot for having me over.
Thanks.
Thanks so much for listening
as we chat about C++.
I'd love to hear what you think of the podcast. Please let me know if we're discussing the stuff you're interested in,
or if you have a suggestion for a topic, I'd love to hear that also. You can email all your
thoughts to feedback at cppcast.com. I'd also appreciate if you can follow CppCast on Twitter
and like CppCast on Facebook. And of course, you can find all that info and the show notes on the