CppCast - Intel C++ Compiler
Episode Date: April 27, 2017Rob and Jason are joined by Udit Patidar and Anoop Prabha from Intel to discuss Intel's C++ Compiler and suite of Performance tuning Software Development Tools. Anoop Prabha is currently a Sof...tware Engineer in Software and Services Group at Intel working with Intel® C++ Compiler Support. He played paramount role in driving customer adoption for features like Intel® Cilk™ Plus, Explicit Vectorization, Compute Offload to Intel® Processor Graphics across all Intel targets by creating technical articles and code samples, educating customers through webinars and 1-on-1 engagements. He is currently driving the Parallel STL feature adoption (new feature in 18.0 beta Compiler). Before joining Intel, Anoop worked at IBM India Private Ltd as a Software Developer for 3 years in Bangalore, India and later completed his graduation from State University of New York at Buffalo. Udit Patidar works in the Developer Products Division of Intel, where he is a product manager for Intel software tools. He was previously a developer working on Intel compilers, focusing on OpenMP parallel programming model for technical and scientific computing workloads. He has extensive experience in high performance computing, both at Intel and previously. Udit holds an MBA in General Management from Cornell University, and a PhD in Computer Science from the University of Houston. News Sandstorm Cap'n Proto cppast - A library to parse and work with the C++ AST Exposing containers of unique pointers Clang-include-fixer Anoop Prabha Anoop Prabha Udit Patidar Udit Patidar Links Free Intel Software Development Tools Intel Parallel Studio XE Suite Page Intel System Studio Suite Page Intel C++ Compiler Product Page C++11 support C++14 support C++17 support Intel C++ Compiler Forum Sponsors Conan.io JetBrains
Transcript
Discussion (0)
This episode of CppCast is sponsored by JFrog, the universal artifact repository including C++ binaries thanks to the integration of Conan, C, and C++ Package Manager. Start today at jfrog.com and conan routine ones. JetBrains is offering a 25% discount for an individual license on the C++ tool of your choice,
CLion, ReSharper, C++, or AppCode.
Use the coupon code JETBRAINS for CppCast during checkout at jetbrains.com.
CppCast is also sponsored by Pacific++,
the first major C++ conference in the Pacific region, providing great talks and opportunities for networking.
Get your ticket now during early bird registration until June 1st.
Episode 99 of CppCast with guests Udit Patidar and Anup Praha, recorded April 26th, 2017. In this episode, we discuss parsing the C++ abstract syntax tree.
Then we talk to Udit Patidar and Anup Raha from Intel.
Udit and Anup talk to us about Intel C++ Compiler C++ developers.
I'm your host, Rob Irving, joined by my co-host, Jason Turner.
Jason, how are you doing today?
Good, Rob. Episode 99.
Episode 99. We're almost at the big one.
Almost, yeah. That's what they tell us, that 100's supposed to be a big one.
Yeah, and I don't want to give it away, but we do have a good special guest planned for next week.
Obviously, all our guests are special, but we have a very special one for next week.
I won't give it away, but if you think back to some of the big name guests we have,
you can probably realize there's maybe one or two notable exceptions we haven't gone on yet,
and we're going to have one of them.
All right.
Should be fun.
Should be fun.
Well, at the top of every episode, I'd like to read a piece of feedback.
This week, I got an email from Amit.
He wrote in, I really enjoy listening to CppCast every week.
I would like to offer Kenton Varda as a guest on the show. He wrote in, apparent from the amount of growing traffic the GitHub project gets. Kenton is extremely productive and is always quick to answer questions and
bug reports. He can be a great example on how to
foster a community around an open source project, which
is why I think he'll be an interesting guest for the
show. Thanks for the podcast. Keep
it up. So I reached out to
Kenton. I haven't heard
back from him yet, but he definitely sounds like a good guest.
I think we may have discussed Captain
Proto on the show before, right,
Jason? I'm almost positive it has come up on the show, yes.
I know I've seen it at conference talks.
I feel like it was probably discussed maybe with Odin when we had it on.
Yeah, that seems to ring a bell for me.
But either way, a conversation about developing community around your projects
might be interesting and beneficial to our audience.
Yeah, absolutely. So hopefully Kenton will get back to us and we can schedule an interview with him.
Well, anyway, we'd love to hear your thoughts about the show as well. You can always reach
out to us on Facebook, Twitter, or email us at feedback at cpcast.com. And don't forget to leave
us a review on iTunes. Joining us today, we have Udit Patidar and Anup Praha.
Anup is currently a software engineer
in the software and services group at Intel,
working with the Intel C++ compiler support.
He played a paramount role in driving
customer adoption for features like
Intel Click+, explicit vectorization,
compute offload, Intel processor
graphics across all Intel targets
by creating technical articles and code
samples, educating
customers through webinars and one-on-one engagements, and he is currently driving the
parallel SGL feature adoption. Before joining Intel, Anoop worked at IBM India Private as a
software developer for three years in Bangalore, and later completed his graduation from State
University of New York at Buffalo. And Udit Patidar works in the developer products division
of Intel, where
he is a product manager for Intel Software Tools. He was previously a developer working
on Intel compilers, focusing on OpenMP parallel programming model for technical and scientific
computing workloads. He has extensive experience in high-performance computing, both at Intel
and previously. Udit holds an MBA in general management from Cornell and a PhD in computer
science from the University of Houston.
Guys, welcome to the show.
Thank you.
Thank you very much.
So this is the first time we've had two guests on the show at the same time.
So maybe you could quickly introduce yourself just by name so everyone knows, you know, which voice is who.
Okay.
This is Anoop here.
I'm the one who is working as a software engineer at intel
so did you want to go next yes and i am odit i am the product manager for intel compilers
and other c++ developer tools okay okay so since this is the first time that we've had two people
on i you know i like to find something from the person's bio to talk about, but let's just go maybe a little bit, just my standard question I like to ask.
So, Udit, how did you get started with programming?
And then we'll move to Anup and ask him.
Interesting for me because I started off as a double E in Sweden, I got into electrical engineering in Stockholm. And then as luck would have it, I started programming in MATLAB of all languages and then moved on to Fortran.
And as a grad student, I started taking up C also. And most of my graduate work was in implementing a neuroscience code base on a parallel computer.
And that was a combination of Fortran, Perl at that time, and also some of the MATLAB routines.
So it was a combination of whatever works best for the problem that I was trying to solve.
And towards the end of my PhD program,
I realized that I wanted to go into industry rather than academia.
So, you know, I started looking for roles,
which were more, you know,
aligned with the high performance computing and technical
computing. And Intel, you know, was one of the best employers in that field. So, you know,
that's where I actually ended up being in programming and, you know, learning to program
in different languages. And now, you know, more of a generalist focusing on C++ and Fortran at Intel.
I think you're the first person we've had on to mention MATLAB as a first language. But I know a lot of engineering departments actually require MATLAB class.
And this was like back 16 years ago.
So I guess that's when MATLAB was pledging and not, you know, did not have so much of Python.
And R was basically non-existent.
So MATLAB was the only
engineering programming language in town
apart from if you were doing
some C++ or Fortran.
That's interesting.
I haven't looked at MATLAB
in at least 16 years.
Last time I saw it wasn't.
Exactly, right?
And Anup, how did you get started programming? Well, I started programming as less to my salt was in the pool. Exactly, right? And a new poll.
How did you get started programming?
Well, I started programming with C during my high school.
So that's when I took the baby steps towards programming.
But the very first class, I was like really onto it.
And I really found that something within me is like what keeps me going is programming.
And I continued that.
So it went on for my undergraduate studies in India where, like,
I was doing projects in C++, learned object-oriented concepts.
And I was doing my course on signal processing there.
So that really helped.
Obviously, it had a lot of challenging problems.
I was introduced to MATLAB, but still I used to do a lot of programming in C++, which kept me going.
And I know I had to write a lot of stuff on my own during that time, but still, that kept me going.
And then as I went to IBM, I started working on Java and Oracle, like back-end programming.
So I was kind of a module lead for the back-end.
So kind of doing an object-oriented SQL programming there, PLSQL,
procedural-oriented language,
and then moved on to the JavaScript and Java for the front-end development.
In the university here in Buffalo,
I was actually doing some programming stuff on the web.
It was PHP, and I was also introduced to the Perl scripting
because there were a lot of automations which had to be done at the university grading system,
which obviously the best programming of all scripting language which I could think of for that problem was Perl.
So that was a great opportunity for me to dive into Perl scripting.
And so during my thesis and master's, I was actually working on the Airfoil, which is the CFD problem.
And that's when I actually stepped back into the C++ programming arena.
And then that continued from then because after working on that thesis
actually i was absorbed by intel and from that day i'm working for supporting the c plus plus
compiler features uh which we have in our intel compiler so when my work in hpc uh i haven't
dealt with the hpc application directly myself but i've been working with a lot of customers
who actually work on HPC.
So I'm kind of relevant in that way that across like oil and gas or even the animation folks
for that matter, I've worked closely with them and showed them how they can actually
optimize their code to be more friendly with Intel hardware.
So that's the short story.
You both have experience with parallel
computing then, if I understood
correctly, right? That's right.
Most of the
our customer base for the compiler
especially have historically been
towards the higher
end of the compute spectrum.
So that's where our legacy has
been. So that's where
most of our customer base is.
Okay.
We'll have to dig into that a little bit more.
But first, we just have a couple news articles to discuss.
Feel free to comment on either any of these, either of you.
First one is an article from Foo Nathan.
And this is, Jason, I want to pronounce it as CPP-ass, but that doesn't sound right.
CPP-AST. It's a library to parse and work with the C++-AST.
And I thought this was a pretty interesting article.
While working on standardese, Foonathan decided he needed to do some work with the abstract syntax tree and he started off using lib clang and just found a number of problems with it
and wound up creating his own library to kind of work around some of these issues he ran into
yeah uh there's an interesting note at the top and also he discussed this on reddit saying in
hindsight i should have probably used lib tooling okay but it's an interesting article to read and see what
conclusions he came to for sure yeah i wonder if if he'll go back and decide to use lib tooling
or if he's going to continue working and contributing to c++ ast i don't know yeah
uh next one exposing a container containers of unique pointers um jason what were your thoughts on this
one it it seems like a decent way to do things i saw some of the comments on reddit people were
saying this is a great reason that we uh we need ranges and c++ uh yeah that might be i didn't
actually read the reddit comments on this one. That might be a good argument there.
That's a good point.
But I thought it was well-structured to get you to the point of implementing your own iterator, essentially.
Did either of you guys get the chance to read this article also?
Yes, I did. And I think there is a classic comparison with the Boost iterators, which they have,
which kind of covers the actual dereferencing part.
So the advantage of using these unique pointers is obviously that it kind of has that ownership for that particular object.
And the fact that they want to hide it at the code level, and I think that's what the article mainly focuses on.
And I think the ideas which you have shared out there, I would say the code snippets kind of gives a classic example on how to do it ourselves.
Or I would also recommend to take a look at the boost equivalent of that, which also does the same job. Yeah, every now and then I see these boost articles,
these people mention the boost iterator helpers,
which I've never personally worked with,
but there's definitely cases where I can see them coming up, being useful.
I guess if I'm, I'm sorry, go ahead, Rob.
No, go ahead.
So one of the comments at the top here says, No, go ahead. stored contiguously in this vector because it's a vector of unique pointers, right?
So I was just thinking this is maybe perhaps a downside of hiding the fact that it's unique
pointers because sometimes you think, well, I'm working with a vector. I'll just get a pointer
to the first element, and I know all the data has to be contiguous because that's the point of a
vector. But if it's obscured from you, you get the first element,
and now the data's not contiguous.
I don't know.
Okay, and this last article we have is Clang Include Fixer.
And I know we were talking about includes last week, too, I think.
This seemed like a pretty good project for basically dealing with a problem
that other languages maybe have, you know,
fixes for in their IDE integration, like adding, using namespace automatically for you in a
language like C Sharp or Java.
But there's no way within C++ to automatically add includes.
So that's what this project is trying to solve for you.
And Jason, I thought you would like how they showed you how to do the integration with Vim and Emacs.
Yes.
That looks very interesting, although I did not try it out yet.
Yeah.
Is there anything you guys wanted to comment on this one?
So do we have a VI versus Emacs community here?
I love VI. For reasons I won't get into at the moment
I've been using Emacs recently
and I feel like I just want to make the comment
that anyone who thinks Vim is obtuse
should try using Emacs
I've used both
so I also have my
slight inclination towards Emacs
but it's a topic
for another time.
Although I think it brings up an interesting question. You guys are compiler authors,
and whenever we talk about IDEs on this show, we talk about
Clang, we talk about Visual Studio, we talk about GCC, we just don't bring up
Intel C++ compiler very often. Do you guys use an IDE?
What IDEs can work with Intel's C++ compiler?
We have good integration as a compiler vendor with both Visual Studio and Eclipse,
and even Xcode, I believe, right, on, yes, on Mac. So the idea is to have a drop-in replacement for the default compiler,
which there is in those IDEs.
So the debuggers and other features should work as seamlessly as possible.
So yes, the major IDEs in the various OSs, we do support those.
Okay.
Okay.
So let's start talking about the Intel C++ compiler.
As Jason just mentioned, we frequently talk about MSVC, GCC, and Clang on this show.
Where exactly do you see the Intel compiler fitting in amongst those other kind of maybe more well-known ones or more talked about
ones okay so obviously for folks who are actually developing their application with Microsoft on
Windows or like a GCC compiler on Linux or like a CLang like it is it is quite easy for them to
actually transition over to Intel compileriler because Intel Compiler provides a great compatibility, especially
on Windows with source and binary compatibility with Microsoft
Compiler. And the same applies for the source and binary
compatibility with GCC in the case of Linux.
So what's the big thing about the compatibility? Well, it makes it easier
for developers to actually evaluate Intel compiler to see if they get a performance edge for any performance sensitive application or library they have.
Now, if their application is really huge, I mean, it's like spans 150, 200 projects, each project having like thousands of files in it.
Well, it's not an easy engineering effort to put the full thing over to a different compiler.
But obviously, having the source and binary level compatibility
makes it a lot easier to selectively
convert the performance sensitive projects
to use the Intel compiler. So the mix and match
of having the binaries
generated by two different compilers
and make it work as a single application,
that's where the real plus comes from.
Now, I've been talking about performance,
and that's the main objective
for anyone who looks at Intel compiler,
because we actually have a very good code generation
for different targets of Intel,
which includes starting from SSC all the way to AVX-512,
that the code generation part is really optimal for IA targets
across different architectures.
And that's what makes it all the more important for people to evaluate,
especially their performance-critical kernels in their project with Intel Compiler.
And that's where it exactly fits in the ecosystem, too.
All right, you mentioned source compatibility, and each compiler has its own bugs.
In some ways, unintentionally, people come to rely on those bugs.
Do you emulate the bugs, depending on which compiler you're plugging in for?
I'm being serious.
So, yeah, I mean, we have had cases before where this was on Windows.
I don't want to specifically talk about the bug here,
but this was in Visual Studio 2010
and moving from 2010 to 2012.
So what we found was one of the customers
was using Visual Studio 2010.
They found their application works fine, but when they actually started using Intel Compiler on Windows,
they found that, well, they had a memory leak in the application.
And then they were like, why is it happening just with Intel Compiler?
And later we figured out that when we were actually doing the testing internally,
that this was really a problem with the memory allocation thing.
And that was fixed in the later version of Visual Studio.
So that's a classic example where like we have a C or C++ standard,
which is published online.
And every compiler vendor, they try to have their implementation stick to those standards.
Now, when I say source compatibility, what I really mean by that is that we depend on a lot of headers,
which is provided by Microsoft on the Windows side and the headers from GCC on the Linux.
So that's exactly what I mean by the source compatibility part of it.
And binary compatible, obviously, the object files and the libraries
which we generate are compliant with the ones with the Windows,
like the Microsoft compiler, or on the Linux side with the GCC compiler.
So that makes it a lot more easier for two different binaries
generated by two different compilers
to coexist in the same application.
So you can handle all of the Windows system headers
and link to all the Windows system libraries
with no problem?
That's right.
Okay.
Does that mean you can compile a project
that relies on something like MFC
or C++ CLI on Windows?
Is that an issue?
Let's see.
I don't see an issue there either.
Okay.
Yeah.
Okay, so let's talk a little bit about, you know,
what type of customer might want to use the Intel compiler.
You kind of made some mentions when we were talking about your BIOS
that it's kind of a high-performance community.
That was originally our customer base.
So we started off with your national labs, right, your Los Alamos national labs.
And in Europe, we had CERN being one of our big customers and NASA, right? Those big, big high-performance computing centers around the world. And then many of our biggest customers are also in computational fluid dynamics code, right?
Oil and gas down in Texas, life sciences.
So it's, you know, wherever you have compute intensive C++ and also Fortran codes,
we have the applicability of our compiler. And on the other end of the spectrum, we do have
some customers who are using the compiler for embedded applications. So it's, you know, you have
medical scanners, right?
So if you have a GE or a Philips who have a medical scanner and basically the medical scanner is doing a lot of real time application image processing to generate a CT scan or to generate what have you.
So those are also high performance applications, although they are in a different form factor, but they
have the same requirements.
So that's where we also see our compiler getting used.
So it's, you know, wherever you have the requirement of a lot of compute capabilities and performance
is critical on Intel compatiblecompatible hardware,
that's where we see our compiler basically being used.
So I guess that brings up two questions for me.
You're saying, well, Intel makes some ARM chips also, correct?
Not just x86 line?
That's a newer business that they are going into, but we as a compiler vendor and a C++
tools vendor, our group does not support that yet. And I don't know if we have plans, but
as of right now, we have x86 and Intel architecture compatible vendors being the major
target customers so you you've you've mentioned intel compatible and i was looking a lot of the
supercomputers aren't necessarily intel chips sometimes they're amd and actually looking at
ibm roadrunner which i was not familiar with at Los Alamos, has PowerCell IBM processors.
I don't know if those are Intel compatible or not.
AMD is. AMD historically has been compatible with Intel.
They borrowed some of ours and we borrowed some of their ideas.
But the idea is that Intel and AMD at an instruction set level are compatible.
So we are able to generate optimized code for AMD.
And I know that, you know, architecture such as such as Sun or IBM power are different.
So we don't support those.
And ARM on the embedded side is also different. IBM Power are different, so we don't support those.
And ARM on the embedded side is also different, so we don't support those architectures.
That's true.
Okay.
But you do equally as well support AMD as Intel brand?
Yes, yes, yes. But I would have to caveat there that because we, as an Intel software team, have much better knowledge of the AMD hardware, right?
Because it's just the nature of what the compiler has
visible to it to optimize.
But
it's just
something that we try and optimize
to the best of our knowledge.
Okay.
I wanted to interrupt this discussion for just a moment
to bring you a word from our sponsors.
JFrog is the leading DevOps solution provider
that gives engineers the freedom of choice.
Manage and securely host your packages
for any programming language with Artifactory.
With highly available registries based on on-prem
or in the cloud and integrations with all major build
and continuous integration tools,
Artifactory provides the only universal universal automated end-to-end solution from development to production.
Artifactory now provides full support for Conan C and C++ Package Manager,
the free open source tool for developers that works in any OS or platform and integrates with all build systems,
providing the best support for binary package creation and reuse across platforms.
Validated in the field to provide a flexible,
production-ready solution,
and with a growing community,
Conan is the leading multi-platform C++ package manager.
Together, JFrog and Conan provide an effective,
modern, and powerful solution for DevOps
for the C++ ecosystem.
If you want to learn more,
join us at the JFrog User Conference SwampUp 2017
running the 24th to 26th of May in Napa Valley, California,
or visit Conan.io.
So I don't want to get into licensing too much or anything,
but is there any availability for the Intel compiler
for student or open source projects?
There is.
And actually, that's a good point.
Students are one of our major students.
Research and academia as a general segment is one of our major users
because a lot of R&D projects in, say, oil and gas or physics
or financial services, right,
quantitative analysts and so on, those are academic projects.
So we do have a lot of offerings for academia,
and there are certain segments who can get our compiler and other C++ tools for free, right, without any charge.
And there is, you mentioned, an open source license also.
And the idea is that if you are a developer who's contributing to an open source project, you can use our tools for free without any any charge so you know we have the uh the community licensing
program which basically is for educators students and open source contributors and then there's the
commercial uh uh side of the house so these are two different um it's the same product but two
different licensing options i guess just as an aside i I'd like to point out, I think it's interesting how many companies are offering licensing to open source developers right now.
It is, and it's been more and more with CLang coming into the picture.
You have more favorable licensing, right, where you can actually mix and match proprietary and open source code as a compiler vendor.
So more and more companies are taking advantage of that as opposed to GCC where, you know,
you have to adhere to the GPL, right, GNU public license.
So the community is moving in that direction, yes.
Yeah, not to, I guess, you know, try to make you do a direct comparison with CLang or anything,
but it is interesting that they seem to be aiming for approximately the same thing you guys are doing.
They're a drop-in replacement compiler for Visual Studio or for GCC.
That is true.
And we are aware of the community's interest in CLang, right?
So we are closely monitoring and looking at being compatible with CLang.
But as of right now, our compatibility extends to GCC and Visual C++.
And, again, I would point out that, you know, a person would use the Intel compiler for getting superior performance.
And CLang, in the Foronix and other comparisons that I've seen, it is getting up there and it is very promising.
But I'm not aware of where we stand with the peak performance just yet. Okay. Yeah, I was curious if you're willing or
interested in actually giving us some performance comparisons for how you guys compare
to the other compilers. We do have. So I said that
most of our
customers, they are HPC, high performance computing
applications.
So there is this industry standard benchmark called SPEC.
And SPEC is basically for evaluating CPU performance.
And the latest benchmark is SPEC 2006, SPEC CPU 2006.
And, you know, we benchmark regularly on the latest Intel hardware against GCC and Clang on Linux and against PGI and also against Visual C++ on Windows. And the aim is to consistently be at least 20% better in performance,
in runtime performance
for both the floating point
and integer variants
of the spec CPU benchmark.
And, you know,
we have maintained this differential
for a while now.
But, you know, these are the in-house comparisons
that we do for spec.
And many of the applications
which we actually optimize for
are used by HPC centers around the world.
So these could be your CFD applications.
It could be crash testing applications.
These could be molecular dynamics applications.
And those are also optimized.
And we do aim to get the Intel compiler outperform the other alternatives there also.
20% is pretty huge.
20% is what we aim for the benchmark, yes.
And this is just for the compiler.
And you may be aware that, you know,
Intel has other C++ tools also.
You know, there's a threading library,
which is called a threading building blocks,
which allows you to multi-thread.
There's vectorization opportunities in the Intel compiler.
So, you know, when you actually get the compounded performance benefit,
it is much more than 20% in this what we've seen, right?
You can get in orders of magnitude rather than in percentage points.
If you are actually seriously, you know,
tuning your code using our tools for the Intel hardware.
Wow. Orders of magnitude is even bigger than 20%.
Exactly. And all this is to say that you should really make sure that you don't leave any performance on the table, right? If you have the opportunity, you know, do some analysis,
you know, using our tools
or using, you know,
some open source tools
to see if there is some headroom
for performance benefits, right?
And then direct your efforts
to optimize and not have any bottlenecks.
Because, you know know as the latest intel
hardware is coming along right it's not that the clock speeds are going higher they have been
stagnant for you know for so many years what's been uh what's been going on now is uh you have
more and more cores coming into uh you know into one processor so you have dual core, quad core, what have you. And inside each core, you have something called a SIMD vectorization or SIMD vector lanes,
which basically allows you to have a single instruction, multiple data kind of processing.
And if you are not utilizing the multiple vectors and the multiple cores of whatever hardware that you are investing in,
you are basically leaving performance on the table.
And that's any compiler that you use, without tuning your application, you're going to be leaving performance on the table.
So then do the Intel tools actually help you find opportunities for multi-core parallelization,
not just vectorization?
There are tools, yes.
So our compiler and the threading library that I mentioned,
they come in a suite of tools called the Intel Parallel Studio.
And again, there's a word parallel in there, right?
Right.
So just to, to you know continue to
emphasize the the uh the point but uh one of the tools inside parallel studio is called vtune
which is a hotspot analysis tool which will look at your code and identify where most of your time
is being spent right it could be a some kernel, like a for loop or something,
which might be taking 60 or 80% of your time,
and you can direct your optimization efforts there.
So you can have some, you know, some vectorization opportunities
using OpenMP as a parallel programming model that we support,
and you can actually vectorize that code,
and then suddenly you see so much performance benefit.
There's another tool called Vector Advisor, or it's just called Advisor Now,
which actually specifically looks at vectorization headroom in your code.
And these are what we classify as analysis tools.
And then there are the build tools, right?
The compiler, the libraries, you know, you have the threading library, which I mentioned.
We have a math library, which is called the math kernel library, which does a lot of C++
and Fortran bindings for matrix multiply, Fourier transforms, things of that nature for linear algebra kind of applications.
And once you use all of those in an optimal way, that's when you start to see the performance benefit.
And our support team, like Anup and his colleagues, they are the ones who are
at the forefront of this knowledge, how to actually help the customers get the most out
of what they're trying to do.
How easy is it to run VTune against an existing application to look for performance critical
areas that you might want to improve on? Well, I mean, it
depends on what you're looking for.
It could be
at a higher level where you
are a new developer who's
introduced to an existing application
and you would like to look for tuning
opportunities on a given hardware
architecture. So what you would do is
basically do a general hotspot analysis
where you basically try to run different workloads on that application, the usual workloads, and try architecture. So what you would do is basically do a general hotspot analysis where like you
basically try to run different workloads on that application, the usual workloads, and
try to see where the application spends most time. And often it's the case in like 99%
of the applications that it's not a flat profile. It's like up and down. So you like to see
all the ups, the functions where it spends most time, and then further drill down, and you can do a general exploration reporting where it will actually give you
a detailed information on what kind of, I mean, how many L1 cache hits or cache misses
has happened, same at the L2 level.
It also has a separate report for bandwidth analysis where it will give you, if you feel that
your application, in spite of computationally optimized, it still doesn't show any reflection
in terms of performance, then you need to really check if it is compute-bound or memory-bound.
And VTune is a place where you can actually go and check whether your application's real
bottleneck is the compute side or is it on the memory side.
Also, I want to mention that the advisor tool, which Udit just mentioned about,
the advisor also provides a new feature, which is actually rolled out with this beta,
which is called roofline analysis.
So it basically gives in a pictorial manner where your application stands in a timeline.
Because every application, when it starts, obviously it needs to know the data.
So every application starts as a memory bound.
But what happens after that is basically deciding the nature of the application.
So after that ramp up time, if it continues to be a memory bound application,
then you could see that either in the
roofline analysis on Visor or with the memory bandwidth analysis on the VTune side. Now, with
that being said, these are the reports which will be quite handy for any developer who's trying to
do the tuning. But we also provide a custom report analysis where like you as a developer, if you're
looking for certain specific hardware
events, you can create a custom report where you can choose the hardware event which you want to
monitor. And then you can come up with your own formulas to look for the metrics which you are
specifically interested in. So that flexibility is also offered. Well, if you don't mind, if we take
just one little step back, you mentioned L1 and L2 cache and memory being memory bound.
Would you mind explaining what the significance of the cache layers are and what memory bound would mean to our listeners?
Sure, yeah.
So, I mean, as Udit was mentioning, I mean, initially the race was about, if you look at the processor, like generations, like dating back to Pentium or even before,
initially it was all about trying to increase the frequency of operation of individual cores
so that you can have more number of cycles per second,
which means more number of instructions can be executed per second.
So that's the scalar optimization part.
But obviously we hit a bottleneck there.
So it wasn't like once we introduced the multi-core architecture, none of the programs was actually
getting a free performance or free lunch, what we call it at Intel. So it's like the developer
really has to go and change their code to make the optimal use of different cores. Now, with that
being said, the processing core has been tremendously increased
because the processors are operating at gigahertz.
But that's not the case with the memory.
Because memory, if you look at the DRAM, the speed at which the DRAM operates
or the read or write, which happens from the DRAM,
is very slow when compared to the speed at which the processor is operating on.
So when your code is actually data hungry, when it is looking for data from the memory,
then your CPU is sitting idle waiting for the data to be fetched into the register so that it can start computing.
So that's what I mean by memory bound, because your core is operating at a really high speed
and it can only operate when the data is available.
So if it is waiting for the data from the memory, then you're memory bound.
No matter how much compute optimizations you do, eventually the data has to arrive and that's where the bottleneck is.
Now, with that being said, obviously, we follow a new architecture.
It's a non-uniform memory access because you have different layers of memory.
Now, if you look at the DRAM in that hierarchy, that's the last in the hierarchy.
Obviously, the capacity is more, but the speed at which it can deliver the data is the least.
Now, if you look at the other levels, which are faster, well well we have different levels of caches and that's what
i mean by l1 l2 and l3 l1 being the closest to the processing core and obviously that's the fastest
memory and but obviously it comes with the trade-off that the size is less so what we expect
the developers to do is basically look at the compute kernel. And if the data is bigger than what can be
fit into an L1 cache, you basically do a tiling of your data, which means you divide your
data set so that each tile perfectly fits into your L1 cache. You utilize that data
for the computation and then pull the next tile in. So that's what it means by L1. Now,
if you don't do this optimization, you would see that there's a lot of data which is in L1,
which is not getting properly utilized for the computation.
And then there is a miss happening in L1, which will lead to the data being looked for in L2.
And if there is a miss in L2, then it will look for data in last level cache.
And if it is not there in the last level cache, then it will go into the DRAM.
So the more deeper you go, the more slower it gets.
And that's what another level of tuning is where you look at the behavior of your application and then try to optimize it.
So VTune will be able to tell you basically what the data access patterns are.
And I'd just like to add that VTune can be run in batch mode through a command line, or you can have a GUI which will give you a pictorial representation of what is going on in your application.
That's interesting.
I appreciate the explanation of the L1, L2, L3 cache.
It's nice to get that from a hardware guy.
Yeah, I'm curious.
Does L1 actually run at the native speed of the processor
or is it still slower than registers?
It is slower,
but obviously it is the fastest you can get on a core right now.
Okay.
Yeah.
So going back to the compiler a bit,
does it support all of the latest modern C++ features,
11, 14?
Are you already working on C++ 17?
It does.
So we actually have backwards compatibility.
We have to C99 also, and this being the 19th episode,
I'll just say a shout-out to C99.
But yes, we do have C++ 11, C++ 14 supported in the latest version of the compiler and our
compiler is actually run by you know model layer similar to what a car would do so we are now
releasing the 18.0 compiler for 2018 products and it's in it's in full support of C++ 14
and we have some support for C++17,
which was recently ratified as a standard, right? So the support that we have invested in right now
is the parallel STL,
or the standard template library,
which was extended to offer multi-threading and parallelism.
And I mentioned our different tools, right?
We had the threading library called Threading Building Blocks and the compiler.
And the idea is that if you are using templatized code, right,
through the standard template library,
then you can actually, without changing your code,
without deviating from standard C++,
you can actually use the parallelization opportunities of your Intel hardware by using the parallel STL syntax.
So that is supported in the C++, data C++ compiler.
And there are two levels, right? First is that you multi-thread your code which will allow you to leverage all the cores that you have
so if you have a dual core laptop or a quad core machine what have you right the first level would
allow you to multi-thread your code and the second level would go one step further and allow you to vectorize within those cores.
So that's when you are actually getting the most bang for your hardware buck, so to say, right, when you actually use the standard C++ language,
so it is portable across compilers,
but you're actually noticing a lot of performance benefit
when you're actually getting the full control over your hardware that you are going on.
And that's what we are investing in new.
So that's C++ 17 newer features in the in the compiler so that's uh version release 18 you said
um that is in beta right now is that correct it is in beta right now as of a couple of weeks ago
and in september ish that's when the uh the uh the gold version or the actual release for the year happens.
And then we are on a minor update schedule.
So every quarter, roughly every three months, we have updates to your latest product.
And then one year later, we'll have a major release for, you know,
what we call 19.0 release in that next year.
So that's how we usually roll our product updates out.
Okay.
So does the VTune tool then actually point out to you
that this is like,
does it have the capability of saying
this standard algorithm could be parallelized?
Do you might want to look at that?
Or is that just something you need to find
where your hotspots are?
I think at a
high level vtune will tell you um and you can film me and anuk that uh uh you have uh uh you have the
most amount of compute time spent in this part of the code right and uh then uh the next level would
be to uh you know almost like double click and zoom into that part of the code
and you would then do maybe even like a line by line analysis
of what's going on in each line of your code, right?
That I think the capability is offered by VTune.
So there is capability to zoom in and out,
look at a macro level and a micro level of how your
code is performing at runtime.
And I know I would just like to point out that
VTune is a very powerful tool, but the compiler itself also
emits very detailed diagnostic reports.
So you don't have to use VTune.
You can actually use a compiler switch,
which will actually dump in text file,
you know, this loop was vectorized,
this loop was not vectorized
because of some dependency of a variable
or more informative messages
than, you know than a silent compile
or a silent just throwing an error and exiting out.
So there is a lot of control that you can actually get
through the compiler itself if you're not using VTune also.
That could definitely be very interesting.
So does it show up like a regular compiler message,
like if you're building in Visual Studio?
Right, yeah.
Yes, yes.
So it'll be like, remark, loop was vectorized
or loop was not vectorized
and possible dependency of variable X
onto something outside the loop
or something like that, right?
So the compiler deemed that it was unsafe
to vectorize a particular,
or parallelize at a higher level,
a particular piece of code,
and it bailed out.
You can, of course, force it,
but then, of course,
you have to be a more advanced programmer
to do that because the compiler,
the very first thing that it wants to do is generate correct code rather than, right?
And there are switches to control precision for some arithmetic operations, right?
You can actually sacrifice precision for performance, right?
If you only want to go to, you know,
two or three significant digits and you're fine,
but there are some applications,
especially, you know, in things like, you know,
rocket science where you actually want to go to the 16th level of precision
just because you don't want to make any difference.
So there, you know, it's a trade-off
which you as an informed developer
would be controlling the compiler for,
but there are options to tweak the performance and sensitivity of the compiler.
You don't want to accidentally crash into the surface of Mars or something like that.
That has unfortunately happened.
Not with the Intel compiler, but this was actually 20, 25 years ago before the Intel compiler was a product, actually,
that there was some rocket which crashed because there was a numerical rounding error.
And, you know, that was a big lesson learned from the entire scientific computing community.
And I believe, you know, that that was a lesson, you know, which resulted in loss of a test rocket, but still it was a loss which would rather be avoided.
Right.
I have just kind of a generic question about performance and tuning since that's what you guys seem to be focusing on if i'm using intel's compiler and v2 and i'm like optimizing the heck out of
my application but it's a cross-platform application and i'm on some other platform
that you guys don't support uh or some other cpu architecture is there any chance that by
optimizing for one platform i might be de-optimizing for another platform well uh, I mean, so there is something called portability
and then there is the next step,
which is called the tuning part.
Obviously, any development project
which you write,
you start writing the functionality.
And obviously,
when you just implement the functionality,
that code is bound to be portable.
It'll work.
Now, when we talk about tuning and specific,
well, I mean,
there are a lot of ways to tune. Say you're going for a pragma-driven approach, which the reason
why I'm mentioning pragma-driven approach is because...
You mean like as in compiler pragmas?
That's right, yeah. Say, for instance, you're using OpenMP, right? So OpenMP is a standard,
and the best part about using something which is standardized is that when you go across different compilers,
and even if you hop to different architecture, the compiler vendor who provides or supports that OpenMP standard,
he would make sure that for the given pragma, he generates optimal code for the different architecture.
Now, that's the best thing about the standard.
And OpenMP, the reason why I'm breaking OpenMP is OpenMP is one of the standards.
I mean, there is another one, like, in the different domain,
like, which is OpenCL for different targets, like maybe the graphics cards, right?
So the thing is, there are so many graphics cards available.
The thing is, every graphics card vendor, he provides his own compiler,
which is compliant with OpenCL.
So he provides an OpenCL driver for his target.
So as long as you write your code so that it basically complies with that standard, in our case, it happens to be OpenMP for the CPU targets.
Then what happens is if you switch to a different compiler, say, for example, Intel architecture, but a different compiler,
as long as the compiler supports OpenMP, great, it will go ahead and generate the code.
Now, for some reason, if the compiler doesn't support it, these are just pragmas.
It will just ignore it, right?
Okay.
Yeah, that's the behavior of the compilers across the board. Now, if it happens to be a different target other than Intel,
then obviously you'll be using a compiler provided by that vendor,
and as long as that supports OpenMP,
they're going to have the code generation for that target
based on your practice.
Right.
So that's how you write your program to be portable,
but at the same time you're not giving up on tuning either.
Okay. Okay.
Is there anything else that we haven't gone over yet with Intel,
the compiler, or any other tools we haven't mentioned yet that you guys wanted to
name drop? I can touch upon because
most of our applications are on the
high-performance computing spectrum.
We have something called the message-passing interface,
which allows you to, again, do the third level of parallelization.
So I mentioned, right, the lowest level was vectorization.
Then you get the multi-core, and then now you're going out of one one one processor into multiple processors which are
connected through a ethernet cable or whatever right so this is the distributed memory platform
and what we offer is an mpi library the message passing interface library which allows you to
scale out your application beyond the you know the the one physical device that you have to
multiple physical devices so for uh uh for uh uh uh you mentioned the ibm um uh blue gene pc or
if you have uh any cluster which is using multiple processors uh you are actually able to scale out and deploy your code, your application in multiple nodes also.
And that's the last tool to scale out your application.
And we see a lot of these high-performance computing centers actually use the MPI library also
to scale out their application
to the entire data center.
Okay.
Well, thank you so much for your time today,
Ujit and Anup.
Sure, sure.
Thank you for having me on the show.
Yeah, we're happy to help.
And there are some links I think you guys can share
which can allow you to get started.
And if you are a student, if you are a developer
who's contributing to open source, which many of you might be,
you can get started and start tuning your applications
and get the most out of your hardware.
Sounds great. Thank you.
Thank you.
Thank you.
Thanks, guys.
Thanks so much for listening in as we chat about C++ I'd love to hear what you think of the podcast
please let me know if we're discussing the stuff you're interested in
or if you have a suggestion for a topic I'd love to hear about that too
you can email all your thoughts to feedback at cppcast.com
I'd also appreciate if you like CppCast on Facebook
and follow CppCast on Twitter.
You can also follow me at Rob W. Irving and Jason at Leftkiss on Twitter.
And of course, you can find all that info and the show notes on the podcast website at CppCast.com.
Theme music for this episode is provided by PodcastThemes.com.