CppCast - Intel C++ Compiler

Episode Date: April 27, 2017

Rob and Jason are joined by Udit Patidar and Anoop Prabha from Intel to discuss Intel's C++ Compiler and suite of Performance tuning Software Development Tools. Anoop Prabha is currently a Sof...tware Engineer in Software and Services Group at Intel working with Intel® C++ Compiler Support. He played paramount role in driving customer adoption for features like Intel® Cilk™ Plus, Explicit Vectorization, Compute Offload to Intel® Processor Graphics across all Intel targets by creating technical articles and code samples, educating customers through webinars and 1-on-1 engagements. He is currently driving the Parallel STL feature adoption (new feature in 18.0 beta Compiler). Before joining Intel, Anoop worked at IBM India Private Ltd as a Software Developer for 3 years in Bangalore, India and later completed his graduation from State University of New York at Buffalo. Udit Patidar works in the Developer Products Division of Intel, where he is a product manager for Intel software tools. He was previously a developer working on Intel compilers, focusing on OpenMP parallel programming model for technical and scientific computing workloads. He has extensive experience in high performance computing, both at Intel and previously. Udit holds an MBA in General Management from Cornell University, and a PhD in Computer Science from the University of Houston. News Sandstorm Cap'n Proto cppast - A library to parse and work with the C++ AST Exposing containers of unique pointers Clang-include-fixer Anoop Prabha Anoop Prabha Udit Patidar Udit Patidar Links Free Intel Software Development Tools Intel Parallel Studio XE Suite Page Intel System Studio Suite Page Intel C++ Compiler Product Page C++11 support C++14 support C++17 support Intel C++ Compiler Forum Sponsors Conan.io JetBrains

Transcript
Discussion (0)
Starting point is 00:00:00 This episode of CppCast is sponsored by JFrog, the universal artifact repository including C++ binaries thanks to the integration of Conan, C, and C++ Package Manager. Start today at jfrog.com and conan routine ones. JetBrains is offering a 25% discount for an individual license on the C++ tool of your choice, CLion, ReSharper, C++, or AppCode. Use the coupon code JETBRAINS for CppCast during checkout at jetbrains.com. CppCast is also sponsored by Pacific++, the first major C++ conference in the Pacific region, providing great talks and opportunities for networking. Get your ticket now during early bird registration until June 1st. Episode 99 of CppCast with guests Udit Patidar and Anup Praha, recorded April 26th, 2017. In this episode, we discuss parsing the C++ abstract syntax tree. Then we talk to Udit Patidar and Anup Raha from Intel.
Starting point is 00:01:23 Udit and Anup talk to us about Intel C++ Compiler C++ developers. I'm your host, Rob Irving, joined by my co-host, Jason Turner. Jason, how are you doing today? Good, Rob. Episode 99. Episode 99. We're almost at the big one. Almost, yeah. That's what they tell us, that 100's supposed to be a big one. Yeah, and I don't want to give it away, but we do have a good special guest planned for next week. Obviously, all our guests are special, but we have a very special one for next week.
Starting point is 00:02:20 I won't give it away, but if you think back to some of the big name guests we have, you can probably realize there's maybe one or two notable exceptions we haven't gone on yet, and we're going to have one of them. All right. Should be fun. Should be fun. Well, at the top of every episode, I'd like to read a piece of feedback. This week, I got an email from Amit.
Starting point is 00:02:40 He wrote in, I really enjoy listening to CppCast every week. I would like to offer Kenton Varda as a guest on the show. He wrote in, apparent from the amount of growing traffic the GitHub project gets. Kenton is extremely productive and is always quick to answer questions and bug reports. He can be a great example on how to foster a community around an open source project, which is why I think he'll be an interesting guest for the show. Thanks for the podcast. Keep it up. So I reached out to Kenton. I haven't heard
Starting point is 00:03:17 back from him yet, but he definitely sounds like a good guest. I think we may have discussed Captain Proto on the show before, right, Jason? I'm almost positive it has come up on the show, yes. I know I've seen it at conference talks. I feel like it was probably discussed maybe with Odin when we had it on. Yeah, that seems to ring a bell for me. But either way, a conversation about developing community around your projects
Starting point is 00:03:42 might be interesting and beneficial to our audience. Yeah, absolutely. So hopefully Kenton will get back to us and we can schedule an interview with him. Well, anyway, we'd love to hear your thoughts about the show as well. You can always reach out to us on Facebook, Twitter, or email us at feedback at cpcast.com. And don't forget to leave us a review on iTunes. Joining us today, we have Udit Patidar and Anup Praha. Anup is currently a software engineer in the software and services group at Intel, working with the Intel C++ compiler support.
Starting point is 00:04:12 He played a paramount role in driving customer adoption for features like Intel Click+, explicit vectorization, compute offload, Intel processor graphics across all Intel targets by creating technical articles and code samples, educating customers through webinars and one-on-one engagements, and he is currently driving the
Starting point is 00:04:28 parallel SGL feature adoption. Before joining Intel, Anoop worked at IBM India Private as a software developer for three years in Bangalore, and later completed his graduation from State University of New York at Buffalo. And Udit Patidar works in the developer products division of Intel, where he is a product manager for Intel Software Tools. He was previously a developer working on Intel compilers, focusing on OpenMP parallel programming model for technical and scientific computing workloads. He has extensive experience in high-performance computing, both at Intel and previously. Udit holds an MBA in general management from Cornell and a PhD in computer
Starting point is 00:05:04 science from the University of Houston. Guys, welcome to the show. Thank you. Thank you very much. So this is the first time we've had two guests on the show at the same time. So maybe you could quickly introduce yourself just by name so everyone knows, you know, which voice is who. Okay. This is Anoop here.
Starting point is 00:05:24 I'm the one who is working as a software engineer at intel so did you want to go next yes and i am odit i am the product manager for intel compilers and other c++ developer tools okay okay so since this is the first time that we've had two people on i you know i like to find something from the person's bio to talk about, but let's just go maybe a little bit, just my standard question I like to ask. So, Udit, how did you get started with programming? And then we'll move to Anup and ask him. Interesting for me because I started off as a double E in Sweden, I got into electrical engineering in Stockholm. And then as luck would have it, I started programming in MATLAB of all languages and then moved on to Fortran. And as a grad student, I started taking up C also. And most of my graduate work was in implementing a neuroscience code base on a parallel computer.
Starting point is 00:06:31 And that was a combination of Fortran, Perl at that time, and also some of the MATLAB routines. So it was a combination of whatever works best for the problem that I was trying to solve. And towards the end of my PhD program, I realized that I wanted to go into industry rather than academia. So, you know, I started looking for roles, which were more, you know, aligned with the high performance computing and technical computing. And Intel, you know, was one of the best employers in that field. So, you know,
Starting point is 00:07:11 that's where I actually ended up being in programming and, you know, learning to program in different languages. And now, you know, more of a generalist focusing on C++ and Fortran at Intel. I think you're the first person we've had on to mention MATLAB as a first language. But I know a lot of engineering departments actually require MATLAB class. And this was like back 16 years ago. So I guess that's when MATLAB was pledging and not, you know, did not have so much of Python. And R was basically non-existent. So MATLAB was the only engineering programming language in town
Starting point is 00:07:51 apart from if you were doing some C++ or Fortran. That's interesting. I haven't looked at MATLAB in at least 16 years. Last time I saw it wasn't. Exactly, right? And Anup, how did you get started programming? Well, I started programming as less to my salt was in the pool. Exactly, right? And a new poll.
Starting point is 00:08:06 How did you get started programming? Well, I started programming with C during my high school. So that's when I took the baby steps towards programming. But the very first class, I was like really onto it. And I really found that something within me is like what keeps me going is programming. And I continued that. So it went on for my undergraduate studies in India where, like, I was doing projects in C++, learned object-oriented concepts.
Starting point is 00:08:36 And I was doing my course on signal processing there. So that really helped. Obviously, it had a lot of challenging problems. I was introduced to MATLAB, but still I used to do a lot of programming in C++, which kept me going. And I know I had to write a lot of stuff on my own during that time, but still, that kept me going. And then as I went to IBM, I started working on Java and Oracle, like back-end programming. So I was kind of a module lead for the back-end. So kind of doing an object-oriented SQL programming there, PLSQL,
Starting point is 00:09:13 procedural-oriented language, and then moved on to the JavaScript and Java for the front-end development. In the university here in Buffalo, I was actually doing some programming stuff on the web. It was PHP, and I was also introduced to the Perl scripting because there were a lot of automations which had to be done at the university grading system, which obviously the best programming of all scripting language which I could think of for that problem was Perl. So that was a great opportunity for me to dive into Perl scripting.
Starting point is 00:09:49 And so during my thesis and master's, I was actually working on the Airfoil, which is the CFD problem. And that's when I actually stepped back into the C++ programming arena. And then that continued from then because after working on that thesis actually i was absorbed by intel and from that day i'm working for supporting the c plus plus compiler features uh which we have in our intel compiler so when my work in hpc uh i haven't dealt with the hpc application directly myself but i've been working with a lot of customers who actually work on HPC. So I'm kind of relevant in that way that across like oil and gas or even the animation folks
Starting point is 00:10:32 for that matter, I've worked closely with them and showed them how they can actually optimize their code to be more friendly with Intel hardware. So that's the short story. You both have experience with parallel computing then, if I understood correctly, right? That's right. Most of the our customer base for the compiler
Starting point is 00:10:54 especially have historically been towards the higher end of the compute spectrum. So that's where our legacy has been. So that's where most of our customer base is. Okay. We'll have to dig into that a little bit more.
Starting point is 00:11:09 But first, we just have a couple news articles to discuss. Feel free to comment on either any of these, either of you. First one is an article from Foo Nathan. And this is, Jason, I want to pronounce it as CPP-ass, but that doesn't sound right. CPP-AST. It's a library to parse and work with the C++-AST. And I thought this was a pretty interesting article. While working on standardese, Foonathan decided he needed to do some work with the abstract syntax tree and he started off using lib clang and just found a number of problems with it and wound up creating his own library to kind of work around some of these issues he ran into
Starting point is 00:11:51 yeah uh there's an interesting note at the top and also he discussed this on reddit saying in hindsight i should have probably used lib tooling okay but it's an interesting article to read and see what conclusions he came to for sure yeah i wonder if if he'll go back and decide to use lib tooling or if he's going to continue working and contributing to c++ ast i don't know yeah uh next one exposing a container containers of unique pointers um jason what were your thoughts on this one it it seems like a decent way to do things i saw some of the comments on reddit people were saying this is a great reason that we uh we need ranges and c++ uh yeah that might be i didn't actually read the reddit comments on this one. That might be a good argument there.
Starting point is 00:12:46 That's a good point. But I thought it was well-structured to get you to the point of implementing your own iterator, essentially. Did either of you guys get the chance to read this article also? Yes, I did. And I think there is a classic comparison with the Boost iterators, which they have, which kind of covers the actual dereferencing part. So the advantage of using these unique pointers is obviously that it kind of has that ownership for that particular object. And the fact that they want to hide it at the code level, and I think that's what the article mainly focuses on. And I think the ideas which you have shared out there, I would say the code snippets kind of gives a classic example on how to do it ourselves.
Starting point is 00:13:37 Or I would also recommend to take a look at the boost equivalent of that, which also does the same job. Yeah, every now and then I see these boost articles, these people mention the boost iterator helpers, which I've never personally worked with, but there's definitely cases where I can see them coming up, being useful. I guess if I'm, I'm sorry, go ahead, Rob. No, go ahead. So one of the comments at the top here says, No, go ahead. stored contiguously in this vector because it's a vector of unique pointers, right? So I was just thinking this is maybe perhaps a downside of hiding the fact that it's unique
Starting point is 00:14:31 pointers because sometimes you think, well, I'm working with a vector. I'll just get a pointer to the first element, and I know all the data has to be contiguous because that's the point of a vector. But if it's obscured from you, you get the first element, and now the data's not contiguous. I don't know. Okay, and this last article we have is Clang Include Fixer. And I know we were talking about includes last week, too, I think. This seemed like a pretty good project for basically dealing with a problem
Starting point is 00:15:04 that other languages maybe have, you know, fixes for in their IDE integration, like adding, using namespace automatically for you in a language like C Sharp or Java. But there's no way within C++ to automatically add includes. So that's what this project is trying to solve for you. And Jason, I thought you would like how they showed you how to do the integration with Vim and Emacs. Yes. That looks very interesting, although I did not try it out yet.
Starting point is 00:15:33 Yeah. Is there anything you guys wanted to comment on this one? So do we have a VI versus Emacs community here? I love VI. For reasons I won't get into at the moment I've been using Emacs recently and I feel like I just want to make the comment that anyone who thinks Vim is obtuse should try using Emacs
Starting point is 00:15:55 I've used both so I also have my slight inclination towards Emacs but it's a topic for another time. Although I think it brings up an interesting question. You guys are compiler authors, and whenever we talk about IDEs on this show, we talk about Clang, we talk about Visual Studio, we talk about GCC, we just don't bring up
Starting point is 00:16:18 Intel C++ compiler very often. Do you guys use an IDE? What IDEs can work with Intel's C++ compiler? We have good integration as a compiler vendor with both Visual Studio and Eclipse, and even Xcode, I believe, right, on, yes, on Mac. So the idea is to have a drop-in replacement for the default compiler, which there is in those IDEs. So the debuggers and other features should work as seamlessly as possible. So yes, the major IDEs in the various OSs, we do support those. Okay.
Starting point is 00:17:07 Okay. So let's start talking about the Intel C++ compiler. As Jason just mentioned, we frequently talk about MSVC, GCC, and Clang on this show. Where exactly do you see the Intel compiler fitting in amongst those other kind of maybe more well-known ones or more talked about ones okay so obviously for folks who are actually developing their application with Microsoft on Windows or like a GCC compiler on Linux or like a CLang like it is it is quite easy for them to actually transition over to Intel compileriler because Intel Compiler provides a great compatibility, especially on Windows with source and binary compatibility with Microsoft
Starting point is 00:17:52 Compiler. And the same applies for the source and binary compatibility with GCC in the case of Linux. So what's the big thing about the compatibility? Well, it makes it easier for developers to actually evaluate Intel compiler to see if they get a performance edge for any performance sensitive application or library they have. Now, if their application is really huge, I mean, it's like spans 150, 200 projects, each project having like thousands of files in it. Well, it's not an easy engineering effort to put the full thing over to a different compiler. But obviously, having the source and binary level compatibility makes it a lot easier to selectively
Starting point is 00:18:36 convert the performance sensitive projects to use the Intel compiler. So the mix and match of having the binaries generated by two different compilers and make it work as a single application, that's where the real plus comes from. Now, I've been talking about performance, and that's the main objective
Starting point is 00:18:56 for anyone who looks at Intel compiler, because we actually have a very good code generation for different targets of Intel, which includes starting from SSC all the way to AVX-512, that the code generation part is really optimal for IA targets across different architectures. And that's what makes it all the more important for people to evaluate, especially their performance-critical kernels in their project with Intel Compiler.
Starting point is 00:19:24 And that's where it exactly fits in the ecosystem, too. All right, you mentioned source compatibility, and each compiler has its own bugs. In some ways, unintentionally, people come to rely on those bugs. Do you emulate the bugs, depending on which compiler you're plugging in for? I'm being serious. So, yeah, I mean, we have had cases before where this was on Windows. I don't want to specifically talk about the bug here, but this was in Visual Studio 2010
Starting point is 00:19:57 and moving from 2010 to 2012. So what we found was one of the customers was using Visual Studio 2010. They found their application works fine, but when they actually started using Intel Compiler on Windows, they found that, well, they had a memory leak in the application. And then they were like, why is it happening just with Intel Compiler? And later we figured out that when we were actually doing the testing internally, that this was really a problem with the memory allocation thing.
Starting point is 00:20:31 And that was fixed in the later version of Visual Studio. So that's a classic example where like we have a C or C++ standard, which is published online. And every compiler vendor, they try to have their implementation stick to those standards. Now, when I say source compatibility, what I really mean by that is that we depend on a lot of headers, which is provided by Microsoft on the Windows side and the headers from GCC on the Linux. So that's exactly what I mean by the source compatibility part of it. And binary compatible, obviously, the object files and the libraries
Starting point is 00:21:12 which we generate are compliant with the ones with the Windows, like the Microsoft compiler, or on the Linux side with the GCC compiler. So that makes it a lot more easier for two different binaries generated by two different compilers to coexist in the same application. So you can handle all of the Windows system headers and link to all the Windows system libraries with no problem?
Starting point is 00:21:36 That's right. Okay. Does that mean you can compile a project that relies on something like MFC or C++ CLI on Windows? Is that an issue? Let's see. I don't see an issue there either.
Starting point is 00:21:52 Okay. Yeah. Okay, so let's talk a little bit about, you know, what type of customer might want to use the Intel compiler. You kind of made some mentions when we were talking about your BIOS that it's kind of a high-performance community. That was originally our customer base. So we started off with your national labs, right, your Los Alamos national labs.
Starting point is 00:22:27 And in Europe, we had CERN being one of our big customers and NASA, right? Those big, big high-performance computing centers around the world. And then many of our biggest customers are also in computational fluid dynamics code, right? Oil and gas down in Texas, life sciences. So it's, you know, wherever you have compute intensive C++ and also Fortran codes, we have the applicability of our compiler. And on the other end of the spectrum, we do have some customers who are using the compiler for embedded applications. So it's, you know, you have medical scanners, right? So if you have a GE or a Philips who have a medical scanner and basically the medical scanner is doing a lot of real time application image processing to generate a CT scan or to generate what have you. So those are also high performance applications, although they are in a different form factor, but they
Starting point is 00:23:26 have the same requirements. So that's where we also see our compiler getting used. So it's, you know, wherever you have the requirement of a lot of compute capabilities and performance is critical on Intel compatiblecompatible hardware, that's where we see our compiler basically being used. So I guess that brings up two questions for me. You're saying, well, Intel makes some ARM chips also, correct? Not just x86 line?
Starting point is 00:24:01 That's a newer business that they are going into, but we as a compiler vendor and a C++ tools vendor, our group does not support that yet. And I don't know if we have plans, but as of right now, we have x86 and Intel architecture compatible vendors being the major target customers so you you've you've mentioned intel compatible and i was looking a lot of the supercomputers aren't necessarily intel chips sometimes they're amd and actually looking at ibm roadrunner which i was not familiar with at Los Alamos, has PowerCell IBM processors. I don't know if those are Intel compatible or not. AMD is. AMD historically has been compatible with Intel.
Starting point is 00:24:56 They borrowed some of ours and we borrowed some of their ideas. But the idea is that Intel and AMD at an instruction set level are compatible. So we are able to generate optimized code for AMD. And I know that, you know, architecture such as such as Sun or IBM power are different. So we don't support those. And ARM on the embedded side is also different. IBM Power are different, so we don't support those. And ARM on the embedded side is also different, so we don't support those architectures. That's true.
Starting point is 00:25:33 Okay. But you do equally as well support AMD as Intel brand? Yes, yes, yes. But I would have to caveat there that because we, as an Intel software team, have much better knowledge of the AMD hardware, right? Because it's just the nature of what the compiler has visible to it to optimize. But it's just something that we try and optimize
Starting point is 00:26:20 to the best of our knowledge. Okay. I wanted to interrupt this discussion for just a moment to bring you a word from our sponsors. JFrog is the leading DevOps solution provider that gives engineers the freedom of choice. Manage and securely host your packages for any programming language with Artifactory.
Starting point is 00:26:37 With highly available registries based on on-prem or in the cloud and integrations with all major build and continuous integration tools, Artifactory provides the only universal universal automated end-to-end solution from development to production. Artifactory now provides full support for Conan C and C++ Package Manager, the free open source tool for developers that works in any OS or platform and integrates with all build systems, providing the best support for binary package creation and reuse across platforms. Validated in the field to provide a flexible,
Starting point is 00:27:08 production-ready solution, and with a growing community, Conan is the leading multi-platform C++ package manager. Together, JFrog and Conan provide an effective, modern, and powerful solution for DevOps for the C++ ecosystem. If you want to learn more, join us at the JFrog User Conference SwampUp 2017
Starting point is 00:27:26 running the 24th to 26th of May in Napa Valley, California, or visit Conan.io. So I don't want to get into licensing too much or anything, but is there any availability for the Intel compiler for student or open source projects? There is. And actually, that's a good point. Students are one of our major students.
Starting point is 00:27:52 Research and academia as a general segment is one of our major users because a lot of R&D projects in, say, oil and gas or physics or financial services, right, quantitative analysts and so on, those are academic projects. So we do have a lot of offerings for academia, and there are certain segments who can get our compiler and other C++ tools for free, right, without any charge. And there is, you mentioned, an open source license also. And the idea is that if you are a developer who's contributing to an open source project, you can use our tools for free without any any charge so you know we have the uh the community licensing
Starting point is 00:28:47 program which basically is for educators students and open source contributors and then there's the commercial uh uh side of the house so these are two different um it's the same product but two different licensing options i guess just as an aside i I'd like to point out, I think it's interesting how many companies are offering licensing to open source developers right now. It is, and it's been more and more with CLang coming into the picture. You have more favorable licensing, right, where you can actually mix and match proprietary and open source code as a compiler vendor. So more and more companies are taking advantage of that as opposed to GCC where, you know, you have to adhere to the GPL, right, GNU public license. So the community is moving in that direction, yes.
Starting point is 00:29:39 Yeah, not to, I guess, you know, try to make you do a direct comparison with CLang or anything, but it is interesting that they seem to be aiming for approximately the same thing you guys are doing. They're a drop-in replacement compiler for Visual Studio or for GCC. That is true. And we are aware of the community's interest in CLang, right? So we are closely monitoring and looking at being compatible with CLang. But as of right now, our compatibility extends to GCC and Visual C++. And, again, I would point out that, you know, a person would use the Intel compiler for getting superior performance.
Starting point is 00:30:25 And CLang, in the Foronix and other comparisons that I've seen, it is getting up there and it is very promising. But I'm not aware of where we stand with the peak performance just yet. Okay. Yeah, I was curious if you're willing or interested in actually giving us some performance comparisons for how you guys compare to the other compilers. We do have. So I said that most of our customers, they are HPC, high performance computing applications. So there is this industry standard benchmark called SPEC.
Starting point is 00:31:17 And SPEC is basically for evaluating CPU performance. And the latest benchmark is SPEC 2006, SPEC CPU 2006. And, you know, we benchmark regularly on the latest Intel hardware against GCC and Clang on Linux and against PGI and also against Visual C++ on Windows. And the aim is to consistently be at least 20% better in performance, in runtime performance for both the floating point and integer variants of the spec CPU benchmark. And, you know,
Starting point is 00:31:58 we have maintained this differential for a while now. But, you know, these are the in-house comparisons that we do for spec. And many of the applications which we actually optimize for are used by HPC centers around the world. So these could be your CFD applications.
Starting point is 00:32:20 It could be crash testing applications. These could be molecular dynamics applications. And those are also optimized. And we do aim to get the Intel compiler outperform the other alternatives there also. 20% is pretty huge. 20% is what we aim for the benchmark, yes. And this is just for the compiler. And you may be aware that, you know,
Starting point is 00:32:50 Intel has other C++ tools also. You know, there's a threading library, which is called a threading building blocks, which allows you to multi-thread. There's vectorization opportunities in the Intel compiler. So, you know, when you actually get the compounded performance benefit, it is much more than 20% in this what we've seen, right? You can get in orders of magnitude rather than in percentage points.
Starting point is 00:33:21 If you are actually seriously, you know, tuning your code using our tools for the Intel hardware. Wow. Orders of magnitude is even bigger than 20%. Exactly. And all this is to say that you should really make sure that you don't leave any performance on the table, right? If you have the opportunity, you know, do some analysis, you know, using our tools or using, you know, some open source tools to see if there is some headroom
Starting point is 00:33:54 for performance benefits, right? And then direct your efforts to optimize and not have any bottlenecks. Because, you know know as the latest intel hardware is coming along right it's not that the clock speeds are going higher they have been stagnant for you know for so many years what's been uh what's been going on now is uh you have more and more cores coming into uh you know into one processor so you have dual core, quad core, what have you. And inside each core, you have something called a SIMD vectorization or SIMD vector lanes, which basically allows you to have a single instruction, multiple data kind of processing.
Starting point is 00:34:38 And if you are not utilizing the multiple vectors and the multiple cores of whatever hardware that you are investing in, you are basically leaving performance on the table. And that's any compiler that you use, without tuning your application, you're going to be leaving performance on the table. So then do the Intel tools actually help you find opportunities for multi-core parallelization, not just vectorization? There are tools, yes. So our compiler and the threading library that I mentioned, they come in a suite of tools called the Intel Parallel Studio.
Starting point is 00:35:20 And again, there's a word parallel in there, right? Right. So just to, to you know continue to emphasize the the uh the point but uh one of the tools inside parallel studio is called vtune which is a hotspot analysis tool which will look at your code and identify where most of your time is being spent right it could be a some kernel, like a for loop or something, which might be taking 60 or 80% of your time, and you can direct your optimization efforts there.
Starting point is 00:35:54 So you can have some, you know, some vectorization opportunities using OpenMP as a parallel programming model that we support, and you can actually vectorize that code, and then suddenly you see so much performance benefit. There's another tool called Vector Advisor, or it's just called Advisor Now, which actually specifically looks at vectorization headroom in your code. And these are what we classify as analysis tools. And then there are the build tools, right?
Starting point is 00:36:28 The compiler, the libraries, you know, you have the threading library, which I mentioned. We have a math library, which is called the math kernel library, which does a lot of C++ and Fortran bindings for matrix multiply, Fourier transforms, things of that nature for linear algebra kind of applications. And once you use all of those in an optimal way, that's when you start to see the performance benefit. And our support team, like Anup and his colleagues, they are the ones who are at the forefront of this knowledge, how to actually help the customers get the most out of what they're trying to do. How easy is it to run VTune against an existing application to look for performance critical
Starting point is 00:37:22 areas that you might want to improve on? Well, I mean, it depends on what you're looking for. It could be at a higher level where you are a new developer who's introduced to an existing application and you would like to look for tuning opportunities on a given hardware
Starting point is 00:37:39 architecture. So what you would do is basically do a general hotspot analysis where you basically try to run different workloads on that application, the usual workloads, and try architecture. So what you would do is basically do a general hotspot analysis where like you basically try to run different workloads on that application, the usual workloads, and try to see where the application spends most time. And often it's the case in like 99% of the applications that it's not a flat profile. It's like up and down. So you like to see all the ups, the functions where it spends most time, and then further drill down, and you can do a general exploration reporting where it will actually give you a detailed information on what kind of, I mean, how many L1 cache hits or cache misses
Starting point is 00:38:18 has happened, same at the L2 level. It also has a separate report for bandwidth analysis where it will give you, if you feel that your application, in spite of computationally optimized, it still doesn't show any reflection in terms of performance, then you need to really check if it is compute-bound or memory-bound. And VTune is a place where you can actually go and check whether your application's real bottleneck is the compute side or is it on the memory side. Also, I want to mention that the advisor tool, which Udit just mentioned about, the advisor also provides a new feature, which is actually rolled out with this beta,
Starting point is 00:38:57 which is called roofline analysis. So it basically gives in a pictorial manner where your application stands in a timeline. Because every application, when it starts, obviously it needs to know the data. So every application starts as a memory bound. But what happens after that is basically deciding the nature of the application. So after that ramp up time, if it continues to be a memory bound application, then you could see that either in the roofline analysis on Visor or with the memory bandwidth analysis on the VTune side. Now, with
Starting point is 00:39:32 that being said, these are the reports which will be quite handy for any developer who's trying to do the tuning. But we also provide a custom report analysis where like you as a developer, if you're looking for certain specific hardware events, you can create a custom report where you can choose the hardware event which you want to monitor. And then you can come up with your own formulas to look for the metrics which you are specifically interested in. So that flexibility is also offered. Well, if you don't mind, if we take just one little step back, you mentioned L1 and L2 cache and memory being memory bound. Would you mind explaining what the significance of the cache layers are and what memory bound would mean to our listeners?
Starting point is 00:40:14 Sure, yeah. So, I mean, as Udit was mentioning, I mean, initially the race was about, if you look at the processor, like generations, like dating back to Pentium or even before, initially it was all about trying to increase the frequency of operation of individual cores so that you can have more number of cycles per second, which means more number of instructions can be executed per second. So that's the scalar optimization part. But obviously we hit a bottleneck there. So it wasn't like once we introduced the multi-core architecture, none of the programs was actually
Starting point is 00:40:51 getting a free performance or free lunch, what we call it at Intel. So it's like the developer really has to go and change their code to make the optimal use of different cores. Now, with that being said, the processing core has been tremendously increased because the processors are operating at gigahertz. But that's not the case with the memory. Because memory, if you look at the DRAM, the speed at which the DRAM operates or the read or write, which happens from the DRAM, is very slow when compared to the speed at which the processor is operating on.
Starting point is 00:41:26 So when your code is actually data hungry, when it is looking for data from the memory, then your CPU is sitting idle waiting for the data to be fetched into the register so that it can start computing. So that's what I mean by memory bound, because your core is operating at a really high speed and it can only operate when the data is available. So if it is waiting for the data from the memory, then you're memory bound. No matter how much compute optimizations you do, eventually the data has to arrive and that's where the bottleneck is. Now, with that being said, obviously, we follow a new architecture. It's a non-uniform memory access because you have different layers of memory.
Starting point is 00:42:07 Now, if you look at the DRAM in that hierarchy, that's the last in the hierarchy. Obviously, the capacity is more, but the speed at which it can deliver the data is the least. Now, if you look at the other levels, which are faster, well well we have different levels of caches and that's what i mean by l1 l2 and l3 l1 being the closest to the processing core and obviously that's the fastest memory and but obviously it comes with the trade-off that the size is less so what we expect the developers to do is basically look at the compute kernel. And if the data is bigger than what can be fit into an L1 cache, you basically do a tiling of your data, which means you divide your data set so that each tile perfectly fits into your L1 cache. You utilize that data
Starting point is 00:42:57 for the computation and then pull the next tile in. So that's what it means by L1. Now, if you don't do this optimization, you would see that there's a lot of data which is in L1, which is not getting properly utilized for the computation. And then there is a miss happening in L1, which will lead to the data being looked for in L2. And if there is a miss in L2, then it will look for data in last level cache. And if it is not there in the last level cache, then it will go into the DRAM. So the more deeper you go, the more slower it gets. And that's what another level of tuning is where you look at the behavior of your application and then try to optimize it.
Starting point is 00:43:34 So VTune will be able to tell you basically what the data access patterns are. And I'd just like to add that VTune can be run in batch mode through a command line, or you can have a GUI which will give you a pictorial representation of what is going on in your application. That's interesting. I appreciate the explanation of the L1, L2, L3 cache. It's nice to get that from a hardware guy. Yeah, I'm curious. Does L1 actually run at the native speed of the processor or is it still slower than registers?
Starting point is 00:44:08 It is slower, but obviously it is the fastest you can get on a core right now. Okay. Yeah. So going back to the compiler a bit, does it support all of the latest modern C++ features, 11, 14? Are you already working on C++ 17?
Starting point is 00:44:27 It does. So we actually have backwards compatibility. We have to C99 also, and this being the 19th episode, I'll just say a shout-out to C99. But yes, we do have C++ 11, C++ 14 supported in the latest version of the compiler and our compiler is actually run by you know model layer similar to what a car would do so we are now releasing the 18.0 compiler for 2018 products and it's in it's in full support of C++ 14 and we have some support for C++17,
Starting point is 00:45:07 which was recently ratified as a standard, right? So the support that we have invested in right now is the parallel STL, or the standard template library, which was extended to offer multi-threading and parallelism. And I mentioned our different tools, right? We had the threading library called Threading Building Blocks and the compiler. And the idea is that if you are using templatized code, right, through the standard template library,
Starting point is 00:45:40 then you can actually, without changing your code, without deviating from standard C++, you can actually use the parallelization opportunities of your Intel hardware by using the parallel STL syntax. So that is supported in the C++, data C++ compiler. And there are two levels, right? First is that you multi-thread your code which will allow you to leverage all the cores that you have so if you have a dual core laptop or a quad core machine what have you right the first level would allow you to multi-thread your code and the second level would go one step further and allow you to vectorize within those cores. So that's when you are actually getting the most bang for your hardware buck, so to say, right, when you actually use the standard C++ language,
Starting point is 00:46:46 so it is portable across compilers, but you're actually noticing a lot of performance benefit when you're actually getting the full control over your hardware that you are going on. And that's what we are investing in new. So that's C++ 17 newer features in the in the compiler so that's uh version release 18 you said um that is in beta right now is that correct it is in beta right now as of a couple of weeks ago and in september ish that's when the uh the uh the gold version or the actual release for the year happens. And then we are on a minor update schedule.
Starting point is 00:47:28 So every quarter, roughly every three months, we have updates to your latest product. And then one year later, we'll have a major release for, you know, what we call 19.0 release in that next year. So that's how we usually roll our product updates out. Okay. So does the VTune tool then actually point out to you that this is like, does it have the capability of saying
Starting point is 00:47:56 this standard algorithm could be parallelized? Do you might want to look at that? Or is that just something you need to find where your hotspots are? I think at a high level vtune will tell you um and you can film me and anuk that uh uh you have uh uh you have the most amount of compute time spent in this part of the code right and uh then uh the next level would be to uh you know almost like double click and zoom into that part of the code
Starting point is 00:48:27 and you would then do maybe even like a line by line analysis of what's going on in each line of your code, right? That I think the capability is offered by VTune. So there is capability to zoom in and out, look at a macro level and a micro level of how your code is performing at runtime. And I know I would just like to point out that VTune is a very powerful tool, but the compiler itself also
Starting point is 00:49:00 emits very detailed diagnostic reports. So you don't have to use VTune. You can actually use a compiler switch, which will actually dump in text file, you know, this loop was vectorized, this loop was not vectorized because of some dependency of a variable or more informative messages
Starting point is 00:49:22 than, you know than a silent compile or a silent just throwing an error and exiting out. So there is a lot of control that you can actually get through the compiler itself if you're not using VTune also. That could definitely be very interesting. So does it show up like a regular compiler message, like if you're building in Visual Studio? Right, yeah.
Starting point is 00:49:48 Yes, yes. So it'll be like, remark, loop was vectorized or loop was not vectorized and possible dependency of variable X onto something outside the loop or something like that, right? So the compiler deemed that it was unsafe to vectorize a particular,
Starting point is 00:50:07 or parallelize at a higher level, a particular piece of code, and it bailed out. You can, of course, force it, but then, of course, you have to be a more advanced programmer to do that because the compiler, the very first thing that it wants to do is generate correct code rather than, right?
Starting point is 00:50:30 And there are switches to control precision for some arithmetic operations, right? You can actually sacrifice precision for performance, right? If you only want to go to, you know, two or three significant digits and you're fine, but there are some applications, especially, you know, in things like, you know, rocket science where you actually want to go to the 16th level of precision just because you don't want to make any difference.
Starting point is 00:50:57 So there, you know, it's a trade-off which you as an informed developer would be controlling the compiler for, but there are options to tweak the performance and sensitivity of the compiler. You don't want to accidentally crash into the surface of Mars or something like that. That has unfortunately happened. Not with the Intel compiler, but this was actually 20, 25 years ago before the Intel compiler was a product, actually, that there was some rocket which crashed because there was a numerical rounding error.
Starting point is 00:51:35 And, you know, that was a big lesson learned from the entire scientific computing community. And I believe, you know, that that was a lesson, you know, which resulted in loss of a test rocket, but still it was a loss which would rather be avoided. Right. I have just kind of a generic question about performance and tuning since that's what you guys seem to be focusing on if i'm using intel's compiler and v2 and i'm like optimizing the heck out of my application but it's a cross-platform application and i'm on some other platform that you guys don't support uh or some other cpu architecture is there any chance that by optimizing for one platform i might be de-optimizing for another platform well uh, I mean, so there is something called portability and then there is the next step,
Starting point is 00:52:29 which is called the tuning part. Obviously, any development project which you write, you start writing the functionality. And obviously, when you just implement the functionality, that code is bound to be portable. It'll work.
Starting point is 00:52:41 Now, when we talk about tuning and specific, well, I mean, there are a lot of ways to tune. Say you're going for a pragma-driven approach, which the reason why I'm mentioning pragma-driven approach is because... You mean like as in compiler pragmas? That's right, yeah. Say, for instance, you're using OpenMP, right? So OpenMP is a standard, and the best part about using something which is standardized is that when you go across different compilers, and even if you hop to different architecture, the compiler vendor who provides or supports that OpenMP standard,
Starting point is 00:53:16 he would make sure that for the given pragma, he generates optimal code for the different architecture. Now, that's the best thing about the standard. And OpenMP, the reason why I'm breaking OpenMP is OpenMP is one of the standards. I mean, there is another one, like, in the different domain, like, which is OpenCL for different targets, like maybe the graphics cards, right? So the thing is, there are so many graphics cards available. The thing is, every graphics card vendor, he provides his own compiler, which is compliant with OpenCL.
Starting point is 00:53:49 So he provides an OpenCL driver for his target. So as long as you write your code so that it basically complies with that standard, in our case, it happens to be OpenMP for the CPU targets. Then what happens is if you switch to a different compiler, say, for example, Intel architecture, but a different compiler, as long as the compiler supports OpenMP, great, it will go ahead and generate the code. Now, for some reason, if the compiler doesn't support it, these are just pragmas. It will just ignore it, right? Okay. Yeah, that's the behavior of the compilers across the board. Now, if it happens to be a different target other than Intel,
Starting point is 00:54:25 then obviously you'll be using a compiler provided by that vendor, and as long as that supports OpenMP, they're going to have the code generation for that target based on your practice. Right. So that's how you write your program to be portable, but at the same time you're not giving up on tuning either. Okay. Okay.
Starting point is 00:54:47 Is there anything else that we haven't gone over yet with Intel, the compiler, or any other tools we haven't mentioned yet that you guys wanted to name drop? I can touch upon because most of our applications are on the high-performance computing spectrum. We have something called the message-passing interface, which allows you to, again, do the third level of parallelization. So I mentioned, right, the lowest level was vectorization.
Starting point is 00:55:19 Then you get the multi-core, and then now you're going out of one one one processor into multiple processors which are connected through a ethernet cable or whatever right so this is the distributed memory platform and what we offer is an mpi library the message passing interface library which allows you to scale out your application beyond the you know the the one physical device that you have to multiple physical devices so for uh uh for uh uh uh you mentioned the ibm um uh blue gene pc or if you have uh any cluster which is using multiple processors uh you are actually able to scale out and deploy your code, your application in multiple nodes also. And that's the last tool to scale out your application. And we see a lot of these high-performance computing centers actually use the MPI library also
Starting point is 00:56:26 to scale out their application to the entire data center. Okay. Well, thank you so much for your time today, Ujit and Anup. Sure, sure. Thank you for having me on the show. Yeah, we're happy to help.
Starting point is 00:56:42 And there are some links I think you guys can share which can allow you to get started. And if you are a student, if you are a developer who's contributing to open source, which many of you might be, you can get started and start tuning your applications and get the most out of your hardware. Sounds great. Thank you. Thank you.
Starting point is 00:57:02 Thank you. Thanks, guys. Thanks so much for listening in as we chat about C++ I'd love to hear what you think of the podcast please let me know if we're discussing the stuff you're interested in or if you have a suggestion for a topic I'd love to hear about that too you can email all your thoughts to feedback at cppcast.com I'd also appreciate if you like CppCast on Facebook and follow CppCast on Twitter.
Starting point is 00:57:26 You can also follow me at Rob W. Irving and Jason at Leftkiss on Twitter. And of course, you can find all that info and the show notes on the podcast website at CppCast.com. Theme music for this episode is provided by PodcastThemes.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.