CppCast - Linear Algebra Standardization

Episode Date: April 28, 2022

Mark Hoemmen joins Rob and Jason. They first talk about an debugging improvements in VS Code and C++20/23 features going into MSVC. Then they talk to Mark Hoemmen about his past work on linear algebra... libraries Tpetra and Kokkos, and current efforts to get linear algebra into the standard. News What's new for C++ debugging in VS Code Conformance should mean something - fputc, and freestanding MSVC C++20/23 Update Links Tpetra parallel linear algebra P1417R0: Historical lessons for C++ linear algebra library standardization P1673R7: A free function linear algebra interface based on the BLAS P1674R1: Evolving a Standard C++ Linear Algebra Library from the BLAS Patreon CppCast Patreon

Transcript
Discussion (0)
Starting point is 00:00:00 Episode 347 of CppCast with guest Mark Homan recorded April 26, 2022. This episode of CppCast, we're thanking all of our patrons on Patreon. Thank you so much for your ongoing support of the show. And if you'd like to support the show too, please check us out at patreon.com slash cppcast. In this episode, we talk about debugging improvements in VS Code. And we talk to Mark Homan. Mark talks to us about his work on linear algebra libraries and more. Welcome to episode 347 of CppCast, the first podcast for C++ developers by C++ developers. I'm your host, Rob Irving, and joining my co-host, Jason Turner.
Starting point is 00:01:16 Jason, how are you doing today? I'm okay, Rob. How are you doing? I'm doing all right. I saw a movie last night. You saw a movie last night? Which one? Northman. I think I've saw a movie last night. You saw a movie last night? Which one? Northman. I think I've seen a preview for that. It's pretty good. Very violent, but very good. It's like
Starting point is 00:01:31 Hamlet, but with Vikings. Is it violent in like a comedic kind of way? No. Okay. Action movie. Okay. I say I don't go for over-the-top violence unless it's over the top to the point of ridiculousness. I'll watch the Expendables movies,
Starting point is 00:01:49 which are just comedy, basically. Like, yes, that much blood can't come out of a human. You know, like that kind of... Yeah. Do you have anything you want to talk about before we get started? I feel like we may as well go ahead and mention that C++ on C registration is still open. And I will be doing a workshop there on doing code reviews
Starting point is 00:02:09 because I thought that seemed like a fun one. I've never done a workshop on that before. But I like doing code reviews. So I thought maybe if I can kind of help people get a better feel for what a good code review looks like, then that would be fun. And I'm keynoting at that, which I think we've mentioned. I think so. When is C++ on C? Fourth of July weekend.
Starting point is 00:02:31 Oh, okay. Yeah, it's like July 3rd, I think is the first day with the workshop. I think that's right. And then 4, 5, 6, something like that. It's either the 4th through the 7th or the 3rd through the 6th. We can put those dates in the show notes or put a link in. Yeah. But that's the next conference
Starting point is 00:02:52 that I know that I'm going to. So that's exciting. Very exciting. Well, at the top of every episode, I'd like to read a piece of feedback. We got this direct message to the SuperPcast account a couple weeks ago. This is from Jacob
Starting point is 00:03:06 saying, hey guys, I really enjoyed the conversation about Zig with Andrew. The constexpr stuff seems really great. In particular, constexpr sounds like it saves a lot of boilerplate compared to C++. I would also like to suggest a guest, Mark Homan. He's had a long career in HPC and scientific computing. I think you could have an interesting conversation about those fields. In particular, two topics which I don't think have been discussed on the podcast are COCOS and the distributed algebra libraries TPETRA, where Mark was one of its main authors. And yeah, so we're actually going to be talking to Mark in just a moment. Thank you for the suggestion, Jacob.
Starting point is 00:03:41 We'd love to hear your thoughts about the show. You can always reach out to us on Facebook, Twitter, or email us at feedback at cbcast.com. And don't forget to leave us a review on iTunes or subscribe on YouTube. Joining us today is Mark Homan. Mark is a scientific software developer
Starting point is 00:03:56 with a background in parallel computing and numerical linear algebra. He has a bachelor's degree in mathematics and computer science from the University of Illinois, Urbana-Champaign, and a PhD in computer science from the University of California, Berkeley. After finishing his studies, Mark worked at Sandia National Lab for 10 years.
Starting point is 00:04:12 There, he contributed to two open-source C++ software projects, Trilinos and Cocos. For the last few years at Sandia, he focused more on applying Trilinos and Cocos to non-open-source engineering, including simulation applications. In early 2020, he moved to Stellar Science, a scientific software company based in Albuquerque, and contributed to several non-open source physics and discrete event simulations. This April, Marc joined the DevTech Compute Team at NVIDIA,
Starting point is 00:04:36 working remotely from his home in Albuquerque, New Mexico. Marc's preferred programming language is C++. He has been writing it professionally for 22 years and has been contributing to the C++ standard process for the last five. He is co-author on three standards library proposals in Flight, P1673, C++ BLAS interface, P009, MD-SPAN, and P1684 and DRA. After C++, he feels most comfortable working in Python. He can get by in Fortran, MATLAB, and ANSI Common Lisp. In his spare time, he likes to play mandolin, lute, comfortable working in Python. He can get by in Fortran, MATLAB, and ANSI Common Lisp. In his spare time, he likes to play mandolin, lute, and baroque guitar.
Starting point is 00:05:08 Mark, welcome to the show. Hi, thank you so much. I didn't realize you were going to be reading the entire thing out loud, or else I would have shortened it more. Sorry. No, no problem. Oh, we just had a guest on who is also from Urbana, right? They were an adjunct professor or something like that.
Starting point is 00:05:26 Sounds familiar. I know we've had a lot of national lab developers. No, no. I mean, Illinois from the university. Yeah. Albuquerque, though. The most difficult to spell capital city in America, I think. No, it's not a capital. The most difficult to spell large city in America, I think. Well, what's funnier is that the Albuquerque in Spain has an extra R. Oh, does it really? Yeah. We dropped it at some point.
Starting point is 00:05:50 Where do you stick an extra R in there? B-U-R-Q-U-E. That's fun. Okay. It supposedly stands for the cork oak, the oak tree from which you harvest cork. Oh. Which in Spain is what they do there in that part of Spain. I did not have any idea about that whatsoever. I just can't hear the name without starting to
Starting point is 00:06:12 sing in the back of my head Weird Al's Albuquerque song. Yes. All right. Well, Mark, we've got a couple news articles to discuss. Feel free to comment on any of these, and we'll start talking more about your involvement with linear algebra libraries, okay? Sure. Okay, so this first one is, what's new for C++ debugging in VS Code?
Starting point is 00:06:36 You know, this seems like it's a pretty good update. They now support debugging on the Apple M1 chip, and they support data breakpoints with GDB, which is one of my favorite features in a normal Visual Studio, being able to set data breakpoints. It's really handy, so it's really nice that they're bringing that to GDB and VS Code. They also have a quick run or debug play button. So if you just make a single C++ file, which apparently is pretty common with VS Code, you can just hit play without having to set up all the other project settings and everything
Starting point is 00:07:12 to get it compiling. Nice. Yeah, really nice. You know, just add edit and continue features, and it will be a full Visual Studio replacement product at this point. Pretty much, pretty much. I need to spend a little bit more time with VS Code.
Starting point is 00:07:27 It's worked well, actually. I've tried it with C++ projects a few times recently. And for certain applications, I actually prefer it over Visual Studio at this point. Yeah, I still haven't really used it for real projects, but it's my go-to editor if I just want to view something or work on a script or something like that, but I don't use it for larger projects.
Starting point is 00:07:53 Live preview of Markdown. I have been using that feature quite extensively with my Provelza work. Yeah, it's a great Markdown editor, so that's mainly what I use it for, but it's nice to pull up a little Goat example or something. For the last two years in my last job, we did Visual Studio development. But code was useful for Markdown.
Starting point is 00:08:15 The next thing we have is a post from Jean-Huidh Benid's blog. And this is conformance should mean something. F put C in freestanding. I forgot how long this article was when I first grabbed the link and I did not give myself enough time to read it. So I'm hoping Jason and Mark, you can talk me through this one. I did. I've been very fortunate, blessed perhaps not to have to deal with very strange implementations of C. And so the concern here is a little foreign to me, but when I read it,
Starting point is 00:08:45 it's sort of like one of those slow creeping horror novels where you discover something unnatural lurks beneath the surface. And it has to do with freestanding implementation, so no operating system? No, it's worse than that. It's somehow even worse than that. It's like the meaning of the word character is not 100% clear in the standard and implementations apparently exploit that. Oh, that's handy. Yeah. I don't know, you know, you reach a point where you've read too many of John Heed's articles and you wonder if any of this is meaningful at all. Yeah, you start to question the meaning of words. It's a very frightening thing. You shouldn't do this for too long or your sanity is, you start to question the meaning of words. It's a very frightening thing. You shouldn't do this for too long,
Starting point is 00:09:27 or you have to buff your sanity before you do this. I've always been just a little bothered by the fact that all of the C standard library character functions take int. That's always bothered me. There was an 8-bit type. There was characters always been in the C standard, right? Because that was one in the C standard, right? Because that was one of the main points was that
Starting point is 00:09:47 the developers of C wanted to be able to do byte addressing on hardware that supported it and previous languages didn't support that. So why the heck did they make those functions take an int instead of a car? Troubles me. And then this last
Starting point is 00:10:04 thing we have is a YouTube video. And this is on Laoway at STL from the MSVC team talking about MSVC's C++ 20 and 23 compiler and library updates. And I believe this video is part of the virtual C++. It was. Virtual conference that's going on right now. So it's good to see plenty of updates to go over. So always great to see all the compilers chugging on ahead with the new releases. I think definitely worth commenting is that some of the C++ 20 conformance fixes
Starting point is 00:10:41 that are going into Visual Studio 2022 are being backported into Visual Studio 2019 in an upcoming release. Yeah, some of them are just important enough worth backporting into 2019. I'd count them down. Yeah, having worked at a defense contract for the last two years,
Starting point is 00:10:58 their world is a little bit more Windows-y and a little bit more concerned about ABI compatibility than the HPC world. And so that was kind of an eye-opener for me to realize before that was never a concern that, oh, whatever, you break ABI, whatever, you recompile everything anyway, whatever. But I start to see the importance of that in the last two years. Fascinating. So that might make you only the second guest that we've had on in the last two years, actually, who says that ABI compatibility is important. So that's an aside at the moment, but whatever, we have an interview to do here. From an ABI perspective, do you do things to make sure that the libraries you produce are still ABI compatible with previous versions?
Starting point is 00:11:46 Yes. In my last project, one of the things that we designed was a profiling system. So something that developers could use to measure the performance of chunks of C++ code, something that would work in a thread parallel world and not have too much performance overhead. And the system works by loading from a dynamic shared library, loading from a DLL. And so there, the ABI matters. And so there was put a lot of thought into making sure that the ABI was really a C ABI. So we wouldn't have, you know, that C89 subset of C where the ABI generally is pretty stable. Wow. And having a versioning system. But it always shocks me how
Starting point is 00:12:28 people casually expect C++ ABI to be stable. No, it's not. So what is it? Were we talking like an interface that's nothing but void pointers and ints or something like that? Do you want to be that far back?
Starting point is 00:12:44 It was like that. It was, well, we did rely on the C student type aliases. So int32t and uint64t and void star and character star for our character arrays. I mean, ABI can go to the extremes. When I worked in distributed memory parallel code with the message passing interface with MPI, there are MPI backends that will do ABI translation of data. Like they'll flip the endianness of doubles and you pay for that, but there are implementations that could do that. And so this wasn't going to that extreme. It was assuming that your endianness will differ and the bit layout of things isn't going to change. Did you have to ever approach situations where you said, well, for technical reasons, we really do need to break ABI here.
Starting point is 00:13:35 Now what do we do? Did that ever come up on that project? That did come up on that project, and the way that we handled it was versioning. So there was always a stable interface that would let you query the version that the library provides. So there was a kind of a dialogue. The application says, well, I expect this version or this range of versions, and the library says, well, I provide this range of versions. And if they don't meet, then the application knows to bail or not load the library. Profiling is best effort.
Starting point is 00:14:06 And so users might still want to run even if they don't have profiling. What they don't want is the application crashes in the middle because it called a hook that doesn't exist or has a different signature. That would be uncool. Did you ever have to revert to anything like function, function EX, function EX version two?
Starting point is 00:14:30 Thankfully, no. Because it's profiling the kinds of things you want to profile but that actually was very interesting because this was a discrete event simulator and it was like a serious video game and where things happen and and simulated things happen and simulated events happen and simulated entities interact with each other, a war gaming thing. And so the goal was to run all of this in a thread parallel context. And so it was a little bit different than say the Cocos world or the high performance computing world where you know where your thread parallel loops are, and you profile around them. This last project, I was profiling inside of thread parallel code. And so I had to design the profiling interface and the profiling system differently to accommodate the fact that it's always running in a thread parallel world.
Starting point is 00:15:17 And that even meant different interfaces, interfaces that were meant to work in a thread parallel world, but also hopefully that didn't have stumbling blocks for users. So I'm curious, trying to do runtime profiling with an ABI hook and a thread parallel world, like how much performance impact would you cause by trying to measure the performance of things? That was an important question for us. It turns out if you don't load the hooks, performance impact is I test a null function pointer and then I do nothing.
Starting point is 00:15:52 If I assume that the interface is inlined, if I don't consider a little scope guard creation kind of thing, then I'm testing scope guard creation, I test if the function pointer is null, and on scope guard destruction, I test if the function pointer is null. And on scope guard destruction, I test if a function pointer is null. Okay. And so the kinds of things you would want to measure performance of need to be significantly bigger than measuring a function pointer. But yeah, you have to think of the locality issues, right? If you call in, you're kind of reaching outside
Starting point is 00:16:20 your eye cache, maybe if you do that. So we weren't really thinking of that fine grained it should be coarse enough grain but you don't worry about that right so it's not like a tight loop where you're creating one of these things your scope guard inside of okay sometimes when i look at problems like that i wonder like well how much branch prediction cache pressure am i putting on because now i've got this extra check here and this check here and like, you know what, because then sometimes, and we've had previous guests on that have these stories that are like, well, on Tuesday, the code always ran faster. And it comes down to something completely random, because on Tuesday, the timestamp caused, you know, a different pressure on, you know, some branch prediction unit or whatever.
Starting point is 00:17:06 That's a thing to think about. And we were trying to design for the future, but this was an old code base. And having something that could measure performance was better than having nothing, which is what they had at the time. But especially for measuring profiling, I had to fight really hard about overheads, because I was assuming that I didn't want threads to synchronize. And I didn't want I wanted minimal synchronization when I
Starting point is 00:17:30 measured performance of things. And sometimes the users or the project leaders didn't really understand. So it all had to be thread safe, but I'm guessing mostly lock free, because otherwise, yeah, yes. Yeah, there are lots of questions there. Right. Interesting. Well, Mark, you heard me introduce your listener comment suggesting we have you on, but before having you on the show, you mentioned how you're not actively working
Starting point is 00:17:56 on these libraries anymore, but could you maybe start off by giving us a little bit of background on what Cocos and T-Petra are and your involvement with them? Yeah, I'll start with T-PETRA because that came first in time, sort of. There's a bit of a history there. Indian National Laboratories, as part of their work, needs to simulate things, physics things, engineering
Starting point is 00:18:17 things. For example, if you drop something on your toe, you want to know if it goes boom or not. That's sort of the coarse way of describing it. And a lot of what Sandia is concerned about is safety. And their catchphrase is always never. And that means it always goes boom when you want it to, and it never goes boom when you don't. That sounds important for the kind of research that's done in New Mexico. Yes, it's very important. And I live here too, so it matters to me. And a lot of what I was concerned with was the never part of that. So in engineering situations
Starting point is 00:18:52 and situations where people are doing work, where people are building things or taking things apart, they want to know if it's safe. Or if an accident happens, will something incredibly bad happen or will something not so bad happen? And so a lot of my work was supporting simulations to do that.
Starting point is 00:19:12 And these were physics simulations. So things were crashing into other things or things were on fire and crashing into other things. That was the thing. Things sometimes catch on fire because they crash into other things. And so that involves finite that to maybe a linear system or at least squares problem, a math problem. That's something that I can help you solve. My main concern at Sandia, I did work some on the discretizations, but my main concern was once you've turned the physics into a math problem,
Starting point is 00:20:00 a matrix kind of problem, a linear algebra problem, then I worked on libraries to help you solve that fast and correctly. And so Trilinos is a massive library of things to help you solve math problems efficiently and correctly. And TPetra was kind of the center of that, the sparse matrices and vectors, parallel data structures to help you do that in an efficient way. I was concerned with interfaces. I was concerned with implementations. When I started working on tPetra, that was about the end of 2010. At the time, it was the template version of ePetra, essential Petra. E stands for essential. Historically, in the 2000s, the idea of using C++ was new for our kind of work.
Starting point is 00:20:49 And so vendors of supercomputer systems didn't necessarily try to support the latest C++ versions. What were you using before C or MATLAB or something? Fortran. Fortran? Fortran, yeah, or C. There's a little bit of history of parallel computing there. So everything was kind of Fortran for a long time with a little occasional experimentation with other languages.
Starting point is 00:21:13 That was the sort of Cray-1 vector supercomputing world. Fortran with Pragmas. And as parallel computers started to diverge in architectures in the 80s, there were distributed memory parallel computers. Sandia was an early adopter, one of the early adopters of those technologies. And those tended to use C as their native programming language. And so people started to implement C versions of these libraries. And then it was really like 2000 and after when C++ started to pick up in the big national lab world. And so compiler support circa 2004 was still a little bit weak. And so people
Starting point is 00:21:55 had to say, oh, yeah, we can't use templates, or we've got to not use this feature or that feature. And so eventually, people wanted people, theilers picked caught up and people wanted to write to use generic libraries and they wanted to use more of c++ and that's where tpetra came in it was a postdoc project in 2009 so postdoc worked on it for a year a year maybe two years and then he left for a different lab and then what they wanted to use tpetra for was oh they wanted to solve very large problems, problems where you couldn't express the number, the dimensions of the matrix in a 32-bit integer. But you can't express the dimensions in a 32-bit integer. The number of rows or the number of columns.
Starting point is 00:22:37 That is a very large problem. Yeah, but people want to solve problems with more than 4 billion degrees of freedom. That was a thing. And people have solved linear systems with hundreds than 4 billion degrees of freedom. That was a thing. And people have solved linear systems with hundreds of billions of degrees of freedom. That's a thing that people do. It's not a regular thing that people do. Sometimes it's a little bit more of a stunt, but people solve real problems that are large. It's hard to wrap your head around running in parallel on a million processors and having problems that have billions of degrees
Starting point is 00:23:04 of freedom, but that's what people do. You made a comment a moment ago that caught my attention. You said in the 90s when parallel computing architecture started to diverge. And I got my computer science degree in the 90s. And I do remember these conversations about, you know, shared memory versus distributed memory versus, you know, whatever, and how the interconnects work and whatever. I've got impression that all of that conversation has mostly gone away. And today, all supercomputers are effectively the same thing of a bunch of nodes of Linux boxes that are connected together with Ethernet. Am I missing something? Or is that like, it's kind of gone all the way around in the circle as it tends to do. Okay.
Starting point is 00:23:48 It was really the 80s when distributed memory parallelism versus shared memory vector parallelism started to become a thing. But by the mid-2000s, the world that you describe Linux boxes or Unix boxes connected over a network, that was kind of the world where people are programming in MPI, the message passing interface, in a distributed memory way across many nodes. And part of the reason they use MPI was because their problems are large and they don't fit in memory on a single node. But then with GPUs, with accelerators, the world starts to get more interesting again.
Starting point is 00:24:25 And so now, yeah, we still have nodes connected together over a network to solve the biggest problems, but the nodes get more interesting, richer. So that's the world that we find ourselves in today. Interesting. I also remember conversations about like, well, you know, if we'd made really simple RISC machines and fit a thousand cores on a CPU, that seems to have largely gone away in favor of GPU, which is less general. What is a GPU? It is, yes, but it's not what we had in our minds when we thought a thousand cores on one chip. But it is. That's what it is.
Starting point is 00:24:59 And a lot of other architectures that aren't called GPU have that design. Many small, you know, individually low power instruction units that work in parallel to solve your problem. And so that's modern computer architecture. This is perhaps a good segue to talk about Cocos. Go for it. Let's go for it. Thank you. So as mentioned, in the 2000s, architecture kind of stabilized. The way that you program nodes of Linux are that you use MPI, which is a distributed memory explicit message passing programming model. And Trillinos, TPetra, those kinds of codes grew up in this world where we could rely on MPI being the way that you program. And so you don't think so much about how you program on a single node.
Starting point is 00:25:43 You're writing straight line, single thread code, and you're communicating with other processes on other nodes. And so because the programming model was stable, people built up a huge software infrastructure to solve all kinds of problems. They built huge applications, millions and millions, tens of millions of lines of code, lines of C++. So like Terminals is single digit millions of lines of code, lines of C++. So like, Terminals is single-digit millions of lines of code to solve linear systems. And so that stability of programming model made it easy to write code and build up large systems. But as node architecture started to diverge again,
Starting point is 00:26:18 it became harder to do that, because do I write in CUDA? Do I write in OpenMP? Do I write in OpenACC? Do I write in OpenCL? Do I write in SYCL? Do I write in OpenACC? Do I write in OpenCL? Do I write in SYCL? Do I write in FUBAR, XYZ, whatever? I wanted to experiment with this just myself like six months ago, and I couldn't decide what to use, and so I just gave up. And that's what people do. People need to get work done. And the problem is, as the architecture
Starting point is 00:26:43 started to come out, that's the time where you need to experiment with new algorithms, new ways of writing code. 10 years later, it's too late. You don't have time to experiment. You don't have time to influence the architecture, the vendors. It's already given to you, and it may be terrible. It may be not helpful for what you need to do. And so especially people who do supercomputing, they need the ability to be early adopters so that they can influence or at least beg and plead. And so that's really what Cocos is. Cocos is a common interface to a lot of these shared memory programming models. and it would work on NVIDIA GPUs, it works on AMD GPUs, it works on Intel whatever's or any kind of architecture you want
Starting point is 00:27:29 that has threads in it. So are you still involved in Cocos project? I'm a little bit tangentially because a lot of my work with C++ standard proposals, the collaborators like Christian Trott are people who are in charge of Cocos who work on Cocos.
Starting point is 00:27:45 Okay. So I could download Cocos. What is it? I mean, like, from a programmer perspective, is it just a library that I use that abstracts these programming models away for me? Yes, that's right. presents a common interface so that you can write parallel loops, parallel for loops, reductions, scans, parallel patterns, and also multidimensional array data structures. So then I need OpenCL or whatever installed also for it to talk to? That's right. So it's a configure time choice. So at configure time in Cocos, you decide I'm going to use OpenMP. And then Cocos can use multiple
Starting point is 00:28:26 programming models that you specify when you're writing code. I want to target this to the OpenMP backend, or I want to target it to the CUDA backend. I see. It doesn't have a stable ABI. No, it does not. It does not aim for that. Although that has become, occasionally the question will come up, oh, I'm writing Python and I want to use Cocos. Right. And Cocos is C++. It relies on passing in lambdas and functors and compile time things. It's very compile timey.
Starting point is 00:28:57 And it has to be to be optimized. Because OpenMP is not just a set of pragmas. It's like shifting gears in your compiler and you engage a whole other compiler mode. Sometimes when you have OpenMP on, the compiler is doing different things to loops. Different chunks of optimizer kick in. The compiler needs to see the loop.
Starting point is 00:29:19 It needs transparency through the inlining. That's historically been, you kind of hope that the compiler will inline things because that's how Cocos works. And so ABI is not a thing with Cocos, but if you're a Python programmer or a Fortran programmer, you could use Cocos or you want to be compatible with Cocos. And there the story is a little bit harder because it's a C++ thing. I mean, in using something like OpenMP, usually vendors, when they provide OpenMP runtimes, the OpenMP runtimes are compatible across their different languages. And so if you have C code and C++ code and Fortran code and they use OpenMP, they generally
Starting point is 00:29:58 use the same OpenMP runtime. And so you can compatibly call across languages, and OpenMP won't spawn extra threads that fight each other for the system's resources. It'll use a single thread pool. That's handy, yeah. So Cocos will just take advantage of that, but Cocos doesn't have its own thread pool. If I were to write code and say,
Starting point is 00:30:19 okay, I'm using Cocos, right now I'm targeting OpenMP, but then you're like, no, I need to pivot to CUDA. How difficult is that? You change the template parameter, the template argument. Oh, okay. You're saying, if I'm hearing this right, I could actually write code that targets CUDA, OpenMP, whatever, all in one project?
Starting point is 00:30:42 The explicit goal of Cocos was a single code base that can target multiple backends. Okay. And the reason for that is we didn't have Department of Energy lacks the resources to optimize code for N different backends. So there's a huge code base that needs thread parallelization. Maybe it can't, it doesn't have to go at a hundred percent flat out on every architecture,
Starting point is 00:31:02 but it needs to not be slow. That's really the goal. And if you need to optimize one particular thing really hard, yeah, you write CUDA or whatever. You don't have to do that for every single thing. And the thing is, as parallelism increases on Node, the serial bottlenecks start to matter more and more and more. And so you have to start writing everything and thinking thread parallel all the time. Right. Hey, everyone, quick interruption to let you know that CppCast is on Patreon. We want to thank all of our patrons for their ongoing support of the show. And thanks to our patrons, we recently started using a professional editor for CppCast episodes.
Starting point is 00:31:38 If you'd like to support the show too, please check us out at patreon.com slash cppcast. So you're now involved in the standards committee over the past five years. What sorts of projects are you working on? What sorts of papers are you working on? How do they relate to your history with these linear algebra libraries? A lot of my history in linear algebra libraries was making them generic to support, for example, different element types, so different kinds of values in the matrix or vector. That actually mattered for us. We had users who had scalar types, C++ types that could support automatic differentiation automatically. So they would just automatically compute derivatives. Or there were types where instead of computing on
Starting point is 00:32:25 one thing at a time, you were actually solving n different problems at a time, like an ensemble of problems. And so you would propagate this ensemble of values as a single value, and they would have all the overloaded operators and behave like double their flow. Or there were types where you could embed like solving a tiny partial differential equation inside where people were solving stochastic partial differential equations to do uncertainty quantification. They were trying to figure out the probability of something terrible happening by running a lot of different problems that are slightly different from each other and seeing and looking for worst cases. And so people were pushing all kinds of crazy element types through my library.
Starting point is 00:33:08 And so I learned a lot about, well, how do I write a generic library? How do I support generic element types? These are things that people want to do. People want to put quaternions in matrix matrix multiply. People want to put complex numbers in there. People want to mix precisions. People want to have matrices afloat, but they want to accumulate in doubles so they get better accuracy. Those are all things that people want to do. And so one of the proposals I've been spending a lot of time on recently is P1673, which is a C++ wrapper for the BLAS, for the basic linear algebra subroutines.
Starting point is 00:33:43 And I took the lessons learned from generic linear algebra subroutines. And that, I took the lessons learned from generic linear algebra and I applied them. I wrote a companion paper to that, 1674, and it talks about my experience of given the basic linear algebra subroutines, which is a C or Fortran library, if I'm writing C++, what do I need to do to that Fortran library so that I can use it from C++ in an idiomatic C++ way? How do I make the lowest level possible but still idiomatic C++ library out of something that was written for Fortran 77 code or Fortran 4 code? If you look at some of the older routines. Fortran 4? That's really old, right? Oh, people were still writing Fortran 4 in the 80s. But yeah, if you're not familiar with the
Starting point is 00:34:32 history of the basic linear algebra subroutines, people might be familiar with Jack Dungara, the recent Turing Award winner. And the blast was his effort at separating out the parts of linear algebra that a vendor could optimize from the parts of linear algebra that a mathematician should write. And so the point of the Blas was, I work at a company that makes hardware, and I don't really know how to do a matrix factorization. But I know that I can do plus and times, and I know my high school algebra. And if somebody explains to me, oh, this is how you do matrix, matrix, multiply, I can do plus and times, and I know my high school algebra. And if somebody explains to me, oh, this is how you do matrix, matrix, multiple, I can implement it. And I can make it go fast because I know the architecture. And if I'm the mathematician, what I can do is I can rewrite my
Starting point is 00:35:15 algorithms to use these BLAS routines that the vendor knows how to optimize. And if I do that, I can make my math go fast, as long as I spend most of my time in the vendor-provided routines. And part of that was an insight about how the importance of memory locality, and that importance has only increased over time, where arithmetic is practically free on a computer. Memory bandwidth costs you money, hardware essentially, but memory latency is physics. You just can't get around it. And so the way that you make your code go fast, yeah, essentially. But memory latency is physics. You just can't get around it. And so the way that you make your code go fast, yeah, physics.
Starting point is 00:35:49 I credit that saying to my co-advisor, Kathy Yellick, who was a Berkeley professor who was in charge of the Supercomputing Center in Berkeley. She said that arithmetic was free, bandwidth is money, latency is physics. Admiral Grace Hopper, right? right yes she was known for carrying around this is a nanosecond or whatever it was yes do you know what we're talking about rob it was like a piece of metal or something that she would bring to lectures yeah network this is how
Starting point is 00:36:19 far light travels in a nanosecond was it a nanosecond? Something like that, yeah. So that you could actually visualize, like, yeah, physics matter. Yeah. And for data centers, if I'm programming multiple nodes, a lot of the latency comes from hardware and software needing to interpret my message.
Starting point is 00:36:38 But yeah, the length of a cable matters. The reason the Cray vector supercomputers had their funny layout with the sofa around the central core, the central core was laid out because of latency concerns for memory. It's designed that way on purpose. Any high-frequency trader will tell you how important it is to be co-located at the exchange these days. Anyhow. Yeah, so the way that people have been writing linear algebra for the past 30 or more years
Starting point is 00:37:07 has been focused around memory locality and in designing the algorithm so that it can spend most of its time in BLAS algorithms that have the most memory reuse potential. And so that's really what this proposal is about. It's just the C++ version of that. And being idiomatic C++, that implies more genericity, you know, supporting more types, also supporting parallelism through execution policy overloads. And the way in C++ that we express a matrix or vector,
Starting point is 00:37:43 we didn't want to write like a matrix class hierarchy. That's not something we wanted to do. We wanted to be as low level as possible. And so the analog of iterator for a matrix is MD-SPAN. So that's the proposal P0009. And that's how you view a matrix. And MD-SPAN also has genericity on the layout of data and how you access the data. And so you can use MD-SPAN, for example, to access. We've written accessors for partition global address space. So memory that lives on another node that you have to call a special function to get to, you can use MD-SPAN to express that. For NVSHMEM, for our NVIDIA hardware, NVSHMEM is how you can, in a shared memory kind of
Starting point is 00:38:23 way, you can get to memory on foreign remote GPUs. So basically every vendor already has some implementation of the C version of this, right? Yes. So are you expecting that the C++ proposal, if it's accepted, would literally be a wrapper around the existing implementations or a fresh implementation in C++? Like, I'm just curious. The reference implementation that we have, what we expect is if you can call the C BLAS, you will. Otherwise, you'll write generic C++ code. And so there may be a bit of a performance, I don't want to call it a performance cliff. If your element types or layouts are not supported by the C-blahs, then we have to write C++. But that in itself is an opportunity for vendors to optimize. And some vendors do optimize things like mixed precision or interesting element
Starting point is 00:39:16 types. It's kind of an opportunity for vendors to say, oh, yeah, well, we have this cool new feature that says this short floating point type or this extended precision decimal floating point type or this crazy element type, we optimize that, or we optimize this new interesting recursive matrix layout that improves cache locality. And so the C++ interface is like, the users just call that, it's generic, but then the vendors can do optimizations underneath however they like. Kind of seems like the exact kind of scenario that has always kept the Intel C++ compiler around, as people say, well, I know that I need this optimized whatever feature that Intel's compiler supports. It's a little bit like that, yeah. I can just imagine Intel, for example, saying, yes, our BLAS C++ interface
Starting point is 00:40:11 is going to be the best or whatever because they want to highlight their hardware. You can use the Intel BLAS without the Intel compiler. And so there may be interesting questions for vendors about whether they want their implementation to support switching BLAS backends. I mean, that could be a thing that implementation could do. Even at runtime, they could load a different DLL and swap in different functions.
Starting point is 00:40:34 There are a lot of options for implementations. So is this proposal related to the linear algebra proposals that we saw come out of the graphics proposal from Guy Davidson a couple years ago? No. We actually have a joint paper in which we describe the relationship between the two proposals, which is we are not in competition. Their proposal serves a different audience, I would say. And the proposal in question is p1385. And the design is matrix classes and overloaded operators, vector classes, you know, times and plus on matrices and vectors.
Starting point is 00:41:14 So those are things that graphics people want and need to do, and they like doing it that way. We chose a different approach for a few different reasons. First of all, we didn't want to figure out like, what's the element type of adding two matrices? So let's say I have a matrix of double and a matrix of complex float. What's the element type of their sum? It should be complex double. I would think complex double would be my guess. But you know, sometimes I'm like, I'm not a math person. Well, it's worse than that. In C++, if you try to use Operator Plus on a double and a complex float, you get a compiler error.
Starting point is 00:41:49 Oh, really? Yes, it's very exciting. I've spent very little time with standard complex. Yeah, maybe perhaps for the better. It's not bad, but people naively think that, oh yeah, it works just like float. And then they try to write code and it's different. I've only used it for Mandelbrot rendering because that's all complex numbers.
Starting point is 00:42:10 So when you have a matrix operator plus, you have to decide what the return type, the element type should be. You also have to allocate. And so now you start to get into the business of storage and allocation. And what if I don't just want to do new? What if I want to allocate on a GPU or something? And now you've conflated the issues of allocation and storage and representation and the actual functions. I'm not saying that that's bad, but I think for the graphics use case, a lot of linear
Starting point is 00:42:42 algebra problems for graphics are small. There are four by four, three by three. And there, a lot of people who are doing graphics, maybe they want operator overloading. They don't want to call a function with a long name. So I feel like that's perhaps could be a different use case. I don't want to say that there are many users who are happy, like the kind of MATLAB user who has medium-sized matrices, and they're not really concerned about this problem.
Starting point is 00:43:10 They want A plus B to return a matrix, and they're happy with that. That's not really the world in which I live. I was the kind of person who was implementing MATLAB, not the person who was using MATLAB, generally. Well, I've used MATLAB, but yeah. There's a fun story about the history of MATLAB, which is that Cleve Moeller was trying to teach students linear algebra, and they were all programming in Fortran. The students spent most of their time trying to get Fortran to compile,
Starting point is 00:43:36 and Cleve didn't like that very much. And so he wrote a wrapper around Fortran, and that's what MATLAB is, at least the beginning of it. Interesting, actually. I was going to say, I was looking at the P1673 paper you have, and you're currently on the seventh revision, which just went up like 10 days ago. How's this moving along? Are you getting good feedback in the committee meetings?
Starting point is 00:43:56 We are. Actually, this morning, we had a review of the paper in LEWG, in the Library Evolution Working Group, and it had a lot of great feedback. Probably going to see more LEWG review the Library Evolution Working Group, and it had a lot of great feedback. Probably going to see more LEWG review before it's done, and some people are concerned about the length and the amount of wording review to do, and whether perhaps it would be better to spin off as a TS rather than an IS. It's not something that I've wanted to do because it's not clear that that would save any work. But there was a lot of discussion about the length of the paper
Starting point is 00:44:30 and the number of functions in it. I don't want to give up the functions, especially the BLAS 2 and 3 ones. Those are really critical. Like, you really need all of those functions to do basically in your algebra. And it's a lot more than plus sometimes. And you said there is a reference implementation. Did I hear that right?
Starting point is 00:44:48 Yes. So you can find the reference implementation. It's linked off of Cocos' GitHub. So it's GitHub. I'll type it in the chat. Make sure we get that in. Yeah. Standard blast. Yeah. And there are reference implementations also of MD-SPAN, which is P0009, MD-ARRAY, which is P1684. I think we talked to Daisy Holman a few months back about MD-SPAN. Yeah, so the MD-SPAN repository has the latest reference implementation of MD-ARRAY now.
Starting point is 00:45:22 So for listeners who are interested in trying these reference implementations, do you have a Conan VC package? Is it fetch-contentable? Is it easy to get and use in your library? It's a little more than that. It's typical CMake and a little bit of handwork, so it's not very beautifully packaged. But the goal is more of that. But in terms of build system, it's not very beautifully packaged. But the goal is more of that. But in
Starting point is 00:45:46 terms of build system, it's not hideously complicated. Okay. Just curious. So our listeners know what they're getting themselves into. For sure. You didn't say you have to run a bunch of custom shell scripts. So I think you're... No, no, no, no. It's CMake. It's all CMake. Okay. So it looks like MD-SPAN and MD-ARRAY are probably likely to get accepted soon. Is that so? MD-SPAN is in the middle of wording review, and we're hoping C++23. MD-ARRAY is still undergoing design changes in design review, so it almost certainly will not make 23. Now, if I understand MD-SPAN right, I can say I've got a flat array or a vector or whatever and say I want now you to give me a view of this
Starting point is 00:46:37 as if it were a fourth dimensional array, whatever with this layout and access patterns and such. Yes, that's right. Okay. So then, dumb question, what is the need for MD array then? The point of MD array is convenience, so that you don't have to do what you just described, create a vector and then create an MD span to view it. You can just create an MD array and it does that for you. Another point is if you have very small arrays, so arrays of small types that have compile time dimensions, you know, your three by three, three by four case,
Starting point is 00:47:15 MD array doesn't need to store a pointer. So MD span is a view and it stores a pointer and possibly any extents that are not known at compile time. So MD span will always store a pointer. And you can have types that are small enough that the whole thing will fit in 64 bits. Or, you know, maybe it doesn't take much more than that. And so MD array can optimize for those use cases where you have very small objects that you would maybe use an array, std array, to store. And so md array
Starting point is 00:47:46 can use std array to store things. I don't have the math background for this at all, but I just found myself wondering, like, is there a use case, and I'm kind of visualizing like C style arrays, where you could say, I want one that's, I know it's five dimensions, it's vector backed, but I'm only going to tell you the length of the first four dimensions and the fifth dimension can grow dynamically as we add more data to it or something like that. I mean, not that you have dynamically growing arrays in C, but if you declare it a C style array, you don't have to declare the size of the last dimension because that is just another point. That's just an index offset at the end.
Starting point is 00:48:28 So just a random thought, like, is there a use case for something like that? Does that exist? Is that a math thing? I'm just talking. No, users sometimes want to do that. So, for example, users may have tensor data that can be organized in a time series. And as time elapses, more data sets are added to their collection, and they may want to represent that as a single tensor. Sorry,
Starting point is 00:48:53 mathematicians sometimes grump at me if I say the word tensor, but they want to represent that as a single multi-dimensional array and just pack on to the end. Right, right, right. Yeah, okay. So that's a thing that people want. That's not in our design. Right. Well, it just made me wonder, is MD vector on the horizon or something like that instead of MD array at some point? If that would be a use case.
Starting point is 00:49:15 We haven't thought about that. We've excluded that from the design because you have to think very carefully about compile time versus runtime extents. So the design of MD span anddaray revolves around this extents object, and it mixes compile time sizes and runtime sizes in the type. So just like stdspan has dynamic extent or a compile time extent, extents is the generalization of that.
Starting point is 00:49:41 And because you can mix arbitrary compile time and runtime extents, if you were to design an mdve vector around that, suddenly it gets complicated. How do you know which extents you can grow and which you can't? And it gets messy. And so we didn't try to tackle that problem. I think that's a hard interface problem. Yeah, that's fair. Like I said, I have no idea if there's even a use case for it. But you know, like people who do Python data analysis with pandas and whatnot, they want to do things like that. And I think the way to handle that would be the growing would happen in the Python world. And if you wanted to call into C++, you'd make an MD span, statically, call the function that you want to call in C++, and then go back to the Python world and grow it as you like. So do MD-SPAN and SPAN and array and
Starting point is 00:50:26 MD-ARRAY play nicely with each other? Like if I say, well, I've just got a Ritter-Stood-SPAN, is it like convertible to an MD-SPAN with one dimension? Yes, there are those kinds of constructors. And the history is funny. The MD-ARRAY proposal has a long history. The first paper was 2014. Oh, wow. And span was actually split off from MD span. That's the origin. Oh, okay. People wanted a simpler vocabulary type, where pointer was actually like T star. And with MD span, the pointer type can be more general. And so people a vocabulary type, people wanted span to be a separate thing. But yeah, the two are interconvertible.
Starting point is 00:51:09 I mean, array, it's very annoying that the dynamic version of array is called vector. The name is unfortunate. And so MD array can have a mix of compile time and runtime extents. So it's different than STD array. But the naming is difficult, but no one else we could call it. It's an array. That's what it is. Sorry. All right. Well, Mark, we're going to put links to all these papers and reference library
Starting point is 00:51:34 implementations in the show notes. Is there anything else you want to plug or let our listeners know about before we let you go today? Oh, well, I just wanted to thank you for giving me the opportunity to speak and thank the listeners for listening. And I welcome feedback on any of these proposals, especially Stoodlaws or MD Array are more in the design review stages right now. We especially welcome feedback on those. But also on MD Span, if you find a bug in the spec, now's the time to look. So I welcome any feedback you may have.
Starting point is 00:52:04 All right. Thanks so much for coming any feedback you may have. All right. Thanks so much for coming on today. Thank you. Thank you. Thanks so much for listening in as we chat about C++. We'd love to hear what you think of the podcast. Please let us know if we're discussing the stuff you're interested in, or if you have a suggestion for a topic, we'd love to hear about that too.
Starting point is 00:52:19 You can email all your thoughts to feedback at cppcast.com. We'd also appreciate if you can like CppCast on Facebook and follow CppCast on Twitter. You can also follow me at Rob W. Irving and Jason at Left2Kiss on Twitter. We'd also like to thank all our patrons who help support the show through Patreon. If you'd like to support us on Patreon, you can do so at patreon.com slash cppcast. And of course, you can find all that info and the show notes on the podcast website at cppcast.com slash cppcast and of course you can find all that info and the show notes on the

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.