CppCast - Linear Algebra Standardization
Episode Date: April 28, 2022Mark Hoemmen joins Rob and Jason. They first talk about an debugging improvements in VS Code and C++20/23 features going into MSVC. Then they talk to Mark Hoemmen about his past work on linear algebra... libraries Tpetra and Kokkos, and current efforts to get linear algebra into the standard. News What's new for C++ debugging in VS Code Conformance should mean something - fputc, and freestanding MSVC C++20/23 Update Links Tpetra parallel linear algebra P1417R0: Historical lessons for C++ linear algebra library standardization P1673R7: A free function linear algebra interface based on the BLAS P1674R1: Evolving a Standard C++ Linear Algebra Library from the BLAS Patreon CppCast Patreon
Transcript
Discussion (0)
Episode 347 of CppCast with guest Mark Homan recorded April 26, 2022.
This episode of CppCast, we're thanking all of our patrons on Patreon.
Thank you so much for your ongoing support of the show.
And if you'd like to support the show too, please check us out at patreon.com slash cppcast. In this episode, we talk about debugging improvements in VS Code.
And we talk to Mark Homan.
Mark talks to us about his work on linear algebra libraries and more. Welcome to episode 347 of CppCast,
the first podcast for C++ developers by C++ developers.
I'm your host, Rob Irving, and joining my co-host, Jason Turner.
Jason, how are you doing today?
I'm okay, Rob. How are you doing?
I'm doing all right. I saw a movie last night.
You saw a movie last night? Which one?
Northman. I think I've saw a movie last night. You saw a movie last night? Which one? Northman.
I think I've seen a preview for that.
It's pretty good. Very violent,
but very good. It's like
Hamlet, but with Vikings.
Is it violent in like a comedic
kind of way? No.
Okay. Action movie.
Okay. I say I don't go for
over-the-top violence unless it's over
the top to the point of ridiculousness.
I'll watch the Expendables movies,
which are just comedy, basically.
Like, yes, that much blood can't come out of a human.
You know, like that kind of...
Yeah.
Do you have anything you want to talk about before we get started?
I feel like we may as well go ahead and mention
that C++ on C registration is still open.
And I will be doing a workshop there on doing code reviews
because I thought that seemed like a fun one.
I've never done a workshop on that before.
But I like doing code reviews.
So I thought maybe if I can kind of help people get a better feel
for what a good code review looks like, then that would be fun.
And I'm keynoting at that, which I think we've mentioned.
I think so. When is C++ on C?
Fourth of July weekend.
Oh, okay.
Yeah, it's like July 3rd, I think is the first day with the workshop. I think that's right.
And then 4, 5, 6, something like that. It's either the 4th through the 7th
or the 3rd through the 6th.
We can put those dates in the show notes
or put a link in.
Yeah.
But that's the next conference
that I know that I'm going to.
So that's exciting.
Very exciting.
Well, at the top of every episode,
I'd like to read a piece of feedback.
We got this direct message
to the SuperPcast account a couple weeks ago.
This is from Jacob
saying, hey guys, I really enjoyed the conversation about Zig with Andrew. The constexpr stuff seems
really great. In particular, constexpr sounds like it saves a lot of boilerplate compared to C++.
I would also like to suggest a guest, Mark Homan. He's had a long career in HPC and scientific
computing. I think you could have an interesting conversation about those fields.
In particular, two topics which I don't think have been discussed on the podcast are COCOS
and the distributed algebra libraries TPETRA, where Mark was one of its main authors.
And yeah, so we're actually going to be talking to Mark in just a moment.
Thank you for the suggestion, Jacob.
We'd love to hear your thoughts about the show.
You can always reach out to us on Facebook, Twitter,
or email us at feedback at cbcast.com.
And don't forget to leave us
a review on iTunes
or subscribe on YouTube.
Joining us today is Mark Homan.
Mark is a scientific software developer
with a background in parallel computing
and numerical linear algebra.
He has a bachelor's degree
in mathematics and computer science
from the University of Illinois,
Urbana-Champaign,
and a PhD in computer science from the University of California, Berkeley.
After finishing his studies, Mark worked at Sandia National Lab for 10 years.
There, he contributed to two open-source C++ software projects, Trilinos and Cocos.
For the last few years at Sandia, he focused more on applying Trilinos and Cocos to non-open-source engineering,
including simulation applications.
In early 2020, he moved to Stellar Science,
a scientific software company based in Albuquerque,
and contributed to several non-open source physics
and discrete event simulations.
This April, Marc joined the DevTech Compute Team at NVIDIA,
working remotely from his home in Albuquerque, New Mexico.
Marc's preferred programming language is C++.
He has been writing it professionally for 22 years
and has been contributing to the C++ standard process for the last five. He is co-author on three standards
library proposals in Flight, P1673, C++ BLAS interface, P009, MD-SPAN, and P1684 and DRA.
After C++, he feels most comfortable working in Python. He can get by in Fortran, MATLAB,
and ANSI Common Lisp. In his spare time, he likes to play mandolin, lute, comfortable working in Python. He can get by in Fortran, MATLAB, and ANSI Common Lisp.
In his spare time, he likes to play mandolin, lute, and baroque guitar.
Mark, welcome to the show.
Hi, thank you so much.
I didn't realize you were going to be reading the entire thing out loud,
or else I would have shortened it more.
Sorry.
No, no problem.
Oh, we just had a guest on who is also from Urbana, right?
They were an adjunct professor or something like that.
Sounds familiar. I know we've had a lot of national lab developers.
No, no. I mean, Illinois from the university. Yeah. Albuquerque, though. The most difficult
to spell capital city in America, I think. No, it's not a capital. The most difficult
to spell large city in America, I think. Well, what's funnier is that the Albuquerque
in Spain has an extra R.
Oh, does it really?
Yeah.
We dropped it at some point.
Where do you stick an extra R in there?
B-U-R-Q-U-E.
That's fun.
Okay.
It supposedly stands for the cork oak, the oak tree from which you harvest cork.
Oh.
Which in Spain is what they do there in that part of Spain.
I did not have any idea about that whatsoever. I just can't hear the name without starting to
sing in the back of my head Weird Al's Albuquerque song.
Yes.
All right. Well, Mark, we've got a couple news articles to discuss. Feel free to comment on
any of these,
and we'll start talking more about your involvement with linear algebra libraries, okay?
Sure.
Okay, so this first one is,
what's new for C++ debugging in VS Code?
You know, this seems like it's a pretty good update.
They now support debugging on the Apple M1 chip,
and they support data breakpoints with GDB, which is one of my favorite features in a normal Visual Studio, being able to set data breakpoints.
It's really handy, so it's really nice that they're bringing that to GDB and VS Code.
They also have a quick run or debug play button. So if you just make a single C++ file,
which apparently is pretty common with VS Code,
you can just hit play without having to set up
all the other project settings and everything
to get it compiling.
Nice.
Yeah, really nice.
You know, just add edit and continue features,
and it will be a full Visual Studio replacement product
at this point.
Pretty much, pretty much.
I need to spend a little bit more time with VS Code.
It's worked well, actually.
I've tried it with C++ projects a few times recently.
And for certain applications,
I actually prefer it over Visual Studio at this point.
Yeah, I still haven't really used it for real projects,
but it's my go-to editor if I just want to view something
or work on a script or something like that,
but I don't use it for larger projects.
Live preview of Markdown.
I have been using that feature quite extensively
with my Provelza work.
Yeah, it's a great Markdown editor,
so that's mainly what I use it for,
but it's nice to pull up a little Goat example or something.
For the last two years in my last job, we did Visual Studio development.
But code was useful for Markdown.
The next thing we have is a post from Jean-Huidh Benid's blog.
And this is conformance should mean something.
F put C in freestanding.
I forgot how long this article
was when I first grabbed the link and I did not give myself enough time to read it. So I'm hoping
Jason and Mark, you can talk me through this one. I did. I've been very fortunate,
blessed perhaps not to have to deal with very strange implementations of C. And so the concern
here is a little foreign to me, but when I read it,
it's sort of like one of those slow creeping horror novels where you discover something
unnatural lurks beneath the surface. And it has to do with freestanding
implementation, so no operating system? No, it's worse than that. It's somehow
even worse than that. It's like the meaning of the word character is not 100% clear in the standard
and implementations apparently exploit that. Oh, that's handy. Yeah. I don't know, you know,
you reach a point where you've read too many of John Heed's articles and you wonder if any of
this is meaningful at all. Yeah, you start to question the meaning of words. It's a very
frightening thing. You shouldn't do this for too long or your sanity is, you start to question the meaning of words. It's a very frightening thing. You shouldn't do this for too long,
or you have to buff your sanity before you do this.
I've always been just a little bothered by the fact
that all of the C standard library character functions take int.
That's always bothered me.
There was an 8-bit type.
There was characters always been in the C standard, right?
Because that was one in the C standard, right? Because that was
one of the main points was that
the developers of C wanted
to be able to do byte addressing
on hardware that supported
it and previous languages didn't support
that. So why the heck did
they make those functions take an
int instead of a car? Troubles me.
And then this last
thing we have is a YouTube video. And
this is on Laoway at STL from the MSVC team talking about MSVC's C++ 20 and 23 compiler
and library updates. And I believe this video is part of the virtual C++. It was. Virtual
conference that's going on right now. So it's good to see plenty of updates to go over.
So always great to see all the compilers
chugging on ahead with the new releases.
I think definitely worth commenting
is that some of the C++ 20 conformance fixes
that are going into Visual Studio 2022
are being backported into Visual Studio 2019
in an upcoming release.
Yeah, some of them are just important enough
worth backporting into 2019.
I'd count them down.
Yeah, having worked at a defense contract
for the last two years,
their world is a little bit more Windows-y
and a little bit more concerned about ABI compatibility
than the HPC world. And so
that was kind of an eye-opener for me to realize before that was never a concern that, oh, whatever,
you break ABI, whatever, you recompile everything anyway, whatever. But I start to see the importance
of that in the last two years. Fascinating. So that might make you only the second guest that we've had on in the last two years, actually, who says that ABI compatibility is important.
So that's an aside at the moment, but whatever, we have an interview to do here.
From an ABI perspective, do you do things to make sure that the libraries you produce are still ABI compatible with previous versions?
Yes. In my last project, one of the things that we designed was a profiling system.
So something that developers could use to measure the performance of chunks of C++ code,
something that would work in a thread parallel world and not have too much performance overhead.
And the system works by loading from a
dynamic shared library, loading from a DLL. And so there, the ABI matters. And so there was put a
lot of thought into making sure that the ABI was really a C ABI. So we wouldn't have, you know,
that C89 subset of C where the ABI generally is pretty stable. Wow. And having a versioning system.
But it always shocks me how
people
casually expect C++ ABI
to be stable.
No, it's not.
So what is it? Were we talking like
an interface that's nothing but void pointers
and ints or something like that? Do you want to be
that far back?
It was like that. It was, well, we did rely on the C student type aliases. So int32t and uint64t
and void star and character star for our character arrays. I mean, ABI can go to the extremes. When
I worked in distributed memory parallel code with the message passing interface with MPI, there are MPI backends that will do ABI translation of data.
Like they'll flip the endianness of doubles and you pay for that, but there are implementations that could do that.
And so this wasn't going to that extreme. It was assuming that your endianness will differ
and the bit layout of things isn't going to change.
Did you have to ever approach situations where you said,
well, for technical reasons, we really do need to break ABI here.
Now what do we do?
Did that ever come up on that project?
That did come up on that project,
and the way that we handled it was versioning.
So there was always a stable interface that would let you query the version that the library provides. So there was a kind
of a dialogue. The application says, well, I expect this version or this range of versions,
and the library says, well, I provide this range of versions. And if they don't meet,
then the application knows to bail or not load the library. Profiling is best effort.
And so users might still want to run
even if they don't have profiling.
What they don't want is the application crashes in the middle
because it called a hook that doesn't exist
or has a different signature.
That would be uncool.
Did you ever have to revert to anything like function,
function EX, function EX version two?
Thankfully, no. Because it's profiling the kinds of things you want to profile but that actually was very interesting because this was a discrete event simulator and it was
like a serious video game and where things happen and and simulated things happen and simulated events happen and simulated entities interact with each
other, a war gaming thing. And so the goal was to run all of this in a thread parallel context.
And so it was a little bit different than say the Cocos world or the high performance computing
world where you know where your thread parallel loops are, and you profile around them.
This last project, I was profiling inside of thread parallel code.
And so I had to design the profiling interface and the profiling system differently
to accommodate the fact that it's always running in a thread parallel world.
And that even meant different interfaces, interfaces that were meant to work in a thread parallel world,
but also hopefully that
didn't have stumbling blocks for users. So I'm curious, trying to do runtime profiling with
an ABI hook and a thread parallel world, like how much performance impact would you cause by trying
to measure the performance of things? That was an important question for us.
It turns out if you don't load the hooks,
performance impact is I test a null function pointer
and then I do nothing.
If I assume that the interface is inlined,
if I don't consider a little scope guard creation kind of thing,
then I'm testing scope guard creation,
I test if the function pointer is null,
and on scope guard destruction, I test if the function pointer is null. And on scope guard
destruction, I test if a function pointer is null. Okay. And so the kinds of things you would want
to measure performance of need to be significantly bigger than measuring a function pointer. But yeah,
you have to think of the locality issues, right? If you call in, you're kind of reaching outside
your eye cache, maybe if you do that. So we weren't really thinking of that fine grained it should be coarse enough grain but you don't worry about that right so it's not like
a tight loop where you're creating one of these things your scope guard inside of okay sometimes
when i look at problems like that i wonder like well how much branch prediction cache pressure am
i putting on because now i've got this extra check here and this check
here and like, you know what, because then sometimes, and we've had previous guests on that
have these stories that are like, well, on Tuesday, the code always ran faster. And it comes down to
something completely random, because on Tuesday, the timestamp caused, you know, a different
pressure on, you know, some branch prediction unit or whatever.
That's a thing to think about.
And we were trying to design for the future,
but this was an old code base.
And having something that could measure performance
was better than having nothing, which is what they had at the time.
But especially for measuring profiling,
I had to fight really hard about overheads, because I was assuming that I
didn't want threads to synchronize. And I didn't want I wanted minimal synchronization when I
measured performance of things. And sometimes the users or the project leaders didn't really
understand. So it all had to be thread safe, but I'm guessing mostly lock free, because otherwise,
yeah, yes. Yeah, there are lots of questions there.
Right. Interesting.
Well, Mark, you heard me introduce your listener comment
suggesting we have you on,
but before having you on the show,
you mentioned how you're not actively working
on these libraries anymore,
but could you maybe start off
by giving us a little bit of background
on what Cocos and T-Petra are
and your involvement with them?
Yeah, I'll start with
T-PETRA because that came first in time, sort of. There's a bit of a history there. Indian National
Laboratories, as part of their work, needs to simulate things, physics things, engineering
things. For example, if you drop something on your toe, you want to know if it goes boom or not.
That's sort of the coarse way of describing it. And a lot of what Sandia is concerned about is safety. And their catchphrase
is always never. And that means it always goes boom when you want it to, and it never goes boom
when you don't. That sounds important for the kind of research that's done in New Mexico.
Yes, it's very important. And I live here too, so it matters to me.
And a lot of what I was concerned with
was the never part of that.
So in engineering situations
and situations where people are doing work,
where people are building things
or taking things apart,
they want to know if it's safe.
Or if an accident happens,
will something incredibly bad happen
or will something not so bad happen?
And so a lot of my work was supporting simulations to do that.
And these were physics simulations.
So things were crashing into other things or things were on fire and crashing into other things.
That was the thing.
Things sometimes catch on fire because they crash into other things. And so that involves finite that to maybe a linear system or at least squares problem,
a math problem.
That's something that I can help you solve.
My main concern at Sandia, I did work some on the discretizations,
but my main concern was once you've turned the physics into a math problem,
a matrix kind of problem, a linear algebra problem,
then I worked on libraries to help you solve that
fast and correctly. And so Trilinos is a massive library of things to help you solve math problems
efficiently and correctly. And TPetra was kind of the center of that, the sparse matrices and
vectors, parallel data structures to help you do that in an efficient
way. I was concerned with interfaces. I was concerned with implementations. When I started
working on tPetra, that was about the end of 2010. At the time, it was the template version of ePetra,
essential Petra. E stands for essential. Historically, in the 2000s, the idea of using C++ was new for our kind of work.
And so vendors of supercomputer systems
didn't necessarily try to support the latest C++ versions.
What were you using before C or MATLAB or something?
Fortran.
Fortran?
Fortran, yeah, or C.
There's a little bit of history of parallel computing there.
So everything was kind of Fortran for a long time with a little occasional experimentation with other languages.
That was the sort of Cray-1 vector supercomputing world.
Fortran with Pragmas.
And as parallel computers started to diverge in architectures in the 80s, there were distributed memory parallel computers.
Sandia was an early adopter, one of the early adopters of those technologies.
And those tended to use C as their native programming language.
And so people started to implement C versions of these libraries.
And then it was really like 2000 and after when C++ started to pick up in the big
national lab world. And so compiler support circa 2004 was still a little bit weak. And so people
had to say, oh, yeah, we can't use templates, or we've got to not use this feature or that feature.
And so eventually, people wanted people, theilers picked caught up and people wanted to write to use generic libraries and they wanted to use more of c++
and that's where tpetra came in it was a postdoc project in 2009 so postdoc worked on it for a year
a year maybe two years and then he left for a different lab and then what they wanted to use
tpetra for was oh they wanted to solve very large problems, problems where you couldn't express the number,
the dimensions of the matrix in a 32-bit integer.
But you can't express the dimensions in a 32-bit integer.
The number of rows or the number of columns.
That is a very large problem.
Yeah, but people want to solve problems
with more than 4 billion degrees of freedom.
That was a thing.
And people have solved linear systems with hundreds than 4 billion degrees of freedom. That was a thing.
And people have solved linear systems with hundreds of billions of degrees of freedom.
That's a thing that people do. It's not a regular thing that people do. Sometimes it's a little bit more of a stunt, but people solve real problems that are large. It's hard to wrap your head around
running in parallel on a million processors and having problems that have billions of degrees
of freedom, but that's what people do. You made a comment a moment ago that caught my attention.
You said in the 90s when parallel computing architecture started to diverge. And I got my
computer science degree in the 90s. And I do remember these conversations about, you know,
shared memory versus distributed memory versus, you know, whatever, and how the interconnects
work and whatever. I've got impression that all of that conversation has mostly gone away. And today,
all supercomputers are effectively the same thing of a bunch of nodes of Linux boxes that are
connected together with Ethernet. Am I missing something? Or is that like, it's kind of gone all
the way around in the circle as it tends to do. Okay.
It was really the 80s when distributed memory parallelism versus shared memory vector parallelism started to become a thing.
But by the mid-2000s, the world that you describe Linux boxes
or Unix boxes connected over a network,
that was kind of the world where people are programming in MPI,
the message passing interface, in a distributed memory way across many nodes.
And part of the reason they use MPI was because their problems are large
and they don't fit in memory on a single node.
But then with GPUs, with accelerators, the world starts to get more interesting again.
And so now, yeah, we still have nodes connected together over a network to solve the biggest problems, but the nodes get more interesting, richer.
So that's the world that we find ourselves in today.
Interesting.
I also remember conversations about like, well, you know, if we'd made really simple RISC machines and fit a thousand cores on a CPU,
that seems to have largely gone away in favor of GPU, which is less general.
What is a GPU?
It is, yes, but it's not what we had in our minds when we thought a thousand cores on one chip.
But it is. That's what it is.
And a lot of other architectures that aren't called GPU have that design.
Many small, you know, individually
low power instruction units that work in parallel to solve your problem. And so that's modern
computer architecture. This is perhaps a good segue to talk about Cocos. Go for it. Let's go
for it. Thank you. So as mentioned, in the 2000s, architecture kind of stabilized. The way that you program nodes of
Linux are that you use MPI, which is a distributed memory explicit message passing programming model.
And Trillinos, TPetra, those kinds of codes grew up in this world where we could rely on MPI being
the way that you program. And so you don't think so much about how you program on a single node.
You're writing straight line, single thread code, and you're communicating with other processes on other nodes.
And so because the programming model was stable, people built up a huge software infrastructure to solve all kinds of problems.
They built huge applications, millions and millions, tens of millions of lines of code, lines of C++.
So like Terminals is single digit millions of lines of code, lines of C++. So like, Terminals is single-digit millions
of lines of code to solve linear systems.
And so that stability of programming model
made it easy to write code and build up large systems.
But as node architecture started to diverge again,
it became harder to do that,
because do I write in CUDA?
Do I write in OpenMP?
Do I write in OpenACC?
Do I write in OpenCL? Do I write in SYCL? Do I write in OpenACC? Do I write in OpenCL? Do I
write in SYCL? Do I write in FUBAR, XYZ, whatever? I wanted to experiment with this just myself like
six months ago, and I couldn't decide what to use, and so I just gave up.
And that's what people do. People need to get work done. And the problem is, as the architecture
started to come out, that's the time where you need to experiment with new algorithms, new ways of writing code. 10 years later, it's too late.
You don't have time to experiment. You don't have time to influence the architecture, the vendors.
It's already given to you, and it may be terrible. It may be not helpful for what you need to do.
And so especially people who do supercomputing, they need the ability to be early adopters so
that they can influence or at least beg and plead.
And so that's really what Cocos is.
Cocos is a common interface to a lot of these shared memory programming models. and it would work on NVIDIA GPUs, it works on AMD GPUs, it works on Intel whatever's
or any kind of architecture you want
that has threads in it.
So are you still involved in Cocos project?
I'm a little bit tangentially
because a lot of my work
with C++ standard proposals,
the collaborators like Christian Trott
are people who are in charge of Cocos
who work on Cocos.
Okay. So I could download Cocos. What is it?
I mean, like, from a programmer perspective, is it just a library that I use that abstracts these programming models away for me?
Yes, that's right. presents a common interface so that you can write parallel loops, parallel for loops, reductions, scans, parallel patterns,
and also multidimensional array data structures.
So then I need OpenCL or whatever installed also for it to talk to?
That's right. So it's a configure time choice.
So at configure time in Cocos, you decide I'm going to use OpenMP.
And then Cocos can use multiple
programming models that you specify when you're writing code. I want to target this to the OpenMP
backend, or I want to target it to the CUDA backend. I see. It doesn't have a stable ABI.
No, it does not. It does not aim for that. Although that has become, occasionally the question will come up,
oh, I'm writing Python and I want to use Cocos.
Right.
And Cocos is C++.
It relies on passing in lambdas and functors and compile time things.
It's very compile timey.
And it has to be to be optimized.
Because OpenMP is not just a set of pragmas.
It's like shifting gears in your compiler
and you engage a whole other compiler mode.
Sometimes when you have OpenMP on,
the compiler is doing different things to loops.
Different chunks of optimizer kick in.
The compiler needs to see the loop.
It needs transparency through the inlining.
That's historically been,
you kind of hope that the compiler will
inline things because that's how Cocos works. And so ABI is not a thing with Cocos, but if you're
a Python programmer or a Fortran programmer, you could use Cocos or you want to be compatible with
Cocos. And there the story is a little bit harder because it's a C++ thing. I mean, in using something like OpenMP, usually vendors, when they provide OpenMP runtimes,
the OpenMP runtimes are compatible across their different languages.
And so if you have C code and C++ code and Fortran code and they use OpenMP, they generally
use the same OpenMP runtime.
And so you can compatibly call across languages, and OpenMP won't spawn extra threads
that fight each other for the system's resources.
It'll use a single thread pool.
That's handy, yeah.
So Cocos will just take advantage of that,
but Cocos doesn't have its own thread pool.
If I were to write code and say,
okay, I'm using Cocos, right now I'm targeting OpenMP,
but then you're like, no, I need to pivot to CUDA.
How difficult is that?
You change the template parameter, the template argument.
Oh, okay.
You're saying, if I'm hearing this right,
I could actually write code that targets CUDA, OpenMP, whatever,
all in one project?
The explicit goal of Cocos was a single code base that can target
multiple backends.
Okay.
And the reason for that is we didn't have Department of Energy lacks the
resources to optimize code for N different backends.
So there's a huge code base that needs thread parallelization.
Maybe it can't,
it doesn't have to go at a hundred percent flat out on every architecture,
but it needs to not be slow.
That's really the goal. And if you need to optimize one particular thing really hard, yeah, you write CUDA or whatever.
You don't have to do that for every single thing. And the thing is, as parallelism increases on
Node, the serial bottlenecks start to matter more and more and more. And so you have to start
writing everything and thinking thread parallel all the time.
Right. Hey, everyone, quick interruption to let you know that CppCast is on Patreon.
We want to thank all of our patrons for their ongoing support of the show.
And thanks to our patrons, we recently started using a professional editor for CppCast episodes.
If you'd like to support the show too, please check us out at patreon.com slash cppcast. So you're now involved in the
standards committee over the past five years. What sorts of projects are you working on? What
sorts of papers are you working on? How do they relate to your history with these linear algebra
libraries? A lot of my history in linear algebra libraries was making them generic to support, for example, different
element types, so different kinds of values in the matrix or vector. That actually mattered for us.
We had users who had scalar types, C++ types that could support automatic differentiation
automatically. So they would just automatically compute derivatives. Or there were types where
instead of computing on
one thing at a time, you were actually solving n different problems at a time, like an ensemble of
problems. And so you would propagate this ensemble of values as a single value, and they would have
all the overloaded operators and behave like double their flow. Or there were types where
you could embed like solving a tiny partial differential equation inside where people
were solving stochastic partial differential equations to do uncertainty quantification.
They were trying to figure out the probability of something terrible happening by running a lot of
different problems that are slightly different from each other and seeing and looking for worst
cases. And so people were pushing all kinds of crazy element types through my library.
And so I learned a lot about, well, how do I write a generic library? How do I support
generic element types? These are things that people want to do. People want to put quaternions
in matrix matrix multiply. People want to put complex numbers in there. People want to mix
precisions. People want to have matrices afloat, but they want to accumulate in doubles so they get
better accuracy.
Those are all things that people want to do.
And so one of the proposals I've been spending a lot of time on recently is P1673, which
is a C++ wrapper for the BLAS, for the basic linear algebra subroutines.
And I took the lessons learned from generic linear algebra subroutines. And that, I took the lessons
learned from generic linear algebra and I applied them. I wrote a companion paper to that, 1674,
and it talks about my experience of given the basic linear algebra subroutines, which is a C
or Fortran library, if I'm writing C++, what do I need to do to that Fortran library so that I can use it from C++ in an idiomatic C++ way?
How do I make the lowest level possible but still idiomatic C++ library out of something that was written for Fortran 77 code or Fortran 4 code?
If you look at some of the older routines.
Fortran 4? That's really old, right?
Oh, people were still writing Fortran 4 in the 80s. But yeah, if you're not familiar with the
history of the basic linear algebra subroutines, people might be familiar with Jack Dungara,
the recent Turing Award winner. And the blast was his effort at separating out the parts of
linear algebra that a vendor could optimize from the parts of linear algebra that a mathematician should write.
And so the point of the Blas was, I work at a company that makes hardware, and I don't really know how to do a matrix factorization.
But I know that I can do plus and times, and I know my high school algebra.
And if somebody explains to me, oh, this is how you do matrix, matrix, multiply, I can do plus and times, and I know my high school algebra. And if somebody explains to me,
oh, this is how you do matrix, matrix, multiple, I can implement it. And I can make it go fast
because I know the architecture. And if I'm the mathematician, what I can do is I can rewrite my
algorithms to use these BLAS routines that the vendor knows how to optimize. And if I do that,
I can make my math go fast, as long as I spend most of my time in the
vendor-provided routines. And part of that was an insight about how the importance of memory locality,
and that importance has only increased over time, where arithmetic is practically free on a computer.
Memory bandwidth costs you money, hardware essentially, but memory latency is physics.
You just can't get around it. And so the way that you make your code go fast, yeah, essentially. But memory latency is physics. You just can't get around it.
And so the way that you make your code go fast,
yeah, physics.
I credit that saying to my co-advisor, Kathy Yellick,
who was a Berkeley professor
who was in charge of the Supercomputing Center in Berkeley.
She said that arithmetic was free,
bandwidth is money, latency is physics.
Admiral Grace Hopper, right? right yes she was known for carrying
around this is a nanosecond or whatever it was yes do you know what we're talking about rob
it was like a piece of metal or something that she would bring to lectures yeah network this is how
far light travels in a nanosecond was it a nanosecond? Something like that, yeah. So that you could actually visualize,
like, yeah, physics matter.
Yeah.
And for data centers,
if I'm programming multiple nodes,
a lot of the latency comes from
hardware and software needing to
interpret my message.
But yeah, the length of a cable matters.
The reason the Cray vector supercomputers
had their funny layout with the sofa around the central core,
the central core was laid out because of latency concerns for memory.
It's designed that way on purpose.
Any high-frequency trader will tell you how important it is to be co-located at the exchange these days.
Anyhow.
Yeah, so the way that people have been writing linear algebra for the past 30 or more years
has been focused around memory locality and in designing the algorithm so that it can
spend most of its time in BLAS algorithms that have the most memory reuse potential.
And so that's really what this proposal is about.
It's just the C++ version of that.
And being idiomatic C++, that implies more genericity,
you know, supporting more types,
also supporting parallelism through execution policy overloads.
And the way in C++ that we express a matrix or vector,
we didn't want to write like a matrix class hierarchy.
That's not something we wanted to do. We wanted to be as low level as possible. And so the analog of iterator for a matrix is
MD-SPAN. So that's the proposal P0009. And that's how you view a matrix. And MD-SPAN also has
genericity on the layout of data and how you access the data. And so you can use MD-SPAN, for example, to access.
We've written accessors for partition global address space.
So memory that lives on another node that you have to call a special function to get
to, you can use MD-SPAN to express that.
For NVSHMEM, for our NVIDIA hardware, NVSHMEM is how you can, in a shared memory kind of
way, you can get to memory on foreign remote GPUs.
So basically every vendor already has some implementation of the C version of this, right?
Yes.
So are you expecting that the C++ proposal, if it's accepted, would literally be a wrapper around the existing implementations or a fresh implementation in C++?
Like, I'm just curious.
The reference implementation that we have, what we expect is if you can call the C BLAS, you will. Otherwise, you'll write generic C++ code. And so there may be a bit of a performance,
I don't want to call it a performance cliff. If your element types or layouts are not supported by the C-blahs, then we have to write C++. But that in itself is an opportunity for
vendors to optimize. And some vendors do optimize things like mixed precision or interesting element
types. It's kind of an opportunity for vendors to say, oh, yeah, well, we have this cool new feature that says this short floating point type or this
extended precision decimal floating point type or this crazy element type, we optimize that,
or we optimize this new interesting recursive matrix layout that improves cache locality.
And so the C++ interface is like, the users just call that, it's generic, but then the vendors can do optimizations underneath however they like.
Kind of seems like the exact kind of scenario that has always kept the Intel C++ compiler around, as people say, well, I know that I need this optimized whatever feature that Intel's compiler supports.
It's a little bit like that, yeah.
I can just imagine Intel, for example,
saying, yes, our BLAS C++ interface
is going to be the best or whatever
because they want to highlight their hardware.
You can use the Intel BLAS without the Intel compiler.
And so there may be interesting questions for vendors
about whether they want their implementation
to support switching BLAS backends.
I mean, that could be a thing that implementation could do.
Even at runtime, they could load a different DLL and swap in different functions.
There are a lot of options for implementations.
So is this proposal related to the linear algebra proposals that we saw come out of the graphics proposal from Guy Davidson a couple years ago?
No.
We actually have a joint paper in which we describe the relationship between the two proposals,
which is we are not in competition.
Their proposal serves a different audience, I would say.
And the proposal in question is p1385. And the design is matrix classes and
overloaded operators, vector classes, you know, times and plus on matrices and vectors.
So those are things that graphics people want and need to do, and they like doing it that way.
We chose a different approach for a few different reasons. First of all,
we didn't want to figure out like, what's the element type of adding two matrices? So let's say I have a matrix of double and a
matrix of complex float. What's the element type of their sum? It should be complex double.
I would think complex double would be my guess. But you know, sometimes I'm like, I'm not a math
person. Well, it's worse than that. In C++, if you try to use Operator Plus
on a double and a complex float,
you get a compiler error.
Oh, really?
Yes, it's very exciting.
I've spent very little time with standard complex.
Yeah, maybe perhaps for the better.
It's not bad, but people naively think that,
oh yeah, it works just like float.
And then they try to write code and it's different.
I've only used it for Mandelbrot rendering because that's all complex numbers.
So when you have a matrix operator plus, you have to decide what the return type,
the element type should be. You also have to allocate. And so now you start to get into the
business of storage and allocation.
And what if I don't just want to do new?
What if I want to allocate on a GPU or something?
And now you've conflated the issues of allocation and storage and representation and the actual
functions.
I'm not saying that that's bad, but I think for the graphics use case, a lot of linear
algebra problems for graphics are small.
There are four by four, three by three.
And there, a lot of people who are doing graphics, maybe they want operator overloading.
They don't want to call a function with a long name.
So I feel like that's perhaps could be a different use case.
I don't want to say that there are many users who are happy, like the kind of MATLAB user
who has medium-sized matrices,
and they're not really concerned about this problem.
They want A plus B to return a matrix, and they're happy with that.
That's not really the world in which I live.
I was the kind of person who was implementing MATLAB, not the person who was using MATLAB, generally.
Well, I've used MATLAB, but yeah.
There's a fun story about the history of MATLAB,
which is that Cleve Moeller was trying to teach students linear algebra, and they were all programming in Fortran.
The students spent most of their time
trying to get Fortran to compile,
and Cleve didn't like that very much.
And so he wrote a wrapper around Fortran,
and that's what MATLAB is, at least the beginning of it.
Interesting, actually.
I was going to say, I was looking at the P1673 paper you have, and you're currently on the
seventh revision, which just went up like 10 days ago.
How's this moving along?
Are you getting good feedback in the committee meetings?
We are.
Actually, this morning, we had a review of the paper in LEWG, in the Library Evolution
Working Group, and it had a lot of great feedback.
Probably going to see more LEWG review the Library Evolution Working Group, and it had a lot of great feedback. Probably going to
see more LEWG review before it's done, and some people are concerned about the length and the
amount of wording review to do, and whether perhaps it would be better to spin off as a TS rather than
an IS. It's not something that I've wanted to do because it's not clear that that would save any work.
But there was a lot of discussion about the length of the paper
and the number of functions in it.
I don't want to give up the functions,
especially the BLAS 2 and 3 ones.
Those are really critical.
Like, you really need all of those functions
to do basically in your algebra.
And it's a lot more than plus sometimes.
And you said there is a reference implementation. Did I hear that right?
Yes. So you can find the reference implementation. It's linked off of Cocos' GitHub. So it's
GitHub. I'll type it in the chat.
Make sure we get that in.
Yeah.
Standard blast.
Yeah. And there are reference implementations also of MD-SPAN,
which is P0009, MD-ARRAY, which is P1684.
I think we talked to Daisy Holman a few months back about MD-SPAN. Yeah, so the MD-SPAN repository has the latest reference implementation of MD-ARRAY now.
So for listeners who are interested in trying these reference implementations,
do you have a Conan VC package?
Is it fetch-contentable?
Is it easy to get and use in your library?
It's a little more than that.
It's typical CMake and a little bit of handwork,
so it's not very beautifully packaged.
But the goal is more of that. But in terms of build system, it's not very beautifully packaged. But the goal is more of that. But in
terms of build system, it's not hideously complicated. Okay. Just curious. So our
listeners know what they're getting themselves into. For sure. You didn't say you have to run
a bunch of custom shell scripts. So I think you're... No, no, no, no. It's CMake. It's all CMake. Okay. So it looks like MD-SPAN and MD-ARRAY are probably likely to get accepted soon. Is that so?
MD-SPAN is in the middle of wording review, and we're hoping C++23.
MD-ARRAY is still undergoing design changes in design review, so it almost certainly will not make 23.
Now, if I understand MD-SPAN right,
I can say I've got a flat array or a vector or whatever
and say I want now you to give me a view of this
as if it were a fourth dimensional array,
whatever with this layout and access patterns and such.
Yes, that's right. Okay.
So then, dumb question, what is the need for MD array then? The point of MD array is convenience,
so that you don't have to do what you just described, create a vector and then create
an MD span to view it. You can just create an MD array and it does that for you. Another point is if you have very small arrays,
so arrays of small types that have compile time dimensions,
you know, your three by three, three by four case,
MD array doesn't need to store a pointer.
So MD span is a view and it stores a pointer
and possibly any extents that are not known at compile time.
So MD span will always store a pointer.
And you can have types that are small enough that the whole thing will fit in 64 bits.
Or, you know, maybe it doesn't take much more than that.
And so MD array can optimize for those use cases where you have very small objects
that you would maybe use an array, std array, to store. And so md array
can use std array to store things. I don't have the math background for this at all, but I just
found myself wondering, like, is there a use case, and I'm kind of visualizing like C style arrays,
where you could say, I want one that's, I know it's five dimensions, it's vector backed, but I'm only going to tell
you the length of the first four dimensions and the fifth dimension can grow dynamically
as we add more data to it or something like that. I mean, not that you have dynamically
growing arrays in C, but if you declare it a C style array, you don't have to declare the
size of the last dimension because that is just another point.
That's just an index offset at the end.
So just a random thought, like, is there a use case for something like that?
Does that exist?
Is that a math thing?
I'm just talking.
No, users sometimes want to do that.
So, for example, users may have tensor data that can be organized in a time series.
And as time elapses, more data sets are
added to their collection, and they may want to represent that as a single tensor. Sorry,
mathematicians sometimes grump at me if I say the word tensor, but they want to represent that as a
single multi-dimensional array and just pack on to the end. Right, right, right. Yeah, okay. So
that's a thing that people want. That's not in our design.
Right.
Well, it just made me wonder,
is MD vector on the horizon or something like that
instead of MD array at some point?
If that would be a use case.
We haven't thought about that.
We've excluded that from the design
because you have to think very carefully
about compile time versus runtime extents.
So the design of MD span anddaray revolves around this extents object,
and it mixes compile time sizes and runtime sizes in the type.
So just like stdspan has dynamic extent or a compile time extent,
extents is the generalization of that.
And because you can mix arbitrary compile time and runtime extents,
if you were to design an mdve vector around that, suddenly it gets complicated. How do you know
which extents you can grow and which you can't? And it gets messy. And so we didn't try to tackle
that problem. I think that's a hard interface problem. Yeah, that's fair. Like I said, I have
no idea if there's even a use case for it. But you know, like people who do Python data analysis with pandas and whatnot, they want to do things like that. And I think the way to
handle that would be the growing would happen in the Python world. And if you wanted to call into
C++, you'd make an MD span, statically, call the function that you want to call in C++,
and then go back to the Python world and grow it as you like. So do MD-SPAN and SPAN and array and
MD-ARRAY play nicely with each other? Like if I say, well, I've just got a Ritter-Stood-SPAN,
is it like convertible to an MD-SPAN with one dimension? Yes, there are those kinds of
constructors. And the history is funny. The MD-ARRAY proposal has a long history. The first paper was 2014.
Oh, wow.
And span was actually split off from MD span. That's the origin.
Oh, okay.
People wanted a simpler vocabulary type, where pointer was actually like T star. And with MD span, the pointer type can be more general. And so people a vocabulary type, people wanted span to be a separate thing.
But yeah, the two are interconvertible.
I mean, array, it's very annoying
that the dynamic version of array is called vector.
The name is unfortunate.
And so MD array can have a mix of compile time
and runtime extents.
So it's different than STD array.
But the naming is difficult, but no one else we could call it. It's an array. That's what it is. Sorry.
All right. Well, Mark, we're going to put links to all these papers and reference library
implementations in the show notes. Is there anything else you want to plug or let our
listeners know about before we let you go today? Oh, well, I just wanted to thank you for giving
me the opportunity to speak and thank the listeners for listening.
And I welcome feedback on any of these proposals,
especially Stoodlaws or MD Array are more in the design review stages right now.
We especially welcome feedback on those.
But also on MD Span, if you find a bug in the spec, now's the time to look.
So I welcome any feedback you may have.
All right. Thanks so much for coming any feedback you may have. All right.
Thanks so much for coming on today.
Thank you.
Thank you.
Thanks so much for listening in as we chat about C++.
We'd love to hear what you think of the podcast.
Please let us know if we're discussing the stuff you're interested in,
or if you have a suggestion for a topic, we'd love to hear about that too.
You can email all your thoughts to feedback at cppcast.com.
We'd also appreciate if you can like CppCast on Facebook and follow CppCast on Twitter.
You can also follow me at Rob W. Irving and Jason at Left2Kiss on Twitter.
We'd also like to thank all our patrons who help support the show through Patreon.
If you'd like to support us on Patreon, you can do so at patreon.com slash cppcast.
And of course, you can find all that info and the show notes on the podcast website at cppcast.com slash cppcast and of course you can find all that info and the show notes on the