Embedded - 193: Axiomatically Did Not Happen
Episode Date: March 29, 2017Owen Anderson (@OwenResistor) joined us to talk about how compilers are written. We discussed LLVM, intermediate representation, clang, and GPUs. As mentioned at the end of the show, Owen’s curre...nt employer is hiring. If you are interested and would like to get the brownie points that come with being a friend of a friend, contact us and we’ll connect you to Owen and he’ll submit your resume. Recent books Owen mentioned: Manager Path, Feminist Fight Club, The Laundry Files seriesby Charles Stross. LLVM Language Reference Teardown of what goes into rasterization What Every C Programmer Should Know About Undefined Behavior
Transcript
Discussion (0)
Welcome to Embedded.
I'm Elysia White with Christopher White.
Remember how we keep mentioning Clang?
And that time we tried to explain what happens to your source code before it runs on your hardware?
And when I dumbly asked, what is LLVM?
And everybody who'd been naughty knowingly looked a little grateful that
someone finally asked. Well, it is time for an expert. Our guest this week is Owen Anderson.
Hi, Owen. Thanks for being on the show this week.
Hi, Chris. Hi, Alicia.
Could you tell us a bit about yourself?
So I am a compiler engineer by background. I got involved in the LLVM project from the open source side when it was a very, very young project.
I think that was 2006.
And that was while I was still in school and ended up becoming a professional compiler engineer
where I worked on LLVM and the Clang C and C++ compiler,
sort of through the birth and then lifecycle of that,
porting it to different architectures,
and then eventually started working on it,
targeting it towards graphics processors.
I also studied computer architecture in school,
and while I'm not a computer architect,
I have worked with a lot of computer architects.
All right.
That leads to many, many questions.
Hopefully many answers.
Before we get to those questions, Lightning Round is finally back from vacation. And so we're going to ask you
a bunch of short questions that are relatively random and
fairly meaningless. And we want short
answers, which could be random or meaningless, whichever, you know, it's fine. Christopher,
do you want to start? No. All right. So, Owen, do you start many projects and only finish a few,
or do you only work on a few projects and finish most of them for personal projects my
house is littered with unfinished projects all right uh tabs or spaces spaces two spaces oh
anticipated the next question there's only one right answer preferred voltage
i i love i'm obsessed with the idea of everything being low power so i want to
say some very low voltage but practically all the stuff i have is five volts favorite animal
um no idea my daughter's obsessed with bats right now. Oh, that's a good one. Favorite by proxy.
Yeah.
Well, we talk about them a lot, so.
Was there a movie or book that you encountered for the first time in the last year that now is a favorite?
Fiction or nonfiction?
Whatever. It's open.
Oh, okay. I can do one of each. So I listen to almost all my books on audiobook.
I also listen to The Laundry Files in the past year, which was new to me.
Oh, yes, by Strauss.
So probably not.
Yes.
Charles Strauss, yes.
Which was incredibly nerdy and awesome.
In nonfiction, I read Feminist Fight Club,
which was really great about fighting sexism in the workplace.
And I actually just read The Manager's Path, which is the first book about management I've
ever read that wasn't complete garbage.
So.
And while you've been a compiler engineer for a long time, you are now a manager, right?
Yeah, I've been a tech lead or engineering manager for a few years now.
That grew directly out of my compiler engineering.
I was an engineer and then a senior engineer and then sort of the tech lead on the project
and eventually became the engineering manager.
Do you want to continue?
Yes, sir.
No, that's fine.
If you could have dinner with anyone in the history of the world, who would it be?
I'm going to cop out on that one.
I'm actually, I have an interest in history because my father was a classicist and a historian,
but I'm more interested in like social history
rather than a sort of great man view of history.
So I tend to be interested in all these like periods of history
that you never learn about in school.
So I don't know if I could pick a person.
There's plenty of time periods I'm interested in.
I'm interested in the dark ages and all these things that you just don't read about in a history book very much.
Fair enough.
All right.
Last one.
This one should be easy.
Have you ever touched a penguin?
I don't think I've touched one.
I've seen them, including not behind glass, but I don't think I've touched
one. Are you expecting anyone to answer this question in the affirmative? I'm hoping someday.
For the record, they stink. I liked that last week Terry asked stuffed or live, and then the
answer was no to both. So I don't know why we needed the clarification. Thus ends lightning round. Yes.
So let's get on to compiler engineer and what that means
and what compilers really are.
How many of you are there?
Way more than you would think.
Really?
Really.
It seems like a fairly rare thing.
I mean, you make a compiler
and everybody uses it
and then it's done.
It's one of those tasks that's never really done. You can always make it better in some way.
And it has certain characteristics as a sort of niche within engineering that
appeal to people with certain mindsets, where it's a very sort of well-specified problem,
right? You have to take in a particular input and produce a particular output.
There's no non-determinism. There's no timing aspect.
There's a lot of math
involved in certain parts
of it.
You're right that it is a very
niche topic within engineering
and people specialize into that.
Some of it
may be just that because I've done it a lot, I now know
lots of compiler people.
There's lots of us sprinkled around.
Okay.
So I write something in C, C++, Python.
No, no, no, C or C++, the compiled language.
And what happens to it before I download my binary, my.out or or my srec to my hardware?
A huge amount of stuff.
First, you build the car, and then you drive it.
No, it's... I mean, generally, we break a compiler down
into sort of three phases,
which are not very meaningfully named,
like front-end, optimizer, and back-end.
The front-end would generally be the portion of the compiler
that deals with the language-specific portions of your compiler.
So you have a C front-end or a C++ front-end
or a Rust front-end or whatever language you're writing in.
It's the part that's responsible for giving you error messages
and working out all the types in your code and things like that.
And the preprocessing, right?
Sorry.
The preprocessing?
Not just preprocessing.
So, for instance, C++ has a fairly complex semantic analysis system
once you take into account templates, right?
Something has to expand out all those templates,
plug in all the types,
figure out does your program type check correctly?
Did you leave out a semicolon somewhere?
Okay.
So front-end runs,
it generates a representation of your program,
usually a representation that's somewhat reflective
of what the code you wrote looked like. And the second phase is the optimizer. Optimizers
generally perform sort of language and machine independent optimization. So things along the
lines of, you know, you're computing some values and never using them. Oh, let's just not compute those at all. And there's, you know, a giant laundry list
of all these sort of machine agnostic optimizations. And then finally, the backend component
maps your program onto the processor you're actually targeting.
So there are some intermediate things that happen. There's IR, the intermediate language that goes from C into a completely different language.
Why? Why wouldn't it go straight to assembly?
Well, so very old compilers did.
All the sort of original 70s compilers are what we would now today call single-pass compilers or splat compilers,
where they read in one line of code and immediately wrote out the assembly to implement that line of code.
Oh, no, that's clearly inefficient. You definitely need an optimizer.
Yeah. The reason is that code, or even a sort of syntax tree, a data structure that represents code directly,
is a fine way for humans to understand what's going on. It's not a particularly great
representation for the compiler to actually do transformations on. It makes things harder
than they need to be in terms of the mechanics of the compiler wants to make these changes,
how hard is it to actually make that change. And also in terms of analysis, a lot of compilation
and optimization in particular is the compiler has to demonstrate, is this fact definitely true?
And programming languages that are built for humans tend to make that much harder
than it needs to be. Like, what fact might be true?
So I can give you a very specific example
of an intermediate representation aspect from LLVM,
which is pretty widely used in a lot of compilers now,
which is called SSA form.
It stands for single static assignment form.
I always get the two S's backwards.
In C, you can have a variable I.
You can assign to it multiple times. You can say
I equals 1, I equals 2,
I equals 3. You can have, you can
assign to it in one side of an if and then assign
to it again in the else side.
And this actually makes it
really complicated
for the compiler to reason about,
well, maybe one of those assignments wasn't needed
because you never read it again after writing it.
But if they're all called i,
it's kind of tricky to reason about that.
So most compilers use a representation
where they would actually break those up into different variables. You would have sort of I0 and an I1 and an I2 that were all sort
of the different assignments of I, so that each one only gets assigned to once you don't have this
overwriting, which then makes it very much easier to reason about because you can say
I2 is equal to 2 and nothing ever uses i2 so i can get rid of it
so the main value of the intermediate representation the ir is the acronym people use
the main value is in optimization is that right generally speaking yes mean, it also enables a significant amount of reuse
across different compilers, different languages,
and different machine backends.
So it turns out that writing an optimizer is fairly complicated.
It's an ongoing process of many years, right?
And it's fairly detail-oriented, right?
You have to write a lot of little pattern
recognizers that say, oh, I recognize this little pattern of you do X, we should simplify that into
Y. And that accumulated knowledge and that collection of pattern recognizers is very
valuable and you don't want to waste it. It seems like it would also be a big advantage
when you're targeting a vast array of different processors
to have that intermediate step
because then it's a shorter distance
between that and assembly necessarily.
Generally, yes.
So this sort of bleeds into an analogy I often make,
which I can't take credit for.
A co-worker originally came up with this,
which is that you can think of the process of compilation as sort of this mountain.
And you start at the beginning going up the mountain in a process that we call canonicalization, which is there's many ways to write the same program in terms of what the program does.
If you're writing a factorial function, you can write it recursively or non-recursively and all these things, right?
Canonicalization is the process of
trying to factor out all those differences.
Trying to collapse all those programs
that are equivalent in terms of what they do
so that they all become the same representation.
And then you go up the mountain,
you're canonicalizing, you're canonicalizing,
you reach enlightenment somewhere at the top
where it's perfectly all-factored. And then you start going down the mountain, you're canonicalizing, you're canonicalizing, you reach enlightenment somewhere at the top and where it's perfectly, perfectly all factors.
And then you start going down the other side of the mountain, which is in the process called lowering, where now that you have sort of your ideal representation of the program, you want to lower it towards the way that it is best expressed for the machine you're targeting towards.
And real compilation pipelines are not one mountain. It's actually
like, you know, peaks and valleys. Well, yes, because if I want, I mean, determining
canonical representation, I mean, what you're doing there is trying to read my mind through
my code, which is an imperfect representation anyway, and now you're making assumptions about it.
Not assumptions.
So compilers have sort of this core mandate
that they have to always be safe, right?
The compiler can't break your code.
And, of course, what that actually means
is specified in the C or C++ standard,
which nobody actually reads and nobody actually understands,
which then leads to many compiler engineers spending many hours having to explain it to people when their code doesn't do what they think it should do.
The compiler, one of the reasons people often wonder why do compilers take a long time to run, especially in high optimization levels,
and that's because the compiler doesn't get to make very many assumptions about what you mean. If it wants to do something, it has to prove that it can do it, which means it
has to consider all these different cases, has to reason about, oh, if it took exactly this pattern
of control flow, could it violate some assumption I want to make? And that kind of theorem-proving
aspect is usually what ends up being the big time sinks.
Wouldn't it be more efficient for everybody if I just wrote my code in the intermediate representation?
In principle, maybe.
In practice, it's not generally very pleasant to write in.
You can actually look up LLVM's intermediate representation.
It's pretty well documented.
There's a document called the language reference.
It's not too bad to read.
It's a pain to write, is what I would say.
You can also play with it on Compiler Explorer.
If you use one of the Clang compilers
and you add the command line option dash emit dash LLVM,
it will actually show you the intermediate representation.
And it's fairly readable.
It looks kind of like an abstract assembly language.
And that's completely different from what GCC does, for example?
GCC has analogous things inside of it.
They don't have a nice readable form that they can
easily print out.
The general architectures
have a lot in common.
You've mentioned LLVM,
which used to stand
for Low Level Virtual Machine,
but now it stands for LLVM
because that acronym
doesn't have any
real application to what it is anymore.
Yep. We used to get a lot of questions about what is this LLVM thing and how is it related
to a Java virtual machine? And everyone got fed up with that.
So what is LLVM?
LLVM is an umbrella project for a toolkit for making compilers.
One of the big differences between LLVM and GCC is that LLVM is structured as a set of libraries that you can mix and match,
that you can build into your own software, and you can put them together to build compilers. The most well-known example is the Clang C and C++ compiler,
but it's also used in lots of other compilers.
Rust uses it.
There's a D compiler based on it.
Apple's Swift compiler uses it.
Lots of GPU compilers are based on LLVM.
And it provides a lot of the components you need
at various parts of the compilation pipeline,
front-end optimizer and back-end.
And one more time,
wouldn't it be more efficient to write
the assembly from the C code?
I'm just really stuck on this.
We go to somewhere else.
It depends on what
you want. It'll certainly get you faster
compile times, but
the problem will certainly get you
much worse code quality.
So it's
really all what you're optimizing for.
Some people want really fast compilers,
particularly for their debug builds.
I mean, I guess the thing
is, you have to do
this even if it's implicit,
because you can't directly optimize
C code.
You have to keep a bunch of state around.
So if you're going to keep a bunch of state around,
why not just make a separate language
which you can use as a scratch pad
to keep all of that state around and then optimize that?
So I think maybe C is a little bit which you can use as a scratch pad to keep all of that state around and then optimize that. It seems inefficient.
So maybe C is a little bit of an example
that makes it a little more confusing,
because C is a pretty low-level language to start with.
I would pick on something like Haskell,
which is a compiled language as well.
It's a functional language. It's pretty high-level.
Haskell is
the Haskell compiler,
the Glasgow Haskell compiler, which is sort of the standard
one, has an LLVM-based
backend. And they do
multiple levels of
you know, to think of that mountain
that I described, right, they have sort of multiple of them,
right, where they have a representation that's
closer to the Haskell language, and they
perform optimizations that are sort of derived from Haskell-y kind of knowledge.
And then they translate it into LLVM, and they let LLVM worry about all the low-level,
you know, simplify my divides into shifts kinds of things.
And it's sort of a separation of concerns kind of problem in that, you know, in Haskell, you have a lot more knowledge.
You know, if you have a representation that's very close to the Haskell language, you have a lot of sort of implicit knowledge where you know that everything is a pure function.
You know there are no side effects or you know there are only side effects within the IOMONad and things like that. But it's hard to do little fiddly,
bit-twiddling optimizations
directly on a representation
that's very close to Haskell.
When you convert to LLVM,
you're trading benefits, right?
It's no longer as easy to reason about
those sort of mathematical properties of Haskell,
but you've opened up the ability to optimize at a finer granularity.
It's all about trade-offs, really.
Different representations of the program may be easier or harder
to perform different kinds of analysis and different kinds of transformation on.
And so most real compilers go through several different representations of the program.
And I guess the thing that might have been bouncing around your head, Alicia,
and came to my mind is, oh, okay, well, why not just optimize the assembly and use that as the IR?
It's because you've got tons of different architectures,
and you'd have to rewrite your general-purpose optimizer for every single architecture.
Exactly. So one of the things that people are very interested in with C++,
for instance, is optimizers that can eliminate virtual calls and turn them into direct calls.
And you really don't want to have to re-implement that for x86 and ARM and MIPS and every other
target you could possibly care about.
Are the compiler or are the instruction sets for the different processors all that different?
I mean, I look at the assemblies and some of them are quite different, but does it really matter?
It matters, but yeah, that's actually a really complicated question. It definitely matters, but it matters both more and less than people think, and it depends on who you
ask. I think the way I think of it is there's sort of a sweet spot in instruction set designs,
this instruction set design space, where you want your instruction set to
be sort of complete enough and dense enough um that you're not just wasting instruction size or
decoding bandwidth um you want it to be sort of sane enough that the compiler doesn't have to
struggle with generating good code for it um And this is where we start getting into very long instruction words.
Yes.
Whether we have lots of assembly instructions with a few modifiers,
or if we have a few assembly instructions
and you just have to do a lot of them in order to get to where you want to go.
Yeah, or going on the reverse you go like very uh risky where you have only very very simple
instructions but then you have a lot of them um i like i said i think there's sort of the sweet
spot in the middle and as long as your architecture is somewhere in that sort of middle region it
doesn't make a huge amount of difference like there will be obviously there will be there will be difference within the compiler, because it needs to worry about whatever little
quirks your processor has. But in terms of will the compiler be able to generate something
reasonably good for my processor, you know, if you stick to that middle region, you're probably
going to be fine. And so many do stick to that region, because it is sort of the most obvious,
it's the one you fall into unless you have a good reason not to.
Yeah.
So I will,
so like to give some examples,
right?
So VLIW,
I think everyone knows why VLIW,
very long instruction word is kind of a pain.
Well,
I guess I don't know if everyone listening knows about VLIW.
VLIW is this idea of having instructions that are grouped by the programmer
or by the compiler
that are guaranteed to be independent of each other
so that they can be run in parallel.
So if you have an ALU
or you have three arithmetic units
and you have maybe a floating point unit
and a memory thing
and a memory thing.
And a memory accessor thing.
Now you can have a very long instruction that says,
add these three separate, entirely not dependent things as they need to be added and access this memory
and do some floating point operation.
And that's all one instruction word because we are in very long instruction word land.
And so if you wanted to do something else, you're going to end up with an entirely different word.
I like to think of it as German where you end up with 29 consonants all in a row.
I like that analogy.
Yeah, VLW is painful for everybody who has to program it.
VLAW is fundamentally a trade-off to make your hardware simpler and make your programmer's lives more difficult.
And often the programmer ends up being really your compiler engineer.
VLAW is great for certain application domains where you really need to get peak performance out of the minimum number of transistors.
But you are making a trade-off that you're going to make the compiler's life very, very difficult.
And then on the opposite end, right, we have very risky designs like MIPS or more recently RISC-V.
You have a different set of problems where the instructions are so simple that there's almost
nothing the compiler can do to help you out.
Like, there's no difference between good code and not good code.
So you have to make sure you're building very good CPUs
or else you're just going to end up with very poor performance.
I forgot what the question was that started me off.
I mean, it was risk versus very long instruction words.
And why does it
matter? I mean, you're in this intermediate
representation. Well, and there's the
other axis, which is complex instruction set.
The Intel side of things where
you have the Windows instruction.
I'll pick on
RISC-V a little bit.
And I don't want to be too
mean because I actually know a bunch of people who work on RISC-V and little bit. And I don't want to be too mean because I actually know a bunch of people
who work on RISC-V
and they're all very smart people.
But I'm going to pick on it a little bit anyways, right?
RISC-V is a very,
if people don't know about it,
it's a new open source instruction set architecture
out of Berkeley.
And they're building open source processors based on it.
It's very, very RISC-ish.
You can see the sort of heritage through MIPS in it.
You know, it doesn't have any kind of complicated addressing modes.
It has a very simple instruction set,
which is great for teaching and great, very easy to learn.
It also means like the compiler can't do a lot to help you out
in terms of code quality.
And it gives rise to this sort of conundrum, right? It also means the compiler can't do a lot to help you out in terms of code quality.
And it gives rise to this sort of conundrum, right? If you look at performance of applications,
most applications are going to be performance limited
by some small number of relatively small loops in the program.
And if your loop expressed in RISC-V is two times as many instructions because every load has to have an accompanying add
in order to do the addressing mode,
then now your processor needs to have
twice as many instructions per clock
just to hold pace with one that could represent that loop
in half the number of instructions.
And a 2x increase in instructions per clock
is a big increase for a CPU designer.
Yes, yes.
We need you to go twice as fast.
So there's sort of a sweet spot in there.
You look at like ARM or PowerPC
or even x86 to a degree, right?
Modern x86.
Most of the instructions are relatively simple.
And then you have complicated instructions
only when they're going to provide significant value.
So, you know, everyone sort of agrees
that having a reasonable set of addressing modes is pretty good.
Everyone agrees that having a pretty rich set of shifting
and bitwise operators is good.
Things like that.
If you're missing some of the, you know, there's little islands where you really want to have a few sort of complicated instructions just to make sure that you're not going to be quadrupling the size of some tight loops.
But otherwise, being riskISC is just fine.
Okay.
So RISC, did we already say it stands for Reduced Instruction Set?
Yeah, RISC stands for Reduced Instruction Set Computer.
And it was, I think it originated in the 80s.
And ARM Cortex-M, which I know that's not what everybody uses, but it certainly is what I've been using a lot of, are considered RISC processors.
ARM is another one of those that started as an acronym.
It's no longer, as far as I know.
It's the ACORN RISC machine.
Exactly.
So do you work much with the embedded processors?
Not specifically.
I mean, I worked on ARM a significant amount,
and I was one of the two engineers on the first 64-bit ARM functional compiler.
It's a little lazy because there were multiple compilers coming up at once,
but we were the first one to boot a kernel.
But that was all actually application processors,
not MCUs.
I have programmed MCUs a little bit on my own time.
That's not a requirement.
I just was wondering into what parts of LLVM did you work on?
So I've worked on the optimizer,
particularly sort of earlier in my career,
and then I moved more into the backend part of the compiler,
working specifically on retargeting LLVM to new architectures,
to 64-bit ARM, and then later to GPU architectures.
The GPU architectures.
I usually think of CUDA.
Or OpenCL. Yeah. architectures uh i usually think of like cuda or open yeah so is that what you're compiling or is this a general c compiler um so what what i guess are you working on the back end for that or
yeah so i worked on the back ends that would be sort of the backing for all of those okay and not
just those actually our primary clients were not any of those.
They were actually games and graphics programs
where you write small programs called shaders
in a shading language like GLSL or HLSL.
And you did this for Apple or Google
or one of the big giant names.
Yeah, I worked for Apple.
Why do for-profit companies support LLVM?
It seems like, aren't they giving things away that they should be charging for?
It turns out that there's actually no money in compilers, generally speaking,
or no money that's big enough for Apple or Google or anyone else to care about.
Even Microsoft has moved towards giving more and more of their tools away for free because the value to all these big companies is in getting developers to build applications for their platforms, for Mac or iOS or Android or Windows or whatever. And even for the hardware vendors,
their motivation is to get people to write applications for ARM or PowerPC or x86 or whatever, right?
The amount of money they could make
from charging people for a compiler
is a pittance compared to selling, you know,
10 times more iPhones or Android phones
or x86 processors or whatever else their goal is.
So it'd be real nice if ARM came out with a free...
Well, that's a different...
They have a compiler they charge for.
So why...
This is a different discussion.
Why are we still paying anybody for compilers? That's the question. I mean, that's a different discussion. Why are we still paying anybody for compilers?
I mean, that's a great question.
Obviously, I come from the application processor domain
where most people don't pay for compilers.
There's very limited circumstances where people do.
And it's usually, I mean, there are still some domains
where people pay for compilers.
In scientific computing, people do still pay for specialty compilers
that have very sort of domain-specific capabilities.
And I remember Intel still sells theirs.
Yeah.
Yeah, they sell theirs.
I don't know anyone who has actually paid for it.
I think they actually sell that primarily
to sort of scientific and HPC users.
I believe like Cray charges for theirs similarly.
So I would guess that the reason is sort of similar
where there's sort of domain-specific knowledge
that needs to be applied to build a compiler
specifically for that domain.
And that's why there are embedded compilers for sale
because there are lots of embedded chips
and more than paying for even the compiler or the optimizer,
you're paying for being able to deal with the 9,000 different variations.
Yeah, that's what I would think.
It would be like having these complete index of memory maps
and register layouts and which is less about the compiler and more about a very small part of the
back end yeah i mean it's probably maybe even not part of the back end it may actually just be a
header file somewhere five thousand dollar header file so does llvm work work for ARMs, for Cortex-Ms? I mean, you mentioned ARM64s.
Yeah.
So in principle, yes.
But there's no...
In principle, yes, but you would have to supply your own memory map and register layout.
And LLVM historically did not have its own linker,
though they're actually working on writing one.
And I'm not sure how mature it is for ARM right now.
But yes, in principle, you could use Clang to target ARM.
I actually have.
I have one of those, the DIP packaged LPC parts.
And I have actually written a little program in Clang
and then downloaded it onto it.
But there's definitely a DIY aspect where you're missing a bunch of the pieces
that would make up sort of a full tool chain,
and you have to supply those yourself.
And it's kind of sad because Clang is so good.
Yeah, you found Clang was nice to you recently, wasn't it, Christopher?
Oh, it was just today I was playing with something,
not in the embedded space where it was just like,
oh, you misspelled this.
You misspelled a variable name.
It wasn't that you have a type name.
It wasn't that you, I don't know what this type is.
It was a, I think you misspelled this.
Yeah, I think you meant this.
And I actually did that for a variable also,
because I had a variable somewhere upstream,
and I spelled it correctly, and then later spelled it differently.
It's like, I think you meant this.
That's so weird. It's just really smart. And some of the and I spelled it correctly, and then later spelled it differently. It's like, I think you meant this. That's so weird.
It's just really smart.
And some of the static analysis stuff it does, too.
It's just really amazing.
So I would love for it to be.
We tend to use Clang in our work for Embedded
not to target the processor,
but just to run it against the code
and then throw it away and take its static analysis
because it's that good at that.
Yeah, I mean, if you're running it on host too,
Clang also has a number of dynamic analysis utilities
that are very useful, address sanitizer
and undefined behavior sanitizer,
which have saved me a lot of pain in my life.
So you mentioned undefined behavior,
and I know that there are some points in the C-spec where it says this mentioned undefined behavior. And I know that there are some points in the C
spec where it says this is undefined behavior. Why is that so important?
That is one of the most contentious design points of the C family of languages and the source of
both a lot of optimization and a lot of pain for everybody.
Undefined behavior basically is the standard throwing the compiler a bone and saying,
you can assume that this thing never happens. This is sort of the exception where I was saying
earlier that the compiler doesn't get to assume anything about your code. Anything it wants to know, it has to prove it.
This is sort of the escape valve there.
It's sort of the standard provides almost, if you think of it in mathematical terms,
it's providing an axiom that says, whatever this thing is, axiomatically, it did not happen.
And so the way this comes up, people think that compilers go looking for undefined behavior
and then decide to format your hard drive when they find it.
That's not how it works, right?
The compiler is trying to prove some fact.
It says, I want to make this transformation.
In order to make this transformation,
I need to know that this thing is true.
And in order to prove that this thing is true,
it needs to consider, don't know some handful of
cases and then it'll you know it'll look at one case and it'll say ah this case if it were true
we would be in undefined behavior and so i can just assume that this case is not true
it's if you've you ever did a proof-oriented mathematics class, it's sort of saying any situation that would give rise to undefined behavior is just an automatic contradiction, can't happen.
You've stunned into silence.
A little bit, a little bit, because that seems, I used to hate those proofs because it seemed like you were cheating.
I'm going to assume this isn't true.
And therefore, if I can prove that what you just told me leads to this, then what you just told me isn't true.
Well, there's more to it than that.
But there's some logical connections that, yes.
But it is proof by contradiction.
So like a very standard example here would be related to null checking.
That in a lot of places, the compiler may want to ask,
can I prove that this pointer wasn't null?
And the C and C++ standards provide various ways that the compiler can determine that.
For instance, if you've already dereferenced the pointer,
then it can't have been null.
And in fact, that is actually an example
of undefined behavior right there.
So if you had dereference A and then dereference A again,
if A was null, the first one was undefined behavior.
And so when you're talking about optimizing the second one,
you can assume it was not null,
because if it was null, you were already in undefined behavior.
Let me ask you kind of a weird question.
Talking about a weird question, and then I have a follow-up.
And then I probably have a follow-up to that.
What does the structure of the optimizer
look like from a code perspective because it sounds like a collection of heuristics and it
seems like that could get into just a mess really quickly especially if you're reusing them all the
time yeah um so most compilers are broken down into, you keep using the phrase pipeline,
and that's actually pretty accurate.
They're broken down into a series of phases.
And I already sort of said there were three phases
from a coarse granularity,
and each of those phases is then subdivided
into smaller phases.
LLVM makes this actually very explicit.
So LLVM in the optimizer has this concept of a pass,
which is, it's just an object with a run on function
method. So it runs on each function, it processes each function in your compilation unit, it maybe
does some analysis, does some transformation. And those passes generally implement some sort of logical optimization. So examples there might be
dead store elimination
or de-virtualization
or global value numbering,
which is a horribly indescriptive name,
but it's basically redundancy elimination.
If you're computing the same thing twice,
don't compute it twice,
only compute it once.
And that's also
a nice abstraction, and a large part of the compiler
fits into that model. There are
places where that model sort of breaks down.
There's a past in LLVM
called instcombine,
which is what we, compiler engineers,
would call a people optimizer. That is the
place where you just sort of pile in all the little
heuristics.
Say, like, hey, if you're dividing by a power of two,
just turn it into a shift kind of stuff.
And that, you know, the last time I looked,
Instacombine was many tens of thousands of lines of code.
What language is LLVM written in?
LLVM is written in C++.
Fortran. C++. Fortran.
C++.
Okay.
You had follow-ons to that. I did have follow-ons.
You were going to go on for like an hour.
I heard that.
So.
So.
So the optimizer is a collection of heuristics and phases.
And that's very structured.
It looks for certain things. Is there a place for
machine learning and sort of
deep learning?
Has anybody looked at doing that? I know that sounds
pop science-y.
Many, many
PhD theses have been written
about trying to get
machine learning into compilers.
And lots of them are about using machine learning
to tune the thresholds for different things, right?
Like the decision for should I unroll this loop or not?
Okay.
It turns out that in general,
compilers is not a particularly great application domain
for machine learning.
Generally speaking, the way I think about it is that machine learning tends to work
well in domains where you have a lot of unstructured data.
Where you need a lot of flexibility because the data might change.
It might change and, like I said, sort of not structured.
Compilers are a very structured domain.
So, like, it's, I'm trying to think of a way to sort of explain this, right?
We don't...
A compiler doesn't have, like, big buffers full of data.
A compiler has a very complex web of objects
that are all interlinked and have, you know, connections to each other.
You have, like, this instruction,
which uses the output of that instruction.
And it's much more related to graph theory than it is to, say, numerical analysis or something like that.
Okay.
And generally speaking, my experience has been that for highly structured problems,
humans are much better at developing solutions than learning algorithms.
And the reverse is sometimes true for more what I would call dense domains or numerical domains.
I mean, you mentioned that the output needed to be deterministic.
Yes, absolutely.
People get very unhappy if they run their compiler twice and get different results.
And machine learning, I usually think it applies more to things that aren't as deterministic, that require more flexibility.
I mean, machine learning can be deterministic
in the sense that once you have a training set,
you're going to hope,
and as long as you don't change any of the inputs,
you're still going to get the same output.
But it's harder to understand why.
And yeah.
No, I'm done.
Okay.
I wanted to go back to GPUs. sure uh how are they different from regular cpus
oh boy that is a very broad question um so the first thing i would say is that a gpu is
not like a cpu is not even necessarily the right comparison for a GPU off the start.
A GPU is much more than just a compute core.
It's a batch-oriented bandwidth processor.
And I'm going to pull out my favorite analogy,
which is that mainframes and GPUs have a lot of things in common. right? They're both systems that are designed
to process huge amounts of data
in a batch-oriented fashion
where you don't care so much
about how long it takes to process
any individual record in the case of a mainframe
or pixel in the case of a GPU,
but you care about the aggregate throughput
of the whole system.
And part, you know, there's sort of, I think of there being sort of four parts of that,
which is a high bandwidth computational core, accelerated input, accelerated output,
and conflict resolution. And both of those are systems that have built all four of those into hardware. So I've wanted to use GPUs on some projects.
And it's usually when I need to optimize signal processing, maybe for images, maybe for lots
of very similar but not quite the same audio channels, lots.
Those are potentially good domains for GPUs.
They're also potentially good domains for a DSP.
Well, and that was what we finally went to was DSPs
because they were simpler to explain to everybody,
but also because they were more power efficient
for what we were doing.
So it made sense.
When we think about GPUs,
when we do VR or computer graphics or games
and there are shaders and lots of things moving
and physics engines,
what else can GPUs do?
I mean, where are they interesting
when the G isn't really for graphics?
This is an interesting question, right?
So, I mean, GPUs are originally designed to do one algorithm really, really well,
which is the rasterization algorithm,
which is a total hack that is used for rendering pixels on the screen.
It just happens to work really well well so all computer graphics uses it um and gpgpu or
compute uses of gpu are all about finding ways to take existing problems and sort of poke them until
they happen to look enough like rasterization that they then run well on a gpu yeah um so like i said
i mentioned sort of four four i think of sort of core elements of what makes a GPU a GPU.
The one that's easiest for people to sort of grasp
is the high bandwidth compute core.
Yes.
And that's the part that's most, you can explain
how that aligns with what's in a CPU, right?
So a GPU's compute core is actually not that different
from a CPU, but with certain elements sort of turned up to 11.
If you're familiar with vector processing
or SIMD processing on a CPU,
GPU compute cores are even more so.
Your typical GPU is a vector processor
with a vector width of maybe 32 or 64
in contrast to most CPUs, which might have 4 or 8.
They're also highly hyper-threaded,
or simultaneously multi-threaded.
They have multiple hardware threads running on the same core.
So you can have one thread doing a memory access
while another thread is using the ALU.
And then finally, they're also multi-core.
Extremely multi-core, right?
Extremely.
Yes, though the technical marketing for GPU vendors tends to obscure that.
Or they make the numbers seem bigger than they really are.
When you look at the way marketing for gpu vendors
works they often refer to when they'll say like we have so many thousand cores often what they're
saying is they're referring to each lane of the vector as being a separate core so it's 32 times
the number yeah yeah and then also they are probably also multiplying by the number of hardware threads per core.
Marketing lies.
It's all, yeah, marketing is, GPU marketing is interesting.
I mean, it's actually, you know, it's not that complicated a concept once you sort of break it down into what it's made up of.
And a lot of these sort of things that people think of as being weird on GPUs make sense once you sort of understand what it's built up out of.
For instance, people talk about control flow being expensive on a GPU.
That's because you're doing everything on this vector processor.
And if you want to have an if statement, the way you do it is you have predication per lane of the vector,
and you do both sides of the if statement with the predicate mask inverted. And so what seemed like an if to choose a shorter path
now becomes a way to go both ways,
a much longer path.
Exactly.
I've done a little bit of GPU stuff
for a medical device company,
and it was an OpenCL.
And the hardest thing initially
was just getting my brain to switch away from that
kind of thinking to okay no i don't have this list of sequential steps i have this extremely wide
list of maybe very short things i want to do all at once to uh you know one grid square of the
screen and divide it up that way that the parallelism is such a different way of thinking.
And some other people I was working with,
they would read that and go,
I don't understand how this works.
It's like, well, imagine this splatted for us,
you know, hundreds of whatever,
maybe not hundreds, but marketing hundreds.
And they all run at the same time.
And then we get it over a high bandwidth memory interface.
But it's still really tricky,
and there were times, like you said,
with the conditionals that I was thinking,
oh, I need to do with it.
No, no, I don't need to do that at all.
But it was hard to turn your head around.
Yeah, I mean, the other thing that people sometimes run into
that's a little strange,
I mentioned the term bandwidth-oriented processor, right?
So their GPUs are all about running,
you hand them a batch of work
and they get that batch done quickly,
but they don't really try to make any one thread
run particularly quickly.
What that means is that they are trying
to keep every unit on the chip occupied all the time.
They're doing all kinds of crazy swapping threads around,
letting one thread run on the ALU
while another one is running on the memory unit or whatever, right?
And so people will sometimes write programs in a style
where maybe you have a kernel that fetches something from memory
and then uses that to do another fetch,
and they wonder why their program's really slow.
It's because your thread stopped and waited for that memory,
and you couldn't run again,
and was sitting there tying up space on the processor
that it couldn't then use for other threads.
It gets kind of complicated.
It sounds like the back-end problem for a compiler for a GPU
is much different than for a CPU because you have to have
it seems like you have to have more architectural knowledge of how the
structure of the thing works rather than just the instruction set.
Does that make sense? Or am I misrepresenting it?
There's some of that.
I mean, generally, when you're compiling for a GPU,
you're sort of transforming the...
You know, programmers, when they program a GPU,
they program in what is sometimes called a SPMD style,
which is a really terrible name,
but it stands for single-program multiple data.
And the idea is you write your program
as if you were programming one lane of the vector
machine, and then the compiler
sort of implicitly undoes
that for you and turns it into a normal
vector program.
And so
there's sort of that extra transformation
that the compiler is doing for you behind the scenes.
Also, in contrast
to your typical CPUs,
GPUs generally don't have publicly documented assembly formats, and they generally change much more frequently than CPUs would.
So your compiler can be a lot more tied to the specifics of the piece of hardware you're targeting.
That makes it challenging.
Yeah, it makes it fun and exciting, right?
As a compiler engineer.
Good.
You know, it's in contrast to a CPU, right,
where it may take years and years and years
to add three instructions.
On GPU, they can add new instructions
and remove old ones in any new generation they want
because there's no binaries that have to keep working. they can add new instructions and remove old ones in any new generation they want because no one
no there's no binaries that have to keep working okay that makes sense and it makes it fun and
interesting if you're interested in architecture then because you can do all kinds of wild and
crazy things okay so i have a question switching subjects subjects that I'm asking you because you sort of asked me to ask you.
What's the worst bug you've seen that has been caused by a difference between C and C++?
And I am really looking forward to this story.
This one is great.
And I can't take all the credit for it because I was not the one who debugged this, but I did sit down with Hall from the guy who debugged this.
And it involves everyone's favorite feature of C as well, which is volatile.
Oh, yes, yes, yes. Let's talk about this.
Volatile actually has a very specific meaning in both C and C++.
Everyone thinks that it means basically don't touch my code,
which is true to a first order approximation.
Sometimes true.
Most of the time, if that is your mental model of it, you will be correct.
And this is a story about one of those times when it was not correct.
So we got a bug report about a device driver that was not functioning properly.
And I don't really know the specifics of what this thing was doing,
other than it was something very, very deeply embedded
that had been programmed in C++ for no good reason that I understand.
It had been waking up from sleep.
And for some reason that i do not understand it
it woke up before its associated um the like the controller on its d-ram had come up
and so it was in a little busy loop trying to manually strobe its ram
and the way we've all been there.
I have not.
And the way it did this was it had a pointer,
which it dereferenced, and then cast the result to void,
which is sort of the standard way you avoid getting a warning about an unused variable.
And they'd been smart, right?
The pointer was volatile.
And yet, when it generated the code, there was no load in the generated code.
And it had taken them months and months and months to track this down because it was, you know, sporadic memory corruption in the DRAM, right? And it boiled down to a difference
between C and C++
in terms of the rules
of what volatile actually means.
Volatile, what volatile means
in C and C++ is that, you know,
there's some sort of sequence of events
that the standard says must happen.
And volatile says basically
the compiler can't elide or remove or add
any events to that list.
And in C++, the definition of when those events occur is different.
And in particular, for dereferencing a volatile value, dereferencing it and casting it to
void doesn't trigger what's called an L-value to
R-value conversion, and therefore doesn't
actually cause an event,
a volatile event, to happen.
You needed to cast it to volatile void.
No, you had to
actually assign it to something.
Oh.
And it was,
like I said, it was one of these
deep, dark quarters of C++.
That seems like something that they would harmonize.
I believe they did in C++ 11 or 14.
Jeez.
But it was just horrible, right?
And because the rules of volatile say that the compiler can't add or remove volatile accesses,
the standard actually required us not to do the dereference there because we can't add them where they're not supposed to be.
And we'll say we made a special one-off compiler to fix that bug.
So this wasn't actually a compiler bug?
This was not a compiler bug.
We were implementing the specification exactly as written.
So how often do people, I know you live in Silicon Valley,
how often at dinner parties do people say,
well, I found a compiler bug, and you just put your head in your hands and cry?
Well, maybe not at dinner parties, but it's,
those would be very interesting dinner parties.
The life of compiler engineers is often spent debugging other people's code, which does not
work because they upgraded their compiler, and then finding all the places where they have
undefined behavior, and telling them, and then telling them again when they don't believe you,
and then telling them again when they don't believe you and then telling them again when they don't believe you
and then writing the patch for them that fixes their code.
I remember interviewing a company and they told me there was a compiler bug
and that was why they couldn't do X, Y, or Z.
And I sat there and tried not to say, I doubt it.
I doubt that it's a compiler bug.
What about when it says internal error?
Look, those are legit.
I mean, there are real compiler bugs, obviously.
Like, I've written my fair share of bugs.
And, you know, when you encounter a bug in,
actually, it's not the compiler bugs that scare me, right?
It's the assembler bugs.
You hit an assembler bug.
Oh, God.
And you're like, how does anything ever work?
But realistically,
if you look at the bugs we would get coming in from a,
if we deployed a new version of the compiler
and all the teams across the company updated,
the fraction that were real,
honest to God,
compiler bugs was
less than 10% probably.
It's hard because you don't want to believe it's your code.
Yeah.
And yet as a compiler engineer, you have to believe it's your code.
You're kind of the last defense.
Yeah, I mean, it's a tricky place to be.
And you get pretty good at identifying sort of common failure modes.
This is also a great motivator for compilers to have really good warnings and diagnostics so that you can tell people, yes, the compiler is doing weird things to your code because you
have undefined behavior and look, the compiler tried to tell you.
Right. Yes, that's a good reason to document what you do.
And we were discussing undefined behavior earlier.
Like I was saying, it's sort of the way undefined behavior creeps in
in sort of these insidious ways where, you know, it's not,
things that happened to work previously
could sometimes stop working in a new version of the compiler
because, you know, the theorem prover in the compiler got smarter
and suddenly it's able to make use of these axioms
to, you know, prove some transformation safe.
And now it's deleted all your code.
Which is better.
It's faster and more power efficient now.
I was about to say, we optimized it to return zero.
Yep.
Well, we've kept you for a while. now? I I had one, but now I've forgotten it. I should have written it down. That's what I should do with these things.
Or type it at me in the little screen we share.
Well, that was too late.
Shockingly useful.
Yes.
All right.
Well, then I will ask you about our new toy.
Oh? We got an NVIDIA board sent to us from a friend who works at NVIDIA.
And it's the Jetson TX2.
And I gather it has more processing power in its little board than my entire garage full of dev kits.
It's got, gosh, I don't even know.
It seemed like it could run a whole college's worth of computers,
at least if you went to school in the 90s and that's what you think of as computers.
Well, to be fair, I think your phone could do that now too.
That's true.
What should I do with it?
I'm a little afraid of it because it clearly is smarter than I am.
Well, you can build an AI that takes over the world.
And now let's maybe not do that one.
You know, I'm not super familiar with the Jetson boards from NVIDIA,
but I'm sure it would make a great, you know, sort of small console
if you wanted to try to build like, I, you know, sort of small console if you wanted to try to build, like, I don't know,
the modern equivalent of an Atari 2600 or whatever.
Those kinds of boards are often targeted
for computer vision applications.
Yeah.
So, you know, you can start building
your own self-driving car.
And they tend to be really good
for any kind of numerical application,
which generally means doing matrix multiplies.
Having worked with people
who are into numerical systems,
I learned that everything in the world
can be done with a matrix multiply.
So just find any problem
that can be done with a matrix multiply
and it'll be great at it.
Yeah, but it does it so much faster.
I have a lot of
matrix multiplies in my current code
and I'm still...
You can do bigger matrices.
Yes, exactly.
The biggest.
Yeah, I do. I think
computer vision is one of the
big things and maybe a little
bit of robot control.
Yeah, we saw that a lot when I was
working on GPUs.
Our primary client was, of course, traditional graphics rasterization, putting pixels on the screen.
But even when you looked at compute applications, the biggest clients of compute were still applications that were sort of graphics-y,
that just didn't happen to fit within the exact confines of the rasterization algorithm.
So computer vision type algorithms were a big application domain there.
That makes sense.
And every time I have tried to apply these things to non-computer vision, non-graphics,
it never quite works out because the DSP just makes a lot of sense
and it's simpler and easier and you don't have chris's oh my god why am i
trying to run everything through one thing when i don't have this many vectors to go in yeah i mean
one of those so one thing things i learned from the architects i worked with was that um the area
of a like the square millimeters of a processor is a pretty good proxy for the power that it's going to draw.
And if you think about a GPU, like if you're only using the vector processor, just because,
I don't know, you wanted to do some random computation on it and you wanted wide vectors,
you're not really using a large fraction of the transistors that are on there, right? Your GPU,
like I said, it has this high bandwidth compute core,
but it also has these IO accelerators
for reading and writing images very fast.
It does all the format conversions.
It can do gamma correction and all this stuff.
And it also has this conflict detection engine in it.
There's a lot of stuff in it besides just this vector core.
And so the larger the fraction of the stuff
in the GPU that you're using,
sort of the better bang for buck you're getting
in terms of value per watt, I guess.
That makes sense.
All right.
You know, this is why like for processing
on like deep learning applications applications on GPUs.
There's a huge amount of...
I mean, GPUs are already a decent thing for it
just because they have a lot of floating point flops.
But they're even better for it
when you can take advantage of the texture units.
You can pretend that your neural network weights
are the pixels of an image
and you use the texture input and output units to do the format conversions for you.
Suddenly you're getting value out of a much larger fraction of the chip and your performance per watt goes way up.
I like this metric.
Now I'm going to be thinking about what to do with this board, trying to utilize as much of it as I can instead of my previous.
Well, maybe I'll dip my toe in.
Now I have to use the whole thing.
Well, then you got to write some real graphics code.
We'll see.
How many different programming languages
are you comfortable programming in?
I mean, you've mentioned many.
I have dabbled in many.
The only ones that I've written large things in
are C++ and Python.
I have passing familiarity with many others.
And there's sort of a Dunning-Kruger kind of effect
where I've worked on C++ compilers.
I have fixed bugs in C++ compilers.
I have ported C++ compilers to new architectures.
And I absolutely do not consider myself an expert in C++.
But you're an expert in Haskell because you've never actually used it.
Maybe. I can pontificate about Haskell, certainly.
What new language would you want to use?
If you could choose all the languages that are out there.
Oh, geez. I mean, i'm very interested in rust i've never written anything beyond like a little baby program in it um
and certainly if i were like starting some significant new project from scratch i would
at least consider that as an option um i don't know like i said i have sort of diverse tastes
in languages and like i I like to go,
you know,
going along with your lightning round question.
Like I try them.
I write a little tiny program and then it's not like,
this is cool.
I should use it for more stuff and then never end up revisiting it.
Yes.
I understand.
I understand quite a lot.
Well,
Owen,
thank you so much for being with us.
Do you have any thoughts you'd like to leave us with?
Yeah, I would encourage anyone who has any interest in compilers or processors or anything else,
just take a look at what your compiler is generating.
And in particular, if you're a student,
look into compilers,
and compilers in open source in particular.
I think compilers are not something
that gets talked about a lot in school.
It's actually really interesting.
There's a lot, particularly if you have
a little bit more of a mathematical bent.
It's really interesting.
There's a lot of fun challenges.
There's a lot of different sub-niches
you can find your area of comfort.
And it's something that, because a lot of it's done in open source these days,
it's very easy to get into as a student.
Google runs the Google Summer of Code program,
which is actually exactly how I got started.
I did a Google Summer of Code project in 2006,
which feels like forever ago.
So, you know, I encourage all students to look into things like that.
And you answered the question that I forgot,
which was how would you get into this
if you were interested in playing around with compositors?
I got really bored one summer is the answer.
And you have a job now at Oculus, is that right?
That's correct.
I work at Oculus now, not particularly on compilers.
And we are hiring for all kinds of different things from hardware engineers, electrical, mechanical, optical.
Don't ask me anything about optics, please. please um firmware engineers embedded systems engineers um cpu system engineers all you know
whole stack all the way up lots of fun and interesting challenges um and building virtual
reality which is a fun and exciting challenging area with a lot of uh interactions between you
know embedded systems and graphics and very demanding performance.
And so people will go to the Oculus website
and we'll link in a careers page?
Yeah, it's oculus.com slash careers.
Or they can reach out to me directly
and I'll help people get connected with the right folks.
I guess the way to reach out to you
is either through Twitter
or more likely send email to show at embedded.fm
and we will connect you.
Awesome.
Our guest has been Owen Anderson,
an engineering manager at Oculus,
an LLVM old timer and self-proclaimed GPU crazy person.
You have to be pretty crazy.
Links to everything will be in the show notes.
You can find the show notes on embedded.fm along with the current March Madness rankings, the blog contact link, and assorted other goodies.
Owen, it's been great talking to you.
Thank you very much. It's been great.
Thank you also to my Christopher for producing and co-hosting.
And of course, thank you for listening.
I have a thought to leave you with from Donald Nooth.
The best programs are written so that computing machines can perform them quickly and so that
humans can understand them clearly.
A programmer is ideally an essayist who works with traditional aesthetic and literary forms, as well as mathematical concepts, to communicate the way that an algorithm works and to convince a reader that the results will be correct.
Embedded is an independently produced radio show that focuses on the many aspects of engineering.
It is a production of Logical Elegance, an embedded software consulting company in California.
If there are advertisements in the show,
we did not put them there
and do not receive money from them.
At this time, our sponsors are Logical Elegance
and listeners like you.