The Changelog: Software Development, Open Source - Moore's Law and High Performance Computing (Interview)
Episode Date: February 16, 2018Todd Gamblin, a computer scientist at Lawrence Livermore National Laboratory, joined us to talk about Moore’s Law, his work at Lawrence Livermore National Laboratory, the components of a micro-chip,... and High Performance Computing.
Transcript
Discussion (0)
Bandwidth for ChangeLog is provided by Fastly.
Learn more at Fastly.com.
Error monitoring is provided by Rollbar.
Check them out at Rollbar.com.
And we're hosted on Linode servers.
Head to Linode.com slash ChangeLog.
This episode is brought to you by Rollbar.
Rollbar is real-time error monitoring, alerting, and analytics that helps you resolve production errors in minutes.
And I talked with Paul Bigger, the founder of CircleCI, a trusted customer of Rollbar,
and he says they don't deploy a service without installing Rollbar first.
It's that crucial to them.
We operate at serious scale, and literally the first thing we do
when we create a new service is we install Rollbar in it.
We need to have that visibility.
And without that visibility,
it would be impossible to run at the scale we do.
And certainly with the number of people that we have.
We're a relatively small team operating a major service.
And without the visibility that Rollbar gives us
into our exceptions, it just wouldn't be possible.
All right, to start deploying with confidence,
just like Paul and the team in CircleCI,
head to rollbar.com slash changelog. Once again, rollbar.com slash changelog.
All right, welcome back, everybody. This is The Change Log. I'm your host, Adam Stachowiak.
Today on the show, we're talking with Todd Gamblin, a computer scientist at Lawrence Livermore National Lab. And we got Moore's Law wrong
on a recent episode of the ChangeLog, episode 267, as a matter of fact. And Todd hopped in
Slack and said, hey, guys, you got it wrong. We should talk about it. And that's what the
show's about. We talked about Moore's Law, his work at Lawrence Livermore National Lab,
the components of a microchip, supercomputers, and high-performance computing.
So Todd, we got Moore's Law a little bit wrong
in our episode with Eric Norman.
That was episode 267 about functional programming.
And we were talking about Moore's Law,
and I might have even mentioned on the show
how lots of people get Moore's Law wrong,
and then I got it wrong.
So, you know,
embarrassed face, it's happening.
But you were gracious enough to hop into our Slack,
which is a place that we hang out
and talk about our shows and programming topics and random things, blockchain mostly.
A lot of blockchain.
A lot of blockchain in there.
And hop in our Slack community
and straighten us out a bit about it and the particulars.
And so we thought, well, if we need schooling,
perhaps more people than just us need a little bit of schooling.
So first of all, thanks for coming on the changelog.
And secondly, straighten us out on Moore's Law and what it actually is.
So I don't necessarily think that it was completely wrong on the show.
The gist of what you guys said was fine.
That chips are, there's no more free lunch you don't get free
performance out of your chips anymore like you used to when the clock speed was going up rapidly
right um but um moore's law is not dead although i mean you it's fair to be confused because there
have been a lot of articles written about this. There was an article in the MIT Review that said Moore's Law is dead.
Now what?
But it predicted the death of Moore's Law, I think, out in the 2020s.
And Intel's CEO says Moore's Law is fine.
The chips are going to continue to improve.
So I think it's kind of hard to see what's really happening in the processor landscape.
So what Moore's Law actually says is that the number of transistors that you can cram on a chip doubles every 18 to 24 months.
And so that's the part that is still relatively true, although it's slowing down. Um, and I think, you know, the, the interesting thing and the thing
that, that people typically get confused with this is, um, so there's something else called
Dennard scaling, um, that broke down around 2006. Um, and I think that's, that's what has led to
us having all these multi-core chips now, um, where, you know, you got a lot of performance
out of your single-core chips before.
And so what Denner scaling says is that as your transistors get smaller, the voltage and current stay proportional to that.
So effectively, your power density is the same for a smaller transistor than it is as it is for a larger one.
So what that means is that you can basically jack up the frequency
or the voltage on the chip as you scale the number of transistors.
And so you get clock speed for free over time.
Just by increasing the power.
Yeah, just by increasing the frequency
as you scale it down.
So the chips have effectively the same power
for the area that you're putting
all those transistors in, right?
You want to keep the power envelope
relatively constant
because you're putting it in a device like,
I don't know, well, these days, like a phone
or a desktop computer.
And you don't want someone to have
a really high power desktop machine that ramps up their power bill, right? Right. a phone or you know a desktop computer and you don't want someone to have a
really high power the desktop machine that ramps up their power bill right
right so you know you've got a fixed power envelope you're increasing the
number of transistors and it used to be that you could also increase the clock
speed but because of the breakdown of Dennard scaling you see that in like 2006
or around there the chips are kind of capped out at like two and a half gigahertz now.
Right.
Like they're all sort of hovering around there.
They get up to three sometimes.
And then like you can find like four gigahertz monsters in like some of the bigger IBM like system Z systems.
But but effectively, it's kind of capped out there. I don't know if you remember, like I had 100 megahertz computers back in the day
or, you know, even I think my Apple IIgs
was like maybe kilohertz.
I'm not even remembering, but.
I don't think I go back quite that far.
You might go back a little further
than Adam and I in that regard.
I've always been in the megahertz.
I know that I was in the megahertz.
I think it was like maybe my first was like 750 megahertz,
somewhere around there.
Oh, wow.
So you guys are picking up like late 90s.
Yeah, that would have been like late 90s.
Yep, exactly.
I do recall having a 4 gig drive was my first computer,
and then the second one I had a 20 gig drive.
So, I mean, that's sort of like,
I don't relate it so much back to the chip but most
mostly like how much space that i have to put stuff on sure which sort of like relates to the
chip era because it kind of goes in a similar scale yeah they go hand in hand yeah so you got
a little seniority on us there uh todd but nonetheless we definitely yeah we've definitely
seen the the topping out i mean i'm on a a uh what does it say 2016 macbook pro and i got
a 3.1 gigahertz so that's like yeah two and a half three like you said in the in the more server
products you might have four gigahertz but that's what definitely has stopped yeah and the reason
that that's broken down is that so den, Denner and Scaly ignores current
leakage.
And so, as people packed all these transistors on the chip, you get something called a thermal
runway where, you know, you can't pack them that close without having a whole lot of power
on the die.
So you basically are capped at how much clock speed you can have. But what they do is,
you know, you can still get these multi-core chips now, right? Like the number of cores on
your chip has definitely been increasing. And so that's what they're using the transistors for,
where they used to, you know, pack more transistors into things like out-of-order execution
and other stuff on the die.
Now you're just building out and replicating
chips of effectively the same size on the same,
well, cores of effectively the same size on the chip.
And so that's what your multi-core CPUs are doing.
They're becoming their own little massively parallel machines.
So even back in that show,
even then we were talking about the the proliferation of
cores at least at the consumer level um hasn't gone crazy in terms of you know you're still
talking about two core four core eight core sure um probably from your from your purview inside
the supercomputer labs you can tell us about us about what machinery looks like inside there.
But is that something that has also hit a threshold
or it's just slowed to where it doesn't make sense,
maybe like you said, inside the same thermal envelope
to have 32 cores on a laptop, for instance?
So I'm actually not 100% clear on why't you know jacked up the number of cores on
on a laptop um i mean i would assume that it's because people don't need that many
cores that much and also because most of the parallelism that you're going to want to do
on a desktop machine is going to be on the gpu and on like the phones they have a lot of
specialized hardware for things like video processing like all this these like ai units
and things like that which is interesting and has a lot to do with like, you know, the eventual death of Moore's law too.
But yeah, on, on the supercomputers, I mean, we're, we buy the same, you know, effectively the same
chips, um, at least for some of the machines as, um, as your desktop machines. So like our
commodity clusters, as we call them, they're Linux boxes with Intel and maybe AMD chips.
And we're seeing a large increase in on-node parallelism.
So like where we might not have more cores per chip
than what your desktop would have,
we have a lot of sockets in the machines.
And so the amount of on-node parallelism that's there
is pretty high. And we try to use all that so where do you see this
all going in terms of Moore's law you said that it was said that it would be
dying in the 2020s we're getting near that range but you're you know
prognosticate out for us what's it look like in the next five years um so i mean i think next five
years i i think you'll see the the rate of transistors that get packed on the chip start
to slow down um i think i currently the number of transistors on the chip is doubling it i think
2x every three years instead of every every you know one and a half half. So it is, it is slowing. Um, and then, you know, once that goes
away, um, you're going to have to figure out how to take advantage of, uh, or how to increase your
speed other ways. Um, and so what that means in the hardware universe is I think you'll start to
see a lot more specialized hardware. You're kind of already seeing that, right? Like on the, like,
I, like we were talking about on the, on the mobiles, you've got the iPhone X has this bionic processor or whatever it is.
You've got custom GPUs and things.
I mean, a lot of the processing that happens on your phone is just offloaded.
And most of that is for power consumption
because you can do it a lot more efficiently in hardware.
I think you're going to see HPC sites start to shift towards different architectures that can make use of the same
number of transistors more effectively for their particular workload.
So you'll see a lot more specialization.
But you're not going to, you know, it's basically if the number of transistors that you can
fit on a chip becomes constant, then the only way that you can get more speed is to make more effective
use of them you can't continue getting performance either in terms of more parallelism or in terms of
higher clock speed right because physics maybe for those out there that are like catching up and
trying to maybe maybe just trying to follow along to some degree if they're not schooled in
transistors and chips can you break down what a chip is and like what the components of it are and the thresholds we've kind of gone
over the years and where we're at today is that possible sure yeah we could talk about that some
i mean if at the lowest level i mean people talk about transistors um and and what a transistor is
is it's a it's a thing that if you apply current to it, it changes the conductivity of the material.
And so what that means is that if you think of it as a wire with another wire coming into it, if you put some current on that thing, then the first wire either conducts or it doesn't.
And all that means is that now you have the ability, you have
a switch. So you can build out more complex logic from that switch. That's the fundamental thing
that enables us to build computers. And they can build that now by etching it on silicon. So
there's a, they, they oxidize silicon. That's what all these, you know, fabs and big chip plants are doing. They etch lots of transistors onto silicon with chemical reactions.
And there's different processes for doing that.
And so those processes are what enable us to cram more transistors on the chip over time.
It's improvements to them.
And I mean, I'm not a process scientist, so I don't know a whole lot about that.
But effectively, Moore's Law originated when Gordon Moore in 1965 observed that that process had resulted in the effective doubling of transistors every, I think, 18 months or two years back in 1965.
He was looking at like the range from 1958 to 1965.
And so that's where that comes from.
So there's a general comment that turned into law.
It's an observation.
I wouldn't say that it's a law.
You can't go to jail for violating Moore's Law.
We call it Moore's Law.
Yeah, that's right.
So yeah, which is, we call a lot of things Moore.
You'll go to jail.
That's right.
We call them law until they get broken, and then we're like, well...
Even the way that it's usually stated, right?
18 to 24 months is fairly vague, right?
Right.
It's an observation of the cadence with which they can double the number of transistors on a chip.
Right.
And it's held pretty true.
Moore thought it would hold for, I think, 10 years.
And it's actually held since 1965 pretty well.
So it's somewhat remarkable in
that sense so you know it it's more than just an observation when it holds for many many more years
than you you thought it would and it's somewhat been co-opted it's somewhat been co-opted and
transformed into meaning general compute power doubling and that's kind of the that was the way
that we were using it in fact when i when I was looking it up a little bit here
in an Intel executive, I think in the 90s,
he said that chip performance would double every 18 months
as opposed to transistor density.
And that's the general context
of what most programmers and technologists
talk about Moore's Law.
It's generally computing power
and not the specific thing that Moore was talking about. Yeah. Well, cause the computing power is
what enables you to do more and more with your computer. Right. I mean, you can, you can do many
more computations. The thing gets faster. It's good. Um, I mean, I think the main consequence
of, of the, the breakdown there is that, you know, you don't get as much single thread performance as
you used to, that's kind of capped out.
So if you're using Microsoft Word or something,
like you're typing or something that has to execute sequentially is not going to go any faster.
But if you can get parallelism out of your workload,
then you can actually harness all the power that's on your chip.
And the difference is that if you increase the clock frequency,
that just means that everything on your chip is happening faster.
So that's effectively free.
If you had a program that ran on an older chip,
it would run just the same on the newer chip, just faster.
Whereas with parallelism, if you want to harness that performance,
you actually have to rework your program.
Divvy it up. And so I think that, yeah, you have to divide it up into, into smaller chunks or figure out a way to do it might
involve like changing your whole algorithm. Um, and that's, that's a lot harder. Um, and not all
workloads can do that. And you know, not all consumer workloads can do that. So it's interesting
to see how this will pan out on, on consumer chips. Although I think with all this machine learning stuff going on now,
it's not like there's a shortage of numerically intensive things
to do on your desktop machine or on your phone.
Or games have always taken advantage of things.
We could talk about GPUs some.
GPUs are an interesting design point,
right? Like I think in the functional programming podcast you were mentioning earlier, you guys
mentioned that, you know, oh, people told me I was going to have thousands of cores. Where are
my thousands of cores? And I think, you know, the answer is they're in your GPU because that's a
very different workload from what your desktop CPU is doing. It's data parallel. And so it's easy to divide up the work
that you have to do for graphics rendering.
And so the GPUs are basically these large parallel,
they call them vector processors
because they do lots of the same type of instruction at once.
And I think in a GV100 Volta,
there's I think like 5,000 cores on that thing if you count CUDA cores.
People debate whether or not you should count CUDA core as a real core because it's definitely not the same thing as the CPU in your system.
But, I mean, it is – it's 5,000-way parallel.
That's true.
You can do that many operations at once.
Yeah.
But very specific use case, not general purpose.
Well, you brought in a new terminology, though, too, in this conversation.
You got chip.
You got – so I'm still trying to paint the picture of what this thing is.
But you brought in a new term called cores.
You mentioned it earlier, too, but you got the chip made of silicon, and you got transistors on that.
Where do cores come into play?
What are cores?
Okay.
So what people used to just call a chip,
because you only had one core, is a core.
A core is basically a microprocessor.
Although even that term is kind of fuzzy these days,
because you can say the microprocessor has multiple cores.
I mean, people, you're right, there's a lot of ambiguity.
Okay, so let's go back to the transistors and um, and, and, and build it up from there.
I mean, so.
And then we can go to the GPUs cause that's where I want to go.
Yeah.
Yeah.
Okay.
I think that's the interesting direction.
So we talked about, you've got transistors on the chip.
You can use those to do switching that enables you to build what they call logic gates and,
and you can do things like and or not basically you're taking two signals
and you're producing a result so you know one and one is is one one and zero is zero and so on
right that's just that's basic logic you can take that and it turns out you can build anything
with that you can build if you if you have a nand gate basically a not and gate then you can
build whatever you want um and so there's lots of ways to do that um but effectively they build this
whole chip out of that and and that's they're they're putting that logic on on the die and
that implements you know what what people recognize as a modern cpu and so like if we're coming at
this from like a high-level language,
I think most of the listeners here are familiar with JavaScript or Ruby, Python, or even C.
Those either get interpreted or compiled into machine instructions.
And effectively, you're taking all these logic gates
and you're building something that fetches instructions from memory.
It fetches data from memory,
the instructions tell the processor to do something with that data,
and then they write it back to memory,
and that's pretty much how your chip works.
So if you have that pipeline where you can pull instructions from memory,
you can do stuff to numbers and write it back to memory,
then that's effectively what a modern core, I guess, looks like.
That's a processor.
You can run programs on that.
And so it used to be that you had one core on the chip, and that was what you did.
You had one thread of execution.
You would fetch an instruction.
You'd do what it said.
You'd write the result back to memory, and you would go on and fetch the next one and do what it said, you'd write the result back to memory, and you would go on and
fetch the next one and do what it said. And there's just a whole lot of optimizations that
have happened over the course of processor history that led to, you know, what we have today.
Chips and Spectre and Meltdown have been in the news recently. So the chips do things like
speculative execution. They can say, hey, I'm not going to know whether I want to execute this stream of instructions
for a little while.
Um,
but while I'm waiting to figure it out,
um,
can I go and try that?
And then as we found out,
you can get bugs from that,
but it's also a huge performance increase.
Um,
there's things like just regular out of order execution,
um,
where is effectively like your chip has logic on it that
looks at the instructions coming in. It figures out the dependencies between them. Um, and it,
it figures out which ones don't have any dependencies right now in terms of, you know,
data that they need to read or results of other calculations in the instruction stream.
It'll pull those and it'll execute those instructions concurrently or, um, with, uh,
with other ones that don't have any dependencies and so
that that's called us an out-of-order processor or sometimes people call it a
super scalar processor because it can execute more than one instruction at
once there's vectorization so like there's some most chips have some you
know types of instructions that will do multiple things at once so if you know
that you have you know four numbers lined up in memory and you want
to multiply them all by two you can pull them all in at once and do those
operations all at once if they're if they're the same and so there's there's
lots of these different sequential optimizations that people have done and
that's what goes into your one chip and so now that you have you know all of
these extra transistors because
you can't increase the clock speed on the one chip or on the on the one core um people are building
out the number of cores that they have on a chip and so they basically just you know they have the
same core they're not making it they're not trying to cram too much into that one core and increasing
the power density to the point that it would cause problems but they're just scaling that out with the number
of transistors does that make sense totally it's it's still when when you sit back and think about
it it's still mind-numbingly awesome what we can actually build out of those core primitives
like you just where we've gotten from where from starts, you take it for granted. You don't
think about it much, but when you do, you sit back and think about it. All ones and zeros and
logic gates at the bottom of it all. What we actually can create out of that has been amazing.
That's true.
Oh, there's tons of layers. And I mean, if you think about it, the people started out just
programming to the chip, right? If you got a new new machine, you know, back, you know, say in the founding days
of this laboratory,
you know, 1952,
you would read the manual
and, you know, the instructions
that you would,
the way that the programmer manual
had assembly code in it.
It said, here's the instructions
you can execute on this chip.
This is what it can do.
And you'd have to actually,
you know, think about memory
and, you know, what,
how you're managing it,
what you're pulling into the core,
how much memory you have,
things like that. And now, you know, you don't even think about that. You can instantiate things
in dynamically. You don't have to think very much about memory in most of the modern languages. And
it's a pretty nice ecosystem. I mean, I think, you know, the reason that the multi-core stuff
doesn't, I think, change your perception of what's going on on the computer quite as much,
or at least from a programming perspective, is, I mean, one reason is that there are a lot of multithreaded programs.
And even your operating system is, you know, even before you had multicore chips,
your operating system was executing multiple things at the same time.
It was just doing it by timesharing.
And so you know what a context switch is.
It's when you're executing one program
and then the OS says,
oh, well, there's this other thing
that's running at the same time.
I'm going to swap that in,
execute it for a little bit,
then I'm going to preempt it
and switch back to the other thing that you were doing.
And effectively, that's how your OS did multitasking
before you had multi-core chips
is by just splitting up the by switching back
and forth between different tasks really rapidly um and now you know you're on your chip you really
can have things executing actually in parallel and so it's to some extent it's kind of a natural
transition right because you can just execute different threads on different cores and the
operating system has to manage that but you still have context switching too. So, you know, you can still execute many more tasks on your chip than you have cores. This episode is brought to you by DigitalOcean.
DigitalOcean recently announced new, highly competitive droplet pricing on their standard plans,
on both the high and the low end scale of their pricing they introduced a new flexible $15 plan where you can mix and match resources
like ram and the number of cpus and they're now leading the pack on cpu optimized droplets which
are ideal for data analysis or ci pipelines and they're also working on per second billing here's
a quote from the recent blog post on the new drop-off plans.
Quote,
We understand that price-to-performance ratios are of the utmost consideration
when you're choosing a hosting provider
and we're committed to being a price-to-performance leader in the market.
As we continue to find ways to optimize our infrastructure,
we plan to pass those savings on to you, our customers.
End quote.
Head to do.co slashchangelog to get started.
New accounts get $100 of hosting credit
to use in your first 60 days.
Once again, head to do.co.changelog. well moore's law is not dead but dying murphy's law however eternally true still still still true
yes will always be true adam you know that one yes mur? Murphy's Law? Yeah, that's a real law. Yes. Anything that can go wrong will go wrong.
So I guess if you want to get back to the Moore's Law dying aspect,
I think GPUs are a good example of one way that you can take more effective advantage of some transistors
and sort of combat that power law or the, you know, the Denner scaling problem.
The GPUs are, in terms of the number of operations you can do on them, you get a lot more performance
per watt if you can exploit them than you do out of a CPU. So if you can actually, if you have a
workload that's data parallel, you can pose it that way.
Then you can execute it more efficiently on a GPU than you can on the CPU.
And so that's, and you have like 5,000 cores on there, right?
It's a big scale out machine. It's doing vector stuff.
It's very power efficient.
And that's one way to use the transistors more
effectively for certain workloads than for the cpu and i think you know that's that's where you're
going you're going to see other types of technologies take over that are better at
certain tasks i think you know in our community the other places that you know people are looking
um it's so there's quantum computing people talk about that a lot um i want to talk about that yeah okay um put that on
the sidelines say too much about it okay not an expert but i mean we have there's like a whole
beyond moore's law thrust in doe and i think in the broader cs research uh you know funding agencies
um what's the doe the department of energy okay just to be clear for for those not in the u.s
and hearing acronyms to know what they're talking
about.
Yeah,
we could do the,
the,
we didn't do the origin story,
uh,
thing at the beginning,
which we could do.
I think there's some in that show to some degree about where you work and
what you do.
And so I'm pretty sure that's how Nadia opened it up.
Right.
Um,
yeah,
that's,
that's true.
So we can talk about the DOE. Right. Yeah, that's true.
So we can talk about the DOE some later.
DOE is where I work.
I work at Lawrence Livermore National Laboratory,
and we care about high-performance computing.
So yeah, quantum computing is one way that you can use a different type of technology
to do computation.
So far, people haven't really...
They've shown that it's useful for certain
problems. So like there's a D-Wave system, Los Alamos has a D-Wave system that they're looking
at. It's a type of quantum computer that can do something called quantum annealing, which allows
you to solve certain optimization problems very fast. But again, you know, that's a different
model of computation. It's not like a script.
It's another type of thing. So if you have to do optimization problems, that's a good thing to use.
And you can do it really fast. There's something called cognitive computing that we're looking at. So at Livermore, we have a partnership with IBM where we're looking at their TrueNorth texture. And they call it a cognitive computer.
Effectively what it is, is it's a chip that you can basically put a neural network on
and you can evaluate it very quickly.
And so it's good for machine learning workloads.
If you need to do some machine learning evaluation along with your workload, where I'm distinguishing
between training and evaluation, then you could potentially do it faster with the true North chip.
And then,
you know,
to some extent there's,
there are limitations to how you can do that.
You have to discretize the neural net a certain way so that it fits on the
chip.
Um,
and you can only do certain types of neural nets.
Um,
but you know,
you can pose a lot of different neural net problems that way.
So we think it could be useful,
um,
for helping to accelerate some of the simulations that we're doing or to help to solve problems that are really hard for humans to optimize at runtime.
So that's another model.
Are there private sector equivalents, Todd, to these things that you're speaking of?
Or are these the kinds of things that you only find in the public sector in terms of these the cognitive
learning machines so i believe true north is available and you could you could buy it um if
you were in the private sector um it's an ibm product um i'm not 100 clear on whether it's
just a research prototype that we're dealing with or whether you can actually buy with these and
play with them in industry but i mean i know that yeah i think some industry players have d-wave
machines so they're playing with those so you know you can get them around with them um i definitely
think that you know it's still in the research phases um in terms of what you would actually do
with it yeah um the true north chip is interesting because it's a little closer in terms of, you know,
actually deploying those because the people do have machine learning workloads, right?
Like, and if they want to accelerate them, they could use something like this, um, to
do that.
Um, what it doesn't accelerate is the training.
So, you know, you would still have these giant batch jobs to go and analyze data sets to
build the neural net that you use to, know either to classify or to analyze the data
once you're done training that thing but i mean i think the theme across all these different areas
is that you know it's more specialization and special purpose yeah so tell us real quick so
you mentioned you know you work at lawrence livermore national lab what yes so you have
these specific use and you said we care about high-performance computing.
Maybe explain the specific use cases as much as is public knowledge or not top-secret stuff that you guys do and you're applying these technologies to do.
Okay, so I work for the Department of Energy.
I think the Department of Energy has I think the Department of Energy has
been in the news as Trump has picked his cabinet lately. We deal with a couple of different things.
I think the DOE is the biggest funder of science research in the US alongside the NSF.
And that involves funding universities. It involves funding the NSF. And, you know, that involves funding universities, it involves funding the
national laboratories. And we're also in charge of managing the US nuclear stockpile, and making
sure that it stays safe and reliable. And so across all of those different scientific domains,
there's a whole lot of physics simulation that needs to get done.
And effectively, you know, we are using simulation to look at things that you either, you know,
can't design an experiment for, or that it's too expensive to design an experiment for,
or that it would just be, you know, unsafe to design an experiment for or that you shouldn't design an experiment for. And I guess on the NNSA side, so Lawrence Livermore is part of the National Nuclear Security Administration, which is under the DOE. The unsafe thing that
we're looking at is, you know, how do nuclear weapons work? And so that's a lot of the simulation
workload that takes place here. We also do other types of simulation like climate science.
We have a lot of people working on that.
We look at fundamental material science, all these big either computational fluid dynamics or astrophysical simulations, geological simulations, earthquake simulations, all these physical phenomena. We, you know, we have simulations at various degrees of resolution, um, that we can look at to figure
out, you know, what would happen if, so like we have some guys who've done predictions about
earthquakes in the Bay area, where would the damage be? Um, we look at, will this weapon
continue to work? Um, we also do things like detection. Like if, if you had something like this type of device and someone was trying to ship it in a container, how might you figure out that it was there without opening every container?
There are lots of things like that that the DOE looks into.
And high-performance computing drives all sorts of different aspects of that. Yeah. So, and I guess the other interesting facility here that's in the news frequently is the
National Ignition Facility, which is a nuclear fusion experiment.
So we're trying to make a little star in a big, you know, building the size of three
football fields where we've got like 192 lasers that fire at this little target.
And so simulating how the lasers interact with the target,
how they deposit energy there,
is one of the things that we can simulate on the machines here.
You're building a star inside of a big building.
A little tiny star.
To me, every star is big, I guess.
So a tiny star relative to other stars, but a big building.
Well, let me be clear. It's a star in the sense that we're So a tiny star relative to other stars, but a big building. Let me be clear.
It's,
it's a star in the sense that we're trying to get fusion burn to happen.
Right.
I was going to ask you what's,
what exactly is a star?
I was just waiting for Adam to hop in.
Cause this is like where he gets super excited.
His ears are perking up.
Uh,
well,
I was still stuck back at the size of this,
this,
uh,
true North.
And I was thinking like the size of the thing.
And I was actually thinking about at what point does because these things are really really small at what point does a you know chip or
microchip or whatever you however you want to term this gets so small that it gets to the very very
small which if you study physics and things like that you know life like we see it then you see the
very very big which is planet sizes and you know universe sizes then you get the very very small which is like atom
sizes like how small do these things get but then this star conversation is far more interesting to
me i like that so there's lots of physics that goes on in department of energy so i guess i would
shameless plug it's a i can endorse the department of energy it's a good place to work
because you get to find out about stuff like yeah sounds interesting so yeah so i mean yeah the
interesting thing about nif is that that's the national ignition facility is that you're
simulating a star it's very small um it's it's you know the target is like a few millimeters
in diameter versus but you're trying to cause the same kind of fusion burn that would happen in like the sun and so it's all these lasers colliding right the light from these
lasers colliding they create the fusion yeah yeah that's right it's it's they the lasers come in
they hit this kind of cylindrical thing called a whole rom it's made of gold um that gets really
hot x-rays come out of it and implode the target in the middle
and that's the idea
are you doing that a lot
or are you simulating
on computers
and then doing it
very few times
I'm guessing they can do it
physically in this big building
but then these chips
that he's talking about
they can do it simulated
this is a good example
of the type of work we do
so NIF is
where you're trying
to do it physically
we're trying to get fusion
burn there. Um, but to understand how this thing is working, right. Um, we have to do simulation
to, you know, prototype the designs. And I think we do about 400 real shots, um, in a year over at
NIF where we actually, you know, turn the lasers on point them at a target. Well, it's not too many.
Yeah. We, well, it's in, and we're ramping that up.
It's a scientific facility,
so you can do research for lots of different groups.
Yeah.
In conjunction with that, we do simulations to see if, you know,
what we're simulating matches what really happened, right?
And that's an iterative process.
So you do more simulations.
You say, okay, it matches.
How do I change the design to, better, to get more energy out?
And then you go simulate that.
It says it's going to do better.
You try it.
Maybe it doesn't.
And then you iterate on that until the two match.
And ideally, that's the process that we use for designing these things.
So that's where the HPC comes in, is simulating something like that takes an awful lot of compute cycles.
So like I work in Livermore Computing, which is it's a compute center, kind of like a data center.
But we have machines that are dedicated to doing just in this building um for for all of our
computing needs and we have some of the largest machines in the world here that people run these
these parallel applications on two main cores huh that's a lot of yeah we have one machine with one
and a half million which is number four i, I think, in the world now.
So that's Sequoia.
And we're installing the new machine right now.
It's called Sierra.
It's a big IBM system.
It's with Power 9 processors and NVIDIA GPUs.
This is highly specialized equipment for highly specialized tasks.
Yeah, that's true.
It seems so. You buy a different kind of machine for HPC than you do for like the cloud.
But, you know, some aspects of running a data center and a compute center are very similar.
Managing power, temperature, stuff like that.
I would say that security, yeah, exactly.
That's important.
We've been rolling out meltdown patches all across the facility.
I was just going to ask that.
Yeah.
And the interesting thing, and we see performance hits from that,
so we try to optimize that.
How big are the performance hits that you're seeing?
There are some reports it would be up to 30%,
but it doesn't sound like that's necessarily the case.
Yeah, I think that's in line depending on the workload.
I think it really depends on what application you're running
because it's that system call overhead that you're paying for.
So we have an interest in high-performance computing
because there's basically never an end to the computing capacity
that we need to simulate the stuff that we're looking into.
And so most of the place where we get into architecture
around here is in optimizing the performance of applications.
So, we have people who work with the application teams
and they say, okay, your simulation does this.
How do I, how can I make that execute more efficiently
on the hardware?
And then we also look at procurement.
So we're like, we have this workload. We know that
we need to run these things. So what's the next machine that we're going to buy? And so, you know,
I was talking about Sequoia. Sequoia is the, I guess, 16 realized 20 peak petaflop machine that
we have on the floor right now. Our next machine is going to be 125 petaflop machine.
And so the whole procurement process,
people get together and they look at the architectures from different vendors
and they say, you know, how is our workload going to execute on this?
And so I think, you know, in the future,
you're going to have to think more and more about matching the, you know,
the applications to the architecture.
And we had to think about that because our next machine is a big GPU system.
So, I mean, here's an example that probably gets at kind of the heart of this Moore's Law stuff.
Sequoia is the previous generation machine.
It's about 100,000 nodes.
Each node has a multi-core chip on it.
And they're all PowerPC chips.
And so our workloads could execute pretty effectively on that.
And it was fairly easy to scale things out to a large number of processors.
The GPUs have kind of won in terms of that's the thing that has a market out there for games and for other applications.
And so, you know, we have to ride along with the much bigger market for commodity computing.
And so our current machine is only 4000 nodes. It's got power nine processors on them, and it's
got four GPUs per node. And so that's, you know, in terms of number of nodes, it's a much smaller machine than Sequoia, but it's way faster.
And it had, you know, it's 125 petaflops versus 20.
And so that's where, you know, the GPUs will win.
But for us, that's a big shift because we haven't been, we haven't used GPUs as extensively before.
And so now we have to take our applications,
import them so they can actually use the four GPUs per node.
And that's a challenge.
Give us an idea of what range we're looking at here.
U.S. dollars.
For the big machines?
Yeah, the big machines.
Are we talking like hundreds of thousands of dollars,
millions of dollars, tens of millions?
What's the order of magnitude? So for most of the big machines. Are we talking like hundreds of thousands of dollars, millions of dollars, tens of millions? What's the order of magnitude?
So for most of the big machines,
like if you're going to get a number one on the top 500 list,
which is like the place where they have the list of the top supercomputers,
is probably like around, has been $200 million,
at least in the DOE, for the system.
And that's procured over the course of five years.
We start five years out.
We talk to vendors and we get them to pitch.
They write a proposal that says,
here's how we could meet your specs.
And then we have a big meeting where we go
and we look at how they project this will work on our workloads.
They do experiments with some of our applications.
And we also look at the other specs on the machine how they project this will work on our workloads. They do experiments with some of our like applications. And,
you know,
we also look at the other specs on the machine and,
and different parameters,
you know,
how much memory is it going to have?
How much memory per node?
How many nodes are we going to have to use GPUs?
Are we going to have to use like Intel Xeon Phi chips or,
or other things?
And then we pick the one that we think will best meet our needs going
forward.
How would you like to close that deal, Adam?
$200 million?
That's a lot of money.
Be the salesman on the front end of that thing?
Yeah, that's a long sales process.
You go out to dinner after you make that sale.
Yeah.
And if you want the details on our current machine, there's a nice article at Next Platform by our CTO who is in charge of that procurement process.
Awesome.
We'll make sure we link it up in the show notes.
All right, Todd, I have a suggested project for you
for the NIF folks
after you guys finish that star you're working on.
I'm sure they'll listen to me.
Next project.
Yes.
Sharks with laser beams on their heads.
I feel like people have come up with that idea before.
Just for your consideration.
Well, simulate that a few times.
I think you'll wind up with it.
Are you sure no one else is working on it?
Well, I think you'd be bleeding edge.
All right.
With simulation, we can make it better.
We can make more effective sharks with laser beams.
That sounds scary, though.
I think we should think about the consequences of doing that.
MARK MANDEL, JR.: Looking to learn more about the Cloud
or Google Cloud Platform?
MELANIE WARRICK, JR.: But you don't know where to begin?
MARK MANDEL, JR.: Check out the Google Cloud Platform
weekly podcast at gcbpodcast.com,
where Google Developer Advocates Melanie Warrick. MELANIE WARRICK, JR.: Hello. MARK MANDEL, JR.: And myself Google Cloud Platform Weekly Podcast at gcppodcast.com, where Google developer advocates,
Melanie Warrick, and myself, Mark Mandel, answer questions,
get in the weeds, and talk to GCP teams, customers,
and partners about best practices from security
to machine learning, Kubernetes, open source, and more.
MELANIE WARRICK- Listen to gcppodcast.com
and learn what's new in cloud in about 30 minutes a week.
MARK MANDEL JR.: Hear from technologists
all across Google, like Vint Cerf, Peter Norvig, and Dr. Fei-Fei Li,
all about lessons learned, trends, and cool things
happening with our technology.
MELANIE WARRICK- Every week,
gcppodcast.com takes questions submitted by our audience,
and we'll answer them live on the podcast.
MARK MANDELMANN, JR.: Subscribe to the podcast
at gcppodcast.com, follow us on Twitter at gcppodcast,
or search for Google Cloud Platform Podcast
on your favorite podcast
app. And by GoCD. GoCD is an open source continuous delivery server built by ThoughtWorks.
GoCD provides continuous delivery out of the box with its built-in pipelines,
advanced traceability, and value stream visualization. With GoCD, you can easily
model, orchestrate, and visualize complex workflows from end to end.
It supports modern infrastructure with elastic on-demand agents and cloud deployments.
And their plug-in ecosystem ensures GoCD will work well in your unique environment.
To learn more about GoCD, visit gocd.org.
It's free to use and has professional support for enterprise add-ons available from ThoughtWorks.
Once again, go cd.org slash changelog.
So I guess the question I have is like, if you've got this 200 million dollar computer it's got to be something that's pretty uh demanding right people are going to want to
use this thing because you're not going to want to not get the return on investment for that thing
so like what's what's it like scheduling, managing a project that's on it?
How do you schedule time for it?
Do you have to predict how long your project will take the compute time?
Like give us a day to day operation of using one of these computers.
OK, so I mean, I can't speak necessarily to what the the actual application guys would would do because I'm not I'm a performance guy.
So I work with them to help
speed things up. But I mean, the usage model is basically you have to write a proposal
to get time on these things. For the bulk of our workload, and this is the case for other
Department of Energy laboratories too, you have to write something up that says, you know,
I have this scientific problem. It really needs a lot of CPU
cycles. It's not possible without that. And here's what it would enable. This is why it's worth,
you know, the time on the machines. And so those go through, you know, here and at Argonne and
Oak Ridge, all these other labs, a competitive process where reviewers look at the proposals,
they evaluate, does it have merit?
And then once that's done, you get assigned hours according to what you asked for on the machine.
So you get CPU hours. That's millions of CPU hours or more, depending on what the project is.
And the CPU hour is measured in terms, I think we may be doing node hours now. I'm not sure if it's CPU hours or node hours.
But basically it's just a measure of how much compute power you're allowed to use.
So that's how we justify it.
And the machines stay busy all the time because we have science projects that need them for their workloads.
We have more work than the computers could ever possibly do.
But they are doing it fast, so it enables new science.
So I think in a given day at the lab,
there's a bunch of users.
We have 3,000 users for the facility.
Some here, some are collaborators,
some are at universities that we collaborate with.
They're running jobs, applications.
It's like a big batch system.
You log into it.
You say, here's the job I want to run.
Here's how many CPUs it needs or how many nodes it needs.
And here's how much time it needs to do that approximately.
And then we have a scheduler that just goes and farms those jobs out to the system.
And so the people at the compute center, we look at what's going on.
We try to manage the scheduler so that it has a good policy for all these different users.
And we have performance teams who help the application teams actually optimize their code to run on the machine machine and that's an iterative process right so
for a machine like the new Sierra machine I was talking about we'll typically have a smaller
machine in advance of that that's similar you know we have a power 8 GPU system instead of a power 9
GPU system that we've been testing on and they'll get their code running on that in preparation for
the new system.
And in that process, we'll run profilers on the code.
We will look at traces to see if it's communicating effectively between all the nodes.
And we'll help out the application teams by saying, you know, you should modify this or we need to change this algorithm. I think one of the things that we've been helping people with a lot lately, especially with the GPUs and also with other centers using more exotic chips like Xeon Phi, which is like an Intel many core chip.
It's like a 64 core Intel chip.
We need the same code to execute well on all these different architectures, and that's not an easy process.
So if you have a numerically intensive code, you write it one way.
It might execute well on the CPU, but not on the GPU.
And we'd ideally like to have one code that the application developers maintain and have that – have essentially some other layer handle the mapping of that down to the architecture.
So one of the things we've developed is we call them performance portability frameworks.
We have this thing called Raja.
It's a C++ library where you can write a loop.
Instead of a for loop, you write a for all, you pass a lambda to it,
and you pass that for all a template parameter that says,
hey, I want you to execute on the GPU or I want you to execute on the CPU.
And that allows them to tune effectively for different architectures. They can kind of swap
out the parallelism model under there. And so tuning that, getting the compilers to optimize
it well for different machines, that's the kind of thing the performance folks have been working on.
So you answered the one question that I was thinking when you talked about scheduling is, do these things ever sit idle? Because that would be like the worst use of a huge,
you know, massive, powerful, expensive computer is idle time, right? So it's,
I guess it's heartening to find out that there's so much work to do, that that's not a problem
whatsoever.
In fact, the problem is the opposite, is that you need to start procuring some more to continue more and more research.
The other side, too, it sounds like you do a dashboard or something like that.
Do you operate – do you ever see the computer?
Do you actually get next to it, or do you just operate whatever you need to do through some sort of like, I don't know, like portal or something like that. We have people who get next to the machine and we give tours of the facility
to folks who visit the lab sometimes.
But you don't have to like put your USB stick into it
to like put your program on it and run it, right?
You're like...
No, no, punch cards.
Okay.
No, so basically, I mean,
these things look like servers,
like you'd be used to, right?
Like you have a desktop machine, like you'd be used to, right? Like you,
you have a desktop machine, you SSH into the computer, and then there's a resource manager
running on it. So like Slurm is the open source resource manager that we use, uh, is developed
here. Um, and now it's got a company, SkedMD around it. Um, and the users would say, you know,
S batch, um, command line. And then that they would take that command line, put it in the queue, and then eventually run it on however many nodes they asked for.
Or, you know, S run if they want to do it interactively and wait for some nodes to be available.
And, you know, the wait times can get pretty big if the queues are deep.
So, yeah.
So you get assigned hours, but you don't get assigned like 9 in the morning to 10 in the morning.
You get just hours and you're in a queue.
Whenever your queue comes up, you execute.
Right.
You get a bank that comes with your project.
We call it a bank.
That's how many total CPU hours you have.
If you submit a job, then, you know, when you submit it, you have to say, here's how long long I expect it to run for and the scheduler will kill it after that much
time and then you nodes you want and then it runs for that long and however
you know the length of time it runs times the number of nodes that used you
know times the number of CPUs per node is how much they subtract from your bank
at the end of that and so so effectively, you get a few
multi-million CPU hour allocation.
You can run that out pretty quickly
if you run giant jobs that run for a long time.
So Todd, I first met you at the Sustain event
last spring, almost summertime, I suppose,
at GitHub headquarters.
You were very involved in that.
And in fact, that's when you hopped into our Slack for the first time and helped bring
some people from the lab to that event.
And so you have this interest and passion around sustaining open source because that's
why you were there and involved.
And we appreciated your help.
But tell us in the audience the intersection of where open source comes in with the work
you're doing with
the supercomputers and the lab work sure so i'd say two places um in their their big places i
think for our computer center the the folks who run it we prefer open source for for nearly
everything um for the resource, for the file systems.
We have big parallel file systems like Lustre for even, you know, the the compilers that we use. We're we're investing in in Clang or in LLVM to create a new flang to do Fortran for some of our codes.
And and so, you know, I would say that the majority of what we do
at the Compute Center is open source
in terms of the infrastructure that we're using.
Our machines run Linux,
and we have a team downstairs
that manages a distribution for HPC.
We call it TOS, which is a Trilab open source stack.
That's basically Linux distribution with our custom packages on top of it.
And that's how we manage our deployment for the machines.
So that's one way.
And then we have people working on, you know, the people who work on those projects or like ZFS is used in Lustre.
We have a guy who actually did the ZFS on Linux port
and manages that community.
And I think we get a lot out of that.
It's Brian Bellendorf at Livermore.
Not the Brian Bellendorf who's doing blockchain stuff,
but actually another Brian Bellendorf.
I was going to ask that.
The same name.
Yeah, there's two Brian Bellendorf's in open source.
Same spelling and everything?
Yeah, everything.
He said that they met once and talked to each other.
That is confusing.
We just had the other Brian Bellendorf on the show.
We interviewed him at OSCON last year.
Listen to that.
Hyperledger.
So this is ZFS.
So there's the ZFS Brian Bellendorf and there's the
Hyperledger Brian Bellendorf.
One of them is in the building with me.
Yep. You, you know,
we were talking about how we, we procure these big machines. Um, and there's, there's a contract associated with that. Um, in that we, we allocate some time for the vendor to contribute to open
source, uh, software. We require that as part of the contract. And so they work with us. Um,
and they, they make sure that our software and other software that we care about from the DOE and elsewhere actually works on the machine.
So that's another way we interface with the open source community.
On that note, then, it sounds like you're pretty intimate in terms of what's involved in the process or what's on these machines.
Do you have good relationships
with those who sys admin these machines do you do you as a you know collective are you able to say
well we prefer this flavor of linux and you know it seems like since you choose open source you
have some sort of feedback loop into like preferences that everyone can put on this machine and do all these fun things you do
yeah so at this center i mean we the there's a software development group there's the system
administration group they're all in the in the building that i'm in which is attached to compute
center um we there's a lot of crosstalk okay between those different areas and then we also
talk to the researchers right um who who run applications on the machines.
Yeah, I would say that, you know,
Livermore Computing, at least,
like on the infrastructure side,
is definitely involved in, you know,
choosing what open source we want to run on the machines
and when we maybe don't want to go open source.
Like we run proprietary compilers
because they're fast, for example.
But we also do things like invest
in the open source compilers
to try to say, you know, we want an open source alternative
so that we have something that we can run on the machine
that will work and work well.
The reason I ask that is because it seems like, you know,
the application process is very protected to manage what,
to manage the load on those machines and the time.
And so I just wondered if, you know if the involvement of what's on the machines
and who manages them and all that stuff is just as protected.
It seems like one size is a little bit more loose,
but to get the time, it's a big ceremony, a big process,
and it could be gatekeeped to some degree.
Yeah, I guess I would say that HPC, it's a research computing field, right?
It's mostly researchers who need this much compute power.
And so the calls for proposals are not unlike the calls for funding that people put out for academia.
There are open ones. And so like the Office of Science Labs have the Insight program where, you know, you can apply for time on Oak Ridge and Argonne's
machines, which are similarly large, if not, Oak Ridge has a larger machine than ours right now.
And for us, our customers are slightly different because we work with, we work with Los Alamos
National Lab and Sandia National Lab. And so our proposal processes, at least on like the classified
side, are mostly between those labs because they're about the weapons program and stuff like that.
But then there are other places where you can get time for basic science runs. there and we let you know early code teams like the the guys who maybe have like an important
science application that isn't as complicated as maybe some of our production codes who can get on
there early and you know show off the machine we let them run there's a few months at the beginning
where we let them um use the time with allocations there so i guess i'd say there's a lot of different
ways to get time on on the machines and you know it's it's pretty low overhead it's not you know quite like
writing a full academic proposal it's pretty open and we're on this open source kick i was just
curious how that flavored in because as you're describing your choices and you know and i guess
the primary choice of choosing open source and that's your preference it seemed like you know
while there's a lot of process around the proposal flow, maybe there's a little bit more cross-check, as you mentioned,
at involvement with other teams that have access to these super expensive machines.
That's a huge privilege because I don't have access to a $200 million machine.
I can barely afford one that costs $7 million.
And I got to borrow money from grandma or something like that.
$7 what? $7 million? Okay, $1,000. barely afford one that costs seven you know like and i gotta like borrow money from grandma or something like that you know so seven what seven million okay thousand you said 200 million and then you said you could barely afford when it costs well i assumed everybody thought that was
me a thousand is like i assumed the denominator is saying the same grandma ain't got millions man
you can quote me on that question
i guess what we're trying to say it is todd is, Todd, is how can we get some time on this computer?
We've got some research.
Yeah, so I guess if I had to boil it down to something, you have to have justification for getting on the machine.
You have to be able to show that you can make scientific progress with your hours.
So that's what the process is about.
Sharks and lasers, man.
Sharks with laser beams on their heads.
I told you my justification already.
Ill-tempered sea bass.
Oh boy.
I guess the other elephant in the room, Todd, for justifications,
and you addressed this to us in the break,
but please for the audience's sake,
because I know that we probably have a fair amount of Bitcoin miners listening.
And so I think that is the other thing that has people kind of, you know, putting their pinky up to their mouth.
CPU time.
Yeah.
What about Bitcoin mining on these rigs?
Okay, so we're not allowed to mine Bitcoin on these machines.
It's not legal to use government machines to mine for Bitcoins.
But even if you did, it wouldn't it wouldn't be
worth it if you look at what people are using for bitcoin mining a lot of that is like custom chips
uh they're very low power and only do hashing so you'll do way better investing in that than you
will in in one of our machines so and i think at some other compute center people have been
have been fired for for trying to mine Bitcoin on the machines.
Wow.
Yeah, yeah, yeah.
You can Google for news stories about that.
But I guess I want to be a little clearer on the openness front.
So the application for time is separate from software that you actually run on the machine, right?
So we have a lot of open source projects that live for much longer than any allocation or any machine really. And there are a lot of people who work on
those and those are open. Some of them, you know, you can even run on your laptop and scientists do.
So like a lot of development for these machines does happen on people's laptops and then they
scale the program up to run on the big machine. So, you know, there's a lot of open source software development
that happens,
even if the process for getting access
to a big machine isn't open.
And you can run that open source software
on the machine that you do have for 7K.
Or less.
Or easily 7K.
Yes.
Your grandma could run that, Adam.
That's right.
Grammys can run that.
So what about the open source community?
Seems like any time when I think of a government operation,
you think especially with the security constraints
and a lot of the quote-unquote red tape,
seems like actually deep involvement with a community
that's built on openness and
freedom and all these things that are kind of opposite of like secrecy and closed is there any
um give or take there is there is there red tape are there issues around that or has it all been
you know pretty easy in terms of integrating your open source work into the greater open source
community so i'd say that historically livermore is pretty good about open source.
I mean, we started using Linux clusters like in the late 90s.
And, you know, we've been working on the operating system for that.
We've developed like Slurm, the resource manager, has been open source for a long time.
So putting stuff out there has not been such a problem.
There is a release process that you have to go through that's kind of cumbersome.
But once you do it, you know, like we did for SPAC, the package manager I work on, you can work out in the open on that as long as you stay within a certain project scope.
So, yeah, I mean, there is some red tape around that.
Obviously, we don't want to release some of the things that we develop.
But then again, we use a lot of open source internally and benefit from the broader open source community. So I would say that DOE has a pretty active open source development ecosystem.
And we leverage things that are developed by other labs and other labs develop or leverage
things that are developed by us. And I think there's a lot of back and forth.
I would say that like the, the interaction model on the,
on the projects is maybe not quite the same as like a large infrastructure
project.
Like,
I don't know,
like Kubernetes or Docker or something like that,
just because I mean,
it's,
it's scientific computing.
So people get funded to solve a particular problem,
not to develop software.
So there are,
you know,
sustainability issues around how much software development time
can we actually put on this project.
On the production side, though, the facilities, I mean, their job is to keep the center running
and to do it efficiently.
So that's, I think that's why you see a lot of open source coming out of there.
But, you know, then again, there are-lived research projects that are very widely used.
So one good example of that
is in the math library community.
So for large-scale parallel solvers,
the different labs have teams
working on that stuff
and there are some solver libraries
like Livermore has Hyper,
Sandia has Trilinos,
and Berkeley Lab has some solvers.
And also things like finite element frameworks, things for meshing and for building these big models of physical systems.
So Livermore has a library called MFEM that has a big open source community around it.
Well, not big by JavaScript standards, but big by scientific computing standards. Right. Yeah. So some of them operate like communities, I would say, and then others kind of tend
to, you know, stay within a particular group or, you know, they maybe don't have a cross
lab community.
It just depends on the software and what the funding and the interaction model has been
historically.
I do think like more community could help a lot of projects. If people started
thinking more in terms of like, how do I sustain this over time? How do I get more contributors?
I don't necessarily think that we build research software with growing contributors in mind.
I think it's interesting that you got the three I took note of earlier. The one you obviously
talked about on request for commits back.
That's the product you work on primarily.
Yes.
Then you've got Slurm, which I think is a workload manager, if I got you correctly.
That's actually what you interface with to put your products onto a supercomputer.
Is that right?
Yeah, it runs the jobs.
It runs the jobs.
It does the scheduling.
Yeah.
Then you got Lustre, which I was just noticing down in the trademark is a Seagate technology trademark.
So that means that that's the file system.
So these things are important enough for you to have open source projects alongside them that I guess are more specific to, say, a supercomputer scenario versus, say, a laptop scenario.
Yeah, for sure i mean we we have to pay um for open source
development for you know like a parallel file system that'll run fast on our machine we have
to keep the computers working so right yeah a lot of the infrastructure projects are aimed at that
seems like some of this stuff should come with a 200 million dollar computer
well so i mean i'm just thinking like
so it does to some extent like so you get a t-shirt or anything?
Huh?
So we haven't gotten t-shirts, you do get a mural.
You can get a mural painted on the side of the machine.
So like the, if you look at like the nurse machines, like the, they have a picture of Grace Hopper painted on the side of their machine.
Um, but, but so I will, there is a lot of software that comes from the vendor.
So like Cray provides a whole programming environment with their system.
It's not necessarily open in the same sense.
If you buy an IBM system, they will bundle like their file system, which is GPFS.
It's a competitor for Lustre.
It's proprietary.
And you know, which one you go with depends on what value do you get out of the procurement.
Which one do you think is going to perform better?
I'd say performance drives a lot of the decisions at the procurement level.
But openness is also a big factor.
Does it come with hard drives?
Yeah, so the system would come with a parallel file system.
So it's not just hard drives. It's like racks and racks and racks of hard drives.
Right.
I was going to make a joke to say like when you get it,
do you just wet the drives and put your own stuff on it like you do in the old?
Do you do it at scale essentially?
Like when I get a machine, even a Mac,
I sometimes will just wipe it and put a brand new version of OS X on there
or Mac OS now because I just like it.
You know, I just do that, you know, my own way.
Yes, we, we, yeah, we do that effectively with our Linux clusters.
So we, we build our own distribution, like I was saying.
And so we have a toss image that we run across those for the bigger machines.
Like, so for what we call our advanced technology machines that are in these, you know, large procurement packages, you know, it's much more vendor driven
in this because it's bleeding edge. So we rely a lot more on the vendor to provide the software
stack. Although, I mean, our, you know, the next machine is going to be Linux. So it's the machines
for the OS at least run Linux. Which flavor of Linux?
We run, so across the center here, we run RHEL.
So that's the distro that typically is at the base of our TOS distribution.
And then some machines run SUSE, but not at Livermore.
Like the Cray machines, I think, use SUSE as their base distro.
They also build their own kind of custom lightweight versions of Linux
for the actual compute nodes.
They want to reduce system noise,
so they don't want a lot of context switching going on
and stuff that would slow down the application.
When you say RHEL, that's R-H-E-L, right?
Yeah, Red Hat Enterprise Linux.
Gotcha.
Yeah. Gotcha. Yeah.
Cool.
This is pretty interesting to kind of peel back the layers of a supercomputing research
laboratory like this and, you know, see how open source fits in, how, you know, $200 million
would advise, you know, how you procure time, how you propose time, how you interface with
other teams that manage open source software, how you propose time, how you interface with other teams that manage
open source software, how you determine preferences.
And I mean, this is an interesting conversation.
That's not exactly the typical episode of the changelog.
So hopefully listeners, you really enjoyed this.
And if you did, there's a way you can tell us.
You can either hop in GitHub.
So github.com slash the changelog slash ping or join the community go to changelog.com
slash members what is it no it's community slash community sorry go there which is what todd did
one day and he's like hey uh y'all are doing this conference called sustain i'm gonna go and i want
to bring some friends and wow this is an awesome community so maybe todd to close this out what
what can you say about hanging out in Slack?
Hanging out in Slack?
With us.
With Slack and with us.
So I do that because it's nice to be in touch with the, well, with I guess a different open source community, right?
So I think the changelog is kind of heavy on web development.
I used to be a web developer before I came to DOE. So I like to keep up with that stuff and see what's going on
out in the cloud
as well as over here in the DOE.
So it's been a nice time.
Well, Todd, thank you
for coming on the show today, man.
We're huge fans of yours.
And just thanks so much
for schooling us on Moore's Law.
Appreciate it.
Cool.
All right, that's it for this show
thank you for tuning in
if you haven't been
to changelog.com
in a while
go there
check it out
we just launched
a brand new version
of the site
go to changelog.com
and subscribe
to get the latest
news and podcasts
for developers
get that in your inbox
every single week
we make it super easy
to keep up with
developer news
that matters
I want to thank
our sponsors Rollbar Digital Ocean GCP Podcast, and GoCD.
And bandwidth for Changelog is provided by Fastly.
So head to fastly.com to learn more.
Air monitoring is by Rollbar.
Check them out at rollbar.com.
And we host everything we do on Linode cloud servers.
Head to linode.com slash changelog.
Check them out.
Support this show.
The Change Law is hosted by myself, Adam Stachowiak, and Jared Santo.
Editing is by Jonathan Youngblood.
Music is by Breakmaster Cylinder.
And you can find more shows just like this at changelog.com or wherever you subscribe
to podcasts.
See you next week. Thank you. Bye.