The Changelog: Software Development, Open Source - Moore's Law and High Performance Computing (Interview)

Episode Date: February 16, 2018

Todd Gamblin, a computer scientist at Lawrence Livermore National Laboratory, joined us to talk about Moore’s Law, his work at Lawrence Livermore National Laboratory, the components of a micro-chip,... and High Performance Computing.

Transcript
Discussion (0)
Starting point is 00:00:00 Bandwidth for ChangeLog is provided by Fastly. Learn more at Fastly.com. Error monitoring is provided by Rollbar. Check them out at Rollbar.com. And we're hosted on Linode servers. Head to Linode.com slash ChangeLog. This episode is brought to you by Rollbar. Rollbar is real-time error monitoring, alerting, and analytics that helps you resolve production errors in minutes.
Starting point is 00:00:23 And I talked with Paul Bigger, the founder of CircleCI, a trusted customer of Rollbar, and he says they don't deploy a service without installing Rollbar first. It's that crucial to them. We operate at serious scale, and literally the first thing we do when we create a new service is we install Rollbar in it. We need to have that visibility. And without that visibility, it would be impossible to run at the scale we do.
Starting point is 00:00:49 And certainly with the number of people that we have. We're a relatively small team operating a major service. And without the visibility that Rollbar gives us into our exceptions, it just wouldn't be possible. All right, to start deploying with confidence, just like Paul and the team in CircleCI, head to rollbar.com slash changelog. Once again, rollbar.com slash changelog. All right, welcome back, everybody. This is The Change Log. I'm your host, Adam Stachowiak.
Starting point is 00:01:22 Today on the show, we're talking with Todd Gamblin, a computer scientist at Lawrence Livermore National Lab. And we got Moore's Law wrong on a recent episode of the ChangeLog, episode 267, as a matter of fact. And Todd hopped in Slack and said, hey, guys, you got it wrong. We should talk about it. And that's what the show's about. We talked about Moore's Law, his work at Lawrence Livermore National Lab, the components of a microchip, supercomputers, and high-performance computing. So Todd, we got Moore's Law a little bit wrong in our episode with Eric Norman. That was episode 267 about functional programming.
Starting point is 00:02:02 And we were talking about Moore's Law, and I might have even mentioned on the show how lots of people get Moore's Law wrong, and then I got it wrong. So, you know, embarrassed face, it's happening. But you were gracious enough to hop into our Slack, which is a place that we hang out
Starting point is 00:02:21 and talk about our shows and programming topics and random things, blockchain mostly. A lot of blockchain. A lot of blockchain in there. And hop in our Slack community and straighten us out a bit about it and the particulars. And so we thought, well, if we need schooling, perhaps more people than just us need a little bit of schooling. So first of all, thanks for coming on the changelog.
Starting point is 00:02:48 And secondly, straighten us out on Moore's Law and what it actually is. So I don't necessarily think that it was completely wrong on the show. The gist of what you guys said was fine. That chips are, there's no more free lunch you don't get free performance out of your chips anymore like you used to when the clock speed was going up rapidly right um but um moore's law is not dead although i mean you it's fair to be confused because there have been a lot of articles written about this. There was an article in the MIT Review that said Moore's Law is dead. Now what?
Starting point is 00:03:28 But it predicted the death of Moore's Law, I think, out in the 2020s. And Intel's CEO says Moore's Law is fine. The chips are going to continue to improve. So I think it's kind of hard to see what's really happening in the processor landscape. So what Moore's Law actually says is that the number of transistors that you can cram on a chip doubles every 18 to 24 months. And so that's the part that is still relatively true, although it's slowing down. Um, and I think, you know, the, the interesting thing and the thing that, that people typically get confused with this is, um, so there's something else called Dennard scaling, um, that broke down around 2006. Um, and I think that's, that's what has led to
Starting point is 00:04:17 us having all these multi-core chips now, um, where, you know, you got a lot of performance out of your single-core chips before. And so what Denner scaling says is that as your transistors get smaller, the voltage and current stay proportional to that. So effectively, your power density is the same for a smaller transistor than it is as it is for a larger one. So what that means is that you can basically jack up the frequency or the voltage on the chip as you scale the number of transistors. And so you get clock speed for free over time. Just by increasing the power.
Starting point is 00:05:04 Yeah, just by increasing the frequency as you scale it down. So the chips have effectively the same power for the area that you're putting all those transistors in, right? You want to keep the power envelope relatively constant because you're putting it in a device like,
Starting point is 00:05:19 I don't know, well, these days, like a phone or a desktop computer. And you don't want someone to have a really high power desktop machine that ramps up their power bill, right? Right. a phone or you know a desktop computer and you don't want someone to have a really high power the desktop machine that ramps up their power bill right right so you know you've got a fixed power envelope you're increasing the number of transistors and it used to be that you could also increase the clock speed but because of the breakdown of Dennard scaling you see that in like 2006
Starting point is 00:05:41 or around there the chips are kind of capped out at like two and a half gigahertz now. Right. Like they're all sort of hovering around there. They get up to three sometimes. And then like you can find like four gigahertz monsters in like some of the bigger IBM like system Z systems. But but effectively, it's kind of capped out there. I don't know if you remember, like I had 100 megahertz computers back in the day or, you know, even I think my Apple IIgs was like maybe kilohertz.
Starting point is 00:06:13 I'm not even remembering, but. I don't think I go back quite that far. You might go back a little further than Adam and I in that regard. I've always been in the megahertz. I know that I was in the megahertz. I think it was like maybe my first was like 750 megahertz, somewhere around there.
Starting point is 00:06:28 Oh, wow. So you guys are picking up like late 90s. Yeah, that would have been like late 90s. Yep, exactly. I do recall having a 4 gig drive was my first computer, and then the second one I had a 20 gig drive. So, I mean, that's sort of like, I don't relate it so much back to the chip but most
Starting point is 00:06:45 mostly like how much space that i have to put stuff on sure which sort of like relates to the chip era because it kind of goes in a similar scale yeah they go hand in hand yeah so you got a little seniority on us there uh todd but nonetheless we definitely yeah we've definitely seen the the topping out i mean i'm on a a uh what does it say 2016 macbook pro and i got a 3.1 gigahertz so that's like yeah two and a half three like you said in the in the more server products you might have four gigahertz but that's what definitely has stopped yeah and the reason that that's broken down is that so den, Denner and Scaly ignores current leakage.
Starting point is 00:07:27 And so, as people packed all these transistors on the chip, you get something called a thermal runway where, you know, you can't pack them that close without having a whole lot of power on the die. So you basically are capped at how much clock speed you can have. But what they do is, you know, you can still get these multi-core chips now, right? Like the number of cores on your chip has definitely been increasing. And so that's what they're using the transistors for, where they used to, you know, pack more transistors into things like out-of-order execution and other stuff on the die.
Starting point is 00:08:06 Now you're just building out and replicating chips of effectively the same size on the same, well, cores of effectively the same size on the chip. And so that's what your multi-core CPUs are doing. They're becoming their own little massively parallel machines. So even back in that show, even then we were talking about the the proliferation of cores at least at the consumer level um hasn't gone crazy in terms of you know you're still
Starting point is 00:08:34 talking about two core four core eight core sure um probably from your from your purview inside the supercomputer labs you can tell us about us about what machinery looks like inside there. But is that something that has also hit a threshold or it's just slowed to where it doesn't make sense, maybe like you said, inside the same thermal envelope to have 32 cores on a laptop, for instance? So I'm actually not 100% clear on why't you know jacked up the number of cores on on a laptop um i mean i would assume that it's because people don't need that many
Starting point is 00:09:10 cores that much and also because most of the parallelism that you're going to want to do on a desktop machine is going to be on the gpu and on like the phones they have a lot of specialized hardware for things like video processing like all this these like ai units and things like that which is interesting and has a lot to do with like, you know, the eventual death of Moore's law too. But yeah, on, on the supercomputers, I mean, we're, we buy the same, you know, effectively the same chips, um, at least for some of the machines as, um, as your desktop machines. So like our commodity clusters, as we call them, they're Linux boxes with Intel and maybe AMD chips. And we're seeing a large increase in on-node parallelism.
Starting point is 00:09:53 So like where we might not have more cores per chip than what your desktop would have, we have a lot of sockets in the machines. And so the amount of on-node parallelism that's there is pretty high. And we try to use all that so where do you see this all going in terms of Moore's law you said that it was said that it would be dying in the 2020s we're getting near that range but you're you know prognosticate out for us what's it look like in the next five years um so i mean i think next five
Starting point is 00:10:26 years i i think you'll see the the rate of transistors that get packed on the chip start to slow down um i think i currently the number of transistors on the chip is doubling it i think 2x every three years instead of every every you know one and a half half. So it is, it is slowing. Um, and then, you know, once that goes away, um, you're going to have to figure out how to take advantage of, uh, or how to increase your speed other ways. Um, and so what that means in the hardware universe is I think you'll start to see a lot more specialized hardware. You're kind of already seeing that, right? Like on the, like, I, like we were talking about on the, on the mobiles, you've got the iPhone X has this bionic processor or whatever it is. You've got custom GPUs and things.
Starting point is 00:11:12 I mean, a lot of the processing that happens on your phone is just offloaded. And most of that is for power consumption because you can do it a lot more efficiently in hardware. I think you're going to see HPC sites start to shift towards different architectures that can make use of the same number of transistors more effectively for their particular workload. So you'll see a lot more specialization. But you're not going to, you know, it's basically if the number of transistors that you can fit on a chip becomes constant, then the only way that you can get more speed is to make more effective
Starting point is 00:11:46 use of them you can't continue getting performance either in terms of more parallelism or in terms of higher clock speed right because physics maybe for those out there that are like catching up and trying to maybe maybe just trying to follow along to some degree if they're not schooled in transistors and chips can you break down what a chip is and like what the components of it are and the thresholds we've kind of gone over the years and where we're at today is that possible sure yeah we could talk about that some i mean if at the lowest level i mean people talk about transistors um and and what a transistor is is it's a it's a thing that if you apply current to it, it changes the conductivity of the material. And so what that means is that if you think of it as a wire with another wire coming into it, if you put some current on that thing, then the first wire either conducts or it doesn't.
Starting point is 00:12:42 And all that means is that now you have the ability, you have a switch. So you can build out more complex logic from that switch. That's the fundamental thing that enables us to build computers. And they can build that now by etching it on silicon. So there's a, they, they oxidize silicon. That's what all these, you know, fabs and big chip plants are doing. They etch lots of transistors onto silicon with chemical reactions. And there's different processes for doing that. And so those processes are what enable us to cram more transistors on the chip over time. It's improvements to them. And I mean, I'm not a process scientist, so I don't know a whole lot about that.
Starting point is 00:13:27 But effectively, Moore's Law originated when Gordon Moore in 1965 observed that that process had resulted in the effective doubling of transistors every, I think, 18 months or two years back in 1965. He was looking at like the range from 1958 to 1965. And so that's where that comes from. So there's a general comment that turned into law. It's an observation. I wouldn't say that it's a law. You can't go to jail for violating Moore's Law. We call it Moore's Law.
Starting point is 00:13:58 Yeah, that's right. So yeah, which is, we call a lot of things Moore. You'll go to jail. That's right. We call them law until they get broken, and then we're like, well... Even the way that it's usually stated, right? 18 to 24 months is fairly vague, right? Right.
Starting point is 00:14:11 It's an observation of the cadence with which they can double the number of transistors on a chip. Right. And it's held pretty true. Moore thought it would hold for, I think, 10 years. And it's actually held since 1965 pretty well. So it's somewhat remarkable in that sense so you know it it's more than just an observation when it holds for many many more years than you you thought it would and it's somewhat been co-opted it's somewhat been co-opted and
Starting point is 00:14:36 transformed into meaning general compute power doubling and that's kind of the that was the way that we were using it in fact when i when I was looking it up a little bit here in an Intel executive, I think in the 90s, he said that chip performance would double every 18 months as opposed to transistor density. And that's the general context of what most programmers and technologists talk about Moore's Law.
Starting point is 00:15:02 It's generally computing power and not the specific thing that Moore was talking about. Yeah. Well, cause the computing power is what enables you to do more and more with your computer. Right. I mean, you can, you can do many more computations. The thing gets faster. It's good. Um, I mean, I think the main consequence of, of the, the breakdown there is that, you know, you don't get as much single thread performance as you used to, that's kind of capped out. So if you're using Microsoft Word or something, like you're typing or something that has to execute sequentially is not going to go any faster.
Starting point is 00:15:37 But if you can get parallelism out of your workload, then you can actually harness all the power that's on your chip. And the difference is that if you increase the clock frequency, that just means that everything on your chip is happening faster. So that's effectively free. If you had a program that ran on an older chip, it would run just the same on the newer chip, just faster. Whereas with parallelism, if you want to harness that performance,
Starting point is 00:16:02 you actually have to rework your program. Divvy it up. And so I think that, yeah, you have to divide it up into, into smaller chunks or figure out a way to do it might involve like changing your whole algorithm. Um, and that's, that's a lot harder. Um, and not all workloads can do that. And you know, not all consumer workloads can do that. So it's interesting to see how this will pan out on, on consumer chips. Although I think with all this machine learning stuff going on now, it's not like there's a shortage of numerically intensive things to do on your desktop machine or on your phone. Or games have always taken advantage of things.
Starting point is 00:16:41 We could talk about GPUs some. GPUs are an interesting design point, right? Like I think in the functional programming podcast you were mentioning earlier, you guys mentioned that, you know, oh, people told me I was going to have thousands of cores. Where are my thousands of cores? And I think, you know, the answer is they're in your GPU because that's a very different workload from what your desktop CPU is doing. It's data parallel. And so it's easy to divide up the work that you have to do for graphics rendering. And so the GPUs are basically these large parallel,
Starting point is 00:17:14 they call them vector processors because they do lots of the same type of instruction at once. And I think in a GV100 Volta, there's I think like 5,000 cores on that thing if you count CUDA cores. People debate whether or not you should count CUDA core as a real core because it's definitely not the same thing as the CPU in your system. But, I mean, it is – it's 5,000-way parallel. That's true. You can do that many operations at once.
Starting point is 00:17:41 Yeah. But very specific use case, not general purpose. Well, you brought in a new terminology, though, too, in this conversation. You got chip. You got – so I'm still trying to paint the picture of what this thing is. But you brought in a new term called cores. You mentioned it earlier, too, but you got the chip made of silicon, and you got transistors on that. Where do cores come into play?
Starting point is 00:18:02 What are cores? Okay. So what people used to just call a chip, because you only had one core, is a core. A core is basically a microprocessor. Although even that term is kind of fuzzy these days, because you can say the microprocessor has multiple cores. I mean, people, you're right, there's a lot of ambiguity.
Starting point is 00:18:25 Okay, so let's go back to the transistors and um, and, and, and build it up from there. I mean, so. And then we can go to the GPUs cause that's where I want to go. Yeah. Yeah. Okay. I think that's the interesting direction. So we talked about, you've got transistors on the chip.
Starting point is 00:18:36 You can use those to do switching that enables you to build what they call logic gates and, and you can do things like and or not basically you're taking two signals and you're producing a result so you know one and one is is one one and zero is zero and so on right that's just that's basic logic you can take that and it turns out you can build anything with that you can build if you if you have a nand gate basically a not and gate then you can build whatever you want um and so there's lots of ways to do that um but effectively they build this whole chip out of that and and that's they're they're putting that logic on on the die and that implements you know what what people recognize as a modern cpu and so like if we're coming at
Starting point is 00:19:22 this from like a high-level language, I think most of the listeners here are familiar with JavaScript or Ruby, Python, or even C. Those either get interpreted or compiled into machine instructions. And effectively, you're taking all these logic gates and you're building something that fetches instructions from memory. It fetches data from memory, the instructions tell the processor to do something with that data, and then they write it back to memory,
Starting point is 00:19:49 and that's pretty much how your chip works. So if you have that pipeline where you can pull instructions from memory, you can do stuff to numbers and write it back to memory, then that's effectively what a modern core, I guess, looks like. That's a processor. You can run programs on that. And so it used to be that you had one core on the chip, and that was what you did. You had one thread of execution.
Starting point is 00:20:19 You would fetch an instruction. You'd do what it said. You'd write the result back to memory, and you would go on and fetch the next one and do what it said, you'd write the result back to memory, and you would go on and fetch the next one and do what it said. And there's just a whole lot of optimizations that have happened over the course of processor history that led to, you know, what we have today. Chips and Spectre and Meltdown have been in the news recently. So the chips do things like speculative execution. They can say, hey, I'm not going to know whether I want to execute this stream of instructions for a little while.
Starting point is 00:20:48 Um, but while I'm waiting to figure it out, um, can I go and try that? And then as we found out, you can get bugs from that, but it's also a huge performance increase. Um,
Starting point is 00:20:58 there's things like just regular out of order execution, um, where is effectively like your chip has logic on it that looks at the instructions coming in. It figures out the dependencies between them. Um, and it, it figures out which ones don't have any dependencies right now in terms of, you know, data that they need to read or results of other calculations in the instruction stream. It'll pull those and it'll execute those instructions concurrently or, um, with, uh, with other ones that don't have any dependencies and so
Starting point is 00:21:26 that that's called us an out-of-order processor or sometimes people call it a super scalar processor because it can execute more than one instruction at once there's vectorization so like there's some most chips have some you know types of instructions that will do multiple things at once so if you know that you have you know four numbers lined up in memory and you want to multiply them all by two you can pull them all in at once and do those operations all at once if they're if they're the same and so there's there's lots of these different sequential optimizations that people have done and
Starting point is 00:21:59 that's what goes into your one chip and so now that you have you know all of these extra transistors because you can't increase the clock speed on the one chip or on the on the one core um people are building out the number of cores that they have on a chip and so they basically just you know they have the same core they're not making it they're not trying to cram too much into that one core and increasing the power density to the point that it would cause problems but they're just scaling that out with the number of transistors does that make sense totally it's it's still when when you sit back and think about it it's still mind-numbingly awesome what we can actually build out of those core primitives
Starting point is 00:22:40 like you just where we've gotten from where from starts, you take it for granted. You don't think about it much, but when you do, you sit back and think about it. All ones and zeros and logic gates at the bottom of it all. What we actually can create out of that has been amazing. That's true. Oh, there's tons of layers. And I mean, if you think about it, the people started out just programming to the chip, right? If you got a new new machine, you know, back, you know, say in the founding days of this laboratory, you know, 1952,
Starting point is 00:23:08 you would read the manual and, you know, the instructions that you would, the way that the programmer manual had assembly code in it. It said, here's the instructions you can execute on this chip. This is what it can do.
Starting point is 00:23:17 And you'd have to actually, you know, think about memory and, you know, what, how you're managing it, what you're pulling into the core, how much memory you have, things like that. And now, you know, you don't even think about that. You can instantiate things in dynamically. You don't have to think very much about memory in most of the modern languages. And
Starting point is 00:23:35 it's a pretty nice ecosystem. I mean, I think, you know, the reason that the multi-core stuff doesn't, I think, change your perception of what's going on on the computer quite as much, or at least from a programming perspective, is, I mean, one reason is that there are a lot of multithreaded programs. And even your operating system is, you know, even before you had multicore chips, your operating system was executing multiple things at the same time. It was just doing it by timesharing. And so you know what a context switch is. It's when you're executing one program
Starting point is 00:24:08 and then the OS says, oh, well, there's this other thing that's running at the same time. I'm going to swap that in, execute it for a little bit, then I'm going to preempt it and switch back to the other thing that you were doing. And effectively, that's how your OS did multitasking
Starting point is 00:24:20 before you had multi-core chips is by just splitting up the by switching back and forth between different tasks really rapidly um and now you know you're on your chip you really can have things executing actually in parallel and so it's to some extent it's kind of a natural transition right because you can just execute different threads on different cores and the operating system has to manage that but you still have context switching too. So, you know, you can still execute many more tasks on your chip than you have cores. This episode is brought to you by DigitalOcean. DigitalOcean recently announced new, highly competitive droplet pricing on their standard plans, on both the high and the low end scale of their pricing they introduced a new flexible $15 plan where you can mix and match resources
Starting point is 00:25:30 like ram and the number of cpus and they're now leading the pack on cpu optimized droplets which are ideal for data analysis or ci pipelines and they're also working on per second billing here's a quote from the recent blog post on the new drop-off plans. Quote, We understand that price-to-performance ratios are of the utmost consideration when you're choosing a hosting provider and we're committed to being a price-to-performance leader in the market. As we continue to find ways to optimize our infrastructure,
Starting point is 00:25:59 we plan to pass those savings on to you, our customers. End quote. Head to do.co slashchangelog to get started. New accounts get $100 of hosting credit to use in your first 60 days. Once again, head to do.co.changelog. well moore's law is not dead but dying murphy's law however eternally true still still still true yes will always be true adam you know that one yes mur? Murphy's Law? Yeah, that's a real law. Yes. Anything that can go wrong will go wrong. So I guess if you want to get back to the Moore's Law dying aspect,
Starting point is 00:26:53 I think GPUs are a good example of one way that you can take more effective advantage of some transistors and sort of combat that power law or the, you know, the Denner scaling problem. The GPUs are, in terms of the number of operations you can do on them, you get a lot more performance per watt if you can exploit them than you do out of a CPU. So if you can actually, if you have a workload that's data parallel, you can pose it that way. Then you can execute it more efficiently on a GPU than you can on the CPU. And so that's, and you have like 5,000 cores on there, right? It's a big scale out machine. It's doing vector stuff.
Starting point is 00:27:40 It's very power efficient. And that's one way to use the transistors more effectively for certain workloads than for the cpu and i think you know that's that's where you're going you're going to see other types of technologies take over that are better at certain tasks i think you know in our community the other places that you know people are looking um it's so there's quantum computing people talk about that a lot um i want to talk about that yeah okay um put that on the sidelines say too much about it okay not an expert but i mean we have there's like a whole beyond moore's law thrust in doe and i think in the broader cs research uh you know funding agencies
Starting point is 00:28:17 um what's the doe the department of energy okay just to be clear for for those not in the u.s and hearing acronyms to know what they're talking about. Yeah, we could do the, the, we didn't do the origin story, uh,
Starting point is 00:28:30 thing at the beginning, which we could do. I think there's some in that show to some degree about where you work and what you do. And so I'm pretty sure that's how Nadia opened it up. Right. Um, yeah,
Starting point is 00:28:44 that's, that's true. So we can talk about the DOE. Right. Yeah, that's true. So we can talk about the DOE some later. DOE is where I work. I work at Lawrence Livermore National Laboratory, and we care about high-performance computing. So yeah, quantum computing is one way that you can use a different type of technology
Starting point is 00:28:59 to do computation. So far, people haven't really... They've shown that it's useful for certain problems. So like there's a D-Wave system, Los Alamos has a D-Wave system that they're looking at. It's a type of quantum computer that can do something called quantum annealing, which allows you to solve certain optimization problems very fast. But again, you know, that's a different model of computation. It's not like a script. It's another type of thing. So if you have to do optimization problems, that's a good thing to use.
Starting point is 00:29:37 And you can do it really fast. There's something called cognitive computing that we're looking at. So at Livermore, we have a partnership with IBM where we're looking at their TrueNorth texture. And they call it a cognitive computer. Effectively what it is, is it's a chip that you can basically put a neural network on and you can evaluate it very quickly. And so it's good for machine learning workloads. If you need to do some machine learning evaluation along with your workload, where I'm distinguishing between training and evaluation, then you could potentially do it faster with the true North chip. And then, you know,
Starting point is 00:30:07 to some extent there's, there are limitations to how you can do that. You have to discretize the neural net a certain way so that it fits on the chip. Um, and you can only do certain types of neural nets. Um, but you know,
Starting point is 00:30:17 you can pose a lot of different neural net problems that way. So we think it could be useful, um, for helping to accelerate some of the simulations that we're doing or to help to solve problems that are really hard for humans to optimize at runtime. So that's another model. Are there private sector equivalents, Todd, to these things that you're speaking of? Or are these the kinds of things that you only find in the public sector in terms of these the cognitive learning machines so i believe true north is available and you could you could buy it um if
Starting point is 00:30:54 you were in the private sector um it's an ibm product um i'm not 100 clear on whether it's just a research prototype that we're dealing with or whether you can actually buy with these and play with them in industry but i mean i know that yeah i think some industry players have d-wave machines so they're playing with those so you know you can get them around with them um i definitely think that you know it's still in the research phases um in terms of what you would actually do with it yeah um the true north chip is interesting because it's a little closer in terms of, you know, actually deploying those because the people do have machine learning workloads, right? Like, and if they want to accelerate them, they could use something like this, um, to
Starting point is 00:31:34 do that. Um, what it doesn't accelerate is the training. So, you know, you would still have these giant batch jobs to go and analyze data sets to build the neural net that you use to, know either to classify or to analyze the data once you're done training that thing but i mean i think the theme across all these different areas is that you know it's more specialization and special purpose yeah so tell us real quick so you mentioned you know you work at lawrence livermore national lab what yes so you have these specific use and you said we care about high-performance computing.
Starting point is 00:32:07 Maybe explain the specific use cases as much as is public knowledge or not top-secret stuff that you guys do and you're applying these technologies to do. Okay, so I work for the Department of Energy. I think the Department of Energy has I think the Department of Energy has been in the news as Trump has picked his cabinet lately. We deal with a couple of different things. I think the DOE is the biggest funder of science research in the US alongside the NSF. And that involves funding universities. It involves funding the NSF. And, you know, that involves funding universities, it involves funding the national laboratories. And we're also in charge of managing the US nuclear stockpile, and making sure that it stays safe and reliable. And so across all of those different scientific domains,
Starting point is 00:33:01 there's a whole lot of physics simulation that needs to get done. And effectively, you know, we are using simulation to look at things that you either, you know, can't design an experiment for, or that it's too expensive to design an experiment for, or that it would just be, you know, unsafe to design an experiment for or that you shouldn't design an experiment for. And I guess on the NNSA side, so Lawrence Livermore is part of the National Nuclear Security Administration, which is under the DOE. The unsafe thing that we're looking at is, you know, how do nuclear weapons work? And so that's a lot of the simulation workload that takes place here. We also do other types of simulation like climate science. We have a lot of people working on that. We look at fundamental material science, all these big either computational fluid dynamics or astrophysical simulations, geological simulations, earthquake simulations, all these physical phenomena. We, you know, we have simulations at various degrees of resolution, um, that we can look at to figure
Starting point is 00:34:10 out, you know, what would happen if, so like we have some guys who've done predictions about earthquakes in the Bay area, where would the damage be? Um, we look at, will this weapon continue to work? Um, we also do things like detection. Like if, if you had something like this type of device and someone was trying to ship it in a container, how might you figure out that it was there without opening every container? There are lots of things like that that the DOE looks into. And high-performance computing drives all sorts of different aspects of that. Yeah. So, and I guess the other interesting facility here that's in the news frequently is the National Ignition Facility, which is a nuclear fusion experiment. So we're trying to make a little star in a big, you know, building the size of three football fields where we've got like 192 lasers that fire at this little target.
Starting point is 00:35:03 And so simulating how the lasers interact with the target, how they deposit energy there, is one of the things that we can simulate on the machines here. You're building a star inside of a big building. A little tiny star. To me, every star is big, I guess. So a tiny star relative to other stars, but a big building. Well, let me be clear. It's a star in the sense that we're So a tiny star relative to other stars, but a big building. Let me be clear.
Starting point is 00:35:25 It's, it's a star in the sense that we're trying to get fusion burn to happen. Right. I was going to ask you what's, what exactly is a star? I was just waiting for Adam to hop in. Cause this is like where he gets super excited. His ears are perking up.
Starting point is 00:35:36 Uh, well, I was still stuck back at the size of this, this, uh, true North. And I was thinking like the size of the thing. And I was actually thinking about at what point does because these things are really really small at what point does a you know chip or
Starting point is 00:35:53 microchip or whatever you however you want to term this gets so small that it gets to the very very small which if you study physics and things like that you know life like we see it then you see the very very big which is planet sizes and you know universe sizes then you get the very very small which is like atom sizes like how small do these things get but then this star conversation is far more interesting to me i like that so there's lots of physics that goes on in department of energy so i guess i would shameless plug it's a i can endorse the department of energy it's a good place to work because you get to find out about stuff like yeah sounds interesting so yeah so i mean yeah the interesting thing about nif is that that's the national ignition facility is that you're
Starting point is 00:36:34 simulating a star it's very small um it's it's you know the target is like a few millimeters in diameter versus but you're trying to cause the same kind of fusion burn that would happen in like the sun and so it's all these lasers colliding right the light from these lasers colliding they create the fusion yeah yeah that's right it's it's they the lasers come in they hit this kind of cylindrical thing called a whole rom it's made of gold um that gets really hot x-rays come out of it and implode the target in the middle and that's the idea are you doing that a lot or are you simulating
Starting point is 00:37:09 on computers and then doing it very few times I'm guessing they can do it physically in this big building but then these chips that he's talking about they can do it simulated
Starting point is 00:37:16 this is a good example of the type of work we do so NIF is where you're trying to do it physically we're trying to get fusion burn there. Um, but to understand how this thing is working, right. Um, we have to do simulation to, you know, prototype the designs. And I think we do about 400 real shots, um, in a year over at
Starting point is 00:37:38 NIF where we actually, you know, turn the lasers on point them at a target. Well, it's not too many. Yeah. We, well, it's in, and we're ramping that up. It's a scientific facility, so you can do research for lots of different groups. Yeah. In conjunction with that, we do simulations to see if, you know, what we're simulating matches what really happened, right? And that's an iterative process.
Starting point is 00:37:58 So you do more simulations. You say, okay, it matches. How do I change the design to, better, to get more energy out? And then you go simulate that. It says it's going to do better. You try it. Maybe it doesn't. And then you iterate on that until the two match.
Starting point is 00:38:15 And ideally, that's the process that we use for designing these things. So that's where the HPC comes in, is simulating something like that takes an awful lot of compute cycles. So like I work in Livermore Computing, which is it's a compute center, kind of like a data center. But we have machines that are dedicated to doing just in this building um for for all of our computing needs and we have some of the largest machines in the world here that people run these these parallel applications on two main cores huh that's a lot of yeah we have one machine with one and a half million which is number four i, I think, in the world now. So that's Sequoia.
Starting point is 00:39:07 And we're installing the new machine right now. It's called Sierra. It's a big IBM system. It's with Power 9 processors and NVIDIA GPUs. This is highly specialized equipment for highly specialized tasks. Yeah, that's true. It seems so. You buy a different kind of machine for HPC than you do for like the cloud. But, you know, some aspects of running a data center and a compute center are very similar.
Starting point is 00:39:34 Managing power, temperature, stuff like that. I would say that security, yeah, exactly. That's important. We've been rolling out meltdown patches all across the facility. I was just going to ask that. Yeah. And the interesting thing, and we see performance hits from that, so we try to optimize that.
Starting point is 00:39:53 How big are the performance hits that you're seeing? There are some reports it would be up to 30%, but it doesn't sound like that's necessarily the case. Yeah, I think that's in line depending on the workload. I think it really depends on what application you're running because it's that system call overhead that you're paying for. So we have an interest in high-performance computing because there's basically never an end to the computing capacity
Starting point is 00:40:19 that we need to simulate the stuff that we're looking into. And so most of the place where we get into architecture around here is in optimizing the performance of applications. So, we have people who work with the application teams and they say, okay, your simulation does this. How do I, how can I make that execute more efficiently on the hardware? And then we also look at procurement.
Starting point is 00:40:44 So we're like, we have this workload. We know that we need to run these things. So what's the next machine that we're going to buy? And so, you know, I was talking about Sequoia. Sequoia is the, I guess, 16 realized 20 peak petaflop machine that we have on the floor right now. Our next machine is going to be 125 petaflop machine. And so the whole procurement process, people get together and they look at the architectures from different vendors and they say, you know, how is our workload going to execute on this? And so I think, you know, in the future,
Starting point is 00:41:19 you're going to have to think more and more about matching the, you know, the applications to the architecture. And we had to think about that because our next machine is a big GPU system. So, I mean, here's an example that probably gets at kind of the heart of this Moore's Law stuff. Sequoia is the previous generation machine. It's about 100,000 nodes. Each node has a multi-core chip on it. And they're all PowerPC chips.
Starting point is 00:41:48 And so our workloads could execute pretty effectively on that. And it was fairly easy to scale things out to a large number of processors. The GPUs have kind of won in terms of that's the thing that has a market out there for games and for other applications. And so, you know, we have to ride along with the much bigger market for commodity computing. And so our current machine is only 4000 nodes. It's got power nine processors on them, and it's got four GPUs per node. And so that's, you know, in terms of number of nodes, it's a much smaller machine than Sequoia, but it's way faster. And it had, you know, it's 125 petaflops versus 20. And so that's where, you know, the GPUs will win.
Starting point is 00:42:37 But for us, that's a big shift because we haven't been, we haven't used GPUs as extensively before. And so now we have to take our applications, import them so they can actually use the four GPUs per node. And that's a challenge. Give us an idea of what range we're looking at here. U.S. dollars. For the big machines? Yeah, the big machines.
Starting point is 00:43:00 Are we talking like hundreds of thousands of dollars, millions of dollars, tens of millions? What's the order of magnitude? So for most of the big machines. Are we talking like hundreds of thousands of dollars, millions of dollars, tens of millions? What's the order of magnitude? So for most of the big machines, like if you're going to get a number one on the top 500 list, which is like the place where they have the list of the top supercomputers, is probably like around, has been $200 million, at least in the DOE, for the system.
Starting point is 00:43:23 And that's procured over the course of five years. We start five years out. We talk to vendors and we get them to pitch. They write a proposal that says, here's how we could meet your specs. And then we have a big meeting where we go and we look at how they project this will work on our workloads. They do experiments with some of our applications.
Starting point is 00:43:45 And we also look at the other specs on the machine how they project this will work on our workloads. They do experiments with some of our like applications. And, you know, we also look at the other specs on the machine and, and different parameters, you know, how much memory is it going to have? How much memory per node? How many nodes are we going to have to use GPUs?
Starting point is 00:43:55 Are we going to have to use like Intel Xeon Phi chips or, or other things? And then we pick the one that we think will best meet our needs going forward. How would you like to close that deal, Adam? $200 million? That's a lot of money. Be the salesman on the front end of that thing?
Starting point is 00:44:12 Yeah, that's a long sales process. You go out to dinner after you make that sale. Yeah. And if you want the details on our current machine, there's a nice article at Next Platform by our CTO who is in charge of that procurement process. Awesome. We'll make sure we link it up in the show notes. All right, Todd, I have a suggested project for you for the NIF folks
Starting point is 00:44:33 after you guys finish that star you're working on. I'm sure they'll listen to me. Next project. Yes. Sharks with laser beams on their heads. I feel like people have come up with that idea before. Just for your consideration. Well, simulate that a few times.
Starting point is 00:44:49 I think you'll wind up with it. Are you sure no one else is working on it? Well, I think you'd be bleeding edge. All right. With simulation, we can make it better. We can make more effective sharks with laser beams. That sounds scary, though. I think we should think about the consequences of doing that.
Starting point is 00:45:15 MARK MANDEL, JR.: Looking to learn more about the Cloud or Google Cloud Platform? MELANIE WARRICK, JR.: But you don't know where to begin? MARK MANDEL, JR.: Check out the Google Cloud Platform weekly podcast at gcbpodcast.com, where Google Developer Advocates Melanie Warrick. MELANIE WARRICK, JR.: Hello. MARK MANDEL, JR.: And myself Google Cloud Platform Weekly Podcast at gcppodcast.com, where Google developer advocates, Melanie Warrick, and myself, Mark Mandel, answer questions, get in the weeds, and talk to GCP teams, customers,
Starting point is 00:45:32 and partners about best practices from security to machine learning, Kubernetes, open source, and more. MELANIE WARRICK- Listen to gcppodcast.com and learn what's new in cloud in about 30 minutes a week. MARK MANDEL JR.: Hear from technologists all across Google, like Vint Cerf, Peter Norvig, and Dr. Fei-Fei Li, all about lessons learned, trends, and cool things happening with our technology.
Starting point is 00:45:51 MELANIE WARRICK- Every week, gcppodcast.com takes questions submitted by our audience, and we'll answer them live on the podcast. MARK MANDELMANN, JR.: Subscribe to the podcast at gcppodcast.com, follow us on Twitter at gcppodcast, or search for Google Cloud Platform Podcast on your favorite podcast app. And by GoCD. GoCD is an open source continuous delivery server built by ThoughtWorks.
Starting point is 00:46:12 GoCD provides continuous delivery out of the box with its built-in pipelines, advanced traceability, and value stream visualization. With GoCD, you can easily model, orchestrate, and visualize complex workflows from end to end. It supports modern infrastructure with elastic on-demand agents and cloud deployments. And their plug-in ecosystem ensures GoCD will work well in your unique environment. To learn more about GoCD, visit gocd.org. It's free to use and has professional support for enterprise add-ons available from ThoughtWorks. Once again, go cd.org slash changelog.
Starting point is 00:47:09 So I guess the question I have is like, if you've got this 200 million dollar computer it's got to be something that's pretty uh demanding right people are going to want to use this thing because you're not going to want to not get the return on investment for that thing so like what's what's it like scheduling, managing a project that's on it? How do you schedule time for it? Do you have to predict how long your project will take the compute time? Like give us a day to day operation of using one of these computers. OK, so I mean, I can't speak necessarily to what the the actual application guys would would do because I'm not I'm a performance guy. So I work with them to help
Starting point is 00:47:45 speed things up. But I mean, the usage model is basically you have to write a proposal to get time on these things. For the bulk of our workload, and this is the case for other Department of Energy laboratories too, you have to write something up that says, you know, I have this scientific problem. It really needs a lot of CPU cycles. It's not possible without that. And here's what it would enable. This is why it's worth, you know, the time on the machines. And so those go through, you know, here and at Argonne and Oak Ridge, all these other labs, a competitive process where reviewers look at the proposals, they evaluate, does it have merit?
Starting point is 00:48:26 And then once that's done, you get assigned hours according to what you asked for on the machine. So you get CPU hours. That's millions of CPU hours or more, depending on what the project is. And the CPU hour is measured in terms, I think we may be doing node hours now. I'm not sure if it's CPU hours or node hours. But basically it's just a measure of how much compute power you're allowed to use. So that's how we justify it. And the machines stay busy all the time because we have science projects that need them for their workloads. We have more work than the computers could ever possibly do. But they are doing it fast, so it enables new science.
Starting point is 00:49:09 So I think in a given day at the lab, there's a bunch of users. We have 3,000 users for the facility. Some here, some are collaborators, some are at universities that we collaborate with. They're running jobs, applications. It's like a big batch system. You log into it.
Starting point is 00:49:28 You say, here's the job I want to run. Here's how many CPUs it needs or how many nodes it needs. And here's how much time it needs to do that approximately. And then we have a scheduler that just goes and farms those jobs out to the system. And so the people at the compute center, we look at what's going on. We try to manage the scheduler so that it has a good policy for all these different users. And we have performance teams who help the application teams actually optimize their code to run on the machine machine and that's an iterative process right so for a machine like the new Sierra machine I was talking about we'll typically have a smaller
Starting point is 00:50:12 machine in advance of that that's similar you know we have a power 8 GPU system instead of a power 9 GPU system that we've been testing on and they'll get their code running on that in preparation for the new system. And in that process, we'll run profilers on the code. We will look at traces to see if it's communicating effectively between all the nodes. And we'll help out the application teams by saying, you know, you should modify this or we need to change this algorithm. I think one of the things that we've been helping people with a lot lately, especially with the GPUs and also with other centers using more exotic chips like Xeon Phi, which is like an Intel many core chip. It's like a 64 core Intel chip. We need the same code to execute well on all these different architectures, and that's not an easy process.
Starting point is 00:51:06 So if you have a numerically intensive code, you write it one way. It might execute well on the CPU, but not on the GPU. And we'd ideally like to have one code that the application developers maintain and have that – have essentially some other layer handle the mapping of that down to the architecture. So one of the things we've developed is we call them performance portability frameworks. We have this thing called Raja. It's a C++ library where you can write a loop. Instead of a for loop, you write a for all, you pass a lambda to it, and you pass that for all a template parameter that says,
Starting point is 00:51:46 hey, I want you to execute on the GPU or I want you to execute on the CPU. And that allows them to tune effectively for different architectures. They can kind of swap out the parallelism model under there. And so tuning that, getting the compilers to optimize it well for different machines, that's the kind of thing the performance folks have been working on. So you answered the one question that I was thinking when you talked about scheduling is, do these things ever sit idle? Because that would be like the worst use of a huge, you know, massive, powerful, expensive computer is idle time, right? So it's, I guess it's heartening to find out that there's so much work to do, that that's not a problem whatsoever.
Starting point is 00:52:31 In fact, the problem is the opposite, is that you need to start procuring some more to continue more and more research. The other side, too, it sounds like you do a dashboard or something like that. Do you operate – do you ever see the computer? Do you actually get next to it, or do you just operate whatever you need to do through some sort of like, I don't know, like portal or something like that. We have people who get next to the machine and we give tours of the facility to folks who visit the lab sometimes. But you don't have to like put your USB stick into it to like put your program on it and run it, right? You're like...
Starting point is 00:52:56 No, no, punch cards. Okay. No, so basically, I mean, these things look like servers, like you'd be used to, right? Like you have a desktop machine, like you'd be used to, right? Like you, you have a desktop machine, you SSH into the computer, and then there's a resource manager running on it. So like Slurm is the open source resource manager that we use, uh, is developed
Starting point is 00:53:15 here. Um, and now it's got a company, SkedMD around it. Um, and the users would say, you know, S batch, um, command line. And then that they would take that command line, put it in the queue, and then eventually run it on however many nodes they asked for. Or, you know, S run if they want to do it interactively and wait for some nodes to be available. And, you know, the wait times can get pretty big if the queues are deep. So, yeah. So you get assigned hours, but you don't get assigned like 9 in the morning to 10 in the morning. You get just hours and you're in a queue. Whenever your queue comes up, you execute.
Starting point is 00:53:52 Right. You get a bank that comes with your project. We call it a bank. That's how many total CPU hours you have. If you submit a job, then, you know, when you submit it, you have to say, here's how long long I expect it to run for and the scheduler will kill it after that much time and then you nodes you want and then it runs for that long and however you know the length of time it runs times the number of nodes that used you know times the number of CPUs per node is how much they subtract from your bank
Starting point is 00:54:20 at the end of that and so so effectively, you get a few multi-million CPU hour allocation. You can run that out pretty quickly if you run giant jobs that run for a long time. So Todd, I first met you at the Sustain event last spring, almost summertime, I suppose, at GitHub headquarters. You were very involved in that.
Starting point is 00:54:43 And in fact, that's when you hopped into our Slack for the first time and helped bring some people from the lab to that event. And so you have this interest and passion around sustaining open source because that's why you were there and involved. And we appreciated your help. But tell us in the audience the intersection of where open source comes in with the work you're doing with the supercomputers and the lab work sure so i'd say two places um in their their big places i
Starting point is 00:55:13 think for our computer center the the folks who run it we prefer open source for for nearly everything um for the resource, for the file systems. We have big parallel file systems like Lustre for even, you know, the the compilers that we use. We're we're investing in in Clang or in LLVM to create a new flang to do Fortran for some of our codes. And and so, you know, I would say that the majority of what we do at the Compute Center is open source in terms of the infrastructure that we're using. Our machines run Linux, and we have a team downstairs
Starting point is 00:55:55 that manages a distribution for HPC. We call it TOS, which is a Trilab open source stack. That's basically Linux distribution with our custom packages on top of it. And that's how we manage our deployment for the machines. So that's one way. And then we have people working on, you know, the people who work on those projects or like ZFS is used in Lustre. We have a guy who actually did the ZFS on Linux port and manages that community.
Starting point is 00:56:29 And I think we get a lot out of that. It's Brian Bellendorf at Livermore. Not the Brian Bellendorf who's doing blockchain stuff, but actually another Brian Bellendorf. I was going to ask that. The same name. Yeah, there's two Brian Bellendorf's in open source. Same spelling and everything?
Starting point is 00:56:42 Yeah, everything. He said that they met once and talked to each other. That is confusing. We just had the other Brian Bellendorf on the show. We interviewed him at OSCON last year. Listen to that. Hyperledger. So this is ZFS.
Starting point is 00:56:57 So there's the ZFS Brian Bellendorf and there's the Hyperledger Brian Bellendorf. One of them is in the building with me. Yep. You, you know, we were talking about how we, we procure these big machines. Um, and there's, there's a contract associated with that. Um, in that we, we allocate some time for the vendor to contribute to open source, uh, software. We require that as part of the contract. And so they work with us. Um, and they, they make sure that our software and other software that we care about from the DOE and elsewhere actually works on the machine. So that's another way we interface with the open source community.
Starting point is 00:57:34 On that note, then, it sounds like you're pretty intimate in terms of what's involved in the process or what's on these machines. Do you have good relationships with those who sys admin these machines do you do you as a you know collective are you able to say well we prefer this flavor of linux and you know it seems like since you choose open source you have some sort of feedback loop into like preferences that everyone can put on this machine and do all these fun things you do yeah so at this center i mean we the there's a software development group there's the system administration group they're all in the in the building that i'm in which is attached to compute center um we there's a lot of crosstalk okay between those different areas and then we also
Starting point is 00:58:22 talk to the researchers right um who who run applications on the machines. Yeah, I would say that, you know, Livermore Computing, at least, like on the infrastructure side, is definitely involved in, you know, choosing what open source we want to run on the machines and when we maybe don't want to go open source. Like we run proprietary compilers
Starting point is 00:58:40 because they're fast, for example. But we also do things like invest in the open source compilers to try to say, you know, we want an open source alternative so that we have something that we can run on the machine that will work and work well. The reason I ask that is because it seems like, you know, the application process is very protected to manage what,
Starting point is 00:59:00 to manage the load on those machines and the time. And so I just wondered if, you know if the involvement of what's on the machines and who manages them and all that stuff is just as protected. It seems like one size is a little bit more loose, but to get the time, it's a big ceremony, a big process, and it could be gatekeeped to some degree. Yeah, I guess I would say that HPC, it's a research computing field, right? It's mostly researchers who need this much compute power.
Starting point is 00:59:35 And so the calls for proposals are not unlike the calls for funding that people put out for academia. There are open ones. And so like the Office of Science Labs have the Insight program where, you know, you can apply for time on Oak Ridge and Argonne's machines, which are similarly large, if not, Oak Ridge has a larger machine than ours right now. And for us, our customers are slightly different because we work with, we work with Los Alamos National Lab and Sandia National Lab. And so our proposal processes, at least on like the classified side, are mostly between those labs because they're about the weapons program and stuff like that. But then there are other places where you can get time for basic science runs. there and we let you know early code teams like the the guys who maybe have like an important science application that isn't as complicated as maybe some of our production codes who can get on
Starting point is 01:00:30 there early and you know show off the machine we let them run there's a few months at the beginning where we let them um use the time with allocations there so i guess i'd say there's a lot of different ways to get time on on the machines and you know it's it's pretty low overhead it's not you know quite like writing a full academic proposal it's pretty open and we're on this open source kick i was just curious how that flavored in because as you're describing your choices and you know and i guess the primary choice of choosing open source and that's your preference it seemed like you know while there's a lot of process around the proposal flow, maybe there's a little bit more cross-check, as you mentioned, at involvement with other teams that have access to these super expensive machines.
Starting point is 01:01:14 That's a huge privilege because I don't have access to a $200 million machine. I can barely afford one that costs $7 million. And I got to borrow money from grandma or something like that. $7 what? $7 million? Okay, $1,000. barely afford one that costs seven you know like and i gotta like borrow money from grandma or something like that you know so seven what seven million okay thousand you said 200 million and then you said you could barely afford when it costs well i assumed everybody thought that was me a thousand is like i assumed the denominator is saying the same grandma ain't got millions man you can quote me on that question i guess what we're trying to say it is todd is, Todd, is how can we get some time on this computer? We've got some research.
Starting point is 01:01:52 Yeah, so I guess if I had to boil it down to something, you have to have justification for getting on the machine. You have to be able to show that you can make scientific progress with your hours. So that's what the process is about. Sharks and lasers, man. Sharks with laser beams on their heads. I told you my justification already. Ill-tempered sea bass. Oh boy.
Starting point is 01:02:11 I guess the other elephant in the room, Todd, for justifications, and you addressed this to us in the break, but please for the audience's sake, because I know that we probably have a fair amount of Bitcoin miners listening. And so I think that is the other thing that has people kind of, you know, putting their pinky up to their mouth. CPU time. Yeah. What about Bitcoin mining on these rigs?
Starting point is 01:02:34 Okay, so we're not allowed to mine Bitcoin on these machines. It's not legal to use government machines to mine for Bitcoins. But even if you did, it wouldn't it wouldn't be worth it if you look at what people are using for bitcoin mining a lot of that is like custom chips uh they're very low power and only do hashing so you'll do way better investing in that than you will in in one of our machines so and i think at some other compute center people have been have been fired for for trying to mine Bitcoin on the machines. Wow.
Starting point is 01:03:06 Yeah, yeah, yeah. You can Google for news stories about that. But I guess I want to be a little clearer on the openness front. So the application for time is separate from software that you actually run on the machine, right? So we have a lot of open source projects that live for much longer than any allocation or any machine really. And there are a lot of people who work on those and those are open. Some of them, you know, you can even run on your laptop and scientists do. So like a lot of development for these machines does happen on people's laptops and then they scale the program up to run on the big machine. So, you know, there's a lot of open source software development
Starting point is 01:03:46 that happens, even if the process for getting access to a big machine isn't open. And you can run that open source software on the machine that you do have for 7K. Or less. Or easily 7K. Yes.
Starting point is 01:04:03 Your grandma could run that, Adam. That's right. Grammys can run that. So what about the open source community? Seems like any time when I think of a government operation, you think especially with the security constraints and a lot of the quote-unquote red tape, seems like actually deep involvement with a community
Starting point is 01:04:23 that's built on openness and freedom and all these things that are kind of opposite of like secrecy and closed is there any um give or take there is there is there red tape are there issues around that or has it all been you know pretty easy in terms of integrating your open source work into the greater open source community so i'd say that historically livermore is pretty good about open source. I mean, we started using Linux clusters like in the late 90s. And, you know, we've been working on the operating system for that. We've developed like Slurm, the resource manager, has been open source for a long time.
Starting point is 01:04:59 So putting stuff out there has not been such a problem. There is a release process that you have to go through that's kind of cumbersome. But once you do it, you know, like we did for SPAC, the package manager I work on, you can work out in the open on that as long as you stay within a certain project scope. So, yeah, I mean, there is some red tape around that. Obviously, we don't want to release some of the things that we develop. But then again, we use a lot of open source internally and benefit from the broader open source community. So I would say that DOE has a pretty active open source development ecosystem. And we leverage things that are developed by other labs and other labs develop or leverage things that are developed by us. And I think there's a lot of back and forth.
Starting point is 01:05:46 I would say that like the, the interaction model on the, on the projects is maybe not quite the same as like a large infrastructure project. Like, I don't know, like Kubernetes or Docker or something like that, just because I mean, it's,
Starting point is 01:05:57 it's scientific computing. So people get funded to solve a particular problem, not to develop software. So there are, you know, sustainability issues around how much software development time can we actually put on this project. On the production side, though, the facilities, I mean, their job is to keep the center running
Starting point is 01:06:15 and to do it efficiently. So that's, I think that's why you see a lot of open source coming out of there. But, you know, then again, there are-lived research projects that are very widely used. So one good example of that is in the math library community. So for large-scale parallel solvers, the different labs have teams working on that stuff
Starting point is 01:06:37 and there are some solver libraries like Livermore has Hyper, Sandia has Trilinos, and Berkeley Lab has some solvers. And also things like finite element frameworks, things for meshing and for building these big models of physical systems. So Livermore has a library called MFEM that has a big open source community around it. Well, not big by JavaScript standards, but big by scientific computing standards. Right. Yeah. So some of them operate like communities, I would say, and then others kind of tend to, you know, stay within a particular group or, you know, they maybe don't have a cross
Starting point is 01:07:14 lab community. It just depends on the software and what the funding and the interaction model has been historically. I do think like more community could help a lot of projects. If people started thinking more in terms of like, how do I sustain this over time? How do I get more contributors? I don't necessarily think that we build research software with growing contributors in mind. I think it's interesting that you got the three I took note of earlier. The one you obviously talked about on request for commits back.
Starting point is 01:07:46 That's the product you work on primarily. Yes. Then you've got Slurm, which I think is a workload manager, if I got you correctly. That's actually what you interface with to put your products onto a supercomputer. Is that right? Yeah, it runs the jobs. It runs the jobs. It does the scheduling.
Starting point is 01:08:00 Yeah. Then you got Lustre, which I was just noticing down in the trademark is a Seagate technology trademark. So that means that that's the file system. So these things are important enough for you to have open source projects alongside them that I guess are more specific to, say, a supercomputer scenario versus, say, a laptop scenario. Yeah, for sure i mean we we have to pay um for open source development for you know like a parallel file system that'll run fast on our machine we have to keep the computers working so right yeah a lot of the infrastructure projects are aimed at that seems like some of this stuff should come with a 200 million dollar computer
Starting point is 01:08:38 well so i mean i'm just thinking like so it does to some extent like so you get a t-shirt or anything? Huh? So we haven't gotten t-shirts, you do get a mural. You can get a mural painted on the side of the machine. So like the, if you look at like the nurse machines, like the, they have a picture of Grace Hopper painted on the side of their machine. Um, but, but so I will, there is a lot of software that comes from the vendor. So like Cray provides a whole programming environment with their system.
Starting point is 01:09:10 It's not necessarily open in the same sense. If you buy an IBM system, they will bundle like their file system, which is GPFS. It's a competitor for Lustre. It's proprietary. And you know, which one you go with depends on what value do you get out of the procurement. Which one do you think is going to perform better? I'd say performance drives a lot of the decisions at the procurement level. But openness is also a big factor.
Starting point is 01:09:37 Does it come with hard drives? Yeah, so the system would come with a parallel file system. So it's not just hard drives. It's like racks and racks and racks of hard drives. Right. I was going to make a joke to say like when you get it, do you just wet the drives and put your own stuff on it like you do in the old? Do you do it at scale essentially? Like when I get a machine, even a Mac,
Starting point is 01:09:56 I sometimes will just wipe it and put a brand new version of OS X on there or Mac OS now because I just like it. You know, I just do that, you know, my own way. Yes, we, we, yeah, we do that effectively with our Linux clusters. So we, we build our own distribution, like I was saying. And so we have a toss image that we run across those for the bigger machines. Like, so for what we call our advanced technology machines that are in these, you know, large procurement packages, you know, it's much more vendor driven in this because it's bleeding edge. So we rely a lot more on the vendor to provide the software
Starting point is 01:10:36 stack. Although, I mean, our, you know, the next machine is going to be Linux. So it's the machines for the OS at least run Linux. Which flavor of Linux? We run, so across the center here, we run RHEL. So that's the distro that typically is at the base of our TOS distribution. And then some machines run SUSE, but not at Livermore. Like the Cray machines, I think, use SUSE as their base distro. They also build their own kind of custom lightweight versions of Linux for the actual compute nodes.
Starting point is 01:11:12 They want to reduce system noise, so they don't want a lot of context switching going on and stuff that would slow down the application. When you say RHEL, that's R-H-E-L, right? Yeah, Red Hat Enterprise Linux. Gotcha. Yeah. Gotcha. Yeah. Cool.
Starting point is 01:11:26 This is pretty interesting to kind of peel back the layers of a supercomputing research laboratory like this and, you know, see how open source fits in, how, you know, $200 million would advise, you know, how you procure time, how you propose time, how you interface with other teams that manage open source software, how you propose time, how you interface with other teams that manage open source software, how you determine preferences. And I mean, this is an interesting conversation. That's not exactly the typical episode of the changelog. So hopefully listeners, you really enjoyed this.
Starting point is 01:11:56 And if you did, there's a way you can tell us. You can either hop in GitHub. So github.com slash the changelog slash ping or join the community go to changelog.com slash members what is it no it's community slash community sorry go there which is what todd did one day and he's like hey uh y'all are doing this conference called sustain i'm gonna go and i want to bring some friends and wow this is an awesome community so maybe todd to close this out what what can you say about hanging out in Slack? Hanging out in Slack?
Starting point is 01:12:27 With us. With Slack and with us. So I do that because it's nice to be in touch with the, well, with I guess a different open source community, right? So I think the changelog is kind of heavy on web development. I used to be a web developer before I came to DOE. So I like to keep up with that stuff and see what's going on out in the cloud as well as over here in the DOE. So it's been a nice time.
Starting point is 01:12:54 Well, Todd, thank you for coming on the show today, man. We're huge fans of yours. And just thanks so much for schooling us on Moore's Law. Appreciate it. Cool. All right, that's it for this show
Starting point is 01:13:07 thank you for tuning in if you haven't been to changelog.com in a while go there check it out we just launched a brand new version
Starting point is 01:13:13 of the site go to changelog.com and subscribe to get the latest news and podcasts for developers get that in your inbox every single week
Starting point is 01:13:20 we make it super easy to keep up with developer news that matters I want to thank our sponsors Rollbar Digital Ocean GCP Podcast, and GoCD. And bandwidth for Changelog is provided by Fastly. So head to fastly.com to learn more.
Starting point is 01:13:34 Air monitoring is by Rollbar. Check them out at rollbar.com. And we host everything we do on Linode cloud servers. Head to linode.com slash changelog. Check them out. Support this show. The Change Law is hosted by myself, Adam Stachowiak, and Jared Santo. Editing is by Jonathan Youngblood.
Starting point is 01:13:51 Music is by Breakmaster Cylinder. And you can find more shows just like this at changelog.com or wherever you subscribe to podcasts. See you next week. Thank you. Bye.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.