Big Compute - FedRAMP, GPUs and the future of Federal Agency Computing

Episode Date: November 20, 2019

Kevin Kelly hosts Christopher Chang,  Computational Scientist and Acting HPC User Program Lead at the National Renewable Energy Laboratory. They discuss the incredible capability... of scalable high performance computing (HPC), now available securely through FedRAMP.

Transcript
Discussion (0)
Starting point is 00:00:00 Well, hello, I'm Kevin Kelly, and this is the Big Compute podcast. Today's topic is cloud HPC for government users using FedRAMP and other technologies. Today, I have with me Chris Chang from the National Renewable Energy Lab, also known as NREL, which is a research lab part of the Department of Energy. Hey, Chris, how you doing? Good, Kevin. Nice to talk with you. All right. Well, thanks for coming on. First thing I want to get out of the way is I just absolutely love going to the NREL lab in Boulder, Colorado. It is,
Starting point is 00:00:47 it's beautiful for a government building. It's just. Well, thank you. I take no responsibility for that, but I also find it pretty nice place to work. Yeah. Well, I mean, Hey, when you're approaching the lab every day for work, it's a whole lot different than approaching a large 70s-era concrete government building. I go into a lot of those, too. It's just the use of natural materials and different kinds of woods and things like that in the building. I love being there. Yeah, it is definitely a unique place yeah and it's kind of along along the way of the theme of the the lab itself so i want to think you could talk about that for a few minutes because you know most
Starting point is 00:01:38 people i talk to don't don't understand or don't even realize that the government is in, well, they're in the business of doing research for alternative energy. Yeah, yeah, definitely. So NREL is one of a complex of Department of Energy labs, and there are various kinds. There tend to be two main categories. One are the kind of general purpose labs, if you like, so Oak Ridge, Argonne, they'reEL is a mission-focused lab. And so there are a handful of labs in the DOE lab complex that are really focused on a mission. The National Renewable Energy Lab is focused on, obviously, renewable energy, but also energy efficiency.
Starting point is 00:02:38 I love reading and seeing about the different research projects that are coming out there. They're pretty out in front in terms of a lot of the things they're looking at and how they they're trying very hard also to transfer a lot of that research and apply it in the commercial sector. So, hey, as a taxpayer, I love this. And as a as a as a resident of planet Earth, I'm really into it also. I'm very happy to see this going on. Yeah, absolutely. One of the things that makes NREL special as well is that it's focused on what might be called bridging the gap. So there is a fair bit of basic science going on, but a lot of the focus is on applied science, engineering, and techno-economic analysis to understand how new technologies might percolate into the marketplace, how they might get taken up by industry, that kind of thing. So, yeah, it is very practically minded so with with all those focus areas that you're talking
Starting point is 00:03:48 about there there's an awful lot of high performance computing and other types of research computing that goes on at the lab definitely and it is already always in short supply so there's more demand than there is supply and it's growing. So I think traditionally some of the more basic science research has embraced large-scale computing. That can be the only way they can get to where they want to go. Industry is kind of in that transition, I would say now, as I perceive it, in starting to see the possibilities that large-scale computing can bring. Obviously, in places like probably Boeing, they already have embraced it, doing huge engineering projects. But now that mindset, as large-scale computing is becoming more available
Starting point is 00:04:47 is starting to percolate out into new stakeholders. Yeah, well, I can definitely see then how the demand just increases, right? So, I mean, on site, the lab has some pretty big HPC systems. I think it's a Peregrine and Eagle. Right. We are on our second, I would say, main self-hosted system now. Peregrine was our first. That was ordered two, two and a half petaflops and that just got retired last month actually and then Eagle they named them after birds so Eagle is our current system that
Starting point is 00:05:37 went into production early this year and that is eight half petaflops, about 2,200 nodes. So it's a fairly big, yeah, in-house computing capability. Yep. And I imagine, given the amount of demand, that it's a pretty busy machine, high utilization. Absolutely. Yeah, it's already oversubscribed, so we're still processing requests for fiscal year 20 coming up, and it was not quite UX oversubscribed, but close to that. So, yeah, as soon as it's there, it's already not enough in some ways. Wow. Well, then, what is a researcher to do, right? The machine is oversubscribed. We hear people looking in other places to go do their big compute.
Starting point is 00:06:38 The cloud seems like an interesting place, and it's been really a focus area for you or an area of investigation for you over the last couple of years you know computing is constantly changing but one of the currents that has been flowing past us is the commodification of computing so cloud is one of the manifestations of that where you think of computing almost more as a utility, the way you would get power in your house more than, you know, big iron on premises necessarily.
Starting point is 00:07:15 We're in that stage of investigating, you know, what can we think about in terms of that particular current? Well, I have to also think then that now the stakeholders within the lab kind of take a look at this and say, well, that sounds great, but there's a Pandora's box that you're opening up here on governance and compliance and economics of doing it, right? I mean, cloud computing has been around for a while now. Commercial clouds have been out there for a while, especially with the AWS and Azure and the others all ramping up.
Starting point is 00:07:57 But early on in the market, I heard a lot of people saying, hey, HPC is not something that we want to do in the cloud because of the special requirements for that. So you have this technology issue, and then you have this compliance issue. Are we at the point where we can actually go do it? It depends on how you define HPC. DOE has its leadership class facilities and the computing that goes on there is not what
Starting point is 00:08:31 one would migrate to the cloud first, let's say. Not necessarily if you think about the economic scales of the hyperscalers like Amazon, the systems that DOE is exploring are, you know, let's say 200 million. That's in some ways rounding error for, you know,
Starting point is 00:08:53 some of these companies. So it's not that they couldn't in principle host something like that, but whether or not it would fit into their business model is a different matter. So I think there's that type of HPC is going to stay probably the way it is for a little while. But we tend to have customers who want to do computing that doesn't necessarily fall into that model. So in some ways, it's more general technical computing. There you might have lots of just single node jobs and they just need to run a ton of them or modest scaling needs. So maybe they need a handful of nodes
Starting point is 00:09:35 that are connected through a high performance network, but they don't necessarily need huge scaling. And data centric computing is coming online now as well at the lab, and there that opens up a different set of challenges and opportunities. You know, can we do HPC in the cloud now? It sort of depends on which part of HPC you're referring to. I think some things could fit very well today. Other things might never really be a great fit.
Starting point is 00:10:08 But the cloud computing space, as you know, is evolving quite rapidly. So it's hard to say in a year whether that assessment will still be relevant. My experience with working with the labs on that is very consistent. I think there's just such a, like you're saying, such an economy of scale when you have some of these hyperscale machines, the cost per core hour is so hard to replicate in a commercial setting. But definitely for specialized projects or shorter term projects, there's an awful lot of things out on the edges that we've seen that. But, you know, the biggest roadblock that I experienced doing this over the last three, four years really had to do with compliance
Starting point is 00:10:54 and with security, right? I mean, look, you work in a federal lab. It's a secure facility, and the data there is protected for a lot of good reasons. And so when I've had conversations with folks about the concept, really the big thing that stopped us from moving forward had to do with figuring out how we could do this in a way that was secure and that everybody was satisfied with. It met the standards. There are, of course, the classified labs that are doing research that probably would not make it onto a public cloud in any case.
Starting point is 00:11:39 But for NREL, the biggest challenge, I think, is working with industrial partners. And so they have sensitive data that doesn't rise to a level of classified, but of course you still don't want it where anyone other than the stakeholders can see it. The lab has worked with Amazon, has worked with FedRAMP in the past and got the certification needed to host that kind of data. And Microsoft as well, I think, is in that pool. But, yes, compliance is definitely a challenge that's currently getting addressed in the cloud space.
Starting point is 00:12:19 And some of the cloud vendors can handle that and some can't yet. But I think it's moving in that direction. So I imagine you'll mention FedRAMP. My understanding of that is really to have the federal government have a single framework in which to think about these compliance issues in computing? Certainly cloud is relevant there, but more generally, how do you think about security in computing cybersecurity and unify that vision across the federal government?
Starting point is 00:12:58 Yeah, I'm glad you brought it up. So for folks that are listening that aren't familiar with FedRAMP, it's actually a program administered by GSA, the General Services Administration. And so the Federal Risk and Authorization Management Program, FedRAMP, is really designed to accelerate the adoption of cloud, but also have this like consistent security baseline and authorization process. Because I think the frustrating thing for folks like me that are on the commercial side is that for every government customer that I've gone to over the last few years to talk about this had a spreadsheet that had 200 you know, 200 things on it
Starting point is 00:13:46 that we needed to answer or approve. And of course, everywhere you go, that 200 question spreadsheet is different. Maybe there's 120 questions that are in common. And, you know, then you start with, you know, the ones that aren't. And so it made it hard. And then there was always that question as to whether it would get authorized up the chain, right, by the chief, the CISO, right? So the chief information security officer and the other folks. Yeah, so FedRAMP for what I've seen is is I guess it depends on how you look at it. It's either opening doors or it's greasing the skids because for a lot of folks, it is enabling them to see things that they like, oh, I didn't consider this before
Starting point is 00:14:36 because I just didn't think I couldn't do it. Or I've always wanted to do this, but now this is a way to accelerate it. Yeah. It's nice to have a single set of standards and even contacts to work against. So even if the lab is doing things in its own way, you know it has to map to that unified framework at the end. And that in itself I think is a value. Yeah, it's both combination combination of and opening doors. What's really helped cloud HPC on that
Starting point is 00:15:09 end has been non-HPC workloads because FedRAMP is for all of industry. It's not an HPC only thing. It is a general computing thing. By having
Starting point is 00:15:24 not just the, you know, the big players in the space get FedRAMP authorized, but seeing other more specialty area, you know, ISVs and folks go through the process. It's really helped kind of open that door for the HPC market because they're like, wow, this is possible. It's hard, but it's possible. And there's this momentum, right? So I was in contact with a company that does some financial management software. And it was great. Like, it was like, they were going through the process and we were able to talk about it and their experiences with the labs. So, um, in general, it's a, for me, it's, I look at it as a, uh, a federal program that yes, it's, it's helping in a lot of areas. It's definitely going to help drive HPC,
Starting point is 00:16:24 but there's a whole industry behind it that's going to drive all sorts of other things. And we can't even think about it yet. Yeah, and a lot of the business processes were migrated to the cloud long before we were talking about HPC. So I absolutely hear you there. And the other point that brings to mind is HPC, compared to commodity computing, is a pretty small fraction of the market. And there's a reason a lot of HPC is done on x86 processors instead of the vector processors of yore, because the commodity computing market went to x86, right? That's where all the money was. That's where all the development was.
Starting point is 00:17:07 That's where all the investment was made. And HPC is good at figuring out how to use and leverage that investment in industry, right, to get work done. So, yeah, as the rest of the world goes, so HPC will probably follow to some degree. No, I think that's a great way to look at it. I know you and I have talked about, you know, like video gaming and things like that. But when you think about it, right, like the GPU technology really came, you know, there was some smart HPC people that said, hey, we can go use that. Yeah.
Starting point is 00:17:55 Yeah. Yeah. It was a much larger industry that was driving the development of all that infrastructure before HPC got to the table. Yep. And of course, if you keep following that thread, now we're seeing a lot of, of, of AI and machine learning stuff coming out of, of that corner of the industry because of the, you know, the,
Starting point is 00:18:21 the great work that's been done with GPU technology. And so we're looking at that as another skipping stone to doing it in the cloud. Yeah, and certainly in my area of science, I sort of self-identify as computational chemistry, I guess. You're seeing more and more machine learning and AI applications in that space as well. So it's not just the computing, but the science that depends on the computing that follows that too.
Starting point is 00:18:57 So that'll be interesting to see how that develops. The challenge, of course, is the model for science. A lot of machine learning is still somewhat black box at the moment. And so even if you develop something that's 99% accurate, outperforms what you can do by hand. If you can't map it back to some set of basic principles, do you understand it? Does that count as science? It's certainly useful, but where does that fit? So I think there's kind of this connection between, you know, computing and AI and science engineering analysis
Starting point is 00:19:39 and how useful these technologies will be or how they will fit into the process of science engineering and analysis going forward. You know, there are some use cases that are spot on and others where, you know, it will take a little more thinking to figure out where it fits. Sorry to get on track. Well, you're doing great no what what it actually makes me think about you know some folks that i've dealt with where they want to go try it but the cost of entry or the barrier to entry is really hard or really high right where hey you know what i would like to look at that
Starting point is 00:20:21 but in order for my lab or my organization to give me the proper tools to go do it, it's an awful lot of investment, right? So some things never get an opportunity to happen because you're kind of waiting for, you know, that moment, right? When there's enough, there's enough inertia in the organization to go buy resources or procure resources, like, you know, let's look at GPUs, right, to go do it. And so I still, I, you know, for me, I think about cloud all day, right? And I'm like, well, this, this opens that door. So as a researcher, now there's a pathway for you to go do that and say, look, I just want to try rent equipment to go do that and then see what happens.
Starting point is 00:21:12 Yeah. That was the model I was thinking of as you were talking is, you know, going to rent an auger or something, you know, piece of equipment, you wouldn't buy it yourself for one job. And so if you, you know, you didn didn't and there weren't a rental market, you might just never get the job done. But because you have this ability to go out there and borrow or rent something for much less than you would pay if you wanted to buy things outright,
Starting point is 00:21:40 that opens some doors. And, yeah, I can see cloud being a great platform for exploring certain things that you wouldn't necessarily buy outright and bring it up in-house. On the flip side of that, there has to be enough interest for the cloud providers to actually put that on the floor as well. So, you know, if it's too esoteric, there's probably not going to be enough interest to actually get something in the cloud. It has to build some critical mass.
Starting point is 00:22:16 Enough people have to want that auger so that you can keep a rental market going. So, you know, FPGAs might be an example where there's probably enough interest that they can make a small investment and get a return on that investment. But specialized ASIC might not be worth the time and you just have to buy that and get it in-house if that's what you need. Yeah, I think you make a great point. It does help that we do have some commercial cloud providers that are big enough that, you know, what seems like grains of sand to us, they, there actually are. But when you are a large cloud provider, you can collect that information on interest, you know, interest within the market and then offer that. So, you know, it's a great example. There are FPGAs that you can, in effect, rent out on the commercial cloud right now. Yeah. Is there selection?
Starting point is 00:23:25 Is there the level of software support? You know, that probably could be a lot better. But if it was just one lab or one guy in a lab, one researcher, yeah, they'd have to go buy it, right? They wouldn't be able to do this shared model or the rental model to do that. Yeah, I also think kind of looping back to what we talked about earlier on fedramp is if now if you're doing the rental model you can do it in a way that is compliant and is not going to get get a talk i started talking to from security folks right because
Starting point is 00:24:01 it's all great that you can go out and on demand this stuff, but it still needs to be, you know, within the parameters and the guidelines and the governance that your organization has set up. Yeah. And, you know, it is a big deal. There are providers who we can work with and providers that we just flat out can't based on their FedRAMP status. And, you know, that can affect a lot of not only the choices that you make in terms of going out and procuring some capability, but also in the discussions you have that, you know, when we think about what the future looks like, you know, if something isn't FedRAMP certified,
Starting point is 00:24:50 there's sometimes not even a point in bringing that up. People might just get a little aggravated even hearing about it because, hey, if it's not doable, why talk about it? So, you know, FedRAM federal certification is pretty important I can definitely see as it isn't the acceptance of it accelerates that it'll be more it'll be the standard so right now there's actually legislation winding its way through Capitol Hill to make FedRAMP to actually you know authorize it so that it is a permanent part of the government.
Starting point is 00:25:28 So when that happens, right, and it is legislated, I think that you're actually going to see a lot more pressure or encouragement to use it. Yeah, and I think it'll probably get streamlined as well in the process but certain things that might be a little confusing or difficult now will be more manifest in terms of how to achieve that certification so once things get more standardized it's not just at the status of a program but it's it is the way to do things you know I think there'll be less mystery around it than perhaps you had seen or I had seen. Yeah, that's a great point. And that's on both sides of the table.
Starting point is 00:26:13 I think for the providers that, you know, that we're familiar with, that I work with, it can be a long road. It's a huge investment, you know, time, people, and money. And definitely for the government agencies that are working with it. can be a long road. It's a huge investment, you know, time, people, and money, and definitely for the government agencies that are working with it. But on the other side, the other agencies that can leverage what your agency has done or another agency has done, I think is very appealing, right, and very appetizing to somebody. so that, you know, if the Department
Starting point is 00:26:47 of Energy issues an ATO, which is an authority to operate for a provider, for FedRAMP, other agencies and the government can take that ATO and leverage it. They don't have to go through as much of the process and work. And so that's huge. For you, that means you can leverage what the Department of Defense or the Department of the Interior has done. It opens up all sorts of doors. Right, and that kind of accelerates the growth. So you get rapidly accelerating adoption, which, yeah, definitely be good.
Starting point is 00:27:27 Because you don't mind making the investments as long as you know that it's an investment, a good investment. When there's a lot of confusion or ambiguity around the process, it can be difficult to know, am I even doing the right thing? Or is it just not worth starting until somebody forces me to so i yeah welcome to standardization yep yeah and so um you know what what are we seeing on that end i see a lot of people asking right where has this been done already? Right? Who else is doing this? You know, can I, you know, basically, you know, hop on their back and have them, you know, bring me,
Starting point is 00:28:11 bring me part of the way or, or to the finish line. Um, so it's great. Again, you know, I made the, the wise guy remark earlier as a, as a taxpayer, but I love this stuff because it's, this is, this is what we should be doing, right? It's this collaboration between agencies. It's smart money management, and it's going to help us do better things because it's going to speed up things. This is acceleration. I love it. Yeah, I agree. So, Chris, I wanted to shift gears a little bit. Chris, I wanted to shift gears a little bit. Yeah, I wanted to shift gears a little bit because, you know, you said you self-identified as a chemist.
Starting point is 00:28:54 What are the things that, you know, you're looking at and saying, wow, you know, when we go to design our next environment, right, Whether it's an on-site machine, because we know that these are years in the making, or it's a cloud or a combination of that. Is there a technology or a capability or a trend that you're seeing that you're considering for, you know, hey, the next time we do this, here's what we're going to try? Yeah, so, you know, we're already in the process of investigating what the requirements and needs of the user community are for our next system.
Starting point is 00:29:35 And, you know, certainly the currents that are going on in the industry drive some of that. So people are – there's certain subsets of computing are adapting rapidly to GPU computing. And, you know, some of the backdrop for that is that what's called the exascale computing project, which is GOE's sort of vision for HPC and large-scale computing going forward is highly dependent on acceleration, so not CPU-driven computing as much as things that are plugged in sort of next to the CPU.
Starting point is 00:30:18 And that environment is basically condensed down to GPUs. So in some ways, GPUs are the only game in town. And so a lot of demand is starting to look more GPU-centric because basically the future of HPC is looking, at least the near future, is looking more GPU-centric. So we're thinking along the lines of how much GPU capability do we need in the system?
Starting point is 00:30:49 We have some workloads that are ready to go, machine learning, molecular dynamics. So that would be the chemistry inspiration. Or it can use GPUs very well. There are other pieces of computing that have not yet gotten there. And so, you know, as the system evolves, or as the world evolves, the GPU world evolves, I think more and more work will be able to be mapped onto GPUs
Starting point is 00:31:22 just by nature of the evolution of the tools and the software stack around GPUs. But there's also efforts by developers and user communities to adopt GPUs in their computing because there can be such gain from them. So that's one aspect of system design, certainly, that's relevant. Another is the rise of data-centric computing. So storage is becoming more of a focus. I guess I should say storage capacity. So there are very large data sets out there,
Starting point is 00:32:06 and people want to do new and different things with them. Some of those data sets are in-house. Some of them are already in the cloud. And so it can be impractical to move large data sets to where the computing is. It can make sense that when you design a system and you're supporting some fraction of data-centric computing, you need to make sure that the computing is done near the data rather than assuming that the data is going to get moved local to your compute.
Starting point is 00:32:40 And there are a variety of other design points, I suppose. We support a lot of interactive type of computing, so it's not necessarily traditional HPC. Our user community is kind of in some ways more engineering and analysis grounded. And the model for batch computing is not necessarily something that they've fully embraced yet. So we try to bridge computing is not necessarily something that they've fully embraced yet. So we try to bridge that, not necessarily just trying to get them into the batch system, but also supporting what their current needs are in computing.
Starting point is 00:33:16 Well, and it sounds to me, too, like you're going to have some sort of a transition or a transformation with how you are supporting these users, right? If all these changes you see coming down the road. And I have to imagine that they're not all the guys like you and me who started out submitting to a script. I'm guessing a lot of these folks haven't seen a script. They all know somebody who's seen a script. So some of our users, a script, and it worked the first time,
Starting point is 00:33:51 and then they will use it for the life of the project, and it does just fine for them. But, yes, there is kind of an evolution of models, computing models, and we're still figuring out the right way to support those. It calls to mind, I had a meeting last week with a government customer where for the first time somebody said to me, you know, we're talking about building a solution with them. And they said, is there an app for that? Because our users, our younger users, our millennials, will want to submit and monitor their jobs from their phone.
Starting point is 00:34:35 And that makes an awful lot of sense to me. Yeah, that's an interesting idea, not only to bring in infrastructure issues, but also compliance issues. So how much can you push out to somebody's personal phone through the web, and how much do you need to keep more secure, or how do you secure that? So yeah, that's an interesting point. It also taps into, you know, what I see is kind of this remote, always-on type of model for maybe not employment in general, but I think there is a movement toward being less tied to a physical space, driving into the office, you know, 8 to 5 or 9 to 5 or whatever, in favor of being more flexible in terms of space and in time. Yeah, you might want to go see how things are going, you know,
Starting point is 00:35:29 right after you've got your jammies on, and you don't really want to set up your workstation to go do that. You'd rather pull out a phone, bring up an app, have an interface that's seamless and be able to work through that and get useful things done. As the world goes, you know, HPC tends to follow. I know that if I've submitted a job that's going to run for a week, I'm going to keep an eye on it or I would like to. And having that convenience is great.
Starting point is 00:35:59 But when I talk to, like, my sons who are, you know, in their late 20s, early 30s, to them it's a necessity that they would have that kind of access. And I think, you know, you're right. I would have preferred to have done it that way, so I wouldn't be tied to a machine the way you described it. I think it's a reasonable demand. We have to build those types of interfaces and tools to enable that. You need to get on that, Kevin.
Starting point is 00:36:29 It's on the list, Chris. Look, we're coming up at the end of our time. This was really great and informative. I love hearing about the dynamite things that are going on at NREL and in your world. Looking forward to finding out more, but is there words of wisdom, Chris, that you want to leave us with? Other than, you know, keeping your eyes open to what's coming
Starting point is 00:36:58 and just being flexible and kind of keeping an open mind. There you go. No, that's great. That's terrific. Well, look, I, I really want to thank you for spending this time with us on the, the big compute podcast. And I'm looking forward to seeing you again, in person soon, maybe at super computing, maybe before that, it's in Denver this year, so it's just a short commute for you from Golden. I'll be there. Thanks again, Chris. Thank you for having me on. I appreciate it. I enjoyed the talk.
Starting point is 00:37:34 All right. Well, again, this is Kevin Kelly, and this is the Big Compute Podcast. Thanks for listening in, and we hope to catch up with you again soon. Thank you.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.