Grey Beards on Systems - 149: GreyBeards talk HPC storage with Dustin Leverman, Group Leader, HPC storage at ORNL

Starting point is 00:00:00 Hey everybody, Ray Lucchese here with Keith Townsend. Welcome to another sponsored episode of the Greybeards on Storage podcast, a show where we get Greybeards bloggers together with storage assistant vendors to discuss upcoming products, technologies, and trends affecting the data center today. We have with us here today, Dustin Leverman, Group Leader for HBC Storage at Oak Ridge National Labs. So, Dustin, why don't you tell us a little bit about yourself and what the labs are doing with storage these days?

Starting point is 00:00:36 Yeah, so, like you mentioned, I'm the Group Leader for storage and archive, actually. So basically what my role is is to make sure that, you know, based on our user use cases, that we're able to achieve those with both our scratch storage, which we define as storage that's connected to our compute resources, but is basically built for comfort, not for speed. And then we define our archival storage as, I guess traditionally you would define archival storage as write once, read never,

Starting point is 00:01:12 but we actually treat ours more like a, just kind of a long-term store. So there's usually like a robust policy engine that exists between two different tiers, like a disk tier and a tape tier. And we try to meet our, from a project level basis, we try to meet our users' use cases on if they're trying to actually write once, read never,

Starting point is 00:01:34 or if they need to have more interactive performance. One thing, as I was reading about Orion, which I guess is your storage environment, I guess I'd call it, is the size of the data load, the data corpus that you guys are dealing with. I think it was something on the order of hundreds of petabytes. 700, roughly. It's 688.7. 700 petabytes is quite a lot of data to be thrown around here. And you mentioned comfort store.

Starting point is 00:02:08 I saw something about, I guess, multiple tiers that you have for this storage environment. Do you want to talk a little bit about that, Dustin? Sure. So Orion is the, like I mentioned before, the difference between Scratch and Archive. Orion is our Scratch file system for the Frontier supercomputer. Now, it's not dedicated storage to Frontier. It's mounted by other systems on the periphery of the center, but Frontier is the main system. So, like you mentioned, it's a multi-tier system. The three tiers are, usually you wouldn't mention

Starting point is 00:02:44 a metadata tier. It's just meant for metadata. But in the case of Orion, we actually use a feature of Lustre, the file system for Orion. We use a feature called data on metadata. So basically you can choose how much of the big, of the different component of a file you want to have stored in metadata. Like the first bytes or something like that? Yeah. So basically what that does is you can specify a size. So let's say that I had a good understanding of our file size distribution. If we know that we have lots of files that are, let's say 64K or below, which really takes up very little data on the system. You can specify a DOM data on metadata size of 64k, where basically what that will do is under the hood when a client is trying to request data,

Starting point is 00:03:32 it will talk to the metadata server first and then the metadata server will direct it to a server that actually has the data. But when you have really small files, that creates a bunch of extra overhead. So this data on metadata feature allows the metadata server basically to just return the data to those clients instead of having um those clients have to go reach out to other servers so really the advantage is oh go ahead yeah so the data is sitting on the metadata tier rather than on the storage tier i guess or exactly storage tiers oh's interesting. That way you get around the small file challenge, right? Exactly. And actually I mentioned 64 K that actually we, we set our DOM size to 256 K just because when we look at our file size

Starting point is 00:04:19 distribution there's, there's interesting things that happen you know, in the one, two, four megabyte range, and the 64k range, like in that, in those areas. So basically, we tried to pick that component size to get as much of our small files in Flash as possible. Yeah, yeah, yeah. So there is a Flash tier as well. I believe it's NVMe storage as well. It is. Yes. So we actually have 10 petabytes of metadata storage. We have an additional 11 petabytes of NVMe storage. That's like I've just mentioned, you know, our DOM size is 256K.

Starting point is 00:05:01 And then basically from 256K up to eight megabytes um the components all the components of every file live in flash up to eight megabytes um and you know like i mentioned our we have this bimodal file size distribution so um what that that allows us you know high iops for small files so when you say components of files so let's's say I have a, I don't know, a hundred megabyte file and eight megabytes of it would be, you know, like the hottest data or like the first data or some sort of a sequential moving filter of it, or how would that work? How does the system decide what goes into the flash tier versus, I guess, disk, right?

Starting point is 00:05:48 Great question. So Lustre, the file system I being written, it allows you to specify different, essentially components or rules for as the file grows, what happens. So what, like I mentioned, you have the first 256K on metadata, the next up to a megabyte on flash. And then after that, all of that additional data starts getting written

Starting point is 00:06:24 to the hard drive tier and you know like doing the math i just mentioned you know we have 21 petabytes of flash but it's a 688 petabyte file system um so there's a very significant the rest of that capacity is all in our um our disk tier so it's not a what i would call like a hot uh data tier as much as it's not what I would call like a hot data tier as much as it's, you know, the first 256 is a metadata, the next would use a policy engine to move data between the hot tier and the cold tier. The disadvantage with that is that either users will, I mean, you can't afford, I mean, we couldn't afford 688 petabytes of flash, right? That would be prohibitively expensive. So we had to make a choice of what we do. And, you know, if we placed all data on the performance tier, there would be risk to that tier filling up, right? Or users abusing it or the policy engine that migrates the data not being able to keep up. that allows us basically to provide deterministic performance for individual files for our users

Starting point is 00:07:47 and eliminates the risks of tiers filling up. Right, right, right, right. I was going to ask, how do you back up 688.7 petabytes? Great question. So one of the things we operate under and one of the things we operate under one of the ideas we operate under is that we shouldn't back everything up basically users need to make

Starting point is 00:08:12 a conscious decision whether or not something is worth keeping forever I know that some places you know you get a home directory you probably want to always back up everything in your home directory but home directories are relatively small like we have single jobs on our system that could write 80 petabytes of data in a single run. And that's, you know, that would also be prohibitively expensive to back up. And

Starting point is 00:08:33 so sometimes you care about the breadcrumbs of how you got to the result. And sometimes you just care about the result. So, and the users know what's important and they can back things up accordingly. And so, like I mentioned before, we have our archival systems. And in our case, you know, today we have two archival systems for different security zones. One of them uses HPSS and the other one uses Spectrum Archive. But basically, users will use Globus. We have a data transfer cluster that uses the Globus data transfer tool. And the expectation is that they would use Globus to transfer data between Scratch and Archive. So I'm looking at the website.

Starting point is 00:09:15 One, it's impressive. You have a roll of IBM Z systems. And it is helping me wrap my head around these huge numbers. As you think about the amount of data you have and the compute needed to address it, there are aspects of this scale that I still am having a little bit of trouble with? And one of the obvious things is networking. Where are the clients coming from that's accessing all of this data? So like I mentioned, Orion is primarily for Frontier. And so really you have to make decisions on where the system lives in your center to determine where you need the performance, right? So between Frontier and Orion, the performance tier is able to achieve, you know,

Starting point is 00:10:12 10 terabytes a second. So basically that's 450 servers with two 200 gigabit per second links each. That would be, if those lived on different networks, you couldn't afford to connect them. Like the amount of switch gear and hosts to route that traffic between two different networks would be millions of dollars. So what we do is basically based on these high throughput requirements, Orion actually lives on Frontier's HSN, its know, its high-speed network that it uses for MPI traffic. And so that allows us to have that throughput where peripheral systems like an analysis cluster or data transfer cluster, you know, those only need, you know, hundreds of gigabytes per second.

Starting point is 00:10:58 So we have a set of routers that route traffic. There's 160 routers that route traffic between Frontier Slingshot Network and our, we call it Scion. It's our scalable IO network. Essentially, it's our SAN. But all of these peripheral systems connect to the SAN to access the file system, but that traffic is routed. Yeah, I love the concept that only need, the peripheral systems only need hundreds of gigabytes per second. 10 terabytes of of gigabytes per second 10 terabytes of data bandwidth per second right I mean that's that's the number you're throwing around

Starting point is 00:11:29 Dustin well actually that was the requirement in our contract with HPE we actually can do 11 terabytes a second uh right at 14 terabytes a second read from Frontier I'm trying to write this down. This is phenomenal. Just so you know, we're recording this. Oh, yeah, of course. Yeah, yeah, I understand that. No, even for this conversation, because I'm still trying to wrap my head around these numbers and kind of what's around the concept that there's a 10 petabyte metadata. That's a pretty impressive database by itself.

Starting point is 00:12:15 That's the amount of optimization that needs to go into it. And this is a file system that we're talking about. So seamlessly, your data scientists who are connecting to this are just using standard NFS type protocols, I'm assuming. And there's all of this magic happening on the back end to make it seem like they're just using this really basic transfer protocol. But there's a lot going on behind the scenes. So Lustre would have a client, right? It's not a, it's not like an NFS solution. But it does give it the luster client gives it posix like access to the file system. Similar to an FF. Yeah. So yeah, talk about talk about this metadata system here. It's, it's like a one hell of a key value store. Is that what it is? Or is this something that Lustre is putting together? Are you guys doing some special work here or what?

Starting point is 00:13:11 So basically what Lustre does under the hood is it takes a bunch of file systems and then it has a metadata subsystem that ties it all together, right? So when you go to write a file to Lustre, it'll create an inode that will have a metadata entry for that file. And it will together, right? So when you go to write a file to Lustre, it'll create an inode that will have a metadata entry for that file. And it will say, you know, depending on how you set up the, depending on how you configure that file, you know, how big of stripes do you want? How many stripes do you want? What's pool or tier do you want to be writing to? You know, you can make, like I mentioned before, the progressive file layout that can tune this for you. So users don't have to have to have all that system knowledge. But basically, when you go to write this file, you know, it'll, you create an inode for it, and it will say where

Starting point is 00:13:55 all of the different data blocks are, are mapped to. But like I mentioned, to this, this new feature data on metadata allows you to tune how much data is actually written on the metadata target as well. It's not in the inode. It's not data in inode, but it makes us be able to utilize such a large amount of metadata space. But really the goal of it is not to accelerate the file IO necessarily. That's one of the side effects of keeping data on metadata. What it really does is it makes it so for small files, you're not having to do an extra network hop. So it's keeping a bunch of that traffic off of the network. And it also does another side effect is since the metadata is Flash, it stops a lot of these very small IO transactions from

Starting point is 00:14:44 hitting hard disks, which are mechanical devices and would have to seek the drive. So it does accelerate the files in that way as well. So yeah, so on disk you're writing, I don't know, a stripe would might be what, a megabyte or? It actually is on a stripe width. Yeah. It is? Okay. So that makes it a little bit easier to spin out on disk and stuff like that with the throughput. So you're still talking, you know, 688 petabytes, so 20 petabytes of metadata and a flash tier. You're still talking 660, 670 petabytes of disk, right? Exactly.

Starting point is 00:15:23 And that's over thousands of disks. And the disks are effectively attached to, I'll call it, storage servers? Yes, yeah. So Lustre has building blocks, right? You have MDSs, metadata servers, and you have OSSs, object storage servers. And each of them have data targets, like a metadata target MDT or an object storage target, an OST that lives on the OSS.

Starting point is 00:15:51 So in the case of Orion, Orion has 200, I'm sorry, 450 object storage servers, and each one has one flash storage target and two disk storage targets. And so, you know, like the individual, I mean, across those 450 servers, there are, I'm doing the math right now, there are 47,700. 450 flash drives, something like that. There's 24 flash drives and 212 18 terabyte drives. So doing

Starting point is 00:16:27 the math, it works out to a little over 47,018 terabyte hard drives. And I think it's 5,400, roughly 3.84 terabyte NVMEs. So I'm going to ask the obvious operations question. How do you monitor that many hard drives? There's probably some problems cropping up occasionally and stuff like that. At any given time, there's a failed drive. And so, and as you would probably guess, when you're in the time that you spend having a failed drive introduces risk for data loss, right, when you use RAID. So we use ZFS DRAID, and we DRAID 2, to be specific. And so that allows us to have eight data blocks and two parity blocks. But we also have two spare drives in each one of our pools.

Starting point is 00:17:25 So not spare drives, spare capacity, I guess, inside of our pools. And so what we do is, you know, since we expect that any drive can be failed at any time, Lustre has a feature that will mark that drive, mark that pool as degraded and will deprioritize IO to it so that you're not, that when users are writing, let's say they have a huge job that's writing a lot of data all at once, you know, you don't want the metadata server to be writing new data to this pool that's being rebuilt. So it'll try to avoid it, try to make it so you're not being blocked on this one slow pool, like you're not, you know, an MPI barrier is waiting to write and it has to wait on it. So we have, there's monitoring, you were asking about monitoring,

Starting point is 00:18:11 you know, there's monitoring inside the system that keeps track at the luster layer that keeps track of the pools degraded or not. But then we have a very complicated set of pole-based monitoring subsystems and also telemetry subsystems that constantly watch for, you know, the processes that watch for slow drives, and it'll fail them for you. It'll check smart errors and make sure that a drive's not failed. You know, there's checks that make sure that all the server's memory is present and healthy, and the power supplies are present and healthy. You know. There's a myriad of various checks. I think it's on the order of just for Orion, there's something like, I want to say it's about 2,000, 2,500 checks just for the pole-based

Starting point is 00:18:56 monitoring. But then we also gather metrics using Telegraph and use Elasticsearch for our telemetry database to send this data into a database where we can query at any given time, you know, what individual, what kind of performance are individual servers seeing at any given time. If somebody launches a job, we want to see, you know, what's the histogram of the metadata activity for that job or the bandwidth activity for that job. It generates a huge amount of telemetry. Oh, yeah. Yeah, I would imagine.

Starting point is 00:19:31 Let's talk about some of the, you know, just more esoteric. What's the workload driving or frontier, I guess? Is it physics or flow dynamics? It's actually all over the place. So Oak Ridge is a, the Frontier supercomputer is funded by the Department of Energy Office of Science. The Office of Science, I mean, basically put in an application for hours on the system. And we have a science director and a team of people that look at these applications and decide whether or not they're going to grant time on the machine. So since this is an open call, you know, you'll see, you'll see folks from, you know, fluid dynamics, astrophysics, physics codes, chemistry codes. We even partner with some, some industry partners sometimes where they may want to, to model the airflow over a wing of an airplane or, you know, it's all over the place.

Starting point is 00:20:46 So with that, the obvious question is injection of data. When I'm thinking about moving petabytes of data into the system to run my analysis against, what are the various methods of users of the frontier to ingest data for their projects into the system? So do you mean like the styles of like the power process? So I guess there's two questions here. You know, a lot of these workloads would be generating the data directly rather than ingesting data from someplace else.

Starting point is 00:21:22 But there's certainly a lot of work that would require, you know, massive amounts of data to be, I don't know, like weather simulation data. You'd want to, you know, suck up all the weather history for the last 10 years or something like that. That would be a lot of data coming into the system. How does that work? Is that what you're asking, Keith? Yeah. Actually, that's a great example, weather modeling.

Starting point is 00:21:44 So, like, for example, with weather modeling specifically, you know, it's important to understand, you know, you're constantly, you would constantly have a stream of sensor data coming from satellites and airplanes and ground weather stations, you know, there's all of that data would be basically you'd see a constant stream of that coming into the system. But also for a weather model, you don't just launch a weather model with just the sensor data. There's a sense of where you basically the output from your last job is used as to sort of train the new model, the new run that you're about to make. So you take a combination of this, what they call cycling data, in addition to the sensor data, and it can give you an accurate forecast. So it's a combination of a constant stream of sensor data, but also being able to write out your cycling data at the end of the job and then have your next job pick up both of those things and create new output. So are you streaming that data from partners that have like internet two or different connectivity? What's the connectivity coming in? So actually we don't do production weather modeling on Frontier specifically. We do operate systems for both for NOAA, National Center for Ocean. Yeah. And we also operate

Starting point is 00:23:08 systems for we review weather prediction. But so anyway, so yes, that what, what, how, how that works for us is we actually just upgraded to a 400 gig link on ESNet, which allows us to connect with our partner sites that will deliver this data to us. And then honestly, many of our other users too, the DOE is a computing ecosystem. So we have very high bandwidth to the other DOE labs, like Argonne National Lab and NERSC over in the Bay Area. Yeah, I can easily see that web having some pretty high-speed layer two connectivity to each other and being able to ingest data and connect with other partners. Back to kind of the Luster conversation, where does Luster in and kind of oak ridge begin when it comes to the overall capability of the underlay are you guys just constantly updating the luster product project

Starting point is 00:24:17 with things that you've learned or uh is luster pretty much you know uh much what you take that component and just build separate components to supporting systems around it or a combination of both? I think if I'm understanding the question correctly, so Lustre is an open source file system operating on OpenSFS

Starting point is 00:24:39 and there's multiple companies that contribute to that code including WAM Cloud which is part of DDN, and then also like HPE is a significant contributor and many, many others. But so basically, and since it's open source, you know, individual sites can develop features that they want. So like, for example, Oak Ridge National Laboratory, we partnered with WAM Cloud to develop this progressive file layouts feature. And, you know, the gatekeeper for the code works for WAM Cloud. So, you know, there's best practices on coding and the way that they want everything to be structured. And so, you know, we worked with them to develop this feature. So but then there also may be other features that Oak Ridge doesn't care about, but other sites may fund or other companies may fund.

Starting point is 00:25:31 Does that answer your question? Yeah. So partially. So I guess the effort is kind of Oak Ridge took it to, I'm sorry, luster out of the box, out of the open source box, got you 80% there. Now you've had to partner with these folks and get you the other 20 percent there or you've just contributed all back to the project. And, you know, if I have the team, I can build exactly what you folks have built. Yes. So we actually so Orion, the file system was bought from HPE and HPE as a company. They will take what what is upstream and luster and they may make some additions, modifications, and enhancements to it. But we actually have an agreement with HPE where if they're making enhancements for us for this, you know, we're taxpayer funded and we want to make sure that, you know, the people paying, paying for us to do this work are able to reap the benefits of, um, of their funding. So, uh, you know, we, we have our own in-house Lustre developers as well that do modify,

Starting point is 00:26:58 um, actually James Simmons, um, at Oak Ridge. He, he is, I believe, um, he's in the, in the top three, I believe, like top contributors to the Luster Code individually. And everything that he touches makes its way upstream. So to answer your question, the goal is that everything that we do makes it upstream so everybody can benefit from it. Huh. Very interesting. You mentioned somewhere in there about learning and stuff like that. Are you guys doing a lot of AI kinds of workloads at the labs as well? Absolutely. On both our Summit and Frontier computers. Now, Orion usually isn't directly used for AI.

Starting point is 00:27:40 Generally, what the AI workload that we see looks like is lots of small random reads, which a parallel file system struggles with. So we actually have another tier of storage that's actually node local storage that our AI users will request access to. That's where their workload will operate from. And so basically each one of our compute nodes, let's use Frontier as an example, I believe it's 1.6 terabyte NVMEs. There's two of them in each compute node that get rated together with a rate zero. And so anyways, so that will provide, you know, a few terabytes of space where they can copy data from the parallel file system, run all their computations

Starting point is 00:28:25 and then copy it back. And so when you say parallel file system, you're talking Orion, Lustre and the whole shebang, right? Exactly. Yeah. Where each file can be modified by, you know, byte level locking allows different clients to be able to modify the same file at the same time. And that gets really complicated when you have, you know, potentially 10,000 nodes trying to access the same file at the same time. But if you have a copy of it on each node, that's actually easy. It's everybody doing this. Yeah.

Starting point is 00:28:56 I'm also kind of curious about the culture of the customers you support. I've spent a little bit of time supporting research folks, and I found the kind of the CS knowledge to be a wide range from folks who are really, really good at data analysis in general, and then folks who are good at data analysis but really don't care about the infrastructure layer at any concept. But when you're doing really big projects like this, you have to care about the details. So how do you bridge the gap from people solving really big problems, but without the infrastructure knowledge to know what size, what's the optimum size for striping files, et cetera, details that they just, you know,

Starting point is 00:29:46 in the weeds about the storage system. Sure. So, I feel like we're actually pretty good at our documentation. So, that's usually kind of our first place to go. However, we do have an entire group of folks. It's our scientific computing group at the lab that supports that, that insight program that I was mentioning before, where the goal for, for this group is to take domain scientists. Like, let's say you have an astrophysicist that has this code where they want to do this really exciting thing. But, you know, like you mentioned, you know, they may, they may not be computer scientists necessarily. And, you know, so we have this scientific computing group that is domain scientists that are also computer scientists to help them tune their codes, port their codes to GPUs, make sure that they're using the IO system in a way that doesn't bring it to a halt, for example, or impact other users. So we have a whole group of people that do this for us interesting you mentioned that uh the the data protection layers uh it was a zfs rate d-rate two i guess yes so is zfs behind luster or how's that work yeah so luster um

Starting point is 00:31:04 like i mentioned about the osts before the osts are um you can format those so basically like each of the each of the servers has um a set of zfs z pools like raid sets um but each of those raid sets can be zfs they can be um there's some if you're familiar with ext um right there's a luster specific, it has a few expansions on EXT, but basically it's an EXT for file system called LDSCFS. There's different file system like layers, the file system raid sets that you can put that Lustre is a superset of. So Lustre uses those underlying file systems as the store for the data. Right, right, right, right, right.

Starting point is 00:31:53 So how do you upgrade something like this? If you've got, I don't know, 450 OST servers, if that's the right number, and you decide you want to roll out a new feature of luster or something like that, how does that transpire in this environment? Very complicated. We always with one of these large systems, you know, the systems are made of building blocks. Like I just mentioned, there's 450 server pairs. But, you know, there's 225 server pairs. I'm sorry, 450 servers, 225 pairs.

Starting point is 00:32:32 And so we have a single pair that we call our TDS, our test and development system. So anytime that we want to perform an upgrade, we go and test this upgrade that also has a computer connected to it. That's very that's a TDS of the frontier. So we'll do this. We'll perform end-to-end upgrades of both, run a set of regression tests to ensure that we're not going to cause an instability or a performance degradation or any of those sorts of things. So we'll test everything on our test and development system. But the problem is, is that proving that it works on a single unit is not the same. Versus 225. Exactly.

Starting point is 00:33:13 So we don't usually do live updates unless there's something very specific that's low impact that we're trying to test. Usually it involves taking a full outage that we will, we do them on Tuesdays because, you know, Monday is planning day. Tuesday is you do the upgrade and then you have if something goes terribly wrong, you have a few days to work it out before before the weekend. So, yeah, we'll we'll we'll run this at small scale and then we'll do the upgrade on the large system. And the intention is always to have the infrastructure set up so that you can roll back. So Orion runs on all stateless images so we can perform this upgrade, upgrade to the new image. If it doesn't go well or we find an instability or something else that we've introduced, we just reboot the system back into the old image.

Starting point is 00:34:01 And it's exactly in the state it was in earlier that day before the upgrade. So the one thing that isn't stateless is the control plane. So how do you guys back up the control plane itself? So we actually, I guess, in terms of system administration, you know, there's the phrase, I guess, you know, treat your servers like cattle, not pets. Right. So, you know, our philosophy there is that, you know, we want to have the recipe to regenerate the control that we call it our management system. It's the cluster management component. So basically we use Puppet for configuration management, where any change we make goes through Puppet, which is backed by Git. So if I'm going to go change a tunable on a system, I go make that change in Puppet, and then it filters down into the image and into the individual storage nodes.

Starting point is 00:34:53 But Puppet is fully backed up, like very, very, and actually, you know, every sysadmin has like essentially a backup on their laptops because they have a Git checkout of it. But so that component's backed up. Our images are backed up to our backup system the recipe we use to put the operating system on our management cluster is also backed up so basically if if my management server dies or something really bad happens i can just throw another one in place and then regenerate it within a couple hours, get the system back online. That's pretty impressive, actually. Yeah, I'm thinking of how closely this is to like a hyperscaler environment.

Starting point is 00:35:42 But you also have the, I don't, I guess it's a luxury of not having the same SLA as a hyperscaler. So you can take this model in which, you know, downtime can happen, maybe not, you know, the preference, but downtime can happen. Right. Yeah. And I think one major difference too, is that, you know, hyperscalers, I think a lot of times we're trying to solve more distributed problems where we're trying to solve scale problems. So with a hyperscaler, it'll be important. I guess from a hyperscalers perspective, it's going to be important to have very, very high uptime. You want to have flexibility, right? Like you may use other like Kubernetes or something to make sure that if you have a resource that needs to be running, it's always running somewhere, right? Where for us, our systems are relatively static. So the complexity of how, you know,

Starting point is 00:36:28 like we put this system in place and the way it's designed, it's not, you can't really change it and still keep the same performance attributes. So we put it in place, we accept it, we get our users on it, we use it for five, six years, and then we move on to our next one where a hyperscaler is constantly changing and that they need to have a flexible environment that we don't necessarily have to have. Ours is about performance and data integrity. Not as an uptime actually is a huge consideration of ours. You know, we generally, these at scale systems, the one that we are still running in production that will be going away later this year, you know, it still sees, you know, 98, 99 percent uptime. You know, it's very good for a system of its scale. where one day there needs to be bare metal host. Another day, these need to be virtualized host.

Starting point is 00:37:28 Or yet another day, these hosts are running containers. The use cases are pretty static in the sense that you're providing the service. You conform your workload or application to the infrastructure versus the infrastructure conforming to the workload. Exactly. That's exactly right. Yeah. Yeah. So, so geez, this is pretty amazing stuff. Is this, is this currently running today or is this something that's in the process of being built up or what's the status of it, I guess, Dustin? Orion is in production as of,

Starting point is 00:38:08 I'm trying to remember the exact date, it was in April. But, you know, before we put the system in production, you go through a huge amount of, you know, acceptance process, you know, making sure that every end to end from a functionality, performance, stability, that you've run through a huge number of tests

Starting point is 00:38:25 to ensure the system's ready for users. But we also put users on the system. We have a set of what we call friendly users that we put on. It's like, hey, we know that you have a file per process workload that's very hard on the system. Can you test it and share your results with us or let us work with you along the way? Or you have a huge single shared file workload, which is one of the areas that we really want to make sure to exercise and works well. Can you get on the system and run this for us? And honestly, the feedback, even when the system is being used, like I said, we've seen 10, 11, 12 terabytes per second from it, even in a mixed environment with lots of users in the system, our users are still seeing, I've seen 7.9 terabytes a second. So I think, yes, the system's in production, it's are still seeing, you know, I've seen 7.9 terabytes a second.

Starting point is 00:39:05 So I think, yes, the system's in production. It's in use. And so far, I mean, our users love it. So I guess with the, have you, it's a relatively new system. So, you know, it's like bringing out your fast sports car and it seems fast for like the first year. The sports car performance hasn't changed, but the user's expectations have. This is what us operators always have to deal with. And I don't think

Starting point is 00:39:32 you guys have gotten to the point where users are complaining that it's slow yet, but you have a unique system and that it is a shared system. So do you have to deal with noisy neighbors yet. Yes. And it depends on which file system, different file systems have different architectural considerations that can change the blind spots, right? For like areas that you may get, you know, like, let's say, for example, let's say that I have one user that is very friendly and, you know, they may have, let's say, for example, let's say that I have one user that is very friendly and, you know, they may have, let's say, a file per node for their workload. Right. And let's say their job is 6,000 nodes. Now, let's also say I have another user that is using the same metadata subsystem that has it's using 9,400 nodes, but they have, you know, 64 processes per node. And, you know, that their workload would be very, very different, and could potentially, you know,

Starting point is 00:40:34 interact, would have a negative interaction with that other job that's, that's maybe doing something a little less intensive. So that is always the case, always potential. But the way we solve that is, you know, I mentioned that we have 10 petabytes of metadata that's actually spread across 40 metadata servers. And each of our programs are randomly assigned to an end of to a metadata server. And so what that allows us to do is to basically break up the problem domain. And so if one user is misbehaving, it'll only impact a subset of other projects and not all of them.

Starting point is 00:41:09 So the system will remain on. Effectively, a metadata server is like a file system in your parlance then, is that? Well, the metadata subsystem actually is composed of ZFS file systems also, the same as like the object storage systems. So it is the part of Lustre that I'm trying to figure out the right way to put this. It's not a file system of its own.

Starting point is 00:41:42 The metadata subsystem is made of smaller file systems that service the broader Lustre file system, if that makes sense. But you mentioned two things here, Dustin. You mentioned that, number one, as jobs come in, they're randomly assigned to a metadata server. And number two- Not the jobs, sorry, the projects. The projects, okay. So this isn't like an automated orchestration system. So when I submit it to use Ray's language, job,

Starting point is 00:42:03 when I submit a job, it is a project is a customer or a client doing a thing and they're randomly assigned to. Right. So let's say that I want to solve a numeric weather prediction or yeah, astrophysics problem. I'll come in and I'll put a proposal

Starting point is 00:42:23 and saying, hey, astrophysics, let's use an example. Like, let's say I want to model the expansion of the universe. And somebody's like, cool. Yeah, we'll create AST021. That'll be your astrophysics project name. And so they will have an area of the file system like slash Lustre, Orion, AST021. And that directory, AST021, where it's not just one person it could be a whole group of people that work in that project that project itself would be dedicated or not dedicated it

Starting point is 00:42:53 would it would be on a single metadata server that may be shared by other users and then do you guys have the ability to kind of move the projects to metadata servers based on uh use or there's noisy neighbors or something like that you can um but then you would have to move the metadata to the other metadata servers so you would actually it would be a migrate the it's not like you could just you have to be worth the effort like it had to be two long-term projects that obviously collide with each other like okay this is worth that effort well and usually what we try to do instead is we try to if we identify like i mentioned you know we have a very very robust telemetry system you know if we identify a user as being problematic um we will work with those users and be like okay what are you doing what

Starting point is 00:43:41 are you trying to do like let's figure out how to make it so that this isn't happening. I don't want to like put users in jail or put them in file system jail and move them off. I'd rather just help them to do what they need, what they're trying to do. More effectively on Orion and stuff like that. So speaking of, you know, kind of frontier and Orion,

Starting point is 00:44:01 these project, these massive project names, this is obviously not something that you folks thought about three months ago and deployed today. How long was the planning process of the design prior to building out, I would imagine it was iterations of the production system and then the final production system? Give us a time skip. So usually, I mean, it's years. So like, for example, we just deployed Orion and, you know, we're already talking about our next gen file system. Like we're designing it now. But it's, but usually what that is, is, you know, like we have, you know, my group at the lab, you know, there's the first year of a system is usually

Starting point is 00:44:46 spent stabilizing it, which actually, as of, since Orion was put in production, it's had a hundred percent scheduled uptime. So it's been very good so far, actually. But, you know, when you, when you get real users on it and not synthetic benchmarks, you expose problems. So we'll spend the first year exposing problems on the current system while concurrently working on building our next one. before you can build your next one you need to evaluate different products so we'll reach out to vendors and say what are you thinking about doing and three four five years you know and if it's something that has let's say they have an early version of it it's like okay we want to buy one of those and we put it in what we call our test bed

Starting point is 00:45:23 that so it has every generation of all these different storage systems from all these different vendors and we'll put it through its paces and understand where its weaknesses are. Are there things that we can work with that vendor to improve? And then we put out a call for proposals, right? And we gather our use, we'll put our, we gather our use cases from our users. We turn those into requirements. We turn those requirements into an RFP, and then our vendors will respond to this request for proposals. And after we receive that, I mean, you can still be a couple of years before you, you know, take delivery of the final system, integrate it, and then you have to go through an acceptance phase and you know, take delivery of the final system, integrate it. And then you have to

Starting point is 00:46:07 go through an acceptance phase and you get to get users on the system. And then you go through it, you start all over again for your next gen system. Yeah. Yeah. That's amazing. It's usually a five year life cycle. Yeah. Per, per, per file system. I saw something like this. Orion is a five year lifetime is what you're saying. Five to six. Yeah. And it depends on a lot of things. You know, the Orion is just one file system we operate. You know, we have 12 production file systems.

Starting point is 00:46:33 Right, right, right, right. And are the kind of frontier and Orion type at the hip together? Are these separate things? So Orion may go away and frontier will connect to yet a, you know, the next foul system or are they, are the life cycles tied together? They are, I would probably say they're generally tied together, but they're not exclusively tied together. Like for example, let's say that we, let's say that Orion's running great and we think we can squeeze another year out of it. You know, it may be that Frontier goes away and Orion stays active and is used by other systems. Or it could be that we don't want to marry the next procurement with a file system. And so it could be that the next gen compute system, whatever's past Frontier, ends up mounting Orion for its first year of production while we integrate the new file system.

Starting point is 00:47:23 They're not necessarily married. I do not envy that process. All right. Well, listen, so Keith, any last questions for Dustin before we close? No, we could, if I ask more questions, we can go on for another hour. This has been extremely interesting. All right. Dustin, is there anything that you'd like to say to our listening audience before we close? I guess I just mentioned, you know, if you have interest in this, you know, we actually

Starting point is 00:47:53 participate heavily in the Supercomputing Conference, the Lustre Users Group, the Cray Users Group. You know, you can find any, we have, you know, Oak Ridge is very active in those spaces, both from the compute, storage, networking, you know oakridge is very active in those spaces both from the compute storage networking all you know facilities obviously operating these systems you know you're we're talking i mean we're talking 30 megawatts of power for a single system there's a lot of complexity there as well but there's information about all this kind of cool stuff happening at the at the lab at all those places okay great great well this has been great, Dustin. Thank you very much for being on

Starting point is 00:48:26 our show today. Absolutely. It was great meeting with you. And that's it for now. Bye, Dustin. Bye, Keith. Bye, Ray. Until next time. Next time, we will talk to the system storage technology person. Any questions you want us to ask, please let us know. And if you enjoy our podcast, tell your friends about it. Please review us on Apple Podcasts, Google Play, and Spotify, as this will help get the word out. Thank you.

Grey Beards on Systems - 149: GreyBeards talk HPC storage with Dustin Leverman, Group Leader, HPC storage at ORNL

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.