Grey Beards on Systems - 149: GreyBeards talk HPC storage with Dustin Leverman, Group Leader, HPC storage at ORNL
Episode Date: May 31, 2023Ran across an article discussing ORION, ORNL’s new storage system which had 100s of PB of file storage and supported TB/sec of bandwidth, so naturally I thought the GreyBeards need to talk to these ...people. I reached out and Dustin Leverman, Group Leader HPC storage at Oak Ridge National Labs (ORNL)answered the call. Dustin has … Continue reading "149: GreyBeards talk HPC storage with Dustin Leverman, Group Leader, HPC storage at ORNL"
Transcript
Discussion (0)
Hey everybody, Ray Lucchese here with Keith Townsend.
Welcome to another sponsored episode of the Greybeards on Storage podcast,
a show where we get Greybeards bloggers together with storage assistant vendors
to discuss upcoming products, technologies, and trends affecting the data center today.
We have with us here today, Dustin Leverman, Group Leader for HBC Storage at Oak Ridge
National Labs.
So, Dustin, why don't you tell us a little bit about yourself and what the labs are doing
with storage these days?
Yeah, so, like you mentioned, I'm the Group Leader for storage and archive, actually. So basically what my role is is to make sure that, you know,
based on our user use cases, that we're able to achieve those
with both our scratch storage, which we define as storage
that's connected to our compute resources,
but is basically built for comfort, not for speed.
And then we define our archival storage as,
I guess traditionally you would define archival storage
as write once, read never,
but we actually treat ours more like a,
just kind of a long-term store.
So there's usually like a robust policy engine
that exists between two different tiers,
like a disk tier and a tape tier.
And we try to meet our, from a project level basis,
we try to meet our users' use cases
on if they're trying to actually write once, read never,
or if they need to have more interactive performance.
One thing, as I was reading about Orion,
which I guess is your storage environment, I guess I'd call it,
is the size of the data load, the data corpus that you guys are dealing with.
I think it was something on the order of hundreds of petabytes.
700, roughly. It's 688.7.
700 petabytes is quite a lot of data to be thrown around here.
And you mentioned comfort store.
I saw something about, I guess, multiple tiers that you have for this storage environment.
Do you want to talk a little bit about that, Dustin?
Sure.
So Orion is the, like I mentioned before, the difference between Scratch and Archive.
Orion is our Scratch
file system for the Frontier supercomputer. Now, it's not dedicated storage to Frontier. It's
mounted by other systems on the periphery of the center, but Frontier is the main system.
So, like you mentioned, it's a multi-tier system. The three tiers are, usually you wouldn't mention
a metadata tier. It's just meant
for metadata. But in the case of Orion, we actually use a feature of Lustre, the file system for
Orion. We use a feature called data on metadata. So basically you can choose how much of the big,
of the different component of a file you want to have stored in metadata.
Like the first bytes or something like that? Yeah. So basically what that does is you can specify a size. So let's say that I had a good
understanding of our file size distribution. If we know that we have lots of files that are,
let's say 64K or below, which really takes up very little data on the system. You can specify a DOM data on metadata size of 64k,
where basically what that will do is under the hood when a client is trying to request data,
it will talk to the metadata server first and then the metadata server will direct it to a
server that actually has the data. But when you have really small files, that creates a bunch of
extra overhead. So this data on metadata feature allows the metadata server basically to just return the data to
those clients instead of having um those clients have to go reach out to other servers so really
the advantage is oh go ahead yeah so the data is sitting on the metadata tier rather than on the
storage tier i guess or exactly storage tiers oh's interesting. That way you get around the small file challenge, right?
Exactly. And actually I mentioned 64 K that actually we,
we set our DOM size to 256 K just because when we look at our file size
distribution there's, there's interesting things that happen you know,
in the one, two, four megabyte range,
and the 64k range, like in that, in those areas. So basically, we tried to pick
that component size to get as much of our small files in Flash as possible.
Yeah, yeah, yeah. So there is a Flash tier as well. I believe it's NVMe storage as well.
It is. Yes. So we actually have 10 petabytes of metadata storage.
We have an additional 11 petabytes of NVMe storage.
That's like I've just mentioned, you know, our DOM size is 256K.
And then basically from 256K up to eight megabytes um the components all the components
of every file live in flash up to eight megabytes um and you know like i mentioned our we have this
bimodal file size distribution so um what that that allows us you know high iops for small files
so when you say components of files so let's's say I have a, I don't know,
a hundred megabyte file and eight megabytes of it would be, you know,
like the hottest data or like the first data or some sort of a sequential
moving filter of it, or how would that work?
How does the system decide what goes into the flash tier versus, I guess, disk, right?
Great question. So Lustre, the file system I being written, it allows you to specify different,
essentially components or rules
for as the file grows, what happens.
So what, like I mentioned,
you have the first 256K on metadata,
the next up to a megabyte on flash.
And then after that,
all of that additional data starts getting written
to the hard drive tier
and you know like doing the math i just mentioned you know we have 21 petabytes of flash but it's a
688 petabyte file system um so there's a very significant the rest of that capacity is all in
our um our disk tier so it's not a what i would call like a hot uh data tier as much as it's not what I would call like a hot data tier as much as it's, you know, the first 256 is a metadata, the next would use a policy engine to move data between the hot tier and the cold tier.
The disadvantage with that is that either users will, I mean, you can't afford, I mean, we couldn't afford 688 petabytes of flash, right?
That would be prohibitively expensive.
So we had to make a choice of what we do. And, you know, if we placed all data on the performance tier, there would be risk to that tier filling up, right?
Or users abusing it or the policy engine that migrates the data not being able to keep up. that allows us basically to provide deterministic performance for individual files for our users
and eliminates the risks of tiers filling up.
Right, right, right, right.
I was going to ask, how do you back up 688.7 petabytes?
Great question.
So one of the things we operate under and one of the things we operate under
one of the ideas we operate under is that
we shouldn't back everything up
basically users need to make
a conscious decision whether or not
something is worth keeping forever
I know that some places
you know you get a home directory you probably
want to always back up everything in your home directory
but home directories are relatively small
like we have single jobs on our system that could write 80 petabytes of data in a single run.
And that's, you know, that would also be prohibitively expensive to back up. And
so sometimes you care about the breadcrumbs of how you got to the result. And sometimes you just care
about the result. So, and the users know what's important and they can back things up accordingly. And so, like I mentioned before, we have our archival systems.
And in our case, you know, today we have two archival systems for different security zones.
One of them uses HPSS and the other one uses Spectrum Archive.
But basically, users will use Globus.
We have a data transfer cluster that uses the Globus data transfer tool.
And the expectation is that they would use Globus to transfer data between Scratch and Archive.
So I'm looking at the website.
One, it's impressive.
You have a roll of IBM Z systems.
And it is helping me wrap my head around these huge numbers.
As you think about the amount of data you have and the compute needed to address it, there are aspects of this scale that I still am having a little bit of trouble with? And one of the obvious things is networking. Where are the
clients coming from that's accessing all of this data? So like I mentioned, Orion is
primarily for Frontier. And so really you have to make decisions on where the system lives in your
center to determine where you need the performance,
right? So between Frontier and Orion, the performance tier is able to achieve, you know,
10 terabytes a second. So basically that's 450 servers with two 200 gigabit per second links
each. That would be, if those lived on different networks, you couldn't afford to
connect them. Like the amount of switch gear and hosts to route that traffic between two different
networks would be millions of dollars. So what we do is basically based on these high throughput
requirements, Orion actually lives on Frontier's HSN, its know, its high-speed network that it uses for MPI traffic.
And so that allows us to have that throughput where peripheral systems like an analysis
cluster or data transfer cluster, you know, those only need, you know, hundreds of gigabytes
per second.
So we have a set of routers that route traffic.
There's 160 routers that route traffic between Frontier
Slingshot Network and our, we call it Scion. It's our scalable IO network. Essentially,
it's our SAN. But all of these peripheral systems connect to the SAN to access the file system,
but that traffic is routed. Yeah, I love the concept that only need,
the peripheral systems only need hundreds of gigabytes per second.
10 terabytes of of gigabytes per second 10
terabytes of data bandwidth per second right I mean that's that's the number you're throwing around
Dustin well actually that was the requirement in our contract with HPE we actually can do 11
terabytes a second uh right at 14 terabytes a second read from Frontier
I'm trying to write this down. This is phenomenal.
Just so you know, we're recording this.
Oh, yeah, of course.
Yeah, yeah, I understand that.
No, even for this conversation, because I'm still trying to wrap my head around these numbers and kind of what's around the concept that there's a 10 petabyte metadata.
That's a pretty impressive database by itself.
That's the amount of optimization that needs to go into it.
And this is a file system that we're talking about. So seamlessly, your data scientists who are connecting to this are just using standard NFS type protocols, I'm assuming.
And there's all of this magic happening on the back end to make it seem like they're just using this really basic transfer protocol.
But there's a lot going on behind the scenes.
So Lustre would have a client, right? It's not a, it's not like an NFS solution.
But it does give it the luster client gives it posix like access to the file system. Similar to an FF. Yeah.
So yeah, talk about talk about this metadata system here. It's, it's like a one hell of a key value store. Is that what it is? Or is this
something that Lustre is putting together? Are you guys doing some special work here or what?
So basically what Lustre does under the hood is it takes a bunch of file systems and then it has a
metadata subsystem that ties it all together, right? So when you go to write a file to Lustre,
it'll create an inode that will have a metadata entry for that file. And it will together, right? So when you go to write a file to Lustre, it'll create an inode that will
have a metadata entry for that file. And it will say, you know, depending on how you set up the,
depending on how you configure that file, you know, how big of stripes do you want? How many
stripes do you want? What's pool or tier do you want to be writing to? You know, you can make,
like I mentioned before, the progressive file layout that can tune this for you. So users don't have to have to have all that system knowledge. But basically,
when you go to write this file, you know, it'll, you create an inode for it, and it will say where
all of the different data blocks are, are mapped to. But like I mentioned, to this, this new feature
data on metadata allows you to tune how much data is actually
written on the metadata target as well. It's not in the inode. It's not data in inode, but
it makes us be able to utilize such a large amount of metadata space. But really the goal of it is
not to accelerate the file IO necessarily. That's one of the side effects of keeping data on metadata.
What it really does is it makes it so for small files, you're not having to do an extra network
hop. So it's keeping a bunch of that traffic off of the network. And it also does another side
effect is since the metadata is Flash, it stops a lot of these very small IO transactions from
hitting hard disks,
which are mechanical devices and would have to seek the drive. So it does accelerate the files
in that way as well. So yeah, so on disk you're writing, I don't know, a stripe would might be
what, a megabyte or? It actually is on a stripe width. Yeah. It is? Okay. So that makes it a
little bit easier to spin out on disk and stuff like that with the throughput.
So you're still talking, you know, 688 petabytes, so 20 petabytes of metadata and a flash tier.
You're still talking 660, 670 petabytes of disk, right?
Exactly.
And that's over thousands of disks.
And the disks are effectively attached to, I'll call it, storage servers?
Yes, yeah.
So Lustre has building blocks, right?
You have MDSs, metadata servers, and you have OSSs, object storage servers.
And each of them have data targets,
like a metadata target MDT or an object storage target,
an OST that lives on the OSS.
So in the case of Orion, Orion has 200, I'm sorry,
450 object storage servers,
and each one has one flash storage target
and two disk storage targets.
And so, you know, like the individual, I mean, across those 450 servers,
there are, I'm doing the math right now, there are 47,700.
450 flash drives, something like that.
There's 24 flash drives and 212 18 terabyte drives. So doing
the math, it works out to a little over 47,018 terabyte hard drives. And I think it's 5,400,
roughly 3.84 terabyte NVMEs. So I'm going to ask the obvious operations question. How do you monitor that many hard drives?
There's probably some problems cropping up occasionally and stuff like that.
At any given time, there's a failed drive.
And so, and as you would probably guess, when you're in the time that you spend having a failed drive introduces risk for data loss, right, when you use RAID.
So we use ZFS DRAID, and we DRAID 2, to be specific.
And so that allows us to have eight data blocks and two parity blocks.
But we also have two spare drives in each one of our pools.
So not spare drives, spare capacity, I guess, inside of our pools. And so what we do is,
you know, since we expect that any drive can be failed at any time, Lustre has a feature
that will mark that drive, mark that pool as degraded and will deprioritize IO to it so that you're not,
that when users are writing, let's say they have a huge job that's writing a lot of data all at
once, you know, you don't want the metadata server to be writing new data to this pool that's being
rebuilt. So it'll try to avoid it, try to make it so you're not being blocked on this one slow pool,
like you're not, you know, an MPI barrier is waiting to write
and it has to wait on it. So we have, there's monitoring, you were asking about monitoring,
you know, there's monitoring inside the system that keeps track at the luster layer that keeps
track of the pools degraded or not. But then we have a very complicated set of pole-based
monitoring subsystems and also telemetry subsystems
that constantly watch for, you know, the processes that watch for slow drives,
and it'll fail them for you. It'll check smart errors and make sure that a drive's not failed.
You know, there's checks that make sure that all the server's memory is present and healthy,
and the power supplies are present and healthy. You know. There's a myriad of various checks. I think it's on the order of just for Orion,
there's something like, I want to say it's about 2,000, 2,500 checks just for the pole-based
monitoring. But then we also gather metrics using Telegraph and use Elasticsearch for our telemetry database to send this data
into a database where we can query at any given time, you know, what individual, what
kind of performance are individual servers seeing at any given time.
If somebody launches a job, we want to see, you know, what's the histogram of the metadata
activity for that job or the bandwidth activity for that job.
It generates a huge amount of telemetry.
Oh, yeah.
Yeah, I would imagine.
Let's talk about some of the, you know, just more esoteric.
What's the workload driving or frontier, I guess?
Is it physics or flow dynamics?
It's actually all over the place. So Oak Ridge is a, the Frontier supercomputer is funded by the Department of Energy Office of Science.
The Office of Science, I mean, basically put in an application for hours on the system.
And we have a science director and a team of people that look at these applications and decide whether or not they're going to grant time on the machine. So since this is an open call, you know, you'll see, you'll see folks from, you know, fluid dynamics, astrophysics,
physics codes, chemistry codes. We even partner with some, some industry partners sometimes where
they may want to, to model the airflow over a wing of an airplane or, you know, it's all over the place.
So with that, the obvious question is injection of data.
When I'm thinking about moving petabytes of data into the system to run my analysis against,
what are the various methods of users of the frontier to ingest data for their projects
into the system?
So do you mean like the styles of like the power process?
So I guess there's two questions here.
You know, a lot of these workloads would be generating the data directly rather than ingesting
data from someplace else.
But there's certainly a lot of work that would require, you know,
massive amounts of data to be, I don't know, like weather simulation data.
You'd want to, you know, suck up all the weather history
for the last 10 years or something like that.
That would be a lot of data coming into the system.
How does that work? Is that what you're asking, Keith?
Yeah.
Actually, that's a great example, weather modeling.
So, like, for example, with weather modeling specifically, you know, it's important to understand, you know, you're constantly, you would constantly have a stream of sensor data coming from satellites and airplanes and ground weather stations, you know, there's all of that data would be basically you'd see a constant stream of that coming into the system. But also for a weather model, you don't just launch a
weather model with just the sensor data. There's a sense of where you basically the output from your
last job is used as to sort of train the new model, the new run that you're about to make. So you take a
combination of this, what they call cycling data, in addition to the sensor data, and it can give
you an accurate forecast. So it's a combination of a constant stream of sensor data, but also being
able to write out your cycling data at the end of the job and then have your next job pick up both of those things and create new output. So are you streaming that data from partners that have like internet two
or different connectivity? What's the connectivity coming in? So actually we don't do production
weather modeling on Frontier specifically. We do operate systems for both for NOAA, National Center for Ocean. Yeah. And we also operate
systems for we review weather prediction. But so anyway, so yes, that what, what, how, how that
works for us is we actually just upgraded to a 400 gig link on ESNet, which allows us to connect
with our partner sites that will deliver this data to us. And then
honestly, many of our other users too, the DOE is a computing ecosystem. So we have
very high bandwidth to the other DOE labs, like Argonne National Lab and NERSC over in the Bay Area. Yeah, I can easily see that web having some pretty high-speed
layer two connectivity to each other and being able to ingest data and connect with other
partners. Back to kind of the Luster conversation, where does Luster in and kind of oak ridge begin when it comes to the
overall capability of the underlay are you guys just constantly updating the luster product project
with things that you've learned or uh is luster pretty much you know uh much what you take that component
and just build separate components
to supporting systems around it
or a combination of both?
I think if I'm understanding the question correctly,
so Lustre is an
open source file system
operating on OpenSFS
and there's multiple companies
that contribute to that code
including WAM Cloud which is part of DDN, and then also like HPE is a significant contributor and many, many others.
But so basically, and since it's open source, you know, individual sites can develop features that they want.
So like, for example, Oak Ridge National Laboratory, we partnered with WAM Cloud to develop this progressive file layouts feature.
And, you know, the gatekeeper for the code works for WAM Cloud.
So, you know, there's best practices on coding and the way that they want everything to be structured.
And so, you know, we worked with them to develop this feature. So but then there also may be other features that Oak Ridge doesn't care about, but other sites may fund or other companies may fund.
Does that answer your question? Yeah. So partially. So I guess the effort is kind of Oak Ridge took it to, I'm sorry, luster out of the box, out of the open source box, got you 80% there.
Now you've had to partner with these folks and get you the other 20 percent there or you've just contributed all back to the project.
And, you know, if I have the team, I can build exactly what you folks have built.
Yes. So we actually so Orion, the file system was bought from HPE and HPE as a company.
They will take what what is upstream and luster and they may make some additions, modifications, and enhancements to it.
But we actually have an agreement with HPE where if they're making enhancements for us for this, you know, we're taxpayer funded and we want to make sure that, you know, the people paying, paying for us to do this work
are able to reap the benefits of, um, of their funding.
So, uh, you know, we, we have our own in-house Lustre developers as well that do modify,
um, actually James Simmons, um, at Oak Ridge.
He, he is, I believe, um, he's in the, in the top three, I believe, like top contributors to the
Luster Code individually. And everything that he touches makes its way upstream. So to answer your
question, the goal is that everything that we do makes it upstream so everybody can benefit from it.
Huh. Very interesting. You mentioned somewhere in there about learning and stuff like that.
Are you guys doing a lot of AI kinds of workloads at the labs as well?
Absolutely. On both our Summit and Frontier computers.
Now, Orion usually isn't directly used for AI.
Generally, what the AI workload that we see looks like is lots of small random reads,
which a parallel file system struggles with. So we actually have another tier of storage
that's actually node local storage that our AI users will request access to. That's where their
workload will operate from. And so basically each one of
our compute nodes, let's use Frontier as an example, I believe it's 1.6 terabyte NVMEs.
There's two of them in each compute node that get rated together with a rate zero. And so anyways,
so that will provide, you know, a few terabytes of space where they can copy data from the parallel
file system, run all their computations
and then copy it back. And so when you say parallel file system, you're talking Orion,
Lustre and the whole shebang, right? Exactly. Yeah. Where each file can be modified by,
you know, byte level locking allows different clients to be able to modify the same file at
the same time. And that gets really complicated when you have, you know,
potentially 10,000 nodes trying to access the same file at the same time.
But if you have a copy of it on each node, that's actually easy.
It's everybody doing this.
Yeah.
I'm also kind of curious about the culture of the customers you support.
I've spent a little bit of time supporting research folks, and I found the
kind of the CS knowledge to be a wide range from folks who are really, really good at data analysis
in general, and then folks who are good at data analysis but really don't care about the
infrastructure layer at any concept. But when
you're doing really big projects like this, you have to care about the details. So how do you
bridge the gap from people solving really big problems, but without the infrastructure knowledge
to know what size, what's the optimum size for striping files, et cetera, details that they just, you know,
in the weeds about the storage system. Sure. So, I feel like we're actually pretty good at
our documentation. So, that's usually kind of our first place to go. However, we do have an entire
group of folks. It's our scientific computing group at the lab that supports that, that insight program that I was mentioning before, where the goal for, for this group is to take domain scientists. Like, let's say you have an
astrophysicist that has this code where they want to do this really exciting thing. But, you know,
like you mentioned, you know, they may, they may not be computer scientists necessarily.
And, you know, so we have this scientific computing group that is domain scientists that are also computer scientists to help them tune their codes, port their codes to GPUs, make sure that they're using the IO system in a way that doesn't bring it to a halt, for example, or impact other users.
So we have a whole group of people that do this for us interesting you mentioned that uh the the data protection layers uh it was a zfs
rate d-rate two i guess yes so is zfs behind luster or how's that work yeah so luster um
like i mentioned about the osts before the osts are um you can
format those so basically like each of the each of the servers has um a set of zfs z pools like
raid sets um but each of those raid sets can be zfs they can be um there's some if you're familiar
with ext um right there's a luster specific, it has a few expansions on EXT, but basically it's an EXT for file system called LDSCFS.
There's different file system like layers, the file system raid sets that you can put that Lustre is a superset of.
So Lustre uses those underlying file systems
as the store for the data.
Right, right, right, right, right.
So how do you upgrade something like this?
If you've got, I don't know, 450 OST servers,
if that's the right number,
and you decide you want to roll out a new feature of luster or something like that, how does that transpire in this environment?
Very complicated.
We always with one of these large systems, you know, the systems are made of building blocks.
Like I just mentioned, there's 450 server pairs.
But, you know, there's 225 server pairs. I'm sorry, 450 servers, 225 pairs.
And so we have a single pair that we call our TDS, our test and development system.
So anytime that we want to perform an upgrade, we go and test this upgrade that also has a computer connected to it.
That's very that's a TDS of the
frontier. So we'll do this. We'll perform end-to-end upgrades of both, run a set of
regression tests to ensure that we're not going to cause an instability or a performance degradation
or any of those sorts of things. So we'll test everything on our test and development system.
But the problem is, is that proving that it works on a single unit is not the same. Versus 225.
Exactly.
So we don't usually do live updates unless there's something very specific that's low
impact that we're trying to test.
Usually it involves taking a full outage that we will, we do them on Tuesdays because, you know, Monday is planning day.
Tuesday is you do the upgrade and then you have if something goes terribly wrong, you have a few days to work it out before before the weekend.
So, yeah, we'll we'll we'll run this at small scale and then we'll do the upgrade on the large system.
And the intention is always to have the infrastructure set up so that you can roll back.
So Orion runs on all stateless images so we can perform this upgrade, upgrade to the new image.
If it doesn't go well or we find an instability or something else that we've introduced, we just reboot the system back into the old image.
And it's exactly in the state it was in earlier that day before the upgrade. So the one thing that isn't stateless is the control plane. So how do you
guys back up the control plane itself? So we actually, I guess, in terms of system administration,
you know, there's the phrase, I guess, you know, treat your servers like cattle, not pets. Right. So, you know, our philosophy
there is that, you know, we want to have the recipe to regenerate the control that we call
it our management system. It's the cluster management component. So basically we use
Puppet for configuration management, where any change we make goes through Puppet, which is
backed by Git. So if I'm going to go change a tunable on a system, I go make that change in Puppet,
and then it filters down into the image and into the individual storage nodes.
But Puppet is fully backed up, like very, very, and actually, you know, every sysadmin
has like essentially a backup on their laptops because they have a Git checkout of it.
But so that component's backed up.
Our images are backed up to our backup system the recipe we use to put the operating system on our management cluster
is also backed up so basically if if my management server dies or something really bad happens
i can just throw another one in place and then regenerate it within a couple hours, get the system back online.
That's pretty impressive, actually.
Yeah, I'm thinking of how closely this is to like a hyperscaler environment.
But you also have the, I don't, I guess it's a luxury of not having the same SLA as a hyperscaler. So you can take this model in which, you know, downtime can happen,
maybe not, you know, the preference, but downtime can happen.
Right. Yeah. And I think one major difference too, is that, you know, hyperscalers, I think
a lot of times we're trying to solve more distributed problems where we're trying to solve scale problems. So with a hyperscaler,
it'll be important. I guess from a hyperscalers perspective, it's going to be important to have
very, very high uptime. You want to have flexibility, right? Like you may use other
like Kubernetes or something to make sure that if you have a resource that needs to be running,
it's always running somewhere, right? Where for us, our systems are relatively static. So the complexity of how, you know,
like we put this system in place and the way it's designed, it's not, you can't really change it and
still keep the same performance attributes. So we put it in place, we accept it, we get our users
on it, we use it for five, six years, and then we move on to our next one
where a hyperscaler is constantly changing and that they need to have a flexible environment
that we don't necessarily have to have. Ours is about performance and data integrity.
Not as an uptime actually is a huge consideration of ours. You know, we generally, these at scale
systems, the one that we are still running in production that will be going away later this year, you know, it still sees, you know, 98, 99 percent uptime.
You know, it's very good for a system of its scale. where one day there needs to be bare metal host. Another day, these need to be virtualized host.
Or yet another day, these hosts are running containers.
The use cases are pretty static in the sense that you're providing the service.
You conform your workload or application to the infrastructure
versus the infrastructure conforming to the
workload. Exactly. That's exactly right. Yeah. Yeah. So, so geez, this is pretty amazing stuff.
Is this, is this currently running today or is this something that's in the process of being
built up or what's the status of it, I guess, Dustin?
Orion is in production as of,
I'm trying to remember the exact date,
it was in April.
But, you know, before we put the system in production,
you go through a huge amount of,
you know, acceptance process,
you know, making sure that every end to end
from a functionality, performance, stability,
that you've run through a huge number of tests
to ensure the system's ready for users. But we also put users on the system. We have a set of
what we call friendly users that we put on. It's like, hey, we know that you have a file
per process workload that's very hard on the system. Can you test it and share your results
with us or let us work with you along the way? Or you have a huge single shared file workload, which is one of the areas that we really want to make sure to exercise and works
well. Can you get on the system and run this for us? And honestly, the feedback, even when the
system is being used, like I said, we've seen 10, 11, 12 terabytes per second from it, even in a
mixed environment with lots of users in the system, our users are still seeing, I've seen 7.9
terabytes a second. So I think, yes, the system's in production, it's are still seeing, you know, I've seen 7.9 terabytes a second.
So I think, yes, the system's in production.
It's in use.
And so far, I mean, our users love it.
So I guess with the, have you, it's a relatively new system.
So, you know, it's like bringing out your fast sports car and it seems fast for like
the first year.
The sports car performance hasn't changed, but the user's
expectations have. This is what us operators always have to deal with. And I don't think
you guys have gotten to the point where users are complaining that it's slow yet, but you have a
unique system and that it is a shared system. So do you have to deal with noisy neighbors yet. Yes. And it depends on which file system,
different file systems have different architectural considerations that can change the blind spots,
right? For like areas that you may get, you know, like, let's say, for example,
let's say that I have one user that is very friendly and, you know, they may have, let's say, for example, let's say that I have one user that is very friendly and, you know, they may have, let's say, a file per node for their workload.
Right. And let's say their job is 6,000 nodes.
Now, let's also say I have another user that is using the same metadata subsystem that has it's using 9,400 nodes, but they have, you know, 64 processes per node. And, you know,
that their workload would be very, very different, and could potentially, you know,
interact, would have a negative interaction with that other job that's, that's maybe doing
something a little less intensive. So that is always the case, always potential. But the way we solve that is,
you know, I mentioned that we have 10 petabytes of metadata that's actually spread across
40 metadata servers. And each of our programs are randomly assigned to an end of to a metadata
server. And so what that allows us to do is to basically break up the problem domain. And so
if one user is misbehaving,
it'll only impact a subset of other projects
and not all of them.
So the system will remain on.
Effectively, a metadata server is like a file system
in your parlance then, is that?
Well, the metadata subsystem actually is composed
of ZFS file systems also,
the same as like the object storage systems.
So it is the part of Lustre that I'm trying to figure out the right way to put this.
It's not a file system of its own.
The metadata subsystem is made of smaller file systems that service the broader Lustre file system, if that makes sense.
But you mentioned two things here, Dustin. You mentioned that, number one, as jobs come in,
they're randomly assigned to a metadata server.
And number two-
Not the jobs, sorry, the projects.
The projects, okay.
So this isn't like an automated orchestration system.
So when I submit it to use Ray's language, job,
when I submit a job, it is a project is a customer
or a client doing a thing
and they're randomly assigned to.
Right.
So let's say that I want to solve
a numeric weather prediction
or yeah, astrophysics problem.
I'll come in and I'll put a proposal
and saying, hey,
astrophysics, let's use an example.
Like, let's say I want to model the expansion of the universe.
And somebody's like, cool.
Yeah, we'll create AST021.
That'll be your astrophysics project name. And so they will have an area of the file system like slash Lustre, Orion, AST021.
And that directory, AST021, where it's not just one person it could be a whole group of
people that work in that project that project itself would be dedicated or not dedicated it
would it would be on a single metadata server that may be shared by other users and then do you guys
have the ability to kind of move the projects to metadata servers based on uh use or there's noisy neighbors or
something like that you can um but then you would have to move the metadata to the other
metadata servers so you would actually it would be a migrate the it's not like you could just
you have to be worth the effort like it had to be two long-term projects that obviously collide
with each other like okay this is worth that effort well and usually what we try to do instead is we try to if we identify like i
mentioned you know we have a very very robust telemetry system you know if we identify a user
as being problematic um we will work with those users and be like okay what are you doing what
are you trying to do like let's figure out how to make it so that this isn't happening.
I don't want to like put users in jail
or put them in file system jail and move them off.
I'd rather just help them to do what they need,
what they're trying to do.
More effectively on Orion and stuff like that.
So speaking of, you know,
kind of frontier and Orion,
these project, these massive project names,
this is obviously not something that
you folks thought about three months ago and deployed today. How long was the planning process
of the design prior to building out, I would imagine it was iterations of the production
system and then the final production system? Give us a time skip.
So usually, I mean, it's years. So like, for example, we just deployed Orion and, you know, we're already talking about our next gen file system. Like we're designing it now. But it's,
but usually what that is, is, you know, like we have, you know, my group at the lab, you know,
there's the first year of a system is usually
spent stabilizing it, which actually, as of, since Orion was put in production, it's had
a hundred percent scheduled uptime. So it's been very good so far, actually.
But, you know, when you, when you get real users on it and not synthetic benchmarks,
you expose problems. So we'll spend the first year exposing problems on the current system
while concurrently working on building our next one. before you can build your next one you need to evaluate
different products so we'll reach out to vendors and say what are you thinking about doing
and three four five years you know and if it's something that has let's say they have an early
version of it it's like okay we want to buy one of those and we put it in what we call our test bed
that so it has every generation of all these different storage systems from all these different vendors
and we'll put it through its paces and understand where its weaknesses are.
Are there things that we can work with that vendor to improve? And then we put out a call
for proposals, right? And we gather our use, we'll put our, we gather our use cases
from our users. We turn those into requirements. We turn those requirements into an RFP, and then
our vendors will respond to this request for proposals. And after we receive that, I mean,
you can still be a couple of years before you, you know, take delivery of the final system,
integrate it, and then you have to go through an acceptance phase and you know, take delivery of the final system, integrate it. And then you have to
go through an acceptance phase and you get to get users on the system. And then you go through it,
you start all over again for your next gen system. Yeah. Yeah. That's amazing. It's usually a five
year life cycle. Yeah. Per, per, per file system. I saw something like this. Orion is a five year
lifetime is what you're saying. Five to six.
Yeah.
And it depends on a lot of things.
You know, the Orion is just one file system we operate.
You know, we have 12 production file systems.
Right, right, right, right. And are the kind of frontier and Orion type at the hip together?
Are these separate things?
So Orion may go away and frontier will connect to yet a, you know, the next foul system or are they,
are the life cycles tied together? They are, I would probably say they're generally tied together,
but they're not exclusively tied together. Like for example, let's say that we, let's say that
Orion's running great and we think we can squeeze another year out of it. You know, it may be that Frontier goes away and Orion stays active and is used by other systems.
Or it could be that we don't want to marry the next procurement with a file system.
And so it could be that the next gen compute system, whatever's past Frontier, ends up mounting Orion for its first year of production while we integrate the new file system.
They're not necessarily married.
I do not envy that process.
All right. Well, listen, so Keith, any last questions for Dustin before we close?
No, we could, if I ask more questions, we can go on for another hour.
This has been extremely interesting.
All right. Dustin,
is there anything that you'd like to say to our listening audience before we close?
I guess I just mentioned, you know, if you have interest in this, you know, we actually
participate heavily in the Supercomputing Conference, the Lustre Users Group, the Cray
Users Group.
You know, you can find any, we have, you know, Oak Ridge is very active in those spaces,
both from the compute, storage, networking, you know oakridge is very active in those spaces both from the compute storage
networking all you know facilities obviously operating these systems you know you're we're
talking i mean we're talking 30 megawatts of power for a single system there's a lot of complexity
there as well but there's information about all this kind of cool stuff happening at the at the
lab at all those places okay great great well this has been great, Dustin. Thank you very much for being on
our show today. Absolutely. It was great meeting with you. And that's it for now. Bye, Dustin.
Bye, Keith. Bye, Ray. Until next time. Next time, we will talk to the system storage technology
person. Any questions you want us to ask, please let us know. And if you enjoy our podcast,
tell your friends about it. Please review us on Apple Podcasts, Google Play, and Spotify, as this will help get the word out. Thank you.