Storage Developer Conference - #72: Innovations, Challenges, and Lessons Learned in HPC Storage Yesterday, Today, and Tomorrow
Episode Date: June 25, 2018...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the
SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage
developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage
Developer Conference. The link to the slides is available in the show notes at snea.org
slash podcasts. You are listening to SDC Podcast Episode 72. So I'm Gary Greider from Los Alamos
National Laboratory.
My co-presenter, John Bent, he used to work for Seagate.
Actually, now he's unemployed, which is why I'm here, I guess.
And he's going to work for Cray in a few days, so he's kind of between jobs.
So I agreed to give the talk.
He was going to give most of it, but I'm going to give it.
So you'd get stuck with me.
First is Snea legal notice.
You've probably seen this a few times, so I won't dwell on it.
John presented an abstract. I won't read it. It's in the slides. The slides are online.
So you now have the abstract.
This is one of my slides. I just want to make the point here that we've been at computing for a long time at Los Alamos.
We actually had people carry wires from Los Alamos to Pennsylvania
to run on the first machine in the United States at University of Pennsylvania.
In fact, my mentor was one of those programmers that carried wires from Los Alamos to Pennsylvania
to run the first set of code on the first machine, ENIAC.
We built our own machine that was much like that called MANIAC.
It was built by John von Neumann and Nick Metropolis,
who you see over there in the picture.
My son actually dated Nick Metropolis' daughter, so interesting.
So, actually, granddaughter.
So you see lots of machines there.
IBM Stretch was an interesting machine.
It was interesting because it was considered
to be an enormous failure.
It was supposed to be four times faster
than the machine before it,
or ten times faster,
and it only was four times faster.
And so it was considered to be a miserable failure,
even though we had to build an entire building to put it in to IBM spec. Turned out to be that
was the first machine that exemplified the IBM 360, 370 architecture. So it turns out to have
been a real success, but at the time it was considered to be an interesting failure.
Had lots of CDC machines. The first Cray-1 showed up and didn't work for six months
because they forgot that at 7,200 feet,
there's a whole lot more cosmic events than there are at sea level,
and so they couldn't make the machine run for more than a few minutes.
So we had to wait for them to put some sort of protection on the memory.
I certainly was there in the late period of that time frame.
Cray X and Y series S and Ps came along, interesting stuff.
CM2s and CM5s were very interesting.
Why?
The introduction of data parallelism.
Data parallelism now you might know is things like CUDA.
So essentially those machines there, this one on this end and that on that end,
were of the first data parallel machines ever built.
Danny Hillis built them. He's a friend of mine, lives in Santa Fe.
And
interesting machines. This machine up here was
64,000 one-bit wide processors.
One-bit wide. You could specify
your word length to any
size bits you wanted. Pretty cool, huh?
I want my word length to be
18 bits.
You could do it. Pretty neat.
And data parallelism was really cool
because languages like C star and things like that
started to come along,
which predate CUDA by a lot.
And essentially these machines are what would be now called
an NVIDIA graphics processor.
The SGI Blue Mountain machine was interesting.
The DEC HP machine was interesting.
That was the last very, very large alpha machine ever built, I believe.
We built the first petaflop machine in 2009, IBM, using the Toshiba IBM cell processor.
It was the first petaflop machine ever.
Machines beyond that, the current machine machine we have installed is our Cray Trinity
machine down there. We also have a D-Wave machine, one of four or five in the world, and we're doing
some interesting things with that. And I'm in the process of buying a machine to be installed in
2020 called Crossroads, and it's becoming a rather large pain in the butt. When you buy a
quarter of a billion dollar machine,
it takes about four years to go through the procurement process to get that job done.
So it's a long process.
They're all quite setups in many ways.
A view of our environment, quick view.
So we usually have one premier machine, and then we have a bunch of capacity machines.
Premier machine today is a few million cores, two petabytes of DRAM.
It has four petabytes of flash.
I'll tell you what the flash is for in a minute.
It has very large private scratch space, 100 petabytes.
It runs at about a terabyte a second.
And then we have a whole bunch of other machines to do smaller things. Think 100,000
cores and down or small machines. And we have lots and lots and lots and lots of them. And
they're for running capacity jobs. And then capability jobs are jobs that run on the big
machine, say a million cores and up. And those jobs typically run about six months on that
machine, a single job across the whole machine, sometimes as much as a year to solve a particular problem in the weapons complex.
We have some shared scratch space that's shared by all.
That's much, much slower.
Think hundreds of gigabytes a second.
And these are the capacity machines.
And then we've got these services.
We've got an archive, parallel tape archive.
In the last session, they talked a little bit about punch cards.
I actually have a punch card.
It's writable media.
You can take notes on it.
So, anyway, that's a parallel tape archive.
Of course, NFS services and various kinds of file transfer agents.
We have large clusters that are just around to move data from one file system to another
or from file system to archive to another file system out to the WAN.
These are a few hundreds of machines big, so they're not very big,
but they're just for moving data to and from things.
What's a large machine look like?
Well, it's a few acres. It's
20,000 nodes. It's a few million cords. It's two petabytes of DRAM.
It's four petabytes of flash and so forth. You've heard all that.
To service it, you need a 100 petabyte size scratch file
system. You need a campaign store, which I'll describe in a minute,
that grows at about 30 petabytes a year during the
lifetime of the machine and this tape archive. But what does it take to cool it and stuff? So
that's what it takes to cool it. Those are 36-inch steel water pipes. It took $27 million worth of
36-inch steel water pipes to hook up this last machine that we bought. 36-inch water pipes,
in case you're not aware of how heavy that is, it will crush any floor, including an 8-inch water pipes, in case you're not aware of how heavy that is.
It will crush any floor, including an 8-inch concrete floor.
So we had to cut holes into the floor and build pylons and hang it from a steel brace
so that it wouldn't crush the building that it was going into.
And yes, I'm doing this again for the next machine that I'm putting in.
Oh my, do I love steel and copper.
Oh gosh.
It's a lot of megawatts.
This machine was supposed to be less than 18 megawatts.
It turned out to be less, about 15 megawatts, I think.
It's interesting, though, because the megawatts is a single job using 18 megawatts.
That's $18 million worth of power here.
But one of the problems you have with power is not just paying for it,
but what are you doing to the power company when you have things like, oh, gosh, the job aborted,
and within one AC cycle you send 18 megawatts back up the chain, right?
And so the power company goes, oh, well, that hurt.
I don't like that too much.
Please don't do that again.
And so we actually have to write software that runs on the
nodes that if the job crashes, it immediately
spins up a slowdown process.
Same thing when we launch
so that we don't launch too fast.
If we're going to take a long DST, we have to call
the local
power company and say, you know, we're going to go off
for a while. You guys can
spin them
down. Anyway, so it's interesting we actually are
about this close from probably by 2020 2020 we'll be buying power futures because we'll be
something like 80 megawatts for the computing there and that's enough to get into that business
it's not you know google size but it's pretty good size. Again, single job that runs off millions, of course, for long periods of time.
Soccer fields and soccer fields and buildings and semis and copper.
Anyway, some of the things we've done in the past that you might recognize,
you probably don't realize that the government has a fair amount to do with funding technology in the U.S.
And DOE and DOD, which we partner with heavily, actually do fund some of this stuff.
This is storage-related because this is a storage conference.
I could have put other stuff up here.
We came up with this idea called a burst buffer.
I'll explain what that is in a minute.
DDN built one.
EMC built one.
And Cray built this thing called Data Warp which is a burst buffer
we had a fair amount to do with IBM GPFS
we basically paid for
almost lock, stock and barrel luster
paid for a fair amount of Banassus
PNFS
was a thing that came out of
University of Michigan that came out of
payments that we made to
Peter Honeyman and that group there
Seth I'll have a slide on that next it's an interesting story, very cool payments that we made to Peter Honeyman and that group there.
SEF, I'll have a slide on that next.
It's an interesting story, very cool.
HPSS, you may or may not know what that is.
It's a parallel tape system that you can use,
Unitree and these sorts of things.
Tokutech even was very interesting. It's a cache-oblivious B-tree-based sort of log structure merge tree kind of a thing.
Seth begins.
So I see Sage back there in the back, so I can't lie too much.
So we actually were funding technologies inside of universities.
The government does a fair amount of that too,
and we decided to partner with the University of California, Santa Cruz on a proposal that they made to us on
building some essentially research infrastructure to do
research on scalable file systems, stuff like scalable namespaces and
scalable security and other sorts of things like that.
And we signed a contract with them back in 2001. I think we funded them
for nine years at $300,000 a year or something like that.
Lots of students.
And they produced cool, really cool stuff, radios and all the things that you now have known to come in love with Ceph.
Anyway, so we were at the beginning of that as well.
Some simulation background.
So you can't really understand what these machines do
unless you take some background.
First, there's scales, link scales.
So what do I mean by link scales?
Well, an application might be running at multiple link scales.
One is a continuum scale doing partial differential equations.
One may be running at molecular scale
because you couldn't get enough fidelity
using the partial differential equation. Maybe even
some at subatomic or molecular scales. And so
essentially a single problem, different parts of the problem may be running at
different scales. Different parts of the problem may be
meshed differently. So a mesh is essentially each one of these cells
in a mesh has like 50
floating point values, pressure and temperature and neutron density and all sorts of wonderfulness.
And you mesh your thing up. Of course, it's very seldom like that. It's more often an unstructured
mesh, a 3D unstructured mesh typically, and at different resolutions. And in fact, you don't ever run the problem at the same resolution all the time.
There's fun parts going on in the problem,
and then at the time, there's not so much fun parts, right?
So think about a shockwave going through a material.
I don't give a crap about the material over here,
and I don't give a crap about the material over here.
I just care about where the shockwave is.
And so essentially, I drill way down on that
to get some interesting stuff
and I drill way out on the other stuff
just to be able to fit it into my two petabytes of memory.
And so basically, this is called AMR,
automated mesh refinement.
Essentially, it refines the mesh as the application runs.
There's a couple of different kinds of scientific methods
used in Lagrangian and Eulerian.
Lagrangian is where the mesh deforms
and then you fix it.
So essentially the mesh follows the particles around
and then in Lagrangian,
the mesh stays the same
and the periodals move within the mesh.
And then there's different kinds of applications that use both.
And then there's this AMR thing.
And when you came in, you probably saw that I had this little thing running here.
This is an example of what I'm talking about. And you can see in here that
the cell sizes change,
the mesh deforms,
it's correcting itself as you go,
it's subdividing and making some cells smaller
and bigger depending on what's happening
in the application.
This is a real simple little run down at Livermore
as a really nice example of
stepping through time steps. Time steps are, I run for a little while, not for very long, and then I
exchange information with my boundary cells about what's happening in me because they have to adjust
what's happening in them based on what's happening to me. That's the way physics is done. That's the
way these simulations work. And so it's an interesting little example of how that works.
Yes, yes.
And in fact, basically every cell is updated every time slice.
And, in fact, that's interesting because you might ask,
well, what's a time slice?
How long does it take to run a time slice?
It could be as long as a second.
We have one application that runs and synchronizes with its neighbors every millisecond.
So think about how wonderful that is on the power system.
The whole thing stops.
We exchange information on the network,
and the whole thing starts up again every millisecond.
It takes one hell of a lot of capacitors to get DC to behave when you do stuff like that
because that's way, way less than an AC cycle.
I mean, you drive power supplies nuts in this case.
So anyway,
that's how that works.
How do you program it? Well,
hopefully in the next two slides you'll convince yourself you damn well don't want to be a programmer
that programs one of these things because
it's hard. Scaling.
Scaling, either strong scaling
or weak scaling. Strong scaling is
where you just keep the same problem size,
but you just want to run it.
But you want to run it faster. We strong scaling is where you just keep the same problem size, but you just want to run it faster.
Weak scaling is where you scale the size of the problem with the machine.
That's a little easier.
Process parallelism is how we've done it in the past.
Explicit message passing between processes on the system,
and there's various kinds of processes that are message passing.
There's point-to-point.
There's all reduces, all gathers.
And in fact, one could have written map reduce in about 15 statements of MPI about 20 years ago,
and in fact, oh my God, somebody did.
And gosh, it was reinvented 15 years later.
Isn't that cool?
So then there's also the point-to-point, of course. it was reinvented 15 years later. Isn't that cool?
So then there's also the point-to-point, of course.
There's also MPI plus threads,
so we all know what threads are, right?
And so you run MPI across nodes,
and you run threads within a node,
and so you organize your program that way. You can also do MPI plus threads plus data parallelism.
What's data parallelism?
Well, that's SIMD.
That's like I have this vector,
and it's spread across this entire machine,
and I want to do a multiplication of this vector against that vector,
completely distributed all at once.
So you send the same instruction to everybody,
but they just do it on different data parts.
That's how a GPU works, guys.
That's what data parallelism is.
That's what that is.
So you can use both data parallelism and MPI,
MPI in threads or MPI threads plus data parallelism.
And, of course, thinking about this,
when you have a billion cores and three levels of memory,
that gets really fun.
Programmers hate this,
and so they like to use higher-level abstractions,
things like Obenshmim, CoreArray4Tran, Chapel, X10.
Let me tell you a little bit about CoreArray4Tran since I have a card in my pocket.
CoreArray4Tran, for those of you that are programmers in 4Tran, probably very few of them are,
but there was loops, right, and you looped through and you said, you know, do for whatever, multiply this by that.
Right, so what this does is it says there was a dimension statement. you looped through and you said, you know, do for whatever, multiply this by that, right? So what
this does is it says, there was a dimension statement that said, this array has two dimensions,
and I'm going to multiply these three things together. Well, a call write for a trend gives
you a third dimension, that third dimension is across the machine. And so when you set up a loop,
the two inner loops run local, and the local loop runs across the machine. So you just write
for trend, and you just add a little dimension thingy on,
and you push the parallel button,
and push, it goes across the whole machine.
Isn't that cool?
Rice University built that for us.
We built that a long, long time ago.
There's one for C.
Actually, C++ with Colrace,
in case you want to know about C++ and where it's headed,
is actually getting this capability fairly soon.
Although, fairly soon, it's a standards body, right?
Anyway, so there's assists for making this easier
for application programmers. It's not all bad. Mostly bad, but not
all bad. How do you run these things? Well, we have these things called workflows.
The workflows are interesting. Essentially, you
run, you get into a loop, and you run a setup.
You get into a loop, and you run, you checkpoint, you run, you checkpoint, you run, you checkpoint.
And I'll tell you why you checkpoint in a minute.
And then eventually you get done running, and you do downsampling and stuff like that.
We did this workflow document, which is downloadable.
It was associated with the last machine we bought that described how we use the machines for the entire
science workflow of getting a job done. You can see that this was associated
with storage. Some of the stuff is very temporary. Some of it
lasts a little longer and some of it lasts forever.
Within that, like I said, within one of that
cycle, you end up doing this cycle of, you know, compute, checkpoint, compute, checkpoint.
Inside there is grind, communicate with your neighbors, grind, communicate with your neighbors because you're passing state information with your neighbors.
And this happens across the entirety of the memory of the machine essentially over and over and over again quite frequently.
What about Checkpoint? Why do I care about Checkpoint? Well, it's not just about computation.
It's about failure. These are very big machines. They fail all the time.
And the difference between these machines and
typical cloud machines or other kinds of things is the application is tightly coupled.
So if you lost one process,
you've essentially got to stop and start again
from the state of that process
because that process has some information in it
that may affect the calculation.
And so you've got to essentially do checkpoints.
So the way these applications work
is they calculate for a while,
about as long as they think it's going to take
for the machine to die,
because it will die fairly soon.
And so then they take a checkpoint.
How often is that today?
Think an hour or two.
So they compute for an hour or two,
and then they take a checkpoint of the entire memory.
So every hour I get to suck in two petabytes of memory
into some sort of stable storage.
Well, that's interesting, right?
That's a lot of stuff.
And so I've got to do that pretty fast,
so fast that it takes something like 30,000 to 50,000 disk drives,
all spinning in parallel to get it off in some reasonable amount of time
so that I'm not spending all my time doing checkpointing
because I really don't want to checkpoint.
I really want to compute.
And so anyway, that's what checkpointing is about.
It's very difficult to break these things up
into non-synchronous mechanisms
because that's the way physics works, unfortunately.
Yeah?
If you have a machine going down,
can you share a little bit
how you bring the cluster back up?
Do you just place that machine
and bring it to its last checkpoint
and then continue to bring the whole cluster back up
to its previous checkpoint?
The way it works is something will die.
Usually it's a node.
Usually it's a DIMM,
because there's one shitload of DIMMs, right?
And so it'll die.
The machine, the job will abort.
The spin down will kick in.
It'll spin itself down.
The job is already submitted again to run.
So it's sitting there in the queue. It immediately
starts again and it knows I need to go to
the lax checkpoint. So it says, I need all that two petabytes back, please. And so
those disks say, okay, here it comes, a terabyte a second.
And we load it all back up again and off they go.
So I'm about to get into why burst buffers exist,
but I won't give that away quite yet.
Do you have them waiting?
Yep.
They absolutely are waiting.
They absolutely are waiting.
And that's a problem with space.
We use it in the entirety of the memory of the machine.
There are two sort of I.O. patterns that are used.
One is called end-to-end process, file-per-process.
File-per-process is every process, of course, writes its own file.
This is fun because all within a millisecond,
it decides to ask the metadata server of some poor parallel file system
that it wants to create a million files in one directory,
all within a millisecond of one another. And so the metadata server goes oh god that hurt i think i'll try to
get around to that at some point eventually you get all the files created and then it works pretty
well it writes um in parallel the other job the or the other pattern is into one which is all the
processes right into the same file and that's fun because it's really easy to do an open, but then they don't write in huge patterns
by themselves
so that they could
use disks by themselves.
They stride stuff together.
So they decide
that they want to put
all their temperatures together
and then all their pressures together.
So these are like 100K writes.
So there's a million cores
and each one's writing
a 100K write
next to one another
and then they all move down
and they
do it again and so they beat the hell out of the lock manager and stuff like that to try to get
this stuff written there's a reason they do this it's not just because they're being mean to me
um but anyway that's how it works so both patterns suck uh pretty badly
this is actually why luster was invented, just so you know.
They use libraries to put the data in interesting patterns into the file
because they want to retrieve stuff
using hyper slabs and various kinds of techniques
to try to get analysis out of these data.
And so they're writing this stuff out
from a three-dimensional thing
into a one-dimensional surface.
And so, anyway, they use interesting libraries
for shaping how that stuff happens.
This is an old, old, old, old slide deck
of why N to 1 really hurts
associated with erasure and rating.
Basically, you end up doing this read-update-write operation
because you don't write an entire block,
and so you don't get much parallelism
because of RAID systems that work this way.
Of course, this is a really bad example,
and it's better than this,
and there's lots of tricks to play to get around this,
but this is kind of as bad as it could be. There was a
paper written in 2009, John and I and a few others,
and back when John worked for me and before he went off and became famous,
and we built this thing called PLFS
Parallel Log Structure File System, which was a way to do log structure in parallel,
and it speeded everything up a whole bunch.
Of course, it created metadata like crazy,
and so it was hard to deal with.
But the reason I think that I brought it up
is A, John put it in because he likes to show off
his publications.
But B, it was the second best paper
at supercomputing in 2009.
It was the first time that an I.O. paper
ever made it to the top five.
So we were pretty proud of that
because supercomputing and I.O., you know, they're not the same.
Okay, so how did it work?
It basically took end-to-end workloads and turned them into end-to-end workloads by putting this virtual layer in.
You were writing this ungodly pattern of some kind, and it basically turned it into virtually just, and it made a big index
and a big distributed index, and that turned out to be hard to deal with,
but it was interesting. And it sped things up
a lot. Wow, John Slizer, cool.
So how did we, you know, circa 2002
to 2015, which is a long time,
not as long as cards, of course, but a long time, this is how we worked.
We had a tightly coupled application running in DRAM.
Excuse me.
We had a parallel file system and this ungodly, icky workload up here,
and then we had this parallel file system talking to a tape archive, which was very well aligned.
This is how things worked up until 2012 to 15
or something like that.
And then what happened is I got busy with my spreadsheets
and scared the hell out of everybody.
Being a manager, the spreadsheet, of course,
is the tool of choice.
For programmers, of course, it's kind of laughable.
But anyway, I did this thing where I said,
okay, I know what size machines I'm going to buy,
how big they are, how often they're going to fail,
and I know how much disks cost and how fast they're going to be,
and I know kind of how much flash it's going to cost
and how fast it's going to be.
And so what should I buy going into the future
to put this scratch file system together
to do these checkpoints to?
And there's three different slides.
There's one that's just by flash, all flash.
Well, that was going to cost a lot
because the capacity was going to cost me a whole lot.
Then the other was by all disk.
The problem with all disk is it was more expensive,
and so I used this hybrid approach,
some disk and some flash, which isn't a big surprise.
Tiering is an old trick in storage,
and that's actually where burst buffers were born,
is out of this hybrid thing,
and I'll tell you exactly what a burst buffer is in a minute.
My most proud moment about this particular graph
was it was the first time I ever put a million dollars on a log scale,
which I thought was a pretty cool concept.
I did the same thing for our archive.
I said, okay, gosh, we buy lots of tape drives and tapes.
How are we going to get these big files off to tape?
And it turned out that for the first time ever I had modeled archives forever,
the tape drives cost more than the tape. And I said, gosh, what's happening?
Well, there was a bandwidth driver that we hadn't had before that showed up because
the sizes of the data got so big. And so we realized that
at some point in time we needed to add a tier of disk in front of the tape
that we didn't have before to make it economical enough to do.
And that's where campaign storage was born,
and I'll tell you what that is in a minute.
Burst buffer is a layer of flash
that goes in between the parallel file system and the DRAM.
When I said that Trinity had two petabytes of DRAM,
it also has four petabytes of flash.
That's what it's for,
is for you to dump your stuff down at four or five terabytes a second
and then move on while it drains some of those,
not all of them, but some of them to the disk and vice versa.
So that's what burst buffers are.
People are using them for all kinds of things now,
including analysis, in situ analysis,
and they're also using them for producer-consumer models
so you have analysis running on the same machine
as you have the compute running on, and so forth.
Wow, John has a lot of stack slides, doesn't he?
So this other thing that we injected here was called a,
he called it an object store.
This is a campaign store.
It's called a campaign store because that's the sort of data
you need to keep around and fairly close by
during the campaign, the science campaign that you're running.
That's why it's called Campaign Store.
I didn't invent the word, but that's what it is.
That's an awful lot of layers, though.
How many should there be?
Well, really sort of two, right?
There needs to be one that's sort of performance-based
and one that's sort of capacity-based.
And there may be lots of physical layers within there,
but there's kind of really only two in a way.
So sort of like that, a performance tier, probably Flash today,
and some sort of a capacity tier, probably something like an object store,
but this may be more than one kind of device type,
which may force us to have more than,
that may look like more than one tier.
Why?
The way we think it's going to happen in the next five years,
maybe six or seven years at Los Alamos
is that we'll have a very fast all-flash scratch
file system within the supercomputer itself. In fact, our 2020 machine will probably be the first
machine to have something like between 100 and 200 petabytes of flash inside the machine. It won't
have a spinning scratch disk file system. It'll be inside the machine if it works out well for us.
So anyway, that'll be interesting.
The target there is 20 terabytes a second,
just so you know what the bandwidth is targeted for.
And then we'll have these capacity tiers.
The capacity tier here is this campaign store,
which is using disk.
Disk, of course, is basically I have this sort of nomenclature down here,
metadata and data reads and writes.
You can do anything to the agile tier.
You can't write to the campaign tier without a tool.
You can't just write willy-nilly to it because it's a bunch of SMR or Hammer
or some ungodly drives that are hard to write to,
and so you're going to have to make it behave on write.
And then the archive, which is probably tape or something worse,
and so you can't read or write to it without a special tool.
And so the only thing that you can read or write to directly is the Agile tier,
and then these you can read from it,
but you can't write to it directly without a tool, and so forth.
So that's why we think there'll be multiple layers in here.
Okay, so let's talk about this capacity tier,
this thing that was up here, that thing,
what's it going to look like.
We were starting to look around a couple,
three years at least ago,
to try to figure out what that tier would look like.
We were looking for software solutions to do this.
We said, gosh, you know, why can't we just go buy an object store and get this over with?
The problem really isn't that the object store wouldn't do the job from a bandwidth point of view.
The problem is the users were really used to having folders, and folders really meant something.
They weren't just this obtuse tag.
And they were used to having folder inheritance
and all the things that they're used to.
And so we decided, well, gosh, you know,
really what we need is for our trillion dollars worth of applications
and users that really want to use folders and things like that,
we really needed to look kind of sort of like a parallel file system
or like a file system in some ways. But it doesn't have to have all of POSIX,
just some of it. And so we basically try to figure out, is there
something out there that marries the best of both worlds, an object store, you know,
that scales widely and so forth, and
a POSIX system. And there were issues, of course. POSIX has scaling
issues on certain things.
There's mismatches between POSIX security and so forth and object.
How are we going to do this?
We looked all over for products to do this at the time,
which was admittedly three years ago, and we didn't find a whole lot.
We didn't look pretty far.
And so we decided, okay, fine, let's just write something,
not something, not a lot, a little.
Let's leverage as much as we can and write as little as we can.
The goal's 100 to 1,000 gigabytes a second,
which is about what this layer needs to go.
If the parallel file system is running 10 terabytes a second,
then this thing needs to run a tenth of that, roughly.
Maybe up to a billion files in a single directory,
maybe a trillion files total in the system.
Near POSIX, but not complete POSIX.
It doesn't have to have all of it,
but it needs folders and things like that.
Could we leverage commercial products underneath,
like scalable object systems and stuff?
We've actually run this on top of Scality and EMC ECS.
We actually, right now, are running over our own erasure that we put on top of ZFSs.
I'll tell you about that in a minute.
There's actually a talk about that Thursday.
It's not very much code.
It's a library or two and essentially uses object stores when it can,
and it uses POSIX namespaces,
many, many, many of them glued together
to put the namespace into.
It's friendly to object stores
by essentially spreading large files
across multiple objects
and packing small files into a single object.
Why does it need to split up large files
into many objects?
Well, I have petabyte-sized files.
You can't write that out as one object.
That won't work.
So think about it.
A gigabyte-sized object,
that's a million objects to do a petabyte.
So you've got to do it.
You've got to do it in parallel.
You've got to do it really wide.
How does it scale?
Down here at the bottom,
it scales across N object servers. So you can have as many as you want
and it'll scale across those and you just set it up to stripe across them essentially this is sort
of a raid zero so within each one of those there needs to be protection how does the namespace scale
in two dimensions it scales using names using tree or in a decomposition so you that's the normal
way to to decompose a namespace
where you break up the tree
and split it across multiple metadata servers,
but it also has a file hashing mechanism
so that the files can be hashed
across metadata servers within a directory
so that you can get these very large directories
that have like a million or a billion files in them.
We were retiring a machine a couple of years back,
actually last year, called Cielo. It was an a machine a couple of years back, actually last year,
called Cielo. It was an 8,000-node machine. And I asked the question, we always have this contest to see how mean we can be to machines before we actually ship them away. And one of the things
that we did was we said, what would it be like if you had an 8,000-node metadata server? Not a
storage server, but a metadata server. Just to hold your namespace.
Wouldn't that be fun?
And so we ran this thing across it,
essentially hashing in this way,
and the target was to create a billion files in a directory
and a trillion files across multiple directories.
And I was hoping to get a billion inserts a second
into that same directory.
I didn't make it.
I never get what I want.
It was only like 900 or some million inserts a second,
and we didn't quite get a trillion.
It was 900 and some billion files.
But anyway, we were close.
So it was a pretty good test.
It was a test essentially of the software
to see if it scaled from a metadata point of view.
How does it work?
It's really simple.
Essentially, it uses POSIX metadata
to store where the data is or store the names.
So this is a typical tree.
And then within a tree, you'll have a file.
And the file will have an extended attribute
that will say where in the object system it is, and striping and all that sort of junk, and it sticks the object down there.
This is called a unifile because the file is just one object.
The same thing is true for a multifile, which is a file that's too big and must be striped,
so it goes across multiple object systems.
And then there's this packed file thingy where you can put lots of small files into a
single pack. How do we get data
to and from it? Well, we could use copy or tar or something
really stupid like that, but it would go pretty slow and pretty serially. So we have
this thing that we've had around for a long time. It's been open source called pftool
and essentially walks trees in parallel,
does reader in parallel, does stat in parallel,
breaks big files up and moves them in parallel,
and packages small files into packages
and moves them in parallel.
And it actually is restartable.
So if you only get through a certain amount
of your multi-petabytes that you want to move,
it keeps track of where it was at,
can restart and start moving again.
So essentially this particular code
runs on those file transfer agent clusters.
You submit a job and say move my petabytes
and it goes and does that for you
in parallel as fast as it can.
Oops.
Where does this campaign store fit
in this big drawing I said?
Well, it fits down here
and it connects up to our site-wide network.
So this site-wide network is, you know,
capable of something like a few terabytes a second.
That's an InfiniBand network
that connects all this stuff together.
Of course, this private network also
is over a terabyte a second,
and then within the machines, of course, it's faster.
So who all else is playing with campaign store-like ideas besides LANL?
There's a company called Spectralogic.
It's out of Colorado.
They make tape robots.
They also make an object server with disks that sit in front of tape.
And they partner with Peter Brom, who you probably know is one of the authors of Lustre,
a longtime friend of mine.
And essentially, they're essentially taking this code and making product out of it.
There's five sites that are trying to use it right now, I understand, across the world.
So it's being commercialized.
And then this is another John slide showing off one of his publications, or our publications.
We wrote a nice article for Usenix on this idea of where storage stacks are headed in HPC,
and it's in the login magazine.
Let's talk a little bit about the future.
The future is, for DOE, it's called Exascale
because we're currently at petascale machines.
What does Exascale mean?
Well, something like a billion-way parallelism.
So you've got something like a billion threads,
between a billion and 10 billion threads.
We're hoping to keep it down to something like 20 or 30 megawatts for such a machine,
which may be very difficult to do.
It'd be nice if it fit inside a football field.
If it didn't, that'd be bad, I guess, but maybe not a killer.
We'd like for it to be sort of productive, if possible, for users.
You know, 100 petabyte-sized working sets, and so that's sort of productive if possible for users. You know, 100 petabyte size working sets,
and so that's sort of the memory of the machine
and so forth.
So anyway, how do you do this?
Well, you spend a whole lot of money in industry
to fund a whole lot of technology.
So that's what these lines are about.
So we have funding going to IBM and NVIDIA
and AMD and ARM and HP and others to fund technologies
in this space to try to figure out what we need to do to get there from a power point of view,
from a resilience point of view, and so forth. You buy systems every once in a while. You write
RFPs to pull this technology up into those systems. So think of these as half-a-billion-dollar buys, two labs buying two machines or something every so often.
And eventually, you know, you get there, and finally, yes, you've got something that looks something like an exascale machine,
which, again, would be pretty large.
I don't know how much more to go into here.
Some of the sorts of storage-related things that we've been funding,
there's a project at Intel called Deos that we've been funding.
Essentially, it's a container system for HPC.
We've come to the conclusion that keeping all the metadata,
the way that we've been keeping metadata before just simply won't work.
We're going to have to containerize it. And so we're doing a lot with research on trying to figure out how to build user space,
ephemeral namespaces that are associated with the application,
and then containerize all that and store it in parallel to objects.
That's kind of what DAOS was about.
It also had this idea of transactional
versioning so that we could
take advantage of the fact that
the applications do have some
asynchrony, but not a lot. You can see
these synchronization points that we could get a little
bit of gain from.
Essentially, the problem is
the stacks that we use today, HDF5,
Lustre, block layers,
running across some InfiniBand network
all the way over to some storage devices of some kind,
even if they're flash.
You're talking about 10 microseconds at least of latency,
if not more, 15, 20.
And the devices themselves are going to be faster than that.
So it would be pretty a shame for us to buy a big machine
with 100 petabytes of flash and not get our latencies to be faster than that. So it would be pretty a shame for us to buy a big machine with 100 petabytes of flash
and not get our latencies on our software stacks down.
So we're going to spend a lot of money in the next,
I don't know, five years or so
trying to get those software stacks to be far more latency,
far less latency.
So essentially way less thick,
smaller servers, maybe even no servers, and so forth.
So that's what this is about.
There's some interesting technologies.
The HDF Vault technology is about this.
This is a plug-in to HDF5 to go to services that are smarter.
MDM is a very interesting project.
That was a parallel key value store framework.
So you think about running, say,
10,000 copies of React in parallel on your machine.
So you instantiate the job.
It instantiates 10,000 instances of React,
and then it stores the records in parallel.
It's a key value store,
and it does it in such a way
that you can get ordering across the whole thing,
which is very interesting.
It kept some key distribution information
so that you could query it and say,
hey, who's got the largest record
or who's got the smallest pressure,
and you would know which key value store
to go and beat up on.
It's a very interesting technology
that's still being worked on.
DeltaFS was an interesting thing, too. That was a POSIX namespace. It was a user interesting technology that's still being worked on. DeltaFS was an interesting thing, too.
That was a POSIX namespace.
It was a user-based library that you link with your application,
and it stores data into a key value store as well.
So there's some interesting stuff going on there.
I'm getting close to the end.
I've got six minutes left.
I will say that I actually was at LANL at this time, but that's not me
because I don't think I ever say that I actually was at LANL at this time, but that's not me, because I
don't think I ever had that much hair. The software I talked about is at GitHub. Pretty much all this
stuff is open source. It's almost all BSD licensed, so don't blame us if it doesn't work. That's a great one. Those are IBM 3350 disk drives.
There's a CDC 7600.
That's about a 20,000 square foot room.
We have a whole bunch of those.
That building was built and started building in 1959,
and again, it was specced for Stretch.
Stretch was the first machine that went into that building,
and it took up that whole room.
I have five minutes.
Yeah.
Yeah.
What do you see...
How do you see InfiniBand working out for you guys?
You know, the cycle of... The seats are getting very... Well, we're fans of InfiniBand.
We had a fair amount to do with getting it started.
Us and actually work a lot, oddly enough.
Odd partner to have in HPC.
And we like where it's headed. It's always been two to three years ahead in
bandwidth per dollar, which is what we care about.
That's what we're buying. And actually, being religionless
government people, we don't really care. It's just most effective
to use for stuff like that. We also
like the fact that they're putting intelligence on the NICs. We actually had a
summer school this summer that one of the projects
was to try out their new erasure on
the card, which is interesting.
I think they got some pretty good results.
There's a paper at Supercomputing
that's been given about that.
But frankly, when we buy a machine,
we don't spec to say it has to have InfiniBand.
We say it has to be this big and run this big of a problem,
six times bigger than the old one we had, and so forth.
And InfiniBand gets bid sometimes, and sometimes it doesn't. It's big of a problem, six times bigger than the old one we had and so forth.
And InfiniBand gets bid sometimes and sometimes it doesn't.
But our centralized network is InfiniBand.
And I just, you know, it's really because of economics. It's the best dollar per gigabyte per second.
It's always been two years ahead of Ethernet, and it looks like it still will for quite a while.
And the tool set gets better and better.
OFED is a fairly healthy community.
Susan Coulter, who works for me, actually is the president of it. Anyway, so yeah, it's a good,
it's a, it's working out, it works out for us. I was going to tell you a little bit about the
erasure system. So we had a problem with the erasure. We were using something like
a 20 plus 4 or something like that,
and we were striping RAID 0 across a whole
crap load of those, and we had this problem.
We were used to having failures
of disk drives once in a while,
and we were used to having failures catastrophically
of maybe a whole rack or two
because of a blast or something like that,
but we had this odd failure that we lost
141 drives all roughly the same day,
which isn't a huge amount
of drives given the population, but it's still quite a bit.
And
it was spread out all over the place. So it wasn't just
one rack and it wasn't, you know, it was really
kind of odd. Turned out to be
serial number based and
all kinds of junk that we worked through with Seagate.
But at any rate,
we were worried about this and we started looking at the literature,
and we found that Facebook had actually done two tiers of erasure on top of memory
for some interesting reasons that they had.
And so we ended up looking around and saying,
is there a product that does two tiers of erasure, erasure on top of erasures?
We couldn't find anything, so we actually wrote our own
the top tier. The bottom tier is just to buy as many
erasure boxes you want. And we use the Intel Storage
Acceleration Library to do that. And that library is really quite
fast. It uses the AVX instruction set. We can get like 10 gigabytes
a second per node.
There's a really nice talk that's going to be given Thursday about this event
and this software that we developed around it, which is also open source.
Just don't blame us if it doesn't work.
Looks like I have one minute.
Any last questions?
Yeah.
Do you guys look at different types of accelerators?
Do you guys have at any kind of special accelerators for SNEA? Do you guys worry about the data?
The question was, do we look at accelerators for preconditioning data written out on tape?
No, we don't.
About the only thing I can say about that is when we switch technologies,
which happens when you keep data around for 50 or 70 years, which we actually have, I thought it was interesting that Sneha's doing a 100-year archive and I almost have one. Actually, yeah,
it's true. And it's
important data. It's data from tests in Nevada that we can't lose. It's pretty important.
About the only thing I can say about that is
when we switch technologies,
when we move from tape to tape or when we move from tape to disk or whatever, that's
the time to add value to the data. And so we're certainly thinking about how we could,
as we move it, learn about it. One of the things that we have a really big goal for
is to keep more metadata about the data so that we know later, 50 years later, what we have.
And, in fact, back here in this slide right there, I skipped over something.
There's an index capability that we index literally every file across the whole complex.
This is the technology we're working on right now so that you could, with wide parallelism, say,
across the trillions and trillions of files we
have, I'm a user and I want to say, well, I'm trying to remember what I did with project whatever
the hell it was. Look across the whole complex and tell me what I have. That's what this is.
And I bet you there's a talk sometime in the next year about some really cool technology
that's going to enable us to do that. I'm out of time. Thank you very much.
Thanks for listening.
If you have questions about the material presented in this podcast,
be sure and join our developers mailing list by sending an email to developers-subscribe at snea.org.
Here you can ask questions and discuss this topic further with your peers in the storage developer community.
For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.