Storage Developer Conference - #72: Innovations, Challenges, and Lessons Learned in HPC Storage Yesterday, Today, and Tomorrow

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to SDC Podcast Episode 72. So I'm Gary Greider from Los Alamos National Laboratory. My co-presenter, John Bent, he used to work for Seagate. Actually, now he's unemployed, which is why I'm here, I guess.

Starting point is 00:00:57 And he's going to work for Cray in a few days, so he's kind of between jobs. So I agreed to give the talk. He was going to give most of it, but I'm going to give it. So you'd get stuck with me. First is Snea legal notice. You've probably seen this a few times, so I won't dwell on it. John presented an abstract. I won't read it. It's in the slides. The slides are online. So you now have the abstract.

Starting point is 00:01:24 This is one of my slides. I just want to make the point here that we've been at computing for a long time at Los Alamos. We actually had people carry wires from Los Alamos to Pennsylvania to run on the first machine in the United States at University of Pennsylvania. In fact, my mentor was one of those programmers that carried wires from Los Alamos to Pennsylvania to run the first set of code on the first machine, ENIAC. We built our own machine that was much like that called MANIAC. It was built by John von Neumann and Nick Metropolis, who you see over there in the picture.

Starting point is 00:01:58 My son actually dated Nick Metropolis' daughter, so interesting. So, actually, granddaughter. So you see lots of machines there. IBM Stretch was an interesting machine. It was interesting because it was considered to be an enormous failure. It was supposed to be four times faster than the machine before it,

Starting point is 00:02:19 or ten times faster, and it only was four times faster. And so it was considered to be a miserable failure, even though we had to build an entire building to put it in to IBM spec. Turned out to be that was the first machine that exemplified the IBM 360, 370 architecture. So it turns out to have been a real success, but at the time it was considered to be an interesting failure. Had lots of CDC machines. The first Cray-1 showed up and didn't work for six months because they forgot that at 7,200 feet,

Starting point is 00:02:49 there's a whole lot more cosmic events than there are at sea level, and so they couldn't make the machine run for more than a few minutes. So we had to wait for them to put some sort of protection on the memory. I certainly was there in the late period of that time frame. Cray X and Y series S and Ps came along, interesting stuff. CM2s and CM5s were very interesting. Why? The introduction of data parallelism.

Starting point is 00:03:13 Data parallelism now you might know is things like CUDA. So essentially those machines there, this one on this end and that on that end, were of the first data parallel machines ever built. Danny Hillis built them. He's a friend of mine, lives in Santa Fe. And interesting machines. This machine up here was 64,000 one-bit wide processors. One-bit wide. You could specify

Starting point is 00:03:35 your word length to any size bits you wanted. Pretty cool, huh? I want my word length to be 18 bits. You could do it. Pretty neat. And data parallelism was really cool because languages like C star and things like that started to come along,

Starting point is 00:03:51 which predate CUDA by a lot. And essentially these machines are what would be now called an NVIDIA graphics processor. The SGI Blue Mountain machine was interesting. The DEC HP machine was interesting. That was the last very, very large alpha machine ever built, I believe. We built the first petaflop machine in 2009, IBM, using the Toshiba IBM cell processor. It was the first petaflop machine ever.

Starting point is 00:04:23 Machines beyond that, the current machine machine we have installed is our Cray Trinity machine down there. We also have a D-Wave machine, one of four or five in the world, and we're doing some interesting things with that. And I'm in the process of buying a machine to be installed in 2020 called Crossroads, and it's becoming a rather large pain in the butt. When you buy a quarter of a billion dollar machine, it takes about four years to go through the procurement process to get that job done. So it's a long process. They're all quite setups in many ways.

Starting point is 00:04:58 A view of our environment, quick view. So we usually have one premier machine, and then we have a bunch of capacity machines. Premier machine today is a few million cores, two petabytes of DRAM. It has four petabytes of flash. I'll tell you what the flash is for in a minute. It has very large private scratch space, 100 petabytes. It runs at about a terabyte a second. And then we have a whole bunch of other machines to do smaller things. Think 100,000

Starting point is 00:05:27 cores and down or small machines. And we have lots and lots and lots and lots of them. And they're for running capacity jobs. And then capability jobs are jobs that run on the big machine, say a million cores and up. And those jobs typically run about six months on that machine, a single job across the whole machine, sometimes as much as a year to solve a particular problem in the weapons complex. We have some shared scratch space that's shared by all. That's much, much slower. Think hundreds of gigabytes a second. And these are the capacity machines.

Starting point is 00:05:59 And then we've got these services. We've got an archive, parallel tape archive. In the last session, they talked a little bit about punch cards. I actually have a punch card. It's writable media. You can take notes on it. So, anyway, that's a parallel tape archive. Of course, NFS services and various kinds of file transfer agents.

Starting point is 00:06:25 We have large clusters that are just around to move data from one file system to another or from file system to archive to another file system out to the WAN. These are a few hundreds of machines big, so they're not very big, but they're just for moving data to and from things. What's a large machine look like? Well, it's a few acres. It's 20,000 nodes. It's a few million cords. It's two petabytes of DRAM. It's four petabytes of flash and so forth. You've heard all that.

Starting point is 00:06:56 To service it, you need a 100 petabyte size scratch file system. You need a campaign store, which I'll describe in a minute, that grows at about 30 petabytes a year during the lifetime of the machine and this tape archive. But what does it take to cool it and stuff? So that's what it takes to cool it. Those are 36-inch steel water pipes. It took $27 million worth of 36-inch steel water pipes to hook up this last machine that we bought. 36-inch water pipes, in case you're not aware of how heavy that is, it will crush any floor, including an 8-inch water pipes, in case you're not aware of how heavy that is. It will crush any floor, including an 8-inch concrete floor.

Starting point is 00:07:29 So we had to cut holes into the floor and build pylons and hang it from a steel brace so that it wouldn't crush the building that it was going into. And yes, I'm doing this again for the next machine that I'm putting in. Oh my, do I love steel and copper. Oh gosh. It's a lot of megawatts. This machine was supposed to be less than 18 megawatts. It turned out to be less, about 15 megawatts, I think.

Starting point is 00:07:53 It's interesting, though, because the megawatts is a single job using 18 megawatts. That's $18 million worth of power here. But one of the problems you have with power is not just paying for it, but what are you doing to the power company when you have things like, oh, gosh, the job aborted, and within one AC cycle you send 18 megawatts back up the chain, right? And so the power company goes, oh, well, that hurt. I don't like that too much. Please don't do that again.

Starting point is 00:08:23 And so we actually have to write software that runs on the nodes that if the job crashes, it immediately spins up a slowdown process. Same thing when we launch so that we don't launch too fast. If we're going to take a long DST, we have to call the local power company and say, you know, we're going to go off

Starting point is 00:08:39 for a while. You guys can spin them down. Anyway, so it's interesting we actually are about this close from probably by 2020 2020 we'll be buying power futures because we'll be something like 80 megawatts for the computing there and that's enough to get into that business it's not you know google size but it's pretty good size. Again, single job that runs off millions, of course, for long periods of time. Soccer fields and soccer fields and buildings and semis and copper. Anyway, some of the things we've done in the past that you might recognize,

Starting point is 00:09:17 you probably don't realize that the government has a fair amount to do with funding technology in the U.S. And DOE and DOD, which we partner with heavily, actually do fund some of this stuff. This is storage-related because this is a storage conference. I could have put other stuff up here. We came up with this idea called a burst buffer. I'll explain what that is in a minute. DDN built one. EMC built one.

Starting point is 00:09:42 And Cray built this thing called Data Warp which is a burst buffer we had a fair amount to do with IBM GPFS we basically paid for almost lock, stock and barrel luster paid for a fair amount of Banassus PNFS was a thing that came out of University of Michigan that came out of

Starting point is 00:10:00 payments that we made to Peter Honeyman and that group there Seth I'll have a slide on that next it's an interesting story, very cool payments that we made to Peter Honeyman and that group there. SEF, I'll have a slide on that next. It's an interesting story, very cool. HPSS, you may or may not know what that is. It's a parallel tape system that you can use, Unitree and these sorts of things.

Starting point is 00:10:19 Tokutech even was very interesting. It's a cache-oblivious B-tree-based sort of log structure merge tree kind of a thing. Seth begins. So I see Sage back there in the back, so I can't lie too much. So we actually were funding technologies inside of universities. The government does a fair amount of that too, and we decided to partner with the University of California, Santa Cruz on a proposal that they made to us on building some essentially research infrastructure to do research on scalable file systems, stuff like scalable namespaces and

Starting point is 00:10:55 scalable security and other sorts of things like that. And we signed a contract with them back in 2001. I think we funded them for nine years at $300,000 a year or something like that. Lots of students. And they produced cool, really cool stuff, radios and all the things that you now have known to come in love with Ceph. Anyway, so we were at the beginning of that as well. Some simulation background. So you can't really understand what these machines do

Starting point is 00:11:27 unless you take some background. First, there's scales, link scales. So what do I mean by link scales? Well, an application might be running at multiple link scales. One is a continuum scale doing partial differential equations. One may be running at molecular scale because you couldn't get enough fidelity using the partial differential equation. Maybe even

Starting point is 00:11:48 some at subatomic or molecular scales. And so essentially a single problem, different parts of the problem may be running at different scales. Different parts of the problem may be meshed differently. So a mesh is essentially each one of these cells in a mesh has like 50 floating point values, pressure and temperature and neutron density and all sorts of wonderfulness. And you mesh your thing up. Of course, it's very seldom like that. It's more often an unstructured mesh, a 3D unstructured mesh typically, and at different resolutions. And in fact, you don't ever run the problem at the same resolution all the time.

Starting point is 00:12:29 There's fun parts going on in the problem, and then at the time, there's not so much fun parts, right? So think about a shockwave going through a material. I don't give a crap about the material over here, and I don't give a crap about the material over here. I just care about where the shockwave is. And so essentially, I drill way down on that to get some interesting stuff

Starting point is 00:12:48 and I drill way out on the other stuff just to be able to fit it into my two petabytes of memory. And so basically, this is called AMR, automated mesh refinement. Essentially, it refines the mesh as the application runs. There's a couple of different kinds of scientific methods used in Lagrangian and Eulerian. Lagrangian is where the mesh deforms

Starting point is 00:13:11 and then you fix it. So essentially the mesh follows the particles around and then in Lagrangian, the mesh stays the same and the periodals move within the mesh. And then there's different kinds of applications that use both. And then there's this AMR thing. And when you came in, you probably saw that I had this little thing running here.

Starting point is 00:13:38 This is an example of what I'm talking about. And you can see in here that the cell sizes change, the mesh deforms, it's correcting itself as you go, it's subdividing and making some cells smaller and bigger depending on what's happening in the application. This is a real simple little run down at Livermore

Starting point is 00:14:03 as a really nice example of stepping through time steps. Time steps are, I run for a little while, not for very long, and then I exchange information with my boundary cells about what's happening in me because they have to adjust what's happening in them based on what's happening to me. That's the way physics is done. That's the way these simulations work. And so it's an interesting little example of how that works. Yes, yes. And in fact, basically every cell is updated every time slice. And, in fact, that's interesting because you might ask,

Starting point is 00:14:51 well, what's a time slice? How long does it take to run a time slice? It could be as long as a second. We have one application that runs and synchronizes with its neighbors every millisecond. So think about how wonderful that is on the power system. The whole thing stops. We exchange information on the network, and the whole thing starts up again every millisecond.

Starting point is 00:15:12 It takes one hell of a lot of capacitors to get DC to behave when you do stuff like that because that's way, way less than an AC cycle. I mean, you drive power supplies nuts in this case. So anyway, that's how that works. How do you program it? Well, hopefully in the next two slides you'll convince yourself you damn well don't want to be a programmer that programs one of these things because

Starting point is 00:15:35 it's hard. Scaling. Scaling, either strong scaling or weak scaling. Strong scaling is where you just keep the same problem size, but you just want to run it. But you want to run it faster. We strong scaling is where you just keep the same problem size, but you just want to run it faster. Weak scaling is where you scale the size of the problem with the machine. That's a little easier.

Starting point is 00:15:58 Process parallelism is how we've done it in the past. Explicit message passing between processes on the system, and there's various kinds of processes that are message passing. There's point-to-point. There's all reduces, all gathers. And in fact, one could have written map reduce in about 15 statements of MPI about 20 years ago, and in fact, oh my God, somebody did. And gosh, it was reinvented 15 years later.

Starting point is 00:16:22 Isn't that cool? So then there's also the point-to-point, of course. it was reinvented 15 years later. Isn't that cool? So then there's also the point-to-point, of course. There's also MPI plus threads, so we all know what threads are, right? And so you run MPI across nodes, and you run threads within a node, and so you organize your program that way. You can also do MPI plus threads plus data parallelism.

Starting point is 00:16:44 What's data parallelism? Well, that's SIMD. That's like I have this vector, and it's spread across this entire machine, and I want to do a multiplication of this vector against that vector, completely distributed all at once. So you send the same instruction to everybody, but they just do it on different data parts.

Starting point is 00:17:02 That's how a GPU works, guys. That's what data parallelism is. That's what that is. So you can use both data parallelism and MPI, MPI in threads or MPI threads plus data parallelism. And, of course, thinking about this, when you have a billion cores and three levels of memory, that gets really fun.

Starting point is 00:17:21 Programmers hate this, and so they like to use higher-level abstractions, things like Obenshmim, CoreArray4Tran, Chapel, X10. Let me tell you a little bit about CoreArray4Tran since I have a card in my pocket. CoreArray4Tran, for those of you that are programmers in 4Tran, probably very few of them are, but there was loops, right, and you looped through and you said, you know, do for whatever, multiply this by that. Right, so what this does is it says there was a dimension statement. you looped through and you said, you know, do for whatever, multiply this by that, right? So what this does is it says, there was a dimension statement that said, this array has two dimensions,

Starting point is 00:17:50 and I'm going to multiply these three things together. Well, a call write for a trend gives you a third dimension, that third dimension is across the machine. And so when you set up a loop, the two inner loops run local, and the local loop runs across the machine. So you just write for trend, and you just add a little dimension thingy on, and you push the parallel button, and push, it goes across the whole machine. Isn't that cool? Rice University built that for us.

Starting point is 00:18:12 We built that a long, long time ago. There's one for C. Actually, C++ with Colrace, in case you want to know about C++ and where it's headed, is actually getting this capability fairly soon. Although, fairly soon, it's a standards body, right? Anyway, so there's assists for making this easier for application programmers. It's not all bad. Mostly bad, but not

Starting point is 00:18:36 all bad. How do you run these things? Well, we have these things called workflows. The workflows are interesting. Essentially, you run, you get into a loop, and you run a setup. You get into a loop, and you run, you checkpoint, you run, you checkpoint, you run, you checkpoint. And I'll tell you why you checkpoint in a minute. And then eventually you get done running, and you do downsampling and stuff like that. We did this workflow document, which is downloadable. It was associated with the last machine we bought that described how we use the machines for the entire

Starting point is 00:19:07 science workflow of getting a job done. You can see that this was associated with storage. Some of the stuff is very temporary. Some of it lasts a little longer and some of it lasts forever. Within that, like I said, within one of that cycle, you end up doing this cycle of, you know, compute, checkpoint, compute, checkpoint. Inside there is grind, communicate with your neighbors, grind, communicate with your neighbors because you're passing state information with your neighbors. And this happens across the entirety of the memory of the machine essentially over and over and over again quite frequently. What about Checkpoint? Why do I care about Checkpoint? Well, it's not just about computation.

Starting point is 00:19:52 It's about failure. These are very big machines. They fail all the time. And the difference between these machines and typical cloud machines or other kinds of things is the application is tightly coupled. So if you lost one process, you've essentially got to stop and start again from the state of that process because that process has some information in it that may affect the calculation.

Starting point is 00:20:15 And so you've got to essentially do checkpoints. So the way these applications work is they calculate for a while, about as long as they think it's going to take for the machine to die, because it will die fairly soon. And so then they take a checkpoint. How often is that today?

Starting point is 00:20:29 Think an hour or two. So they compute for an hour or two, and then they take a checkpoint of the entire memory. So every hour I get to suck in two petabytes of memory into some sort of stable storage. Well, that's interesting, right? That's a lot of stuff. And so I've got to do that pretty fast,

Starting point is 00:20:45 so fast that it takes something like 30,000 to 50,000 disk drives, all spinning in parallel to get it off in some reasonable amount of time so that I'm not spending all my time doing checkpointing because I really don't want to checkpoint. I really want to compute. And so anyway, that's what checkpointing is about. It's very difficult to break these things up into non-synchronous mechanisms

Starting point is 00:21:08 because that's the way physics works, unfortunately. Yeah? If you have a machine going down, can you share a little bit how you bring the cluster back up? Do you just place that machine and bring it to its last checkpoint and then continue to bring the whole cluster back up

Starting point is 00:21:22 to its previous checkpoint? The way it works is something will die. Usually it's a node. Usually it's a DIMM, because there's one shitload of DIMMs, right? And so it'll die. The machine, the job will abort. The spin down will kick in.

Starting point is 00:21:40 It'll spin itself down. The job is already submitted again to run. So it's sitting there in the queue. It immediately starts again and it knows I need to go to the lax checkpoint. So it says, I need all that two petabytes back, please. And so those disks say, okay, here it comes, a terabyte a second. And we load it all back up again and off they go. So I'm about to get into why burst buffers exist,

Starting point is 00:22:06 but I won't give that away quite yet. Do you have them waiting? Yep. They absolutely are waiting. They absolutely are waiting. And that's a problem with space. We use it in the entirety of the memory of the machine. There are two sort of I.O. patterns that are used.

Starting point is 00:22:25 One is called end-to-end process, file-per-process. File-per-process is every process, of course, writes its own file. This is fun because all within a millisecond, it decides to ask the metadata server of some poor parallel file system that it wants to create a million files in one directory, all within a millisecond of one another. And so the metadata server goes oh god that hurt i think i'll try to get around to that at some point eventually you get all the files created and then it works pretty well it writes um in parallel the other job the or the other pattern is into one which is all the

Starting point is 00:22:59 processes right into the same file and that's fun because it's really easy to do an open, but then they don't write in huge patterns by themselves so that they could use disks by themselves. They stride stuff together. So they decide that they want to put all their temperatures together

Starting point is 00:23:16 and then all their pressures together. So these are like 100K writes. So there's a million cores and each one's writing a 100K write next to one another and then they all move down and they

Starting point is 00:23:25 do it again and so they beat the hell out of the lock manager and stuff like that to try to get this stuff written there's a reason they do this it's not just because they're being mean to me um but anyway that's how it works so both patterns suck uh pretty badly this is actually why luster was invented, just so you know. They use libraries to put the data in interesting patterns into the file because they want to retrieve stuff using hyper slabs and various kinds of techniques to try to get analysis out of these data.

Starting point is 00:24:01 And so they're writing this stuff out from a three-dimensional thing into a one-dimensional surface. And so, anyway, they use interesting libraries for shaping how that stuff happens. This is an old, old, old, old slide deck of why N to 1 really hurts associated with erasure and rating.

Starting point is 00:24:22 Basically, you end up doing this read-update-write operation because you don't write an entire block, and so you don't get much parallelism because of RAID systems that work this way. Of course, this is a really bad example, and it's better than this, and there's lots of tricks to play to get around this, but this is kind of as bad as it could be. There was a

Starting point is 00:24:48 paper written in 2009, John and I and a few others, and back when John worked for me and before he went off and became famous, and we built this thing called PLFS Parallel Log Structure File System, which was a way to do log structure in parallel, and it speeded everything up a whole bunch. Of course, it created metadata like crazy, and so it was hard to deal with. But the reason I think that I brought it up

Starting point is 00:25:12 is A, John put it in because he likes to show off his publications. But B, it was the second best paper at supercomputing in 2009. It was the first time that an I.O. paper ever made it to the top five. So we were pretty proud of that because supercomputing and I.O., you know, they're not the same.

Starting point is 00:25:28 Okay, so how did it work? It basically took end-to-end workloads and turned them into end-to-end workloads by putting this virtual layer in. You were writing this ungodly pattern of some kind, and it basically turned it into virtually just, and it made a big index and a big distributed index, and that turned out to be hard to deal with, but it was interesting. And it sped things up a lot. Wow, John Slizer, cool. So how did we, you know, circa 2002 to 2015, which is a long time,

Starting point is 00:26:07 not as long as cards, of course, but a long time, this is how we worked. We had a tightly coupled application running in DRAM. Excuse me. We had a parallel file system and this ungodly, icky workload up here, and then we had this parallel file system talking to a tape archive, which was very well aligned. This is how things worked up until 2012 to 15 or something like that. And then what happened is I got busy with my spreadsheets

Starting point is 00:26:34 and scared the hell out of everybody. Being a manager, the spreadsheet, of course, is the tool of choice. For programmers, of course, it's kind of laughable. But anyway, I did this thing where I said, okay, I know what size machines I'm going to buy, how big they are, how often they're going to fail, and I know how much disks cost and how fast they're going to be,

Starting point is 00:26:54 and I know kind of how much flash it's going to cost and how fast it's going to be. And so what should I buy going into the future to put this scratch file system together to do these checkpoints to? And there's three different slides. There's one that's just by flash, all flash. Well, that was going to cost a lot

Starting point is 00:27:12 because the capacity was going to cost me a whole lot. Then the other was by all disk. The problem with all disk is it was more expensive, and so I used this hybrid approach, some disk and some flash, which isn't a big surprise. Tiering is an old trick in storage, and that's actually where burst buffers were born, is out of this hybrid thing,

Starting point is 00:27:33 and I'll tell you exactly what a burst buffer is in a minute. My most proud moment about this particular graph was it was the first time I ever put a million dollars on a log scale, which I thought was a pretty cool concept. I did the same thing for our archive. I said, okay, gosh, we buy lots of tape drives and tapes. How are we going to get these big files off to tape? And it turned out that for the first time ever I had modeled archives forever,

Starting point is 00:28:03 the tape drives cost more than the tape. And I said, gosh, what's happening? Well, there was a bandwidth driver that we hadn't had before that showed up because the sizes of the data got so big. And so we realized that at some point in time we needed to add a tier of disk in front of the tape that we didn't have before to make it economical enough to do. And that's where campaign storage was born, and I'll tell you what that is in a minute. Burst buffer is a layer of flash

Starting point is 00:28:31 that goes in between the parallel file system and the DRAM. When I said that Trinity had two petabytes of DRAM, it also has four petabytes of flash. That's what it's for, is for you to dump your stuff down at four or five terabytes a second and then move on while it drains some of those, not all of them, but some of them to the disk and vice versa. So that's what burst buffers are.

Starting point is 00:28:53 People are using them for all kinds of things now, including analysis, in situ analysis, and they're also using them for producer-consumer models so you have analysis running on the same machine as you have the compute running on, and so forth. Wow, John has a lot of stack slides, doesn't he? So this other thing that we injected here was called a, he called it an object store.

Starting point is 00:29:17 This is a campaign store. It's called a campaign store because that's the sort of data you need to keep around and fairly close by during the campaign, the science campaign that you're running. That's why it's called Campaign Store. I didn't invent the word, but that's what it is. That's an awful lot of layers, though. How many should there be?

Starting point is 00:29:37 Well, really sort of two, right? There needs to be one that's sort of performance-based and one that's sort of capacity-based. And there may be lots of physical layers within there, but there's kind of really only two in a way. So sort of like that, a performance tier, probably Flash today, and some sort of a capacity tier, probably something like an object store, but this may be more than one kind of device type,

Starting point is 00:30:11 which may force us to have more than, that may look like more than one tier. Why? The way we think it's going to happen in the next five years, maybe six or seven years at Los Alamos is that we'll have a very fast all-flash scratch file system within the supercomputer itself. In fact, our 2020 machine will probably be the first machine to have something like between 100 and 200 petabytes of flash inside the machine. It won't

Starting point is 00:30:38 have a spinning scratch disk file system. It'll be inside the machine if it works out well for us. So anyway, that'll be interesting. The target there is 20 terabytes a second, just so you know what the bandwidth is targeted for. And then we'll have these capacity tiers. The capacity tier here is this campaign store, which is using disk. Disk, of course, is basically I have this sort of nomenclature down here,

Starting point is 00:31:06 metadata and data reads and writes. You can do anything to the agile tier. You can't write to the campaign tier without a tool. You can't just write willy-nilly to it because it's a bunch of SMR or Hammer or some ungodly drives that are hard to write to, and so you're going to have to make it behave on write. And then the archive, which is probably tape or something worse, and so you can't read or write to it without a special tool.

Starting point is 00:31:31 And so the only thing that you can read or write to directly is the Agile tier, and then these you can read from it, but you can't write to it directly without a tool, and so forth. So that's why we think there'll be multiple layers in here. Okay, so let's talk about this capacity tier, this thing that was up here, that thing, what's it going to look like. We were starting to look around a couple,

Starting point is 00:31:59 three years at least ago, to try to figure out what that tier would look like. We were looking for software solutions to do this. We said, gosh, you know, why can't we just go buy an object store and get this over with? The problem really isn't that the object store wouldn't do the job from a bandwidth point of view. The problem is the users were really used to having folders, and folders really meant something. They weren't just this obtuse tag. And they were used to having folder inheritance

Starting point is 00:32:27 and all the things that they're used to. And so we decided, well, gosh, you know, really what we need is for our trillion dollars worth of applications and users that really want to use folders and things like that, we really needed to look kind of sort of like a parallel file system or like a file system in some ways. But it doesn't have to have all of POSIX, just some of it. And so we basically try to figure out, is there something out there that marries the best of both worlds, an object store, you know,

Starting point is 00:32:55 that scales widely and so forth, and a POSIX system. And there were issues, of course. POSIX has scaling issues on certain things. There's mismatches between POSIX security and so forth and object. How are we going to do this? We looked all over for products to do this at the time, which was admittedly three years ago, and we didn't find a whole lot. We didn't look pretty far.

Starting point is 00:33:21 And so we decided, okay, fine, let's just write something, not something, not a lot, a little. Let's leverage as much as we can and write as little as we can. The goal's 100 to 1,000 gigabytes a second, which is about what this layer needs to go. If the parallel file system is running 10 terabytes a second, then this thing needs to run a tenth of that, roughly. Maybe up to a billion files in a single directory,

Starting point is 00:33:45 maybe a trillion files total in the system. Near POSIX, but not complete POSIX. It doesn't have to have all of it, but it needs folders and things like that. Could we leverage commercial products underneath, like scalable object systems and stuff? We've actually run this on top of Scality and EMC ECS. We actually, right now, are running over our own erasure that we put on top of ZFSs.

Starting point is 00:34:09 I'll tell you about that in a minute. There's actually a talk about that Thursday. It's not very much code. It's a library or two and essentially uses object stores when it can, and it uses POSIX namespaces, many, many, many of them glued together to put the namespace into. It's friendly to object stores

Starting point is 00:34:32 by essentially spreading large files across multiple objects and packing small files into a single object. Why does it need to split up large files into many objects? Well, I have petabyte-sized files. You can't write that out as one object. That won't work.

Starting point is 00:34:47 So think about it. A gigabyte-sized object, that's a million objects to do a petabyte. So you've got to do it. You've got to do it in parallel. You've got to do it really wide. How does it scale? Down here at the bottom,

Starting point is 00:35:01 it scales across N object servers. So you can have as many as you want and it'll scale across those and you just set it up to stripe across them essentially this is sort of a raid zero so within each one of those there needs to be protection how does the namespace scale in two dimensions it scales using names using tree or in a decomposition so you that's the normal way to to decompose a namespace where you break up the tree and split it across multiple metadata servers, but it also has a file hashing mechanism

Starting point is 00:35:32 so that the files can be hashed across metadata servers within a directory so that you can get these very large directories that have like a million or a billion files in them. We were retiring a machine a couple of years back, actually last year, called Cielo. It was an a machine a couple of years back, actually last year, called Cielo. It was an 8,000-node machine. And I asked the question, we always have this contest to see how mean we can be to machines before we actually ship them away. And one of the things that we did was we said, what would it be like if you had an 8,000-node metadata server? Not a

Starting point is 00:36:03 storage server, but a metadata server. Just to hold your namespace. Wouldn't that be fun? And so we ran this thing across it, essentially hashing in this way, and the target was to create a billion files in a directory and a trillion files across multiple directories. And I was hoping to get a billion inserts a second into that same directory.

Starting point is 00:36:28 I didn't make it. I never get what I want. It was only like 900 or some million inserts a second, and we didn't quite get a trillion. It was 900 and some billion files. But anyway, we were close. So it was a pretty good test. It was a test essentially of the software

Starting point is 00:36:45 to see if it scaled from a metadata point of view. How does it work? It's really simple. Essentially, it uses POSIX metadata to store where the data is or store the names. So this is a typical tree. And then within a tree, you'll have a file. And the file will have an extended attribute

Starting point is 00:37:02 that will say where in the object system it is, and striping and all that sort of junk, and it sticks the object down there. This is called a unifile because the file is just one object. The same thing is true for a multifile, which is a file that's too big and must be striped, so it goes across multiple object systems. And then there's this packed file thingy where you can put lots of small files into a single pack. How do we get data to and from it? Well, we could use copy or tar or something really stupid like that, but it would go pretty slow and pretty serially. So we have

Starting point is 00:37:40 this thing that we've had around for a long time. It's been open source called pftool and essentially walks trees in parallel, does reader in parallel, does stat in parallel, breaks big files up and moves them in parallel, and packages small files into packages and moves them in parallel. And it actually is restartable. So if you only get through a certain amount

Starting point is 00:37:59 of your multi-petabytes that you want to move, it keeps track of where it was at, can restart and start moving again. So essentially this particular code runs on those file transfer agent clusters. You submit a job and say move my petabytes and it goes and does that for you in parallel as fast as it can.

Starting point is 00:38:16 Oops. Where does this campaign store fit in this big drawing I said? Well, it fits down here and it connects up to our site-wide network. So this site-wide network is, you know, capable of something like a few terabytes a second. That's an InfiniBand network

Starting point is 00:38:36 that connects all this stuff together. Of course, this private network also is over a terabyte a second, and then within the machines, of course, it's faster. So who all else is playing with campaign store-like ideas besides LANL? There's a company called Spectralogic. It's out of Colorado. They make tape robots.

Starting point is 00:39:02 They also make an object server with disks that sit in front of tape. And they partner with Peter Brom, who you probably know is one of the authors of Lustre, a longtime friend of mine. And essentially, they're essentially taking this code and making product out of it. There's five sites that are trying to use it right now, I understand, across the world. So it's being commercialized. And then this is another John slide showing off one of his publications, or our publications. We wrote a nice article for Usenix on this idea of where storage stacks are headed in HPC,

Starting point is 00:39:39 and it's in the login magazine. Let's talk a little bit about the future. The future is, for DOE, it's called Exascale because we're currently at petascale machines. What does Exascale mean? Well, something like a billion-way parallelism. So you've got something like a billion threads, between a billion and 10 billion threads.

Starting point is 00:40:05 We're hoping to keep it down to something like 20 or 30 megawatts for such a machine, which may be very difficult to do. It'd be nice if it fit inside a football field. If it didn't, that'd be bad, I guess, but maybe not a killer. We'd like for it to be sort of productive, if possible, for users. You know, 100 petabyte-sized working sets, and so that's sort of productive if possible for users. You know, 100 petabyte size working sets, and so that's sort of the memory of the machine and so forth.

Starting point is 00:40:31 So anyway, how do you do this? Well, you spend a whole lot of money in industry to fund a whole lot of technology. So that's what these lines are about. So we have funding going to IBM and NVIDIA and AMD and ARM and HP and others to fund technologies in this space to try to figure out what we need to do to get there from a power point of view, from a resilience point of view, and so forth. You buy systems every once in a while. You write

Starting point is 00:41:00 RFPs to pull this technology up into those systems. So think of these as half-a-billion-dollar buys, two labs buying two machines or something every so often. And eventually, you know, you get there, and finally, yes, you've got something that looks something like an exascale machine, which, again, would be pretty large. I don't know how much more to go into here. Some of the sorts of storage-related things that we've been funding, there's a project at Intel called Deos that we've been funding. Essentially, it's a container system for HPC. We've come to the conclusion that keeping all the metadata,

Starting point is 00:41:39 the way that we've been keeping metadata before just simply won't work. We're going to have to containerize it. And so we're doing a lot with research on trying to figure out how to build user space, ephemeral namespaces that are associated with the application, and then containerize all that and store it in parallel to objects. That's kind of what DAOS was about. It also had this idea of transactional versioning so that we could take advantage of the fact that

Starting point is 00:42:09 the applications do have some asynchrony, but not a lot. You can see these synchronization points that we could get a little bit of gain from. Essentially, the problem is the stacks that we use today, HDF5, Lustre, block layers, running across some InfiniBand network

Starting point is 00:42:28 all the way over to some storage devices of some kind, even if they're flash. You're talking about 10 microseconds at least of latency, if not more, 15, 20. And the devices themselves are going to be faster than that. So it would be pretty a shame for us to buy a big machine with 100 petabytes of flash and not get our latencies to be faster than that. So it would be pretty a shame for us to buy a big machine with 100 petabytes of flash and not get our latencies on our software stacks down.

Starting point is 00:42:50 So we're going to spend a lot of money in the next, I don't know, five years or so trying to get those software stacks to be far more latency, far less latency. So essentially way less thick, smaller servers, maybe even no servers, and so forth. So that's what this is about. There's some interesting technologies.

Starting point is 00:43:11 The HDF Vault technology is about this. This is a plug-in to HDF5 to go to services that are smarter. MDM is a very interesting project. That was a parallel key value store framework. So you think about running, say, 10,000 copies of React in parallel on your machine. So you instantiate the job. It instantiates 10,000 instances of React,

Starting point is 00:43:39 and then it stores the records in parallel. It's a key value store, and it does it in such a way that you can get ordering across the whole thing, which is very interesting. It kept some key distribution information so that you could query it and say, hey, who's got the largest record

Starting point is 00:43:54 or who's got the smallest pressure, and you would know which key value store to go and beat up on. It's a very interesting technology that's still being worked on. DeltaFS was an interesting thing, too. That was a POSIX namespace. It was a user interesting technology that's still being worked on. DeltaFS was an interesting thing, too. That was a POSIX namespace. It was a user-based library that you link with your application,

Starting point is 00:44:11 and it stores data into a key value store as well. So there's some interesting stuff going on there. I'm getting close to the end. I've got six minutes left. I will say that I actually was at LANL at this time, but that's not me because I don't think I ever say that I actually was at LANL at this time, but that's not me, because I don't think I ever had that much hair. The software I talked about is at GitHub. Pretty much all this stuff is open source. It's almost all BSD licensed, so don't blame us if it doesn't work. That's a great one. Those are IBM 3350 disk drives.

Starting point is 00:44:48 There's a CDC 7600. That's about a 20,000 square foot room. We have a whole bunch of those. That building was built and started building in 1959, and again, it was specced for Stretch. Stretch was the first machine that went into that building, and it took up that whole room. I have five minutes.

Starting point is 00:45:18 Yeah. Yeah. What do you see... How do you see InfiniBand working out for you guys? You know, the cycle of... The seats are getting very... Well, we're fans of InfiniBand. We had a fair amount to do with getting it started. Us and actually work a lot, oddly enough. Odd partner to have in HPC.

Starting point is 00:45:52 And we like where it's headed. It's always been two to three years ahead in bandwidth per dollar, which is what we care about. That's what we're buying. And actually, being religionless government people, we don't really care. It's just most effective to use for stuff like that. We also like the fact that they're putting intelligence on the NICs. We actually had a summer school this summer that one of the projects was to try out their new erasure on

Starting point is 00:46:22 the card, which is interesting. I think they got some pretty good results. There's a paper at Supercomputing that's been given about that. But frankly, when we buy a machine, we don't spec to say it has to have InfiniBand. We say it has to be this big and run this big of a problem, six times bigger than the old one we had, and so forth.

Starting point is 00:46:44 And InfiniBand gets bid sometimes, and sometimes it doesn't. It's big of a problem, six times bigger than the old one we had and so forth. And InfiniBand gets bid sometimes and sometimes it doesn't. But our centralized network is InfiniBand. And I just, you know, it's really because of economics. It's the best dollar per gigabyte per second. It's always been two years ahead of Ethernet, and it looks like it still will for quite a while. And the tool set gets better and better. OFED is a fairly healthy community. Susan Coulter, who works for me, actually is the president of it. Anyway, so yeah, it's a good,

Starting point is 00:47:13 it's a, it's working out, it works out for us. I was going to tell you a little bit about the erasure system. So we had a problem with the erasure. We were using something like a 20 plus 4 or something like that, and we were striping RAID 0 across a whole crap load of those, and we had this problem. We were used to having failures of disk drives once in a while, and we were used to having failures catastrophically

Starting point is 00:47:37 of maybe a whole rack or two because of a blast or something like that, but we had this odd failure that we lost 141 drives all roughly the same day, which isn't a huge amount of drives given the population, but it's still quite a bit. And it was spread out all over the place. So it wasn't just

Starting point is 00:47:53 one rack and it wasn't, you know, it was really kind of odd. Turned out to be serial number based and all kinds of junk that we worked through with Seagate. But at any rate, we were worried about this and we started looking at the literature, and we found that Facebook had actually done two tiers of erasure on top of memory for some interesting reasons that they had.

Starting point is 00:48:15 And so we ended up looking around and saying, is there a product that does two tiers of erasure, erasure on top of erasures? We couldn't find anything, so we actually wrote our own the top tier. The bottom tier is just to buy as many erasure boxes you want. And we use the Intel Storage Acceleration Library to do that. And that library is really quite fast. It uses the AVX instruction set. We can get like 10 gigabytes a second per node.

Starting point is 00:48:45 There's a really nice talk that's going to be given Thursday about this event and this software that we developed around it, which is also open source. Just don't blame us if it doesn't work. Looks like I have one minute. Any last questions? Yeah. Do you guys look at different types of accelerators? Do you guys have at any kind of special accelerators for SNEA? Do you guys worry about the data?

Starting point is 00:49:12 The question was, do we look at accelerators for preconditioning data written out on tape? No, we don't. About the only thing I can say about that is when we switch technologies, which happens when you keep data around for 50 or 70 years, which we actually have, I thought it was interesting that Sneha's doing a 100-year archive and I almost have one. Actually, yeah, it's true. And it's important data. It's data from tests in Nevada that we can't lose. It's pretty important. About the only thing I can say about that is when we switch technologies,

Starting point is 00:49:45 when we move from tape to tape or when we move from tape to disk or whatever, that's the time to add value to the data. And so we're certainly thinking about how we could, as we move it, learn about it. One of the things that we have a really big goal for is to keep more metadata about the data so that we know later, 50 years later, what we have. And, in fact, back here in this slide right there, I skipped over something. There's an index capability that we index literally every file across the whole complex. This is the technology we're working on right now so that you could, with wide parallelism, say, across the trillions and trillions of files we

Starting point is 00:50:26 have, I'm a user and I want to say, well, I'm trying to remember what I did with project whatever the hell it was. Look across the whole complex and tell me what I have. That's what this is. And I bet you there's a talk sometime in the next year about some really cool technology that's going to enable us to do that. I'm out of time. Thank you very much. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at snea.org. Here you can ask questions and discuss this topic further with your peers in the storage developer community.

Starting point is 00:51:21 For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

Storage Developer Conference - #72: Innovations, Challenges, and Lessons Learned in HPC Storage Yesterday, Today, and Tomorrow

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.