Storage Developer Conference - #194: HPC Scientific Simulation Computational Storage Saga

Starting point is 00:00:00 Hello, this is Bill Martin, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developers Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to SDC Podcast Episode 194. So I want to take a quick second to introduce our first keynote presenter. Gary leads the high-performance computing division at Los Alamos. The HPC division at Los Alamos operates one of the largest supercomputing centers in the world. That focused on the U.S. national security for U.S. DOE, National Nuclear Security Administration.

Starting point is 00:01:13 Gary's responsible for all aspects of high-performance computing research, technology development, and deployment at the Los Alamos. He has 30 granted patents and over 13 pending in data storage. Of course, Gary's been working in HPC and HPC-related storage since 1984. And with that, please join me in giving Gary a warm welcome. Thank you, Gary. Thank you, Gary. So thanks for inviting me to give a talk. Thanks that we're all back together

Starting point is 00:01:52 again. It's really nice to have an in-audience kind of presentation. I haven't done that in a while. I'm tired of sitting in front of the green screen, so it's all good now. I actually have a real screen to stand in front of. So my talk this morning is about

Starting point is 00:02:12 computational storage, and we've been working on computational storage stuff for a while at Lyle, and we think that there's several big wins in that space, and so we've started a number of projects, and I'm going to tell you a little bit about several of them. My apologies if you've heard some of this before. I don't really know how much of it you've heard before, but hopefully we'll get to the math part toward the end so we can start doing real math. Anyway, let's get started. So we've been at this for a long time at computing. We've been doing computing for about 80 years at LANL. In fact, my mentor when I went to LANL was a lady, and she was a programmer that took wires, which is how you program them,

Starting point is 00:03:00 to ENIAC in Pennsylvania and actually ran one of the first problems on ENIAC. And there's actually a nice IEEE recording of her on the internet you can search for. But anyway, she was a really interesting lady. Uh-oh. Stand over here. So anyway, I've been there for a while, as you know. So some background. At LANL, we have these very, very large machines,

Starting point is 00:03:27 and that's not terribly different than any big supercomputer site necessarily. All supercomputer sites have big machines, but LANL is a little bit different in that we buy a machine to run a single problem on for a very, very long time, which is really unusual. Most supercomputing sites, they buy a big machine, and thousands of scientists buy to use it and they split it up and use it in various ways,

Starting point is 00:03:49 time slice it and space slice it up. We do that too, but we also run very, very large jobs. The kind of jobs that were typical for us is say 10,000 nodes for six months or something like that. So a petabyte of DRAM tied up for six months trying to solve the problem. This is our current machine that's in production. The next one is being installed almost as we speak. But this is sort of a little bit of background on how big the machines are.

Starting point is 00:04:21 So our current machine is 20,000 nodes, a few million cores, a couple of petabytes of DRAM, four petabytes of NAND flash, which was a lot in 2016 that ran at, you know, like four terabytes a second, and a big scratch file system made out of disk that we probably won't have ever again, and 40 petabytes of site-wide campaign store,

Starting point is 00:04:44 which is a relatively cheap disk technology, and there's a tape archive that runs, you know, three gigabytes a second in parallel. In 2023, we're installing a machine, and this was sort of a guess at what it was going to look like. It's probably not going to be quite that size, but this is the kind of machines that will be being installed inside of DOE in the 2023 timeframe. 10 petabytes of DRAM, 100 petabytes of flash, and so forth. Half an exabyte of spinning disk. So that's how big the machines are getting to be.

Starting point is 00:05:19 And at LANL, can you imagine using that whole machine to solve a problem for a year? It must be a really big problem, and I'll describe that in a minute. So anyway, I know it's not tier one size site. It's maybe only 60 megawatts or something total, but it's an awful lot to apply to a single problem. So I'm going to take you through three different scenarios. Actually, this is already out of date, as of a couple of nights ago. But there are sort of three different ways we've approached this.

Starting point is 00:05:55 One, we approached format agnostic offloads, so think compression, erasure, things like that, where the offload device doesn't have to understand the format of the data to do its job. We also are working with key value, which of course that's format aware. The offload engine has to understand the format of the data to be able to deal with it. And then the final thing that we're planning to do is work in the columnar space, which again is a format aware, but it's just columnar instead of row based. And so I'll be talking about these three things throughout the talk. So the first one is this format agnostic operation. So we decided that the first thing we would try, of course, is something relatively easy, although we turned out to not be as easy as we thought,

Starting point is 00:06:43 to do format agnostic offloads. So the first question you got to ask yourself is, why would you want to do format agnostic offloads like compression, erasure, encoding, and things like that? And this sort of depicts the reason why. The reason is because buying Intel processors to do bandwidth operations like this is an expensive way to do it. The top line on this is the potential. If you could drive all the flash devices at the full rate, that's what they would run at. And then the lines below have to do with adding things like checksumming and adding things like erasure and adding things like compression and how much bandwidth you can get out of, I mean the percent essentially of the total bandwidth you can get when you start adding those things. And these are pretty hot servers. I mean,

Starting point is 00:07:39 this was an Intel Platinum dual socket and an AMD Apex second generation. I mean, they're pretty good servers. But the problem is you really can't saturate the flash, nor can you even really saturate the network in some cases because you don't have enough memory bandwidth on the processors to be able to do all these things on the fly to the data as it goes through. And so we said, there's an opportunity. Let's go close that gap. And so let's offload something to make that not the problem anymore.

Starting point is 00:08:12 And that's what we did. So we set some requirements. The requirements, of course, at our site are it's got to work underneath a parallel file system. Why? Well, because we have millions of processes writing and reading into a single file. So you can imagine that's a hard prospect to do, writing a million processes writing into a single file. But we've been doing that for the last 20 years. That's why we paid for Lustre to be developed.

Starting point is 00:08:37 That's what Lustre is for. If anybody tells you it's for something really, really different, tell them that's not true. That's why it was there. And anyway, so whatever we do has to work in that environment. Underneath Lustre, you can run several component file systems. One of the component file systems you can run is ZFS. ZFS is really popular with storage administrators. If there's any storage administrators in the room, they know how reliable and robust it is

Starting point is 00:09:04 because it's got all this log structure and all this encoding in it, and you can roll it back and all kinds of fun stuff. And so we have a lot of ZFS. In fact, at some point, probably all of our disk technology will have ZFS on top of it so that our system administrators can go home and sleep at night. The flexibility that we need, we needed flexibility in where things ran. Why? Well, over time, economics will change.

Starting point is 00:09:33 Over time, memory bandwidth will get faster, and memory bandwidth will get slower compared to network bandwidth, compared to flash bandwidth, and so forth. And so when you're trying to put a solution together for offloading, you don't necessarily want to offload it exactly to one thing. You want to offload it in such a way that next year when some new fangled memory device comes out that's faster, you can move where the piece parts are of what you offloaded around. And NVMe actually gives you this capability with its peering. So you can use peer movement to move data between peers, to move data to where the most economic place to run the piece of the offload that you want to do,

Starting point is 00:10:13 and then change your mind three years from now because economics change. And so that was a really important thing that we were looking for out of this was the ability to move the piece parts that we offload around to different hardware. Could be in the CPU running ZFS, could be in the BOF, in the NIC of the BOF, it could be in the device in the BOF, it could be at the array level and so forth. And so that was an important thing. We also wanted to attempt to follow

Starting point is 00:10:42 NVMe computational storage standards as best we could. Why? Because we wanted to be a good citizen and see what we could do with that standard and tell them where it might have some needs. That was one of the ideas was let's inform that emerging

Starting point is 00:10:59 standard. Then also, the solution needed to be broadly available. That's why we chose ZFS because ZFS is a pretty popular thing. So we wouldn't be the only people that could benefit from it. The computational storage benefits and opportunities that we have, one thing it could do is it could benefit us by giving us higher compression ratios. And why? Because before we weren't willing to spend an extra 10 or 20 or 50

Starting point is 00:11:28 or whatever it was, Intel machines, to do that extra compression. We just do the lightweight compression instead of the heavyweight compression and we use less Intel servers to do that. In this case, we probably think we can get more compression by offloading it to a device that's more appropriate for doing compression.

Starting point is 00:11:48 And 1.06 to 1 to 1.3 to 1, that's not astounding numbers. You know, people talk about compression. They talk about 2 to 1 or 3 to 1 or 4 to 1 or even 10 to 1. We don't get that. We can't get that because we're all floating point numbers. They're all almost random. There's high entropy. If there were compressible data,

Starting point is 00:12:12 that'd be compressible computation. And we've already pressed all that out 20 years ago. So you can't compress our data. There's a nice data set that's on the web that we put out there that said, if you can compress this, come talk to us. If you can't, leave us alone. And it's interesting, the first part of the file, they can compress a little bit, and they go, oh, we're going to win big, and then they

Starting point is 00:12:33 get into the file a little bit. The entropy goes crazy, and all of a sudden they go, sometimes even negative. And so anyway, so enabling expensive encoding is another thing we wanted to do we worry a lot about data loss at our scale and there's correlated failures and random failures random failures are fairly easy to just use simple erasure to cover but when you have correlated failures trying to protect yourself from that is pretty hard what do I mean by a correlated failure

Starting point is 00:13:04 maybe a row of the data center got too cold for a while, and it caused more failure in that row than any other row. Well, it's happened to us. We had a 400 disk loss day. It wasn't a fun day, but it was manageable because we anticipated that things like this would happen. In fact, we're our own worst enemy. We write stuff in parallel. We write a 30,000 disk drives and we beat them up

Starting point is 00:13:31 mercilessly for 20 minutes exactly the same way. So are they going to fail randomly? Probably not. We cause them to fail in correlated ways. So correlated failures are a thing for us and they're probably a thing for you if you're at any scale at all. Per server and per device, bandwidths could be higher just because of that. You're using less dollars on Intel servers and more dollars on Flash devices themselves

Starting point is 00:13:57 because of the curves in the previous slide, which will cause the cost of servers to go down, or maybe you use less of them. And the other thing that we're pretty proud of is we wanted to enable something other than a block ecosystem. And this kind of does this, right? Even though NVMe is a block thing, we've actually developed a distributed computing model over NVMe, right? We're taking piece parts and we're putting them on different places and running them as if it were a distributed process, which it is, which we recognize because we have all kinds of distributed processes in the supercomputers. And so that was kind of important

Starting point is 00:14:35 is to try to push this concept of let's get people thinking about computational storage more as a distributed computing paradigm and less as just an extension of a block protocol. And we pushed that concept pretty hard, and I think we got there. So here's, you'll actually hear a talk about this, actually. There's just one talk. I think it's just one talk about this today. And so at the top, you see a classical ZFS write operation. You know, you call write, copies the data out of the user space down into the kernel, and then the kernel says, okay, we're going to compress it, and we're going to check sum it, and we're going to RAID Z it, and we're going to issue an IO, and it's going to go off to an NVMe device, and that was all on the host.

Starting point is 00:15:27 And we said, okay, how are we going to do offloads? Well, what do we want to offload? Compression, checksum, erasure, all those things that require a lot of memory bandwidth to do. And how are we going to do that without moving a lot of data around, moving it back and forth to the computational device? well, the first thing you have to do is you've got to allocate memory on the storage device or on the computational storage device. And so you call malloc, sort of. It's not really a malloc. It's a remote malloc. And you're mallocing space to copy data down to,

Starting point is 00:15:58 and you're going to leave it there. So you copy it down to the computational storage device, and it sits there. And then the control returns back to ZFS. And then ZFS says, you know what? We should compress it. Hey, computational storage device, you're the right person for this job.

Starting point is 00:16:15 You go allocate some more memory and compress the thing and move it into that buffer. And then the same thing happens for checksum A and RAID Z and so forth. And each one of those pieces actually could occur on a different device. So if it were profitable, we could just use peering to move the data to a different peer and have that device go and do it. And so that was the idea was let's offload some of these functions.

Starting point is 00:16:40 Let's do it in a way that looks an awful lot like RPC slash distributed programming concepts. Let's not be religious about where we run things. And let's try to do it in a way that's pretty extensible. And so we at LANL have ZFS developer on site. And so we developed a library where you can register to do these things for ZFS or any storage system, actually. You can say, hey, I'm really good at doing compression. Send me compression stuff. And so you put devices in there and you register the devices to do these things. And then there's a version of ZFS that calls into that to get these functions implemented. And that's all available.

Starting point is 00:17:26 And ask us about it. We can tell you about it. But one of the talks today is going to be about that. What did we do to ZFS? And what does this library look like in the kernel that allows you to register computational storage services? Anyway, so that's how it was done. The performance win was really, really good. The offload was wonderful.

Starting point is 00:17:48 And so there was this big press release. Some of you may have seen it. And it was about this project. And essentially, it was, you know, essentially what I just talked about. It was computational storage. ZFS offloaded. We called it an ABOF, an accelerated BOF. And it offloads this popular file system called ZFS, all the memory bandwidth intensive functions, and all through NVMe.

Starting point is 00:18:18 And so here's the partners. The partners really did all the hard work. I mean, we did some of the hard work with modifying ZFS, but the partners really carried a lot of the load here. And we had some wonderful partners. I had a com host that has this no-load software, and the whole idea behind it is you're not religious about where you run stuff, and it's NVMe-based. Aon is an interesting company. They are wonderful.

Starting point is 00:18:40 They build these wonderful and nicely balanced enclosures so that you get the right amount of PCE in and the right amount of PCI out and so forth. It's amazing to me how many servers you buy that don't have balanced PCIe or have more PCIe switches than they actually have bandwidth or things like that. So we work with Aon to give us enclosures that actually are balanced. NVIDIA's Bluefield technology, so that was used as the NIC in the BOF, and the DPU on that was used as one of the targets for running things. We used SK Hynix Flash, and SK Hynix was a wonderful partner here for supporting

Starting point is 00:19:20 us, and then LANL did this ZFS offload function. And so that's how it worked. It really worked really well, and hopefully the NVMe Computational Storage Committee will get some value out of learning how we did this, and this is an example application of something you might want to offload. There's profit in offloading it in memory bandwidth savings, and it was an excellent partnership. In fact, the theme for this whole talk should be, you know,

Starting point is 00:19:48 you get stuff done by partnering. And these were great partners. So I'm going to move on to the next thing, which is the first one is format agnostic. So this is format aware. And the first part we're going to do with format aware is record-oriented format aware. So that first part we're going to do with format aware is record-oriented format aware. So that's what this is going to be,

Starting point is 00:20:07 is a talk about several things we're doing in that space that we have done and are doing. But first I have to go back and talk about the prereq for this work. And the prereq for this work was called DeltaFS.

Starting point is 00:20:21 It was done at Carnegie Mellon University. And it turned out to be a top paper at supercomputing for a student. And the student is sitting right there in the back of the room. And so anyway, it leverages that work. And let me tell you a little bit about that work. So first of all, it's important that you understand the application a little bit to understand why this is important. So we have this thing called a vector particle and cell code. Particle and cell means if I've got 100,000 cores, then I have a million or so cells that are spread out across those 100,000 cores,

Starting point is 00:21:02 or maybe 10 million or something like that. And then I've got a trillion particles. And the particles flow around in those cells. So based on something that's happened, some event that caused a high-energy thing to happen, these particles are flowing all over the place because they're hitting things and bouncing off of one another and all sorts of physical fun things to do. And essentially they're moving around. So every time step particles are moving around all over the place. And this is actually a plasma. And so the particles are moving really, really fast. And some of them are super energetic and some of them aren't and so forth. And so

Starting point is 00:21:40 anyway, you run this application. It's a fixed mesh, so it's really pretty simple. This mesh is fixed across all of the processor's memories, and the particles are flowing around in them. And the particles are really tiny, the amount of data. It's like 24 bytes or 32 bytes. There's a particle ID, which of course has to go up to a trillion, so it's a 64-bit number. But then it's got an energy field, and it's got a what cell, a MIN field, and so forth. And so the records are really

Starting point is 00:22:13 tiny. They're like 48-byte records or something. But there's a trillion of them. So essentially, I've got a trillion 48-byte records that I want to do analytics on. And it's a trillion every time step. And so every time this thing runs and particles move, it stops and it says, okay, write out these trillion. And it runs a little longer and it stops and it writes out these trillion and so forth. And so at the end, it's a large number of trillion files or trillion particles or records that you're dealing with. So it's kind of the edge case for record-oriented

Starting point is 00:22:46 processing, right? Tiny little records and a whole lot of them being spewed out of 10,000 machines and a million cores. At the end of this large run, the scientists will say magically, not really magically, I'll get into how they do this in a minute. There's a thousand of those trillion particles that are super interesting. They are so hot or they're so energetic. I'd really like to know where all they've been. So you've run this thing for a long time and then they want to know, they want a particle trace of where all the cells were in this thing over time. And so the question, how do you do this? How are you going to do this?

Starting point is 00:23:29 And how do records help you? And why offload? Why is offload a big deal for this? The answer is because the guy only cares about a thousand particles out of a trillion. The query is going to retrieve nine orders of magnitude less data than was written if there was ever an opportunity to offload all the way to the storage device this is it this is the the poster child for it and and so what was done is that we you know developed this solution in software and this is what the student did.

Starting point is 00:24:05 He's back in the back of the room. We essentially ran 10,000-ish copies of LevelDB or RocksDB all in parallel. We sharded up the key space. And what's the key? The key is the particle ID. And particle IDs are so simple, they shard trivially.

Starting point is 00:24:24 And so shard A has particle ID. And particle IDs are so simple, they shard trivially. And so shard A has particle ID zero to something, and shard B has particle ID something to something bigger, and so forth. Super trivial. And then what happens is, every time step, this application writes out all the particles and shards them up and sends them to the right place. Of course, they're coming from all over the place. So these particles are just jamming in, all trillion of them, going to 10,000 shards. And some of them came from here, but the next time those things may be over here because they moved and so forth. And so these things are just sifting in. And so the first problem we had was, how the heck are you going to do that? That's a trillion RPCs you're doing every time you turn around. That's

Starting point is 00:25:08 an awful lot of network RPCs, even on a $50 million network. That's a lot of RPCs. And so we did this thing called input. And so instead of doing a single record, everyone batches up the things that are going to the right shards, makes nice multi-record RPCs, and sends those to the right shards. We got two or three or four orders of magnitude improvement just because of that. And the idea here was you don't want to add any time to the right. So that graph in the middle is the right time, and the dark blue is the overhead we took over writing out this stuff out as just files, just flatten it out and write it out as files as big as we can, or can we put records into many, many key value stores at almost the same rate,

Starting point is 00:26:00 so that the application is being penalized almost none to get their data written out in record form as opposed to squashing it out and writing it in byte stream form. And if you write it out in record form, you can actually query it. And so that's the idea is don't spend any extra time or only a little bit of extra time writing it so that you get this huge win on read. And actually, the win on read is enormous. It's a factor of 1,000 or something. Before, we would have had to have read in all tens of petabytes, many, many, many, many trillions of records, and sift through them every time step,

Starting point is 00:26:39 finding these 1,000 particles. And now we just go ask a key value store, give me all the places that this particle has been and so it's really really fast and so you can see that the win is huge and it's interesting if the win the query time stays almost flat no money how many particles you have you could go to 10 trillion particles and you still take basically the same amount of time because the win is so big.

Starting point is 00:27:10 And so anyway, this is what was done. And why was it the best paper at supercomputing and so forth is because the numbers are really enormous. We got 8 billion particle inserts per second. Not 8 million, 8 billion. It's pretty big, right? And so anyway, this is a pretty cool thing. And we said, this win is so big, it's obvious we should push this down out of software,

Starting point is 00:27:33 put it near the storage devices. If there ever was a poster child for this, this is it. And so that's what we did. We decided to start working with a couple of companies on doing this. And the first one that I'll talk to you about is Pavilion. So Pavilion has an NVMe over Fabric storage array, and they happen to have a pretty smart storage controller. And so the question was, could we stick a fully ordered key value store onto that storage array and do this at the array level instead of in software in our servers. And so we worked with them. They're a really cool company and they put this stuff down

Starting point is 00:28:12 on their storage array. And we got super excellent performance because we weren't moving the data through hosts. It was being done all on the storage device. And they actually used extensions on the SNEA KV over NVMe standard. And the extensions are input and input and get and other kinds of things that we needed. And that worked out really, really well. And we're working with them in that space still. And please ask us for our semantic extensions to the KV. We actually wrote a document that said, here's the things we want out of a KV. It's not just get, put, and delete and compact. It's input and get and controlled. There's a few things we needed

Starting point is 00:29:00 to have added. We're happy to share that with anybody. I think it'd be useful for everybody to have it that's working in key value store space or contemplating it because there's some really nice additions to key value stores. The next one I want to talk to you about is our effort with SK Hynek. So SK Hynek has been a super good partner and it's kind of been interesting. It's like two giants trying to dance with one another right at first, right? I mean, SK is this huge giant and Los Alamos and DOE is this big, huge giant of bureaucracy and trying to figure out how to work with one another took a while, but man, has it been worth it? I think we ended up kind of trying to figure out

Starting point is 00:29:42 how to challenge one another. And it's been super. This relationship has been really great. At any rate, the question that we asked them was, I mean, they said challenge us. So we did. We said, you know, we've taken this key value thing and put it in software. We've taken this key value thing and we've put it at the array level. Could you take it all the way to the storage device? Could you push it all the way to a flash device and get it in or near or almost in a storage device? And that's a pretty tall order if you think about the way key value stores and

Starting point is 00:30:15 log-structured merge trees work. So they did that. In fact, they pushed it all the way into an FPGA sitting on top of a storage device. In fact, that storage device, that's a flash drive, right? It's as long thing as a flash drive. And that's a card with an FPGA on it. And this is all over NVMe. And so they actually push this functionality all the way to a storage device. This was demonstrated at Flash Memory Summit a couple of months ago or something,

Starting point is 00:30:45 and there were two talks about it. And it's really pretty. In fact, it was on the floor, and it was actually one of the most popular demos on the floor. It's pretty cool. We made a little demo where you could go up and guess a particle energy, and it would give you the trace for it.

Starting point is 00:31:03 It was kind of fun fun and they gave away a bunch of if you guessed right you got a ball or something like that anyway um this has really been the performance is pretty crazy good and um this has been a great partnership and it's a fully ordered key value store so it's not it's not like the early key value stores that were released by Samsung that were hash-based. You know the hash. You get the value. It's an actual ordered key value store. I don't know the value.

Starting point is 00:31:33 I want to punch into the index and iterate forward, and you give me all the ones that are next. Or I want to punch in the index and go backwards and give me all the ones that are last. That's what I mean by in order to key value store. And this is a full-blown thing. So anyway, this was pretty cool. And we're continuing on working with them on a few other things, but this was really neat. Let's talk about ABOF 2 plans for a little bit now,

Starting point is 00:32:00 which gets into columns. So we did this agnostic stuff with files, and then we did this really cool stuff with records. What could we do with columns, and why would we want to do it with columns? Well, Columnar is pretty popular right now. There's all these Columnar databases and all this Columnar technology, Apache Drill and DuckDB and all kinds of fun stuff for doing analytics. Columns have their pros, right? You can compress each column differently. You can add columns trivially. There's nice stuff about columns, and of course, there's bad stuff about columns. The question we asked ourselves was, do we have applications that could use column or kinds of treatment? And the answer is yes, in spades. So we have a bunch of applications that

Starting point is 00:32:47 use these things called grid methods. And a grid method is where you break up the thing you're physically simulating into grids, and you have a set of values per grid, per grid point. And the values you have are materials properties like the pressure and the temperature and the momentum and energy levels and things like that at each value in the mesh and the grid and the simulation. And there's these fancy grid methods like Lagrangian and Eulerian. Lagrangian is where the mesh deforms. Look at the black and white thing. The mesh is deforming itself. So as the simulation runs, the mesh deforms. But then you have Eulerian meshes like this one, lower one on the right, where the mesh now doesn't deform. It's just the

Starting point is 00:33:39 stuff inside of it changes. And then you have this stuff on top of that called automated mesh refinement or adaptive mesh refinement. And adaptive mesh refinement is really cool. It's this, see this bottom right, all the squares aren't the same size. Why aren't they the same size? Because there's a lot more stuff going on where the little ones are than the big ones. Why do we do this? We do this because we want to run a problem that's two orders of magnitude bigger than the computer we buy. And you might say, Gary, you bought a stupid two petabyte supercomputer. Why can't you fit your problem into it?

Starting point is 00:34:17 The answer is we can't. Nowhere close. We'll never be able to. The scientists will always win. We will never win. We can only try to keep up with them. And so that's what's going on here. So adaptive mesh refinement is a way to run a huge problem on a tiny supercomputer. And the way you do it is you use this adaptive mesh refinement so you don't care too much about the part where the fun stuff isn't happening and you care an awful lot about where the fun stuff is. And the fun stuff, of course,

Starting point is 00:34:48 is moving. It's a shockwave going through a material or whatever it is you're simulating. This is really common in science. This is done a lot. And it has its interesting issues. The issues it has is how do you access memory in this space? It's ugly. Stuff is moving around all the time. You've got to use pointers to get all your data

Starting point is 00:35:11 because there's this layer of software that's moving the mesh around based on heuristics and things like that. But what's cool about it is, and when I said that we've eliminated the ability to get compression, this is why. If we found that that mesh was compressible, that means that we messed up. AMR should have pushed all that uninteresting low entropy crap out of the way and got to the part that really needs to be drilled down on where all the entropy is.

Starting point is 00:35:42 And so if we ever do find compressible data at our laboratory, we go get the scientist and say, you messed up. You could have run a problem way bigger than this. Go fix it, right? And so compression, dedupe, none of that stuff works because of this. It's already been pressed out. The computational method itself has said, get rid of it. That's opportunity. So anyway, this is what we do. And these grid methods are interesting because remember I said inside of each one of those little cells, there's pressure and temperature and all these values. So these applications think of themselves as a set of distributed arrays, say 50 to 100 distributed arrays with the number of array values equal to all the cells in the mesh. And so that's pretty cool.

Starting point is 00:36:37 Essentially, they are columnar applications. There's a column of pressure and a column of temperature and a column of, okay, and this is depicting multiple time steps of an application, time step one, two, three, four, five, six, and you can see that it's AMR'd, so the fun stuff has, you know, got finer, um, finer call, uh, finer squares, and the big stuff has bigger squares. And so how do you turn this into a columnar? So how we do this is each one of these machines, so the middle boxes there are machines, computers, and there's, say, 64 cores.

Starting point is 00:37:31 So there's 64 boxes. And you see the lines that connect those things together. So the lines that connect those things together, that's a method called Hilbert space. So Hilbert figured out how to fill space by connecting things together. And so essentially there is a serialized order of how things are built in a mesh using Hilbert. And so it's spread across all million cores of the machine and all trillion cells using a Hilbert space. And it's one serial array. And the way we write the data out is in Hilbert order and we write all the pressures in Hilbert order and then we write all the temperatures in Hilbert order and then we write

Starting point is 00:38:16 all everything else in Hilbert order. And so this thing is a million cores all writing to one part of the file to get it all in order in Hilbert order. And then why do they do that? They do that because they want to restart the application with a different number of processors, and they can just go in and subdivide these things and start the application with an arbitrary number of processors. So at any rate, we were sitting on a goldmine of columnar applications. We didn't realize that, but here we are. We've got all these columnar applications. What could we do with them?

Starting point is 00:38:49 What we could do with them is we could use columnar techniques like DuckDB and Apache Drill. And all we really need to do is take this columnar format, add some indices after each column, and presto, you've got essentially what looks like a Parquet file. It's a parallel Parquet file, but it's a Parquet file, and spread across thousands of devices. And so the question we're asking ourselves is, can we use standard columnar analytics techniques to do analytics on these very, very large data sets that happen to be columnar in

Starting point is 00:39:26 nature. And so that's where we're headed with this ABOF2 demo is to try to push this indexing and retrieval technology using columnar functions down to the computational storage devices so that it doesn't run in our software on hosts anymore. It's pushed down to storage devices. So that's where ABOF 2 work is starting to happen. And then the next question is, if you could do it to flash an ABOF set of flash devices, could you do it to disks? So we have this really cool partnership going with Seagate to try to push this stuff all the way down

Starting point is 00:40:03 to an individual disk drive. And interestingly enough, not just to an individual disk drive, but to an individual disk drive that's part of an erasure group that's part of a set of parallel erasure groups. And so we're trying to figure out how to push analytics through the erasure level

Starting point is 00:40:22 all the way to the disk device so the disk device can do it in parallel and you're going to see a talk about this today or tomorrow tomorrow here at this conference on how this was done this has been demonstrated it's pretty cool related talks there's been some related talks on all this stuff we had two talks at FMS. Our friends at SK and us gave talks about that cool KVC-SD. There was a talk about ABOF at FMS. Then there's all these talks

Starting point is 00:40:56 at SDC that I've mentioned. And there's a talk at Supercomputing on Goofy, which is our indexing technology. And let me just leave you with some food for thought. We're working with Fungible, a company that actually gave a talk yesterday, I think, on scalable NVMe over fabric endpoint state management. And what do I mean by that?

Starting point is 00:41:15 Well, you've got 10,000 machines and may all talk to the same storage device. How will it deal with all the state required to do that? That seems pretty tough. So we're working in that space. We actually are working on a project to put SPDK on top of MPI so that we can test SPDK scalability at incredible scales, like millions. That's a work in progress. We're looking for some lightweight security solutions for NVMe over Fabric directly from clients without kernels. And we're also tinkering around with ideas on how you route NVMe over Fabric between different media. Because we've done that before with Lustre.

Starting point is 00:41:58 There's this thing called LNAT that actually its job in life is to route RDMA between medias. And I think that's it. Caveat, everything in here was harder than I made it sound. I'm a manager, so I just wave my arms so that other people in the room do the work. Thanks for your time. Thanks for listening. For additional information on the material presented in this podcast,

Starting point is 00:42:31 be sure to check out our educational library at snea.org slash library. To learn more about the Storage Developer Conference, visit storagedeveloper.org.

Storage Developer Conference - #194: HPC Scientific Simulation Computational Storage Saga

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

Storage Developer Conference - #194: HPC Scientific Simulation Computational Storage Saga

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.