Storage Developer Conference - #194: HPC Scientific Simulation Computational Storage Saga
Episode Date: July 11, 2023...
Transcript
Discussion (0)
Hello, this is Bill Martin, SNEA Technical Council Co-Chair.
Welcome to the SDC Podcast.
Every week, the SDC Podcast presents important technical topics to the storage developer community.
Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage
Developers Conference. The link to the slides is available in the show notes at snea.org
slash podcasts. You are listening to SDC Podcast Episode 194. So I want to take a quick second to introduce our first keynote presenter.
Gary leads the high-performance computing division at Los Alamos.
The HPC division at Los Alamos operates one of the largest supercomputing centers in the world. That focused on the U.S. national security for U.S. DOE, National Nuclear Security Administration.
Gary's responsible for all aspects of high-performance computing research, technology development, and deployment at the Los Alamos.
He has 30 granted patents and over 13 pending in data storage.
Of course, Gary's been working in HPC and HPC-related storage since 1984.
And with that, please join me in giving Gary a warm welcome.
Thank you, Gary.
Thank you, Gary.
So thanks for inviting me to give a talk.
Thanks that we're all back together
again. It's really nice to have
an in-audience
kind of
presentation. I haven't done that in a while.
I'm tired of sitting in front
of the green screen, so
it's all good now.
I actually have a real screen to stand in front of. So my talk this morning is about
computational storage, and we've been working on computational storage stuff for a while
at Lyle, and we think that there's several big wins in that space, and so we've started
a number of projects, and I'm going to tell you a little bit about several of them. My apologies if you've heard some of this before. I don't
really know how much of it you've heard before, but hopefully we'll get to the math part toward
the end so we can start doing real math. Anyway, let's get started. So we've been at this for
a long time at computing. We've been doing computing for about 80 years at LANL.
In fact, my mentor when I went to LANL was a lady,
and she was a programmer that took wires, which is how you program them,
to ENIAC in Pennsylvania and actually ran one of the first problems on ENIAC.
And there's actually a nice IEEE recording of her on the internet you can search for.
But anyway, she was a really interesting lady.
Uh-oh.
Stand over here.
So anyway, I've been there for a while, as you know.
So some background.
At LANL, we have these very, very large machines,
and that's not terribly different than any big supercomputer site necessarily.
All supercomputer sites have big machines,
but LANL is a little bit different in that we buy a machine
to run a single problem on for a very, very long time,
which is really unusual.
Most supercomputing sites, they buy a big machine,
and thousands of scientists buy to use it
and they split it up and use it in various ways,
time slice it and space slice it up.
We do that too, but we also run very, very large jobs.
The kind of jobs that were typical for us
is say 10,000 nodes for six months or something like that.
So a petabyte of DRAM tied up for six months trying to solve the problem.
This is our current machine that's in production.
The next one is being installed almost as we speak.
But this is sort of a little bit of background on how big the machines are.
So our current machine is 20,000 nodes, a few million cores,
a couple of petabytes of DRAM,
four petabytes of NAND flash,
which was a lot in 2016
that ran at, you know, like four terabytes a second,
and a big scratch file system made out of disk
that we probably won't have ever again,
and 40 petabytes of site-wide campaign store,
which is a relatively cheap disk technology,
and there's a tape archive that runs, you know, three gigabytes a second in parallel.
In 2023, we're installing a machine, and this was sort of a guess at what it was going to look like.
It's probably not going to be quite that size,
but this is the kind of machines that will be being installed inside of DOE in the 2023 timeframe.
10 petabytes of DRAM, 100 petabytes of flash, and so forth.
Half an exabyte of spinning disk.
So that's how big the machines are getting to be.
And at LANL, can you imagine using that whole machine
to solve a problem for a year?
It must be a really big problem, and I'll describe that in a minute.
So anyway, I know it's not tier one size site.
It's maybe only 60 megawatts or something total, but it's an awful lot to apply to a
single problem.
So I'm going to take you through three different scenarios. Actually, this is already out of date, as of a couple of nights ago.
But there are sort of three different ways we've approached this.
One, we approached format agnostic offloads, so think compression, erasure, things like that,
where the offload device doesn't have to understand the format of the data to do its job. We also are working with key value, which of course that's
format aware. The offload engine has to understand the format of the data to be able to deal with it.
And then the final thing that we're planning to do is work in the columnar space, which again
is a format aware, but it's just columnar instead
of row based. And so I'll be talking about these three things throughout the talk.
So the first one is this format agnostic operation. So we decided that the first thing we would try,
of course, is something relatively easy, although we turned out to not be as easy as we thought,
to do format agnostic offloads.
So the first question you got to ask yourself is, why would you want to do format agnostic offloads
like compression, erasure, encoding, and things like that?
And this sort of depicts the reason why.
The reason is because buying Intel processors to do bandwidth operations like this is an expensive way to do it.
The top line on this is the potential. If you could drive all the flash devices at the full rate, that's what they would run at. And then the lines below have to do with adding things like checksumming and adding things like erasure and adding things like compression and how
much bandwidth you can get out of, I mean the percent essentially of the total bandwidth
you can get when you start adding those things. And these are pretty hot servers. I mean,
this was an Intel Platinum dual socket and an AMD Apex second generation. I mean, they're pretty good servers.
But the problem is you really can't saturate the flash,
nor can you even really saturate the network in some cases
because you don't have enough memory bandwidth on the processors
to be able to do all these things on the fly to the data as it goes through.
And so we said, there's an opportunity.
Let's go close that gap.
And so let's offload something to make that not the problem anymore.
And that's what we did.
So we set some requirements.
The requirements, of course, at our site are it's got to work underneath a parallel file system.
Why?
Well, because we have millions of processes writing and reading into a single file.
So you can imagine that's a hard prospect to do, writing a million processes writing into a single file.
But we've been doing that for the last 20 years.
That's why we paid for Lustre to be developed.
That's what Lustre is for.
If anybody tells you it's for something really, really different, tell them that's not true.
That's why it was there. And anyway, so whatever we do has to work in that environment.
Underneath Lustre, you can run several component file systems.
One of the component file systems you can run is ZFS.
ZFS is really popular with storage administrators.
If there's any storage administrators in the room,
they know how reliable and robust it is
because it's got all this log structure and all this encoding in it,
and you can roll it back and all kinds of fun stuff.
And so we have a lot of ZFS.
In fact, at some point, probably all of our disk technology will have ZFS on top of it
so that our system administrators can go home and sleep at night.
The flexibility that we need, we needed flexibility in where things ran.
Why?
Well, over time, economics will change.
Over time, memory bandwidth will get faster,
and memory bandwidth will get slower compared to network bandwidth,
compared to flash bandwidth, and so forth.
And so when you're trying to put a solution together for offloading,
you don't necessarily want to offload it exactly to one thing. You want to offload it in such a way that next year when some new fangled memory device comes out that's faster, you can
move where the piece parts are of what you offloaded around. And NVMe actually gives you this
capability with its peering. So you can use peer movement to move data between peers,
to move data to where the most economic place to run the piece of the offload that you want to do,
and then change your mind three years from now because economics change.
And so that was a really important thing that we were looking for out of this was the ability to move the piece parts that we offload around to different hardware.
Could be in the CPU running ZFS,
could be in the BOF, in the NIC of the BOF,
it could be in the device in the BOF,
it could be at the array level and so forth.
And so that was an important thing.
We also wanted to attempt to follow
NVMe computational storage standards as best we could. Why? Because we
wanted to be a good citizen
and see what we could
do with that standard and tell them
where it might have
some needs.
That was one of the ideas
was let's inform that emerging
standard. Then also, the
solution needed to be broadly available.
That's why we chose ZFS
because ZFS is a pretty popular thing. So we wouldn't be the only people that could
benefit from it. The computational storage benefits and opportunities that we have,
one thing it could do is it could benefit us by giving us higher compression ratios. And why?
Because before we weren't willing
to spend an extra 10 or 20 or 50
or whatever it was, Intel machines,
to do that extra compression.
We just do the lightweight compression
instead of the heavyweight compression
and we use less Intel servers to do that.
In this case, we probably think
we can get more compression
by offloading it to a device that's more appropriate for doing compression.
And 1.06 to 1 to 1.3 to 1, that's not astounding numbers.
You know, people talk about compression.
They talk about 2 to 1 or 3 to 1 or 4 to 1 or even 10 to 1.
We don't get that.
We can't get that because we're all floating point numbers.
They're all almost random.
There's high entropy.
If there were compressible data,
that'd be compressible computation.
And we've already pressed all that out 20 years ago.
So you can't compress our data.
There's a nice data set that's on the web
that we put out there that said,
if you can compress this, come talk to us.
If you can't, leave us alone. And it's interesting, the first part of the
file, they can compress a little bit, and they go, oh, we're going to win big, and then they
get into the file a little bit. The entropy goes crazy, and all of a sudden they go,
sometimes even negative. And so anyway, so enabling expensive encoding is another thing we wanted to do
we worry a lot about data loss at our scale
and there's correlated failures and random failures
random failures are fairly easy to just use
simple erasure to cover but when you have correlated failures
trying to protect yourself from that is pretty hard
what do I mean by a correlated failure
maybe a row of the data center got too cold for a while,
and it caused more failure in that row than any other row.
Well, it's happened to us.
We had a 400 disk loss day.
It wasn't a fun day, but it was manageable
because we anticipated that things like this would happen.
In fact, we're our own worst enemy. We write stuff in parallel.
We write a 30,000 disk drives and we beat them up
mercilessly for 20 minutes exactly the same way. So are they going to fail
randomly? Probably not. We cause them to fail in correlated ways.
So correlated failures are a thing for us and they're probably a thing for you
if you're at any scale at all.
Per server and per device,
bandwidths could be higher just because of that.
You're using less dollars on Intel servers
and more dollars on Flash devices themselves
because of the curves in the previous slide,
which will cause the cost of servers to go down,
or maybe you use less of them.
And the other thing that we're pretty proud of is we wanted to enable something other than a block ecosystem. And this kind of does this, right? Even though NVMe is a block thing, we've actually
developed a distributed computing model over NVMe, right? We're taking piece parts and we're putting
them on different places
and running them as if it were a distributed process, which it is, which we recognize because
we have all kinds of distributed processes in the supercomputers. And so that was kind of important
is to try to push this concept of let's get people thinking about computational storage more as a
distributed computing paradigm and less as just an extension of a block protocol. And we pushed that concept pretty hard, and I think we got there.
So here's, you'll actually hear a talk about this, actually. There's just one talk. I think
it's just one talk about this today. And so at the top, you see a classical ZFS write operation.
You know, you call write, copies the data out of the user space down into the kernel,
and then the kernel says, okay, we're going to compress it, and we're going to check sum it,
and we're going to RAID Z it, and we're going to issue an IO,
and it's going to go off to an NVMe device, and that was all on the host.
And we said, okay, how are we going to do offloads? Well, what do we want to offload? Compression, checksum, erasure, all those things
that require a lot of memory bandwidth to do. And how are we going to do that without moving a lot
of data around, moving it back and forth to the computational device? well, the first thing you have to do is you've got to allocate memory on the storage device
or on the computational storage device.
And so you call malloc, sort of.
It's not really a malloc.
It's a remote malloc.
And you're mallocing space to copy data down to,
and you're going to leave it there.
So you copy it down to the computational storage device,
and it sits there.
And then the control returns back to ZFS.
And then ZFS says, you know what?
We should compress it.
Hey, computational storage device,
you're the right person for this job.
You go allocate some more memory
and compress the thing and move it into that buffer.
And then the same thing happens
for checksum A and RAID Z and so forth.
And each one of those pieces
actually could occur on a different device.
So if it were profitable, we could just use peering to move the data to a different peer and have that device go and do it.
And so that was the idea was let's offload some of these functions.
Let's do it in a way that looks an awful lot like RPC slash distributed programming concepts.
Let's not be religious about where we run things.
And let's try to do it in a way that's pretty extensible.
And so we at LANL have ZFS developer on site.
And so we developed a library where you can register to do these things for ZFS or any storage system, actually. You can
say, hey, I'm really good at doing compression. Send me compression stuff. And so you put devices
in there and you register the devices to do these things. And then there's a version of ZFS that
calls into that to get these functions implemented. And that's all available.
And ask us about it.
We can tell you about it.
But one of the talks today is going to be about that.
What did we do to ZFS?
And what does this library look like in the kernel that allows you to register computational storage services?
Anyway, so that's how it was done.
The performance win was really, really good.
The offload was wonderful.
And so there was this big press release.
Some of you may have seen it.
And it was about this project.
And essentially, it was, you know, essentially what I just talked about.
It was computational storage.
ZFS offloaded.
We called it an ABOF, an accelerated BOF. And it offloads this popular
file system called ZFS, all the memory bandwidth intensive functions, and all through NVMe.
And so here's the partners. The partners really did all the hard work. I mean, we did some of
the hard work with modifying ZFS, but the partners really carried a lot of the load here.
And we had some wonderful partners.
I had a com host that has this no-load software,
and the whole idea behind it is you're not religious
about where you run stuff, and it's NVMe-based.
Aon is an interesting company.
They are wonderful.
They build these wonderful and nicely balanced enclosures
so that you get the right amount of PCE in and the right amount of PCI out and so forth.
It's amazing to me how many servers you buy that don't have balanced PCIe or have more PCIe switches than they actually have bandwidth or things like that.
So we work with Aon to give us enclosures that actually are balanced.
NVIDIA's Bluefield technology, so that was used as the
NIC in the BOF, and the DPU on that
was used as one of the targets for running things.
We used SK Hynix Flash, and SK Hynix was a wonderful partner here for supporting
us, and then LANL did this ZFS offload
function. And so that's how it worked.
It really worked really well, and hopefully the NVMe Computational Storage Committee
will get some value out of learning how we did this,
and this is an example application of something you might want to offload.
There's profit in offloading it in memory bandwidth savings,
and it was an excellent partnership.
In fact, the theme for this whole talk should be, you know,
you get stuff done by partnering.
And these were great partners.
So I'm going to move on to the next thing,
which is the first one is format agnostic.
So this is format aware.
And the first part we're going to do with format aware
is record-oriented format aware. So that first part we're going to do with format aware is record-oriented format aware.
So that's what this is going to be,
is a talk about several things
we're doing in that space
that we have done and are doing.
But first I have to go back
and talk about the prereq
for this work.
And the prereq for this work
was called DeltaFS.
It was done at Carnegie Mellon University.
And it turned out to be a top paper
at supercomputing for a student. And the student is sitting right there in the back of the room.
And so anyway, it leverages that work. And let me tell you a little bit about that work.
So first of all, it's important that you understand the application a little bit to
understand why this is important. So we have this thing called a vector particle and cell code.
Particle and cell means if I've got 100,000 cores,
then I have a million or so cells that are spread out across those 100,000 cores,
or maybe 10 million or something like that.
And then I've got a trillion particles.
And the particles flow around in those cells.
So based on something that's happened, some event that caused a high-energy thing to happen,
these particles are flowing all over the place because they're hitting things
and bouncing off of one another and all sorts of physical fun things to do. And essentially they're moving around. So every time step particles are moving
around all over the place. And this is actually a plasma. And so the particles are moving really,
really fast. And some of them are super energetic and some of them aren't and so forth. And so
anyway, you run this application. It's a fixed mesh, so it's really pretty simple.
This mesh is fixed across all of the processor's memories,
and the particles are flowing around in them.
And the particles are really tiny, the amount of data.
It's like 24 bytes or 32 bytes.
There's a particle ID, which of course has to go up to a trillion,
so it's a 64-bit number.
But then it's got an energy field, and it's got a what cell, a MIN field, and so forth. And so the records are really
tiny. They're like 48-byte records or something. But there's a trillion of them. So essentially,
I've got a trillion 48-byte records that I want to do analytics on. And it's a trillion every time step.
And so every time this thing runs and particles move,
it stops and it says, okay, write out these trillion.
And it runs a little longer and it stops and it writes out these trillion and so forth.
And so at the end, it's a large number of trillion files
or trillion particles or records that you're dealing with.
So it's kind of the edge case for record-oriented
processing, right? Tiny little records and a whole lot of them being spewed out of 10,000 machines
and a million cores. At the end of this large run, the scientists will say magically, not really
magically, I'll get into how they do this in a minute. There's a thousand of those trillion particles that are super interesting. They are
so hot or they're so energetic. I'd really like to know where all they've been. So you've run this
thing for a long time and then they want to know, they want a particle trace of where all the cells
were in this thing over time.
And so the question, how do you do this?
How are you going to do this?
And how do records help you?
And why offload?
Why is offload a big deal for this?
The answer is because the guy only cares about a thousand particles out of a trillion.
The query is going to retrieve nine orders of magnitude less data than was written
if there was ever an opportunity to offload all the way to the storage device this is it this is
the the poster child for it and and so what was done is that we you know developed this solution
in software and this is what the student did.
He's back in the back of the room.
We essentially ran 10,000-ish copies
of LevelDB or RocksDB all in parallel.
We sharded up the key space.
And what's the key?
The key is the particle ID.
And particle IDs are so simple,
they shard trivially.
And so shard A has particle ID. And particle IDs are so simple, they shard trivially. And so shard A has particle
ID zero to something, and shard B has particle ID something to something bigger, and so forth.
Super trivial. And then what happens is, every time step, this application writes out
all the particles and shards them up and sends them to the right place. Of course,
they're coming from all over the place. So these particles are just jamming in, all trillion of
them, going to 10,000 shards. And some of them came from here, but the next time those things
may be over here because they moved and so forth. And so these things are just sifting in.
And so the first problem we had was, how the heck are you going to do that? That's a trillion RPCs you're doing every time you turn around. That's
an awful lot of network RPCs, even on a $50 million network. That's a lot of RPCs. And so
we did this thing called input. And so instead of doing a single record, everyone batches up
the things that are going to the right shards, makes nice multi-record
RPCs, and sends those to the right shards. We got two or three or four orders of magnitude
improvement just because of that. And the idea here was you don't want to add any time to the
right. So that graph in the middle is the right time, and the dark blue is the overhead we took
over writing out this stuff out as just files, just flatten it out and write it out as files
as big as we can, or can we put records into many, many key value stores at almost the same rate,
so that the application is being penalized almost none to get their data written out in
record form as opposed to squashing it out and writing it in byte stream form.
And if you write it out in record form, you can actually query it. And so that's the idea is don't
spend any extra time or only a little bit of extra time writing it so that you get this huge win on read.
And actually, the win on read is enormous.
It's a factor of 1,000 or something.
Before, we would have had to have read in all tens of petabytes,
many, many, many, many trillions of records, and sift through them every time step,
finding these 1,000 particles.
And now we just go ask a key value store,
give me all the places that
this particle has been and so it's really really fast and so you can see
that the win is huge and it's interesting if the win the query time
stays almost flat no money how many particles you have you could go to 10
trillion particles and you still take basically the same amount of time
because the win is so big.
And so anyway, this is what was done.
And why was it the best paper at supercomputing and so forth
is because the numbers are really enormous.
We got 8 billion particle inserts per second.
Not 8 million, 8 billion.
It's pretty big, right?
And so anyway, this is a pretty cool
thing. And we said, this win is so big, it's obvious we should push this down out of software,
put it near the storage devices. If there ever was a poster child for this, this is it.
And so that's what we did. We decided to start working with a couple of companies on doing this. And the first one that I'll talk to you about is Pavilion.
So Pavilion has an NVMe over Fabric storage array,
and they happen to have a pretty smart storage controller.
And so the question was,
could we stick a fully ordered key value store
onto that storage array and do this at the array level instead of in software
in our servers. And so we worked with them. They're a really cool company and they put this stuff down
on their storage array. And we got super excellent performance because we weren't moving the data
through hosts. It was being done all on the storage device. And they actually used extensions on the SNEA KV over NVMe standard.
And the extensions are input and input and get and other kinds of things that we needed.
And that worked out really, really well.
And we're working with them in that space still.
And please ask us for our semantic extensions to the KV.
We actually wrote a document that said, here's the things we want out of a KV. It's not just
get, put, and delete and compact. It's input and get and controlled. There's a few things we needed
to have added. We're happy to share that with anybody. I think it'd be useful for everybody
to have it that's working in key value store space or contemplating it because
there's some really nice additions to key value stores. The next one I want to talk to you about
is our effort with SK Hynek. So SK Hynek has been a super good partner and
it's kind of been interesting. It's like two giants trying to
dance with one another right at first, right? I mean, SK is this huge giant and Los Alamos and
DOE is this big, huge giant of bureaucracy and trying to figure out how to work with one another
took a while, but man, has it been worth it? I think we ended up kind of trying to figure out
how to challenge one another. And it's been super.
This relationship has been really great.
At any rate, the question that we asked them was, I mean, they said challenge us.
So we did.
We said, you know, we've taken this key value thing and put it in software.
We've taken this key value thing and we've put it at the array level.
Could you take it all the way to the storage device? Could you push it all the way to a flash device and get it in or near or almost in a
storage device? And that's a pretty tall order if you think about the way key value stores and
log-structured merge trees work. So they did that. In fact, they pushed it all the way into an FPGA
sitting on top of a storage device.
In fact, that storage device, that's a flash drive, right?
It's as long thing as a flash drive.
And that's a card with an FPGA on it.
And this is all over NVMe.
And so they actually push this functionality all the way to a storage device.
This was demonstrated at Flash Memory Summit a couple of months ago or something,
and there were two talks about it.
And it's really pretty.
In fact, it was on the floor,
and it was actually one of the most popular demos on the floor.
It's pretty cool.
We made a little demo where you could go up
and guess a particle energy,
and it would give you the trace for it.
It was kind of fun fun and they gave away
a bunch of if you guessed right you got a ball or something like that anyway um this has really been
the performance is pretty crazy good and um this has been a great partnership and it's a fully
ordered key value store so it's not it's not like the early key value stores that were released by Samsung that were hash-based.
You know the hash.
You get the value.
It's an actual ordered key value store.
I don't know the value.
I want to punch into the index and iterate forward, and you give me all the ones that are next.
Or I want to punch in the index and go backwards and give me all the ones that are last.
That's what I mean by in order to key value store.
And this is a full-blown thing.
So anyway, this was pretty cool.
And we're continuing on working with them on a few other things,
but this was really neat.
Let's talk about ABOF 2 plans for a little bit now,
which gets into columns.
So we did this agnostic stuff with files, and then we did this
really cool stuff with records. What could we do with columns, and why would we want to do it with
columns? Well, Columnar is pretty popular right now. There's all these Columnar databases and
all this Columnar technology, Apache Drill and DuckDB and all kinds of fun stuff for doing analytics.
Columns have their pros, right? You can compress each column differently. You can add columns trivially. There's nice stuff about columns, and of course, there's bad stuff about columns.
The question we asked ourselves was, do we have applications that could use column or
kinds of treatment? And the answer is yes, in spades. So we have a bunch of applications that
use these things called grid methods. And a grid method is where you break up the thing you're
physically simulating into grids, and you have a set of values per grid, per grid point.
And the values you have are materials properties like the pressure and the temperature
and the momentum and energy levels and things like that at each value in the mesh and the grid
and the simulation. And there's these fancy grid methods like Lagrangian and Eulerian.
Lagrangian is where the mesh deforms. Look at the black and white thing. The
mesh is deforming itself. So as the simulation runs, the mesh deforms. But then you have Eulerian
meshes like this one, lower one on the right, where the mesh now doesn't deform. It's just the
stuff inside of it changes. And then you have this stuff on top of that called automated mesh refinement
or adaptive mesh refinement. And adaptive mesh refinement is really cool. It's this,
see this bottom right, all the squares aren't the same size. Why aren't they the same size?
Because there's a lot more stuff going on where the little ones are than the big ones. Why do we
do this? We do this because we want to run a problem
that's two orders of magnitude bigger than the computer we buy.
And you might say, Gary, you bought a stupid two petabyte supercomputer.
Why can't you fit your problem into it?
The answer is we can't.
Nowhere close.
We'll never be able to.
The scientists will always win.
We will never win.
We can only try to keep up with them.
And so that's what's going on here. So adaptive mesh refinement is a way to run a huge problem on a tiny supercomputer. And the way you do it is you use this adaptive mesh refinement so you
don't care too much about the part where the fun stuff isn't happening and you care an awful lot about where the fun stuff is. And the fun stuff, of course,
is moving. It's a shockwave going through a material or whatever it is you're simulating.
This is really common in science. This is done a lot.
And it
has its interesting issues. The issues it has is
how do you access memory in this space?
It's ugly.
Stuff is moving around all the time.
You've got to use pointers to get all your data
because there's this layer of software
that's moving the mesh around
based on heuristics and things like that.
But what's cool about it is,
and when I said that we've eliminated
the ability to get compression, this is why.
If we found that that mesh was compressible, that means that we messed up.
AMR should have pushed all that uninteresting low entropy crap out of the way and got to the part that really needs to be drilled down on where all the entropy is.
And so if we ever do find compressible data at our laboratory,
we go get the scientist and say, you messed up. You could have run a problem way bigger than this.
Go fix it, right? And so compression, dedupe, none of that stuff works because of this. It's
already been pressed out. The computational method itself has said, get rid of it. That's opportunity.
So anyway, this is what we do. And these grid methods are interesting because remember I said inside of each one of those little cells, there's pressure and temperature
and all these values. So
these applications think of themselves as a set of distributed arrays, say 50 to 100 distributed arrays with the number of array values equal to all the cells in the mesh.
And so that's pretty cool.
Essentially, they are columnar applications.
There's a column of pressure and a column of temperature and a column of, okay,
and this is depicting multiple time steps of an application, time step one, two, three, four,
five, six, and you can see that it's AMR'd, so the fun stuff has, you know, got finer,
um, finer call, uh, finer squares, and the big stuff has bigger squares.
And so how do you turn this into a columnar?
So how we do this is each one of these machines,
so the middle boxes there are machines, computers, and there's, say, 64 cores.
So there's 64 boxes.
And you see the lines that connect those things together.
So the lines that connect those things together, that's a method called Hilbert space.
So Hilbert figured out how to fill space by connecting things together.
And so essentially there is a serialized order of how things are built in a mesh using Hilbert.
And so it's spread across all million cores of the machine and all trillion cells using a Hilbert space.
And it's one serial array. And the way we write the data out is in Hilbert order and we write all the pressures
in Hilbert order and then we write all the temperatures in Hilbert order and then we write
all everything else in Hilbert order. And so this thing is a million cores all writing to one part
of the file to get it all in order in Hilbert order.
And then why do they do that? They do that because they want to restart the application
with a different number of processors, and they can just go in and subdivide these things
and start the application with an arbitrary number of processors. So at any rate, we were sitting on
a goldmine of columnar applications. We didn't realize that, but here we are.
We've got all these columnar applications.
What could we do with them?
What we could do with them is we could use columnar techniques
like DuckDB and Apache Drill.
And all we really need to do is take this columnar format,
add some indices after each column,
and presto, you've got essentially what looks like a Parquet file.
It's a parallel Parquet file, but it's a Parquet file, and spread across thousands of devices.
And so the question we're asking ourselves is, can we use standard columnar analytics techniques
to do analytics on these very, very large data sets that happen to be columnar in
nature. And so that's where we're headed with this ABOF2 demo is to try to push this indexing
and retrieval technology using columnar functions down to the computational storage devices so that
it doesn't run in our software on hosts anymore. It's pushed down to storage devices. So that's where ABOF 2 work is starting to happen.
And then the next question is,
if you could do it to flash an ABOF set of flash devices,
could you do it to disks?
So we have this really cool partnership going with Seagate
to try to push this stuff all the way down
to an individual disk drive.
And interestingly enough,
not just to an individual disk drive,
but to an individual disk drive
that's part of an erasure group
that's part of a set of parallel erasure groups.
And so we're trying to figure out
how to push analytics through the erasure level
all the way to the disk device
so the disk device can do it
in parallel and you're going to see a talk about this today or tomorrow tomorrow here at this
conference on how this was done this has been demonstrated it's pretty cool related talks
there's been some related talks on all this stuff we had two talks at FMS. Our friends at SK and us
gave talks about that cool KVC-SD.
There was a talk about ABOF at FMS.
Then there's all these talks
at SDC that I've mentioned.
And there's a talk at Supercomputing
on Goofy, which is our indexing technology.
And let me just leave you
with some food for thought.
We're working with Fungible, a company that actually gave a talk yesterday, I think, on
scalable NVMe over fabric endpoint state management.
And what do I mean by that?
Well, you've got 10,000 machines and may all talk to the same storage device.
How will it deal with all the state required to do that?
That seems pretty tough.
So we're working in that space. We actually are working on a project to put SPDK on top of MPI
so that we can test SPDK scalability at incredible scales, like millions. That's a work in progress.
We're looking for some lightweight security solutions for NVMe over Fabric directly from clients without kernels.
And we're also tinkering around with ideas on how you route NVMe over Fabric between different media.
Because we've done that before with Lustre.
There's this thing called LNAT that actually its job in life is to route RDMA between medias.
And I think that's it.
Caveat, everything in here was harder than I made it sound.
I'm a manager, so I just wave my arms
so that other people in the room do the work.
Thanks for your time.
Thanks for listening.
For additional information on the material presented in this podcast,
be sure to check out our educational library at snea.org slash library.
To learn more about the Storage Developer Conference, visit storagedeveloper.org.