Grey Beards on Systems - 106: Greybeards talk Intel’s new HPC file system with Kelsey Prantis, Senior Software Eng. Manager, Intel
Episode Date: September 17, 2020We had talked with Intel at Storage Field Day 20 (SFD20), about a month ago. At the virtual event, Intel’s focus was on their Optane PMEM (persistent memory) technology. Kelsey Prantis (@kelseyprant...is), Senior Software Engineering Manager, Intel was on the show and gave an introduction into Intel’s DAOS (Distributed Architecture Object Storage, DAOS.io) a new … Continue reading "106: Greybeards talk Intel’s new HPC file system with Kelsey Prantis, Senior Software Eng. Manager, Intel"
Transcript
Discussion (0)
Hey everybody, Ray Lucchese here with Matt Lieb.
Welcome to another episode of the Greybeards on Storage podcast, a show where we get Greybeards
bloggers together with storage and system vendors to discuss upcoming products, technologies,
and trends affecting the data center today.
This Greybeards on Storage episode was recorded September 11, 2020.
We have with us here today Kelsey Prantis, Senior Software Engineering Manager at Intel.
So Kelsey, why don't you tell us a little bit about yourself and Intel's Deos product.
Hi, yeah, I'm so happy to be here. Thank you for having me.
So I entered the HPC storage business about
eight years ago. I was originally part of a startup that was called Wham Cloud. Many of you
may be familiar with them. We worked on the Lustre product and then were later acquired by Intel.
And while we were at Intel, we started taking a look at what could be done with new
hardware technologies that were being developed at the time, right? MVME was sort of off on the
horizon at that point, but it was clearly going to be, you know, pretty transformational to storage.
So we began taking a look at what was going to be involved with incorporating that into
the future of storage and what that might look like. We actually originally started it at a
position where we were looking at how we could modify Lustre itself to be able to take advantage
of these technologies. But as we really dived into what these new hardware capabilities were going to bring to us, we realized that there was actually a need for a whole new look at how storage was architected for these new capabilities that brought it along.
So that's sort of where the birth of Deos came from and came to be as we sort of splintered off from that group here inside of Intel
and started working at what we felt was the next generation of storage.
And Deos is a fully supported Intel product.
Is it something that customers purchase or is it open source
and support is available or how does that work out?
So Deos at the moment, it's completely open source.
We do all of our development out in the open. You can find Deos out on GitHub and work together with our community there.
The commercial support offering is something that is still under consideration today and
still being worked out with us and our partners at exactly how that's
going to be offered. Of course, you know, commercial support is going to be absolutely
necessary to support any new open source technology, but exactly what that's going to
look like is something that we're figuring out currently still. And when you say like the new
hardware Intel is coming out with, I mean, besides the multi-core CPUs and NVMe and that sort of stuff,
are we talking about things like persistent memory,
Optane persistent memory and Optane SSDs as well?
Yeah, so the initial investigation was looking at all of these different technologies and what they could bring.
The one that we really honed in on in the end that we thought was really transformational, though, was the Intel persistent memory.
That brought some very new and unique opportunities for the storage system, you know, sort of leading up to this point, all of the different storage
media that we would use for storage, you know, is sort of fundamentally based on block-based IO.
And you end up with a certain amount of locking, accessing the different blocks on that, that
becomes a performance bottleneck, accessing that media. But Intel Optane Persistent Memory was really a
whole new form factor where you could have, you know, byte granular fine grain access to your
storage media, instead of trying to figure out how you can pack your data together in these larger
blocks. And what that allows, it allows a lot more parallel access to your data. And, you know,
more parallelization. So serialization, of course,
means a much more performant outcome. So that's really what we built DAOs on top of. The way
DAOs works is we put all of our metadata and our small file IO. And by small file IO, I mean
IOs that are generally about 4K and below. And all of those get stored directly on the intel optane persistent memory
DIMMs and then at the other end for your larger IOs and your bulk data transfers that are already
more block friendly those DAOs sends to your NVMe SSDs so you can use either 3D NAND or optane
SSDs for this so doing this split allows us to find a good balance
of really being able to take advantage of the performance
that we can have with the byte granular access,
the persistent memory,
but still keep our TCO in an attractive space.
Now, though, if you do want maximum performance,
you can have actually an all persistent memory
DAO system as well,
but the inverse is not the case.
You have to have the persistent memory. When system as well but the inverse is not the case you have to have the
persistent memory uh when you target this build is it a standard x86 server most of the
my recollection is that the hpc stuff is sitting on risk yeah so we're actually just using standard
cot servers with deos right you're just using your standard Xeon processors, which has support for the persistent memory
DIMMs.
We handle our data protection a little differently than some of the existing technologies today
that required a lot of different custom hardware and things like that.
We use replication and coming soon, erasure coding on that as well.
And way replication that allows you to, you know,
spread your data as replicas across your different nodes.
And that's how we manage your data protection rather than specialized
server platforms and specialized dual ported storage.
And this really allows our different OEM partners a lot more flexibility
in designing these platforms to be able to bring it to market.
That's interesting.
So you mentioned that the small file I.O. is done,
cached I guess is the word I would use in PMEM,
and then the metadata is also use, in PMEM, and then the metadata is also
completely maintained in PMEM. This is a big change from what normal
HPC file systems had in the past. I mean, a lot of these systems had special metadata services,
servers, and things of that nature. Do you have a special metadata server in DAOs
or is it all pretty much a converged solution?
Yeah, we don't have any specialized servers in DAOs.
All of your DAOs servers are effectively of the same importance
and all of them have the different metadata.
So there's no point of bottleneck then that you have
in trying to access
your metadata because it's spread across your entire cluster. Though I will pick on a knit
there though, is that I think it's one of the things that makes it even more transformational
is it's not just using the persistent memory as a cache, right? You know, you can, there's some
other technologies trying to take stabs at using DRAM as a cache, but when it comes to, you know, there's some other technologies trying to take stabs at using DRAM as a cache.
But when it comes to, you know, being able to measure performance, if something is a cache, at the end of the day, you still have to drain it to your disk before you've really saved that data.
But because Optane Persistent Memory is, it's right in the name, it's persistent, we actually use it as a first-class storage device, and your data will stay there unless you start to get to a place where you start to run out of space.
And then we have aggregation running in the background that will aggregate some of those small IOs into something more block-friendly and transfer it to the SSDs.
But it is a first-class storage device and not a cache, and that maybe seems like a subtle difference, but ends up having quite a different impact to the end result.
Most HPC environments seemed like they were large block, heavy bandwidth, consumptive solutions and less in the small file space.
Do you find that small file spaces, small files are starting to become more active?
Yes, yes. We find that there's always going to be those classic HPC workloads that do very well, focused on the bandwidth and don't need the small file I.O., but there are more and more.
I mean, there's always been some applications in the scientific space and things like that that have struggled with the existing systems and had small file I.O.OM been performance limited. But we're starting to see that more and more, right? As we're starting to see the continued growth
of AI and data analytics,
and they start to have more and more high performance needs,
they have very different access patterns to their data
than sort of just the, we'll call the traditional HBC,
the modeling and simulation jobs.
And as these are coming into our data centers,
they're putting a lot of new demands
onto our storage systems. And it's not something that you even want to handle just by doing
different islands inside your cluster, because the workflows are already starting to get more
complex, right? We're already starting to see something like you may have, you know, one
workflow where let's say you have a weather modeling and simulation job, very traditional modeling and simulation job, but now you also want to add an AI component that's refining your results and doing some AI workloads to refine that and combine it.
If you had separate islands, you'd have to transfer your data between the islands.
That'd be a huge bottleneck.
So you really need something that can serve both.
It has the good small file IO, and it still does well in the bandwidth. Right. We excel in the small file I.O., but we still have a very healthy and competitive bandwidth at the same time. as many GPUs as they've got CPUs anymore just to support this AI ML deep learning types of
workloads and stuff like that. So I see those things starting to emerge as more and more
of more and more importance in the HPC space as well. I did a seismic installation, Ray, for for earthquake metrics and the requirements,
not just to ingest the data,
but to process that data and come up with something useful,
as you can imagine in a,
in a seismic environment requires something practically immediate.
And, and it seems to me that the ability to handle both large and small file data sets
with the same sort of IO requirement is mission critical.
And it was always the bottleneck.
You're mentioning it's almost like real-time solution to analyze seismic data.
That's pretty interesting.
We could have all of the podcasts on that alone, Matt.
Or oil and gas is another one, sure.
Well, that's always what a lot of this comes down to as well, right?
A lot of these real-time applications and things, being able to be faster to your conclusions
can make real practical differences for your businesses.
Right. If you're talking your financial and you're being able to make better AI predictions and invest accordingly.
Right. That can be real money. Or if you're analyzing seismic data so that you can warn people sooner that there's going to be an earthquake where they are.
Those things, you know, are real tangible benefits to being able to have this better performance.
Right, right.
And so how much persistent memory?
So I guess the first question, are all the nodes homogeneous?
They all have to have the same amount of persistent memory and the same sorts of storage behind
it?
Or can they be heterogeneous?
Right now, the DAO storage nodes are
homogeneous. You want to be able to do that so that you can have maximum
bandwidth to all your nodes. If you have some of your nodes and you know they
don't have as many DIMMs or things like that, that starts to affect the rate that
you can get your data to your media. We generally recommend there's a ratio between the persistent memory
and the NVMe SSDs per node, and then you replicate that across a number of identical nodes. So
you can tune it to your workload, right? And, you know, the DAOs engineering can work with folks to
tune it to their workload, but sort of the general rough estimate is you want about 6% for a ratio from your SSD capacity to your persistent memory, right? So if you take,
this is how much SSD you want, 6% of that, that's how much persistent memory you want,
so that you have enough room to fully utilize the system. Do you support normal HDDs as well, or is it just the SSD and PMEM?
No, we use SSDs. And I know that can be a controversial decision, but we think that what we're doing here is we're really trying to define HDDs, there's already a lot of technologies out there like Lustre where they have spent a couple of decades perfecting that.
And we just recommend folks use that. And if they need a mix, we can support data tiering so they can have data as a performance tier and still use something like a Lustre or Spectrum scale and be able to use that for their colder HDD storage. Because there's a lot
of footprint too you end up putting on your clients and you have to do all this buffering
and a whole lot of different activity on your system once you start bringing HDDs into the
picture that you've sort of been freed from MVME. Sure. Kelsey, the interconnect between servers.
Under Lustre, it was quite often an InfiniBand architecture.
What are we looking at here?
So I think you'll still find quite often you're going to see a high-speed RDMA network like InfiniBand.
If you're trying to bring this advantage that you can
do a lot of these low latency reads, you need a low latency network to see all of that benefit.
However, we do use the libfabrics library underneath. So you're not restricted to that
in any way, right? Any libfabric supported fabrics, okay. So, and we do see, you know, you can install Deos and run it over
Ethernet with Rocky and, you know, different configurations like that. And we expect,
especially as, you know, the Ethernet market continues to evolve over time to see that to
continue to be a, you know, maybe not dominant option, but certainly one that's present enough to continue to care
about supporting and continue to probably erode into that market of it. So what would be, you know,
the smallest configuration of nodes that you would support, number of nodes rather,
and what would be the largest ones that you've tested with at this
point? So, I mean, from a pragmatical point of view, you could actually have one DAOs server.
I'm not sure how interesting that would necessarily be, but it would certainly work.
We usually look at the smallest installations. We usually look around three, because if you want
data protection, right, if you have three, you can have two replicas to protect your data.
So that's usually what you sort of look at at the smallest level if you want to fully explore the functionality of DAOS.
But if you didn't care about data protection, you could just have one.
Up to the largest size currently, we have tested up to about 512 clients currently.
That's going to, of course, grow very much.
I don't know if you guys are aware yet, but DAOS is going to be the primary storage on the Aurora supercomputer for Argonne National Lab.
So that system, you know, is 230 usable petabytes and has to be at least 25 terabytes per second.
And so that's not far off in the future.
So obviously we have to come right out of the gate supporting you all the way up to Exascale.
It's funny you mentioned Argonne.
Chicago, right?
Yep.
They're up here in Chicago and they were a customer of mine.
Yeah, I mean, they've been a close partner with us for
a number of years now. So, you know, a lot of what we're building here has been being built for
built for Argonne, but not built as a one off built to be a generic solution, but done in
partnership together with Argonne. What sort of protocols are you supporting
from a file perspective or an unstructured data perspective?
Yeah, so this is something that we've actually put a lot of thought into, right?
Because a lot of the file systems up to now have really been focused on POSIX.
At least we were talking about the high performance file systems, right?
Have been very focused on supporting a POSIX interface. But POSIX itself actually brings a lot of constraints
to your performance because it has what we call pessimistic conflict resolution. It assumes sort
of any metadata activity you're going to do, there could be a conflict and does a lot of locking
accordingly. But what we found is actually those sorts of conflicts are fairly rare. So when
you take a look at some other areas in the industry, look at databases and some other
things, they've come up with sort of some different ways to deal with this. So we use in
what we call sort of optimistic conflict resolution. So we basically start doing the right.
And then if there's a conflict that arises,
we resolve it at that time.
And that allows us to be much more performant.
But to do that,
you have to move away from the POSIX interface.
And I know this is something that's been talked about
in our industry for a long time now of,
you know, you see big headlines,
the death of POSIX and things,
but it really is a big issue.
So DAOS itself provides its own API and its own layer, but you can't expect that you're going to
get adoption if you come into the market and say what you need to do is to rewrite all of your
applications to not use POSIX anymore. It's not going to happen, right? So what we've done is we have the base DAOS API, but we've been building different
layers on top of it so that applications can go ahead and use it. So we do actually have our own
POSIX layer you can use on top of DAOS. So if your application's doing POSIX directly,
it can go ahead and use our POSIX layer and that will write to DAOS in the backend. So that
supports all your existing POSIX applications.
But we also realized, you know, as the industry's involved,
there's also a number of other middleware and application frameworks
that applications are actually already using to do their IO
rather than the application sort of directly thinking about how they're doing their IO.
And that there was an opportunity to add some support for these
different middleware and applications where the applications and frameworks themselves were
DAOs aware and could talk to DAOs. And then they could have the full bandwidth of what DAOs has to
offer and not have the POSIX constraints, but they don't have to rewrite any of the IO in their application. So what we
have so far currently right now is we have support for MPI-IO, HDF5, and Apache Spark, right? So if
you use an example, an Apache Spark application, you can actually just use DAOs as your backend
for it. And we want to keep expanding that list further out in the future, right? We already have in progress, you know,
SegY support and root support.
We want to see on here, you know,
TensorFlow and PyTorch.
And there's a lot of options.
And actually, because we,
some of the other features we have,
like distributed transactions,
you could even build, you know,
SQL over Daos or something like that.
So there's really sort of a broad range,
but we want to make it easy to
adopt, right? You can't start from the point of asking people to rewrite everything.
So one thing I didn't hear was like NFS version four or something like that,
that supports parallel IO. There's no interest in that or it just requires so much of a bottleneck
to do that, it wouldn't be worth it. I think the latter. It just ends up being so much of a bottleneck
that it's not necessarily worth it. But I mean, to be clear, it's supporting
POSIX across your cluster, but you're still going to be doing
parallel I.O. to the back-end DAO system, right?
It's not just one write happening.
So with the migration of metadata into persistent memory,
persistent memory has got byte access and that sort of stuff.
Did you find that there was a significant advantage in doing that?
Or was it a major change from your perspective to take advantage of this? I mean,
yeah, I think to some extent, the reason Intel would do something like that is to take advantage
of the technology that they're providing, right? Yeah. I mean, I think there really is a
fundamental advantage to that. So I sort of touched on it before, but right with this idea
where all of your SSDs or HDDs have these large blocks that you're writing to, right? So as your
applications writing maybe different unrelated pieces of data, the way they get serialized and
packed into blocks, you can have unrelated pieces of data end up sharing a block. And then if different
clients want to then access that data later, they have to put a lock on the block. And if they're
doing it at the same time, now that activity becomes serialized. You do that several million
times across your cluster, and it actually becomes a pretty significant bottleneck. So
obviously, of course, we at Intel want to find great ways to showcase
our own technology. But my group specifically sort of came from the other end where we were
working on storage and we saw what this could bring to us. And that really is different. You
couldn't really take DAOs and put it on SSDs and get the same sort of performance because
that hardware capability really is a core functionality
of why it's able to be more performant.
One of the challenges with SSDs, depending on the SSD, of course, technology is that
its read performance is extremely good, but its write performance is not as good.
And the write endurance is an issue as
well. Does DAOS do anything to try to speed up write activity? You're not dependent on Optane
SSDs, are you? No, we're not dependent on Optane SSDs. And while we're looking at ways to maybe
perhaps glean some additional performance from Optane SSDs
that are not required. You can use regular 3D NAND SSDs as well. You know, there's a few different
things, I think, to answer your question. The first one is that actually using the persistent
memory for the small IOs in your metadata actually does also help preserve the lifespan of your SSDs
because you're not having to do all those small rights to your SSDs to begin with. So that actually does help extend the lifetime of your SSD.
Also, you know, there are different types of SSD technologies, and we are actually currently
inside Intel doing a lot of benchmarking with future generation technologies of our Intel
SSDs as well, and taking a look at how we can feed requirements back into those future SSDs to be,
you know, the best optimized for that. But, you know, we find there's a typical lifespan for these sort of systems, right? And we don't find that it is cost prohibitive to be able to use the SSDs in the way they're able to last. But the persistent memory is definitely part of that, that we're not sending all of those small writes to the SSDs all the time. The Argon solution was going to be something like 256 petabytes of storage.
Is that, I'm not sure if that's the right number,
but it's multiple petabytes of storage, right?
Yeah, yeah.
So that was probably the raw number at some point.
It's 230 usable petabytes.
So once you have the erasure coding in place,
it's 230 usable.
Yeah, well, even so, 230 petabytes of SSD,
NVMe SSD storage is going to cost quite a lot. And all that's going to be on Deos?
Yep, that's all Deos storage. Yeah, it's going to be quite a large system. I think it's probably
going to be, you know, the best storage system in the world for a number of years after it. You know, Argon's really invested in the storage side and building
this new supercomputer. But of course, that doesn't mean you have to go up to that kind of
scale to start seeing it be, you know, worth using DAOs, even when you're down at those three and
four node systems, you know, or half a petabyte, you can still notice the difference quite a lot.
You mentioned you tested 512 nodes in the lab.
How many nodes are you deploying at the lab, Argon?
I'm afraid that's in the set of information with Argon that I'm not allowed to share.
Okay.
Okay.
I'm going to say lots.
Yeah, it's going to be lots, but it's, you know, we're talking hundreds of petabytes.
It's huge. Any event.
Any event.
Obviously, Intel was at storage field day 20 here almost, I guess, a month or so ago.
And one of the things that was mentioned there was, I think it's the IO500 benchmarks.
I've never heard of IO500 before.
It's a relatively new benchmark as far as I can discern, but you want to talk about some of the numbers you guys achieved or rankings maybe?
Yeah, sure. It's been around for a few years now. It was around when I was working back on Luster.
They were working on forming this together, actually. Andreas Dilger, who worked
with us on Lustre, has been very involved with the IO500 as well. But it was really an attempt
to try to come up with a benchmark that could be consistently applied that gave people a better
idea of what the file system performance would be like when they started to really use it for
applications, right? You sort
of end up in that game, right? Everyone's showing you like their best benchmark, but then not showing
you where maybe they don't do that well. So they might show you, you know, they do really great at,
you know, sort of IOR easy with large streaming read and writes. But if you can figure IOR
differently and you're doing a lot of, you know of random reads and writes, you might see a drastic difference in performance and they would kind of
pick and choose them. So it was an effort to say, okay, let's come together as an industry and to
try and have a fair comparison. Let's build together something that kind of looks at all
the different angles and aggregates it into a score. And then much like the top 500 supercomputer list, right? It uses that to rank the top file systems in the world. And they sort of have two
lists. They have one that's sort of the full list, and that's very much like the top 500.
There's no constraints on it. It can be any size system who just has the best performance in the
world. And then they have a second list that they call the 10-node challenge. And the 10-node challenge is where every system is limited to exactly 10 clients.
So that also gives you a lot of interesting opportunities to compare the technologies and the solutions if you're in a similar scale system.
So this year at ISC was our second time submitting.
And it was also the first time where we were joined with
submissions from some of our partners as well. So in addition to a submission directly from my team
on our test bed, we also had submissions together with Argonne and with TAC as well did a submission.
And we're very proud of the results for it. Our Intel submission was number one on the full list
by almost double the second solution.
So I just mean double the score, overall score.
And if you dive into that and you take a look at,
you start to take a look at IOR hard results
and all metadata MD test hard,
those sorts of ones that those differences start
to get even more dramatic because the IO500 list posts all of these details are public online for
each submission. But we were really proud to be able to really showcase what this technology was
available of and with a pretty small system, right? We only had 30 servers and 52 clients. We didn't
even actually have enough clients to fully saturate our servers, but that's what we had in our development lab here.
And with that, you know, we were able to get the best performance, you know, out of any submission
of all the different supercomputers in the world, right? We're beating the top supercomputers with
effectively a handful of nodes, right? You're talking both of them together. You're only
looking at 82 nodes and you're comparing to systems that maybe have, you know, several
hundreds to thousands of systems that you're running against. And so we were very proud of
that. We're also very proud of our partners submitting along with us. Argonne and TAC got
fourth and third on the full list, respectively.
And then on the 10-node challenge, where it's the same number of clients,
actually all three DAOs submissions took the top three spots, right?
Once you start looking at similarly sized systems,
we took all three spots and we were over three times higher
than the top known DAOs submission.
So we think this really does a good job in a way that's been, you know,
really vetted by a third party and, you know, sort of you can be picked apart by your customers
and really still show that you stand up and hold that performance claims that you've been making.
And the IO500, again, is more or less an HPC simulation of file services that are required in that space, but
it's multi-application oriented and that sort of thing. So it wasn't necessarily focused in
history of HPC. Small files, like we said, are not necessarily that active, but
if the benchmark was doing something like machine you know, machine learning or something,
I'm not sure whether it does that or not. Obviously, the solution would be faster.
Yeah, this is actually, this is making an aim to be more balanced, right? So it has a portion of
the score that goes to different types of workloads, right? So there's a portion of the score
that goes to your traditional large streaming, read and writes, there's a portion of the score that goes to your traditional large streaming read and writes, and then a portion of the score that goes to, you know, they call
them IOR hard and MD test hard, as well as doing things like find on your system. And that's where
you're going to start seeing, you know, sort of called poorly behaved IO, right? Small file IO,
or maybe IO that's not aligned to your blocks, things like
that. And that's where the existing solutions tend to really fall down, right? You start to see them
perform like, you know, the difference between some of these existing technologies from if you're,
you know, in one of these more poorly aligned or small IO benchmarks to the large streaming rate might be like 9x difference.
But with DAOs, right, it's much closer because we don't have the same limitations from the
underlying SSDs. It's much closer to each other. So because the IO500 publishes all of the specific
data for all the individual benchmarks, you can actually, it's kind of cool, you can dive in there
and look and do all the comparisons and try and look at the benchmarks
that's closer to your workload and see what might be the best technology. So if you know you're
doing a lot of small file IO or poorly aligned IO, you can go look and compare them in there.
And that's when those differences start being actually even more dramatic. You know, you start seeing 13, 14, 40x differences
between these different entries.
But the numbers I said before were about the overall score.
So that, you know, we really, we can still hold our own
in the large streaming and reading rights as well.
Absolutely. Yeah, yeah, yeah, yeah.
I report on benchmarks in my monthly electronic newsletter,
plug for my newsletter. And I haven't done anything my monthly electronic newsletter, plug for my newsletter.
And I haven't done anything on IO500 yet, but I'm planning to start.
So let's look forward to seeing something there.
As far as, you know, this is, it seems like, you know, Lustre and its comparison file systems are fairly serious long-term endeavors.
Lustre has been around for, I don't know, I want to say 10, 20 years kind of number.
GPFS, which is a precursor to Spectrum Scale, is probably the same sorts of timeframes.
Deos coming out within, I don't know, four or five years.
Is that how development started?
Or it's not as mature, I guess, is the kindest thing I could say.
Right.
So everyone sort of has to start somewhere.
But no, you're definitely right.
Luster's actually crossed its 20-year anniversary now.
It's more than 20 years old.
And, you know, a lot of our organization came
from the Lustre group originally. And we sort of have a little saying amongst ourselves that,
you know, to create any new stable file system and have it completely stable takes about 10 years.
And that's based on our experience from having helped build Lustre. But Deos has actually been around a bit longer than that.
We started Deos back in 2012.
So it's already been over eight years since we started Deos.
And we're looking forward to our first production version in Q1 this upcoming year in 2021. So it's definitely newer, but we've put a lot of
thought and time into validating this and building it out to be a stable system and
not trying to necessarily just rush it out to the market as soon as possible.
Right, right, right. A lot of startups and stuff will take significantly less time than that to deploy a solution.
But I don't know if they play in that HPC space
as much as more enterprise-level kinds of workloads.
Well, a lot of times startups have to do that, right?
They have to show some progress to secure more funding.
One of the benefits of being part of Intel
is they can look to a longer term horizon for their investments.
About how many engineers are working on the project today?
I have about 50 engineers in my team working on this currently.
Then there's a number of people across other organizations, across Intel, that are contributing
in some way or another on the enablement side.
And then the Optane persistent memory, there are, you know,
various generations of that that exist and are planned and stuff like that. So you're already
probably, you know, working on support for, I'll call it the next gen of persistent memory.
Yes, definitely. So, you know, what's out there today, if someone wants to try it today, right, they can use the first generation. But the second generation, of course, we're already working on and we plan to further out with that. And we're engaging with those organizations to, you know, make sure what we're putting out together coordinates well and builds a stronger solution together. Right, right, right. And from a marketing perspective, you mentioned a couple of partners.
I guess Argonne's a partner.
You mentioned another one that did another submission.
I can't recall what that was on the IO500.
That was TAC, so Texas Advanced Computing Center.
Okay.
And do you have actual solution partners out
there that are pushing DAOs as well to other environments? Or I guess I call these channels
kind of things, right? Yeah, we're working with a number of partners. I can't necessarily disclose
all of them, but Lenovo disclosed at the Daos user group last year that they're working on
an integrated solution for Daos. And we did a, actually we did a fireside chat at ISC about that
as well. Virtually, of course. Yeah, of course. I guess I wasn't familiar with ISC. I've been to
SC19 and a couple of the other ones prior to that, but I hadn't been into the international
version of that. That typically in Europe someplace? Yeah, it's usually in Frankfurt
in Germany. You should really come. It's fun. Yeah. It's a good travel trip. Yeah, when the
virus is over and all that stuff. I know, right? That's interesting. And so talk a little bit. So
you mentioned erasure coding is coming. You've got replication today and you support, you know, two to three replicas of the data.
Is that a correct statement?
Actually, it's N-way replication.
So it's configurable for you as the end user how much you want to replicate it.
So you could actually make it replicate to every node in your cluster, which may sound crazy.
Why would you ever need
that much protection? But that actually can come into play from an availability aspect as well,
right? So if you look at an artificial intelligence sort of job where they may need all their clients
to read the same data set at the same time, if you have that only on a couple nodes, you're going to
get a huge hotspot in your cluster and it's going to bottleneck there. But if you had something like that, you could take that data set and you can replicate it across all
of your storage nodes and then feed that out to your job and then that load gets spread across
your cluster. And something that's also kind of cool with DAOs is how much replication you want
is actually selectable down to the per object basis, right? So we're an object store, so per
object you can decide how much replication you want. So if your application has some data that is just throw away and you know
you're going to throw it away, if you need to go back to the last checkpoint or restart your
application, you can actually go ahead and just put that as not being replicated at all. So you
don't have any of the extra resources in your system being consumed. That wouldn't be useful. But then you can have your other pieces of data at the same time. You
do need to keep around for your checkpoint or to restart your application and protect those to any
level you desire. So there's a lot of flexibility in that compared to some of the more hardware
based data protection systems. Yeah, yeah, yeah. When you said per object, I'm thinking, are we talking about a shared?
I think you're talking about a file.
It could be a file, depending on how you use your object store and what middleware.
It could be a file.
It can even be smaller than that.
It can be a key-value pair.
That's very unusual.
I've never heard of anybody being able to specify the replication of a key value pair.
I guess if the value was sufficiently large, that would make a lot of sense, I guess.
I mean, most realistically, people are going to use it for groups of data, not individual key value pairs.
But it has that flexibility since it's not stored inside DAOs as a file, right?
Because we're an object store.
We have a different sort of structure inside of us.
It gives us some of that flexibility to whatever level of granularity that you want to use.
So a particular key value could potentially be a separate object in your object store?
Well, so I sort of use those interchangeably. The way our objects work, they effectively are
key values, right?
And there's sort of a few different layers of structure.
So maybe like start, let me start top down.
So you have at the top layer what we call pools.
These are actually similar in concept to the Z pool for Matt there.
But that's going to be more like your storage allocation.
That often is going to correlate to like your project, right? That's how much space this user is allowed to consume. So you have your pool allocation.
Inside your pool, you have containers. Containers generally correlate to a data set. It's one run
of your application from beginning to end, right? One job, all the data that's generated in that
usually goes into one container. Inside that container,
you have objects, and those are key value pairs. But what makes a DAOs interface interesting
compared to something like S3 is our values actually can be additional keys or arrays of keys,
not just a blob. So that allows you to make basically any hierarchical structure that you could desire. And that's how
we're able to do things like build support for HDF5 and POSIX and things like that. We actually
build containers that have a structure similar to those file structures that is mimicked in that
hierarchy, right? And then can provide an appropriate interface out, but that's how it
gets stored inside of DAO. So that gives us huge flexibility. It's almost like a native Python solution here, key values embedded in key values,
et cetera, and arrays. It's pretty impressive, actually. Well, that's really what we wanted
to do, right? People write their applications that way. And instead of having to then think
about it getting serialized out to disk, we want applications to be able to just store their data and the way their application is structuring the data. And we can go ahead and take care of
how you get that to the media most efficiently for you. Well, this has been great. Thank you
very much, Kelsey, for being on our show today. Anything else you'd like to say to our listening
audience before we close? Happy to be here. And if anybody wants to follow up and get more information,
we do have a community website at deos.io. That's it for now. Bye, Kelsey. Bye. Thank you.
Until next time. Thank you. Bye-bye.