Grey Beards on Systems - 106: Greybeards talk Intel’s new HPC file system with Kelsey Prantis, Senior Software Eng. Manager, Intel

Starting point is 00:00:00 Hey everybody, Ray Lucchese here with Matt Lieb. Welcome to another episode of the Greybeards on Storage podcast, a show where we get Greybeards bloggers together with storage and system vendors to discuss upcoming products, technologies, and trends affecting the data center today. This Greybeards on Storage episode was recorded September 11, 2020. We have with us here today Kelsey Prantis, Senior Software Engineering Manager at Intel. So Kelsey, why don't you tell us a little bit about yourself and Intel's Deos product. Hi, yeah, I'm so happy to be here. Thank you for having me.

Starting point is 00:00:42 So I entered the HPC storage business about eight years ago. I was originally part of a startup that was called Wham Cloud. Many of you may be familiar with them. We worked on the Lustre product and then were later acquired by Intel. And while we were at Intel, we started taking a look at what could be done with new hardware technologies that were being developed at the time, right? MVME was sort of off on the horizon at that point, but it was clearly going to be, you know, pretty transformational to storage. So we began taking a look at what was going to be involved with incorporating that into the future of storage and what that might look like. We actually originally started it at a

Starting point is 00:01:31 position where we were looking at how we could modify Lustre itself to be able to take advantage of these technologies. But as we really dived into what these new hardware capabilities were going to bring to us, we realized that there was actually a need for a whole new look at how storage was architected for these new capabilities that brought it along. So that's sort of where the birth of Deos came from and came to be as we sort of splintered off from that group here inside of Intel and started working at what we felt was the next generation of storage. And Deos is a fully supported Intel product. Is it something that customers purchase or is it open source and support is available or how does that work out? So Deos at the moment, it's completely open source.

Starting point is 00:02:27 We do all of our development out in the open. You can find Deos out on GitHub and work together with our community there. The commercial support offering is something that is still under consideration today and still being worked out with us and our partners at exactly how that's going to be offered. Of course, you know, commercial support is going to be absolutely necessary to support any new open source technology, but exactly what that's going to look like is something that we're figuring out currently still. And when you say like the new hardware Intel is coming out with, I mean, besides the multi-core CPUs and NVMe and that sort of stuff, are we talking about things like persistent memory,

Starting point is 00:03:15 Optane persistent memory and Optane SSDs as well? Yeah, so the initial investigation was looking at all of these different technologies and what they could bring. The one that we really honed in on in the end that we thought was really transformational, though, was the Intel persistent memory. That brought some very new and unique opportunities for the storage system, you know, sort of leading up to this point, all of the different storage media that we would use for storage, you know, is sort of fundamentally based on block-based IO. And you end up with a certain amount of locking, accessing the different blocks on that, that becomes a performance bottleneck, accessing that media. But Intel Optane Persistent Memory was really a whole new form factor where you could have, you know, byte granular fine grain access to your

Starting point is 00:04:11 storage media, instead of trying to figure out how you can pack your data together in these larger blocks. And what that allows, it allows a lot more parallel access to your data. And, you know, more parallelization. So serialization, of course, means a much more performant outcome. So that's really what we built DAOs on top of. The way DAOs works is we put all of our metadata and our small file IO. And by small file IO, I mean IOs that are generally about 4K and below. And all of those get stored directly on the intel optane persistent memory DIMMs and then at the other end for your larger IOs and your bulk data transfers that are already more block friendly those DAOs sends to your NVMe SSDs so you can use either 3D NAND or optane

Starting point is 00:05:01 SSDs for this so doing this split allows us to find a good balance of really being able to take advantage of the performance that we can have with the byte granular access, the persistent memory, but still keep our TCO in an attractive space. Now, though, if you do want maximum performance, you can have actually an all persistent memory DAO system as well,

Starting point is 00:05:22 but the inverse is not the case. You have to have the persistent memory. When system as well but the inverse is not the case you have to have the persistent memory uh when you target this build is it a standard x86 server most of the my recollection is that the hpc stuff is sitting on risk yeah so we're actually just using standard cot servers with deos right you're just using your standard Xeon processors, which has support for the persistent memory DIMMs. We handle our data protection a little differently than some of the existing technologies today that required a lot of different custom hardware and things like that.

Starting point is 00:06:02 We use replication and coming soon, erasure coding on that as well. And way replication that allows you to, you know, spread your data as replicas across your different nodes. And that's how we manage your data protection rather than specialized server platforms and specialized dual ported storage. And this really allows our different OEM partners a lot more flexibility in designing these platforms to be able to bring it to market. That's interesting.

Starting point is 00:06:36 So you mentioned that the small file I.O. is done, cached I guess is the word I would use in PMEM, and then the metadata is also use, in PMEM, and then the metadata is also completely maintained in PMEM. This is a big change from what normal HPC file systems had in the past. I mean, a lot of these systems had special metadata services, servers, and things of that nature. Do you have a special metadata server in DAOs or is it all pretty much a converged solution? Yeah, we don't have any specialized servers in DAOs.

Starting point is 00:07:15 All of your DAOs servers are effectively of the same importance and all of them have the different metadata. So there's no point of bottleneck then that you have in trying to access your metadata because it's spread across your entire cluster. Though I will pick on a knit there though, is that I think it's one of the things that makes it even more transformational is it's not just using the persistent memory as a cache, right? You know, you can, there's some other technologies trying to take stabs at using DRAM as a cache, but when it comes to, you know, there's some other technologies trying to take stabs at using DRAM as a cache.

Starting point is 00:07:45 But when it comes to, you know, being able to measure performance, if something is a cache, at the end of the day, you still have to drain it to your disk before you've really saved that data. But because Optane Persistent Memory is, it's right in the name, it's persistent, we actually use it as a first-class storage device, and your data will stay there unless you start to get to a place where you start to run out of space. And then we have aggregation running in the background that will aggregate some of those small IOs into something more block-friendly and transfer it to the SSDs. But it is a first-class storage device and not a cache, and that maybe seems like a subtle difference, but ends up having quite a different impact to the end result. Most HPC environments seemed like they were large block, heavy bandwidth, consumptive solutions and less in the small file space. Do you find that small file spaces, small files are starting to become more active? Yes, yes. We find that there's always going to be those classic HPC workloads that do very well, focused on the bandwidth and don't need the small file I.O., but there are more and more. I mean, there's always been some applications in the scientific space and things like that that have struggled with the existing systems and had small file I.O.OM been performance limited. But we're starting to see that more and more, right? As we're starting to see the continued growth

Starting point is 00:09:06 of AI and data analytics, and they start to have more and more high performance needs, they have very different access patterns to their data than sort of just the, we'll call the traditional HBC, the modeling and simulation jobs. And as these are coming into our data centers, they're putting a lot of new demands onto our storage systems. And it's not something that you even want to handle just by doing

Starting point is 00:09:32 different islands inside your cluster, because the workflows are already starting to get more complex, right? We're already starting to see something like you may have, you know, one workflow where let's say you have a weather modeling and simulation job, very traditional modeling and simulation job, but now you also want to add an AI component that's refining your results and doing some AI workloads to refine that and combine it. If you had separate islands, you'd have to transfer your data between the islands. That'd be a huge bottleneck. So you really need something that can serve both. It has the good small file IO, and it still does well in the bandwidth. Right. We excel in the small file I.O., but we still have a very healthy and competitive bandwidth at the same time. as many GPUs as they've got CPUs anymore just to support this AI ML deep learning types of workloads and stuff like that. So I see those things starting to emerge as more and more

Starting point is 00:10:32 of more and more importance in the HPC space as well. I did a seismic installation, Ray, for for earthquake metrics and the requirements, not just to ingest the data, but to process that data and come up with something useful, as you can imagine in a, in a seismic environment requires something practically immediate. And, and it seems to me that the ability to handle both large and small file data sets with the same sort of IO requirement is mission critical. And it was always the bottleneck.

Starting point is 00:11:20 You're mentioning it's almost like real-time solution to analyze seismic data. That's pretty interesting. We could have all of the podcasts on that alone, Matt. Or oil and gas is another one, sure. Well, that's always what a lot of this comes down to as well, right? A lot of these real-time applications and things, being able to be faster to your conclusions can make real practical differences for your businesses. Right. If you're talking your financial and you're being able to make better AI predictions and invest accordingly.

Starting point is 00:11:54 Right. That can be real money. Or if you're analyzing seismic data so that you can warn people sooner that there's going to be an earthquake where they are. Those things, you know, are real tangible benefits to being able to have this better performance. Right, right. And so how much persistent memory? So I guess the first question, are all the nodes homogeneous? They all have to have the same amount of persistent memory and the same sorts of storage behind it? Or can they be heterogeneous?

Starting point is 00:12:23 Right now, the DAO storage nodes are homogeneous. You want to be able to do that so that you can have maximum bandwidth to all your nodes. If you have some of your nodes and you know they don't have as many DIMMs or things like that, that starts to affect the rate that you can get your data to your media. We generally recommend there's a ratio between the persistent memory and the NVMe SSDs per node, and then you replicate that across a number of identical nodes. So you can tune it to your workload, right? And, you know, the DAOs engineering can work with folks to tune it to their workload, but sort of the general rough estimate is you want about 6% for a ratio from your SSD capacity to your persistent memory, right? So if you take,

Starting point is 00:13:12 this is how much SSD you want, 6% of that, that's how much persistent memory you want, so that you have enough room to fully utilize the system. Do you support normal HDDs as well, or is it just the SSD and PMEM? No, we use SSDs. And I know that can be a controversial decision, but we think that what we're doing here is we're really trying to define HDDs, there's already a lot of technologies out there like Lustre where they have spent a couple of decades perfecting that. And we just recommend folks use that. And if they need a mix, we can support data tiering so they can have data as a performance tier and still use something like a Lustre or Spectrum scale and be able to use that for their colder HDD storage. Because there's a lot of footprint too you end up putting on your clients and you have to do all this buffering and a whole lot of different activity on your system once you start bringing HDDs into the picture that you've sort of been freed from MVME. Sure. Kelsey, the interconnect between servers. Under Lustre, it was quite often an InfiniBand architecture.

Starting point is 00:14:32 What are we looking at here? So I think you'll still find quite often you're going to see a high-speed RDMA network like InfiniBand. If you're trying to bring this advantage that you can do a lot of these low latency reads, you need a low latency network to see all of that benefit. However, we do use the libfabrics library underneath. So you're not restricted to that in any way, right? Any libfabric supported fabrics, okay. So, and we do see, you know, you can install Deos and run it over Ethernet with Rocky and, you know, different configurations like that. And we expect, especially as, you know, the Ethernet market continues to evolve over time to see that to

Starting point is 00:15:18 continue to be a, you know, maybe not dominant option, but certainly one that's present enough to continue to care about supporting and continue to probably erode into that market of it. So what would be, you know, the smallest configuration of nodes that you would support, number of nodes rather, and what would be the largest ones that you've tested with at this point? So, I mean, from a pragmatical point of view, you could actually have one DAOs server. I'm not sure how interesting that would necessarily be, but it would certainly work. We usually look at the smallest installations. We usually look around three, because if you want data protection, right, if you have three, you can have two replicas to protect your data.

Starting point is 00:16:06 So that's usually what you sort of look at at the smallest level if you want to fully explore the functionality of DAOS. But if you didn't care about data protection, you could just have one. Up to the largest size currently, we have tested up to about 512 clients currently. That's going to, of course, grow very much. I don't know if you guys are aware yet, but DAOS is going to be the primary storage on the Aurora supercomputer for Argonne National Lab. So that system, you know, is 230 usable petabytes and has to be at least 25 terabytes per second. And so that's not far off in the future. So obviously we have to come right out of the gate supporting you all the way up to Exascale.

Starting point is 00:16:54 It's funny you mentioned Argonne. Chicago, right? Yep. They're up here in Chicago and they were a customer of mine. Yeah, I mean, they've been a close partner with us for a number of years now. So, you know, a lot of what we're building here has been being built for built for Argonne, but not built as a one off built to be a generic solution, but done in partnership together with Argonne. What sort of protocols are you supporting

Starting point is 00:17:22 from a file perspective or an unstructured data perspective? Yeah, so this is something that we've actually put a lot of thought into, right? Because a lot of the file systems up to now have really been focused on POSIX. At least we were talking about the high performance file systems, right? Have been very focused on supporting a POSIX interface. But POSIX itself actually brings a lot of constraints to your performance because it has what we call pessimistic conflict resolution. It assumes sort of any metadata activity you're going to do, there could be a conflict and does a lot of locking accordingly. But what we found is actually those sorts of conflicts are fairly rare. So when

Starting point is 00:18:09 you take a look at some other areas in the industry, look at databases and some other things, they've come up with sort of some different ways to deal with this. So we use in what we call sort of optimistic conflict resolution. So we basically start doing the right. And then if there's a conflict that arises, we resolve it at that time. And that allows us to be much more performant. But to do that, you have to move away from the POSIX interface.

Starting point is 00:18:36 And I know this is something that's been talked about in our industry for a long time now of, you know, you see big headlines, the death of POSIX and things, but it really is a big issue. So DAOS itself provides its own API and its own layer, but you can't expect that you're going to get adoption if you come into the market and say what you need to do is to rewrite all of your applications to not use POSIX anymore. It's not going to happen, right? So what we've done is we have the base DAOS API, but we've been building different

Starting point is 00:19:07 layers on top of it so that applications can go ahead and use it. So we do actually have our own POSIX layer you can use on top of DAOS. So if your application's doing POSIX directly, it can go ahead and use our POSIX layer and that will write to DAOS in the backend. So that supports all your existing POSIX applications. But we also realized, you know, as the industry's involved, there's also a number of other middleware and application frameworks that applications are actually already using to do their IO rather than the application sort of directly thinking about how they're doing their IO.

Starting point is 00:19:42 And that there was an opportunity to add some support for these different middleware and applications where the applications and frameworks themselves were DAOs aware and could talk to DAOs. And then they could have the full bandwidth of what DAOs has to offer and not have the POSIX constraints, but they don't have to rewrite any of the IO in their application. So what we have so far currently right now is we have support for MPI-IO, HDF5, and Apache Spark, right? So if you use an example, an Apache Spark application, you can actually just use DAOs as your backend for it. And we want to keep expanding that list further out in the future, right? We already have in progress, you know, SegY support and root support.

Starting point is 00:20:27 We want to see on here, you know, TensorFlow and PyTorch. And there's a lot of options. And actually, because we, some of the other features we have, like distributed transactions, you could even build, you know, SQL over Daos or something like that.

Starting point is 00:20:40 So there's really sort of a broad range, but we want to make it easy to adopt, right? You can't start from the point of asking people to rewrite everything. So one thing I didn't hear was like NFS version four or something like that, that supports parallel IO. There's no interest in that or it just requires so much of a bottleneck to do that, it wouldn't be worth it. I think the latter. It just ends up being so much of a bottleneck that it's not necessarily worth it. But I mean, to be clear, it's supporting POSIX across your cluster, but you're still going to be doing

Starting point is 00:21:15 parallel I.O. to the back-end DAO system, right? It's not just one write happening. So with the migration of metadata into persistent memory, persistent memory has got byte access and that sort of stuff. Did you find that there was a significant advantage in doing that? Or was it a major change from your perspective to take advantage of this? I mean, yeah, I think to some extent, the reason Intel would do something like that is to take advantage of the technology that they're providing, right? Yeah. I mean, I think there really is a

Starting point is 00:21:59 fundamental advantage to that. So I sort of touched on it before, but right with this idea where all of your SSDs or HDDs have these large blocks that you're writing to, right? So as your applications writing maybe different unrelated pieces of data, the way they get serialized and packed into blocks, you can have unrelated pieces of data end up sharing a block. And then if different clients want to then access that data later, they have to put a lock on the block. And if they're doing it at the same time, now that activity becomes serialized. You do that several million times across your cluster, and it actually becomes a pretty significant bottleneck. So obviously, of course, we at Intel want to find great ways to showcase

Starting point is 00:22:45 our own technology. But my group specifically sort of came from the other end where we were working on storage and we saw what this could bring to us. And that really is different. You couldn't really take DAOs and put it on SSDs and get the same sort of performance because that hardware capability really is a core functionality of why it's able to be more performant. One of the challenges with SSDs, depending on the SSD, of course, technology is that its read performance is extremely good, but its write performance is not as good. And the write endurance is an issue as

Starting point is 00:23:26 well. Does DAOS do anything to try to speed up write activity? You're not dependent on Optane SSDs, are you? No, we're not dependent on Optane SSDs. And while we're looking at ways to maybe perhaps glean some additional performance from Optane SSDs that are not required. You can use regular 3D NAND SSDs as well. You know, there's a few different things, I think, to answer your question. The first one is that actually using the persistent memory for the small IOs in your metadata actually does also help preserve the lifespan of your SSDs because you're not having to do all those small rights to your SSDs to begin with. So that actually does help extend the lifetime of your SSD. Also, you know, there are different types of SSD technologies, and we are actually currently

Starting point is 00:24:20 inside Intel doing a lot of benchmarking with future generation technologies of our Intel SSDs as well, and taking a look at how we can feed requirements back into those future SSDs to be, you know, the best optimized for that. But, you know, we find there's a typical lifespan for these sort of systems, right? And we don't find that it is cost prohibitive to be able to use the SSDs in the way they're able to last. But the persistent memory is definitely part of that, that we're not sending all of those small writes to the SSDs all the time. The Argon solution was going to be something like 256 petabytes of storage. Is that, I'm not sure if that's the right number, but it's multiple petabytes of storage, right? Yeah, yeah. So that was probably the raw number at some point. It's 230 usable petabytes.

Starting point is 00:25:19 So once you have the erasure coding in place, it's 230 usable. Yeah, well, even so, 230 petabytes of SSD, NVMe SSD storage is going to cost quite a lot. And all that's going to be on Deos? Yep, that's all Deos storage. Yeah, it's going to be quite a large system. I think it's probably going to be, you know, the best storage system in the world for a number of years after it. You know, Argon's really invested in the storage side and building this new supercomputer. But of course, that doesn't mean you have to go up to that kind of scale to start seeing it be, you know, worth using DAOs, even when you're down at those three and

Starting point is 00:25:57 four node systems, you know, or half a petabyte, you can still notice the difference quite a lot. You mentioned you tested 512 nodes in the lab. How many nodes are you deploying at the lab, Argon? I'm afraid that's in the set of information with Argon that I'm not allowed to share. Okay. Okay. I'm going to say lots. Yeah, it's going to be lots, but it's, you know, we're talking hundreds of petabytes.

Starting point is 00:26:22 It's huge. Any event. Any event. Obviously, Intel was at storage field day 20 here almost, I guess, a month or so ago. And one of the things that was mentioned there was, I think it's the IO500 benchmarks. I've never heard of IO500 before. It's a relatively new benchmark as far as I can discern, but you want to talk about some of the numbers you guys achieved or rankings maybe? Yeah, sure. It's been around for a few years now. It was around when I was working back on Luster. They were working on forming this together, actually. Andreas Dilger, who worked

Starting point is 00:27:05 with us on Lustre, has been very involved with the IO500 as well. But it was really an attempt to try to come up with a benchmark that could be consistently applied that gave people a better idea of what the file system performance would be like when they started to really use it for applications, right? You sort of end up in that game, right? Everyone's showing you like their best benchmark, but then not showing you where maybe they don't do that well. So they might show you, you know, they do really great at, you know, sort of IOR easy with large streaming read and writes. But if you can figure IOR differently and you're doing a lot of, you know of random reads and writes, you might see a drastic difference in performance and they would kind of

Starting point is 00:27:49 pick and choose them. So it was an effort to say, okay, let's come together as an industry and to try and have a fair comparison. Let's build together something that kind of looks at all the different angles and aggregates it into a score. And then much like the top 500 supercomputer list, right? It uses that to rank the top file systems in the world. And they sort of have two lists. They have one that's sort of the full list, and that's very much like the top 500. There's no constraints on it. It can be any size system who just has the best performance in the world. And then they have a second list that they call the 10-node challenge. And the 10-node challenge is where every system is limited to exactly 10 clients. So that also gives you a lot of interesting opportunities to compare the technologies and the solutions if you're in a similar scale system. So this year at ISC was our second time submitting.

Starting point is 00:28:43 And it was also the first time where we were joined with submissions from some of our partners as well. So in addition to a submission directly from my team on our test bed, we also had submissions together with Argonne and with TAC as well did a submission. And we're very proud of the results for it. Our Intel submission was number one on the full list by almost double the second solution. So I just mean double the score, overall score. And if you dive into that and you take a look at, you start to take a look at IOR hard results

Starting point is 00:29:20 and all metadata MD test hard, those sorts of ones that those differences start to get even more dramatic because the IO500 list posts all of these details are public online for each submission. But we were really proud to be able to really showcase what this technology was available of and with a pretty small system, right? We only had 30 servers and 52 clients. We didn't even actually have enough clients to fully saturate our servers, but that's what we had in our development lab here. And with that, you know, we were able to get the best performance, you know, out of any submission of all the different supercomputers in the world, right? We're beating the top supercomputers with

Starting point is 00:29:59 effectively a handful of nodes, right? You're talking both of them together. You're only looking at 82 nodes and you're comparing to systems that maybe have, you know, several hundreds to thousands of systems that you're running against. And so we were very proud of that. We're also very proud of our partners submitting along with us. Argonne and TAC got fourth and third on the full list, respectively. And then on the 10-node challenge, where it's the same number of clients, actually all three DAOs submissions took the top three spots, right? Once you start looking at similarly sized systems,

Starting point is 00:30:34 we took all three spots and we were over three times higher than the top known DAOs submission. So we think this really does a good job in a way that's been, you know, really vetted by a third party and, you know, sort of you can be picked apart by your customers and really still show that you stand up and hold that performance claims that you've been making. And the IO500, again, is more or less an HPC simulation of file services that are required in that space, but it's multi-application oriented and that sort of thing. So it wasn't necessarily focused in history of HPC. Small files, like we said, are not necessarily that active, but

Starting point is 00:31:20 if the benchmark was doing something like machine you know, machine learning or something, I'm not sure whether it does that or not. Obviously, the solution would be faster. Yeah, this is actually, this is making an aim to be more balanced, right? So it has a portion of the score that goes to different types of workloads, right? So there's a portion of the score that goes to your traditional large streaming, read and writes, there's a portion of the score that goes to your traditional large streaming read and writes, and then a portion of the score that goes to, you know, they call them IOR hard and MD test hard, as well as doing things like find on your system. And that's where you're going to start seeing, you know, sort of called poorly behaved IO, right? Small file IO, or maybe IO that's not aligned to your blocks, things like

Starting point is 00:32:05 that. And that's where the existing solutions tend to really fall down, right? You start to see them perform like, you know, the difference between some of these existing technologies from if you're, you know, in one of these more poorly aligned or small IO benchmarks to the large streaming rate might be like 9x difference. But with DAOs, right, it's much closer because we don't have the same limitations from the underlying SSDs. It's much closer to each other. So because the IO500 publishes all of the specific data for all the individual benchmarks, you can actually, it's kind of cool, you can dive in there and look and do all the comparisons and try and look at the benchmarks that's closer to your workload and see what might be the best technology. So if you know you're

Starting point is 00:32:52 doing a lot of small file IO or poorly aligned IO, you can go look and compare them in there. And that's when those differences start being actually even more dramatic. You know, you start seeing 13, 14, 40x differences between these different entries. But the numbers I said before were about the overall score. So that, you know, we really, we can still hold our own in the large streaming and reading rights as well. Absolutely. Yeah, yeah, yeah, yeah. I report on benchmarks in my monthly electronic newsletter,

Starting point is 00:33:23 plug for my newsletter. And I haven't done anything my monthly electronic newsletter, plug for my newsletter. And I haven't done anything on IO500 yet, but I'm planning to start. So let's look forward to seeing something there. As far as, you know, this is, it seems like, you know, Lustre and its comparison file systems are fairly serious long-term endeavors. Lustre has been around for, I don't know, I want to say 10, 20 years kind of number. GPFS, which is a precursor to Spectrum Scale, is probably the same sorts of timeframes. Deos coming out within, I don't know, four or five years. Is that how development started?

Starting point is 00:34:07 Or it's not as mature, I guess, is the kindest thing I could say. Right. So everyone sort of has to start somewhere. But no, you're definitely right. Luster's actually crossed its 20-year anniversary now. It's more than 20 years old. And, you know, a lot of our organization came from the Lustre group originally. And we sort of have a little saying amongst ourselves that,

Starting point is 00:34:32 you know, to create any new stable file system and have it completely stable takes about 10 years. And that's based on our experience from having helped build Lustre. But Deos has actually been around a bit longer than that. We started Deos back in 2012. So it's already been over eight years since we started Deos. And we're looking forward to our first production version in Q1 this upcoming year in 2021. So it's definitely newer, but we've put a lot of thought and time into validating this and building it out to be a stable system and not trying to necessarily just rush it out to the market as soon as possible. Right, right, right. A lot of startups and stuff will take significantly less time than that to deploy a solution.

Starting point is 00:35:26 But I don't know if they play in that HPC space as much as more enterprise-level kinds of workloads. Well, a lot of times startups have to do that, right? They have to show some progress to secure more funding. One of the benefits of being part of Intel is they can look to a longer term horizon for their investments. About how many engineers are working on the project today? I have about 50 engineers in my team working on this currently.

Starting point is 00:35:53 Then there's a number of people across other organizations, across Intel, that are contributing in some way or another on the enablement side. And then the Optane persistent memory, there are, you know, various generations of that that exist and are planned and stuff like that. So you're already probably, you know, working on support for, I'll call it the next gen of persistent memory. Yes, definitely. So, you know, what's out there today, if someone wants to try it today, right, they can use the first generation. But the second generation, of course, we're already working on and we plan to further out with that. And we're engaging with those organizations to, you know, make sure what we're putting out together coordinates well and builds a stronger solution together. Right, right, right. And from a marketing perspective, you mentioned a couple of partners. I guess Argonne's a partner. You mentioned another one that did another submission.

Starting point is 00:36:55 I can't recall what that was on the IO500. That was TAC, so Texas Advanced Computing Center. Okay. And do you have actual solution partners out there that are pushing DAOs as well to other environments? Or I guess I call these channels kind of things, right? Yeah, we're working with a number of partners. I can't necessarily disclose all of them, but Lenovo disclosed at the Daos user group last year that they're working on an integrated solution for Daos. And we did a, actually we did a fireside chat at ISC about that

Starting point is 00:37:32 as well. Virtually, of course. Yeah, of course. I guess I wasn't familiar with ISC. I've been to SC19 and a couple of the other ones prior to that, but I hadn't been into the international version of that. That typically in Europe someplace? Yeah, it's usually in Frankfurt in Germany. You should really come. It's fun. Yeah. It's a good travel trip. Yeah, when the virus is over and all that stuff. I know, right? That's interesting. And so talk a little bit. So you mentioned erasure coding is coming. You've got replication today and you support, you know, two to three replicas of the data. Is that a correct statement? Actually, it's N-way replication.

Starting point is 00:38:12 So it's configurable for you as the end user how much you want to replicate it. So you could actually make it replicate to every node in your cluster, which may sound crazy. Why would you ever need that much protection? But that actually can come into play from an availability aspect as well, right? So if you look at an artificial intelligence sort of job where they may need all their clients to read the same data set at the same time, if you have that only on a couple nodes, you're going to get a huge hotspot in your cluster and it's going to bottleneck there. But if you had something like that, you could take that data set and you can replicate it across all of your storage nodes and then feed that out to your job and then that load gets spread across

Starting point is 00:38:52 your cluster. And something that's also kind of cool with DAOs is how much replication you want is actually selectable down to the per object basis, right? So we're an object store, so per object you can decide how much replication you want. So if your application has some data that is just throw away and you know you're going to throw it away, if you need to go back to the last checkpoint or restart your application, you can actually go ahead and just put that as not being replicated at all. So you don't have any of the extra resources in your system being consumed. That wouldn't be useful. But then you can have your other pieces of data at the same time. You do need to keep around for your checkpoint or to restart your application and protect those to any level you desire. So there's a lot of flexibility in that compared to some of the more hardware

Starting point is 00:39:40 based data protection systems. Yeah, yeah, yeah. When you said per object, I'm thinking, are we talking about a shared? I think you're talking about a file. It could be a file, depending on how you use your object store and what middleware. It could be a file. It can even be smaller than that. It can be a key-value pair. That's very unusual. I've never heard of anybody being able to specify the replication of a key value pair.

Starting point is 00:40:06 I guess if the value was sufficiently large, that would make a lot of sense, I guess. I mean, most realistically, people are going to use it for groups of data, not individual key value pairs. But it has that flexibility since it's not stored inside DAOs as a file, right? Because we're an object store. We have a different sort of structure inside of us. It gives us some of that flexibility to whatever level of granularity that you want to use. So a particular key value could potentially be a separate object in your object store? Well, so I sort of use those interchangeably. The way our objects work, they effectively are

Starting point is 00:40:43 key values, right? And there's sort of a few different layers of structure. So maybe like start, let me start top down. So you have at the top layer what we call pools. These are actually similar in concept to the Z pool for Matt there. But that's going to be more like your storage allocation. That often is going to correlate to like your project, right? That's how much space this user is allowed to consume. So you have your pool allocation. Inside your pool, you have containers. Containers generally correlate to a data set. It's one run

Starting point is 00:41:17 of your application from beginning to end, right? One job, all the data that's generated in that usually goes into one container. Inside that container, you have objects, and those are key value pairs. But what makes a DAOs interface interesting compared to something like S3 is our values actually can be additional keys or arrays of keys, not just a blob. So that allows you to make basically any hierarchical structure that you could desire. And that's how we're able to do things like build support for HDF5 and POSIX and things like that. We actually build containers that have a structure similar to those file structures that is mimicked in that hierarchy, right? And then can provide an appropriate interface out, but that's how it

Starting point is 00:42:01 gets stored inside of DAO. So that gives us huge flexibility. It's almost like a native Python solution here, key values embedded in key values, et cetera, and arrays. It's pretty impressive, actually. Well, that's really what we wanted to do, right? People write their applications that way. And instead of having to then think about it getting serialized out to disk, we want applications to be able to just store their data and the way their application is structuring the data. And we can go ahead and take care of how you get that to the media most efficiently for you. Well, this has been great. Thank you very much, Kelsey, for being on our show today. Anything else you'd like to say to our listening audience before we close? Happy to be here. And if anybody wants to follow up and get more information, we do have a community website at deos.io. That's it for now. Bye, Kelsey. Bye. Thank you.

Starting point is 00:42:52 Until next time. Thank you. Bye-bye.

Your Ad Here

Grey Beards on Systems - 106: Greybeards talk Intel’s new HPC file system with Kelsey Prantis, Senior Software Eng. Manager, Intel

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.