Grey Beards on Systems - 93: GreyBeards talk HPC storage with Larry Jones, Dir. Storage Prod. Mngmt. and Mark Wiertalla, Dir. Storage Prod. Mkt., at Cray, an HPE Enterprise Company

Starting point is 00:00:00 . Hey everybody, Kasey here with Matt Lew. Welcome to the next episode of the Greybeards on Storage podcast, the show where we get Greybeards storage bloggers to talk with system vendors to discuss upcoming products, technologies, and trends affecting the data center today. This Great Barren Storage episode was recorded on November 5th, 2019. We have with us here today Larry Jones, Director of Storage Product Marketing, and Mark Wirtella, Product Management at Cray and HPE Enterprise Company. So Larry and Mark,

Starting point is 00:00:38 why don't you tell us a little bit about yourselves and what's new with Cray ClusterStore? Thanks, Ray. We have just introduced a brand new storage system tailor-made for HPC and AI. The world will first see it at Supercomputing in Denver coming up here in a couple of weeks. So we invite you all to come by the Cray or HPE booth to see the latest and greatest storage system. And I said HPC and AI. And really, it was very much the combination of those two disciplines, which has caused us to rethink our storage architectures for this new exascale era that we're entering. What we're seeing is that there's a load of folks who in their HPC workflows are starting to use machine learning and deep learning as part of the process of solving particular problems. For instance, we've seen it in a lot of the computational

Starting point is 00:01:46 fluid dynamics, in some of the seismic processing workflows. You would think that, you know, computational fluid dynamics would not be a deep learning type of application. Yeah, it's only in some portions of it that it makes sense. So as they're trying to really understand how certain metrics are derived, the sort of pattern matching expertise of a deep learning system can help them kind of define some of the starting, kind of the entry metrics for doing the computations. And it's very similar in kind of seismic processing and other disciplines as well. So, Mark, what's new in your side of things? My role at Cray is more in the software side of things.

Starting point is 00:02:30 This new cluster store E1000 that Larry will tell you more about in just a few moments, comes with a suite of software. Practical examples would be the parallel file system that runs on top of it. That's Lustre. And that's a significant investment by Cray to build out features, harden this thing at scale, test it, put support around it.

Starting point is 00:02:55 And there's a whole story, you know, just on talking about Lustre itself. Larry's going to talk to you about Flash and disk and tiering in just a moment. And the engine behind all of that cool technology is a software suite called ClusterStore Data Services. And that's the thing that, frankly, makes it usable, allows an administrator or a user to simply set it and let it go and not have to manually move data, track it. It's all automatically handled by the software in the system itself. So those would be two big components. Luster's been, seemed like, been around forever, right? I mean, it seems like even HPE had products in that space before the acquisition, correct? Yeah, Luster's a couple of generations in at this point in time,

Starting point is 00:03:41 but it's very established in the top 500 high-performance computing centers around the world, especially in the top 100 itself. And it does the type of work that just other file systems just can't meet. They can't meet the scale, they can't meet the bandwidth performance, and soon we're going to deliver IOPS performance out of Lustre that can only, you know, high performance computing centers around the world can only use on a really robust parallel file system that's mature, that's reliable, that's robust, has a full set of features behind it. Is it still relying on InfiniBand. So that's actually an interesting area. We will fully support InfiniBand. In fact, both EDR and HDR, the latest generations of InfiniBand.

Starting point is 00:04:35 But Cray is also introducing their own interconnect, which is called Slingshot. So Slingshot is an Ethernet-based system with a lot of enhancements for handling HPC kinds of problems, things like being able to pass small messages, MPI messages at extremely high rates. And importantly for storage, it does a couple of things. For the storage system, it has some great congestion management features, which means that we have a problem in storage where you have lots and lots of compute nodes all trying to read and write to storage at the same time, which can cause congestion. Because of Slingshot, we can eliminate or minimize the congestion and make both the compute side and the storage side work in much better harmony. So why did you guys decide you needed to do a new version of storage like this?

Starting point is 00:05:42 That's a great question. And it's really around that convergence of HPC and AI, which some of the analysts now, like our friends at Intersect 360, are saying more than half the HPC sites are running AI with their HPC. And so typically up to now for a machine learning application, you've got one storage infrastructure for AI and an entirely different one for HPC. The AI one can be all SSD, very focused on IOPS. The HPC one is racks and racks of disks mainly focused on how do I get the performance out of the SSDs and lets you get the cost advantages and sequential performance out of the more traditional HPC infrastructure dominated by disk drives. And that's what Mark was referring to earlier with his state of services. It seemed to me Lustre was always kind of tied to high bandwidth types of workloads.

Starting point is 00:07:16 To move to a more of an IAP level workload seems like a dramatic change. It's absolutely a shift. But one of the reasons Lustre still makes sense is that we're also being driven to move into exascale. We're moving out of what we always thought was huge about exascale, which requires storage systems that have levels of performance that you really just can't economically build with disk drives any longer. They're big, they're bulky, they frankly cost a lot in power. And so this is where Flash comes in. So Flash also serves the purpose of keeping the storage scalable, meeting the performance requirements, but also keeping it economical for customers at the very, very high end. You know, once we achieve that, of course, it becomes consumable for everybody else. That may not be a DOE lab or a supercomputing center.

Starting point is 00:08:17 So, I mean, from an AI perspective, it's trying to keep, I don't know what the terminology would be for HPC, but in the normal world, it would be GPUs busy and stuff like that. And these GPUs have got thousands of cores and that sort of stuff. And in an HPC compute cluster, there are thousands and thousands of cores of just compute stuff as well, right? Exactly. And so really the secret that Mark was referring to of the E1000 system is that we're delivering, I think it's the first NVMe generation 4 storage controller, certainly in the HPC market and perhaps elsewhere as well. But what Gen 4 does, as opposed to most of the current flash systems that you see out in the market, is it's capable of twice the performance of a Gen 3 system. And so where does that matter?

Starting point is 00:09:20 As you're pointing out, I need to deliver a lot of IOPS to a lot of GPUs. And essentially, that's, you know, if you're going to do training on a deep learning app, you want to be able to go through a very big training set, pull out just the little bits and pieces that need to be trained on. You have to do that randomly in order to get the proper results. And so it requires a much higher level of small random I.O. than you typically see in an HPC workload. So that's really what we're trying to do is converge both of those things, a little bit of sequential and a lot of random I.O. You said NVMe Gen 4.

Starting point is 00:10:03 You're really talking PCIe Gen 4 with NVMe on top of it, I guess. Is that how I clarify this? That's correct. And the Gen 4 NVMe drives, which are just coming on the market, the SSDs that deliver their throughput through a PCI Gen 4 and NVMe Gen 4 configuration. Okay. And does that mean that you guys are supporting NVMe over Fabric as the interconnect protocol? But you're really a file system, right? Yeah. NVMe over Fabric is typically focused on kind of block environments at this point in time. And really what we're focused on is delivering that kind of high local throughput. So we may do NVMe over Fabric for some applications in the future,

Starting point is 00:10:54 but this is really an NVMe Gen 4 controller. I mean, this thing is capable, just this one box, never mind racks of boxes, just one 2U24 box is capable of up to 80 gigabytes per second of read performance, 60 gigabytes per second of write performance. And just to give you a sense, that would be more than a rack full of... Yeah, I don't think I've ever seen a disk drive with 80 gigabytes of read performance, quite frankly. Right, it's because it's running parallel.

Starting point is 00:11:31 If you had quantum shift, I would agree. And the marketplaces that you're planning on going into, I would imagine that seismology is a big one. The financial markets, biomedical engineering, the real hitters for high performance, right? Sure, that's right. Where Cray has typically focused their efforts there in the market has been with government labs, defense, intelligence, certainly in weather as a big market. Also the oil and gas and other energy environments.

Starting point is 00:12:15 Those have all been big sort of traditional Cray markets. But now we're joining HPE. So as of last month, we're now merged into HPE, and they have obviously many, many more salespeople and a much wider set of targets that they work at. So to your point about financial services, biotech, many more of the kind of commercial markets that another one that we talked about earlier is kind of aerospace. And those are areas that Cray hasn't been as focused on, but are big, important markets for our friends at HPE. The metadata services for something like this has got to be fairly sophisticated, I'm guessing. It is. And it also needs to learn how to scale in order to meet the exascale requirements around the world. So those are part of the improvements made in the Lustre file system

Starting point is 00:13:10 itself is to build out the metadata system to support high performance devices, but also support more inodes in a file system to be able to parallelize the metadata operations across multiple servers and be able to build out tiers of directories within the file system as well to make it more efficient for the system to function. I kind of danced around a question. I didn't really answer it succinctly a little while ago, Ray, but you asked about IOPS and Lustre, and it is a significant change. To that effect, Cray has invested a lot in putting features that optimize smaller IOs in the Lustre file system,

Starting point is 00:13:55 specifically to support the benefit of having a flash device. It's not just Cray, it's the rest of the Lustre community as well. So these enhancements are just now coming into the current Lustre releases like Lustre 2.12, which is the long-term support release from the community. And there are many more that are being applied to the master branch going forward. So you'll expect to see a change in Lustre as it's repositionosition for being exclusive to high bandwidth types of environments to now meet not just exascale, but the hybridization of AI and small IOPS a knock on Lustre, but a lot of the high-performance computing storage environments were fairly difficult to configure and use. I mean, it required a lot of knowledge and that sort of stuff. Have you made any inroads in that area as well? Absolutely. You know, one of the challenges, of course, is Lustre is very flexible.

Starting point is 00:15:01 So anytime something's flexible, it requires a little bit more knowledge in how to exploit the flexibility. But, you know, what we do, our business model literally is to be able to take Lustre, make it easier to use for end users and the administrators so that they don't have to literally perform a science experiment to install Lustre. I'll say we almost make it appliance-like. We install it, we configure it, set up the object storage servers, the targets behind them, same for the metadata servers. So when ClusterStore shows up on a customer's site, in essence, the administrator's task is already half over. And then we'll talk a little bit more about making it easier to use through cluster store data

Starting point is 00:15:51 services, where we'll start to take many tools that the administrators use to, frankly, manage and maintain Lustre. And we allow the administrator to now start using one interface and one set of tools that have been designed and have been integrated with the solution by Cray itself. It'll also have an impact upon end users who maybe have been forced to use Lustre commands that work inefficiently. And in the new age with the 1000 and cluster store data services, the end users will be able to literally use this new software to be able to do things like search and find on their files and not have to go wait a day for the results to come back. But we're talking exascale types of file dimensions or at least numbers of files in the billions, I'm guessing, things like that.

Starting point is 00:16:47 Is that? Yeah, yes, absolutely, in the billions. Yes, some of our early customers are some of the very first, in fact, all of the very first U.S will ship the system to, at least at scale, we'll ship some smaller systems, will be Argonne National Labs. So they're going to get a 400 petabyte system that will deliver over two and a half terabytes a second. And that will support a new supercomputer called Aurora that will be delivered in 2021. So we're delivering the storage system, which will actually be a site-wide storage system connecting to other supercomputers like Theta. But it will happen in 2020 with the new system coming in 2021. So for a 400 petabyte file system kind of thing,

Starting point is 00:17:45 how many files you think that's gonna represent? Typically the requirements in the, for these large systems are in the billions of iNotes. Billions of iNotes, now an iNote would be a directory or a file, right? Yeah, could be a directory or a file. And it's, yeah, so they'll be in the billions

Starting point is 00:18:06 and each one is going to be specified independently. But as Mark said earlier, a lot of work has been done by the Lustre community and by Cray actually to develop a scalable metadata solution and all of these very large systems have this as a requirement. In fact, we're delivering the first four awarded supercomputers and all of them have chosen this new cluster store E1000 system. And so we're already looking at a backlog of over 1.3 exabytes of storage

Starting point is 00:18:49 that we sold and will be delivering over the next couple of years as these supercomputers roll out. It's a nice place to be, actually. Lots of talk about Flash and stuff like that. Are you using the Flash in the metadata side of things as well as the data? And are you doing tiering and that sort of stuff? I mean, there's lots of questions in my mind about how you're able to do the high level of biops as well as bandwidth, right? Right. The storage nuts and bolts. Deduplication, even.

Starting point is 00:19:23 That's just one of the things. One thing we haven't mentioned before I let Mark explain more about the tiering is the other half of the story. So there are two principal building blocks here. We talked about the 2U24 flash system, but the other one is also important here, which is the disk array. That enclosure will support in four rack units,

Starting point is 00:19:48 106 drives. So you're looking at being able to, in a single 4U box, be able to put this year over 1.2 petabytes of usable storage in 4U. You know, if you fill up a rack of those things you're talking about nearly 10 petabytes in a single rack and by the time this rolls out in 2021 and 2022 to some of these larger systems we'll be seeing 20 terabyte drives and over a petabyte and a half in a single one of these systems so it's really those two building blocks going together, the flash and the disk that allow us to build these, you know, these storage systems that can serve these exascale supercomputers. And in storage terms,

Starting point is 00:20:41 there's no place in this architecture for spinning disk. Am I correct? Oh, yeah, that's what I just mentioned. So that 4U high enclosure has 106 spinning disk drives in it. Wow, okay, I assume that was also NVMe. And they really provide the capacity. So if you think about flash delivering the performance, as Mark was saying earlier, the capacity is just way too expensive to deliver in flash these days. So you need to complement the flash with that disk. And so if you can get a petabyte or so of pretty inexpensive, I mean, a flash or a disk system is well less than a penny per gigabyte stored and so you you're now getting a a combination of flash plus disk but as mark said earlier the magic is how do you make those

Starting point is 00:21:37 into a single system so that the user doesn't have to put this here, get this there, that sort of thing. One more question, and that's the location for the solid state in this equation. Is that going to be primarily used as a caching mechanism? It's going to depend, I think, on the user's requirements. Some of the data that they keep is actually cache. Think about it as application level cache. But what we're doing actually is just building a file system. So this is having files either in Flash, and typically you're going to want the files that you're actually executing against in the Flash systems so you get the best performance both read and write. And then when you're done with that, you need to get them out of the flash so that you can run the next job.

Starting point is 00:22:34 So I guess this question is for Mark. Is that auto-turing? Does that look at the metadata and say, well, this file hasn't been accessed in X amount of hours, minutes, days, weeks, who knows, and automatically move that stuff? The cash question was actually interesting, intriguing. If you don't mind, I'd like to go back there for just a moment. And that's kind of the way the market's consumed Flash so far. It's like we know it has appeal for performance stakes, but customers, they don't want to manage more than just a few things. of years ago with another Flash-based product was when either administrators or maybe end users, the researchers and scientists themselves, when they had to manage Flash as a separate entity from the parallel file system sitting on disk, it overwhelmed them. It became too much.

Starting point is 00:23:41 So the idea of using Flash storage as a cache naturally evolves from that. If you can manage a single parallel file system without having to use a second thing and just get the flash storage to be a cache, I get the benefit of having performance from that technology and not having to manage more than just the parallel file system. Now, there's also consequences. Is there a way to really guarantee that cache is going to sustainably deliver performance? No, you can't say that about cache. Cache doesn't really understand the application. It just understands data's moving, you know, back and forth. It's reacting to the real demand that's occurring, right?

Starting point is 00:24:31 That sort of thing. Yes. And, you know, you can only predict so far into the future given any application. So what we're going to do with ClusterStore and the E1000 is offer customers the option to, if you want to configure this as a, we'll call it a transparent cache, you can do that. And then ClusterStore will intuitively, it'll know that, well, maybe for optimized performance, I just want to read straight from the disk if that's where the data is and write it back to flash. Maybe reread it from flash. Maybe keep it in flash for other steps in the workflow. And then turn it over to the policies that the administrator has defined to determine how long do I keep it there before I

Starting point is 00:25:16 move it back down to disk. I want to optimize the use of flash. So I've got to keep some flash available for the next job coming in. We all understand that, but there's another use case and it's for those, um, I'll call them power users. It's the, the, uh, the, the, uh, scientists, the, the researcher who understands what their application is doing and how best they can use the file system to, to optimize performance in, in that case, you know, we're going to give the user tools to be able to say, either through scripts or through the workload manager, put my data in Flash, leave it there. And leave it there until they tell you I'm done with it.

Starting point is 00:25:58 And I may not be done with it on this job step. Maybe I'm going to keep it there for a day or perhaps even longer. So there's a couple of tools that we'll give to end users to be able to say, why don't you tell me what it is that you want to do? And ClusterStore then automates that process. And these are lessons we've learned where the technologies that are coming out of the former Cray DataWarp product will service well in Lustre. So here we are. We've got one thing to manage. It's called Lustre, as opposed to different file systems. It is giving the user the choice of, you know, make it easy for me. Just give me, you know, the performance out of Flash, or it gives them the

Starting point is 00:26:35 option. We talked about flexibility and complexity a while ago, right? Well, you know, some people can handle that. Some people want that option. And in that case, we can give them the tools to say, you can manage it, and if you want it placed in Flash, we'll do that for you and we'll automate the process. Back to the metadata side of things here. But I mean, does this sort of caching or power use kind of thing also apply to the metadata as well as the data? Yeah, kind of. We're working on a feature. It won't be in the first release, but we're working on a feature that would, in essence, allow an end user to spin up their own file system, if you will. And that gives them the privileges to a reserved metadata system as well. So they can, in essence, have a kind of a guaranteed quality of service that, you know, would lock out competing applications. So, you know, that's something on the drawing board and on the roadmap going forward.

Starting point is 00:27:30 We'll get there. Sort of like multi-tenancy, as I would call it, kind of thing. Is that? I'd say more file system on demand or how about dynamic file system? That's interesting. But that would be absolutely a metadata play. Okay. Okay. You keep talking about parallel file systems now. In the old days You keep talking about parallel file systems now. In the old days, you know, parallel file systems would entail your own sort of API and host level functionality. But later versions of NFS have supported parallel file systems. I mean, when you're talking about Lustre parallel file system, does that involve a

Starting point is 00:28:03 Lustre client? Is that the right word? Yeah, I guess client is the right word. Sure, yes. And that would be one of the things that the customers would look to Cray to, frankly, deliver simplicity for them. We would put the Cluster client on the Cray or soon to be the HPE servers and install that in the factory before delivering to the customer so that the administrator wouldn't have to do that. Okay, and that provides the parallel file system access to something. You're not necessarily using NFS version 4.2 or 3 or something like that to do that. Correct. It's a native client that talks to the file system. I mean, and in file systems, do you guys support things like, God forbid, snapshots and deduplication and, I don't know, replication things, compression, security, encryption? Those are like five different things I just mentioned for file systems.

Starting point is 00:28:59 So there's some of the things that we're doing immediately and some of the things in the longer run. So in terms of some of the markets that I mentioned earlier, some of the customers there are very interested in strong security, as you can imagine. So in particular, their interest is encryption at rest. So this is something that we've supported in the previous generation of the product and will be supported in this generation as well. So that in the event that a drive is removed from the system, it will be missing the key needed to unlock the data that's actually resident on that drive. So it becomes very secure for some, particularly some of our government customers. And is that done at the drive level by the drive itself, or is it done through the E1000 control logic kinds of things? Both actually, but it's the actual encryption is done at the drive, and because that encryption is distributed to all of the drives themselves, whether they're flash or disk, it doesn't impact performance to any significant degree. So you can deliver both the performance as well as the security that they require. Some of the other features you

Starting point is 00:30:25 were mentioning bring up another topic, which is some of the evolution of Lustre, particularly in what's called the backing store, the store that's actually used to put the data down on the disks or the flash. Typically that has been a Linux file system known as ext4 modified for Lustre and called LDiskFS. But now there's a new champion on the horizon. So ZFS, which has been around Oracle and Sun and other places for years now, has been pressed into service for use with Lustre. And that is something that we'll support also in the new E1000 cluster store system. So the customer will have the choice of either having the original LDSCFS system

Starting point is 00:31:32 or the new ZFS system. And what ZFS offers is the potential to do things like snapshots and compression. It would be a little bit hard to do dedupe at the kind of scale we're talking about. That is mentioned as a joke, by the way. I'm a big fan of ZFS. I've worked with it for many years, and I can see the benefit there. And I think you can probably tell by some of the questions I asked that I'm sort of bent that direction.

Starting point is 00:32:10 And so are many, many customers are with you on that one. So, you know, when I talk about read and write cache in the SSD layer, that's definitely from my ZFS days. Ah, yes. So we'll be using ZFS to perform some of those functions and Mark can describe some of them with some of the Lustre capabilities. We won't do that straight out of the gate here, but compression, some of the experiments that we've done to date show that compression can deliver, even with scientific data, which in some ways is often pre-compressed, can deliver 30%, in some cases 50% improvements in capacity for the user. So one of the things that we'll offer is the ability to run LDISCFS, for instance, on the flash side and run ZFS on the disk side. And so ZFS, you might be able to use the compression and LDISCFS because it's actually a sort of

Starting point is 00:33:23 a simpler file system. We can get it to run considerably faster on the flash technology. So the combination of both may turn out to be quite useful for some customers. That's interesting. And so you're talking strictly the data layer, but there's a metadata side of this as well. And does it use LDSCFS as underlying backend storage? Yes.

Starting point is 00:33:47 So it can use either LDISCFS or ZFS. Again, LDISCFS is faster. So typically we're going to advise customers to use the faster system. But there are a lot of customers like yourself who are very familiar with ZFS and they like the attributes of ZFS. So if they want ZFS metadata. Way back, we started talking about AI and HPC and there are definitely different styles of workloads, but you're starting to see a lot of your customers doing AI in that space as well? Yeah, it's really coming along as in almost every discipline we see, you know, there's a lot of researchers who, you know, have a new tool. You know, they are used to having a hammer.

Starting point is 00:34:52 All of a sudden, somebody gave them a screwdriver and they're finding out, wait, hey, there's a lot of screws we can turn here in our workflow. And so they're finding ways to take advantage of this new tool as part of their HPC world. Now, HPE is strong in the kind of overall AI system for a lot of enterprise kinds of applications. So it's not just HPC that the overall company is looking at, but what we're really interested in and what we think this product is super well suited for is the convergence of those two things. And so as we see, you know, 50% today and growing very fast, the convergence of both this kind of machine, particularly machine learning techniques, into these workflows, we think this particular, I mean, we can do 3 million read IOPS. Like you can go find the brochures from a lot of the enterprise all flash arrays, and they're kind of nowhere near that for a single 2U24 system. And that's

Starting point is 00:35:55 running luster. So, you know, we talked to our AI friends about this. They see, wow, that's a pretty good system there. We'd like to see that, even in areas like autonomous driving and that sort of thing, where you have the combination of massive data sets that we can put on that very high-density disk, as well as huge IOPS requirements, which can be delivered from this new 2U24 Gen 4 system. And just to clarify, when you say something like the 2U24, the 4U106 drive, that includes a controller for those drives as well as the basic storage? Yeah, so let me describe that a little bit.

Starting point is 00:36:43 A basic building block is that 2U24. When it's filled with 24 disk drives in it, you can use it as an all-flash array. It has two servers built in that provide high availability. Those two servers are actually RoAM servers, so they're based on the AMD RoAM system, which is how we get the very early PCIe Gen 4 capability into the system. And we can take that same box, we leave out most of the SSDs, we attach it to that very high density disk enclosure, and we can use it as a disk enclosure because it still has the same two servers built in. It still does high availability by failing over between those two servers, and it serves both functions for us, as well as doing metadata, as we discussed earlier.

Starting point is 00:37:40 So it's really just giving different personalities to that controller allows us to build very complex systems, very flexible systems with only two building blocks that anyone has to sort through. You don't have a specific, and I'm not sure if the right term, a metadata server physical manifestation and a data server kind of side of things. They're just all these building blocks that do both services? Yeah. Essentially, it's software-defined, to use a term. Basically, you just load the personality that you need to have. For instance, your metadata personality, load up the SSDs that the customer needs for their billion iNodes or so, and you now have the metadata unit. Then on the other hand, if I'm doing a big disk system, so one of the very first systems that I mentioned earlier, Argon, will be primarily a disk-oriented system. disk oriented systems. So we're going to ship a lot of those 2U24 controllers,

Starting point is 00:38:45 but they're mainly going to be controlling that, those disk enclosures. So it's just what personality you add here. And that really simplifies the life of the administrator as well as the systems architect. And I asked all the cluster, you talk about cluster store data services. We haven't really talked about too much of that.

Starting point is 00:39:09 Is there, besides the Turing, is there something else there, Mark, that you want to mention? Yeah, I would actually. You know, Larry talked about Argonne a while ago. There are other Exascale awards awards that you know cray is working on and they they run the they run the uh the gamut from being all disk systems to being all flash systems to being hybrid systems that comprise flash and disk so there's no one size fits all um and you know the cluster store data services values most apparent when we talk about tiering, because there's a movement of data from, you know, disk into flash and back

Starting point is 00:39:52 down to disk. But even in those environments where maybe it's all flash, or maybe it's all disk, cluster store data services is adding technology that makes Lustre itself easier to use. We started to talk a little while ago about the scalability challenges with billions of inodes. And you can imagine that looking for a file or a directory within a file system that's that large, if you had to walk the file system sequentially, a simple activity takes hours to do in a file system that's that large. If you had to walk the file system sequentially, you know, a simple activity takes hours to do in a file system that size. So we've actually used ClusterStore Data Services

Starting point is 00:40:32 to impart Cray IP into the Lustre system to optimize how data is indexed, how it's searched, and then provide interfaces to both the end user and the administrator to, frankly, to be able to use this product kind of seamlessly or uniquely. So one of the key challenges has always been a user comes in and doesn't know what to find, and it takes hours because Lustre's walking through the metadata line by line to find

Starting point is 00:41:08 the user's file and that ties the system up and an admin gets a phone call. Why is my job running low? Because it's slow because it's this user that's doing an unnecessary find operation. So cluster store data services, even in a disk-only environment, has the ability to optimize the operational luster itself. And then there's going to be activities like my file system's filling up, so it's all on flash. My file system's at 85% utilization. I've got a big job that needs to start tomorrow. How do I determine what data is not needed any longer? So an administrator is going to say, well, I got a purge.

Starting point is 00:41:49 I don't necessarily need to move data. I need to be able to trim the data out of the file system. So cluster store data services would allow an administrator either to set policies to do this automagically or to be able to manually go in. I'm looking for this kind of data. Maybe it's more than a week old. Quickly find that. So that goes back to indexing and search. And then be able to basically just say, I need to purge it now and get it ready for tomorrow's workload.

Starting point is 00:42:17 So cluster store data services is more than just tiering. That's the end. Right. In a lot of cases, the HPC workload seems to be, you know, creating these temporary files for a scratch workplace, if that's the right term, and then pretty much jettisoning them at the end of that, whatever that workflow ends up to be. Is that kind of the world we're talking about here? Partly. Yeah. You know, Cluster was kind of born as a scratch file system, a high-performance scratch file system. But it's become, you know, over the years, it's become more than just scratch. It's meant now

Starting point is 00:42:49 to hold data for and protect data, that's most important, for extended periods of time. So, you know, we still have all the obligations of providing the high performance, but, you know, now we have to be a real file system. and that means protecting data, providing the tools to manage data. That brings up a whole other discussion here. We only have time for a little bit. With 100 petabytes of storage, how do you protect something like that? Do you replicate it or duplicate it? You certainly don't fire up a backup job and take a couple thousand silos of tapes or something like that.

Starting point is 00:43:27 Well, I'll tease a bit. This would be my data management framework conversation and how HPE's existing DMF7 would be able to look into the Lustre file system. It would also have policies based upon age, size of the file, last time used, the user, the group. And it would be able to say, okay, rather than purge it, I want to take it and I want to put it someplace and I want to make copies as I do that. And you really can't run backups on a file system that's petabyte. You can't do it. But DMF has the ability to not just optimize utilization of the Lustre file system, but also put data in zero-watt storage or object storage or tape and make copies, put it a second site. And in doing the traditional kind of archive workflow, it provides data protection services as well. So that would be my pitch for DMF7 and the new age. All right, we're going to have to end there. Matt, any last questions for Larry and Mark? No, actually, that last point about data management services was top

Starting point is 00:44:44 of mind from the very beginning of the conversation. Wanted to know, you know, how we would manage the backups of this massive amount of data. And I think that the metadata tier is mission critical to finding those rarely used files within the backup.

Starting point is 00:45:02 Be very interesting to see how this plays out. I've really enjoyed the conversation. Yeah, yeah. Larry and Mark, anything you'd like to say to our listening audience before we close? No, thank you for the opportunity. And it was a great discussion. I'd say that both Larry and I will be at Supercomputing in two weeks' time. And both of us will be on and about the Cray Expo on the floor. Okay, just a plug for supercomputing.

Starting point is 00:45:27 It's in Denver again this year, so I will be attending. Okay, so, well, this has been great. Thank you very much, Larry and Mark, for being on our show today. Thank you, Ray. Thanks very much. Next time, we'll talk to another system storage technology person. Any questions you want us to ask, please let us know. And if you enjoy our podcast, tell your friends about it.

Starting point is 00:45:45 Please review us on iTunes, Google Play, and Spotify, as this will help get the word out. That's it for now. Bye, Matt. Bye, Ray. And bye, Larry and Mark.

Starting point is 00:45:56 Bye, Ray. Until next time. Thank you.

Grey Beards on Systems - 93: GreyBeards talk HPC storage with Larry Jones, Dir. Storage Prod. Mngmt. and Mark Wiertalla, Dir. Storage Prod. Mkt., at Cray, an HPE Enterprise Company

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.