Grey Beards on Systems - 139: GreyBeards talk HPC file systems with Marc-André Vef and Alberto Miranda of GekkoFS

Starting point is 00:00:00 Hey everybody, Ray Lucchese here. Jason Collier here. Welcome to another sponsored episode of the Greybeards on Storage podcast, a show where we get Greybeards bloggers together with storage assistant vendors to discuss upcoming products, technologies, and trends affecting the data center today. And now it is my pleasure to introduce Mark-André Veth, a PhD student and Dr. Alberto Miranda of the Barcelona Supercomputing Center. Since Supercomputing 22 is about to open in Dallas this month, we thought it would be

Starting point is 00:00:37 a good time to see what's happening in the HVCIO space. So Marc and Alberto, why don't you tell us a little bit about yourself and what makes Gecko FS so unique? So, yeah, my name is Marc Feff. I'm from the Johannes Gutenberg University, Mainz in Germany. And yeah, like you said, I'm a PhD student. I'm very happy being on here today. And in the past, during my master's already, I found my, let's say, passion for file systems and then went more into the PhD about actually then developing and designing my own file systems. And I did two of them. Mostly it was GECOFS, which is a distributed file system that you can start up at ad hoc. So what we have seen in the past was basically that there are some issues

Starting point is 00:01:27 with parallel file system in metadata performance, and we wanted to do look into new ways how we can create a very fast file system that do not have these limitations. And Alberto. Okay. Thank you. Thank you, Ray. Thank you, Jason, for having us here.

Starting point is 00:01:43 My name is Alberto Miranda. I'm a senior researcher at the Barcelona Supercomputer Center where I started a really long time ago. I really don't want to remember how many years ago that was. I started basically as a, I will say, as a research assistant engineer doing coding in the third systems research group and over time i started with a phd and nowadays i'm the colleague for these storage systems for a string computing research group where we're doing a lot of things when we're trying to optimize io mostly

Starting point is 00:02:21 for applications running in marinostrum but we're also trying to see the bigger picture doing things that everyone can use and this is where GeckoFS came into the picture. We basically had an internal seminar where several people from the IEO world joined together and we started discussing things and we found with Andrea Brickman, which is Mark's advisor, that we were doing very similar things. And we decided to join two slightly different projects to try to come up with something that was really useful to everyone, to everybody. So what's so special about you?

Starting point is 00:03:00 So I just saw GeckoFS and some IO500 reports from, I think earlier this summer of the last HPC conference, I think, in Europe. And out of nowhere, Gecko FS came up. I think there were at least two or three different top 10 slots in the IO500. What's so special about IO500? Or not IO500, but GeckoFS. I mean, it's not usual for something like, you know, come out of nowhere and all of a sudden be fast as slick. Yeah, so I think our first getting into IO500 was a couple of years ago. And then I think our last one was in supercomputing 20.

Starting point is 00:03:39 But yeah, we were quite there at the beginning and were at the top five. I think what it makes really special is that, you know, we're not having a lot of the usual mechanisms that normal parallel file systems have, and therefore we can have a lot better performance. Just looking at what the HPC systems, HPC applications actually need, and then we can optimize around it. So we're in a very unique position where when an application only runs us as their file system,

Starting point is 00:04:10 we can optimize for them and do not have to basically support all the applications like Lustro or GPFS would do it. So I read the paper that you guys wrote, and there was probably a dozen people, maybe half a dozen people on the paper. So it's kind of interesting. They mentioned quite a few different techniques that you guys were using and hopefully we'll get a chance to get in all these things. But go ahead, Alberto, you were going to say something. Sorry before I interrupted. Yeah, well, I wanted to say that

Starting point is 00:04:37 the key point behind GeckoFS is that it's not actually a parallel file system for long-term storage. Basically, what we wanted to do was we were basically finding a lot of problems trying to keep all the POSIX semantics in distributed context. So we decided to try two things. One, to try to get rid of POSIX, which didn't really work because everyone is using the POSIX interface to interact with I.O. And the other key thing was that we wanted to be able to containerize a bit the I.O. for an application and to keep it separate from the parallel file system in an attempt to reduce the problems that we were seeing in Myron Ostrom and in Mogul 2. So what we ended up doing was creating a new file system that was using the resources, the load logger resources that were assigned to a particular HPC job. So the difference between that is that only one application, only a single

Starting point is 00:05:45 application is in the storage. And then we can fine tune what kind of semantics we use and improve performance. So it really seems that, you know, kind of the deep learning arena seems to be really tailored for GeckoFS, given the fact that you've got the ability to rapidly read from a lot of different parallel components. And I'm assuming you guys have some type of sharding mechanism in there that helps that out. Yeah, I mean, there was a whole bunch of stuff in the paper about the sharding stuff. Jason, you didn't do your damn homework. Oh, yeah, I tell you. So, yeah, the IO500 is not just deep learning, right?

Starting point is 00:06:29 It's everything that you would find in an HPC center, right? I mean, yeah, they've added deep learning over the last couple of iterations, I guess. But you guys did fine in other stuff too, right? Yeah, I mean, for deep learning applications, a little bit to go back to that. The distributed file system that we have, like EchoFS, they are pretty useful as long as the input data that you have really exceeds the storage capabilities of OneNote. Because what people are already doing nowadays, if they have the input data for a deep learning workflow, they have this already available on a single node. And then, of course, it isn't that much of an improvement.

Starting point is 00:07:07 But as soon as you have a big namespace like in GeckoFS that spans over multiple nodes, you can then have really big input files. And this is where at that point you don't you don't need these charts anymore, which basically splits up your data center. And for I500, I think at the moment they don't really have a a deep learning part, but it comes concerning the I.O. workloads that you see in HPC center. They are very broadly represented there. Defined.

Starting point is 00:07:36 Yeah, yeah, yeah. So it's not really a deep learning, but it's a simulation of deep learning is what you're telling me. Yeah, it's more of a simulation of what the HPC applications are doing. But yes, deep learning applications are also represented in some of the io patterns there yeah and so typically a lot of these big parallel file systems think of the gpfss and lusters they were always really good at sequential io um is that the same case uh with this or does this can this get into a little bit more you know better for those random read scenarios well in a case gecko fs is not particularly optimized for sequential reads or sequential rights because basically uh what we wanted to do uh from the beginning

Starting point is 00:08:16 well in fact what we wanted to do was to to be able to tailor the data distribution approach to a particular application. But the first data distribution that we implemented into the system was something that we called white striping, that is basically just we chunk the data space of a file in the storage system and pseudo randomly distribute it across all the available storage nodes, which in this case are actually all the compute nodes assigned to the job. So in this case,

Starting point is 00:08:54 when we're trying to read something sequentially, we're basically aggregating all the bandwidth possible from all the compute nodes that you were able to include into your job and for random data distributions uh we're not very much affected by them because we are actually doing a pseudo random data distribution oh i was just going to say and and how you guys handling the metadata accesses for basically where that data is like split up and line stripe too yeah is there a single metadata server throughout the whole cluster or is there you know separate metadata servers per shard i guess not not the right term obviously but yeah so so what what we are doing uh very uniquely um for example for lastron gpfs

Starting point is 00:09:36 you have the central metadata service right where the requests go to and we are basically just looking at the path of the file that you're accessing, hashing that. And then we basically have all the information we need to client side from the client side, really know where the file and the metadata of that file is. So we can simply all the metadata spread across all servers. And depending on where it's hashed to, the metadata will be stored. So we don't need to go to any central server and ask for the data or the metadata actually is. We can just access it. And that's, this makes it very fast for metadata, of course. So the client component itself, then the client itself is actually also managing the hash. So it knows where to go to. So I'm assuming you guys have a kind of custom client that

Starting point is 00:10:20 actually talks. So it's not like talking NFS or, you know, some other protocol. It's actually, you're using a client on the client side that's actually the gecko fs yeah yeah yeah exactly yeah we're we're actually using an interposition library that we we uh we embed into the application using ld preload which brings a whole set of problems i would rather not discuss today. So, yeah, we have an interception library. We're also working on trying to offer something more sensible for people that can actually modify their applications and can link, can afford to link against a different library, but the first use case that we needed to support was for applications, legacy applications

Starting point is 00:11:07 that already existed. So the first approach was basically just try to interpose all the IO API. As X calls. Yeah. Well, in fact, we first tried to interpose on the Lipsy IO calls, and we rapidly found that this was a nightmare so we tried going the other way and

Starting point is 00:11:29 naturally intercepting all the system calls and we well we we are using a library from intel that was developed for this purpose that is working quite quite well the syscall intercept library. And we expanded on that a bit. And that's basically the whole client that we're using today, intercepting system calls and managing them so that the application believes that there is a real mount point and a real file system handling the management of its data. So, I mean, the configuration, I'll say, is kind of multi-tiered. So you have, I'll call it a regular file system behind GeckoFS.

Starting point is 00:12:17 And then you've got on each node, you've got, it's not a local file system because there's global aspects to it, but there's a distributed file system, like Mark said, I guess, across all the local nodes that are accessing local data. There's a stage process where you take the data from whatever the real file system has it and move it to the local nodes. Is that correct? Yeah, so basically what you can very easily imagine is that basically when you have all the SSDs that are on the nodes, you just aggregate them together, and then you have the performance, of course, and the capacities all bundled together.

Starting point is 00:12:59 But of course, when you have these nodes and you have a job in them, of course the input data isn't there, and this is what the staging means. So staging basically means that in the beginning of a compute job where the application needs this input, somehow this data needs to be put there. So this is what stage in means. And then in the end, the result in GeckoFest needs to be staged out because the files then will be destroyed in the end.

Starting point is 00:13:22 So multi-tiered in a way that we're using, like Alberto said, we have the Cisco library, which then, where we then are in our client, where we decide which way we want to go. Is it going to a normal path where you'd have nothing to do with it? Or is it going into the GEC office namespace? And if we do, we distribute it across the service naturally.

Starting point is 00:13:41 And that's pretty much it. You're effectively stitching together across all the local nodes, the file system into one sort of global file, global file system, I guess is the right term. Is that how this works? Yes, exactly. So basically instead of accessing each individual file system on each SSD, it is now one big basically SSD that you see as a user

Starting point is 00:14:06 for all the nodes that you're using and where the file system servers are actually running at. And there's a lot of randomizing being done for deep learning to ensure that the same data is not processed the same way and stuff like that. Various batch sizes, which is one aspect, but the epics are actually intended to be, you know, randomized between epics and stuff like that. So how does that accomplish?

Starting point is 00:14:34 Is that something that's done by the, the deep learning framework, or is that something that, that you guys from an IO perspective are supporting or you just seeing reads, I guess, and, and writes and opens and closes. You're not doing any randomization for them. They're coming in and saying, okay, I want this record, I want record one now, record 77 next or something like, is that how this works?

Starting point is 00:14:57 Yeah, that exactly how it works. Well, we're basically implementing the POSIX API even if we do change a bit its semantics. So right now we're not doing anything special for a particular application, except a couple of examples that we have in our own supercomputers. But this could easily be implemented into our framework because that's precisely what we wanted to do in the first way. Once we have the API deployed and running, which is with the development of which is in progress, we could easily tailor for particular semantics for a single application. In this case, deep learning infrastructure deep learning middleware

Starting point is 00:15:46 yeah it's a deep learning is a great opportunity here for for what i consider smaller files but hpc historically has been pretty large files and pretty large blocks and stuff like that it's it's kind of hard for most systems most file systems to be optimized for both large blocks and small blocks. I mean, how do you guys manage that? Because I mean, for you to be successful in IO500, top five, mind you, you have to be able to support both, right? Yeah. So the way we are doing it, in the end, we have similar to where file systems have blocks,

Starting point is 00:16:22 we have this in a similar way, but call them chunks. And it's basically a chunk, which could be, for example, just 500 kilobyte of a file, which is then actually a normal file on this node local file system on a node. So what we basically only have to do if we have a very big IO request that is megabytes big, this big buffer will get then chunked with the chunk size. So two megabytes IO size will be then has four sizes

Starting point is 00:16:50 and supposedly we'll go to four servers in parallel. And if it's just a smaller request, it will just go to one server. So in that sense, it could be more beneficial if there's a lot of smaller IO to use smaller chunk sizes, but we have basically only seen that 500 kilobytes works well for most use cases for us. But there can be optimizations, of course, depending on whatever application workloads

Starting point is 00:17:13 we're looking at. Right. Also, if I may add a bit to that, what we also found is what you basically could classify applications between those that were doing really large IEO phases and those that were doing mostly smaller read and write phases even if sometimes they could be somewhat mixed the major patterns were typically biased to one side or the other. So the good thing of GeckoFS, since you tailor it for the application, is that you can configure the settings for your application easily and you can fine-tune. If you

Starting point is 00:17:54 do a run into the supercomputer and you see that the performance is not as good as it could be, you can try tweaking the chunk size. You could also try doing an IO profile on your application and then see exactly what your application is doing in terms of IO and see if there are some settings in GeckoFS that you could adjust that could benefit your application. Yeah, yeah. So, I mean, Alberto, Mark, I mean, the challenge with, I'll call it, open source systems has always been that there are plenty of different knobs, I would say. And it's a bit challenging to try to decide what knobs to set to what sort of levels in order to, you know, install the system and run it and get effective throughput out of it. But do you guys have anything that, you know, guidance to that respect?

Starting point is 00:18:47 I mean, I don't know how many knobs GeckoFS has, but I'm sure there's more than four, more than 10, more than 20. Well, there are really a couple of knobs that you can really, you know, turn around. There is, for example, I don't know how much resources you actually want to use for the servers and so on. But in the end, where we started with it, was actually a very, there were very little settings.

Starting point is 00:19:14 And then we could basically then incrementally see what impact every little knob has. But in the end, it is because we are only, we're not POSIX and we're not supporting everything. The knobs also decrease, right? So this is also not a true. No, no, but it is true that we started with very clear ideas on what could be tuned, but over time, the minimum set of things

Starting point is 00:19:40 that can be changed has grown significantly significantly so i would not say 20 knobs but between uh how to distribute how to distribute data the chunk size uh the number of uh compute nodes that you can use as ios servers whether you want them also behaving as compute nodes themselves for applications. I mean, if you start combining everything, yeah, it's not 20, but it's 10 to 15. Yeah, fewer knobs is not a bad thing. I think RFC 1925, the 12 networking truths, truth number 12 states that in protocol design, perfection hasn't been reached when there's nothing left to add, but when there's nothing left to take away. And I think that's very much the case with file systems as well.

Starting point is 00:20:32 Somebody else said that, Jason. I'm going to have to look that one up. But yeah, it wasn't about RFP 12. Ross Callen wrote that back in 1996. It was one of those April Fool's Day RFCs when he was working at DEC. So it's a good one. It's worth looking. RFC 1925.

Starting point is 00:20:51 Okay. Okay. Back to GeckoFS. So I noticed in the paper you have both a memory mode and SSD mode. I mean, so memory would be just using a chunk of memory as a device, I guess, as a storage device? Is that how it plays out? So basically for normal, not local, we just use whatever is available. This could be an XT4, could be an XFS or anything, as long as we have a pass, we're pretty much

Starting point is 00:21:20 happy. But of course, for memory, we are not, we're also depending kind of on a path. So in that sense, we need to have a TempFS or a RamFS available where we can then store the files. So it is in a sense memory support, but not kind of on a byte level. Right, right, right. So what you're saying is that the local node, you have some sort of a file system that you need to map GeckoFS on top of. And then you have a parallel file system that's behind it where the petabytes of data are sitting as well.

Starting point is 00:21:59 I'm just trying to make sure I understand the configuration levels here. Well, there is actually a parallel file system below GeckoFS, but we are not accessing it continuously. I mean, in principle, the idea behind GeckoFS, which and where the current research is headed right now in our case, is actually to use GeckoFS for data staging workflows where you link or you coordinate with the job scheduler,

Starting point is 00:22:29 you tell it what your input resources are going to be, what you expect your output resources are going to be, and this coordinates with GeckoFS, deploys it and inputs all the data that the application is going to need and starts the data that the application is going to need and starts the application that in principle will run using only the compute nodes without interfering with the parallel file system for all the temporary data that it would generate

Starting point is 00:22:55 during the simulation. And by the end, when you're finished, you just stage out the useful results of the simulation to long-term powerful system storage. I was going to say, that's pretty much the lineup. I got a couple of folks in my group that have done a lot of HPC, and that seems to be kind of the standard way of where there's, here's the big data store sitting on the back, and then basically every job that you do, you're basically spinning up these large temporary systems to go through, do all this processing and then to have it all torn down and put back up again. And the reality is, I'm sure you, you folks know

Starting point is 00:23:35 is that, you know, doing that with like a GPFS or a luster, I mean, it's a very expensive and, and, and kind of encompassing, uh, complex problem to deal with. We're building those temp systems is not, it's not the easiest thing to, to effectively script that out. Right. It's a, it takes a lot of effort to do that. So I'm sure any, any type of, you know, easing of the, the kind of that, that infrastructure rollout component where, where you're building

Starting point is 00:24:05 that dynamically, that's a huge bonus for anything in HPC. We used to call those things scratch files, right? Yeah, 100%. I think one of the most important things, if you have such a system, it needs to really spin up in seconds. You cannot really wait. And also what is also really, really important for our case that it is all in user level so that you don't have to bother your HPC admins for some special settings if it's almost sometimes difficult enough to get a few system and the permissions for that.

Starting point is 00:24:37 So in that sense, you can think of Gecko just as an IO application where the admins don't really have to know anything about it, right? The users can just use it and that's it. I noticed in the paper and it may or may not be current as well, but you support something you called eventual consistency for write activity. You want to describe how that works out or what that looks like? Is that correct? I guess.

Starting point is 00:25:04 Yeah, so it's probably the first question yeah that's that's a little bit difficult um on um so so in general we have two kind of um yeah views on that so one one part we call basically um where everything where an iowa um an IO request is accessing one specific thing in the file system. So this could be you're accessing one specific offset in a file. This could be you are accessing the metadata of one specific file. These ones are what we would say strongly consistent because they kind of honor the order where the requests are coming to the service. Where we don't really have strong consistency is when we're looking at listing a directory, for example.

Starting point is 00:25:51 Because in principle, what could happen is that at the same time while a directory is listed, files can be inserted there. And then what the user will actually see is not really what is actually in the system because files are also created. But again, since we are not having any locking mechanisms at all, and we're not locking anything, this is also the reasons why our system scales so well.

Starting point is 00:26:14 There's always this situation that requests can also overtake each other. It happens very, very rarely. But usually, we would say these direct operations on files are strongly consistent. This is what users usually see. But LS operation, like directory listings, things that HPC applications usually don't do anyway, these are then eventually consistent. Yeah. Yeah. I used to run a GPFS system and I remember LS is hanging for, you know, you got five, ten minutes waiting for one of those to return because of the file locking issue. Yeah, we actually had an issue in Maranostrum where under certain conditions an LS could take with GPFS actually up to five minutes and we managed to fix it by disabling the coloring of the

Starting point is 00:27:06 LS output because then the client didn't need to access all the metadata to know what it was printing and to actually print one thing in yellow or another thing in blue. It was kind of crazy. Wait, wait, wait. So I got a problem with no locking here. So what happens if I'm opening up a new, if I'm creating a file and stuff like that? I mean, you know, so you've got to have some locking on this metadata structure, even though it's A, it's distributed.

Starting point is 00:27:33 But, you know, if I'm creating a file, I mean, you're not locking the directory to create that file while you're creating it? No. Tell me you're locking the directory. No, we're're locking the directory. No. We're not locking the directory. Your guys' use case and application is pretty different, right? Primarily, most of the writes that are actually happening are during that whole stage in component. Is that right? Well, he's got to create output.

Starting point is 00:27:57 The model data's got to be. You can also create the output. There's output being done here, Jason. I know, but I create the output. There's output being done here, Jason. I know, but I create the output on a different volume. Like you've got one that's a read volume and one that's a write volume. That seems to make sense in an application like that.

Starting point is 00:28:15 But the difference is that we are not locking directories basically because directories themselves or as metadata entities don't exist in GeckoFS, at least in the white striping mode. Because basically what we, since we're hashing the path, basically each different path, even if they live under the same directory, is effectively different from each other and lives in a different IO server for GetOFS. For actually modifying the entries for a single path, we don't have login, but we do have transactions

Starting point is 00:28:55 in the internal database that we are using. Oh, okay. So you do transaction rollback if there's a problem or something like that, if that needs to be done, I guess. But I mean, really for the applications that you guys target to, I mean, this is, I mean, I'm not going to run this if I'm a large bank and all of my transactional data is going to go on it. This is probably the wrong thing to put on the back end. But for the HPC data that you're generating, I mean, especially when you look at like deep learning models and things like that, you know, it's not as necessary having a strong locking mechanism.

Starting point is 00:29:33 And like you said, you know, one of the biggest issues in, you know, those larger parallel file systems is the fact that there is locking and it is blocking you from getting tasks done. Yeah. But locking is a good thing. It's not necessarily. Yeah, it's slower.

Starting point is 00:29:49 Not when you've got 18,000 nodes trying to access one file. No, and also, locking is good, but it's not as good for performance when everyone is doing locking. When you have your application that it's trying to synchronize itself with a synchronization mechanism, and you also have your middleware that it's also using synchronization mechanisms, and then you also have your IO layer that has its internal mechanisms, then you see that everyone is trying to synchronize access. And in the end, everyone is doing the same.

Starting point is 00:30:23 So what we tried to do was to strip everything. If you already have an application that it's synchronizing itself to access or to read and write data, why do we need to do additional job to guarantee that? And if you are not doing it, then we can talk to you and see what the best way to synchronize the I.O. of your application would be. I'm not convinced. But, okay, for instance, like you talk a lot about deep learning. So with deep learning and you distributed the training data, so at some point those models have to be re-synced or re-sized.

Starting point is 00:31:02 I think you'd call it all reduced or something like that in the paper. That process is both, you know, we're going to write out each of our own individual models. Some process in this cluster of 18,000 nodes, Jason, is going to look at this model and try to reduce it to one model and then publish that to all these other nodes. And then at that point, they're going to take off with the next round of data. Isn't it? Isn't that how this works? Yeah.

Starting point is 00:31:28 But you're also looking at billions of objects, right? You're doing this. So it's like, you know, it's like, it'll, it'll get it on the next training cycle. Right.

Starting point is 00:31:38 Maybe. Yeah. No luck. Yeah. That's I mean, for, for deep learning applications, I mean,

Starting point is 00:31:43 mostly for the input, it's, it's, you know, read. But again, I mean, mostly for the input, it's, you know, read. But again, for writing this, this is not an issue. And normally, even if these models need to be synchronized, this is not really a problem for us. And again, the reason why what in deep learning applications are happening, where you need to basically move around the input data so that you have these randomness really, and you don't have any biases in the end.

Starting point is 00:32:09 These really only exist because you don't really have a distributed system and you use these charts that you are putting on each local drive. And if you're having GeckoFS, of course, you have one big space, you don't need to do any of that. The whole deep learning application has access to the complete input

Starting point is 00:32:25 data and can access it wherever they want. And they don't need to move the data around the input data so that the other nodes get access to it. I guess that was somewhat confusing in the paper because you talked about sharding versus non-sharding and things of that nature. So in effect, GeckoFS in a normal deep learning environment would say, okay, I've stitched together this, I don't know, 10 terabyte data set that you want to train on. And each of these guys have, you know, a 500 kilobyte, kilobyte, wrong, gigabyte SSD on it. And I've got, you know, 10 of these guys, so I can do 5 terabytes or something like that. I don't know. Right, exactly. And then you need to figure out how you wrap together these shards so that they don't end up biases. And this is very difficult because sometimes these biases are not really obvious.

Starting point is 00:33:15 If you don't have to do that, if you just push these 10 terabytes of data into your GeckoFS, which spans multiple SSDs, then you don't have to do any of that. So that's like the real advantage. And talking to the deep learning people at our university as well, they would be really happy for the workflow if they would use GECOFS, because then they have to deal with a whole lot less. Let's just say it that way. Oh, yeah, yeah.

Starting point is 00:33:39 They don't have to shard it. They don't have to decide what the biases are, what's non-biased approach to sharding and all of this stuff. Yeah, agree i understand that it's not too good with this locking stuff but i guess i'm gonna let that side yeah some some so maybe maybe uh you want more up to it if we say of course um there as alberto already said um on the method we have the metadata and rocks db of course there is some ordering there involved, right? It is not really 100% transactional, but in a sense. And also, we have a POSIX-compliant, a strong, consistent local file system on each server. So this has already local locking.

Starting point is 00:34:19 What we're saying is we just don't need this locking on a global remote node scale that this locking is happening over the network. Mark, you can't tell me there's such great advantage of having this, you know, single global file system across all the nodes so I can ultimately, you know, randomize them on demand. But then in the same case, this, you know, the locking within a node is good, but locking across nodes, not. I don't know. Like I said, we need to set that aside because that's a whole hour discussion or longer between us all. So, you know, there are plenty of players in this space. They all seem to be wanting to do this sort of thing. You know, BGFS, is it BGFS or BFS or something like that?

Starting point is 00:35:03 They're another one of these types of file system. What do you call them? What did you call them? It was booster, boosting? Burst buffer. Burst, BurstFS, right. So BFS. So why are you guys so much better than BFS?

Starting point is 00:35:19 BGFS? So for BGFS, I think you had Frank Harreld already on a couple of episodes. Yeah, we did. Yeah. So they have actually, so normally they're also having a parallel file system similar to what GPFS and Lustre is doing, but they also have these burst buffer mode that they call beyond. And in that sense, they are completely strongly POSIX compliant, but they're also kernel file systems. So spinning these beyond instances up requires administrative access. But once you run beyond,

Starting point is 00:35:54 similar to how you would run GECOFS, we have all of these comparisons in one of our papers. They're also quite strong. But again, I think one of the major advantages for us is really that just the user can do that and you really have this really good metadata performance. So user space versus kernel space, yeah. I would never think user space

Starting point is 00:36:16 would actually be quicker than kernel space. That's a different question. It may not be quicker, but it's also if something goes awry, you don't kernel panic the machine. I see. Okay, maybe. I've known GPFS many a times as well. And, yeah, GPFS always had the great ability to taint the kernel,

Starting point is 00:36:36 and then anytime you wanted to do any specific kernel modifications, it was a giant pain. Let's not go there. I understand the problem. All right, So you guys are open source. Is there like a support contract that somebody could buy if they were so

Starting point is 00:36:53 interested in doing something like this or not? Most of the guys we talk to that are open source have an enterprise solution that they're willing to you know have customers spend money on and stuff well we we're not at that point yet uh we believe uh we would really be delighted to to go that way uh there have been certain talks in that direction

Starting point is 00:37:18 but nothing is uh is yet uh confirmed so I cannot talk anymore. I understand. When you guys are looking for VC money, let me know. Jason's got this small company behind him for some reason. I don't want to talk about it. Let's see. So Big Data

Starting point is 00:37:44 is another solution like this where they do a lot of data reading and not a lot of data writing to some extent. You see this as a solution in that space as well? I mean, do you know what data analytics? I mean, I guess Hadoop, Spark kinds of things. Well, it's a bit difficult. It always a little bit depends on... So we're not a backend for, I don't know, an SQL database or anything like that. But it always really depends on kind of the workload as well there.

Starting point is 00:38:16 You can start to classify into these burst applications, checkpointing application, all of these things. But in the end, you really need to look more closely. Well, if you have a compute-bound application that doesn't really write a whole lot and where the parallel fastener without problems can keep up, then there's really not a point of using it. But the more applications are running on the parallel fastener, the more it gets interference. We have seen this already.

Starting point is 00:38:41 We have running on Maren nostrum a couple of experiments um over a month at different times this was always the same workload and you actually see like diff orders of magnitudes and differences for the performance that the user actually gets right so yeah it really depends on what the system state is what the application workload and so on yeah interesting i like orders of magnitude speed up. I'm not sure. Yeah, actually, that's okay. It's not the order of magnitude of speed up. It's a slowdown, depending on who you're sharing your IO with, your GPFS IO with, and what they're doing with that. You can see that your application is definitely slower than it should be

Starting point is 00:39:27 right right right right right right right right huh i was gonna talk so raid level you guys do any sort of raid level across these ssds uh across the nodes and stuff like that what sort of data protection do you offer? Erasure coding 13 plus two or something? Well, erasure coding is planned. We're actually working on it. To be due. TBD. No, no, no.

Starting point is 00:39:54 I mean, well, initially we didn't plan to have any kind of, because it actually didn't matter for the HPC workloads that we were working on. Because the data is elsewhere. It's temporary, right? It's basically all temporal data. Yeah, and most simulations already have in place a system for checkpointing. So it didn't make sense for us at the time to actually do the same work twice.

Starting point is 00:40:23 But down the line, if we're trying to move a bit beyond HPC and trying to make the file system more useful, we actually need to have some data protection schema in place. And given that our main work mode is fully distributed, the current design is going to issue codes for data protection. That would make sense. Right, right, right. I can't think of anything else, Jason.

Starting point is 00:40:56 You have any last questions for these guys? Besides the locking question. You don't want to ask the locking question. We have Super Compute coming up. Were you guys planning on actually going to the Super Compute coming up, were you guys planning on actually going to the Super Compute conference that's coming up?

Starting point is 00:41:08 I think it's in Dallas, I believe, this year. Yeah, so I will be there. Yeah, I will be there. So yeah, I'm happy to talk to everyone about it. But yeah,

Starting point is 00:41:21 I will be there and we'll also be looking forward to the next IFF100. If there is maybe a new surprise contender, you never know. It's usually every half a year, the new fastest one coming up. That's suddenly very fast. Yeah, yeah.

Starting point is 00:41:33 I got a couple of people on notice in your way. Yeah, it'll be nice. There you go. There you go. Listen, Mark and Alberto, is there anything you'd like to say to our listening audience before we close? So, yeah, really, thanks a lot for having us on. It was fantastic to talk to you. And it has been the first time for a podcast for us, but it has been very interesting.

Starting point is 00:41:56 And, of course, for GeckoFS, feel free to talk to us. We're always happy to talk to you about your workloads and how GeckoFS could could help if you have any issues with it or if you want to open any tickets. We are on GitLab. Just type it into Google and you will find it. And yeah, we're always happy to help and looking forward to it. Yeah. Alberto, anything? No, basically what Mark said. I mean, we're really happy for the exposure. We're really interested in workloads that people can come up with. We're really eager to come up with different ways to optimize GeckoFS and to find newer requirements.

Starting point is 00:42:42 I really would be really happy if we can actually come to a decision into the logging. Do we need it? Don't we need it? I'm thinking of another knob here, but that's another question. I don't know. I don't know. You know, there's trade-offs, right? There's always trade-offs.

Starting point is 00:43:04 And you've got to make some decisions early on on what trade-offs you want to go after and what trade-offs. There's always trade-offs. You've got to make some decisions early on. What trade-offs you want to go after and what trade-offs

Starting point is 00:43:09 you don't. I understand that. In your particular application, I think it makes a lot

Starting point is 00:43:15 of sense. I wasn't going to say that online, but I guess I had to. All right. All right.

Starting point is 00:43:22 That's it for now. Bye, Mark. I'm Alberto. It was great having you on the show today. Thank you, Ray. Thank you for having us. All right, that's it for now. Bye, Mark. I'm Alberto. It was great having you on the show today. Thank you, Ray. Thank you for having us.

Starting point is 00:43:29 Yeah. Thank you, Ray. Thank you, Jason. All right. And bye, Jason.

Starting point is 00:43:33 Until next time. Next time, we will talk to another system storage technology person. Any questions you want us to ask, please let us know. And if you enjoy our podcast, tell your friends about it. Please review us on Apple Podcasts, Google Play, and Spotify,

Starting point is 00:44:11 as this will help get the word out. Thank you.

Your Ad Here

Grey Beards on Systems - 139: GreyBeards talk HPC file systems with Marc-André Vef and Alberto Miranda of GekkoFS

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.