Grey Beards on Systems - 139: GreyBeards talk HPC file systems with Marc-André Vef and Alberto Miranda of GekkoFS
Episode Date: November 8, 2022In honor of SC22 conference this month in Dallas, we thought it time to check in with our HPC brethren to find out what’s new in storage for their world. We happened to see that IO500 had some recen...t (ISC22) results using a relative new comer, GekkoFS (@GekkoFS). So we reached out to the team … Continue reading "139: GreyBeards talk HPC file systems with Marc-André Vef and Alberto Miranda of GekkoFS"
Transcript
Discussion (0)
Hey everybody, Ray Lucchese here.
Jason Collier here.
Welcome to another sponsored episode of the Greybeards on Storage podcast,
a show where we get Greybeards bloggers together with storage assistant vendors
to discuss upcoming products, technologies, and trends affecting the data center today.
And now it is my pleasure to introduce Mark-André Veth, a PhD student and Dr. Alberto Miranda
of the Barcelona Supercomputing Center.
Since Supercomputing 22 is about to open in Dallas this month, we thought it would be
a good time to see what's happening in the HVCIO space.
So Marc and Alberto, why don't you tell us a little bit about yourself and what makes
Gecko FS so unique?
So, yeah, my name is Marc Feff. I'm from the Johannes Gutenberg University, Mainz in Germany.
And yeah, like you said, I'm a PhD student. I'm very happy being on here today.
And in the past, during my master's already, I found my, let's say, passion for file systems and then went more into the PhD about actually then developing and designing my own file systems.
And I did two of them.
Mostly it was GECOFS, which is a distributed file system that you can start up at ad hoc. So what we have seen in the past was basically that there are some issues
with parallel file system in metadata performance, and we wanted to do look
into new ways how we can create a very fast file system that do not have these
limitations.
And Alberto.
Okay.
Thank you.
Thank you, Ray.
Thank you, Jason, for having us here.
My name is Alberto Miranda.
I'm a senior researcher at the Barcelona Supercomputer Center
where I started a really long time ago.
I really don't want to remember how many years ago that was.
I started basically as a, I will say, as a research assistant engineer
doing coding in the third systems research group and over
time i started with a phd and nowadays i'm the colleague for these storage systems for a string
computing research group where we're doing a lot of things when we're trying to optimize io mostly
for applications running in marinostrum but we're also trying to see the bigger
picture doing things that everyone can use and this is where GeckoFS came into the picture.
We basically had an internal seminar where several people from the IEO world joined together and we
started discussing things and we found with Andrea Brickman, which is Mark's advisor,
that we were doing very similar things.
And we decided to join two slightly different projects to try to come up with
something that was really useful to everyone, to everybody.
So what's so special about you?
So I just saw GeckoFS and some IO500 reports from,
I think earlier this summer of the last HPC conference, I think, in Europe.
And out of nowhere, Gecko FS came up.
I think there were at least two or three different top 10 slots in the IO500.
What's so special about IO500?
Or not IO500, but GeckoFS. I mean, it's not usual for something like, you know, come out of nowhere and all of a sudden be fast as slick.
Yeah, so I think our first getting into IO500 was a couple of years ago.
And then I think our last one was in supercomputing 20.
But yeah, we were quite there at the beginning and were at the top five. I think what it makes really special is that, you know,
we're not having a lot of the usual mechanisms
that normal parallel file systems have,
and therefore we can have a lot better performance.
Just looking at what the HPC systems, HPC applications actually need,
and then we can optimize around it.
So we're in a very unique position
where when an application only runs us as their file system,
we can optimize for them
and do not have to basically support all the applications
like Lustro or GPFS would do it.
So I read the paper that you guys wrote,
and there was probably a dozen people,
maybe half a dozen people on the paper.
So it's kind of interesting. They mentioned quite a few different techniques that you guys were using and hopefully we'll get a chance to get in all these things. But go ahead, Alberto,
you were going to say something. Sorry before I interrupted. Yeah, well, I wanted to say that
the key point behind GeckoFS is that it's not actually a parallel file system for long-term storage.
Basically, what we wanted to do was we were basically finding a lot of problems trying to keep all the POSIX semantics in distributed context. So we decided to try two things.
One, to try to get rid of POSIX, which didn't really work because everyone is using the POSIX interface to interact with I.O.
And the other key thing was that we wanted to be able to containerize a bit the I.O. for an application and to keep it separate from the parallel file system in an attempt to reduce the problems that we were seeing in Myron Ostrom and in
Mogul 2.
So what we ended up doing was creating a new file system that was using the resources,
the load logger resources that were assigned to a particular HPC job.
So the difference between that is that only one application, only a single
application is in the storage. And then we can fine tune what kind of semantics we use
and improve performance. So it really seems that, you know, kind of the deep learning
arena seems to be really tailored for GeckoFS, given the fact that you've got the ability to rapidly read from a lot of different parallel components.
And I'm assuming you guys have some type of sharding mechanism in there that helps that out.
Yeah, I mean, there was a whole bunch of stuff in the paper about the sharding stuff.
Jason, you didn't do your damn homework.
Oh, yeah, I tell you.
So, yeah, the IO500 is not just deep learning, right?
It's everything that you would find in an HPC center, right?
I mean, yeah, they've added deep learning over the last couple of iterations, I guess.
But you guys did fine in other stuff too, right?
Yeah, I mean, for deep learning applications, a little bit to go back to that.
The distributed file system that we have, like EchoFS, they are pretty useful as long as the input data that you have really exceeds the storage capabilities of OneNote. Because what
people are already doing nowadays, if they have the input data for a deep learning workflow,
they have this already available on a single node. And then, of course,
it isn't that much of an improvement.
But as soon as you have a big namespace like in GeckoFS
that spans over multiple nodes, you can then have really big input files.
And this is where at that point you don't you don't need these charts anymore,
which basically splits up your data center.
And for I500, I think at the moment they don't really have a
a deep learning part, but it comes concerning the I.O. workloads that you see in HPC center.
They are very broadly represented there.
Defined.
Yeah, yeah, yeah.
So it's not really a deep learning, but it's a simulation of deep learning is what you're telling me.
Yeah, it's more of a simulation of what the HPC applications are doing.
But yes, deep learning applications are also represented in some of the io patterns there yeah
and so typically a lot of these big parallel file systems think of the gpfss and lusters they were always really good at sequential io um is that the same case uh with this or does this can this get
into a little bit more you know better for those random read
scenarios well in a case gecko fs is not particularly optimized for sequential reads
or sequential rights because basically uh what we wanted to do uh from the beginning
well in fact what we wanted to do was to to be able to tailor the data distribution approach
to a particular application. But the first data
distribution that we implemented into the system was something that we called white striping,
that is basically just we chunk the data space of a file in the storage system and
pseudo randomly distribute it across all the available storage nodes,
which in this case are actually all the compute nodes
assigned to the job.
So in this case,
when we're trying to read something sequentially,
we're basically aggregating all the bandwidth possible
from all the compute nodes
that you were able to include into your job
and for random data distributions uh we're not very much affected by them because we are actually doing a pseudo random data distribution oh i was just going to say and and how you guys handling
the metadata accesses for basically where that data is like split up and line stripe too yeah
is there a single metadata server throughout the whole cluster or is there you know separate metadata servers per shard i guess not not the right term
obviously but yeah so so what what we are doing uh very uniquely um for example for lastron gpfs
you have the central metadata service right where the requests go to and we are basically just
looking at the path of the file that you're accessing, hashing that.
And then we basically have all the information we need to client side from the client side, really know where the file and the metadata of that file is.
So we can simply all the metadata spread across all servers.
And depending on where it's hashed to, the metadata will be stored.
So we don't need to go to any central server and ask for the data or the metadata actually is. We can just access it. And that's, this makes it very fast for metadata,
of course. So the client component itself, then the client itself is actually also managing the
hash. So it knows where to go to. So I'm assuming you guys have a kind of custom client that
actually talks. So it's not like talking NFS or, you know, some other protocol. It's actually,
you're using a client on the client side that's actually the gecko fs yeah yeah yeah exactly
yeah we're we're actually using an interposition library that we we uh we embed into the application
using ld preload which brings a whole set of problems i would rather not discuss today.
So, yeah, we have an interception library. We're also working on trying to offer something more sensible for people that
can actually modify their applications and can link, can afford to link against
a different library, but the first use case that we needed to support
was for applications, legacy applications
that already existed.
So the first approach was basically just try to interpose
all the IO API.
As X calls.
Yeah.
Well, in fact, we first tried to interpose
on the Lipsy IO calls,
and we rapidly found that this was a nightmare so we tried going the other way and
naturally intercepting all the system calls and we well we we are using a library from intel that was
developed for this purpose that is working quite quite well the syscall intercept library. And we expanded on that a bit.
And that's basically the whole client that we're using today,
intercepting system calls and managing them
so that the application believes that there is a real mount point
and a real file system handling the management of its data.
So, I mean, the configuration, I'll say, is kind of multi-tiered.
So you have, I'll call it a regular file system behind GeckoFS.
And then you've got on each node, you've got, it's not a local file system because
there's global aspects to it, but there's a distributed file system, like Mark said, I guess, across all the local nodes that are accessing local data.
There's a stage process where you take the data from whatever the real file system has it and move it to the local nodes.
Is that correct? Yeah, so basically what you can very easily imagine
is that basically when you have all the SSDs that are on the nodes,
you just aggregate them together,
and then you have the performance, of course,
and the capacities all bundled together.
But of course, when you have these nodes and you have a job in them,
of course the input data isn't there,
and this is what the staging means.
So staging basically means that in the beginning of a compute job where the application needs
this input, somehow this data needs to be put there.
So this is what stage in means.
And then in the end, the result in GeckoFest needs to be staged out because the files then
will be destroyed in the end.
So multi-tiered in a way that we're using,
like Alberto said, we have the Cisco library,
which then, where we then are in our client,
where we decide which way we want to go.
Is it going to a normal path
where you'd have nothing to do with it?
Or is it going into the GEC office namespace?
And if we do, we distribute it across the service naturally.
And that's pretty much it.
You're effectively stitching together across all the local nodes,
the file system into one sort of global file,
global file system, I guess is the right term.
Is that how this works?
Yes, exactly.
So basically instead of accessing each individual file system on each
SSD, it is now one big basically SSD that you see as a user
for all the nodes that you're using
and where the file system servers are actually running at.
And there's a lot of randomizing being done for deep learning
to ensure that the same data is not processed the same way
and stuff like that.
Various batch sizes, which is one aspect,
but the epics are actually intended to be, you know,
randomized between epics and stuff like that. So how does that accomplish?
Is that something that's done by the, the deep learning framework,
or is that something that,
that you guys from an IO perspective are supporting or you just seeing reads,
I guess, and, and writes and opens and closes.
You're not doing any randomization for them.
They're coming in and saying,
okay, I want this record, I want record one now,
record 77 next or something like, is that how this works?
Yeah, that exactly how it works.
Well, we're basically implementing the POSIX API
even if we do change a bit its semantics.
So right now we're not doing anything special for a particular application, except a couple of examples that we have in our own supercomputers.
But this could easily be implemented into our framework because that's precisely what we wanted to do in the first way.
Once we have the API deployed and running, which is with the development of which is in progress,
we could easily tailor for particular semantics for a single application.
In this case, deep learning infrastructure deep learning middleware
yeah it's a deep learning is a great opportunity here for for what i consider smaller files but
hpc historically has been pretty large files and pretty large blocks and stuff like that
it's it's kind of hard for most systems most file systems to be optimized for both large blocks and small blocks.
I mean, how do you guys manage that?
Because I mean, for you to be successful in IO500, top five, mind you, you have to be
able to support both, right?
Yeah.
So the way we are doing it, in the end, we have similar to where file systems have blocks,
we have this in a similar way, but call them chunks.
And it's basically a chunk, which could be, for example,
just 500 kilobyte of a file,
which is then actually a normal file on this node local file system on a node.
So what we basically only have to do
if we have a very big IO request that is megabytes big,
this big buffer will get then chunked with the chunk size.
So two megabytes IO size will be then has four sizes
and supposedly we'll go to four servers in parallel.
And if it's just a smaller request,
it will just go to one server.
So in that sense, it could be more beneficial
if there's a lot of smaller IO to use smaller chunk sizes,
but we have basically only seen that 500 kilobytes
works well for most use cases for us.
But there can be optimizations, of course, depending on whatever application workloads
we're looking at.
Right.
Also, if I may add a bit to that, what we also found is what you basically could classify
applications between those that were doing really large IEO phases and those that were doing mostly smaller
read and write phases even if sometimes they could be somewhat mixed
the major patterns were typically biased to one side or the other.
So the good thing of GeckoFS, since you tailor it for the application,
is that you can configure the settings for your application easily and you can fine-tune. If you
do a run into the supercomputer and you see that the performance is not as good as it could be,
you can try tweaking the chunk size. You could also try doing an IO profile on your
application and then see exactly what your application is doing in terms of IO and see
if there are some settings in GeckoFS that you could adjust that could benefit your application.
Yeah, yeah. So, I mean, Alberto, Mark, I mean, the challenge with, I'll call it,
open source systems has always been that there are plenty of different knobs, I would say.
And it's a bit challenging to try to decide what knobs to set to what sort of levels in order to, you know, install the system and run it and get effective throughput out of it.
But do you guys have anything that, you know, guidance to that respect?
I mean, I don't know how many knobs GeckoFS has,
but I'm sure there's more than four, more than 10, more than 20.
Well, there are really a couple of knobs that you can really, you know,
turn around.
There is, for example, I don't know how much resources
you actually want to use for the servers and so on.
But in the end, where we started with it,
was actually a very, there were very little settings.
And then we could basically then incrementally
see what impact every little knob has.
But in the end, it is because we are only, we're not POSIX and we're not supporting everything.
The knobs also decrease, right?
So this is also not a true.
No, no, but it is true that we started
with very clear ideas on what could be tuned,
but over time, the minimum set of things
that can be changed has grown significantly significantly so i would not say 20 knobs
but between uh how to distribute how to distribute data the chunk size uh
the number of uh compute nodes that you can use as ios servers whether you want them also behaving
as compute nodes themselves for applications.
I mean, if you start combining everything, yeah, it's not 20, but it's 10 to 15.
Yeah, fewer knobs is not a bad thing. I think RFC 1925, the 12 networking truths, truth number 12 states that in protocol design,
perfection hasn't been reached when there's nothing left to add, but when there's nothing left to take away.
And I think that's very much the case with file systems as well.
Somebody else said that, Jason.
I'm going to have to look that one up.
But yeah, it wasn't about RFP 12.
Ross Callen wrote that back in 1996.
It was one of those April Fool's Day RFCs when he was working at DEC.
So it's a good one.
It's worth looking.
RFC 1925.
Okay.
Okay.
Back to GeckoFS.
So I noticed in the paper you have both a memory mode and SSD mode.
I mean, so memory would be just using a chunk of memory as a device, I guess, as a storage device?
Is that how it plays out?
So basically for normal, not local, we just use whatever is available.
This could be an XT4, could be an XFS or anything, as long as we have a pass, we're pretty much
happy.
But of course, for memory, we are not, we're also depending kind of
on a path. So in that sense, we need to have a TempFS or a RamFS available where we can then
store the files. So it is in a sense memory support, but not kind of on a byte level.
Right, right, right. So what you're saying is that the local node, you have some sort of a file system
that you need to map GeckoFS on top of.
And then you have a parallel file system that's behind it
where the petabytes of data are sitting as well.
I'm just trying to make sure I understand
the configuration levels here.
Well, there is actually a parallel file system below GeckoFS,
but we are not accessing it continuously.
I mean, in principle, the idea behind GeckoFS,
which and where the current research is headed right now in our case,
is actually to use GeckoFS for data staging workflows
where you link or you coordinate with the job scheduler,
you tell it what your input resources are going to be,
what you expect your output resources are going to be,
and this coordinates with GeckoFS,
deploys it and inputs all the data
that the application is going to need
and starts the data that the application is going to need and starts the
application that in principle will run using only the compute nodes without interfering
with the parallel file system for all the temporary data that it would generate
during the simulation. And by the end, when you're finished, you just stage out the useful
results of the simulation to long-term powerful system storage.
I was going to say, that's pretty much the lineup.
I got a couple of folks in my group that have done a lot of HPC,
and that seems to be kind of the standard way of where there's,
here's the big data store sitting on the back, and then basically every job that you do,
you're basically spinning up these large temporary systems to go through, do all this processing and then
to have it all torn down and put back up again. And the reality is, I'm sure you, you folks know
is that, you know, doing that with like a GPFS or a luster, I mean, it's a very expensive and,
and, and kind of encompassing, uh, complex problem to deal with.
We're building those temp systems is not,
it's not the easiest thing to, to effectively script that out. Right.
It's a, it takes a lot of effort to do that. So I'm sure any,
any type of, you know, easing of the, the kind of that,
that infrastructure rollout component where,
where you're building
that dynamically, that's a huge bonus for anything
in HPC. We used to call those things scratch files, right?
Yeah, 100%. I think one of the most important things, if you
have such a system, it needs to really spin up in seconds. You cannot really
wait. And also what is also really, really important
for our case that it is all in user level
so that you don't have to bother your HPC admins for some special settings if it's almost
sometimes difficult enough to get a few system and the permissions for that.
So in that sense, you can think of Gecko just as an IO application where the admins don't
really have to know anything about it, right?
The users can just use it and that's it.
I noticed in the paper and it may or may not be current as well, but you support
something you called eventual consistency for write activity.
You want to describe how that works out or what that looks like?
Is that correct?
I guess.
Yeah, so it's probably the first question yeah that's that's
a little bit difficult um on um so so in general we have two kind of um yeah views on that so one
one part we call basically um where everything where an iowa um an IO request is accessing one specific thing in the file system.
So this could be you're accessing one specific offset in a file.
This could be you are accessing the metadata of one specific file.
These ones are what we would say strongly consistent because they kind of honor the
order where the requests are coming to the service.
Where we don't really have strong consistency is when we're looking at listing a directory, for example.
Because in principle, what could happen
is that at the same time while a directory is listed,
files can be inserted there.
And then what the user will actually see
is not really what is actually in the system
because files are also created.
But again, since we are not having any locking mechanisms at all, and we're not locking anything,
this is also the reasons why our system scales so well.
There's always this situation that requests can also overtake each other.
It happens very, very rarely.
But usually, we would say these direct operations
on files are strongly consistent. This is what users usually see. But LS operation,
like directory listings, things that HPC applications usually don't do anyway,
these are then eventually consistent. Yeah. Yeah. I used to run a GPFS system and I remember
LS is hanging for, you know, you got five, ten minutes waiting for one of those to return because of the file locking issue.
Yeah, we actually had an issue in Maranostrum where under certain conditions an LS could take with GPFS actually up to five minutes and we managed to fix it by disabling the coloring of the
LS output because then the client didn't need to access all the metadata
to know what it was printing and to actually print one thing in yellow or
another thing in blue. It was kind of crazy. Wait, wait, wait. So I got a problem with
no locking here. So what happens if I'm opening up a new,
if I'm creating a file and stuff like that?
I mean, you know,
so you've got to have some locking on this metadata structure,
even though it's A, it's distributed.
But, you know, if I'm creating a file, I mean,
you're not locking the directory to create that file while you're creating it?
No.
Tell me you're locking the directory.
No, we're're locking the directory. No. We're not locking the directory. Your guys' use case and application is pretty different, right?
Primarily, most of the writes that are actually happening are during that whole stage in component.
Is that right?
Well, he's got to create output.
The model data's got to be.
You can also create the output.
There's output being done here, Jason.
I know, but I create the output. There's output being done here, Jason.
I know, but I create the output on a different volume.
Like you've got one that's a read volume
and one that's a write volume.
That seems to make sense in an application like that.
But the difference is that we are not locking directories
basically because directories themselves
or as metadata entities don't exist in GeckoFS,
at least in the white striping mode. Because basically what we, since we're hashing the path,
basically each different path, even if they live under the same directory,
is effectively different from each other and lives in a different IO server for GetOFS.
For actually modifying the entries for a single path,
we don't have login, but we do have transactions
in the internal database that we are using.
Oh, okay.
So you do transaction rollback if there's a problem
or something like that, if that needs to be done, I guess.
But I mean, really for the applications that you guys target to, I mean, this is, I mean, I'm not going to run this if I'm a large bank and all of my transactional data is going to go on it.
This is probably the wrong thing to put on the back end. But for the HPC data that you're generating,
I mean, especially when you look at like deep learning models and things like that, you know, it's not as necessary
having a strong locking mechanism.
And like you said, you know, one of the biggest issues
in, you know, those larger parallel file systems
is the fact that there is locking
and it is blocking you from getting tasks done.
Yeah.
But locking is a good thing.
It's not necessarily.
Yeah, it's slower.
Not when you've got 18,000 nodes trying to access one file.
No, and also, locking is good, but it's not
as good for performance when everyone is doing locking.
When you have your application that it's
trying to synchronize itself with a
synchronization mechanism, and you also have your middleware that it's also using synchronization
mechanisms, and then you also have your IO layer that has its internal mechanisms, then you see
that everyone is trying to synchronize access. And in the end, everyone is doing the same.
So what we tried to do was to strip everything.
If you already have an application that it's synchronizing itself to access or to read and
write data, why do we need to do additional job to guarantee that? And if you are not doing it,
then we can talk to you and see what the best way to synchronize the I.O. of your application would be.
I'm not convinced.
But, okay, for instance, like you talk a lot about deep learning.
So with deep learning and you distributed the training data, so at some point those
models have to be re-synced or re-sized.
I think you'd call it all reduced or something like that in the paper.
That process is both, you know, we're going to write out each of our own individual models.
Some process in this cluster of 18,000 nodes, Jason, is going to look at this model and try to
reduce it to one model and then publish that to all these other nodes. And then at that point,
they're going to take off with the next round of data.
Isn't it?
Isn't that how this works?
Yeah.
But you're also looking at billions of objects,
right?
You're doing this.
So it's like,
you know, it's like,
it'll,
it'll get it on the next training cycle.
Right.
Maybe.
Yeah.
No luck.
Yeah.
That's I mean,
for,
for deep learning applications,
I mean,
mostly for the input,
it's,
it's, you know, read. But again, I mean, mostly for the input, it's, you know, read.
But again, for writing this, this is not an issue.
And normally, even if these models need to be synchronized, this is not really a problem for us.
And again, the reason why what in deep learning applications are happening, where you need to basically move around the input data
so that you have these randomness really,
and you don't have any biases in the end.
These really only exist
because you don't really have a distributed system
and you use these charts
that you are putting on each local drive.
And if you're having GeckoFS, of course,
you have one big space, you don't need to do any of that.
The whole deep learning application has access
to the complete input
data and can access it wherever they want. And they don't need to move the data around the input
data so that the other nodes get access to it. I guess that was somewhat confusing in the paper
because you talked about sharding versus non-sharding and things of that nature. So
in effect, GeckoFS in a normal deep learning environment would say, okay, I've stitched together this, I don't know, 10 terabyte data set that you want to train on.
And each of these guys have, you know, a 500 kilobyte, kilobyte, wrong, gigabyte SSD on it.
And I've got, you know, 10 of these guys, so I can do 5 terabytes or something like that. I don't know. Right, exactly. And then you need to figure out
how you wrap together these shards so that they don't end up
biases. And this is very difficult because sometimes these biases are not really obvious.
If you don't have to do that, if you just push these 10 terabytes of data into your
GeckoFS, which spans multiple SSDs, then you don't have to do any of that.
So that's like the real advantage.
And talking to the deep learning people at our university as well, they would be really
happy for the workflow if they would use GECOFS, because then they have to deal with a whole
lot less.
Let's just say it that way.
Oh, yeah, yeah.
They don't have to shard it.
They don't have to decide what the biases are, what's non-biased approach to sharding
and all of this stuff. Yeah, agree i understand that it's not too good with this
locking stuff but i guess i'm gonna let that side yeah some some so maybe maybe uh you want more up
to it if we say of course um there as alberto already said um on the method we have the metadata
and rocks db of course there is some ordering there involved, right? It is not really 100% transactional, but in a sense.
And also, we have a POSIX-compliant, a strong, consistent local file system on each server.
So this has already local locking.
What we're saying is we just don't need this locking on a global remote node scale that this locking is happening over the network.
Mark, you can't tell me there's such great advantage of having this, you know, single global file system across all the nodes so I can ultimately, you know, randomize them on demand.
But then in the same case, this, you know, the locking within a node is good, but locking across nodes, not.
I don't know.
Like I said, we need to set that aside because that's a whole hour discussion or longer between us all.
So, you know, there are plenty of players in this space.
They all seem to be wanting to do this sort of thing.
You know, BGFS, is it BGFS or BFS or something like that?
They're another one of these types of file system.
What do you call them?
What did you call them?
It was booster, boosting?
Burst buffer.
Burst, BurstFS, right.
So BFS.
So why are you guys so much better than BFS?
BGFS?
So for BGFS, I think you had Frank Harreld already on a couple of episodes.
Yeah, we did.
Yeah.
So they have actually, so normally they're also having a parallel file system similar to what GPFS and Lustre is doing, but they also have these burst buffer mode that they call beyond.
And in that sense, they are completely strongly POSIX compliant, but they're also kernel file systems. So spinning these beyond instances
up requires administrative access.
But once you run beyond,
similar to how you would run GECOFS,
we have all of these comparisons in one of our papers.
They're also quite strong.
But again, I think one of the major advantages for us
is really that just the user can do that
and you really have this really good metadata performance.
So user space versus kernel space, yeah.
I would never think user space
would actually be quicker than kernel space.
That's a different question.
It may not be quicker, but it's also if something goes awry,
you don't kernel panic the machine.
I see.
Okay, maybe.
I've known GPFS many a times as well.
And, yeah, GPFS always had the great ability to taint the kernel,
and then anytime you wanted to do any specific kernel modifications,
it was a giant pain.
Let's not go there.
I understand the problem.
All right, So you guys are
open source. Is
there like a support contract that
somebody could buy if they were so
interested in doing
something like this or not?
Most of the guys
we talk to that are open source have an
enterprise solution that
they're willing to you know
have customers spend money on and stuff well we we're not at that point yet uh we believe uh we
would really be delighted to to go that way uh there have been certain talks in that direction
but nothing is uh is yet uh confirmed so I cannot talk anymore.
I understand.
When you guys are looking for VC money, let me know.
Jason's got this small company behind him for some reason.
I don't want to talk about it.
Let's see.
So
Big Data
is another solution like this where they do a lot of data reading and not a lot of data writing to some extent.
You see this as a solution in that space as well?
I mean, do you know what data analytics?
I mean, I guess Hadoop, Spark kinds of things.
Well, it's a bit difficult.
It always a little bit depends on...
So we're not a backend for, I don't know, an SQL database or anything like that.
But it always really depends on kind of the workload as well there.
You can start to classify into these burst applications, checkpointing application, all of these things.
But in the end, you really need to look more closely.
Well, if you have a compute-bound application that doesn't really write a whole lot
and where the parallel fastener without problems can keep up,
then there's really not a point of using it.
But the more applications are running on the parallel fastener,
the more it gets interference.
We have seen this already.
We have running on Maren nostrum a couple of experiments
um over a month at different times this was always the same workload and you actually see like diff
orders of magnitudes and differences for the performance that the user actually gets right
so yeah it really depends on what the system state is what the application workload and so on yeah
interesting i like orders of magnitude speed up. I'm not sure.
Yeah, actually, that's okay. It's not the order of magnitude of speed up. It's a slowdown,
depending on who you're sharing your IO with, your GPFS IO with, and what they're doing with that.
You can see that your application is definitely slower than it should be
right right right right right right right right
huh i was gonna talk so raid level you guys do any sort of raid level across these ssds
uh across the nodes and stuff like that what sort of data protection do you offer? Erasure coding 13 plus two or something?
Well, erasure coding is planned.
We're actually working on it.
To be due.
TBD.
No, no, no.
I mean, well, initially we didn't plan
to have any kind of,
because it actually didn't matter
for the HPC workloads that we were working on.
Because the data is elsewhere.
It's temporary, right? It's basically all temporal data.
Yeah, and most simulations already have in place a system for checkpointing.
So it didn't make sense for us at the time to actually do the same work twice.
But down the line, if we're trying to move a bit beyond HPC
and trying to make the file system more useful,
we actually need to have some data protection schema in place.
And given that our main work mode is fully distributed,
the current design is going to issue codes for data protection.
That would make sense.
Right, right, right.
I can't think of anything else, Jason.
You have any last questions for these guys?
Besides the locking question.
You don't want to ask the locking question.
We have Super Compute coming up.
Were you guys planning on actually going to the Super Compute coming up, were you guys planning on
actually going to
the Super Compute conference
that's coming up?
I think it's in Dallas,
I believe, this year.
Yeah, so I will be there.
Yeah, I will be there.
So yeah, I'm happy
to talk to everyone
about it.
But yeah,
I will be there
and we'll also be
looking forward
to the next IFF100.
If there is maybe a new surprise contender, you never know.
It's usually every half a year, the new fastest one coming up.
That's suddenly very fast.
Yeah, yeah.
I got a couple of people on notice in your way.
Yeah, it'll be nice.
There you go.
There you go.
Listen, Mark and Alberto, is there anything you'd like to say to our listening audience before we close?
So, yeah, really, thanks a lot for having us on.
It was fantastic to talk to you.
And it has been the first time for a podcast for us, but it has been very interesting.
And, of course, for GeckoFS, feel free to talk to us.
We're always happy to talk to you about your workloads and how GeckoFS could could help if you have any issues with it or if you want to open any tickets. We are
on GitLab. Just type it into Google and you will find it. And yeah, we're always happy
to help and looking forward to it.
Yeah. Alberto, anything?
No, basically what Mark said. I mean, we're really happy for the exposure.
We're really interested in workloads that people can come up with.
We're really eager to come up with different ways to optimize GeckoFS and to find newer requirements.
I really would be really happy if we can actually come to a decision into the logging.
Do we need it?
Don't we need it?
I'm thinking of another knob here, but that's another question.
I don't know.
I don't know.
You know, there's trade-offs, right?
There's always trade-offs.
And you've got to make some decisions early on on what trade-offs you want to go after and what trade-offs. There's always trade-offs. You've got to
make some
decisions early
on.
What trade-offs
you want to
go after and
what trade-offs
you don't.
I understand
that.
In your
particular
application, I
think it
makes a lot
of sense.
I wasn't
going to say
that online,
but I guess
I had to.
All right.
All right.
That's it for
now.
Bye, Mark.
I'm Alberto. It was great having you on the show today. Thank you, Ray. Thank you for having us. All right, that's it for now. Bye, Mark. I'm Alberto.
It was great having you on the show today.
Thank you,
Ray.
Thank you for having us.
Yeah.
Thank you,
Ray.
Thank you,
Jason.
All right.
And bye,
Jason.
Until next time.
Next time,
we will talk to another system storage technology person.
Any questions you want us to ask,
please let us know.
And if you enjoy our podcast,
tell your friends about it.
Please review us on Apple Podcasts, Google Play, and Spotify,
as this will help get the word out. Thank you.