Grey Beards on Systems - 117: GreyBeards talk HPC file systems with Frank Herold, CEO of ThinkParQ, makers of BeeGFS
Episode Date: April 19, 2021We return back to our storage thread with a discussion of HPC file systems with Frank Herold, (@BeeGFS) CEO of ThinkParQ GmbH, the makers of BeeGFS. I’ve seen BeeGFS start to show up in some IO500 t...op storage benchmark results and as more and more data keeps coming online every day, we thought it time … Continue reading "117: GreyBeards talk HPC file systems with Frank Herold, CEO of ThinkParQ, makers of BeeGFS"
Transcript
Discussion (0)
Hey everybody, Ray Lucchese here with Matt Lieb.
Welcome to the next episode of Greybeards on Storage podcast, a show where we get Greybeards
Storage bloggers to talk with system vendors and other experts to discuss upcoming products,
technologies, and trends affecting the data center today. This Great Beads on Storage episode was recorded on April 13, 2021.
We have with us here today Frank Harreld, CEO of Thinkpark, the makers of BGFS.
So Frank, why don't you tell us a little bit about yourself and what BGFS is up to these days?
Thank you, Ray, and thank you, Matt, for the invitation and having you on this call.
So as I said, I'm Frank Herold, CEO of Thinkpark.
Thinkpark is the company behind BGFS.
So most likely many people know BGFS, not that many know Thinkpark.
We are a German-based company established in 2014 as a spin-off of the Fraunhofer
Research Organization, where BGFS originally also was developed. So as of today, we are still a
small team, but we operate internationally. We have deployments worldwide. And I would also like
to highlight that BGFS comes in two flavors. The one is we call it publicly available source.
The other one is where we offer a sync bar commercial support behind the product.
So we develop the product, we maintain the product, but we also offer support for the product.
There are a couple of functionalities in the product asking customers to sign in for a support contract. But for general purpose, for general use, everybody can also use BGFS as a publicly
available source.
So is BGFS focused on the high-performance computing or some of the genomics types of
stuff?
What's your general market for BGFS?
And tell us a little bit about what BGFS does.
Right.
So BGFS is a parallel file system which fits practically everywhere where you have the need for high performance throughput,
where you have a high number of compute nodes you would like to leverage your storage against.
So, yes, the classical approach, and that's also where we are coming from,
is the HPC. But I think those days, the term HPC is changing massively, and also many different
customer profiles goes under HPC, besides the classical large shops of HPC. But you are right, life science is a big market opportunity
for us. We have quite a number of customers and it is everything above traditional filers where
you need more capacity, where you need more throughput, where you have more ambition jobs
running against a cluster. We also have quite a number of deployments in oil and gas manufacturing.
And recently, I mean, recently means in the last two or three years,
specifically the deep learning section comes up a lot in our customer base.
I mean, the challenge with deep learning is lots of, it seems like lots of small files.
We used to say images were large, but images these days are fairly small compared to some of the other files that HPC is running, these sort of stuff.
Does BGFS handle both small files and large files effectively?
I mean, it's always been a challenge to mix and match those sorts of workloads.
I fully agree.
That's why we are also quite sensitive in discussing opportunities with customers.
I would say at some point you win, at some point you lose. It depends very much driven on how applications are accessing the data.
We have seen quite good performance results also in benchmarks on customer sites
where they compared BGFS against alternate file systems.
So I would say BGFS is a very performing file system.
We have, you can call it a feature.
We do not have that much tuning parameters.
So we get quite sufficient performance
straight out of the box,
with not optimizing and tuning
in the one or the other direction,
like small or large files.
And that is one of the advantages of BGFS,
having excellent performance from small to large files in the initial file system.
Frank, this is a server-based storage platform?
Yeah, it is. I mean, high level on the architecture, how we operate.
We have the capability of what we are calling a converged setup, where you put all the services on a single machine
or quite a handful of machines,
but you can also segregate.
And that is where consulting needs to shine in
and discuss with customers, what is the best approach?
What is the best setup?
Where we can also spin out metadata services.
And that is specifically for very large file systems.
If we talk millions and billions of files.
And also when you have to handle small file requests frequently, where metadata probably is better to separate and isolate on dedicated machines.
Very high performance, low latency.
While on the storage server itself, you can play between spinning disk up to NVMEs all in one namespace. And you can dedicate where data goes in while storing this data.
So you can tailor your performance to your workload requirements based on where you want to put the data in the backend storage.
And that makes an interesting point.
So this is really a software-defined solution, right?
You don't actually sell hardware solutions to this.
And in this sort of environment, you mentioned the converged solution.
So a converged solution would have, I guess, all your services running on a single server with lots of storage behind it.
Is that how that would look?
That could be one of the options.
You can also have two or three servers,
which we are still calling a converged setup,
where you have on each of those server,
a metadata server and storage server.
So you can balance a little bit the metadata
across multiple nodes,
but it doesn't give you all the value off.
And that makes it super flexible
where we can also customize on customer needs
exactly what they are looking for.
While on the storage server,
I mean, as long as we see a device,
we can map it into the file system.
And as we have talked about AI,
we see quite often customers
leveraging cost-sensitive storage for larger amounts of data
as the streaming performance out of spinning rate setups is still good enough, I would call it.
While for the small files, we leverage NVMEs or SSDs, but the customer sees this as one single
namespace. That is a key advantage. Across the whole cluster of however many servers
he has and that sort of stuff. So you do not create silos of storage ports, which makes it
quite complicated in managing data and data flows. And so you're effectively accessing the data on
directly attached storage that's attached to these storage servers per se. Is that how this works? And you don't care if it's NVMe or SAS attached or SATA attached storage,
as long as it's storage, I guess.
I don't care.
Even if you have sometimes storage virtualization layer in between,
it still works.
So we are hooking up on top of the Linux file system presentation layer.
So if you have an XFS,
ZFS to display in Linux OS, we can map BGFS file system on top of.
I got you. I got you. And as far as the host is concerned, is there a client software that
runs in there or is it POSIX compliant or how's that? Is it NFS and CIFS or SMP? Sorry.
It is per default.
We call it POSIX compliance, but I think still POSIX is a big word.
And just a very few files is really 100% POSIX compliance. So we say we are POSIX compliance, supporting the majority of commands, which are defined in the POSIX standard.
That's what I think we have never seen any trouble with.
You have the possibility to re-share, re-export the file system
to create NFS or SMB shares, as we still have customers, quite a number,
which would also have quite often on lower performance requirements, but still
access into the same file system.
And when you say reshare, that would be done at the host level or that would be done at
the storage server level or the metadata server as a management level?
I'm trying to understand how that would play out.
The technical answer is it depends.
So you can do both.
Consultant.
Right.
I mean, you can do it straight out of the storage server.
And that's typically what we do in smaller deployments.
But in larger deployments, customers also quite often like to have dedicated server machines just for the reshare.
And is there a management layer here?
Is there a management server?
Management service, I guess, is the right word to use, right?
Right.
It's a software piece.
It's a management service, which needs to run also on one or two of those machines.
But the management service is just, I would call it the traffic cop. So it is not really having a lot of load on, it just needs to run and be up and running as a service.
We also have a monitoring service on top of for checking health status of certain services we have
running and daemons. But that is all background job doesn't require really a lot
of memory or CPU. So it can be in one of those machines. Right, right, right. And do you guys
support data protection RAID kinds of solutions or how does that work in BGFS? As I said before,
I mean, we span a virtual file system across underlying XFS or ZFS.
So typically on this lower level, customers have some sort of data protection.
And we do not have per definition data protection.
And we have some capabilities of mirroring data as of today for customers and for specific
data sets where it makes sense.
Of course, you can get a little bit more performance advantage by mirroring the data,
specifically in the read, since you can read from both storage server data.
On the other hand, you need much more capacity, exactly twice the capacity for storing the data.
What about replication strategies? I mean, the mirroring is some sort of replication internally as we have. We have
deployments in the field where we leverage third-party tools where customers are replicating
data into a second replica of BGFS. We do not have in-house build replication for the product.
I see. So it's all done out external to the storage system, I guess, of the BGFS.
And so how's your metadata becomes an extremely important characteristic of these sorts of things.
So how's your metadata laid out? Is it, is it,
you know, is it cached? Is it, you know, that sort of thing. I mean, there's, there's so many things
you can do with metadata to try to speed up the process, but there's, you know, the counterpart
is that it's got to be very data integrity in order so you don't lose files and stuff like that.
Exactly. That's the point. I mean, metadata is one of the most critical
components. I mean, if you are losing the metadata, you are losing effectively all the data.
Since striping data across multiple storage server, you cannot combine those blocks we
have writing or chunks we have writing on the storage back. So metadata is critical.
With the metadata operation, we cache a memory on the metadata
server, quite a larger amount of data, but we flush this continuously also against the database
underneath to have this consistency layer. But yes, metadata is absolutely critical in terms of keeping the data alive and have them in touch on the storage device, but also on very low latency devices.
So I've noticed on your website that one of your preferred architectures is to segregate out to a separate storage element, the metadata layer.
Like almost having a metadata server kind of thing, right?
Effectively, it is a service,
but we prefer to have the metadata service running on isolated machines
where they have the full bandwidth, full memory,
and full CPU dedicated to the metadata service,
as this is where we see the best results,
specifically if we go in performance solutions.
And what types of servers are we talking about?
Are they all sort of x86-based,
or is there a leveraging of GPU
or coprocessor elements that the file system can take advantage of?
The majority of customers is still running on x86 machines. That's what we
find all over. BGFS itself is quite hardware agnostic, so we have also
deployments on OpenPower, on ARM, on AMD machines. That's
what we're also supporting. From a BGFS vendor perspective, I must say we don't care
per definition. But yes, we have certain experiences and blueprints on what works best
to reach the one or the other goal customers are defining.
And what about the underlying networking layer?
Is that you have to have InfiniBand or RDMA or is that?
It's not or, it is and.
So we play on InfiniBand, we play on GigE, we play on OmniPass.
We also have customers running mixed environments.
So that's up to the customer.
I mean, we all know the pros and cons on various interfaces.
But effectively, I mean, it is up to the customer to define.
And sometimes there is technical requirement.
Sometimes there's budget requirements. Sometimes there's budget requirements.
Sometimes there's history on customers while they're using the one or the other technology.
But we also quite often see deployments where they have, for example, InfiniBand on 100 and 200
gig interface layer. That's what we can play with. And so how many, so, you know, obviously these sorts of environments like to scale pretty large.
Is there a limit to A, the amount of storage, B, the amount of files slash directories?
You keep mentioning a single namespace as well.
Is there a single namespace across the whole cluster?
Can be defined as a single namespace?
Yes. I mean, in contrast to competitors, I would say we are coming more from
the smaller end and going one step by the other one and larger deployments. So we have
double digit petabytes deployment, also on the larger scale, which we are calling large for
BGFS as of today. I mean, large for me doesn't really mean capacity, it is more the
number of faults and the number of operations as this is the headache for a fault system.
That's where we play quite well. We have not seen any customer really suffering on scalability
limits. I'm not aware on the architecture that we have some limits, of course. I mean, if you have more metadata server, at some point, the metadata communication takes up more time than having them on less metadata server.
So that is what you need to balance.
And I agree, at some point, it might be also a good choice to put this in a second namespace to isolate performance to dedicated jobs.
I see. I see. So the reason you'd want to split that is because you could split the metadata services across the two different namespaces.
Is that how it would work?
That could be one. We could also play that you have a number of metadata server assigned to namespace one, number of metadata server assigned to namespace two.
And in case of failover mechanisms, you can leverage also the second one.
I see.
And that would be a smart way, also a quite cost-effective way,
where you have full capacity, full bandwidth while operating the entire system,
while also in a degraded mode, you can take advantage.
You open up an interesting story here, failover.
So you can failover metadata services.
You mentioned that there is a capability to do that.
What about the storage side of things?
It's the same mechanism we are using underneath.
So it's the same possibility also.
I have mentioned that we have the capability of mirroring the data for the storage server. So
that means that we have one active and one passive metadata
storage server. If one storage server, the primary fails, the
secondary will become the active one. Right, right, right. So it's
definitely a mirrored RAID
one solution kind of thing.
Keep in mind, it is on top of typical RAID protection on a storage layer customers are
using, I mean, soft RAID or hardware RAID.
So it is a double layer of security.
And my question quite often to customers while discussing this is also how much security
you need for your data.
I mean, if we talk about Scratch, just repeat the job if there is a failure.
Don't waste too much budget on security layer.
If you have this as a permanent repository for your data,
security, I mean, data security and integrity becomes much more important.
Right, right, right, right.
It's not like a lot of HPC jobs are doing much of their data activities, a scratch workload kinds of things.
But some of it obviously has to persist beyond that work environment and that sort of thing.
So I understand that logic.
Do you tear behind the solution? I mean, if you've got, you know, mixtures of storage that are, you know,
NVMe SSDs versus disk storage,
do you move data, hot data to NVMe
and slow data back to disk storage
or anything like that?
We have a functionality
which is called Storage Pool.
So with Storage Pool,
you can define on a per directory basis where the data will land while writing into those directories.
If you have a performance directory, it goes on NVMe, while a capacity directory goes on HCD, as an example.
The admin has also manually, as of today, the capability of moving data on a per job basis or whatever the definition
is from hot into capacity we have deployments for example based on irots or starfish where they
collect a lot of meter data where they also have some policy definitions while actively using our CLI interface and moving
data underneath of the storage from a performance tier into a capacity tier.
The reason why we have this implemented this way as of today is that I strongly believe
that if you have a fixed defined workflow, then you can also define
fixed policies when data will be moved from hot to capacity, for example.
While in many customer deployments we have, with all the respect, but the data storage
is quite chaotic.
So a researcher will never define which data needs to be where.
They all want to have data always on the performance tier.
What about nuances, Frank, like compression and deduplication
and that level of control over size requirements?
We see those requirements.
We discuss those. We do not have some implemented and we are still trying to build our mind and also watch how the acceptance on the market space is.
I'm getting the point that all these technologies are some sort of data reduction. So for storage costs, it has huge advantages. That's what I'm getting.
But on the other hand, I'm saying, you know.
Well, you don't want to take a performance hit, right?
Yeah.
You are buying a Ferrari for a good reason.
And you're not trying to reuse a Ferrari to be in a caravan.
Put me this in an analogy. And that's what we are still trying to discuss with customers and
with partners what is the best balanced approach um to yeah to fulfill those kind of requirements
um we need to invest in this uh we are still in the research phase on those functionalities. Fair enough. It strikes me that with enough
server-based storage componentry and enough processing power, you might be able to
divide out those sort of more nuanced functionalities that are
attempting to save space without performance hits.
But yeah, I understand completely what you're saying.
And that's always the trick, isn't it?
You've got to balance the potential benefits of functions like that against the value of
the performance data that you're getting.
Right. And that is exactly the point.
I mean, people using a parallel file system
looking primarily into performance.
And many customers just using it as a scratch
have super high performance on low latency for the data.
And I mean, that's also why many customers
are really looking in NVMe deployments
or a mixture of different technologies
to get really performance out of the box.
Everything else kills performance.
And there are also technologies
like object stores on the market,
which are excellent in storing large amounts of data,
quite cost-effective,
with all these
deduplication technologies.
Excellent.
But they are purpose built for exactly this.
So you mentioned parallel file systems a couple of times.
Most of the enterprise listeners we have talked to don't understand or don't realize what
a parallel file system really represents. So why don't you,
if you could kind of clarify the difference between a normal file system and a parallel
file system, if such a thing exists. I know the parallel file systems exist. I'm not sure what
a normal file system would look like, but I guess NFS or SMB. I mean, if you, it is the same with NFS or SMB devices.
I mean, you have one head and you have massive amount of storage behind.
At some point, the storage performance, I mean, the spindle performance, the aggregated spindle performance is higher than your head node can deliver on the network interfaces or whatever to the number of clients. That works as a repository file system that
delivers a good performance, a solid performance. Nothing against this. If you
want to go above, then parallel file systems shine in. So what parallel file systems
are doing is that you have multiple heads and that's what we are calling the storage server or
OSS or everywhere there is different terminology, but effectively it is a head with storage underneath
The files are spread across multiple server and that's a big difference
So if you have make a simple example, you have a one megabyte file. You divide this into four chunks, each with 250K.
You spread them across four server.
While reading those data back from the client, you leverage the performance of up to four server in this example.
And that brings performance.
And that performance is primarily throughput, right?
I mean, you're increasing the throughput capabilities of the system.
Sure.
And that is also back to the initial discussion we had, which is one of the problems in the AI.
If you have very small files, this striping of data across multiple servers doesn't make much sense.
And that's where you need the intelligence in the system when you want to split data and where probably you put data on a single
server only.
And if you take in large amounts of data, also try to avoid create hotspots on a single
server so that you probably have all the small files going on storage server one while you
spread across four servers, larger files and storage server one is getting overloaded over
time.
So you have optimizations to try to ensure that small files are spread across multiple servers, even though they might only be on a single server kind of thing.
Right.
And that is by default deployment.
That's nothing where you really need to tweak and tune.
And that's also why I said we have quite decent performance out of the box with not
going in deep tuning sessions.
And that's also why BGFS is super easy to deploy and get good out-of-the-box performance as is.
So what sort of, you know, I noticed that there are a couple of vendors out there that are also partnering up with BGFS.
You want to talk about those?
I mean, are these partners, I guess?
Is that how you would consider them?
We have two.
I mean, we have direct end customers,
which are most likely coming from our history
where we started with direct contacts,
mainly in Europe, research operations, pushing BGFS into scaling the operation of the company makes
absolutely sense to have partners and as we have also seen we are just software defined
so you always need someone fulfilling the entire requirements and delivering an end-to-end solution to a
customer base.
So you need to bring in hardware, infrastructure components, but also intellectual property
to put everything together based on the requirements the customer has.
So we are the expert on the software-defined storage level for BGFS, but we want to partner with partners
which knows exactly what these architecture
of hardware can deliver by hard numbers and figures.
We are not the hardware expert.
That's one of the arguments why we go with partners.
We have, I would call them regional partner operating in a specific region
or in a specific vertical across the globe in Europe and North America, but also in the APEC
region. While in the last two years, we have also started to engage with a couple of larger partners like NetApp, like Dell, but specifically China also with InSpore.
Both Dell and InSpore have started to build appliances around the product,
which helps us a lot as we have standards on hardware setup along with BGFS, which are
burned-in, documented, performance-specified, so quite easy out-of-the-box deployments
not creating a lot of hassle while these handmade one-off solutions can take a
bit longer in the burn-in phase for a customer to get some really up and
running to the perfect requirements of the customer we need both since not
every customer is a general-purpose buyer and we have some customers with very specific requirements.
So that's why I think we play quite well on both sides.
And on top of these partner engagement, we also have what we are calling technology partnerships with some other hardware software vendors where we try to do things together.
I have mentioned before, I wrote Starfish
as an example for data movement
where we are working together.
But Bright, for example, we work.
Slurm, we have deeper integration.
We also talk with NVIDIA on a couple of things.
Slurm, you mentioned
Slurm is like an operational orchestration
layer kind of thing, I guess. I don't even
want to go there. It's a different discussion.
I agree,
but I would, I mean, just
to give you a little idea of why I'm bringing this
up and why this is important for us.
We have one specific
flavor. Customers really like what we are
calling BGFS on demand or beyond. And if we think about the initial architecture we have discussed,
where we have these dedicated storage server and metadata server and so on for a kind of repository performance file system, we have with BEYOND the possibility where you can
spin up a kind of scratch file system on the fly on the compute nodes. So if you have thousands
of compute nodes and an SSD or NVMe hanging around in those units, you can define BEYOND.
It's just firing up a script on our side. You're firing up a script,
deploying a beyond on X number of compute nodes. And then you have a kind of temporary file system.
You run some jobs on this or just a single job. And putting Slurm in space is where you can
have the scheduler integrated with beyond. So the Slurm integration spin-ups the temporary file system,
moves the data into it, runs the job,
removes the results from the temporary file system
into the repository,
removes the temporary file system space,
and restarts with the next job session.
That's very interesting.
So this is kind of like a try it and buy it kind of thing.
Try it to buy it.
I was going to ask if you have some sort of a downloadable, you know, software solution
that you could just, you know, people can look at and bring in and try it out and stuff
like that.
So this sort of the, you called it Beyond.
So BGFS Beyond, is that how I'd call it?
Yeah.
I mean, we just call it Beyond.
Okay.
BGFS is the file system name, and BGFS On Demand is the marketing brand name for this specific flavor.
It runs BGFS on the knees for sure.
But back to your initial question. I mean, we have this commercial offering as Thinkpark for BGFS,
which is a subscription into the software for X number of years,
where we also provide commercial support.
But we have also the publicly, we call it publicly available source.
And there's a big term, open source.
So it is some sort of, but not 100% open source. That's why we call it publicly available source. So it is some sort of, but not 100% open source.
That's why we call it publicly available source.
You can download it on bgfs.io.
You can look into the source.
Even the public source comes with full features and functions.
It's not strict or limited in some extent,
but we are asking customers signing up if you have extended use
specifically on what we are calling enterprise signing up if you have extended use specifically
on what we are calling enterprise feature to sign up on a contract.
So we have probably quite a large customer base in the field, in the community, while
we also have quite a number of customers on the contract.
And so the contract is providing both support as well as external feature functions, that sort of thing.
Right.
Right.
So back to this Beyond solution.
So effectively, you could deploy this across any number of compute servers that you have.
With Slurm, it fires it up.
It creates the file system.
It runs the jobs.
And then at the end of the jobs, you could move the data out of the file system to someplace else.
And then it deconstructs the file system.
That's the sort of thing that's going on?
That you can run forever exactly in the loop as you have described.
And the repository file system on the MES doesn't need to be BGFS.
It can be any installed base file system on a customer side.
So that is an excellent functionality where we have seen quite a number of
customers using it, but we also see quite some interest.
There was, I think, half a year back, Sandia Lab in US testing BDONT intensively.
So that is waking up their interest as this is a specific, unique use case they couldn't
really do with what they have as of today.
And if you think in thousands of compute nodes, I mean, have Slurm integration in,
and you run a dedicated job
against 100 compute nodes here,
15 nodes there,
that creates performance.
I mean, even if we do not talk about high capacity on this,
but it creates an extra performance boost.
Or you can also isolate some IO pattern
from your general purpose file system, which always creates
some hassle in the normal file system operation.
Right. If you need the throughput, if you need the performance kind of capabilities,
you can fire this up temporarily, run your job against it, and then take it down.
And that is, if you believe it or not, I mean, it is
just a single command line interface.
One line, firing it up, defines the number of nodes you want to deploy it, done.
Well, gosh, I don't have any more questions.
Matt, do you have any last questions for Frank?
No, really don't.
Sounds cool.
Frank, anything you'd like to say to our listening audience before we close?
I appreciated the time spent on this.
I think the topics we have discovered are quite interesting ones.
We can spend even more time on going on each of those topics.
And I would like to follow up on these discussions at a later point in time,
most likely also in a face-to-face discussion at some point.
Yeah, that would be great.
As soon as this COVID stuff is over.
Okay.
Well, this has been great.
Thank you very much, Frank, for being on our show today.
Thank you, Ray and Matt.
And that's it for now.
Bye, Matt.
Bye, Ray.
And bye, Frank.
All right.
Until next time.
Bye, Frank.
Take care.
Bye-bye.
Stay safe.
Until next time.
Next time, we will talk to another system storage technology person.
Any questions you want us to ask, please let us know.
And if you enjoy our podcast, tell your friends about it.
Please review us on Apple Podcasts, Google Play, and Spotify, as this will help get the word out. Thank you.