Grey Beards on Systems - 117: GreyBeards talk HPC file systems with Frank Herold, CEO of ThinkParQ, makers of BeeGFS

Episode Date: April 19, 2021

We return back to our storage thread with a discussion of HPC file systems with Frank Herold, (@BeeGFS) CEO of ThinkParQ GmbH, the makers of BeeGFS. I’ve seen BeeGFS start to show up in some IO500 t...op storage benchmark results and as more and more data keeps coming online every day, we thought it time … Continue reading "117: GreyBeards talk HPC file systems with Frank Herold, CEO of ThinkParQ, makers of BeeGFS"

Transcript
Discussion (0)
Starting point is 00:00:00 Hey everybody, Ray Lucchese here with Matt Lieb. Welcome to the next episode of Greybeards on Storage podcast, a show where we get Greybeards Storage bloggers to talk with system vendors and other experts to discuss upcoming products, technologies, and trends affecting the data center today. This Great Beads on Storage episode was recorded on April 13, 2021. We have with us here today Frank Harreld, CEO of Thinkpark, the makers of BGFS. So Frank, why don't you tell us a little bit about yourself and what BGFS is up to these days? Thank you, Ray, and thank you, Matt, for the invitation and having you on this call. So as I said, I'm Frank Herold, CEO of Thinkpark.
Starting point is 00:00:52 Thinkpark is the company behind BGFS. So most likely many people know BGFS, not that many know Thinkpark. We are a German-based company established in 2014 as a spin-off of the Fraunhofer Research Organization, where BGFS originally also was developed. So as of today, we are still a small team, but we operate internationally. We have deployments worldwide. And I would also like to highlight that BGFS comes in two flavors. The one is we call it publicly available source. The other one is where we offer a sync bar commercial support behind the product. So we develop the product, we maintain the product, but we also offer support for the product.
Starting point is 00:01:38 There are a couple of functionalities in the product asking customers to sign in for a support contract. But for general purpose, for general use, everybody can also use BGFS as a publicly available source. So is BGFS focused on the high-performance computing or some of the genomics types of stuff? What's your general market for BGFS? And tell us a little bit about what BGFS does. Right. So BGFS is a parallel file system which fits practically everywhere where you have the need for high performance throughput,
Starting point is 00:02:16 where you have a high number of compute nodes you would like to leverage your storage against. So, yes, the classical approach, and that's also where we are coming from, is the HPC. But I think those days, the term HPC is changing massively, and also many different customer profiles goes under HPC, besides the classical large shops of HPC. But you are right, life science is a big market opportunity for us. We have quite a number of customers and it is everything above traditional filers where you need more capacity, where you need more throughput, where you have more ambition jobs running against a cluster. We also have quite a number of deployments in oil and gas manufacturing. And recently, I mean, recently means in the last two or three years,
Starting point is 00:03:10 specifically the deep learning section comes up a lot in our customer base. I mean, the challenge with deep learning is lots of, it seems like lots of small files. We used to say images were large, but images these days are fairly small compared to some of the other files that HPC is running, these sort of stuff. Does BGFS handle both small files and large files effectively? I mean, it's always been a challenge to mix and match those sorts of workloads. I fully agree. That's why we are also quite sensitive in discussing opportunities with customers. I would say at some point you win, at some point you lose. It depends very much driven on how applications are accessing the data.
Starting point is 00:04:10 We have seen quite good performance results also in benchmarks on customer sites where they compared BGFS against alternate file systems. So I would say BGFS is a very performing file system. We have, you can call it a feature. We do not have that much tuning parameters. So we get quite sufficient performance straight out of the box, with not optimizing and tuning
Starting point is 00:04:36 in the one or the other direction, like small or large files. And that is one of the advantages of BGFS, having excellent performance from small to large files in the initial file system. Frank, this is a server-based storage platform? Yeah, it is. I mean, high level on the architecture, how we operate. We have the capability of what we are calling a converged setup, where you put all the services on a single machine or quite a handful of machines,
Starting point is 00:05:08 but you can also segregate. And that is where consulting needs to shine in and discuss with customers, what is the best approach? What is the best setup? Where we can also spin out metadata services. And that is specifically for very large file systems. If we talk millions and billions of files. And also when you have to handle small file requests frequently, where metadata probably is better to separate and isolate on dedicated machines.
Starting point is 00:05:36 Very high performance, low latency. While on the storage server itself, you can play between spinning disk up to NVMEs all in one namespace. And you can dedicate where data goes in while storing this data. So you can tailor your performance to your workload requirements based on where you want to put the data in the backend storage. And that makes an interesting point. So this is really a software-defined solution, right? You don't actually sell hardware solutions to this. And in this sort of environment, you mentioned the converged solution. So a converged solution would have, I guess, all your services running on a single server with lots of storage behind it.
Starting point is 00:06:22 Is that how that would look? That could be one of the options. You can also have two or three servers, which we are still calling a converged setup, where you have on each of those server, a metadata server and storage server. So you can balance a little bit the metadata across multiple nodes,
Starting point is 00:06:39 but it doesn't give you all the value off. And that makes it super flexible where we can also customize on customer needs exactly what they are looking for. While on the storage server, I mean, as long as we see a device, we can map it into the file system. And as we have talked about AI,
Starting point is 00:07:01 we see quite often customers leveraging cost-sensitive storage for larger amounts of data as the streaming performance out of spinning rate setups is still good enough, I would call it. While for the small files, we leverage NVMEs or SSDs, but the customer sees this as one single namespace. That is a key advantage. Across the whole cluster of however many servers he has and that sort of stuff. So you do not create silos of storage ports, which makes it quite complicated in managing data and data flows. And so you're effectively accessing the data on directly attached storage that's attached to these storage servers per se. Is that how this works? And you don't care if it's NVMe or SAS attached or SATA attached storage,
Starting point is 00:07:48 as long as it's storage, I guess. I don't care. Even if you have sometimes storage virtualization layer in between, it still works. So we are hooking up on top of the Linux file system presentation layer. So if you have an XFS, ZFS to display in Linux OS, we can map BGFS file system on top of. I got you. I got you. And as far as the host is concerned, is there a client software that
Starting point is 00:08:17 runs in there or is it POSIX compliant or how's that? Is it NFS and CIFS or SMP? Sorry. It is per default. We call it POSIX compliance, but I think still POSIX is a big word. And just a very few files is really 100% POSIX compliance. So we say we are POSIX compliance, supporting the majority of commands, which are defined in the POSIX standard. That's what I think we have never seen any trouble with. You have the possibility to re-share, re-export the file system to create NFS or SMB shares, as we still have customers, quite a number, which would also have quite often on lower performance requirements, but still
Starting point is 00:09:05 access into the same file system. And when you say reshare, that would be done at the host level or that would be done at the storage server level or the metadata server as a management level? I'm trying to understand how that would play out. The technical answer is it depends. So you can do both. Consultant. Right.
Starting point is 00:09:28 I mean, you can do it straight out of the storage server. And that's typically what we do in smaller deployments. But in larger deployments, customers also quite often like to have dedicated server machines just for the reshare. And is there a management layer here? Is there a management server? Management service, I guess, is the right word to use, right? Right. It's a software piece.
Starting point is 00:10:01 It's a management service, which needs to run also on one or two of those machines. But the management service is just, I would call it the traffic cop. So it is not really having a lot of load on, it just needs to run and be up and running as a service. We also have a monitoring service on top of for checking health status of certain services we have running and daemons. But that is all background job doesn't require really a lot of memory or CPU. So it can be in one of those machines. Right, right, right. And do you guys support data protection RAID kinds of solutions or how does that work in BGFS? As I said before, I mean, we span a virtual file system across underlying XFS or ZFS. So typically on this lower level, customers have some sort of data protection.
Starting point is 00:10:52 And we do not have per definition data protection. And we have some capabilities of mirroring data as of today for customers and for specific data sets where it makes sense. Of course, you can get a little bit more performance advantage by mirroring the data, specifically in the read, since you can read from both storage server data. On the other hand, you need much more capacity, exactly twice the capacity for storing the data. What about replication strategies? I mean, the mirroring is some sort of replication internally as we have. We have deployments in the field where we leverage third-party tools where customers are replicating
Starting point is 00:11:40 data into a second replica of BGFS. We do not have in-house build replication for the product. I see. So it's all done out external to the storage system, I guess, of the BGFS. And so how's your metadata becomes an extremely important characteristic of these sorts of things. So how's your metadata laid out? Is it, is it, you know, is it cached? Is it, you know, that sort of thing. I mean, there's, there's so many things you can do with metadata to try to speed up the process, but there's, you know, the counterpart is that it's got to be very data integrity in order so you don't lose files and stuff like that. Exactly. That's the point. I mean, metadata is one of the most critical
Starting point is 00:12:25 components. I mean, if you are losing the metadata, you are losing effectively all the data. Since striping data across multiple storage server, you cannot combine those blocks we have writing or chunks we have writing on the storage back. So metadata is critical. With the metadata operation, we cache a memory on the metadata server, quite a larger amount of data, but we flush this continuously also against the database underneath to have this consistency layer. But yes, metadata is absolutely critical in terms of keeping the data alive and have them in touch on the storage device, but also on very low latency devices. So I've noticed on your website that one of your preferred architectures is to segregate out to a separate storage element, the metadata layer. Like almost having a metadata server kind of thing, right?
Starting point is 00:13:28 Effectively, it is a service, but we prefer to have the metadata service running on isolated machines where they have the full bandwidth, full memory, and full CPU dedicated to the metadata service, as this is where we see the best results, specifically if we go in performance solutions. And what types of servers are we talking about? Are they all sort of x86-based,
Starting point is 00:13:56 or is there a leveraging of GPU or coprocessor elements that the file system can take advantage of? The majority of customers is still running on x86 machines. That's what we find all over. BGFS itself is quite hardware agnostic, so we have also deployments on OpenPower, on ARM, on AMD machines. That's what we're also supporting. From a BGFS vendor perspective, I must say we don't care per definition. But yes, we have certain experiences and blueprints on what works best to reach the one or the other goal customers are defining.
Starting point is 00:14:49 And what about the underlying networking layer? Is that you have to have InfiniBand or RDMA or is that? It's not or, it is and. So we play on InfiniBand, we play on GigE, we play on OmniPass. We also have customers running mixed environments. So that's up to the customer. I mean, we all know the pros and cons on various interfaces. But effectively, I mean, it is up to the customer to define.
Starting point is 00:15:20 And sometimes there is technical requirement. Sometimes there's budget requirements. Sometimes there's budget requirements. Sometimes there's history on customers while they're using the one or the other technology. But we also quite often see deployments where they have, for example, InfiniBand on 100 and 200 gig interface layer. That's what we can play with. And so how many, so, you know, obviously these sorts of environments like to scale pretty large. Is there a limit to A, the amount of storage, B, the amount of files slash directories? You keep mentioning a single namespace as well. Is there a single namespace across the whole cluster?
Starting point is 00:15:58 Can be defined as a single namespace? Yes. I mean, in contrast to competitors, I would say we are coming more from the smaller end and going one step by the other one and larger deployments. So we have double digit petabytes deployment, also on the larger scale, which we are calling large for BGFS as of today. I mean, large for me doesn't really mean capacity, it is more the number of faults and the number of operations as this is the headache for a fault system. That's where we play quite well. We have not seen any customer really suffering on scalability limits. I'm not aware on the architecture that we have some limits, of course. I mean, if you have more metadata server, at some point, the metadata communication takes up more time than having them on less metadata server.
Starting point is 00:16:55 So that is what you need to balance. And I agree, at some point, it might be also a good choice to put this in a second namespace to isolate performance to dedicated jobs. I see. I see. So the reason you'd want to split that is because you could split the metadata services across the two different namespaces. Is that how it would work? That could be one. We could also play that you have a number of metadata server assigned to namespace one, number of metadata server assigned to namespace two. And in case of failover mechanisms, you can leverage also the second one. I see. And that would be a smart way, also a quite cost-effective way,
Starting point is 00:17:36 where you have full capacity, full bandwidth while operating the entire system, while also in a degraded mode, you can take advantage. You open up an interesting story here, failover. So you can failover metadata services. You mentioned that there is a capability to do that. What about the storage side of things? It's the same mechanism we are using underneath. So it's the same possibility also.
Starting point is 00:18:02 I have mentioned that we have the capability of mirroring the data for the storage server. So that means that we have one active and one passive metadata storage server. If one storage server, the primary fails, the secondary will become the active one. Right, right, right. So it's definitely a mirrored RAID one solution kind of thing. Keep in mind, it is on top of typical RAID protection on a storage layer customers are using, I mean, soft RAID or hardware RAID.
Starting point is 00:18:35 So it is a double layer of security. And my question quite often to customers while discussing this is also how much security you need for your data. I mean, if we talk about Scratch, just repeat the job if there is a failure. Don't waste too much budget on security layer. If you have this as a permanent repository for your data, security, I mean, data security and integrity becomes much more important. Right, right, right, right.
Starting point is 00:19:06 It's not like a lot of HPC jobs are doing much of their data activities, a scratch workload kinds of things. But some of it obviously has to persist beyond that work environment and that sort of thing. So I understand that logic. Do you tear behind the solution? I mean, if you've got, you know, mixtures of storage that are, you know, NVMe SSDs versus disk storage, do you move data, hot data to NVMe and slow data back to disk storage or anything like that?
Starting point is 00:19:37 We have a functionality which is called Storage Pool. So with Storage Pool, you can define on a per directory basis where the data will land while writing into those directories. If you have a performance directory, it goes on NVMe, while a capacity directory goes on HCD, as an example. The admin has also manually, as of today, the capability of moving data on a per job basis or whatever the definition is from hot into capacity we have deployments for example based on irots or starfish where they collect a lot of meter data where they also have some policy definitions while actively using our CLI interface and moving
Starting point is 00:20:27 data underneath of the storage from a performance tier into a capacity tier. The reason why we have this implemented this way as of today is that I strongly believe that if you have a fixed defined workflow, then you can also define fixed policies when data will be moved from hot to capacity, for example. While in many customer deployments we have, with all the respect, but the data storage is quite chaotic. So a researcher will never define which data needs to be where. They all want to have data always on the performance tier.
Starting point is 00:21:06 What about nuances, Frank, like compression and deduplication and that level of control over size requirements? We see those requirements. We discuss those. We do not have some implemented and we are still trying to build our mind and also watch how the acceptance on the market space is. I'm getting the point that all these technologies are some sort of data reduction. So for storage costs, it has huge advantages. That's what I'm getting. But on the other hand, I'm saying, you know. Well, you don't want to take a performance hit, right? Yeah.
Starting point is 00:21:54 You are buying a Ferrari for a good reason. And you're not trying to reuse a Ferrari to be in a caravan. Put me this in an analogy. And that's what we are still trying to discuss with customers and with partners what is the best balanced approach um to yeah to fulfill those kind of requirements um we need to invest in this uh we are still in the research phase on those functionalities. Fair enough. It strikes me that with enough server-based storage componentry and enough processing power, you might be able to divide out those sort of more nuanced functionalities that are attempting to save space without performance hits.
Starting point is 00:22:47 But yeah, I understand completely what you're saying. And that's always the trick, isn't it? You've got to balance the potential benefits of functions like that against the value of the performance data that you're getting. Right. And that is exactly the point. I mean, people using a parallel file system looking primarily into performance. And many customers just using it as a scratch
Starting point is 00:23:16 have super high performance on low latency for the data. And I mean, that's also why many customers are really looking in NVMe deployments or a mixture of different technologies to get really performance out of the box. Everything else kills performance. And there are also technologies like object stores on the market,
Starting point is 00:23:39 which are excellent in storing large amounts of data, quite cost-effective, with all these deduplication technologies. Excellent. But they are purpose built for exactly this. So you mentioned parallel file systems a couple of times. Most of the enterprise listeners we have talked to don't understand or don't realize what
Starting point is 00:24:02 a parallel file system really represents. So why don't you, if you could kind of clarify the difference between a normal file system and a parallel file system, if such a thing exists. I know the parallel file systems exist. I'm not sure what a normal file system would look like, but I guess NFS or SMB. I mean, if you, it is the same with NFS or SMB devices. I mean, you have one head and you have massive amount of storage behind. At some point, the storage performance, I mean, the spindle performance, the aggregated spindle performance is higher than your head node can deliver on the network interfaces or whatever to the number of clients. That works as a repository file system that delivers a good performance, a solid performance. Nothing against this. If you want to go above, then parallel file systems shine in. So what parallel file systems
Starting point is 00:25:01 are doing is that you have multiple heads and that's what we are calling the storage server or OSS or everywhere there is different terminology, but effectively it is a head with storage underneath The files are spread across multiple server and that's a big difference So if you have make a simple example, you have a one megabyte file. You divide this into four chunks, each with 250K. You spread them across four server. While reading those data back from the client, you leverage the performance of up to four server in this example. And that brings performance. And that performance is primarily throughput, right?
Starting point is 00:25:42 I mean, you're increasing the throughput capabilities of the system. Sure. And that is also back to the initial discussion we had, which is one of the problems in the AI. If you have very small files, this striping of data across multiple servers doesn't make much sense. And that's where you need the intelligence in the system when you want to split data and where probably you put data on a single server only. And if you take in large amounts of data, also try to avoid create hotspots on a single server so that you probably have all the small files going on storage server one while you
Starting point is 00:26:19 spread across four servers, larger files and storage server one is getting overloaded over time. So you have optimizations to try to ensure that small files are spread across multiple servers, even though they might only be on a single server kind of thing. Right. And that is by default deployment. That's nothing where you really need to tweak and tune. And that's also why I said we have quite decent performance out of the box with not going in deep tuning sessions.
Starting point is 00:26:45 And that's also why BGFS is super easy to deploy and get good out-of-the-box performance as is. So what sort of, you know, I noticed that there are a couple of vendors out there that are also partnering up with BGFS. You want to talk about those? I mean, are these partners, I guess? Is that how you would consider them? We have two. I mean, we have direct end customers, which are most likely coming from our history
Starting point is 00:27:17 where we started with direct contacts, mainly in Europe, research operations, pushing BGFS into scaling the operation of the company makes absolutely sense to have partners and as we have also seen we are just software defined so you always need someone fulfilling the entire requirements and delivering an end-to-end solution to a customer base. So you need to bring in hardware, infrastructure components, but also intellectual property to put everything together based on the requirements the customer has. So we are the expert on the software-defined storage level for BGFS, but we want to partner with partners
Starting point is 00:28:08 which knows exactly what these architecture of hardware can deliver by hard numbers and figures. We are not the hardware expert. That's one of the arguments why we go with partners. We have, I would call them regional partner operating in a specific region or in a specific vertical across the globe in Europe and North America, but also in the APEC region. While in the last two years, we have also started to engage with a couple of larger partners like NetApp, like Dell, but specifically China also with InSpore. Both Dell and InSpore have started to build appliances around the product,
Starting point is 00:28:53 which helps us a lot as we have standards on hardware setup along with BGFS, which are burned-in, documented, performance-specified, so quite easy out-of-the-box deployments not creating a lot of hassle while these handmade one-off solutions can take a bit longer in the burn-in phase for a customer to get some really up and running to the perfect requirements of the customer we need both since not every customer is a general-purpose buyer and we have some customers with very specific requirements. So that's why I think we play quite well on both sides. And on top of these partner engagement, we also have what we are calling technology partnerships with some other hardware software vendors where we try to do things together.
Starting point is 00:29:48 I have mentioned before, I wrote Starfish as an example for data movement where we are working together. But Bright, for example, we work. Slurm, we have deeper integration. We also talk with NVIDIA on a couple of things. Slurm, you mentioned Slurm is like an operational orchestration
Starting point is 00:30:10 layer kind of thing, I guess. I don't even want to go there. It's a different discussion. I agree, but I would, I mean, just to give you a little idea of why I'm bringing this up and why this is important for us. We have one specific flavor. Customers really like what we are
Starting point is 00:30:27 calling BGFS on demand or beyond. And if we think about the initial architecture we have discussed, where we have these dedicated storage server and metadata server and so on for a kind of repository performance file system, we have with BEYOND the possibility where you can spin up a kind of scratch file system on the fly on the compute nodes. So if you have thousands of compute nodes and an SSD or NVMe hanging around in those units, you can define BEYOND. It's just firing up a script on our side. You're firing up a script, deploying a beyond on X number of compute nodes. And then you have a kind of temporary file system. You run some jobs on this or just a single job. And putting Slurm in space is where you can have the scheduler integrated with beyond. So the Slurm integration spin-ups the temporary file system,
Starting point is 00:31:28 moves the data into it, runs the job, removes the results from the temporary file system into the repository, removes the temporary file system space, and restarts with the next job session. That's very interesting. So this is kind of like a try it and buy it kind of thing. Try it to buy it.
Starting point is 00:31:47 I was going to ask if you have some sort of a downloadable, you know, software solution that you could just, you know, people can look at and bring in and try it out and stuff like that. So this sort of the, you called it Beyond. So BGFS Beyond, is that how I'd call it? Yeah. I mean, we just call it Beyond. Okay.
Starting point is 00:32:07 BGFS is the file system name, and BGFS On Demand is the marketing brand name for this specific flavor. It runs BGFS on the knees for sure. But back to your initial question. I mean, we have this commercial offering as Thinkpark for BGFS, which is a subscription into the software for X number of years, where we also provide commercial support. But we have also the publicly, we call it publicly available source. And there's a big term, open source. So it is some sort of, but not 100% open source. That's why we call it publicly available source. So it is some sort of, but not 100% open source.
Starting point is 00:32:45 That's why we call it publicly available source. You can download it on bgfs.io. You can look into the source. Even the public source comes with full features and functions. It's not strict or limited in some extent, but we are asking customers signing up if you have extended use specifically on what we are calling enterprise signing up if you have extended use specifically on what we are calling enterprise feature to sign up on a contract.
Starting point is 00:33:11 So we have probably quite a large customer base in the field, in the community, while we also have quite a number of customers on the contract. And so the contract is providing both support as well as external feature functions, that sort of thing. Right. Right. So back to this Beyond solution. So effectively, you could deploy this across any number of compute servers that you have. With Slurm, it fires it up.
Starting point is 00:33:43 It creates the file system. It runs the jobs. And then at the end of the jobs, you could move the data out of the file system to someplace else. And then it deconstructs the file system. That's the sort of thing that's going on? That you can run forever exactly in the loop as you have described. And the repository file system on the MES doesn't need to be BGFS. It can be any installed base file system on a customer side.
Starting point is 00:34:09 So that is an excellent functionality where we have seen quite a number of customers using it, but we also see quite some interest. There was, I think, half a year back, Sandia Lab in US testing BDONT intensively. So that is waking up their interest as this is a specific, unique use case they couldn't really do with what they have as of today. And if you think in thousands of compute nodes, I mean, have Slurm integration in, and you run a dedicated job against 100 compute nodes here,
Starting point is 00:34:51 15 nodes there, that creates performance. I mean, even if we do not talk about high capacity on this, but it creates an extra performance boost. Or you can also isolate some IO pattern from your general purpose file system, which always creates some hassle in the normal file system operation. Right. If you need the throughput, if you need the performance kind of capabilities,
Starting point is 00:35:16 you can fire this up temporarily, run your job against it, and then take it down. And that is, if you believe it or not, I mean, it is just a single command line interface. One line, firing it up, defines the number of nodes you want to deploy it, done. Well, gosh, I don't have any more questions. Matt, do you have any last questions for Frank? No, really don't. Sounds cool.
Starting point is 00:35:39 Frank, anything you'd like to say to our listening audience before we close? I appreciated the time spent on this. I think the topics we have discovered are quite interesting ones. We can spend even more time on going on each of those topics. And I would like to follow up on these discussions at a later point in time, most likely also in a face-to-face discussion at some point. Yeah, that would be great. As soon as this COVID stuff is over.
Starting point is 00:36:06 Okay. Well, this has been great. Thank you very much, Frank, for being on our show today. Thank you, Ray and Matt. And that's it for now. Bye, Matt. Bye, Ray. And bye, Frank.
Starting point is 00:36:16 All right. Until next time. Bye, Frank. Take care. Bye-bye. Stay safe. Until next time. Next time, we will talk to another system storage technology person.
Starting point is 00:36:27 Any questions you want us to ask, please let us know. And if you enjoy our podcast, tell your friends about it. Please review us on Apple Podcasts, Google Play, and Spotify, as this will help get the word out. Thank you.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.