Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 07x06: Connecting Ceph Storage to AI with Clyso

Starting point is 00:00:00 Many of the largest scale data storage environments use Ceph, an open source storage system, and are now connecting this to AI. This episode of Utilizing Tech, sponsored by Solidigm, features Dan Vanderster, CTO of Klyso, discussing Ceph for AI data. Welcome to Utilizing Tech, the podcast about emerging technology from tech fielding, part of the Futurum Group. This season is presented by Solidigm and focuses on the questions of AI data infrastructure. I'm your host, Stephen Foskett, organizer of the Tech Field Day event series. And joining me from Solidaim is my co-host, Janice Narowski. Welcome to the show. Thank you, Stephen. It's great to be back. Well, it's good to have you here. So as we've spoken about many times in the past, especially here on this whole season of Utilizing Tech, there's a lot of data out there. There's a lot of existing data, a lot of existing data sources, and a lot of existing data platforms. And all of that is going to eventually need to be integrated into the AI data pipeline and into the whole AI picture. Yeah, absolutely. And, you know, there's lots of ways to go about it, right?

Starting point is 00:01:08 A lot of different hardware out there, software. And folks are really trying to figure it all out. How do I make all of this work together? And there's one software tool out there that's open sourced that many, many companies have been using for years, right? And now with the advent of AI, they're kind of like, how do I use this tool I've been using for a long time that's open source and free? How do I make this work for my workloads going on today

Starting point is 00:01:36 and the ones that are ever evolving into the future? So we're excited to talk with somebody from Ceph. So I'll turn it back to you, Stephen, to introduce him, but excited to dive into this topic. Absolutely, yeah. And Ceph is one of those things storage nerds like me have seen for many, many years. I've watched this project grow. I've watched it become absolutely a critical component. Many people may not have heard of it, but as our guest said, it kind of is the Linux of storage. It is everywhere. And it is used,

Starting point is 00:02:15 especially in environments that have lots and lots of data. And those are the environments that are going to need to be integrated into the AI data pipeline. So let's welcome Dan Vanderster from Klyso here to talk a little bit about the importance of Ceph in AI. Thanks, Stephen and Janice. Yeah, I'm Dan Vanderster. I'm CTO at Klyso. I'm coming from CERN. I spent about a decade at CERN working in the IT department, working in storage. And I'm also wearing multiple hats. I've been having the pleasure of working with Seth for around a decade as well. Early adoption, testing at scales that hadn't been seen before, and participating, getting more active in the community, and eventually

Starting point is 00:03:03 now acting actually in the open source project as a member of the executive council leading the overall open source project. So many people haven't heard of Ceph, like I said, but many, I mean, I'm sure it affects almost everybody now because it really is everywhere. I mean, essentially, this is a software-based storage solution that uses as... So full disclosure, I was there at the beginning when this was originally announced, and I was excited about it because of what it is. It uses unreliable components to build reliable,

Starting point is 00:03:39 high-performance, scalable storage solutions. So essentially, it is massive scale, massively distributed, and it's designed to not just be able to adapt to failure, but to be ready for failure. And that's one reason that it's become so successful. So tell us a little bit more, where is Ceph today in the overall picture of the world's information. Right. So that's, I mean, that you highlighted a lot of the points that actually attracted us to Seth early on and made us one of the early adopters. You know, what organizations are lurking for and also like the people

Starting point is 00:04:19 operating the storage infrastructure are looking for is something that's like reliable and can be built out of like you know low cost commodity components but then you can build something that's reliable and scalable and one of the like main things is that if you're building a large scale storage system you don't want to have to like every four or five years lift and shift migrate data from one from one appliance to another you want something like an organic storage system and this is how we this is how i think we wrote an internal memo at cern like toward an organic storage system for for our cloud and that's how we got it started and yeah it really, it proved to deliver what it promised, which was that kind

Starting point is 00:05:09 of scalable, reliable system where the operators can, you know, sleep at night and things fail all the time, but you end up with a reliable system that can grow and evolve with the organization. So on that notion of creating something that's scalable and reliable and kind of always on, can you dive a little bit into how you're working with AI with some of your partners today and how you're bringing through some of that Ceph goodness and bringing that to AI?

Starting point is 00:05:44 Yeah, I mean, AI use case is certainly the hot topic with Ceph these days because Ceph is like, in addition to the whole reliability aspect of Ceph, it's also very flexible. It's a low level object store internally, but actually on top, it's like very familiar storage components. It's like block storage for a private cloud or it's object storage compatible with you know the public clouds and it's also a file system a normal posix file system uh like a typical nfs what you'd expect

Starting point is 00:06:18 from nfs so like because it's so flexible um it already has made, like, it's already used quite massively in those, you know, in cloud environments for object storage, especially like self-hosted or hybrid cloud or multi-cloud environments. So there's really just like a lot of data out there that now in the AI context, you know, organizations want to process that data that's in Ceph more quickly, more rapidly. And it always puts pressure on the project to deliver more and more features and performance needed for the exponentially expanding processing capacity that we have in the AI world. So are those, I guess, what are the industries and use cases that are predominantly using Ceph today? And let's, you know, let's talk about that. And then let's talk about how

Starting point is 00:07:13 those industries are going to be using AI with this data set. Right. I mean, so Ceph had its beginnings in the academic sphere. So it was really a lot of initially universities universities research centers research labs building and using it like they would use uh hp like it's it started out as an hpc file system and then it evolved these other flexible use cases um today it's used by all sorts of industries every industry from let's say financial industries um trading you know you know high frequency trading or you know complicated algorithms that are that are processed through crunching through lots of financial data um of course still still like super computer centers doing any kind of

Starting point is 00:07:58 research let's say biotech research things like that that. And also like, you know, on the entertainment side, like media storage, like Ceph is massive in media storage, like backing, you know, some of the largest media companies, as well as like video games and things like that. You know, it's really, it might, you say it may not be well known, but it really is backing the largest infrastructures that we have out there. Ceph is behind the scenes powering it all. And you mentioned a moment ago, too, that, you know, Ceph isn't always seen as being maybe a high performance solution, right?

Starting point is 00:08:38 Can you talk a little bit more about, you know, how Ceph does bring forth that performance yeah i mean it's important to it's important to understand when Ceph was created it like the idea was to do something better than previous storage systems and that was to put a very strong priority on the durability and consistency and reliability of the storage so So to not make any sacrifices for performance or for any other reason on that data consistency. Ceph has a lot of, there's a lot of smarts internally to enable that to happen at high performance speeds. And we've done recent tests at Clayso

Starting point is 00:09:25 and with the Ceph project published on ceph.io, we've done recent tests to show and to try to make a little bang and make the other file systems a little bit, you know, pay attention a little more. We did a one terabyte per second demo showing that you can get, I mean, I'm not sure another file system has demonstrated

Starting point is 00:09:41 one terabyte per second recently, but this was with, you know, just with open source, open source software on commodity components, few hundred NVMEs, standard servers that anyone can purchase and you can build like the best AI platform for your organization with it. It seems like that is really true to the philosophy of SEF too, because right from the very beginning, it was all about using mundane components. It's not extreme specialized components. It's about using ordinary components that are accessible and affordable and varied and

Starting point is 00:10:19 combining that into a unified system that offers massive scale. And one of the fundamental architecture concepts of Ceph is distributing everything, not just in not having a single point of failure, a single bottleneck. So it makes sense to me that that approach would be able to deliver high performance if that was one of the goals that folks had when kind of rebuilding it out and tuning the system. Because that is very similar to how AI clusters and HPC clusters are built. They're built out of, well, in some cases, more exotic components, but they're still built out of massively scalable components and they're massively distributed. Because anytime you have any kind of bottleneck, well, that's a bottleneck that's going to cause a performance hit. So do you find that you are

Starting point is 00:11:11 able, it sounds like you're able to change, I guess, some of the tuning or some of the, the, the, the way that Ceph was designed to really focus on performance and distribute that workload. Is that, Is that how it works? I mean, in the earlier days of Ceph, I don't know, maybe let's say eight years ago or 10 years ago, I think we all in the community had the idea that, yeah, we call it like horizontal scalability. If you want more performance, just buy more servers or buy more devices and you just add them and then you scale linearly the performance. And for the very lowest levels of Ceph and for things like object storage, that's really true. And we had the ambition and the goal that, like, okay, we can see beyond NFS. We won't have

Starting point is 00:12:00 POSIX file systems anymore. We don't need them anymore. We can just do everything with object storage. But I think we didn't really, it still seems that POSIX file systems are popular and needed, and they're just the normal expectation for a variety of different reasons. And POSIX file systems are not as easily horizontally scalable. That's the issue. Scaling AI workloads brings new challenges to metadata performance. Really opening really like millions of files per second becomes a, is, you know, that's a very highly metadata intensive task. And so we work hard to have new ways to make the Ceph metadata features actually scale as well.

Starting point is 00:12:50 They're pretty good at it already, I would say. It's like we anticipated this a few years ago, and it's already quite good. But we're always trying to push to the next level and be ahead of the curve a bit. Well, when it comes to AI training, I would think that that would be a problem because of the scalability of the clusters that are being used. You have a ton of clients, they're accessing the same data.

Starting point is 00:13:13 Like you said, they're opening lots and lots of files. Is that what you're seeing? That it tends to be a very, very massively parallel access pattern? That's right. Yeah, exactly. And repeating the same ones again and again and again. And it's really this like, it. And repeating the same, the same ones again and again and again. And it's really this like, it's the number of files. That's the thing. It's very easy to make a file system or a storage system, which deals with large objects or large stores, and you

Starting point is 00:13:36 just stream everything in parallel. And, you know, in the HPC space, we call this embarrassingly parallel. So it's easy to make an embarrassingly parallel storage system. But when you have contention and you have clients, you know, all looking at the same files, maybe modifying files in the same directories, and because Ceph has this very strict view of never allowing any client to have an outdated view on the current situation, that's where Seth is like, actually, you know, and with our developer hat on, we're like, hmm, maybe for real life, that kind of strict consistency is not always needed. So maybe we should consider relaxing some of those to behave more like the rest of the file systems that we compete against. I don't know if it's the right time to bring this up,

Starting point is 00:14:30 but I kind of want to talk a little bit about the hardware, obviously with Solidigm, so I've invested interest here. But just curious, as we're building higher capacity, denser storage, what is your kind of viewpoint on how this helps with Ceph? And if at all, if there is any benefit to say a 61.44 terabyte QLC SSD? Yeah, I mean, the, so of course there's a benefit, right? Because you get, so when NVMEs arrived on the scene the scene seph was not prepared for this because seph is a very it's a very smart there's a lot of soft a lot of code a lot of lines of code that go into into like doing the storing the data and you know it was written at a time when

Starting point is 00:15:19 and we had we probably had early days of flash but we, but it was written in the time of spinning disks. And of course, there was a lot of work done in those early days of NVMe to really make sure that you can extract all the high performance out of that. And there were some early workarounds done as well. Yeah, one of the low-level hacks that the practitioners learned early on was that they should take large NVMEs and split them into many different virtual devices and then treat them as separate disks. And then that way, Seth could get more performance and get the native performance because they have too many IOPS and they have too much bandwidth and the software couldn't keep up. But that's like maybe five years ago. Now in the last two, three years, the Ceph developers, the Ceph project has spent a lot of time getting, reworking the internals to make sure that they can extract all of the performance. And there's still an ongoing, so we've achieved quite a lot already. And there's like an ongoing major project called Project Crimson, which is like a major rewrite of the storage of the low level storage daemons, which is 100% focused on extracting performance of large NDMEs like you're talking about now. So there isn't, in your opinion, any sort of issue with, say, the endurance, right, when it comes to working with QLC?

Starting point is 00:16:50 I mean, this is definitely a relevant issue that comes up. When we're designing systems for customers, we pay attention to the endurance. We pay attention to things like write amplification that might be relevant for their exact use case. So when we're working with a customer or a user, we try to understand exactly how they're using it and then guide them. One of the things about Ceph is it has about 10,000 tuning options. And so you can really manipulate anything about the sector sizes, the block sizes, how data is chopped, diced and sliced and distributed across the cluster

Starting point is 00:17:33 in a way that optimizes the usage of the underlying devices, right? So if you have that full holistic view on the hardware, you can tune the software to, you can tune Ceph to extract like maximal performance. And so maybe out of the box, it might not always work optimally with a little insight and a little expertise and guidance you can really get. That's how you achieve results like one terabyte per second. It's like paying attention to all those stacks. We see a lot of, we say a lot of, a lot of Ceph users making, making use of those large devices. Yeah. So Dan, you are out there helping people build Ceph systems and adapt

Starting point is 00:18:18 their Ceph systems and, and basically bring them into the AI world on a daily basis, I assume. Talk to us a little bit about some of the real world here. I guess one of the questions that immediately springs to mind is, you know, something you brought up, which is the object store versus file system versus block question. You know, some of the performance questions. I'm just excited to hear how real world users are using Ceph storage with AI. It's such a wide variety. It's hard to find one classical explanation, like one classical model. I mean, so if you look at the typical, like very large supercomputer right now,

Starting point is 00:19:09 they're often I think most many of the recent designed supercomputer centers like let's say top 500 or IO 500 stuff, you'll find a Ceph tier on the outside and the outer layer as a sort of data lake. That's very common because it's really the lowest with erasure coding, it has very efficient erasure coding you can you can get a low cost um you know standards compliant it speaks s3 and swift so if you get a compliant normal object storage that you can deploy and and run it um and run at a scale that makes sense to to different organizations right um but you, you know, there's no, that's the beauty of Ceph. It can do whatever you need. It can work in that use case. We also have AI environments that are

Starting point is 00:19:55 building their solutions on top of block storage. So like, oh, let's prepare some images, some block device images that have data sets, and we'll just like mount those on the fly and deploy the data like that. Or of course the file system is like always the standard, let's have a file system, let's mount it, and let's put all the data there and then throw our massive GPU clusters at that file system. And then we get the call like, oh, it's like, it's, I think,

Starting point is 00:20:26 did we overload it? Like, is it broken? I mean, it's like we get, so we can help, we help with those kinds of issues too, because it's like, it's one of those things that just, I don't know. It is a good thing about Ceph, but one of the things that happens again and again is that it works very well that it like absorbs more and more of the use cases ai here oh let's store the pension fund on it too and then like eventually you have this like you have this thing that that like this becomes the core critical back end of the whole company or the whole university and that's like like, then you get, okay, let's, let's re-architect this. Let's, let's move things around. Let's do like, let's, we can do stretch clusters across multiple data centers. We can,

Starting point is 00:21:14 we can take that part out. That's really like a analytics thing and AI focus. Let's move that out into a separate system because it's not appropriate to have that with this other system. And, you know, so then you can talk about like third or fourth generation like after you after they understand their their their use case a bit more yeah this is that's a repeated thing that that happens happens quite often in the set in the set world well i do love that you know uh seph is just incredibly as you said flexible right and because it's been around for so long, there's so many different organizations that can take advantage of it and evolve with it, especially as we are in the world of AI, where everything seems to be evolving every second of every day. Can you give us, back to Stephen's point, give us an example of a particular partner? Say, gosh, I was just reading an article you had something about, you know, Ceph on Ubuntu or, you know, any partners that you can speak to specifics on

Starting point is 00:22:16 around how, you know, you're using Ceph for AI? I mean, the important thing is that, you know, KlaiSo and we participate actively in the Ceph Foundation. The Ceph Foundation, it was established, it's part of under the umbrella of the Linux Foundation. So that's like the open forum where the organizations can coordinate their activities. We have different hardware vendors, software vendors, and everyone, we find those projects to work together.

Starting point is 00:22:50 So Dan, you have a lot of experience in the HPC space. And I think that many of us have seen very much a similarity between HPC and AI, both in terms of architecture, but also in terms of the technical solutions that are used as well as the community that's supporting it. Obviously, a lot of the AI space is built on open source tools and open source projects. A lot of the HPC customers are out there deploying AI or experimenting with AI. It certainly is a lot going on in academia. Talk to us a little bit about the community and the ways that AI and Ceph and open source are working together. Yeah. So, I mean, it comes back to the roots of Ceph and then like how it evolved. It started in, like you said, in that HPC targeted

Starting point is 00:23:46 use case. Because of its flexibility, it was the right technology on the spot with private clouds coming. Then we went, you know, we went from private clouds to containerized and Kubernetes. It was the right technology for Kubernetes and remains today with the Ceph CSI plugin. And, you know, each time, you know, you grow the community. And now, so then also data lakes, you know, with object storage, the massive large-scale data lakes, it was the relevant technology. And now with AI, it's the same. It's like this ever-growing community.

Starting point is 00:24:26 And, you know, for those of us that have been participating for a while, like, like you also had the pleasure, I wish I had been there in the initial announcements, but I was maybe like a year later. Like the strength of Ceph has always been, it's like awesome community. Like there's with the mailing lists, with the events that we organize, the, we have Ceph Day events around the world, always organized. We have a cephalic on, uh, that's our yearly large event this year. It'll be in December at, at my former organization CERN. Um, you know, we, we bring people together and it's like this friendly open source community where everyone's sharing experience, contributing

Starting point is 00:25:01 valuable feedback to developers to make it better. And it's like, you know, it's really, it's just like, it's the Linux of storage. Like everyone works together and makes something that is going to like solve real problems and make like, make, make lives better and make organizations like more efficient. And it really works. Like, so that's why I would just, you know, if people,

Starting point is 00:25:21 if they're just learning about Seth through this, through this podcast, like please go to Seth.io, follow the links, join the mailing list, join the Slack, check out our YouTube, watch the videos, come to one of the events. I'll be, I'll be giving, if you check on the Klyso LinkedIn next month, I'll be giving a webinar on, uh, related to, to mastering some kind of Seth operations tasks. So some kind of tools that we have to simplify the operations of some of the more maintenance tasks that might be like tricky for some operators, but we have some tools to make it a lot easier. Yeah. So that's it. Yeah. Yeah. It really seems that the whole open source environment is still quite vibrant, quite alive and well. And I'm glad to see that because, as I said, I mean, it's important in HPC, it's made was that tools like this tend to find their way outside of their intended use cases.

Starting point is 00:26:29 And if they're useful and if they prove themselves in one area, they tend to absorb other use cases as well. And as companies are deploying AI applications, the question is, how do we get data into this? Well, in many cases, the customers already have data in something like Ceph, and they could absolutely think about using that as part of their AI data pipeline, just like they would as part of their analytics pipeline or their research. And I think that if there's one message that comes out of this, it's that Ceph is absolutely useful and will be useful in the AI space as well. So thank you so much for that overview. And definitely, I echo what you said. Check out some of the open source work that's being done and get involved.

Starting point is 00:27:19 Thanks very much for having me. It's been a pleasure. Well, thanks for being here. And Janice, thank you for joining me for this episode of Utilizing Tech. Thank you also to the listeners. If you enjoyed this podcast, please do leave us a rating, a review, a comment. You'll find it on your favorite podcast application and you'll also find us on YouTube. This podcast is brought to you by Solidigm and by Tech Field A, part of the Futurum Group. For show notes and more episodes, head over to our dedicated website, which is utilizingtech.com,

Starting point is 00:27:50 or find us on xTwitter and Mastodon at Utilizing Tech. Thanks for listening, and we will see you next week with another episode of Utilizing Tech, focused on AI data infrastructure with Solidigm. Thank you.

Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 07x06: Connecting Ceph Storage to AI with Clyso

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.