Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 07x06: Connecting Ceph Storage to AI with Clyso
Episode Date: July 8, 2024Many of the largest-scale data storage environments use Ceph, an open source storage system, and are now connecting this to AI. This episode of Utilizing Tech, sponsored by Solidigm, features Dan van ...der Ster, CTO of Clyso, discussing Ceph for AI Data with Jeniece Wnorowski and Stephen Foskett. Ceph began in research and education but today is widely used as well in finance, entertainment, and commerce. All of these use cases require massive scalability and extreme reliability despite using commodity storage components, but Ceph is increasingly able to deliver high performance as well. AI workloads require scalable metadata performance as well, which is an area that Ceph developers are making great strides. The software has also proved itself adaptable to advanced hardware, including today’s large NVMe SSDs. As data infrastructure development has expanded from academia to HPC to the cloud and now AI, it’s important to see how the community is embracing and improving the software that underpins today’s compute stack. Hosts: Stephen Foskett, Organizer of Tech Field Day: https://www.linkedin.com/in/sfoskett/ Jeniece Wnorowski, Datacenter Product Marketing Manager at Solidigm: https://www.linkedin.com/in/jeniecewnorowski/ Guest: Dan van der Ster, CTO at CLYSO and Ceph Executive Council Member: https://www.linkedin.com/in/dan-vanderster/ Follow Utilizing Tech Website: https://www.UtilizingTech.com/ X/Twitter: https://www.twitter.com/UtilizingTech Tech Field Day Website: https://www.TechFieldDay.com LinkedIn: https://www.LinkedIn.com/company/Tech-Field-Day X/Twitter: https://www.Twitter.com/TechFieldDay Tags: #UtilizingTech, #Sponsored, #AIDataInfrastructure, #AI, @SFoskett, @TechFieldDay, @UtilizingTech, @Solidigm,
Transcript
Discussion (0)
Many of the largest scale data storage environments use Ceph, an open source storage system, and are now connecting this to AI.
This episode of Utilizing Tech, sponsored by Solidigm, features Dan Vanderster, CTO of Klyso, discussing Ceph for AI data.
Welcome to Utilizing Tech, the podcast about emerging technology from tech fielding, part of the Futurum Group.
This season is presented by Solidigm and focuses on the questions of AI data infrastructure.
I'm your host, Stephen Foskett, organizer of the Tech Field Day event series. And joining me from
Solidaim is my co-host, Janice Narowski. Welcome to the show. Thank you, Stephen. It's great to be
back. Well, it's good to have you here. So as we've spoken about many times in the past, especially here on this whole season of Utilizing Tech, there's a lot of data out there. There's a lot of existing data, a lot of existing data sources, and a lot of existing data platforms. And all of that is going to eventually need to be integrated into the AI data pipeline and into the whole AI picture. Yeah, absolutely.
And, you know, there's lots of ways to go about it, right?
A lot of different hardware out there, software.
And folks are really trying to figure it all out.
How do I make all of this work together?
And there's one software tool out there that's open sourced
that many, many companies have been using for years, right?
And now with the advent of AI, they're kind of like,
how do I use this tool I've been using for a long time that's open source and free?
How do I make this work for my workloads going on today
and the ones that are ever evolving into the future?
So we're excited to talk with somebody from Ceph.
So I'll turn it back to you, Stephen, to introduce him, but excited to dive into this topic.
Absolutely, yeah.
And Ceph is one of those things storage nerds like me have seen for many, many years.
I've watched this project grow.
I've watched it become absolutely a critical component. Many people may not have heard of it,
but as our guest said, it kind of is the Linux of storage. It is everywhere. And it is used,
especially in environments that have lots and lots of data. And those are the environments
that are going to need to be integrated into the AI data pipeline. So let's welcome Dan Vanderster from Klyso here to talk a little bit about the importance of Ceph in AI.
Thanks, Stephen and Janice. Yeah, I'm Dan Vanderster. I'm CTO at Klyso.
I'm coming from CERN. I spent about a decade at CERN working in the IT department, working in storage. And I'm also
wearing multiple hats. I've been having the pleasure of working with Seth for
around a decade as well. Early adoption,
testing at scales that hadn't been seen before, and
participating, getting more active in the community, and eventually
now acting actually in the open source project as a member of the executive council leading the overall open source project.
So many people haven't heard of Ceph, like I said, but many, I mean, I'm sure it affects almost everybody now because it really is everywhere.
I mean, essentially, this is a software-based storage solution
that uses as...
So full disclosure, I was there at the beginning
when this was originally announced,
and I was excited about it because of what it is.
It uses unreliable components to build reliable,
high-performance, scalable storage solutions.
So essentially, it is massive scale, massively
distributed, and it's designed to not just be able to adapt to failure, but to be ready for failure.
And that's one reason that it's become so successful. So tell us a little bit more,
where is Ceph today in the overall picture of the world's information.
Right. So that's, I mean, that you highlighted a lot of the points
that actually attracted us to Seth early on and made us one of the early adopters.
You know, what organizations are lurking for and also like the people
operating the storage infrastructure are looking for is something that's like reliable and can be
built out of like you know low cost commodity components but then you can build something
that's reliable and scalable and one of the like main things is that if you're building a large
scale storage system you don't want to have to like every four or five years lift and shift
migrate data from one from one appliance to
another you want something like an organic storage system and this is how we this is how i think we
wrote an internal memo at cern like toward an organic storage system for for our cloud and
that's how we got it started and yeah it really, it proved to deliver what it promised, which was that kind
of scalable, reliable system where the operators can, you know, sleep at night and things fail all
the time, but you end up with a reliable system that can grow and evolve with the organization.
So on that notion of creating something
that's scalable and reliable and kind of always on,
can you dive a little bit into how you're working with AI
with some of your partners today
and how you're bringing through some of that Ceph goodness
and bringing that to AI?
Yeah, I mean, AI use case is certainly the hot topic with Ceph these days
because Ceph is like,
in addition to the whole reliability aspect of Ceph,
it's also very flexible.
It's a low level object store internally,
but actually on top, it's like very familiar storage components.
It's like block storage for a private cloud or it's object storage compatible with you know the public clouds
and it's also a file system a normal posix file system uh like a typical nfs what you'd expect
from nfs so like because it's so flexible um it already has made, like, it's already used quite massively in those,
you know, in cloud environments for object storage, especially like self-hosted or hybrid
cloud or multi-cloud environments. So there's really just like a lot of data out there that
now in the AI context, you know, organizations want to process that data that's in Ceph more quickly, more rapidly. And it always
puts pressure on the project to deliver more and more features and performance needed for the
exponentially expanding processing capacity that we have in the AI world.
So are those, I guess, what are the industries and use cases that are predominantly
using Ceph today? And let's, you know, let's talk about that. And then let's talk about how
those industries are going to be using AI with this data set.
Right. I mean, so Ceph had its beginnings in the academic sphere. So it was really a lot of
initially universities universities research centers
research labs building and using it like they would use uh hp like it's it started out as an
hpc file system and then it evolved these other flexible use cases um today it's used by all sorts
of industries every industry from let's say financial industries um trading you know you know high frequency
trading or you know complicated algorithms that are that are processed through crunching through
lots of financial data um of course still still like super computer centers doing any kind of
research let's say biotech research things like that that. And also like, you know, on the entertainment side,
like media storage, like Ceph is massive in media storage,
like backing, you know, some of the largest media companies,
as well as like video games and things like that.
You know, it's really, it might, you say it may not be well known,
but it really is backing the largest infrastructures that we have out there.
Ceph is behind the scenes powering it all.
And you mentioned a moment ago, too, that, you know, Ceph isn't always seen as being maybe a high performance solution, right?
Can you talk a little bit more about, you know, how Ceph does bring forth that performance yeah i mean it's important to
it's important to understand when Ceph was created it like the idea was to do something better than
previous storage systems and that was to put a very strong priority on the durability and
consistency and reliability of the storage so So to not make any sacrifices for performance
or for any other reason on that data consistency.
Ceph has a lot of, there's a lot of smarts internally
to enable that to happen at high performance speeds.
And we've done recent tests at Clayso
and with the Ceph project published on ceph.io,
we've done recent tests to show
and to try to make a little bang
and make the other file systems a little bit,
you know, pay attention a little more.
We did a one terabyte per second demo
showing that you can get, I mean,
I'm not sure another file system has demonstrated
one terabyte per second recently,
but this was with, you know, just with open source, open source software on commodity components, few hundred
NVMEs, standard servers that anyone can purchase and you can build like the best AI platform for
your organization with it. It seems like that is really true to the philosophy of SEF too,
because right from the very beginning,
it was all about using mundane components.
It's not extreme specialized components.
It's about using ordinary components that are accessible and affordable and varied and
combining that into a unified system that offers massive scale. And one of the fundamental architecture
concepts of Ceph is distributing everything, not just in not having a single point of failure,
a single bottleneck. So it makes sense to me that that approach would be able to deliver
high performance if that was one of the goals that folks had when kind of rebuilding it out
and tuning the system. Because that is very similar to how AI clusters and HPC clusters are
built. They're built out of, well, in some cases, more exotic components, but they're still built
out of massively scalable components and they're massively distributed. Because anytime you have
any kind of bottleneck, well, that's a bottleneck that's going to cause a performance hit. So do you find that you are
able, it sounds like you're able to change, I guess, some of the tuning or some of the,
the, the, the way that Ceph was designed to really focus on performance and distribute
that workload. Is that, Is that how it works?
I mean, in the earlier days of Ceph, I don't know, maybe let's say eight years ago or 10 years ago,
I think we all in the community had the idea that, yeah, we call it like horizontal scalability. If you want more performance, just buy more servers or buy more devices and you just add them
and then you scale linearly the performance.
And for the very lowest levels of Ceph and for things like object storage, that's really true.
And we had the ambition and the goal that, like, okay, we can see beyond NFS. We won't have
POSIX file systems anymore. We don't need them anymore. We can just do everything with object storage.
But I think we didn't really, it still seems that POSIX file systems are popular and needed, and they're just the normal expectation for a variety of different reasons. And POSIX file
systems are not as easily horizontally scalable. That's the issue.
Scaling AI workloads brings new challenges to metadata performance.
Really opening really like millions of files per second
becomes a, is, you know,
that's a very highly metadata intensive task.
And so we work hard to have new ways to make the Ceph metadata features actually scale as well.
They're pretty good at it already, I would say.
It's like we anticipated this a few years ago, and it's already quite good.
But we're always trying to push to the next level and be ahead of the curve a bit.
Well, when it comes to AI training, I would think that that would be a problem
because of the scalability of the clusters
that are being used.
You have a ton of clients,
they're accessing the same data.
Like you said, they're opening lots and lots of files.
Is that what you're seeing?
That it tends to be a very, very massively
parallel access pattern?
That's right. Yeah, exactly.
And repeating the same ones again and again and again. And it's really this like, it. And repeating the same, the same ones again and again
and again. And it's really this like, it's the number of files. That's the thing. It's very easy
to make a file system or a storage system, which deals with large objects or large stores, and you
just stream everything in parallel. And, you know, in the HPC space, we call this embarrassingly
parallel. So it's easy to make an embarrassingly parallel storage system.
But when you have contention and you have clients, you know, all looking at the same files,
maybe modifying files in the same directories, and because Ceph has this very strict view of never
allowing any client to have an outdated view on the current situation, that's where Seth is like, actually, you know, and with our developer
hat on, we're like, hmm, maybe for real life, that kind of strict consistency is not always needed.
So maybe we should consider relaxing some of those to behave more like the rest of the file systems that we compete against.
I don't know if it's the right time to bring this up,
but I kind of want to talk a little bit about the hardware,
obviously with Solidigm, so I've invested interest here.
But just curious, as we're building higher capacity, denser storage,
what is your kind of viewpoint on how this helps with Ceph? And if at all, if there is any benefit to say a 61.44 terabyte QLC SSD?
Yeah, I mean, the, so of course there's a benefit, right? Because you get,
so when NVMEs arrived on the scene the scene seph was not prepared for this
because seph is a very it's a very smart there's a lot of soft a lot of code a lot of lines of code
that go into into like doing the storing the data and you know it was written at a time when
and we had we probably had early days of flash but we, but it was written in the time of spinning disks.
And of course, there was a lot of work done in those early days of NVMe to really make sure that you can extract all the high performance out of that.
And there were some early workarounds done as well. Yeah, one of the low-level hacks that the practitioners learned early on was that they should take large NVMEs and split them into many different virtual devices and then treat them as separate disks. And then that way, Seth could get more performance and get the native performance because they have too many IOPS and they have too much bandwidth and the software couldn't keep up.
But that's like maybe five years ago.
Now in the last two, three years, the Ceph developers, the Ceph project has spent a lot of time
getting, reworking the internals to make sure that they can extract all of the performance.
And there's still an ongoing, so we've achieved quite a lot already. And there's like an ongoing major project called Project Crimson, which is like a major rewrite of the storage of the low level storage daemons, which is 100% focused on extracting performance of large NDMEs like you're talking about now.
So there isn't, in your opinion, any sort of issue with, say, the endurance, right, when it comes to working with QLC?
I mean, this is definitely a relevant issue that comes up.
When we're designing systems for customers, we pay attention to the endurance.
We pay attention to things like write amplification that might be relevant for their exact use case. So when we're working with a customer or a user,
we try to understand exactly how they're using it and then guide them. One of the things about Ceph
is it has about 10,000 tuning options. And so you can really manipulate anything
about the sector sizes, the block sizes,
how data is chopped, diced and sliced
and distributed across the cluster
in a way that optimizes the usage
of the underlying devices, right?
So if you have that full holistic view on the hardware,
you can tune the software to,
you can tune Ceph to extract like maximal performance. And so maybe out of the box,
it might not always work optimally with a little insight and a little expertise and guidance you can really get. That's how you achieve results like one terabyte per second. It's like paying
attention to all those stacks. We see a lot of, we say a lot of, a lot of Ceph users making, making use of
those large devices. Yeah. So Dan, you are out there helping people build Ceph systems and adapt
their Ceph systems and, and basically bring them into the AI world on a daily basis, I assume.
Talk to us a little bit about
some of the real world here. I guess one of the questions that immediately springs to mind is,
you know, something you brought up, which is the object store versus file system versus block
question. You know, some of the performance questions. I'm just excited to hear how real world users are using Ceph storage with AI.
It's such a wide variety.
It's hard to find one classical explanation, like one classical model.
I mean, so if you look at the typical, like very large supercomputer right now,
they're often I think most many of the recent designed
supercomputer centers like let's say top 500 or IO 500 stuff,
you'll find a Ceph tier on the outside and the outer layer as a sort of data lake. That's very common because it's really the
lowest with erasure coding, it has very efficient erasure coding you
can you can get a low cost um you know standards compliant it speaks s3 and swift so if you get a
compliant normal object storage that you can deploy and and run it um and run at a scale that
makes sense to to different organizations right um but you, you know, there's no, that's the beauty of Ceph.
It can do whatever you need. It can work in that use case. We also have AI environments that are
building their solutions on top of block storage. So like, oh, let's prepare some images, some block
device images that have data sets, and we'll just like mount those on the fly
and deploy the data like that.
Or of course the file system is like always the standard,
let's have a file system, let's mount it,
and let's put all the data there
and then throw our massive GPU clusters at that file system.
And then we get the call like, oh, it's like, it's, I think,
did we overload it? Like, is it broken? I mean, it's like we get, so we can help, we help with
those kinds of issues too, because it's like, it's one of those things that just, I don't know.
It is a good thing about Ceph, but one of the things that happens again and again is that it works very well that it like absorbs more and
more of the use cases ai here oh let's store the pension fund on it too and then like eventually
you have this like you have this thing that that like this becomes the core critical back end of
the whole company or the whole university and that's like like, then you get, okay, let's, let's re-architect this.
Let's, let's move things around. Let's do like, let's,
we can do stretch clusters across multiple data centers. We can,
we can take that part out. That's really like a analytics thing and AI focus.
Let's move that out into a separate system because it's not appropriate to have
that with this other system. And, you know,
so then you can talk about like third or fourth generation like after you after they understand their their their use case a bit more yeah this is that's a repeated thing that that
happens happens quite often in the set in the set world well i do love that you know uh seph is just
incredibly as you said flexible right and because it's been around for so long, there's so many different organizations that can take advantage of it and evolve with it, especially as we are in the world of AI, where everything seems to be evolving every second of every day.
Can you give us, back to Stephen's point, give us an example of a particular partner? Say, gosh, I was just reading an article you had something about,
you know, Ceph on Ubuntu or, you know, any partners that you can speak to specifics on
around how, you know, you're using Ceph for AI? I mean, the important thing is that, you know,
KlaiSo and we participate actively in the Ceph Foundation.
The Ceph Foundation, it was established,
it's part of under the umbrella of the Linux Foundation.
So that's like the open forum where the organizations can coordinate
their activities.
We have different hardware
vendors, software vendors, and everyone, we find those projects to work together.
So Dan, you have a lot of experience in the HPC space. And I think that many of us have seen
very much a similarity between HPC and AI, both in terms of architecture, but also in terms of
the technical solutions that are
used as well as the community that's supporting it. Obviously, a lot of the AI space is built on
open source tools and open source projects. A lot of the HPC customers are out there deploying AI or experimenting with AI. It certainly is a lot
going on in academia. Talk to us a little bit about the community and the ways that AI and
Ceph and open source are working together. Yeah. So, I mean, it comes back to the roots of Ceph
and then like how it evolved. It started in, like you said, in that HPC targeted
use case. Because of its flexibility, it was the right technology on the spot with private clouds
coming. Then we went, you know, we went from private clouds to containerized and Kubernetes.
It was the right technology for Kubernetes and remains today with the Ceph CSI plugin.
And, you know, each time, you know, you grow the community.
And now, so then also data lakes, you know, with object storage,
the massive large-scale data lakes, it was the relevant technology.
And now with AI, it's the same.
It's like this ever-growing community.
And, you know, for those of us that have been participating for a while,
like, like you also had the pleasure, I wish I had been there in the initial announcements, but I was maybe like a year later. Like the strength of Ceph has always been, it's like
awesome community. Like there's with the mailing lists, with the events that we organize, the,
we have Ceph Day events around the world, always organized.
We have a cephalic on, uh, that's our yearly large event this year.
It'll be in December at, at my former organization CERN.
Um, you know, we, we bring people together and it's like this friendly
open source community where everyone's sharing experience, contributing
valuable feedback to developers to make it better.
And it's like, you know, it's really, it's just like,
it's the Linux of storage.
Like everyone works together and makes something that is going to like
solve real problems and make like, make,
make lives better and make organizations like more efficient.
And it really works.
Like, so that's why I would just, you know, if people,
if they're just learning about Seth through this, through this podcast,
like please go to Seth.io, follow the links, join the mailing list, join the Slack,
check out our YouTube, watch the videos, come to one of the events. I'll be, I'll be giving,
if you check on the Klyso LinkedIn next month, I'll be giving a webinar on, uh, related to,
to mastering some kind of Seth operations tasks. So some kind of tools that
we have to simplify the operations of some of the more maintenance tasks that might be like
tricky for some operators, but we have some tools to make it a lot easier. Yeah. So that's it. Yeah.
Yeah. It really seems that the whole open source environment is still quite vibrant, quite alive and well. And I'm glad to see that because, as I said, I mean, it's important in HPC, it's made was that tools like this tend to find their way outside of their intended use cases.
And if they're useful and if they prove themselves in one area, they tend to absorb other use cases as well.
And as companies are deploying AI applications, the question is, how do we get data into this? Well, in many cases, the customers already have
data in something like Ceph, and they could absolutely think about using that as part of
their AI data pipeline, just like they would as part of their analytics pipeline or their research.
And I think that if there's one message that comes out of this, it's that Ceph is absolutely useful and will be useful in the AI space as well.
So thank you so much for that overview.
And definitely, I echo what you said.
Check out some of the open source work that's being done and get involved.
Thanks very much for having me.
It's been a pleasure.
Well, thanks for being here.
And Janice, thank you for joining me for
this episode of Utilizing Tech. Thank you also to the listeners. If you enjoyed this podcast,
please do leave us a rating, a review, a comment. You'll find it on your favorite podcast application
and you'll also find us on YouTube. This podcast is brought to you by Solidigm and by Tech Field
A, part of the Futurum Group. For show notes and more episodes, head over to our dedicated website, which is utilizingtech.com,
or find us on xTwitter and Mastodon at Utilizing Tech.
Thanks for listening, and we will see you next week with another episode of Utilizing Tech,
focused on AI data infrastructure with Solidigm. Thank you.