Grey Beards on Systems - 124: GreyBeards talk k8s storage orchestration using CNCF Rook Project with Sébastien Han & Travis Nielsen, Red Hat

Starting point is 00:00:00 Hey everybody, Ray Lucchese here with Keith Townsend. Welcome to another sponsored episode of the Greybeards on Storage podcast, a show where we get Greybeards bloggers together with storage assistant vendors to discuss upcoming products, Senior Principal Software Engineer and SEF Storage Architect, and Travis Nielsen, Senior Principal Software Engineer, both at Red Hat. So, Sebastian and Travis, why don't you tell us a little bit about yourselves and what's going on with the Rook Storage Project?

Starting point is 00:00:44 Yeah, hi, this is Travis. Glad to be with you today. Working on what's happening on with the Rook storage project. Hi, this is Travis. Glad to be with you today. Working on what's happening with Rook today. That's a big question. Lots is going on. Maybe I'll start with a little background first on what Rook is and what we're trying to accomplish. What Rook aims to do is really bring storage to Kubernetes in a way that's natural to work with Kubernetes. And where we started, the storage platform we started with as well was Ceph. So we knew Ceph was a great storage platform and it was built

Starting point is 00:01:21 long before Kubernetes ever existed. So where Rook started was we said, well, let's bring Seth into Kubernetes. And the way we do that is with an operator. So an operator works with Kubernetes CRDs or custom resource definitions to, yeah, to respond to what user's desired state is. So you want to deploy Ceph, so you tell, you create these custom resources

Starting point is 00:01:56 that tell Rook how you want to deploy Ceph. And then the Rook operator is the component that goes and makes that happen. It automates the install and everything around getting Seth running in the cluster. So it's worth pausing here for a sec to really dig into this deeply for our audience, at least beyond the cover. A CRD is a object you configure, let's say, in generic sense, not just Kubernetes, but I want to describe a resource and how I'm going to attach the resource, etc. So it's a rich definition of how you're going to use a resource. And it's a self-concept or is it Kubernetes?

Starting point is 00:02:43 No, it's Kubernetes. Yeah, it's an extension to the kubernetes api when when there is something you want to define or when you want to see happening in kubernetes and kubernetes has no idea about that can be a storage cluster for example then you can define a crd that represents that cluster and then on the side, you have an operator that will respond to requests like instantiation of that CRD. Then the operator will respond and then deploy a cluster, for example.

Starting point is 00:03:14 And really think of the operator as a, you just take all of the operational expertise and then you bring all of that into a software entity that will be responsible for deploying, maintaining, and just managing the entire lifecycle of a software. In our case, it's a storage stuff. So, Ray, if you think about the typical storage admin job, if I needed to attach a bunch of lungs to, let's go ancient, to an HPUX instance, and I wanted to replicate those lungs locally or within, you know, in some block level storage, a CRD would be the equivalent of defining that in whatever your OS or platform is. An operator automates that thing. So instead of

Starting point is 00:04:02 having a, instead of, and it's an intelligent way to automate that thing. So instead of having a, instead of having, and it's an intelligent way to automate that thing. So instead of having the administrator go back and repetitively do the same thing over and over again, Kubernetes has this

Starting point is 00:04:12 concept of an operator which can intelligently do that repetitive task that can be software defined now. And does it take the place of CSI or something like that? Or is the CSI still in the environment as well? CSI is still in the environment as well.

Starting point is 00:04:32 I'd say what the custom resource definition is, the CRD, it's just really a way to extend the Kubernetes resources that are built into Kubernetes. Because Kubernetes doesn't know or couldn't possibly create all possible types of resources or Define them that people would want to use into Kubernetes. Because Kubernetes doesn't know or couldn't possibly create all possible types of resources or define them that people would want to use with Kubernetes. So they allow this plugin mechanism so you can define your own types.

Starting point is 00:04:55 So the concept of a CRD and sort of the framework for using CRDs is built into Kubernetes, but the actual definition and creation of those is defined by each project. Like Rook has its set of CRDs that define Cep software for the Ceph storage cluster that's operating in the Kubernetes pods. Is that how this way it plays out? Yeah, think of Rook as the management plane for the storage platform, which is Ceph. And I guess why this is needed is a really interesting rabbit hole to go down a bit, because a lot of times a storage admins will look at something like a CSI driver and think, oh, OK, this CSI driver for my VMAX array, the pods in the cluster can see the underlying storage.

Starting point is 00:06:09 What problem am I solving if I'm able to provide persistent block storage to a pod? And I think that's where I think you guys can help us really understand the value of Rook from a Kubernetes perspective. Yeah, with CSI specifically. So CSI is a specific extension to Kubernetes for storage. Of course, you can plug in your persistent volumes and mount them in your pods and all that. So you have CSI drivers that are implemented according to that CSI API. And so Rook actually installed a CSI driver for working with Ceph volumes.

Starting point is 00:06:54 And so that's one aspect of Rook, but Rook is not a CSI driver, Rook installed the Ceph CSI driver. And so CSI is a specific API, is how I think of it, for Kubernetes to work with the storage volume. So what I'm teasing out here is I can have a Ceph provider

Starting point is 00:07:15 outside of like my cluster. I can stand up, as you mentioned earlier in the introduction, Ceph has been around for long before Kubernetes. So I can have an existing set of Ceph resources and I can have a CSI driver for that and I can connect via the Ceph CSI driver to there. But that storage, by definition, sits outside of Kubernetes. It's not part of my cluster. So when my cluster fails or my storage fails, I lose that connectivity to the storage. As I understand it, it helps me think of storage in the same way that I think of Kubernetes resources.

Starting point is 00:08:06 So it effectively brings the Ceph cluster under the constraints and operational characteristics of a Kubernetes cluster. It takes that Ceph cluster you were talking about, Keith, and brings it inside Kubernetes, I guess. Is that what is going on here, gents? Yeah, yeah. That's exactly what it is. So your storage becomes, your Kubernetes becomes storage aware. In my Dell Technologies example where we have our VMAX array and we're just CXI driver. If I want to move my workloads, I can move my workloads. But what about that persistent storage layer and that connectivity to the underlying storage?

Starting point is 00:08:50 The CSI driver doesn't magically make the pods, the new pods to connect to the old. Something has to orchestrate all of that new connectivity and the movement of the physical data plane to a new set of pods. CSI does exactly that. And I think that's where you guys can help me out because this is where my kind of knowledge is failing because I'm not a Kubernetes guy. I know enough to be dangerous. As I think of like traditional Ceph

Starting point is 00:09:21 and traditional Ceph storage providers or NFS providers, because Rook works more than just Ceph. As I think about that, and I think about the storage upcoming under Kubernetes control, what are the benefits over just CSI drivers to a traditional external resource? What's the distinction that bringing Ceph underneath the Kubernetes cluster versus having Ceph outside the Kubernetes cluster does for container apps, let's call it? You're just adding the management layer. If you have Rook deploying Ceph inside Kubernetes, That's what you get, essentially. Versus if you treat an existing external Ceph cluster

Starting point is 00:10:08 as external, then you're more into this consumer relationship than managing. You're not managing anything at this point. Rook is only responsible for bringing the connectivity of that external cluster onto Kubernetes and then pass it down to CSI so that you can provide persistent storage

Starting point is 00:10:26 to your applications. But by having that Ceph cluster inside of Kubernetes pods, let's say, what are the advantages of doing that versus having the Ceph cluster sitting outside? You're saying both can be Rook managed or Rook connected, I'll call it. Yeah, so when it's inside Kubernetes,

Starting point is 00:10:46 then you get, let's say, dynamic response over failures, for example. So if one of the components of Ceph fails, then it can be immediately rescheduled onto another host. And you can also do things like replica sets, for example, when you can decide, I want this particular interface of sef running multiple times because seth we haven't really dived into what seph is and what it does and how the storage interface is structured but essentially sef provides three methods to interact interact with storage so three storage interfaces object object, block, and file system.

Starting point is 00:11:26 And you could decide that you want, and essentially the object piece is really similar to OpenStack Swift or Amazon S3, which is API compatible. It is really compliant with this API. And these are actually taking the form of gateways, as we called them. And you could decide, okay, run three instances of all of those gateways and aggregate all of them through the service endpoint in Kubernetes. And if the scale goes up, then you can dynamically add more and also scale down to where you were before. So it's more responsiveness over scaling up and scaling down, as well as also responding to failures.

Starting point is 00:12:05 If one of the critical components of Ceph, for instance, the one maintaining the quorum, the monitors, is failing, then we can fail over and reestablish the quorum. That's one of the fees we get from running in Kubernetes. So, I mean, the advantages of Kubernetes, obviously, is the auto-scaling kinds of things and auto high availability restarted containers that fail and stuff like that. But now you have that sort of capability for the storage as well as the containerized application. What about the physical disks and stuff that's sitting behind some Ceph node someplace or Ceph pod or Ceph container? I'm not exactly certain what the right term is. Yeah, well, I think you're right. I mean, it depends on which environment you're deploying. I mean, there is no magic.

Starting point is 00:12:52 If you run on bare metal, then the disk is going to stick on the machine. The disk is going to stick on the host. If you run on more dynamic environments, such as cloud environments, like Amazon or Azure, for instance, then the storage becomes portable because essentially there are VMs where you have attached block devices onto.

Starting point is 00:13:10 So if that fails, then you can move it onto another virtual machine. And yeah, if one machine fails, one VM fails, then the storage, let's say an EBS volume, will be rescheduled onto another VM. And then the SEP cluster is being healthy again. So let's, I really love where we're going with this conversation. And I think, again, as we get into the nuances where we see the difference between SEP external and SEP internal to being provided by the ROOC operators. So if I wanted to build a highly scalable or

Starting point is 00:13:50 highly redundant, let's focus on redundancy, on highly redundant solution, I can build it with either model. I think the question becomes that of a operations management plane, if I do it with Ceph external, then I have to, as a either operator or developer, I have to glue together the things, the automation for when failure happens. The, the reconnection, the, I, you know, the, the visibility into the app to see that the, that either the pod of the storage is down. Kubernetes tells me whether or not the pod is down. But in order to provide that replication redundancy, et cetera, I have to build that myself when I'm consuming. I have to build that automation

Starting point is 00:14:40 myself when I'm consuming it via Ceph external. There's some tools out there that'll help me do it. But what Rook says is basically your workload and your, when I define my workload and my storage, I can define it in a single set of YAML or whatever. And Kubernetes manages the whole thing for me with less manual thought on my end to make that work is that correct it is right and that's where the work project again was born really was self-management requires generally that i mean you hire a specialist who knows how to deploy and upgrade and maintain handle the failure. You need someone who really would understand Ceph deeply.

Starting point is 00:15:27 So what Rook does is it helps abstract that or remove some of that complexity. And so then you can just define, well, how do you want Ceph to run? How many Ceph bonds do you want? Tell us where you want the OSDs to run and we'll just make it happen. And then, oh, if you want to upgrade to a new version of Ceph, well, just tell us what version that you want the OSDs to run and we'll just make it happen. And then, Oh, if you want to upgrade to a new version of Ceph,

Starting point is 00:15:47 we'll just tell us what version of that you want and we'll go sequentially sequentially and safely upgrade all of the pods in the succession. So you don't have to worry about how Ceph upgrades work. And again, managing Ceph is Rook's job, the Rook operator's job so that you don't have to worry so much about the Ceph internals. Yeah. Do you see a lot of, I don't know, implementations of Rook Ceph sitting in public cloud environments? Or is it more bare metal or obviously a combination of both? You mentioned that public cloud has some interesting characteristics with respect to disk placement, I guess, or disk

Starting point is 00:16:26 floating connectivity. Yeah. Yeah. It's really bringing even more high availability to the storage when you deploy that because the cloud provider is going to guarantee you X9s, I guess, for that storage. But if you put on to like an environment, if you put SAP on it, then you're just extending that because you will be replicating data also on top of those block devices, which hopefully already underneath replicated as well, but you're just bringing more redundancy to the platform and more availability in general.

Starting point is 00:17:06 Right. And again, when Rook started, I thought, why would anybody ever want to run Rook in the cloud? You've got the cloud storage solutions. So when it was initially created, it's like, let's target bare metal, your own data center type of scenarios. But what we really found is there are some common scenarios to run in the cloud, which is, for example, limitations of the cloud providers. Like you can only have a certain number of PVs per node.

Starting point is 00:17:32 So you quickly run, I think it's like 32. And I forget how many exactly in some environments. But you can have like there's practically limitless number of PVs for Ceph, like thousands, or I've seen thousands in testing at least. So the other, yeah, the other thing that kind of this crosses the boundary of, you know, obviously containers and Kubernetes was kind of designed around a stateless computations, I'll call it. But, you know, we're bringing Ceph and Rook and persistent volumes. All of a sudden now we have stateful containers. Are you seeing a big adoption of stateful applications and stateful containers? In Kubernetes deployments?

Starting point is 00:18:19 Yeah, yeah, exactly. I mean, I guess you wouldn't need Rook without him, I suppose. Right, exactly. I mean, that's why people use Rook, because at the end of the day, how can you build an application that doesn't need some state or storage, right? You know, unless it's even the simplest website is backed by some storage. Typically, those things are sort of sitting outside the Kubernetes operational environment, right? I mean, it's like a database server or something like that sitting outside. So, Ray, I think you're hitting a key point of what Rook is solving is this concept. We need persistence. not only in database storage, but we need some type of file system persistence or data persistence for unstructured data across multiple pods.

Starting point is 00:19:10 Like the pods, the processes can be anywhere. We don't really care for that. But data has gravity and we have to sometimes move the data with the gravity, I mean, with the workload. And once we've once we ran into that architectural problem, well, the question is, how do we solve it? Do operators build the capabilities within Kubernetes to be data aware? Or do you build it outside of Kubernetes and then kind of have a dotted line relationship between the two. I think Rook solves that or answers that question in an opinionated way to say that, well, you make Kubernetes aware. We still don't want to treat the process or the control plane. And I think this is a control plane question. We don't want to treat the data control plane as a pet. It's still a cattle. If the data control plane dies, we still want to move the entire application, including the data, to another set of pods or resources. And I think that's what Rook enables. I think there's isolation here between Ceph and

Starting point is 00:20:22 Rook that's important to understand. I mean, I see Rook as being the one that deploys the Ceph cluster. I see Rook being one that's sort of monitoring the Ceph cluster and Ceph operations and stuff. But, you know, the containers are using persistent volumes through the CSI to talk directly to Ceph that's sitting on the cluster. Isn't that how this works? Yeah, that's absolutely right. And that was one of our core architecture principles is the data path only goes from your application to Ceph. Rook is not in that data path.

Starting point is 00:21:01 Rook's only at the management layer. So we're not slowing anything down. It's just you get the stability. You know, Ceph has been around for a long time. You want storage that's stable and durable and all that. Right, right. And you mentioned lifecycle management. So I have no idea how you upgrade a Ceph cluster that's external to this world,

Starting point is 00:21:24 but you're providing some capabilities within Rook to, I don't know, go from version X to version Y in Ceph? Yeah, that's why you don't really have to know how to upgrade a Ceph cluster to upgrade one when you use Rook, because the only thing you have to do essentially is just tell us which version you want, and we'll just go ahead and kick up the upgrade. So the entire logic is, again, baked in the operator. And the RUC operator that's running in the Kubernetes cluster someplace,

Starting point is 00:21:57 as containers, no doubt. That's interesting. So I guess part of the clarification I need as well is, you know, I watched the video. I think Ceph is obviously the most mature presentation protocol for Rook. Are there other supported underlays like NFS or block storage directly outside of what's provided via Ceph? So if I wanted to use Rook to present NFS instead of Ceph. You can do it through Ceph, actually. So it's always through Ceph.

Starting point is 00:22:32 Ceph is the primary presentation or management plane technology that we're using. Data. Yes. Data. Yeah. Technology. Data gateway. I'm not sure exactly what the terms are, but so you're saying if you wish to have a standard NFS box or a NAS box being serviced in this Kubernetes cluster, you could do this through Ceph file or something like that.

Starting point is 00:22:58 Well, Ceph has an interface to re-export with NFS. So, yes. I see. As far as do you have the same sort of export for, does Ceph also support object export like that, or does it have to be maintained within the Ceph storage functionality? Yeah, could you have some object storage endpoint beyond, let's say you wanted to use Amazon S3,

Starting point is 00:23:21 and you're running this Kubernetes, you know, Amazon, you're running this, this Kubernetes cluster in Amazon EKS or something like that. Could you use physical S3 sub storage or would you have to use EBS kinds of volumes? Yeah, Rook always consumes EBS block volumes, but then Seth can turn around and expose the S3 endpoint from that. So if you want an AWS S3 endpoint, then you would just go straight to AWS S3.

Starting point is 00:23:54 So I guess the challenge is, and I guess, and it's fine if it doesn't solve this problem. Let's say that I have the problem that S3 is external object to me, external storage to my Kubernetes cluster. So I have all of the same challenges of, you know, accessing any CSI provider when it's outside of the control plane of Kubernetes. And I want to solve the problem of making my application or my cluster data aware. But I also want to use the power of S3 replication. I don't want to replicate via, I don't

Starting point is 00:24:36 want to have a layer of abstraction on top of S3. I want to replicate anywhere. But I want to orchestrate the movement of the data at the cluster level, or the attachment of the data at the cluster level, or the attachment of the data at the cluster level. Rook does not do that. If you want that capability, you have to use Ceph. Rook is only orchestrating the Ceph storage. Ceph does really a lot of things. For instance, you were mentioning about S3. Ceph has an interface to consume objects, just like you were consuming them through S3. So it has gateways, which are S3 compliant.

Starting point is 00:25:15 So essentially, we're always playing catch up with whatever comes next into S3 spec. But Ceph has gateways that you can access through the S3 protocol and you can also set this up in a multi-site fashion so you can have geographically distributed clusters that all that interact with each other's uh and just replicate data across across regions for instance this is not like this is possible with Rook to a certain extent. But this requires like overall like higher level orchestration.

Starting point is 00:25:53 For this, ideally, we would need to have something like Kubernetes Federation, which we don't have yet. So that you would really orchestrate workloads between regions. But right now, we don't really have this. But out of the box, Ceph, Rook, and with a little bit of extra configuration, you could set up multi-site gateways pretty easily. Interesting.

Starting point is 00:26:16 For the data alone, and you could potentially do some disaster recovery things at the other regions if you wish to fire that sort of thing up there and that sort of stuff. So we mentioned lifecycle, we mentioned configuration, and we mentioned monitoring. Are there other capabilities of Rook with respect to the Ceph cluster that we should know about? Access methods, maybe. We haven't really discussed that through CSI. So what would that be, Sebastian? So as I said, Seth is really capable of doing many things.

Starting point is 00:26:54 And it is really great when it comes to exposing storage through different interfaces. So it can be block, file system, and object. Within the CSI spec, it's always block-oriented. So it's not object. Within the CSI spec, it's always block-oriented. So it's not object. So what you're consuming is always a raw block device or a block device with a file system on top of it, which you can decide which one you want. And then you have different ways as an application to consume that storage. So you can say, and that's what we call access methods, essentially, where you can specify, I want

Starting point is 00:27:26 that block device to be mapped slash attached, if you want, to multiple applications at the same time. And they will all be writing and reading at the same time, too. So it's a... I mean, use this for something like a file system, for instance, that would be supported across a number of containers and stuff like that are utilized by a number of containers. That's right. If you have an application that scales, like let's say your app has Rubrica 3, it runs three times, but it has to access the same data

Starting point is 00:27:55 store always. And with SAF, you could tell, okay, use the same PV, but attach it three times on two different containers, and they will all share the same store and they will be able to read and write from that storage too. This is the most advanced, I think, that people might be looking at. You can do this with block as well, if you want. We can do this with our system, of course. So if I was to do like a database server

Starting point is 00:28:21 under Kubernetes and I wanted to use RookSef and stuff like that. I mean, this guy could potentially have multiple pods running to scale the database as it requires, and then behind it, I guess it would be block storage rather than file storage. But let's say it had file storage behind it. Then in this situation, multiple pods could be accessing that persistent volume that are supporting this database server in this Kubernetes cluster. So yes and yes, but it's a bit over-optimistic, I think, because doing databases operations over distributed file systems is,

Starting point is 00:29:00 I mean, it is always super heavy in terms of metadata. So I guess, yeah, it is possible to... Nobody would seriously consider this suicidal method. I guess it depends. I mean, it depends on the workload. It depends on how heavy the workload is and how your rights are. But I mean, practically speaking, technically speaking, it works. But then would you want to do it is a different question.

Starting point is 00:29:25 You might be better looking at distributed databases instead. But yeah, I mean, it's a valid example. Just trying to put in a warning here. There's a clear performance cost to using the shared file system. So probably databases, yeah, you'd want the distributed database instead of the distributed files. So they would probably use a block storage option under this configuration, but that would work as well in this case. The block storage volume could be shared across a number of pods running the database server

Starting point is 00:29:59 itself. There is a downside to this. I mean, in that case, you wouldn't map the same block device multiple times. It will be a one-to-one relationship, and then you might have shards distributed. But then the database will do the coordination. Because if you map multiple block devices onto the same machine, then the only thing you should do is just mount that file system as read-only. Have a single writer and multiple readers.

Starting point is 00:30:27 If you have multiple writers, then you're just corrupting everything. Yeah, unless there's some sort of synchronization across the writers and stuff. Yeah, well, you have to use a clustered file system, like ancient clustered file systems, I guess. Well, this is the VMFS, this is what VMFS does, but it's proprietary to VMware's. Yeah, they have GFS and the

Starting point is 00:30:52 old OCFS2 stuff. Yeah, Virtos and all that stuff. There are plenty of clustered file systems that have existed over time, and VMFS is the current one that VMware is using for much of their enterprise apps and stuff like that. Oh, okay.

Starting point is 00:31:09 Yeah, yeah, yeah. So how does this fit into like a CI, CD, DevOps kind of situation? Can you roll out Rook changes? You know, go from Rook version 10 to Rook version 11? I don't know what the versions are and stuff like that automatically. Or is that something outside of this? Well, you could do get-ups, yeah, for sure.

Starting point is 00:31:32 Yeah, we've definitely seen people doing that with CI-CD. And as far as upgrading Rook itself, the upgrade is usually just, well, update our CRDs, those core definitions, update the rollback access control, the RBAC, which is just sometimes the privileges change as far as what Rook needs. And then you update the operator, and then the operator automates everything after that. Yeah, so I would expect this just enables the technology enables the approach. If your approach is to do rolling updates,

Starting point is 00:32:12 the technology is there from, let's say, that step one is to update the Ceph cluster and the components of that. You can schedule Rook to do that, or let's say step one is to update Rook. And you have, you know, the technology is there. You guys have kind of completely embraced or you're subjective to the approach to managing Kubernetes in general. This just integrates with your whatever operational approach you've adopted.

Starting point is 00:32:43 That's right. And it really is, I mean, to the Kubernetes admin, Rook should look like any other application that they already need automation for in the cluster. It's just another app. So what's the next, like, what are the big areas of interest the community would like to take Rook to? Like, what's in the hopper? Backup, disaster recovery, synchronous replication.

Starting point is 00:33:08 You know, these are enterprise kinds of situations, right? I mean, does the system support a backup solution or how would that play out in this environment? Is it a Ceph issue or is it a Rook issue? I don't know. Yeah, well, DR is definitely one area, well, probably the biggest area of focus we have right now because it is a complex architectural problem.

Starting point is 00:33:32 How do you really support disaster recovery? Where's the automation and where's the boundary between that manual trigger that says, yes, we do need the admin to decide when to failover? Yeah, and I think there are two things, right, Travis? Something you really worked on extensively is stretched clusters. Because before going DR, we have to determine whether a cluster can be stretched across two locations, because that is actually the ideal.

Starting point is 00:34:01 When you do a stretched cluster, then you get some sort of a DR for free, right? You don't really have to do much. If you're doing synchronous replication across a stretch cluster and things of that nature, or they all have access to the same servers, it depends on what you're doing, right? Yeah, for that stretch scenario, really, we're talking about still a single Kubernetes cluster, so the latency still can't be too high, but some people have, if you have two data centers that are

Starting point is 00:34:29 geographically close enough, and then you have, basically, we need a third tiebreaker node somewhere between the two data centers, then you can have that replication. You call that the witness node or something like that. Arbiter, we call it. So let's talk about observability and visibility

Starting point is 00:34:47 from a application pod orchestration perspective. Is there any either roadmap or existing features that allow operators to select Ceph clusters based on attributes? I guess the first question is, are there going to be multiple Ceph clusters in this Kubernetes environment, or is it just one? You can have as many as you want, but typically you have, I mean, typically you have one, but yeah, it just depends on how you want to.

Starting point is 00:35:21 So I guess that, I guess that, I guess Ray, you're asking a little bit better question than I am, which is we're storage guys. We like speeds and feeds. Not all underlying storage is the same. Sometimes I need cheap and deep storage. Sometimes I need super fast storage. And is the delineation a separate Ceph cluster? Is it the same cluster with different storage pools? Like how do I give my developers the choice they need?

Starting point is 00:35:53 So I got some database that needs, you know, real fast block storage. And I've got some machine language solution that needs real sequential storage and i've got some i don't know data analytics well it is i think it's one of the really it's one of the goodness of seth honestly is that with seth you don't need to have multiple clusters to be did in where each will be dedicated to a certain purpose like oh this one is fast storage. This one is archive. This one is only file. This one is only block.

Starting point is 00:36:29 No, no, you can manage all of that through a single step cluster. And you can really isolate pools through like, you can build a logical reference of your architectural platform. Like how many servers you have, what type of disks are in those servers. And you can divide that up

Starting point is 00:36:47 and expose that particular storage. So you can say, okay, the set of machine will be blocks, fast block storage oriented, and then it will be exposed through a pool that will know, okay, I have those, like these pool of machines available, and then I'm exposing and I'm serving storage in that matter. Then through CSI, I can expose multiple pools with multiple providers through storage classes,

Starting point is 00:37:13 and then the storage classes will say, okay, this is fast storage, go with it. This is archiving intended, and then go with it. And this is how the developers will determine the type of storage they should consume. So on the container YAML file, for instance, they would say, I want fast storage or I want the storage pool. And Seth would automatically assign it to that. If you give it the right hardware, yes.

Starting point is 00:37:40 Yeah, so from the application's perspective, the application requests storage from the storage class, and you'd have to define a storage class that corresponds to the Ceph pool. And then if you've configured Ceph properly, then it... And Rook does that all for me, right? Right. No kidding. Rook won't read your mind for how to set up, but... So once I have the Ceph defined storage pools, then Rook will provide the storage class and the linkage between the two and all that stuff, is that what you're saying? in EBC, I'm sorry, EBS, then I can assign that ultra fast if I'm doing Theta, one level that's fast. And then if I'm doing, if Amazon even has spinning disk anymore, if I'm doing spinning disk,

Starting point is 00:38:36 I can put that as slow. I can define that as the Rook administrator. I can, you guys are making that easier for me to define that within Ceph. So I don't need a Ceph expert to do that definition for me. Right. Especially if you run in the cloud with Amazon and you would know that this particular type of provider is bringing you NVMEs. So you tell Ceph, OK, use that provider and give me 908s. And then you get a pool and then you create your own storage class that points to that pool, and then you pass that storage class down to your developers, and they can start consuming it. So yeah, we're really trying to make that easy for you to consume.

Starting point is 00:39:18 I don't recall. Does Ceph do mirroring for data protection, or does it support something like RAID 5 or RAID 6 or something like that? Or, you know, what's the minimum number of Ceph nodes if such a thing exists? Yeah. Internally, Ceph will replicate the data for you that we recommend by default, three replicas. So it just mirrors the three and that's how it handles data protection if one of them goes down.

Starting point is 00:39:44 Yeah. So in a Ceph cluster, you'd really want at least three nodes because of that number. You want at least node redundancy. And there's other means of mirroring. The mirroring term, at least in the Ceph world, is more about mirroring across clusters. So you can take the Arbity image and mirror it to the other cluster. It's just a question of how many copies of the data are we talking about? Is there one copy and there's parity or is there multiple copies required? And in this case, you're saying multiple copies are required. Yeah, and then it's on what level of abstraction are you talking about? Are there multiple copies?

Starting point is 00:40:25 Because I could consume at the provisioning level, I could consume a service, a provider that has by its nature redundant copies, but they may not expose them. So there's a lot of complexity in these abstractions. Yeah, but I would say that if you get underlying replication, even if you don't know about it and you are somehow paying for it already, and you're also never really guaranteed that you will ever get your disk back

Starting point is 00:41:03 if something fails. And that's why you really have to have Ceph on top of it to do the replication for you of your data. And Ceph also has different types of replication. You can tell it how many replicas, or you can tell it to use a ratio coding, which then breaks it up. So ratio coding would be what I would consider

Starting point is 00:41:24 a RAID-like functionality, which has data and parity, and how many parity groups might determine how many storage nodes go down or how many disks can go down and still recover the data. So that's good. It's good for availability, and it's also, it becomes a little bit cheaper if you want to reduce your cost per terabytes as well when you're deploying. It's a bit more expensive when it comes to computation, of course.

Starting point is 00:41:52 Yes, yes, or access itself. Yeah, yeah. So what would be a typical sized RookSeph environment? I mean, is it something you'd consider deploying petabytes to, or is it something more of a terabytes nature? I'm just trying to get a handle on what the typical environment might look like. The scale is definitely up to petabytes. That's what Ceph was designed for, at least petabytes. We do have, I'm trying to think,

Starting point is 00:42:26 I know of at least one or two clusters that are at a petabyte or two in Rook. But there are also so many people in the upstream community that we just never hear from. I wish I knew what the real size of clusters was. Yeah. But the most famous one are, because they have been working with Ceph for a very long time and they have really, they're part of the Ceph foundation.

Starting point is 00:42:48 They have been really advocating their usage of Ceph as the CERN, which is the data scientists from Geneva. And I think there's also like an Australian, the Monash University, and they have a few petabytes there. But CERN has, I think they have, so they have multiple clusters and no, everything I'm saying is public by the way, so I think they have like between 20 to 60 petabytes clusters on Ceph and I'm sure they are much bigger clusters out there for the object use case when they want to have... I mean, something like CERN and Monash is actually using Rook to manage those clusters

Starting point is 00:43:31 under Kubernetes? Unfortunately, I think the answer is no. They have multiple clusters too, but I think they're experimenting. Yeah, and the cluster I was mentioning and they've commented publicly too, is the University of California. I think they've got a hundred or low hundreds number of nodes type of thing with the petaphyter too, supporting their universities across California. Well, this has been great. So Keith, any last questions for Sebastian or Travis? No, I think we've picked their brain as much as my brain will handle. I've learned way more about

Starting point is 00:44:12 Kubernetes than I ever wanted to learn about. So Sebastian or Travis, anything you'd like to say to our listening audience before we close? Well, I think it was great to have you guys. We could spend hours discussing this. I think you could tell that Travis and I are really passionate about what we do. But it has been great

Starting point is 00:44:33 chatting with you guys, for sure. Yeah, absolutely. And we'd love to hear from anybody interested in Rook. You can join the Rook Slack. Go to rook.io. That's our main website

Starting point is 00:44:43 and links to everything interesting there. There's Kube website and links to everything interesting there. There's KubeCon. For the last few years, we've always had talks at KubeCon. You can go back and listen to or the KubeCon North America that's coming up in mid-October. We'll have a couple of talks there.

Starting point is 00:44:58 So we'll look forward to hearing more from people. All right. Well, this has been great. Sebastian and Travis, thank you for being on our show today. Thank you. Yep. Thank you. And that's it for now.

Starting point is 00:45:09 Bye, Keith. Bye, Rick. And bye, Sebastian and Travis. Okay. Bye. Until next time. Next time, we will talk to

Starting point is 00:45:19 another system storage technology person. Any questions you want us to ask, please let us know. And if you enjoy our podcast, tell your friends about it. Please review us on Apple Podcasts, Google Play, and Spotify as this will help get the word out. Thank you.

Grey Beards on Systems - 124: GreyBeards talk k8s storage orchestration using CNCF Rook Project with Sébastien Han & Travis Nielsen, Red Hat

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.