Storage Developer Conference - #98: Rethinking Ceph Architecture for Disaggregation Using NVMe-over-Fabrics

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to STC Podcast Episode 98. So my name is Yi. I'm a research scientist from Intel Labs. I'm here today with my colleague Arun here, as well as my co-speaker. As you see, we have another co-speaker to show, but unfortunately he can't make it due to some emergency in the family.

Starting point is 00:01:00 The topic today is rethinking self-architect architecture for desegregation using NVMe over fabrics. I don't know how many people have just attended the NVMe fabrics talk from Sujoy and Mohan. If you guys missed, that is worth to chat with them to get more detail on that. touch a little bit more, but we'll focus more on Ceph as a very popular, successful SDS solution in the era of disaggregation and storage disaggregation particularly. So the agenda for today, we will start doing some kind of a background, a little bit of refresher for Ceph. For people who are already familiar with that, that will be kind of a very easy background introduction, as well as the disaggregation in general.

Starting point is 00:01:50 And we're gonna try to illustrate what the problem is, particularly for Ceph. How do we combine the two? You have a very specific problem with Ceph. Now you want to introduce disaggregation. How can we solve that problem? And using replication, one of the most common practice, we explain the replication flow as well as what we refer to as data syntax

Starting point is 00:02:14 when you try to combine the two to provide a good value for your storage services. And based on that, we're going to show, at least based on our observation, the current approaches, as well as our proposed approach. Particularly, Arun is going to take you guys through the details of the architecture design, at least the thoughts we have been through in our mind and what are the architecture detail in the context of CEF and how we arrive to our analytical results as some small scale preliminary kind of evaluation results. And we'll conclude the talk. So this is a very

Starting point is 00:03:03 popular picture of Ceph architecture. You can Google this everywhere. But it kind of summarizes the benefits from Ceph being very successful as one of the also default deployment in OpenStack architecture as a block service. It also provides, it's actually an object store. And working together with Swift in the past as well as now. And you can see application would,

Starting point is 00:03:36 depending on the storage service you want to provide, you could provide a block level service like your VM, this image, through the thing called RBD, stands for Redux Block Device. Or you want to just continue using your existing project interface, file system interface. You have a self-file system there, then allowing to access it as just read and write as a file. Or you want to just like your image website, you want to just upload your picture just as an object,

Starting point is 00:04:08 you have the object interface through the Redos gateway. Underline is basically by this Redos layer, reliable, autonomic, distributed object store that achieves all these storage service functionalities. It has software-defined hardware-agnostic SDS. It basically achieves the purpose of desegregating your software from your hardware control plane.

Starting point is 00:04:39 Then providing the service of all, everybody who's in-store industry cares about is the reliability of how I can store my data reliably. So they provide replication or re-recoding if you hear very much about your bandwidth savings. And they can rebalance in case of

Starting point is 00:05:00 your OSD fails. The details of how self-work is in the bottom right, like that block diagram, where the core part is they have a crash algorithm that figures out what is the best way to put your data and where. And that takes into consideration

Starting point is 00:05:19 of your network technology, particularly your failure domain, versus your client requirement on your reliability. Like you want a replication, then you could have one pool have two-way replication, another pool have three-way replication. One pool is whatever you favored, your real coding profile.

Starting point is 00:05:40 And they're going to automatically figure out where the data is going to end up to, eventually landing in your storage devices. This one is a little bit background on desegregation. Desegregation itself is not exactly new, but there are some new trend industry particulars that make this more appealing. The software-defined storage, as I mentioned earlier, is looking at a scale-out approach for storage guarantees,

Starting point is 00:06:10 and particularly looking at how can I desegregate software from hardware. And you guys probably already know there are numerous SDS offerings, particularly in cloud storage. It makes it easier to manage my commodity-ready hardware with just an SDS stack that I can focus on providing the service that is more related to what client want. And another angle of desegregation is basically separate servers into resource components. So the desegregation is all about, as I quoted here from the Euthynics paper,

Starting point is 00:06:53 extreme resource modularity, because that provides resource flexibility as well as utilization. That is a direct TCO benefit, and you can see the benefit while deploying your infrastructure. And the deployment flexibility can also provide you, if you want to do a hybrid and you want to do a hyper-converged, that is also possible because it eventually allows you to scale your storage,

Starting point is 00:07:20 or in this case, the segregation can be storage, networking, as well as compute, scale that set of resources independently from your application grow. So you have very soft workloads, can all fit into the services provided by disaggregation. And particularly I mentioned that because of industry trend, as I mentioned in the earlier talk from Suzhou andoy and Mohan mentioned the NVMe over Fabric, the particular one appealing factor here is faster interconnect as very well designed standards like NVMe protocol allows you to do very good I.O. over to the remote.

Starting point is 00:08:00 And that latency allows, make it a very appealing factor for you to say, okay, I really want to have desegregation there with the faster interconnect that I can actually manage my resources much more efficiently. Well, as I work for Intel, you notice that I put a picture from Intel, particularly Intel's perspective as rack scale design in the context of desegregation. And you can see eventually you want to move from today physical aggregation to fabric integration and fully

Starting point is 00:08:32 modular resources. The modularity is the key part that allow you to achieve all the benefits you're looking for in your data center infrastructure. So now, Ceph on NVMeUF, how can we do it? Number one is, so what is the rationale of putting Ceph on NVMeUF Fabrics? You have one very popular SDS solution as Ceph, backed by the very popular and big open source community, as well as Red Hat. You have the NVM over Fabrics as pushing in industry. A lot of vendors come into play, and we can see that both true trends, they're going to somehow merge together.

Starting point is 00:09:20 Then we have to figure out how to make them work together and very efficiently, taking advantage from both sides. As I mentioned earlier, one, that the desegregation is allow you to manage your storage resources independently without worrying about the other side. Now, if you still want to say, I want to

Starting point is 00:09:43 use desegregation, how do you support multiple SDDS offerings on disaggregation? All of them is going to say, okay, I own my storage, even though I'm an SDS, I own still my own storage. They're not designed with a mindset that they are going to be put their data eventually landing on a storage device that is located remotely. And I'm really going to talk about the actual problem you're going to see when you're actually starting to doing that. So you want to scale computer storage sets independently, then you have to think about how to solve this problem. And also this basically opens a door to a lot of optimization

Starting point is 00:10:23 opportunities looking at various SDS architectures and where we can do better to take advantage from both sides. Well, by our observations, there are a couple of approaches people have already been kind of practicing. The first one, what we refer to as host-based NVMe over fabric storage backend. Well, this is kind of an extreme of, okay, I want to reuse my NVMe over fabric as just like a SAN. I have metering capability. I will let it maintain my replication. But essentially you're saying, okay, Seth, don't worry about adding up reliability thing. I'm going to, I'm going to manage that. You have to set up

Starting point is 00:11:10 Seth configuration parameter of, say, replication factor one, because the other side is going to do mirroring. It's an, it's an approach, of course. It will actually, it will work. But the point is, why do you need Ceph to begin with? Now, another option is, I'm a Ceph administrator. I'm very familiar with Ceph. I want to just use it as the way it is. Okay, I just don't want to worry about anything related to MVM or your fabric.

Starting point is 00:11:41 I'm just going to treat it as a dumb disk. You can still do that. Plug in the MME in shader. It's going to bring up a volume showing up in your system. Treat it as a local disk with whatever IO characteristics you see there. Use that. No problem. You can still have that working. We have tried that. And we actually have a comparison later to that. However, as you can see, essentially the problem is Ceph has a known way, or Ceph, Swift, any others,

Starting point is 00:12:11 they want to make sure your service is maintained very well based on your service level requirement from the clients, which basically means how do I know where my failure domain is. By doing that, essentially you're trying to hide that information from Ceph. Ceph crush map doesn't really know your target distance. There's no such layer in between that the algorithm cares. Algorithm itself is a consistent hashing algorithm. You have to have that topology built in the right mind

Starting point is 00:12:42 that actually covers your requirement from that perspective. So it will work. Does it work to the point that we expect it to work? Well, we'll see. Decoupled self-control and data flow is actually what we're trying to after. And it's our proposed approach. We're going to talk about that in details later.

Starting point is 00:13:05 So as an example, I'm going to focus more on self-replication flow, but as I mentioned earlier, SDS reliability guarantees is through the data copies, replication, or redundancy from real coding. The durability,

Starting point is 00:13:20 there's this long-running task called scrubbing in Ceph, or I think swiftly referred to as auditing. It's basically make sure your data integrity check can always be successful. If not, there's a certain mechanism to make sure you bring back the right data

Starting point is 00:13:38 to the storage device. And we're going to talk about the replication flow, particularly in today's talk. So this picture shows a very high-level simplified flow, about six steps. So client's going to try to say, I want to upload my family photo there. This ends up putting a new object in your storage cluster.

Starting point is 00:14:02 In this case, you can see there are three OSDs deployed, as we refer to as primary, secondary, and tertiary. The primary is going to be, say, okay, where are my peers? Where are the secondary and tertiary? Once it's identified through the algorithm, they're going to bring the data from the client, from the primary OSD, then they're going to send that to the secondary as well as tertiary. Once all these three pieces of data are actually persisted in your storage device, there will be an acknowledgement sent back, then there will be actually

Starting point is 00:14:41 counter-tracking how many acknowledclogin I have received. Eventually, I can say, okay, well, good. Your data is placed there according to your rule. Then you're done. This doesn't have the picture of MVMU fabric. Now, in next, this is where I'm going to show you where what we refer to as data center tax coming from. Recall from earlier slide, the safety of performance today is to come into provision separate cluster network for internal traffic, particularly because of the, let me go back one more, the replication traffic between the OSDs.

Starting point is 00:15:26 And you have to have a dedicated network for that purpose, as well as for squabbing, that kind of task. Now, you do this as what we refer to as stock self-deployment right now. This network calls component as capacity or scale up. And you can see it's almost in replication case, it's always a linear factor how you can grow. When you move to desegregation, this obviously exacerbates the data movement problem. If you follow the arrow from the client,

Starting point is 00:15:55 the data block will have to travel to primary OSD. That's a request before. But the point is, the OSC is playing a relay role here to pass the data to secondary, tertiary, but eventually just move on to their actual location in a remote target. So that red line is showing us a fabric connection. So the data has really no purpose staying in the primarist That red line is showing as a fabric collection.

Starting point is 00:16:29 The data has really no purpose staying in the primaries because it's not going to eventually reside there. This is where we refer to as data center tax, and it's also the focus of our research work to how to reduce them. The proposed approach, I'm going to have my colleague Arun to talk about more and walk through the archaic details as how we achieve the TCO benefit for self-armed disaggregation. Okay. So like Yi mentioned just now, we have these extra hops of the data, so extra bandwidth is consumed,

Starting point is 00:17:09 and you have the relay replication, which adds the latency cost. So now let's look at how we could try and resolve or improve the architecture to address these issues. So the first thing, like I said, extra data hops. So what we'd rather do is have data land directly from the primary OSD to the eventual remote storage targets. The issue here, of course, is that we need the final landing destinations. In StockSafe today, you have a crush map which provides you the mapping for a particular object which OSD it should go to. What we are missing is what is the target in this case where you're disaggregating the storage, right? What is the target for each

Starting point is 00:17:57 given OSD? So you need extra state, and what we are proposing is to maintain a map of that storage target where you know for a particular OSD where is the remote storage target. Once we have that, the primary OSD can directly land the data to the appropriate storage target. Okay, you did that. But now we have the next issue as to who owns the device blocks. So currently, the owner of the device blocks where the data eventually lands is the host file system, or in the newer versions of Ceph, it's BlueStore. And if we leave that as it is, and you want to land data directly from the primary OSD, you need to know which blocks to land them to,

Starting point is 00:18:50 which means you'll have extra traffic going back and forth between the OSDs to just get to know where to land the data. So to avoid that, what we're proposing is let's move the block ownership to the remote side. And I'll talk a lot more in details about that in the next slide. But just to complete the picture, the third part is what we refer to as the control plane. And essentially what we're talking about there is the metadata that's associated with each object. In stock stuff today, that's tightly coupled with the data. And that's where you have those extra data hops that you pointed out in the previous slide, right?

Starting point is 00:19:32 So what we are proposing is to say, hey, can we decouple this? Have the OSD peers just exchange the metadata or the control traffic? And with just that, can we achieve the end goal of Ceph guarantees that are offered, right? If we are able to do that,

Starting point is 00:19:53 we can eliminate, essentially, N minus one, if where N is your replication factor, you eliminate N minus one data copies. There's one last bit that remains. You landed the data directly on the remote storage targets from the primary. You send the control messages to the peer OSDs. How do you connect the two? And what we are proposing there is essentially the unique ID associated with each object. We can

Starting point is 00:20:16 use that to correlate the metadata with the data. And in typical three-way replication, what you would achieve by doing something like this is what in stock Ceph would take six hops. So data from client, first hop, primary OSD to the peer OSDs, the next two hops, so three hops there. And then each of these OSDs landing the data to the remote storage target, three more hops. So you have six hops in stock Ceph today, versus if you do it this way, you'll have the data from the client that we're counting as the first hop, and directly to the remote storage targets, the remaining three hops.

Starting point is 00:20:57 So that's your four hops. So let's go to the next level down about what we're referring to as the control and data plane and what we're talking of separating here, right? If you look at the Ceph OSD stack, there's actually a clean layering that's already in there, right? So, the part we're referring to as control plane is really the object mapping service, right? You get an object, there's a placement algorithm, essentially the crush map, which determines the placement group for a particular object and the pool associated with that, right? So we're referring to those as the control plane. And once the OSD actually gets the object, the details of where the data for that

Starting point is 00:21:43 object must be placed on the device, that part is taken care by BlueStores and the layer below, right? We're referring to that as the data plane. Now, if you're talking about disaggregation, below that you'll have to have an initiator, in our case NVM, your fabric initiator, talking or sending the data across to the remote storage target. And like we spoke about a couple of slides earlier, this approach is inefficient because you have the relay and the extra data copies.

Starting point is 00:22:15 So what we're basically saying when we're saying remote block management is to move the BlueStore and have that run standalone on the remote storage target. You leave the Ceph top half on the OSD host. You do all the operations that Ceph OSD does today, but you leave the block management or you move the block management to the remote storage target. And the key part here is the notion of the block ownership mapping table, right? So how do you correlate given an object ID where the data blocks are? Now, this is nothing new here. BlueStore already does all of this. It's just where you're running BlueStore, right,

Starting point is 00:22:58 on the host versus the remote target. So with these, here's our estimate of what kind of benefits you would get. So the first part, remote block management. The picture on the left is stock Ceph with this pure disaggregated approach using NVMe or Fabric, right? And like we mentioned, you have your primary OSD copying the data over to the peer OSDs. And so that's two hops and then the three hops down to the target. We're not counting the client data hop because it's common between the two approaches. So you have five data hops there. And then with our approach or our proposal, the primary will basically be able to send the data directly to the remote storage targets, only control messaging between the PROSDs. So we have a formula there for interested folks.

Starting point is 00:23:57 We can get into the details of that offline. But essentially, I mean, intuitively, you can kind of see we're eliminating the R minus 1, where R is your replication factor, hops of the data between the PROSDs. And just to give some more context, for three-way replication, that would translate to about a 40% reduction in bandwidth consumption. Yeah, so that's about the bandwidth benefits. How about the latency benefits? So what I'm showing here is a sequence chart.

Starting point is 00:24:32 And again, let's go from left to right. Each column there is showing what happens at each of the entities involved, right? The client, the primary OSD, and its related primary target. And similarly for the replica OSDs, right? So in stock Ceph today, the data comes in, the primary has to hop that or relay that over to your peer OSDs, and then the peer OSDs or each of the OSDs will essentially write the data at the remote target. Contrast that with the new approach we are looking at. And the key point to take away really is that we are sending the data concurrently. So from the primary OSD, if the primary OSD knows where the eventual storage target is, you can send that at the same time to all of those targets.

Starting point is 00:25:31 And at that same time, you can send it to the control data to the PROSDs. And so obviously, you don't incur the latency from the relaying that you incur with StockSafe. Now, there is the aspect of the primary having to keep track and make sure that the data was landed correctly at each of the storage targets, as well as the PROSDs. So just like StockSafe today keeps track of counters in terms of making sure the requisite number of replicas were landed. You would need to have some more logic there to make sure was the data landed correctly at the remote storage target, did the metadata land at the PROSD, et cetera. But that aside, by doing this concurrent operation,

Starting point is 00:26:21 essentially you eliminate, I mean, intuitively, again, I won't go into the details of the equation there. But intuitively, because you eliminate that relay hop, you're essentially removing, you know, the one hop and back of the network transfer, right? And so we estimate that we'll get 1.5x latency improvement by doing that. Okay, so that was at the design level. Here's what we actually went off and tried. So this is a proof of concept, and we started with Ceph Luminous, and the picture here is showing a two-way replication. So what we went and modified is in the Ceph object store.

Starting point is 00:27:07 So logic was added to essentially do an NVMe or fabric initiator, that functionality in the Ceph object store. And so that is to land the data directly to all the remote storage targets. The other aspect that was modified is in the rep op operation, where currently the primary sends the control and data to the peer OSDs. There we just snipped off the data copy part, so only the control messages are sent.

Starting point is 00:27:44 We're using SPDK, the user space framework for this. So the transport is SPDK RDMA. And on the remote side, we have the SPDK NVMe over Fabric target. Now, this receives the commands that we're sending from the host side, and we've added an SPDK block device, which takes these commands and essentially does a mapping to make the relevant call to Ceph BlueStore. It's just that instead of happening on the host side, it's happening on the remote target side.

Starting point is 00:28:19 And currently we've implemented this on SoftRocky, but even as we speak, Currently we have implemented this on soft Rocky, but even as we speak we are working to kind of do this in a larger scale deployment, but remove soft Rocky and actually do it on an actual Ceph network. In the picture here, I'm basically showing one part that's highlighted, and that's just to show in the next slide I'll be showing some preliminary data that we've got. And so the measurements that I'll talk about are on this basically the Ceph cluster network,

Starting point is 00:28:57 and what we're really measuring is just the received and transmitted bytes. Okay. So in terms of preliminary results and what we did, basically, RADOS put was used. And again, we're working towards expanding the set of tests and doing more. But just for this graph and what I'm showing here, so 10 iterations of radios put, two different object sizes, 2 meg and 6 meg. And the picture like I showed before, we are measuring the received and transmitted bytes on the Ceph network. And as expected, oh, and by the way,

Starting point is 00:29:39 this is with two-way replication. So essentially, you know, for stock Ceph with NVMe or Fabric, you end up transferring 40 MB per put. And instead, with the approach we are proposing, you basically get about a 48% reduction. And similarly, for the larger object size. And again, this might seem a little more than what we had estimated, but I just want to clarify that here we are only talking or what we're showing is the Ceph network traffic.

Starting point is 00:30:11 We are not doing anything to reduce any bandwidth consumption on the fabric side. That remains as is. But just wanted to contrast what's different. So that's what we're showing here. And so the initial results look promising, but we want to expand this and do it at a larger scale, as well measure the latency,

Starting point is 00:30:36 for which we'll need to move off of SoftRocky and do it on actual hardware. So yeah, that's essentially what we are going after here. So to summarize, if you use stock Ceph and you try to disaggregate the storage, basically Ceph hasn't been designed for that, right? And because of that, as you introduce extra hops and extra relay steps, which we're referring to as the data center tax. And we want to eliminate that by two main ideas, decouple the control and data flow, and do block management remotely. And we want to preserve the Ceph SDS value. So all the things that Ceph is great at with respect to reliability, durability, all the operations that it does, we'd like to retain that

Starting point is 00:31:35 and do it in a way where you reduce the TCO for Ceph while bringing the NVMe or Fabric value to Ceph deployment. So in summary, that's kind of what we are going after. And in terms of next step, I think the first thing is to take this idea to the larger community and more specifically the Ceph community. We'd like to validate whether our ideas make sense, get more inputs, and make the whole design stronger, if you will. The other thing is integrating the storage target information with the crush map. That would be ideal, as opposed to having two separate places where state is maintained. I already mentioned we'd like to evaluate this performance at scale. The other interesting thing is when you remove the bottom half of Ceph and have that remote,

Starting point is 00:32:34 you can actually start looking at or moving towards a vector where you can work with stateless OSDs, if you will. So what we are talking about is if an OSD crashes, you can just quickly failover, get a new instance of an OSD come up, and it can connect to the same remote storage target, and you're back online. So that's a nice direction to go after. And the typical question that we get asked when I mention this is,

Starting point is 00:33:10 what happens if the remote storage target crashes, right? There, nothing changes with respect to what Ceph does today, right? A particular OSD crashes, you know, you have your peer OSDs figure that out and take the relevant steps to bring back the number of data copies to the appropriate level, etc. There, nothing will change. That's what we are referring to here as mechanisms to survive OSD node failures. We think that's an interesting future area of work. As well, additional offloads that are possible when you do this kind of approach.

Starting point is 00:33:49 So I think that's basically what we have here today. Thank you. Thank you all. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the Storage Developer community. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

Storage Developer Conference - #98: Rethinking Ceph Architecture for Disaggregation Using NVMe-over-Fabrics

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.