Storage Developer Conference - #98: Rethinking Ceph Architecture for Disaggregation Using NVMe-over-Fabrics
Episode Date: June 10, 2019...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the
SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage
developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage
Developer Conference. The link to the slides is available in the show notes at snea.org
slash podcasts. You are listening to STC Podcast Episode 98. So my name is Yi. I'm a research scientist from Intel Labs.
I'm here today with my colleague Arun here, as well as my co-speaker.
As you see, we have another co-speaker to show,
but unfortunately he can't make it due to some emergency in the family.
The topic today is rethinking self-architect architecture for desegregation using NVMe over fabrics.
I don't know how many people have just attended the NVMe fabrics talk from Sujoy and Mohan.
If you guys missed, that is worth to chat with them to get more detail on that. touch a little bit more, but we'll focus more on Ceph as a very popular, successful SDS
solution in the era of disaggregation and storage disaggregation particularly.
So the agenda for today, we will start doing some kind of a background, a little bit of
refresher for Ceph.
For people who are already familiar with that, that will be kind of a very easy background introduction,
as well as the disaggregation in general.
And we're gonna try to illustrate what the problem is,
particularly for Ceph.
How do we combine the two?
You have a very specific problem with Ceph.
Now you want to introduce disaggregation.
How can we solve that problem?
And using replication, one of the most common practice,
we explain the replication flow as well as what we refer to as data syntax
when you try to combine the two to provide a good value for your storage services.
And based on that, we're going to show, at least based on our observation, the current
approaches, as well as our proposed approach. Particularly, Arun is going to take you guys
through the details of the architecture design, at least the thoughts we have been through in our mind and what are the architecture detail in the context of CEF
and how we arrive to our
analytical results as some small scale preliminary kind of
evaluation results. And we'll conclude the talk.
So this is a very
popular picture of Ceph architecture.
You can Google this everywhere.
But it kind of summarizes the benefits from Ceph being very successful
as one of the also default deployment in OpenStack architecture as a block service.
It also provides, it's actually an object store.
And working together with Swift in the past
as well as now.
And you can see application would,
depending on the storage service you want to provide,
you could provide a block level service
like your VM, this image,
through the thing called RBD,
stands for Redux Block Device.
Or you want to just continue using your existing project interface, file system interface. You have a self-file system there, then allowing to access it as just read and write as a file.
Or you want to just like your image website,
you want to just upload your picture just as an object,
you have the object interface through the Redos gateway.
Underline is basically by this Redos layer,
reliable,
autonomic,
distributed object store
that achieves all these storage service functionalities.
It has software-defined hardware-agnostic SDS.
It basically achieves the purpose of desegregating your software from your hardware control plane.
Then providing the service of all, everybody who's in-store industry cares about
is the reliability of how I
can store my data reliably.
So they provide replication
or re-recoding if you
hear very much about your bandwidth savings.
And they can
rebalance in case of
your OSD
fails. The details
of how self-work is in the bottom right,
like that block diagram,
where the core part is they have a crash algorithm
that figures out what is the best way
to put your data and where.
And that takes into consideration
of your network technology,
particularly your failure domain,
versus your client requirement on your reliability.
Like you want a replication,
then you could have one pool have two-way replication,
another pool have three-way replication.
One pool is whatever you favored,
your real coding profile.
And they're going to automatically figure out
where the data is going to end up to,
eventually landing in your storage devices.
This one is a little bit background on desegregation.
Desegregation itself is not exactly new,
but there are some new trend industry particulars that make this more appealing.
The software-defined storage, as I mentioned earlier,
is looking at a scale-out approach for storage guarantees,
and particularly looking at how can I desegregate software from hardware.
And you guys probably already know there are numerous SDS offerings,
particularly in cloud storage.
It makes it easier to manage my commodity-ready hardware with just
an SDS stack that I can focus on providing the service that is more related to
what client want.
And another angle of desegregation is basically separate servers into resource components.
So the desegregation is all about, as I quoted here from the Euthynics paper,
extreme resource modularity,
because that provides resource flexibility as well as utilization.
That is a direct TCO benefit,
and you can see the benefit while deploying your infrastructure.
And the deployment flexibility can also provide you,
if you want to do a hybrid and you want to do a hyper-converged,
that is also possible
because it eventually allows you to scale your storage,
or in this case, the segregation can be storage, networking, as well as compute,
scale that set of resources independently from your application grow.
So you have very soft workloads, can all fit into the services provided by disaggregation.
And particularly I mentioned that because of industry trend,
as I mentioned in the earlier talk from Suzhou andoy and Mohan mentioned the NVMe over Fabric, the particular one appealing factor
here is faster interconnect as very well
designed standards like NVMe protocol allows you to do
very good I.O. over to the remote.
And that latency allows, make it a very
appealing factor for you to say,
okay, I really want to have desegregation there with the faster interconnect
that I can actually manage my resources much more efficiently.
Well, as I work for Intel, you notice that I put a picture from Intel,
particularly Intel's perspective as rack scale design in the context of desegregation.
And you can see eventually you want to move from today
physical aggregation to fabric integration and fully
modular resources. The modularity is the key
part that allow you to achieve all the benefits
you're looking for in your data center infrastructure.
So now, Ceph on NVMeUF, how can we do it?
Number one is, so what is the rationale of putting Ceph on NVMeUF Fabrics?
You have one very popular SDS solution as Ceph, backed by the very popular and big open source community, as well as Red Hat.
You have the NVM over Fabrics as pushing in industry.
A lot of vendors come into play, and we can see that both true trends, they're going to somehow merge together.
Then we have to figure out how to make them work together and very efficiently,
taking advantage from both sides.
As I mentioned earlier, one, that the desegregation is allow you to
manage your
storage resources independently
without worrying about the
other side. Now,
if you still want to say, I want to
use desegregation,
how do you support multiple SDDS offerings on disaggregation? All of them is going to say, okay, I own my storage, even though
I'm an SDS, I own still my own storage. They're not designed with a mindset that they are going
to be put their data eventually landing on a storage device that is located remotely. And I'm really going to talk about the actual problem you're going to
see when you're actually starting to doing that. So you want
to scale computer storage sets independently, then you have to think
about how to solve this problem. And also
this basically opens a door to a lot of optimization
opportunities looking at various SDS architectures
and where we can do better to take advantage from both sides.
Well, by our observations, there are a couple of approaches people have already been kind of practicing.
The first one, what we refer to as host-based NVMe over fabric storage backend.
Well, this is kind of an extreme of, okay, I want to reuse my NVMe over fabric as just like a SAN.
I have metering capability. I will let it maintain my replication.
But essentially you're saying, okay, Seth, don't worry about adding up reliability thing. I'm going to,
I'm going to manage that. You have to set up
Seth configuration parameter of, say, replication factor one,
because the other side is going to do mirroring.
It's an, it's an approach, of course. It will actually,
it will work. But the point is, why do you need Ceph to begin with?
Now, another option is, I'm a Ceph administrator.
I'm very familiar with Ceph.
I want to just use it as the way it is.
Okay, I just don't want to worry about anything related to MVM or your fabric.
I'm just going to treat it as a dumb disk.
You can still do that. Plug in the MME in shader.
It's going to bring up a volume showing up in your system. Treat it as a local disk
with whatever IO characteristics you see there.
Use that. No problem. You can still have that working.
We have tried that. And we actually have a comparison later to that.
However, as you can see, essentially the problem is
Ceph has a known way, or Ceph, Swift, any others,
they want to make sure your service is maintained very well
based on your service level requirement from the clients,
which basically means how do I know where my failure domain is.
By doing that, essentially you're trying to hide that information from Ceph.
Ceph crush map doesn't really know your target distance.
There's no such layer in between that the algorithm cares.
Algorithm itself is a consistent hashing algorithm.
You have to have that topology built in the right mind
that actually covers your requirement
from that perspective.
So it will work.
Does it work to the point that we expect it to work?
Well, we'll see.
Decoupled self-control and data flow is actually what we're trying to after.
And it's our proposed approach.
We're going to talk about that in details later.
So as an example,
I'm going to focus more on
self-replication flow, but
as I mentioned earlier,
SDS reliability guarantees
is through the data copies,
replication, or redundancy from
real coding. The durability,
there's
this long-running task called
scrubbing in Ceph,
or I think swiftly referred to as auditing.
It's basically make sure your data integrity check
can always be successful.
If not, there's a certain mechanism
to make sure you bring back the right data
to the storage device.
And we're going to talk about the replication flow,
particularly in today's talk.
So this picture shows a very high-level simplified flow,
about six steps.
So client's going to try to say,
I want to upload my family photo there.
This ends up putting a new object in your storage cluster.
In this case, you can see there are three OSDs deployed, as we refer to as primary,
secondary, and tertiary. The primary is going to be, say,
okay, where are my peers? Where are the secondary and tertiary?
Once it's identified through the algorithm,
they're going to bring the data from the client, from the primary OSD,
then they're going to send that to
the secondary as well as tertiary. Once all these three pieces of data are actually persisted in
your storage device, there will be an acknowledgement sent back, then there will be actually
counter-tracking how many acknowledclogin I have received.
Eventually, I can say, okay, well, good.
Your data is placed there according to your rule.
Then you're done.
This doesn't have the picture of MVMU fabric.
Now, in next, this is where I'm going to show you where what we refer to as data center tax coming from.
Recall from earlier slide, the safety of performance today is to come into provision separate cluster network for internal traffic,
particularly because of the, let me go back one more, the replication traffic between the OSDs.
And you have to have a dedicated network for that purpose,
as well as for squabbing, that kind of task.
Now, you do this as what we refer to as stock self-deployment right now.
This network calls component as capacity or scale up.
And you can see it's almost in replication case, it's always a linear factor how you can grow.
When you move to desegregation,
this obviously exacerbates the data movement problem.
If you follow the arrow from the client,
the data block will have to travel to primary OSD.
That's a request before.
But the point is, the OSC is playing a relay role here
to pass the data to secondary, tertiary,
but eventually just move on to their actual location
in a remote target.
So that red line is showing us a fabric connection.
So the data has really no purpose staying in the primarist That red line is showing as a fabric collection.
The data has really no purpose staying in the primaries because it's not going to eventually reside there.
This is where we refer to as data center tax,
and it's also the focus of our research work to how to reduce them.
The proposed approach, I'm going to have my colleague Arun
to talk about more and walk through the archaic details
as how we achieve the TCO benefit for self-armed disaggregation.
Okay.
So like Yi mentioned just now, we have these extra hops of the data, so extra bandwidth is consumed,
and you have the relay replication, which adds the latency cost.
So now let's look at how we could try and resolve or improve the architecture to address these issues.
So the first thing, like I said, extra data hops.
So what we'd rather do is have data land directly from the primary OSD to the eventual remote storage targets.
The issue here, of course, is that we need the final landing destinations.
In StockSafe today, you have a crush map which provides you the mapping
for a particular object which OSD it should go to. What we are missing is what is the target
in this case where you're disaggregating the storage, right? What is the target for each
given OSD? So you need extra state, and what we are proposing is to maintain a map of that storage target
where you know for a particular OSD where is the remote storage target.
Once we have that, the primary OSD can directly land the data to the appropriate storage target.
Okay, you did that.
But now we have the next issue as to who owns the device blocks.
So currently, the owner of the device blocks where the data eventually lands is the host file system,
or in the newer versions of Ceph, it's BlueStore.
And if we leave that as it is, and you want to land data directly from the primary OSD, you need to know which blocks to land them to,
which means you'll have extra traffic going back and forth between the OSDs to just get to know where to land the data.
So to avoid that, what we're proposing is let's move the block ownership to the remote side.
And I'll talk a lot more in details about that in the next slide.
But just to complete the picture, the third part is what we refer to as the control plane.
And essentially what we're talking about there is the metadata that's associated with each object.
In stock stuff today, that's tightly coupled with the data.
And that's where you have those extra data hops
that you pointed out in the previous slide, right?
So what we are proposing is to say,
hey, can we decouple this?
Have the OSD peers just exchange the metadata
or the control traffic?
And with just that,
can we achieve the end goal
of Ceph guarantees that are offered, right?
If we are able to do that,
we can eliminate, essentially,
N minus one,
if where N is your replication factor,
you eliminate N minus one data copies.
There's one last bit that remains.
You landed the data directly
on the remote storage targets from the primary. You send the control messages to the peer OSDs. How do you connect the
two? And what we are proposing there is essentially the unique ID associated with each object. We can
use that to correlate the metadata with the data. And in typical three-way replication, what you would achieve by doing something like this is what in stock Ceph would take six hops.
So data from client, first hop, primary OSD to the peer OSDs, the next two hops, so three hops there.
And then each of these OSDs landing the data to the remote storage target, three more hops. So you have six hops in stock Ceph today,
versus if you do it this way,
you'll have the data from the client
that we're counting as the first hop,
and directly to the remote storage targets,
the remaining three hops.
So that's your four hops.
So let's go to the next level down
about what we're referring to as the control and data plane and what we're talking of separating here, right?
If you look at the Ceph OSD stack, there's actually a clean layering that's already in there, right?
So, the part we're referring to as control plane is really the object mapping service, right? You get an object,
there's a placement algorithm, essentially the crush map, which determines the placement group
for a particular object and the pool associated with that, right? So we're referring to those as
the control plane. And once the OSD actually gets the object, the details of where the data for that
object must be placed on the device,
that part is taken care by BlueStores and the layer below, right?
We're referring to that as the data plane.
Now, if you're talking about disaggregation, below that you'll have to have an initiator,
in our case NVM, your fabric initiator, talking or sending the data across to the remote storage
target.
And like we spoke about a couple of slides earlier,
this approach is inefficient because you have the relay and the extra data copies.
So what we're basically saying when we're saying remote block management
is to move the BlueStore and have that run standalone on the remote storage target.
You leave the Ceph top half on the OSD host.
You do all the operations that Ceph OSD does today,
but you leave the block management or you move the block management to the remote storage target.
And the key part here is the notion of the block ownership mapping table,
right? So how do you correlate given an object ID where the data blocks are? Now, this is nothing
new here. BlueStore already does all of this. It's just where you're running BlueStore, right,
on the host versus the remote target. So with these, here's our estimate of what kind of benefits you would get.
So the first part, remote block management. The picture on the left is stock Ceph with this pure
disaggregated approach using NVMe or Fabric, right? And like we mentioned, you have your primary OSD copying the data over to the peer OSDs. And so that's
two hops and then the three hops down to the target. We're not counting the client data
hop because it's common between the two approaches. So you have five data hops there. And then
with our approach or our proposal, the primary will basically be able to send the data directly to the remote storage targets,
only control messaging between the PROSDs.
So we have a formula there for interested folks.
We can get into the details of that offline.
But essentially, I mean, intuitively, you can kind of see we're eliminating the R minus 1,
where R is your replication factor, hops of the data between the PROSDs.
And just to give some more context, for three-way replication,
that would translate to about a 40% reduction in bandwidth consumption.
Yeah, so that's about the bandwidth benefits.
How about the latency benefits?
So what I'm showing here is a sequence chart.
And again, let's go from left to right.
Each column there is showing what happens at each of the entities involved, right? The client, the primary OSD, and its related primary target.
And similarly for the replica OSDs, right? So in stock Ceph
today, the data comes in, the primary has to hop that or relay that over to your peer OSDs,
and then the peer OSDs or each of the OSDs will essentially write the data at the remote target. Contrast that with the new approach we are looking at.
And the key point to take away really is that we are sending the data concurrently.
So from the primary OSD, if the primary OSD knows where the eventual storage target is,
you can send that at the same time to all of those targets.
And at that same time, you can send it to the control data to the PROSDs.
And so obviously, you don't incur the latency from the relaying that you incur with StockSafe.
Now, there is the aspect of the primary having to keep track and make sure that the data was landed correctly at each of the storage targets, as well as the PROSDs. So just like StockSafe today keeps track of counters in terms of making sure the requisite number of replicas were landed.
You would need to have some more logic there
to make sure was the data landed correctly
at the remote storage target,
did the metadata land at the PROSD, et cetera.
But that aside, by doing this concurrent operation,
essentially you eliminate, I mean, intuitively, again, I won't go into the
details of the equation there. But intuitively, because you eliminate that relay hop, you're
essentially removing, you know, the one hop and back of the network transfer, right? And so we
estimate that we'll get 1.5x latency improvement by doing that.
Okay, so that was at the design level.
Here's what we actually went off and tried.
So this is a proof of concept, and we started with Ceph Luminous, and the picture here is showing a two-way replication.
So what we went and modified is in the Ceph object store.
So logic was added to essentially do an NVMe or fabric initiator,
that functionality in the Ceph object store.
And so that is to land the data directly to all the remote storage targets.
The other aspect that was modified is in the
rep op operation, where currently
the primary sends the control and data to the peer OSDs.
There we just snipped off the data
copy part, so only the control messages are sent.
We're using SPDK, the user space
framework for this. So the transport is SPDK RDMA. And on the remote side, we have the SPDK NVMe
over Fabric target. Now, this receives the commands that we're sending from the host side, and we've added an SPDK block device,
which takes these commands
and essentially does a mapping
to make the relevant call to Ceph BlueStore.
It's just that instead of happening on the host side,
it's happening on the remote target side.
And currently we've implemented this on SoftRocky,
but even as we speak, Currently we have implemented this on soft Rocky,
but even as we speak we are working to kind of do this in a larger scale deployment, but remove soft Rocky
and actually do it on an actual Ceph network.
In the picture here, I'm basically showing one part
that's highlighted, and that's just
to show in the next slide I'll be showing some preliminary data that we've got.
And so the measurements that I'll talk about are on this basically the Ceph cluster network,
and what we're really measuring is just the received and transmitted bytes.
Okay. So in terms of preliminary results and what we did, basically, RADOS put was used.
And again, we're working towards expanding the set of tests and doing more. But just for this
graph and what I'm showing here, so 10 iterations of radios put, two different object sizes, 2 meg and 6 meg.
And the picture like I showed before,
we are measuring the received and transmitted bytes
on the Ceph network.
And as expected, oh, and by the way,
this is with two-way replication.
So essentially, you know, for stock Ceph
with NVMe or Fabric, you end up transferring
40 MB per put. And instead, with the approach we are proposing, you basically get about a 48%
reduction. And similarly, for the larger object size. And again, this might seem a little more
than what we had estimated,
but I just want to clarify that here we are only talking
or what we're showing is the Ceph network traffic.
We are not doing anything to reduce any bandwidth consumption
on the fabric side.
That remains as is.
But just wanted to contrast what's different.
So that's what we're showing here.
And so the initial results look promising,
but we want to expand this and do it at a larger scale,
as well measure the latency,
for which we'll need to move off of SoftRocky
and do it on actual hardware.
So yeah, that's essentially what we are going after here. So to summarize,
if you use stock Ceph and you try to disaggregate the storage, basically Ceph hasn't been designed
for that, right? And because of that, as you introduce extra hops and extra relay steps, which we're referring to as the data center tax.
And we want to eliminate that by two main ideas, decouple the control and data flow, and do block management remotely.
And we want to preserve the Ceph SDS value. So all the things that Ceph is great at
with respect to reliability, durability, all the operations that it does, we'd like to retain that
and do it in a way where you reduce the TCO for Ceph while bringing the NVMe or Fabric value to Ceph deployment.
So in summary, that's kind of what we are going after.
And in terms of next step, I think the first thing is to take this idea to the larger community
and more specifically the Ceph community.
We'd like to validate whether our ideas make sense, get more inputs,
and make the whole design stronger, if you will. The other thing is integrating the storage
target information with the crush map. That would be ideal, as opposed to having two separate
places where state is maintained. I already mentioned we'd like to evaluate this performance at scale. The other interesting thing is when you remove the bottom half of Ceph and have that remote,
you can actually start looking at or moving towards a vector where you can work with stateless OSDs, if you will. So what we are talking about is
if an OSD crashes,
you can just quickly failover,
get a new instance of an OSD come up,
and it can connect to the same remote storage target,
and you're back online.
So that's a nice direction to go after.
And the typical question that we get asked when I mention this is,
what happens if the remote storage target crashes, right?
There, nothing changes with respect to what Ceph does today, right?
A particular OSD crashes, you know, you have your peer OSDs figure that out and take the relevant steps to bring back
the number of data copies to the appropriate level, etc.
There, nothing will change. That's what we are referring to here as
mechanisms to survive OSD node failures. We think that's an
interesting future area of work. As well, additional
offloads that are possible when you do this kind of approach.
So I think that's basically what we have here today.
Thank you.
Thank you all.
Thanks for listening.
If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org.
Here you can ask questions and discuss this topic further with your peers in the Storage Developer community. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.