Storage Developer Conference - #9: Using CDMI to Manage Swift, S3, and Ceph Object Repositories

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcast. You are listening to SDC Podcast Episode 9. Today we hear from David Slick, Technical Director with NetApp, as he presents Using CDMI to Manage Swift, S3, and Ceph Object Repositories from the 2015 Storage Developer Conference.

Starting point is 00:00:49 Hey, good morning. My name's David Slick. I'm from NetApp, and this morning I'll be presenting about using a CDMI standard for management of a variety of different storage systems, including S3, Swift, Ceph, etc. So, we're ready to get started. So I'm going to start with a brief overview of CDMI. How many here in the audience have heard of CDMI before? A good smattering. So CDMI is the Cloud Data Management Interface,

Starting point is 00:01:27 and I want to emphasize the management in management interface because one of the things that makes CDMI unique is it's focused on management tasks that you don't typically see in data-centric protocols. When you look at a lot of protocols out there, the primary goal and the primary design point is how to get data back and forth between various clients and various repositories. CDMI is a little different.

Starting point is 00:01:56 CDMI came out of SNEA, and as most of you are probably aware of, one of SNEA's main areas of standard development is around storage management protocols. For example, SMIS, which is widely used for storage management, is also another SNEA standard. So when we started looking at CDMI quite a while ago, we said, okay, well, you know, cloud storage is starting to happen. We're starting to see protocols. Protocols are focused on data access. There's going to need to be a way to manage this as well. And that's what I'm going to focus on in today's presentation, how some vendors are using CDMI to manage all sorts of different storage objects you may see up in the cloud. So to recap, the major APIs that we see adopted and used in the cloud are S3, CDMI, Azure, and Swift. I'm

Starting point is 00:02:57 sure everyone here doesn't have any quibbles about me saying S3 is by and far the most widely used cloud data access protocol. But one thing that we found talking with customers is Azure is extremely widely used as well. We don't see that because most of the time when you think of applications talking to clouds, you're thinking of, well, I download this application or I'm using this web service, etc. That's only part of the market. The enterprise storage and enterprise application segment of the market, typically applications developed inside enterprise organizations using Microsoft toolchains,.NET, etc, are extremely widespread. And the amount of data going in and out via the Azure API is quite significant.

Starting point is 00:03:53 I can't tell you which is more. If I had to make a guess, I'd say that S3 is slightly edging out Azure in terms of being the most popular cloud-based data access API. But this is always something that I emphasize during talks because I think people underestimate the degree to which Azure is widely deployed. And likewise, we have the Swift API, which is part of the Swift project inside OpenStack. Sorry, that's my VPN wanting to connect here.

Starting point is 00:04:29 Let me kill that. There we are. And Swift API is in many ways very similar to S3, only it's part of the OpenStack project. And that's getting more and more popularity as OpenStack deploy and that's getting more and more popularity as OpenStack deployments have grown in number and scale. So with respect to CDMI, we've seen quite a lot of CDMI implementation as well. CDMI has a number

Starting point is 00:05:00 of advantages, namely that it provides functionality not found in these other three APIs and as an open standard it has advantages in the realm of documentation, interoperability, and patent protection. So we've seen about 30 different CDMI implementations that we're aware of. There's a lot of CDMI implementations internal to organizations, government bodies, educational facilities that we aren't necessarily aware of. We'll see little bits of evidence of them here and there, people on news groups coming and asking us questions, etc. But we're quite happy with the degree to which adoptions happened. Standards-based protocols tend to have a slightly slower adoption curve

Starting point is 00:05:54 than vendor-driven protocols, primarily because, as a standards body, we don't have a vested interest to try to get immediate adoption and lock in through a protocol but once again we believe that over time we're going to continue to see CDMI adoption well for some of the reasons that we're going to talk about today okay so just as a note CDMI also has also has a plug-in for OpenStack that allows you to have OpenStack talk CDMI as well as Swift. And there's also Gateway that allows you to run an instance on EC2 and access your S3 objects via CDMI. So this is an example of multi-protocol support and a lot of the commercial vendors will have products that support the CDMI API, the S3 API, the Swift API, and even in some cases

Starting point is 00:06:55 files and LUNs, bringing us to our topic today. So to give a quick timeline, we started working on CDMI back in 2009 when we determined that, you know, there's a lot of stuff going on in cloud storage. We think this is about right for standardization. Our philosophy for standardizing things is if you do it too early, you're inventing things. And the last thing you want is standards bodies inventing things. If you do it too late, the industry will have solidified around a series of practices and tangible ways of doing things and it's just too late to get adoption. The right time to do that is after people have figured out how to do things, but not so late that the way that it's done has crystallized. So, we worked from 2009 through 2011 getting CDMI to become an official technical architecture

Starting point is 00:07:52 which is SNEA speak for US standard through that body. And then we worked on errata as we got more and more implementations on board. In 2011 we did 0.1 and then in 2012 we did 0.2. And this all led up to CDMI being submitted and going through the process to become an ISO IEC standard, specifically a 17826. And this has been a really good thing for CDMI because being an official ISO standard means CDMI has worldwide recognition that this is an endorsed way to allow systems to work interoperably in an international stage.

Starting point is 00:08:38 And we've seen a lot of adoption coming from the fact that it's an international standard. There's a lot of academic and government work, especially in Europe, where the mandates of the research projects are how do we connect these different clouds together, how do we provide interoperability, how do we provide a playing field where all sorts of different clouds can provide comparable services. And one of the things we did in CDMI, once again I don't know how familiar you are with the details of the spec, but one thing that's different about CDMI is you don't have to implement the whole thing. A lot of standards if you're gonna have it work you have to be bug for bug

Starting point is 00:09:19 compatible with the whole base feature set. CDMI has the concept of what are called capabilities. I'll show this a little bit later on. But in essence, what capabilities are is a way for me as the implementation to describe to you as the client what I can do. So if I need to, for example, use CDMI to manage LUNs in the cloud, all I have to do is implement the core sets of functionality for that. I may support the objects that correspond to LUNs, I may support the metadata that allows me to manage those LUNs, and I may support those basic operations. But I may not support some of

Starting point is 00:09:59 the more advanced features that have nothing to do with LUNs. For example, I may not be supporting any of the parts of CDMI that deal with managing NFS or SIFs. So that flexibility, being able to arbitrarily select which subsets of the standard, have been something that has really spurred on adoption and the capabilities of the means by which I as a client can figure out what a server does. Because if, you know, with other protocols you just can assume it does everything because that's the definition of the protocol with CDMI it's not so. So moving on, the way that we handle enhancements in CDMI is through what we call the extensions process. We're in the CDMI working group we are very much use driven so we don't want to once

Starting point is 00:10:44 again we don't want to just be inventing things. If we're going to spend time standardizing things, we want these to correspond to real customer needs driving real vendor needs driving revenue. So vendors that are involved in the process, and if anyone wants to get involved in CDMI, we'd welcome you to join the technical working group. Bring forward extensions, the concept that we want to extend CDMI to handle a given use case. We talk it through in committee. We publish it for public review. And when someone

Starting point is 00:11:15 shows up with an implementation, then we can go and start to look at adding it to the spec. And it doesn't go into the full spec until we have two verified interoperable implementations. And we do that as a series of plugfests. I think it was a plugfest 21 we're on right now, and that's happening this week here at SDC. There's the SIFS plugfest, and so the SMB plugfest,

Starting point is 00:11:41 and then there's CDMyPlugfest, CloudPlugfest, going on concurrently. So we have 18 extensions that were submitted, and some of those were folded into CDMI 1.1, which is primarily in a RADA release. And I'm proud to say that as of just a couple weeks ago, CDMI 1.1.1 has been submitted to ISO, and that will become 17.8.26.2016, hopefully not 2017, as an update to the ISO standard. So there's a lot of history behind here. We're pretty proud of where we've gotten to and looking forward to sharing that. So ultimately, you know, why does

Starting point is 00:12:24 this matter? You know, a lot of people are doing great stuff with all these other object protocols. We're really happy about that. The more that object storage and cloud storage is being used, the more stuff that's going on means the more stuff that needs to be managed. And that's ultimately why, you know, this is all a good thing. Just like with NFS and SIFIFS there doesn't need to be a protocol winner because ultimately regardless of what you're using NFS, SIFS, iSCSI, LUNS, S3, SWIFT, Ceph, etc all these different systems this is just more and more data that needs to be managed and managing

Starting point is 00:13:00 the data is all the hard problems that we have an industry have to be honest struggled with for decades now we can move the data around a hard problem that we have in industry have, to be honest, struggled with for decades. We can move the data around, we can keep it reliably, we can provide the performance, but dealing with the terabytes to petabytes to exabytes of data and connecting dozens to thousands to millions of different endpoints to this data. This is becoming the key challenge for enterprise storage, cloud scale storage, and an area where we think CDMI has some real value to offer. So why does it matter? Well, first of all, it's simple and easy to implement. Talked a little bit about how pretty much everything's optional. You can pick from the standard what's specific to your needs, your customer needs. But also it has a lot of functionality that either hasn't been in other APIs or is just

Starting point is 00:13:53 starting to be implemented in other APIs. For example, let's say you're providing LUN services for a very large compute system in the cloud. You may have tens of thousands of customers managing millions of LUNs. Well, how can you describe these sorts of things? Sure, you can carve it up into little mini namespaces, one for each customer, and that's manageable for the customer standpoint. But there's a lot of management operations you're going to want to do

Starting point is 00:14:23 on subsets of your entire pool of resources. One way that CDMI deals with this is by having rich metadata standards. So CDMI allows you to have hierarchical and structured metadata attached to every single object. And I'll show this in a little bit. And then, for example, you can do query. You can say, I want to operate on the set of objects that match these criteria that match against the metadata using a query so that's an example or for example notifications being able to

Starting point is 00:14:57 link have a standardized way to link your storage into your notification so when someone creates for example aUN with a given set of characteristics, your back-end systems can get a notification, or the client's front-end systems can get a notification. These are examples of a number of tested and robust parts of the CDMI standard that we're only just now seeing. For example, notification was added into the S3 ecosystem later last year, and it still isn't there for the other ones. So this is an example of where CDMI provides ways to do things that haven't been yet integrated into other APIs.

Starting point is 00:15:41 So there's a lot of maturity there. As an open industry standard, it's not controlled by any one vendor. Everyone's welcome to join. And there's protection against patents. The companies involved in the creation of CDMI is pretty much a who's who of the storage industry. There's well over 100 participants. And part of the SNIA process is everyone has to sign an intellectual property agreement which says that they're not going to sue each other over patents, and you're not going to have a situation where if you come out with a product that uses CDMI, five years down the road someone says, oh, you're using something we have patented, you owe

Starting point is 00:16:19 us a huge amount of money. So that's a real degree of comfort, not just for commercial vendors, but also for the open source folks, because they don't want to get involved with patents. A lot of the times they're just kind of, there's no patents, I don't, I'm gonna ignore this, but with CDMI that that's really not an issue. And then finally, you know, we have a well-defined standard document, there's testing infrastructure, I encourage folks, if they're interested, we have a test conformance system that we're going to be using at the Plugfest this week. Come on, drop by the room and

Starting point is 00:16:55 we can show you what it looks like. And once again, adoption as I mentioned. So what does CDMI standardize? Well, CDMI does have the standard CRUD operations, but not necessarily for the reason why you think. Because typically in a cloud storage standard, you know, there are create, read, update, delete is the core of everything. Because that's the whole point. You want to store your data, you want to retrieve your data, you have to delete your data, etc. In CDMI, this is almost incidental. You have to have this to manage your namespace. And that's what I'm going to be spending most of the time today talking about, is the concept of namespace and managing namespaces. So while CDMI does have this, and while you can use CDMI as a data protocol, that's almost incidental because we're ultimately

Starting point is 00:17:46 interested in these management functionalities. So CDMI, when it talks about managing things, the object model has four primary entities, and we'll look in a few slides on how these overlay onto what's being managed. So there's data objects of whatever type. There's container objects of whatever type. There's queue objects of whatever type. And then there's something called a domain object. So to talk through this really quickly, we're all familiar with things like a file, an ISO image, etc. Things that you want to deal with as a concrete entity, those map to data objects. Then there's things like containers.

Starting point is 00:18:33 All my LUNs on a given zone. A directory, a bucket, etc. Those map to containers. Now one of the interesting things about CDMI is CDMI does not force you into the model that it's either a container or a data object. CDMI is based on REST and one of the things about the REST architecture and model is we're dealing with representations. So everything in CDMI is designed around representations that sit beside your existing models. So I'll show this in the demo, but you know, I have an S3 storage system or a

Starting point is 00:19:15 Ceph storage system with an object. I do an HTTP GET that's going through Swift or S3 or Ceph. But if I tell the system that I want it as a CDMI representation, that then tells the system I'm no longer just doing my data path. I now want to do management functionality, and the CDMI standard starts to get invoked. And I'll demonstrate that for you in a few minutes. But the reason why I mention this, because CDMI is representation-based, you can actually have a data object representation and a container representation

Starting point is 00:19:55 for the same thing. And this happens actually quite a lot in our industry. For example, what is a VMDK? It's a data object, a file. But is it really? Well, actually, it's a container with a bunch of stuff inside it. And sometimes when you're doing management, you want to deal with your LUN as a single thing. But sometimes when you're doing management, you want to crack open that LUN and go and do operations on what's inside it. If you're doing virus scanning, for example, if you're doing analytics, you want to go through and you want to, if you do your querying into what's inside your LUNs, etc.

Starting point is 00:20:37 And this pattern actually we see a lot of times. When you're doing backup, you often don't want to work on the granularity of individual files. You want to be able to manage a whole collection of files for backup, for SLOs, etc. So this duality, being able to say it's a data object and or it's a container, is something we use quite a bit through the standard and is one of the neat aspects of CDMI. So just to quickly talk about domains, the other thing that CDMI has as part of the standard is the concept that in a cloud you're not necessarily going to have everything managed by the same administrative entity. Because storage systems from a management standpoint have

Starting point is 00:21:26 typically come from an environment where there's only one managing entity, this is typically a big limitation. If I have a cloud and I have data, say, from the IEEE and data from ACM and data from other organizations, ultimately these all want to sit in one namespace. You don't want to have to segregate the namespace. Here's the IEEE papers, here's the ACM papers, etc. You just want to have your papers. And if it's an ACM paper sitting right beside an IEEE paper, those should be managed by who owns them, who has the administrative rights for them. So CDMI has the concept of multiple administrative domains. For every object, there's a link to a corresponding domain. So if you have an ACM domain and an IEEE domain, when I come in to do an operation against an object,

Starting point is 00:22:18 that object belongs to a domain, and that domain controls how my credentials are resolved and therefore what I can do administratively against that object. So if I try to delete an ACM paper and it says you know DSLIC credentials I don't know who you are you know I'm going to deny all those operations but if I go to delete the IEEE paper you know it's my paper, credentials are matched, yep, permissions match up, that operation is approved. Two objects sitting in the same directory, the same part of the namespace, totally different administrative mappings. And this is something really important

Starting point is 00:22:57 for management as we move towards, you know, cloud scale, global namespaces, because you're not going to have the assumption that administrative control is segregated on a namespace by namespace basis. Bucket granularity management, that's just a stepping stone towards where customers want to go with the true data mobility, especially between organizations,

Starting point is 00:23:22 especially in the cloud, where you have anywhere from thousands to millions of customers. Okay, so I spent a little bit of time about that, touched on identity and access control through talking about domains. Metadata, we're going to look at that in more detail. Query and notifications already touched on. If you're interested in these in more detail, please take a look at the spec. It goes into great detail on how these are used and implemented. Versioning is another interesting

Starting point is 00:23:51 one, and serialization, deserialization. So let's get into the meat. What do we mean by converged data management? Well, we have all these different things, all these different types of data that we're generating and consuming in the cloud, but also in the enterprise. And being able to move things back and forth, hybrid cloud, et cetera, is another area that's really interesting. But that would be a whole separate talk.

Starting point is 00:24:21 So what do we see? We have things like objects and files and lines. We also have things like NoSQL databases. And if I have a hundred NoSQL databases up in my cloud, I need a way to manage that too. I need a way to list them. I need a way to get information about them. How much space are they consuming? How much is it costing me? I need a way to specify things like SLOs this database. I may want to have on flash I need really low latency this database. I mean just be archiving because it's a snapshot of Another database this database is a set of data I do add a batch analytics on once a month. Most of the time I don't want to be

Starting point is 00:25:07 paying for really high performance storage, but you know when that month end comes up, I want the cloud system to move it into Flash and then run all my analytics on it, and when I'm done, oh, I want to move it back onto something cheaper. So one of the neat things about CDMI is it's equally applicable for managing all these new and emerging data types. You can have one namespace that covers all these different data types for management purposes. And this makes everything a lot easier because when you have one namespace that you can use in a common way to specify and discover, all of a sudden now you have to write a lot less code in the cloud, and your client has to have a lot less code to deal with away from managing LUNs,

Starting point is 00:25:53 and away from managing files, and away from managing databases, etc. And we're only starting to see this with RESTful interfaces, there's convergence, the stacks out here. We're seeing less variability. It's not like in the old days where you had to have say an Oracle specific API for managing one databases and another API for managing other type of databases and proprietary means for each of your file or types etc but you know this is a really nice thing about CDMI so this this presentation is talking about how this works. Another thing that this is valuable for is data mobility.

Starting point is 00:26:34 One of the things that CDMI provides, because it's designed as a superset from a namespace, each of these namespaces, the way I store files in a file system, the way I store objects in S3, for example, or Swift, the way LUNs are named and managed, these all have different restrictions. And these are often subtle, little, annoying restrictions, what characters are allowed, lengths of strings in various places.

Starting point is 00:27:02 These become really nasty interoperability challenges. So what CDMI has done is we tried to make sure that it's a superset namespace. And by being a superset namespace, you can express objects from these various systems in CDMI. You can always go from X to CDMI. You may have challenges going from CDMI to Y, but it's a lot easier to write an adapter from CDMI to Y than it is to write an adapter from every to every. So one of the areas we've seen a lot of interesting work, especially out of Europe, is using CDMI as an interchange format, a vendor-neutral way to get data from system A to system B and to archive data when you're moving between clouds.

Starting point is 00:27:53 Because what's a really big thing in Europe is about being able to have the right to move your data, the right to repatriate your data, rather than it getting stuck in a cloud or stuck in a storage system. OK, so let's go through some definitions. So we talked a little bit about what objects are at a generic level. It's something you want to manage. Examples, files, directories, LUNs, you know, you can get into systems. But one thing that's important is CDMI is a cloud standard. And by a cloud standard, I mean the internals are abstracted. I shouldn't have to know how it's implemented.

Starting point is 00:28:31 In fact, if I can tell how it's implemented, that's bad. Because every piece of information that I know about the internal implementation of the storage system is a way that I can potentially constrain the freedom of the cloud provider to do stuff behind the scenes. The whole point is I shouldn't have to know. I should be able to tell the cloud provider what I want, what I expect without caring about the implementation. If the cloud provider chooses to use erasure coding or replicas or RAID 6 or some other new technology, I shouldn't care. I should care about the reliability of the data and the performance

Starting point is 00:29:11 of the data because ultimately that's what I want to pay for. That's not my job how they do it. I should define my service levels and pay and have it provided. So CDMI takes a very different approach for management than what we see, for example, in SMIS and Redfish and a lot of the other management protocols out there. We're not dealing with systems and disks and RAID networks. We're dealing with namespaces and objects and SLOs. So what is a namespace? I've used the term a whole bunch of times. It's one of those words that's a little overloaded. So ultimately your namespace is the complete set, an organization within, of your objects. A namespace has to be addressable. You have to be able to describe how to get to an object within a namespace. That's an important part of the

Starting point is 00:30:12 definition. Namespaces can have various different internal structures. You can have flat namespaces, like for example within a bucket. It's flat. You have a key value. You can have a hierarchical namespace. Going to S3 and Swift as examples, people create hierarchical namespaces inside that flat S3 namespace by giving special meaning to a delimiter, typically slash, inside the name of the object to create, in essence, a synthetic hierarchical namespace within a flat namespace.

Starting point is 00:30:48 File systems have never really had that problem because they've been built into the concept of hierarchies. Well, since I first got started in computing, I actually had an operating system with disks with no hierarchies, no directories, but it's been around for at least as long as I've been in the industry, and probably longer. And we're starting to see a lot of other relationships. Graph relationships are more and more popular. We started to see this with symlinks, but it's becoming a lot richer. When you look at, for example, what Facebook's doing, everything they're managing is all about graphs.

Starting point is 00:31:29 And my bet is, as we see things continue to develop over the next 10 years, we're going to see graph-based namespaces become more and more important, especially from management standpoints, because a lot of the times when you're doing a management function function you want to look at relationships. What are all the associated objects? Projects, owned by, etc. If I'm a

Starting point is 00:31:58 client and I discontinue my account you may want to have it such that everything owned by DSLIC goes on to a really cheap archival tier for a few months before it gets deleted, as an example. So namespaces, once again, because everything in a namespace has to be addressable, this is where all the restrictions come in. What's an allowable name? What's your delimiter for hierarchies? What's your symbol representing a graph traversal, etc.? And CDMI uses Unicode. Unicode's not a panacea. There's actually some really quirky and difficult parts of Unicode, but compared to what we had before, it's really nice. You can represent

Starting point is 00:32:46 pretty much every set of characters in there. Binary data is a problem, but this has been great for international tech support and being able to handle all the different character sets. And the more restrictive a namespace is, the less it's able to accommodate different types of objects. So this is where, for example, CDMI serialization comes in. The concept of CDMI serialization is I may have these really complex sets of hierarchical or graph-related data. I may have a NoSQL database and a hierarchy of objects and some LUNs and metadata, etc. Well, when you serialize it, you turn it into a bog-flat bitstream that can go on pretty much anything, any storage device that we'll ever encounter, because it goes down to the lowest common denominator. And because,

Starting point is 00:33:40 once again, in CDMI, you have that object container representations. If you have a system that manages even something with a really restrictive namespace, if you step into that serialized container, now all of a sudden you can have all the richness of CDMI. Okay, so we talked about namespaces. Well, what do we mean by management? The informal definition, whatever doesn't fit in the data path. All the leftovers, you know. Well, we'll think about that later.

Starting point is 00:34:12 But this is actually something interesting. Management isn't in the SNEA dictionary. You'd think, given what we do, it would be in the dictionary. But it's not, because it's so common we don't think about it. Ultimately, my definition is management of the operations that are applied against objects and sets of objects. It's not the operations you do directly to the object. It's ones that are applied against. I want to snapshot this object. I want to migrate these objects. I want to change the permissions of these objects. These aren't things that change

Starting point is 00:34:55 the essence of the object. They change all of the associated parameters and ways that these objects are used in the system. So when you put it all together, CDMI's key value is it provides a superset namespace to allow management operations to be performed against cloud resident objects, and this can sit on pretty much any type of a system. So let's get down to some meat. CDMI namespaces.

Starting point is 00:35:24 So CDMI supports your arbitrary tree-based... Oh, that's interesting. Did I step on something? There we are. Good, good. Yes. So arbitrary tree-based hierarchies. This is a superset of what you see in Swift, S3, Azure, and other cloud namespaces. It's also a superset of your standard file system namespaces, which means if you have a filer with NFS and CIF shares, CDMI can layer on top of it. And if you have S3 and Swift and Ceph and Sheepdog and all these other platforms out there that are providing multi-protocol support,

Starting point is 00:36:13 it can layer on top. And you can even do this for your LUNs and other managed objects. CDMI also has the concept of references, which allow your arbitrary SIM-link-like cross-tree linkages, which we see more commonly. And there's been some work tossed around inside the standards body for having a full extension for graph relationships, i.e. allowing any arbitrary object to have structured metadata to say I am a relationship to X object. And then that's really important to be generic because there's as many relationships as there are business problems out there. I am a parent of, I am a child of, is what we're used to,

Starting point is 00:36:58 but I am a derivative of, I am a friend of, I am a etc. Being able to have these arbitrary relationships moves us a lot closer to this kind of graph world that at least we think things are going. So this allows CDMI to encapsulate all these different data types into a single namespace for management purposes. So what does this look like in principle? Well, up in the corner here we have the CDMI object model. And the way it works is there's a root. And this is something that's typical in a namespace. You need some place to enter.

Starting point is 00:37:38 And the root has a couple special properties. The root is where you can start your discovery. What can a CDMI system do? One principle that's important, not just for namespaces, you want to be able to discover and browse and traverse your namespace, but you also want to be able to discover and traverse and browse your API. What can I do? What does this system let me do? So there's especially if it's like slash cdmi slash v1 slash that can be an example of a cdmi root. But just slash could be a cdmi root because that's what we're used to in files. Or you know slash lun slash might be my root. cdmi is very agnostic about this because we want

Starting point is 00:38:22 people to be able to layer this onto the namespaces they already have for their data path. We don't want to have to force them to create a whole new system. It's representational, right? I do HTTP GET. I say accept CDMI. The server now knows I'm talking about a CDMI representation for an existing namespace, and we just want it to work. So under the root you have containers and you know like a file system, your containers can be nested, you can have containers within containers with containers and any of those levels including at the root you can have data objects or queue objects etc. And dotted line to each of these are capabilities. This is the self-describing mechanism. Is this object read-only? I can go and discover that. I don't have to actually

Starting point is 00:39:13 try to delete it and it comes back with an error. I can ask on a per object or per container or per system basis what actions, what can I do from a management standpoint for this object? Does it support the super low latency SLO? For example. So if we look at, for example, how this overlays, with respect to the file system world, this is quite straightforward. My root container is my root directory.

Starting point is 00:39:43 My container is my directory.. My container is my directory. My data object is my file. Boom, one-to-one, clean. When we look at, for example, Swift, my CD and my root is my Swift root. I have a container representing a Swift account. This allows me to do administrative operations on accounts. If I want to be able to get statistics or migrate a whole account or do those sorts of operations, there's a manageable object representing that layer.

Starting point is 00:40:14 Your bucket, the next layer down, once again, is just another CDMI container. So you can manage these things, name them, move them around, serialize them, set SLOs, etc. on them, and then finally you've got data objects in there. And that's if you view it just as the flat namespace. If you go to the next step, you can say, okay, well I'm gonna expand this another level and say people create synthetic hierarchies inside their bucket. Let's layer that out with more containers. Similar with S3, S3 is a little different than Swift. S3 has a root.

Starting point is 00:40:47 It has buckets and objects. And instead of being an explicit container, in S3, it's a dotted line to an account. Now, once again, CDMI has the concept of domains. Every object has a dotted line to the domain. So in the S3 model, what you do is the dotted line goes to your domain, which there was one per customer, one per bucket, and you've got that correspondence. So the overlays work quite well. So let's look at this. Let's put this all together. I have a multi-protocol storage system in the cloud that's providing me file access. It's providing one access.

Starting point is 00:41:33 It's providing S3 and Swift and all these different protocols. Well, here's an example. I have my root. I have a bunch of files. And I can export all these as an NFS, as SIFS file system. So my compute, my cloud compute can then mount those and see all those resources. But not just that, I can go and I can go to specific objects in that file system and say this VMDK I want it as iSCSI. This ISO I want it as iSCSI. This ISO, I want it as iSCSI.

Starting point is 00:42:07 And then the cloud compute can go and melt that too. So you can start to have really interesting workflows where someone just drag and drops a VMDK in, and then that triggers a notification, and then that is then exported via iSCSI and bound to a new VM that's instantiated, and boom, you just drag the file in, and that app's running. Question?

Starting point is 00:42:27 Two questions, short ones. In the previous slide, you showed the CDMI object model as consisting of a container data object, but you mentioned that CDMI does not distinguish between container and object. I mean, everything's a object. Why is that separation here? does not distinguish between container and object. I mean, everything's a object. Yeah, that's...

Starting point is 00:42:45 Why is that separation? Oh, I'm just showing it for how the overlay is. If you view something as a container representation, you can look at its children. But if you view something as a data object representation, you're just looking at its value. The other one is the multiple boxes that are shown. Are they showing two multiple instances of that container?

Starting point is 00:43:13 Is that the symmetric there? You can have that. I'm trying to be a little more generic than I probably should be for a concrete example. But yeah, you can have as many containers as you want as long as they have unique names. Question? So I wanna be a believer,

Starting point is 00:43:32 I've been following it for about four years. Yep. No customers I ever talk to ever mention CDMI. Yep. Right, so there's a question here, and sort of, I've always thought about it as more of a front control plane interface. And I've never found a killer app. But the way it's presented, I'm wondering if it actually is more useful as an orchestration layer

Starting point is 00:43:55 to sort of manage various storage offerings within sort of a hybrid cloud. Like it's sort of varying in. It's not for the customer front side. Am I thinking wrong about it? No, you're not thinking wrong about it at all. And it really depends on which target market. Selling into enterprises, that orchestration layer is their front end.

Starting point is 00:44:15 So what is the killer aspect of it? What is the thing that would drive CDMI adoption? You said there's 30 plus vendors that have implemented it, but I don't have any customer demand for it. So what is driving the... Why would I care, I guess, is my question. Why would I do it? Primary reason is, do you want to invent your own? If you want to invent...

Starting point is 00:44:37 But if I'm deploying OpenStack, I get the hybrid model. I get the sort of manage of multiple sources. I don't know. Maybe it's a question we can take offline. Like I say, I want to be a believer. I've been a follower for over 40 years. It has not percolated up to the level of a lot of sort of emotional. And it's a good question.

Starting point is 00:45:03 Unlike S3 where there there's S3 apps everywhere and there's all this buzz about it, you don't hear much about CDMI. And most of the implementations that we've been involved in are inside a corporation. So I went to Lee to answer a question. There are some of our customers who work for IBM. Some of our customers, they have needed, they need to move between clouds and schemes.

Starting point is 00:45:34 That's what these customers need. Precisely those are the layers. So are they building their own? There's a lot of open source digital products. Yeah. We can pick our plan. Okay, yeah, let's pick our plan.

Starting point is 00:45:51 But, you know, once again, the question about customer demand is a big one. The customers, they give an example with Aki. Aki built their orchestration layer around linking iSCSI LUNs in the cloud to VMs in the cloud, all based around CDMI's orchestration layer around linking iSCSI LUNs in the cloud to VMs in the cloud,

Starting point is 00:46:07 all based around CDMI's orchestration layer. So the point about the orchestration layer is a really big deal. That's who consumes a lot of this management functionality. So quickly go through here. At the same time, you can have your hierarchies, your buckets. Not only is this available via FASwift and S3, you can have those two layers for account management work at the same time. This can also, you can just go in via NFS and access the same files. A number of companies have done quite well with this sort of triple protocol access scale. These got a lot of traction with their multi-protocol access, speaking of a specific implementation.

Starting point is 00:46:48 So let's quickly go and look at what this looks like. So let me jump out here, and what I'm going to do is I'm going to go into this here. So what this is is this is a little AJAX JavaScript-based CDMI client. This is running in the browser, it goes out, and it's actually doing CDMI operations against a cloud running on my laptop. And this is another thing a lot of people really like about HTTP-based protocols, is you can talk directly and do management operations right in the HTTP. You don't need a layer between to do these sorts of operations. So I have my namespace. This index.html is actually this program itself.

Starting point is 00:47:33 So inside, I have a namespace that's a bunch of files. And here, for example, I have these. And I click on this, and these are actually my slides that I'm presenting. And this is part of that namespace. And I have some images, and I can just go through. This is all in CDMI Server. And if I go back up here, I also have, from my filer,

Starting point is 00:48:01 I have some buckets that are via an S3 gateway. I have some volumes that are file system and I have some LUNs. So if I go into my LUNs, I have my filer, I have my volumes, I have inside my volumes here for example and there's my LUN. So if I go and I take this, so I'm gonna take take this URL and copy it. And then I'm going to go over to a tool here that lets me actually run an operation against this. Okay, what I've done here is I've gone and I've said, this is my resource, and CDMI specification, so I want to talk CDMI, and if I run that, it's going to go through, and boom, here's the

Starting point is 00:48:43 CDMI JSON as to the CDMI spec. So, if I run that, it's going to go through, and boom, here's the CDMI JSON as to the CDMI spec. So if I went to this URL, if it's a file, the HTTP, I'm going to get back via those data protocols. Here I'm seeing the management side. And in here I have my exports. This is exported via iSCSI. I've got two portals corresponding to my IP address access. I've got some identifiers. I've got LUN information. I got permissions. All this was able to be specified as part of the structured metadata inside the CDMI object. So there's an example of that ability to easily do the

Starting point is 00:49:23 orchestration as you mentioned there. So what are your typical LUN operations? You don't want to create LUNs, you want to discover them. Well discovering them is just list the CDMI objects in a container, create them, create a new CDMI object and specify metadata. Do I want it thin provisioned, how big do I want it provisioned, etc. Specifying permissions, SLOs, exports, this links into all of those parameters I talked about earlier. If I want this on low latency, high throughput, different levels of data protection, all these standardized mechanisms for which describe this. And then, of course, the standard way

Starting point is 00:50:05 to take a snapshot. We're not talking about how to implement the snapshot. This is just a signaling layer. Same thing with serialization, et cetera. Now, given that we're running close to when we should do questions, I already showed you the LUNs here. I did all my three demos in one quick demo.

Starting point is 00:50:29 File management, I browse through, traverse through all the files. They're all there. I can go and look at each file. I can request it. Let me just jump back here. If I go up to my root here, and I go into my files, and I select a given file. Here's my file and I just pull it up on the web browser. Standard HTTP, everything's good there.

Starting point is 00:50:58 If I go in and I pull this up, just a standard HTTP, boom. There's my HTTP request, my content type image, JPEG, and a bunch of binary data. But when I go and I say I want to do CDMI, 1.1.1, and accept And accept cdmi, cdmi object. And boom. I'm seeing now my management representation. I flipped it over. Instead of the data, I have this JSON with this metadata, not just things like when it was

Starting point is 00:51:46 created, its count, how many times it's being modified, who owns it, ACLs, but also things like hash, for example, and arbitrary user metadata that can be attached to these objects, and these for management purposes. I can go and attach a piece of metadata that says, I want this to be well-protected, or I don't want this to be protected at all. Okay, so going back, that's how you can do those. So I guess to summarize, you can view and manage different types of objects

Starting point is 00:52:22 in a unified namespace that covers everything from files to LUNs to objects to NoSQL databases to graphs, et cetera. This is a common management API and approaches for management. And it's a way to bundle all these things together. Because you have one namespace, you can talk about all these things in a common way instead of having to have separate APIs for each

Starting point is 00:52:43 of these. And once again, CDMI is very extensible. about all these things in a common way instead of having to have separate APIs for each of these. And once again, CDMI is very extensible. So if you guys are interested in doing these sorts of management and having interoperable management, definitely encourage you to take a look at CDMI. We meet every week. We're always open to talking about ways CDMI can be used, and we'd love to see more people adopting it. So thank you very much. If there's any questions I'd be happy to take them.

Starting point is 00:53:16 And also giving one more talk on Wednesday about some of the security work we're doing related to CDMI with respect to encrypted objects and mobility around encrypted objects. Question. It's about the scalability of CDMI. Yep. Do you have some kind of limits? We think it's like, okay, we know this works like for a billion objects.

Starting point is 00:53:43 Or that we can manage what is growing, like having this in a cluster of, I don't know, 100 now. Yeah, so a lot of that's implementation question, but there actually are scalability issues around namespaces. If you have, say, a flat namespace with a billion objects in it, I can tell you that pretty much every system today will fall over. You need to then look at different ways to manage it. And that's where querying and metadata

Starting point is 00:54:10 starts to become important, because just a flat listing of that scale breaks down. Is there some performance penalty on having the NFS at the C being managed by CDMI? Thanks.

Starting point is 00:54:25 Not really. That would be an implementation side. CDMI doesn't impose that. So you showed us a slide with the metadata. Yep. CDMI metadata. Is there any sense of how much overhead over another object? Let's say S3.

Starting point is 00:54:44 So CDMI over S3, that metadata over here, how much is it? So that has three different answers. The first answer is there's no overhead because you have to store it somewhere. And how it's stored in the storage system is up to the storage system. That's not something that CDMI imposes. CDMI is just on the wire. The second question is how much overhead is there for it to go over the wire? We're standard JSON, it's a little verbose, luckily it works with all the compression that comes along with

Starting point is 00:55:16 HTTP etc. The third answer is as a storage system if you're going to even implement this metadata how much does that cost you? And ultimately, the answer to that question is how much value does it bring you? Question in the back. You mentioned once credential. One of the things I was coming a little bit back on is how much CDMI standardizes the flow of credential information, but CDMI doesn't standardize how that's done. So let's say you include a Kerberos ticket or a key, an access key in your HTTP request.

Starting point is 00:56:02 CDMI defines how that gets routed, but it doesn't define the actual interpretation of that. So I do a delete. I include my credentials, however those are represented in the request. CDMI says you're doing it against this object. This object belongs to this domain. This domain's associated with this system,

Starting point is 00:56:23 whether that's LDAP or Active Directory or something else, it routes my credentials through to that system which then resolves my credentials and comes back with an ACL name. That's outside of the CDMI spec. CDMI is just doing the routing. Once the ACL name comes back, then CDMI, because CDMI represents things as ACLs over the wire, the system can map the ACL name against the operations. But ultimately, that's the internals of the system. It doesn't even have to use ACLs internally. We just represent them as ACLs. And ACLs are optional.

Starting point is 00:56:57 So yeah, it's routing as opposed to standardizing authentication. Do you assume a better term for authentication? Pretty much, I mean, people have implemented S3, Keystone, Kerberos, HTTP basic, custom authentication. It's really agnostic to that. We just add in that routing layer so you can have the namespace. So how do you ensure that your other team can find this? So once again,

Starting point is 00:57:27 you can discover what authentication methods are required and or supported. And that can be on an object by object basis. You may have a namespace where a bunch of objects, anybody can access with no authentication, a bunch of objects where HTTP basic

Starting point is 00:57:43 and digest are good enough, you know, TLS mandated, and a bunch of objects where HTTP basic and digest are good enough, you know, TLS mandated, and a bunch of objects where you need Kerberos, all in one namespace. And that's going to be important as we go global beyond single client, you know, hybrid cloud because you want your security policies to go with the object. Good question. Any further questions before I get ready for the next presenter? Oh, well, thank you all. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further

Starting point is 00:58:27 with your peers in the developer community. For additional information about the Storage Developer Conference, visit storagedeveloper.org.

Your Ad Here

Storage Developer Conference - #9: Using CDMI to Manage Swift, S3, and Ceph Object Repositories

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.