Storage Developer Conference - #9: Using CDMI to Manage Swift, S3, and Ceph Object Repositories
Episode Date: May 30, 2016...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC
Podcast. Every week, the SDC Podcast presents important technical topics to the developer
community. Each episode is hand-selected by the SNEA Technical Council from the presentations
at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcast.
You are listening to SDC Podcast Episode 9.
Today we hear from David Slick, Technical Director with NetApp,
as he presents Using CDMI to Manage Swift, S3, and Ceph Object Repositories
from the 2015 Storage Developer Conference.
Hey, good morning. My name's David Slick.
I'm from NetApp, and this morning I'll be presenting about using a CDMI standard
for management of a variety of different storage systems, including S3, Swift, Ceph, etc.
So, we're ready to get started.
So I'm going to start with a brief overview of CDMI.
How many here in the audience have heard of CDMI before?
A good smattering.
So CDMI is the Cloud Data Management Interface,
and I want to emphasize the management in management interface
because one of the things that makes CDMI unique
is it's focused on management tasks
that you don't typically see in data-centric protocols.
When you look at a lot of protocols out there,
the primary goal and the primary design point is how to get data back and forth
between various clients and various repositories.
CDMI is a little different.
CDMI came out of SNEA, and as most of you are probably aware of,
one of SNEA's main areas of standard development is around storage management
protocols. For example, SMIS, which is widely used for storage management, is also another
SNEA standard. So when we started looking at CDMI quite a while ago, we said, okay, well,
you know, cloud storage is starting to happen. We're starting to see protocols. Protocols are focused on data access. There's going to need to be a way to
manage this as well. And that's what I'm going to focus on in today's presentation, how some
vendors are using CDMI to manage all sorts of different storage objects you may see up in the cloud. So to recap, the major
APIs that we see adopted and used in the cloud are S3, CDMI, Azure, and Swift. I'm
sure everyone here doesn't have any quibbles about me saying S3 is by and far
the most widely used cloud data access protocol.
But one thing that we found talking with customers is Azure is extremely widely used as well.
We don't see that because most of the time when you think of applications talking to clouds,
you're thinking of, well, I download this application or I'm using this web service, etc.
That's only part of the market. The enterprise storage and enterprise application segment of the market, typically applications developed
inside enterprise organizations using Microsoft toolchains,.NET, etc, are extremely widespread.
And the amount of data going in and out via the Azure API is quite significant.
I can't tell you which is more.
If I had to make a guess, I'd say that S3 is slightly edging out Azure in terms of being the most popular cloud-based data access API.
But this is always something that I emphasize during talks
because I think people underestimate the degree
to which Azure is widely deployed.
And likewise, we have the Swift API,
which is part of the Swift project inside OpenStack.
Sorry, that's my VPN wanting to connect here.
Let me kill that.
There we are.
And Swift API is in many ways very similar to S3,
only it's part of the OpenStack project.
And that's getting more and more popularity
as OpenStack deploy and that's getting more and more popularity as
OpenStack deployments have grown in number and scale. So with respect to CDMI,
we've seen quite a lot of CDMI implementation as well. CDMI has a number
of advantages, namely that it provides functionality not found in
these other three APIs and as an open standard it has advantages in the
realm of documentation, interoperability, and patent protection. So we've seen
about 30 different CDMI implementations that we're aware of.
There's a lot of CDMI implementations internal to organizations, government bodies, educational facilities that we aren't necessarily aware of.
We'll see little bits of evidence of them here and there, people on news groups coming and asking us questions, etc.
But we're quite happy with the degree to which adoptions happened.
Standards-based protocols tend to have a slightly slower adoption curve
than vendor-driven protocols,
primarily because, as a standards body,
we don't have a vested interest to try to get immediate adoption and lock
in through a protocol but once again we believe that over time we're going to
continue to see CDMI adoption well for some of the reasons that we're going to
talk about today okay so just as a note CDMI also has also has a plug-in for OpenStack that allows you to have OpenStack talk CDMI as well as Swift.
And there's also Gateway that allows you to run an instance on EC2 and access your S3 objects via CDMI. So this is an example of multi-protocol support and a lot of the commercial vendors will
have products that support the CDMI API, the S3 API, the Swift API, and even in some cases
files and LUNs, bringing us to our topic today. So to give a quick timeline, we started working on CDMI back in 2009 when we determined that, you know, there's a lot of stuff going on in cloud storage.
We think this is about right for standardization.
Our philosophy for standardizing things is if you do it too early, you're inventing things.
And the last thing you want is standards bodies inventing things. If you do it too late, the industry will have solidified around
a series of practices and tangible ways of doing things and it's just too late
to get adoption. The right time to do that is after people have figured out
how to do things, but not so late that the way that it's done has crystallized. So, we worked from 2009 through 2011
getting CDMI to become an official technical architecture
which is SNEA speak for US standard through that
body. And then we worked on errata as we got more and more
implementations on board. In 2011 we did
0.1 and then in 2012 we did 0.2. And this all led up to
CDMI being submitted and going through the process to become an ISO IEC standard, specifically a 17826.
And this has been a really good thing for CDMI because being an official ISO standard means CDMI has worldwide recognition
that this is an endorsed way to allow systems to work interoperably in an international
stage.
And we've seen a lot of adoption coming from the fact that it's an international standard. There's a lot of academic
and government work, especially in Europe, where the mandates of the research projects
are how do we connect these different clouds together, how do we provide interoperability,
how do we provide a playing field where all sorts of different clouds can provide comparable
services. And one of the things we did in CDMI, once again
I don't know how familiar you are with the details of the spec, but one thing
that's different about CDMI is you don't have to implement the whole thing. A lot
of standards if you're gonna have it work you have to be bug for bug
compatible with the whole base feature set. CDMI has the concept of what are called capabilities.
I'll show this a little bit later on.
But in essence, what capabilities are is a way for me as the implementation
to describe to you as the client what I can do.
So if I need to, for example, use CDMI to manage LUNs in the cloud,
all I have to do is implement the core sets of functionality
for that. I may support the objects that correspond to LUNs, I may support the metadata that allows me
to manage those LUNs, and I may support those basic operations. But I may not support some of
the more advanced features that have nothing to do with LUNs. For example, I may not be supporting any of the parts of CDMI that deal with managing NFS or SIFs. So that flexibility,
being able to arbitrarily select which subsets of the standard, have been
something that has really spurred on adoption and the capabilities of the
means by which I as a client can figure out what a server does. Because if, you
know, with other protocols you just
can assume it does everything because that's the definition of the protocol with CDMI it's not so.
So moving on, the way that we handle enhancements in CDMI is through what we call the extensions
process. We're in the CDMI working group we are very much use driven so we don't want to once
again we don't want to just be inventing things.
If we're going to spend time standardizing things,
we want these to correspond to real customer needs driving real vendor needs driving revenue.
So vendors that are involved in the process,
and if anyone wants to get involved in CDMI, we'd welcome you to join the technical working group.
Bring forward extensions, the concept that we want to extend CDMI
to handle a given use case. We talk it through in committee.
We publish it for public review. And when someone
shows up with an implementation, then we can
go and start to look at adding it to the spec. And it doesn't go
into the full spec until we have two verified interoperable implementations.
And we do that as a series of plugfests.
I think it was a plugfest 21 we're on right now,
and that's happening this week here at SDC.
There's the SIFS plugfest,
and so the SMB plugfest,
and then there's CDMyPlugfest, CloudPlugfest,
going on concurrently.
So we have 18 extensions that were submitted, and some of those were folded into CDMI 1.1,
which is primarily in a RADA release.
And I'm proud to say that as of just a couple weeks ago,
CDMI 1.1.1 has been submitted to ISO, and that will become 17.8.26.2016, hopefully not 2017,
as an update to the ISO standard. So there's a lot of history behind here. We're pretty proud
of where we've gotten to and looking forward to sharing that. So ultimately, you know, why does
this matter? You know, a lot of people are doing great stuff with all these other object
protocols. We're really happy about that. The more that object storage and
cloud storage is being used, the more stuff that's going on means the more
stuff that needs to be managed.
And that's ultimately why, you know, this is all a good thing.
Just like with NFS and SIFIFS there doesn't need to be a protocol winner because ultimately regardless of what
you're using NFS, SIFS, iSCSI, LUNS, S3, SWIFT, Ceph, etc all these different
systems this is just more and more data that needs to be managed and managing
the data is all the hard problems that we have an industry have to be honest struggled with for decades now we can move the data around a hard problem that we have in industry have, to be honest, struggled with for decades.
We can move the data around, we can keep it reliably, we can provide the performance,
but dealing with the terabytes to petabytes to exabytes of data and connecting dozens to thousands to millions of different endpoints to this data. This is becoming the
key challenge for enterprise storage, cloud scale storage, and an area where we think CDMI has some
real value to offer. So why does it matter? Well, first of all, it's simple and easy to implement.
Talked a little bit about how pretty much everything's optional. You can pick from the
standard what's specific to your needs, your customer needs.
But also it has a lot of functionality that either hasn't been in other APIs or is just
starting to be implemented in other APIs.
For example, let's say you're providing LUN services for a very large compute system in
the cloud.
You may have tens of thousands of customers managing millions of LUNs.
Well, how can you describe these sorts of things?
Sure, you can carve it up into little mini namespaces, one for each customer,
and that's manageable for the customer standpoint.
But there's a lot of management operations you're going to want to do
on subsets of your entire pool of resources.
One way that CDMI deals with this is by having rich metadata standards.
So CDMI allows you to have hierarchical and structured metadata attached to every single object.
And I'll show this in a little bit.
And then, for example, you can do query.
You can say, I want to operate on
the set of objects that match these criteria that match against the metadata
using a query so that's an example or for example notifications being able to
link have a standardized way to link your storage into your notification so
when someone creates for example aUN with a given set of
characteristics, your back-end systems can get a notification, or the client's front-end systems
can get a notification. These are examples of a number of tested and robust parts of the CDMI
standard that we're only just now seeing. For example, notification was added into the S3 ecosystem later last year,
and it still isn't there for the other ones.
So this is an example of where CDMI provides ways to do things
that haven't been yet integrated into other APIs.
So there's a lot of maturity there.
As an open industry standard, it's not
controlled by any one vendor. Everyone's welcome to join. And there's protection against patents.
The companies involved in the creation of CDMI is pretty much a who's who of the storage industry.
There's well over 100 participants. And part of the SNIA process is everyone has to sign an intellectual property
agreement which says that they're not going to sue each other over patents, and you're
not going to have a situation where if you come out with a product that uses CDMI, five
years down the road someone says, oh, you're using something we have patented, you owe
us a huge amount of money.
So that's a real degree of comfort, not just for commercial vendors, but also for the open source folks, because they don't want to get
involved with patents. A lot of the times they're just kind of, there's no patents,
I don't, I'm gonna ignore this, but with CDMI that that's really not an issue. And
then finally, you know, we have a well-defined standard document, there's
testing infrastructure, I encourage folks, if they're interested, we have
a test conformance system that we're going to be using
at the Plugfest this week. Come on, drop by the room and
we can show you what it looks like. And once again,
adoption as I mentioned. So what does CDMI standardize?
Well, CDMI does have the standard CRUD operations, but not necessarily for the reason why you think.
Because typically in a cloud storage standard, you know, there are create, read, update, delete is the core of everything.
Because that's the whole point. You want to store your data, you want to retrieve your data, you have to delete your data, etc. In CDMI, this is almost incidental. You have to have this to manage your namespace. And that's
what I'm going to be spending most of the time today talking about, is the concept of namespace
and managing namespaces. So while CDMI does have this, and while you can use CDMI as a data
protocol, that's almost incidental because we're ultimately
interested in these management functionalities. So CDMI, when it talks about managing things,
the object model has four primary entities, and we'll look in a few slides on how these overlay
onto what's being managed. So there's data objects of whatever type.
There's container objects of whatever type. There's queue objects of whatever type.
And then there's something called a domain object. So to talk through this really quickly,
we're all familiar with things like a file, an ISO image, etc. Things that you want to deal with as a concrete entity,
those map to data objects.
Then there's things like containers.
All my LUNs on a given zone.
A directory, a bucket, etc.
Those map to containers.
Now one of the interesting things about CDMI is CDMI does not
force you into the model that it's either a container or a data object. CDMI is based on
REST and one of the things about the REST architecture and model is we're dealing with
representations. So everything in CDMI is designed around representations that sit beside your
existing models. So I'll show this in the demo, but you know, I have an S3 storage system or a
Ceph storage system with an object. I do an HTTP GET that's going through Swift or S3 or Ceph. But if I tell the system that I want it as a CDMI representation,
that then tells the system I'm no longer just doing my data path.
I now want to do management functionality,
and the CDMI standard starts to get invoked.
And I'll demonstrate that for you in a few minutes.
But the reason why I mention this, because CDMI is
representation-based, you can actually have a data object
representation and a container representation
for the same thing.
And this happens actually quite a lot in our industry.
For example, what is a VMDK?
It's a data object, a file. But is it really? Well, actually, it's a container
with a bunch of stuff inside it. And sometimes when you're doing management, you want to deal
with your LUN as a single thing. But sometimes when you're doing management, you want to crack
open that LUN and go and do operations on what's inside it. If you're doing virus scanning, for example, if you're doing analytics, you want to
go through and you want to, if you do your querying into what's inside your LUNs, etc.
And this pattern actually we see a lot of times. When you're doing backup, you often don't want to work on the granularity of individual files.
You want to be able to manage a whole collection of files for backup, for SLOs, etc.
So this duality, being able to say it's a data object and or it's a container,
is something we use quite a bit through the standard and is one of
the neat aspects of CDMI. So just to quickly talk about domains, the other
thing that CDMI has as part of the standard is the concept that in a cloud
you're not necessarily going to have everything managed by the same
administrative entity. Because storage systems from a management standpoint have
typically come from an environment where there's only one managing entity, this is typically a big
limitation. If I have a cloud and I have data, say, from the IEEE and data from ACM and data from
other organizations, ultimately these all want to sit in one namespace. You don't want to have to segregate the namespace.
Here's the IEEE papers, here's the ACM papers, etc. You just want to have your papers.
And if it's an ACM paper sitting right beside an IEEE paper, those should be managed by who owns them, who has the administrative rights for them. So CDMI has the concept of multiple administrative domains.
For every object, there's a link to a corresponding domain.
So if you have an ACM domain and an IEEE domain,
when I come in to do an operation against an object,
that object belongs to a domain,
and that domain controls how my credentials are resolved and therefore
what I can do administratively against that object. So if I try to delete an ACM
paper and it says you know DSLIC credentials I don't know who you are
you know I'm going to deny all those operations but if I go to delete the IEEE
paper you know it's my paper, credentials are matched, yep, permissions match up, that
operation is approved. Two objects sitting in the same directory, the same part of the
namespace, totally different administrative mappings. And this is something really important
for management as we move towards, you know, cloud scale, global namespaces, because you're
not going to have the assumption
that administrative control is segregated on a namespace
by namespace basis.
Bucket granularity management, that's
just a stepping stone towards where
customers want to go with the true data mobility,
especially between organizations,
especially in the cloud, where you
have anywhere from thousands to millions of customers.
Okay, so I spent a little bit of time about that,
touched on identity and access control through talking about domains.
Metadata, we're going to look at that in more detail.
Query and notifications already touched on.
If you're interested in these in more detail, please take a look at the spec.
It goes into great detail on how these are used and implemented. Versioning is another interesting
one, and serialization, deserialization. So let's get into the meat. What do we mean by
converged data management? Well, we have all these different things,
all these different types of data
that we're generating and consuming in the cloud,
but also in the enterprise.
And being able to move things back and forth, hybrid cloud,
et cetera, is another area that's really interesting.
But that would be a whole separate talk.
So what do we see?
We have things like objects and files and lines. We also have things like
NoSQL databases. And if I have a hundred NoSQL databases up in my cloud, I need a way to
manage that too. I need a way to list them. I need a way to get information about them.
How much space are they consuming? How much is it costing me? I need a way to specify things like SLOs this database. I may want to have on flash
I need really low latency this database. I mean just be archiving because it's a snapshot of
Another database this database is a set of data
I do add a batch analytics on once a month. Most of the time I don't want to be
paying for really high performance storage, but you know when that month end comes up, I want the
cloud system to move it into Flash and then run all my analytics on it, and when I'm done,
oh, I want to move it back onto something cheaper. So one of the neat things about CDMI is it's equally applicable for managing all these
new and emerging data types. You can have one namespace that covers all these different data
types for management purposes. And this makes everything a lot easier because when you have
one namespace that you can use in a common way to specify and discover,
all of a sudden now you have to write a lot less code in the cloud,
and your client has to have a lot less code to deal with away from managing LUNs,
and away from managing files, and away from managing databases, etc.
And we're only starting to see this with RESTful interfaces, there's convergence, the stacks out here.
We're seeing less variability.
It's not like in the old days where you had to have say an Oracle specific API for managing one databases
and another API for managing other type of databases and proprietary means for
each of your file or types etc but you know this is a really nice thing about
CDMI so this this presentation is talking about how this works.
Another thing that this is valuable for is data mobility.
One of the things that CDMI provides, because it's designed as a superset from a namespace,
each of these namespaces, the way I store files in a file system,
the way I store objects in S3, for example, or Swift,
the way LUNs are named and managed,
these all have different restrictions.
And these are often subtle, little, annoying restrictions,
what characters are allowed,
lengths of strings in various places.
These become really nasty interoperability challenges.
So what CDMI has done is we tried to make sure that it's a superset namespace. And by being a superset namespace, you can express objects from these
various systems in CDMI. You can always go from X to CDMI. You may have challenges going from CDMI to Y, but it's a lot
easier to write an adapter from CDMI to Y than it is to write an adapter from every to every.
So one of the areas we've seen a lot of interesting work, especially out of Europe,
is using CDMI as an interchange format, a vendor-neutral way to get data from system A
to system B and to archive data when
you're moving between clouds.
Because what's a really big thing in Europe
is about being able to have the right to move your data,
the right to repatriate your data,
rather than it getting stuck in a cloud
or stuck in a storage system. OK, so let's go through some definitions. So we talked a little bit about what objects are at a
generic level. It's something you want to manage. Examples, files, directories, LUNs, you know,
you can get into systems. But one thing that's important is CDMI is a cloud standard. And by a cloud standard, I mean the internals are abstracted.
I shouldn't have to know how it's implemented.
In fact, if I can tell how it's implemented, that's bad.
Because every piece of information that I know about the internal implementation of the storage system
is a way that I can potentially constrain the freedom of the cloud
provider to do stuff behind the scenes. The whole point is I shouldn't have to
know. I should be able to tell the cloud provider what I want, what I expect
without caring about the implementation. If the cloud provider chooses to use
erasure coding or replicas or RAID 6 or some other new technology,
I shouldn't care. I should care about the reliability of the data and the performance
of the data because ultimately that's what I want to pay for. That's not my job how they do it. I
should define my service levels and pay and have it provided. So CDMI takes a very different approach for management
than what we see, for example, in SMIS and Redfish and a lot of the other management protocols out
there. We're not dealing with systems and disks and RAID networks. We're dealing with namespaces
and objects and SLOs. So what is a namespace? I've used the term a whole bunch of times. It's one of
those words that's a little overloaded. So ultimately your namespace is the complete set,
an organization within, of your objects. A namespace has to be addressable. You have to be able to describe
how to get to an object within a namespace. That's an important part of the
definition.
Namespaces can have various different internal structures.
You can have flat namespaces, like for example within a bucket.
It's flat. You have a key value. You can have a hierarchical namespace.
Going to S3 and Swift as examples, people create hierarchical namespaces
inside that flat S3 namespace by giving special meaning to a delimiter,
typically slash, inside the name of the object to create, in essence, a synthetic
hierarchical namespace within a flat namespace.
File systems have never really had that problem because they've been built into the concept of hierarchies.
Well, since I first got started in computing, I actually had an operating system with disks with no hierarchies, no directories, but it's been around for at
least as long as I've been in the industry, and probably longer.
And we're starting to see a lot of other relationships.
Graph relationships are more and more popular.
We started to see this with symlinks, but it's becoming a lot richer.
When you look at, for example, what Facebook's doing,
everything they're managing is all about graphs.
And my bet is, as we see things continue to develop
over the next 10 years,
we're going to see graph-based namespaces
become more and more important,
especially from management standpoints,
because a lot of the times
when you're doing a management function function you want to look at relationships. What are all the
associated objects? Projects, owned by, etc. If I'm a
client and I discontinue my account you may want to have it such that everything
owned by DSLIC goes on to a really
cheap archival tier for a few months before it gets deleted, as an example. So namespaces,
once again, because everything in a namespace has to be addressable, this is where all the
restrictions come in. What's an allowable name? What's your delimiter for hierarchies? What's your symbol
representing a graph traversal, etc.? And CDMI uses Unicode. Unicode's not a panacea. There's
actually some really quirky and difficult parts of Unicode, but compared to what we had before,
it's really nice. You can represent
pretty much every set of characters in there. Binary data is a problem, but this has been
great for international tech support and being able to handle all the different character sets.
And the more restrictive a namespace is, the less it's able to accommodate different types of objects.
So this is where, for example, CDMI serialization comes in. The concept of CDMI serialization is I
may have these really complex sets of hierarchical or graph-related data. I may have a NoSQL database
and a hierarchy of objects and some LUNs and metadata, etc. Well, when you serialize
it, you turn it into a bog-flat bitstream that can go on pretty much anything, any storage device
that we'll ever encounter, because it goes down to the lowest common denominator. And because,
once again, in CDMI, you have that object container representations.
If you have a system that manages even something with a really restrictive namespace,
if you step into that serialized container, now all of a sudden you can have all the richness of CDMI.
Okay, so we talked about namespaces.
Well, what do we mean by management?
The informal definition, whatever doesn't fit in the data path.
All the leftovers, you know.
Well, we'll think about that later.
But this is actually something interesting.
Management isn't in the SNEA dictionary.
You'd think, given what we do, it would be in the dictionary.
But it's not, because it's so common we don't think about it.
Ultimately, my definition is management of the operations that are applied against objects and sets of objects.
It's not the operations you do directly to the object.
It's ones that are applied against. I want to snapshot this object. I want to migrate
these objects. I want to change the permissions of these objects. These aren't things that change
the essence of the object. They change all of the associated parameters and ways that these
objects are used in the system.
So when you put it all together,
CDMI's key value is it provides a superset namespace to allow management operations to be performed
against cloud resident objects,
and this can sit on pretty much any type of a system.
So let's get down to some meat.
CDMI namespaces.
So CDMI supports your arbitrary tree-based...
Oh, that's interesting.
Did I step on something?
There we are. Good, good. Yes.
So arbitrary tree-based hierarchies. This is a superset of what you see in Swift, S3, Azure, and other cloud namespaces.
It's also a superset of your standard file system namespaces, which means if you have a filer with NFS and CIF shares, CDMI can layer on top of it. And if you have S3 and Swift and Ceph and Sheepdog
and all these other platforms out there
that are providing multi-protocol support,
it can layer on top.
And you can even do this for your LUNs
and other managed objects.
CDMI also has the concept of references,
which allow your arbitrary SIM-link-like cross-tree linkages, which we see more commonly.
And there's been some work tossed around inside the standards body for having a full extension for graph relationships, i.e. allowing any arbitrary object to have structured metadata to say I am a relationship to X object.
And then that's really important to be generic because there's as many relationships as there
are business problems out there. I am a parent of, I am a child of, is what we're used to,
but I am a derivative of, I am a friend of, I am a etc. Being able to have these arbitrary relationships
moves us a lot closer to this kind of graph world that at least we think things are going.
So this allows CDMI to encapsulate all these different data types into a single namespace
for management purposes. So what does this look like in principle?
Well, up in the corner here we have the CDMI object model.
And the way it works is there's a root.
And this is something that's typical in a namespace.
You need some place to enter.
And the root has a couple special properties.
The root is where you can start your discovery.
What can a CDMI system do? One principle that's
important, not just for namespaces, you want to be able to discover and browse and traverse your
namespace, but you also want to be able to discover and traverse and browse your API.
What can I do? What does this system let me do? So there's especially if it's like slash cdmi slash v1 slash that can be an example
of a cdmi root. But just slash could be a cdmi root because that's what we're used to in files.
Or you know slash lun slash might be my root. cdmi is very agnostic about this because we want
people to be able to layer this onto the namespaces they
already have for their data path. We don't want to have to force them to create a whole new system.
It's representational, right? I do HTTP GET. I say accept CDMI. The server now knows I'm talking
about a CDMI representation for an existing namespace, and we just want it to work. So under the root you have containers and
you know like a file system, your containers can be nested, you can have containers within
containers with containers and any of those levels including at the root you can have data objects or
queue objects etc. And dotted line to each of these are capabilities. This is the self-describing mechanism.
Is this object read-only? I can go and discover that. I don't have to actually
try to delete it and it comes back with an error. I can ask on a per object or per
container or per system basis
what actions, what can I do from a management standpoint for this object?
Does it support the super low latency SLO?
For example.
So if we look at, for example, how this overlays,
with respect to the file system world, this is quite straightforward.
My root container is my root directory.
My container is my directory.. My container is my directory.
My data object is my file.
Boom, one-to-one, clean.
When we look at, for example, Swift, my CD and my root is my Swift root.
I have a container representing a Swift account.
This allows me to do administrative operations on accounts.
If I want to be able to get statistics or migrate a whole account or do those sorts of operations,
there's a manageable object representing that layer.
Your bucket, the next layer down, once again, is just another CDMI container.
So you can manage these things, name them, move them around, serialize them, set SLOs, etc. on them,
and then finally you've got data objects in there. And that's if you view it just
as the flat namespace. If you go to the next step, you can say, okay, well I'm
gonna expand this another level and say people create synthetic hierarchies
inside their bucket. Let's layer that out with more containers. Similar with S3,
S3 is a little different than Swift.
S3 has a root.
It has buckets and objects.
And instead of being an explicit container, in S3, it's a dotted line to an account.
Now, once again, CDMI has the concept of domains.
Every object has a dotted line to the domain. So in the S3 model, what you do is the dotted line
goes to your domain, which there was one per customer, one per bucket, and you've got that
correspondence. So the overlays work quite well. So let's look at this. Let's put this all together.
I have a multi-protocol storage system in the cloud that's providing me file access.
It's providing one access.
It's providing S3 and Swift and all these different protocols.
Well, here's an example.
I have my root.
I have a bunch of files.
And I can export all these as an NFS, as SIFS file system.
So my compute, my cloud compute can then mount those and see all those resources.
But not just that, I can go and I can go to specific objects in that file system and say this VMDK I want it as iSCSI.
This ISO I want it as iSCSI. This ISO, I want it as iSCSI.
And then the cloud compute can go and melt that too.
So you can start to have really interesting workflows where someone just drag and drops a VMDK in,
and then that triggers a notification,
and then that is then exported via iSCSI
and bound to a new VM that's instantiated,
and boom, you just drag the file in,
and that app's running.
Question?
Two questions, short ones.
In the previous slide,
you showed the CDMI object model
as consisting of a container data object,
but you mentioned that CDMI
does not distinguish between container and object.
I mean, everything's a object.
Why is that separation here? does not distinguish between container and object. I mean, everything's a object. Yeah, that's...
Why is that separation?
Oh, I'm just showing it for how the overlay is.
If you view something as a container representation,
you can look at its children.
But if you view something as a data object representation,
you're just looking at its value.
The other one is the multiple boxes that are shown.
Are they showing two multiple instances of that container?
Is that the symmetric there?
You can have that.
I'm trying to be a little more generic than I probably should
be for a concrete example.
But yeah, you can have as many containers as you want
as long as they have unique names.
Question?
So I wanna be a believer,
I've been following it for about four years.
Yep.
No customers I ever talk to ever mention CDMI.
Yep.
Right, so there's a question here,
and sort of, I've always thought about it as more of a front control plane interface.
And I've never found a killer app.
But the way it's presented, I'm wondering if it actually is more useful as an orchestration layer
to sort of manage various storage offerings within sort of a hybrid cloud.
Like it's sort of varying in.
It's not for the customer front side. Am I thinking
wrong about it? No, you're not thinking wrong
about it at all. And it really depends on which
target market. Selling into enterprises,
that orchestration
layer is their front end.
So what is the killer aspect of it? What is the thing that would
drive CDMI adoption? You said there's 30 plus
vendors that have implemented it, but I don't
have any customer demand for it.
So what is driving the...
Why would I care, I guess, is my question. Why would I do it?
Primary reason is, do you want to invent your own?
If you want to invent...
But if I'm deploying OpenStack, I get the hybrid model.
I get the sort of manage of multiple sources.
I don't know.
Maybe it's a question we can take offline.
Like I say, I want to be a believer.
I've been a follower for over 40 years.
It has not percolated up to the level of a lot of sort of emotional.
And it's a good question.
Unlike S3 where there there's S3 apps everywhere
and there's all this buzz about it,
you don't hear much about CDMI.
And most of the implementations that we've been involved in
are inside a corporation.
So I went to Lee to answer a question.
There are some of our customers who work for IBM.
Some of our customers, they have needed, they need to move between clouds and schemes.
That's what these customers need.
Precisely those are the layers.
So are they building their own?
There's a lot of open source
digital products.
Yeah.
We can pick our plan.
Okay, yeah, let's pick our plan.
But, you know, once again,
the question about customer demand
is a big one.
The customers,
they give an example with Aki.
Aki built their orchestration layer
around linking iSCSI LUNs in the cloud
to VMs in the cloud, all based around CDMI's orchestration layer around linking iSCSI LUNs in the cloud to VMs in the cloud,
all based around CDMI's orchestration layer. So the point about the orchestration layer is a
really big deal. That's who consumes a lot of this management functionality. So quickly go through
here. At the same time, you can have your hierarchies, your buckets. Not only is this
available via FASwift and S3, you can have those two layers for
account management work at the same time. This can also, you can just go in via NFS
and access the same files. A number of companies have done quite well with this sort of triple
protocol access scale. These got a lot of traction with their multi-protocol access,
speaking of a specific implementation.
So let's quickly go and look at what this looks like. So let me jump out here, and what I'm going to do is I'm going to go into this here.
So what this is is this is a little AJAX JavaScript-based CDMI client. This is running in the browser, it goes out, and it's
actually doing CDMI operations against a cloud running on my laptop. And this is another thing
a lot of people really like about HTTP-based protocols, is you can talk directly and do
management operations right in the HTTP. You don't need a layer between to do
these sorts of operations.
So I have my namespace.
This index.html is actually this program itself.
So inside, I have a namespace that's a bunch of files.
And here, for example, I have these.
And I click on this, and these are actually my slides
that I'm presenting.
And this is part of that namespace.
And I have some images, and I can just go through.
This is all in CDMI Server.
And if I go back up here, I also have, from my filer,
I have some buckets that are via an S3 gateway. I have some volumes that
are file system and I have some LUNs. So if I go into my LUNs, I have my filer, I
have my volumes, I have inside my volumes here for example and there's my LUN. So
if I go and I take this, so I'm gonna take take this URL and copy it. And then I'm going to go over to a tool here that lets me actually run
an operation against this.
Okay, what I've done here is I've gone and I've said, this is my resource,
and CDMI specification, so I want to talk CDMI,
and if I run that, it's going to go through, and boom, here's the
CDMI JSON as to the CDMI spec. So, if I run that, it's going to go through, and boom, here's the CDMI JSON as to the CDMI spec.
So if I went to this URL, if it's a file, the HTTP, I'm going to get back via those data protocols.
Here I'm seeing the management side.
And in here I have my exports.
This is exported via iSCSI.
I've got two portals corresponding to my IP address access.
I've got some identifiers. I've got LUN information. I got permissions. All this was able to be specified as part of the structured
metadata inside the CDMI object. So there's an example of that ability to easily do the
orchestration as you mentioned there.
So what are your typical LUN operations? You don't want to create LUNs, you want to
discover them. Well discovering them is just list the CDMI objects in a
container, create them, create a new CDMI object and specify
metadata. Do I want it thin provisioned, how big do I want it provisioned, etc. Specifying permissions, SLOs, exports, this links into all of
those parameters I talked about earlier. If I want this on low latency, high
throughput, different levels of data protection, all these standardized
mechanisms for which describe this. And then, of course, the standard way
to take a snapshot.
We're not talking about how to implement the snapshot.
This is just a signaling layer.
Same thing with serialization, et cetera.
Now, given that we're running close to when
we should do questions, I already
showed you the LUNs here.
I did all my three demos in one quick demo.
File management, I browse through, traverse through all the files.
They're all there.
I can go and look at each file.
I can request it.
Let me just jump back here.
If I go up to my root here, and I go into my files, and I select a given file.
Here's my file and I just pull it up on the web browser.
Standard HTTP, everything's good there.
If I go in and I pull this up,
just a standard HTTP, boom.
There's my HTTP request, my content type image, JPEG, and a bunch of binary data. But when I go and I say I want to do CDMI, 1.1.1, and accept And accept cdmi, cdmi object.
And boom.
I'm seeing now my management representation.
I flipped it over.
Instead of the data, I have this JSON with this metadata,
not just things like when it was
created, its count, how many times it's being modified, who owns it, ACLs, but also things
like hash, for example, and arbitrary user metadata that can be attached to these objects,
and these for management purposes. I can go and attach a piece of metadata that says,
I want this to be well-protected,
or I don't want this to be protected at all.
Okay, so going back, that's how you can do those.
So I guess to summarize,
you can view and manage different types of objects
in a unified namespace that covers everything
from files to LUNs to objects to NoSQL databases
to graphs, et cetera.
This is a common management API and approaches for management.
And it's a way to bundle all these things together.
Because you have one namespace, you
can talk about all these things in a common way
instead of having to have separate APIs for each
of these.
And once again, CDMI is very extensible. about all these things in a common way instead of having to have separate APIs for each of these.
And once again, CDMI is very extensible.
So if you guys are interested in doing these sorts of management and having interoperable
management, definitely encourage you to take a look at CDMI.
We meet every week.
We're always open to talking about ways CDMI can be used, and we'd love to see more people adopting it. So thank you
very much. If there's any questions I'd be happy to take them.
And also giving one more talk on Wednesday about some of the security
work we're doing related to CDMI with respect to encrypted objects and
mobility around encrypted objects.
Question.
It's about the scalability of CDMI.
Yep.
Do you have some kind of limits?
We think it's like, okay, we know this works like for a billion objects.
Or that we can manage what is growing,
like having this in a cluster of, I don't know, 100 now.
Yeah, so a lot of that's implementation question,
but there actually are scalability issues around namespaces.
If you have, say, a flat namespace with a billion objects in it,
I can tell you that pretty much every system today will fall over. You need
to then look at different ways to manage
it. And that's where querying and metadata
starts to become important, because
just a flat listing of that scale
breaks down.
Is there some performance penalty on
having the NFS
at the C
being managed by CDMI?
Thanks.
Not really.
That would be an implementation side.
CDMI doesn't impose that.
So you showed us a slide with the metadata.
Yep.
CDMI metadata.
Is there any sense of how much overhead over another object?
Let's say S3.
So CDMI over S3, that metadata over here, how much is it?
So that has three different answers.
The first answer is there's no overhead because you have to store it somewhere.
And how it's stored in the storage system is up to the storage system.
That's not something that CDMI imposes.
CDMI is just on the wire.
The second question is how much overhead is there for it to go over the wire? We're standard JSON, it's a
little verbose, luckily it works with all the compression that comes along with
HTTP etc. The third answer is as a storage system if you're going to even
implement this metadata how much does that cost you?
And ultimately, the answer to that question is how much value does it bring you?
Question in the back.
You mentioned once credential.
One of the things I was coming a little bit back on is how much CDMI standardizes the flow of credential information, but CDMI doesn't standardize
how that's done.
So let's say you include a Kerberos ticket or a key, an access key in your HTTP request.
CDMI defines how that gets routed,
but it doesn't define the actual interpretation of that.
So I do a delete.
I include my credentials,
however those are represented in the request.
CDMI says you're doing it against this object.
This object belongs to this domain.
This domain's associated with this system,
whether that's LDAP or Active Directory or something else, it routes my credentials through to that
system which then resolves my credentials and comes back with an ACL
name. That's outside of the CDMI spec. CDMI is just doing the routing. Once the
ACL name comes back, then CDMI, because CDMI represents things as ACLs over the wire, the system can map the ACL name against the operations.
But ultimately, that's the internals of the system.
It doesn't even have to use ACLs internally.
We just represent them as ACLs.
And ACLs are optional.
So yeah, it's routing as opposed to standardizing authentication.
Do you assume a better term for authentication?
Pretty much, I mean, people have implemented S3,
Keystone, Kerberos, HTTP basic, custom authentication.
It's really agnostic to that.
We just add in that routing layer so you can have the namespace.
So how do you ensure that your other team can find this?
So once again,
you can discover what authentication
methods are
required and or supported.
And that can be on an object
by object basis. You may have a namespace
where a bunch of objects, anybody
can access with no authentication, a bunch
of objects where HTTP basic
and digest are good enough, you know, TLS mandated, and a bunch of objects where HTTP basic and digest are good enough,
you know, TLS mandated, and a bunch of objects where you need Kerberos, all in one namespace.
And that's going to be important as we go global beyond single client, you know, hybrid
cloud because you want your security policies to go with the object.
Good question. Any further questions before I get ready for the next
presenter? Oh, well, thank you all. Thanks for listening. If you have questions about the
material presented in this podcast, be sure and join our developers mailing list by sending an
email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further
with your peers in the developer community.
For additional information about the Storage Developer Conference,
visit storagedeveloper.org.