Storage Developer Conference - #110: Datacenter Management of NVMe Drives
Episode Date: October 8, 2019...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the
SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage
developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage
Developer Conference. The link to the slides is available in the show notes at snea.org
slash podcasts. You are listening to STC Podcast, Episode 110. My name is Mark Carlson. I'm
co-chair of the Technical Council with Bill
and co-chair of the Object Drive Twig with Bill,
co-chair of the Cloud Twig with David.
And what I'm going to talk about today is...
Which one?
Hmm, which one?
Co-chair of the Provisional...
Oh, yeah, that's just a provisional thing, yeah.
So there's some recent work in NVME I'll talk about with the management interface.
It was expected to be released by now, but they keep delaying it.
So I can't really say much about that yet. But there's some technology coming in from the DMTF we should look at,
possibly around what's called binary encoded JSON, as well as Redfish device enablement.
So let's jump right in.
When you look at what does it mean to scale out management,
there's a data center, right?
And the way hyperscalers run their data center is they have trays of drives that fit in racks.
And then each rack has these trays of drives
and maybe some top of rack management.
Maybe there's some compute in there with the storage.
And then a storage system is really a row of these racks.
So this system has dozens of...
This data center has dozens of systems,
but the entire rack is what we would call a storage system
and provides an instance of whatever their software-defined storage is.
So how do you manage that?
Suppose all these are NVMe drives in here.
You're going to go into each of the hosts that are connected to those drives, and you're going to have a little agent that you put there,
and that's going to gather up the drives and send it back to observability and data center management. So the host-based agents are mainly deployed in those hyperscaler systems,
but that doesn't work for vendors of enterprise storage.
Why?
Because it doesn't work to sell somebody a system and say,
you must run my software on it.
That's how we manage our drives.
Whereas in hyperscalers, they have completely control
over whatever's running on every system,
so they can just add a host agent to it.
It's no big deal.
But they have sort of custom management software
that's managing their systems and trays and racks and pods. And
it's constantly changing. They're pushing out new versions of the software multiple
times a day in some cases. And they don't have a separate management network. Everything
is on one network and the entire data center, the entire complex, in fact, worldwide,
they don't want to create storage networks.
They don't want to create management networks.
They just want a network that talks to everything.
And so that networking traffic, if it was heavy networking traffic,
would be taking away from the networking bandwidth
that they can use to sell to their customers.
So you need to be able to throttle back the management traffic,
maybe just get events.
When you get certain events,
then you want to go and dump a bunch of log files and other things
to find out what might be going on.
And so network quality of service comes into play
where you do want to get those alerts.
But then when you actually start examining things, maybe that has a lower priority than the other customer-based traffic that's on there.
And it uses in-band commands to the drive.
We'll talk about why that is in a little bit.
But there's also this other route.
It's a BMC, a Baseboard Management Controller.
It's a piece of management hardware
that has a little slow little bus,
serial bus that talks to all the components,
not just drives, but NICs and other components
that are sitting inside that system.
And so most enterprise boxes use a BMC to manage the box, the whole box.
But in the hyperscale's case, it's mainly just controlling the fans.
That's it.
Its job is to keep that system, that tray of drives, at a certain temperature by spinning the fans at a certain speed.
Only when it can't do that would it even notify the host agent and so forth.
Any questions so far?
All right.
So what sort of operation and metrics are they looking for?
As I said, the BMC is mainly used to automate the box,
and the host agent manages the NVMe drive very similarly to the previous SCSI and SATA drives,
which did not have an external management port.
So what it does is it sends in band a bunch of NVMe admin commands
and gets to log pages, temperatures,
and where level information is.
But there's also sort of active management as well.
You can update the firmware.
You can format the drive.
You can create and
delete namespaces all through those admin commands. And so it's for historical reasons,
right? That's how they wrote their host agent before. And moving over to NVMe wasn't too
difficult. I know of none of those host agents that use the MI interface,
even though we added the in-band NVMe admin command
to do MI.
So there's a...
MI is still not widely adopted.
We are starting to see actual systems
from enterprise storage vendors and server vendors
where the BMC can talk NVMe at some point.
But, like I said, most of the
host agents are going to use the NVMe admin commands.
They could do a NVMe MI send receive admin
command, which does the equivalent of sending it through the internal low-speed network.
And then if the host agent talks to the BMC at all,
in a hyperscaler environment, it's just, you know,
hey, I can't keep the temperature of the box down. And in an enterprise case,
you may be using the BMC to manage the box.
And it's not just controlling the fans in that case.
It's actually proxying for the management information
coming from each of these components down here.
In the case of the hyperscaler,
the host agent does all the proxying.
Okay?
So what's missing?
Well, there is no host agent support for MI,
so if they wanted to use the send and receive MI
from the host agents,
they'd have to write that code, extract their metrics from it for their own observability purposes.
And then if you wanted to access it via the slow-speed network, you could use SMBus.
Otherwise, VDM is vendor-defined messaging within the PCIe standard.
But our command line NVMe tool has not been extended to use that command.
And that's likely what they would use in some sort of script.
BMC support, if you're talking to BMC in, let's say, a tier two cloud environment
where these enterprises are looking to use some of the same techniques as the hyperscalers,
but in their new data center build-outs,
they don't have the wherewithal to actually write their own management software.
So they'd like to have some example code, at least, to cut and paste from.
So when you look inside a BMC,
it's actually running perhaps a version of Linux, embedded Linux, right?
And so we need to add MCTP support for that. perhaps a version of Linux, embedded Linux, right?
And so we need to add MCTP support for that if that BMC is going to be used in a similar way
as to the enterprise guys.
And then you need an adapter for external BMC connection
if you want to have a management network.
The major vendors would need to support that as well,
people like AMI.
And then there is a project,
OCP, Open Compute Project,
called OpenBMC.
And so that might be a place
that we could add some firmware as well.
But it's not clear that
the people that pick up OpenBMC are going to
use the BMC in the same way
as the enterprise guys or as the hyperscaler
guys.
But the real issue is
proliferation of models because
NVMe MI is a model.
It's the old
bit packing kind of model
where this field and this bit
means this particular thing, right?
And again, that's historical.
That's the way SCSI and ATA have done it for decades.
So why not?
And then hyperscalers,
each hyperscaler has their own model
because they developed it from scratch
and they didn't talk to each other
and they didn't see any of our standard
because it was all other software.
So they're the only ones that can write any new code
to take advantage of any new interface like MI.
And so far they have not done that.
And then if you are using a BMC as a proxy for the box management,
that has a model.
We've had SCSI enclosure standards as a model for those BMCs forever.
And MVME-MI decided, hey, well, let's just use that.
So through the MI decided, hey, well, let's just use that. So through the MI interface,
you can actually send SES queries and responses
through that interface now,
and that's yet another model.
And then the real issue comes in
when I'm guiding a data center manager
and I have to talk to all this stuff,
how many different models do I have to adapt
my own way of programming
to actually execute down on the devices that I'm managing?
So if I can use something in common that seems to be working
and seems to be rolling out and seems to have success,
I'm way ahead of doing my own thing.
So enter Redfish.
Redfish, Jeff Hillen gave a great talk yesterday.
It's a DMTF standard specifically for scale-out data center management.
And it uses REST.
You've probably been hearing about that for 15 years now.
And it is based on other standards as well,
such as OData and other ITF standards as well.
And the basic philosophy is to have a set of URLs
to describe all the things in the data center.
And wherever possible, we want each component
to sort of represent its
functionality directly in Redfish.
Because right now,
there's a bunch of situations
where the drive
is giving you
structures with bits and bytes,
and what the manager
is asking for is, I want a JSON
object that describes this in something that's human-readable.
So Redfish is becoming more and more widely adopted.
The Intel RackScale design uses Redfish and Swordfish,
and if we're producing components that have a redfish model inside them,
that may allow the hyperscalers to move over to that common model
or at least have a single adapter between their internal model and redfish.
And so that could be one adapter per hyperscaler,
and that might also increase its adoption as well.
So if you look in here, there's basically two kinds of networks.
The SMBus I2C network is an inside-the-box network, right?
And there's multiple reasons why you want to do that, but the main reason is cost and complexity, right? And there's multiple reasons why you want to do that, but the main reason is cost
and complexity, right? So if these things need to talk to each other inside the box,
you don't want to have to slap on an Ethernet connection to do that.
So the MI then runs on top of MCTP, which runs on top of SMBus I2C.
There's like an EPROM in there
that has the static information, et cetera.
But what's useful for outside the BOS
is a Redfish interface
that runs over TCPI network, IP network, using HTTP,
and it just stands on the shoulders of everybody else
that's already figured out things like networking problems.
And then the NVMe admin commands down here are in band
and do pretty much the same thing you can do out of band
through these simple networks.
So in this case, maybe the BMC is converting
internal MI commands to Redfish here.
Same with the JBoff server or JBoff.
You may be using a BMC to proxy all the management of the box.
In this case, there is no possibility of having a host agent
unless you've got some sort of PCIe connection management of the box. In this case, there is no possibility of having a host agent unless, you know,
you've got some sort of PCIe connection
between the JBoff and the server,
which is lightning, basically.
So to convert from NVMe to Redfish,
we need some sort of adaptation software.
And this could be an open source project.
It could be under OCP, it could be under SNEA,
that would adapt from the NVMe MI commands
and so forth to Redfish, back and forth.
And this one piece of software,
because it understands both this and this,
can be used by multiple drives
for at least the core elements of MI.
Now, if you're selling drives into enterprise storage vendors,
he may tell you to do some other things
besides what's in the standard.
And that's fine because he owns the adaptation software.
But if you do that and you offer it to all your customers,
everybody has to adapt that little piece that's extra
because it's not in the standard,
and the standard adaptation software
isn't going to convert it either.
Similar situation over here.
Your adaptation software now is using in-band commands, but it's the same semantics, right?
You're pulling out NVMe structures, and you're converting those to JSON objects and sending them back to the data center manager.
So one of the things is the other side of the interface is the Redfish interface.
And that's, you know, is there a standard Redfish model for NVMe drives?
And no, there isn't.
There's drives, right?
And it's pretty simplistic, right?
That's what's in Redfish today. The idea is we can do a profile of an NVMe drive,
and then everybody implements those same parameters,
and now we have interoperability.
This management software doesn't need to be rewritten
when a new NVMe drive comes out.
Redfish is an easily extensible model,
so anybody can add their own proprietary features to it
without breaking everybody else.
So what a profile does is it establishes which functions,
which properties must be filled in
in order to pass a conformance test, for example.
And DMTF and SNEA both have some open source tools
that can generate sort of test clients from that profile,
run those test clients against the implementation
and say what passed and what failed.
And then the beauty is you can extend a profile as well.
So I could have a Toshiba memory profile that takes the standardized Redfish profile for NVMe drives
and adds anything that our various customers have asked for on one-off drives.
Any questions so far?
All right. Any questions so far? So the Open Compute project is using these red profiles
to develop a profile for network interface cards.
And Sneha has some experience with this,
creating a standard profile for IP-based drives.
So the object drive Twig is now working on an NVMe Redfish profile
that would be in common across all NVMe drives.
And so part of what we've been doing
is just bringing everybody up to speed in SNEA
on what NVMe is and NVMe MI.
So we will document that profile.
It's a JSON object body.
And we'll review it with the NVMe Management Interface Working Group.
And then the idea is to register it with DMTF so that any application can go out,
fetch that NVMe Redfish profile off of their website
and use it either to compile or run their management software.
Okay?
So then there's this thing called Redfish device enablement.
And this is a way of doing Redfish implementations
on something that may not have enough resources
to do a full-on TCP, IP, HTTP kind of implementation.
So the idea is convert this very verbose
Redfish model in JSON
back down to one of these compact structures
with bits and bytes
associated with each location.
And that's what I'll talk about in a minute
is the binary encoded JSON attempts to do.
But this supports a range of capabilities
from primitive to more advanced devices.
And it uses something called PLDM,
which is a platform layer data model,
which really is going into the management of each kind of device type,
including things like firmware management of devices, et cetera.
And if you live in a world where the BMC is the center of the management universe,
it makes a lot of sense for that BMC code to use one protocol
and one way of doing things to all the devices that it's managing.
But as I said, hyperscalers could care less because they don't want to use that
cheap internal network to do the management.
Costs money.
So what are we talking about?
Well, so on the left here, and I'm sorry that this is white and you really can't see it,
this is the HTTP operation.
So if you're familiar with HTTP,
it's got verbs like get, put, patch, post, delete, and head.
That is part of the HTTP protocol.
And so if Redfish is going to work, we need equivalent RDE operations that do read, replace, update,
some sort of action or create for a post.
You know what a post is, right?
Delete is delete and headers read headers.
If we're going to add Redfish support to a device,
we have to have some sort of functions like this
that match the HTTP verbs that we started with.
Now, having a sort of compressed form
of that JSON object body, message body,
means that there's some negotiations you have to do.
Because the way you make it so compact
is you get rid of the verbose keys.
You know, you have a key value, key value, key value,
property name value, property name value kind of thing.
So instead, they replace it with just a number.
Okay, for this particular JSON object body, object body, timestamp equals number three.
And there's a dictionary that lets you reverse that.
So when you get from the device, this is property number three.
Here's something that looks like a time.
Take that and put it in a JSON body with the actual verbose timestamp property name in it.
And then so you've got to get the dictionary out of the device
that tells you how to do that.
And there's also a way to get the actual schema URI.
And remember I said there's a schema up on the website,
DMTF website.
This is that URI to that website
so that anything that wants to manage this
can go and get that piece of the Redfish schema
to be able to interpret what's going on.
So I mentioned this binary encoded JSON,
and it's really, you know,
because of the design choices around the slow network and devices with very little RAM and other resources to process this stuff,
the idea is that, look, you're going to need to have a common way, a common format across all these devices. NVMe AMI works fine for NVMe
solid state drives, but does it work for NICs? Does it work for GPUs, FPGAs, right? So that's why
the movement towards Redfish allows us to create something, perhaps an NVMe, that's agnostic to
even the fact that it's a storage device.
In other words, if these functions are just passing Redfish objects back and forth,
you could use NVMe to now manage a GPU device or an accelerator device
or any of the computational storage-type devices that are being talked about here this week.
So that's the binary encoded JSON.
There's a spec out there.
It's part of the RDE for PLDM spec.
But I'm not sure we really even need this at this point for these NVMe devices because NVMe devices seem to have a lot more resources
than some of the old SCSI and SATA drives that we're maybe used to.
And maybe they can be more verbose.
You've got a frigging submission queue and completion queue in NVMe.
You're really not worried about verbosity because that's going to happen
like that.
And so maybe the approach that they're using on the sort of management,
you know, enterprise storage manager where BMC is part of the world,
might instead be used for sort of an in-band set of stuff.
So,
Canon NVMe device support badge,
it's really just a different encoding.
It's still a structure with bits packed into bytes
packed into double words, right?
But we do need to support additional commands,
the commands that correspond to the HTTP verbs.
Whether it's a Badger or not,
and whether it's a combination of Badger and dictionary or not,
I'm not sure is needed.
But there will be discussions in these various groups. But the real beauty is these
vendor-specific functions and properties are really just an extension of the profile. So it's
very easy to say, well, I can interoperate you on the basis of what's in DMTF, or maybe I can write
some extensions for the newest features in NVMe. but I also have some properties that aren't in any spec that I give to the public.
So you don't need to keep adapting to Redfish in that case.
But if you do have to have an adapter, it's greatly simplified
because now I can have a generic adapter that really doesn't understand anything about the device
because the device is giving me redfish.
All I'm doing is converting the badge compaction
into the regular JSON output back and forth.
And I can be doing that adaptation
without understanding the semantics of what the actual device is producing,
which is really cool.
So we can use this for the computational storage guys as well.
Speaking of which...
So, you know, the whole approach
is you're desegregating components of a server or a hyperconverged thing.
It's like hyperconverged took the pendulum all the way to its part, and it's starting to swing back the other way, right?
And so if they use an NVMe interface, how are they going to manage, right? If you manage them via this network, you know, SMBus I squared C,
that restricts them to maybe all being in the same enclosure.
Do you want that?
Even though you might have five enclosures,
are you going to go to five different BMCs to sort of create a picture of the
whole system or what, right?
And so do BMCs still even make sense
when you've got all these disaggregated components
sitting on a fabric somewhere?
And then why not just run the agent
in a computational component, right?
So you've got something that's not being used right now
for a CPU, why not turn it into a host agent,
quote-unquote host agent? Quote, unquote, host agent.
It's really a floating agent now
in that it can be discovering
and talking to all those things on the NVMe fabric
and then bringing them back in
and giving the user option to, let's say,
compose a virtual system out of these components,
get some namespace from this NVMe drive,
you know, set up this accelerator,
set up this analytics that's going to run on the accelerated data,
et cetera, right?
So you sort of, like, create a virtual system,
use it for some tasks, and then put it back in the pool.
And in that case, you know,
having the BMCs try and get in the middle of that,
if anything, right?
You really just want to be able to send
sort of in-band NVMe commands
to whatever component you're drawing resources from.
So if we can add some admin commands for Redfish,
then the different components can each provide their own profiles.
I have a GPU profile.
I have an FPGA profile.
I have a data analytics profile.
And the NDME interface didn't have to change
because it's still just fetching Redfish.
And so then you can get management
that actually can talk to all the things
it has to compose something out of on the fly
for a particular case.
Comments?
Yeah.
So this is just up here. Right.
The question is what?
Okay. Right, exactly.
That's right.
So it's just a comment on the second bullet here about implementing these new admin commands
with the Redfish functionality.
And one of those would be download by dictionary, for example.
Yeah, you only need a dictionary if you're using Bej,
if you're compacting things, right?
So dictionary tells you how to pull it apart.
But it could be the NVM admin commands
just put on the completion queue
a JSON message body,
and the adaptation is like,
take that and put it in a TCP IP packet.
That's all they have to do, right?
And that's ideal.
I mean, you don't want to have to put
an Ethernet port on every drive. That doesn't work. You know, somewhere's have to put an Ethernet port on every drive.
That doesn't work.
Somewhere's got to be an Ethernet switch now,
even though you're using Fibre Channel for the data path, right?
But if you're using something like Fibre Channel or NVMe over Fabrics,
the fact that you can do in-band command now means you can manage a remote drive,
and you don't have to have one of these inside-the-box networks for it.
If it's in-band, it doesn't matter what the transport is.
It can be any of the NVMe over Fabrics transports.
In the future, it can be DCP IP.
So I think what we're doing is kind of laying the foundation
for all the computational storage elements as well
to pick up the same way of doing things.
So then if you've done that, you had a standardized profile.
This adaptation software in the BMC case
is doing the Bege dictionary compaction.
But over here for the host agent,
he's really simple.
Like I said, if he's able to get a JSON message body
out of the NVMe drive,
he's just wrapping it with TCP, IP, and HTTP.
And it's really just a matter of, you know,
does the BMC need to get the data out of here using some sort of compact representation?
And some of the enterprise storage guys I talked to said,
no, we already implement MI.
Why would we want a different way of doing it?
So they're doing that anyway.
Okay, so I talked a little about PLDM.
It's DSP420,
and it is the basis of the RDE work
that includes the badge and the dictionary
and the new verbs
and again it uses MCTP
maybe a later version than MI does
but this is what enterprise system vendors
are pushing their device vendors
in that direction to implement in PLDM.
But, of course, that is not compatible with NVMe MI.
So, you know, once NVMe is adopted, it's going to be there for a while, and you're going to have to support it.
And that's the unfortunate thing.
And the first type of device that they're pushing this PLDM on is NICs,
and they're having variable success on that.
That's the OCP NIC.
Because they put out this perfectly adequate standard called NCSI that they all adopted.
And now they're saying, well, we got NCSI.
Why do we have to do PLDM?
Sounds familiar.
So if you can imagine where PLDM is now the highest level network running on this little SM bus network.
Then the badge, of course, goes with the RDE.
So in the BMC-centric world, you know,
you can expect probably in a couple years to get custom requirements for supporting PLDM.
For the host agent, I'm not sure PLDM even makes sense, right?
It's for a network that they don't want to use.
Any questions on that?
So does that mean that my price has out-of-band MI connections,
SM bus, and supports in-band, I'm going to have to support both? That's what I'm trying to get away from.
See, if inside the device, right, we can implement Redfish
instead of something for this connection and something different for this connection,
which is today's story, right, We as drive manufacturers win, right?
We have one place to do the instrumentation,
one format to put it in.
You outside world just figure it out.
Which one do you want to use, right?
So, and, you know, that whole,
things like Gen Z are changing the computer architectures
going forward.
And so there may not be such a thing as a host anymore, right?
Everything's on the bus.
It all works peer-to-peer.
And you have to manage it like that.
You really can't depend on there being one place where everything connects to like a host.
Yeah? to like a host. Yeah.
Right, so I think your question is along the lines of
given some RAM, would I be better off making my drive So I think your question is along the lines of,
given some RAM, would I be better off making my drive go faster with it or doing better management with it, right?
And inside your customer, you'll find that there's camps in both.
Say, like, if I can't manage it, I don't want it, right?
And on the other side, it's like, can't you squeeze a few more IOPS out of
this?
Yeah, that takes resources.
Yeah,
that does take resources, but
typically we're not so concerned with
the management performance, more
the data path
performance, right?
And at least my understanding is, in the case of Bej,
it doesn't require any more resources than the existing MI.
It's almost as compact a format.
And the real difficulty in doing that management software on the device
is reaching out to all the places where you have to gather metrics, right? It's not packing it into a structure to send back. You pack it into the
structure, you send it, and you free up that memory. So it's very short-term use of that
kind of resource. Does that make sense? Okay?
So what about Ethernet NVMe drives?
Now, this is the best world we could have because if you have an Ethernet port on your drive already,
you can put a little TCP IP stack there just to do management
and do HTTP over that, put a little web server on there,
and you don't even need any adaptation at all.
You can be a Redfish endpoint directly
and serve up your information
if the industry goes sort of to Ethernet drives.
I mean, and there's different ways to do Ethernet, too.
Rocky is one kind of Ethernet.
And, you know, at FMS,
we were already seeing sort of Rocky drives, right? So NVMe over Fabrics talking directly to the drive.
That's kind of interesting,
especially if you can get the Ethernet components down.
And then we do have a project to do MVM or Fabrics for TCP IP.
And, of course, if you want to do TCP IP in the data path,
your drive gets a lot more expensive
because now you need to sort of execute that TCP IP stack.
Whereas in Ethernet, you're just picking apart the bytes
in the submission queue and the completion queue.
So you don't, you know,
you could do an Ethernet drive
that supports TCP IP just for management,
not for actual data path.
And then you can do the scale-out management
without the need for a host and a host-based agent
or a BMC, both of which are sort of intermediaries.
When you get away from the intermediaries,
that increases the scale-out as well.
Does that make sense?
Okay?
Okay?
All right.
So I want to tell people about our BOF this evening.
It's the NVME BOF, and we did this last year.
It was very popular.
I think it's going to be in here again tonight at about 7 o'clock.
And every year we invite this Bay Area NVME meetup group
that Olga runs, if you know Olga.
So it'll be a combination of SDC attendees
and this local meetup group.
That should be fun.
It's a panel.
Bill will be on it.
Got your slides ready, Bill?
Thank you. Thank you.
Thanks for listening.
If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org.
Here you can ask questions and discuss this topic further with your peers in the Storage Developer community.
For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.