Storage Developer Conference - #110: Datacenter Management of NVMe Drives

Episode Date: October 8, 2019

...

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to STC Podcast, Episode 110. My name is Mark Carlson. I'm co-chair of the Technical Council with Bill and co-chair of the Object Drive Twig with Bill, co-chair of the Cloud Twig with David.
Starting point is 00:00:54 And what I'm going to talk about today is... Which one? Hmm, which one? Co-chair of the Provisional... Oh, yeah, that's just a provisional thing, yeah. So there's some recent work in NVME I'll talk about with the management interface. It was expected to be released by now, but they keep delaying it. So I can't really say much about that yet. But there's some technology coming in from the DMTF we should look at,
Starting point is 00:01:30 possibly around what's called binary encoded JSON, as well as Redfish device enablement. So let's jump right in. When you look at what does it mean to scale out management, there's a data center, right? And the way hyperscalers run their data center is they have trays of drives that fit in racks. And then each rack has these trays of drives and maybe some top of rack management. Maybe there's some compute in there with the storage.
Starting point is 00:02:18 And then a storage system is really a row of these racks. So this system has dozens of... This data center has dozens of systems, but the entire rack is what we would call a storage system and provides an instance of whatever their software-defined storage is. So how do you manage that? Suppose all these are NVMe drives in here. You're going to go into each of the hosts that are connected to those drives, and you're going to have a little agent that you put there,
Starting point is 00:02:53 and that's going to gather up the drives and send it back to observability and data center management. So the host-based agents are mainly deployed in those hyperscaler systems, but that doesn't work for vendors of enterprise storage. Why? Because it doesn't work to sell somebody a system and say, you must run my software on it. That's how we manage our drives. Whereas in hyperscalers, they have completely control over whatever's running on every system,
Starting point is 00:03:29 so they can just add a host agent to it. It's no big deal. But they have sort of custom management software that's managing their systems and trays and racks and pods. And it's constantly changing. They're pushing out new versions of the software multiple times a day in some cases. And they don't have a separate management network. Everything is on one network and the entire data center, the entire complex, in fact, worldwide, they don't want to create storage networks.
Starting point is 00:04:09 They don't want to create management networks. They just want a network that talks to everything. And so that networking traffic, if it was heavy networking traffic, would be taking away from the networking bandwidth that they can use to sell to their customers. So you need to be able to throttle back the management traffic, maybe just get events. When you get certain events,
Starting point is 00:04:33 then you want to go and dump a bunch of log files and other things to find out what might be going on. And so network quality of service comes into play where you do want to get those alerts. But then when you actually start examining things, maybe that has a lower priority than the other customer-based traffic that's on there. And it uses in-band commands to the drive. We'll talk about why that is in a little bit. But there's also this other route.
Starting point is 00:05:05 It's a BMC, a Baseboard Management Controller. It's a piece of management hardware that has a little slow little bus, serial bus that talks to all the components, not just drives, but NICs and other components that are sitting inside that system. And so most enterprise boxes use a BMC to manage the box, the whole box. But in the hyperscale's case, it's mainly just controlling the fans.
Starting point is 00:05:40 That's it. Its job is to keep that system, that tray of drives, at a certain temperature by spinning the fans at a certain speed. Only when it can't do that would it even notify the host agent and so forth. Any questions so far? All right. So what sort of operation and metrics are they looking for? As I said, the BMC is mainly used to automate the box, and the host agent manages the NVMe drive very similarly to the previous SCSI and SATA drives,
Starting point is 00:06:21 which did not have an external management port. So what it does is it sends in band a bunch of NVMe admin commands and gets to log pages, temperatures, and where level information is. But there's also sort of active management as well. You can update the firmware. You can format the drive. You can create and
Starting point is 00:06:45 delete namespaces all through those admin commands. And so it's for historical reasons, right? That's how they wrote their host agent before. And moving over to NVMe wasn't too difficult. I know of none of those host agents that use the MI interface, even though we added the in-band NVMe admin command to do MI. So there's a... MI is still not widely adopted. We are starting to see actual systems
Starting point is 00:07:22 from enterprise storage vendors and server vendors where the BMC can talk NVMe at some point. But, like I said, most of the host agents are going to use the NVMe admin commands. They could do a NVMe MI send receive admin command, which does the equivalent of sending it through the internal low-speed network. And then if the host agent talks to the BMC at all, in a hyperscaler environment, it's just, you know,
Starting point is 00:07:58 hey, I can't keep the temperature of the box down. And in an enterprise case, you may be using the BMC to manage the box. And it's not just controlling the fans in that case. It's actually proxying for the management information coming from each of these components down here. In the case of the hyperscaler, the host agent does all the proxying. Okay?
Starting point is 00:08:32 So what's missing? Well, there is no host agent support for MI, so if they wanted to use the send and receive MI from the host agents, they'd have to write that code, extract their metrics from it for their own observability purposes. And then if you wanted to access it via the slow-speed network, you could use SMBus. Otherwise, VDM is vendor-defined messaging within the PCIe standard. But our command line NVMe tool has not been extended to use that command.
Starting point is 00:09:19 And that's likely what they would use in some sort of script. BMC support, if you're talking to BMC in, let's say, a tier two cloud environment where these enterprises are looking to use some of the same techniques as the hyperscalers, but in their new data center build-outs, they don't have the wherewithal to actually write their own management software. So they'd like to have some example code, at least, to cut and paste from. So when you look inside a BMC, it's actually running perhaps a version of Linux, embedded Linux, right?
Starting point is 00:10:04 And so we need to add MCTP support for that. perhaps a version of Linux, embedded Linux, right? And so we need to add MCTP support for that if that BMC is going to be used in a similar way as to the enterprise guys. And then you need an adapter for external BMC connection if you want to have a management network. The major vendors would need to support that as well, people like AMI. And then there is a project,
Starting point is 00:10:32 OCP, Open Compute Project, called OpenBMC. And so that might be a place that we could add some firmware as well. But it's not clear that the people that pick up OpenBMC are going to use the BMC in the same way as the enterprise guys or as the hyperscaler
Starting point is 00:10:52 guys. But the real issue is proliferation of models because NVMe MI is a model. It's the old bit packing kind of model where this field and this bit means this particular thing, right?
Starting point is 00:11:10 And again, that's historical. That's the way SCSI and ATA have done it for decades. So why not? And then hyperscalers, each hyperscaler has their own model because they developed it from scratch and they didn't talk to each other and they didn't see any of our standard
Starting point is 00:11:28 because it was all other software. So they're the only ones that can write any new code to take advantage of any new interface like MI. And so far they have not done that. And then if you are using a BMC as a proxy for the box management, that has a model. We've had SCSI enclosure standards as a model for those BMCs forever. And MVME-MI decided, hey, well, let's just use that.
Starting point is 00:12:02 So through the MI decided, hey, well, let's just use that. So through the MI interface, you can actually send SES queries and responses through that interface now, and that's yet another model. And then the real issue comes in when I'm guiding a data center manager and I have to talk to all this stuff, how many different models do I have to adapt
Starting point is 00:12:24 my own way of programming to actually execute down on the devices that I'm managing? So if I can use something in common that seems to be working and seems to be rolling out and seems to have success, I'm way ahead of doing my own thing. So enter Redfish. Redfish, Jeff Hillen gave a great talk yesterday. It's a DMTF standard specifically for scale-out data center management.
Starting point is 00:13:01 And it uses REST. You've probably been hearing about that for 15 years now. And it is based on other standards as well, such as OData and other ITF standards as well. And the basic philosophy is to have a set of URLs to describe all the things in the data center. And wherever possible, we want each component to sort of represent its
Starting point is 00:13:28 functionality directly in Redfish. Because right now, there's a bunch of situations where the drive is giving you structures with bits and bytes, and what the manager is asking for is, I want a JSON
Starting point is 00:13:44 object that describes this in something that's human-readable. So Redfish is becoming more and more widely adopted. The Intel RackScale design uses Redfish and Swordfish, and if we're producing components that have a redfish model inside them, that may allow the hyperscalers to move over to that common model or at least have a single adapter between their internal model and redfish. And so that could be one adapter per hyperscaler, and that might also increase its adoption as well.
Starting point is 00:14:29 So if you look in here, there's basically two kinds of networks. The SMBus I2C network is an inside-the-box network, right? And there's multiple reasons why you want to do that, but the main reason is cost and complexity, right? And there's multiple reasons why you want to do that, but the main reason is cost and complexity, right? So if these things need to talk to each other inside the box, you don't want to have to slap on an Ethernet connection to do that. So the MI then runs on top of MCTP, which runs on top of SMBus I2C. There's like an EPROM in there that has the static information, et cetera.
Starting point is 00:15:13 But what's useful for outside the BOS is a Redfish interface that runs over TCPI network, IP network, using HTTP, and it just stands on the shoulders of everybody else that's already figured out things like networking problems. And then the NVMe admin commands down here are in band and do pretty much the same thing you can do out of band through these simple networks.
Starting point is 00:15:45 So in this case, maybe the BMC is converting internal MI commands to Redfish here. Same with the JBoff server or JBoff. You may be using a BMC to proxy all the management of the box. In this case, there is no possibility of having a host agent unless you've got some sort of PCIe connection management of the box. In this case, there is no possibility of having a host agent unless, you know, you've got some sort of PCIe connection between the JBoff and the server,
Starting point is 00:16:11 which is lightning, basically. So to convert from NVMe to Redfish, we need some sort of adaptation software. And this could be an open source project. It could be under OCP, it could be under SNEA, that would adapt from the NVMe MI commands and so forth to Redfish, back and forth. And this one piece of software,
Starting point is 00:16:46 because it understands both this and this, can be used by multiple drives for at least the core elements of MI. Now, if you're selling drives into enterprise storage vendors, he may tell you to do some other things besides what's in the standard. And that's fine because he owns the adaptation software. But if you do that and you offer it to all your customers,
Starting point is 00:17:10 everybody has to adapt that little piece that's extra because it's not in the standard, and the standard adaptation software isn't going to convert it either. Similar situation over here. Your adaptation software now is using in-band commands, but it's the same semantics, right? You're pulling out NVMe structures, and you're converting those to JSON objects and sending them back to the data center manager. So one of the things is the other side of the interface is the Redfish interface.
Starting point is 00:17:48 And that's, you know, is there a standard Redfish model for NVMe drives? And no, there isn't. There's drives, right? And it's pretty simplistic, right? That's what's in Redfish today. The idea is we can do a profile of an NVMe drive, and then everybody implements those same parameters, and now we have interoperability. This management software doesn't need to be rewritten
Starting point is 00:18:17 when a new NVMe drive comes out. Redfish is an easily extensible model, so anybody can add their own proprietary features to it without breaking everybody else. So what a profile does is it establishes which functions, which properties must be filled in in order to pass a conformance test, for example. And DMTF and SNEA both have some open source tools
Starting point is 00:18:49 that can generate sort of test clients from that profile, run those test clients against the implementation and say what passed and what failed. And then the beauty is you can extend a profile as well. So I could have a Toshiba memory profile that takes the standardized Redfish profile for NVMe drives and adds anything that our various customers have asked for on one-off drives. Any questions so far? All right. Any questions so far? So the Open Compute project is using these red profiles
Starting point is 00:19:34 to develop a profile for network interface cards. And Sneha has some experience with this, creating a standard profile for IP-based drives. So the object drive Twig is now working on an NVMe Redfish profile that would be in common across all NVMe drives. And so part of what we've been doing is just bringing everybody up to speed in SNEA on what NVMe is and NVMe MI.
Starting point is 00:20:08 So we will document that profile. It's a JSON object body. And we'll review it with the NVMe Management Interface Working Group. And then the idea is to register it with DMTF so that any application can go out, fetch that NVMe Redfish profile off of their website and use it either to compile or run their management software. Okay? So then there's this thing called Redfish device enablement.
Starting point is 00:20:45 And this is a way of doing Redfish implementations on something that may not have enough resources to do a full-on TCP, IP, HTTP kind of implementation. So the idea is convert this very verbose Redfish model in JSON back down to one of these compact structures with bits and bytes associated with each location.
Starting point is 00:21:18 And that's what I'll talk about in a minute is the binary encoded JSON attempts to do. But this supports a range of capabilities from primitive to more advanced devices. And it uses something called PLDM, which is a platform layer data model, which really is going into the management of each kind of device type, including things like firmware management of devices, et cetera.
Starting point is 00:21:52 And if you live in a world where the BMC is the center of the management universe, it makes a lot of sense for that BMC code to use one protocol and one way of doing things to all the devices that it's managing. But as I said, hyperscalers could care less because they don't want to use that cheap internal network to do the management. Costs money. So what are we talking about? Well, so on the left here, and I'm sorry that this is white and you really can't see it,
Starting point is 00:22:28 this is the HTTP operation. So if you're familiar with HTTP, it's got verbs like get, put, patch, post, delete, and head. That is part of the HTTP protocol. And so if Redfish is going to work, we need equivalent RDE operations that do read, replace, update, some sort of action or create for a post. You know what a post is, right? Delete is delete and headers read headers.
Starting point is 00:23:02 If we're going to add Redfish support to a device, we have to have some sort of functions like this that match the HTTP verbs that we started with. Now, having a sort of compressed form of that JSON object body, message body, means that there's some negotiations you have to do. Because the way you make it so compact is you get rid of the verbose keys.
Starting point is 00:23:33 You know, you have a key value, key value, key value, property name value, property name value kind of thing. So instead, they replace it with just a number. Okay, for this particular JSON object body, object body, timestamp equals number three. And there's a dictionary that lets you reverse that. So when you get from the device, this is property number three. Here's something that looks like a time. Take that and put it in a JSON body with the actual verbose timestamp property name in it.
Starting point is 00:24:08 And then so you've got to get the dictionary out of the device that tells you how to do that. And there's also a way to get the actual schema URI. And remember I said there's a schema up on the website, DMTF website. This is that URI to that website so that anything that wants to manage this can go and get that piece of the Redfish schema
Starting point is 00:24:31 to be able to interpret what's going on. So I mentioned this binary encoded JSON, and it's really, you know, because of the design choices around the slow network and devices with very little RAM and other resources to process this stuff, the idea is that, look, you're going to need to have a common way, a common format across all these devices. NVMe AMI works fine for NVMe solid state drives, but does it work for NICs? Does it work for GPUs, FPGAs, right? So that's why the movement towards Redfish allows us to create something, perhaps an NVMe, that's agnostic to even the fact that it's a storage device.
Starting point is 00:25:26 In other words, if these functions are just passing Redfish objects back and forth, you could use NVMe to now manage a GPU device or an accelerator device or any of the computational storage-type devices that are being talked about here this week. So that's the binary encoded JSON. There's a spec out there. It's part of the RDE for PLDM spec. But I'm not sure we really even need this at this point for these NVMe devices because NVMe devices seem to have a lot more resources than some of the old SCSI and SATA drives that we're maybe used to.
Starting point is 00:26:13 And maybe they can be more verbose. You've got a frigging submission queue and completion queue in NVMe. You're really not worried about verbosity because that's going to happen like that. And so maybe the approach that they're using on the sort of management, you know, enterprise storage manager where BMC is part of the world, might instead be used for sort of an in-band set of stuff. So,
Starting point is 00:26:52 Canon NVMe device support badge, it's really just a different encoding. It's still a structure with bits packed into bytes packed into double words, right? But we do need to support additional commands, the commands that correspond to the HTTP verbs. Whether it's a Badger or not, and whether it's a combination of Badger and dictionary or not,
Starting point is 00:27:19 I'm not sure is needed. But there will be discussions in these various groups. But the real beauty is these vendor-specific functions and properties are really just an extension of the profile. So it's very easy to say, well, I can interoperate you on the basis of what's in DMTF, or maybe I can write some extensions for the newest features in NVMe. but I also have some properties that aren't in any spec that I give to the public. So you don't need to keep adapting to Redfish in that case. But if you do have to have an adapter, it's greatly simplified because now I can have a generic adapter that really doesn't understand anything about the device
Starting point is 00:28:06 because the device is giving me redfish. All I'm doing is converting the badge compaction into the regular JSON output back and forth. And I can be doing that adaptation without understanding the semantics of what the actual device is producing, which is really cool. So we can use this for the computational storage guys as well. Speaking of which...
Starting point is 00:28:38 So, you know, the whole approach is you're desegregating components of a server or a hyperconverged thing. It's like hyperconverged took the pendulum all the way to its part, and it's starting to swing back the other way, right? And so if they use an NVMe interface, how are they going to manage, right? If you manage them via this network, you know, SMBus I squared C, that restricts them to maybe all being in the same enclosure. Do you want that? Even though you might have five enclosures, are you going to go to five different BMCs to sort of create a picture of the
Starting point is 00:29:19 whole system or what, right? And so do BMCs still even make sense when you've got all these disaggregated components sitting on a fabric somewhere? And then why not just run the agent in a computational component, right? So you've got something that's not being used right now for a CPU, why not turn it into a host agent,
Starting point is 00:29:44 quote-unquote host agent? Quote, unquote, host agent. It's really a floating agent now in that it can be discovering and talking to all those things on the NVMe fabric and then bringing them back in and giving the user option to, let's say, compose a virtual system out of these components, get some namespace from this NVMe drive,
Starting point is 00:30:10 you know, set up this accelerator, set up this analytics that's going to run on the accelerated data, et cetera, right? So you sort of, like, create a virtual system, use it for some tasks, and then put it back in the pool. And in that case, you know, having the BMCs try and get in the middle of that, if anything, right?
Starting point is 00:30:32 You really just want to be able to send sort of in-band NVMe commands to whatever component you're drawing resources from. So if we can add some admin commands for Redfish, then the different components can each provide their own profiles. I have a GPU profile. I have an FPGA profile. I have a data analytics profile.
Starting point is 00:30:55 And the NDME interface didn't have to change because it's still just fetching Redfish. And so then you can get management that actually can talk to all the things it has to compose something out of on the fly for a particular case. Comments? Yeah.
Starting point is 00:31:19 So this is just up here. Right. The question is what? Okay. Right, exactly. That's right. So it's just a comment on the second bullet here about implementing these new admin commands with the Redfish functionality. And one of those would be download by dictionary, for example. Yeah, you only need a dictionary if you're using Bej,
Starting point is 00:32:37 if you're compacting things, right? So dictionary tells you how to pull it apart. But it could be the NVM admin commands just put on the completion queue a JSON message body, and the adaptation is like, take that and put it in a TCP IP packet. That's all they have to do, right?
Starting point is 00:32:58 And that's ideal. I mean, you don't want to have to put an Ethernet port on every drive. That doesn't work. You know, somewhere's have to put an Ethernet port on every drive. That doesn't work. Somewhere's got to be an Ethernet switch now, even though you're using Fibre Channel for the data path, right? But if you're using something like Fibre Channel or NVMe over Fabrics, the fact that you can do in-band command now means you can manage a remote drive,
Starting point is 00:33:24 and you don't have to have one of these inside-the-box networks for it. If it's in-band, it doesn't matter what the transport is. It can be any of the NVMe over Fabrics transports. In the future, it can be DCP IP. So I think what we're doing is kind of laying the foundation for all the computational storage elements as well to pick up the same way of doing things. So then if you've done that, you had a standardized profile.
Starting point is 00:34:01 This adaptation software in the BMC case is doing the Bege dictionary compaction. But over here for the host agent, he's really simple. Like I said, if he's able to get a JSON message body out of the NVMe drive, he's just wrapping it with TCP, IP, and HTTP. And it's really just a matter of, you know,
Starting point is 00:34:34 does the BMC need to get the data out of here using some sort of compact representation? And some of the enterprise storage guys I talked to said, no, we already implement MI. Why would we want a different way of doing it? So they're doing that anyway. Okay, so I talked a little about PLDM. It's DSP420, and it is the basis of the RDE work
Starting point is 00:35:07 that includes the badge and the dictionary and the new verbs and again it uses MCTP maybe a later version than MI does but this is what enterprise system vendors are pushing their device vendors in that direction to implement in PLDM. But, of course, that is not compatible with NVMe MI.
Starting point is 00:35:35 So, you know, once NVMe is adopted, it's going to be there for a while, and you're going to have to support it. And that's the unfortunate thing. And the first type of device that they're pushing this PLDM on is NICs, and they're having variable success on that. That's the OCP NIC. Because they put out this perfectly adequate standard called NCSI that they all adopted. And now they're saying, well, we got NCSI. Why do we have to do PLDM?
Starting point is 00:36:12 Sounds familiar. So if you can imagine where PLDM is now the highest level network running on this little SM bus network. Then the badge, of course, goes with the RDE. So in the BMC-centric world, you know, you can expect probably in a couple years to get custom requirements for supporting PLDM. For the host agent, I'm not sure PLDM even makes sense, right? It's for a network that they don't want to use. Any questions on that?
Starting point is 00:36:53 So does that mean that my price has out-of-band MI connections, SM bus, and supports in-band, I'm going to have to support both? That's what I'm trying to get away from. See, if inside the device, right, we can implement Redfish instead of something for this connection and something different for this connection, which is today's story, right, We as drive manufacturers win, right? We have one place to do the instrumentation, one format to put it in. You outside world just figure it out.
Starting point is 00:37:31 Which one do you want to use, right? So, and, you know, that whole, things like Gen Z are changing the computer architectures going forward. And so there may not be such a thing as a host anymore, right? Everything's on the bus. It all works peer-to-peer. And you have to manage it like that.
Starting point is 00:37:54 You really can't depend on there being one place where everything connects to like a host. Yeah? to like a host. Yeah. Right, so I think your question is along the lines of given some RAM, would I be better off making my drive So I think your question is along the lines of, given some RAM, would I be better off making my drive go faster with it or doing better management with it, right? And inside your customer, you'll find that there's camps in both. Say, like, if I can't manage it, I don't want it, right? And on the other side, it's like, can't you squeeze a few more IOPS out of
Starting point is 00:38:46 this? Yeah, that takes resources. Yeah, that does take resources, but typically we're not so concerned with the management performance, more the data path performance, right?
Starting point is 00:39:05 And at least my understanding is, in the case of Bej, it doesn't require any more resources than the existing MI. It's almost as compact a format. And the real difficulty in doing that management software on the device is reaching out to all the places where you have to gather metrics, right? It's not packing it into a structure to send back. You pack it into the structure, you send it, and you free up that memory. So it's very short-term use of that kind of resource. Does that make sense? Okay? So what about Ethernet NVMe drives?
Starting point is 00:39:52 Now, this is the best world we could have because if you have an Ethernet port on your drive already, you can put a little TCP IP stack there just to do management and do HTTP over that, put a little web server on there, and you don't even need any adaptation at all. You can be a Redfish endpoint directly and serve up your information if the industry goes sort of to Ethernet drives. I mean, and there's different ways to do Ethernet, too.
Starting point is 00:40:22 Rocky is one kind of Ethernet. And, you know, at FMS, we were already seeing sort of Rocky drives, right? So NVMe over Fabrics talking directly to the drive. That's kind of interesting, especially if you can get the Ethernet components down. And then we do have a project to do MVM or Fabrics for TCP IP. And, of course, if you want to do TCP IP in the data path, your drive gets a lot more expensive
Starting point is 00:40:57 because now you need to sort of execute that TCP IP stack. Whereas in Ethernet, you're just picking apart the bytes in the submission queue and the completion queue. So you don't, you know, you could do an Ethernet drive that supports TCP IP just for management, not for actual data path. And then you can do the scale-out management
Starting point is 00:41:23 without the need for a host and a host-based agent or a BMC, both of which are sort of intermediaries. When you get away from the intermediaries, that increases the scale-out as well. Does that make sense? Okay? Okay? All right.
Starting point is 00:41:46 So I want to tell people about our BOF this evening. It's the NVME BOF, and we did this last year. It was very popular. I think it's going to be in here again tonight at about 7 o'clock. And every year we invite this Bay Area NVME meetup group that Olga runs, if you know Olga. So it'll be a combination of SDC attendees and this local meetup group.
Starting point is 00:42:16 That should be fun. It's a panel. Bill will be on it. Got your slides ready, Bill? Thank you. Thank you. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the Storage Developer community.
Starting point is 00:42:55 For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.