Storage Developer Conference - #96: Solid State Datacenter Transformation

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to SDC Podcast Episode 96. So what I wanted to talk to you today about was the solid state transformation of the data center. I was asked to maybe do a walkthrough, a little bit of time. So I had fun going back in my archives and seeing what we've done together as an industry.

Starting point is 00:00:55 So just disclaimer, so make sure there's going to be a test. So make sure you read all that. So one thing I wanted to start out with is zettabytes. I didn't really even know what zettabytes were, and still I actually started working the past few years of hearing petabytes to zettabytes. And what you can see is that the edge and data centers and endpoints are driving just a tsunami of data. And so there's no better time for being in storage than what we see now. And so it's only coming, Dave talked eloquently

Starting point is 00:01:29 about how in AI you are absolutely needing more and more data and that the thirst for data is increasing, so there's a lot here. And what we need is faster and faster solid state drives and persistent memory to take advantage of that. So in today's talk, what I wanted to do is talk about the industry transformation to embrace NVM and how we can continue to do that together. So I've almost got a past look at the past 10 years of what have we done together and

Starting point is 00:01:57 how we shifted as we move forward to really realize what we have in the NVM industry. So if we look at the journey beginning, going back to 2000, we really had, when I started at Intel, I was working on software ATA, which was even a precursor to serial ATA. How could I make hard drives faster and offload them into system memory with faster algorithms?

Starting point is 00:02:21 It was fascinating, but we didn't have these tiers in the middle. So how did we get to where we are? If you go back a decade at IDF in 2007, I was talking about, hey, PC could be a driver of NAND growth and SSD to become the new killer application of NAND flash. So it's hard to remember, for those of us in this room, when NAND was a potential killer application. For SSDs would be a killer application for NAND. So where have we come? So what we've done over the past decade is gone from could NAND SSDs be part of this tiering, this memory hierarchy? And how did we get here?

Starting point is 00:03:01 So the first thing I wanted to talk about was the need for the open NAND flash interface. So if we scroll back, we had no standard NAND interface going back to the time in 2005, 2004. So at least my experience when I was starting on SSDs at Intel, one of our issues was how do I even design an SSD interface? We had Toshiba, we had Samsung, we had many smaller NAND vendors. And what we were struggling with is how do I even implement solidly an ASIC for an SSD? What could I do? And so the challenge was basically the way the industry worked at that time was Samsung would publish their latest data sheet and everybody else would then adjust for whatever the new developments were. And so we really, to make the ambitions grow

Starting point is 00:03:49 of what we wanted to do in the SSD industry, we had to address that. So then enter ONFI. And so what we did is, if you go back, and sorry for the odd formatting, one of the fonts did not translate. If you go back to ONFI, what happened is back in 2006, what we did is we first delivered the, we codified commonalities and then started to scale

Starting point is 00:04:14 for SSDs. And then we started increasing the performance for SSDs. I don't know if we can remember what we've all done together over the past decade is, if you scroll back, NAND originally in 2005 was 40 megabytes per second was the interface to the NAND chip. And so we have gone to where we've scaled from in ONFI 1.0, we went to 133 megabytes per second, then we went to 266, 400, and brought over the DRAM learnings, and we scaled in less than two years. And then we scaled for the SSD industry. So obviously one thing that many of you in this room know is ONFI was a coalition of the small companies of the Micron and the SK, Hynix, and the Intel. And so what we did is we worked with Jetik and formed an on-feed Jetik collaboration. And then what we were able to do through that is work with Samsung and Toshiba

Starting point is 00:05:06 to really bring together and unify the industry on how we had commonality in the most important parts of the NAND interface such that all of us could build SSD controllers confidently and move forward with the scale of our industry ambitions. Then we had steady innovation year over year over year we added some features we went to you know 800 megabytes per second then i'll talk in a second about 1200 but just continuing over and over time to deliver steady innovation and keep pace with ssd needs so up today we had you

Starting point is 00:05:39 know in 2014 we delivered 800 megatransfers per second 2017 delivered 1200 megatransfers per second. 2017, delivered 1,200 megatransfers per second. And so we've continued to keep pace with where NAND needs to go in order to hit SSD needs and develop that NAND tier. And so what are we seeing then? Should we just keep going? I mean, I started out at 40 megabytes per second and now we're 1,200, so we should just go 2,400. So one thing I wanted to say is NAND is an amazing technology,

Starting point is 00:06:06 but we really reached at the scaling of the individual NAND device, we've reached where I believe we should go. I mean, maybe we go slightly higher, but what these charts are saying is that as you keep increasing your data rate, you're getting diminishing returns at the NAND. So I'm not going to be back here in two years telling you we've scaled the 1,600 megatransfers per second or 2,000 at the NAND. What this is saying is we've reached the end of our ambitions with NAND in terms of at an individual NAND layer unless we get a breakthrough in the NAND media. You just really have, I just get scaling where I'm just paying power and I'm not getting

Starting point is 00:06:40 a system level benefit or an SSD level benefit to bring forward. So that was our first pass is if we're going to deliver a NAND SSD tier, first off, we have to be able to have NAND attached in a big way. As many of us have 40 NAND sites, 80 NAND sites, 120 NAND sites in our device. And how do we go forward and be confident about, I can use Samsung, I can use Toshiba, I can use Intel, I can use Micron. So as an industry together, we've delivered that. And what I wanted to get across in this talk is one of the most important things is those of us in this room, us working together and figuring out what we want to do. When we first started talking about in 2004 about starting ONFI, we kind of got laughed at quite a bit.

Starting point is 00:07:24 Because it's like, Intel, you don't even own NAND. What are you talking about? Because this is pre-Intel Micron NAND merger. So I just wanted to, us in this room, that's how we move things forward. So now I wanted to talk about the path to NVM Express and what we need to continue to do. So ONFI work is pretty much done,

Starting point is 00:07:42 and there's not continued, but NVM Express, we have way more. Are you guys asleep yet? You seem very sleepy. So I'm going to disclose something shocking. I'm terrible at names. NVM HCI, that was the original name of NVM Express. So this is back at Flash Memory Summit in 2009. I was out there talking about, hey, look at all this awesomeness of NVM HCI and optimized interface for NVM. And what you can see here is we were really looking at it from a, hey, I've got an HCI controller, an NVM HCI, and I'm going to do this awesome caching and client, and I'm going to have this really cool thing where I just have NAND on a DIMM, and I just attach the NAND, and it's going to be awesome. And it was meant to be this very

Starting point is 00:08:28 low-cost client capability. Well, sometimes our ambitions are failures, right? And so, total misfire, right? We totally, that did not go anywhere. So, I think that one of the things that I, I love books, as Dave does, and one of the things that I find fascinating is failure. The growth mindset books by Carol Dweck of Stanford are fascinating of how do we really evolve things? It's by failing and then picking yourself up and moving to the next thing. So NVMHCI, big failure. But it was a springboard for enterprise. So what we did is we figured out that, hey, as an industry, we figured out that look at all these PCIe SSDs that are happening,

Starting point is 00:09:11 all these Fusion IOS SSDs. How do we, in the same way with OnFee, we thought about how do I enable NAND for a broad-based ambition of adoption in the industry through standards? I need to do the same thing. We as the industry need to do the same thing for PCIe SSDs in the industry through standards, I need to do the same thing. We as the industry need to do the same thing for PCIe SSDs in the enterprise. And so you saw Fusion and you saw a lot of different companies doing things, but not in a way that will scale for the industry. And so

Starting point is 00:09:37 why not put a word like enterprise in front of NVMHCI? That sounds really good. So that's what we did. So how do we address the gap? We talked about enterprise and NVMHCI. That sounds really good. So that's what we did. So how do we address the gap? We talked about enterprise NVMHCI and how do we deliver enterprise NVMHCI and the goals and timeline. And of course, the first thing I did was I said, you know, we should just extend what NVMHCI from client was originally intended to be just an extension of AHCI. And so I was like, for enterprise, okay, I have to tweak a few things, but that's fine. Well, that's where all of us come together and solve problems together. And I remember Don Walker from Dell, he just shook me and said, no, you are not going to just extend this client thing to enterprise.

Starting point is 00:10:21 We need to clean this up. And what he said is, hey, we're going to NVM Express or Enterprise NVM HCI at the time is going to revolutionize the industry. We cannot build off of some, you know, glass, you know, house of cards. We have to have the really solid foundation and do the absolute best we can. And so Don pushed me and said, no, we have to go start from scratch. And that's how Don, Peter, several of us got together and said, okay, what do we do for principles of enterprise and VMHCI? We started out with, you know, back in 2010, we talked about, okay, the first bucket was get rid of performance bottlenecks we see in other

Starting point is 00:11:01 interfaces. So for example, if you look at HCI, you have register MMIO reads that cost you a microsecond each. You've got to get rid of all of those. You have to simplify the decoding. So one of the challenges that we saw with really large-scale SSD ambitions is I need to be able to hardware automate everything. And if you look at SAS or serial ATA, it becomes pretty challenging to hardware automate everything. And if you look at SAS or serial ATA, it becomes pretty challenging to hardware automate certain things because of boutique things. So how do we get to one read and one write? Another big bucket was a streamlined command set, again, to realize the ambition of how do I hardware automate? Then into enterprise features, how do I have encryption? How do I have the end-to-end data protection? How do I have solid error reporting? And then lastly, how do I get a scalable architecture?

Starting point is 00:11:51 So something I'll come back to more and more at the end is what you'll experience is, you know, we look back and this is 2010, but we really started working on this like 2007. And what I'll give away a big secret, NVMe didn't ship till 2014. When you have and it really didn't gain steam until 2016. And so one of the things as an industry that we have to be super clear on is where are we going? It takes a long time to get anywhere as you guys as in SNEA as we've been working through the persistent memory twig and all of these things. It takes a long time to gain that momentum and truly deliver a massive innovation. So as an industry, we really have to know where we're going.

Starting point is 00:12:31 And so that's why I think it's super, super important that we focus on what are we trying to build. And so from the very beginning with NAND, we knew that NAND, I mean, I remember NAND was going to die, you know, from every, ever since I've been working on it, but it's not dead yet, right? But we always knew there was something else coming, and so we needed to define NVM Express from the ground up to make sure that we could accomplish the ambition of things like 3D Crosspoint. So NVM Express was born in 2011, in March of 2011. And so, we had this board of directors that included a few companies that no longer exist. But Micron, EMC, many of us in this room. And it really delivered efficient SSD performance. And so, if we look all the way back, we got rid of the uncacheable register reads.

Starting point is 00:13:23 We got the MSIX and interrupt steering so we could get up to a million IOPS. We really maximized queue depth and efficiency for 4K commands. And then we really focused on how do we develop the ecosystem. We looked at the interrupt program from the very start. How do we broadly scale and make sure that we address things together? And major drivers for all OSs. I think something that was a critical breakthrough in NVMe was we had drivers available before the first SSDs.

Starting point is 00:13:52 That's really one of the things that made NVMe go versus potential competitors of the time. Say, many of you in this room might have remembered the quote-unquote war with SCSI Express, right? A lot of it was just putting your head to the wind and really delivering and building out the ecosystem where everyone could thrive. We then focused on how do we build out the enterprise feature set?

Starting point is 00:14:15 How did we add multi-path support and reservations and other capabilities and just continue to build? And so another key thing that I see that SNEA does so well is pick a point to go and just keep going and build out the capabilities over time. And we get there as an industry together. And then finally we shipped. I don't know if I'm going to get in trouble if Intel sees that I've included a Samsung picture in the first. But Intel shipped very shortly thereafter. So we had the first Plugfest in May of 2013 with 11 companies. We had the first product announcement from Samsung in July of 2013. And then they shipped more. They really shipped later

Starting point is 00:14:57 in the year, as did Intel. And so we had two drives out in the market in 2013. And it really delivered in performance. So this talks about, hey, let's look at the performance. We had awesome random reads, awesome random writes, very good sequential performance, really taking advantage of PCI Express. And so how do we build on a foundation that scales is another key thing that I think through. So if we look at today, PCI Gen 4 is starting to be coming out into the marketplace over the next year. And you have even people working on PCI Gen 5. PCI Gen 5, the specification is on track to finish in the next year or so. I shouldn't announce anything on PCI SIG. Many of

Starting point is 00:15:39 you are SIG members and know much more than I do. But we're just scaling. And that sequential performance, we can continue to deliver with NAND or other media. And then how do we do it efficiently and with low latency? There was a lot of debate early on in NVM Express about why are we worrying about all this efficiency? Why does that matter? And it doesn't matter for NAND. I mean, when I look at AHCI and how bad AHCI is, which I also was the author of for serial ATA programming interface versus NVMe, people would say, why are you focused so much on the efficiency and latency of NVMe? And it wasn't for NAND. It was for making sure that we didn't have to redo NVMe for new memories like 3D crosspoint in the

Starting point is 00:16:25 future. So it was super important for how do we get the average latency down. So this is just showing how all the way back if we look at the average latency, just NVMe running on a PCI Gen 3 on the CPU is so much more lower latency than SAS or SATA, and that's really the focus we had to have. And with better quality of service. So getting to this curve here is, what does it look like for my interface latency for 4K random reads? I just fall off a cliff with my quality of service

Starting point is 00:16:54 as well as with SAS, whereas I get consistent performance with NVMe. And shockingly, the analysts notice over time. And so people started to project that PCIe, mostly NVMe, would replace SATA and SAS over time, and one of the things that was exciting to me in the last quarter is that Intel, for the first time, shipped more terabytes in capacity on NVM Express

Starting point is 00:17:23 than on serial ATA, so we truly are hitting a crossover in the industry. And we continue to add capabilities. So if we take a look at NVMe, we're just adding capabilities over time. One of the things we did is in 2014, we added NVMe 1.2. We added both client and server features. So, for example, host memory buffer was a client feature of allowing you to utilize a host DRAM with the SSD. And whereas controller memory buffer, we try to, again,

Starting point is 00:17:51 my whole theme is confusing people with naming. So host memory buffer, good thing for client. Controller memory buffer, more of a good thing for enterprise. So it's all about how you suck people into your work groups because they can't tell otherwise what you're talking about. No, no, no. And then we headed to new frontiers. So we went on to the management interface,

Starting point is 00:18:12 then on to fabrics, and then driving innovation in the cloud. So something that we could really see we were hitting on something is how do you find, just like with AI today, how did we get embraced by some of the up and coming, you know, that were not entrenched in any traditional focus area? So if we took a look at, you know, we had the lightning design that Facebook did. And then just talking through why did they embrace NVMe, you had Google Cloud Platform all the way

Starting point is 00:18:41 back in 2016, differentiating SCSI versus NVMe. And it all comes down to what we're showing here is this is the HDD latency of SAS and SATA. Then you go over to a NAND on SAS or SATA, and you get a really nice, you know, milliseconds to microseconds. Then you go to NVMe, you get another bump, but you still have all of this latency in the media with NAND. And that's, you know, the dream is to obtain. And I'll talk more about that dream. And then even more capabilities in NVMe revision 1.3. And so in revision 1.3, we added virtualization, we added directives, and just continuing, you know, about a two-year cadence on the specification for what people need and how we keep delivering for the industry. So we delivered streams, we delivered virtualization, and more.

Starting point is 00:19:32 And so one of the key things with streams starts to hit on a key point I want to talk to you about what's our North Star as we move forward in NVMe. Streams is hitting towards the what do we do to deal with some of the conflicting of where I have multiple tenants using a single drive? When I have so many tenants using one drive, I get these inherent conflicts. What am I going to do to really address that and make sure that I can have this very big drive be used by so many tenants and feel like a small drive? That's, I think, the biggest challenge of what we have moving forward. So here's the latest NVMe roadmap. I think where

Starting point is 00:20:13 you're sitting, you should put on your binoculars and make sure that you study this deeply. What we're doing is we have NVMe MI 1.1, so we have the management interface. We're adding enclosure management capabilities, so enclosure management features, making sure that you can have your LEDs and have everything else. We have NVMe over Fabrics 1.1. The key thing in NVMe over Fabrics 1.1 is two things. One is a new TCP transport layer that I'll talk a little bit more about. And the other is enhanced discovery. So one of the things that we are trying

Starting point is 00:20:53 to do is make sure that when we originally delivered NVMe over Fabrics, we didn't really, our ambitions were not completely massive scale to start. We wanted to take things in chunks. And so now, as we've gotten much more broad adoption, what we need to do is make sure that when you're out trying to figure out what's on the network, we have enhanced discovery capabilities so you can more easily find what's changed in the network. So you actually can be able to query, hey, what got added on, what moved, all of those things. And then NVMe 1.4. So both the NVMe Refabrics 1.1 and NVMe Management Interface 1.1 we're expecting to release by end of this year.

Starting point is 00:21:35 NVMe itself, the core specification, the revision 1.4, we're anticipating to deliver that mid-next year. What that's got is anchor features as we have multi-pathing capability. One of the key things is just building out further and further a lot of our multi-pathing. So asymmetric namespace access is one of the key things. If Fred Knight is here, Fred can tell you to a great extent of the wonderful capabilities in ANA. What that's trying to do is make sure that if you have two paths to an SSD, how can you make sure that the path you're on, that you're on the optimal path currently,

Starting point is 00:22:14 and then fail over to the one that's less optimal? Then we have persistent memory region. One of the capabilities that we're doing in that is essentially you take the controller memory buffer that has such an awesome name and you think about okay that is for when it's it's volatile is the controller memory buffer what persistent memory region is giving you is a persistent controller memory buffer essentially but we just renamed it so persistent memory region what it's really giving you is this capability to have a a way that you can have memory access to your ssd so you may have you can imagine some

Starting point is 00:22:51 people thinking about in database applications they may have a portion of the device that they want to use in a memory like fashion for certain updates for small updates and then they want to use the vast majority of the SSD as NAND in the normal fashion. So that's what persistent memory region gives you. And then lastly, IO determinism. Let me lead into IO determinism is on the forefront of what I think we need to focus on next. So let me dive into that. So today's challenge that I really see is quality of service at scale. And IODeterminism is one mechanism going after that, but I wanted to talk with us about how we align on the real problem we need to go after next and how we do that together. So I am shamelessly stealing lots

Starting point is 00:23:42 of pictures from a presentation that Chris Peterson and I gave together at Flash Memory Summit in 2017. So Facebook at scale, these are numbers from 2017. And so it's only gotten way, way bigger. Right. So you talk about, you know, two billion people on join Facebook each month. It's just that doesn't make sense. Two billion something it's something of something but they're big numbers right they're really big numbers that's

Starting point is 00:24:09 my point at scale it's big i sound like an engineer right um so one of the things that's occurring is disaggregation right disaggregation of storage that's happening over time and you have a server where you really want to disaggregate because your CPU and your flash, in year one, you've balanced them. In year two, well, hey, on this particular server, maybe I need more flash. And so your needs have outstripped. But on the converse side, you may have, okay, I've got wasted CPU resource. So a critical thing that the disaggregated, disaggregation of flash or storage pooling that some people like to talk about is, how do I not scale and tie two things together that don't need to be tied together? How do I

Starting point is 00:24:59 not tie my CPU scaling to my flash scaling? And we're seeing that in all sorts of things. For example, I was in a conversation, you know, earlier this week about how do I not tie my AI scaling to my core scaling, right? How do I make sure that I can independently scale these variables to deliver the best performance and the most optimized? And one thing you'll hear Chris talk about is dark flash and the evil of dark flash, right? Flash is not the cheapest thing ever. Some of our major cloud service providers are actually spending, I was very disappointed to learn, as an example, one of our cloud service providers that's, you know, one of the top super seven that they call them, they spend more money on, their most money spent is on memory, on DRAM. And the second most

Starting point is 00:25:47 is on SSDs, and the third is on CPUs. And being from Intel, that was super sad. So a key thing is how do we disaggregate flash so that that way you can avoid buying more flash than you need and really right-size your variables of CPU and memory and flash. So the challenge that Chris is seeing is that the NAND flash trend, if you look at it, is kind of crazy, right? So the NAND flash trend is we just keep growing our NAND flash capacity, which is awesome for everybody in this room. But what that leads to is a really big challenge. If you roll back a few years, people were attaching, say, I attach a terabyte M.2 SSD. That's awesome. Now I've got a four terabyte M.2 SSD. However, my application doesn't need four terabytes now. How do I now land more

Starting point is 00:26:41 applications on a single SSD? And what does that mean to me as a shared resource? So what that means is I cry. So what you have here is, you know, you've got your noisy neighbor problem. And so I have application B and C that are being very well behaved and just doing read requests. And then application A is just doing all these writes

Starting point is 00:27:03 and just thrashing my SSD. So how do I live in that environment moving forward? And so what Chris was trying to give as a great example, he couldn't get the writes to an image that I liked even more, which was a Ferrari in a herd of sheep. So how do you deal with that? I got this Ferrari. It's awesome. But what am I going to do with it?

Starting point is 00:27:24 So we need to create that individual lanes, that individual racetrack for our awesome devices for us to really realize our ambitions moving forward. So if you look at this, this is a more realistic graph or real data showing what the problem is. What Chris is showing here is that if you have the percentile of this is my read latency, and this is a histogram or my request. Most of my requests come in at between 90 and 100 microseconds of latency. That's what this is showing. I don't have much of a tail. Pretty much everything's, you know, cut off at 100 microseconds. So that's awesome. My problem here is that I have my tail, and my tail doesn't go that far. It only goes to like 130 microseconds. If you notice, what happens is I introduce 10% of 4k writes and 90% 4k reads. And that's my noisy neighbor from the last uh the last slide all of a sudden

Starting point is 00:28:26 my worst case outlier went from 130 microseconds to four milliseconds that's super sad which is why he's crying so that 35x that is our industry challenge what do we do to address this? So one thing that I've worked on the past few years is something called IO determinism or NVM sets, and that's pretty cool. But what you guys heard about from Matthias yesterday was zone namespaces, and that's pretty cool. We need to get aligned on what the really cool thing is that we're all going to pull together on

Starting point is 00:29:04 and deliver to really solve that problem of quality of service at scale that we're just running into now. There's all sorts of ways to go after it, but I see us almost in the same space of when I call back to, you know, when we were still enterprise NVM HCI and noticing you had Fusion doing this

Starting point is 00:29:21 and you had a different company doing that. We have to get together on what the true way is to solve quality of service at scale because this problem is only becoming worse. And that's where I'd encourage you to get involved at NVMe and take a look at what we're doing in zone namespaces and other approaches. But we have to get on the same page of where we're going so that all of us can benefit because it's a universal problem for our industry moving forward. So then I wanted to touch on briefly scaling the number of SSDs and how fabrics came to be. So if we talk about fabrics, why did we do NVMe over fabrics? I think it was because we were bored. I'm not sure. So this is the problem that EMC laid out. So Mike Shapiro of DSSD, and then that got acquired by EMC for a very large amount of money.

Starting point is 00:30:14 He was showing that if you look at, on the blue, it's the scale of the number of devices that people want to attach. I mean, in 2015, he was calling out, you know, tens of SSDs. I remember trying to make the case to my management that you would really want to have four SSDs in the server, and they thought I was crazy back in the day. And then to hundreds of SSDs, and then to thousands of SSDs. But at the same time, you have on the right is the latency. So, NVM Express with NAND, you know, you're up here at 100 plus microseconds, and then you get to next-gen NVM, something like a 3D crosspoint, and you end up very much down here if I want 20 microseconds. So, the why is, how do time. When I get to next-gen NVM, I want to realize that benefit even if I'm across Ethernet or I'm across OmniPath or I'm across InfiniBand. I want to

Starting point is 00:31:13 realize the benefit of the very fast SSDs and not lose it in translation. Because one thing that keeps happening is when you translate to something like iSCSI, you lose a ton. So any of the end-to-end features, so for example, in SCSI, you know, you have your command queue is 256 deep, but in NVMe, I can have 32 queues of commands that are all very deep, or 200 queues of commands that are all very deep, and I have all the features set, and I want to be able to use that end-to-end to get rid of translation that only adds latency. And so what we did is we created NVMe over Fabrics. And what this is showing down here is, okay, I am creating a very thin host-side transport abstraction and a thin controller-side abstraction, but it's NVMe

Starting point is 00:31:56 end-to-end. And so how do we realize that we only add 10 or 20 microseconds of latency? So this is back in 2015, and how do we focus on it? We wanted to leverage everything we did in NVMe. So we leveraged the namespaces, the controllers, the queuing, all of that we leveraged and kept the commonality. And so 90% common between them. So common commands, common as much as possible. And we delivered it in 2016. So we had, if you saw in June of 2016, we delivered NVMe over Fabrics. We had a Linux host in Target. We had more than 20 companies, and we had many, we had 10 companies showing less than 10 microseconds added latency and brought that to market. So, of course, you can tell my theme of we just have to keep adding over time.

Starting point is 00:32:48 So the original NVMe over fabrics, we delivered RDMA as one, and that encompassed InfiniBand, it encompassed Ethernet and whatnot. We also had Fiber Channel. What we've added is we've added TCP. So TCP is the new transport. It should be imminently ratified uh in nvme and so why add tcp there's a couple reasons to add tcp one is that our ambition to scale the uh the discovery capabilities part of that comes from tcp being added that makes the

Starting point is 00:33:19 discovery mechanism all much more simple in terms of management of the network. The second thing is that RDMA is great. Fiber channel is great. There's a lot of stuff that's really great. How do I realize the benefit of NVMe if I have an existing infrastructure of a bunch of existing NICs that are TCP? If I don't have an RDMA NIC in my infrastructure and I have this huge infrastructure, say I'm a Facebook or a Azure or a Google, and I've got all these NICs, or if I'm a, I need to be able to go in and take advantage of NVMe without ripping out all of my network cards. And so that's what TCP is giving you. It may not be, there is absolutely zero debate that TCP is the most performance optimal solution versus what you could build.

Starting point is 00:34:08 But it is the good enough, if I have an existing infrastructure and I want NVMe, I'm going to get 90% of the benefit or more. So if I take a look at using iSCSI, I'm talking about 100 microseconds of latency, even with fast, of additional latency by doing that. If I go with TCP, I'm still in that 10 to 20 microsecond added ballpark, and I get the NVMe capabilities end to end. So it's a really exciting alternative to iSCSI to really realize the ambition of how do you get more NVMe SSDs in your network. And then I wanted to highlight software. I try to, you know, forget

Starting point is 00:34:44 about software as much as I can, but then the software people poke me, like Andy Rudolph's back there somewhere poking me. So software plays a critical role. So one of the things to take a look at here is this is an Alibaba Cloud storage infrastructure example. And what we're showing here is if they use SAT SSDs, they got IOPS per CPU core were only

Starting point is 00:35:07 200 and they were sad. And then when they went to NVMe and they still use the kernel, they scaled it to X. But when they used SPDK, they ended up scaling by, you know, 10 X, actually more like 15 X. And so the benefit there of how do we get to optimize software such that when we're talking about the latency we're talking about, software matters. And so how do we deal with that and how do we optimize? So I can't finish without connectors and form factors, so it was funny going through my old archives of old presentations what I found.

Starting point is 00:35:46 So Jim will recognize this slide about U.2. So the predecessor, if we go back to 2011 and the SSD form factor work group that many of you were probably a part of, we talked about, hey, we need to create a two and a half inch form factor for enterprise SSDs. And then it really started to take shape. We had the capability of, you could plug in SAS,

Starting point is 00:36:05 you could plug in SATA, you could plug in a buy four PCIe, you could do a buy two by twos to get dual ported. You can do all sorts of exciting stuff. So that was all the way back, you know, 2013. We, as an industry got together in 2010, 2011, realized it in 2013. And then I found us inventing M.2. I mean, I've almost tried to put this out of my memory, but if we look at this, this is a slide from 2011 of the Apple gumstick and MSATA. And what do we do? We don't, these are not great things. And so what about an optimized caseless form factor, which turned into M.2? And so M.2 as an ambition of, hey, socket, if you guys remember, there's socket one, socket two, socket three of a storage only. There was a lot of frustration by many people about socket

Starting point is 00:36:52 three, about creating a buy for only storage capability. And we always thought that it was only for the future, that it was for this far off future, but people started using it right away. And so this all was from client. So what happened? We created U.2 and M.2, and we created it in the data center we have been living with U.2 and M.2. We've lived with U.2 was, I must have a capability of having hard drives in my system.

Starting point is 00:37:20 And M.2 is a client-derived solution that we've forced into data center. And what I wanted to highlight is, you know, legacy keeps us in a legacy box. Those legacy mindsets. So one of the things I wanted to highlight is, one of the things that's emerging now, is how do we start to take essentially the whiteboard that Don shook me back for NVMe and say, on form factors, we need to shake ourselves. We can't use the form factors derived from hard drives or client for the data center transformation we're doing now. And so what we've been doing over the past two years is putting together something called EDSFF.

Starting point is 00:37:56 And so EDSFF is a scalable form factor for different usages. We have both a short and a long, and there's both a 1U and a 2U high. And what it does is it enables you to, another key thing in our data centers is we have precious number of PCIe lanes. We can't strand them all to being a storage only connector of M.2 or U.2. We have to be able to use them for GPUs or for AI accelerators or for other things. And so that's what EDSFF is giving us, is the ability to put a connector down and then decide later, two years after I've designed my motherboard, what I use it for. So that's something I encourage you to get more involved in and take a look at, is EDSFF is really that next generation on form factors for where to go. And I never thought at the beginning of my career I'd ever talk mechanicals.

Starting point is 00:38:48 So with all of that, I just wanted to highlight, we've delivered this NAND SSD tier. We have done that together over the past decade. It's been a lot of work. It's been connectors. It's been software. It's been fabrics. It's been fundamentally designing a NAND interface for the industry.

Starting point is 00:39:06 But we've done that, and now we have the scale and a multi-billion dollar business that many of us enjoy for NAND SSDs. But there's more. There's future NVM. So if I go back to IDF in 2011, we were talking about scalability for future NVM. We were dreaming of the future NVM coming to fruition. And so this was just an example of how do I get as low as I possibly can? There's a device called Chatham that we built at Intel. It was an FPGA that simulated the kind of latencies we thought that a 3D crosspoint could achieve, and just how do we

Starting point is 00:39:45 hyper-optimize from the start to hit a million IOPS with very low latency. So then we were looking at, this is from Flash Memory Summit and Hot Chips back all the way in 2013, and it was just saying, hey, there's all these different memories that Dave talked much more about than I could ever talk about. But if you look at it, what this is saying is, how do I fully exploit the next-gen non-volatile memory? What these dark blue and lightish blue, all of those are media latency. And they make you sad. And so then you get to the next-generation memory and you're less sad. But then you got this red. What's that red? That red is the evilness of software. So we had to work on the red, which is what NVMe was attempting to address, but we really had to even go any even further. So then in 2016, you know, Intel and Micron both announced, hey, 3D cross pointpoint technology, it's awesome, we love it.

Starting point is 00:40:48 And then into, how do we reimagine the hierarchy? So if you look at, you know, we have DRAM, we have NAND SSDs. 3D cross-point is a perfect technology to fit in between. So why do I care? This is an SSD picture. Why do I really care? Because I can reach a region of operation with 3D cross point and Intel Optane SSDs that I never could before.

Starting point is 00:41:17 So what this shows is this is an Intel P37, blah, blah, blah. You know, I'll get in trouble with the marketing people later. This is a NAND SSD along this curve, and it's showing the latency in microseconds at your 99th percentile. So what it's saying is this is my tail latency, and if we look at what's our problem, quality of service, right? Quality of service at scale is what matters to so many people

Starting point is 00:41:40 because they're going for service-level agreements with their customers. And what this is saying is I've got this tail with NAND, and NAND is just an evil thing because of its program and erases, right? It's a beautiful but evil thing. So what this is saying is 3D Crosspoint, I don't have the tail. There's no tail. I can be just perfectly vertical because I don't have all that erase and program that I can't give the quality of service that I need to give. So why does that matter? Let me show a proof point and talk about where are we going to the future.

Starting point is 00:42:15 So this is an Intel, and sorry for the horrible font translation. So first off, if I take a P3700, so again, a NAND SSD, and I put an Optane in the box and don't do anything, I get 30% better performance on Cassandra database, which is an important database to many cloud customers. Then I go optimize with DirectIO and Java optimizations, and I do that software thing that apparently software matters, and I get another 2.5x, right? So if I really focus on software, I can get so much more at a solution level. And then if I evolve and I go to persistent memory, I get a nine X because essentially what persistent memory about is about is about getting

Starting point is 00:43:00 rid of me because with persistent memory, you don't need nvme it just goes away it's load store you don't need nvme we can just go away or i could go away so um so cheer for for me going away and that we really get that 9x performance and really realize the ambition of persistent memory and one of the things that i'm excited to say is intel has actually sold for revenue our first qual sample of our persistent memory DIMM. So despite it always taking a long time, and if I think about NVMe from 2007 to 2014 for a ship, we have a long history of persistent memory coming, but it is here. And so that emerging hierarchy is we would, from an Intel perspective, we would say we think Optane SSDs and Optane

Starting point is 00:43:46 persistent memory are a great addition to this hierarchy so that you can optimize for that TCO. And with that, we have so many zettabytes, and we need to make sure in that hierarchy, we are putting the zettabytes where you need it based on the workload and the TCO and the value you get to the customers. And with that, so we invented this future together. There's so many people in this room that I worked with so closely over the past 10 years, and this has been us together with fits and starts of really stupid ideas by people like me of NVMHCF for client, and then we fix each other, and we deliver something great together. And I really think over the next decade, we have more opportunity with scaling quality of service, with taking advantage

Starting point is 00:44:30 of the new memory and the opportunity that provides. And I'm really excited to work with all of you on that. So that's my talk. Thanks. I think you guys get a break now, but if you want to, I can answer questions. We have time for a question or two. Excellent. Go ahead. Lucy, can you wait for Lucy? Oh, there's a mic. You must have a mic. Great talking. We're being live streamed.

Starting point is 00:45:02 Do you have an update? I know that there's been yield issues with 3D Crosspoint, and the marriage is breaking up between Micron and Intel. Do you have any update on when? You have said nothing about price. All you said was availability about 3D Crosspoint. Have you ever learned about letting engineers know anything about price? They intentionally don't tell me the price, So I see the price, you know, just like you do. The only thing I would say about the quote-unquote marriage breaking up

Starting point is 00:45:32 is that I think you just have to look at what is Intel's ambition. My ambition, so I actually joined the data center group at Intel 18 months ago, and I actually do real data center stuff now, which is frightening. But we're about how do we optimize Optane for the data center? And so I think that fundamentally, I look at one of the, I think that there's more drama made of this than there needs to be.

Starting point is 00:45:56 We are laser focused at delivering data center optimized performance. And I haven't seen that, this is my perspective, mine only, not Intel's. I haven't seen that, this is my perspective, mine only, not Intel's. I haven't, I've seen that's where we're going. And that's where we're going to deliver the most optimal Optane SSD and Optane persistent memory solution we can

Starting point is 00:46:14 to deliver for the industry for the data center ambitions we have. And so we have so much AI, the amount of data AI needs, you know, as Dave was talking about, it just is crazy. And so we really see we need to optimize for AI and where the data center is going. All right, one last question.

Starting point is 00:46:33 Right here. Hi, Amber. Hi. NVMe was designed primarily as a block interface, but we see now the emergence of CMB and PMR, and you mentioned those as part of your talk. How do you see, again, the emerging of those new byte-level memories? Are they should be part of NVMe or should be a new standard for accessing those types of memories? I mean, my opinion is you can – I don't – the amount of investment for the industry to truly optimize yet another thing in between NVMe and persistent memory,

Starting point is 00:47:07 I think it's a big lift for a tweener. That's where controller memory buffer and persistent memory region are a good opportunity for being able to get for something that's out over PCIe, being able to optimize your transfer to that. But fundamentally, I think you really end up with a persistent memory solution being what you do rather than trying to find yet another interface for that tweener. So yeah. Thank you guys for your time. I appreciate it. Yeah. Thanks Amber. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic

Starting point is 00:47:57 further with your peers in the storage developer community. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

Storage Developer Conference - #96: Solid State Datacenter Transformation

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.