Storage Developer Conference - #96: Solid State Datacenter Transformation
Episode Date: May 20, 2019...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the
SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage
developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage
Developer Conference. The link to the slides is available in the show notes at snea.org slash
podcasts. You are listening to SDC Podcast Episode 96. So what I wanted to talk to you today about
was the solid state transformation of the data center.
I was asked to maybe do a walkthrough, a little bit of time.
So I had fun going back in my archives and seeing what we've done together as an industry.
So just disclaimer, so make sure there's going to be a test.
So make sure you read all that.
So one thing I wanted to start out with is zettabytes. I didn't really even know what
zettabytes were, and still I actually started working the past few years of hearing petabytes
to zettabytes. And what you can see is that the edge and data centers and endpoints are driving
just a tsunami of data. And so there's no better time for being in storage
than what we see now.
And so it's only coming, Dave talked eloquently
about how in AI you are absolutely needing
more and more data and that the thirst for data
is increasing, so there's a lot here.
And what we need is faster and faster solid state drives
and persistent memory to take advantage of that.
So in today's talk, what I wanted to do is talk about the industry transformation to
embrace NVM and how we can continue to do that together.
So I've almost got a past look at the past 10 years of what have we done together and
how we shifted as we move forward to really realize what we have in the NVM industry.
So if we look at the journey beginning,
going back to 2000, we really had,
when I started at Intel, I was working on software ATA,
which was even a precursor to serial ATA.
How could I make hard drives faster
and offload them into system memory
with faster algorithms?
It was fascinating,
but we didn't have these tiers in the middle.
So how did we get to where we are? If you go back a decade at IDF in 2007, I was talking about,
hey, PC could be a driver of NAND growth and SSD to become the new killer application of NAND flash.
So it's hard to remember, for those of us in this room, when NAND was a potential killer application.
For SSDs would be a killer application for NAND.
So where have we come? So what we've done over the past decade is gone from could NAND SSDs be part of this tiering, this memory hierarchy?
And how did we get here?
So the first thing I wanted to talk about was the need for the open NAND flash interface.
So if we scroll back, we had no standard NAND interface going back to the time in 2005, 2004.
So at least my experience when I was starting on SSDs at Intel, one of our issues was how do I even design an SSD interface? We had Toshiba,
we had Samsung, we had many smaller NAND vendors. And what we were struggling with is how do I even
implement solidly an ASIC for an SSD? What could I do? And so the challenge was basically the way
the industry worked at that time was Samsung would publish their latest data sheet and everybody else
would then adjust for whatever the new developments were.
And so we really, to make the ambitions grow
of what we wanted to do in the SSD industry,
we had to address that.
So then enter ONFI.
And so what we did is, if you go back,
and sorry for the odd formatting,
one of the fonts did not translate.
If you go back to ONFI, what happened is back in
2006, what we did is we first delivered the, we codified commonalities and then started to scale
for SSDs. And then we started increasing the performance for SSDs. I don't know if we can
remember what we've all done together over the past decade is, if you scroll back, NAND originally in 2005 was 40 megabytes per second was the interface to the NAND chip. And so we have
gone to where we've scaled from in ONFI 1.0, we went to 133 megabytes per second, then we went to
266, 400, and brought over the DRAM learnings, and we scaled in less than two years. And then we scaled for the SSD industry.
So obviously one thing that many of you in this room know is
ONFI was a coalition of the small companies of the Micron and the SK, Hynix, and the Intel.
And so what we did is we worked with Jetik and formed an on-feed Jetik collaboration.
And then what we were able to do through that is work with Samsung and Toshiba
to really bring together and unify the industry
on how we had commonality
in the most important parts of the NAND interface
such that all of us could build SSD controllers confidently
and move forward with the scale of our industry ambitions.
Then we had steady innovation year over year over year we added some features we went to
you know 800 megabytes per second then i'll talk in a second about 1200 but just continuing over
and over time to deliver steady innovation and keep pace with ssd needs so up today we had you
know in 2014 we delivered 800 megatransfers per second 2017 delivered 1200 megatransfers per second. 2017, delivered 1,200 megatransfers per second.
And so we've continued to keep pace with where NAND needs to go
in order to hit SSD needs and develop that NAND tier.
And so what are we seeing then?
Should we just keep going?
I mean, I started out at 40 megabytes per second and now we're 1,200,
so we should just go 2,400.
So one thing I wanted to say is NAND is an amazing technology,
but we really reached at the scaling of the individual NAND device, we've reached where
I believe we should go. I mean, maybe we go slightly higher, but what these charts are
saying is that as you keep increasing your data rate, you're getting diminishing returns at the
NAND. So I'm not going to be back here in two years telling you we've scaled the 1,600 megatransfers
per second or 2,000 at the NAND.
What this is saying is we've reached the end of our ambitions with NAND in terms of at
an individual NAND layer unless we get a breakthrough in the NAND media.
You just really have, I just get scaling where I'm just paying power and I'm not getting
a system level benefit or an SSD level benefit to bring forward. So that was our first
pass is if we're going to deliver a NAND SSD tier, first off, we have to be able to have NAND
attached in a big way. As many of us have 40 NAND sites, 80 NAND sites, 120 NAND sites in our device.
And how do we go forward and be confident about, I can use Samsung, I can use Toshiba, I can use Intel, I can use Micron.
So as an industry together, we've delivered that.
And what I wanted to get across in this talk is one of the most important things is those of us in this room,
us working together and figuring out what we want to do.
When we first started talking about in 2004 about starting ONFI, we kind of got laughed at quite a bit.
Because it's like, Intel, you don't even own NAND.
What are you talking about?
Because this is pre-Intel Micron NAND merger.
So I just wanted to, us in this room,
that's how we move things forward.
So now I wanted to talk about the path to NVM Express
and what we need to continue to do.
So ONFI work is pretty much done,
and there's not continued, but NVM Express,
we have
way more. Are you guys asleep yet? You seem very sleepy. So I'm going to disclose something
shocking. I'm terrible at names. NVM HCI, that was the original name of NVM Express.
So this is back at Flash Memory Summit in 2009. I was out there talking about, hey, look at all this awesomeness of NVM HCI and optimized interface for NVM.
And what you can see here is we were really looking at it from a, hey, I've got an HCI controller, an NVM HCI,
and I'm going to do this awesome caching and client, and I'm going to have this really cool thing where I just have NAND on a DIMM,
and I just attach the NAND, and it's going to be awesome. And it was meant to be this very
low-cost client capability. Well, sometimes our ambitions are failures, right? And so,
total misfire, right? We totally, that did not go anywhere. So, I think that one of the things that
I, I love books, as Dave does, and one of the things that I find fascinating is failure.
The growth mindset books by Carol Dweck of Stanford are fascinating of how do we really evolve things?
It's by failing and then picking yourself up and moving to the next thing.
So NVMHCI, big failure.
But it was a springboard for enterprise. So what we did is we figured out that, hey, as an industry,
we figured out that look at all these PCIe SSDs that are happening,
all these Fusion IOS SSDs.
How do we, in the same way with OnFee,
we thought about how do I enable NAND for a broad-based ambition
of adoption in the industry through standards?
I need to do the same thing.
We as the industry need to do the same thing for PCIe SSDs in the industry through standards, I need to do the same thing. We as the industry need to
do the same thing for PCIe SSDs in the enterprise. And so you saw Fusion and you saw a lot of
different companies doing things, but not in a way that will scale for the industry. And so
why not put a word like enterprise in front of NVMHCI? That sounds really good.
So that's what we did. So how do we address the gap? We talked about enterprise and NVMHCI. That sounds really good. So that's what we did. So how do we address the gap? We talked
about enterprise NVMHCI and how do we deliver enterprise NVMHCI and the goals and timeline.
And of course, the first thing I did was I said, you know, we should just extend what NVMHCI from
client was originally intended to be just an extension of AHCI. And so I was like, for enterprise, okay, I have to tweak a few things, but that's fine.
Well, that's where all of us come together and solve problems together.
And I remember Don Walker from Dell, he just shook me and said,
no, you are not going to just extend this client thing to enterprise.
We need to clean this up.
And what he said is, hey, we're going to
NVM Express or Enterprise NVM HCI at the time is going to revolutionize the industry. We cannot
build off of some, you know, glass, you know, house of cards. We have to have the really solid
foundation and do the absolute best we can. And so Don pushed me and said, no, we have
to go start from scratch. And that's how Don, Peter, several of us got together and said, okay,
what do we do for principles of enterprise and VMHCI? We started out with, you know, back in 2010,
we talked about, okay, the first bucket was get rid of performance bottlenecks we see in other
interfaces. So for example, if you look at HCI, you have register MMIO reads that cost you a microsecond each. You've got to get rid of all
of those. You have to simplify the decoding. So one of the challenges that we saw with really
large-scale SSD ambitions is I need to be able to hardware automate everything. And if you look at
SAS or serial ATA, it becomes pretty challenging to hardware automate everything. And if you look at SAS or serial ATA, it becomes pretty
challenging to hardware automate certain things because of boutique things. So how do we get to
one read and one write? Another big bucket was a streamlined command set, again, to realize the
ambition of how do I hardware automate? Then into enterprise features, how do I have encryption? How
do I have the end-to-end data protection? How do I have solid error reporting? And then lastly, how do I get a scalable architecture?
So something I'll come back to more and more at the end is what you'll experience is, you know,
we look back and this is 2010, but we really started working on this like 2007. And what I'll
give away a big secret, NVMe didn't ship till 2014. When you have
and it really didn't gain steam until 2016. And so one of the things as an industry that we have
to be super clear on is where are we going? It takes a long time to get anywhere as you guys
as in SNEA as we've been working through the persistent memory twig and all of these things.
It takes a long time to gain that momentum and truly deliver a massive innovation.
So as an industry, we really have to know where we're going.
And so that's why I think it's super, super important that we focus on what are we trying to build.
And so from the very beginning with NAND, we knew that NAND, I mean, I remember NAND was going to die, you know, from every, ever since I've been working on it, but it's not dead yet, right? But we always knew there was something else coming,
and so we needed to define NVM Express from the ground up to make sure that we could accomplish
the ambition of things like 3D Crosspoint. So NVM Express was born in 2011, in March of 2011.
And so, we had this board of directors that included a few companies that no longer exist.
But Micron, EMC, many of us in this room.
And it really delivered efficient SSD performance.
And so, if we look all the way back, we got rid of the uncacheable register reads.
We got the MSIX and interrupt steering so we could get up to a million IOPS.
We really maximized queue depth and efficiency for 4K commands.
And then we really focused on how do we develop the ecosystem.
We looked at the interrupt program from the very start.
How do we broadly scale and make sure that we address things together?
And major drivers for all OSs.
I think something that was a critical breakthrough in NVMe
was we had drivers available before the first SSDs.
That's really one of the things that made NVMe go
versus potential competitors of the time.
Say, many of you in this room might have remembered
the quote-unquote war with SCSI Express, right?
A lot of it was just putting your head to the wind
and really delivering and building out the ecosystem
where everyone could thrive.
We then focused on how do we build out the enterprise feature set?
How did we add multi-path support and reservations
and other capabilities and just continue to build?
And so another key thing that I see that SNEA does so well is pick a point to go
and just keep going and build out the capabilities over time. And we get there as an industry
together. And then finally we shipped. I don't know if I'm going to get in trouble if Intel sees
that I've included a Samsung picture in the first. But Intel shipped very shortly thereafter.
So we had the first Plugfest in May of 2013 with 11 companies. We had the first product
announcement from Samsung in July of 2013. And then they shipped more. They really shipped later
in the year, as did Intel. And so we had two drives out in the market in 2013. And it really
delivered in performance. So this talks
about, hey, let's look at the performance. We had awesome random reads, awesome random writes,
very good sequential performance, really taking advantage of PCI Express. And so how do we build
on a foundation that scales is another key thing that I think through. So if we look at today,
PCI Gen 4 is starting to be coming out into the marketplace
over the next year. And you have even people working on PCI Gen 5. PCI Gen 5, the specification
is on track to finish in the next year or so. I shouldn't announce anything on PCI SIG. Many of
you are SIG members and know much more than I do. But we're just scaling. And that sequential
performance, we can continue to deliver with NAND or other media. And then how do we do it efficiently and with low latency?
There was a lot of debate early on in NVM Express about why are we worrying about all this
efficiency? Why does that matter? And it doesn't matter for NAND. I mean, when I look at AHCI and how bad AHCI is,
which I also was the author of for serial ATA programming interface versus NVMe,
people would say, why are you focused so much on the efficiency and latency of NVMe?
And it wasn't for NAND.
It was for making sure that we didn't have to redo NVMe for new memories like 3D crosspoint in the
future. So it was super important for how do we get the average latency down. So this
is just showing how all the way back if we look at the average latency, just NVMe running
on a PCI Gen 3 on the CPU is so much more lower latency than SAS or SATA, and that's
really the focus we had to have. And with better quality of service.
So getting to this curve here is,
what does it look like for my interface latency
for 4K random reads?
I just fall off a cliff with my quality of service
as well as with SAS,
whereas I get consistent performance with NVMe.
And shockingly, the analysts notice over time.
And so people started to project that PCIe,
mostly NVMe, would replace SATA and SAS over time,
and one of the things that was exciting to me
in the last quarter is that Intel, for the first time,
shipped more terabytes in capacity on NVM Express
than on serial ATA,
so we truly are hitting a crossover in the industry.
And we continue to add capabilities.
So if we take a look at NVMe, we're just adding capabilities over time.
One of the things we did is in 2014, we added NVMe 1.2.
We added both client and server features.
So, for example, host memory buffer was a client feature of allowing you to utilize a host DRAM with the SSD.
And whereas controller memory buffer, we try to, again,
my whole theme is confusing people with naming.
So host memory buffer, good thing for client.
Controller memory buffer, more of a good thing for enterprise.
So it's all about how you suck people into your work groups
because they can't tell otherwise what you're talking about.
No, no, no.
And then we headed to new frontiers.
So we went on to the management interface,
then on to fabrics,
and then driving innovation in the cloud.
So something that we could really see we were hitting on something is
how do you find, just like with AI today,
how did we get embraced
by some of the up and coming, you know, that were not entrenched in any traditional focus area?
So if we took a look at, you know, we had the lightning design that Facebook did.
And then just talking through why did they embrace NVMe, you had Google Cloud Platform all the way
back in 2016, differentiating SCSI versus NVMe.
And it all comes down to what we're showing here is this is the HDD latency of SAS and SATA.
Then you go over to a NAND on SAS or SATA, and you get a really nice, you know, milliseconds to microseconds.
Then you go to NVMe, you get another bump, but you still have all of this latency in the media with NAND.
And that's, you know, the dream is to obtain. And I'll talk more about that dream.
And then even more capabilities in NVMe revision 1.3. And so in revision 1.3, we added virtualization, we added directives, and just continuing, you know, about a two-year cadence
on the specification for what people need and how we
keep delivering for the industry. So we delivered streams, we delivered virtualization, and more.
And so one of the key things with streams starts to hit on a key point I want to talk to you about
what's our North Star as we move forward in NVMe. Streams is hitting towards the what do we do to
deal with some of the conflicting of where I
have multiple tenants using a single drive? When I have so many tenants using one drive,
I get these inherent conflicts. What am I going to do to really address that and make sure that
I can have this very big drive be used by so many tenants and feel like a small drive?
That's, I think, the biggest
challenge of what we have moving forward. So here's the latest NVMe roadmap. I think where
you're sitting, you should put on your binoculars and make sure that you study this deeply.
What we're doing is we have NVMe MI 1.1, so we have the management interface.
We're adding enclosure management capabilities, so enclosure management features,
making sure that you can have your LEDs and have everything else.
We have NVMe over Fabrics 1.1.
The key thing in NVMe over Fabrics 1.1 is two things.
One is a new TCP transport layer that I'll talk a
little bit more about. And the other is enhanced discovery. So one of the things that we are trying
to do is make sure that when we originally delivered NVMe over Fabrics, we didn't really,
our ambitions were not completely massive scale to start. We wanted to take things in chunks.
And so now, as we've gotten much more broad adoption,
what we need to do is make sure that when you're out trying to figure out what's on the network,
we have enhanced discovery capabilities so you can more easily find what's changed in the network.
So you actually can be able to query, hey, what got added on, what moved, all of those things.
And then NVMe 1.4.
So both the NVMe Refabrics 1.1 and NVMe Management Interface 1.1 we're expecting to release by end of this year.
NVMe itself, the core specification, the revision 1.4, we're anticipating to deliver that mid-next year.
What that's got is anchor features as we have multi-pathing capability.
One of the key things is just building out further and further a lot of our multi-pathing.
So asymmetric namespace access is one of the key things.
If Fred Knight is here, Fred can tell you to a great extent of the wonderful capabilities in ANA.
What that's trying to do is make sure that if you have two paths to an SSD,
how can you make sure that the path you're on,
that you're on the optimal path currently,
and then fail over to the one that's less optimal?
Then we have persistent memory region.
One of the capabilities that we're doing in that
is essentially you take the controller memory buffer
that has such an awesome name and you think about okay that is for when it's it's volatile
is the controller memory buffer what persistent memory region is giving you is a persistent
controller memory buffer essentially but we just renamed it so persistent memory region what it's
really giving you is this capability to have a a way that you can have memory access to your ssd so you may have you can imagine some
people thinking about in database applications they may have a portion of the device that they
want to use in a memory like fashion for certain updates for small updates and then they want to
use the vast majority of
the SSD as NAND in the normal fashion. So that's what persistent memory region gives you.
And then lastly, IO determinism. Let me lead into IO determinism is on the forefront of what I think
we need to focus on next. So let me dive into that. So today's challenge that I really see is quality of service at scale. And IODeterminism
is one mechanism going after that, but I wanted to talk with us about how we align on the real
problem we need to go after next and how we do that together. So I am shamelessly stealing lots
of pictures from a presentation that Chris Peterson and I gave together at Flash Memory Summit in 2017.
So Facebook at scale, these are numbers from 2017.
And so it's only gotten way, way bigger.
Right.
So you talk about, you know, two billion people on join Facebook each month.
It's just that doesn't make sense.
Two billion something
it's something of something but they're big numbers right they're really big numbers that's
my point at scale it's big i sound like an engineer right um so one of the things that's
occurring is disaggregation right disaggregation of storage that's happening over time and you
have a server where you really
want to disaggregate because your CPU and your flash, in year one, you've balanced them. In year
two, well, hey, on this particular server, maybe I need more flash. And so your needs have outstripped.
But on the converse side, you may have, okay, I've got wasted CPU resource. So a critical thing that the
disaggregated, disaggregation of flash or storage pooling that some people like to talk about is,
how do I not scale and tie two things together that don't need to be tied together? How do I
not tie my CPU scaling to my flash scaling? And we're seeing that in all sorts of things. For example, I was in
a conversation, you know, earlier this week about how do I not tie my AI scaling to my core scaling,
right? How do I make sure that I can independently scale these variables to deliver the best
performance and the most optimized? And one thing you'll hear Chris talk about is dark flash and the
evil of dark flash, right? Flash is not the cheapest
thing ever. Some of our major cloud service providers are actually spending, I was very
disappointed to learn, as an example, one of our cloud service providers that's, you know, one of
the top super seven that they call them, they spend more money on, their most money spent is on memory, on DRAM. And the second most
is on SSDs, and the third is on CPUs. And being from Intel, that was super sad. So a key thing is
how do we disaggregate flash so that that way you can avoid buying more flash than you need and really right-size your variables of CPU and memory and
flash. So the challenge that Chris is seeing is that the NAND flash trend, if you look at it,
is kind of crazy, right? So the NAND flash trend is we just keep growing our NAND flash capacity,
which is awesome for everybody in this room. But what that leads to is a really big challenge.
If you roll back a few years,
people were attaching, say, I attach a terabyte M.2 SSD. That's awesome. Now I've got a four
terabyte M.2 SSD. However, my application doesn't need four terabytes now. How do I now land more
applications on a single SSD? And what does that mean to me as a shared resource?
So what that means is I cry.
So what you have here is, you know,
you've got your noisy neighbor problem.
And so I have application B and C
that are being very well behaved
and just doing read requests.
And then application A is just doing all these writes
and just thrashing my SSD.
So how do I live in that environment moving forward?
And so what Chris was trying to give as a great example,
he couldn't get the writes to an image that I liked even more,
which was a Ferrari in a herd of sheep.
So how do you deal with that?
I got this Ferrari. It's awesome.
But what am I going to do with it?
So we need to create
that individual lanes, that individual racetrack for our awesome devices for us to really realize
our ambitions moving forward. So if you look at this, this is a more realistic graph or real data
showing what the problem is. What Chris is showing here is that if you have the percentile of this is my read latency, and this is a histogram or my request.
Most of my requests come in at between 90 and 100 microseconds of latency. That's what this
is showing. I don't have much of a tail. Pretty much everything's, you know, cut off at 100 microseconds. So that's awesome. My problem here is that I have my tail,
and my tail doesn't go that far. It only goes to like 130 microseconds. If you notice, what happens
is I introduce 10% of 4k writes and 90% 4k reads. And that's my noisy neighbor from the last uh the last slide all of a sudden
my worst case outlier went from 130 microseconds to four milliseconds that's super sad which is
why he's crying so that 35x that is our industry challenge what do we do to address this? So one thing that I've worked on the past few years
is something called IO determinism or NVM sets,
and that's pretty cool.
But what you guys heard about from Matthias yesterday
was zone namespaces, and that's pretty cool.
We need to get aligned on what the really cool thing is
that we're all going to pull together on
and deliver to really solve that problem
of quality of service at scale
that we're just running into now.
There's all sorts of ways to go after it,
but I see us almost in the same space
of when I call back to, you know,
when we were still enterprise NVM HCI
and noticing you had Fusion doing this
and you had a different company doing that.
We have to get together on what the true way is to solve quality of service at scale because this problem is only becoming
worse. And that's where I'd encourage you to get involved at NVMe and take a look at what we're
doing in zone namespaces and other approaches. But we have to get on the same page of where we're
going so that all of us can benefit because it's a universal problem for our industry moving forward.
So then I wanted to touch on briefly scaling the number of SSDs and how fabrics came to be.
So if we talk about fabrics, why did we do NVMe over fabrics? I think it was because we were bored. I'm not sure. So this is the problem that EMC laid out.
So Mike Shapiro of DSSD, and then that got acquired by EMC for a very large amount of money.
He was showing that if you look at, on the blue, it's the scale of the number of devices that people want to attach.
I mean, in 2015, he was calling out, you know, tens of SSDs.
I remember trying to make the case to my management that you would really want to have four SSDs in
the server, and they thought I was crazy back in the day. And then to hundreds of SSDs, and then
to thousands of SSDs. But at the same time, you have on the right is the latency. So, NVM Express
with NAND, you know, you're up here at 100 plus microseconds, and then you get to next-gen NVM, something like a 3D crosspoint, and you end up very much down here if I want 20 microseconds.
So, the why is, how do time. When I get to next-gen NVM, I want to realize that
benefit even if I'm across Ethernet or I'm across OmniPath or I'm across InfiniBand. I want to
realize the benefit of the very fast SSDs and not lose it in translation. Because one thing that
keeps happening is when you translate to something like iSCSI, you lose a ton. So any of the end-to-end
features, so for example, in SCSI, you know, you have your command queue is 256 deep, but in NVMe,
I can have 32 queues of commands that are all very deep, or 200 queues of commands that are all very
deep, and I have all the features set, and I want to be able to use that end-to-end to get rid of
translation that only adds latency. And so what we did is we
created NVMe over Fabrics. And what this is showing down here is, okay, I am creating a very thin
host-side transport abstraction and a thin controller-side abstraction, but it's NVMe
end-to-end. And so how do we realize that we only add 10 or 20 microseconds of latency?
So this is back in 2015, and how do we focus on it? We wanted
to leverage everything we did in NVMe. So we leveraged the namespaces, the controllers,
the queuing, all of that we leveraged and kept the commonality. And so 90% common between them.
So common commands, common as much as possible. And we delivered it in 2016. So we had,
if you saw in June of 2016, we delivered NVMe over Fabrics. We had a Linux host in Target. We had
more than 20 companies, and we had many, we had 10 companies showing less than 10 microseconds
added latency and brought that to market. So, of course, you can tell my theme of we just have to keep adding over time.
So the original NVMe over fabrics, we delivered RDMA as one,
and that encompassed InfiniBand, it encompassed Ethernet and whatnot.
We also had Fiber Channel.
What we've added is we've added TCP.
So TCP is the new transport.
It should be imminently ratified uh
in nvme and so why add tcp there's a couple reasons to add tcp one is that our ambition to
scale the uh the discovery capabilities part of that comes from tcp being added that makes the
discovery mechanism all much more simple in terms of management of the network. The second thing
is that RDMA is great. Fiber channel is great. There's a lot of stuff that's really great.
How do I realize the benefit of NVMe if I have an existing infrastructure of a bunch of existing
NICs that are TCP? If I don't have an RDMA NIC in my infrastructure and I have this
huge infrastructure, say I'm a Facebook or a Azure or a Google, and I've got all these NICs,
or if I'm a, I need to be able to go in and take advantage of NVMe without ripping out all of my
network cards. And so that's what TCP is giving you. It may not be, there is absolutely zero
debate that TCP is the most performance optimal solution versus what you could build.
But it is the good enough, if I have an existing infrastructure and I want NVMe,
I'm going to get 90% of the benefit or more.
So if I take a look at using iSCSI, I'm talking about 100 microseconds of latency,
even with fast, of additional latency by doing that.
If I go with TCP, I'm still
in that 10 to 20 microsecond added ballpark, and I get the NVMe capabilities end to end.
So it's a really exciting alternative to iSCSI to really realize the ambition of how do you get
more NVMe SSDs in your network. And then I wanted to highlight software. I try to, you know, forget
about software as much as I can,
but then the software people poke me,
like Andy Rudolph's back there somewhere poking me.
So software plays a critical role.
So one of the things to take a look at here
is this is an Alibaba Cloud storage infrastructure example.
And what we're showing here is if they use SAT SSDs,
they got IOPS per CPU core were only
200 and they were sad. And then when they went to NVMe and they still use the kernel, they scaled
it to X. But when they used SPDK, they ended up scaling by, you know, 10 X, actually more like 15
X. And so the benefit there of how do we get to optimize software
such that when we're talking about the latency we're talking about,
software matters.
And so how do we deal with that and how do we optimize?
So I can't finish without connectors and form factors,
so it was funny going through my old archives of old presentations what I found.
So Jim will recognize this slide about U.2.
So the predecessor, if we go back to 2011
and the SSD form factor work group
that many of you were probably a part of,
we talked about, hey, we need to create
a two and a half inch form factor for enterprise SSDs.
And then it really started to take shape.
We had the capability of, you could plug in SAS,
you could plug in SATA, you could plug in a buy four PCIe, you could do a buy two by twos to get
dual ported. You can do all sorts of exciting stuff. So that was all the way back, you know,
2013. We, as an industry got together in 2010, 2011, realized it in 2013.
And then I found us inventing M.2. I mean, I've almost tried to put this out of
my memory, but if we look at this, this is a slide from 2011 of the Apple gumstick and MSATA. And
what do we do? We don't, these are not great things. And so what about an optimized caseless
form factor, which turned into M.2? And so M.2 as an ambition of, hey, socket, if you guys remember, there's socket one, socket
two, socket three of a storage only. There was a lot of frustration by many people about socket
three, about creating a buy for only storage capability. And we always thought that it was
only for the future, that it was for this far off future, but people started using it right away. And so this all was from client.
So what happened?
We created U.2 and M.2,
and we created it in the data center
we have been living with U.2 and M.2.
We've lived with U.2 was,
I must have a capability of having hard drives in my system.
And M.2 is a client-derived solution
that we've forced into data center. And what I wanted
to highlight is, you know, legacy keeps us in a legacy box. Those legacy mindsets. So one of the
things I wanted to highlight is, one of the things that's emerging now, is how do we start to take
essentially the whiteboard that Don shook me back for NVMe and say, on form factors, we need to
shake ourselves. We can't use the form factors
derived from hard drives or client for the data center transformation we're doing now.
And so what we've been doing over the past two years is putting together something called EDSFF.
And so EDSFF is a scalable form factor for different usages. We have both a short and a long,
and there's both a 1U and a 2U high. And what it
does is it enables you to, another key thing in our data centers is we have precious number of
PCIe lanes. We can't strand them all to being a storage only connector of M.2 or U.2. We have to
be able to use them for GPUs or for AI accelerators or for other things.
And so that's what EDSFF is giving us, is the ability to put a connector down and then decide later, two years after I've designed my motherboard, what I use it for.
So that's something I encourage you to get more involved in and take a look at, is EDSFF is really that next generation on form factors for where to go.
And I never thought at the beginning of my career I'd ever talk mechanicals.
So with all of that, I just wanted to highlight,
we've delivered this NAND SSD tier.
We have done that together over the past decade.
It's been a lot of work.
It's been connectors.
It's been software.
It's been fabrics.
It's been fundamentally designing a NAND interface for the industry.
But we've done that, and now we have the scale and a multi-billion dollar business that many
of us enjoy for NAND SSDs.
But there's more.
There's future NVM.
So if I go back to IDF in 2011, we were talking about scalability for future NVM. We were dreaming of
the future NVM coming to fruition. And so this was just an example of how do I get as low as I
possibly can? There's a device called Chatham that we built at Intel. It was an FPGA that
simulated the kind of latencies we thought that a 3D crosspoint could achieve, and just how do we
hyper-optimize from the start to hit a million IOPS with very low latency. So then we were looking
at, this is from Flash Memory Summit and Hot Chips back all the way in 2013, and it was just saying,
hey, there's all these different memories that Dave talked much more about than I could ever
talk about. But if you look at it, what this is saying is, how do I fully exploit the next-gen non-volatile
memory? What these dark blue and lightish blue, all of those are media latency. And they make you
sad. And so then you get to the next-generation memory and you're less sad. But then you got this red. What's that red? That red is the evilness of software. So we had to work on the red, which is what NVMe was attempting
to address, but we really had to even go any even further. So then in 2016, you know, Intel and
Micron both announced, hey, 3D cross pointpoint technology, it's awesome, we love it.
And then into, how do we reimagine the hierarchy?
So if you look at, you know, we have DRAM, we have NAND SSDs.
3D cross-point is a perfect technology to fit in between.
So why do I care?
This is an SSD picture.
Why do I really care?
Because I can reach a region of operation with 3D cross point and Intel Optane SSDs
that I never could before.
So what this shows is this is an Intel P37, blah, blah, blah.
You know, I'll get in trouble
with the marketing people later.
This is a NAND SSD along this curve,
and it's showing the latency in microseconds at your 99th percentile.
So what it's saying is this is my tail latency,
and if we look at what's our problem, quality of service, right?
Quality of service at scale is what matters to so many people
because they're going for service-level agreements with their customers.
And what this is saying is I've got this tail with NAND, and NAND is just an evil thing because of its program
and erases, right? It's a beautiful but evil thing. So what this is saying is 3D Crosspoint,
I don't have the tail. There's no tail. I can be just perfectly vertical because I don't have all
that erase and program that I can't give the quality of service that I need to give.
So why does that matter?
Let me show a proof point
and talk about where are we going to the future.
So this is an Intel,
and sorry for the horrible font translation.
So first off, if I take a P3700,
so again, a NAND SSD,
and I put an Optane in the box and don't do anything, I get 30% better performance on Cassandra database, which is an important database to many cloud customers.
Then I go optimize with DirectIO and Java optimizations, and I do that software thing that apparently software matters, and I get another 2.5x, right?
So if I really focus on software, I can get so much more at a solution level. And then if I evolve and I go to persistent
memory, I get a nine X because essentially what persistent memory about is about is about getting
rid of me because with persistent memory, you don't need nvme it just goes away it's load
store you don't need nvme we can just go away or i could go away so um so cheer for for me going
away and that we really get that 9x performance and really realize the ambition of persistent
memory and one of the things that i'm excited to say is intel has actually sold for revenue
our first qual sample of our
persistent memory DIMM. So despite it always taking a long time, and if I think about NVMe
from 2007 to 2014 for a ship, we have a long history of persistent memory coming, but it is
here. And so that emerging hierarchy is we would, from an Intel perspective, we would say we think Optane SSDs and Optane
persistent memory are a great addition to this hierarchy so that you can optimize for that TCO.
And with that, we have so many zettabytes, and we need to make sure in that hierarchy,
we are putting the zettabytes where you need it based on the workload and the TCO and the value
you get to the customers.
And with that, so we invented this future together.
There's so many people in this room that I worked with so closely over the past 10 years, and this has been us together with fits and starts of really stupid ideas by people like me of NVMHCF for client,
and then we fix each other, and we deliver something great together.
And I really think over the next decade, we have more opportunity with scaling quality of service, with taking advantage
of the new memory and the opportunity that provides. And I'm really excited to work with
all of you on that. So that's my talk. Thanks. I think you guys get a break now, but if you want to, I can answer questions.
We have time for a question or two.
Excellent. Go ahead.
Lucy, can you wait for Lucy?
Oh, there's a mic. You must have a mic.
Great talking.
We're being live streamed.
Do you have an update?
I know that there's been yield issues with 3D Crosspoint,
and the marriage is breaking up between Micron and Intel.
Do you have any update on when?
You have said nothing about price.
All you said was availability about 3D Crosspoint.
Have you ever learned about letting engineers know anything about price?
They intentionally don't tell me the price, So I see the price, you know, just like you do. The only thing I would say about the quote-unquote marriage breaking up
is that I think you just have to look at what is Intel's ambition. My ambition, so I actually
joined the data center group at Intel 18 months ago, and I actually do real data center stuff now,
which is frightening. But we're about how do we optimize Optane
for the data center?
And so I think that fundamentally,
I look at one of the,
I think that there's more drama made of this
than there needs to be.
We are laser focused at delivering data center
optimized performance.
And I haven't seen that,
this is my perspective, mine only, not Intel's. I haven't seen that, this is my perspective,
mine only, not Intel's.
I haven't, I've seen that's where we're going.
And that's where we're going to deliver the most optimal Optane SSD
and Optane persistent memory solution we can
to deliver for the industry
for the data center ambitions we have.
And so we have so much AI,
the amount of data AI needs,
you know, as Dave was talking about,
it just is crazy.
And so we really see we need to optimize for AI and where the data center is going.
All right, one last question.
Right here.
Hi, Amber.
Hi.
NVMe was designed primarily as a block interface, but we see now the emergence of CMB and PMR,
and you mentioned those as part of your talk.
How do you see, again, the emerging of those new byte-level memories?
Are they should be part of NVMe or should be a new standard for accessing those types of memories?
I mean, my opinion is you can – I don't – the amount of investment for the industry to truly optimize yet another thing in between NVMe and persistent memory,
I think it's a big lift for a tweener.
That's where controller memory buffer and persistent memory region are a good opportunity for being able to get for something that's out over PCIe,
being able to optimize your transfer to that.
But fundamentally, I think you really end up with a persistent memory solution
being what you do rather than trying to find yet another interface for that tweener. So yeah.
Thank you guys for your time. I appreciate it. Yeah. Thanks Amber.
Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending
an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic
further with your peers in the storage developer community. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.