Storage Developer Conference - #79: Various All Flash Solution Architectures
Episode Date: November 12, 2018...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the
SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage
developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage
Developer Conference. The link to the slides is available in the show notes at snea.org
slash podcasts. You are listening to SDC Podcast Episode 79. We're going to talk about various
different architectures for AllFlash.
So if you've thought about it, you've thought about AllFlash,
and you're like, these NVMe things are really cool,
but how do I just set up a system that will actually work?
What kind of architecture?
And you may go, well, how hard is that?
I mean, it's pretty obvious, right?
Well, if you start thinking about it, it's not very obvious.
So how many hardware people we have?
Hardware people?
I'm going to be interactive.
Okay, the hardware people, I think, are going to really enjoy this because it's going to be very hardware-centric. Okay? Sorry, software guys. But software people, which is probably a lot of
you, I want you to think about what are the aspects of what my software needs to do depending on what that hardware architecture is.
Any marketing people or sales,
those kind of people,
I guess it's a developer's conference
and so they're not invited.
I was going to say, if you're here, get out.
No, I'm kidding.
No, for those people,
what you really want to think about,
and even the hardware and software guys,
think about your customer.
That's going to be the bottom line.
Think about what the customer wants, what the customer is looking for, because what you're going to discover is there's no one clean answer.
And that's the part that's kind of the pitfall.
So what are we going to do for the agenda?
I'm going to talk about some architectural considerations.
What are the things we need to think about? What do you think about? And then we're going to go into some direct
connect designs because direct connect designs are fantastic. You don't need a switch. Everything's
wonderful. Well, kind of. It's got its problems too. And then you say, well, we'll solve that
with a switch. So put a switch in there. Let's talk about that too. And because everybody's got their own
approach on what to do with switches too, and what are the impacts of that switch and how's that
going to take, take things. And then there's this new concept called a JBoff. You've all heard of
JBods, right? Just a bunch of drives. Well, a JBoff is just a bunch of flash, same sort of concept. No CPU, no memory, none of that. Direct connect-ish kind of NVMe drives without the CPU.
And so that's what a JBOF is.
And we'll talk about that and how we can use that.
And then go into just one simple high availability solution,
because there's a lot of people out there,
especially Hewlett-Packard people.
They're all about, well, where's your dual port drives?
Where's your high availability?
So we'll talk about that too.
So those are the approaches we're going to take.
So the first thing that you really need to think about when you're talking about NVMe is you're limited on how many PCI lanes.
If you've got an Intel design, you've got 96 lanes.
And we all got excited when AMD came out with 128.
Yes!
Well, okay, that's still not enough.
So AMD's got 128.
But then there's also other limitations.
And this is what we're going to get into when we start looking at some of these architectures.
Because some of them, it's like, it's not an issue of how many lanes you've got.
You flat out don't have any physical space to put some things. And so there are physical
limitations that you've got to take into consideration when you start looking at a design.
And then there's also cable limitations. If you're going to cable anything, that can become a
limitation. You go, what do I need a cable for? Everything's inside. Well, a JBOF has outside
cables, but inside
you almost definitely have a row of fans that go right in the middle of your design.
How do you get the signals from one side to the other? A cable. How does the cable impact things?
So the cable limitations also have a big issue there. And then you've got thermals. Thermals
are a big deal. And that's one of the
things where I'm really excited to see. And you can see at our booth, that booth table,
you'll see where there's improvements in NVMe storage that are really improving the thermals.
And so that's kind of cool. And then there's this concept of a balanced design. And I'm going to
get into what I mean by a balanced design in just a second. So first of all, the PCIe trade-off. What do I want? I want
lots of network, right? Because I need all of that network capability to get into the box.
But the more network you provide, the less you have PCIe lanes for your NVMe. So NVMe drives, they need PCIe lanes. Network, it needs PCIe lanes.
How do you divide them up? And that really becomes one of the key questions in your design.
How do I divide it up? Because there's aspects that you're trying to do, and undoubtedly,
it's going to be a trade-off. What do I mean by a balanced design? What I mean by a balanced design, there's asymmetric designs like this one over here
where I put all of my NVMe drives off of one CPU
and all of my network off of another.
And you go, why would you do that?
Well, there have been some people that have said,
you know what?
I can handle my RAID software a whole lot easier
if all the drives were just on one CPU.
Wouldn't that simplify things? So we all the drives were just on one CPU. Wouldn't that
simplify things? So we'll just put them all on one. And if I have network stuff and software that I
want to do with what's coming in off the network, I'll stick them all on one. What a great idea.
And in some cases, depending on your software, and so all you software geeks out there,
you figure out what you like better. But other people say, no,
no, no, no, we need a symmetric design where the data flow comes in from the network, goes right
to that CPU, and goes right to the solid-state drive. Why is that important? Because with an
asymmetric design, your UPI bus becomes really, really critical. Now, granted, Intel's now come
with three UPI buses, right? So that helps.
That's good.
They're getting faster.
That helps.
But you're going to find that the UPI bus
becomes really, really crucial in your performance
if you have an asymmetric design.
If you don't have an asymmetric design,
you've got a symmetric design,
now the idea is you minimize
what's going to happen on that UPI bus
just so you can avoid any latencies.
So that's what I mean by difference between asymmetric and symmetric.
So let's go into direct connect. So when we have a direct connect systems, and I'm just, I'm going
to talk about three of them. The marketing people insisted that I put up pictures. I promise you,
this is not about super micro products, but there's some pictures to match up.
There's 1U10 and there's 20 drive systems that are direct connect and multi-node systems that
are direct connect. We're going to go into all of those in detail. Well, not that much detail.
So let's talk about a 1U10. So 1U10 is great because it's only 1U. So it's very small.
It's very neat.
You can fit 10 drives in there.
You can direct connect them because if you direct connect 10 NVMe drives, that's 40 lanes.
Great.
Out of my 96, I just use up 40.
I've got 56 lanes.
Plenty for network, right?
I can just cram out network all day long.
Well, here's the problem.
It's a 1U.
The back of a 1U, the most you're ever going to really get is maybe three PCIe slots.
That's it.
You can't fit anymore.
So you may have all the lanes in the world for your network,
but you can't physically fit more than about 40, maybe.
I mean really so that's one of the limitations you're going to start running into when you run when you start
thinking about your one you space and how you can do a direct connect with a
1u 10 make sense so so we say okay let's go to a 1U or 2U20 drive system. And there are 1U20 drive systems in the world.
Great thing about that. Now, what I've done here, I've done a fantastic job of maximizing out my
PCIe buses to my drives. That'll take 80 of my 96 lanes will go all the way to drives. Okay?
And that is a great design for one specific kind of
application. If you've got a lot of data analytics, if you're trying to take the data off the drive,
use it to compute and put it back on the drive. This is a fantastic design. There's no switches
involved. It's direct connect. Everything is lovely. And you get some of the most amazing
IOP numbers when you've got a design like this. Fantastic, amazing IOP numbers. Here's the problem.
How do you ever get any data in and out of the box? Because now what you've got is only 16 PCI
lanes left to do your network. That gives you about a five to one ratio of your PCI
lanes to drives versus PCI lanes to network. Now, if you have a heavily IO intensive design that
goes in and out of that box, this is going to be terrible, horrible performance. If you have data
analytics that's inside the box, it's going to run fantastic.
You see the difference? So five to one ratio. So it's like, oh, I don't like that. There's some
people that go, oh, that'd be terrible because they're only looking at the data from the outside,
not necessarily from the inside. So you have to think about where is your data going to be?
And then once again, with a 1U, you have limited IO.
So of how much space you're going to have.
But that isn't as big of a deal than this one
because you've only got two slots you can even do anyway.
So let's talk about multi-node.
What I mean by a multi-node system,
this is where you take a 2U
and you've got four nodes in the 2U.
So four individual nodes, compute nodes, but then they connect to 24 drives in the front.
And how that works out is you divide up those nodes such that each node connects to six drives.
Now, there are people in this world, and I was one of them when I was working for a previous company,
where the big concern we had was if I've got, say, that 20-node system, what if the node goes down?
I just lost 20 drives. 20 drives. What is known as a blast radius is big. I just lost 20 drives.
How do I recover from losing 20 drives?
That's huge.
It'll take me forever to rebuild that, right?
You can have all the erasure coding in the world.
It's still going to take forever
to recover from 20 drive loss.
So in this kind of a situation,
if you've got a customer that says,
I care about my blast radius,
this is where multi-nodes are really, really great.
Why?
Because they only have six drives.
There's only six drives connected to each node.
You still have a 2U with 24 drives,
but your blast radius just shrunk down to only six
if I lose a node.
So that's where it has a big advantage.
Each node has six, so that means 24 lanes of PCIe.
That leaves you 32 lanes for slots and 16 lanes for your loan.
So now I can have a lot of bandwidth going out.
And now think about that ratio.
Now I actually have more bandwidth out in my network than I do have my drives.
But there's still a physical limitation out the back because now each node
is pretty small, right? So you've got to worry about the physics of that. And that's a problem
that Supermicro solved in a really, really cool way. But the great advantage of this is that low
blast radius because now I'm not going to be so impacted. So rather balanced. And then we're going
to come back to this one in a minute because there's a cool thing you can do with this multi-node
and another system we were looking at.
So how about switch solutions?
This solves a lot of problems.
I just need more PCIe lanes, right?
That's the problem.
I don't have enough.
So what do I do?
I'll throw in a switch.
And when I throw in a switch,
that'll allow me to break out to a lot more drives.
And that'll give me a lot more capability
so I can do a lot more things.
We're going to talk about three of those.
The classic one.
The classic one that everybody and their dog has
is the 2U24.
Because everybody's designed a 2U24 system,
and now they're all saying,
gee, I can convert that to an NVMe system, right?
So what do you do when you do that?
When you do that, you've got these CPUs here, okay?
They go down to PCIe switches and pick your switch.
There's a couple vendors.
I will remain nameless.
And then we go down to our drives.
And if you've got, you'll notice in this case, it's very symmetric.
So I've got 12 drives here and 12 drives here to each CPU. I'm going through a switch and then I've got 16
lanes going off to the switch and I've got 32 lanes going off to my LOM and slots. Okay. What
does that give me? That gives me a one to two ratio on my SSDs. So now I have plenty of network. I've got twice as much network as I
have drive. Okay. And I've got, now we've got to talk about another concept, which is your switch
ratio. How many lanes down versus how many lanes up? That's your switch ratio. Why do you care?
Well, you care because especially if you're doing sequential reads, just constant
read, what's going to happen? This switch is going to become a bottleneck. You're still
only going to have the bandwidth of this x16. So in a Gen 3 world, that's going to be maxed
out at 16 gigabytes per second. And in reality, you're not going to get 16 gigabytes. You're going to get less than that.
So maybe 14-ish.
13, actually, from what we've seen.
So you get about 13 gigabytes here.
You've got all this bandwidth here,
but you're bottlenecked here.
You're sucking through a little coffee straw, okay?
Now, there is one thing about NVMe that's kind of nice. You can
post a write and back off and free up your bus. That's really good because now I can get that
right out of there. If you're mixing reads and writes, I can quickly dump the right, come back
and do more reads. So the ratio, it'll help with that. But with the three to one ratio, you're
going to see the bottleneck here. So that's where
you have a pitfall here. So here, now we have lots of bandwidth. We solved our problem unlike our
direct connect where we didn't, where we had a lot more drive bandwidth than we had network. Now we
have lots more network than we have drives. Two to one or one to two, depending on how you look at it,
and then the ratio there. Okay? So people say, well,
how do we solve that problem? Well, let's talk about another solution. What happens about a 1U32?
And you go, you can't do that. Yes, you can. You can do this with U.2s. You can do this with the
new ruler form factor. You can do this with the new NF1 form factor. You can do this with the new ruler form factor. You can do this with the new NF1 form factor.
You can do this with what's coming very soon,
EDSFF form factors.
So you can have 32 drives in a 1U.
Well, how do you do that?
This is one way you can do that.
Take 32 lanes off your CPU
and bring it down to a switch.
Then branch that off into 64 lanes
that come off of here off to...
That is a typo.
Doggone it.
Hold that thought.
I copied and pasted from the other one.
Don't you hate it?
I've looked at these slides a number of times.
And then you go and you go,
wait a minute, that ain't right.
There, to those 16 drives,
because we've got to add up to 32, right?
So we've got 16 drives here and 16 drives there.
That's going to take 64 lanes.
What you find is now we only have a two to one ratio on our switch.
This is really nice. Now we're getting much closer to that nice ratio of NVMe with a two to one ratio. So it's not quite like sucking through a tiny coffee straw. We've got a much bigger straw
to suck through. So we only have a two to one ratio. So this is a nice ratio to have, I think, with PCIe. So two to one there. And now
our drives, though, are two to one on the network, too, because now we've got a by 16 here,
a 32 here. So now we just flipped it the other way. Instead of one to two, we just became two
to one. So now my drives have more bandwidth than my network.
So we just flip the equation on the architecture.
Once again, anytime you have more compute going on internally,
you want more PCIe to your drives.
Anytime you're more interested in getting data in and out of the box,
focus on the network.
But you can't take things to extremes.
Well, you can.
We've seen some extreme cases.
So that's the way a 1U32 works in your networks and drives.
Next one, people said,
no, no, no, no, no, you are missed the point.
We should not have one to two.
We should not have two to one.
We should have one to one. One should not have two to one. We should have one to one.
One to one is obviously the best answer.
So here's an architecture where it's a 1U36.
And you can do this with NVMe, I mean, with NF1.
I've seen an NF1 system where they actually got a one to one ratio.
So what did they do?
They took 24 lanes for loms and slots, okay, 24 lanes
over here, very symmetric design, so they went 24 up and 24 down. If I've got 48 lanes, just split
them in half, right? That sounds like a great solution. So I've got 24 lanes up, 24 lanes down,
and when I do that in a 2U24, what that ends up with is I end up with 72 lanes here and a 3 to 1 ratio.
So what did I just sacrifice?
Yes, I have a perfectly balanced network to drive off of my CPUs.
But I'm back to sucking through a tiny coffee straw again with my switches.
So there's always some place where you have to
sacrifice. And that's the crappy part of architecture because there's no perfect answer
because we don't have all the PCI lanes in the world. I was thinking about this morning in the
shower. I want 512 PCI lanes. That's what I want. And some way to get all that out of a 1U, which is impossible. So,
at least today. So, this is the way that the system sets up in a... Doggone it.
There it is again. What is with this? I keep screwing up. You'd think I would have, I obviously copy and paste and did it
wrong. Wait, wait for it. I hope they let me distribute new slides. There's 36 instead of 24.
I'm surprised you guys aren't all over me going, Mike, your drives are wrong. So now all of a sudden this makes more sense.
So this solves the question of I want perfect balance,
but it doesn't solve the question of now I've got a bottleneck through my switch.
So there's another approach to take.
Why do we need?
We've got NVMe over Fabric, right?
What's the purpose of NVMe over Fabric?
Avoid the processor.
Go around it.
Use RDMA to go around the processor.
So why do you need a processor?
Just get the stupid processor out of it.
Take away the processor.
Take away the memory.
I just get rid of that whole problem.
That's a JBOF.
So how does a JBOF work?
A JBOF, you take, you don't have any processors at all,
but you've got PCIe switches.
Now, this first row of switches, what we did,
so this is a design of JBOF, is we did a one-to-one ratio.
So I've got 32 lanes that can go off to servers
and another 32 lanes that are going off to servers. Once again,
now it's servers. We're not going off to network because our network is the PCIe bus.
So I can go off to multiple servers here and multiple servers here, but I want to make sure
that the servers on both sides have access to all of the drives. So to do that, I use this PCIe
switch just to crisscross for me. So I can crisscross
with the PCI switch so that the servers over here have access to these drives and to these drives.
And then I put another set of PCI switch. So there's two layers of PCI switches,
and this one has a two-to-one ratio. So overall, you get a two-to-one ratio from the outside of
your box all the way to the inside of the box.
That's really nice because I reduced it down to that nice two to one ratio.
The problem you're going to see, obviously, is wait a minute.
There's 150 nanoseconds of latency through this switch and through this switch.
So I just added latency.
No matter what, there's always going to be a problem. So I add latency here, but the beautiful thing about this is now I have 64 lanes of PCIe coming out of the box.
That, in reality terms, theoretical terms, 64 gigabytes per second.
In reality, we've seen 52 gigabytes per second of real raw data out of a box.
That's kind of unprecedented. You don't get
that out of the box because who's got that many lanes coming out of the box? That's the key. So
this is where JBoffs work out really, really well. If you want high bandwidth to multiple servers,
why do you want to do that? Because one of the things that's happened in this world, and if you've been to enough of these conferences, you've heard about it. NVMe drives are so much
more powerful than spinning rust that you now have a situation where you have more capability
in the drive than you have in any one node. If you've got 32 drives out there,
I've heard lots of people say there's no two-socket processor
that can possibly keep them all busy.
So share them.
Send them out to multiple servers
and share that resource.
If you're into disaggregation,
this is a great solution.
So RackScale Design loves this solution
with the new RackScale Design 2.1,
because now I can disaggregate all of that. I don't have to associate my compute to my drives.
I can keep them disaggregated. So that's another advantage of a JBoff solution.
Now, one thing that I love to do, just to really mix it up, is what if you took a multi-node and a JBoff and put them together?
What do you get? Well, you've got a four-node in a 2U. You've got a JBoff that's a 1U. So in 3U,
I can get 56 drives. So now I've got a whole lot of capability in the drives.
The network kind of looks like this. I end up with a x16 LUM over here and a
x16 slot over here, so I can get 200 gigabits coming off of my CPUs. And then, if you remember,
each one of the nodes has x6 directly connected, and now each one of the nodes can have a x8 going
off to the JBoff, if you just divide them up evenly, depending on how you want to divide them.
So now I can take 56 total drives.
I've got 40 lanes for SSDs.
I've got 16 lanes for the slot, 16 lanes for the LAN.
You start looking at that.
Now my ratio here is actually pretty close,
32 to 40, real close.
And I've got 800 gigabytes of ethernet
that can go out of the box.
So now not only do I have a nice little ratio,
I've got the ability to have an enormous amount of bandwidth coming in and out
of a four-node server system that I'm distributing out the drives.
So this is combining the multi-node with the JBOF.
And that comes out with kind of a really slick solution. And it's very symmetric. Now, you hear me talk about symmetry. One of the things we learned
at Samsung, at Samsung I designed, I think one of the first, it was a 40 drive system. It was just
a proof of concept, and designed that out, but I designed it such that we could do a symmetric
design or an asymmetric design. We could do both. I cabled it such that I could do a symmetric design or an asymmetric design. We could do both.
I cabled it such that I could do both.
And one of the things that we discovered in that, yes, there are advantages if your software really wants all of the drives off of one CPU for RAID.
But if it's just performance, a symmetric design is so much better.
If you can avoid that UPI bus, it really, really helps. So this is one of the areas
where you can take a multi-node,
which is notoriously not symmetric,
and turn it into a symmetric design
by adding a JBOF.
Okay, last but not least,
and I have no idea how I'm doing on time.
How's my timekeeper?
Good?
I've only spent 25 minutes.
Actually, I'm going too fast.
I should slow down.
2U24, high availability.
How do you do that?
Well, what we do in that, you've got two nodes,
so you've got two CPUs, each in two nodes.
In this case, we take by 16 slots,
16 slots, send it off to a PCIe switch
that's got a three to one ratio so that I can do dual ported drives. So in this case,
I've got a drive that's by four, but each port is by two. So I cut my bandwidth in half to each port,
but it allows me to connect to two different nodes. And that's the way dual-ported
drives work. So now what I can do is I can take, on these 24 drives, I can take 48 nodes, 48 lanes
over here, 48 lanes over here, combine it with my BI-16, put a three-to-one ratio, connect it up to
that CPU, and now I've got 48 lanes of PCIe going out the back.
Okay?
And this gives you a dual-ported solution with a standard classic 2U24.
Pictures.
Stealing it.
Now, the one thing you can do,
and I'll leave this as an exercise to the audience,
you can do one more thing that'll take this into a 2U48,
which some people have done, but I'm not supposed
to talk about products, so I didn't. But you could also move this and add another layer that would
then be a 2U48 using the same sort of architecture. So now you get a three to one ratio on the
switches and you get a one to three ratio on your SSDs to network. Okay?
So the bottom line, what's the right one?
I mean, I just showed you a whole bunch of different architectures.
Which one is correct?
Because we would all love to, we only have enough resources to design one system,
so what system should we design?
Well, there is no right answer. The right answer is the
customer. Who is your customer and what do they care about? Do they care about blast radius? Then
don't sell them a 1U32 server. Oh my gosh, now you've got 32 drives that are all going to die,
and when they all die, when you lose that node, they're going to go crazy. That's not the solution,
right? They care about blast radius, you want to go crazy. That's not the solution, right?
They care about blast radius, you want to minimize that.
Flip side is you want to care about capacity and some great bandwidth, you want to go to that one U32.
So what's the right answer?
The answer is it really, really depends on your configuration.
It really, really depends on your customer application.
And what's their usage model?
What are they doing?
What's important to them? Do they care about analytics and data and computation that's going to be inside? Or do I
just say, I want to use this as a storage application and I just want an enormous amount
of bandwidth into and out of that box? It totally depends on what they want. The answer, of course, is design them all, right?
The nice thing is, and I'm not supposed to do this, so don't tell anybody I said this. At Super
Micro, we have almost every one of those. There is one on there that we actually don't have,
and I'll let you go to our website and figure out which one it is. But what we've done is we've just
designed them all because we know that we
have customers that want each and every one of those different kinds of solutions. And then we
still have customers that come to us and come up with some bizarre idea that's like, oh no,
that doesn't work. We need a completely different thing. And it's like, are you kidding?
We have all that. Anyhow, so there is no one answer. It depends on your application. It depends on what your software is going to do.
Think about write amplification.
That's another aspect.
If you've got heavy write amplification that's going,
if you've got a server here and you've got write amplification
to each one of the storage systems,
then you care enormously about the bandwidth
going in and out the network, right?
But if I've got software
where the right amplification is internal,
in other words, I write to the device
and then I amplify out,
now all of a sudden I care much more
about how many PCI lanes I've got going to my drives
and less about what I've got going to the network.
That's why, in some ways, like I designed one system,
it's got a two-to-one ratio.
I kind of like that because from my standpoint,
if I'm writing the data in and then amplify it out internal,
well, then that's a great solution.
But someone pointed out to me,
yeah, but what if I do write amplification outside the box? Well, then that's a great solution. But someone pointed out to me, yeah, but what if I do write amplification outside the box?
Well, then your solution sucks.
And it's like, okay, good point.
And so you really have to think about what your architecture is.
And for all you software guys, how is that data going to be distributed?
How is it going to be analyzed?
How is it going to be used?
That's what's key. Thanks for listening. If you have questions about the material presented in
this podcast, be sure and join our developers mailing list by sending an email to developers
dash subscribe at sneha.org. Here you can ask questions and discuss this topic further
with your peers in the Storage Developer community.
For additional information about the Storage Developer Conference,
visit www.storagedeveloper.org.