Storage Developer Conference - #79: Various All Flash Solution Architectures

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to SDC Podcast Episode 79. We're going to talk about various different architectures for AllFlash. So if you've thought about it, you've thought about AllFlash, and you're like, these NVMe things are really cool,

Starting point is 00:00:50 but how do I just set up a system that will actually work? What kind of architecture? And you may go, well, how hard is that? I mean, it's pretty obvious, right? Well, if you start thinking about it, it's not very obvious. So how many hardware people we have? Hardware people? I'm going to be interactive.

Starting point is 00:01:10 Okay, the hardware people, I think, are going to really enjoy this because it's going to be very hardware-centric. Okay? Sorry, software guys. But software people, which is probably a lot of you, I want you to think about what are the aspects of what my software needs to do depending on what that hardware architecture is. Any marketing people or sales, those kind of people, I guess it's a developer's conference and so they're not invited. I was going to say, if you're here, get out. No, I'm kidding.

Starting point is 00:01:39 No, for those people, what you really want to think about, and even the hardware and software guys, think about your customer. That's going to be the bottom line. Think about what the customer wants, what the customer is looking for, because what you're going to discover is there's no one clean answer. And that's the part that's kind of the pitfall. So what are we going to do for the agenda?

Starting point is 00:02:01 I'm going to talk about some architectural considerations. What are the things we need to think about? What do you think about? And then we're going to go into some direct connect designs because direct connect designs are fantastic. You don't need a switch. Everything's wonderful. Well, kind of. It's got its problems too. And then you say, well, we'll solve that with a switch. So put a switch in there. Let's talk about that too. And because everybody's got their own approach on what to do with switches too, and what are the impacts of that switch and how's that going to take, take things. And then there's this new concept called a JBoff. You've all heard of JBods, right? Just a bunch of drives. Well, a JBoff is just a bunch of flash, same sort of concept. No CPU, no memory, none of that. Direct connect-ish kind of NVMe drives without the CPU.

Starting point is 00:02:51 And so that's what a JBOF is. And we'll talk about that and how we can use that. And then go into just one simple high availability solution, because there's a lot of people out there, especially Hewlett-Packard people. They're all about, well, where's your dual port drives? Where's your high availability? So we'll talk about that too.

Starting point is 00:03:10 So those are the approaches we're going to take. So the first thing that you really need to think about when you're talking about NVMe is you're limited on how many PCI lanes. If you've got an Intel design, you've got 96 lanes. And we all got excited when AMD came out with 128. Yes! Well, okay, that's still not enough. So AMD's got 128. But then there's also other limitations.

Starting point is 00:03:37 And this is what we're going to get into when we start looking at some of these architectures. Because some of them, it's like, it's not an issue of how many lanes you've got. You flat out don't have any physical space to put some things. And so there are physical limitations that you've got to take into consideration when you start looking at a design. And then there's also cable limitations. If you're going to cable anything, that can become a limitation. You go, what do I need a cable for? Everything's inside. Well, a JBOF has outside cables, but inside you almost definitely have a row of fans that go right in the middle of your design.

Starting point is 00:04:10 How do you get the signals from one side to the other? A cable. How does the cable impact things? So the cable limitations also have a big issue there. And then you've got thermals. Thermals are a big deal. And that's one of the things where I'm really excited to see. And you can see at our booth, that booth table, you'll see where there's improvements in NVMe storage that are really improving the thermals. And so that's kind of cool. And then there's this concept of a balanced design. And I'm going to get into what I mean by a balanced design in just a second. So first of all, the PCIe trade-off. What do I want? I want lots of network, right? Because I need all of that network capability to get into the box.

Starting point is 00:04:56 But the more network you provide, the less you have PCIe lanes for your NVMe. So NVMe drives, they need PCIe lanes. Network, it needs PCIe lanes. How do you divide them up? And that really becomes one of the key questions in your design. How do I divide it up? Because there's aspects that you're trying to do, and undoubtedly, it's going to be a trade-off. What do I mean by a balanced design? What I mean by a balanced design, there's asymmetric designs like this one over here where I put all of my NVMe drives off of one CPU and all of my network off of another. And you go, why would you do that? Well, there have been some people that have said,

Starting point is 00:05:38 you know what? I can handle my RAID software a whole lot easier if all the drives were just on one CPU. Wouldn't that simplify things? So we all the drives were just on one CPU. Wouldn't that simplify things? So we'll just put them all on one. And if I have network stuff and software that I want to do with what's coming in off the network, I'll stick them all on one. What a great idea. And in some cases, depending on your software, and so all you software geeks out there, you figure out what you like better. But other people say, no,

Starting point is 00:06:05 no, no, no, we need a symmetric design where the data flow comes in from the network, goes right to that CPU, and goes right to the solid-state drive. Why is that important? Because with an asymmetric design, your UPI bus becomes really, really critical. Now, granted, Intel's now come with three UPI buses, right? So that helps. That's good. They're getting faster. That helps. But you're going to find that the UPI bus

Starting point is 00:06:30 becomes really, really crucial in your performance if you have an asymmetric design. If you don't have an asymmetric design, you've got a symmetric design, now the idea is you minimize what's going to happen on that UPI bus just so you can avoid any latencies. So that's what I mean by difference between asymmetric and symmetric.

Starting point is 00:06:50 So let's go into direct connect. So when we have a direct connect systems, and I'm just, I'm going to talk about three of them. The marketing people insisted that I put up pictures. I promise you, this is not about super micro products, but there's some pictures to match up. There's 1U10 and there's 20 drive systems that are direct connect and multi-node systems that are direct connect. We're going to go into all of those in detail. Well, not that much detail. So let's talk about a 1U10. So 1U10 is great because it's only 1U. So it's very small. It's very neat. You can fit 10 drives in there.

Starting point is 00:07:28 You can direct connect them because if you direct connect 10 NVMe drives, that's 40 lanes. Great. Out of my 96, I just use up 40. I've got 56 lanes. Plenty for network, right? I can just cram out network all day long. Well, here's the problem. It's a 1U.

Starting point is 00:07:46 The back of a 1U, the most you're ever going to really get is maybe three PCIe slots. That's it. You can't fit anymore. So you may have all the lanes in the world for your network, but you can't physically fit more than about 40, maybe. I mean really so that's one of the limitations you're going to start running into when you run when you start thinking about your one you space and how you can do a direct connect with a 1u 10 make sense so so we say okay let's go to a 1U or 2U20 drive system. And there are 1U20 drive systems in the world.

Starting point is 00:08:29 Great thing about that. Now, what I've done here, I've done a fantastic job of maximizing out my PCIe buses to my drives. That'll take 80 of my 96 lanes will go all the way to drives. Okay? And that is a great design for one specific kind of application. If you've got a lot of data analytics, if you're trying to take the data off the drive, use it to compute and put it back on the drive. This is a fantastic design. There's no switches involved. It's direct connect. Everything is lovely. And you get some of the most amazing IOP numbers when you've got a design like this. Fantastic, amazing IOP numbers. Here's the problem. How do you ever get any data in and out of the box? Because now what you've got is only 16 PCI

Starting point is 00:09:19 lanes left to do your network. That gives you about a five to one ratio of your PCI lanes to drives versus PCI lanes to network. Now, if you have a heavily IO intensive design that goes in and out of that box, this is going to be terrible, horrible performance. If you have data analytics that's inside the box, it's going to run fantastic. You see the difference? So five to one ratio. So it's like, oh, I don't like that. There's some people that go, oh, that'd be terrible because they're only looking at the data from the outside, not necessarily from the inside. So you have to think about where is your data going to be? And then once again, with a 1U, you have limited IO.

Starting point is 00:10:07 So of how much space you're going to have. But that isn't as big of a deal than this one because you've only got two slots you can even do anyway. So let's talk about multi-node. What I mean by a multi-node system, this is where you take a 2U and you've got four nodes in the 2U. So four individual nodes, compute nodes, but then they connect to 24 drives in the front.

Starting point is 00:10:31 And how that works out is you divide up those nodes such that each node connects to six drives. Now, there are people in this world, and I was one of them when I was working for a previous company, where the big concern we had was if I've got, say, that 20-node system, what if the node goes down? I just lost 20 drives. 20 drives. What is known as a blast radius is big. I just lost 20 drives. How do I recover from losing 20 drives? That's huge. It'll take me forever to rebuild that, right? You can have all the erasure coding in the world.

Starting point is 00:11:12 It's still going to take forever to recover from 20 drive loss. So in this kind of a situation, if you've got a customer that says, I care about my blast radius, this is where multi-nodes are really, really great. Why? Because they only have six drives.

Starting point is 00:11:27 There's only six drives connected to each node. You still have a 2U with 24 drives, but your blast radius just shrunk down to only six if I lose a node. So that's where it has a big advantage. Each node has six, so that means 24 lanes of PCIe. That leaves you 32 lanes for slots and 16 lanes for your loan. So now I can have a lot of bandwidth going out.

Starting point is 00:11:53 And now think about that ratio. Now I actually have more bandwidth out in my network than I do have my drives. But there's still a physical limitation out the back because now each node is pretty small, right? So you've got to worry about the physics of that. And that's a problem that Supermicro solved in a really, really cool way. But the great advantage of this is that low blast radius because now I'm not going to be so impacted. So rather balanced. And then we're going to come back to this one in a minute because there's a cool thing you can do with this multi-node and another system we were looking at.

Starting point is 00:12:28 So how about switch solutions? This solves a lot of problems. I just need more PCIe lanes, right? That's the problem. I don't have enough. So what do I do? I'll throw in a switch. And when I throw in a switch,

Starting point is 00:12:42 that'll allow me to break out to a lot more drives. And that'll give me a lot more capability so I can do a lot more things. We're going to talk about three of those. The classic one. The classic one that everybody and their dog has is the 2U24. Because everybody's designed a 2U24 system,

Starting point is 00:13:02 and now they're all saying, gee, I can convert that to an NVMe system, right? So what do you do when you do that? When you do that, you've got these CPUs here, okay? They go down to PCIe switches and pick your switch. There's a couple vendors. I will remain nameless. And then we go down to our drives.

Starting point is 00:13:21 And if you've got, you'll notice in this case, it's very symmetric. So I've got 12 drives here and 12 drives here to each CPU. I'm going through a switch and then I've got 16 lanes going off to the switch and I've got 32 lanes going off to my LOM and slots. Okay. What does that give me? That gives me a one to two ratio on my SSDs. So now I have plenty of network. I've got twice as much network as I have drive. Okay. And I've got, now we've got to talk about another concept, which is your switch ratio. How many lanes down versus how many lanes up? That's your switch ratio. Why do you care? Well, you care because especially if you're doing sequential reads, just constant read, what's going to happen? This switch is going to become a bottleneck. You're still

Starting point is 00:14:13 only going to have the bandwidth of this x16. So in a Gen 3 world, that's going to be maxed out at 16 gigabytes per second. And in reality, you're not going to get 16 gigabytes. You're going to get less than that. So maybe 14-ish. 13, actually, from what we've seen. So you get about 13 gigabytes here. You've got all this bandwidth here, but you're bottlenecked here. You're sucking through a little coffee straw, okay?

Starting point is 00:14:42 Now, there is one thing about NVMe that's kind of nice. You can post a write and back off and free up your bus. That's really good because now I can get that right out of there. If you're mixing reads and writes, I can quickly dump the right, come back and do more reads. So the ratio, it'll help with that. But with the three to one ratio, you're going to see the bottleneck here. So that's where you have a pitfall here. So here, now we have lots of bandwidth. We solved our problem unlike our direct connect where we didn't, where we had a lot more drive bandwidth than we had network. Now we have lots more network than we have drives. Two to one or one to two, depending on how you look at it,

Starting point is 00:15:23 and then the ratio there. Okay? So people say, well, how do we solve that problem? Well, let's talk about another solution. What happens about a 1U32? And you go, you can't do that. Yes, you can. You can do this with U.2s. You can do this with the new ruler form factor. You can do this with the new NF1 form factor. You can do this with the new ruler form factor. You can do this with the new NF1 form factor. You can do this with what's coming very soon, EDSFF form factors. So you can have 32 drives in a 1U. Well, how do you do that?

Starting point is 00:15:55 This is one way you can do that. Take 32 lanes off your CPU and bring it down to a switch. Then branch that off into 64 lanes that come off of here off to... That is a typo. Doggone it. Hold that thought.

Starting point is 00:16:17 I copied and pasted from the other one. Don't you hate it? I've looked at these slides a number of times. And then you go and you go, wait a minute, that ain't right. There, to those 16 drives, because we've got to add up to 32, right? So we've got 16 drives here and 16 drives there.

Starting point is 00:16:41 That's going to take 64 lanes. What you find is now we only have a two to one ratio on our switch. This is really nice. Now we're getting much closer to that nice ratio of NVMe with a two to one ratio. So it's not quite like sucking through a tiny coffee straw. We've got a much bigger straw to suck through. So we only have a two to one ratio. So this is a nice ratio to have, I think, with PCIe. So two to one there. And now our drives, though, are two to one on the network, too, because now we've got a by 16 here, a 32 here. So now we just flipped it the other way. Instead of one to two, we just became two to one. So now my drives have more bandwidth than my network. So we just flip the equation on the architecture.

Starting point is 00:17:33 Once again, anytime you have more compute going on internally, you want more PCIe to your drives. Anytime you're more interested in getting data in and out of the box, focus on the network. But you can't take things to extremes. Well, you can. We've seen some extreme cases. So that's the way a 1U32 works in your networks and drives.

Starting point is 00:17:55 Next one, people said, no, no, no, no, no, you are missed the point. We should not have one to two. We should not have two to one. We should have one to one. One should not have two to one. We should have one to one. One to one is obviously the best answer. So here's an architecture where it's a 1U36. And you can do this with NVMe, I mean, with NF1.

Starting point is 00:18:16 I've seen an NF1 system where they actually got a one to one ratio. So what did they do? They took 24 lanes for loms and slots, okay, 24 lanes over here, very symmetric design, so they went 24 up and 24 down. If I've got 48 lanes, just split them in half, right? That sounds like a great solution. So I've got 24 lanes up, 24 lanes down, and when I do that in a 2U24, what that ends up with is I end up with 72 lanes here and a 3 to 1 ratio. So what did I just sacrifice? Yes, I have a perfectly balanced network to drive off of my CPUs.

Starting point is 00:18:57 But I'm back to sucking through a tiny coffee straw again with my switches. So there's always some place where you have to sacrifice. And that's the crappy part of architecture because there's no perfect answer because we don't have all the PCI lanes in the world. I was thinking about this morning in the shower. I want 512 PCI lanes. That's what I want. And some way to get all that out of a 1U, which is impossible. So, at least today. So, this is the way that the system sets up in a... Doggone it. There it is again. What is with this? I keep screwing up. You'd think I would have, I obviously copy and paste and did it wrong. Wait, wait for it. I hope they let me distribute new slides. There's 36 instead of 24.

Starting point is 00:19:59 I'm surprised you guys aren't all over me going, Mike, your drives are wrong. So now all of a sudden this makes more sense. So this solves the question of I want perfect balance, but it doesn't solve the question of now I've got a bottleneck through my switch. So there's another approach to take. Why do we need? We've got NVMe over Fabric, right? What's the purpose of NVMe over Fabric? Avoid the processor.

Starting point is 00:20:26 Go around it. Use RDMA to go around the processor. So why do you need a processor? Just get the stupid processor out of it. Take away the processor. Take away the memory. I just get rid of that whole problem. That's a JBOF.

Starting point is 00:20:43 So how does a JBOF work? A JBOF, you take, you don't have any processors at all, but you've got PCIe switches. Now, this first row of switches, what we did, so this is a design of JBOF, is we did a one-to-one ratio. So I've got 32 lanes that can go off to servers and another 32 lanes that are going off to servers. Once again, now it's servers. We're not going off to network because our network is the PCIe bus.

Starting point is 00:21:11 So I can go off to multiple servers here and multiple servers here, but I want to make sure that the servers on both sides have access to all of the drives. So to do that, I use this PCIe switch just to crisscross for me. So I can crisscross with the PCI switch so that the servers over here have access to these drives and to these drives. And then I put another set of PCI switch. So there's two layers of PCI switches, and this one has a two-to-one ratio. So overall, you get a two-to-one ratio from the outside of your box all the way to the inside of the box. That's really nice because I reduced it down to that nice two to one ratio.

Starting point is 00:21:50 The problem you're going to see, obviously, is wait a minute. There's 150 nanoseconds of latency through this switch and through this switch. So I just added latency. No matter what, there's always going to be a problem. So I add latency here, but the beautiful thing about this is now I have 64 lanes of PCIe coming out of the box. That, in reality terms, theoretical terms, 64 gigabytes per second. In reality, we've seen 52 gigabytes per second of real raw data out of a box. That's kind of unprecedented. You don't get that out of the box because who's got that many lanes coming out of the box? That's the key. So

Starting point is 00:22:34 this is where JBoffs work out really, really well. If you want high bandwidth to multiple servers, why do you want to do that? Because one of the things that's happened in this world, and if you've been to enough of these conferences, you've heard about it. NVMe drives are so much more powerful than spinning rust that you now have a situation where you have more capability in the drive than you have in any one node. If you've got 32 drives out there, I've heard lots of people say there's no two-socket processor that can possibly keep them all busy. So share them. Send them out to multiple servers

Starting point is 00:23:15 and share that resource. If you're into disaggregation, this is a great solution. So RackScale Design loves this solution with the new RackScale Design 2.1, because now I can disaggregate all of that. I don't have to associate my compute to my drives. I can keep them disaggregated. So that's another advantage of a JBoff solution. Now, one thing that I love to do, just to really mix it up, is what if you took a multi-node and a JBoff and put them together?

Starting point is 00:23:47 What do you get? Well, you've got a four-node in a 2U. You've got a JBoff that's a 1U. So in 3U, I can get 56 drives. So now I've got a whole lot of capability in the drives. The network kind of looks like this. I end up with a x16 LUM over here and a x16 slot over here, so I can get 200 gigabits coming off of my CPUs. And then, if you remember, each one of the nodes has x6 directly connected, and now each one of the nodes can have a x8 going off to the JBoff, if you just divide them up evenly, depending on how you want to divide them. So now I can take 56 total drives. I've got 40 lanes for SSDs.

Starting point is 00:24:30 I've got 16 lanes for the slot, 16 lanes for the LAN. You start looking at that. Now my ratio here is actually pretty close, 32 to 40, real close. And I've got 800 gigabytes of ethernet that can go out of the box. So now not only do I have a nice little ratio, I've got the ability to have an enormous amount of bandwidth coming in and out

Starting point is 00:24:52 of a four-node server system that I'm distributing out the drives. So this is combining the multi-node with the JBOF. And that comes out with kind of a really slick solution. And it's very symmetric. Now, you hear me talk about symmetry. One of the things we learned at Samsung, at Samsung I designed, I think one of the first, it was a 40 drive system. It was just a proof of concept, and designed that out, but I designed it such that we could do a symmetric design or an asymmetric design. We could do both. I cabled it such that I could do a symmetric design or an asymmetric design. We could do both. I cabled it such that I could do both. And one of the things that we discovered in that, yes, there are advantages if your software really wants all of the drives off of one CPU for RAID.

Starting point is 00:25:40 But if it's just performance, a symmetric design is so much better. If you can avoid that UPI bus, it really, really helps. So this is one of the areas where you can take a multi-node, which is notoriously not symmetric, and turn it into a symmetric design by adding a JBOF. Okay, last but not least, and I have no idea how I'm doing on time.

Starting point is 00:26:01 How's my timekeeper? Good? I've only spent 25 minutes. Actually, I'm going too fast. I should slow down. 2U24, high availability. How do you do that? Well, what we do in that, you've got two nodes,

Starting point is 00:26:18 so you've got two CPUs, each in two nodes. In this case, we take by 16 slots, 16 slots, send it off to a PCIe switch that's got a three to one ratio so that I can do dual ported drives. So in this case, I've got a drive that's by four, but each port is by two. So I cut my bandwidth in half to each port, but it allows me to connect to two different nodes. And that's the way dual-ported drives work. So now what I can do is I can take, on these 24 drives, I can take 48 nodes, 48 lanes over here, 48 lanes over here, combine it with my BI-16, put a three-to-one ratio, connect it up to

Starting point is 00:26:58 that CPU, and now I've got 48 lanes of PCIe going out the back. Okay? And this gives you a dual-ported solution with a standard classic 2U24. Pictures. Stealing it. Now, the one thing you can do, and I'll leave this as an exercise to the audience, you can do one more thing that'll take this into a 2U48,

Starting point is 00:27:23 which some people have done, but I'm not supposed to talk about products, so I didn't. But you could also move this and add another layer that would then be a 2U48 using the same sort of architecture. So now you get a three to one ratio on the switches and you get a one to three ratio on your SSDs to network. Okay? So the bottom line, what's the right one? I mean, I just showed you a whole bunch of different architectures. Which one is correct? Because we would all love to, we only have enough resources to design one system,

Starting point is 00:28:00 so what system should we design? Well, there is no right answer. The right answer is the customer. Who is your customer and what do they care about? Do they care about blast radius? Then don't sell them a 1U32 server. Oh my gosh, now you've got 32 drives that are all going to die, and when they all die, when you lose that node, they're going to go crazy. That's not the solution, right? They care about blast radius, you want to go crazy. That's not the solution, right? They care about blast radius, you want to minimize that. Flip side is you want to care about capacity and some great bandwidth, you want to go to that one U32.

Starting point is 00:28:34 So what's the right answer? The answer is it really, really depends on your configuration. It really, really depends on your customer application. And what's their usage model? What are they doing? What's important to them? Do they care about analytics and data and computation that's going to be inside? Or do I just say, I want to use this as a storage application and I just want an enormous amount of bandwidth into and out of that box? It totally depends on what they want. The answer, of course, is design them all, right?

Starting point is 00:29:07 The nice thing is, and I'm not supposed to do this, so don't tell anybody I said this. At Super Micro, we have almost every one of those. There is one on there that we actually don't have, and I'll let you go to our website and figure out which one it is. But what we've done is we've just designed them all because we know that we have customers that want each and every one of those different kinds of solutions. And then we still have customers that come to us and come up with some bizarre idea that's like, oh no, that doesn't work. We need a completely different thing. And it's like, are you kidding? We have all that. Anyhow, so there is no one answer. It depends on your application. It depends on what your software is going to do.

Starting point is 00:29:46 Think about write amplification. That's another aspect. If you've got heavy write amplification that's going, if you've got a server here and you've got write amplification to each one of the storage systems, then you care enormously about the bandwidth going in and out the network, right? But if I've got software

Starting point is 00:30:11 where the right amplification is internal, in other words, I write to the device and then I amplify out, now all of a sudden I care much more about how many PCI lanes I've got going to my drives and less about what I've got going to the network. That's why, in some ways, like I designed one system, it's got a two-to-one ratio.

Starting point is 00:30:33 I kind of like that because from my standpoint, if I'm writing the data in and then amplify it out internal, well, then that's a great solution. But someone pointed out to me, yeah, but what if I do write amplification outside the box? Well, then that's a great solution. But someone pointed out to me, yeah, but what if I do write amplification outside the box? Well, then your solution sucks. And it's like, okay, good point. And so you really have to think about what your architecture is.

Starting point is 00:30:56 And for all you software guys, how is that data going to be distributed? How is it going to be analyzed? How is it going to be used? That's what's key. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers dash subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the Storage Developer community. For additional information about the Storage Developer Conference,

Starting point is 00:31:35 visit www.storagedeveloper.org.

Storage Developer Conference - #79: Various All Flash Solution Architectures

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.