Storage Developer Conference - #40: Breaking Through Performance and Scale Out Barriers

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcast. You are listening to SDC Podcast Episode 40. Today we hear from Bob Hanson, VP Systems Architecture at Aperion Data Systems, as he presents Breaking Through Performance and Scale-Out Barriers, a storage solution for today's hot scale-out applications, from the 2016 Storage Developer

Starting point is 00:00:54 Conference. I'm Bob Hanson from Aparian Data Systems, but the way that you remember this is that you look at it and you go, ape iron. Aperin is the way I say it. But every customer, all the executive team, everybody says it differently. But it's a Greek word that evidently means like infinite or something. You know, wonderful like that. So we've been in business for about two and a half years. We're still an early stage startup. We actually, now I want to hear, some of you guys,

Starting point is 00:01:30 how many people have been in startups before? All right. So Monday we closed our first deal for 500 grand. Yay! We got three more queued up and it looks like it's really going to go. But anyway, I'm going to tell you about how the company got to where we're at and why we did it. This is a storage development conference, right?

Starting point is 00:01:54 I'm going to show you why we did it and why we're doing it a little differently than everybody else. Assuming I get my devices turned on here. So we don't need to go through the agenda because we're about to go through it. So IDC, years ago I was in a technical marketing role, and I really, I learned a lot about the industry at that point from a whole different point of view

Starting point is 00:02:25 than my whole engineering career before that. And it was amazing. I learned that when how market projections were done, and I was working with career marketing people, right, for the first time, and I couldn't believe, you know, how they were just going, yeah, iSCSI is going to be this big in that amount of time. I couldn't believe how they were just going, yeah, iSCSI is going to be this big in that amount of time. I couldn't handle that. I put like a, I don't know, a five-page spreadsheet model together to do that. And when I left the course, nobody wanted

Starting point is 00:02:55 to use that. They went back to putting their finger in the wind. But the other amazing thing was that I would talk to the analysts like IDC and then in a month that projection would show up as being facts from the analysts. And it was really amazing because iSCSI was really going to take off fast because everybody was telling them the same thing. Anyway, that was IDC's role but more recently they came up with a really interesting insight about this third platform thing and you guys have probably seen this before. But the third platform being, and I'll talk about the first two platforms here in a little bit.

Starting point is 00:03:33 But this being cloud, big data analytics, driven by a lot of different things, big data is a huge part of it. But the part that we're focusing on is the analytics part. And of course, the hyperscaler guys drove this initially. But that's one of the things that changed that got my company in the business here. Another one was NVMe. I'm not going to spend any time on that. It's wonderful. It's fast. It's got a huge amount of support. It finally gets rid of that, what, 30-year-old SCSI stack or whatever we've been dealing with for a very long time in different forms and pieces.

Starting point is 00:04:20 I mean, I worked on SCSI and Fib Channel and SAS and all of that and we finally get the start over which is really a godsend given the flash capabilities that we have now. So you get performance manageability and it's all being invented right now. I mean, we got the 1.0 fabric standard out, NVME standards been out for a couple of years but they're still working on it. I just listened to a presentation where they're really optimizing it now for, to give controllers and storage stacks more direct control over the actual flash chips in the way that, not the flash chips, but the way that the flash system works so that you can actually optimize for streams and things like that.

Starting point is 00:05:06 So this is getting better and better as we go along. It's really complicated stuff and I'm really surprised it works at all but it seems to be moving along quite well. And then the really nice thing is that there's a robust community of vendors now. I put a bunch of them up there. OCZ is now Toshiba. Did I put it Toshiba Drive? Yeah. So these guys got bought by these guys and they're actually shipping both of those drives now but it won't be for long. And the really nice thing is that they're all working to be the best in the world. Like, you know, you got Samsung out there and probably selling more than anybody else. But everybody else is innovating.

Starting point is 00:05:46 There's a drive out there that's optimized for writes, that does better write performance than all the rest. There's drives out there that you can flip a bit, and you get high endurance or low endurance, and the capacity changes, which is really a challenge. Because one of my jobs is to work with my ODM partner to qualify drives, and I'm trying to convince them it's a Japanese company. I'm trying to convince them that it's just a bit change.

Starting point is 00:06:13 You know, you should test one of them and call it good for both. And somehow I'm having a really hard time with that. And there's several other innovations on the way from these different vendors and they're pushing the prices coming down really fast. So it's a really interesting ecosystem that's developing here. And it feels like it's going to take off like this year, next year kind of thing. And certainly with the fabrics development, when we get some fabric solutions going, then that'll really start pushing it out of the high-performance area, which is where I'm working into more of a mainstream kind of thing. So I was thinking about this, and I've been in the storage industry for a long time

Starting point is 00:06:55 and made a lot of money on broken hard disk drives. So I was thinking back, well, the first platform, using the IDC thing again, was the mainframe. And some of you guys might remember those things. But you reach back then, and then you had this really dumb mechanical device. And then the mainframe told it what to do, like move the head to here and write something and then stop and move it back. And it was really direct control by the server itself, or the operating system, actually, to dumb hardware. That worked really well.

Starting point is 00:07:31 And in fact, we've been kind of reinventing the mainframe ever since that point in time. I worked a lot in the second platform area where we went client server. You guys can't read this, but this is thousands of apps and mainframe terminal and millions of users. This is hundreds of millions, client server with networking, and tens of thousands of apps.

Starting point is 00:08:00 So then we got SCSI, what a godsend. We got smarter drives. And we got network attached storage, which was a tremendous innovation And in this time frame we also got the array controller So now you have these hard disk drives. They fail you're shipping a lot of them You've got to have raid and then they started layering more and more software on top of there. And you've got this, actually a computer with a bunch of high availability features based on it between the servers or the clients and the hard disk drives.

Starting point is 00:08:35 That worked out great. And it created these huge companies, of which I've worked for three or four of them, with huge lines of code. In fact, the last big company I worked for, they had more than 25 million lines of code in the operating system that they had to put together to get a release out. And, you know, I'm a kind of a hardware firmware kind of guy. Thinking about that, it just makes my head hurt. I don't know how they ever got a release out.

Starting point is 00:09:08 In fact, their stock price shows that they're not doing too well on it. The third platform. So now we have, in addition to millions of apps and billions of users, we've got millions of developers on open source software. We've got OpenStack. We've got people going crazy on all kinds of capabilities built into Linux. And so we've got all of these developers doing really clever things.

Starting point is 00:09:32 We have storage-aware applications and operating systems. You have people in NoSQL that have just bypassed the whole storage stack inside Linux, and they go directly to the block storage devices. You've got other companies that are so hungry for performance that they skip the file system and go straight down into there. You have many applications that are built for tiering and I'll talk about that a little bit more. So very storage smart applications.

Starting point is 00:10:05 And of course, we've got Flash now with NVMe interface that really has been an incredible performance improvement and a reliability improvement. And those devices are really intelligent. One of my vendors told me that they're putting about a little bit more than 500,000 lines of code into a flash controller chip. And that's really interesting. They have very powerful processors in there.

Starting point is 00:10:37 They've got to manage the flash a completely different way. It's a little scary because that's the reliability problem right there is the firmware most of the time. And now we have people moving back to direct attached storage for these higher performance applications. But that's moving into network storage. And that's another thing that got my company going. So I thought I might have to wake people up at this point. So I put that headline on there. Of course, that's a major overstatement because all of the data in the world is sitting behind array controllers today.

Starting point is 00:11:22 There just is a little bit of DAS that's going on in these scale-out markets, and that's about it So the array controller's going to be around forever. But the value proposition for a lot of the new apps that are being developed is disappearing for array controllers. Let's see if I can make my case with you guys. So if you look at a lot of the scale out apps that are coming out now, that are storage aware and they're storage-aware and they're real-time analytics that they're pushing this on. Now, that's a narrow part of the market.

Starting point is 00:11:50 We're not talking about keeping your bank account straight or anything. But this has to do with big data analytics, and a lot of these things, the applications are targeted at applications that require real-time decision-making, like ad tech, where you have to, I've talked about this in other programs, but there are several companies down the peninsula. They've got to be able to identify you as a user, figure out what ad you might click on, go bid for ads. Well, they run an analysis on which of the ads that are available are your, they run a probabilistic sort of

Starting point is 00:12:32 algorithm on there to find out which one you're liable to click through. They have to bid for the ads. They get the ads back, and they put it up on your web page, and they have to do all, and then they track what happened, and all that has to happen in about 200 milliseconds. So that's an example of if you can make their storage go 20%

Starting point is 00:12:49 faster and save them tens of milliseconds, they can make more money because they've got more time to run this algorithm against those ads. And then when you click through, they make dough. So we've got that. You've got fraud detection where people are trying to understand whether your credit card is legit or not before Before you close the deal on Amazon or something

Starting point is 00:13:09 and of course, there's all the security related Issues that are out there and then you have apps like Splunk where the name of the game is ingesting huge amounts of data and That's not really a real time sort of thing, although they do run analytics against it. But the faster your storage is, then the more data

Starting point is 00:13:32 you can take in on a terabyte per hour kind of basis. So these applications are all new. They're designed for scale out from the beginning. That's the group of apps that I'm concentrating on, because not many people are writing banking software anymore, you know. They're looking at this new world of the Internet of Things that's going to keep driving these huge very fast data requirements to be able to ingest this and then run analysis against

Starting point is 00:13:59 it. In all of these apps, the application manages the data placement across the compute cluster and that's because the server has now become the center of the universe. Before it was the, you know, you had your database down here and you would pull that data in and then you would run analysis on it, run your joins or whatever and it would be paging data in and out of the, through the array controllers down to the hard disk. They don't do that anymore. They split the database up in the first place.

Starting point is 00:14:29 They put shards of it in all of these different servers up here. If they could, they'd run it all out of DRAM. But of course, the working sets are getting larger. So they're tending to move to SSDs. And now they're moving off the box into external SSD boxes. But that's a big change. You know, the center is now the server and they all have algorithms to manage tiers of

Starting point is 00:14:53 data and to be able to failover servers because that's the element that's most important. You've got data that's associated with that server directly. So if the server goes down, they have to have a strategy to be able to fail over to another server, warm that data back up, and get started again so that they can keep their throughput going. So that's a big change. This application manages high availability.

Starting point is 00:15:20 And because of that, they've got data tiering and data migration schemes going on. You've got complex array-centric storage software. If you put an array behind these things and you turn on any of the really wonderful capability that's been developed over the years, like dedup and compression and snapshots and all of that sort of thing, what you're doing is you're putting huge amounts of software stack in between the application and the raw performance that you can get out of the NVMe drives these days. And every feature that you add to that, if it's not hardware accelerated, just takes the performance right out of it. Just went to the VMware presentation, what,

Starting point is 00:16:13 two hours ago, and they were really happy that they were, when they put their, I think it was vSAN was the product, on top of the NVMe drives, they were only an order of magnitude slower. That's a whole order of magnitude. They're talking millisecond response time. And the drives will do in the hundreds of microseconds, low hundreds. You can get that easily. You can get better than that.

Starting point is 00:16:39 So you don't want to do this. And these new applications just don't want to see that kind of thing. And then you've got multiple tiers of storage, starting with DRAM and then going to maybe SSDs that are captive inside the servers, then going outside the server into more of a network storage sort of thing.

Starting point is 00:16:59 So the value proposition for array controllers is going away, because that's where the center of all this software that slows everything down exists. And so we're kind of in an era right now where Flash came along, and a lot of companies started up, and they reproduced what the big guys had done with hard disk drives, only they had Flash behind it. So it got a little bit faster.

Starting point is 00:17:25 But they're adding value by throwing more and more software features and functions on top of it, like the tier two guys. And some of them are being very successful doing that. But I would contend that this new set of applications that are being developed, it just doesn't work for them. So the NVMe storage devices, and I'm no expert on this, but because of the difficulties in managing flash, they've put all kinds of reliability firmware into those drives.

Starting point is 00:17:58 And they've been working for years now trying to make that work very fast. So they're getting better at garbage collection so that we get more reliable or more consistent latencies. They're getting better at all of this, and the NVMe standards evolving along with them. But because of all those HA features that are built down into those drives, especially with the NVMe side of things

Starting point is 00:18:21 and the maturity of the market, there's about an order of magnitude difference in reliability between the solid state drives and the hard disk drives. I was just talking to Dr. Jivy from NetApp, which was in this last presentation. And he gave me that bit of data based on tens of thousands of drives that NetApp has information on. So there's another part of the array controller value proposition that is just kind of going away these days.

Starting point is 00:18:51 And besides the fact that you don't have time to do like RAID 6, and that's something that really kills performance, right? So you get to a mirroring sort of situation at best, and away you go. So those are the things that were driving our design. So let's build an old style second platform storage array. You've got a bunch of servers.

Starting point is 00:19:14 You've got HBAs in the systems that would be fiber channel or so. Well, usually fiber channel or NICs to communicate with the storage array controllers. The array controllers are actually custom-made CPU complexes with a lot of DRAM that's actually mirrored on both sides to make that reliable and with NICs on one end or fiber channel on one end and then an HBA out the back that talks to, oh, before we get there, you've got a switching fabric in front to be able to get into these things.

Starting point is 00:19:51 And these days, you can start with an array controller that's just mirrored like that, but there's clustered controllers now where you can scale out to a certain extent. I don't think anybody scales out, well, that's not true. It's very difficult to scale this architecture out to say even four nodes. Two nodes is kind of doable. But you go beyond four nodes and this architecture becomes difficult. And in the back end, you got a whole bunch of JBODs with hard drives or SSDs behind it

Starting point is 00:20:23 connected by another network. Now this network also might have switches on it. And then of course if you're clustering controller sets then you have to have a very high speed communications path that goes between these guys. Sometimes InfiniBand, other people are using other methods to do that, but that's the way it's done in the second platform storage. EMC, NetApp, HP, all those guys have made huge amounts of money, and so did I building these things.

Starting point is 00:20:57 But what we decided to do is very different. So we also have servers. We have our own driver because we have hardware accelerated HBA that's ours also that sits there in the servers. But our storage device looks very, very different. I'll build it out, and then I'll talk about it. So we've got switches inside of the, this is basically a direct attached JBOD sort of thing. There's no CPU complex in here. The first thing that the data path sees going into

Starting point is 00:21:39 the box is a large port count 40 gigabit ethernet switch, which allows us to directly connect a lot of HBAs to the box with very high bandwidth, low latency, standard layer 2 ethernet connections. And you can scale these things out, because they have 32 ports on the back. You can scale them out for quite a ways by, and this is

Starting point is 00:22:04 simplified, of course, but you've got the connections go to the servers. You can have as many HBAs, and those are dual port HBAs in the servers that you want. You connect all those up to the boxes. If you need more boxes, you can interconnect them. It's all just one big ethernet network. Oh, and there's a connection across here, eight lanes going

Starting point is 00:22:24 across to route traffic back and forth across that side. And then down at the bottom are the NVMe SSDs. So this is, you're going to see this sort of architecture, I think some of the NVMe over fabrics companies will probably work on something similar to this. But I'll talk about the difference between the protocols because we have a proprietary protocol we started two and a half years ago or maybe three years ago for a couple of people to invent this thing. So that was before fabrics. There was a gleam in the eye of the fabrics

Starting point is 00:23:00 community. So that's what we're building. So we don't have external switches until you get very large. We don't have many layers. We don't have CPU complexes inside here. It's a very simple design, except for the storage controllers here and the HPAs. Those things both have FPGAs that provides hardware acceleration for our protocol. And I'll talk more about that.

Starting point is 00:23:35 It's designed to scale it. So designed for software application-defined storage that I was talking about. The smarts, we've got some smarts in the driver. We support mirroring and quality of service and drive virtualization, striping and making big ones out of several drives, and permissions. But we're not going to support any kind of software that gets in the way of the data path. If somebody wants to do that, they can do it,

Starting point is 00:24:03 but it'll sit above our driver intelligence. And they'll be responsible for slowing everything down. Yep? To what? It's a tunneling protocol, right? I'll talk about that in a second. Very simple tunneling. Anyway, this thing will scale out to hundreds of servers,

Starting point is 00:24:27 multiple petabytes. It uses any standard NVMe drive. We've got three different drives qualified to go in there right now in four different flavors, I think. So now I'm going to go back and this is like the frequently asked questions section and I'll get to the protocol here in a little bit so why not use the server captive storage and of course you guys are all experienced you know why right so this is good as long as you as long as you can live with the limitations you can plug the storage directly in there and that'll be as fast as you can go, with some exception. Our first demonstrations that we did, we put NVMe cards, actually, in a slot in the server,

Starting point is 00:25:13 and then we put SSDs on the end of our network out there, and we compared the performance between the two of them. And you can't see the difference in performance because we're running, I think right now we're not optimized completely yet, but we got about a three microsecond round trip latency on our network because it's all hardware accelerated layer two, right? But the weird thing is that the numbers would actually come out in our favor and I could never figure out what that was about. And then I think I talked to an Intel guy at the DSI conference or one before that and he said, yeah, we've seen that too. He was doing an NVMO over fabrics kind of demo and I said, we've seen that too. There's some sort of interrupt issue that the PCIe cards that plug directly in and we

Starting point is 00:26:04 were using an Intel card at the time, don't get the priorities right quite as well as the card that we had in there. So it was actually faster. So finally got it described. Anyway, the problem with the captive storage is, of course, it's tied to the server. And so depending on how many slots you have, your capacity is limited. You can't share that stuff across servers, except going over the top of rack network that bogs everything down. There's no dynamic scaling.

Starting point is 00:26:32 When you run out of slots, you're done. SSD virtualization is not happening. These just show up as slash devs of a particular size, and that's what you use in the application. It's a management challenge. If you need to add 800 gigabytes in about another year, you're not going to be able to buy those things. They're too small.

Starting point is 00:27:01 You still have to plug in a 1.6 terabyte drive in or something to get that. You can't really carve it up and use it in different solutions. It's inefficient for cooling and rack space. And as things get bigger and the working sets get larger, there are quite a few companies that are having to buy servers to put SSDs in to increase the size of their overall data set.

Starting point is 00:27:28 And they've got tons of extra CPU that they didn't really need to buy, just because they need more slots in the servers. So the answer is, of course, to get out of that box and make this thing work. Oh, my antivirus stuff kicked in. How convenient. And get the storage out of the server again. So some of this history is repeating again, right? Let's do a SAN kind of thing

Starting point is 00:28:05 and take care of this problem a second time around. So that's what we did. So the protocol that we're using is simple and fast and effective. We've used Layer 2 Ethernet and it's a storage network. You can't really think about it as a general-purpose Ethernet network, although those switch chips are standard switch chips. You could run any kind of Ethernet traffic over there that you wanted to. We don't support that because, again, we're a performance play. We've hardened the Layer 2 Ethernet with some additional error recovery things that are going on in the FPGA.

Starting point is 00:28:49 So if you've got dropped packets, we'll retry things automatically at hardware speeds. It's a fully integrated NVMe fabric with no external switching until you get really large. If you're scaling out like multiple racks with 20 servers apiece in a couple of these boxes, you're going to have to, to get enough interconnection, you're going to have to add switches. Any standard 40 gig Ethernet switch will work. And it's, today anyway, it's the industry's lowest latency transport protocol that's shipping. In fact, I'm not sure anybody else is shipping over Ethernet this way. So what it is, is it's a tunneling protocol.

Starting point is 00:29:30 We are transporting NVMe commands. We are not transporting data like RDMA stuff, like the Fabrics Group. It's a different way to do it. And it's dead simple. The other question always comes up is, what is the difference? Well, that's the basic one.

Starting point is 00:29:50 And here's kind of an illustration of how NVMe over Fabrics might look, the stack of this whole thing with RDMA and, in this case, some verbs sitting on top. If you actually implemented this, now, some of you guys are way more experts on fabrics than I am. But what we came out with, if you did an InfiniBand version of this thing, you would end up with 42 bytes of overhead, with a total of 212, versus our 22 bytes that we have in here.

Starting point is 00:30:27 Because all we're really adding to the headers that are already in place for Ethernet is four bytes here. And that's all we need to get out to a pretty large network and get that stuff routed. We only work on Ethernet, so we're using all the tricks that we can to use things like we key a lot of data placement off the MAC address. So each SSD has a MAC address, each HBA port has a MAC address.

Starting point is 00:30:56 So by tying the transaction to the MAC address, then we don't need extra bytes of overhead that you would need that Fabrics is using so that they can be transport agnostic. So it's really fast. Very little overhead. More comparisons. I think I probably mentioned all of this. Yeah, you know, any time you're trying to make a standard, things are going to go really slow and you're

Starting point is 00:31:25 going to get lots of pages of standards. And anytime you go transport independent, then you have to have shim layers and well-defined interfaces so that you can actually slide different transports underneath. And it's a really good way to go, but it's very complex and it slows things down a little bit. Now today's NAND speeds, the difference between 10 microseconds and 3 microseconds probably doesn't matter at all. But where this will come into play is the next generation of storage class memory where Intel's I think it's all public out there that they're shooting for something like 10 microseconds or less from the

Starting point is 00:32:06 3D cross point. So at that point, if you've got a 10 microsecond transport delay and a 10 microsecond latency from your device, then you just doubled the latency. And I think we can actually probably cut in half what we have at about, get down to about a microsecond and a half round trip after we play all the tricks. And that's really a part of the market that we're really going for. I mean, it works great with NAND and we're going to sell the heck out of that,

Starting point is 00:32:34 but then the next generation of memory, it's really going to work, whereas the Fabrics guys may have a hard time. Now, I know there's chip development underway, a lot of great innovation going over there, and most customers will never need that kind of performance or latency, but there's a lot of them out there that will. And there'll be a growing number of customers that'll be very interested in that. So before I go on, does anybody from FabricSide want to throw darts at me or anything about the comparison. Because, again, I listen to the conversation. I've never participated in the fabrics development. But I've been watching what's going on over there.

Starting point is 00:33:14 And it's really good work. It's a great standard. And it's going to make a lot of money for a lot of companies. But we're going to ship for a couple of years before that really starts rolling in 18. So this is what the hardware looks like. It's a 2U24 box with a whole mess of fans. We can handle a full 25 watts on all of those drives at the

Starting point is 00:33:45 same time at, I think, 35 degrees is our spec for the ambient temperatures. As I mentioned before, it's got 32, 16 on each IOM, two IOMs, 32 40-gig ports ports and QSFP plus is coming out the back. It will handle optical or copper connections. This is what the IOM module looks like. Here are the 40 gig ports. And there's the massive switch chip that's actually a 36 port 40 gig switch chip. Here's our storage controllers here. Each one of these FPGAs has four slices in it.

Starting point is 00:34:30 So a slice of FPGA handles an individual drive. So that gives you some interesting capabilities also, because now you've got the PCIe by 4 interface to the SSD out there, isolated from the network by this storage controller. If anything goes wrong with the SSD or the PCIe bus out there, it'll never propagate into the rest of the system. You also have an intelligent piece of FPGA with a little microprocessor sitting next to it that can do error recovery, and it's responsible for initializing the drive

Starting point is 00:35:05 and getting it all ready to go. And then it advertises it onto the system. That also plays into our discovery mechanism, because we have that intelligence there when that drive comes up. We've got kind of a, we use all NVMe style commands. And the native stuff just goes to the drive. But we have APRN-specific NVMe commands that we use for administrative purposes.

Starting point is 00:35:27 And so that FPGA comes up, that little processor comes up, and we can immediately talk to it and find out what's going on with the drive. We can send it commands, we can get all the smart data. Well, that comes from the drive, but we can get... There's a certain amount of data that that chip collects so that before the thing even comes online we can pull model numbers and serial numbers and the rest of that stuff so that users can figure out what's going on. And error recovery I mentioned. And if you hold your tongue in your mouth just right, you know and stand on one leg and you got the right SSDs We can get eighteen point four million IOPS out of to you

Starting point is 00:36:09 And I don't think anybody's come close to that yet And that's just because we're exposing Everything the SSD can do to the server It's just like you plug that many SSDs into the server. You'd get the same number and that was our objective. Because now the, we've got standard SSDs that are going to keep innovating down there at the SSD, NVMe SSD level and we can use anything that's in that form factor with that box. And then we can go to 100 gig Ethernet or beyond if that becomes a bottleneck.

Starting point is 00:36:43 And we can keep pushing the bottleneck back either to the SSD or to the compute complex. We limited out at about 2.5 million IOPS on a two socket Intel chassis. That's about all we could get out of the servers that we're using. So you have to have multiple servers to be able to handle the performance coming out of one of these boxes.

Starting point is 00:37:11 This is some marketing blah about performance. One interesting thing is that we mirror the writes, and the mirrors handled in the driver, and the consistency checks. And we handle hot spares and building mirrors. And we can go to three-way mirroring actually at this point but the interesting thing is that because We mirror to two drives when you do reads We can pull data off both those drives and you get almost double the performance read performance coming back so you define a virtual volume that's mirrored and The rights go down at about the same speed because speed because it's just two writes right after each other. So you don't really see

Starting point is 00:37:50 a difference there. But then you, when you read, you can just about double the performance if you do some tricks inside the driver. Yes? . No, no, no, like a ratio coding or anything. Not at this time. It's just all mirroring. But we can go up to three ways. Anything else would take a lot more processing power and more development time than we have right now. I mean, you can put something, a shim above the driver and do that. You know, you've just got a whole bunch of LUNs coming out and you can define those to be any size, so you can split up the drives if you want to run some erasure coding above

Starting point is 00:38:29 that, you could certainly do that. Again, it's, you know, application aware or software defined storage applications. There's a group that we're working with that started out using our box as a caching layer for software defined storage because they're running a bunch of virtual machines above that and the prices for the ssd costs have come down so fast that they're considering having a version that they just use all ssds with they just don't tear before they would go to us for a caching layer and then come up and then through the top of rack network go out to like a hard disk farm.

Starting point is 00:39:06 But I think they're going to ship both ways. There's our advertising for our 18 million IOPS out of the box. And again, for these scale out apps, the way that they shard the data and you associate cores with a particular data set that is essentially directly attached to those cores, you can scale this thing forever. Because there's no cross traffic. If you had a big shared storage sort of array, then you'd have to worry about a lot of traffic

Starting point is 00:39:39 going through a particular switch and lost packets and that sort of thing. But for this class of applications that are already dealing, they started out putting memory and DRAM that's not an issue this is just an extension of that so that's where we're starting and the investors like this one because we can ride compared to a lot of companies with custom implementations of NVMe cards, we can ride the innovation curve for the standard 2.5-inch SSDs, and we've already been doing that. We started out with 800-gig drives, then 1.6, now 3.2, now we've got 6.4s under test at the moment.

Starting point is 00:40:23 We're supposed to have 8s by the end of the year and maybe 16s by the end of next year. And then there's versions of that that are way cheaper and versions that are fast. And so a customer requests a certain quality of service, essentially, and we can deliver that to them. Oh, and 3D Crosspoint, I think, I know it's going to come out in a two and a half inch form factor and I think it might come out of there pretty fast.

Starting point is 00:40:51 It might be one of the first things they ship. And then, of course, you had Samsung today talking about was it their Z series SSDs that get down into the 20 microseconds sort of latency range, the 3D stuff. And that'll be out next year also. NVMe interface, the right form factor, we plug it in and qualify it, and we can use it. And you can have multiple types of drives

Starting point is 00:41:20 in one enclosure. It's just a drive is associated with a MAC address. That all gets associated with the serial number on the drive so that you can move them around. But that drive can be any place in the storage network. So you set up a mirror. If you've got one box, you're on both sides of the box. If you've got two boxes, you set the mirror up on two boxes.

Starting point is 00:41:44 If you add another one, you can add a third one. It's just one big network. There's one other NVMe system out there, which is called System A. And so that's another question that we get, how does it compare to that? System A is in a 5U box, has bandwidth of 100 gigabytes per second. In this case, because they had a 5U enclosure, we went to two

Starting point is 00:42:14 boxes here, so we'd look much better. So now we've got 4U with 48 drives at 144 gigabytes a second, 37 million IOPS. These guys top out at 10. They have proprietary SSDs and a latency of 100 microseconds average, which is about the same as what we have. And then you can read the rest there. These guys are doing parallel PCIE connect, so they can't scale out beyond that box, I don't believe. Whereas we're using the Ethernet. And we're ready to go as soon as Intel gives us a Crosspoint SSD, we'll plug it in and qualify it.

Starting point is 00:43:03 These guys will have to do a custom card. And we don't cost so much either. And that's what I brought. My good friend, Achmed Shihab, worked with that guy in Zyrotex and then in NetApp. He had a great quote, all the simplicity and promise of direct attached storage with the capabilities of network attached storage. And that's, in a nutshell, that's what it is.

Starting point is 00:43:38 Male Speaker 2 You pick two drives and it drives on a mirror. Thank you. It mirrors on an SSD basis. So today we can't take a slice of a drive and mirror that someplace. We have to take the whole drive. How many SSD drives do you think would fit into the 18-million number?

Starting point is 00:44:05 24. One box full. That's the 18-million number. Like I say, it's only one SSD that's that fast. Let's see, there's a new one that's even a little bit faster, I think. But we haven't got a hold of that many drives on it yet. But, yeah, somebody can do... If you can squeeze a million

Starting point is 00:44:27 IOPS from a drive through PCIe Gen 3x4 link, we'd have 24 million. You know, it's just a matter of where the drive manufacturers go. You know, 800,000 IOPS, maybe a little, maybe 900 is about, I think you're running out of gas on the PCIe link. The 40 gig links will handle a couple of SSDs before they max out. And then you've got a by 8 PCIe link on the HBA. So it's pretty well balanced, right? You got to buy 8 PCIe going to 240 gig ports, and then that runs down the network to as many drives as you want to hook up to.

Starting point is 00:45:11 Yes? Yeah, there's the extra four bytes and what kind of added . It's four per packet, ethernet packet. It's been identified really? Yeah. So is there some protocol that's been put in for four bytes? That's what we carry back and forth.

Starting point is 00:45:40 There's admin commands that run separately, but that's basically like a data packet overhead. It's just the four. And no, I can't go into any more detail on that. . Yeah, we handle all that stuff in the FPGAs. So yeah, the driver directs the commands. The FPGA creates the packets and handles all the

Starting point is 00:46:08 interchange between the two ends. It gets to the other end, and we split out the metadata and drop the NVM commands into a queue that drops into the drive. When it comes back up, we put it back together and send it back to wherever it came from. It's very simple. Although it's taken two years to make it work. Yes?

Starting point is 00:46:27 You have to use our HBA that has the FPGA on it. It's just a board with one single big FPGA. That has the PCIe interface and the 40 gig and our magic stuff in the middle. . Yeah, so just a by eight, half height PCIe card running 17 watts, I think. So it'll fit in any server. Should work in a.

Starting point is 00:46:58 Pardon me? . We advertise virtual volumes. Actually, right now we're, or just slash devs that relate to the drives if you want to do something simple. But the driver will virtualize, we highly recommend that you do it this way, that the driver will virtualize that drive which means you've got a handle that then tracks the serial number of the drive. So if it moves around, that handle will move with it, and you'll always know what's going on as far as if somebody pulls the drive

Starting point is 00:47:35 and slides it into another box or another location. It'll come back up and reconfigure like it was before. But you also get standard slash dev, NVMe slash devs coming out. So it's basically an HBA with your own proprietary driver to give you the features? Yeah. Yeah, and we started with the standard NVMe driver.

Starting point is 00:48:00 And there's less and less of that in ours and more and more of our own stuff now. But we started with that. And it's just and less of that in ours and more and more of our own stuff now. But we started with that. And it's just a standard block interface. There's nothing special that happens above that. We've got a storage manager that handles the configuration and setup of the mirrors and the virtual volumes and all of that, interfaces with the customer.

Starting point is 00:48:20 We've got a management processor in the box that handles all of the usual enclosure services and the box is all HA and no single point of failure. And so all the standard enterprise kind of stuff is there. Yes? So with the DSI, you said that you might be willing to share the protocol and you know just that you wouldn't work. Well, so we would be willing to work with the fabrics group to have our protocol as an alternative transport.

Starting point is 00:48:49 We'd love to do that. We'd love to have a standard to hang our hat on. And we're still willing to do that, but somehow the committee hasn't got back to me on that. I've talked to Amber every couple of months or something, but... . But, you know, this is also what happened, you know, ATA over Ethernet could have been a standard. That was, yeah, ATA over Ethernet is the same idea, right?

Starting point is 00:49:23 Yeah, but it also. So I'd love to work with the standards committee on a, you know, kind of like fiber channels, an alternate transport for the fabrics thing. And then we could still put our secret stuff above it. We've got the hardware acceleration going. I mean, it would change. It would, you know, put it into standards, it would change.

Starting point is 00:49:45 We'd have to change everything. But that's OK. Yeah? If it's a valid use case, but if you need to connect more than 16 servers to your devices, you need to insert a internet switch? You can add a switch, yeah. It's got 32 ports out the back, though.

Starting point is 00:50:00 Right. If you have dual connections. So if you have more than 32, then yeah. Or if you've got dual connections. Dual connections from the server 16, so. Yeah, it switches. And I'm just not sure if it has a proprietary switch.

Starting point is 00:50:13 No, it's just standard layer 2 ethernet. Just please don't run standard TCP IP over the top of it or performance and collisions and stuff. So I think it's about time to go have Intel buy us all a beer. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at snea.org. Here you can ask questions and discuss this topic further

Starting point is 00:50:46 with your peers in the developer community. For additional information about the Storage Developer Conference, visit storagedeveloper.org.

Your Ad Here

Storage Developer Conference - #40: Breaking Through Performance and Scale Out Barriers

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.