Storage Developer Conference - #40: Breaking Through Performance and Scale Out Barriers
Episode Date: April 10, 2017...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC
Podcast. Every week, the SDC Podcast presents important technical topics to the developer
community. Each episode is hand-selected by the SNEA Technical Council from the presentations
at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcast.
You are listening to SDC Podcast Episode 40.
Today we hear from Bob Hanson, VP Systems Architecture at Aperion Data Systems,
as he presents Breaking Through Performance and Scale-Out Barriers,
a storage solution for today's hot scale-out applications, from the 2016 Storage Developer
Conference. I'm Bob Hanson from Aparian Data Systems, but the way that you remember this is
that you look at it and you go, ape iron.
Aperin is the way I say it.
But every customer, all the executive team, everybody says it differently.
But it's a Greek word that evidently means like infinite or something.
You know, wonderful like that.
So we've been in business for about two and a half years.
We're still an early stage startup. We actually, now I want to hear, some of you guys,
how many people have been in startups before?
All right.
So Monday we closed our first deal for 500 grand.
Yay!
We got three more queued up and it looks like it's really going to go.
But anyway, I'm going to tell you about how the company got to where we're at and why
we did it.
This is a storage development conference, right?
I'm going to show you why we did it and why we're doing it a little differently than everybody
else.
Assuming I get my devices turned on here.
So we don't need to go through the agenda
because we're about to go through it.
So IDC, years ago I was in a technical marketing role,
and I really, I learned a lot about the industry at that point
from a whole different point of view
than my whole engineering career before that.
And it was amazing.
I learned that when how market projections were done,
and I was working with career marketing people, right, for the first time,
and I couldn't believe, you know, how they were just going,
yeah, iSCSI is going to be this big in that amount of time. I couldn't believe how they were just going, yeah, iSCSI is going to be
this big in that amount of time. I couldn't handle that. I put like a, I don't know, a
five-page spreadsheet model together to do that. And when I left the course, nobody wanted
to use that. They went back to putting their finger in the wind.
But the other amazing thing was that I would talk to the analysts like IDC and then in a month that projection would
show up as being facts from the analysts. And it was really amazing because iSCSI was
really going to take off fast because everybody was telling them the same thing. Anyway, that
was IDC's role but more recently they came up with a really interesting insight about
this third platform thing and you guys have probably seen this before.
But the third platform being, and I'll talk about the first
two platforms here in a little bit.
But this being cloud, big data analytics, driven by a lot of
different things, big data is a huge part of it.
But the part that we're focusing on is the analytics
part. And of course, the hyperscaler guys drove this initially. But that's one of the
things that changed that got my company in the business here. Another one was NVMe. I'm
not going to spend any time on that. It's wonderful. It's fast.
It's got a huge amount of support. It finally gets rid of that, what, 30-year-old SCSI stack
or whatever we've been dealing with for a very long time in different forms and pieces.
I mean, I worked on SCSI and Fib Channel and SAS and all of that and we finally get
the start over which is really a godsend given the flash capabilities that we have now.
So you get performance manageability and it's all being invented right now.
I mean, we got the 1.0 fabric standard out, NVME standards been out for a couple of years
but they're still working on it. I just listened to a presentation where they're really optimizing it now for, to give
controllers and storage stacks more direct control over the actual flash chips in the
way that, not the flash chips, but the way that the flash system works so that you can
actually optimize for streams and things like that.
So this is getting better and better as we go along. It's really complicated stuff and
I'm really surprised it works at all but it seems to be moving along quite well. And then
the really nice thing is that there's a robust community of vendors now. I put a bunch of
them up there. OCZ is now Toshiba. Did I put it Toshiba Drive?
Yeah. So these guys got bought by these guys and they're actually shipping both of those
drives now but it won't be for long. And the really nice thing is that they're all working
to be the best in the world. Like, you know, you got Samsung out there and probably selling
more than anybody else. But everybody else is innovating.
There's a drive out there that's optimized for writes,
that does better write performance than all the rest.
There's drives out there that you can flip a bit, and you
get high endurance or low endurance, and the capacity
changes, which is really a challenge.
Because one of my jobs is to work with my ODM partner to qualify drives,
and I'm trying to convince them it's a Japanese company.
I'm trying to convince them that it's just a bit change.
You know, you should test one of them and call it good for both.
And somehow I'm having a really hard time with that.
And there's several other innovations on the way from these different vendors and they're pushing the prices coming down really fast.
So it's a really interesting ecosystem that's developing here.
And it feels like it's going to take off like this year, next year kind of thing.
And certainly with the fabrics development, when we get some fabric solutions going, then that'll really start pushing it out of the high-performance area,
which is where I'm working into more of a mainstream kind of thing.
So I was thinking about this, and I've been in the storage industry for a long time
and made a lot of money on broken hard disk drives.
So I was thinking back, well, the first platform, using the IDC thing again, was the mainframe.
And some of you guys might remember those things.
But you reach back then, and then you had this really dumb mechanical device.
And then the mainframe told it what to do, like move the head to here and write something and then stop and move it back.
And it was really direct control by the server itself,
or the operating system, actually, to dumb hardware.
That worked really well.
And in fact, we've been kind of reinventing the mainframe
ever since that point in time.
I worked a lot in the second platform area
where we went client server.
You guys can't read this, but this is thousands of apps and
mainframe terminal and millions of users.
This is hundreds of millions, client server with networking,
and tens of thousands of apps.
So then we got SCSI, what a godsend.
We got smarter drives. And we got network attached storage, which was a tremendous innovation
And in this time frame we also got the array controller
So now you have these hard disk drives. They fail you're shipping a lot of them
You've got to have raid and then they started layering more and more software on top of there. And you've got this, actually a computer
with a bunch of high availability features
based on it between the servers or the clients
and the hard disk drives.
That worked out great.
And it created these huge companies,
of which I've worked for three or four of them,
with huge lines of code.
In fact, the last big company I worked for, they had
more than 25 million lines of code in the operating system that they had to put together
to get a release out. And, you know, I'm a kind of a hardware firmware kind of guy. Thinking
about that, it just makes my head hurt. I don't know how they ever got a release out.
In fact, their stock price shows that they're not doing too well on it.
The third platform.
So now we have, in addition to millions of apps and billions of users, we've got millions of developers on open source software.
We've got OpenStack.
We've got people going crazy on all kinds of capabilities
built into Linux.
And so we've got all of these developers
doing really clever things.
We have storage-aware applications and operating
systems.
You have people in NoSQL that have just bypassed
the whole storage stack inside Linux,
and they go directly to the block storage
devices. You've got other companies that are so hungry for performance that they skip the
file system and go straight down into there. You have many applications that are built
for tiering and I'll talk about that a little bit more. So very storage smart applications.
And of course, we've got Flash now with NVMe interface that
really has been an incredible performance improvement
and a reliability improvement.
And those devices are really intelligent.
One of my vendors told me that they're putting about a little bit more than 500,000 lines
of code into a flash controller chip.
And that's really interesting.
They have very powerful processors in there.
They've got to manage the flash a completely different way.
It's a little scary because that's the reliability problem right there is the firmware most of
the time.
And now we have people moving back to direct attached storage for these higher performance applications.
But that's moving into network storage.
And that's another thing that got my company going.
So I thought I might have to wake people up at this point.
So I put that headline on there. Of course, that's a major overstatement because all of the data in the world is sitting behind array controllers today.
There just is a little bit of DAS that's going on in these scale-out markets, and that's about it So the array controller's going to be around forever.
But the value proposition for a lot of the new apps that are
being developed is disappearing for array controllers.
Let's see if I can make my case with you guys.
So if you look at a lot of the scale out apps that are coming
out now, that are storage aware and they're storage-aware and they're real-time analytics
that they're pushing this on.
Now, that's a narrow part of the market.
We're not talking about keeping your bank account straight or anything.
But this has to do with big data analytics,
and a lot of these things, the applications are targeted at
applications that require real-time decision-making, like
ad tech, where you have to, I've talked about this in other programs, but there are several
companies down the peninsula. They've got to be able to identify you as a user, figure
out what ad you might click on, go bid for ads. Well, they run an analysis on which of the ads that are
available are your, they run a probabilistic sort of
algorithm on there to find out which one you're liable to
click through.
They have to bid for the ads.
They get the ads back, and they put it up on your web
page, and they have to do all, and then they track what
happened, and all that has to happen in
about 200 milliseconds.
So that's an example of if you can make their storage go 20%
faster and save them tens of milliseconds, they can make
more money because they've got more time to run this
algorithm against those ads.
And then when you click through, they make dough.
So we've got that.
You've got fraud detection where people are trying to
understand whether your credit card is legit or not before
Before you close the deal on Amazon or something
and of course, there's all the
security related
Issues that are out there
and then you have apps like Splunk where
the name of the game is ingesting huge amounts of data and
That's not really a real time sort of thing,
although they do run analytics against it.
But the faster your storage is, then the more data
you can take in on a terabyte per hour kind of basis.
So these applications are all new.
They're designed for scale out from the beginning.
That's the group of apps that I'm concentrating on,
because not many people are writing banking
software anymore, you know.
They're looking at this new world of the Internet of Things that's going to keep driving these
huge very fast data requirements to be able to ingest this and then run analysis against
it.
In all of these apps, the application manages the data placement across the compute cluster
and that's because the server has now become the center of the universe.
Before it was the, you know, you had your database down here and you would pull that
data in and then you would run analysis on it, run your joins or whatever and it would
be paging data in and out of the, through the array controllers down to the hard disk.
They don't do that anymore.
They split the database up in the first place.
They put shards of it in all of these different servers
up here.
If they could, they'd run it all out of DRAM.
But of course, the working sets are getting larger.
So they're tending to move to SSDs.
And now they're moving off the box into external SSD boxes.
But that's a big change.
You know, the center is now the server and they all have algorithms to manage tiers of
data and to be able to failover servers because that's the element that's most important.
You've got data that's associated with that server directly.
So if the server goes down, they have to have a strategy
to be able to fail over to another server, warm that data
back up, and get started again so that they can keep their
throughput going.
So that's a big change.
This application manages high availability.
And because of that, they've got data tiering and data
migration schemes going on.
You've got complex array-centric storage software.
If you put an array behind these things and you turn on any of the really wonderful capability that's been developed over the years, like dedup
and compression and snapshots and all of that sort of thing, what you're doing is you're
putting huge amounts of software stack in between the application and the raw performance
that you can get out of the NVMe drives these days. And every feature that you add to that, if it's not hardware accelerated,
just takes the performance right out of it. Just went to the VMware presentation, what,
two hours ago, and they were really happy that they were, when they put their, I think
it was vSAN was the product, on top of the NVMe drives, they were only an order of magnitude slower.
That's a whole order of magnitude.
They're talking millisecond response time.
And the drives will do in the hundreds of microseconds,
low hundreds.
You can get that easily.
You can get better than that.
So you don't want to do this.
And these new applications just don't
want to see that kind of thing.
And then you've got multiple tiers of storage,
starting with DRAM and then going
to maybe SSDs that are captive inside the servers,
then going outside the server into more of a network storage
sort of thing.
So the value proposition for array controllers is going away,
because that's where the center of all this software that
slows everything down exists.
And so we're kind of in an era right now where Flash came
along, and a lot of companies started up, and they
reproduced what the big guys had done with hard disk drives,
only they had Flash behind it.
So it got a little bit faster.
But they're adding value by throwing more and more software
features and functions on top of it, like the tier two guys.
And some of them are being very successful doing that.
But I would contend that this new set of applications that
are being developed, it just doesn't work for them.
So the NVMe storage devices, and I'm no expert on this, but
because of the difficulties in managing flash, they've put
all kinds of reliability firmware into those drives.
And they've been working for years now trying to make that
work very fast.
So they're getting better at garbage collection so that we
get more reliable or more consistent latencies.
They're getting better at all of this, and the NVMe
standards evolving along with them.
But because of all those HA features that are built down
into those drives, especially with the NVMe side of things
and the maturity of the market, there's about an order of magnitude difference in reliability
between the solid state drives and the hard disk drives.
I was just talking to Dr. Jivy from NetApp, which was in this
last presentation.
And he gave me that bit of data based on tens of
thousands of drives that NetApp has information on.
So there's another part of the array controller value
proposition that is just kind of going away these days.
And besides the fact that you don't have time to do like
RAID 6, and that's something that really kills performance,
right?
So you get to a mirroring sort of situation at best, and
away you go.
So those are the things that were driving our design.
So let's build an old style second platform storage array.
You've got a bunch of servers.
You've got HBAs in the systems that would be fiber channel
or so.
Well, usually fiber channel or NICs
to communicate with the storage array controllers. The array controllers
are actually custom-made CPU complexes with a lot of DRAM that's actually
mirrored on both sides to make that reliable and with NICs on one end or
fiber channel on one end and then an HBA out the back that talks to, oh, before we get there, you've got a switching
fabric in front to be able to get into these things.
And these days, you can start with an array controller
that's just mirrored like that, but there's clustered
controllers now where you can scale out to a certain extent.
I don't think anybody scales out, well, that's not true.
It's very difficult to scale this architecture out to say even four nodes.
Two nodes is kind of doable.
But you go beyond four nodes and this architecture becomes difficult.
And in the back end, you got a whole bunch of JBODs with hard drives or SSDs behind it
connected by another network.
Now this network also might have switches on it.
And then of course if you're clustering
controller sets then you have to have a very high speed communications path
that goes between these guys. Sometimes InfiniBand, other people are using
other methods to do that, but that's the way it's done in the second platform storage.
EMC, NetApp, HP, all those guys have made huge amounts
of money, and so did I building these things.
But what we decided to do is very different.
So we also have servers.
We have our own driver because we have hardware accelerated
HBA that's ours also that sits there in the servers.
But our storage device looks very, very different.
I'll build it out, and then I'll talk about it. So we've got switches inside of the, this is basically a
direct attached JBOD sort of thing. There's no CPU complex
in here. The first thing that the data path sees going into
the box is a large port count 40 gigabit ethernet switch,
which allows
us to directly connect a lot of HBAs to the box with very
high bandwidth, low latency, standard layer 2 ethernet
connections.
And you can scale these things out, because they have 32
ports on the back.
You can scale them out for quite a ways by, and this is
simplified, of course, but you've got the connections go
to the servers.
You can have as many HBAs, and those are dual port HBAs in
the servers that you want.
You connect all those up to the boxes.
If you need more boxes, you can interconnect them.
It's all just one big ethernet network.
Oh, and there's a connection across here, eight lanes going
across to route traffic back and
forth across that side.
And then down at the bottom are the NVMe SSDs.
So this is, you're going to see this sort of architecture, I think some of the NVMe
over fabrics companies will probably work on something similar to this. But I'll talk
about the difference between the protocols because we have a proprietary protocol we
started two and a half years ago or maybe three years ago for a couple of people to
invent this thing. So that was before fabrics. There was a gleam in the eye of the fabrics
community. So that's what we're building. So we don't have external switches
until you get very large.
We don't have many layers.
We don't have CPU complexes inside here.
It's a very simple design, except for the storage
controllers here and the HPAs.
Those things both have FPGAs that provides hardware acceleration for our protocol.
And I'll talk more about that.
It's designed to scale it.
So designed for software application-defined storage that I was talking about.
The smarts, we've got some smarts in the driver. We support mirroring and quality of service
and drive virtualization, striping and making big ones
out of several drives, and permissions.
But we're not going to support any kind of software
that gets in the way of the data path.
If somebody wants to do that, they can do it,
but it'll sit above our driver intelligence.
And they'll be responsible for slowing everything down.
Yep?
To what?
It's a tunneling protocol, right?
I'll talk about that in a second.
Very simple tunneling.
Anyway, this thing will scale out to hundreds of servers,
multiple petabytes. It uses any standard NVMe drive. We've got three different drives qualified
to go in there right now in four different flavors, I think. So now I'm going to go back
and this is like the frequently asked questions section and
I'll get to the protocol here in a little bit so why not use the server
captive storage and of course you guys are all experienced you know why right
so this is good as long as you as long as you can live with the limitations you
can plug the storage directly in there and that'll be as fast as you can go, with some exception.
Our first demonstrations that we did, we put NVMe cards, actually, in a slot in the server,
and then we put SSDs on the end of our network out there, and we compared the performance between the two of them.
And you can't see the difference in performance because we're running,
I think right now we're not optimized completely yet, but we got about a three microsecond
round trip latency on our network because it's all hardware accelerated layer two, right?
But the weird thing is that the numbers would actually come out in our favor and I could
never figure out what that was about. And then I think I talked to an Intel guy at the DSI conference or one before that and he said, yeah, we've seen
that too. He was doing an NVMO over fabrics kind of demo and I said, we've seen that too.
There's some sort of interrupt issue that the PCIe cards that plug directly in and we
were using an Intel card
at the time, don't get the priorities right quite as well as the card that we had in there.
So it was actually faster. So finally got it described. Anyway, the problem with the
captive storage is, of course, it's tied to the server. And so depending on how many slots
you have, your capacity is limited. You can't share that stuff across servers,
except going over the top of rack network
that bogs everything down.
There's no dynamic scaling.
When you run out of slots, you're done.
SSD virtualization is not happening.
These just show up as slash devs of a particular size,
and that's what you use in the application.
It's a management challenge.
If you need to add 800 gigabytes in about another
year, you're not going to be able to buy those things.
They're too small.
You still have to plug in a 1.6 terabyte drive in or
something to get that.
You can't really carve it up and use it in
different solutions.
It's inefficient for cooling and rack space.
And as things get bigger and the working sets get larger,
there are quite a few companies that are having to
buy servers to put SSDs in to increase the size of their overall data set.
And they've got tons of extra CPU that they didn't really
need to buy, just because they need more slots in the servers.
So the answer is, of course, to get out of that box and make this thing work.
Oh, my antivirus stuff kicked in.
How convenient.
And get the storage out of the server again.
So some of this history is repeating again, right?
Let's do a SAN kind of thing
and take care of this problem a second time around. So that's what we did.
So the protocol that we're using is simple and fast and effective. We've used Layer 2
Ethernet and it's a storage network.
You can't really think about it as a general-purpose Ethernet network,
although those switch chips are standard switch chips.
You could run any kind of Ethernet traffic over there that you wanted to.
We don't support that because, again, we're a performance play.
We've hardened the Layer 2 Ethernet with some additional error recovery things that are going on in the FPGA.
So if you've got dropped packets, we'll retry things automatically at hardware speeds.
It's a fully integrated NVMe fabric with no external switching until you get really large.
If you're scaling out like multiple racks with 20 servers apiece in a couple of
these boxes, you're going to have to, to get enough interconnection, you're going to have
to add switches. Any standard 40 gig Ethernet switch will work. And it's, today anyway,
it's the industry's lowest latency transport protocol that's shipping. In fact, I'm not
sure anybody else is shipping over Ethernet this way.
So what it is, is it's a tunneling protocol.
We are transporting NVMe commands.
We are not transporting data like RDMA stuff, like the
Fabrics Group.
It's a different way to do it.
And it's dead simple.
The other question always comes up is, what is the
difference?
Well, that's the basic one.
And here's kind of an illustration of how NVMe over
Fabrics might look, the stack of this whole thing with RDMA
and, in this case, some verbs sitting on top.
If you actually implemented this, now, some of you guys
are way more experts on fabrics than I am.
But what we came out with, if you did an InfiniBand version
of this thing, you would end up with 42 bytes of overhead,
with a total of 212, versus our 22 bytes that we have in here.
Because all we're really adding to the headers that are already
in place for Ethernet is four bytes here.
And that's all we need to get out to a pretty large network
and get that stuff routed.
We only work on Ethernet, so we're
using all the tricks that we can to use things like
we key a lot of data placement off the MAC address.
So each SSD has a MAC address, each HBA port has a MAC address.
So by tying the transaction to the MAC address, then we don't need extra bytes of overhead that you would need
that Fabrics is using so that they can be transport agnostic.
So it's really fast.
Very little overhead.
More comparisons.
I think I probably mentioned all of this.
Yeah, you know, any time you're trying to make a
standard, things are going to go really slow and you're
going to get lots of pages of standards.
And anytime you go transport independent, then you have to have shim layers and well-defined
interfaces so that you can actually slide different transports underneath.
And it's a really good way to go, but it's very complex and it slows things down a little bit.
Now today's NAND speeds, the difference between 10 microseconds and 3 microseconds probably
doesn't matter at all.
But where this will come into play is the next generation of storage class memory where
Intel's I think it's all public out there that they're shooting for something like 10 microseconds or less from the
3D cross point.
So at that point, if you've got a 10 microsecond transport
delay and a 10 microsecond latency from your device, then
you just doubled the latency.
And I think we can actually probably cut in half what we
have at about, get down to about a microsecond and a half
round trip after we play all the tricks. And that's really a part of the market that we're really going
for. I mean, it works great with NAND and we're going to sell the heck out of that,
but then the next generation of memory, it's really going to work, whereas the Fabrics
guys may have a hard time. Now, I know there's chip development underway, a lot of great
innovation going over there, and most customers will never need that kind of performance or latency,
but there's a lot of them out there that will. And there'll be a growing number of customers
that'll be very interested in that. So before I go on, does anybody from FabricSide want
to throw darts at me or anything about the comparison. Because, again, I listen to the conversation.
I've never participated in the fabrics development.
But I've been watching what's going on over there.
And it's really good work.
It's a great standard.
And it's going to make a lot of money for a lot of companies.
But we're going to ship for a couple of years before that
really starts rolling in 18.
So this is what the hardware looks like.
It's a 2U24 box with a whole mess of fans.
We can handle a full 25 watts on all of those drives at the
same time at, I think, 35 degrees is our spec for the ambient temperatures.
As I mentioned before, it's got 32, 16 on each IOM, two IOMs, 32 40-gig ports ports and QSFP plus is coming out the back.
It will handle optical or copper connections.
This is what the IOM module looks like.
Here are the 40 gig ports.
And there's the massive switch chip that's actually a 36 port 40 gig switch chip.
Here's our storage controllers here.
Each one of these FPGAs has four slices in it.
So a slice of FPGA handles an individual drive.
So that gives you some interesting capabilities also,
because now you've got the PCIe by 4 interface to the SSD out
there, isolated from the network by this storage
controller. If anything goes wrong with the SSD or the PCIe bus out there, it'll never
propagate into the rest of the system. You also have an intelligent piece of FPGA with
a little microprocessor sitting next to it that can do error recovery, and it's responsible
for initializing the drive
and getting it all ready to go.
And then it advertises it onto the system.
That also plays into our discovery mechanism,
because we have that intelligence there
when that drive comes up.
We've got kind of a, we use all NVMe style commands.
And the native stuff just goes to the drive.
But we have APRN-specific NVMe commands that we use for administrative purposes.
And so that FPGA comes up, that little processor comes up, and we can
immediately talk to it
and find out what's going on with the drive. We can send it commands, we can get
all the smart data.
Well, that comes from the drive, but we can get... There's a certain amount of data that
that chip collects so that before the thing even comes online we can pull model numbers and serial numbers and the rest of
that stuff so that users can figure out what's going on. And error recovery I mentioned.
And if you hold your tongue in your mouth just right, you know and stand on one leg and you got the right SSDs We can get eighteen point four million IOPS out of to you
And I don't think anybody's come close to that yet
And that's just because we're exposing
Everything the SSD can do to the server
It's just like you plug that many SSDs into the server. You'd get the same number and
that was our objective.
Because now the, we've got standard SSDs that are going to keep innovating down there at
the SSD, NVMe SSD level and we can use anything that's in that form factor with that box.
And then we can go to 100 gig Ethernet or beyond if that becomes a bottleneck.
And we can keep pushing the bottleneck back either to the
SSD or to the compute complex.
We limited out at about 2.5 million IOPS on a two socket
Intel chassis.
That's about all we could get out of the servers that we're
using.
So you have to have multiple servers to be able to handle
the performance coming out of one of these boxes.
This is some marketing blah about performance. One interesting thing is that we mirror the writes, and the
mirrors handled in the driver, and the consistency checks.
And we handle hot spares and building mirrors. And we can go to three-way mirroring actually at this point
but the interesting thing is that because
We mirror to two drives when you do reads
We can pull data off both those drives and you get almost double the performance read performance coming back
so you define a virtual volume that's mirrored and
The rights go down at about the same speed because speed because it's just two writes right after each other. So you don't really see
a difference there. But then you, when you read, you can just about double the performance
if you do some tricks inside the driver. Yes?
.
No, no, no, like a ratio coding or anything. Not at this time. It's just all
mirroring. But we can go up to three ways. Anything else would take a lot more processing
power and more development time than we have right now. I mean, you can put something,
a shim above the driver and do that. You know, you've just got a whole bunch of LUNs coming
out and you can define those to be any size, so you can split up the drives if you want to run some erasure coding above
that, you could certainly do that.
Again, it's, you know, application aware or software defined storage applications.
There's a group that we're working with that started out using our box as a caching layer
for software defined storage because they're running
a bunch of virtual machines above that and the prices for the ssd costs have come down so fast
that they're considering having a version that they just use all ssds with they just don't tear
before they would go to us for a caching layer and then come up and then through the top of
rack network go out to like a hard disk farm.
But I think they're going to ship both ways. There's our advertising for our 18 million IOPS
out of the box. And again, for these scale out apps, the way that they shard the data and
you associate cores with a particular data set that
is essentially directly attached to those cores,
you can scale this thing forever.
Because there's no cross traffic.
If you had a big shared storage sort of array,
then you'd have to worry about a lot of traffic
going through a particular switch and lost packets
and that sort of thing.
But for this class of applications
that are already dealing, they started out putting memory and DRAM
that's not an issue this is just an extension of that so that's where we're
starting and the investors like this one because we can ride compared to a lot of companies with custom implementations of NVMe cards,
we can ride the innovation curve for the standard 2.5-inch SSDs, and we've already been doing that.
We started out with 800-gig drives, then 1.6, now 3.2, now we've got 6.4s under test at the moment.
We're supposed to have 8s by the end of the year
and maybe 16s by the end of next year.
And then there's versions of that that are way cheaper
and versions that are fast.
And so a customer requests a certain quality of service,
essentially, and we can deliver that to them.
Oh, and 3D Crosspoint, I think, I know it's going to come out in a two and a half inch
form factor and I think it might come out of there pretty fast.
It might be one of the first things they ship.
And then, of course, you had Samsung today talking about was it their Z series SSDs that
get down into the 20 microseconds sort of
latency range, the 3D stuff.
And that'll be out next year also.
NVMe interface, the right form factor, we plug it in and
qualify it, and we can use it.
And you can have multiple types of drives
in one enclosure.
It's just a drive is associated with a MAC address.
That all gets associated with the serial number on the drive
so that you can move them around.
But that drive can be any place in the storage network.
So you set up a mirror.
If you've got one box, you're on both sides of the box.
If you've got two boxes, you set the mirror up on two boxes.
If you add another one, you can add a third one.
It's just one big network.
There's one other NVMe system out there, which
is called System A. And so that's another question that
we get, how does it compare to that?
System A is in a 5U box, has bandwidth of 100 gigabytes per
second.
In this case, because they had a 5U enclosure, we went to two
boxes here, so we'd look much better.
So now we've got 4U with 48 drives at 144 gigabytes a second, 37 million IOPS. These guys top out at 10.
They have proprietary SSDs and a latency of 100 microseconds average, which is about the
same as what we have. And then you can read the rest there. These guys are doing parallel PCIE connect, so they can't scale out beyond that box, I
don't believe.
Whereas we're using the Ethernet.
And we're ready to go as soon as Intel gives us a Crosspoint SSD, we'll plug it in and
qualify it.
These guys will have to do a custom card.
And we don't cost so much either.
And that's what I brought.
My good friend, Achmed Shihab, worked with that guy in
Zyrotex and then in NetApp.
He had a great quote, all the simplicity and promise of direct attached storage with the
capabilities of network attached storage.
And that's, in a nutshell, that's what it is.
Male Speaker 2 You pick two drives and it drives on a mirror.
Thank you.
It mirrors on an SSD basis.
So today we can't take a slice of a drive
and mirror that someplace.
We have to take the whole drive.
How many SSD drives do you think
would fit into the 18-million number?
24.
One box full.
That's the 18-million number.
Like I say, it's only one SSD that's that fast.
Let's see, there's a new one that's even a little bit faster, I think.
But we haven't got a hold of that many drives on it yet.
But, yeah, somebody can do...
If you can squeeze a million
IOPS from a drive through PCIe Gen 3x4 link, we'd have 24 million.
You know, it's just a matter of where the drive manufacturers go.
You know, 800,000 IOPS, maybe a little, maybe 900 is about, I think you're running out of gas on the PCIe link.
The 40 gig links will handle a couple of SSDs before they max out.
And then you've got a by 8 PCIe link on the HBA.
So it's pretty well balanced, right?
You got to buy 8 PCIe going to 240 gig ports,
and then that runs down the network to as many drives as you want to hook up to.
Yes? Yeah, there's the extra four bytes and what kind of added
.
It's four per packet, ethernet packet.
It's been identified really?
Yeah.
So is there some protocol that's
been put in for four bytes?
That's what we carry back and forth.
There's admin commands that run separately,
but that's basically like a data packet overhead.
It's just the four.
And no, I can't go into any more detail on that.
.
Yeah, we handle all that stuff in the FPGAs.
So yeah, the driver directs the commands.
The FPGA creates the packets and handles all the
interchange between the two ends.
It gets to the other end, and we split out the metadata and
drop the NVM commands into a queue that
drops into the drive.
When it comes back up, we put it back together and send it
back to wherever it came from.
It's very simple.
Although it's taken two years to make it work. Yes?
You have to use our HBA that has the FPGA on it. It's just a board with one single
big FPGA. That has the PCIe interface and the 40 gig and our magic
stuff in the middle.
.
Yeah, so just a by eight, half height PCIe card running
17 watts, I think.
So it'll fit in any server.
Should work in a.
Pardon me?
.
We advertise virtual volumes. Actually, right now we're, or just slash devs that
relate to the drives if you want to do something simple. But the driver will virtualize, we
highly recommend that you do it this way, that the driver will virtualize that drive
which means you've got a handle that then tracks the serial number of the drive.
So if it moves around, that handle will move with it,
and you'll always know what's going on as far as if somebody pulls the drive
and slides it into another box or another location.
It'll come back up and reconfigure like it was before.
But you also get standard slash dev, NVMe slash devs
coming out.
So it's basically an HBA with your own proprietary
driver to give you the features?
Yeah.
Yeah, and we started with the standard NVMe driver.
And there's less and less of that in ours and more and more
of our own stuff now.
But we started with that. And it's just and less of that in ours and more and more of our own stuff now. But we started with that.
And it's just a standard block interface.
There's nothing special that happens above that.
We've got a storage manager that handles the configuration
and setup of the mirrors and the virtual volumes and all of
that, interfaces with the customer.
We've got a management processor in the box that
handles all of the usual enclosure services
and the box is all HA and no single point of failure.
And so all the standard enterprise kind of stuff is there.
Yes?
So with the DSI, you said that you might be willing to share the protocol
and you know just that you wouldn't work.
Well, so we would be willing to work with the fabrics group to have our protocol as an alternative transport.
We'd love to do that.
We'd love to have a standard to hang our hat on.
And we're still willing to do that, but somehow the committee hasn't got back to me on that.
I've talked to Amber every couple of months or something, but...
.
But, you know, this is also what happened, you know, ATA over Ethernet could have been
a standard.
That was, yeah, ATA over Ethernet is the same idea, right?
Yeah, but it also.
So I'd love to work with the standards committee on a,
you know, kind of like fiber channels, an alternate
transport for the fabrics thing.
And then we could still put our secret stuff above it.
We've got the hardware acceleration going.
I mean, it would change.
It would, you know, put it into standards, it would change.
We'd have to change everything.
But that's OK.
Yeah?
If it's a valid use case, but if you need to connect more
than 16 servers to your devices, you need to insert a
internet switch?
You can add a switch, yeah.
It's got 32 ports out the back, though.
Right.
If you have dual connections.
So if you have more than 32, then yeah.
Or if you've got dual connections.
Dual connections from the server 16, so.
Yeah, it switches.
And I'm just not sure if it has
a proprietary switch.
No, it's just standard layer 2 ethernet.
Just please don't run standard TCP IP over the top of it
or performance and collisions and stuff.
So I think it's about time to go have Intel buy us all a beer.
Thanks for listening.
If you have questions about the material presented in this podcast,
be sure and join our developers mailing list by sending an email to developers-subscribe at snea.org.
Here you can ask questions and discuss this topic further
with your peers in the developer community.
For additional information about the Storage Developer Conference,
visit storagedeveloper.org.