Grey Beards on Systems - 43: GreyBeards talk Tier 0 again with Yaniv Romem CTO/Founder & Josh Goldenhar VP Products of Excelero

Starting point is 00:00:00 Hey everybody, Ray Lucchese here with Howard Marks here. Welcome to the next episodes of Graybeards on Storage monthly podcast, a show where we get Graybeards storage and system bloggers to talk with storage and system vendors to discuss upcoming products, technologies, and trends affecting the data center today. This is our 43rd episode of Greybeards on Storage, which was recorded April 12th, 2017. We have with us here today Yaniv Roman, CTO, and Josh Goldenhar, VP of Products of Accelero. So, Yaniv and Josh, why don't you tell us a little bit about yourself and your company? Okay, I'll go first, I guess. My name is Yaniv Romim. I am CTO, as mentioned,

Starting point is 00:00:51 for Accelero. I'm also one of the co-founders. We founded the company back in 2014, and it's been a really exciting rollercoaster ride so far. I'm Josh Goldenhaar, the VP of products for Accelero. And as Yaniv said, it has been a rollercoaster ride ever since late 2014. And it's where we set out to take an idea that Yaniv and some of the co-founders had technically, which was to use these at the time, brand new, barely released NVMe products, but use them over a network. And use them not just over a network, like, wow, I can access them, but use them at extremely low latencies as if they were local storage. Over a network? Yeah.

Starting point is 00:01:35 At the time, since NVMe over Fabric didn't exist, this was somewhat unheard of, especially since I believe there was only one NVMe drive actually released by anybody at that point, which was Intel. That and the general commentary from the peanut gallery that the network's too slow, you can't possibly do that. Yeah. Which we're still hearing from some of our friends in the hyper-converged world. We are, but the background of Yaniv and our other founders was very heavy in RDMA already.

Starting point is 00:02:07 So this was second nature to them. I think from the very beginning, there was no question. In fact, I had to, in the very beginning, and Yaniv, I think, can attest to this, say, you know, you guys, you can't just think InfiniBand. There's got to be an Ethernet offering. But they were thinking very low latency and have familiarity with the networks that could carry that kind of traffic from the beginning. How fast is the network that you're using these days for the product? So the network we're using today goes all the way up to 100 gig. But we actually know how to hook up a machine to multiple ports of 100 gig.

Starting point is 00:02:39 So you can get bandwidths of around 400 gig from a standard commodity, almost commodity server. Boxes such as those sold by various vendors today that are 2U boxes with 24 drives inside, you can hook them up to 400 gig of networking nowadays, which gives you plenty of bandwidth. 400 gig of networking plus a few PCIe NVMe disks. Well, you guys are limited by PCIe lanes, aren't you? Yes, absolutely. We, on a regular basis, bump up against that limit. We are constantly finding that the bandwidth limit inside the box or inside the PCIe complexes is what overall is limiting IO, even from an IOPS standpoint. When you're pushing millions of IOPS, even 4K IOPS on a box,

Starting point is 00:03:26 you can actually bump up against the bandwidth limits of individual NICs very easily. And if you put enough of these drives together in aggregate, you can hit the bandwidth limits of even what people view as very, very wide bandwidth cards. Yep. And that also sort of can be seen as a precursor to why working in a converged environment is really critical when you want to get the utmost performance out of your system. By using lots of smaller machines with a few drives inside and NICs and also running compute on those machines, you can actually get an overall much higher or much larger amount of performance, thanks again to the fact that you're no longer limited by the PCI lanes

Starting point is 00:04:06 and those kinds of configurations. When you say convergent, are you talking about running virtual hypervisors on the storage itself? So you could do that, but I'm being very careful in saying convergent, not hyperconverged, because our solution does work with hypervised systems, but it can really work also with bare metal systems. And the real focus from our perspective is that by putting the storage devices into machines that are also running compute, by allowing you to do that,

Starting point is 00:04:34 you can still disaggregate if you want to, or you can do a hybrid model where some of the machines are dedicated storage boxes and you also have storage devices within your compute nodes. But by bringing storage devices into the compute nodes and using the PCI lanes there for networking and for the storage devices themselves, you can get much better overall performance. You can do that with better mental environments, and you can use it also in environments where you're running hypervisors. Okay, guys, so let's talk about the product for a minute.

Starting point is 00:05:03 You guys are software-delivered storage, right? Your product is software? Yeah, it is. We made a decision early on that we wanted to be hardware agnostic as much as possible, that we are a software offering. And this is really because we see the trends with the kind of customers we talk to, and that is the largest customers on earth. Google, Facebook, Baidu, Yanex, Amazon.

Starting point is 00:05:37 And what they all share is going to completely standards-based, and some people say commodity-based servers. That is, they've hyper-optimized the hardware to wring all the cost out of it and make it as efficient as possible from a hardware standpoint. And this, of course, has also spurred projects like the Open Compute project, like Open19, which was featured at our launch, which is how do you get this hardware to be even less expensive? So enterprise, small business, the larger end of small business that is, they like to jump on these trends as well. They want to go towards standard hardware. Now, they may buy their hardware from Dell, HP, Lenovo, standard servers, but they still want to go ahead and standardize on servers. In short, what we're seeing is backlash against these proprietary appliances. Appliances and storage, the trend has become to offer a box,

Starting point is 00:06:26 paint it a different color with your own bezel, and say that this is something really special. But underneath the covers, we know that many, if not almost all, competitive storage products out there are based on standard components, if not standard servers. And so people really resent paying the uplift two to three or four times as much money for this standard hardware on top of software costs and then maintenance costs. They don't like the thought that they're going to pay two to three times the street price for a disk drive just because they're getting it from a vendor for the quote-unquote proprietary appliance. So with this movement, we said we have to work on standard hardware.

Starting point is 00:07:07 And so we made that decision very early on to be standardized software, which basically today runs on any x86-based platform. We believe we can work on any 64-bit Linux-based platform. From there, yeah, we offer just software that gives you this kind of incredible unheard of performance. I don't think I can find an NVMe SSD for my Raspberry Pi. So I don't think you really have to worry about the ARM market. Yet, yet, yet.

Starting point is 00:07:37 Yet, yeah, but there are folks coming up. ARM, as you mentioned earlier about restraints on the PCIe bus, that tends to be the problem we see with ARM processors today. Some of them are catching up in processing power and could be competitive to leaders out there, but they tend to have only 8 or 16 PCIe lanes. So that is a problem. The other thing that's kind of obscene here is the fact that you guys are running these gazillion IOPS per second on commodity hardware, effectively low-end commodity hardware, other than the networking and perhaps the NVMe SSDs. That's the other thing that's kind of odd here. I'm not sure there are high-end and low-end once you start talking about Xeon servers.

Starting point is 00:08:21 High-end and low-end is kind of more which Xeon you picked than which vendor you bought it from. You're bringing up a funny point. In all our testing and our demos, we tend to use Xeon processors because we feel that's what the customers have racked in their data centers because they're thinking about the compute side or that's what you buy for a data center is dual socket systems. But we do work very well with single socket systems. In fact, again, as pointed out earlier, we need the PCIe lanes. Our product is unique in that on the target side, that is what holds the NVMe drives, we don't use any CPU.

Starting point is 00:08:57 So one of our largest demos we had done in the past internally was where we got 11.5 million random read 4K IOPS on a cluster of 10 systems. And these were Intel Core i7 processors. So they were desktop processors. It was a converge test. So they had to be a little bit fast so they could actually run the synthetic benchmark. But on the service side, we weren't using any CPU at all. So I think optimally, we would love to have a mobile processor from Intel or an Atom. If we could get an Atom processor, but one that had 48 or 64 PCIe lanes, that would be the best balance for us as far as a target system. Yeah, unfortunately, I think you're the only customer for that product.

Starting point is 00:09:42 Today, today, today, yeah. Yep. If you're looking for a million IOPS, I'm of the opinion that you can probably afford at least a one socket Xeon. So tell me a little bit about what's going on here. Between the storage system and the, I'll call it the host, you're using RDMA across, you know, gigabit Ethernet or 40 gig or 100 gig Ethernet, but you've got software that's running mainly on the host, not much running on the storage system. Is that a good read of this, Yaniv? It is, absolutely.

Starting point is 00:10:19 So one of the things that is very different in our architecture, and it has been that from day one, is really to try and see how you can make this software-defined storage system relevant for very large data centers. That's really sort of been, from the outset, what we've been trying to achieve. And so if you want to be able to do that in a converged environment, one of the things that you don't want to do is you don't want to affect resource planning. You don't want to make something that takes a lot of CPU cycles on the target side, and then you have a conflict between the application running on a node and the fact

Starting point is 00:10:53 that it's also serving as storage. So to go and avoid that, we've really set ourselves as a target to try and minimize, if not zero out, the amount of data path or commonly used CPU for accessing the data. And instead of implementing storage services on the target side, we do it on the client side. So if I have an application that needs a lot of IO, I can expect to utilize some of my CPU resources in order to implement that IO. In fact, even though we're doing that currently, our I.O. stack is very efficient and it doesn't take a lot of overhead in order to implement storage services on top. And by doing that, you can really work in that kind of converged environment and you can work across a large network and not affect the way resource planning is done. So if you have a

Starting point is 00:11:39 scale-out application and you want to increase its size and it needs now more CPU processes or processors or it needs more memory, you can go and you can have it scale out and you want to increase its size, and it needs now more CPU processes, or processors, or it needs more memory, you can go and you can have it scale out. And you don't care whether it's scaling out to machines that are also serving as storage or not, because the target site doesn't take any CPU. With that in place, it did mean that we had to go and re-architect a lot of the ways, or a lot of elements of how the storage itself is implemented. Most, if not all, current storage services are implemented on the target site. And we're going and implementing them on the client side.

Starting point is 00:12:14 And we're doing it on volumes that can be shared between different clients. So you can use clustered file system, for instance, if you want to, in order to share data between multiple nodes. So in order to go and implement services that are scalable, that are done from the client side, we had to re-architect a lot of the stuff, and that's where actually most of our intellectual property actually lies in how you go and implement storage services from the client side. So you mentioned shared volume.

Starting point is 00:12:40 So you can share a physical volume that's residing on the target storage across multiple hosts? Absolutely. So our current offering, our 1.10 product, is the one that we released as part of our marketing launch in March. It includes RAID 10 functionality, so you can have multiple drives that are hooked up or connected into a single volume. Different replicas of the data will be kept on separate nodes in order to ensure true high-level availability for the data. And then those volumes can be shared among multiple different clients. So you could run a clustered file system on top of that, or a database that

Starting point is 00:13:22 wanted to have a shared storage layer underneath and have it running across multiple different nodes. And in that sense, you can run a scaled application even on top of a protected volume on practically whatever scale you want to. So you mentioned multiple targets. Do you support high availability dual controller types of solutions here? I'm just trying to understand how this all hangs together here. Absolutely. So we can work with dual controllers, but it's sort of using those kinds of controls negates a lot of the benefits of what we're doing. What we're basically

Starting point is 00:14:00 saying is you can use your standard hardware. You don't have to go for any exotic hardware. You can use regular standard servers that aren't dual motherboards or dual controllers in that sense. And we can still ensure that the data will be highly available by ensuring that the copies of the data, the replicas within the RAID are stored out onto or deployed onto different nodes. So in terms of data protection, it's like I'm running mirroring in the logical volume manager of my guest, and it's writing to two single controller arrays. I'm getting resiliency because I'm writing to two of them, not because each one is resilient, right?

Starting point is 00:14:39 Exactly. And this is really the key to the system. Janiv mentioned a lot about when you do this, when you put the intelligence in the client, it makes things very scalable. It lets you share. But what it really gets rid of is the very common problem of a single bottleneck. That is when you have a dual controller system,

Starting point is 00:14:59 regardless of the services it's doing, you have all the IO going to that one system, which means you really try to scale up the interfaces completely redundantly. You're using a lot of investment in this one box. And at the end of the day, you get a noisy neighbor problem, even in the box. That is, if you have a lot of clients hitting the box, the box is going to get to a limit of what it can serve. And then that's going to affect IO from other clients because it's all centralized. And at some point, you're going to get to a fairly small, in terms of what NVMe drives can do, bottleneck. That is, if you look at a single NVMe drive, and this is evidenced by any of the newer, or even the older Intel drives, the newer Intel drives,

Starting point is 00:15:42 the Samsung drives, the HGST drives, which, by the way, we work with all of these drives. When you look at these drives, a single drive can do as many IOPS as an entire, for instance, pure storage array. But of course, you wouldn't use it like that because the pure storage array is redundant and gives you services. But the IOPS level is high enough. So what happens when you put a bunch of NVMe drives inside an existing all-flash array today? What you're doing is you're really limiting the performance of those drives. You're getting some services, but you are so bottlenecking those drives. There's really no way to get around that if you have this centralized system.

Starting point is 00:16:21 So by distributing the intelligence for logical volumes, for data protection, for multi-pathing in the client itself, what looks like a block device on the client, this is actually a logical volume manager that's also doing data protection. When you completely distribute that, you not only eliminate that bottleneck, but you also eliminate the noisy neighbor problem. That is, if you have one host, one client going crazy, consuming millions of IOPS if you wanted to, it's not necessarily affecting other clients because it's not going into a centralized system that's a bottleneck. So the idea is that we build a grid of multiple systems that are providing storage from their NVMe SSDs

Starting point is 00:17:09 and multiple systems that are consuming it via your software, right? Yeah, optimally, you might even call it a non-volatile mesh. So, no, I'm sorry for that. NVMesh is the name of the product for anybody who's not catching that one. Optimally, to get the highest levels of aggregate performance in either IOPS or bandwidth, and to experience the least amount of problems with a noisy neighbor or contention on the network, that is. Optimally, yes, you do want to do as Yanni had said and put only a handful of NVMe drives per host, and then spread those out over multiple

Starting point is 00:17:46 hosts because that way you're never going to tax the network or the NICs. Optimally, you want to balance the NVMe drives and their capabilities to the bandwidth of the host that they're installed in. And in this way, you get very high performance. Are there limits to the number of target storage systems, I'll call them, that there are nodes on this network? Sure. So there's really two different types of limits that you can look at. One is how many target nodes can you have that will contribute to a single volume? And currently we're limiting that to 128 nodes or 128 targets. We actually have a large-scale deployment that is using that number for specific volumes.

Starting point is 00:18:31 And then the other one is what's the overall deployment of the whole SDS or the whole software-defined storage system. And there, we really architected the product to be as limitless as possible so that you could really scale it up to a full data center scale that is really the target so you'd be talking about tens of thousands potentially tens of thousands of target devices within a single system gee and for a second there i thought you were going to say and we used a one byte node address so you can only have 255 wait a minute wait a minute did you you say 10,000 storage systems,

Starting point is 00:19:06 storage targets, potentially? And you're not even talking the number of clients here. All you said, the limits from a client perspective is that you can have at most 128 for a single volume. But I mean, you could have multiple single volumes, obviously. So 10,000 storage systems? Somehow, this doesn't make any sense to me. Sure. So if you go, as Josh said, to a setup where you really want to, you know, spread things out. And so you use all of your nodes or all of your standard servers within your data center. And you just, you know, you've got a NIC there anyway, so you might as well have a high speed NIC. And you can use it in a converged mode in the sense that it serves your standard networking and it also does your storage.

Starting point is 00:19:46 And then you put in a drive on average on each one of these standard servers. Then you could have each one of these nodes would be both a client and a target. And in that way, you'd really be getting the utmost performance out of your system. And you'd really be able to also spread your data out so that if you wanted to have high availability, you could ensure that. You could put data replicas on different rows within the same data center or within different racks. You could really make sure that it was always available from that perspective. Yeah, but stretching it from Manhattan to Jersey City would kind of defeat the whole low latency story. Yes, it would. That's true.

Starting point is 00:20:23 Yeah. Okay. So let's talk about the client software to storage target protocol. Before we do that, at this point I usually ask the, okay, so those are the theoretical limits and you used a 32-bit number for your node address so you can have tens of thousands of nodes. How big have you actually tested? Although I think we got a hint to that

Starting point is 00:20:50 about one of your real customers already hit the 128 nodes per volume limit. That's correct. As was mentioned in our launch, and there's action materials on our website, NASA Ames is the largest as far as widest volume. NASA Ames out in Moffett Field using their cluster for visualization for some analytics on files that come off a supercomputer.

Starting point is 00:21:17 They are a single volume spread across 128 nodes. Each of those nodes only has a single drive. So it's only a 256 terabyte virtual flash drive. But that one virtual SSD is attached to, as Yeniv mentioned earlier, it's multi-attached. It's attached to every one of those 128 nodes. So every single node in this compute cluster sees the 256 terabyte device as if it's a local device. And then we've, since the initial deployment, layered in partnership with SGI, now HPE, we've layered the CXFS file system on top of that.

Starting point is 00:21:57 So it's a clustered file system. And every node sees this shared file system, but that file system performs on a seek to any file anywhere in that file system as if it's a local NVMe drive. So they're still achieving about 140 gigabytes per second of aggregate bandwidth from all the nodes. And that's limited by their network architecture, by the way, not by the devices. And somewhere in the neighborhood of 30 million random read IOPS 4K at about a 200 microsecond average response time. They're kind of perfect template customer because they need a very high bandwidth, both read and write load for certain parts of the computations.

Starting point is 00:22:39 And then when they're doing the analytics on this file to look for trends, to do some processing, it's a random IO load. So they're both hitting this with random and throughput at different parts of their compute cycle. And this actually mirrors what many large customers have to do in the world of analytics, which is you use streaming to bring in very large data sets. And then when you start examining the data and looking for certain relationships, you may hit it with a random iOpload. Oh, God. And each one with one drive.

Starting point is 00:23:08 Yeah. Okay. Usually it's Ray, but now you've blown my mind. Yeah, it's blowing my mind too. The key, though, is not to... This is very tempting to get caught up in the numbers, but what we want people to make sure they understand is the numbers really are relevant. You don't want to get caught up in these very large numbers. But the important thing to remember is that we are allowing you to unleash the performance of the media you're buying. In other words, even if you

Starting point is 00:23:33 go to a new system, if one of our, I'll call them competitors out there, if a traditional all-flash array goes to NVMe drives, and they go to 24 NVMe drives, and they use the middle-of-the-road drives that, let's say, do 500,000 IOPS each. They go 24. That's 12 million IOPS. But that same box maybe is going to give you 700,000 IOPS, maybe 800,000 IOPS max. So you've already paid for that IOPS capability. You just can't access it. And that's the big differentiation with our system. It's not the top end number. It's whatever you've paid for in your NVMe media, we're going to allow you to unleash that. We're going to allow you to use it all. You don't have to use it all, but you can use as much as you want. And you'll get that at very low

Starting point is 00:24:22 latencies. You'll get it at the kind of latencies that the media was made for. So if you have 10 million IOPS available and you only use a million, at least when you use a million, you're going to get extremely low latencies, probably under 100 microsecond read response time. And even with protection under 30 microsecond write response time. And what this means is for storage planners is you no longer have to worry about, am I going to have one client affect another? Am I going to run out of horsepower? You're not going to, because you'll be able to extract with our software,

Starting point is 00:24:56 all the performance that you already paid for in the media. Okay. So let's go, go deep here. So the technology, you're actually not using standard NVMe over Fabric kinds of protocols, if I understand this correctly. Is that correct, Yanni? So that's partially correct. Our product works with two different protocols, and it works in practically the same way with both of them. We provide our own flavor of NVMe over Fabrics, something called RDDA, which is Remote Direct Disk Drive Access. And that's what allows us to avoid using

Starting point is 00:25:33 any target-side CPU for the data path. That protocol is inherently different than NVMe over Fabrics for two reasons. First of all, we devised it before NVMe over Fabrics was defined. But also, it has been built this way so that it can achieve that feat of not using any target-type CPU. And when you're working in a converged environment, you typically want to use that, again, so that you don't require any target-type CPU. And so that makes your resource planning very simple, and it avoids a noisy neighbor problem.

Starting point is 00:26:03 Our product also supports using the standard NVMe over Fabrics protocol for the data transfer. And we've exhibited that with some vendors that are pre-release themselves that are coming out with NVMe over Fabrics hardware. And we've shown that the product works seamlessly with them. And then in that scenario, some target site CPU is used. But those are in scenarios where the vendor is coming out with a box that has a special piece of hardware, typically an ASIC or something of the sort, to go and really bring down the hardware requirements or the cost of providing that NVMe over Fabrics target connectivity. So we really do support both of these today within the product. When you go and you implement things using NVMe over Fabrics

Starting point is 00:26:51 in order to generate shared functionality of volumes connected to multiple clients with rated functionality, you still need to perform some kind of remote locking. And for that, we still use some additional RDMA communication on top of the NVMe of the fabrics that's being used there. So even if you're going to use standard NVMe of the fabrics, you still need to have some additional communication to ensure data consistency. Right. So this is somewhat newer than what we saw at Storage Field Day, although you may have had it there. You may have just not discussed it at that point.

Starting point is 00:27:26 So we've got a Linux client that runs your proprietary RDMA protocol and presents a block interface. And we've got a Linux target piece that runs on a server that has the NVMe SSD and delivers up that storage to the target. And they, of course, can both run on the same system. We've talked about mirroring.

Starting point is 00:27:52 Are there any other services yet? That depends on what your definition of services is. So to be fair, and I think what most of your listeners will say, is that services are things like thin provisioning, compression, deduplication, snapshots, et cetera, what they've gotten used to in the all-flash array space. We consider services, storage services, when we're talking at the low level, even things like logical volumes, basic RAID protection. So we would argue that we do have services in our client. Those services are that ability to do logical volumes at all, which are dynamic, resizable,

Starting point is 00:28:27 multi-path in an active-active fashion, different logical volume types. That is the protection on them. They can be concatenated, RAID 0, RAID 1, and RAID 10. And so these things are all built in and are services. But the important thing to understand is that is where the majority of, and Yanni said this earlier, the majority of IP investment has been in how in the world do you do this? How do you get clients, especially clients that can share a logical volume, a logical volume that's not being hosted and processed in a centrally managed solution? How can multiple

Starting point is 00:29:03 different clients attach the same logical volume with all the intelligence being the client side? So we've got that worked out from here on out, without getting too detailed, since we're talking to a public audience, we will be adding other features. This has been a really good conversation. Thanks. But it's been, no, it's, we, of course, now that we have this base technology established and we've done it the right way. And if people are really curious and need some late night reading, you know, you can Google patents and Accelero and you can find patents that are filed

Starting point is 00:29:38 describing some of this distributed metadata. And that's what's really behind this functionality that will lead to, in very soon, upcoming releases, having different erasure coding levels, different data protection schemes, up and through eventually having things like virtualized blocks that offer thin provisioning, clones, or snapshots. So services that people are accustomed to will get in the product on the roadmap. We're not saying here exactly when, but that is absolutely in progress. There will be a trade-off, of course. When you do more processing, there is a hit on latency.

Starting point is 00:30:18 So, we can't get around the physics of the problem there. Yeah, I mean, anytime you take the data and examine it before you store it, that examining takes time. Right. But because we're doing it all in the client, it's 100% distributed. So, again, this is where we'll really shine and avoid that central system bottleneck because every time you add a client, that client is also adding, if you want to think about it, storage processing power. Let's talk here about the distributed metadata, which has got to be some sort of a mechanism that provides for the locking of logical volume and control structure updates and things of that nature. Presumably, it's not in the critical path for the data I.O., but it could be if you start doing things like different protection schemes, virtualized blocks, that sort of thing. Is that where things start to become more complex, I guess? Yeah, that's exactly true. So once you move to rated volumes that are shared among multiple clients, to ensure data consistency, you start to build out some metadata. And we currently

Starting point is 00:31:18 control that metadata in a distributed fashion, as Josh mentioned, and we use RDMA Atomics in order to make it really highly performant and, again, to avoid target site CPU usage. As we progress to more evolved rates that are in development and as we progress to thin provision or virtualized blocks, as Josh mentioned also, that is also in development, the metadata structures do get more complex. We'll be glad to have you back on Greybeards to discuss how you manage all of this with data deduplication. I didn't say data deduplication, did I?

Starting point is 00:31:55 I can assure you that our planning has gone all the way through snapshots and clonings and data duplication, and we know how to do all of that metadata management from the client side in an efficient way. It's very easy to say it. The proof is the hard part, proving it in the field. But we'll do that as well. And that is really where it does get complicated. You need to have the right metadata structures to make that efficient,

Starting point is 00:32:20 especially when you access it from the client side. But that is where the complication comes in. Not that we can get into deep, dark secrets about things to come, especially when you access it from the client side. But that is where the complication comes in. Not that we can get into deep, dark secrets about things to come, but a light just went off in my head as we were talking about how RDMA and that distributed hash table could get very interesting. Well, I mean, lots of nice things about the NVMe over Fabric protocol that makes sorts of things, you know, 4K and under block IO activity almost embedded in the protocol, I guess is what I would call it. Yeah.

Starting point is 00:32:53 I think for Acceleron NVMe, NVMe over Fabric standard protocol is really more interesting as a way to support non-Linux clients. But even the RDMA stuff, I mean, ultimately you are talking NVMe to the SSD protocol. So, I mean, that protocol provides almost embedded 4K data blocks without having to set up data transfers. It's pretty interesting how it all works. It's almost bizarre in my sense. It's like, it's like taking a command, a SCSI command and embedded,

Starting point is 00:33:29 you know, 4k worth of data into the command itself packet rather than the, you know, a data packet or something. Well, why have a chatty protocol? Yeah, I suppose,

Starting point is 00:33:39 especially when you're talking microseconds count here. Yeah. Yeah. And that's especially true when you start looking at newer kinds of non-volatile memory devices that are coming out. The Intel Optane, for instance, where the basic latency for reads as well as for writes is around 10 microseconds. It's around 10 microseconds. And so every network round trip that you have to do really becomes critical. And so if you embed the data within the request itself, you're saving a roundtrip.

Starting point is 00:34:11 It becomes relevant, especially when you go to those kinds of non-vital memory. Yeah, that roundtrip is, you know, almost a microsecond in itself. Jesus Christ. Oh, my God. Okay, So we've got this system and it delivers astounding performance. And you know, certainly NASA aims is a lovely client. Uh, you guys were talking about the big boys, the Facebooks and by dues of the world at the beginning of the podcast. So is your go-to-market strategy for the moment elephant hunting? It is, but not that big an elephant, if it's fair to say. So while we have the utmost and deepest respect for these largest companies, the reality is

Starting point is 00:34:59 they're probably going to do their own thing. They have hundreds or thousands of developers working on these kind of things, and they tune it to their exact hardware environment and the way that they do provisioning and offer services. And at the end of the day, you could almost look at something like Facebook and say that it's one application,

Starting point is 00:35:21 that Facebook is a single, large application made up of many pieces, but it's one very large application that can be hyper-optimized even to the hardware underneath. Whereas I was talking to a finance customer early last week and a very typical kind of enterprise financial services entity, and they said that they have over 8,000 applications in their environment. And so that's who we want to solve for. We want to bring the power of what we can do and offer this hyper-efficient, very high-performance storage that's very flexible. Being at the block layer, you can use it as block, you can use it as file. You can put an object layer on top of it if you wanted to. We did block storage because it's ubiquitous underneath everything else. We'll target those kind of customers. Basically, if you take off those very top level, the highest level customers, Amazon, Microsoft, Azure, etc., and you take the next 200 or so, that's a lot of the folks that we target. So they are elephants.

Starting point is 00:36:26 They're not the biggest elephants in the world. They're still very, very large, well-known customers. And we announce some of those publicly as customers. So we have GE Digital in their Predix Cloud. PayPal is using our software for more kind of intrusion detection and some network security issues there. And we also have Hulu as a customer. Okay.

Starting point is 00:36:51 So, I mean, the big web guys make sense as target customers for you. When you start talking about finance, outside of the Fidelities and the Goldman's of the world who really have IT groups that think more like Web 2.0 companies than like brokerages have traditionally, isn't your penetration problem in corporate America going to be VMware support? You'd think that. But by the very same token that FSI customer I was talking to, one was running OpenStack, and we're perfectly happy to work with OpenStack. We can work under the KVM hypervisor either within a virtual machine and offer faster than local flash performance over the network. And that is because we're not incurring the wrath of the hypervisor IO overhead. So we can work within a KVM environment under OpenStack. And then many

Starting point is 00:37:50 of these same kind of customers that are very forward thinking are looking at containers, in which case there is no virtualization. And we also can work inside a provisioning framework with containers. We have an internal demo we've done for a very large telco who's looking at this, where we showed containerized applications with persistent data running on a container host, and you pull the plug on that host and the orchestration layer, whether it's Docker Swarm or Mesos or Kubernetes, restarts that application container on a different physical host and attaches that persistent volume. And then that container goes ahead and picks up where it left off.

Starting point is 00:38:29 And it's getting local flash performance, but it's free from being bound to a physical host. And that's really the key. And you mentioned it exactly. VMware, because we are a kernel module, it would be impossible for us to get into VMware. They just simply don't expose the APIs. However, when VMware supports the NVMe over Fabric client, then they can use us as a target. And so that will free up that environment, and you could use us with VMware as well. Okay, so where are you with the Docker volumes driver and Cinder driver and such?

Starting point is 00:39:07 So Cinder, we did not push it into the release for OpenStack, but we do have a driver ready to go, a Cinder driver. The reason we did not push that yet is, as we mentioned earlier, we don't have snapshot functionality yet. So we would go ahead and we would very happily perform a snapshot by literally copying all the blocks, which might be a surprise to some people. Yeah. An unwelcome surprise.

Starting point is 00:39:34 Ray and I are old enough to remember business continuity volumes, split mirror snapshots. So yeah, we understand. Yep. So we've not pushed that up. That will likely not get pushed up until we have that functionality or if a customer really wants it. But it is functional today. So if someone was running OpenStack and understood that you should use this only for data volumes and not for the root volumes where you're going to snap, you could use this there. The plugin for the Docker environment, that I would call beta level. And that is because it has to tie into the orchestration layer, either Kubernetes, Swarm, or Mesos.

Starting point is 00:40:13 And we haven't identified a clear leader or partner there yet, honestly. So anybody listening, we are looking for partnership opportunities there. This is another way of saying we're very customer-driven. And the real honest answer from a practical standpoint to your question is the first customer who says, I need you to support this kind of container in this environment, that's going to be the one that we support. But we're ready to go. We have functionality now. That's the nature of being a brand new newborn startup. Your first seven, eight paying customers get pretty much what they want.

Starting point is 00:40:49 Plus, it's such a radical departure from the other storage world that we're familiar with. It is, and we have to see who's going to win. Containers are very interesting, but VMware is a very mature, rich, established environment, to say the least. So we're not crystal balling here, predicting the demise of VMware by any means. So it's a very important environment, well-established. We'll be around a long time. Oh, come on. That would be news. That would be news. It would be. Well, you know what happens to the doomsday people once they actually make the prediction and they say what the date is. Once that date passes, we tend not to hear from them again. I like to learn from the mistakes of other people in the past, not myself, but on my own. There was that one reverend from South Carolina who was announcing the end of the world for the fourth time.

Starting point is 00:41:41 Yes. All right, gents. We're getting off the deep end here. We're at the end of the show is there anything uh howard any final questions you have well i mean i i'm really intrigued by what acceleros doing and by you know some of your other competitors in what i've dubbed the new tier zero uh this very low latency, very high performance world. The real question I have is how big an impact you guys think will come from other vendors moving into NVMe, not as a greenfield, but as a modification to their existing systems. I mean, yesterday, Pure announced that they were allowing users,

Starting point is 00:42:30 coming out with NVMe flash modules, and now the FlashArrayX was an all-in NVMe system. Now, they also are still delivering storage via Fiber Channel and iSCSI, but they're promising NVMe over fabrics in the future. How big is the slice for high performance storage when Pure delivers NVMe over fabrics and maybe 500 microsecond latency? So that's a valid question. And the answer to the first part of that question is welcoming other folks. We welcome everyone in the industry to adopt NVMe. We think it only pushes our agenda forward. NVMe by some is still looked at as kind of exotic, especially NVMe over Fabric.

Starting point is 00:43:26 So we welcome folks like Pure into the arena to go ahead and support that because as Yaniv pointed out, we already support NVMe over Fabric as a standard. Can use that as targets. You can use us as a target. We can use NVMe over Fabric targets. We could integrate into ecosystems that support that. So at the end of the day, they're still going to have in their architecture the same bottleneck limits that they have today. Yes, maybe they'll go down to 500 microseconds of response time. But if you have a very large

Starting point is 00:43:57 database that has certain transactional limits that are serialized, an individual IOP is what matters, how fast you can finish that. So you'll still need a solution like ours versus that that you can get from the recent announcement by Pure. And today, as you kind of alluded to in your question, it's a smaller market. But I was involved in some of the early all-flash arrays, and when it first came out, we were getting the very same questions. People were saying, who needs an average one millisecond response time or 1,000 microseconds? Who needs that when spinning drives were giving us 8,000 microsecond response times at best? Why do you need consistent one millisecond response time? And now you look at what storage is being deployed, all flash arrays are the fastest

Starting point is 00:44:54 growing sector of storage sales when you look at IDC or Gartner. So they've become the norm. And we certainly feel and hope that this is going to be the same kind of pattern that's followed here. There is the once you've seen faster, that's all you're willing to accept. Yeah, there's that old joke in storage, right? Who's ever complained about having too much capacity or too much performance? The backup guy. Yes, yes. The capacity, yeah. Performance, guy. Yes. The capacity, yeah. Performance, no.

Starting point is 00:45:28 All right. Well, that was good. Yanev and Joshua, anything you'd like to say as a final statement to our listening audience? I'd just like to thank everybody for listening. Remember that we're out there. Take a look at our site. We think we're really doing things differently.

Starting point is 00:45:46 We think that's going to make a difference. And really, we're looking forward to where compute and storage is going, rather than simply making tiny improvements on where it's been. Yaniv is actually the one who, I believe, said something to this. And Yaniv, correct me if I'm wrong, but I love the analogy, which is is imagine if people had taken candles and just worked on constantly improving candles, improving candlelight, and making it a little brighter, a little bolder, maybe candles lasting longer. Imagine where we'd be, but sometimes you have to make that leap, and that leap was to the light bulb, to the electric light. We feel we're in the same kind of position. Centralized storage models have been around forever.

Starting point is 00:46:30 All flash arrays did this a little differently, but yet they're still the centralized dual controller, and it constantly makes small improvements, like moving from SAS or SATA now to NVMe media. They're tiny, iterative improvements. But at some point, to really go to the next level, you have to take that leap. And Accelero NVMesh is that leap. So are you really calling Pure's FlashArray Henry Ford's faster horse? It's dangerous to do so, but I think in a way I am. When I look at their announcements, there was a different, uh, you know, apologies to you guys. There's a lot of folks out there, but there was a different blogger who tweeted something last night. I said, they'd love to

Starting point is 00:47:15 see us go up against, uh, Pure with their new announcements. And I did some quick back of the, the napkin calculations on what we support today. And while Pure is talking about supporting an 18.3 terabyte flash configuration in this new announcement, today, right now, you could build out a system with us and you could go to 23.5 petabytes, which would give you 2.8 terabytes per second of bandwidth. Yeah, to be fair, that 18 terabytes is a module, not an array. So even, okay, in an array of, we're still, we're talking the max you can go to an array. It's, I think you could scale us out to, so nearly three terabytes per second of bandwidth and nearly 600 million IOPS. You could build this with standard off-the-shelf servers today.

Starting point is 00:48:02 Yeah, and there's this discussion here between scale-out and scale-up and things of that nature. And the fact is you could deploy 10,000 pure systems and potentially reach, you know, I don't know, terabytes per second. Well, that would make Scott happy. Yeah, that would make lots of people. But the issue is, you know, these guys are doing a different thing. They really have come up with a new approach to storage, not unlike another customer, another client that's looking at, you know, re-architecting storage per se. And it's an interesting approach, definitely. Yaniv, did you have any final things you wanted to say? I think I just want to emphasize one thing, and it goes back to Pure also. One of the

Starting point is 00:48:46 things that's been really important for us is to leverage standard hardware and to really try and avoid using something which is very special. And that's why we've gone with NVMe drives, which we thought when we started, but it's proved, it's really become apparent now that NVMe is something that's going to be very widely adopted. And so we can use SSDs from any vendor, any NVMe SSD. I think Pure are taking a little bit of a different approach there because while they're providing an NVMe interface, they've gone and generated their own module and they've moved away from standard NVMe drives. And so when another vendor goes and improves their NVMe drives, if you're buying Pure, you can't leverage that.

Starting point is 00:49:30 And I think that's really sort of one large differentiation between us and most of the storage appliance vendors is that we're not dictating any hardware. We're letting you leverage whatever you can find out there. And if you want to use Optane drives, you can go and use Optane drives. And if you need a different mix, you can go and do that. We're not really trying to tie you down to any specific hardware. And I think that's a large key to what we're doing. It's really being purely software only.

Starting point is 00:49:56 Okay. Well, this has been great. And even Josh, thanks very much for being on our show today. Next month, we'll talk to another startup storage technology person. Any questions you want us to ask, please let us know. That's it for now. Bye, Howard. Bye, Ray. Until next time.

Grey Beards on Systems - 43: GreyBeards talk Tier 0 again with Yaniv Romem CTO/Founder & Josh Goldenhar VP Products of Excelero

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.