Grey Beards on Systems - 62: GreyBeards talk NVMeoF storage with VR Satish, Founder & CTO Pavilion Data Systems

Starting point is 00:00:00 Hey everybody, Ray Lucchese here with Howard Marks here. Welcome to the next episode of the Graybeard on Storage monthly podcast, a show where we get Graybeard storage assistant bloggers to talk with storage assistant vendors to discuss upcoming products, technologies, and trends affecting the data center today. This Greybeard on Storage episode was recorded on June 8th, 2018. We have with us here today, VR Satish,

Starting point is 00:00:35 founder and CTO of Pavilion Data. VR, why don't you tell us a little bit about yourself and Pavilion Data? Thanks a lot, guys. I'm VR Satish, founder here at Pavilion Data. Thanks a lot, guys. I'm VR Satish, founder here at Pavilion Data. My life, I've actually been a lifer at Veritas, and then it got acquired by Symantec. So I ended up there.

Starting point is 00:00:54 Towards the end, I was the CTO for the storage division. That's where I am. So here I am, excited to be doing my own company, working on bleeding edge technology around NVMe and NVMe or fabrics. Well, bleeding-edge is the best technology. So what is Pavilion Data, and what's going on with it from the perspective of availability and that sort of stuff? So in terms of the product, I mean, the basic premise is what interested me let me start with that is you know we have had storage systems that have been stagnant pretty much in their design for

Starting point is 00:01:33 the last 25 30 years and they've been designed for um spinning media you know as time goes by you know new innovations from intel with nvMe and the whole consortium of companies, I mean, we have gotten to the era of solid state everything. Now here we have media, which is tremendously fast, reaching as fast as network speeds, if I may use that word. And it's time for us to just look at, is it still valid to do what we were doing for the last 30 years in this new environment? Or does the entire storage system need to be completely redesigned from the ground up so that we can keep up with the speeds of the media? And that's precisely what we're doing. We believe we have a storage system

Starting point is 00:02:15 that has been designed from the ground up with the faster medias in mind. And we have a great product and the product is available right now and it's shipping and people can buy it and use it and try it out for themselves. So are you using NVMe storage SSDs and that sort of stuff? That's exactly right. It's not only just on the on the backside using just the SSDs alone. See, let me get a little bit philosophical here. You know, I am a big believer that great companies are formed because they came up with a very simple question and answered it with a very simple answer, a profound answer.

Starting point is 00:02:55 So we ask this fundamental question saying, okay, like I said, media is reaching the speeds of network. I mean, they can, if you take a bunch of NVMe drives, and they can actually sync in up to a terabit per second. And that's pretty much what you have on the network side. So if that's the case, you know, if you go walk into a data center and ask the question, what is the one device which you see in the data center that can ingress a terabit per second and egress a terabit per second at land rate.

Starting point is 00:03:26 These are not the servers. These are the networking devices. They figured out how to do this at scale. So we said, okay, if that's the case, why isn't storage made like a network device and why is it made like a server? And aren't all the network stuff, a lot of hardware, intensive development, that sort of stuff, to get to those speeds? Well, that was the case long back when Cisco originally started.

Starting point is 00:03:48 There are lots of ASICs and FPGAs and all that stuff. But if you go to the world today, you know, you have the Quanta guys who are using merchant silicon from Broadcom and all these people and building up a system. It's just good hardware design, but not necessarily at the component level. You can go get the components in fries, if I may use that word. You know, it's a local department store for electronics. And that's just fair. You have to just put them up together and have a really good design, a system design. I'll give you that the network gear can carry packets at essentially line rate at a couple of terabytes per second for a switch.

Starting point is 00:04:26 But that's just plumbing. They're not actually doing anything. They're not storing the data. They're not retrieving the data. They're just moving it from port to port. That's the easy part. Okay. That's an excellent question.

Starting point is 00:04:37 The first question you have to ask is, how do I move the data? Now, once you know you can move the data, you have to now ask the second question, saying the tougher one, according to you, is how do I move the data? Now, once you know you can move the data, you have to now ask the second question saying, the tougher one, according to you, is how do I process this data? How do I store this effectively with whatever format you want to store it at that line rate?

Starting point is 00:04:54 Maybe give or take a little bit. I'm not saying, look, you're going to do a terabyte per second at line rate. I mean, I don't think that's needed. But you take the fundamental concepts of how they were moving data so fast and interject the storage concepts between that they were moving data so fast and interject the storage concepts between that and say, maybe I will not move it at a line rate. I'll be close

Starting point is 00:05:10 enough. So what we essentially did was borrowed some of the basic underpinning concepts of the network design, network switch design, but we incorporated a lot of storage concepts on top of that. Because I'll give you an example. Network world is very forgiving. You can drop packets and it's okay. Somebody will retry it. In storage, you can't do such things. Depends on the network you're talking about. In the typical Ethernet network, just drop the packet.

Starting point is 00:05:35 In storage, you can't do that. The quality of service is much more profound. I mean, the resiliency requirements are much more profound. So it's not that, I mean, if you look at the concept there, you have line cards, you have port processors, you have ports. People can buy more ports as they want. We try to bring in similar concepts in this world. Just as you have line cards here, we have controller cards.

Starting point is 00:05:57 You know, controller line cards, as we call them. And you have a whole bunch of CPUs, and you can expand them as you want. You can add more as you want. Now we put in a larger CPU when compared to a very, very low powered one, which you may get in the network world or sometimes driven by ASICs. We put a Xeon D, you know,

Starting point is 00:06:17 a little SOC, which is low powered inside our line cards. And we had enough juice there. Xeon D has got a usable amount of horsepower. It's got a usable amount of horsepower. It's got a usable amount of horsepower. That's a big step up from an arm. Yep. And that was good enough for us

Starting point is 00:06:30 to actually do some of the rate calculations that we did. I mean, if we wanted to do and some of the metadata computations that we wanted to do. And we found out, we were surprised when we did this as a prototype in our lab long back, we were surprised that we could actually clock speeds up to 120 gigabytes per second.

Starting point is 00:06:46 128 gigabytes a second? 120 gigabytes a second is what we could clock. And then we added data management. Reads are actually really fast because data management really does not affect that. When you do writes, obviously you're not going to get that. You're going to have some impact because you're doing rate computation and all that.

Starting point is 00:07:03 But at the end of the day, what it proved us was looking at it from outside in, not following the traditional model of buy a server, put some disks in it, put some software on top of that, and voila, this dual proc server is now going to become your storage box. That's not what we did. We did a complete outside-in thinking. And we said, we're going to look at it in a completely different way. Not only do it for the speeds of today, but how is this going to be for the speeds of tomorrow? You know, you see Intel making announcements on the 3D cross-point stuff.

Starting point is 00:07:32 Very interesting work that's going on in the industry, in the media. This is, in my word, this is the age to be innovating in storage. And it's exciting for me. I'm a storage guy. I'm a storage lifer. Yeah, well, it got boring there at the end of the disk drive era for about three or four years.

Starting point is 00:07:50 It didn't last. Things haven't been boring for a while, though. Absolutely. And I love what I'm doing, by the way. Otherwise, you know, when we started this company, you know, everybody used to come and say, oh, another storage company? Why are we doing this? You know, how many skeletons are there in the closet? But when you have a compelling value proposition, when you think you can disrupt the market,

Starting point is 00:08:14 when you have a compelling business model, all these three start to come together and we think we have something here and we want to make this big. Okay, so we haven't actually come out and said it yet, but we're talking about a storage appliance right that's correct we're talking about a 4u storage appliance which is based off of nvme or fabric that means the connectivity is nvme or fabric okay we do two two kinds of fabric uh we can talk over a regular rdma rocky v2 or rocky um or we can also do tcp both so for those for those hosts which want to connect to our box, which don't have the luxury of having any of the new fancy RDMA cards, that's game two. We can actually change the personality of a controller to say, you will now talk TCP and everything will be over TCP.

Starting point is 00:08:59 So legacy compatibility is also there. Well, I can see where you're coming from with legacy, but the NVMe over Rocky code is in the latest Linux kernel. That's been through the standards organization. The NVMe over TCP hasn't yet. That is correct. So do you have a driver for that, or are you sending people to SolarFlare? No, we have a driver for that. We have actually ported a Linux version, and we go all the way back a few versions.

Starting point is 00:09:27 And those of you who want to install our driver on the host side for TCP, that's fair. And for Rocky, we need nothing. It's completely agentless. Agents are evil. So just do a Linux 7.4 and power it up and you're done to connect. And as long as you have a Rocky NIC.

Starting point is 00:09:44 As long as you have a Rocky NIC, you're done. Exactly. boom, power it up and you're done. And as long as you have a Rocky Nick. As long as you have a Rocky Nick, you're done. Exactly. Wow. There's the configuring the network magic, but this is a storage podcast, so we ignore that part. So the storage appliance has line cards? Yes. So see, when we went about designing this, we had, you know, this was the tough part.

Starting point is 00:10:08 We had five guiding principles for us. The first one was we wanted to go after the new trend that is happening in data centers at scale, which is rack scale design. In the sense, people are starting to look at the rack as the unit of computing. If I may call a glorified server, right? And they do hyper-convergence across racks, if I may use that word. I mean, they want to deal at the rack level. Now, there's also a lot of talk about disaggregation inside the rack. That means keep the CPU separate, keep the memory separate, keep the storage separate. Unfortunately, technology for the CPU and memory is not out of the labs yet,

Starting point is 00:10:46 but storage definitely is out of the labs with NVMe UIF. So we wanted to do something for the rack scale. This means we wanted to build a storage box, which was fast, small, and dense. Why fast? Because, I mean, think about it. In a rack, you could have 20 servers. And assume that each of these 20 servers have two NVMe drives. This is take two NVMe drives per server. Now, if you want to give the equivalent of that, that's two NVMe drives times 20

Starting point is 00:11:13 is what the power you need to give, performance you need to give from your box. So it has to be fast. The second one was it had to be small because what good is a rack scale architecture storage when you're going to use three-fourths of the rack? It's completely dumb, right? So we had the challenge that we had to keep it small

Starting point is 00:11:31 and inconspicuous within a rack. So we came up with 4U. In my opinion, personally, even 4U is kind of just about there. If I can do it in 1U, I want to do it in 1U. But then you can do density in 1U, but not necessarily performance with data management in one view, right? And then it has to be dense, because you have to take all the storage that is there in each of these servers and consolidate them at the bottom of the rack. So it has to be dense also. The second guiding principle we used

Starting point is 00:11:59 was it had to use commodity components, because hey, at the end of the day, we don't want to build an exotic piece of system, and then charge an arm a leg like $3, $4 million per box. That's kind of a non-starter. We wanted to democratize this whole notion of what we were trying to do. When you say commodity components, you're talking about chips and that sort of stuff that you had been hardwired designed for. That's correct. All the chips we use are readily available. You can go and get them and use all the chips we use are like you can readily available you can go and get them and

Starting point is 00:12:25 use all the chips are readily available the third one is okay but you guys did do your own circuit board and mechanical engineering that is correct we did the system design we didn't do any of the silicon design because we also talk to people who say you know commodity components meaning dell servers well that's because they always come, for them, anything that is software-defined has to run only in a server. I think it's a little crooked thinking, in my opinion. The third one was reliability. I mean, what good is a fast system

Starting point is 00:12:59 and your consolidating storage for the rack if it's not reliable? So we had to build availability, reliability, fault tolerance into the system, which was for you. The fourth one was we had to add data management, sufficient data management, because, you know, the blast radius obviously increases when you consolidate. So we had to make sure that, you know, we had, you know, rate six. We supported some of the things which were wanting in the DAS world. For example, how do you do a backup? How do I do a consistency group backup

Starting point is 00:13:29 for a distributed application? How do I do thin provisioning? How do I do snapshots? These things were a little foreign in the DAS market. Though we have known this in storage. You don't do thin provisioning in DAS. You stick an SSD in and that's how much that server has. Exactly.

Starting point is 00:13:46 So because we had to give the economics of pooling, so we support all that stuff. The fifth big thing was guiding principle was ease of use and frictionless deployment. At the end of the day, if you have a product that requires let's say, you know, an exotic purple cable, if I may use that word. I mean, you're lost. You don't want that. It had to fit into what people normally do. So frictionless deployment, not necessarily put agents across all the 1,000, 2,000 servers that are out there and upgrade them and keep the lifecycle on. So it has to be

Starting point is 00:14:19 plug and play. And ease of use was very important for us. So these are the five. Let me get back to one of the things you said. You mentioned data management. Not a lot of the NVMe over fabric startups these days have data management at all. Other than, you know, it handles multiple NVMe spaces and that sort of stuff. But that's about it. Snapshots and things like that are kind of foreign concepts to this world. You know, like I said, I'm a storage lifer.

Starting point is 00:14:50 For me, storage without this data management doesn't exist, in my opinion, as a shared storage. So you're absolutely right. You really were in charge of volume manager for a while, weren't you? Yes, I was a CTO for volume manager, file system, clustering, VCS, all that stuff. So, you know, look,, VCS, all that stuff. So, you know, look,

Starting point is 00:15:10 that's where the real challenge is. When you start doing data management is when the CPU starts getting involved. When the CPU starts getting involved, that's when you run, once you have to test the CPU, you have to look at the number of PCIe lanes and how does it go on a scale, all that stuff starts coming in.

Starting point is 00:15:25 One way to scale with data management is to say, look, I'm not going to do any data management in my box, I'm going to put agents on all the hosts, use one core of every host. If there are 20 servers, I'm going to use 20 cores, one per server, and I'm going to get the lanes out, I'm going to do all that stuff. That's one way of doing it.

Starting point is 00:15:42 But doing data management inside the box at speed, is rocket science it's not easy and it requires us that's exactly what made us look at this entire design outside in you know how do we how do we do this at speed what's this number of cpus we need that's why we have 20 controllers to kind of do all this work there inside. 20 controllers? Absolutely. The box actually has, you know, so the way we designed it was, you know, you don't want to build a God box here, right?

Starting point is 00:16:15 It's a petabyte and 20 controllers and you go try to sell that, you'll never get anywhere. What we did was, it had to be consumable in the sense, you should be able to increase the number of controllers as you want. That's increasing the performance. You should be able to increase the amount of disks inside as you want.

Starting point is 00:16:35 So they are independent from each other. So you can start off with four controllers and go all the way to 20 controllers. That allows us to scale the performance independent of the capacity. So there's 20 controllers. And that allows us to scale the performance independent of the capacity. So there's 20 controllers. That's what we do. And each controller is one Xeon D?

Starting point is 00:16:51 That's correct. And then there's a PCIe switching fabric connecting them to all the SSDs? That's correct. So that was another challenge. This is not as easy as it seems. One of the classic problems is, if you look at PCIe, the PCIe device tree is actually rooted at the CPU complex.

Starting point is 00:17:13 Now, if you have 20 controllers, which one? This is a challenge. Well, there's MRIOV, but nobody ever did that. IOV. It has been a challenge. The specs are kind of there, kind of not there. People tried doing this. It was not easy. In fact, I'll tell you honestly, we don't do MRIOV.

Starting point is 00:17:35 We actually found out an ingenious way to do this with what is standards today. And I will not go into the details of that because that's a lot of the IP that we have. But in essence, we've got the ability for all these controllers to look at all the drives and do rewrites to these drives, and we have achieved that. Dual path? Oh, yeah, dual path.

Starting point is 00:17:55 Actually, any number of paths. There's no limitation. We support dual paths today from the external host side. But between the controllers and the media itself, we provide availability. So one of the things we did inside this would be interesting is we don't require we use standard u.2 you know nvme drives okay so it's commodity drives you can go buy from any of the big vendors that are outside out there and we don't require dual-path drives at all. We don't need that.

Starting point is 00:18:30 We can actually work with single-path drives, and that's perfectly okay for us because we have a small, you know, Mezcard kind of... Oh, like a SATA to SAS interposer? Yeah, it's an interposer. That's a perfect word. So interposer, which actually detects if a fabric fails and then switches over to the next fabric.

Starting point is 00:18:49 So in essence, you can actually get all four lanes active even though you have availability. It's not divided by two, one for each side, and we don't do that. So get full bandwidth. And it's cheaper. It's not redundant to use a single path one. So I've now got this petabyte of storage and 20 controllers. And I want to provision an NVMe namespace to a server. Is there a GUI, a CLI, a REST API, God willing?

Starting point is 00:19:24 Excellent question. So before we go there, I actually want to correct something that a REST API, God willing? Excellent question. So before we go there, I actually want to correct something that he said. It's not an all or nothing. So you can start with 14 terabytes, depending on the drive capacity. Okay, I have 500 terabytes and 12 controllers. 12 controllers. So we have a web-based UI. So the first principle is, the way we did this stuff was,

Starting point is 00:19:45 we had an API first model. That means we developed the web services API. On top of that, we built a CLI, a web user interface on top of the API, as well as expose the API. If you go to our web user interface, there's a small link out there that says, go read the API.

Starting point is 00:20:05 So it's self-documenting. You can go and use it and all that good stuff. So there's a full-fledged web user interface. So if you look at the workflow, it's as simple as you create an equivalent of a rate group inside with a minimum of 18 disks. That's how we do white striping inside. And then once you create a rate group, you create a volume, assign it to a controller,

Starting point is 00:20:30 and on the host side, you just do an NVMe discover with that IP address for the controller and then connect to the namespace that you want and you're done and you're running. So that's how easy it is. I mean, we also have security authentication and all that stuff if you want to do all that. But in essence, see, this is the beauty of NVMe. It's as simple as using something like an NFS, right? You know, the IP address,

Starting point is 00:20:58 and you know, the share export from the NFS box, you just mount it and you're good to go. That's how simple it should be to consume. That's how simple it should be to failover. That's what we do. You have a 20 controller cluster in 4U. It's amazing. That's correct. I mean, these are like racks of storage here.

Starting point is 00:21:21 I mean, 20 controller. Exactly. By comparison, we're used to their little controllers, right? I understand. I understand. But they're also blazingly fast controllers. Yeah. I mean, look, that's the beauty of using commodity parts,

Starting point is 00:21:36 right? You get Intel doing great innovation around the Z on Ds, and the ARM guys are also catching up. So you know, why build my own NAND and my own ASICs for controllers when these are readily available? Okay, and in terms of data services, I've already heard thin provisioning, which is just logical.

Starting point is 00:21:56 Yeah, so thin provisioning, we also do RAID 6. We do snapshots and clones, writable clones. We don't do compression yet. It's in the works. We don't do deduplication. Because, you know, look, at this speed, people who really care about going to NVMe at this speed, they don't want you to do deduplication.

Starting point is 00:22:16 In fact, that's what we've heard. I think that's probably true for the next two years or so. And we'll catch up. As it becomes normal, then, I mean, you deduplicate to make it more cost-effective. Yes, exactly. two years or so. And we'll catch up. As it becomes normal, then you duplicate to make it more cost effective. Yes, exactly. And as that performance level

Starting point is 00:22:29 becomes more normal, then people will start putting more cost-sensitive applications on. And, you know, maybe there'll be a new company and maybe it'll be us which will do, you know, do you do first NVMe storage?

Starting point is 00:22:41 I don't know, you know, but it's a natural progression. You know, the best, here's a joke I use all the time the best part about being a CTO in the IT industry is you really have to just look back and look at the future history has been repeating itself well, you don't have to tell us, we've been around a while it's basically the premise of this show

Starting point is 00:23:04 okay, so you do RAID 6 Yeah, yeah, it looks like our harness. It's basically the premise of this show. Okay, so you do RAID 6, and is it always 16 plus 2, or can I add drives to that? So it's always 16 plus 2. Okay, so expansion is in 18s, which is fine. In 18s, yeah. And how do you handle controller resiliency? Good question. So, you know, what we have built, we serve blocks.

Starting point is 00:23:34 We don't do files, by the way. We serve blocks. That's NVMe or Fabric, right? But in order to do things like the provisioning and snapshots and clones, we pretty much had to build the equivalent of a cluster file system without the namespace. Oh yeah, there's always metadata. Each controller, yeah, there's always metadata. Now when we built that, when we built something

Starting point is 00:23:57 like a cluster file system inside, we built such a way that things are active-active. Even a volume can actually be assigned to multiple controllers, and you should be able to do IO on both the controllers. That's what we built for. We have not enabled that in this release. Active-active for a per volume is not enabled, but even though the controllers are all active,

Starting point is 00:24:18 they can be doing other work. Okay, so it's kind of a broad-spread ALUA? Yeah, it's a broad-spread ALUA. So when one of the controllers die, we have an internal heartbeat mechanism which detects a controller going away, or the host side detects that there's a link breakage and starts the connection over to the other side,

Starting point is 00:24:35 and we do a failover and I.O. has continued. And we have to do a failover in a small window of time, otherwise you start getting timers. So we've taken care of all that stuff. Do you need custom path selection plug- plugins for operating systems and hypervisors or no right now see look right now nvmeof is in its infancy uh we work in uh with uh linux and linux community is very active with multi-pathing and and we have made some enhancements to the current Linux open source, and we put it back on the open source

Starting point is 00:25:09 with multi-pathing enabled. So the next version of Red Hat, you'll probably have full multi-pathing enabled. So we've gotten some flags. Some people say, oh, they're doing multi-pathing, so it's no longer agentless. You've got to put an agent. My thing is, agentless, in my opinion, is not all that bad,

Starting point is 00:25:26 but I don't want to maintain the agent. Next release, if Red Hat and Windows and VMware catches up, I'm not in that business, right? My technology should allow for that. Yeah, as soon as the standard covers those things, you don't want to be doing it. That's correct. That's correct, yeah.

Starting point is 00:25:39 So the clone stuff, you mentioned that you provide support read-write access. Is clone a physical copy, or is it more or less another version of Snapshot? And I assume Snapshot is space efficient, right? That's correct. So we just duplicate the metadata that's supported, not the data. And we use a redirect on write model. So clones are very efficient. They have no impact on the performance of the

Starting point is 00:26:05 original. And you can write as long as you have space and then you'll get a bunch of alerts when you're reaching close by. And that's about it. That's how about it is. And the granularity on that is in Kbyte is in the order of kilobytes, not gigabytes? In the order of four kilobytes okay i'm sorry you can snapshot a four kilobot kilobyte block i thought this would be a volume good no no no no yeah you snap the granularity i gotcha redirect right i understand you i'm not by the way just a disclaimer i'm not saying it'll always be four kilobytes we could go to 16 kb as the drive sizes increase, right? Yeah, and small changes like that should have small effects.

Starting point is 00:26:49 That's correct. We have dealt with systems with 16 megabyte page sizes, and they go, and we do snapshots. It's hard to call that space efficient, that's all. Yeah, yeah. So, I mean, internally, in order to maintain four kilobyte block snapshots, pointer redirects, and that sort of stuff, typically you would need some sort of shared storage,

Starting point is 00:27:12 shared memory across the cluster of controllers. You don't seem like you have anything like that. Yeah. So one thing, in this version, we don't use any NVRAM or anything like that. But one of the beauty of what we have done is we have a PCI fabric in between, which connects the controllers to the drives and also connects the controllers to the controllers. So in essence, each controller can actually watch a portion of memory of another controller. So one of the ways we do for really fast metadata is we actually replicate our metadata

Starting point is 00:27:49 across multiple memory systems inside the box so that we have high availability there. And then we have enough hold-up time through supercaps to actually give us enough time to flush the metadata in case there's a power outage or a failure. So that's how we get resiliency. And performance is, look, we're writing at memory speed, you know, pretty much with the latency of PCIe.

Starting point is 00:28:13 So look, a typical write across from one control to another controller will be in the order of 500 to 600 nanoseconds. Can't beat that. Can't beat that. So this system is designed to be, hey, costless, number one, because we didn't use fancy parts. Second one is designed for speed and designed for the new class of media from the ground up.

Starting point is 00:28:33 This is not a retrofit. And we did not pick open source ZFS and just hack it out of it and then call it a file system for NVMe media. We didn't do that. We designed the system for our, we designed the software for our hardware for the media that we are trying to do for today and for what's going to happen tomorrow.

Starting point is 00:28:53 So, I mean, have you guys done any testing with Optane and that sort of stuff, the storage class memories? We have, but I think, you know, look, we're not serious about it today. I'm sure we'll get serious about it very soon. The price has dropped a little it today. I'm sure we'll get serious about it very soon. The price has to drop a little bit more. I think even...

Starting point is 00:29:08 Intel has to be able to produce enough for the price to drop a little bit more. Yeah, so it's a chicken and egg thing. So, I mean, look, just now, I think the U.2s have started rolling out, which have decent capacity. If any U.2, I mean, look, the box is open, so you can pop out one of the drives,

Starting point is 00:29:27 a bank of 18 drives, pop in your Optane drives and measure the performance, and it should be fast. I mean, look, we're not publishing numbers on the Optane. Do the drives in a RAID 6 group have to be the same capacity?

Starting point is 00:29:42 They have to be the same capacity, but not across different groups. They have to be the same capacity? They have to be the same capacity, but not across different groups. They have to be the same capacity is one thing. We recommend it to be having, from the same vendors, having the same characteristics in terms of performance. Because even if one guy is slowed down,

Starting point is 00:29:54 you're going to bring down the whole guy and others, they're going to slow down the other people. Because we white stripe it. Yeah, so all the usual RAID concerns. That's correct. So, I mean, what sort of market are you trying to go after with this solution? Well, excellent question.

Starting point is 00:30:10 So even though the solution, as it stands, is more horizontal in nature, can go after anything, but we have, as a startup, most important stuff for us is to have focus. Focus in both in-source strategy as well as the market we go after. We see a big gap where there's a need for performance. And if you look at the last five, ten years, people are writing applications with predominantly the open systems architecture, right? In the sense, they pick up a MySQL from somewhere

Starting point is 00:30:36 or a MongoDB or a CouchDB. And it's all scale-out in nature, Cassandra and all that stuff. And they start building this in the department level and they transition over to the IT. And IT says, oh my God, what am I dealing with here? I have, you know, a big checklist of 20 things before something is compliant in my IT world.

Starting point is 00:30:55 And they find out, you know, how these things are not able to fully comply because they're all distributed, charted all over the place. And they're trying to get grappled with it. So we believe that we can, A, go to the market by saying, look, we're going after these new age applications where there's an absolute need for speed.

Starting point is 00:31:16 There's an absolute requirement for being compliant. And you want to, you know, data is growing so much, the cost of NVMe drive is still not as cheap as disks, not yet. So they're expensive. So you want to conserve and increase the utilization on these things and not put three copies and all that stuff. So that's what we're going after. And we're seeing this, you know, go after the Splunk world,

Starting point is 00:31:38 go after the MongoDB world, go after the, you know, Couchbase worlds and Cassandra's. And that's the market we're going after, the new age application. We're not going after the traditional place where you have fiber channel sands, and that's a heavy lift. And, you know, there's too many cooks there, and they have great account control, and we don't even see them where we go to.

Starting point is 00:32:00 Okay, and because those applications have their own replication models and replication isn't that important? Yeah, they have their own replication model. Absolutely. I mean, one of our customers actually made this basic statement saying that within the rack, it's your job to provide resiliency, right? So they call that local resiliency. Local resiliency is the job of Pavilion Data, its appliance to provide that. Global resiliency, as they call that local resiliency. Local resiliency is the job of Pavilion Data

Starting point is 00:32:25 is appliance to provide that. Global resiliency, as they call it, which is going across racks or going across data centers or whatever, they actually put the onus more on the apps. You know, it was old days where people were doing SRDFs and replication with the box. Today, the apps are much, much more efficient in doing, you know, replication in doing replication. Look at Oracle Data Guard. They do it directly at the SQL level, transaction level. Well, you call it the old days, but most checks in the world still get generated out of applications that work that way.

Starting point is 00:32:56 I mean, I'm sure if I'm talking to a bunch of guys in mainframe, they would say the same thing too if I call them old days. Oh, yeah. A lot of business is still run on that. So, you know, I mean, I'm talking about, you know, chronologically. I mean, the new era, as we call it, the modern era.

Starting point is 00:33:13 And so, you guys, I mean, as a product, you know, startups typically have a challenge trying to go outside of the U.S. or things of that nature. I mean, are you available globally, or how does that play out? Well, we are, Well, it depends. It depends on. So we're nimble.

Starting point is 00:33:27 Let me put it this way. Not the nimble of the company, but we're nimble in our actions. And if we see a big opportunity in a particular area, we're nimble enough to go and stand up something out there. We're currently available in Europe, and we are available in USA. Okay, and you're shipping now? We're shipping now. And when a customer purchases a product, is it on a

Starting point is 00:33:46 capacity basis, on a controller basis, or how does that work out? So right now, it's right, okay, right now it's on a capacity basis. And that includes like a line card, so they get a certain

Starting point is 00:34:02 number of line cards per terabyte or something like that, is that how it would work? That's correct. We sell in units of five controllers. Five controllers? How can you deal with an odd number of controllers? There's something illogical here. Well, yeah.

Starting point is 00:34:16 So actually, we have 20 controllers. And we have four virtual arrays. Right. No, I understand. So there would be four pairs of five controllers. That's correct we'll take the hit on the one we'll take the hit on the one that's yeah yeah i guess it's easy to explain to them one fourth is what you get so it's one fourth half three fourth and one to grow on

Starting point is 00:34:35 or actually one as a spare yeah so when you go to a um and i'm not sure i've got the right term the second subsystem that would be in a subsequent rack. Does your management system command all those rack storage systems, or is it a separate management system per rack? Yeah, today, look, we're all walking in the other way. Today, we are one rack at a time. Unfortunately, that's how it is. And with my background in Veritas, I've built several of the global distributed systems to manage. You know, one thing we did was we, we built it such a way

Starting point is 00:35:16 that the entire management or the API layer works on a PubSub model, right? So it's easy for us to federate across multiple boxes. In fact, there's a component here which sends proactive telemetrics over to a cloud component that we have so that we can do support, intelligent support. We have that also in the product.

Starting point is 00:35:38 So we can find out if a drive is failing and we can call up and we can tell you, hey, did you know that you're not configured your volumes correctly we can actually do that uh no data we don't we don't send data it's just telemetrics uh about you know some configuration stuff and if there's any outage going on or potential components not working we send that over so vr do you know how many um i'll call it sensors that you're you're you you're collecting on a periodic basis like that?

Starting point is 00:36:05 I can't even count. I mean, there's like IO, throughput, bandwidth, latency, temperature. Yeah, yeah. For different various components. There's a lot of things we collect. There's a lot. Right, right. And we don't send it in real time.

Starting point is 00:36:20 I mean, we bunch them together and send it, unless it's an alert. If it's an alert, we send it in real time. So we also have the other way. We can actually push firmware updates from the cloud if you so want it. So we can just do installation from there. You know, those things are fancy stuff. We don't do that in our release one.

Starting point is 00:36:37 But all the mechanics are there. We're actually not releasing that feature yet. But the sense is, you know, these should be autonomous entities which are out there, and we should be able to update the firmware on demand from a customer or allow a portal for a customer into our global website and they can see their arrays and they can manage it themselves.

Starting point is 00:37:00 So that's the thought process that we have. You mentioned that you don't have any NVRAM. I assume that means you're not using memory caching for the data? So you're just going directly? No, we don't. NVMe is so fast. I understand. I just wanted to make sure I clarified

Starting point is 00:37:15 that up front. When you write something and we acknowledge that it's written, it's really written. Yeah, I understand. So the beauty of that is it allowed all our controllers to be completely stateless. There's no state stored in any of the controllers. That's the beauty of something which we did.

Starting point is 00:37:32 When you have 20 of them, each controller can be serving volume one right now. It could just disappear. Next time it comes up, it could be serving volume two, so on and so forth. Yeah, but you do have to optimize your rate algorithm. Yeah, but there's a group of controllers that are assigned to a particular RAID group, you know.

Starting point is 00:37:49 So, and you can float around in that, and that's perfectly just possible. But the essence of the design where the controllers were very stateless, and part of that guiding principle is what made us saying, look, we're not going to have cache, we're not going to have NVRAM inside the controllers,

Starting point is 00:38:03 then you crash. What kind of 4K write latency are we talking about? Let me put it this way. Roughly end-to-end, we're looking at, for write latencies, around 100 to 125, 150. That's where we're looking at microseconds. That's device-level latency. I mean, that's what you get from an NVMe device.

Starting point is 00:38:24 No, no, no, no, no, no. NBME lights are actually pretty fast. You can actually do it at under 40 microseconds. Because, you know, most of these medias, they have DRAM inside. Yeah. It would be interesting to see in your architecture the performance difference between DRAM heavy and DRAM light SSDs. Hmm. Interesting. Interesting.

Starting point is 00:38:45 Yeah. Because it looks like you're the perfect use for a smart SSD. Smart SSD? Give us more ideas. Give us more ideas. You know, we love it. I mean, this is the best part of being in a startup. You know, we are so nimble.

Starting point is 00:39:03 Right. I mean, again, nimble that right i mean again flexible we're so nimble that we could actually do good interesting stuff which is very not possible in large companies so this has been great howard are there any last questions for vr before we uh sign off no i just heard had started having a very strange thought about this architecture and host-managed SSDs, where you could direct all of your rights to one of the four banks and have the other ones doing garbage collection at the moment. But I need to think that through a little bit. So, you know, now that you mentioned that, one interesting stuff we do here is we actually manage space in line. In the sense that we don't go around in the background

Starting point is 00:39:55 doing garbage collection at our layer. Because every layer that does some form of redirect on write has to do some form of garbage collection. The SSDs do that, and sometimes the the the software host software the controller software that runs right if it does redirect on right it also has to do it on top in our layer for our software we don't do it ssds may be still doing it one of the big reasons why we did that is we could actually uh interface and understand the ssds more so that we know exactly where they're doing garbage collection

Starting point is 00:40:25 and actually avoid going to those areas when we have to to give predictable performance, right? So we can do that. And also, it sets the stage very nicely for integrating with things like Project Denali, which is coming out. One last question for me. Are you guys using a log-structured file

Starting point is 00:40:43 on the back end of this on the NBME devices? That is correct. Hey, VR, is there anything you'd like to say to our listening audience before we sign off? Thanks, by the way. Thanks for giving this opportunity to talk to you guys, by the way. And I hope I answered all the questions that you have. Do come over and visit our site, www.paviliondata.com. We have a Twitter handle as well as we're on LinkedIn.

Starting point is 00:41:08 Please subscribe to that stuff. I think we have a great product, and our initial customer traction has been pretty good. The feedback has been pretty good. And I'm hoping that this fits in pretty much every enterprise that we go out today. So thank you very much for this time. Okay.

Starting point is 00:41:26 Well, this has been great. Thank you very much, VR, for being on our show today. Thank you. Next month, we will talk to another system storage technology person. Any questions you want us to ask, please let us know. And if you enjoy our podcast, tell your friends about it, and please review us on iTunes as this will also help get the word out. That's it for now.

Starting point is 00:41:42 Bye, Howard. Bye, Ray. Until next time.

Your Ad Here

Grey Beards on Systems - 62: GreyBeards talk NVMeoF storage with VR Satish, Founder & CTO Pavilion Data Systems

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.