Grey Beards on Systems - GreyBeards talk VMware agentless backup with Chris Wahl, Tech Evangelist, Rubrik

Episode Date: August 19, 2015

In this edition we discuss Rubrik’s converged data backup with Chris Wahl (@ChrisWahl), Tech Evangelist for Rubrik.  You may recall Chris as a blogger on a number of Tech, Virtualization and Storag...e Field Days (VFD2, TFD extra at VMworld2014, SFD4, etc.) which is where  I met him. Chris is one of the bloggers that complains … Continue reading "GreyBeards talk VMware agentless backup with Chris Wahl, Tech Evangelist, Rubrik"

Transcript
Discussion (0)
Starting point is 00:00:00 Hey everybody, Ray Lucchese here and Howard Marks here. Welcome to the next episode of the Graybeards on Storage monthly podcast, a show where we get Graybeards storage and system bloggers to talk with storage and system vendors to discuss upcoming products, technologies, and trends affecting the data center today. Welcome to the 23rd episode of Greybirds on Storage, which was recorded on August 5th, 2015. We have with us here today Chris Wall, tech evangelist at Rubrik Inc. Why don't you tell us a little bit about yourself and Rubrik, Chris? Oh, absolutely. Thanks for having me on.
Starting point is 00:00:43 So my name is Chris Wall. I'm at Chris Wall on the Twitters. I've done a number of things over the years around VMware virtualization. I'm a VMware certified design expert for data center and network virtualization. I wrote a book called Networking for VMware Administrators, and I run a blog called wallnetwork.com. That's kind of the things about me that I think are somewhat interesting. As Sean Connery would say, I read that book. Was it any good, Howard? Halsey was an idiot.
Starting point is 00:01:12 Oh, wait, sorry. Your book, yeah, your book was fun. All right. Gosh, okay. So what about Rubrik? What the heck is this Rubrik thing? Yeah, I'm like three and some change weeks fresh at the Rubrik group. So you should know everything then.
Starting point is 00:01:29 I've been trying to drink the Kool-Aid. It's pretty tasty, I've got to say. I think it's fruit punch flavored, which is my favorite. Oh, God. Okay. I'm not a marketing kind of guy. I'm very technical, so we'll keep it straightforward. Rubrik is all about data protection. They market it as converged data management. And the idea is to make a lot of the complexity around data protection, you know, doing backups, being able to offer restores, archival, et cetera, go away, make it really
Starting point is 00:01:57 simple, offer it as a 2U appliance where there's no backup software even required and offer cloud as an archive choice instead of tape. So I'd say that's the elevator pitch of what Rubrik is. Yeah, I saw something on your website. Take the backup out of recovery. What the heck is that? How can you not do backup? Well, it's taking the backup software out of recovery. You don't have any software to deploy
Starting point is 00:02:22 because that's the converge part of the converge data management. You take all of the backup software that you used to have to run and the proxy servers and catalog servers and search servers and all those servers that you'd have to use in order to ingest data and index it. And you bake that into an appliance. And then you've taken care of the storage problem as well, because we'll ingest the data, catalog it, index it, figure out what the metadata needs to be, as well as putting the data on the platform and doing all those fancy data efficiency things that everyone loves, you know, dedupe, compression, all that jazz. And then we'll archive it off to cloud, and that also gets the benefits of the dedupe. So you're not just a backup appliance, but you're also a storage appliance? It's almost a convergence between storage and backup. Is that what you've got?
Starting point is 00:03:08 Yeah, you're getting it. We've been seeing a lot of, you know, I mean, really, backup appliances have to include storage unless you're, you know, here's the server with NetBackup pre-installed. Good luck. But, you know, I'm kind of interested in the, we don't need any backup software because some of us have heard this before. Yeah. Well, go ahead with your question. It's so deeper. There's, there's a whole series of things I rely on backup software for. And making the copy is one,
Starting point is 00:03:46 but not the most important of those things. The cataloging, you know, and the finger, you know, so what happens if I, you know, inadvertently delete my Oracle database or something like that? Is it automatically being copied to some off-site location and stuff like that? Or how does that work? Yeah, I think it would help if we go in kind of that workflow a little bit. And that may shake out questions.
Starting point is 00:04:12 Let's start at the beginning. Where do you collect your data, young man? All right. So today we're a VMware-associated appliance. We'll do backups for a VMware environment. And in the future, we're going to do it for Hyper-V, KVM, bare metal, et cetera. But that's what we are today. And so when you set up the appliance, you tell it, hey, here's where my vCenter server or servers are. You give it, you know, name and password, all that kind of jazz. It will say, okay, you have
Starting point is 00:04:40 these number of virtual machines. What do you want to do with them? And you build a policy. And so the policy will say, I want to take backups three times a day and twice a week and four times a month, et cetera. And that's the SLA that we call it. The SLA then overlays on the virtual machine. And at that point, Rubrik takes care of figuring out when and how to backup the virtual machines. So there's no backup jobs that you create. It's all SLA-driven. But you're also creating the storage for those virtual machines? I mean, the VMDKs, VVols, whatever the case may be? There's no Magic LUN, VVols-type stuff.
Starting point is 00:05:18 There's no storage to configure at all. It's all handled for you. So you don't have to build RAID groups or anything like that. Right, because your appliance is the target. Yeah. And we're scale out storage. It was written by these really smart guys at Google and Facebook and, you know, the typical next gen startup kind of portfolio of talent. And so they've written the scale out file system as well as the, you know, master list system so that every node within the 2U appliance can be a backup server. And they all handle Ddupe and the metadata amongst one another. So it'll reach out using VADP, the vSphere
Starting point is 00:05:52 API for data protection, to the host that has the virtual machine that is associated with that SLA domain and grab a backup using VADP. So it'll send the data over the wire. The first full is all the data, and then everything after that uses VMware's change block tracking, or CBT. Okay. Pretty standard on that front, right? And I can tell it, you know, I don't have to tell it by policy when to run the backup. I just tell it, this is the level of protection I want. But I can tell it, I can
Starting point is 00:06:26 throttle back the load you're presenting, right? Yeah, there's a couple nerd knobs that you kind of get for that. You know, if you wanted to just let it... Yeah, it's one of my favorite terms. If you want to just let it do its thing, you can say exactly what you were intending, Howard.
Starting point is 00:06:41 Do it every four hours and it'll just figure it out. But you can also put bumpers on it and say only run at night or only run in the morning if that is your requirement. So are you acting as the VMDK or VVOL storage at that point? Or are you just a backup solution for the VM itself? Yeah, we will grab. So we act like a backup target. Typically, we just look like an NFS data store, but it doesn't really matter for the case of doing the VADP backup
Starting point is 00:07:09 because it's just going over the wire to one of the nodes, sending the VMDKs of the virtual machine, and we can crack those open once they've hit the appliance because they're actually being written to a flash drive because every node inside of our system has a combination of flash and disk. So we ingest the flash, try to make it super quick, and every node is doing an ingest, so there's no, like, choke point.
Starting point is 00:07:33 And at that point, we can crack open the VMDK and index all the files that are in there and then share that index in a distributed manner. So we know that we have the virtual machine, and we also know we have all the files within the virtual machine, and they're all searchable in real time, no matter if they're on the appliance or in the cloud so it really does converge a lot of those you know backup servers proxies catalog databases all that
Starting point is 00:07:54 stuff into one platform okay does does that index extend to common applications like Exchange and SharePoint? That's a goal, right? So right now we have, with version 2.0, which came out on August 19th, we have the ability to use our own VSS providers to do very granular and quality QIYAS to applications like Exchange and SQL and Active Directory and so forth. So we can make sure to do transactional, consistent backups of those workloads.
Starting point is 00:08:28 I think in the future the goal is to offer more support around single table restore, single mailbox restore, etc. Okay. Go ahead. You mentioned Cloud as a solutions here. So behind the appliance scale-out file system is cloud? Yeah, there's a couple options that you have today. So remember I talked about that SLA policy for doing data protection?
Starting point is 00:08:57 Within the policy, you can also say, all right, I want, you know, let's say you have a seven-year retention policy on this SLA. You could say I want three months of it to be local, and everything after that, I want to be an Amazon S3 or OpenStack Swift or any other, you know, S3-compatible Swift object store. So you just make that part of your policy, and it'll figure out, okay, I've hit three months on this particular data point. Let me start shuffling it off to whatever that archival cloud destination may be. And that's all handled automatically for you. So it kind of prunes it out or ages it out to your cloud target. In the case of Swift, it could be even local. It could be a local cloud, you know, on-premises cloud versus using Amazon's public cloud. Back to VSS for a second. So your
Starting point is 00:09:41 VSS provider does application support and log truncation and the like? Yep. Basically, you just provide some credentials so that the rubric appliance can tap into the guest operating system and pop up or drop in the VSS provider. And then we'll call that instead of using the... I know you love VMware snapshots. So we don't use that one unless we have to. Right. Okay. So do you support, I would call it, replication of the backup store? So one of the reasons I do backup, in my case, is for disasters here at the Silverton Consulting worldwide headquarters.
Starting point is 00:10:27 So how would I do that from an off-site perspective? Is it something I would automatically do it to the cloud? Archive it to the cloud? Archive has been around since we came out. We GA'd in May of 2015. And actually, when we did our announcement on August 19th, what we did introduce was replication. So three months after GA, we've already got replication of the product. It's a box-to-box, so you can go from primary to secondary site, and they all have those SLAs
Starting point is 00:10:57 associated with them. So the replication's baked into the SLA. So if you have a gold SLA and you tell it, archive after three months, but I also want all the data replicated for the next three weeks, you can do that. You can get kind of complex with that. At the same time, the destination that you're going to, so that secondary site, can also be going to the cloud as an archive. So there's no limitation saying this is only a, you know, replication target. It can't do anything else. Ah. That's very nice. Well, thank you. Synchronous or asynchronous? I'm assuming asynchronous is a backup,
Starting point is 00:11:32 so that would be a reasonable way to go. Yeah, asynchronous replication based on... We don't want to hose your network connection between the WAN, so we'll make sure that what we're sending is data efficient. We're just sending the changes. Okay, so you're doing hash exchange and... Yeah, it's whatever's left after dedupe compression
Starting point is 00:11:50 and, you know, whatever we actually need to send it to the side gets sent. Nothing more, nothing less. And you mentioned dedupe and compression. So is it... The dedupe can have different regionality perspectives of it. You know, it can be global dedupe across the whole, you know, file system. It could be across a vSphere cluster. It could be across the replicated instances of the data.
Starting point is 00:12:14 Do you have, how does that work in your scenario here? Is it at a VM level or? Yeah, that's usually a pain point, right, is all these little silos of dedupe. It's something I've certainly not liked. The dedupe with Rubrik is global. It expands from the local copies of the data that we have across every single SLA. There's no boundary for the deduplication, and it also extends into the cloud if you archive to cloud. Okay.
Starting point is 00:12:49 So generally, as I build a retention scheme, I want less granularity as I go further back in time. Yes, I want to be able to go back seven years, but I don't want to go back to every day of seven years ago. Right. You guys can consolidate backups together, right? Oh, absolutely. Yeah. I think our gold SLA by default is something like every four hours and then once a day and then three or four times a week. And the system will go through that. It'll prune out. If I have 10 backups and I'll need two of them, it'll go ahead and remove what it doesn't need. It'll prune those as the backup, you know, hits that older demarcation point. Okay, we have all these
Starting point is 00:13:29 weeklies. Let's get rid of all but the one that we need. And that's all automated. The same goes for if you deleted the virtual machine. There's nothing you have to do in Rubrik to say that you deleted it. It'll still hold on to that virtual machine based on the SLA policy until it's aged out and then eventually flushed away. And the SLA policy is at a virtual machine level? Well, it is its own thing. You can assign it to a virtual machine. You can assign it to a folder or a host or data center. It's really just saying anything within this logical object picks up the SLA policy. And then once that's happened, you're absolutely correct, Ray. It would be a per VM type of protection.
Starting point is 00:14:11 Because we're not going to say, oh, this disk is an SLA policy of gold and this one's silver. That would be a little mucky. So you wouldn't want to protect a VM half on gold and half on silver. It wouldn't make a lot of sense. So we'll take the whole VM as the granular. Well, I update the database every 20 seconds, but I only patch the operating system once a day.
Starting point is 00:14:32 Yeah. I think it might be. I can see it. I can see where you're going, Howard. But at the same time, I'm thinking that's a lot of complexity. I'm not saying it wouldn't be stupid. It's just. All right. All right. So
Starting point is 00:14:45 deduplication at the file level, block level, chunk level. I believe it's fixed block, but... Fixed block. Yeah, it's... But it is global, like you're saying. By comparison to, say, a data domain that has to be variable block,
Starting point is 00:15:02 fixed block works fine for this because everything is already aligned because you're reading the VMDKs. Right, right, right. Yeah, from my background, it felt like the fixed versus variable block thing was kind of an old school discussion when the storage array wasn't really involved in the data stream. But now it is, so it's not a huge deal. What the variable is really needed for is when your backup software thinks it's talking to a tape drive.
Starting point is 00:15:28 Yeah. And, you know, so first it backs up the root folder, and then it backs up the lowest alphabetical folder, and then it hits stuff it's backed up before, but that's all shifted over four bytes, so now it doesn't dedupe anymore. And so data domain was to address that problem. But if you're using vStorage API DP,
Starting point is 00:15:49 you're getting everything aligned on 4K blocks already. You got it. You mentioned change block tracking. So, I mean, one of the challenges with backups, obviously, is this big scan they have to go to find out what's changed. With VADP, you get sort of a free view of just the changes to a VMDK or a VM? Yeah, it'll list all the blocks that have changed since we last grabbed a copy of the backup. But we'll still, when we ingest the data that's changed,
Starting point is 00:16:21 we'll still crack it open, check and see what's changed from a file level, figure out what we actually need to write to disk. The Flash layer is just kind of a landing pad. I mean, we keep other stuff at that layer as well. We keep metadata so that you can search in real time without having to pull off a disk. But for the most part, we ingest, we figure out what we need to write, then we flush it down to disk.
Starting point is 00:16:38 Yeah. I mean, the other big challenge with backup, of course, is how is the recovery instant? How is that handled? How is it done the recovery instant, you know, how is that handled? How's it done? I mean, who's got access to it? Is it user level? Is it admin level? vCenter? Those sorts of things. So what's the recovery environment look like for Rubrik? This is where it gets a little bit different, I think. You know, you've got your traditional recovery type, you know, processes and workflows that you have in your head where,
Starting point is 00:17:06 okay, I want a file or a VM, and I'll go through that real quick. So if you wanted to grab a whole virtual machine out, you can do what we call an instant restore. And that's where you go to, you know, rubric, you type in the search engine, hey, I want this whatever VM, it's going to list every point where we have a backup. And at that point, you can kind of click on one and say, I want the VM at this point, and I'm doing a recovery, so I don't want the old one anymore. We'll actually mark the existing VM that's running there as a deprecated virtual machine. We'll change the name.
Starting point is 00:17:35 We'll power it off. And then we'll restore the backup that you're restoring from into the environment. So we just kind of replace the old one with the new one, or actually the fresher one with the backed up older copy. Thank you so much. Thank you so much for not overwriting. Yeah. Yeah. So, and that's helpful because you might, you know, say, ooh, I really porked this thing, but I'm not even sure if the two day old copy is the one I want. So it gives you the opportunity to kind of make sure that it's the one you want. We'll power it on and it'll run and, you know, typical restore. Or you can do single file restore.
Starting point is 00:18:08 And that works just like Google search. You'll type in a search bar. Hey, this is roughly the name that I'm looking for. And what I like is that it'll start auto filling in what it thinks you're trying to search for. Just like Google, you know, try to figure it out. And it's all, you know, real time, no matter if it's local or in the cloud. So I've done this a number of times where I'm grabbing files and it could be forever ago
Starting point is 00:18:27 in the cloud. And that way you can restore it based on whatever you want to do with it, into the virtual machine, on your desktop, whatever. Wait a minute. Let me go back. Howard mentioned something about you not overriding the current VDMDKs
Starting point is 00:18:44 when you do a restore like that what does that mean so let's say you wanted to restore a copy of the virtual machine from a week ago yeah you know typically that would mean you blow away the current one that's running in the system right right but that may not always be the greatest idea you might actually find out that the one you restored a doesn't work or b that you left something on the greatest idea. You might actually find out that the one you restored, A, doesn't work, or B, that you left something on the original one that you wanted. So it just gives you more choice. You could just throw it away.
Starting point is 00:19:12 You don't have to keep it around. Oh, okay. But we want to keep it safe. I like the extra step that I have to decide to throw it away, and I can't throw it away accidentally. Yeah. Yeah. And this is on the primary storage.
Starting point is 00:19:24 We're not just talking... So the backups are not an issue, but you were talking the primary storage at the point of recovery creates a new VMDK for the VM and fires that up. Oh, no, we can run the virtual machine's data right on Rubrik. So we'll give you the VMDKs directly on Rubrik
Starting point is 00:19:40 running on that flash tier. Oh, I didn't know that. Yeah, and then if you want it... if you find it's going to be good. Just skip the important stuff, Chris. It's fine. Sorry. When did that show up? Did I miss that? It's misdirection. I have the rabbit over on my left, and I'm pointing to this thing on my right. Yeah, that's also part of that Converge story is, you know,
Starting point is 00:19:59 we got that flash and capacity disk combo. So we'll run it on our flash for you, and, you know know you can get anywhere from 10 to 30k IO on a 2U box. So it's not going to hurt. So you can run it there and if you decide you like it just storage vMotion it over into your environment. Okay I got you. And so some portion of that flash is used as a cache for the running VMs? Yeah there's a pretty significant amount of the SSD left over for running virtual machines on. Only a small amount has to be used for ingesting data and our metadata and the operating
Starting point is 00:20:30 system, things like that. Right. Right. Because you're scale out. And so that keeps moving across multiple nodes. You don't have to keep all the metadata in one place. Exactly. So I've said it gets a little bit different. Here's where it gets a little different. We also have the concept of instant mount. And so just like I said earlier, where we can run a virtual machine from a storage perspective on Rubrik, we can actually kind of do what would be called maybe a clone or a fork of the virtual machine. We can run a completely separate copy and not even touch the primary with the instant mount at any point in time. So you could say, I have a backup from last night. I want to run a copy of it. We'll actually spin up a copy of that virtual machine. We'll mount Rubrik to VMware as if it were an NFS data store. We'll
Starting point is 00:21:20 power on the virtual machine, but we'll disconnect its NIC, the virtual machine's NIC, so that we have an IP conflict. And at that point, you can log in and either change the IP and reconnect the NIC or just do a test on it, whatever you want. And there's also no limit on the number that you can spin up. You could say, I want 20 of these things, and we'll go ahead and fire up 20 copies of the virtual machine for you. And it's space efficient because we only need to write any changes that occur. So, I mean, you're effectively cloning the VM from a backup that you've created. Yeah. I mean, frankly, while these are both great features, we've seen this before.
Starting point is 00:22:03 Yeah. The guys in Columbus can do this. Yeah, or whoever they are. But the problem is that if you used a backup program and a backup target, you always scale the performance of the backup target to match the backup not to match and i have to run this vm for three days till the weekend when i can afford the storage v motion and and so the flash in the rubric is new you know i've been telling people for a year or two the only reason to use the vmware flash read cache is to speed up an instant recovery from your Veeam that's just too slow. Yeah, yeah.
Starting point is 00:22:53 Yep, and you hit the nail on the head there. I mean, you're getting really great performance on it, and we think that for pre-production workloads, dev tests, you might have some golden backups that you just spin up 30 copies for the day, let your developers go nuts on them, check in their code, throw them away the next day, and spin up a whole fresh new set of copies. And you're doing it all on your backup appliance. I mean, how crazy is that? But you're getting awesome performance, so who cares? Well, your secondary storage, but secondary here doesn't necessarily mean performance. It's just,
Starting point is 00:23:26 you know, if I need to actually deal with something that failed, those test and dev machines have to go power down. Exactly. Or you just have the headroom, you know, you may be running enough nodes that you don't care, you know, unless you're the whole data center went down and you're like, all right, we need to recover it. There comes a point where I would start accusing you of selling me too much Flash. Sure. That, you know, to say, yes, and you can do a recovery when that whole rack failed and still run your test in depth at the same time, I think that was over-specced a little. Yeah, it depends on the dev work, right?
Starting point is 00:24:01 Okay, okay, okay. So when the data is written to the Rubrik appliance, you can have multiple of the appliances. Is the data replicated across or mirrored across those appliances? So if one of the appliances goes down, I can still do these sorts of things, or how does that work? Yeah, and it'll depend on how many nodes you have. By default, it's a four node in the 2U, and we triple mirror the data to our SATA disks as well as across the nodes. So when we ingest to the SATA layer, we'll go ahead and mirror across the three nodes, basically stripe across all three
Starting point is 00:24:38 of the capacity disks so that we get pretty good write speeds, as well as copying all that data to two other nodes so that you have three complete copies of the data. And so as you scale out more nodes, obviously just more targets for that data to be written to from a replication perspective, yeah. All right. Well, triple mirror is good. It's good.
Starting point is 00:24:58 And so you're also striping across all the three disks in a node. So in a typical node, there would be three SATA disks and a flash SSD? Yep. Yeah, there's two choices. You can do three disks at four terabytes a pop or three disks at eight terabytes a pop. So those are your choices for how much capacity you want, basically. Right, right, right, right, right, right.
Starting point is 00:25:23 Now, that's a lot of smarts for that much storage you want more storage or less smarts um both well you know on the one hand on the one hand having a lot of smarts means you can do a lot of things yeah yeah um you know and i want all the smarts that i'm willing to get as long as I can afford it. Yeah, yeah. You know, it's like the relationship between selling price and what you have to pay Intel is disconnected enough that I'm not necessarily making an assumption. Yeah, yeah. You didn't mention how much SSD is in that. Is it a 4-terabyte SSD?
Starting point is 00:26:04 If there is such a thing, it's 400 gigabytes. It's not an insanely huge SSD. 400 gig per node, along with 3-, 4-, or 8-terabyte capacity drives. And SanDisk does make a 4-terabyte SSD, but it's got really bad write endurance. Yeah, no doubt. Isn't it the price of a Jaguar or something? No, no, no. It's under $3,000, actually.
Starting point is 00:26:27 Oh, okay. A 20-year-old SUV or something like that, yes. I saw someone posting on Twitter, like, oh, it's the price of a midsize luxury car. I was like, what SSD is this? Yeah, yeah. Fusion IO. Yeah. Oh, okay.
Starting point is 00:26:40 Naturally, naturally. Okay, so backup and recovery and replication and instant recovery. What else do I need? I need a full-text index. You want to crack open the files. For the archive use case, when, you know, in the backup use case, I care about the catalog i care about where the data came from in the archive use case somebody from legal just came down to me and said who was john smith when did he work here and what did he do and so now i need to find all the
Starting point is 00:27:18 files that say john smith inside them yeah i've heard that one a few times. So, I mean, I'll be perfectly blunt. For right now, we look at the file itself, but not inside the file. So if you knew, you know, John Smith's files, we could grab them. But if you wanted us to search inside of a Word doc for the word John Smith, it's no bueno today. But I think that's a good idea, though.
Starting point is 00:27:40 Yeah, yeah. Well, in the indexing, there's a cost to that, obviously, but it's all in the cloud anyway. They got a lot of smarts in those boxes. Yeah, yeah. At the very least, if you wanted to search, you know, every, if you knew the file and you wanted to grab a copy of it from six years ago, it's literally type in the search box, click a button on the file, and it comes to you. It's not, let's go find the tape that has the right catalog for that particular year and then ask Iron Mountain to ship you a tape from 2008. Although we weren't around back in 2008, so you'll have to do that for a little while. But you get the drift.
Starting point is 00:28:19 The problem is just I put in the perfect system, and now I've got at least seven years of running both of them. Yeah. You mentioned data compression and deduplication. So you're compressing the data as well? That's correct. Yeah, it all happens when we ingest it. We pretty much take it and do all the data efficiency procedures
Starting point is 00:28:41 while it's in the flash layer and then write whatever comes out. So it's kind of lazy deduplication. Isn't it always? The problem is the state of the data, right? As opposed, you know, it's not really post-process because the delay
Starting point is 00:28:57 is measured in seconds at most. But it's not really in line either. Yeah. The goal is just to make sure that we spend as little time dealing with the virtual machine as possible. You know, kick off the VADP process, grab that snapped, you know, the underlying VMDK, pull it in as quickly as possible, you know, avoid stunning the VM for as much as we can, give it back to the virtual machine, and then we do our thing. And every node can do that. So it's actually a lot less stressful on the virtual machine. If you think
Starting point is 00:29:27 about this, because it doesn't have to go through a proxy to a backup server to a storage target, we've eliminated a lot of the middlemen so we can reduce the stun on the virtual machine pretty significantly. Yeah. A question on the appliance. Is it like a dual controller appliance or is it a single controller environment or what's that look like? It's an infinite controller appliance or is it a single controller environment or what's what's that look like it's an infinite controller appliance no no i understand it's you can have scalable nodes you can have multiple nodes but if one particular node what does that look like is it it's four nodes in a 2u it's using the super micro i think it's the the twin pro you know the pretty standard 2u super micro box it's, right? It's not like a NetApp controller or DMC, you know, where it's got dual heads, that kind of thing. Every node in the 4, you know, in the 2U has four nodes.
Starting point is 00:30:11 Every node is, you know, a master, quote-unquote. There's no master realistically. Right. So they all share kind of a job scheduling system and then figure out based on the SLAs, okay, node 1, go back up VM number 3. Node 2, I'm going to, go back up VM number three. Node two, I'm going to grab VM number four, et cetera. Right.
Starting point is 00:30:31 I think the challenge is when you start operating as primary storage, even if it's only for a certain period of time, the availability of the node becomes one of the crucial aspects here. Even though it's scalable and the data is located in three other nodes, maybe that's enough. I don't know. Well, it's better than vSAN. Oh, well, I won't go there.
Starting point is 00:30:56 Guns blazing already. Well, if you run vSAN in the default mode, it's only two-way replication across the nodes. Yeah, yeah, yeah. And those nodes are busy running VMware and all those applications, so they've got a lot more ways to fail. I'm glad you brought that up, Howard. I did see some folks saying this is hyper-converged backup or what is hyper-converged.
Starting point is 00:31:21 Hyper-converged would be running the VMs on the compute on Rubrik. We only run the storage, so you'll still need servers somewhere to run the compute. You mentioned that converge word. I mean, something like Unitrends, where when you do the instant recovery, they run KVM on their box and spin up the VM. You could call that hyper-converged backup, but what a stupid term. Yeah, I agree. I don't like the term hyper-converged backup.
Starting point is 00:31:46 So I've been trying to, I don't know where that came from. I'm trying to make sure that it dies quickly. Yeah. So is there a limit to the number of nodes? I mean, 64 nodes, is that a reasonable configuration here? We don't have any published limit. There shouldn't be a limit. You know, we obviously, you know, we can only test so much hardware being a startup,
Starting point is 00:32:06 but we've tested quite a few just fine. Okay, the node address is only 64 bits eventually. Yeah. How many bits with 64 bits? That's a lot of address space. Okay, another question. Can one, I'll call it a rubric cluster, can it span vSphere clusters? Or does it have to be limited to one vSphere cluster?
Starting point is 00:32:28 You can have as many vSphere clusters or vCenters as you want. Talking to the rubric cluster. Yeah. Well, so one rubric cluster can be as big as you want. But, yeah, you can plug in. When you go to plug in that association with vCenter, you can plug in multiple vCenters. So you can have eight vCenters all talking to one rubric fabric and that's fine.
Starting point is 00:32:51 You could even apply the SLA policy at a cluster layer and say, you know, this cluster's gold, this one's silver, and delineate with clusters based on SLA, if that makes sense. Well, this is the VDI cluster. Back it up once a month. Yeah. Never back it up. No one likes that cluster.
Starting point is 00:33:12 Or just back up the file server serving those user files. Right. Another example. It's supposed to be stateless. Yeah, that's the goal, right? Yeah, yeah. And you mentioned VSS providers. VMware snapshots, how does that play into this discussion?
Starting point is 00:33:28 Or you guys pretty much avoid all that? We shouldn't really ever have to use the VMware snapshot unless the virtual machine doesn't have any VMware tools installed upon it or we're just not able to install our VSS provider into the virtual machine. With VMware tools, it would use a VMware Snapshot. So the VMware Tools is required so that we can put our VSS provider into the guest OS. Otherwise, we don't have a vehicle to get in there. Yeah, but if they just have VMware Tools, then it would create a VMware Snapshot until
Starting point is 00:34:01 you finish copying the change blocks. Yeah, it'll definitely always look like a snapshot's kicking off in vCenter. You know, you'll see a snapshot operation occur every time just so that we can shift the IO off the VMDK and grab a backup, that kind of jazz. It's just a matter of how do we, how do we quiesce the IO from the application? And that's where the VSS provider comes into play. Right. Well, it's also how you, it's also how you create the stable image
Starting point is 00:34:28 that you're going to back up. Yeah. Correct. Yeah, I just know... Where do you send the writes while the backup's running? Is, you know, and... Yeah, so you'll have that. When that snapshot kicks off, you'll have that, when that snapshot kicks off,
Starting point is 00:34:45 you'll have that, you know, snap, basically, delta file there that just writes, you know, it's basically a bit app writing the changes. Yeah. That's your temporary holding pit for writes, and then we, you know, it comes back together and writes it down onto the underlying VMDK when the snap is done. Yeah. And so those transient snaps, as I will call them, from VMware,
Starting point is 00:35:06 they exist only for the moment, and then they are effectively gone after the backup's completed? Is that how it works? Correct. Yeah, as soon as the backup's done. That's how the storage API works. Yeah, yeah, I got you. It creates a snapshot so that the backup engine can mount the VMDK directly from the storage and it'll be fixed content. Yeah. What about SRM and stuff like that? Do you guys have any support
Starting point is 00:35:35 built in for that? No, not today. You know, it definitely is something that's tickling my brain as a potential future that you have to have an SRA, you know, plugged into SRM in order to be one of their targets. Right, right. Especially when you start playing primary storage games and stuff like that. Oh, we're secondary storage. We're good. Yeah, sometimes you're secondary
Starting point is 00:35:57 storage. I put my virtual hands up. Secondary but not necessarily, yeah, but not necessarily, you know, down the end of the chain like secondary with the guy shrugging next to it like I don't know look if you're going to have backup systems
Starting point is 00:36:18 that you're afraid to use doesn't do you any good and that's one of the nice things about it that you're afraid to use doesn't do you any good. Yeah. Sure. And that's one of the nice things about it. You can actually, it's not just data that sits there and basically lies dormant forever. Well, it's not just insurance. Right.
Starting point is 00:36:35 So you mentioned the ultimate goal of doing this for bare metal and other virtualization environments. Hyper-V would be the next one from my perspective that somehow needs to be there. You see this playing in that space? Well, I can't give you any spoilers on specifically what the prioritization of the backlog is, but definitely it has been brought up. Hyper-V, KVM, bare metal, they're all three things we publicly come out and said we're going to go after those use cases as time permits, basically. Just because it's rough. You know, you got some products that
Starting point is 00:37:09 are VM only. You got some that are physical that sort of do virtual machines through an agent or some other crusty stuff. And we'd rather be able to do all of that seamlessly and not require agents. Yeah, yeah, yeah. It's a bit, it's much more of a challenge in a bare metal environment, in my mind, to try to do this agentless. Yeah, I'm interested to see how we kick that off. We've got some pretty brilliant folks there. Everybody's got a PhD or something. I'm just the tech evangelist, right? I just try to disseminate the wizardry that they do and turn it into stuff that we can consume.
Starting point is 00:37:42 I'm sure you guys could talk one-to-one. Sure. I could just listen upon and gaze. No doubt. I still wake up in a cold sweat every once in a while about the system that ran an HR app that Arthur Anderson consultants created
Starting point is 00:38:00 and ran Windows NT and nobody knew how to maintain it at all. And then Arthur Anderson went out of business. Oh, God, yeah. And so there was just this box that we had to figure out how to deal with. I was so happy when we virtualized that sucker. Yeah, you would think.
Starting point is 00:38:18 It's like, oh, look, we haven't been able to buy that server model for six years, and the operating system has been out of support for two. Oh, God. Well, if you're having bad dreams or nightmares, I should say, I got one thing that I think you'll appreciate is that as you get a pretty decent backup set, let's say you've got seven years of backups that you're managing and you've got all these SLAs that you're being held to as a business. I want X number of dailies, weeklies, monthlies, et cetera. We'll actually real-time give you a compliance report based on the SLA compliance and policy that you built into the system. So we'll tell you, okay, over the last seven years, are you compliant?
Starting point is 00:38:56 Where are you not compliant? What are you missing? There's no work to do that. You just go into the system and the report tells you exactly what's going on with that particular workload. So I thought that was pretty snazzy because there's no, you know, there's no, it's not just telling you, all right, you have 300 backups or you have backups for seven years, but you don't know exactly what you have. You can look across the whole ecosystem or across the whole fabric and say, here's exactly what I have as far as dailies, weeklies, et cetera.
Starting point is 00:39:23 And how does it compare to my SLA? So has Rubik been around seven years now? Am I missing something? Well, I invented time travel. In seven years, if we buy Rubik today, we will be very happy. I like where this guy's going. He's a good guy. How long has Rubik been out nowadays? So when did you guys get general availability?
Starting point is 00:39:47 GA was in May. The company was founded January of 2014. So we're a young, scrappy startup. Yeah, yeah, yeah, yeah. We're all hungry and excited. Mere infants. That's right. I mean, it's not bad considering, what is that, 17 months to go GA and then all the features.
Starting point is 00:40:03 No, it's actually quick. Three, four months. Yeah, yeah, yeah. Okay, we and then all the features 3-4 months. Okay, we're about to the end of the podcast. Howard, do you have any final questions for my man Chris here? No, I'm just going to miss having Chris sitting on my left at field day. What are we going to do now?
Starting point is 00:40:18 We'll have to find somebody else to complain about my keyboard usage. I'll keep complaining. I won't be there. Alright, Chris, do you have anything else you'd like to tell the audience? Definitely come out to if you're at VMworld, come out to the booth.
Starting point is 00:40:33 I've got stickers and stuff I'm giving away. Say hi. Tell me what you like and don't like and we'll have a virtual beer. A real one. Speaking of VMworld, ah. Greybeards onorage will actually be a session at VMworld this year. That's correct.
Starting point is 00:40:52 Our friends at VMworld have decided that this podcast is just so wonderful that at 2 p.m. on Monday, August 31st, grayhairsonst, live on stage, will be Ray and I and a veritable murderer's row of storage luminaries, including Paula Long, who we just could not make wear a beard,
Starting point is 00:41:21 so we had to change the name of the session. Gray Hairs, yes. Yes. Well, thanks for that plug, Howard, and I really appreciate it. I should have done it myself, I guess. Well, this has been great. It's been a pleasure to have Chris with us here on our podcast. Next month, we'll talk to another startup storage technology person.
Starting point is 00:41:38 Any questions you have, please let us know. That's it for now. Bye, Howard. Bye, Ray. Bye, Chris. Until next time, thanks again, Chris. Hasta luego. Have a good day.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.