Grey Beards on Systems - 148: GreyBeards talk software defined infrastructure with Anthony Cinelli and Brian Dean, Dell PowerFlex

Starting point is 00:00:00 Hey everybody, Ray Lucchese here with Keith Townsend. Welcome to another sponsored episode of the Greybeards on Storage podcast, a show where we get Greybeards bloggers together with storage assistant vendors to discuss upcoming products, technologies, and trends affecting the data center today. This Greybeards on Storage episode is brought to you today by Dell PowerFlex Storage. And now it is my great pleasure to introduce Brian Dean, PowerFlex Technical Marketing, and Anthony Cianelli, Senior Director, Global PowerFlex Software Defined and Multi-Cloud Solutions. So Brian and Anthony, why don't you tell us a little bit about yourselves and what's new with Dell PowerFlex?

Starting point is 00:00:44 Awesome. Thank you. Thank you so much, Ray. And pumped to be on the podcast today. PowerFlex is at a pretty exciting inflection point here for Dell Technologies. There's just so many things going on in the world of infrastructure and cloud where customers are consistently looking for ways to drive standardization across the data center. They're trying to eliminate all different silos of architectures and silos of platforms that they've brought in over the years. They're trying to simplify and consolidate all of that. And they're trying to get some level of consistency and experience between what they're doing on-prem and what they're doing in the cloud.

Starting point is 00:01:28 And what's really exciting for PowerFlex right now is it literally lives in the middle of all of that. And it's helping deliver some really transformational outcomes to customers, helping them drive towards these true infrastructure transformational outcomes. And we're doing it at some of the largest organizations on the planet, for the largest data centers on the planet, handling some of the hairiest, most mission-critical workloads that are out there. So it's a really exciting time to be in and around the world of PowerFlex in general, driving some of this transformation out there with customers. And Brian, you want to talk a little bit about yourself and what's going on? That's a really great way of putting it.

Starting point is 00:02:11 So, like you said, Brian Dean with PowerFlex Technical Marketing. I've been familiar with the product and working with it for the last six years. And as a distributed software-defined infrastructure platform, I think it has been transformational in the customers that have been able to make use of it. One of the things we'd like to probably get into and clarify as we go through the work today is, what exactly is it? It's been marketed and positioned as high-end storage, as converged infrastructure, or hyper-converged infrastructure, or software-defined storage. What does that even mean? And so it's a little bit of all of these. I like to use an analogy out of like The Princess

Starting point is 00:02:59 Bride, right? So when this movie came out, it was a flop. And part of it was they didn't know how to market it. Is it a comedy? Is it an adventure story? A fantasy love story? Well, it's all of them, but it doesn't fit into any box. It's none of them exactly. It's a little more than all of these. But once people started watching it, then it became its own, you know, classic hit. It's classic. People understood it. Whereas with PowerFlex, I think it's the same kind of thing. One thing is like, it's a little of all of these.

Starting point is 00:03:31 It's none of them in particular. Once people start using it, they see the power of it and how much it can do for them. Anthony talked a little bit early on about some of the current issues facing the data center today. Would you like to maybe talk a little bit more about that and how that plays out and what the infrastructure, I guess, evolves into?

Starting point is 00:03:54 Yeah, sure thing. So what we've seen is when you go to any typical customer of any type of size or know, size or scale, they all kind of have an infrastructure that looks the same in that it's made up of a lot of different things. And I like to break it down into like three different categories. You know, first you'll have what I'll call the general purpose estate, right? You got a large virtual workload environment, maybe a whole bunch of different, you know, database workloads. And typically we'll see things here like your traditional three-tier stack. Maybe you see customers dabbling in things like hyper-converged.

Starting point is 00:04:33 This tends to be a large part of the environment and customers are after, how do I do things cost-effectively and how do I make it simple, right? That's kind of what they're after. But that's only one part of the environment. Then you step outside of that world and you oftentimes also find some type of specialized systems where you have a certain platform there to just being a unicorn in the data center it's just there to do one thing and then on the other end of the spectrum you have this whole new emergent world of what i'll call modern scale-out workloads these are your you know your no sql databases maybe your modern analytics workloads and the infrastructure here also tends to look different because it needs to scale out by nature right so? So you end up with maybe a lot of servers with just direct attached storage. You go to any customer's environment, they usually have those three categories all at the same time.

Starting point is 00:05:36 And there's just an incredible amount of complexity there. And you're not even talking about things like acquisitions where company A buys company B and there's a whole different environment that they have, right? Exactly. There's just so much complexity because workloads tend to look a little bit different and customers just end up with all these different architectures that they then need to operate. If I kind of summarize PowerFlex in one word, it's consolidation. It has a very unique architecture that allows us to take all those different platforms and consolidate it down into one universal infrastructure that can solve for your traditional virtual workloads, can run your databases, can deliver the specialized performance, and has this scale-out capability to also deliver for the modern workloads. So PowerFlex is really all about consolidation. And there's just such a focus right now for customers to think about, how do I get standardized infrastructure? How do I consolidate? How do I drive out cost? Because the reality is having all these different platforms, you end up with tremendous amounts of waste across the infrastructure.

Starting point is 00:07:01 Utilization and all that stuff. Yeah even utilization, there's just waste, right? And by driving consolidation, it helps you drive up utilization. And that's what's so unique about PowerFlex right now. And what's exciting about it is, again, we don't need to sell customers on the value of consolidation, right? That's been proven in IT over 30 years. If I can go from five widgets to one widget, I'm going to save some money. I'm going to become more agile. I'm going to be able to move faster, all that stuff. What's so unique about PowerFlex is it has an architecture that actually allows you to consolidate at the modern scale that customers deal with today. And that's just a really, really exciting thing that just frankly,

Starting point is 00:07:45 nothing out there we've seen has the capability to do to the same degree. That's a huge ask to be able to do all those sorts of things with the same storage architecture and stuff like that. I mean, databases, right? NoSQL, big data, AI and machine learning, that sort of stuff. I mean, this has got very diverse, I'll call it performance characteristics, those sorts of things, right? Totally. And I kind of group it into like there's three things you need to solve for, right? The first one is what I'll call ecosystem supportability, right? So today, a lot of modern software-defined

Starting point is 00:08:26 architectures, they're interesting, but they're interesting and useful only if you're running a general-purpose virtual machine. So right out of the gate, one of the things that makes PowerFlex super unique is I can run a virtual machine. I can run a bare metal database workload, and I can provide persistence for a containerized workload, whether that container is running on a VM or running on bare metal. So from an ecosystem perspective, we can support yesterday's workloads, today's workloads, and tomorrow's workloads. The second piece is what I'll call the architectural scale. The unique thing that PowerFlex brings to the table, it has a truly disaggregated architecture where I can provide you a full stack value prop with

Starting point is 00:09:14 simplicity, automation, and lifecycle management like a hyper-converged appliance would, but I do that with a complete physical decoupling of compute and storage. The value in that is I can now help drive economics at scale. As your compute requirements grow, you simply grow compute. As your storage requirements grow, you simply grow storage. You have a complete decoupling in how you're able to scale those resources. That means you're never adding or paying for a resource you don't need. You're never licensing a resource you don't need. And it allows you to optimize for things like database licensing, which is incredibly expensive.

Starting point is 00:09:59 The third piece, and one that ties it all together, is PowerFlex has a very, very unique IO architecture that allows us to deliver incredible game-changing performance. And as you scale and grow the environment, that performance will scale and grow linearly along with it. The value in that is not that customers need the millions of IOs that PowerFlex delivers. They don't. The value in that is that they don't need to think about or worry about performance as they consolidate. They can truly consolidate with confidence. And that's the superpower of PowerFlex. I now have this scale-out architecture where I can run my Oracle. I can run physical SQL.

Starting point is 00:10:47 I can run my general-purpose VMware. I can run my containers all on the same platform. I can scale my compute and storage in a completely disaggregated way, driven by whatever my application requirements are. And then I know I have more than enough performance to go around to truly consolidate all these different workloads and not have to worry about noisy neighbor, not have to worry about, you know, huge operational burden from performance management. All that goes away. We're going to have to get into all this technical stuff too, Anthony, but I understand where you're going. It's kind of a huge, potentially huge system. But I mean, you talk about on your website, talk about thousands of nodes.

Starting point is 00:11:34 Are there customers out there with these sorts of configurations? So believe it or not, yes. And this is what's really exciting. The cool part is when we bring PowerFlex to a customer today, we talk about this value prop, but then the most exciting thing is no matter what that customer scale is, we would not be learning on them. PowerFlex is already proven at the single largest scales imaginable. It's deployed in production at four of the five largest banking institutions in the U.S. Our single largest customer has over 800 petabytes of PowerFlex deployed, running core banking workloads and banking applications.

Starting point is 00:12:19 Now, that doesn't mean PowerFlex is just for customers that have 100 petabytes or more, right? Absolutely not. Now, that doesn't mean PowerFlex is just for customers that have 100 petabytes or more, right? Absolutely not. But what's really exciting here is it's not often that you're able to kind of approach a truly transformational technology with a unique architecture, yet also see it as something that's incredibly proven at the largest scales imaginable. And that's exciting because I can now go to a customer, general enterprise, general kind of mid-market customer, and we have tremendous confidence in our ability to execute because we've already proven the value prop in these massive environments with these super heavy hitting, hairy workloads. And again, we're happy to get into the architecture specifics of what enables this. But what's really cool here is it's already proven at some really, really large customers.

Starting point is 00:13:09 And that's exciting. And that's the biggest scale. That's battle testing at the largest scale. But it doesn't have to start in the hundreds and multi-hundreds and bigger setups. I mean, we'll start with a four-node storage cluster. So if I was doing some sort of an edge environment, I could start PowerFlex with a four-node configuration and build from there? Yeah, so PowerFlex, you know, minimums, right?

Starting point is 00:13:37 Starts at four nodes, four storage nodes. And then we also have a concept of PowerFlex compute nodes. And you could literally have one of those, three of those, you know, whatever. And then we also have a concept of PowerFlex compute notes. And you could literally have one of those, three of those, whatever. And important to know, PowerFlex, it's not just a storage thing. It truly is about how do we not just deliver really scalable, really performant storage, but it's also how do we help customer simplify and transform the operations? So when you're buying a PowerFlex solution, yes, you're bringing in PowerFlex storage, but you're also running PowerFlex compute nodes

Starting point is 00:14:11 that give you the benefits of automation and lifecycle management for the compute layer as well, inclusive of ESX, for example. And it's that full stack value prop that becomes really interesting to customers. So essentially, there's three form factors for PowerFlex. We have storage nodes, we have compute nodes, and we also have hyperconverge nodes. So whether you're doing, you know, four nodes in an edge location, you're doing anywhere from four nodes to, you know, 14, 40, 400 and beyond nodes in a data center. PowerFlex has the ability to solve for that, which is pretty powerful. So this sounds more than a product. It sounds like a operating model. Can you talk to me

Starting point is 00:14:57 about that operating model? Because when I'm thinking about my bare metal Oracle workloads, my Kubernetes bare metal solutions, my ESXi, my stuff that's running on Red Hat virtualization platforms. These are all models of operating that I have to select and I have to be thoughtful about. What you're trying to sell me is this idea to kind of forego those models as the primary method of addressing storage, compute, and automation in my environment and kind of go with a PowerFlex first model. Not quite. If I can rephrase that, it's to use all of them with PowerFlex as your sort of universal stored framework behind everything. Maybe that's a different way of putting it. It's not that we're replacing it with XI or RHV or something. It's that we can simply operationalize and standardize everything underneath and behind those. So the practical challenge, you know, this on paper, that sounds great, but in practice,

Starting point is 00:16:12 you know, there's minimum firmware needed for, to make sure that I get support on my SAP workload. My, my solution has to be validated. So when I upload, when I upgrade my storage firmware, I got to make sure that all the underlying components, and this is where the complexity comes in. All the underlying components and requirements are in line with this. And this is where customers get stuck. And this is where silos end up being created. How are you helping customers solve that that siloed problem that this is a great solution? Again, I'm sure if I had all HCI solution with ESXi or if I had all Oracle solution, it gives me that flexibility. But when we're thinking about these mixed workloads and it's mixed operating model, this is where my skepticism starts to come into play. Yeah. You bring up a great point, Keith. And this is one of those

Starting point is 00:17:11 areas that I think the uniqueness of PowerFlex becomes really interesting, right? Because one of the challenges of HCI is exactly what you bring up. It's super simple, as long as you're not trying to do anything outside the norm with it of what I'll just call a straight VMware general purpose ESX type workload. Where PowerFlex becomes interesting is it gives you that HCI-like concept, but with the flexibility of that kind of traditional three-tier stack, where your compute and storage have the ability to be operated somewhat independently. And on the compute, you can run a variety of different stacks. So within PowerFlex,

Starting point is 00:17:52 there's a tool, if you will, called PowerFlex Manager. And what PowerFlex Manager does is it provides the operational aspect of the environment. And to kind of work from the bottom of the stack up, it will do at the storage layer, all of the storage, you know, deployment, add, remove nodes, lifecycle management, firmware, BIOS updates of the hardware, of the software-defined storage. All of that gets done by PowerFlex Manager. As you move up the stack to compute, you have the ability to operate it multiple ways. You can have it fully update lifecycle and operate that VMware kind of compute node, if you will, where PowerFlex will literally deploy ESX, all the right BIOS, firmware, and driver levels that have been pre-tested, pre-validated. When it comes time to upgrade or apply patches, it will automate and do all of that on the system.

Starting point is 00:18:46 But then let's say you step to a workload that's maybe not running on a hypervisor, not running on VMware. Maybe it's an Oracle running on a physical Linux host. You have the ability for PowerFlex Manager to treat that host with what we call bare metal, where it will go and treat that host and just talk to the BIOS and firmware on it. So it'll update just the hardware of the server, leaving the customer to operate their Oracle and Linux OS image as they have historically. Same would apply to something like OpenShift, right? Where let's say the customer wants to run OpenShift on bare metal, they can have PowerFlex manager operate the server hardware, if you will, where I have Hypervisor integrated into the lifecycle experience of the platform. And within the same system, using this concept known as services,

Starting point is 00:19:53 you can also have additional platforms running on top where we don't have to do all the bits, giving you this one singular landing zone that you can have all these different stacks operating on top of according to the way you need to operate them. So a practical problem that we've run into, and it sounds like you've thought this through, a practical problem that we run into is when we have a converged system that's fully integrated. There's ESXi. There's some type of lifecycle management tool. There's the collapsed storage and compute. And I want to go from one version of VMware vSphere or VMware vSAN or whatever the control plane is to the next. You solve this firmware problem of needing to go out and get the latest and greatest firmware. That automation is there. And then it seems like in that same platform, if I understand you correctly, if you just want to consume this as a NFS compliant, iSCSI compliant, standards based solution, you can do that. So I can connect a consistent, persistent volume in a Linux host to deliver as Kubernetes.

Starting point is 00:21:09 And I've abstracted away that storage piece of it, the software control plane of it. And I'm just consuming it as compliant storage. So let me back up the conversation just a touch. I think I can answer that question and maybe some remainders from the previous one. So what PowerFlex is, we'll call it a software-defined infrastructure. And I like to call it a software-first architecture because even though it has to run on hardware, in theory, in principle, at the base base it is just software you bring it some x86 some ethernet and some direct attached storage and install the right pieces of software on a compatible

Starting point is 00:21:54 operating system and off we go you know there's different pieces of software that allow it to do different things fundamentally there's just three of them. There's a software-defined, I mean, there's a storage server that works on a node to aggregate those local disks, bind them together with other nodes, create storage pools and different layers of complexity. There's a client, and this is partly getting at what you were just asking,

Starting point is 00:22:21 a software SDC, so a storage data client, which runs normally in, well, not normally, it runs in various operating systems or hypervisors in the kernel and is able to map to those storage nodes and consume storage from them, presenting volumes to the hypervisor or the operating system. And we have these for all kinds of platforms. There's also pieces of software that do all the metadata management, additional pieces of software that enable different features like replication or NVMe over TCP. But fundamentally, you just got these storage creator, storage consumer, and some management layers. And it doesn't matter whether you put those on separate nodes and they talk like a disaggregated thing, or you put them into sort of the same node

Starting point is 00:23:11 and you do a hyper-converged thing for an individual node, both creates and consume storage in a cluster. That allows us, and then they don't care, right? And you can mix them up. Well, you can have some things in a cluster just providing storage, some just consuming and some doing both. That's okay with us. This allows all that great architectural flexibility. We work with lots of different hypervisors and operating systems, but it also provides many layers of complexity that users don't want to get into. So I think it's the beauty of what Anthony was talking about here a minute ago with PowerFlex Manager is now we'll take all of the complexity that's possible and we'll provide you easy templates to deploy and manage it

Starting point is 00:23:56 along with all the hardware life cycling that that happens to be sitting on to provide the ease of operations across the board. Now, Brian, I understand how the client and the server can facilitate what I'll call block access. Do you also offer file access? Yes, that is new with version 4.0. And does that support like standard NFS or standard SMB services?

Starting point is 00:24:24 Yep, all the standard protocols. You don't necessarily need a client to support that. Is that what you're saying? No. We have our file controller nodes that sit there in front of the rest of the cluster, and all of the PowerFlex juice and scalability sits underneath the file systems that those will serve out.

Starting point is 00:24:58 But it is from the client perspective, from the file client perspective, it looks just like any other file system that's being presented to it. So standard SIFs, NFS, S&B, et cetera. Right, right, right. So all the normal file operations. Right. And so you've got this file controller node that, that provides, I'll call it, you know, file services and uses the PowerFlex backend for its bulk storage, storage pools and that sort of stuff. Correct. Yep. And you mentioned lifecycle management.

Starting point is 00:25:27 We're talking lifecycle management for some bare metal solutions. Are you just talking the client side of that, the storage data client? Or are you talking like the whole OS and firmware and hardware? And, you know, there's plenty of, I'll call it, different x86 systems out there, not necessarily all of which are from Dell. I know that's kind of a foreign concept, but do you also offer those sorts of services for non-Dell hardware? So the lifecycle management in PowerFlix is obviously the storage node hardware and software layer.

Starting point is 00:26:03 On the compute node front, it is the server itself, right? BIOS, firmware, et cetera. And then it also has the ability to lifecycle ESX. For a bare metal host or a non-ESX operating system, PowerFlex Manager, we're not going to want to touch that, right? So like, again, if a customer is running Oracle on physical Linux, that's going to be the customer's Linux OS. They're going to be responsible for that lifecycle at Patchit. We're going to handle the compute node that it runs on. Now, to the question of, hey, you know, whose x86 is it? What we have found over the years is that when it comes to software defined, customers want the value and the consolidation that a scale out software defined architecture can provide, but they don't necessarily want the science projectiness of it,

Starting point is 00:27:09 of mixing and matching different hardware from different vendors. So while PowerFlex as a core software-defined storage can run on any x86 hardware from any vendor out there, the experience around the automation and the lifecycle management is when specifically deploying it on PowerFlex nodes, PowerFlex hardware, which obviously is Dell PowerEdge based. but doing that with simplified operations by delivering it as a full stack experience and not simply saying to a customer, hey, here's some software, go build infrastructure out of it, because the reality is that's not core to pretty much any customer's business these days. So they want to kind of consume the outcome that can be provided without having to really put it together or build it themselves in any way. What does a typical, I'll call it storage node, look like in this environment? Is there any special hardware requirements for those sorts of things, or is it just a standard Dell PowerEdge server with SSD storage behind it? I mean... Yeah, it's actually pretty straightforward.

Starting point is 00:28:29 When we call it, quote, PowerFlex nodes, they're PowerEdge servers with SSD or NVMe in them, standard CPUs. There's nothing special or proprietary about the hardware itself. It's standard PowerEdge. If we can put it this way, the hardware gets designed and tuned to enable the software to behave at its best. The software doesn't require particular hardware. So you don't have to have NVRAM or things of that nature

Starting point is 00:29:01 for special buffering caches or something like that? So we're not doing any type of caching within the system. And this is one of the things that's really interesting about PowerFlex's IO architecture and how we deliver the type of consistent performance that we do. All of the IOs in a PowerFlex system, and I'm kind of changing gears on you in case you haven't noticed, Ray, because I was looking for an opportunity to talk about our IOPath because it's so cool. All of the IOs go directly to the underlying media in the distributed cluster. So for example, let's say I have very simply a 10 node, 10 storage nodes in my PowerFlex

Starting point is 00:29:46 cluster, and each of those nodes have 10 NVMe devices on them. When you create a volume in PowerFlex, that volume is evenly distributed across all 100 of those NVMe devices. And now every IO coming from the compute nodes, all those IOs are being evenly distributed across all 100 of those NVMe devices at all times. There's no cache layer. All the reads and all the writes are coming directly off of that persistent media. That delivers two things. Number one, it delivers incredible amounts of performance because now I'm not relying on a cash drive or two to, you know, deliver my performance. But the other benefit of that, moreOs. There's no concept of like cache hits, cache misses, skew, workload skew, all that stuff goes away. And when you kind of run performance testing on

Starting point is 00:30:54 PowerFlex, you literally look at it, no matter how hard you push it, latency just, it's kind of a straight line across because there's no gimmicks in the IO path. What you see is what you get and it is incredibly consistent. And then as you scale or grow the cluster, so if I go from 10 nodes to 20 nodes or 100 NVMe drives to 200 NVMe drives, my volume will now get automatically redistributed across all 200 NVMe drives. And I now just doubled my IO performance. And there's no storage controllers in the way. I'm adding, every time I add a node, I'm adding more storage processing power.

Starting point is 00:31:34 I'm adding more drives. I'm adding more network bandwidth. Well, I think the other piece of this too is it's not just linearity with respect to, you're linearly, you're scaling the capacity but getting performance to scale linearly along with that. It's very predictable. There aren't any

Starting point is 00:31:52 points at which if you add a couple more nodes you start hitting a plateau or a choke point at which now you've made it too big it's going to start underperforming. It will as long as we're staying inside the theoretical maxims which are giant, keep growing. And also, it doesn't start coming apart when you start filling up the storage.

Starting point is 00:32:14 And I think that's part of the result. Another result of this architecture is that you get it 75% full and it doesn't start tipping over. So, you know, we've all done this a really long time. And the one thing that we know is that scale breaks everything. The computer science is undefeated. So let's test this model a little bit. 10 NVMe drives across 10 servers will saturate a network path. What do you guys do to help us mitigate the IO path itself? Like the, obviously the drives and the servers.

Starting point is 00:32:52 Once, once I, you know, once I get up to, you know, 16, uh, to 32 nodes of this stuff with all NVMe drive, I have way more, I have more, I have more storage IO than I have network. Bandwidth, bandwidth, bandwidth, right? For sure. I mean, to be clear, we're not, you know, we're not breaking the laws of physics, to your point. Those laws are undefeated. The cool part is, and I'll steal an analogy from one of our great pre-sales team members that created this. Think of PowerFlex software, almost like a bed sheet. If I put it over a bed, it takes the shape of a bed. If I put it over a chair, it takes the shape of a chair. And the analogy for that is this. Today, if I'm running PowerFlex on a bunch of NVMe devices with four 25 gig NICs on it, it's going to run at the speed

Starting point is 00:33:46 of those four 25 gig network ports. Tomorrow, if I'm running it over a bunch of NVMe drives with notes that have four 100 gig ports in them, I'm going to run at the speed of those four 100 gig ports. So the exciting part is because the software is not the bottleneck. As hardware increases, as networks get bigger, you just get to ride that curve of performance. As those things grow, you immediately take advantage of them. So yes, we're not going to do anything to perform faster than 425 gig connections will allow you. We'll run at line speed, but we won't run faster than that. But now as soon as you upgrade to 100 gig or 400 gigs becoming a thing, you can now operate at those speeds immediately. And that's what's really exciting because again, there's no

Starting point is 00:34:37 storage controller kind of bottleneck. It's not a dual controller array that, hey, no matter how much I put behind it, I can only do what those controllers give me. It's not a cash based architecture where you can only run as fast as the cash allows. It truly is this distributed IO architecture where you will run at the speeds of the, the network architecture, but then as that increases, so will your speed. And it's not just network, right? So you can also end up in a situation where you've got, this is part of being sort of disaggregated, where you could have the storage backend being able to provide a lot more storage than the compute you currently have available to it can consume, right? They can be, you could have, you know, six compute nodes running

Starting point is 00:35:21 a heavy workload and they're all pegged at a hundred percent, but you have only tapped out 20% of what the storage backend can provide. So you just keep adding compute, right? Theoretically, you can get to a point where now your compute is getting to the point where it's starting to saturate what the storage backend can provide. So you just add more storage. We can keep moving this in different directions to ensure that we're not network bound, CPU bound or disk bound or whatever. So talk to me about data protection and your environment. We haven't talked about, you know, how you protect for drive failures or node failures and those sorts of things.

Starting point is 00:36:03 Yeah. Great, great call out. And this is one of the areas we're super proud of because, and again, you know, going back to when Brian and I first started doing this, we used to have to talk about this in theory, but now what's awesome is we have a whole bunch of customer data to back it up. And when you look across, you know,

Starting point is 00:36:18 PowerFlex's big deployments, this is a true like tier zero, mission critical type of resiliency platform, which is really exciting. Right. You know, I'm sure you guys are very familiar of the long history we have with a platform like PowerMax, which is like the gold standard of uptime and resiliency. Like PowerFlex should be thought of in that same breath from a resiliency perspective. And we have the customer data to prove it. And it relies on the same concepts that we use to deliver IO performance, which is many hands make light work. So the way we protect data and PowerFlex is through something we call a parallel mesh, which is a fancy term for a many-to-many RAID 10, if you will, in that, again, we'll go back

Starting point is 00:37:07 to our example of 10 nodes with 100 NVMe devices spread across them. My volume and my protection of that volume is done in a many-to-many fashion across all 100 of those NVMe devices. So in the event I have, for example, a drive fail, all the remaining devices, so one drive fails, all 99 remaining devices evenly participate in that rebuild. Same thing, if I have a 30-node cluster, one node fails, all 29 remaining nodes perfectly evenly participate in that rebuild. And the, the outcome of that is an incredible amount of rebuild speed. You know, if you think, you know, traditional kind of storage system, you have a, you know, call it a four terabyte

Starting point is 00:37:58 flash drive fails, you know, your rebuild is measured in hours, right? That same failure on PowerFlex, your rebuild is measured in literally minutes. And that's exciting. Depending on how big it is and how small the data set is. Yeah, I mean, potentially seconds, right? So it's all about the speed of rebuild and taking advantage of that distributed architecture in order to deliver that incredible

Starting point is 00:38:26 resiliency. And I remember literally doing this and, you know, going back years where we talk about the math behind it, but now it's really exciting. We just, we literally have all the customer data, the customer references running these, you know, core banking workloads at scale, kind of proving out the resiliency of this architecture, which is exciting. And then I'll supplement that with a couple ideas, though. So we're actually, we're not protecting the infrastructure. We're protecting the data. Right.

Starting point is 00:38:52 Right. The data and that mesh mirror copying. Right. That's what we're protecting. We really could care less in this context about the underlying disks or nodes. The expectation is that things fail. Disks fail. Disks fail, nodes fail. And the system is designed to bend and flex around that, but not break.

Starting point is 00:39:11 So when we lose a disk, we're immediately re-protecting the data in this many-to-many pattern that Anthony described. And as soon as we're done with that, we're done as far as any re-protection scheme is after. And unlike traditional RAID where, yes, you can lose a disk or two disks or more, but your data is protected in that first instance, but you're not really healthy as far as your data protection goes until you've replaced the hardware and rebuilt the RAID structure. Right, right. So there is a hardware problem. We don't care about that. as far as your data protection goes, until you've replaced the hardware and rebuilt the RAID structure. Right, right. So there is a hardware problem.

Starting point is 00:39:49 We don't care about that, right? So in our context, you lose a disk, we rebuild the data. It may take, you know, 60 seconds, two minutes, depends on how much data is on there, how many nodes are contributing to this process. Once we're done with that, that's it. You can replace the disk at your leisure or not.

Starting point is 00:40:06 You'd be down a little capacity, but whatever. Does that make sense? Same with the node, right? Another piece on resiliency that I think actually ties back to one of the points Keith brought up earlier, which is it's the operational concepts of resiliency. And, you know, look, I think we're all familiar with like hyper-converged and the rise that's had in a lot of data center environments. And one of the challenges that a pure hyper-converged model has, especially in today's day and age where security patching seems to be happening more than ever, is every time I update a host, I'm taking compute and I'm taking storage offline.

Starting point is 00:40:42 And I've got to think about that, right? Like that's a resiliency concept I need to think about that, right? Like that's a resiliency concept I need to think about when it comes to things like patching. And because of that, we've seen a lot of customers say, hey, you know, I need to remain with that three-tier kind of centralized storage model because I want the ability to patch my compute and not have to think about what does this mean for my storage? Am I taking storage capacity or performance offline? The beauty of PowerFlex is that it gives you the flexibility of three-tier with the operational simplicity of HCI. So when you think about something like

Starting point is 00:41:18 patching, I'm getting all the scale out and the automation concepts that HCI kind of mainstreamed. But because my storage and compute are physically separate, I can go patch and reboot a whole bunch of hosts. I'm doing nothing to my storage. I'm not taking any storage capacity or performance offline. So I maintain that operational benefit that exists in the three-tier world of patching, updating my compute and my storage as each sees fit without an interdependency on the other. And that's something especially larger customers have really seen a ton of value in that has been one of the drawbacks of hyperconverged at scale. Yeah.

Starting point is 00:42:02 So, I mean, now just to be clear, ultimately you have to update the storage nodes as well. And during that update, there would be sort of a cycling through the various nodes or drives in order to perform that update while they're, I'd say the node goes offline, if that's even a terminology kind of thing. Into a maintenance mode where we're expecting, there's different ways of handling it. But yes, of course, we roll and cycle node by node

Starting point is 00:42:34 through the storage cluster to update all of the different components of it. And it says my data is basically sharded across 10 nodes. This is a rolling update. I'm not taking that outage, service outage to do this. You know, because one node goes down for maintenance because it's turn in the upgrade cycle. Remember, there's extra copies on all the other fault units,

Starting point is 00:42:57 which are all the other nodes. And so it just simply moves to the other stuff in the meantime. And then that does finishes its job and we move on to the next one. Or we can, you know, if you really needed to, you could bring a node out completely and bring a node back in or add extras. And the elasticity is there to rebalance anyway. So you guys mentioned the gold standard and availability within the Dell

Starting point is 00:43:23 world, which is the VMAX. PowerMAX. I'm sorry, the PowerMAX. Apologies. When I think of protecting uptime with a PowerMAX, I'm generally thinking of, you know, I've lived in an environment where I supported mission-critical SAP. I just had two PowerMaxes, believe it or not.

Starting point is 00:43:49 If a customer's that concerned about the flexibility, but they want the software defined nature of that, how do they match the availability? What are some of the, I guess the question is, what's the fault domain when I'm thinking about availability of my services? Right. What do what's the fault domain when I'm thinking about availability of my services?

Starting point is 00:44:06 Right. What do you mean by fault domain? So the fault domain of a VMAX is a VMAX. So I have a second VMAX. This is more software defined. So if I have 10 nodes, there's kind of a cluster. So do I have two clusters or how do I design for availability? Yeah, great, great question. So the answer is there's a lot of ways you can do that. So within PowerFlex, there's a concept of cluster, which is my overall system. Then within that cluster, I have a concept of protection domains, which is grouping of nodes. So let's say I have 30 nodes. I can have a 10 node, a 15 node, and a five node protection domain. Each of those are completely their own fault unit from

Starting point is 00:44:53 a failure perspective. So it's almost like creating software defined arrays within the overarching cluster. And then there's a third concept called storage pools, which doesn't get used as much, but actually allows you to drive that segregated protection down to the disk level, if you'd like. So essentially, you can obviously create, you know, multiple PowerFlex clusters, and then use, you know, replication, whether it's at the PowerFlex layer or the application layer, to have data between two completely separate PowerFlex clusters, either within a site or across sites, or you can create that segregation within the PowerFlex cluster itself at the protection domain level with the advantage there being, all right, you know, think of,

Starting point is 00:45:41 I have application A, B, and C each running in a different protection domain. Application B gets shipped off, shut down, retired, sent to the cloud, whatever. I can now take those nodes in protection domain B, and I can just redeploy them into A and C. So you get a lot of flexibility there. The third concept we have, or fourth one rather, is a really interesting one called fault sets. And what fault sets allow you to do is create groupings within a protection domain to protect against specific failure scenarios. So an example being-

Starting point is 00:46:17 Rack or something like that. Rack level failure, exactly. I can actually tell PowerFlex, I have a protection domain that's spread across, I don't know, let's call it four cabinets. But I want to make each cabinet a fault set. PowerFlex will ensure your data is protected outside of the fault set. So you can actually lose an entire cabinet of nodes.

Starting point is 00:46:37 No problem. You're still up running serving data. Application doesn't see a thing. And that starts to become really interesting. And that's one of those benefits of software defined. So remember what he said earlier about the RAID 10-ish like behavior of the protection in the background, obviously you'd never put the two relevant copies on the same disk, but you never put the two relevant copies on the same node, right? The idea is that you want to allow for anything to fail at any time.

Starting point is 00:47:10 But the idea here with the fault sets is you never put the relevant copies in the same fault set. So we're okay with an entire grouping of things going down altogether and never seeing a disruption in your event. And outside of failure, where we actually see fault sets mostly used is with larger customers for maintenance. So we talked earlier, hey, I want to update that. I do a node at a time. I cannot do a cabinet at a time or a fault set at a time of upgrading all those nodes at the same time without any type of outage. Now, here's where I'm going to pull up a little curveball. This fault set concept is super interesting. It's even more interesting when we talk about the ability to deploy PowerFlex directly into the public cloud, which we now have the ability to do. Because what I can now do is using that same fault set concept, I can deploy a PowerFlex software-defined array in Amazon, let's say, but I could deploy that across availability zones. And I can use the fault set concept where I make each Amazon availability zone a fault set, provide AZ level failure protection at the storage layer in the cloud without needing to replicate

Starting point is 00:48:28 your data set into every single availability zone that you want protection from. So Anthony, we're in bonus time. You kind of set me up. I was going to ask about the question around hybrid cloud. You can't have a conversation today about hybrid data without having a conversation about being able to manage data in the public cloud. So is this concept for AWS, is this integrated into the AWS control plane, not control plane, am I consuming the AWS control plane natively or is there an approach to abstract it away so I can take this concept and deploy it in Google Cloud, Azure or some other cloud that I'm interested in taking the underlying cloud's capability in applying this PowerFlex model to? Yeah, great question. So this is software you model to? Yeah, great question. So this is software you bring to the cloud. So the cloud, it's not a integrated into the tools portal, if you will,

Starting point is 00:49:38 experience. It's truly, I'm using, instead of using PowerFlex nodes on-prem, I'm creating virtual instances in the cloud that effectively become my nodes that PowerFlex then gets deployed on top of. And I'm now using that to create a PowerFlex block storage system in the cloud. and that you can kind of bring it anywhere, right? Because it's just running on top of a standard cloud instance, allowing you to create a private block storage array in the cloud. And that now unlocks some really, really interesting capability, right? We talked about one, that multi-availability zone protection concept. Another basic one is data mobility, right? I got PowerFlex on-prem.

Starting point is 00:50:22 I got PowerFlex in the cloud. I can now replicate. I got PowerFlex in the cloud. I can now replicate. I got PowerFlex in Amazon, PowerFlex in Azure. I can move my data between them. The third one, though, that becomes really interesting is performance. So, Anthony, before you go down that path, you mentioned replication and migration in the same breath. And to me, those are two different functions. I mean, can you explain what you're talking about there? Yeah, for sure. I agree with you. are two different functions. I mean, can you explain what you're talking about there? Yeah, for sure.

Starting point is 00:50:47 I agree with you. Definitely two different functions. The way to think of PowerFlex in that construct is very much moves the data from A to B. So your kind of traditional storage replication concept, if you will. So if I have PowerFlex in my data center, I have PowerFlex in the cloud, I now have a very easy way to get the data from A to B, whether that's for disaster recovery purposes or some type of migration of that data set into the cloud or vice versa. And so when you're thinking of replication in PowerFlex 2, it's a volume by volume or volume group level operation so that it's not that the entire cluster

Starting point is 00:51:29 has to be configured like this is my primary and that's my target over there. But for any given volume, one side of the equation is designated as that's your source, that's your target. So you can be running PowerFlex on both sides, on-prem and cloud, and decide, okay, now I'm going to move this from on-prem

Starting point is 00:51:50 over to the cloud side, start operating from that as my primary, scale up the storage underneath that for power, run tests, do whatever you're looking for. Does that make sense? Yeah, yeah, yeah. No, it's, yeah, certainly there. We didn't talk about some of the data services around here.

Starting point is 00:52:10 I mean, so you mentioned replication, compression, snapshotting. You mentioned RAID 10 kind of thing. Do you guys support compression and data snapshotting? We do. Yep. Data reduction, snapshots, really high-performance penalty-free snapshots. All the typical things, if you will, thin provisioning,

Starting point is 00:52:33 pretty standard suite of storage services. Consider those table stakes included. Yeah, exactly. Where it gets really interesting though, is when you think about the Cloud, bringing some of those services capabilities to the cloud, right, where the cloud, we kind of solve both of those challenges. And uniquely, we solve it with a scale-out architecture. And that's what we think is so different about PowerFlex in the cloud versus maybe some other offerings out there where it's, hey, here's a dual-controller virtual appliance that can scale to 60 terabytes if you're deploying the cloud. Like, that's not that interesting.

Starting point is 00:53:25 When I could take the petabyte type scale of PowerFlex, use that to create a true enterprise storage system in the cloud that gives me some of those traditional storage services, but with the scale out kind of native agility that the cloud provides, that becomes a really, really interesting concept that we're getting tons of great feedback and interest in right now with customers.

Starting point is 00:53:49 I mean, imagine how, you know, we can say how elastic PowerFlex is on-premises, but you have to still keep providing some physical nodes to do it, whereas you can keep spinning up EC2 instances nonstop. Yeah. And maybe an interesting story from a customer that has PowerFlex deployed in one of the hyperscalers. On-prem, they kind of run their standard thin provisioning across their storage platforms, of which PowerFlex is one. On-prem, they kind of stay around two to one, right? So they take advantage of over-provisioning, but look, if they need more hardware, more capacity, it takes about a month, right? Place orders, lead times, people showing up to plug it in, et cetera. In the cloud, they're actually running their over-provisioning rate north of 4 to 1 on PowerFlex.

Starting point is 00:54:35 And the sole reason is, well, we're comfortable doing that because if we need capacity, I click buttons and 15 minutes later, it's in my cluster. So they're kind of taking advantage of that cloud agility to drive out cost by running at a higher oversubscription rate. And that's even more interesting because the cloud on its own does not provide thin provisioning. When your PPA goes and provisions a four terabyte volume, but then only stores 500 gig of data. You're still paying the cloud for four terabytes of storage. You're saying that you could drive up the utilization rate from two to one to four to one based on what parameters for the solution? I mean, like thin provisioning or not, I understand, but that would be the same regardless of whether it's on-prem or in the cloud. Totally on things like compression or something like that?

Starting point is 00:55:27 No, I think the operational difference, and this is one of the things I've pinged Dell in the past about and Dell competitors, about when you build storage arrays in a public cloud, you're trying to bring this operating model from the private data center into the public cloud, and it doesn't work because of cost. If I provision a four terabyte storage array, I have to pay for four terabytes of storage, whether I use it or not. And it becomes obscenely expensive. And so I guess the, the, the follow on question to you, Anthony, I like the approach of being able to say, okay,

Starting point is 00:56:02 I'm going to provision 50 gig of real AWS instances behind my PowerFlex and advertise two terabytes or whatever the size I want to advertise to my customers, that process, talk to me about that process of when I actually need to back that with real instances. Is this something that I can automate? Is there APIs in which I can, my platform engineering team can build auto scaling rules that basically says, hey, add additional provision, additional AWS instances, and then go into PowerFlex and assign those instances to my storage pool. You've got the idea exactly. And the programmability of the infrastructure is one of its other highlights.

Starting point is 00:57:00 It is software. There's an API for everything on our side and there's APIs for everything on the cloud provider side too. So everything can be programmed. Bingo. And that's one of the interesting pieces right now. And we realized there are going to be customers who want to take advantage of those APIs. They want to build that automation themselves and great, thumbs up, ready to go. Stay tuned. You'll hear us making some announcements shortly about us for those customers

Starting point is 00:57:31 that maybe don't have the ability to build that automation themselves or just don't want to. Some of that will be provided for them. So without spoiling, stay tuned, if you will. Okay. Well, hey, this has been great. So Keith, any last questions for Brian or Anthony? with, you know, partners around the edges of what they do, you know, like Terraform integrations, et cetera. All the cloudy stuff that I can envision is possible with the platform.

Starting point is 00:58:13 Right, right, right. So, Brian or Anthony, is there anything you'd like to say to our listening audience before we close? For me, you know, thank you both for the time. This was, you know, great discussion. Great, great, great questions. You know, are kind of asked to the audience out there. If you're out there is a really, really interesting platform right now that's growing absolutely crazy for Dell. We'd love the opportunity to come in and talk to you about it. So really appreciate the time. And Brian, any closing comments from you? I'll echo that.

Starting point is 00:58:56 Thanks for having us. This was really great. And to also echo something you mentioned, Dell Tech World is coming up. It's right on our heels here. You'll be hearing a lot about it. If you happen to be there, we have several deep dive sessions, ask the experts, hands-on labs, come get a hands-on look at it, ask a lot more questions if you've got them. Yeah, I guess I have one question. It was actually a whole set of questions,

Starting point is 00:59:22 but I'll just ask one of them. Is there some sort of like a trialability capability? Can I log on to PowerFlex.com and download the software and run it for a four-node system for some number of hundreds of gigabytes or something like that? No. So remember, we don't want customers to say, oh, let me take software and build my own PowerFlex system. Right. There's there's a level of science project projectiness to that that we just found customers are just not that interested in. So if you're in a position where, you know, you're designing infrastructure, you want to get hands on with PowerFlex. We have a whole array of options, everything from very, very rich labs with very high performance systems at Dell facilities that customers can access, or we do

Starting point is 01:00:10 have the ability to send gear to customers in their data center where they can test with their workloads and applications specifically. So to truly kind of properly evaluate, you have the ability to do that. And then the same extends to cloud as well, where we have to work with customers, stand up, you know, PowerFlex in the cloud with them where, you know, they can go and do all the testing and validate, you know, a hold of some pretty robust hardware within the Dell ecosystem. And we found that as a need that we didn't need to bring the physical nodes into our data center, which is kind of our bread and butter. It's a little known hack as a Dell customer to leverage these EBCs for pretty much whatever you want to test. Interesting. Interesting. All right. Well, this has been great, Brian and Anthony. Thanks for being on our show today. And thanks again to Dell PowerPlex for sponsoring this podcast.

Starting point is 01:01:15 Thank you both. Have a great day. And that's it for now. Bye, Brian. Bye, Anthony. And bye, Keith. Bye, Ray. Until next time. Next time, we will talk to another system storage technology person. Any questions you want us to ask, please let us know. And if you enjoy our podcast, tell your friends about it. Please review us on Apple Podcasts, Google Play, and Spotify, as this will help get the word out. Thank you.

Grey Beards on Systems - 148: GreyBeards talk software defined infrastructure with Anthony Cinelli and Brian Dean, Dell PowerFlex

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.