Grey Beards on Systems - 146: GreyBeards talk K8s cloud storage with Brian Carmody, Field CTO, Volumez

Starting point is 00:00:00 Hey everybody, Ray Lucchese here. Jason Collier here. Welcome to another sponsored episode of the Greybeards on Storage podcast, a show where we get Greybeards bloggers together with storage assistant vendors to discuss upcoming products, technologies, and trends affecting the data center today. We have with us here today Brian Carmody, field CTO of Volumes and an old friend. So Brian, why don't you tell us a little bit about yourself and all about what Volumes is doing with its composable data infrastructure? Hey, Ray. Hey, Jason. Great to be with you guys.

Starting point is 00:00:41 Yeah, so I'm a storage engineer, So I work on systems for storing information. And my company is called Volumes. And it's composable infrastructure for the cloud. So you tell it what you need in a declarative fashion. Give me one terabyte of storage that does a million IOPS with 400 microseconds latency and spread it across three availability zones. And it goes and it builds that in your infrastructure and presents it to your application. That's pretty unusual. I mean, because, you know, most composable infrastructure systems I'm aware with have, you know, PCIe switches and, you know, just about to flash and other types of boxes.

Starting point is 00:01:25 There's lots of hardware involved in composable infrastructure. You're doing this in the cloud, which means it's only software, right? Yeah. So we don't have any type of proprietary data path code. We don't have a single, we haven't written a single line of data path code. We use Linux, off the shelf Linux instances as the end to end data path. So we are strictly control plane logic. And by doing that, by getting rid of all the proprietary stuff in the data plane, it makes it fast, it makes it easy, and it makes it so that it can work anywhere. We launched on AWS in the third quarter of last year.

Starting point is 00:02:09 We just had our soft launch on Azure in the first quarter of this year. We're working on GCP and on-prem, which will be ready by the end of the year. But since it's just Linux, if you give me a cluster of Raspberry Pis, we can orchestrate it. I do have Raspberry Pis. It's not quite a cluster yet, but I'm working on that. So this sort of stuff is kind of unusual for cloud environments. I mean, YAML and all that stuff is pretty prevalent in the Kubernetes space. Do you work outside the Kubernetes space as well as there?

Starting point is 00:02:47 Or how does this all play out? Yeah, so our largest customers are all Kubernetes environments. But we can work perfectly with instance-based storage, just a virtual machine. We can give you a mount point. We can give you a raw block device. But most of our production workload customers are running Kubernetes. So in that, do you present yourself as like, is a container storage interface, like a CSI interface? Is that how you present? And then

Starting point is 00:03:17 persistent volume claims are made against that? Yeah, that's exactly right. So you install our Helm chart. It takes about 30 seconds, and 30 seconds later, you can be provisioning storage to your pods. Yeah. I was looking at some of the stuff on your website, and it talks about IOPS and availability. These sorts of things are not necessarily what I'd consider normal PBC policy kinds of things, right? I mean. Yeah, so we were kind of inspired by the model of Kubernetes itself. So for what a platform engineer or a DevOps engineer, when she's orchestrating and she's getting ready to deploy a container, what do you tell Kubernetes? This is how many vCPUs I want. This is how much memory I want. And our idea is that it should be as similar to that for storage. So tell it how much you need,

Starting point is 00:04:21 tell it the resilience you need, tell it the performance you need, and let the orchestrator take care of the rest. Don't think about RAID, don't think about multi-pathing and all those types of things that storage people in the old days used to worry about. DevOps people don't care about that. They don't want to know about that. They just want to specify their requirements and let the robot figure out the rest. So basically in your definition, when you're basically providing that persistent volume claim, you're effectively setting what you want the capacity to be, what you want the IOPS to be,

Starting point is 00:04:54 and what you want the latency to be, right? Yeah, so we have a little trick to make it simpler. We have these objects called policies. So the first thing you do is you create a policy, which is a template for a PV or for a volume. So the policy specifies the IOPS, read and write, the bandwidth, read and write, the latency, guarantee, the resilience, how many availability zones does this have to be resilient against X number of failures? And whether you have any thin provisioning requirements, whether you want encryption

Starting point is 00:05:31 in flight. And those templates, those policies are basically templates. So then from Kubernetes, the operator will call that and say, give me this much storage and use the regional availability high performance policy. So we make it so that you don't have to actually directly in your YAML file specify IOPS and latency. You just order off the menu. So you effectively create a set of storage policies that are generic, that you can support in that environment. And then the customers that want to use those volume claims just identify the policy that

Starting point is 00:06:13 they're going to use and a capacity, I guess. Yeah, exactly. So the way it kind of works in productions is the platform engineering team will own the policies. And the policies are you know completely dynamic you can create and modify and delete them on the fly and that's like creating the menu and then the devops teams then call and order items off that menu which is what ends up being provisioned to their applications. Yeah. Somewhere I saw that you were using ephemeral storage rather than EBS IOPS 2 or something like that,

Starting point is 00:06:53 which would be, you know, high-performing, persistent storage. Yeah, I mean, it's not really that high-performing. Well, I know, but it's high-performing for EBS, I guess, or whatever. Yeah, but a terabyte of high-performance storage using a hyperscaler software-defined storage system is like buying a Volkswagen every month. Okay, I won't go there, but I'll accept that for now. So how does this work with ephemeral storage? I mean... Yeah, so think about how RAID works on an enterprise storage system. You use software to create a highly available logical device out of a bunch of unreliable devices. So we're doing the same thing in the cloud.

Starting point is 00:07:43 Ephemeral storage is ridiculously high performance. The latency is exceptional. But they can disappear at any moment. And so what our composable infrastructure does is when you deploy in, let's say you deploy our CSI driver, what it does is it scans your environment for ephemeral storage. It profiles each of those drives individually to see what's the mixture of read and write performance, how much IOPS, how much bandwidth, and at what latency. And then it makes that ephemeral storage available to store slices of data. Then at provisioning time, when a request comes in, our algorithm evaluates, okay, this application, this request is for a terabyte of storage, 100,000 IOPS with two-zone resiliency. And then it looks at the available capacity and the available IOPS and bandwidth of all those ephemeral drives. It builds a protection scheme, which depending on the specified requirements could be erasure coding, it could be mirroring, and then it constructs a Linux data path using LVM, multi-pathing, NVMe over TCP, all the kind of standard stuff, and ultimately connects those slices of data carved off of those chosen ephemeral drives and presents were effectively assigned to an EC2 instance, let's say an AWS.

Starting point is 00:09:30 But I mean, so in this case, those drives are effectively assigned to, I'll call it your Linux storage solution. Yeah. So imagine we have a couple of instances that have, we'll use AWS as an example, a couple of instances that have instant storage. So the CSI driver will profile those drives, or if they're outside of Kubernetes, we have what's called the volumes connector, and it's just a control plane agent. And it will register that node with our SAS control plane. It will look, and if there are any ephemeral drives, it'll profile them. And then it'll set up a simple target service to get that drive onto the network and visible over the network. So on-premises this would just be a JBoff drive, but in the cloud we need to deploy a little target service that basically just passes through and gets that drive on the network. Like a proxy or something like that to get that to get access to the drive?

Starting point is 00:10:42 Yeah, and what's unique about our architecture is all of the Proxy or something like that to get access to the drive. Yeah. And what's unique about our architecture is all of the processing is done on the host, the application server itself. coding to snapshots, thin provisioning, encryption in flight. All of that happens running in device mapper on the application server. So we don't have any controllers. It's a controllerless architecture. And there's enormous benefits from a performance and scalability perspective that come with getting away from controllers. You don't have a controller software entity sitting out there in front of these and back with these drives behind it and stuff like that doing stuff. No, no. So kind of so fundamental to our architecture thesis is that storage

Starting point is 00:11:42 controllers are kind of at a technological dead end. Ouch, that hurts. No, no. So think about it. So the performance of media is doubling every 18 months. With PCI Gen 4, Gen 5, Gen 6, we're now at the point where you can hold in your hand a device that can do over 10 million IOPS, a little slice of NAND flash. And so if you go from around the beginning of XIV, go from like 2008 to the present, the throughput of what you can hold in your hand is increased by 100,000x. But CPUs on the controllers are absolutely not increasing at that speed. So we're at the point now in the industry where the controllers themselves are the bottleneck. Whether it's a controller-based architecture like an Infinidat or something like that, or whether it's a metadata controller that

Starting point is 00:12:48 separates the metadata server from the data server, or if it's even just a distributed lock mechanism like some of the shared file systems that are very popular right now. Even in that case, the amount of time for the CPUs to release and manage the locks is longer than it takes to do the IO. So the controllers, the CPUs themselves are the bottlenecks. So our big idea is in order to scale and support the next generation of workloads, we have to get rid of the controllers because they themselves are the bottleneck and we have to make it distributed, and we have to connect media directly to the application servers. And thankfully, things like NVMe over TCP and DeviceMapper make that all easy and all built right into the Linux kernel. This is heresy. You know that.

Starting point is 00:13:43 The world according to Ray. I'm just saying. I just kind of put that out there, but we'll let that slide for now. Brian, I think it's awesome. You would, Jason, of course. Of course I would. So this puts requirements on the application environments have to be the latest level of Linux and things of that nature? I mean, snapshots, erasure coding, all this stuff that's, I guess,

Starting point is 00:14:09 present in the Linux kernel today, but it's not present in the older kernels or things of that nature. So would there be requirements for the application environment? Oh, yeah. We definitely like customers to be on the latest Linux releases.

Starting point is 00:14:27 I mean, there has been over the past decade, somebody did a calculation about the huge investment that the Linux kernel development team has globally put into the storage stack over the past 10 years. And the number I heard was it was about $2 billion. And they need to do that, I might add, but that's a different story. Right. But here's the crazy thing. They've built all the primitives directly into the OS to build incredibly high performance, highly resilient systems, systems that beat proprietary data paths in every dimension that matters to customers with just one problem. It's so complex to configure that it can't be reasonably done by humans. And so we built a robot, we built an orchestrator that makes it really simple. You just give it a YAML file or a couple of clicks in the GUI and it sets it up perfectly every time. And it's infinitely scalable. So, I mean, okay.

Starting point is 00:15:35 And the control plane, let's call it, in this case, is a SaaS solution that operates in the cloud and anywhere kind of thing? Yeah, exactly. That's how we deliver the control plane as a managed service, as a SaaS product. And this SaaS product runs today on AWS and Azure software release and soon to be GCP and someday to be on-prem? Or services of the environment maybe is the right word. Yeah, so I don't – I'm not sure if we're going to do an on-prem version of the control plane because – Yeah so much, you can create so much higher levels of resilience by following the SaaS and the site reliability engineering model. That is kind of a solved problem in tech is how do you build web-based applications that are 100% available? That's a very well-understood problem.

Starting point is 00:16:30 So I'm not sure if we'll do the control plane on-prem, but no matter where the control plane is running, as long as you have internet access, we can orchestrate anything. If you were running on a Linux laptop right now, all you would have to do is sign up for our service, do a yum install VLZ connector after adding our repo, and your Linux laptop is going to show up in our control plane as a registered thing ready to do work. Interesting stuff. You mentioned snapshots and stuff like that. I mean, so with ephemeral storage, obviously it's not persistent when the instance goes away.

Starting point is 00:17:21 You use snapshots to protect that data? Is that how it would work? So the snapshots are typically used for business continuity. So we use RAID mirroring or erasure coding to protect against hardware failures. And then snapshots are for, you want to roll back your database, you want to restore an application to a previous state. That's kind of what the Snapchat function is for. You know, instances fail, right? And when instances fail, the ephemeral storage goes away. Or when instances terminate, the ephemeral storage goes away. I guess if they terminate, I guess the expectation is that the data is no longer needed, but that's not quite true either. Yeah. So let's think of like a really simple example.

Starting point is 00:18:05 So let's say you're in AWS. You're using our composable storage. And you have an application that needs a one terabyte file system that does whatever, 100,000 IOPS. And you need 300 microsecond latency at 100,000 IOPS. And you say that you want dual zone resiliency. So what our orchestrator is probably going to do is it's probably going to compose a RAID 10, where the number of stripes is whatever is needed based on the capabilities of your medium. And then we're going to keep two copies of it, one in zone A and one in zone B.

Starting point is 00:18:54 So even if you have, forget about a instance terminating, even if you have a zone go offline, an entire zone failure, you still have half of the raid mirror. So your application continues to run. And then once the zone comes back online, or depending on what your resilience policy is, we'll just, in the background, rebuild the raid and get you back to full redundancy. Somehow, somewhere, ephemeral storage has this problem in my brain that goes away when the application goes away and stuff like that. But in this case, the application that's holding the ephemeral storage is, no, it's not a controller. You don't have a controller in this environment.

Starting point is 00:19:39 Yeah, so in order to get the performance and scalability that we deliver, you have to get rid of the controllers. The application server has to talk directly to the... I struggle with that, Brian, but I think I'm getting to understand it. We should give you a demo or sign up. We talked about that. That's unfair. That was unfair, Brian.

Starting point is 00:20:02 The challenge with ephemeral storage is that it goes away. It goes away when the EC instances that it's assigned to goes away and the data is gone. So what, you know, I would say a lot of the containers have all been around stateless storage kind of environments, but lately they've been changing. I mean, a lot of the databases and stuff like that need to have stateful storage that stays around with or without the database being there. How does this sort of thing work in that environment? Because it's not Kubernetes? Yeah, so this is absolutely the

Starting point is 00:20:46 decade of stateful Kubernetes. There are so many companies in every industry that are building highly complex, stateful, data-driven applications. And Kubernetes is the go-to way to deploy. So think about it like this. Imagine instead of being in the cloud, we were running Kubernetes on-premises and we were using an on-premises excellent technology, something like Portworx. Imagine we were using something really good like Portworx or OnDat or something like that. And ultimately behind the CSI driver and the Kubernetes stuff, there's a storage system that has RAID. And what happens if one of those NVMe drives fails? Rebuild.

Starting point is 00:21:38 A rebuild. Well, it's the same thing with our technology in the cloud. It's the same thing if an instance is terminated. It's a RAID rebuild. Yeah, but it's ephemeral. It goes away. You're saying that in the case where you have resilience specifications that says, let's say, you know, two availability zones, one of those guys is always going to be operating.

Starting point is 00:22:11 Nobody's going to sit here and actually terminate the application is what you're saying. Why would you terminate an application like a database or a web app that's got to be 100% up and operational? Even in upgrade kind of scenarios, you're still running half the applications while the other half are being upgraded. So the applications live forever in this environment, someplace. it's clear the location of the application and the location of the ephemeral storage are almost certainly not on the same instance. So imagine if you want to make it a really easy picture, imagine you have a Kubernetes cluster running AKS in Azure or EKS in AWS, or you're doing your own self-managed Kubernetes. And imagine outside of your Kubernetes cluster, you also have, let's say, a cluster of

Starting point is 00:23:18 eight instances that have instance storage or ephemeral storage as Azure calls it. So you would install our VLC connector on each of those eight nodes. They would register with our control plane. We would recognize the ephemeral storage. We would profile it and make it available for orchestration. And then with the CSI driver installed in your Kubernetes environment over here, when a persistent volume claim is submitted,

Starting point is 00:23:54 our CSI driver is going to pick that up. It's going to say, okay, these are the requirements, IOPS, bandwidth, latency, resilience. Then our orchestrator is going to say, okay, in order to deliver those, we're going to carve out a slice from, let's say, four out of those eight NVMe instance medias, two from this zone and perhaps two from this zone.

Starting point is 00:24:20 Then we're going to establish NVMe over TCP connections from those media instances to the worker node in Kubernetes that that pod is scheduled for. And then on the worker node itself, we're going to set up a Linux-based data path, all the data services. So LVM, device mapper, multi-pathing, everything that's needed. And ultimately create a file system and then mount that in the container. Make sense? Yeah. So you're basically then taking, you're creating this storage subsystem out of instances that

Starting point is 00:25:02 you create, right? So, and then that's what is being connected to from your CSI driver. And it becomes a pass-through. Is that how you're saying it? Effectively? Yeah. Yeah, that's a good way to think about it. Think about it as a pass-through.

Starting point is 00:25:19 All right. Now I see what's going on. So those instances will never die. They're resilient. They're across multiple AZs. And the data that is carved out of those instances can be multi-AZ or not. But the data will continue to exist as long as those instances exist. Right. Almost like storage nodes within a storage cluster, right? Without being called controllers.

Starting point is 00:25:50 Right. Well, then here's the thing, too, when you're talking about that. So data and metadata distribution across that, I mean, what's being done to manage cluster consistency in that? Yeah, so that's, Jason, that's a really good question. So the reason why we don't call them controllers, the reason why they're not controllers is the data services don't run on them. So unlike, I'm not sure how Scribe works.

Starting point is 00:26:20 Right. So maybe we can talk about that and use that to contrast. But there is no shared state. So let's say we have two applications. And so we've orchestrated or composed two data paths leading from those instance storage nodes up to two application servers. There's no shared state between them. They don't know about one another.

Starting point is 00:26:55 So there's no, within the data plane, the only state that has to be managed is within the Linux kernel on the application server itself. And this is absolutely bulletproof code. This is LVM. This is device mapper. It's MD, the RAID device. This is like some of the highest quality storage code in the world.

Starting point is 00:27:29 And then the control plane state, what are the names of the volumes, who's mapped to who in the event of a failure or moving, let's say the Kubernetes scheduler decides to move a container, a stateful container from worker node A to worker node B. Our control plane has an asynchronous state that it keeps of the overall picture. And then it's able to go and issue orchestration commands to respond to changes in the environment. Also, all the observability, all the metrics, all that stuff lives in Prometheus in our control plane. I got you. That makes sense.

Starting point is 00:28:09 So effectively, the control plane's maintaining the data path, metadata, if such a thing exists for this particular environment. Yes, but it's asynchronous. So for example, let's say, guys, we had an environment set up in your VPC, and you had your application running, and there was an internet outage in your hyperscaler. So you lost connection to our control plane. All of your applications continue to run. Yeah. Because the synchronous state is just running. It's maintained in the list. The data path is there.

Starting point is 00:28:49 It exists. And all those services are resonant on the application services, EC instances. What you wouldn't be able to do is you wouldn't be able to provision more until that internet connection came back. So that was a huge part. Because, you know, there were some early experiments

Starting point is 00:29:05 of delivering storage as a service over the network, sometimes with caching devices, you know, to improve local read performance. But they all had this Achilles heel that internet connections are not anywhere near the reliability that's required for a storage data path. So we expressly built this thing so that if connectivity with the control plane became intermittent, that it does not affect the well-being of the applications.

Starting point is 00:29:35 All it would do is would degrade collection of metrics and the ability to do new provisioning until that connection was restored. This is pretty interesting stuff, Brian. It is cool. And I like your, basically, that overall design that Scribe was pretty much designed the same way with the assumption that everything is going to fail. And you design around basically the failure paths that are there?

Starting point is 00:30:05 And how do you maintain consistency within data when you assume everything is unreliable? Well, yeah. I mean, that's why I was so attracted to this company. I definitely didn't think I was going to do another storage company after Infinidad. That's what everybody in storage says. Yeah, because I didn't think there were any interesting solutions left. There were tons of very, very important problems. How do we deliver microsecond latency in the public cloud? That doesn't exist. How do we make it affordable? And how do we give enterprise-grade data services in the cloud? These are all the problems that customers were asking for, but there were no interesting,

Starting point is 00:30:50 I didn't think there were any interesting solutions left. And the flaw in my logic is I was stuck in that controller paradigm. Tell me about it. But once you get rid of the controllers and you make it composable, you address the media directly. Suddenly now, instead of talking about milliseconds, we're talking about microseconds. And we can create a universal data fabric that is the same in every cloud, in every data center on-premises. End-to-end, it's just Linux. So there's no vendor lock-in.

Starting point is 00:31:26 All the trickery and all the bad user experience stuff that people complain about proprietary, open source solves all of those. And that's why I think it's really appealing to our customers because essentially what we're telling them is you don't have to trust a bunch of punks in Tel Aviv to build a data path that's going to not ruin your applications, lose data, and get you fired. This is Linux.

Starting point is 00:31:53 This is storage software that you already use, potentially depending on the size of your company. You might have hundreds of thousands of instances deployed using this storage, what we will give you is a better way to configure and manage that. And the end result is better performance, better scalability, better data services than any of the proprietary products, data paths that came before it. So essentially, it's about unlocking the potential of open source. Yeah. Storage is still like the storage is always going to be the critical metric on which businesses are measured. Right. Tell me about it. You can lose, like, you know, you lose compute.

Starting point is 00:32:37 You basically, you effectively lose, you know, capacity on your ability to do things. You lose network. Basically that's a plumbing issue and you know, you basically get another switch and fix it. You lose customer data, you're out of business. Somebody falls under a sword someplace. Blood flows freely. Storage is still the most important thing in any business. You lose all your contacts in your CRM system

Starting point is 00:33:03 and you're kind of screwed. Let's get back to data protection for these things. So backups and stuff still work in this environment? I mean – Yeah, of course. Take a snapshot. And a snapshot would be a Linux snapshot within the – I don't know what the term is. The file system of Linux?

Starting point is 00:33:23 Is that – Is that LVM based? Yeah. So LVM is the engine that powers much of our data services. But remember, it's a little more complicated on the backend because a particular file system that's visible to a container will be composed of slices of data on a bunch of instance storage nodes, potentially in different zones. So in order to take the snapshot, the snapshot is stored on the same node that has the instance storage. And those snapshots then have to be composed once the snapshot of a slice is taken, those snapshot slices have to be composed in a data path, just like the primary storage,

Starting point is 00:34:21 in order to present it to a media server, or if your server wants to put it to the same server and push it out to a mid-IO or to S3 or something like that. No, I got you. I got you. How does that affect in the recovery time? Let's say you took a snapshot, you need to revert to a snapshot. What's basically that recovery period look like?

Starting point is 00:34:42 Yeah, so it's almost instantaneous. One of the things that we've spent an enormous amount of time optimizing is making snapshots fast. Because what we heard from customers in pretty much every industry and every cloud that the data services in the public cloud are regression from what they have with enterprise storage.

Starting point is 00:35:10 And even some of their solutions that are called fast snapshots and stuff like that, and it turns out that they're not that fast. They need to be virtually instantaneous because that's the standard that was set by modern enterprise storage systems, NetApp, Pure Storage, and Finidad. So it has to be at least that good in order to get the next generation of workloads running in the cloud. We have all the easy workloads in the cloud already. Right. And there's so many options that it brings to the table, especially when you're doing testing patches and doing things like that. I mean, if you've got the ability to do kind of these instantly promotable, like snapshot clones,

Starting point is 00:35:52 if you will, that allows you to basically clone out an infrastructure, run it in a test environment, see if all your software patches work and then put it live. And then in the modern application development lifecycle, that's critical. Yeah, Jason, and that's where we are today. Yeah.

Starting point is 00:36:12 So we have some really cool stuff. It's not finished yet. We don't let customers do it in production yet. But what if we could do exactly what you just said, Jason, but instead of doing it in the same zone or even the same region, what if we allowed you to do it in a different region or do it from one cloud to another from Google, from GCP to, uh, to, to Azure or from on-prem to, uh,

Starting point is 00:36:47 any of the public clouds or vice versa. So it's not done yet. It's a, it's a monster feature. Again, we're really part of the big picture is again, we're very, very inspired by what Kubernetes did by standardizing compute. Linux and Kubernetes is the global standard. I can write code once, I can build a container, a manifest, a Docker file, and I can run it on any cloud in any environment, and it's kind of guaranteed to work. We need to make the storage part of the solution architecture as standardized. And the only way we can do that is with open source. So we want Linux to be the storage system and volumes to be the Kubernetes for data.

Starting point is 00:37:34 All right. So let me talk about pricing for this sort of solution. How do you price something like this? Is it on capacity? Is it on instances, ephemeral storage? Yeah. So it's a credit model. So it's based on the number of media devices. So it's the number of instance storage medias or ephemeral medias, if you're in the cloud, or the number of actual physical NVMe drives for on-prem. So that's one basis. The more pieces of media you have,

Starting point is 00:38:15 the more credits you consume. And then it's based on IOPS. So if you want a volume that does 16,000 IOPS, 100,000, a million IOPS, we support up to 2 million IOPS per volume for a volume of any size. So a 2 million IOP volume is going to consume more credits than a 2,000 IOP volume. Right, right, right. And so you consume credits. It's very similar if you've, like Snowflake has a very similar model. They have different dimensions. Obviously, they're a database. But that's how it works.

Starting point is 00:38:53 And then we have different tiers based on what the support requirements are. Do you need community support only? Do you need business hours? Or do you need 24 seven? And then also with the enterprise premium tiers, you also, along with the support, you get things like SAML for single sign on integration into your single sign on environment and stuff like that. So that's it. It's pretty straightforward. How many, I mean, so customer wise customer-wise, what's a big customer look like with what we call data platform companies. So these are companies who their product that they sell to customers is a data product.

Starting point is 00:39:55 So database as a service companies. Think of a name of one of them in your mind. We're probably talking. The hyperscalers themselves all have databases of service products. So these would are all the data platform parts of the industry. So I would say that would be an example of a large customer. And those are workload forecasts that are measured in petabytes and hundreds of millions of aggregate IOP requirements looking at their tenant dashboard. And then for direct enterprise customers, which are probably smaller, think about cybersecurity. Think about all the interesting AI, ML-type workloads in those spaces. Think about manufacturing, particularly automotive. Again, there's a lot of AI ML, very data-driven stuff.

Starting point is 00:41:12 Think of oil and gas, the high-performance computing type workloads. So that would be more of our direct enterprise type stuff. But we're very early in our ramp. Again, we're less than a year live on AWS. We're brand new on Azure. So we're early in our ramp, but we have wildly ambitious plans for the future. And as far as your sales model, I'm assuming you guys are primarily direct or working with basically the cloud marketplaces, like AWS marketplace versus being kind of more of a

Starting point is 00:41:46 channel based company, right? So we're pursuing both. So we have the reseller model. So this is working with either managed service providers or pure play services and reseller companies. And then a lot of the platform companies themselves have marketplaces. And those are kind of our force multipliers. But one thing that we found is very similar in the bleeding edge of cloud data management that's similar to on-prem kind of old school enterprise storage is it's very high touch. It's vendor led and you have to bring developer level technical skills as a vendor

Starting point is 00:42:37 to the table or you're not going to be able to interface with these customers because you're talking to DevOps and platform engineering teams. Right. Have you found any specific vertical markets where this has been kind of like the killer storage app for? Yeah, so I would say that it's more horizontal than vertical. So AI ML is a highly data-driven space they're having huge cost and scalability problems in the cloud so these are workloads that are kind of allergic to

Starting point is 00:43:17 to shared cloud and they tend to run on-prem they want to get those workloads into the cloud. And, you know, if you look at the stats for where enterprises are deploying AI ML workloads, I mean, it's across every industry and it's across every business function. It's HR, it's manufacturing and production, it's cybersecurity. It's sales and marketing. So that's part of the challenge is what we have built is a horizontal application. Think of Excel. Excel can be used for lots of things, for construction management or accounting or whatever. But in their ramp, they had to find the killer first use, and the first use was finance. For us, it appears to be, again, we're very early in our ramp. We're still running tons of go-to-market experiments,

Starting point is 00:44:17 but where we're getting the pull rather than having to push is in data platforms and AI ML. And I would say high-performance computing is another one that seems to be gaining. We're so early in the ramp, though. It's impossible. Yeah, and those two are so closely related as well. I mean, there's so much like AI ML stuff that's going on in HPC.

Starting point is 00:44:40 They're always kind of unique workloads. But yeah, I mean, it's an exceptionally hot topic. And it does seem like, you know, like being able to take, you know, the lower cost ephemeral storage that is offered by cloud providers, aggregating that. I mean, there's a good possibility for some actual real savings for customers that are deploying this at scale. I look for a cost for instance store on AWS. I couldn't find one. So it must be embedded in the EC2 instance level. Is that how this plays out?

Starting point is 00:45:15 Yeah, that's exactly correct. It's like a no cost storage, but you want to compute, this is where you go. And this is the storage that comes along with it. Yeah, so what we think is going to happen is in the hyperscalers, we're going to be helping shift customers' consumption from the software-defined storage products. DBS and all sorts of things, yeah. Right, to the instant storage products, which let's talk about AWS.

Starting point is 00:45:55 Instant storage on AWS, which runs right off the Nitro cards, is phenomenally good NVMe storage. Ridiculously low latency. Often hundreds of thousands of IOPS are close to it. The only thing it needs is resiliency. So what does Elastic Block Store do? It builds resilience out of ephemeral or unreliable non-resilient storage. So we're just offering a new model where the customer can bring their own data path and buy or rent the media itself from the hyperscalers. And it seems to be, as a model, working pretty well. Interesting stuff.

Starting point is 00:46:32 It's an interesting product, Brian. Okay, Jason, so any last questions for Brian before we leave? No, I just think it's a good business model. I approve. My founder hat on. I'm just like, I like the business model. Okay. Brian, anything you'd like to say to our listening audience before we close? Well, first of all, I just wanted to thank both of you guys for having me on. This is a really fun conversation. I hope we get a chance to do it again. Yeah, it's a pleasure.

Starting point is 00:47:00 And yeah, and to listeners, I would say definitely follow us on social media and come to KubeCon in Chicago and come hang out. And we can talk in more detail about all the cool stuff we're doing. One thing that I probably should have asked was, do you have a demo solution that people can download and use and just try out or something like that? Yeah. Go to volumes.com, hit sign up, give us your email address, create a password, and it'll go into demo mode where you're limited on capacity, but all the features are turned on. Um, and if you want to do it together, just contact us. Send us an email. Hey at volumes.com.

Starting point is 00:47:50 All right. We'll do it together. All right. That's it for now. Bye, Brian. Take it easy, guys. And bye, Ray. Bye, Jason.

Starting point is 00:47:58 Until next time. Next time, we will talk to the system storage technology person. any questions you want us to ask please let us know and if you enjoy our podcast tell your friends about it please review us on apple podcast google play and spotify as this will help get the word out Thank you.

Your Ad Here

Grey Beards on Systems - 146: GreyBeards talk K8s cloud storage with Brian Carmody, Field CTO, Volumez

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.