Grey Beards on Systems - 167: GreyBeards talk Distributed S3 storage with Enrico Signoretti, VP Product & Partnerships, Cubbit

Starting point is 00:00:00 Hey everybody, Ray Lucchese here. Welcome to another sponsored episode of the Graybeards on Storage podcast, a show where we get Graybeards bloggers together with storage assistant vendors to discuss upcoming products, technologies, and trends affecting the data center today. We have with us here today a longtime friend, Enrico Signoretti, VP of Products and Partnerships at Covet. So Enrico, tell us a little bit about yourself and what's new with Covet. Hi Ray, hi Keith, thank you very much for having me today. my long journey as an analyst going back and forth between Italy and the US, I just decided to join Cabit last year.

Starting point is 00:00:52 And well, it was a very nice thing. So when I started working with this company, I mean, I really love the concept behind the technology, practically building a geo-distributed object store. And maybe you remember that I'm really fond of object storage in general. Yeah. Wasn't that a last job too before the endless stuff? Yes. Yes. Again, I tried it with a French startup a few years back. It didn't work out, but it was a great experience, man. Great team and

Starting point is 00:01:28 just too many engineers in the same room probably, but... You mean there was more than one? Yes. Ah, that hurt. Keith. You know, we can't, when I put on my engineering helmet, engineers can't agree on

Starting point is 00:01:44 anything. Yes, they can't agree on anything. Yes, they can. It's fun conversations, but I don't know how effective we are getting work done, especially at a startup. Enrico, back to your journey. No problem. When I joined the company, the company was still a service provider. And the idea at the time was to build this geo-distributed storage that was meant to be something different compared to the AWS of this world. So where you have a gateway, traditional access point via S3,

Starting point is 00:02:22 and in the backend, all the data were fragmented and then distributed geographically with an erasure coding algorithm. What we changed from last year, we practically modified the control, the level of control that the customer has. So now that everybody can build their own geo-distributed object store. So practically, we have this SaaS backplane. So the service is provided as a SaaS service. You connect to the system and you choose the locations. So you can rent hardware from wherever you want.

Starting point is 00:03:02 I mean, from Equinix, for which we have an integration with Equinix Metal or any other service provider that provides dedicated server or co-location or anything. And you install our agent. The agent practically transforms every single piece of hardware

Starting point is 00:03:23 with some storage in a storage node. And so you can build this large network. It could be in a single data center, but could be spread across Europe, across the US, wherever you like it. And then you have also the access points that you can, in a similar way, install in Linux machines and create your network. So in the end, you have this cloud experience that is very similar to any other object storage and something in between the object storage that you have on-premises and the cloud storage service that you can buy from everybody.

Starting point is 00:04:00 But you have this complete control over the infrastructure, costs, and of course, your data. All right. Let's start unpacking some of this stuff here. You mentioned geo-distributed storage a number of times. So if I create an S3 bucket, let's say, and I start putting objects in this S3 bucket, where is the data? You said geo-distributed, so that would assume I could have data here in, let's say, if I was in the U.S., I could have data in Tennessee, I could have data in New York, I could have data in California, all for that same bucket?

Starting point is 00:04:41 Yes, practically what we do,, when data enters in one of our gateways, the first thing we do of course is encrypt everything. And if you have your own keys, you can encrypt it with your own keys. And immediately after, we apply this erasure coding mechanism that actually we are patenting because it's a quite sophisticated way to do it. And we split the file into many segments that you decide how many, and then you decide also how many additional segments you want for redundance. And then, yes, the algorithm in Rakan decides where to put the data. Of course, as I said at the beginning, you have to choose, And then, yes, the algorithm in RackN decides where to put the data.

Starting point is 00:05:28 Of course, as I said at the beginning, you have to choose, you have to install the storage node, the agent in these servers. So this means that the first step is to build the network. So you have a data center in Los Angeles, one in Seattle, one, I don't know, in New York. That's fine. And then we start putting data in these data centers. You can keep adding new data centers. You can mix and match different kinds of servers.

Starting point is 00:06:01 So we always keep track of what is available in the network and we use all the possible resources. So you mentioned the storage. So you build a geo-distributed S3 bucket or an S3 storage server, which could have multiple buckets. I'm trying to understand. So practically, it's a full-fledged cloud service. It comes out with everything. I mean, when you start one of our subscriptions, you get access to the backplane. You even have the sign-up forms for the service. You have user management.

Starting point is 00:06:41 You have multiple ID mechanisms for locking, so you can use Active Directory, for example, or OpenID or Google ID, Microsoft ID, whatever. And then, practically, you are becoming a service provider.

Starting point is 00:07:00 So, the idea is that everybody can become a service provider in 15 minutes. So I think I get the general idea of this. We saw it on the workload side with a company like Platform9 and being able to put agents onto your bare metal server VM and they can be worker nodes for Kubernetes. You folks are taking care of the storage equation of saying, okay, I have worker nodes that are object storage. But one of the things that, you know,

Starting point is 00:07:37 now that we're talking about storage and not the entire workload, not the, you know, the daemons and all of that, we're talking specifically the storage services. I guess the question workload, not the daemons and all of that. We're talking specifically the storage services. I guess the question is, give me a couple of primary use cases for when I would want to do this. Well, of course, all the S3 use cases are good. Most of our customers start with backup

Starting point is 00:08:04 because this is just a low-hanging fruit. I mean, everybody wants a copy. Now, especially, you want this secondary copy that you can have in a different site, well-protected. And so you do a secondary copy in the cloud. And this is a way to have your copy in the cloud without going to a cloud provider, but you being the cloud provider.

Starting point is 00:08:30 The second, the other use cases are more about the fact that, especially in Europe, as you know, I mean, the company is an Italian company, so a European company, and everybody in Europe is very concerned about data sovereignty in general. So when you think about sovereign cloud, you want to keep control on your data. You want to know where the data is. There are a lot of regulations now all across Europe about this. So the fact that we give you this infrastructure

Starting point is 00:09:13 means also that you can expand to use cases that are more in the financial sector or in the banking, etc. So we are dealing with a lot of customers. If I look at our customers, usually, they are small to very large telco providers or MSPs or cloud service provider that want to compete both on price, but also on service levels that are comparable to

Starting point is 00:09:51 to hyperscalers, multi-region storage. On the other hand, you have large enterprises that maybe are building big data lakes or are in this journey about hybridization of their cloud, things like this. So second generation kind of cloud customers. And they really love the technology because think about it. The fact that you can place the gateway where you need it means that you can have all your data in your country, but then you can, the gateway has a cache for performance, but it's just the gateway. You don't have to deploy any storage. So you don't create a second silo. You don't need a copy of your data. All the data is in one place. It's accessible. So you have to set up only one security policy. So it's not that if I have a multi-cloud environment or an hybrid environment, the problem is, okay, I have some data in Google, some in Amazon, some maybe in Azure and some on-premises.

Starting point is 00:11:12 And I'm using three different storage services, each one of them with their policy, their tools, their, you know, sometimes even different protocols. It's a mess. So what you're creating is almost a hybrid storage system here that can run. You can run the storage nodes just about anywhere you want. I assume the storage nodes, all software and the gateways, all software as well. Yes. Yes. Everything is software.

Starting point is 00:11:40 It's an agent that you install. Actually, it's very simple because there is a process where you, when you define your, one of these availability zones that in our lexicon is a Nexus, and you put a, and you have a wizard where you decide how many servers I have with how many disks. So, and we ask you, you know, some questions, let's say several questions. At the end, you get an Ansible playbook. And by running this playbook, you configure all the nodes at the same moment. So that even if you have hundreds of nodes, in a few minutes, you are able to start your service. So let's talk about data path a little bit.

Starting point is 00:12:26 So let's say that I do, let's come with the most basic scenario. I want to run object storage in my local data center. But for obvious reasons, I don't want to manage the object storage. I want to outsource that to you folks so I install the agent um runs the ansible playbooks the agents are configured what am I porting pointing to as a DNS target is it your gateway and the gateway handles that, you know, the communication? Like, where's the, talk to me about the data path. Okay, so the gateway is a piece of software, again, that runs in a server, physical or virtual, that is in your premises or in, you know,

Starting point is 00:13:22 in your cloud environment, okay. It's not ours. And of course you have the URL probably will be s3.yourcompany.com and you have full control of it. When you access it, of course you ask from the front end, so you create the credentials and then you have all your application, you configure it, you start working with it. Every single operation is through this gateway and the gateway contacts what we call in the back-end the coordinator which is our

Starting point is 00:14:05 control plane and again the control plane could be the SaaS service that we provide. At the moment there are a couple of coordinators in Europe. We plan to move to the US next year in 2025 so the coordinator is of course important to have the coordinator not too far from the storage node and the gateway nodes just because the latency could become an issue. So it's very good when you are in Europe, for example, to have the coordinator in Europe. But for a U.S. customer, maybe having the coordinator in Europe is too much of a latency. And then, so the gateway asks to the coordinator where to put the data or where to retrieve the data and et cetera, et cetera.

Starting point is 00:15:00 And so the data is spread across storage nodes. Storage nodes are talking back and forth to the gateway. The gateway is talking back and forth to the coordinator to understand. It's like a metadata handler almost in this environment. It understands the location of the data and the path to the data. Is that how it works? Yes, you are totally right. In fact, our subscription model is only on the data you are managing. So the raw data you are managing, just because in the end, everything is based on the metadata. And for us, it's very simple to give you a very flat price on the amount of data that you have, you know, considering the average file size that our customer have, et cetera. We also have a special price for media files that is really, really low because, of course, you know, in media files, you don't have, you know, the ratio between data and metadata is very big.

Starting point is 00:16:04 So it's a big files. Yeah, I understand. So from the metadata perspective, so obviously the gateway is caching some level of metadata data. And then the coordinator has you know it has the job of distributing the metadata across my various uh object clusters so let's say you know let's go to that next level of design i have two data centers so at each data center i have a gateway so it let's start at the most basic level. We have the storage clusters. We have the gateways.

Starting point is 00:16:48 And the gateways communicate with a coordinator or a set of coordinators, which is your SaaS service or some private service that we need to install. So this keeps the data playing, the control playing to the data and how we're flowing information in between. If we lose, the assumption is if we lose connectivity to the coordinator, that should be fine in the short term, just as long as we're not making changes or doing replication from site to site, we should be locally fine. Then from a just capability perspective, I can

Starting point is 00:17:31 push policy down, new policy, and my management of the actual storage from the coordinator, correct? No point is data stored in your cloud infrastructure. That's all just, there's the metadata for operating the service, and then there's the data, and I own my data. You guys manage my metadata. Okay.

Starting point is 00:17:56 Where is the metadata, Enrico? Okay. I let Keith talk, but actually there is a small difference from what he said and what really happens. It's that the coordinator manages all the metadata. The metadata is not actually distributed because you could have metadata... If you start moving metadata around,

Starting point is 00:18:18 it becomes really, really difficult to synchronize all the metadata across a large network. Consider that some of our customers have 5,000 nodes now. And that could be challenging for the metadata. So metadata is centralized. So the coordinators manage the metadata. Everything is a cloud service anyway. So if you lose connection, it's like losing connection to your cloud service provider. It's the same way. So if you think about one of

Starting point is 00:18:50 the many cloud storage providers that you have available in the market, if you're losing connection to the cloud provider, you don't have access to the data. The same happens for us. It's true that it's very difficult for us to lose access to the metadata service because we run our service in a multi-availability zone data center, plus we have a replica in a fourth data center in another country. So it's really difficult. But except for that, and so this is why latency is important, we are not doing cross-continent swarms at the moment. This will change next year because we will have additional coordinators. So you can think about coordinators as our regions. So we will have additional regions in the US and potentially in Asia.

Starting point is 00:19:55 So let's talk about, so can you have multiple coordinators or is there only one coordinator across multiple zones? No, you have only one coordinator per zone. Okay. You have multiple coordinators in the sense that there is a high availability. Right. Yes. But it's not a single, there's a single coordinator managing everything. There's a single destination. So you, you folks that that's invisible. The redundancy is invisible to us, but there's one coordinator. The one thing that I want to highlight on your explanation, thanks for the clarification, was that this is why

Starting point is 00:20:35 low latency to the coordinator is important. You want to be in a coordinator that's relatively close to where your gateways are. Yes. Also, we have other mechanisms to cheat a little bit with latency. So everything happens in parallel in the backend, of course. So when the gateway, for example, start the operation, has a very optimized query. So it asks, for example, more nodes than are really necessary, and it manages to do data airplanes accordingly and all in parallel. So when we achieve a number of segments saved that is safe enough compared to the data protection level, we give the okay in the front end. So for example, this eliminates the risk of

Starting point is 00:21:25 some nodes being slower than others or we have a cache in the single gateway so that everything that you read frequently is already cached locally that minimizes

Starting point is 00:21:42 their access to the network. Also, optimize communication in the backend. So there are several other things that we do. Also, we start doing some operations anyway. And then if we get some errors or some acknowledgement, then we proceed with the next steps. But actually, we try to anticipate some of the answers from the coordinator

Starting point is 00:22:09 or the front end so that it helps a lot to work with even small files sometimes where the risk is to have the impact of latency worsening the entire experience. Yeah, yeah. So you mentioned there's a single gateway as well as a single coordinator, although it could be multi-AZ as well?

Starting point is 00:22:39 You can have as many gateways as you want. In fact, one of the things that we really like, actually our customers like, is that they can have a gateway for each single tenant or even having edge gateways that they can deploy remotely and take advantage of the cache, for example, or different environments. We have a customer that has three data centers and the swarm is in the three data centers. And then they have one gateway in AWS, one gateway in their data centers,

Starting point is 00:23:16 and a bunch of gateways in the edge location. So they collect logs at the edge locations. Everything runs encrypted inside the network. So everything moves totally encrypted. And then data stays in the country. They access the data from AWS to do some analytics. They have an application in AWS. I mean, the compute power is cheap.

Starting point is 00:23:42 It's storage that is the problem. And then they have another analytics application that uses the same data sets to do other stuff that runs in their data center. But again, same data lake in the end. And this application attacks it from different environments. So where's the storage in that solution? Is it storage sitting in the data center?

Starting point is 00:24:08 Is the storage sitting in AWS? It's the data center that I mentioned. They are one in... This customer is actually in Italy. It's one in Milan, one in Rome, and the other one is in a small town called Arezzo. So it's one petabyte of storage. It's not that much, but actually it's very

Starting point is 00:24:27 compelling in case history because they were coming from AWS. Yeah. So they told us, I mean, we have some TCO calculations, but sometimes you do the math and you say, well, this is

Starting point is 00:24:43 marketing and stuff, but actually they came back saying, we are saving 80% real, really 80% from what we were spending before with all the movement of data and all the stuff that they are doing. So you don't charge for ingress or egress or anything like that. You just charge for storage under management.

Starting point is 00:25:01 Yes. Consider that if you look at, you know, now I'm used to the European market, okay, so you can find hardware from most of the service providers like OVH, many others that is around 1.42 euro per terabyte per month row. So meaning that you buy a dedicated server that includes bandwidth, includes service on the machine, everything, firewall, and you pay as little as that. And then on top, you put our license and you can be as competitive as the cheapest of the storage service out there, but with the geo-distribution included. Usually when you go to one of these cheap cloud storage services, you get everything from a single data center.

Starting point is 00:26:07 And if you want a remote copy, then you need to pay double the price. So even if you start at six, seven euro, then it becomes easily 14. Plus, sometimes hidden fees and other stuff. With us, it's just the fee that you see there. It's already gel distributed. There are all the options that I told about, about edge gateways and stuff like that. So it could be very inexpensive. But the most interesting project that we have at that moment

Starting point is 00:26:40 is at Elko. They have 16 data centers. And if you go in 16 data centers, actually we are starting with 10, but it doesn't change much. If you think about it, you put 10 data centers and you want to sustain like a couple of data center failures, meaning major failures.

Starting point is 00:26:58 So two data centers down, you still have eight data centers. And you start doing eight plus two, meaning that you have a very, very small data protection overhead, but you have a massive... What is the probability that you are going to lose two entire data centers? Pretty low, hopefully. It depends on the environment, I guess, right? That would be a very bad day. Yeah. Yeah. Yeah bad day. Yeah.

Starting point is 00:27:27 And also, consider the power consumption. If you start thinking a traditional technology where you have to replicate the data between data centers, and you have a business continuity scenario plus disaster recovery, it's three copies. So one close to each other and one

Starting point is 00:27:43 in a remote site. It's very expensive. So this adds up. If you also add local data protection on top of it, you are going to have between 4.5 to 5 times the initial storage that you needed to save. Each single terabyte becomes 5 terabytes. And the footprint is massive. With us, you can have 1.6, 1.8 in a similar scenario. And so if you have 1.8 instead of five,

Starting point is 00:28:17 consider how much storage you are saving. So meaning less servers, less RAM, less CPU. So the nodes cost less. But it's not only that. It's that the power consumption and the CO2 footprint, everything. Yeah, sustainability and all that. So let's talk about the erasure coding. Can you configure the erasure coding pretty much at will?

Starting point is 00:28:42 So let's say in this configuration with 10 data servers, I want to be able to handle two data centers going out plus maybe a storage node going out, which would require, you know... Yes, yes. In fact, this is what we are pretending practically. So we have... Seven plus three, something like that.

Starting point is 00:29:02 We have... So independent... You can do in two levels, okay, inside the data center and at the geographic level, first of all, okay, so it's a nested ratio coding, put it this way. So you decide first

Starting point is 00:29:16 how many data centers you have and how many you can lose of these data centers. And then inside the single data center, you decide the second level of data protection. And this is just one redundancy class. And then on top of it, you can add additional redundancy classes and you can decide policies where you see, okay, if it's a PDF file, for example,

Starting point is 00:29:43 I want this redundancy class. If this is a movie, I want this other redundancy class. So you can play with different setups and different costs of storing data. Okay. So all the combinations are possible. And of course, because in the end, it's everything about metadata, right? So we know the metadata, and we know what you can do with your files. So you can decide, for example, well, this level of registration is very inefficient with 32 kilobyte files.

Starting point is 00:30:18 Okay, so just change the level of data protection for these small files and you can do it. And all that's done through metadata supplied with the put request or something like that? Yes, practically when you put the data, you have a put, you already know how big is the file and of course you have all the other metadata that builds up on top of it. I like this nested erasure code. I'm not sure I've ever seen that before. So that's very interesting. This is why we are pretending it. Yeah, yeah, yeah.

Starting point is 00:30:53 I understand that. So performance. You mentioned performance. So the gateways have cache. I assume that's something that they could configure if they want more cache or less or something like that. It's totally configurable. There are no limits on the amount of cache.

Starting point is 00:31:08 Of course, it means that more cache, more expensive. So you have to, we have some customers that do deep archiving. They don't do cache at all. And other customers that are heavy on the cache because they want to have data always, hot data always available in the cache, of course, to minimize latency. So especially for one customer, that's a big one,

Starting point is 00:31:35 we are developing a feature that is cache preheating. So practically you have your gateway and you can decide with a sort of cron job. So you can do a metadata query at a specific time and then get the cache preheated for some workloads. So, for example, I want every Monday morning all the data that was produced last week in the cache so that I have to run a batch job of some sort. Everything is already close to the compute. Or maybe it's Christmas and I want all the movies with the word Christmas in the title downloaded.

Starting point is 00:32:26 I mean, it's not a CDN. No, it's not a CDN. I got it. But you are providing almost cache control. It's not. So you're really preloading the cache with a portion of the data based on some sort of a metadata query. Is that how it's working? Say again, sorry?

Starting point is 00:32:47 You're preloading the cache? Yes. So in this case, for Christmas data, it's anybody that, any film that has Christmas in it, you would preload the cache with so that it would be more responsive during that time. Yes, exactly. And you can do it for each single edge gateway. So if you have 10

Starting point is 00:33:07 different edge gateways, you can run the same query, but also different queries. So depending on the workloads, depending on, you know, your business needs in the end. So is that data then exportable so I can run analytics in like a different, as I'm doing prep for AI training or for RAG, can I export that into another system? Well, the metadata, no, okay. But there is, I mean, this is one of the things that we are developing right now, so all the metadata tagging,

Starting point is 00:33:45 all the stuff is really cool. And of course, by adding a functions on top of it, I'm not saying anything here, but you know, you can think where I'm going, right? So every time you put operation or, you know, some metadata update, whatever, with a function, you can potentially do some metadata augmentation. And then when you have augmented metadata,

Starting point is 00:34:12 you can start to run queries to get the specific data that you need. Maybe you need a set of data to train an AI and you need that specific data, you can do it. So there's some type of message bus associated with this that I can trigger off of? Or you can request

Starting point is 00:34:30 services from. Yes, but it's internal for us. What I was thinking is the function will come to the gateway. So what would be really interesting because this is one of the, as you folks mature, what would be really interesting is this ability, and this is one of the big gaps between, you know, the cloud providers and private object store solutions in general, is this ability to do, not queries, but alerts, message bus alerts off of any,

Starting point is 00:35:05 when an object is written or object is received, et cetera, so that now I can create functions myself. You know, I can go to open source functions as a service platform, like OpenFast and, you know, and trigger OpenFast type functions off of services. This is where the cloud providers are really locking folks in. So it's like Lambda. I can't really talk too much about this, but keep an eye on us.

Starting point is 00:35:42 All right. So let's talk about sizes of the clusters and stuff like that. So you can have the storage servers don't have to necessarily be in one location. Obviously, you'd want them to be in multiple locations for geo-distribution. Yes. So without technology, you can start as small as three servers. They could be three Raspberry PIs, to be honest. So we started the first implementation of Cabit was on machines that were the size of Raspberry PIs with drives attached to it.

Starting point is 00:36:22 And so we still support ARM as well as x86, of course. And you can start from this. So one single location, three servers, and three drives. That's it. And then, I mean,

Starting point is 00:36:42 we have customers now with data center with 40, 40, 12 drive systems and in multiple locations. I mean, you can really configure the system in the way you want it. Again, because we manage all the metadata, also the infrastructure. So the ability of each hard drive or each single hard drive to receive the data, we can then do the data placement

Starting point is 00:37:11 in the best way possible. So this also means that when you start, potentially you can start with these three servers that I mentioned, and then you start growing. And after a while, your service is successful, so you keep adding hardware, but you need bigger hardware. In the meantime, a new generation of CPUs came out, etc.

Starting point is 00:37:34 You can mix- Storage and stuff like that. Yeah. Yeah. So you can mix every sort of hardware so that we don't really care about the kind of hardware. To be honest, many of our customers actually start with a recycled hardware. They commission that hardware from something else

Starting point is 00:37:50 because at the very beginning it's just okay, let's try this technology. It's inexpensive to start with a bunch of old servers. Then when they realize that it's good and it works for them,

Starting point is 00:38:06 so they keep adding hardware. But they start to decommission the whole hardware sometimes. But in other cases, they just let the hardware die. So the lifespan of the cluster is really long. So usually you buy hardware for a three-year lifespan with a three-year contract. But actually, with us, you can keep the hardware. And when the disks start failing, you just add new hardware. And then when the level of that single server is below, let's say, 40%, 50%,

Starting point is 00:38:39 then you say, okay, this is no longer efficient to keep this hardware running. So remove all the hardware. We migrate the single nodes, the single hard drives, and then you're good to go with the new hardware. Go ahead. No, no, it's just to say that it's really inexpensive to manage an infrastructure like this. So when you add a server, let's say,

Starting point is 00:39:05 and you've got humongous disks on it and stuff like that, are you going to spread the data around immediately or only will you use that for new data that comes into the system? So we don't usually do rebalancing. It doesn't make a lot of sense because you have the cache in the front end that manages the performance. So doing the rebalancing is it doesn't make a lot of sense because you have the cash in the front end that manages

Starting point is 00:39:25 the performance. So doing the rebalancing is a lot of effort, especially at the geographic level. So you don't really need to do that. We can do it. We can migrate from a redundancy

Starting point is 00:39:41 class to a new redundancy class that keeps, you know, it's the same class but with that new hardware accounted for, etc. You can do some data movements

Starting point is 00:39:58 but it doesn't really make any sense. I mean, if it's not a lot of data, you can do it. Yeah, yeah, yeah. No, I understand. Okay. No, I'm good. I'm good with that, Enrico.

Starting point is 00:40:11 I was going to say something about can a customer just supply storage servers and have other customers come in and use that data? I guess it's really typical to have both the storage yes services and the gateway that accesses it uh in the same customer so one of the problem with these services is that you don't really know who is managing your your hardware okay yeah yeah and and the problem is what happens i mean sometimes you have these very nice services and they work very well, okay? Maybe spread across many countries, but maybe some countries are not of your liking,

Starting point is 00:40:52 but that's another problem, okay? We are back to the data sovereignty issues. But even if you are okay, that somebody's rent, you know, sorry, lends you some hardware. It's okay. But the problem is, so if you don't know who's lending you the hardware,

Starting point is 00:41:17 then maybe it's a guy in a basement playing with his PC and he has some free space and he lends you the storage. And then one day a new game comes out and you need space and you erase everything. So yes, there are multiple copies and there is a rebuilding in the backend and et cetera, et cetera. But I mean, I don't like it as an enterprise to know that. Right, right. So most enterprise customers would provide their own storage, plus provide their own gateways and clients to those gateways.

Starting point is 00:41:56 So all that would be within the infrastructure of one customer, wherever it lies. It could be in the cloud. It could be anywhere, right? I mean, as far as his concern, he could have any of his infrastructure be deployed as storage services or gateway. Yes. Yes. And you can mix some of your on-premises stuff with cloud stuff and it works. Yeah. Yeah.

Starting point is 00:42:16 Do you do any preferential access based on access speed? I mean, so if I'm a big data center, I've got my own data center with very fast servers, very fast networking. Plus I've got some other data centers out in the boonies that don't have the high speed networking and high speed storage, but I am using storage in those data centers for redundancy and things of that nature. Yeah. so we have two ways to do it. One is we have an internal ranking of the nodes so we know the nodes that work better than

Starting point is 00:42:51 others and we choose the best nodes when it's possible. And the other thing is that you can build two redundancy classes. One with the good nodes and another one with the bad nodes, and then decide where you place the data more granularly, let's say.

Starting point is 00:43:14 Right, based on which cluster you decide to put the data in. Yes, and you have some A-level data and B-level data. Right, right. I noticed on your website, you're integrated with some of the backup providers. I saw Veeam and stuff like that. In that case, you're a target for their backups. Is that how I read that?

Starting point is 00:43:40 Yes, most of the solutions, backup solutions, use us as a secondary copy. When you put the gateway on-premises, we can be very fast, actually, because you have the cache that gives you a boost in performance. So you write very quickly data to the gateway. Then maybe you don't have enough bandwidth to go that fast to the backend. But actually, the gateway acts as a buffer.

Starting point is 00:44:13 So you write quickly and you finish your job quickly. Maybe your cache is also redundant. So maybe it's the right one, SSDs, et cetera, et cetera. And then you just write in the backend at a lower speed. So it's possible. Do I then have the same similar controls as the cloud providers when it comes to immutable data so what can and cannot be deleted for it? Yes. Considering using this for like vaulting of backups? So we support S3 object lock, both in governance mode and compliance mode. So, I mean, it works pretty well for everybody. So yes.

Starting point is 00:44:56 Right, right. Are you guys strongly consistent or eventually consistent? I mean, how does that play out in this environment? This is a great question. So this year we developed, until last year, we didn't have strong consistency. But this year we developed a new algorithm that allows us to practically check on the local cache first, check on the swarm second, and then if we have the metadata but we don't have the data that is updated, then we check on other gateways to see where the data is. So gateways are configured in a sort of pool and you know that all these gateways have the same rules. So you see maybe I'm writing something in Seattle, and then I need to read it a few milliseconds later in LA.

Starting point is 00:45:54 So, then what I'm going to do is, in LA, I don't have anything cached, of course. I check on the, if you have a high-speed connection, potentially, it's already in the swarm that is in the U.S. If it's not, we are going back to the initial gateway and we take the data directly from there. It's not the fastest way to get data back because in the end it's three O's. But you are sure that even in the worst-case scenario scenario we are strong consistent. So I guess what that actually brings up a question I didn't ask that I wanted to ask like are you folks facilitating the network access or is that something that clients have to handle on their own? So the network access is theoretically not our problem and we have specification for the amount of bandwidth that

Starting point is 00:46:45 we need for the storage nodes and what you can expect on the gateway depending on the resources that you give us. But if you need a certain amount of performance, I can give you the amount of hardware and networking that I can to make it work. Yeah, but from the logical connectivity, I'm responsible for making sure that one gateway can hit another gateway. And that gateway can access the internet. And your level of kind of help troubleshooting from a responsibility perspective is to make sure that your coordinators are reachable via the Internet.

Starting point is 00:47:28 And then I'm responsible for the physical and logical connectivity outside of that. Yes, we have all the metrics and everything that are exposed, both internally with the APIs or externally within the user interface. So you can see if something is going wrong or there is a bottleneck or something that is not really performing the way you suppose it to perform. So yes, I mean, there are all the tools to make, you know, your life simple. Yeah. Yeah. Yeah. Well, this has been great. Uh, Keith,

Starting point is 00:48:11 any last questions for Enrico before we close? Uh, you know what, this is a really interesting, I don't have any closing questions. This is, uh, this is actually a really creative solution. I'm interested to one day, uh day peel back the layers on it. Yeah, yeah, yeah. Enrico, is there anything you'd like to say to our listening audience before we close? Well, actually, so there are a couple of things.

Starting point is 00:48:36 One is that we are still very active in Europe. We are expanding very quickly in all major European countries. And in fact, we have a, we will be, for example, at the Cloud Expo in Paris in November. So if you are a European listener, then that could be a good event to catch up. We are also going to Cloud Expo in London and other events. And next year, you can expect to see more of us in the U.S. as well. That's great. I'd like to see how this plays out

Starting point is 00:49:14 in the rest of the world. Europe has a GDPR requirements that are, I would say, more stringent than the U.S. and stuff like that. So I could see how, and we didn't really talk about it, but you could geofence the data as well. You could say that this data only requires, is going to reside in these five data centers and not the rest of the 20 data centers that I have and stuff like that, right? Yes, in fact.

Starting point is 00:49:40 I mean, the biggest customer we have in the moment is a defense company. And they are using us just for this. I mean, so they are building services for other defense companies. So they have a very huge cyber war division and they are using them as for, you know, some traditional use cases, but also to build their own cloud services for other defense companies. So it's pretty cool because the surface of attack is minimal because if you attack a server, you don't find anything except segments of encrypted data and there is no way to go back to the source to rebuild the entire information. And at the same time, I mean, you can really say, okay, all this data stays in these three data centers

Starting point is 00:50:29 in UK or in the US and it doesn't have to move. Also, we are developing with this customer another feature that changes the level of data protection depending on the level of crisis. For example, I mean, yes, this data center maybe is in a country that there is a risk for an attack. I want to move only the data that is in that data center

Starting point is 00:50:52 in another data center. Or I want to change the security level to a different level. Crypto algorithm? Stuff like that? Not a crypto algorithm, but you can change the level of data protection. So maybe you are

Starting point is 00:51:07 in three data centers now, but there is a risk of a war, for example. And you say, okay, so let's make it four. Then you do all the data movements to distribute the data in four. Right, right. Well, Rico, this has been great.

Starting point is 00:51:23 Thanks again for being on our show today. Thank you, guys. Enrico, this has been great. Thanks again for being on our show today. Thank you, guys. It was my pleasure. And check out www.gaby.io. And for anything else, I mean, you can find me on the social media. I'm very active on LinkedIn. Right, right. That's it for now.

Starting point is 00:51:44 Bye, Enrico. Bye, Keith. Bye, Enrico. Bye, Keith. Bye, Enrico. Until next time. Next time, we will talk to another system storage technology person. Any questions you want us to ask, please let us know. And if you enjoy our podcast, tell your friends about it.

Starting point is 00:51:59 Please review us on Apple Podcasts, Google Play, and Spotify, as this will help get the word out.

Grey Beards on Systems - 167: GreyBeards talk Distributed S3 storage with Enrico Signoretti, VP Product & Partnerships, Cubbit

Cubbit is a Geo-distributed/Geo-fenced S3 compatible object storage where the customer supplies the hardware and Cubbit the software. Presently avalilable in Europe only, it will be coming to the USA ...in 2025.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.