Grey Beards on Systems - 167: GreyBeards talk Distributed S3 storage with Enrico Signoretti, VP Product & Partnerships, Cubbit
Episode Date: November 4, 2024Cubbit is a Geo-distributed/Geo-fenced S3 compatible object storage where the customer supplies the hardware and Cubbit the software. Presently avalilable in Europe only, it will be coming to the USA ...in 2025.
Transcript
Discussion (0)
Hey everybody, Ray Lucchese here.
Welcome to another sponsored episode of the Graybeards on Storage podcast,
a show where we get Graybeards bloggers together with storage assistant vendors
to discuss upcoming products, technologies, and trends affecting the data center today.
We have with us here today a longtime friend, Enrico Signoretti, VP of Products and Partnerships at Covet.
So Enrico, tell us a little bit about yourself and what's new with Covet.
Hi Ray, hi Keith, thank you very much for having me today. my long journey as an analyst going back and forth between Italy and the US, I just decided
to join Cabit last year.
And well, it was a very nice thing.
So when I started working with this company, I mean, I really love the concept behind the
technology, practically building a geo-distributed object store.
And maybe you remember that I'm really fond of object storage in general.
Yeah. Wasn't that a last job too before the endless stuff?
Yes. Yes. Again, I tried it with a French startup a few years back.
It didn't work out,
but it was a great experience, man. Great team and
just too many engineers in the same room
probably, but...
You mean there was more than
one? Yes.
Ah, that hurt.
Keith. You know, we
can't, when I put on my engineering
helmet, engineers can't agree on
anything. Yes, they can't agree on anything.
Yes, they can.
It's fun conversations, but I don't know how effective we are getting work done, especially at a startup.
Enrico, back to your journey.
No problem.
When I joined the company, the company was still a service provider. And the idea at the time was to build this geo-distributed storage
that was meant to be something different compared to the AWS of this world.
So where you have a gateway, traditional access point via S3,
and in the backend, all the data were fragmented
and then distributed geographically with an erasure coding algorithm.
What we changed from last year, we practically modified the control, the
level of control that the customer has. So now that everybody can build their own geo-distributed object store.
So practically, we have this SaaS backplane.
So the service is provided as a SaaS service.
You connect to the system and you choose the locations.
So you can rent hardware from wherever you want.
I mean, from Equinix, for which we have an integration
with Equinix Metal
or any other service provider
that provides dedicated server
or co-location or anything.
And you install our agent.
The agent practically transforms
every single piece of hardware
with some storage in a storage node.
And so you can build this large network.
It could be in a single data center, but could be spread across Europe, across the US, wherever you like it.
And then you have also the access points that you can, in a similar way, install in Linux machines and create your network. So in the end, you have this cloud experience
that is very similar to any other object storage
and something in between the object storage
that you have on-premises and the cloud storage service
that you can buy from everybody.
But you have this complete control over the infrastructure,
costs, and of course, your data.
All right.
Let's start unpacking some of this stuff here.
You mentioned geo-distributed storage a number of times.
So if I create an S3 bucket, let's say, and I start putting objects in this S3 bucket, where is the data?
You said geo-distributed, so that would assume I could have data here in, let's say, if I was in the U.S.,
I could have data in Tennessee, I could have data in New York, I could have data in California, all for that same bucket?
Yes, practically what we do,, when data enters in one of our
gateways, the first thing we do of course is encrypt everything. And if you have
your own keys, you can encrypt it with your own keys. And immediately after, we
apply this erasure coding mechanism that actually we are patenting because it's a quite sophisticated way to do it.
And we split the file into many segments that you decide how many,
and then you decide also how many additional segments you want for redundance.
And then, yes, the algorithm in Rakan decides where to put the data.
Of course, as I said at the beginning, you have to choose, And then, yes, the algorithm in RackN decides where to put the data.
Of course, as I said at the beginning, you have to choose,
you have to install the storage node, the agent in these servers. So this means that the first step is to build the network.
So you have a data center in Los Angeles,
one in Seattle, one, I don't know, in New York.
That's fine.
And then we start putting data in these data centers.
You can keep adding new data centers.
You can mix and match different kinds of servers.
So we always keep track of what is available in the network and we use all the
possible resources. So you mentioned the storage. So you build a geo-distributed S3 bucket or an S3
storage server, which could have multiple buckets. I'm trying to understand. So practically, it's a full-fledged cloud service.
It comes out with everything.
I mean, when you start one of our subscriptions,
you get access to the backplane.
You even have the sign-up forms for the service.
You have user management.
You have multiple ID
mechanisms for locking,
so you can use Active Directory, for example,
or OpenID or
Google ID, Microsoft ID, whatever.
And
then, practically,
you are becoming a service provider.
So, the idea is
that everybody can become a service provider
in 15 minutes.
So I think I get the general idea of this.
We saw it on the workload side with a company like Platform9 and being able to put agents onto your bare metal server VM and they can be worker nodes for Kubernetes.
You folks are taking care of the storage equation of saying, okay, I have worker nodes
that are object storage.
But one of the things that, you know,
now that we're talking about storage
and not the entire workload,
not the, you know, the daemons and all of that,
we're talking specifically the storage services. I guess the question workload, not the daemons and all of that. We're talking specifically the storage services.
I guess the question is, give me a couple of primary use cases
for when I would want to do this.
Well, of course, all the S3 use cases are good.
Most of our customers start with backup
because this is just a low-hanging fruit.
I mean, everybody wants a copy.
Now, especially, you want this secondary copy
that you can have in a different site, well-protected.
And so you do a secondary copy in the cloud.
And this is a way to have your copy in the cloud
without going to a cloud provider,
but you being the cloud provider.
The second, the other use cases are more about the fact that,
especially in Europe, as you know,
I mean, the company is an Italian company,
so a European company, and everybody in Europe is very concerned about data sovereignty in general.
So when you think about sovereign cloud, you want to keep control on your data.
You want to know where the data is.
There are a lot of regulations now all across Europe about this.
So the fact that we give you this infrastructure
means also that you can expand to use cases
that are more in the financial sector or in the banking, etc.
So we are dealing with a lot of customers.
If I look at our customers, usually,
they are small to very large telco providers
or MSPs or cloud service provider
that want to compete both on price,
but also on service levels that are comparable to
to hyperscalers, multi-region
storage. On the other hand, you have large enterprises that maybe are building big data lakes or are in this journey about hybridization of their cloud, things
like this.
So second generation kind of cloud customers.
And they really love the technology because think about it.
The fact that you can place the gateway where you need it means that you can have all your data in your country, but then you can, the gateway has a cache for performance, but it's just the gateway.
You don't have to deploy any storage. So you don't create a second silo. You don't need a copy of your data. All the data is in one place. It's accessible. So you have to set up only one
security policy. So it's not that if I have a multi-cloud environment or an hybrid environment, the problem is, okay, I have some data in Google, some in Amazon, some maybe in Azure and some on-premises.
And I'm using three different storage services, each one of them with their policy, their tools, their, you know, sometimes even different protocols.
It's a mess.
So what you're creating is almost a hybrid storage system here that can run.
You can run the storage nodes just about anywhere you want.
I assume the storage nodes, all software and the gateways, all software as well.
Yes.
Yes.
Everything is software.
It's an agent that you install. Actually, it's very simple because there is a process where you, when you define your,
one of these availability zones that in our lexicon is a Nexus, and you put a, and you
have a wizard where you decide how many servers I have with how many disks.
So, and we ask you, you know, some questions, let's say several questions.
At the end, you get an Ansible playbook.
And by running this playbook, you configure all the nodes at the same moment.
So that even if you have hundreds of nodes, in a few minutes, you are able to start your service.
So let's talk about data path a little bit.
So let's say that I do, let's come with the most basic scenario.
I want to run object storage in my local data center.
But for obvious reasons, I don't want to manage the object storage.
I want to outsource that to you folks so I install
the agent um runs the ansible playbooks the agents are configured what am I porting pointing to
as a DNS target is it your gateway and the gateway handles that, you know, the communication?
Like, where's the, talk to me about the data path. Okay, so the gateway is a piece of software,
again, that runs in a server, physical or virtual, that is in your premises or in, you know,
in your cloud environment, okay. It's not ours.
And of course you have the URL probably will be s3.yourcompany.com
and you have full control of it. When you access it, of course you
ask from the front end, so you create the credentials
and then you have all your
application, you configure it, you start working with it.
Every single operation is through this gateway and the gateway contacts
what we call in the back-end the coordinator which is our
control plane and again the control plane could be the SaaS service that we provide. At the moment
there are a couple of coordinators in Europe. We plan to move to the US next year in 2025 so
the coordinator is of course important to have the coordinator not too far from the storage node and the gateway nodes just because the latency could become an issue.
So it's very good when you are in Europe, for example, to have the coordinator in Europe.
But for a U.S. customer, maybe having the coordinator in Europe is too much of a
latency.
And then, so the gateway asks to the coordinator where to put the data or where to retrieve
the data and et cetera, et cetera.
And so the data is spread across storage nodes.
Storage nodes are talking back and forth to the gateway. The gateway is talking back and forth to the coordinator to understand. It's like a metadata handler almost in this environment.
It understands the location of the data and the path to the data. Is that how it works? Yes, you are totally right. In fact, our
subscription model is only on the data you are managing. So the raw data you are managing,
just because in the end, everything is based on the metadata. And for us, it's very simple to give
you a very flat price on the amount of data that you have, you know,
considering the average file size that our customer have, et cetera.
We also have a special price for media files that is really, really low because, of course, you know, in media files, you don't have, you know, the ratio between data and metadata is very big.
So it's a big files.
Yeah, I understand.
So from the metadata perspective,
so obviously the gateway is caching some level of metadata data.
And then the coordinator has you know it has the job of distributing the metadata
across my various uh object clusters so let's say you know let's go to that next level of design i
have two data centers so at each data center i have a gateway so it let's start at the most basic level. We have the storage clusters.
We have the gateways.
And the gateways communicate with a coordinator or a set of coordinators, which is your SaaS service or some private service that we need to install.
So this keeps the data playing, the control playing to the data
and how we're flowing information in between.
If we lose, the assumption is if we lose connectivity to the coordinator, that should be fine in the short term,
just as long as we're not making changes
or doing replication from site to site,
we should be locally fine.
Then from a just capability perspective, I can
push policy down, new policy, and my
management of the actual storage from the coordinator, correct?
No point is data stored
in your cloud infrastructure.
That's all just, there's the metadata for operating the service,
and then there's the data, and I own my data.
You guys manage my metadata.
Okay.
Where is the metadata, Enrico?
Okay.
I let Keith talk, but actually there is a small difference
from what he said and what really happens.
It's that the coordinator manages all the metadata.
The metadata is not actually distributed
because you could have metadata...
If you start moving metadata around,
it becomes really, really difficult to synchronize
all the metadata across a large network.
Consider that some of our customers have 5,000 nodes now.
And that could be challenging for the metadata.
So metadata is centralized.
So the coordinators manage the metadata.
Everything is a cloud service anyway.
So if you lose connection, it's like losing connection to your cloud service provider. It's the same way. So if you think about one of
the many cloud storage providers that you have available in the market, if you're losing connection
to the cloud provider, you don't have access to the data. The same happens for us.
It's true that it's very difficult for us to lose access to the metadata service
because we run our service in a multi-availability zone data center, plus we have a replica in a
fourth data center in another country. So it's really difficult. But except for that, and so this is why
latency is important, we are not doing cross-continent
swarms at the moment. This will change next year because we will have additional coordinators. So you can think about coordinators as our regions.
So we will have additional regions in the US and potentially in Asia.
So let's talk about, so can you have multiple coordinators or is there only one coordinator
across multiple zones?
No, you have only one coordinator per zone. Okay. You have
multiple coordinators in the sense that there is a high availability. Right. Yes. But it's not a
single, there's a single coordinator managing everything. There's a single destination. So you,
you folks that that's invisible. The redundancy is invisible to us, but there's one
coordinator. The one thing that I want to highlight
on your explanation, thanks for the clarification, was that this is why
low latency to the coordinator is important.
You want to be in a coordinator that's relatively close to where your gateways
are.
Yes. Also, we have other mechanisms to cheat a little bit with latency. So everything happens in parallel in the backend, of course. So when the gateway, for example, start the operation,
has a very optimized query. So it asks, for example, more nodes than are really necessary,
and it manages to do data airplanes accordingly and all in parallel. So when we achieve a number
of segments saved that is safe enough compared to the data protection level, we give the okay
in the front end. So for example, this eliminates the risk of
some nodes being slower than others
or we have a cache
in the
single gateway
so that everything
that you read frequently
is already cached locally
that minimizes
their access to the network.
Also, optimize communication in the backend.
So there are several other things that we do.
Also, we start doing some operations anyway.
And then if we get some errors or some acknowledgement,
then we proceed with the next steps.
But actually, we try to anticipate
some of the answers from the coordinator
or the front end
so that it helps a lot to work
with even small files sometimes
where the risk is to have
the impact of latency worsening the entire experience.
Yeah, yeah.
So you mentioned there's a single gateway as well as a single coordinator,
although it could be multi-AZ as well?
You can have as many gateways as you want.
In fact, one of the things that we really like,
actually our customers like, is that they can have a gateway for
each single tenant or even having edge gateways that they can deploy remotely
and take advantage of the cache, for example, or different environments.
We have a customer that has three data centers and the swarm is in the three data centers.
And then they have one gateway in AWS,
one gateway in their data centers,
and a bunch of gateways in the edge location.
So they collect logs at the edge locations.
Everything runs encrypted inside the network.
So everything moves totally encrypted.
And then data stays in the country.
They access the data from AWS to do some analytics.
They have an application in AWS.
I mean, the compute power is cheap.
It's storage that is the problem.
And then they have another analytics application
that uses the same data sets to do other stuff
that runs in their data center.
But again, same data lake in the end.
And this application attacks it from different environments.
So where's the storage in that solution?
Is it storage sitting in the data center?
Is the storage sitting in AWS?
It's the data center that I mentioned.
They are one in...
This customer is actually in Italy.
It's one in Milan, one in Rome,
and the other one is in a small town called Arezzo.
So it's one petabyte of storage. It's not that much,
but actually it's very
compelling in
case history because they
were coming from AWS.
Yeah.
So they told us, I mean,
we have some TCO calculations,
but sometimes you
do the math and you say, well, this is
marketing and stuff, but actually they came back saying,
we are saving 80% real,
really 80% from what we were spending before
with all the movement of data
and all the stuff that they are doing.
So you don't charge for ingress or egress
or anything like that.
You just charge for storage under management.
Yes.
Consider that if you look at, you know, now I'm used to the
European market, okay, so you can find hardware from most of the service providers like OVH, many others that is around 1.42 euro per terabyte per month row.
So meaning that you buy a dedicated server that includes bandwidth, includes service
on the machine, everything, firewall, and you pay as little as that.
And then on top, you put our license and you can be as competitive as
the cheapest of the storage service out there, but with the geo-distribution included.
Usually when you go to one of these cheap cloud storage services, you get everything from a single data center.
And if you want a remote copy, then you need to pay double the price. So even if you start at
six, seven euro, then it becomes easily 14. Plus, sometimes hidden fees and other stuff.
With us, it's just the fee that you see there.
It's already gel distributed.
There are all the options that I told about,
about edge gateways and stuff like that.
So it could be very inexpensive.
But the most interesting project that we have at that moment
is at Elko.
They have 16 data centers.
And if you go in 16 data centers,
actually we are starting with 10,
but it doesn't change much.
If you think about it, you put 10 data centers
and you want to sustain like a couple of data center failures,
meaning major failures.
So two data centers down, you still have eight data centers.
And you start doing eight plus two,
meaning that you have a very, very small data protection overhead, but you have a massive...
What is the probability that you are going to lose two entire data centers?
Pretty low, hopefully.
It depends on the environment, I guess, right?
That would be a very bad day.
Yeah. Yeah. Yeah bad day. Yeah.
And also, consider the power consumption. If you
start thinking a traditional technology where
you have to replicate the data between
data centers, and you have a
business continuity scenario
plus disaster recovery, it's three copies.
So one
close to each other and one
in a remote site.
It's very expensive.
So this adds up. If you also add local data protection on top of it,
you are going to have between 4.5 to 5 times the initial storage that you needed to save.
Each single terabyte becomes 5 terabytes.
And the footprint is massive.
With us, you can have 1.6, 1.8 in a similar scenario.
And so if you have 1.8 instead of five,
consider how much storage you are saving.
So meaning less servers, less RAM, less CPU.
So the nodes cost less.
But it's not only that.
It's that the power consumption and the CO2 footprint, everything.
Yeah, sustainability and all that.
So let's talk about the erasure coding.
Can you configure the erasure coding pretty much at will?
So let's say in this configuration with 10 data servers,
I want to be able to handle two data centers going out
plus maybe a storage node going out,
which would require, you know...
Yes, yes.
In fact, this is what we are pretending practically.
So we have...
Seven plus three, something like that.
We have...
So independent...
You can do in two levels,
okay, inside the data center
and at the geographic level, first of all,
okay, so it's a nested
ratio coding, put it this way.
So you decide first
how many data centers you have and
how many you can
lose of these data centers.
And then inside the single data center,
you decide the second level of data protection.
And this is just one redundancy class. And then on top of it, you can add additional redundancy classes and
you can decide
policies where you see, okay, if it's a PDF file, for example,
I want this redundancy class.
If this is a movie, I want this other redundancy class.
So you can play with different setups and different costs of storing data.
Okay.
So all the combinations are possible.
And of course, because in the end, it's everything about metadata, right?
So we know the metadata, and we know what you can do with your files.
So you can decide, for example, well, this level of registration is very inefficient with 32 kilobyte files.
Okay, so just change the level of data protection for these small files and you can do it.
And all that's done through metadata supplied with the put request or something like that?
Yes, practically when you put the data, you have a put, you already know how big is the file and of course you have all the other metadata that builds up on top of it.
I like this nested erasure code.
I'm not sure I've ever seen that before.
So that's very interesting.
This is why we are pretending it.
Yeah, yeah, yeah.
I understand that.
So performance.
You mentioned performance.
So the gateways have cache.
I assume that's something that they could configure
if they want more cache or less or something like that.
It's totally configurable.
There are no limits on the amount of cache.
Of course, it means that more cache, more expensive.
So you have to, we have some customers that do deep archiving.
They don't do cache at all.
And other customers that are heavy on the cache because they want to have data always,
hot data always available in the cache,
of course, to minimize latency.
So especially for one customer,
that's a big one,
we are developing a feature
that is cache preheating.
So practically you have your gateway
and you can decide with a sort of cron job.
So you can do a metadata query at a specific time and then get the cache preheated for some workloads.
So, for example, I want every Monday morning all the data that was produced last week in the cache so that I have to run a batch job of some sort.
Everything is already close to the compute.
Or maybe it's Christmas and I want all the movies with the word Christmas in the title downloaded.
I mean, it's not a CDN.
No, it's not a CDN.
I got it.
But you are providing almost cache control.
It's not.
So you're really preloading the cache with a portion of the data based on some sort of a metadata query.
Is that how it's working?
Say again, sorry?
You're preloading the cache?
Yes.
So in this case, for Christmas data,
it's anybody that, any film that has Christmas in it,
you would preload the cache with
so that it would be more responsive during that time.
Yes, exactly.
And you can do it for each single edge gateway. So if you have 10
different edge gateways, you can run the same query, but also different queries.
So depending on the workloads, depending on, you know, your business needs in the end.
So is that data then exportable so I can run analytics in like a different, as I'm doing prep for AI training or for RAG,
can I export that into another system?
Well, the metadata, no, okay.
But there is, I mean, this is one of the things
that we are developing right now,
so all the metadata tagging,
all the stuff is really cool.
And of course, by adding a functions on top of it,
I'm not saying anything here, but you know,
you can think where I'm going, right?
So every time you put operation or, you know,
some metadata update, whatever, with a function,
you can potentially do some metadata augmentation.
And then when you have augmented metadata,
you can start to run queries
to get the specific data that you need.
Maybe you need a set of data to train an AI
and you need that specific data,
you can do it.
So there's some type of message bus associated
with this that I can trigger off
of? Or you can request
services from. Yes, but
it's internal for us.
What I was thinking
is the function will come to the
gateway.
So what would be really interesting
because this is one of the, as
you folks mature, what would be really interesting is this ability, and this is one of the big gaps between, you know, the cloud providers and private object store solutions in general, is this ability to do, not queries, but alerts, message bus alerts off of any,
when an object is written or object is received, et cetera,
so that now I can create functions myself.
You know, I can go to open source functions as a service platform,
like OpenFast and, you know,
and trigger OpenFast type functions off of services.
This is where the cloud providers are really locking folks in.
So it's like Lambda.
I can't really talk too much about this, but keep an eye on us.
All right.
So let's talk about sizes of the clusters and stuff like that.
So you can have the storage servers don't have to necessarily be in one location.
Obviously, you'd want them to be in multiple locations for geo-distribution.
Yes.
So without technology, you can start as small as three servers.
They could be three Raspberry PIs, to be honest.
So we started the first implementation of Cabit was on machines that were the size of Raspberry PIs with drives attached to it.
And so we still support ARM
as well as x86, of course.
And you can start from this.
So one single location,
three servers,
and three drives.
That's it.
And then, I mean,
we have customers now
with data center with 40,
40, 12 drive systems and in multiple locations.
I mean, you can really configure the system in the way you want it.
Again, because we manage all the metadata, also the infrastructure.
So the ability of each hard drive
or each single hard drive to receive the data,
we can then do the data placement
in the best way possible.
So this also means that when you start,
potentially you can start with these three servers
that I mentioned, and then you start growing.
And after a while, your service is successful,
so you keep adding hardware, but you need
bigger hardware.
In the meantime, a new generation of CPUs came out, etc.
You can mix-
Storage and stuff like that.
Yeah.
Yeah.
So you can mix every sort of hardware so that we don't really care about the kind of hardware.
To be honest, many of our customers actually
start with a recycled hardware.
They commission that hardware from something else
because at the very beginning it's just
okay, let's try this technology.
It's inexpensive
to start with a bunch of
old servers.
Then when they realize that it's
good and it works
for them,
so they keep adding hardware. But they start to decommission the whole hardware sometimes.
But in other cases, they just let the hardware die.
So the lifespan of the cluster is really long.
So usually you buy hardware for a three-year lifespan
with a three-year contract.
But actually, with us, you can keep the hardware.
And when the disks start failing, you just add new hardware.
And then when the level of that single server is below, let's say, 40%, 50%,
then you say, okay, this is no longer efficient to keep this hardware running.
So remove all the hardware.
We migrate the single nodes, the single hard drives,
and then you're good to go with the new hardware.
Go ahead.
No, no, it's just to say that it's really inexpensive
to manage an infrastructure like this.
So when you add a server, let's say,
and you've got humongous disks on it and stuff like that,
are you going to spread the data around immediately
or only will you use that for new data
that comes into the system?
So we don't usually do rebalancing.
It doesn't make a lot of sense
because you have the cache in the front end
that manages the performance. So doing the rebalancing is it doesn't make a lot of sense because you have the cash in the front end that manages
the performance. So doing the rebalancing
is a lot of
effort, especially
at the geographic level.
So you don't really need to do that.
We can do it.
We can migrate from
a redundancy
class to a new redundancy class that
keeps, you know, it's the same class but
with
that
new hardware
accounted for, etc.
You can do some
data movements
but it doesn't really make any sense.
I mean,
if it's not a lot of data, you can do it.
Yeah, yeah, yeah.
No, I understand.
Okay.
No, I'm good.
I'm good with that, Enrico.
I was going to say something about can a customer just supply storage
servers and have other customers come in and use that data?
I guess it's really typical to have both the storage yes services and
the gateway that accesses it uh in the same customer so one of the problem with these
services is that you don't really know who is managing your your hardware okay yeah yeah and
and the problem is what happens i mean sometimes you have these very nice services and they work very well, okay?
Maybe spread across many countries,
but maybe some countries are not of your liking,
but that's another problem, okay?
We are back to the data sovereignty issues.
But even if you are okay,
that somebody's rent, you know,
sorry, lends you some hardware.
It's okay.
But the problem is,
so if you don't know who's lending you the hardware,
then maybe it's a guy in a basement
playing with his PC
and he has some free space and he lends you the storage. And
then one day a new game comes out and you need space and you erase everything. So yes, there
are multiple copies and there is a rebuilding in the backend and et cetera, et cetera. But I mean,
I don't like it as an enterprise to know that.
Right, right.
So most enterprise customers would provide their own storage, plus provide their own gateways and clients to those gateways.
So all that would be within the infrastructure of one customer, wherever it lies.
It could be in the cloud.
It could be anywhere, right?
I mean, as far as his concern, he could have any of his infrastructure be deployed as storage services or gateway.
Yes.
Yes. And you can mix some of your on-premises stuff with cloud stuff and it works.
Yeah.
Yeah.
Do you do any preferential access based on access speed?
I mean, so if I'm a big data center, I've got my own data center with very fast
servers, very fast networking. Plus I've got some other data centers out in the boonies that don't
have the high speed networking and high speed storage, but I am using storage in those data
centers for redundancy and things of that nature. Yeah. so we have two ways to do it. One is
we have an internal ranking of the nodes
so we know the nodes
that work better than
others and we choose the
best nodes when it's possible.
And the other
thing is that you can build two
redundancy classes.
One with the
good nodes and another one with the bad nodes, and then
decide where you place the data more granularly, let's say.
Right, based on which cluster you decide to put the data in.
Yes, and you have some A-level data and B-level data.
Right, right.
I noticed on your website,
you're integrated with some of the backup providers.
I saw Veeam and stuff like that.
In that case, you're a target for their backups.
Is that how I read that?
Yes, most of the solutions,
backup solutions, use us as a secondary copy.
When you put the gateway on-premises,
we can be very fast, actually, because you have the cache that gives you a boost in performance.
So you write very quickly data to the gateway.
Then maybe you don't have enough bandwidth
to go that fast to the backend.
But actually, the gateway acts as a buffer.
So you write quickly and you finish your job quickly.
Maybe your cache is also redundant.
So maybe it's the right one, SSDs, et cetera, et cetera.
And then you just write in the backend
at a lower speed. So it's possible. Do I then have the same similar controls as the cloud providers
when it comes to immutable data so what can and cannot be deleted for it? Yes. Considering using
this for like vaulting of backups? So we support S3 object lock, both in governance mode and compliance mode.
So, I mean, it works pretty well for everybody. So yes.
Right, right. Are you guys strongly consistent or eventually consistent? I mean,
how does that play out in this environment?
This is a great question.
So this year we developed, until last year, we didn't have strong consistency.
But this year we developed a new algorithm that allows us to practically check on the local cache first, check on the swarm second, and then if we have the metadata but we don't have the
data that is updated, then we check on other gateways to see where the data is. So gateways
are configured in a sort of pool and you know that all these gateways have the same rules.
So you see maybe I'm writing something in Seattle, and then I need to read it a few milliseconds later in LA.
So, then what I'm going to do is, in LA, I don't have anything cached, of course.
I check on the, if you have a high-speed connection, potentially, it's already in the swarm that is in the U.S.
If it's not, we are going back to the initial gateway and we take the data directly from there.
It's not the fastest way to get data back because in the end it's three O's.
But you are sure that even in the worst-case scenario scenario we are strong consistent. So I guess what that actually brings
up a question I didn't ask that I wanted to ask like are you folks facilitating the network access
or is that something that clients have to handle on their own? So the network access is theoretically
not our problem and we have specification for the amount of bandwidth that
we need for the storage nodes and what you can expect on the gateway depending on the resources
that you give us. But if you need a certain amount of performance, I can give you the amount of
hardware and networking that I can to make it work. Yeah, but from the logical connectivity,
I'm responsible for making sure that one gateway can hit another gateway.
And that gateway can access the internet.
And your level of kind of help troubleshooting from a responsibility perspective
is to make sure that your
coordinators are reachable via the Internet.
And then I'm responsible for the physical
and logical connectivity outside of that.
Yes, we have all the metrics and everything that are exposed,
both internally with the APIs or externally within the user interface.
So you can see if something is going wrong or
there is a bottleneck or something that is not really performing the way you
suppose it to perform. So yes, I mean, there are all the tools to make, you know, your life simple.
Yeah. Yeah. Yeah. Well, this has been great. Uh, Keith,
any last questions for Enrico before we close?
Uh, you know what, this is a really interesting, I don't have any closing questions. This is, uh,
this is actually a really creative solution.
I'm interested to one day, uh day peel back the layers on it.
Yeah, yeah, yeah.
Enrico, is there anything you'd like to say
to our listening audience before we close?
Well, actually, so there are a couple of things.
One is that we are still very active in Europe.
We are expanding very quickly
in all major European countries.
And in fact, we have a, we will be, for example, at the Cloud Expo in Paris in November. So if you are a European listener, then that could be a good event to catch up.
We are also going to Cloud Expo in London and other events.
And next year, you can expect to see more
of us in the U.S. as well.
That's great. I'd like to see how this plays out
in the rest of the world. Europe has a GDPR
requirements that are, I would say, more stringent than
the U.S. and stuff like that.
So I could see how, and we didn't really talk about it, but you could geofence the data
as well.
You could say that this data only requires, is going to reside in these five data centers
and not the rest of the 20 data centers that I have and stuff like that, right?
Yes, in fact.
I mean, the biggest customer we have in the moment is a defense company.
And they are using us just for this.
I mean, so they are building services for other defense companies.
So they have a very huge cyber war division and they are using them as for, you know, some traditional use cases, but also to build their own cloud services for other defense companies.
So it's pretty cool because the surface of attack is minimal because if you attack a server,
you don't find anything except segments of encrypted data and there is no way to go back
to the source to rebuild the entire information. And at the same time, I mean, you can really say,
okay, all this data stays in these three data centers
in UK or in the US and it doesn't have to move.
Also, we are developing with this customer
another feature that changes the level of data protection
depending on the level of crisis.
For example, I mean, yes, this data center maybe is in a country
that there is a risk for an attack.
I want to move only the data
that is in that data center
in another data center.
Or I want to change the security level
to a different level.
Crypto algorithm?
Stuff like that?
Not a crypto algorithm,
but you can change the level of
data protection. So maybe you are
in three data centers now, but there is
a risk of a war, for example.
And you say, okay, so let's
make it four. Then you
do all the data movements to
distribute the data in four.
Right, right.
Well, Rico, this has been great.
Thanks again for being on our show today. Thank you, guys. Enrico, this has been great. Thanks again for being on our show today.
Thank you, guys.
It was my pleasure.
And check out www.gaby.io.
And for anything else, I mean, you can find me on the social media.
I'm very active on LinkedIn.
Right, right.
That's it for now.
Bye, Enrico.
Bye, Keith. Bye, Enrico. Bye, Keith.
Bye, Enrico.
Until next time.
Next time, we will talk to another system storage technology person.
Any questions you want us to ask, please let us know.
And if you enjoy our podcast,
tell your friends about it.
Please review us on Apple Podcasts,
Google Play, and Spotify,
as this will help get the word out.