Grey Beards on Systems - 142: GreyBeards talk scale-out, software defined storage with Bjorn Kolbeck, Co-Founder & CEO, Quobyte

Starting point is 00:00:00 Hey everybody, Ray Lucchese here. Jason Collier here. Welcome to another sponsored episode of the Greybeards on Storage podcast, a show where we get Greybeards bloggers together with storage assistant vendors to discuss upcoming products, technologies, and trends affecting the data center today. And now it is my pleasure to introduce Bjorn Kolbeck, co-founder and CEO of CoBiTE. So Bjorn, why don't you tell us a little bit about yourself and what's new at CoBiTE? Hey Ray, yeah, thanks for having me here.

Starting point is 00:00:37 I got into storage more or less by accident. I started my career in high performance computing and was part of a large European research project and we got the data management packet and that opened up the possibility to try things and we decided that we wanted to build a distributed parallel file system for grid computing and that's how I met my co-founders and how it all started. After finishing my PhD, I worked at Google. I wanted to do anything but storage.

Starting point is 00:01:13 But of course, when you work at Google, you use the infrastructure. And it's pretty impressive to see how very small teams run this global infrastructure. And that was the inspiration for my co-founder and me to to start qubit we wanted to bring together the massive scale out of high performance computing storage and the ease of operation and also the scalability on the operational side that you see at google that basically comes from from true software. So that's how basically my career and my research

Starting point is 00:01:47 led to what we have with Colbert today. So why don't you tell us, so what's the big changes? What does Colbert bring to the table that some of the other software storage players don't have? Ease of use, obviously. Yeah, ease of use is a big one. That's one of our focus areas. And we believe that Corvette is easy to use.

Starting point is 00:02:11 That's why we have a free edition that you can just download and install. I think that's the major differentiator to all the other commercial software storage products out there. They don't they're not available as a download. And my guess is that's because they're very complex to install and use. So I think that's the biggest differentiator. And then there are a lot of details on how we do things differently, how we focus on fault tolerance, how we, for example, do the data

Starting point is 00:02:43 management of policies with a lot of flexibility. I think this is where Qobyt is truly unique. And on the other hand, we are a parallel high-performance file system, so you can get excellent throughput, IOPS, small file performance from Qobyt. So when you say parallel file system, are you using like NFS version 3

Starting point is 00:03:00 or are you using your own client software, POSIX? How does the customer access GoByte? We do have our native clients for Linux, Mac, and Windows, and that's the preferred model to access the storage. It's a POSIX-compatible file system. But with our own protocol, we avoid all the problems you have with NFS, bottlenecks, failover problems. And we can do parallel IO and also parallel metadata. And then for applications that somehow require NFS, we also support version 3 and version 4 of that protocol. But ideally, and I think most of our customers do, you would use the native client to get the best performance and also best fault tolerance. And Bjorn, on that, so you mentioned you can go and download that. Do you have,

Starting point is 00:03:53 are there clients available, like you said, for basically Mac, Linux and Windows? Can you also, is that also the server as well? Can you utilize that to basically, I guess, take and create storage pools off of that? On the server side, we only support Linux. Okay. We run in user space, so we support all the major Linux distributions, but no Windows and Mac.

Starting point is 00:04:22 That's really only the client side. Yeah, yeah, yeah. You mentioned somewhere in there parallel metadata. You want to clarify what that means? I mean, I'm trying to understand what that looks like in the world today. Yeah, it means that we basically scale out metadata as well. So you can have a large number of metadata services.

Starting point is 00:04:40 So we have the split between metadata and data. It's very common in the HPC world. And you mentioned NFS initially, and there's also parallel NFS. And one of the big downsides with both NFS versions is that even in the parallel version, you talk to one controller node metadata server. So this is where our protocol allows you to talk to a number of metadata servers, not just one, which means that you can scale out your metadata as well. And so you mentioned storage servers running Linux software. So are the metadata servers in the storage servers both running on those servers or are

Starting point is 00:05:19 they distinct services or how has that kind of deployment played out? So we do have them as separate services. So there is a metadata service as a separate data service, but you can run them on the same machine in the end. That's up to the customer, to the user, how they want to do it. The default setup is to have all the services on the same machine because then you basically utilize the scale out better. But we do have use cases that are very metadata heavy where customers decide that they want to have of that on, say if you've got a mix of a flash and a flash SSD,

Starting point is 00:06:09 and then basically spinning rust on there too, can you actually split those up and run those services on different drives within a system? Absolutely. We have also the basically metadata drives and then data drives. Those are separated because you want the performance isolation. And then on the metadata side should be a flash drive, simply because when you access file system parts that are cold, you want to page them in from the drive. You don't want to wait for a hard drive. And then on the data side, we support both flash and hard drives and can use them as tiers inside the same storage cluster and

Starting point is 00:06:55 even inside the same file. How many nodes or systems do you need to start with to create this distributed system? Minimum is four nodes, simply for fault tolerance. So we do either data replication with a quorum approach, which means you need three replicas, or ratio coding. And so four is the minimum number of nodes to have a production cluster where you can lose one node and the system will still continue to operate properly. And then you can

Starting point is 00:07:32 scale that as much as you like. Yeah. So when you said parallel metadata, metadata is presumably protected as well. Is it protected through replication or is it depending? I mean, is it something like it protected through replication or is it depending? I mean, is it something like it's the same as the data protection? It's different. So for the metadata, we use a replicated key value store. So the mechanism is kind of the same with primary election and Chrome approach. But where data is basically file blocks,

Starting point is 00:08:08 our metadata is stored in an SM tree database that's replicated. And so we have one primary per volume with two backups, and the backups are actually what you would call hot standbys. So they have the state and can take over immediately. And when your cluster grows to like, let's say 16 nodes,

Starting point is 00:08:24 you've got multiple primaries in that configuration or there's just still one primary and multiple secondaries for the metadata? No, then you can scale that out and have as many primaries as you like. And they basically, our system tries to evenly distribute that across your metadata servers so that every metadata server is responsible for part of the system. I'm sorry, did you mention that the metadata is partitioned across those primaries or it's effectively the same metadata across all the primaries?

Starting point is 00:09:02 It's partitioned into what we call volumes. So we have a very lightweight volume concept, and then you can have a large number. So what our users typically do is create, for example, one volume per home directory instead of this giant home directory volume. Yeah, that's interesting. That's interesting. On these machines, is this designed to run bare metal?

Starting point is 00:09:30 Or I assume since it's pretty much a software-defined stack, you could actually run this within virtual machines or containers or anything like that. Is there a specific kind of deployment style that you look at? Or do you support one specific or multiples? Multiples. You're right. It's software. deployment style that you look at or do you support one specific or or multiples uh multiples you're right it's it's software um and I mentioned we run in user space so we don't have a lot of requirements when it comes to the kernel version that's why we run on bare metal um you just take your favorite Linux distribution installed on the servers and then install our software in there. Same works on virtual machines on the cloud

Starting point is 00:10:10 or bare metal machines on the cloud. And we also run on Kubernetes. So we have, or we provide a home chart to install the Cobalt servers, the clients and our CSI plugin. And then you can run Cobalt in Kubernetes and also provide shared file system to Kubernetes. Ah, that's interesting. That is interesting. So you're effectively running as a containers

Starting point is 00:10:34 and under Kubernetes, but not as containers under bare metal, right? Is that how I would read that? Yeah, exactly. So we don't require you to have containers and bare metal we just run our server processes run as a regular user uh with c groups there's nothing basically you take a bare linux and install our software yeah yeah yeah yeah yeah talk about security there was something on your website mentioning that uh the solution is very... You know, the other thing I mentioned on the website was it's not only file, but it's also object. You want to talk about what your object protocol support is? Sure. I mean, this is where we basically combined it into a single platform.

Starting point is 00:11:25 So we made sure that our object storage interface translates to the file system layer. So it's not different namespaces. It's actually when you go through the object storage interface, you can access the file system, which means you can access and share the same files, volumes, everything through the object storage interface. It's really seamless. And it's like S3 then or compatible? Yeah, it's S3 compatible. I would say most applications don't notice the difference. Admins will notice a difference because they're operating on the file system

Starting point is 00:11:58 with and can use file system permissions and access control lists. And then they are also effective for the object storage access so that makes their life much easier okay that's interesting users they don't see a difference okay now back to security um what's what's the security options for the solution then yeah security is a complex topic. So it's many layers. In the end, to make a storage system secure, you try to put as many fences around it as possible.

Starting point is 00:12:31 Again, we have the advantage of our own protocol that allowed us to add more security features than you have in NFS. So the first layer is end-to-end data encryption, which is done inside the callback client. So your data is encrypted and decrypted only on the client machine where it's actually produced or consumed. And then the rest of the system, like the network, even the storage servers, don't need to be trusted because the data is never stored in plain text anywhere on the drives over the network. So that really end-to-end encryption is something you can't do with NFS. Where are the keys for that then? Is that on the client side or is that something that you're configured with?

Starting point is 00:13:19 That's something you can configure. They can either be in one of our services or you can put the keys or use an external key store. Okay, that's good. I like that. Okay, go on. So you were talking about end-to-end encryption is kind of the first layer of security for the solution? Yeah, because when you set that up correctly and your keys are stored somewhere else, you

Starting point is 00:13:45 basically have a system where you don't need to trust the storage admins because they will never have access to the data, to the clear text content. Then we have the second layer that's basically for the network communication. There we have optional TLS support. So once you enable that, you can define which networks you don't trust. And then our clients and services will use TLS to communicate with each other.

Starting point is 00:14:13 Comes, of course, at a price. TLS connections are CPU intensive, slower. So often that's good for wide area communication, but you don't want that inside your cluster, for example, inside the compute cluster. And then with that kind of hand in hand, go the support for X509 certificates. That's again, optional. You can have a cluster that uses the typical ip network restrictions

Starting point is 00:14:47 to allow components to access the cluster or you can use x5 and 9 certificates in that case all servers and also all clients need a valid certificate to join the cluster and the advantage here is that with that um you don't need to trust the network. Even if someone gets access to the network, is able to squeeze a node in there, without a certificate, they can't access a storage system. And the other advantage is that admins can put restrictions on certificates and that way invalidate them if the certificate leaked or restricted to certain users.

Starting point is 00:15:26 So when you have this typical it's there's this typical scenario especially in research where the researchers like or data scientists love to have root access on the workstation yeah in that case with the certificates you give them one certificate that's tied to the user and then you don't care that they have root access or can impersonate other users. So that's a level of extra security that you get with the X.509 certificates. Interesting. What about things like role-based access controls and things of that nature? We support something that's similar to NFS v4 ACLs internally, and then we translate back and forth to the native representation. So in Windows, we support a subset of the Windows access control.

Starting point is 00:16:18 So, for example, no inheritance because that can't be translated, but everything else you can modify it in the Windows settings. Our client translates that and then modifies your NFS ACLs, same with POSIX or S3. So those are unified, which makes it fairly easy to manage them and also audit those access control settings. Do you guys have any type of ability to replicate from one kind of data store to another data store for like disaster recovery style of capabilities? Yeah, we have basically two mechanisms. One is what we call volume mirroring, which is based on an event stream. So the remote cluster gets the updates and then pulls the changes near instantaneous

Starting point is 00:17:11 from the source cluster. That's one option for DR. Then the other option is a bit more versatile. It's what we call the data mover. It's a component that you can can use to move or copy or synchronize data between uh core by clusters um it's highly paralyzed but it's in the end something you can trigger so you can run that every hour you can run that in the evening depending on your needs so that's for um cases where you have for, limited network bandwidth between the clusters or

Starting point is 00:17:45 where the new instantaneous updates would be too much. So an asynchronous based kind of mode that you can trigger? Exactly. Both of them are asynchronous, but one is, depending on your bandwidth, I would say near real time. And the other one is something that's really more like eventual consistency. That's nice. So you mentioned the multiple services. You mentioned the client and the storage and metadata services.

Starting point is 00:18:19 Is there like a CLI or a GU gui level operational console dashboard kind of thing as well and where would that run i guess yeah both of them are available um so we have a command line tool we have a web interface um both of them talk to our api so we're kind of like this API first and admins can use the API to automate or do anything they can do with the command line tool or the web console. So both of those tools actually talk to our API and API and web console are just additional services that you run on one of or multiple of the machines

Starting point is 00:19:03 where the Qubite data and metadata service is running and so you mentioned erasure coding um obviously as a three point three plus one for the four node as you go to let's say 16 nodes does that erase your coding change or how is that is that something customer configures or something that you coding change or how is that? Is that something that customer configures or something that you automatically change or just stays with four to one or three to one? No, you can adapt that automatically. And customers can select the erasure coding schema.

Starting point is 00:19:36 There are a few schemas that we recommend because you need enough redundancy to survive node outages. And customers can select them if they have enough machines and disks in the system. So as you add machines, you can go wider up to, for example, 12 plus three, I think it is. And then our data mover can also recode files. So that's the use case where you have files that were, I don't know, replicated or 4 plus 2 erasure coded, then you add more nodes, and then the data mover can recode them in the background if you want that. Is that something that's automatically done as you add nodes to the cluster?

Starting point is 00:20:19 Is that something that somebody would have to trigger through API, CLI, or GUI? That's something you have to trigger because the recoding is pretty expensive. It basically reads all your data, writes it out again, needs to recompute or compute the parity. So that's a pretty expensive task. It's something you don't want the system to do automatically. You mentioned being able to scale it out. Are there any limits to the scaling?

Starting point is 00:20:46 How far have you practically scaled it out? And are there any theoretical maximums, I guess? In terms of in production, the customers we can talk about, I think, are in the range of 250 servers. And we have seen 20, 000 clients accessing the storage that's what we can talk about theoretically there aren't any limits in our system in terms of the number of servers or clients we use an architecture where we do the primary election decentralized. So only the nodes that have a copy of the file or the metadata need to talk to each other to coordinate who becomes primary. And this allows us to scale without, you know, things like a centralized lock service, which then easily would become a bottleneck because we don't have that.

Starting point is 00:21:40 We really have linear scaling to a large number of nodes. Is there any specific cluster consensus algorithms that you use for doing that? Or is that proprietary? It is proprietary. I mean, if you look into our publication list, you'll find it. It's basically how would I best describe that? Inspired by Paxos might be the right word. Yeah.

Starting point is 00:22:08 I've dealt with Paxos and Raft and a few others in my day. You might. Yeah. Go ahead. The tricky part is to avoid the disk IO, the persistent storage that you need to do with both. So that's the main thing that we don't use the IO for. The consensus, which would basically ruin the performance of your storage system. Yeah, it gets very chatty the more nodes you add into it.

Starting point is 00:22:39 So yeah, I remember going through that many times with Paxos as well. I'm not familiar with Pax but that doesn't that's okay you mentioned primary elective something i can't even think what what the other word was that so if uh if let's say a cluster i'm trying to access a file and there's you know it's erasure coded uh so is there one storage server that becomes the primary conduit for that file i o and that then the data could go to any of the storage servers associated with the erasure code or i guess i guess first question is is erasure coding done on a node by node basis or on a drive by drive basis it appears node by node yeah so um maybe we have to take a step back when you create a file our the metadata server that's responsible for that volume

Starting point is 00:23:35 looks at the policy engine and the policies you configured because that allows you to do things very fine granular. So you could have policy that says, says race files are always erasure coded and on flash and Bjorn's files are replicated and end up on hard drive. So you could, you have a lot of options there. So the metadata server first looks at the policies to decide how to handle your file.

Starting point is 00:24:02 If you use synchronous replication, it would assign three copies on three different machines and also three different drives. And then we have a primary election to make sure that the three copies are in sync, that you have a consistent or consistent copies of your file. If it's erasure coded,

Starting point is 00:24:29 then we actually spread it across ideally eight plus three servers. So if you have 12 servers or 11 servers, we would spread it across them so that each chunk is on a different server so that you have really the full fault tolerance for erasure coding. And then there is no primary election vote because that's the replication. So in that situation, typically, well, I guess I don't know. So in an NFS perspective, the file request would go to an IP address, which would be what I would consider

Starting point is 00:25:02 like the primary storage conduit for the data. The data could be anywhere on the cluster. But in the case of your solution, I could actually go to any of the storage nodes on the cluster, whether I'm NFS or client. I guess I don't understand how it connects. Yeah, it's very different from NFS. You're right. When you're a client and you talk to NFS, you basically have this one NFS endpoint and then something happens behind that server. That's also the problem because you're talking to that one NFS endpoint. So that's a bottleneck.

Starting point is 00:25:46 If everyone's talking to the same one, you know, it's a typical problem you have in NFS-based solutions. So our client takes a different approach. Our client directly communicates with the right metadata service and then ask the metadata service for the file location. And then the metadata service will tell the client that this file,

Starting point is 00:26:04 for example, is erasure coded and the stripes are spread across these 11 drives and servers. And then the client can directly go to the data services and retrieve the data from there or write the data, the metadata servers out of the loop. And that's the great thing for scalability because on one hand, you don't need to talk to the metadata service anymore. On the other hand, the clients are able to talk directly to all the data services that have the necessary data. So you have a system where you don't have those bottlenecks of NFS gateways that need to redirect the data. Okay. So typically, let's say you've got an 8 plus 3 stripe here. The mapping between, let's say, a file record, if such a thing exists, or a byte and offset, let's say, and, you know, a location within that stripe is typically non-computational. I mean, it's something that has to be look up some table, and that typically requires

Starting point is 00:27:09 some sort of metadata service. But in your case, you don't do that. I'm trying to understand how that all plays out. Yeah. Often when metadata in distributed systems isn't the same for all systems. Some do distributed metadata where they consider the block allocation, the metadata. In that case, you're basically similar to a local file system. You have the block allocation table for a file, which can get very big.

Starting point is 00:27:38 And then you constantly talk to the metadata. Yeah, yeah, yeah. Yeah. OK. But in your case, andFS are part of that. And then we have an architecture that's somewhat similar to the Lustre file system where we don't do block allocation at global level. The metadata basically has the allocation

Starting point is 00:27:57 in large chunks, think gigabytes, for a file, what we call segments. And then for each segment, it has the list of drives for the chunks. So for eight plus three erasure coded files, it would be 11 drives. And then the client based on the offset and this very small table

Starting point is 00:28:17 can actually locate the servers that have the data and then the servers to the block allocation locally. But that's something the client doesn't care about. I got it. The client, the offset goes to this tiny table and then hardly ever talks to the metadata service again yeah well nfs is pretty chatty with respect to a lot of the metadata stuff so that could be a potential bottleneck um let's talk about performance. You mentioned that you can have a multi-tiered storage pool. Is the tiering on a file basis or is the tiering on a storage pool basis slash volume, I guess, is the question I would ask. It's actually on a per file basis.

Starting point is 00:29:00 What? That's, again, where our policy engine comes in. So you could say or you could create a policy that says if you want to do that, your own files are tiered down from flash to hard drives after 10 days. Things like that. So of course, usually you don't do that on that granularity, but we have customers that say, for example, certain files like the big, in life science, it's BAM files, those are huge, several gigabytes. Those files should go directly to hard drives

Starting point is 00:29:39 because they're only read sequentially, they're never modified. So just put them directly on hard drives, don't even bother with the flash tier. So it's really on a file level and tiering is just a different kind of policy. We have policies where we automatically switch from writing to flash with replication

Starting point is 00:30:00 to erasure coding on hard drives inside the same file. Because you often have those use cases where you have tiny files of metadata files, and then you have large files that contain the actual data. Think about machine learning. That's fine. Wait. So let me try to understand.

Starting point is 00:30:17 So let's say it's Bjorn's files, and I want to tier them. I can have it go to Flash and be replicated on the Flash tier for, let's say, 30 days. And after that, it can go to disk and be erasure coded. Is that what you said? That's one option. What? Nobody does that. Nobody does that. Yes, that's what we do. How many different tier levels can you have? As many as you like. In the end, we tag devices. So we have the policies.

Starting point is 00:30:58 We have devices that have tags. And then we have files. And tags, automatically, the system creates HDD, SSD and NVMe tags. So this would be your traditional three tiers that you have. But we have customers that, for example, have the high density hard drive servers for archival. Then they create archival or tag archival for those drives and use that for very cold storage, for example. So that's- Ray likes to tear the tape.

Starting point is 00:31:28 No, no, no. Let's not talk about that stuff. Thanks, Jason. That was a low blow. Or maybe it was a high blow. I don't know. But this is pretty unusual to change the configuration

Starting point is 00:31:40 of the data protection based on what tier. And the fact that you've got unlimited numbers of tiers is also pretty unusual, I would say. And then you also mentioned that there's some other policy engines, like you could define a policy, but do you have any automated policies that go in and kind of check the heat maps of data or anything like that um the automated policies would be um what i described that we automatically switch from replication on flash to erasure coding on hard drives inside a file when it grows so that's because most customers have this problem that they have a ton of tiny files and then very large files and this policy automatically handles them correctly.

Starting point is 00:32:26 So the small files end up on Flash. Your large files end up on hard drives. So you don't need to tune anything, and you don't need to do tiering. Because tiering is also expensive. You read the data, and then you push it down. In that case, the system automatically optimizes the file based on the file size. What other policy specifications are there besides the tiering stuff? I mean, you mentioned security, obviously.

Starting point is 00:32:52 I guess that's also on a file level, the encryption? Encryption is, well, in theory, it's on a file level, but you would define that per volume. Okay. What other sort of policy attributes are there? Security settings like which users are considered super users for Kubernetes, for example, how to translate user IDs and how to store them when you don't have a username associated with that. Client settings like metadata cache times. What else is there?

Starting point is 00:33:32 Quality of service levels, all of that. Basically everything is controlled to policies. Yeah, yeah, yeah, yeah. So you mentioned the free solution. Is that sort of on a, you know, how many gigabytes you have for storage or is there a certain timeframe for that? Or is it, how's the free versus non-free work and how is your forever. And it includes up to 150 terabytes of hard drives, and I believe 30 terabytes of flash. So it's something you can actually use. I'm sorry, did you say the free service was up to 150 terabytes of disk and 30 terabytes of flash? Yes. Jesus.

Starting point is 00:34:25 Okay, there was a limitation on number of servers. Is that what you said? and 30 terabytes of flash? Yes. Jesus. What are you... Okay, there's a limitation on number of servers. Is that what you said? No limitation on number of servers. There's a limitation on the number of clients that can access the storage system. And there are certain enterprise features that you don't get with the free edition.

Starting point is 00:34:41 Okay. So, for example, the security features aren't included. But yeah, in the end, we're a scale out file system and we want to make sure that our free edition is actually meaningful. That's why you get 150 terabytes. 150 terabytes could go for a pretty long time for most of my application.

Starting point is 00:34:59 I don't know about yours, Jason, but I think I'd go pretty much forever. Yeah, then just download it from our website. I'm going to. Okay, so tell me, how is the solution priced after 150 terabytes of disk? Jesus. I'm sorry. So our infrastructure edition is priced as an annual subscription.

Starting point is 00:35:27 So we basically give you a subscription based on the capacity for flash and hard drives. Subscription includes support and then obviously also updates to the software. And it's a typical volume discount. So the more capacity you buy, the lower the per gigabyte price is. That's nice. Nice. And so is that licensed for one cluster or for as many clusters as I want? So, I mean, let's say I'm licensed for, I don't know, a petabyte.

Starting point is 00:35:58 That would be a Jason thing. A petabyte of disk storage. Can I run 10 clusters or is that just one cluster or how does that play out? That's just one cluster. Okay, per cluster. Per cluster. I got you. But you could buy 10, 100 terabyte licenses

Starting point is 00:36:18 to spread it across your clusters. Yeah, yeah, exactly, exactly. And the replication data mover or event driven, is that something added to the solution that you have to purchase or is that available with the base price? That's available with the base price. So we just have, I would say a very transparent price model

Starting point is 00:36:43 like a lot of other storage companies. We have the free edition or what we call the infrastructure edition, and the infrastructure edition comes with all features. Okay. Yeah. Nice. Nice. Nice. So tell me about this. So recently there was a client, I guess in life sciences, to purchase the solution. Can you tell me why they ended up with Cobite? Yeah, that's Hudson Alpha. They're based in Huntsville, Alabama. They decided to go with Cobite because they like the simplicity and also the way they can offer storage as a service to their internal customers,

Starting point is 00:37:25 because that's customers might be the wrong word users. That's something that we see more frequently now that especially in life science, but also in academia, um, research institutes and companies try to consolidate the storage. So often every group does their own storage and they're trying to bring it onto a centralized storage platform to have better quality of service and also better cost in the end. And then you need to do it with similar to what you have on the cloud where you can offer storage as a service to your users and this is where our flexibility

Starting point is 00:38:05 with the api is helpful where we have for example multi-tenancy features so that you can isolate your different tenants um thinly provisioned volumes so that you can just you know do over subscription and sell 100 terabytes um without reserving 100 terabytes, all of those things. Right, right, right, right, right, right, right. Storage itself is very, I mean, it's a very horizontal market space. Have there been any specific vertical markets that have been really a good, kind of a really good selling proposition for you? Yeah, because we focus on scale out in the end, any use case where you have data growth

Starting point is 00:38:49 and the customers are struggling with that. So life science is a pretty good example because, you know, if you look at genome sequencing or anything that works with images, whether it's microscopy, cryo-EM, the resolutions are getting higher and they just swamp with data. Resolutions are going up. Yeah, yeah, yeah. Yeah. Media entertainment is similar because same thing,

Starting point is 00:39:16 4K, 6K, 8K. So resolutions are going up. Data is very valuable. So they just, you know, the amount of data they have to store and process is just growing every year so that's a good market then the whole machine learning um sector is very demanding on on the storage as well and you if you think about autonomous driving you just have huge amounts of data there so that that's a good market. And then also traditional high-performance computing. I think COVID definitely brought more attention

Starting point is 00:39:51 to the high-performance computing side. Yeah. And also a lot of funding from the governments for research institutes. So that has been a phenomenal market for us too. Right, right, right, right. So a lot of file systems do well with small files or well with large files. It's kind of hard to do well with both. Is it because of the way you are doing tiering?

Starting point is 00:40:17 So you mentioned one of the policies was to take the file and put it on flash at first. And as it grows to, let's say, oh, I don't know, 10, 100 gigabytes, you move it to disk. Is that a policy option for you? Yeah. And in the end, it's exactly the reason why we can handle both small files and large files very well.

Starting point is 00:40:39 We use our asynchronous replication for small files. And for large files, we use the Razor coding to make them efficient. And for large files, we use the ratio coding to make them efficient. And for large files, it's throughput optimized. And then by combining the two inside the same file, we make that automatic so that you don't really care whether you have small and large files or mix in a volume. So by implementing both mechanisms, we can deliver high performance for small files, high performance for random IOPS like databases or VMs, and high performance for throughput workloads with the erasure coding.

Starting point is 00:41:15 Yeah, interesting. So the policy could be time-based, it could be size-based. I guess it could be change- based and stuff like that i mean i guess that doesn't make any sense sorry don't even go there bjorn um yeah it could be it could be file extension based you know any anything you have in the file metadata you can actually use for the policies username file file extension, last access time. All of that stuff. All of the attributes.

Starting point is 00:41:47 Yeah, yeah, yeah, yeah. There's something else rattling around here. I was trying to think what it was. So you mentioned usability, API. So Google is all about API and I'll call it DevOps kind of world where everything is code managed and stuff like that. Do your customers use your solution like that? I mean, obviously, because it's API driven, it could be.

Starting point is 00:42:13 It depends on the customer, I would say. We have customers that definitely use us, you know, infrastructure as code. We also have Ansible scripts, and it's very easy to set up Cobite clusters. We have customers that spin up Cobite clusters on the cloud temporarily to provide the storage for a workflow that runs

Starting point is 00:42:32 on the cloud, for example. Yeah, that was the other question. Which clouds do you work on? And are you in quote unquote marketplaces on those clouds or is that something else? We're on the Google Cloud marketplace, but we also run on Azure, Or is that something else? We are in the Google Cloud Marketplace.

Starting point is 00:42:51 But we also run on Azure, Amazon, and Oracle Cloud, all of them tested. Okay. And as far as the Marketplace perspective, you're in the Google Cloud Marketplace, but not on the other ones I gather? Yeah, but that will change. Yeah, over time, obviously. God, I think I'm done with questions. I was going to ask about backups and stuff like that. Do you talk with any backup solutions? I mean, it's one thing to have erasure coding and synchronous replication, but if the cluster goes down, the cluster goes down, you want to try to restore it someplace else, right? Yeah, you should have backups.

Starting point is 00:43:31 We should all have backups. Pretty clear. So yeah, we work with some backup vendors. It's pretty straightforward because we're parallel POSIX file systems, so you add you know extra nodes to run the backup from and anything that does backups of posix file systems works with us we're also working with vendors that have solutions for tiering to tape for example and then it's pretty easy they often have a s3 interface and our next release will support that that you can tear off to them or copy onto onto such a solution that then moves it to tape i mean external tiering always comes with footnotes right when you something external system and you need to tear it back in i'm not a big fan

Starting point is 00:44:19 of external tiering it's often very cumbersome but customers ask for it and that's why we're adding it okay you know I think that's about it Jason do you have any last questions for Bjorn I don't think so it's an impressive looking solution yeah exactly okay Bjorn

Starting point is 00:44:39 anything you'd like to say to our listening audience before we close I think try the free edition that's the best way to actually see for yourself all the things like to say to our listening audience before we close? I think, yeah, try the free edition. That's the best way to actually see for yourself all the things we can do and that storage actually can be easy. Yeah, don't say that too loud, but yeah. Well, this has been great, Bjorn. Thanks very much for being on our show today. Thank you, Ray. And that's it for now. Bye, Bjorn. Bye, Jason.

Starting point is 00:45:08 Until next time. Next time, we will talk to another system storage technology person. Any questions you want us to ask, please let us know. And if you enjoy our podcast, tell your friends about it. Please review us on Apple Podcasts, Google Play, and Spotify, as this will help get the word out.

Grey Beards on Systems - 142: GreyBeards talk scale-out, software defined storage with Bjorn Kolbeck, Co-Founder & CEO, Quobyte

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.