Orchestrate all the Things - From Raw Performance to Price Performance: A decade of Evolution at ScyllaDB. Featuring Felipe Mendes and Guilherme Nogueira

Starting point is 00:00:00 Welcome to Orchestrate All The Things. I'm George Anatiotis and we'll be connecting the dots together. Stories about technology, data, AI and media and how they flow into each other, saving our lives. In business, they say it takes 10 years to become an overnight success. In technology, they say it takes 10 years to build a file system. CLDB is in the technology business, offering a distributed NoSQL database that is monstrously fast and scalable. It turns out that it also takes 10 years or more

Starting point is 00:00:31 to build a successful database. This is something that Felipe Mendez and Guillermo Nogueira know well. Mendez and Nogueira are technical directors at Siladb, working directly on the product, as well as consulting clients. Recently, they presented some of the things they have been working on at CILADB's Monster Scale Summit and they shared their insights in an exclusive Fireside Chat. I hope you will enjoy this. If you like my work on orchestrating all the things, you can subscribe to my podcast,

Starting point is 00:01:02 available on all major platforms. My self-published newsletter also syndicated on Substack, Hacker.me, Medium and D-Zone, or follow and register all the things on your social media of choice. So I'm Guilherme Noviere, I'm a technical director here at SILA and I've been a Solutions Architect and a Macarier here, so I've got a lot of contact with prospects that are really looking to SILA to learn more and possibly use that for production. And for that matter, SILA DB is a really fast

Starting point is 00:01:33 NoSQL database really aimed towards performance and extreme scale that allows a work scale to escalate to multiple millions of operations per second at a really low tail latency. As for me, I am Felipe Mendez. Just like Guilherme, I also work at TulaDB as a technical director. I've been here at the company for about four years now. Wow, time flies. But I mean, as Guilherme said, we call him Gui,

Starting point is 00:02:06 just so you know. I love this database. I think its architecture is really beautiful. And we can definitely discuss more about that, what really makes CillaDB unique on its own. Things like its hard core architecture, its unique cache. But I mean, as for my time here at SillaDB, I co-authored a book called Database Performance at Scale. I also contributed code to SillaDB. So I tried to be pretty much involved in all the things in multiple areas here inside of the company. Nice to meet you, George. Great. Well, thank you.

Starting point is 00:02:47 Thank you both for the introduction. And I guess a little bit of background about myself, contextual background maybe in order. So I just realized just by browsing your recent event, actually, that it's been 10 years that SillaDB has been around. And out of those 10 years, I've been in one way or another involved, let's say, or familiar with this database for eight of those years. So it was 2017 when I first became aware. It was also the time that I happened to meet Sila De Biss, founder and CEO, or co-founder actually, Dorlaur. And so I've been kind of keeping track as much as

Starting point is 00:03:36 one can keep track, you know, on a kind of yearly, let's say, check, or sometimes I have to admit even less than yearly. But I'm fairly familiar with what SILA does, and also its evolution over the years. And I realized that because apparently of the fact that it's been 10 years since its inception, let's say, at your recent SILADB event, there were a number of talks that were actually dedicated precisely to this topic. So the evolution of the database over the years and the major milestones.

Starting point is 00:04:11 And I think, Felipe, one of those talks was actually yours. So you're probably the right person to ask you to give us just a bit of a review, let's say, of CILA's evolution over the years and what you think were the major milestones along this course? Of course, I mean I think that's a really great question and this was in fact the keynote that our CEO basically made. He made an analogy during his keynote where he said that, well it takes 10 years for one to develop a database. Well, it turns out that it also takes 10 years and even more for one to basically build a database. So as you said during my talk on Monster Scale Summit,

Starting point is 00:04:56 I also give a very quick overview over how I actually see the multiple Siladb generations. And when SiladB first started, it was all about raw performance, right? We wanted to be, and we still are, the fastest NoSQL database available in the market. And we did that. One analogy that I like to make without going too deep into the weeds

Starting point is 00:05:23 is basically imagine that we were basically a team of operating system engineers and we basically brought all this knowledge on how to build really low-level high-performance stuff but to the realm of databases and that's how CELA started. That's how ideas such as the shard per-core architecture essentially began. And back at the time, we were basically able to break several records. However, as you can imagine, simply raw speed does not really make it a good database. We needed to be compatible. We needed to have compatible, we needed to have features, things like materialized views, secondary indexes, we needed to have integrations with other

Starting point is 00:06:13 third-party solutions which are widely used in the market. And that's basically when we started going throughout our second generation, where basically Siladb started to actually catch up with the Cassandra protocol. We call it CQL, the Cassandra Query Language, and we started implementing all the technical depth that we had. A few years after, we eventually landed into our third generation,

Starting point is 00:06:43 which basically marked our shift to the cloud. That's when we announced that we're fully managed Siladb cloud service, and we built several features for customers to integrate their existing Siladb deployments with their upstream systems, things like Apache Kafka, Apache Spark, Elasticsearch. That's when our change data capture was born. Another feature that many people don't know about and Guy can perhaps talk about a little bit about it more later is basically

Starting point is 00:07:14 CillaDB Alternator. So we also basically have a fully compatible DynamoDB API, which is what we call Ciladb Alternator today. And the next generation of Ciladb was basically when we started breaking apart from some of the deficiencies, that's how I would at least refer to it, deficiencies existing in the Cassandra architecture, where basically we started introducing Raft and it's basically marketing to our road to eventually get to a strongly consistent system. The next generation was basically, it's basically where we currently stand at which is our road towards elasticity with tablets. So lots of features, lots of very interesting things for us to discuss. I hope I cheesed some words, which unfortunately,

Starting point is 00:08:10 some of the audience may not know about it, but I try to be very succinct. Yes, indeed. So thanks, thanks, Felipe. And just to pick up on some of the things you said, that again, since through my own involvement, let's say, on superficial as it has been with the database, I've been able to identify, let's say,

Starting point is 00:08:34 some of the themes over the years. And one of the things that you mentioned, so the CLA-DB Cloud has definitely been one of those. And so evolution of CLA-DB Cloud and the fact that CLA-DB has been increasingly used as a cache and all the migration paths from other data management systems and databases. These were the key themes that we identified in the previous CLA-DB chat that in the previous CLA-DB chat that we had with Dor. And looking at this year, Monster Scale Summit, it seems that these are still there.

Starting point is 00:09:11 And some of the key themes that have always characterized CLA-DB are also there. So cost reduction, high availability, scalability and elasticity. I also noticed with interest that there are a few things that are new such as the addition of vector capabilities and some things that I originally thought they were new like the workload prioritization but it turns out that maybe they are not so new after all. So I guess I'm going to ask you then next, Keith, any out of those themes that you'd like to pick

Starting point is 00:09:49 and just share a few words about. Sure. So in general, vector capabilities is one thing that we're recently implementing and working on. So this will actually match some of the Cassandra features, but also extend it in a way that will be unique to SILA in the future. And the way that that happens is currently

Starting point is 00:10:12 we have the implemented data types, data structures, and query capabilities of working with vector types. Those are actually basic data types of the database. And as we extend that project, we also cover a service to do indexing and work on indexing for those data points in vector. So vector search will be part of the future of SILA DB, especially in SILA Cloud, where we aim there to be a fully managed service, not only for SILA, the base core database, but also for this extended layer.

Starting point is 00:10:53 Okay. And in terms of workload prioritization, you also mentioned that it's seemingly new, but it has actually gaining a lot of traction because customers realize that they can run multiple workloads on the same cluster. So even if they have concurring live, real-time, and analytics workloads, they can run both of them in the same cluster, which is not the case with Cassandra, that you can have a target for latency on your online and batch or analytics operations and still hit those

Starting point is 00:11:26 marks while saving a lot of infrastructure and licensing costs as you're running or you're averaging the same hardware. Indeed, and just to give a bit of my own perspective, let's say on this, the reason why I thought originally that this is a new feature is, well, a couple of reasons, in fact. First, the fact that I wasn't aware of it, so I thought, well, if I didn't know, it seems like a really useful feature, so if I wasn't aware of it, it must be new. Plus the fact that there was a talk on this in the recent event, again from Felipe, so I thought, well, this looks really cool. I didn't know about it.

Starting point is 00:12:06 There's a talk, so it must be new. So Felipe, do you want to clarify? So when exactly was this introduced? And is there a specific reason why you chose to highlight it this year? Is there maybe something new added to the existing capabilities? So thanks for your question, George.

Starting point is 00:12:26 Workload prioritization was introduced in 2019. There are always new features and capabilities. But I can't recall a finishing specific for this year. But a few years back, probably last year or the year before, we actually introduced a feature called Workload Characterization, which basically extended the previous workload prioritization capabilities. The idea of characterizing a workload is that if we stop and think about it,

Starting point is 00:13:04 users basically run and use a database in a variety of ways. is that if you stop and think about it, users basically run and use a database in a variety of ways, right? But the main reason why most organizations typically run CELADB is when they need very high throughput and low latency workloads. Yet, I mean, sometimes they may need to run some very intensive batch workloads, some sort of analytics, or they may have a user who is running ad hoc queries, either because

Starting point is 00:13:31 they're debugging something or doing any sort of data analysis on his own. And what happens is that it's very challenging for the database, a database alone, to actually know how to prioritize one query versus another, right? And this is really the power that workload partization gives to the user. So you basically can assign different service levels, that's the technical term we came to it. And each service level is going to be isolated from another.

Starting point is 00:14:15 Which means that if you have a service level which has lower priority than your main workload and your database is currently running under contention, meaning that either it's CPU or memory or just resources are basically overutilized, then CELADB will start prioritizing the workload you defined for it to prioritize. So it's a very interesting feature. It's unique to CELADB. No other database has anything close to that. And I believe the reason why we are calling out to this year is because basically,

Starting point is 00:14:55 as you said it yourself, you were not aware of it, right? So many users are still not aware of all the unique CillaDB features that we have in comparison to other databases and we every now and then want to call it out again so that we bring more awareness to our users and community. Okay, cool. That's a great segue for me to ask. There was quite a variety, lots of themes in your lineup for the event you just had,

Starting point is 00:15:29 and lots of user presentations as well, which is always good. Do you have, I'm guessing that you must have watched them all or at your own time, because they're also available on demand. So I'm going to ask you both, starting from Yogi, do you have a couple of favorite use cases?

Starting point is 00:15:52 I have one that is really close to my heart, which is Mediums Feature Store Redesign. I work closely with them in DSA or Solutions Art Debt Capability, and I help them discuss and remodel their data set in order to achieve the results that they presented during the event. So that one, seeing that materialized

Starting point is 00:16:17 as a public presentation, even though I was not mentioned, I was still greatly happy to see them on screen and gather so much attention of how much impact and positive impact a good data modeling and reasoning about your access patterns makes for CLDB success. I would also highlight Udmose, which is another prospect that I work a bit close with, which migrated from Dynamo into CLDB Alternator with extreme great success, reducing cost, improving latency,

Starting point is 00:16:54 and being able to be flexible on their vendors, on cloud vendors in general. Okay. Thank you. How about you, Felipe? Well, that's... I could probably speak for an hour just on that. But, I mean, before I actually talk about customers who actually spoke at our recent conference, Monster Scale Summit, I would like to basically call out back on what Guy said, because he basically mentioned about VJU's feature store. And I work at, somehow closely with Clearview AI,

Starting point is 00:17:32 they gave a talk about how they are using Celerb as an AI workload. And that's very interesting because I see, nowadays we have lots of new AI and machine learning stuff coming up, right? Another particular customer that I had the opportunity to speak along was TripAdvisor. So last year at AWS re.Invent, TripAdvisor and I spoke at AWS re.Invent, where TripAdvisor basically spoke how they use SiloDB to basically service

Starting point is 00:18:07 the real-time features. So imagine whenever you hit the TripAdvisor website, they quickly need to figure out who you are because probably you use a TripAdvisor in the past, right? I don't know, maybe you were planning a trip to go, I don't know, to Lisbon, Portugal, and you decided to book something with them. And then as soon as you visit their website, they use Ciladb to quickly head through this data, identify who you are, and then provide recommendations in real time on their website. So as you can imagine, this has to be really fast because users attention spans is not very high, right? So if it just takes, I don't know, 300 milliseconds, that are very high chance they may lose the user. So they have a very interesting talk on how really Silladb helps them

Starting point is 00:18:58 to provide those real-time recommendations to their users. Now back to the main question, which is recommendations to their users. Now back to the main question which is talks specific to our Monster Scale Summit. A particular customer talk that I also love was basically Discord's on how they basically, well they had two talks right, how they basically run trillions of searches at scale. This was how Discord basically manages their Elasticsearch infrastructure. But related to CillaDB closely is basically how they run upgrades really at scale. So I also had the opportunity to work really closely

Starting point is 00:19:39 with the Discord team. I think they're super smart. I always learn a lot of working with them. So if they are listening to this somehow guys, thank you so much for everything you guys taught me throughout all those years. But yeah, it's a very interesting talk and the way they basically manage their distributed systems is really amazing. Cool. All right.

Starting point is 00:20:03 So I guess it's a good time to head back to one of the features, or rather two of the features you talked about early in the conversation, which I think, based on my understanding, is kind of core to the evolution of CLA-DB. And they're also very, very new. These, we are certain that they are new. So we're talking about tablets and strong consistency. So tablets is a new data distribution algorithm that in the latest CILADP version is replacing the legacy VNodes approach

Starting point is 00:20:41 that was inherited from Cassandra. And there's a strongly consistent topology updates for metadata. And I think before we get to discuss about what they are and the motivation and some use cases, I think it's useful if we start by introducing a bit of background. So first on the Paxos versus Raft protocols,

Starting point is 00:21:06 because this is central to understanding tablets and also a little bit of background knowledge on strong versus eventual consistency, because again, this is kind of fundamental knowledge to be able to follow this. So who wants to pick what? So with with regards to the Paxos and Raft comparison, the initial algorithm that was used and inherited from Cassandra implemented for strongly consistent operations, data operations, rely on Paxos. And now SILA is not only applying these to topology

Starting point is 00:21:48 updates, but also to metadata updates, meaning that whenever there is any change in metadata or any topology change in the cluster, that change is approved and committed by the RAPT Distributed Consensus algorithm. So in this case, we can do that fairly rapidly, consistently, and really in a distributed scale. So for instance, only on this aspect,

Starting point is 00:22:15 we can talk about tablets later, it allows clusters to be scaled or joined in the nodes to be joined in a cluster in mere seconds rather than the previous distributed algorithm, which was gossip for cluster topology, which allowed only a single node to join the cluster at a time. So if you had to scale from three to six,

Starting point is 00:22:37 that is a linear operation of adding one additional node three at a time from buckets of three. So in that sense, when we change the strong consistency for topology updates and also for metadata, we can also now add three nodes instantaneously to the cluster and then allow the cluster to rebalance itself using tablets, which we'll touch later on. Yeah, and with regards to your second question, which was the difference between

Starting point is 00:23:16 eventual consistency and struggle consistency, I would like rather to give you an example. So Guy mentioned about the gossip protocol. And the thing about gossip, and by the way, gossip is still used inside Scylla, but we are, as part of this raft effort, which we are still going on with, we are basically more and more relying less and less in gossip. So let me give you an example on how things work before we actually had Raft. So before Raft, if you were to basically bring down a node, suppose you are going to bring down a node for changing a configuration, upgrading it, whatever.

Starting point is 00:24:00 You bring down a node and because Gossip is basically a pandemic protocol, as soon as a node goes down, not all nodes in the cluster are aware of the fact that this node went down immediately. This information takes some time to propagate, right? So imagine you have a cluster which for Ciladb is very easy, I don't know, 100 nodes. is very easy, I don't know, 100 nodes. So you bring one node down, and then because Gosp is a pandemic protocol, one node communicates with another, and then this other node communicates with another, and then this one node communicates with another. So there are lots of HoundTrips going on out here,

Starting point is 00:24:38 and it takes time for this information to disseminate throughout the cluster. So it could happen a situation like this. You shut down a node because you were going to do a maintenance operation, but another node took too long to recognize the fact that the node was down. As a result, simply because you shut down a node,

Starting point is 00:24:57 boom, your latency spikes, and you never understand why. Until you realize that it's because it took too long for Gossip to actually disseminate that information. With Raft, basically we have a state machine. So Raft, we will have a single leader and whenever those metadata changes and state's change actually happen in the cluster, this information is first sent to that state machine, and then all nodes communicate with that state machine,

Starting point is 00:25:28 and they automatically will realize that, okay, one node is down, so we have two flagged, this node down, and then your application can continue running without taking it too long for a specific node or a group of nodes to realize that something changed in your cluster. That's basically a good way to differentiate or a group of nodes structure realize that something changed in your cluster. So that's basically a good way to differentiate the eventually consistency from stronger consistency.

Starting point is 00:25:52 In a way or another, both systems will eventually converge to the same state at the end of the day, no matter how long it takes, our systems will eventually realize that one of the nodes went down. The problem is, or the solution is, how fast can that realization actually came to be? Okay, so I have a question. You mentioned in the second case, not the GOSIP, but using the Raft protocol that is a state machine, and you have a leader that the rest of the nodes communicate with in order to get informed

Starting point is 00:26:32 about the up-to-date status of the cluster. So obviously, I'm going to ask you, all right, so what happens if your leader goes down? I guess there must be some kind of resilience. So by default, there's an election process. One difference between Paxos and Raft is that basically Paxos is a leaderless consensus protocol, right? So there is pretty much no leader, but with Raft, you always have a single replica, which is considered to be a leader. So naturally you might realize that, okay, isn't this a single point of failure? Yes, it is. But with CYLA-DB, basically we never have a single leader. For example, with tablets specifically,

Starting point is 00:27:20 we have one leader per tablet. And the idea of tablets, and because I know we are eventually going to diverge to discuss more in depth about tablets, is that a tablet is basically a logical abstraction that basically partitions your data, your tables into smaller fragments. And the name of those fragments,

Starting point is 00:27:48 we decided to call them tablets. And those tablets, they are independent from the raft, which means that CYLA-DB with raft can atomically and in a strongly consistent way, move them to other nodes as needed. And as your workload grows or shrinks on demand. So basically we have lots of raft leaders, which means that if one fails you pretty much don't mind much. An election will happen for those leaders which fail and you also have many other leaders in other nodes and replicas on a per-tablet basis. Cool.

Starting point is 00:28:27 So we already started getting into the details about tablets, so you quickly described what they are. Can either of you give a little bit of motivation? So what are some use cases in which tablets are useful? So I'll get started, Phil. You can add more use cases later. cases in which tablets are useful? So I'll get started, Fai, you can add more use cases later. So the way tablets work by breaking down those token ranges into smaller and more manageable and in a strongly consistent way is that we're able to really quickly move them out along

Starting point is 00:29:02 servers. So this is really aimed towards fast scaling of a cluster. So when I said that you can add three nodes really quickly into a cluster, that also means that those three nodes join the cluster and then rebalance these tablets in a parallel and consistent way, but also extremely fast. You can scale terabytes of data in your minutes rather than hours,

Starting point is 00:29:29 which was the previous algorithm. In this case, when we're talking about machines that have higher capacity, that also means that they have a higher storage density to be used, and tablets also balance out in a way that will fill those disks in an even way. So all nodes in the cluster will have a similar utilization because tablets who address the number of tablets within each node

Starting point is 00:29:57 according to the number of ECPUs which is always tied to storage in cloud nodes. So in this sense, as storage utilization is more flexible now, so as we can scale more quickly, it also allows users to run at a much higher storage utilization. Philippe is showcasing here that our aim is to run at, instead of 50 to 60%, up to 90 percent storage utilization because tablets and automations in the cloud will also allow us to very rapidly scale the cluster once those storage thresholds are exceeded. If you stop and think about it, I mean, you must be crazy to run a database

Starting point is 00:30:45 close to 90% storage utilization, right? But with Cilla, that's pretty much normal. That's one of the benefits of tablets. So what we are seeing here is exactly as Guil was alluding to. We first start, imagine that this diagonal purple line is basically storage utilization. We start with a small cluster with a single node.

Starting point is 00:31:13 This is basically per zone, so it does not really reflect a real three node cluster, which would be the minimum we recommend for production purposes. But the idea remains. So imagine that your storage utilization grows over time. And as you reach close to 90% storage utilization, what we do is simply we add a new node.

Starting point is 00:31:34 And thanks to tablets, we can really stream that data very, very quickly. We, in fact, we have some demos showcasing how it works. And we even run very many tests on a daily basis using our Celeradb Cloud, which guarantee that if you hit 90% storage utilization, tablets will have enough time to actually scale out your infrastructure without letting you run out of disk space.

Starting point is 00:32:07 Now, what is the value of this? Why do I want to run my database close to 90% storage utilization? Well, many people say that nowadays storage is actually cheap, right? I mean, compared to memory utilization, they are right. Storage is definitely cheap.

Starting point is 00:32:27 But the problem is that it still costs something, right? If you imagine, think about those hyperscalers. Companies like Apple and Netflix where they basically have thousands of Cassandra nodes. Very likely they are running their storage utilization somewhat close to 50-70% utilization, which means that they have at least 30% per node disk space that they don't use. Why? Because it's too dangerous, because it takes too long to scale. As Guy also said, I mean, one really benefit of tablets is that you can add several nodes in parallel. So if you want, I don't know, to 10x your traffic, you simply go and you add 10x times the number of instances you have. All that and so on will handle all the rest for you.

Starting point is 00:33:21 In comparison with other databases, you typically have to run those operations serially. You add one node, then you wait, then you go and add another and then you wait, and just takes time. So with tablets, you can easily scale out as well as scaling your infrastructure on demand, which means that if you have a workload which is very seasonal, for example,

Starting point is 00:33:43 throughout the day you have, I don't know, to run your peak capacity, but overnight you don't need that spare capacity, you can simply scale your database up and down on demand. And just at the end of the day, results in even higher costs beyond simply storage utilization. So those are really the main benefits of tablets.

Starting point is 00:34:04 But there's a third one that I personally feel is also important. Celeradb with tablets bring to users the ability that previously was only available in cloud infrastructure. Think about it. How many databases out there support this kind of elasticity that you can simply install on your own deploy on on premises on your own facilities. I can think only of Celeradb, other than the ones like Bigtable, Google Cloud Spanner, AWS DynamoDB and so on. So tablets also allow users to actually have this feature which was previously only tied

Starting point is 00:34:42 to cloud workloads, now to the hands of users. Okay, so it seems like the main idea here is to get your storage to be more granular basically. And by doing that, you gain in flexibility. However, usually, nothing comes for free and there's always a trade-off. So I'm guessing that the trade-off you make in this case is additional complexity in implementation. I wonder if that translates to additional CPU usage, for example, at runtime.

Starting point is 00:35:21 Is it something that you have benchmarks? So in terms of complexity, CLDB and the company CLDB drivers that are compatible with tablets handle everything for the user. So of course, that once the user is planning a cluster and a table and some configs for the workload, it has to have some considerations as to, OK, how much data you expect to be in this table, how many breakdowns or how many talking ranges

Starting point is 00:35:52 should that data be split at creation time. But other than that, that's really non-additional concerns that a user, an end user, might have as the administrator cluster. In terms of CPU utilization, it is even much more efficient in the way that we're doing this now, because as we discussed, we can transfer a tablet from one node to the other as we're scaling.

Starting point is 00:36:17 Not only we can do that in terms of tablets, but also the previous way was streaming data. Roles were actually being read and streamed each one in buckets over the networks. And that was not as efficient as it could be. And now tablets will also transfer a single file through the network as it's scaling. So now there is no row per row breakdown of the operations,

Starting point is 00:36:44 just a single single very bulk transfer that reaches the maximum network capacity as it scales. That's why it's extremely fast to scale a cluster under these circumstances. Felipe, do you have anything else to add? Yeah. The feature you just mentioned is our zero-cop streaming, which we called file-based tablet streaming, yeah. But you do have a good point, George,

Starting point is 00:37:11 because once it turned out, tablet was a major effort. And it still is, in a sense. So for example, there are some features that today we do not support with tablets, but we are working really, really hard to actually bring supports to those features. All those are basically available in our documentation. But in general, the feature is stable, the performance is great, and the benefits are many. So we have run some performance tests. We haven't seen any performance degradation of any sort.

Starting point is 00:37:49 But again, it's in a way or another, it's still a new feature. So if there is any performance problem, you know how we basically handle those things here at CLEDB. Then it's a bug and we got to fix it. Just to clarify, I wasn't thinking of complexity for the end user, let's say, but more in terms of complexity in implementing the feature. So for you, guys, the engineers in the crew basically, and just out of curiosity, how hard was that to implement? It definitely took some years to get it ready. Basically, Toplets involved re-engineering critical paths of our database and many of

Starting point is 00:38:36 those critical paths. For example, compactions, which is a very fundamental background process which exists both in Cassandra and Siladibi. We needed to introduce a concept known as compaction groups to basically deal with those tablets and ensure that we do not compact SS tables which are from a different tablet incorrectly. Repair also needed to come with its own changes. As Guillermo said, our drivers all needed to support what we call today as tablet awareness. So as you can imagine, when a client application is going to communicate with the database, the most efficient request is when it hits a node that's also a replica for the data that he's after.

Starting point is 00:39:31 So basically tablet awareness ensures that the driver always knows which node is a replica for the tablet that he's querying against. So those were outings that we needed to implement from ground zero. And yeah, so it wasn't easy, but we finally got to a state where we can finally say, hey, tablets is production ready folks. Great. So I guess that means that it's already used in production by end users. Correct. OK.

Starting point is 00:40:09 And is it also, by the way, it's a good, it just made me think, so what kind of feature parity do you have comparing the on-premise version to CLA-DB Cloud? And the why I came up with this question is because I just thought, well, all right, so is it also deployed on the SILADB Cloud? So the way that SILADB Cloud is designed is to actually leverage the very same binary down to the actual binary that you would run if you were running SILA Enterprise where you manage it yourself.

Starting point is 00:40:46 The only limitations of SILADB on the Cloud is that there's few customizations in terms of integration with the rest of one's infrastructure. Such as, for instance, the database deployment is isolated in a single VPC, so there's no outward connection from the CLDB deployment out into the customer VPCs or the customer networks. And then LDAP or any sort of network authentication doesn't make a lot of sense in that scenario. So that's one of the features that are not

Starting point is 00:41:19 available in the cloud. Recently, we also introduced data encryption, so encryption with REST, as we call it. And we can also leverage KMS, which is a customer-provided key management system that will manage the keys for the encrypted SS tables. And that can be deployed on SILA Cloud and could also be deployed on SILA Enterprise. So as time goes by, we are usually aiming towards covering and minimizing the gap

Starting point is 00:41:48 so our users have a much higher value of looking to migrate into cloud rather than managing SILA on by themselves, which can be a pretty complex task. Yeah, indeed. So last time we caught up with with Door, actually in the two of the previous times that we caught up, so the last one and the one right before it, CLA-DB Cloud was a major conversation point because each of those times, Door kept reporting year-on-year growth on the CLA-DB usage and also on the CLA-DB cloud usage and also on the revenue. So I'm wondering if you are aware of whether that growth has been going on basically.

Starting point is 00:42:38 I don't have access to exact financial numbers, but definitely. I wasn't so much talking about that but more about usage. No, yeah, no, definitely we are growing. I mean we see that option of CWDB growing on a daily basis. It's actually really, really impressive. And as you can imagine, that's also one of the reasons why we are, we spoke during our summit about many new features we are bringing. Because during our summit, as we already spoke, we spoke about how we are going to implement vector search. But beyond that, we also implemented about our tiered storage with support for S3 object storage. Not just S3, right?

Starting point is 00:43:22 But when we refer to S3, we basically refer to any object storage for any cloud. And those are the things we should really, we will allow Celeradb, also those features are in to also attract even more users than what we are seeing today. So I would say that, Guy correct me if I'm wrong, I think on a weekly basis I at least see a new SillyDB user coming in using our cloud. And yeah, I mean, the

Starting point is 00:43:54 growth has been impressive and the acceptance of our database is really, really, really, really good. I mean, we have lots of very good logos showcased in our website. We spoke about a few already like Trippadvisor, G Squared, we have Hulu, right? Wish Others, they might be seeing you. We spoke about Udimo and that they're awesome. Medium, digital, yeah, and so on. I think I also recall seeing American Express as a user, and yeah, a couple of, well, more than a couple of other big names as well. But I just realized we're very close to wrapping up,

Starting point is 00:44:37 and you just mentioned the magic word, so new features. And we've already kind of talked a little bit about vector. And this is something that honestly I was expecting to see, because I don't think there's like a single database today which is not adding vector capabilities. And for very obvious reasons. You also mentioned tiered storage. And again, that's something that in one way or another

Starting point is 00:45:05 I think has been ongoing for a while. I remember Dorr mentioning an effort towards utilizing S3 for storage which I guess is still ongoing. So if you want to talk about those or any other upcoming new features. So with regards to shared storage specifically, it somehow ties back to what I previously said that users use their databases in a variety of ways, right? When CillaDB was originally conceived, again it was all about raw performance. We want to give you the lowest latency possible, but sometimes raw performance. We want to give you the lowest latency possible. But sometimes users, they don't want to, you know,

Starting point is 00:45:47 string their data to another database. Sometimes their data grow too large and the costs become a problem because they have to scale out of their infrastructure in order to accommodate the storage growth. And many times users have data that they read very infrequently. So the idea of cheer storage is actually to give those users a choice.

Starting point is 00:46:12 Why are we still using CillaDB? Of course, the performance is not going to be the same. We cannot do miracles. I mean, the speed, the latency of accessing object storage is much higher than accessing a local NVMe. But with supporting things like object storage, there are many interesting things we could do. Besides only cheering your storage by itself,

Starting point is 00:46:39 we could come up with use cases where, for example, your local NVMes would only be used as a fast cache and all your data could be stored on only on object storage. I believe Doin already gave this example a few times back. When a node fails for example instead of you having to stream data from the existing replicas you could simply come and retrieve data directly from object storage and continue servicing traffic from there. So S3 object storage in a way will allow Citadb to make many more things than it's currently able to do today.

Starting point is 00:47:19 Okay, Kip, any new feature that you'd like to pick on? I would personally be very willing to hear more about the vector indexing and vector storage. Particularly because of the fact that some of the early benchmarks that I so mentioned seemed very promising. missing? Yeah, so I think the vector approach really leverages SILA's speed and throughput and low latency combined with a very well-designed processing and vector search layer. So our teams are still developing the layer. Those were very early numbers from benchmarks. But still, I think that's a great sign that how CELUV can integrate really well with those

Starting point is 00:48:13 really complex scenarios which require extreme scalability because you can easily scale to meters of operations per second just to to fulfill a couple of hundred thousand vector inquiries. And you can also leverage SILA's really extreme low latency and predictable latency in order to fulfill those in a really timely scenario. So our aim is that for applications that really need low latency and high throughput for those scenarios,

Starting point is 00:48:46 they can also continue to leverage SILA as a product on the back end. I would also highlight strongly consistent user data. We talked about cluster topology changes and actual metadata changes being strongly consistent on the refs, but we're also working towards implementing that on the user layer. So this is extremely initial, so there's no even a PLC to demonstrate, but that is an end goal that we have to really leverage the power of refs as a distributed consensus algorithm to make sure that users have the best capabilities

Starting point is 00:49:26 in terms of designing the applications on a strongly consistent manner while running Mozilla. All right, so I think we talked about a number of things actually and time flew by really, really fast. And we can probably wrap up here unless there's any other, whatever new feature or any use case or anything you feel that we've left out. And now is the time to mention it. Well, I just would like to say to our audience that if you're currently struggling with your

Starting point is 00:50:05 database, if your performance is not really well acceptable, you might want to check out CillaDB. Okay, cool. That's a good message to close the conversation with. So thank you. Thank you both gentlemen for joining me. It's been a pleasure. Thanks for sticking around. For more stories like this, check the link in bio and follow Link Data Orchestration.

Your Ad Here

Orchestrate all the Things - From Raw Performance to Price Performance: A decade of Evolution at ScyllaDB. Featuring Felipe Mendes and Guilherme Nogueira

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.